Bags of words

When we have text as input data, we can't go ahead and work with raw text. Hence, it's imperative for that text input data to get converted into numbers or vectors of numbers, in order to make it usable for a number of algorithms.

A bag of words model is one of the ways to make the text usable for the algorithms. Essentially, it is a representation of text that works on the occurrence of words in the document. It has nothing to do with the structure, order, and location; this model only looks for the count of the words as a feature.

The thought process behind this model is that having similar content means having a similar document.

The different steps to be taken in the bag of words model are as follows:

Building the corpus: In this step, the documents are collected and combined together to form a corpus. For example, the famous song from the TV series Friends has been used here as a corpus:

I will be there for you
When the rain starts to pour
I will be there for you
Like I have been there before
I will be there for you

Let's consider each line of this song as a separate document.

Vocabulary building: In this step, we figure out the unique words in the corpus and create a list of them:
- I
- will
- be
- there
- for
- you
- when
- the
- rain
- starts
- to
- pour
- like
- have
- been
- before
Document vector creation: Now, it's time to convert each document of text into a vector.

The simple way to do this is through a Boolean route. This means that raw text will be transformed into a document vector, with the help of the presence/absence of that text in the respective document.

For example, if the first line of the song is turned into a document containing I will be there for you, then the document vector will turn out as follows:

	Document vector
I	1
will	1
be	1
there	1
for	1
you	1
when	0
the	0
rain	0
starts	0
to	0
pour	0
like	0
have	0
been	0
before	0

All the words that are present in the document are marked as 1, and the rest are marked as 0.

Hence, the document vector for the first sentence is [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0].

Similarly, the document vector for the second sentence is [0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0].

As the size of the corpus continues to increase, the number of zeros in the document vector will rise, as well. As a result of that, it induces sparsity in the vector and it becomes a sparse vector. Computing a sparse vector becomes really challenging for various algorithms. Data cleansing is one of the ways to counter it, to some extent:

Cleansing the text: This would involve transforming all of the corpus into a single case (either upper (preferably) or lower). The punctuation must be taken out of the corpus. Stemming, which means finding the root words of the text, can be incorporated, and will be able to reduce the unique words in the corpus. Also, removal of stop words, such as is and of, might be able to abate the pain of sparsity.

Count vector: There is another way to create the document vector, with the help of the frequency of the words appearing in the document. Let's suppose that there is a corpus comprised of N documents and T tokens (words) have been extracted. These T tokens will form our dictionary. Hence, the dimension of the count vector matrix will turn out to be N X T. Every row contains the frequency of tokens (words) in that respective document comprising the dictionary.

For example, let's suppose that we have three documents:

N1: Count vector has got count in it
N2: Is count vector better than the Boolean way of creating feature vector?
N3: Creation of feature vector is very important

After removing stopwords, the count vector matrix turns out like the following table:

	count	vector	got	it	better	than	Boolean	way	creating	feature	creation	important
N1	2	1	1	1	0	0	0	0	0	0	0	0
N2	1	2	0	0	1	1	1	1	1	1	0	0
N3	0	1	0	0	0	0	0	0	0	1	1	1

Now, take a look at the matrix dimension carefully; since N=3 and T=12, that makes this a matrix of 3 x 12.

We will look at how the matrix formation has taken place. For document N1, the number of times the count has occurred in it is 2, the number of times the vector has come is 1, and so on. Taking these frequencies, we enter these values. A similar process has been completed for the other two documents, as well.

However, this has a drawback. A highly frequent word might start to dominate the document, and the corpus, too, which will result in having limited information extracted out of the features. To counter this, term frequency inverse-document frequency (TF-IDF) has been introduced.