Bags of words

When we have text as input data, we can't go ahead and work with raw text. Hence, it's imperative for that text input data to get converted into numbers or vectors of numbers, in order to make it usable for a number of algorithms.

A bag of words model is one of the ways to make the text usable for the algorithms. Essentially, it is a representation of text that works on the occurrence of words in the document. It has nothing to do with the structure, order, and location; this model only looks for the count of the words as a feature.

The thought process behind this model is that having similar content means having a similar document.

The different steps to be taken in the bag of words model are as follows:

I will be there for you
When the rain starts to pour
I will be there for you
Like I have been there before
I will be there for you

Let's consider each line of this song as a separate document.

The simple way to do this is through a Boolean route. This means that raw text will be transformed into a document vector, with the help of the presence/absence of that text in the respective document.

For example, if the first line of the song is turned into a document containing I will be there for you, then the document vector will turn out as follows:

 

Document vector

I

1

will

1

be

1

there

1

for

1

you

1

when

0

the

0

rain

0

starts

0

to

0

pour

0

like

0

have

0

been

0

before

0

 

All the words that are present in the document are marked as 1, and the rest are marked as 0.

Hence, the document vector for the first sentence is [1,1,1,1,1,1,0,0,0,0,0,0,0,0,0,0].

Similarly, the document vector for the second sentence is [0,0,0,0,0,0,1,1,1,1,1,1,0,0,0,0].

As the size of the corpus continues to increase, the number of zeros in the document vector will rise, as well. As a result of that, it induces sparsity in the vector and it becomes a sparse vector. Computing a sparse vector becomes really challenging for various algorithms. Data cleansing is one of the ways to counter it, to some extent:

For example, let's suppose that we have three documents:

After removing stopwords, the count vector matrix turns out like the following table:

 

count

vector

got

it

better

than

Boolean

way

creating

feature

creation

important

N1

2

1

1

1

0

0

0

0

0

0

0

0

N2

1

2

0

0

1

1

1

1

1

1

0

0

N3

0

1

0

0

0

0

0

0

0

1

1

1

 

Now, take a look at the matrix dimension carefully; since N=3 and T=12, that makes this a matrix of 3 x 12.

We will look at how the matrix formation has taken place. For document N1, the number of times the count has occurred in it is 2, the number of times the vector has come is 1, and so on. Taking these frequencies, we enter these values. A similar process has been completed for the other two documents, as well.

However, this has a drawback. A highly frequent word might start to dominate the document, and the corpus, too, which will result in having limited information extracted out of the features. To counter this, term frequency inverse-document frequency (TF-IDF) has been introduced.