We now put everything together and demonstrate our system for the following new post that we assign to the variable new_post
:
Disk drive problems. Hi, I have a problem with my hard disk.
After 1 year it is working only sporadically now.
I tried to format it, but now it doesn't boot any more.
Any ideas? Thanks.
As we have learned previously, we will first have to vectorize this post before we predict its label as follows:
>>> new_post_vec = vectorizer.transform([new_post]) >>> new_post_label = km.predict(new_post_vec)[0]
Now that we have the clustering, we do not need to compare new_post_vec
to all post vectors. Instead, we can focus only on the posts of the same cluster. Let us fetch their indices in the original dataset:
>>> similar_indices = (km.labels_==new_post_label).nonzero()[0]
The comparison in the bracket results in a Boolean array, and nonzero
converts that array into a smaller array containing the indices of the True
elements.
Using similar_indices
, we then simply have to build a list of posts together with their similarity scores as follows:
>>> similar = [] >>> for i in similar_indices: ... dist = sp.linalg.norm((new_post_vec - vectorized[i]).toarray()) ... similar.append((dist, dataset.data[i])) >>> similar = sorted(similar) >>> print(len(similar)) 44
We found 44 posts in the cluster of our post. To give the user a quick idea of what kind of similar posts are available, we can now present the most similar post (show_at_1
), the least similar one (show_at_3
), and an in-between post (show_at_2
), all of which are from the same cluster as follows:
>>> show_at_1 = similar[0] >>> show_at_2 = similar[len(similar)/2] >>> show_at_3 = similar[-1]
The following table shows the posts together with their similarity values:
Position |
Similarity |
Excerpt from post |
---|---|---|
1 |
1.018 |
BOOT PROBLEM with IDE controller
|
2 |
1.294 |
IDE Cable
|
3 |
1.375 |
|
It is interesting how the posts reflect the similarity measurement score. The first post contains all the salient words from our new post. The second one also revolves around hard disks, but lacks concepts such as formatting. Finally, the third one is only slightly related. Still, for all the posts, we would say that they belong to the same domain as that of the new post.
We should not expect a perfect clustering, in the sense that posts from the same newsgroup (for example, comp.graphics
) are also clustered together. An example will give us a quick impression of the noise that we have to expect:
>>> post_group = zip(dataset.data, dataset.target) >>> z = (len(post[0]), post[0], dataset.target_names[post[1]]) for post in post_group >>> print(sorted(z)[5:7]) [(107, 'From: "kwansik kim" <kkim@cs.indiana.edu>\nSubject: Where is FAQ ?\n\nWhere can I find it ?\n\nThanks, Kwansik\n\n', 'comp.graphics'), (110, 'From: lioness@maple.circa.ufl.edu\nSubject: What is 3dO?\n\n\nSomeone please fill me in on what 3do.\n\nThanks,\n\nBH\n', 'comp.graphics')]
For both of these posts, there is no real indication that they belong to comp.graphics
, considering only the wording that is left after the preprocessing step:
>>> analyzer = vectorizer.build_analyzer() >>> list(analyzer(z[5][1])) [u'kwansik', u'kim', u'kkim', u'cs', u'indiana', u'edu', u'subject', u'faq', u'thank', u'kwansik'] >>> list(analyzer(z[6][1])) [u'lioness', u'mapl', u'circa', u'ufl', u'edu', u'subject', u'3do', u'3do', u'thank', u'bh']
This is only after tokenization, lower casing, and stop word removal. If we also subtract those words that will be later filtered out via min_df
and max_df
, which will be done later in fit_transform
, it gets even worse:
>>> list(set(analyzer(z[5][1])).intersection( vectorizer.get_feature_names())) [u'cs', u'faq', u'thank']>>> list(set(analyzer(z[6][1])).intersection( vectorizer.get_feature_names())) [u'bh', u'thank']
Furthermore, most of the words occur frequently in other posts as well, as we can check with the IDF scores. Remember that the higher the TF-IDF, the more discriminative a term is for a given post. And as IDF is a multiplicative factor here, a low value of it signals that it is not of great value in general:
>>> for term in ['cs', 'faq', 'thank', 'bh', 'thank']: ... print('IDF(%s)=%.2f'%(term, vectorizer._tfidf.idf_[vectorizer.vocabulary_[term]]) IDF(cs)=3.23 IDF(faq)=4.17 IDF(thank)=2.23 IDF(bh)=6.57 IDF(thank)=2.23
So, except for bh
, which is close to the maximum overall IDF value of 6.74, the terms don't have much discriminative power. Understandably, posts from different newsgroups will be clustered together.
For our goal, however, this is no big deal, as we are only interested in cutting down the number of posts that we have to compare a new post to. After all, the particular newsgroup from where our training data came from is of no special interest.