Chapter 6. Information Retrieval

In the previous chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.

Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content? With structured data, we can create logical expressions and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.

The foundation of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search—in other words, document search.

The following are the components of a document search:

Query q: A logical statement describing the document or kind of document you are looking for
Query term q_t: A term in the query, generally a token
Corpus of documents D: A collection of documents
Document d: A document in D with terms t_d that describe the document
Ranking function r(q, D): A function that ranks the documents in D according to relevance to the query q
Result R: The ranked list of documents

Before we get into how to implement these components, we need to consider a technical problem. How can we quickly access documents based on the information within them? If we have to scan every document, then we could not search large collections of documents. To solve this problem we use an inverted index.

Inverted Indices

Originally, indexing was a means of organizing and labeling information in a way that made retrieving it easier. For example, libraries use indexing to organize and find books. The Dewey Decimal Classification system is a way to index books based on their subject matter. We can also have indices based on titles, authors, publication dates, and so on. Another kind of index can often be found at the back of a book. This is a list of concepts in the book and pages on which to find them.

The index in inverted index is slightly different than the traditional index; instead, it takes inspiration from the mathematical concept of indexing—that is, assigning indices to an element of a set. Recall our set of documents D. We can assign a number to each document, creating mapping from integers to documents, i -> d.

Let’s create this index for our DataFrame. Normally, we would store an inverted index in a data store that allows for quick lookups. Spark DataFrames are not for quick lookups. We will introduce the tools used for search.

Building an Inverted Index

Let’s look at how we can build an inverted index in Spark. Here are the steps we will follow:

Load the data.
Create the index: i -> d*
- Since we are using Spark, we will generate this index on the rows.
Process the text.
Create the inverted index from terms to documents: t_d -> i*

Step 1

We will be creating an inverted index for the mini_newsgroups data set.

import os

from pyspark.sql.types import *
from pyspark.sql.functions import collect_set
from pyspark.sql import Row
from pyspark.ml import Pipeline

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *

spark = sparknlp.start()

path = os.path.join('data', 'mini_newsgroups', '*')
texts = spark.sparkContext.wholeTextFiles(path)

schema = StructType([
    StructField('path', StringType()),
    StructField('text', StringType()),
])

texts = spark.createDataFrame(texts, schema=schema).persist()

Step 2

Now we need to create the index. Spark assumes the data is distributed, so to assign an index, we need to use the lower-level RDD API. The zipWithIndex will sort the data on the workers and assign the indices.

rows_w_indexed = texts.rdd.zipWithIndex()
(path, text), i = rows_w_indexed.first()

print(i)
print(path)
print(text[:200])

0
file:/home/alext/projects/spark-nlp-book/data/mini_...
Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404
Newsgroups: sci.astro,sci.space
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...

Now that we have created the index, we need to create a DataFrame like we did previously, except now we need to add our index into our Rows. Table 6-1 shows the result.

indexed = rows_w_indexed.map(
    lambda row_index: Row(
        index=row_index[1], 
        **row_index[0].asDict())
)
(i, path, text) = indexed.first()

indexed_schema = schema.add(StructField('index', IntegerType()))

indexed = spark.createDataFrame(indexed, schema=indexed_schema)\
    .persist()

indexed.limit(10).toPandas()

Table 6-1. Indexed documents
	path	text	index
0	file:/.../spark-nlp-book/data/m...	Newsgroups: rec.motorcycles\nFrom: lisa@alex.c...	0
1	file:/.../spark-nlp-book/data/m...	Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...	1
2	file:/.../spark-nlp-book/data/m...	Newsgroups: rec.motorcycles\nPath: cantaloupe....	2
3	file:/.../spark-nlp-book/data/m...	Xref: cantaloupe.srv.cs.cmu.edu rec.motorcycle...	3
4	file:/.../spark-nlp-book/data/m...	Path: cantaloupe.srv.cs.cmu.edu!das-news.harva...	4
5	file:/.../spark-nlp-book/data/m...	Path: cantaloupe.srv.cs.cmu.edu!magnesium.club...	5
6	file:/.../spark-nlp-book/data/m...	Newsgroups: rec.motorcycles\nPath: cantaloupe....	6
7	file:/.../spark-nlp-book/data/m...	Newsgroups: rec.motorcycles\nPath: cantaloupe....	7
8	file:/.../spark-nlp-book/data/m...	Path: cantaloupe.srv.cs.cmu.edu!rochester!udel...	8
9	file:/.../spark-nlp-book/data/m...	Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv....	9

Each document d is a collection of terms, t_d. So our index is the mapping from integers to collections of terms.

An inverted index, on the other hand, is the mapping from terms t_d to integers, inv-index: t_d -> i, j, k, .... This allows us to quickly look up what documents contain a given term.

Step 3

Now let’s process the text (see Table 6-2).

from sparknlp.pretrained import PretrainedPipeline

assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')
tokenizer = Tokenizer()\
    .setInputCols(['document'])\
    .setOutputCol('token')
lemmatizer = LemmatizerModel.pretrained()\
    .setInputCols(['token'])\
    .setOutputCol('lemma')
normalizer = Normalizer()\
    .setInputCols(['lemma'])\
    .setOutputCol('normalized')\
    .setLowercase(True)
finisher = Finisher()\
    .setInputCols(['normalized'])\
    .setOutputCols(['normalized'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, tokenizer, 
    lemmatizer, normalizer, finisher
]).fit(indexed)

indexed_w_tokens = pipeline.transform(indexed)

indexed_w_tokens.limit(10).toPandas()

Table 6-2. Documents with normalized tokens
	path	text	index	normalized
0	file:/.../spark-nlp-book/data/m...	...	0	[newsgroups, recmotorcycles, from, lisaalexcom...
1	file:/.../spark-nlp-book/data/m...	...	1	[path, cantaloupesrvcscmuedudasnewsharvardedun...
2	file:/.../spark-nlp-book/data/m...	...	2	[newsgroups, recmotorcycles, path, cantaloupes...
3	file:/.../spark-nlp-book/data/m...	...	3	[xref, cantaloupesrvcscmuedu, recmotorcyclesha...
4	file:/.../spark-nlp-book/data/m...	...	4	[path, cantaloupesrvcscmuedudasnewsharvardeduo...
5	file:/.../spark-nlp-book/data/m...	...	5	[path, cantaloupesrvcscmuedumagnesiumclubcccmu...
6	file:/.../spark-nlp-book/data/m...	...	6	[newsgroups, recmotorcycles, path, cantaloupes...
7	file:/.../spark-nlp-book/data/m...	...	7	[newsgroups, recmotorcycles, path, cantaloupes...
8	file:/.../spark-nlp-book/data/m...	...	8	[path, cantaloupesrvcscmuedurochesterudelbogus...
9	file:/.../spark-nlp-book/data/m...	...	9	[path, cantaloupesrvcscmueducrabapplesrvcscmue...

Because we are using a small data set, we will move out of Spark for the purposes of this example. We will collect our data into pandas and use our index field as our DataFrame index.

doc_index = indexed_w_tokens.select('index', 'path', 'text').toPandas()
doc_index = doc_index.set_index('index')

Step 4

Now, let us create our inverted index. We will use Spark SQL to do this. The result is shown in Table 6-3.

SELECT term, collect_set(index) AS documents
FROM (
    SELECT index, explode(normalized) AS term
    FROM indexed_w_tokens
)
GROUP BY term
ORDER BY term

inverted_index = indexed_w_tokens\
    .selectExpr('index', 'explode(normalized) AS term')\
    .distinct()\
    .groupBy('term').agg(collect_set('index').alias('documents'))\
    .persist()

inverted_index.show(10)

Table 6-3. Inverted index (mapping from term to document index)
	term	documents
0	aaangel.qdeck.com	[198]
1	accumulation	[857]
2	adventists	[314]
3	aecfb.student.cwru.edu	[526]
4	again...hmmm	[1657]
5	alt.binaries.pictures	[815]
6	amplifier	[630, 624, 654]
7	antennae	[484, 482]
8	apr..gordian.com	[1086]
9	apr..lokkur.dexter.mi.us	[292]

This is our inverted index. We can see that the term “amplifier” occurs in documents 630, 624, and 654. With this information, we can quickly find all documents that contain particular terms.

Another benefit is that this inverted index is based on the size of our vocabulary, not on the amount of text in our corpus, so it is not big data. The inverted index grows only with new terms and document indices. For very large corpora, this can still be a large amount of data for a single machine. In the case of the mini_newsgroups data set, however, it is easily manageable.

Let’s see how big our inverted index is.

inverted_index.count()

For us, since we have such a small number of documents, the inverted index has more entries than the index. Word frequencies follow Zipf’s law—that is, the frequency of a word is inversely proportional to its rank when sorted. As a result, the most-used English words are already in our inverted index. This can be further constrained by not tracking words that don’t occur at least a certain number of times.

inverted_index = {
    term: set(docs) 
    for term, docs in inverted_index.collect()
}

Now we can begin our most basic ranking function—simple Boolean search. In this case, let’s look up all the documents that contain the words “language” or “information.”

lang_docs = inverted_index['language']
print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...')
print('number of docs', len(lang_docs))

docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433,  ...
number of docs 44

info_docs = inverted_index['information']
print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...')
print('number of docs', len(info_docs))

docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22,  ...
number of docs 215

filter_set = list(lang_docs | info_docs)
print('number of docs in filter set', len(filter_set))

number of docs in filter set 246

intersection = list(lang_docs & info_docs)
print('number of docs in intersection set', len(intersection))

number of docs in intersection set 13

Let’s print out lines from our filter set. Here, the filter set is the result set, but generally, the filter set is ranked by r(q, D), which results in the result set.

Let’s look at the lines in which we see the occurrences, to get an idea about our result set.

k = 1
for i in filter_set:
    path, text = doc_index.loc[i]
    lines = text.split('\n')
    print(path.split('/')[-1], 'length:', len(text))
    for line_number, line in enumerate(lines):
        if 'information' in line or 'language' in line:
            print(line_number, line)
    print()
    k += 1
    if k > 5:
        break

178813 length: 1783
14 >>    Where did you get this information?  The FBI stated ...

104863 length: 2795
14 of information that I received, but as my own bad mouthing) ...

104390 length: 2223
51 ... appropriate disclaimer to outgoing public information,

178569 length: 11735
60  confidential information obtained illegally from law ...
64  ... to allegations of obtaining confidential information from
86  employee and have said they simply traded information with ...
91  than truthful" in providing information during an earlier ...
125  and Department of Motor Vehicles information such as ...
130  Bullock also provided information to the South African ...
142  information.
151  exchanged information with the FBI and worked with ...
160  information in Los Angeles, San Francisco, New York, ...
168  some confidential information in the Anti-Defamation ...
182  police information on citizens and groups.
190  ... spying operations, which collected information on more than
209  information to police, journalists, academics, government ...
211  information illegally, he said.
215  identity of any source of information," Foxman said.

104616 length: 1846
45 ... an appropriate disclaimer to outgoing public information,

Now that we have our result set, how should we rank our results? We could just count the number of occurrences of our search term, but that would be biased toward long documents. Also, what happens if our query includes a very common word like “the”? If we just use the counts, common words like “the” will dominate our results. In our result set, the one with the most occurrences of the query terms has the longest text. We could say that the more terms found in the document, the more relevant the document is, but this has problems too. What do we do with one-term queries? In our example, only one document has both. Again, if our query has a common word—for example, “the cat in the hat”—should “the” and “in” have the same importance as “cat” and “hat”? To solve this problem, we need a more flexible model for our documents and queries.

Vector Space Model

In the previous chapter, we introduced the concept of vectorizing documents. We talked about creating binary vectors, where 1 means that the word is present in the document. We can also use the counts.

When we convert a corpus to a collection of vectors, we are implicitly modeling our language as a vector space. In this vector space, each dimension represents one term. This has many benefits and drawbacks. It is a simple way to represent our text in a manner that allows machine learning algorithms to work with it. It also allows us to represent the vectors sparsely. On the other hand, we lose the information contained in the word order. This process also creates high dimensional data sets, which can be problematic to some algorithms.

Let’s calculate the vectors for our data set. In the previous chapter, we used the CountVectorizer for this. We will build the vectors in Python, but the way we will build them will help us understand how libraries implement vectorization.

class SparseVector(object):
    
    def __init__(self, indices, values, length):
        # if the indices are not in ascending order, we need 
        # to sort them
        is_ascending = True
        for i in range(len(indices) - 1):
            is_ascending = is_ascending and indices[i] < indices[i+1]
        if not is_ascending:
            pairs = zip(indices, values)
            sorted_pairs = sorted(pairs, key=lambda x: x[0])
            indices, values = zip(*sorted_pairs)
        self.indices = indices
        self.values = values
        self.length = length
        
    def __getitem__(self, index):
        try:
            return self.values[self.indices.index(index)]
        except ValueError:
            return 0.0
        
    def dot(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res = 0
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res += self.values[i] * other.values[j]
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return res
    
    def hadamard(self, other):
        assert isinstance(other, SparseVector)
        assert self.length == other.length
        res_indices = []
        res_values = []
        i = j = 0
        while i < len(self.indices) and j < len(other.indices):
            if self.indices[i] == other.indices[j]:
                res_indices.append(self.indices[i])
                res_values.append(self.values[i] * other.values[j])
                i += 1
                j += 1
            elif self.indices[i] < other.indices[j]:
                i += 1
            elif self.indices[i] > other.indices[j]:
                j += 1
        return SparseVector(res_indices, res_values, self.length)
    
    def sum(self):
        return sum(self.values)
    
    def __repr__(self):
        return 'SparseVector({}, {})'.format(
            dict(zip(self.indices, self.values)), self.length)

We need to make two passes over all the documents. In the first pass, we will get our vocabulary and the counts. In the second pass we will construct the vectors.

from collections import Counter

vocabulary = set()
vectors = {}

for row in indexed_w_tokens.toLocalIterator():
    counts = Counter(row['normalized'])
    vocabulary.update(counts.keys())
    vectors[row['index']] = counts
    
vocabulary = list(sorted(vocabulary))
inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)}
vocab_len = len(vocabulary)

Now that we have this information, we need to go back over our word counts and construct actual vectors.

for index in vectors:
    terms, values = zip(*vectors[index].items())
    indices = [inv_vocabulary[term] for term in terms]
    vectors[index] = SparseVector(indices, values, vocab_len)

vectors[42]

SparseVector({56: 1, 630: 1, 678: 1, 937: 1, 952: 1, 1031: 1, 1044: 1,
1203: 1, 1348: 1, 1396: 5, 1793: 1, 2828: 1, 3264: 3, 3598: 3, 3753: 1,
4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 8532: 1, 9570: 1, 11031: 1,
11731: 1, 12509: 1, 13555: 1, 13772: 1, 14918: 1, 15205: 1, 15350: 1,
15475: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 17627: 1, 17798: 1,
17931: 2, 18178: 1, 18329: 2, 18505: 1, 18730: 3, 18776: 1, 19346: 1,
19620: 1, 20381: 1, 20475: 1, 20594: 1, 20782: 1, 21831: 1, 21856: 1,
21907: 1, 22560: 1, 22565: 2, 22717: 1, 23714: 1, 23813: 1, 24145: 1,
24965: 3, 25937: 1, 26437: 1, 26438: 1, 26592: 1, 26674: 1, 26679: 1,
27091: 1, 27109: 1, 27491: 2, 27500: 1, 27670: 1, 28583: 1, 28864: 1,
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1,
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35635: 1, 35724: 1,
36136: 1, 36596: 1, 36672: 1, 37048: 1, 37854: 1, 37867: 3, 37872: 1,
37876: 3, 37891: 1, 37907: 1, 37949: 1, 38002: 1, 38224: 1, 38225: 2,
38226: 3, 38317: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 41239: 1,
41240: 1, 41276: 1, 41292: 1, 41507: 1, 41731: 1, 42384: 2}, 42624)

Let’s look at some of the words that occur the most.

vocabulary[3598]

'be'

vocabulary[37876]

'the'

As we discussed previously, there are many drawbacks to using only the counts for a search. The concern is that words that are generally common in English will have more impact than the less common words. There are a couple strategies for addressing this. First, let’s look at the simplest solution—removing the common words.

Stop-Word Removal

These common words we are looking to remove are called stop words. This term was coined in the 1950s by Hans Peter Luhn, a pioneer in information retrieval. Default stop-word lists are available, but it is often necessary to modify generic stop-word lists for different tasks.

from pyspark.ml.feature import StopWordsRemover

sw_remover = StopWordsRemover() \
    .setInputCol("normalized") \
    .setOutputCol("filtered") \
    .setStopWords(StopWordsRemover.loadDefaultStopWords("english"))

filtered = sw_remover.transform(indexed_w_tokens)

from collections import Counter

vocabulary_filtered = set()
vectors_filtered = {}

for row in filtered.toLocalIterator():
    counts = Counter(row['filtered'])
    vocabulary_filtered.update(counts.keys())
    vectors_filtered[row['index']] = counts
    
vocabulary_filtered = list(sorted(vocabulary_filtered))
inv_vocabulary_filtered = {
    term: ix 
    for ix, term in enumerate(vocabulary_filtered)
}
vocab_len_filtered = len(vocabulary)

for index in vectors:
    terms, values = zip(*vectors_filtered[index].items())
    indices = [inv_vocabular_filteredy[term] for term in terms]
    vectors_filtered[index] = \
        SparseVector(indices, values, vocab_len_filtered)

vectors[42]

SparseVector({630: 1, 678: 1, 952: 1, 1031: 1, 1044: 1, 1203: 1, 1348: 1, 
1793: 1, 2828: 1, 3264: 3, 4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 
8532: 1, 9570: 1, 11031: 1, 11731: 1, 12509: 1, 13555: 1, 13772: 1, 
14918: 1, 15205: 1, 15350: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 
17627: 1, 17798: 1, 17931: 2, 18178: 1, 18505: 1, 18776: 1, 20475: 1, 
20594: 1, 20782: 1, 21831: 1, 21856: 1, 21907: 1, 22560: 1, 22565: 2, 
22717: 1, 23714: 1, 23813: 1, 24145: 1, 25937: 1, 26437: 1, 26438: 1, 
26592: 1, 26674: 1, 26679: 1, 27109: 1, 27491: 2, 28583: 1, 28864: 1, 
29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1, 
34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35724: 1, 36136: 1, 
36596: 1, 36672: 1, 37048: 1, 37872: 1, 37891: 1, 37949: 1, 38002: 1, 
38224: 1, 38225: 2, 38226: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 
41239: 1, 41240: 1, 41276: 1, 41731: 1}, 42624)

vocabulary[3264]

'bake'

vocabulary[38226]

'timmons'

The words “bake” and “timmons” seem more informative. You should explore your data when determining what words should be included in the stop-word list.

It may seem like a daunting task to list all the words we don’t want. However, recalling what we discussed about morphology, we can narrow down what we want to remove. We want to remove unbound function morphemes.

A fluent speaker of a language, who knows these basics of morphology, is able to create a reasonably good list. However, this still leaves two concerns. What if we need to keep some common words? What if we want to remove some common lexical morphemes? You can modify the list, but that still leaves one last concern. How do we handle queries like “fictional cats”? The word “fictional” is less common than “cats,” so it makes sense that the former should be more important in determining what documents are returned. Let’s look at how we can implement this using our data.

Inverse Document Frequency

Instead of manually editing our vocabulary, we can try and weight the words. We need to find some way of weighting the words using their “commonness.” One way to define “commonness” is by identifying the number of documents in our corpus that contain the word. This is generally called document frequency. We want words with high document frequency to be down-weighted, so we are interested in using inverse document frequency (IDF).

We take these values and multiply them by the term frequencies, which are the frequencies of words in a given document. The result of multiplying inverse document frequency by term frequency gives us the TF.IDF.

\begin{matrix} t f (t, d) & = the number of times t occurs in d \\ d f (t) & = the number of documents t occurs in \\ i d f (t) & = \frac{the number of documents}{d f (t)} \end{matrix}

There are many different flavors of TF.IDF. The most common kind is smoothed logarithmic.

\begin{matrix} t f (t, d) & = l o g (1 + the number of times t occurs in d) \\ d f (t) & = the number of documents t occurs in \\ i d f (t) & = l o g (\frac{the number of documents}{1 + d f (t)}) \end{matrix}

Let’s calculate this with our vectors. We actually already have the term frequency, so all we need to do is calculate the idf, transform the values with log, and multiply tf and idf.

idf = Counter()

for vector in vectors.values():
    idf.update(vector.indices)

for ix, count in idf.most_common(20):
    print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))

11031 date                 2000
15475 from                 2000
23813 messageid            2000
26438 newsgroups           2000
28583 path                 2000
36672 subject              2000
21907 lines                1993
27897 organization         1925
37876 the                  1874
 1793 apr                  1861
 3598 be                   1837
38317 to                   1767
27500 of                   1756
   56 a                    1730
16266 gmt                  1717
18329 i                    1708
18730 in                   1695
 1396 and                  1674
15166 for                  1474
17238 have                 1459

We can now make idf a SparseVector. We know it contains all the words, so it actually won’t be sparse, but this will help us implement the next steps.

indices, values = zip(*idf.items())
idf = SparseVector(indices, values, vocab_len)

from math import log

for index, vector in vectors.items():
    vector.values = list(map(lambda v: log(1+v), vector.values))
    
idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))

tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}

tfidf[42]

SparseVector({56: 2.2206482367540246, 630: 5.866068667810157, 
678: 5.793038323439593, 937: 2.7785503981772224, 952: 5.157913986067814, 
..., 
41731: 2.4998956290056062, 42384: 3.8444034764394415}, 42624)

Let’s look at the TF.IDF values for “be” and “the.” Let’s also look at one of the terms with a higher TF.IDF than these common words.

tfidf[42][3598] # be

4.358148273729854

tfidf[42][37876] # the

4.3305185461380855

vocabulary[17236], tfidf[42][17236]

('hausmann', 10.188396765921954)

Let’s look at the document to get an idea of why this word is so important.

print(doc_index.loc[42]['text'])

Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard...
From: timmbake@mcl.ucsb.edu (Bake Timmons)
Newsgroups: alt.atheism
Subject: Re: Amusing atheists and agnostics
Message-ID: <timmbake.735285604@mcl>
Date: 20 Apr 93 06:00:04 GMT
Sender: news@ucsbcsl.ucsb.edu
Lines: 32

Maddi Hausmann chirps:

>timmbake@mcl.ucsb.edu (Bake Timmons) writes: >

...

>"Killfile" Keith Allen Schneider = Frank "Closet Theist" O'Dwyer = ...

= Maddi "The Mad Sound-O-Geek" Hausmann

...whirrr...click...whirrr

--
Bake Timmons, III
...

We can see the document is talking about some person named “Maddi Hausman.”

In Spark

Spark has stages for calculating TF.IDF in MLlib. If you have a column that contains arrays of strings, you can use either CountVectorizer, which we are already familiar with, or HashingTF to get the tf values. HashingTF uses the hashing trick, in which you decide on a vector space beforehand, and hash the words into that vector space. If there is a collision, then those words will be counted as the same. This lets you trade off between memory efficiency and accuracy. As you make your predetermined vector space larger, the output vectors become larger, but the chance of collisions decreases.

Now that we know how to turn a document into a vector, in the next chapter we can explore how we can use that vector in classic machine learning tasks.

Exercises

Now that we have calculated the TF.IDF values, let’s build a search function. First, we need a function to process the query.

def process_query(query, pipeline):
    data = spark.createDataFrame([(query,)], ['text'])
    return pipeline.transform(data).first()['normalized']

Then we need a function to get the filter set.

def get_filter_set(processed_query):
    filter_set = set()
    # find all the documents that contain any of the terms
    return filter_set

Next, we need a function that will compute the score for the document.

def get_score(index, terms):
    return # return a single score

We also want a function for displaying results.

def display(index, score, terms):
    hits = [term for term in terms if term in vocabulary and tfidf[index][inv_vocabulary[term]] > 0.]
    print('terms', terms, 'hits', hits)
    print('score', score)
    print('path', path)
    print('length', len(doc_index.loc[index]['text']))

Finally, we are ready for our search function.

def search(query, pipeline, k=5):
    processed_query = process_query(query, pipeline)
    filter_set = get_filter_set(processed_query)
    scored = {index: get_score(index, processed_query) for index in filter_set}
    display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k]
    for index in display_list:
        display(index, scored[index], processed_query)

search('search engine', pipeline)

You should be able to implement get_filter_set and get_score easily using examples in this chapter. Try out a few queries. You will likely notice that there are two big limitations here. There is no N-gram support, and the ranker is biased toward longer documents. What could you modify to fix these problems?

Resources

An Introduction to Information Retrieval, by Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: this book covers many important aspects of information retrieval. Two of its three authors are also authors of Foundations of Statistical Natural Language Processing.
Apache Lucene: this is the most-used open source search engine. Often, one of the search platforms built on top of Lucene are used, Apache Solr or Elasticsearch.
Lucene in Action, 2nd ed., by Michael McCandless, Erik Hatcher, and Otis Gospodnetic (Manning Publications)
- A guide to implementing searches using Lucene
Elasticsearch: The Definitive Guide, by Clinton Gormley and Zachary Tong (O’Reilly)
- A guide to implementing searches using Elasticsearch
Learning to Rank for Information Retrieval, by Tie-Yan Liu (Springer)
- Learning to rank, building machine learning–based rankers, is an important part of modern search engines. Tie-Yan Liu is one the most important contributors to the field of learning to rank.