In the previous chapter we came across common words that made it difficult to characterize a corpus. This is a problem for different kinds NLP tasks. Fortunately, the field of information retrieval has developed many techniques that can be used to improve a variety of NLP applications.
Earlier, we talked about how text data exists, and more is being generated every day. We need some way to manage and search through this data. If there is an ID or title, we can of course have an index on this data, but how do we search by content? With structured data, we can create logical expressions and retrieve all rows that satisfy the expressions. This can also be done with text, though less exactly.
The foundation of information retrieval predates computers. Information retrieval focuses on how to find specific pieces of information in a larger set of information, especially information in text data. The most common type of task in information retrieval is search—in other words, document search.
The following are the components of a document search:
q
A logical statement describing the document or kind of document you are looking for
q_t
A term in the query, generally a token
D
A collection of documents
d
A document in D
with terms t_d
that describe the document
r(q, D)
D
according to relevance to the query q
R
Before we get into how to implement these components, we need to consider a technical problem. How can we quickly access documents based on the information within them? If we have to scan every document, then we could not search large collections of documents. To solve this problem we use an inverted index.
Originally, indexing was a means of organizing and labeling information in a way that made retrieving it easier. For example, libraries use indexing to organize and find books. The Dewey Decimal Classification system is a way to index books based on their subject matter. We can also have indices based on titles, authors, publication dates, and so on. Another kind of index can often be found at the back of a book. This is a list of concepts in the book and pages on which to find them.
The index in inverted index is slightly different than the traditional index; instead, it takes inspiration from the mathematical concept of indexing—that is, assigning indices to an element of a set. Recall our set of documents D
. We can assign a number to each document, creating mapping from integers to documents, i -> d.
Let’s create this index for our DataFrame
. Normally, we would store an inverted index in a data store that allows for quick lookups. Spark DataFrame
s are not for quick lookups. We will introduce the tools used for search.
Let’s look at how we can build an inverted index in Spark. Here are the steps we will follow:
Create the index: i -> d*
t_d -> i*
We will be creating an inverted index for the mini_newsgroups data set.
import os from pyspark.sql.types import * from pyspark.sql.functions import collect_set from pyspark.sql import Row from pyspark.ml import Pipeline import sparknlp from sparknlp import DocumentAssembler, Finisher from sparknlp.annotator import * spark = sparknlp.start()
path = os.path.join('data', 'mini_newsgroups', '*') texts = spark.sparkContext.wholeTextFiles(path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) texts = spark.createDataFrame(texts, schema=schema).persist()
Now we need to create the index. Spark assumes the data is distributed, so to assign an index, we need to use the lower-level RDD
API. The zipWithIndex
will sort the data on the workers and assign the indices.
rows_w_indexed = texts.rdd.zipWithIndex() (path, text), i = rows_w_indexed.first() print(i) print(path) print(text[:200])
0 file:/home/alext/projects/spark-nlp-book/data/mini_... Xref: cantaloupe.srv.cs.cmu.edu sci.astro:35223 sci.space:61404 Newsgroups: sci.astro,sci.space Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard.edu!...
Now that we have created the index, we need to create a DataFrame
like we did previously, except now we need to add our index into our Row
s. Table 6-1 shows the result.
indexed = rows_w_indexed.map( lambda row_index: Row( index=row_index[1], **row_index[0].asDict()) ) (i, path, text) = indexed.first()
indexed_schema = schema.add(StructField('index', IntegerType())) indexed = spark.createDataFrame(indexed, schema=indexed_schema)\ .persist()
indexed.limit(10).toPandas()
path | text | index | |
---|---|---|---|
0 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles\nFrom: lisa@alex.c... | 0 |
1 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... | 1 |
2 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles\nPath: cantaloupe.... | 2 |
3 | file:/.../spark-nlp-book/data/m... | Xref: cantaloupe.srv.cs.cmu.edu rec.motorcycle... | 3 |
4 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!das-news.harva... | 4 |
5 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!magnesium.club... | 5 |
6 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles\nPath: cantaloupe.... | 6 |
7 | file:/.../spark-nlp-book/data/m... | Newsgroups: rec.motorcycles\nPath: cantaloupe.... | 7 |
8 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!rochester!udel... | 8 |
9 | file:/.../spark-nlp-book/data/m... | Path: cantaloupe.srv.cs.cmu.edu!crabapple.srv.... | 9 |
Each document d
is a collection of terms, t_d
. So our index is the mapping from integers to collections of terms.
An inverted index, on the other hand, is the mapping from terms t_d
to integers, inv-index: t_d -> i, j, k, ...
. This allows us to quickly look up what documents contain a given term.
Now let’s process the text (see Table 6-2).
from sparknlp.pretrained import PretrainedPipeline assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') tokenizer = Tokenizer()\ .setInputCols(['document'])\ .setOutputCol('token') lemmatizer = LemmatizerModel.pretrained()\ .setInputCols(['token'])\ .setOutputCol('lemma') normalizer = Normalizer()\ .setInputCols(['lemma'])\ .setOutputCol('normalized')\ .setLowercase(True) finisher = Finisher()\ .setInputCols(['normalized'])\ .setOutputCols(['normalized'])\ .setOutputAsArray(True) pipeline = Pipeline().setStages([ assembler, tokenizer, lemmatizer, normalizer, finisher ]).fit(indexed) indexed_w_tokens = pipeline.transform(indexed)
indexed_w_tokens.limit(10).toPandas()
path | text | index | normalized | |
---|---|---|---|---|
0 | file:/.../spark-nlp-book/data/m... | ... | 0 | [newsgroups, recmotorcycles, from, lisaalexcom... |
1 | file:/.../spark-nlp-book/data/m... | ... | 1 | [path, cantaloupesrvcscmuedudasnewsharvardedun... |
2 | file:/.../spark-nlp-book/data/m... | ... | 2 | [newsgroups, recmotorcycles, path, cantaloupes... |
3 | file:/.../spark-nlp-book/data/m... | ... | 3 | [xref, cantaloupesrvcscmuedu, recmotorcyclesha... |
4 | file:/.../spark-nlp-book/data/m... | ... | 4 | [path, cantaloupesrvcscmuedudasnewsharvardeduo... |
5 | file:/.../spark-nlp-book/data/m... | ... | 5 | [path, cantaloupesrvcscmuedumagnesiumclubcccmu... |
6 | file:/.../spark-nlp-book/data/m... | ... | 6 | [newsgroups, recmotorcycles, path, cantaloupes... |
7 | file:/.../spark-nlp-book/data/m... | ... | 7 | [newsgroups, recmotorcycles, path, cantaloupes... |
8 | file:/.../spark-nlp-book/data/m... | ... | 8 | [path, cantaloupesrvcscmuedurochesterudelbogus... |
9 | file:/.../spark-nlp-book/data/m... | ... | 9 | [path, cantaloupesrvcscmueducrabapplesrvcscmue... |
Because we are using a small data set, we will move out of Spark for the purposes of this example. We will collect our data into pandas and use our index field as our DataFrame
index.
doc_index = indexed_w_tokens.select('index', 'path', 'text').toPandas() doc_index = doc_index.set_index('index')
Now, let us create our inverted index. We will use Spark SQL to do this. The result is shown in Table 6-3.
SELECT term, collect_set(index) AS documents
FROM (
SELECT index, explode(normalized) AS term
FROM indexed_w_tokens
)
GROUP BY term
ORDER BY term
inverted_index = indexed_w_tokens\ .selectExpr('index', 'explode(normalized) AS term')\ .distinct()\ .groupBy('term').agg(collect_set('index').alias('documents'))\ .persist()
inverted_index.show(10)
term | documents | |
---|---|---|
0 | aaangel.qdeck.com | [198] |
1 | accumulation | [857] |
2 | adventists | [314] |
3 | aecfb.student.cwru.edu | [526] |
4 | again...hmmm | [1657] |
5 | alt.binaries.pictures | [815] |
6 | amplifier | [630, 624, 654] |
7 | antennae | [484, 482] |
8 | apr..gordian.com | [1086] |
9 | apr..lokkur.dexter.mi.us | [292] |
This is our inverted index. We can see that the term “amplifier” occurs in documents 630, 624, and 654. With this information, we can quickly find all documents that contain particular terms.
Another benefit is that this inverted index is based on the size of our vocabulary, not on the amount of text in our corpus, so it is not big data. The inverted index grows only with new terms and document indices. For very large corpora, this can still be a large amount of data for a single machine. In the case of the mini_newsgroups data set, however, it is easily manageable.
Let’s see how big our inverted index is.
inverted_index.count()
42624
For us, since we have such a small number of documents, the inverted index has more entries than the index. Word frequencies follow Zipf’s law—that is, the frequency of a word is inversely proportional to its rank when sorted. As a result, the most-used English words are already in our inverted index. This can be further constrained by not tracking words that don’t occur at least a certain number of times.
inverted_index = { term: set(docs) for term, docs in inverted_index.collect() }
Now we can begin our most basic ranking function—simple Boolean search. In this case, let’s look up all the documents that contain the words “language” or “information.”
lang_docs = inverted_index['language'] print('docs', ('{}, ' * 10).format(*list(lang_docs)[:10]), '...') print('number of docs', len(lang_docs))
docs 1926, 1937, 1171, 1173, 1182, 1188, 1830, 808, 1193, 433, ... number of docs 44
info_docs = inverted_index['information'] print('docs', ('{}, ' * 10).format(*list(info_docs)[:10]), '...') print('number of docs', len(info_docs))
docs 516, 519, 520, 1547, 1035, 1550, 1551, 17, 1556, 22, ... number of docs 215
filter_set = list(lang_docs | info_docs) print('number of docs in filter set', len(filter_set))
number of docs in filter set 246
intersection = list(lang_docs & info_docs) print('number of docs in intersection set', len(intersection))
number of docs in intersection set 13
Let’s print out lines from our filter set. Here, the filter set is the result set, but generally, the filter set is ranked by r(q, D)
, which results in the result set.
Let’s look at the lines in which we see the occurrences, to get an idea about our result set.
k = 1 for i in filter_set: path, text = doc_index.loc[i] lines = text.split('\n') print(path.split('/')[-1], 'length:', len(text)) for line_number, line in enumerate(lines): if 'information' in line or 'language' in line: print(line_number, line) print() k += 1 if k > 5: break
Now that we have our result set, how should we rank our results? We could just count the number of occurrences of our search term, but that would be biased toward long documents. Also, what happens if our query includes a very common word like “the”? If we just use the counts, common words like “the” will dominate our results. In our result set, the one with the most occurrences of the query terms has the longest text. We could say that the more terms found in the document, the more relevant the document is, but this has problems too. What do we do with one-term queries? In our example, only one document has both. Again, if our query has a common word—for example, “the cat in the hat”—should “the” and “in” have the same importance as “cat” and “hat”? To solve this problem, we need a more flexible model for our documents and queries.
In the previous chapter, we introduced the concept of vectorizing documents. We talked about creating binary vectors, where 1 means that the word is present in the document. We can also use the counts.
When we convert a corpus to a collection of vectors, we are implicitly modeling our language as a vector space. In this vector space, each dimension represents one term. This has many benefits and drawbacks. It is a simple way to represent our text in a manner that allows machine learning algorithms to work with it. It also allows us to represent the vectors sparsely. On the other hand, we lose the information contained in the word order. This process also creates high dimensional data sets, which can be problematic to some algorithms.
Let’s calculate the vectors for our data set. In the previous chapter, we used the CountVectorizer
for this. We will build the vectors in Python, but the way we will build them will help us understand how libraries implement vectorization.
class SparseVector(object): def __init__(self, indices, values, length): # if the indices are not in ascending order, we need # to sort them is_ascending = True for i in range(len(indices) - 1): is_ascending = is_ascending and indices[i] < indices[i+1] if not is_ascending: pairs = zip(indices, values) sorted_pairs = sorted(pairs, key=lambda x: x[0]) indices, values = zip(*sorted_pairs) self.indices = indices self.values = values self.length = length def __getitem__(self, index): try: return self.values[self.indices.index(index)] except ValueError: return 0.0 def dot(self, other): assert isinstance(other, SparseVector) assert self.length == other.length res = 0 i = j = 0 while i < len(self.indices) and j < len(other.indices): if self.indices[i] == other.indices[j]: res += self.values[i] * other.values[j] i += 1 j += 1 elif self.indices[i] < other.indices[j]: i += 1 elif self.indices[i] > other.indices[j]: j += 1 return res def hadamard(self, other): assert isinstance(other, SparseVector) assert self.length == other.length res_indices = [] res_values = [] i = j = 0 while i < len(self.indices) and j < len(other.indices): if self.indices[i] == other.indices[j]: res_indices.append(self.indices[i]) res_values.append(self.values[i] * other.values[j]) i += 1 j += 1 elif self.indices[i] < other.indices[j]: i += 1 elif self.indices[i] > other.indices[j]: j += 1 return SparseVector(res_indices, res_values, self.length) def sum(self): return sum(self.values) def __repr__(self): return 'SparseVector({}, {})'.format( dict(zip(self.indices, self.values)), self.length)
We need to make two passes over all the documents. In the first pass, we will get our vocabulary and the counts. In the second pass we will construct the vectors.
from collections import Counter vocabulary = set() vectors = {} for row in indexed_w_tokens.toLocalIterator(): counts = Counter(row['normalized']) vocabulary.update(counts.keys()) vectors[row['index']] = counts vocabulary = list(sorted(vocabulary)) inv_vocabulary = {term: ix for ix, term in enumerate(vocabulary)} vocab_len = len(vocabulary)
Now that we have this information, we need to go back over our word counts and construct actual vectors.
for index in vectors: terms, values = zip(*vectors[index].items()) indices = [inv_vocabulary[term] for term in terms] vectors[index] = SparseVector(indices, values, vocab_len)
vectors[42]
SparseVector({56: 1, 630: 1, 678: 1, 937: 1, 952: 1, 1031: 1, 1044: 1, 1203: 1, 1348: 1, 1396: 5, 1793: 1, 2828: 1, 3264: 3, 3598: 3, 3753: 1, 4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 8532: 1, 9570: 1, 11031: 1, 11731: 1, 12509: 1, 13555: 1, 13772: 1, 14918: 1, 15205: 1, 15350: 1, 15475: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 17627: 1, 17798: 1, 17931: 2, 18178: 1, 18329: 2, 18505: 1, 18730: 3, 18776: 1, 19346: 1, 19620: 1, 20381: 1, 20475: 1, 20594: 1, 20782: 1, 21831: 1, 21856: 1, 21907: 1, 22560: 1, 22565: 2, 22717: 1, 23714: 1, 23813: 1, 24145: 1, 24965: 3, 25937: 1, 26437: 1, 26438: 1, 26592: 1, 26674: 1, 26679: 1, 27091: 1, 27109: 1, 27491: 2, 27500: 1, 27670: 1, 28583: 1, 28864: 1, 29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1, 34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35635: 1, 35724: 1, 36136: 1, 36596: 1, 36672: 1, 37048: 1, 37854: 1, 37867: 3, 37872: 1, 37876: 3, 37891: 1, 37907: 1, 37949: 1, 38002: 1, 38224: 1, 38225: 2, 38226: 3, 38317: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 41239: 1, 41240: 1, 41276: 1, 41292: 1, 41507: 1, 41731: 1, 42384: 2}, 42624)
Let’s look at some of the words that occur the most.
vocabulary[3598]
'be'
vocabulary[37876]
'the'
As we discussed previously, there are many drawbacks to using only the counts for a search. The concern is that words that are generally common in English will have more impact than the less common words. There are a couple strategies for addressing this. First, let’s look at the simplest solution—removing the common words.
These common words we are looking to remove are called stop words. This term was coined in the 1950s by Hans Peter Luhn, a pioneer in information retrieval. Default stop-word lists are available, but it is often necessary to modify generic stop-word lists for different tasks.
from pyspark.ml.feature import StopWordsRemover sw_remover = StopWordsRemover() \ .setInputCol("normalized") \ .setOutputCol("filtered") \ .setStopWords(StopWordsRemover.loadDefaultStopWords("english")) filtered = sw_remover.transform(indexed_w_tokens)
from collections import Counter vocabulary_filtered = set() vectors_filtered = {} for row in filtered.toLocalIterator(): counts = Counter(row['filtered']) vocabulary_filtered.update(counts.keys()) vectors_filtered[row['index']] = counts vocabulary_filtered = list(sorted(vocabulary_filtered)) inv_vocabulary_filtered = { term: ix for ix, term in enumerate(vocabulary_filtered) } vocab_len_filtered = len(vocabulary)
for index in vectors: terms, values = zip(*vectors_filtered[index].items()) indices = [inv_vocabular_filteredy[term] for term in terms] vectors_filtered[index] = \ SparseVector(indices, values, vocab_len_filtered)
vectors[42]
SparseVector({630: 1, 678: 1, 952: 1, 1031: 1, 1044: 1, 1203: 1, 1348: 1, 1793: 1, 2828: 1, 3264: 3, 4742: 1, 5907: 1, 7990: 1, 7999: 1, 8451: 1, 8532: 1, 9570: 1, 11031: 1, 11731: 1, 12509: 1, 13555: 1, 13772: 1, 14918: 1, 15205: 1, 15350: 1, 16266: 1, 16356: 1, 16865: 1, 17236: 2, 17627: 1, 17798: 1, 17931: 2, 18178: 1, 18505: 1, 18776: 1, 20475: 1, 20594: 1, 20782: 1, 21831: 1, 21856: 1, 21907: 1, 22560: 1, 22565: 2, 22717: 1, 23714: 1, 23813: 1, 24145: 1, 25937: 1, 26437: 1, 26438: 1, 26592: 1, 26674: 1, 26679: 1, 27109: 1, 27491: 2, 28583: 1, 28864: 1, 29636: 1, 31652: 1, 31725: 1, 31862: 1, 33382: 1, 33923: 1, 34311: 1, 34451: 1, 34478: 1, 34778: 1, 34904: 1, 35034: 1, 35724: 1, 36136: 1, 36596: 1, 36672: 1, 37048: 1, 37872: 1, 37891: 1, 37949: 1, 38002: 1, 38224: 1, 38225: 2, 38226: 3, 38856: 1, 39818: 1, 40870: 1, 41238: 1, 41239: 1, 41240: 1, 41276: 1, 41731: 1}, 42624)
vocabulary[3264]
'bake'
vocabulary[38226]
'timmons'
The words “bake” and “timmons” seem more informative. You should explore your data when determining what words should be included in the stop-word list.
It may seem like a daunting task to list all the words we don’t want. However, recalling what we discussed about morphology, we can narrow down what we want to remove. We want to remove unbound function morphemes.
A fluent speaker of a language, who knows these basics of morphology, is able to create a reasonably good list. However, this still leaves two concerns. What if we need to keep some common words? What if we want to remove some common lexical morphemes? You can modify the list, but that still leaves one last concern. How do we handle queries like “fictional cats”? The word “fictional” is less common than “cats,” so it makes sense that the former should be more important in determining what documents are returned. Let’s look at how we can implement this using our data.
Instead of manually editing our vocabulary, we can try and weight the words. We need to find some way of weighting the words using their “commonness.” One way to define “commonness” is by identifying the number of documents in our corpus that contain the word. This is generally called document frequency. We want words with high document frequency to be down-weighted, so we are interested in using inverse document frequency (IDF).
We take these values and multiply them by the term frequencies, which are the frequencies of words in a given document. The result of multiplying inverse document frequency by term frequency gives us the TF.IDF.
There are many different flavors of TF.IDF. The most common kind is smoothed logarithmic.
Let’s calculate this with our vectors. We actually already have the term frequency, so all we need to do is calculate the idf
, transform the values with log
, and multiply tf
and idf
.
idf = Counter() for vector in vectors.values(): idf.update(vector.indices)
for ix, count in idf.most_common(20): print('{:5d} {:20s} {:d}'.format(ix, vocabulary[ix], count))
11031 date 2000 15475 from 2000 23813 messageid 2000 26438 newsgroups 2000 28583 path 2000 36672 subject 2000 21907 lines 1993 27897 organization 1925 37876 the 1874 1793 apr 1861 3598 be 1837 38317 to 1767 27500 of 1756 56 a 1730 16266 gmt 1717 18329 i 1708 18730 in 1695 1396 and 1674 15166 for 1474 17238 have 1459
We can now make idf
a SparseVector
. We know it contains all the words, so it actually won’t be sparse, but this will help us implement the next steps.
indices, values = zip(*idf.items()) idf = SparseVector(indices, values, vocab_len)
from math import log for index, vector in vectors.items(): vector.values = list(map(lambda v: log(1+v), vector.values)) idf.values = list(map(lambda v: log(vocab_len / (1+v)), idf.values))
tfidf = {index: tf.hadamard(idf) for index, tf in vectors.items()}
tfidf[42]
SparseVector({56: 2.2206482367540246, 630: 5.866068667810157, 678: 5.793038323439593, 937: 2.7785503981772224, 952: 5.157913986067814, ..., 41731: 2.4998956290056062, 42384: 3.8444034764394415}, 42624)
Let’s look at the TF.IDF values for “be” and “the.” Let’s also look at one of the terms with a higher TF.IDF than these common words.
tfidf[42][3598] # be
4.358148273729854
tfidf[42][37876] # the
4.3305185461380855
vocabulary[17236], tfidf[42][17236]
Let’s look at the document to get an idea of why this word is so important.
print(doc_index.loc[42]['text'])
Path: cantaloupe.srv.cs.cmu.edu!das-news.harvard... From: timmbake@mcl.ucsb.edu (Bake Timmons) Newsgroups: alt.atheism Subject: Re: Amusing atheists and agnostics Message-ID: <timmbake.735285604@mcl> Date: 20 Apr 93 06:00:04 GMT Sender: news@ucsbcsl.ucsb.edu Lines: 32 Maddi Hausmann chirps: >timmbake@mcl.ucsb.edu (Bake Timmons) writes: > ... >"Killfile" Keith Allen Schneider = Frank "Closet Theist" O'Dwyer = ... = Maddi "The Mad Sound-O-Geek" Hausmann ...whirrr...click...whirrr -- Bake Timmons, III ...
We can see the document is talking about some person named “Maddi Hausman.”
Spark has stages for calculating TF.IDF in MLlib. If you have a column that contains arrays of strings, you can use either CountVectorizer
, which we are already familiar with, or HashingTF
to get the tf
values. HashingTF
uses the hashing trick, in which you decide on a vector space beforehand, and hash the words into that vector space. If there is a collision, then those words will be counted as the same. This lets you trade off between memory efficiency and accuracy. As you make your predetermined vector space larger, the output vectors become larger, but the chance of collisions decreases.
Now that we know how to turn a document into a vector, in the next chapter we can explore how we can use that vector in classic machine learning tasks.
Now that we have calculated the TF.IDF values, let’s build a search function. First, we need a function to process the query.
def process_query(query, pipeline): data = spark.createDataFrame([(query,)], ['text']) return pipeline.transform(data).first()['normalized']
Then we need a function to get the filter set.
def get_filter_set(processed_query): filter_set = set() # find all the documents that contain any of the terms return filter_set
Next, we need a function that will compute the score for the document.
def get_score(index, terms): return # return a single score
We also want a function for displaying results.
def display(index, score, terms): hits = [term for term in terms if term in vocabulary and tfidf[index][inv_vocabulary[term]] > 0.] print('terms', terms, 'hits', hits) print('score', score) print('path', path) print('length', len(doc_index.loc[index]['text']))
Finally, we are ready for our search function.
def search(query, pipeline, k=5): processed_query = process_query(query, pipeline) filter_set = get_filter_set(processed_query) scored = {index: get_score(index, processed_query) for index in filter_set} display_list = list(sorted(filter_set, key=scored.get, reverse=True))[:k] for index in display_list: display(index, scored[index], processed_query)
search('search engine', pipeline)
You should be able to implement get_filter_set
and get_score
easily using examples in this chapter. Try out a few queries. You will likely notice that there are two big limitations here. There is no N-gram support, and the ranker is biased toward longer documents. What could you modify to fix these problems?