This chapter focuses on the basic word-processing techniques you can apply to get started with NLP, including tokenization, vocabulary reduction, bag-of-words, and N-grams. You can solve many tasks with these techniques plus some basic machine learning. Knowing how, when, and why to use these techniques will help you with simple and complicated NLP tasks. This is why the discussion of the linguistics technique covers implementation. We will focus on working with English for now, though we will mention some things that should be considered when working with other languages. We are focusing on English because it would be very difficult to cover these techniques in depth across different languages.
Let’s load the data from the mini_newsgroups again, and then we will explore tokenization.
import os from pyspark.sql.types import * from pyspark.ml import Pipeline import sparknlp from sparknlp import DocumentAssembler, Finisher spark = sparknlp.start()
space_path = os.path.join('data', 'mini_newsgroups', 'sci.space') texts = spark.sparkContext.wholeTextFiles(space_path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) texts = spark.createDataFrame(texts, schema=schema).persist()
## excerpt from mini newsgroups modified for examples example = ''' Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his american subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't. ''' example = spark.createDataFrame([('.', example)], schema=schema).persist()
Language data, from both text and speech, is sequential data. When working with sequential data, it is vital to understand what your sequences are made up of. On disk and in memory, our text data is a sequence of bytes. We use encodings like UTF-8 to turn these bytes into characters. This is the first step toward interpreting our data as language. This is almost always straightforward because we have agreed-upon standards for encoding characters as bytes. Turning bytes into characters is not enough to get the useful information we want, however. We next need to turn our sequence of characters into words. This is called tokenization.
Although we all intuitively understand what a “word” is, defining it linguistically is more difficult. Identifying a word is easy for a human. Let’s look at some examples:
English speakers will recognize example 1 as a word, example 2 as a possible word, and example 3 as not a possible word; example 4 is trickier. The suffix “-ism” is something we attach to a word, a bound morpheme, but it has been used as an unbound morpheme. Indeed, there are languages that do not traditionally have word boundaries in their writing, like Chinese. So, although we can recognize what is and is not a word when standing alone, it is more difficult to define what is and is not a word in a sequence of words. We can go with the following definition: a sequence of morphemes is a word if splitting it apart or combining it with neighboring morphemes would change the meaning of the sentence.
In English, and other languages that use a delimiter between words, it is common to use regular expressions to tokenize. Let’s look at some examples.
First, let’s look at a whitespace tokenizer:
from pyspark.ml.feature import RegexTokenizer ws_tokenizer = RegexTokenizer()\ .setInputCol('text')\ .setOutputCol('ws_tokens')\ .setPattern('\\s+')\ .setGaps(True)\ .setToLowercase(False) text, tokens = ws_tokenizer.transform(example)\ .select("text", "ws_tokens").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(tokens)
["Nick's", 'right', 'about', 'this.', "It's", 'always', 'easier', 'to', 'obtian', 'forgiveness', 'than', 'permission.', 'Not', 'many', 'poeple', 'remember', 'that', "Britan's", 'Kng', 'George', 'III', 'expressly', 'forbade', 'his', 'American', 'subjects', 'to', 'cross', 'the', 'alleghany/appalachian', 'mountains.', 'Said', 'subjects', 'basically', 'said,', '"Stop', 'us', 'if', 'you', 'can."', 'He', "couldn't."]
This leaves a lot to be desired. We can see that we have many tokens that are words with some punctuation attached. Let’s add the boundary pattern “\b.”
b_tokenizer = RegexTokenizer()\ .setInputCol('text')\ .setOutputCol('b_tokens')\ .setPattern('\\s+|\\b')\ .setGaps(True)\ .setToLowercase(False) text, tokens = b_tokenizer.transform(example)\ .select("text", "b_tokens").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(tokens)
['Nick', "'", 's', 'right', 'about', 'this', '.', 'It', "'", 's', 'always', 'easier', 'to', 'obtian', 'forgiveness', 'than', 'permission', '.', 'Not', 'many', 'poeple', 'remember', 'that', 'Britan', "'", 's', 'Kng', 'George', 'III', 'expressly', 'forbade', 'his', 'American', 'subjects', 'to', 'cross', 'the', 'alleghany', '/', 'appalachian', 'mountains', '.', 'Said', 'subjects', 'basically', 'said', ',', '"', 'Stop', 'us', 'if', 'you', 'can', '."', 'He', 'couldn', "'", 't', '.']
We have the punctuation separated, but now all the contractions are broken into three tokens—for example, “It’s” becomes “It”, “‘”, “s”. This is less than ideal.
In Spark NLP, the tokenizer is more sophisticated than just single regex. It takes the following parameters (apart from the usual input and output column name parameters):
The algorithm works in the following steps:
Let’s see an example.
from sparknlp.annotator import Tokenizer assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') tokenizer = Tokenizer()\ .setInputCols(['doc'])\ .setOutputCol('tokens_annotations') finisher = Finisher()\ .setInputCols(['tokens_annotations'])\ .setOutputCols(['tokens'])\ .setOutputAsArray(True) pipeline = Pipeline()\ .setStages([assembler, tokenizer, finisher]) text, tokens = pipeline.fit(texts).transform(example)\ .select("text", "tokens").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(tokens)
['Nick', "'s", 'right', 'about', 'this', '.', 'It', "'s", 'always', 'easier', 'to', 'obtian', 'forgiveness', 'than', 'permission', '.', 'Not', 'many', 'poeple', 'remember', 'that', 'Britan', "'s", 'Kng', 'George', 'III', 'expressly', 'forbade', 'his', 'American', 'subjects', 'to', 'cross', 'the', 'alleghany/appalachian', 'mountains', '.', 'Said', 'subjects', 'basically', 'said', ',', '"', 'Stop', 'us', 'if', 'you', 'can', '.', '"', 'He', 'could', "n't", '.']
Here we see that the punctuation is separated out, and the contractions are split into two tokens. This matches closely with the intuitive definition of the word.
Now that we have our tokens, we have another thing to contend with—reducing our vocabulary.
Most NLP tasks involve turning the text into vectors. Initially, your vectors will have dimension equal to your vocabulary. An implicit assumption to doing this is that the words are orthogonal to each other. In terms of words, this means that “cat,” “dog,” and “dogs” are all considered equally different. We would like to represent words in a vector space that is somehow related to their meaning, but that is more complicated. We will cover such representations in Chapters 10 and 11. There are simpler ways to tackle this problem, however. If we know that two words are almost the same, or are at least equivalent for our purposes, we can represent them with the same dimension in our vector. This will help classification, regression, and search tasks. So how can we do this? We can use our knowledge of morphology (how words are constructed from smaller words and affixes). We can remove affixes before constructing our vector. The two primary techniques for doing this are stemming and lemmatization.
Stemming is the process of removing affixes and leaving a word stem. This is done according to sets of rules that determine what characters to delete or replace. The first stemming algorithm was created by Julie Beth Lovins in 1968, although there had been earlier work done on the subject. In 1980, Martin Porter created the Porter Stemmer. This is certainly the most well-known stemming algorithm. He later created a domain-specific language and associated tools for writing stemming algorithms called snowball. Although people almost always use a predefined stemmer, if you find that there are affixes that are not being removed but should be, or the other way around, consider writing or modifying an existing algorithm.
Lemmatization is the process of replacing a word with its lemma or head-word. The lemma is the form of a word that has a full dictionary entry. For example, if you look up “oxen” in a dictionary, it will likely redirect you to “ox.” Algorithmically, this is easy to implement but is dependent on the data you use for looking up lemmas.
There are pros and cons to both stemming and lemmatization.
Which method you use will depend on your task and your resource constraints.
Use stemming if:
Use lemmatization if:
Let’s look at some examples of using stemming and lemmatization in Spark NLP.
from sparknlp.annotator import Stemmer, Lemmatizer, LemmatizerModel assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') tokenizer = Tokenizer()\ .setInputCols(['doc'])\ .setOutputCol('tokens_annotations') stemmer = Stemmer()\ .setInputCols(['tokens_annotations'])\ .setOutputCol('stems_annotations') # The next line downloads lemmatizer "model". Here, "training" # is reading the user supplied dictionary lemmatizer = LemmatizerModel.pretrained()\ .setInputCols(['tokens_annotations'])\ .setOutputCol('lemma_annotations') finisher = Finisher()\ .setInputCols(['stems_annotations', 'lemma_annotations'])\ .setOutputCols(['stems', 'lemmas'])\ .setOutputAsArray(True) pipeline = Pipeline()\ .setStages([ assembler, tokenizer, stemmer, lemmatizer, finisher]) text, stems, lemmas = pipeline.fit(texts).transform(example)\ .select("text", "stems", "lemmas").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(stems)
['nick', "'", 'right', 'about', 'thi', '.', 'it', "'", 'alwai', 'easier', 'to', 'obtian', 'forgiv', 'than', 'permiss', '.', 'not', 'mani', 'poepl', 'rememb', 'that', 'britan', "'", 'kng', 'georg', 'iii', 'expressli', 'forbad', 'hi', 'american', 'subject', 'to', 'cross', 'the', 'alleghany/appalachian', 'mountain', '.', 'said', 'subject', 'basic', 'said', ',', '"', 'stop', 'u', 'if', 'you', 'can', '.', '"', 'he', 'could', "n't", '.']
print(lemmas)
['Nick', 'have', 'right', 'about', 'this', '.', 'It', 'have', 'always', 'easy', 'to', 'obtian', 'forgiveness', 'than', 'permission', '.', 'Not', 'many', 'poeple', 'remember', 'that', 'Britan', 'have', 'Kng', 'George', 'III', 'expressly', 'forbid', 'he', 'American', 'subject', 'to', 'cross', 'the', 'alleghany/appalachian', 'mountain', '.', 'Said', 'subject', 'basically', 'say', ',', '"', 'Stop', 'we', 'if', 'you', 'can', '.', '"', 'He', 'could', 'not', '.']
Some examples to note:
["britain", "'"]
and lemmatized to ["Britain", "have"]
erroneously.The typology of the language will greatly influence which approach is easier. There are common forms of words that have drastically different meanings. For example, in Spanish we would not want to combine “puerto” (“port”) and “puerta” (“door”), but we may want to combine “niño” (“boy”) and “niña” (“girl”). This means that reductions will be dependent on the lexical semantics (meaning of the word). This would be near impossible to fully support in a stemming algorithm, so you likely want to lemmatize. On the other hand, if a language has a rich morphology, the lemma dictionary will be very large (the number of forms times the number of words).
An often overlooked aspect of vocabulary reduction is misspellings. In text that is not edited or proofread by the author, this can create a very long tail. Worse, there are some mistakes that are so common that the misspelling can actually be a moderately common token, which makes it very hard to remove.
There are two approaches to spelling correction in Spark NLP. SymmetricDelete needs a set of correct words to search. This vocabulary can be provided as a dictionary, or by providing a trusted corpus. It is based on the SymSpell project by Wolf Garbe. The other approach is the Norvig spelling correction algorithm, which works by creating a simple probability model. This approach also needs a correct vocabulary, but it suggests the most probable word—i.e., the most frequent word in the trusted corpus with a certain edit distance from the given word.
Let’s look at the pretrained Norvig spelling correction.
from sparknlp.annotator import NorvigSweetingModel from sparknlp.annotator import SymmetricDeleteModel
# Norvig pretrained assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') tokenizer = Tokenizer()\ .setInputCols(['doc'])\ .setOutputCol('tokens_annotations') norvig_pretrained = NorvigSweetingModel.pretrained()\ .setInputCols(['tokens_annotations'])\ .setOutputCol('norvig_annotations') finisher = Finisher()\ .setInputCols(['norvig_annotations'])\ .setOutputCols(['norvig'])\ .setOutputAsArray(True) pipeline = Pipeline()\ .setStages([ assembler, tokenizer, norvig_pretrained, lemmatizer, finisher]) text, norvig = pipeline.fit(texts).transform(example)\ .select("text", "norvig").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(norvig)
['Nick', "'s", 'right', 'about', 'this', '.', 'It', "'s", 'always', 'easier', 'to', 'obtain', 'forgiveness', 'than', 'permission', '.', 'Not', 'many', 'people', 'remember', 'that', 'Britain', "'s", 'Kng', 'George', 'III', 'expressly', 'forbade', 'his', 'American', 'subjects', 'to', 'cross', 'the', 'alleghany/appalachian', 'mountains', '.', 'Said', 'subjects', 'basically', 'said', ',', '"', 'Stop', 'us', 'if', 'you', 'can', '.', '"', 'He', 'could', "n't", '.']
We see that “obtian,” “poeple,” and “Britan” are all corrected. However, “Kng” is missed, and “american” is converted to “Americana.” The latter two mistakes are likely due to capitalization, which makes matching with the probability model more difficult.
This is a more heuristic-based cleanup step. If you are processing data scraped from the web, it is not uncommon to have HTML artifacts (tags, HTML encodings, etc.) left behind. Getting rid of these artifacts can reduce your vocabulary by quite a bit. If your task does not require numbers or anything nonalphabetic, for instance, you can also use normalization to remove these.
from sparknlp.annotator import Normalizer
assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') tokenizer = Tokenizer()\ .setInputCols(['doc'])\ .setOutputCol('tokens_annotations') norvig_pretrained = NorvigSweetingModel.pretrained()\ .setInputCols(['tokens_annotations'])\ .setOutputCol('norvig_annotations') lemmatizer = LemmatizerModel.pretrained()\ .setInputCols(['norvig_annotations'])\ .setOutputCol('lemma_annotations') normalizer = Normalizer()\ .setInputCols(['lemma_annotations'])\ .setOutputCol('normtoken_annotations')\ .setLowercase(True) finisher = Finisher()\ .setInputCols(['normtoken_annotations'])\ .setOutputCols(['normtokens'])\ .setOutputAsArray(True) sparknlp_pipeline = Pipeline().setStages([ assembler, tokenizer, norvig_pretrained, lemmatizer, normalizer, finisher ]) pipeline = Pipeline()\ .setStages([ assembler, tokenizer, norvig_pretrained, lemmatizer, normalizer, finisher]) text, normalized = pipeline.fit(texts).transform(example)\ .select("text", "normtokens").first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his american subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(normalized)
['nicks', 'right', 'about', 'this', 'itys', 'always', 'easy', 'to', 'obtain', 'forgiveness', 'than', 'permission', 'not', 'many', 'people', 'remember', 'that', 'britans', 'kng', 'george', 'iii', 'expressly', 'forbid', 'he', 'americana', 'subject', 'to', 'cross', 'the', 'alleghanyappalachian', 'mountain', 'said', 'subject', 'basically', 'say', 'stop', 'we', 'if', 'you', 'can', 'he', 'couldnt']
Now that we have reduced our vocabulary by combining similar and misspelled words and removing HTML artifacts, we can feel confident that the vocabulary we are working with is a realistic reflection of the content of our documents. The next step is to turn these words into vectors for our model. There are many techniques for doing this, but we will start with the most straightforward approach, called bag-of-words. A bag (also called a multiset), is a set in which each element has a count. If you are familiar with the Python collection Counter
, that is a good way to understand what a bag is. And so, a bag-of-words is the count of the words in our document. Once we have these counts, we turn them into a vector by mapping each unique word to an index.
Let’s look at a simple example using Python’s Counter
.
text = "the cat in the hat" tokens = text.split() tokens
['the', 'cat', 'in', 'the', 'hat']
from collections import Counter counts = Counter(tokens) counts
Counter({'the': 2, 'cat': 1, 'in': 1, 'hat': 1})
index = {token: ix for ix, token in enumerate(counts.keys())} index
{'the': 0, 'cat': 1, 'in': 2, 'hat': 3}
import numpy as np vec = np.zeros(len(index)) for token, count in counts.items(): vec[index[token]] = count vec
array([2., 1., 1., 1.])
The example we have is for only one document. If we are working on a large corpus, our index will have far more words than we would expect to ever find in a single document. It is not uncommon for corpus vocabularies to number in the tens of thousands or hundreds of thousands, even though a single document will generally have tens to hundreds of unique words. For this reason, we want our vectors to be sparse.
A sparse vector is one in which only the nonzero values are stored. Sparse vectors are generally implemented as associative arrays, maps, or dictionaries from index to value. For sparse data, like bags-of-words, this can save a great deal of space. However, not all algorithms are implemented in a way that is compatible with sparse vectors.
In Spark, we can use the CountVectorizer
to create our bags-of-words.
from pyspark.ml.feature import CountVectorizer assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('doc') tokenizer = Tokenizer()\ .setInputCols(['doc'])\ .setOutputCol('tokens_annotations') norvig_pretrained = NorvigSweetingModel.pretrained()\ .setInputCols(['tokens_annotations'])\ .setOutputCol('norvig_annotations') lemmatizer = LemmatizerModel.pretrained()\ .setInputCols(['norvig_annotations'])\ .setOutputCol('lemma_annotations') normalizer = Normalizer()\ .setInputCols(['lemma_annotations'])\ .setOutputCol('normtoken_annotations')\ .setLowercase(True) finisher = Finisher()\ .setInputCols(['normtoken_annotations'])\ .setOutputCols(['normtokens'])\ .setOutputAsArray(True) sparknlp_pipeline = Pipeline().setStages([ assembler, tokenizer, norvig_pretrained, lemmatizer, normalizer, finisher ]) count_vectorizer = CountVectorizer()\ .setInputCol('normtokens')\ .setOutputCol('bows') pipeline = Pipeline().setStages([sparknlp_pipeline, count_vectorizer]) model = pipeline.fit(texts) processed = model.transform(example) text, normtokens, bow = processed\ .select("text", "normtokens", 'bows').first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(normtokens)
['nick', 'have', 'right', 'about', 'this', 'it', 'have', 'always', 'easy', 'to', 'obtain', 'forgiveness', 'than', 'permission', 'not', 'many', 'people', 'remember', 'that', 'britain', 'have', 'kng', 'george', 'iii', 'expressly', 'forbid', 'he', 'american', 'subject', 'to', 'cross', 'the', 'alleghanyappalachian', 'mountain', 'said', 'subject', 'basically', 'say', 'stop', 'we', 'if', 'you', 'can', 'he', 'could', 'not']
Let’s look at the bag-of-words. This will be a sparse vector, so the elements are indices into the vocabulary and counts of occurrences. For example, 7: 3.0
means that the seventh word in our vocabulary occurs three times in this document.
bow
SparseVector(5319, {0: 1.0, 3: 2.0, 7: 3.0, 9: 1.0, 10: 1.0, 14: 2.0, 15: 1.0, 17: 1.0, 28: 1.0, 30: 1.0, 31: 2.0, 37: 1.0, 52: 1.0, 67: 1.0, 79: 2.0, 81: 1.0, 128: 1.0, 150: 1.0, 182: 1.0, 214: 1.0, 339: 1.0, 369: 1.0, 439: 1.0, 459: 1.0, 649: 1.0, 822: 1.0, 953: 1.0, 1268: 1.0, 1348: 1.0, 1698: 1.0, 2122: 1.0, 2220: 1.0, 3149: 1.0, 3200: 1.0, 3203: 1.0, 3331: 1.0, 3611: 1.0, 4129: 1.0, 4840: 1.0})
We can get the learned vocabulary from the CountVectorizerModel
. This is the list of words. In the previous example, we said that the seventh word occurs three times in a document. Looking at this vocabulary, that means that “have” occurs three times.
count_vectorizer_model = model.stages[-1] vocab = count_vectorizer_model.vocabulary print(vocab[:20])
['the', 'be', 'of', 'to', 'and', 'a', 'in', 'have', 'for', 'it', 'that', 'i', 'on', 'from', 'not', 'you', 'space', 'this', 'they', 'as']
The drawback to doing this is that we lose the meaning communicated by the arrangement of the words—the syntax. To say that parsing the syntax of natural language is difficult is an understatement. Fortunately, we often don’t need all the information encoded in the syntax.
The main drawback to using bag-of-words is that we are making use of only the meanings encoded in individual words and document-wide context. Language encodes a great deal of meaning in local contexts as well. Syntax is hard to model, let alone parse. Fortunately, we can use N-grams to extract some of the context without needing to use a complicated syntax parser.
N-grams, also known as shingles, are subsequences of words of length n within a string of words. They allow us to extract information from small windows of context. This gives us a first approximation of the information we can gather from syntax because, although we are looking at local context, there is structural information explicitly extracted. In many applications, N-grams are enough to extract the information necessary.
For low values of n there are special names. For example, 1-grams are called unigrams, 2-grams are called bigrams, and 3-grams are called trigrams. For values higher than 3, they are usually referred to as “number” + grams, like 4-grams.
Let’s look at some example N-grams.
text = "the quick brown fox jumped over the lazy dog" tokens = ["the", "quick", "brown", "fox", "jumped", "over", "the", "lazy", "dog"] unigrams = [('the',), ('quick',), ('brown',), ('fox',), ('jumped',), ('over',), ('the',), ('lazy',), ('dog',)] bigrams = [('the', 'quick'), ('quick', 'brown'), ('brown', 'fox'), ('fox', 'jumped'), ('jumped', 'over'), ('over', 'the'), ('the', 'lazy'), ('lazy', 'dog')] trigrams = [('the', 'quick', 'brown'), ('quick', 'brown', 'fox'), ('brown', 'fox', 'jumped'), ('fox', 'jumped', 'over'), ('jumped', 'over', 'the'), ('over', 'the', 'lazy'), ('the', 'lazy', 'dog')]
We still need to determine our n. Generally, n is less than 4. Consider the largest size multiword phrase you think will be important to your application. This will generally depend on the length of your documents and how technical the language is expected to be. In hospital medical records or in long documents with highly technical language, 3, 4, or even 5-grams might be useful. For tweets or short documents with informal language, bigrams should suffice.
Let’s look at some examples.
from pyspark.ml.feature import NGram bigrams = NGram()\ .setN(2)\ .setInputCol("normtokens")\ .setOutputCol("bigrams") trigrams = NGram()\ .setN(3)\ .setInputCol("normtokens")\ .setOutputCol("trigrams") pipeline = Pipeline().setStages([sparknlp_pipeline, bigrams, trigrams]) model = pipeline.fit(texts) processed = model.transform(example) text, normtokens, bigrams, trigrams = processed\ .select("text", "normtokens", 'bigrams', 'trigrams').first()
print(text)
Nick's right about this. It's always easier to obtian forgiveness than permission. Not many poeple remember that Britan's Kng George III expressly forbade his American subjects to cross the alleghany/appalachian mountains. Said subjects basically said, "Stop us if you can." He couldn't.
print(normtokens)
['nick', 'have', 'right', 'about', 'this', 'it', 'have', 'always', 'easy', 'to', 'obtain', 'forgiveness', 'than', 'permission', 'not', 'many', 'people', 'remember', 'that', 'britain', 'have', 'kng', 'george', 'iii', 'expressly', 'forbid', 'he', 'american', 'subject', 'to', 'cross', 'the', 'alleghanyappalachian', 'mountain', 'said', 'subject', 'basically', 'say', 'stop', 'we', 'if', 'you', 'can', 'he', 'could', 'not']
print(bigrams)
['nick have', 'have right', 'right about', 'about this', 'this it', 'it have', 'have always', 'always easy', 'easy to', 'to obtain', 'obtain forgiveness', 'forgiveness than', 'than permission', 'permission not', 'not many', 'many people', 'people remember', 'remember that', 'that britain', 'britain have', 'have kng', 'kng george', 'george iii', 'iii expressly', 'expressly forbid', 'forbid he', 'he american', 'american subject', 'subject to', 'to cross', 'cross the', 'the alleghanyappalachian', 'alleghanyappalachian mountain', 'mountain said', 'said subject', 'subject basically', 'basically say', 'say stop', 'stop we', 'we if', 'if you', 'you can', 'can he', 'he could', 'could not']
print(trigrams)
['nick have right', 'have right about', 'right about this', 'about this it', 'this it have', 'it have always', 'have always easy', 'always easy to', 'easy to obtain', 'to obtain forgiveness', 'obtain forgiveness than', 'forgiveness than permission', 'than permission not', 'permission not many', 'not many people', 'many people remember', 'people remember that', 'remember that britain', 'that britain have', 'britain have kng', 'have kng george', 'kng george iii', 'george iii expressly', 'iii expressly forbid', 'expressly forbid he', 'forbid he american', 'he american subject', 'american subject to', 'subject to cross', 'to cross the', 'cross the alleghanyappalachian', 'the alleghanyappalachian mountain', 'alleghanyappalachian mountain said', 'mountain said subject', 'said subject basically', 'subject basically say', 'basically say stop', 'say stop we', 'stop we if', 'we if you', 'if you can', 'you can he', 'can he could', 'he could not']
Now that we have learned how to extract tokens, we can look at how we can visualize a data set. We will look at two visualizations: word frequencies and word clouds from the space and autos newsgroups. They represent the same information but in different ways.
from sparknlp.pretrained import PretrainedPipeline space_path = os.path.join('data', 'mini_newsgroups', 'sci.space') space = spark.sparkContext.wholeTextFiles(space_path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) space = spark.createDataFrame(space, schema=schema).persist() sparknlp_pipeline = PretrainedPipeline( 'explain_document_ml', lang='en').model normalizer = Normalizer()\ .setInputCols(['lemmas'])\ .setOutputCol('normalized')\ .setLowercase(True) finisher = Finisher()\ .setInputCols(['normalized'])\ .setOutputCols(['normalized'])\ .setOutputAsArray(True) count_vectorizer = CountVectorizer()\ .setInputCol('normalized')\ .setOutputCol('bows') pipeline = Pipeline().setStages([ sparknlp_pipeline, normalizer, finisher, count_vectorizer]) model = pipeline.fit(space) processed = model.transform(space)
vocabulary = model.stages[-1].vocabulary word_counts = Counter() for row in processed.toLocalIterator(): for ix, count in zip(row['bows'].indices, row['bows'].values): word_counts[vocabulary[ix]] += count
from matplotlib import pyplot as plt %matplotlib inline
y = list(range(20)) top_words, counts = zip(*word_counts.most_common(20)) plt.figure(figsize=(10, 8)) plt.barh(y, counts) plt.yticks(y, top_words) plt.show()
Figure 5-1 shows the word frequencies from the space newsgroup.
from wordcloud import WordCloud plt.figure(figsize=(10, 8)) wc = WordCloud(colormap='Greys', background_color='white') im = wc.generate_from_frequencies(word_counts) plt.imshow(im, interpolation='bilinear') plt.axis("off") plt.title('sci.space') plt.show()
Figure 5-2 shows the word cloud from the space newsgroup.
autos_path = os.path.join('data', 'mini_newsgroups', 'rec.autos') autos = spark.sparkContext.wholeTextFiles(autos_path) schema = StructType([ StructField('path', StringType()), StructField('text', StringType()), ]) autos = spark.createDataFrame(autos, schema=schema).persist() model = pipeline.fit(autos) processed = model.transform(autos)
vocabulary = model.stages[-1].vocabulary word_counts = Counter() for row in processed.toLocalIterator(): for ix, count in zip(row['bows'].indices, row['bows'].values): word_counts[vocabulary[ix]] += count
y = list(range(20)) top_words, counts = zip(*word_counts.most_common(20)) plt.figure(figsize=(10, 8)) plt.barh(y, counts) plt.yticks(y, top_words) plt.show()
Figure 5-3 shows the word frequencies from the autos newsgroup.
from wordcloud import WordCloud plt.figure(figsize=(10, 8)) wc = WordCloud(colormap='Greys', background_color='white') im = wc.generate_from_frequencies(word_counts) plt.imshow(im, interpolation='bilinear') plt.axis("off") plt.title('rec.autos') plt.show()
Figure 5-4 shows the word cloud from the autos newsgroup.
Now we can visualize our text. However, they have such similar words. In the next chapter, we will learn how to address this issue.
We’ve visualized unigrams, but we have a problem with common words. Let’s try visualizing N-grams. Try bigrams and trigrams.