Chapter 8. Sequence Modeling with Keras

So far, we have looked at documents as bags-of-words. This is a common, easily achievable approach for many NLP tasks. However, this approach produces a low-fidelity model of language. The order of words is essential to the encoding and decoding of meaning in language, and to incorporate this, we will need to model sequences.

When people refer to sequences in a machine learning context, they are generally talking about sequences of data in which data points are not independent of the data points around them. We can still use features derived from the data point, as with general machine learning, but now we can also use the data and labels from the nearby data points. For example, if we are trying to determine if the token “produce” is being used as a noun or verb, knowing what words are around it will be very informative. If the token before it is “might,” that indicates “produce” is a verb. If “the” is the preceding token, that indicates it is a noun. These other words give us context.

What if “to” precedes “produce”? That could still indicate either a noun or a verb. We need to look back further. This gives us the concept of windows—the amount of context we want to capture. Many algorithms have a fixed amount of context that is decided as a hyperparameter. There are some algorithms, for example LSTMs, that can learn how long to remember the context.

Sequence problems come from different domains, and in different data formats. In this book we are working with text data, which is a sequence of characters, or words, or even messages. Working with voice data is inherently a sequence problem. Economics, physics, and medicine also have many sequence problems. The statistics and behavior of this data is generally very different among these disparate fields, but there are many techniques in common.

One of the common techniques used for sequence modeling is a family of machine learning algorithms called graphical models. Graphical models can be used to model probabilistic structures. This is what allows us to capture context. When building the graphical models, we have to decide how much context to capture. How much context we want to capture is based on the specifics of the problem. There are also models that let us learn how much context to use. The algorithm needed to learn a graphical model differs depending on the model chosen. Some can be learned with gradient descent; others have their own learning algorithms.

In this chapter, we will go over some of the popular algorithms for learning sequences, such as hidden Markov models, conditional random fields, and recurrent neural networks. Let’s start out with sentence segmentation and hidden Markov models.

Sentence Segmentation

One of the first sequence-modeling problems most learn is the sentence-boundary detection, or SBD problem, in which we want to determine where a sentence boundary is. This seems like a problem that could be solved by a regular expression. We can simply say that every “.”, “?”, and “!” are sentence boundaries. We can add something to check for acronyms—perhaps we can check if the previous character is capitalized and the character after the next whitespace is not. This will still miss some situations with acronyms—for example, two acronyms following each other like “U.S. D.O.S.”

What if our text is not so regularly formatted? If our text has many lists, as is common in technical communications, there may not be sentence-ending punctuation at the end of each item. However, we don’t want to call all the items in a list the same sentence. We could make more and more complicated regular expressions, or we could build a model.

There are some pros and cons to help us in deciding when to use a regular expression for SBD and when to use a model. Using a regular expression is easy to do, if you ignore some exceptions. If you are in a situation where using a model on text will be difficult, it is best to just use a regular expression. On the other hand, SBD models are somewhat generalizable. That being said, if you use an SBD model built on newspapers to process clinical text, it will produce more errors than you would expect.

(Hidden) Markov Models

Hidden Markov model, or HMM is a popular model for dealing with sequence data. To understand it, we need to discuss Markov models and the Markov property.

The Markov property relates to a stochastic (random) process. In a stochastic process, a random variable for time $t$ , $X_{t}$ , can be dependent on all previous variables in the sequence, $X_{t - 1}, X_{t - 2}, . . ., X_{0}$ . If $X_{t}$ is only dependent on the previous variable, $upper X Subscript t minus 1$ , then we say it has the Markov property. This simplifies problems greatly. This is an unrealistic assumption in language. However, like we saw in the previous chapter with independence assumptions of naïve Bayes and logistic regression, unrealistic assumptions don’t necessarily produce a bad model. We can also ease the Markov property and say that $X_{t}$ is dependent on the last $k$ variables.

We can also use the relationship between the sequence we wish to model and an observable sequence. Now we can estimate the probability of a particular state as:

P [y_{i} = k | x_{i} = c] \approx P [y_{i} = k | y_{i - 1} = k^{'}] \cdot P [y_{i - 1} = k^{'}] \cdot P [x_{i} = c | y_{i} = k]

We can calculate the transition probability $P [y_{i} = k | y_{i - 1} = k^{'}]$ , the initial probability $upper P left-bracket y 0 equals k right-bracket$ , and the emission probability $P [x_{i} = c | y_{i} = k]$ directly from the data if we have labels. Once we have this, how do we predict the hidden states, $y_{i}$ ? We use the Viterbi algorithm.

Let’s look at an example. We will be using the Brown corpus available with NLTK.

from collections import defaultdict, Counter

import numpy as np
import pandas as pd

import sparknlp

spark = sparknlp.start()

from nltk.corpus import brown
sentences = brown.sents()

The corpus is already split into sentences, so we will have to detokenize the data in order to get training data. Fortunately, this also provides us with labels. We will label the last character of a sentence with E, and everything else will be labeled S.

def detokenize(sentence):
    text = ''
    for token in sentence:
        if text and any(c.isalnum() for c in token):
            text += ' '
        text += token
    return text

word_counts = Counter()
raw = []
labels = []
for fid in brown.fileids():
        sentences = brown.sents(fid)
        word_counts.update(
            [t for s in sentences for t in s if t.isalpha()])
        sentences = [detokenize(s) for s in sentences]
        raw.append(' '.join(sentences))
        labels.append('S'.join([
            ('S' * (len(s) - 1)) + 'E' 
            for s in sentences
        ]))

word_counts = pd.Series(word_counts))

Now, let us define our training algorithm. If you note the equations defined previously, we will be doing repeated multiplication of probabilities; this creates a danger of underflow. Underflow is a limitation of floating-point numbers. Floating-point numbers can represent only so much precision. So, if a number approaches too close to 0, the floating-point representation may round to 0. For example, when implementing naïve Bayes, instead of multiplying probabilities, we will add log-probabilities. This is to avoid underflow when multiplying many numbers that are less than 1.

We will need the set of observations, the characters in the text, the set of states, whether or not a character marks a sentence ending, and the log probabilities. We will need to have the log probability for initial states. In this modeling problem the initial state will always be “S.” We will need the emission log probabilities—the log probability of a character given a state. And finally, we need the transition log probability, which is the log probability of a state given a previous state.

class HMM(object):
    def __init__(self, obs_set, state_set, initial_log, emission_log, 
                 transition_log):
        self.obs_set = obs_set
        self.state_set = state_set
        self.initial_log = initial_log
        self.emission_log = emission_log
        self.transition_log = transition_log

To calculate these things, we need to track total possible observations, states, and the counts we will use to calculate the log-probabilities.

def training_data():
    data_dict = {}
    data_dict['obs_set'] = set()
    data_dict['state_set'] = set()
    data_dict['transition_ct'] = defaultdict(Counter)
    data_dict['emission_ct'] = defaultdict(Counter)
    data_dict['initial_ct'] = Counter()
    return data_dict

Now we need to have a function that will update this data as we traverse over the data set.

def update_state(data_dict, ob_seq, st_seq):
    assert len(ob_seq) == len(st_seq)
    data_dict['initial_ct'][st_seq[0]] += 1
    for i in range(1, len(st_seq)):
        ob = ob_seq[i]
        st = st_seq[i]
        data_dict['obs_set'].add(ob)
        data_dict['state_set'].add(st)
        data_dict['transition_ct'][ob_seq[i-1]][ob] += 1
        data_dict['emission_ct'][st][ob] += 1

Now that we have the counts and the total set of observations and states, we can calculate the sums we will need for the log probabilties.

def calculate_sums(data_dict):
    data_dict['transition_sums'] = {
        st: np.sum(list(data_dict['transition_ct'][st].values())) 
        for st in data_dict['state_set']
    }
    data_dict['initial_sum'] = np.sum(
        list(data_dict['initial_ct'].values()))
    data_dict['emission_sums'] = {
        st: np.sum(list(data_dict['emission_ct'][st].values())) 
        for st in data_dict['state_set']
    }

Once we have the counts and the sums, we can calculate the log probabilties.

def calculate_log_probs(data_dict, eps):
    data_dict['transition_log'] = {
        prev_st: {
            # log P[y_i = k | y_i-1 = k']
            st: (np.log(data_dict['transition_ct'][prev_st][st] + \
                        eps) - \
                 np.log(data_dict['transition_sums'][prev_st] + \
                        eps)) 
            for st in data_dict['state_set']
        } 
        for prev_st in data_dict['state_set']
    }
    
    data_dict['initial_log'] = {
            # log P[y_0 = k]
        st: (np.log(data_dict['initial_ct'][st] + eps) - \
             np.log(data_dict['initial_sum'] + eps)) 
        for st in data_dict['state_set']
    }
    
    data_dict['emission_log'] = {
        st: {
            # log P[x_i = c | y_i = k]
            ob: (np.log(data_dict['emission_ct'][st][ob] + eps) - \
                 np.log(data_dict['emission_sums'][st] + eps)) 
            for ob in data_dict['obs_set']
        } 
        for st in data_dict['state_set']
    }

Finally, we have our train method to tie everything together.

def train(observations, states, eps=1e-8):
    # initialize
    data_dict = training_data()
    
    # traverse data and count all transitions, initials, and 
    # emissions
    for ob_seq, st_seq in zip(observations, states):
        update_state(data_dict, ob_seq, st_seq)
                
    calculate_sums(data_dict)
    
    calculate_log_probs(data_dict, eps)
    
    return HMM(list(data_dict['obs_set']), list(data_dict['state_set']), 
               data_dict['initial_log'], data_dict['emission_log'], 
               data_dict['transition_log'])

model = train(raw, labels)

Now, given a piece of text, we need to calculate the most probable sequence of states. We can use these predicted states to split a piece of text into sentences. For doing this, we will use the Viterbi algorithm. This algorithm will let us efficiently traverse the set of possible sequences.

def viterbi(y, model):
    # probabilities for the initial states
    path_logs = [{
        st: model.initial_log[st] + model.emission_log[st][y[0]] 
        for st in model.state_set
    }]
    path_preds = [{st: '' for st in model.state_set}]
    
    for i in range(1, len(y)):
        curr_log = {}
        curr_preds = {}
        for st in model.state_set:
            # find the most probable previous state that 
            # would lead to st
            curr_log[st] = -np.inf
            curr_preds[st] = ''
            for prev_st in model.state_set:
                # log probability
                local_log = path_logs[i-1][prev_st] + \
                    model.transition_log[prev_st][st] + \
                    model.emission_log[st][y[i]]
                if curr_log[st] < local_log:
                    curr_log[st] = local_log
                    curr_preds[st] = prev_st
        path_logs.append(curr_log)
        path_preds.append(curr_preds)

    # Now we work backwards. Find the most probable final 
    # state, and work back to the beginning.
    terminal_log = -np.inf
    curr_st = ''
    for st in model.state_set:
        if terminal_log < path_logs[-1][st]:
            terminal_log = path_logs[-1][st]
            curr_st = st
    preds = curr_st
    for i in range(len(y)-1, 0, -1):
        curr_st = path_preds[i][curr_st]
        preds = curr_st + preds
    return preds

Now that we can make predictions, we can build our own sentence splitter.

def split(text, model):
    state_seq = viterbi(text, model)
    sentences = []
    start = 0
    for end in range(1, len(text)):
        if state_seq[end] == 'E':
            sentences.append(text[start:end+1])
            start = end+1
    sentences.append(text[start:])
    return sentences

Let’s see how it does.

example = raw[0]

print('\n###\n'.join(split(example, model)[:10]))

The Fulton County Grand Jury said Friday an investigation of 
Atlanta's recent primary election produced`` no evidence'' that 
any irregularities took place.
###
The jury further said in term-
###
end presentments that the City Executive Committee, which had over-
###
all charge of the election,`` deserves the praise and thanks of the 
City of Atlanta'' for the manner in which the election was 
conducted.
###
The September-
###
October term jury had been charged by Fulton Superior Court Judge 
Durwood Pye to investigate reports of possible`` irregularities'' 
in the hard-
###
fought primary which was won by Mayor-
###
nominate Ivan Allen Jr.
###
.
###
`` Only a relative handful of such reports was received'', the jury 
said,`` considering the widespread interest in the election, the 
number of voters and the size of this city''.

Not great, but this is a simple model built from the data—not from manually encoded heuristics. There are several ways we could improve this. We can add more emission features because we are assuming they are independent of each other. We should look at the data to understand why the model thinks hyphens are ends of sentences.

This model is simple, but we already had plenty of labels. Getting labels like this can be a time-consuming process. You can use the Baum-Welch algorithm to learn transition and emission probabilities on a partially labeled or unlabeled data set.

Let’s look at how Spark NLP does sentence detection. The algorithm is based on Kevin Dias’s pragmatic_segmenter, originally implemented in Ruby. Let’s compare how it does to how our simple HMM does.

example_df = spark.createDataFrame([(example,)], ['text'])

from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import SentenceDetector

from pyspark.ml import Pipeline

assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')
sent_detector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')
finisher = Finisher()\
    .setInputCols(['sentences'])\
    .setOutputCols(['sentences'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, sent_detector, finisher
]).fit(example_df)

sentences = pipeline.transform(example_df)

print('\n###\n'.join(sentences.first()['sentences'][:10]))

The Fulton County Grand Jury said Friday an investigation of 
Atlanta's recent primary election produced`` no evidence'' that 
any irregularities took place.
###
The jury further said in term-end presentments that the City 
Executive Committee, which had over-all charge of the election,`` 
deserves the praise and thanks of the City of Atlanta'' for the 
manner in which the election was conducted.
###
The September-October term jury had been charged by Fulton Superior 
Court Judge Durwood Pye to investigate reports of possible`` 
irregularities'' in the hard-fought primary which was won by 
Mayor-nominate Ivan Allen Jr..
###
`` Only a relative handful of such reports was received'', the jury 
said,`` considering the widespread interest in the election, the 
number of voters and the size of this city''.
###
The jury said it did find that many of Georgia's registration and 
election laws`` are outmoded or inadequate and often ambiguous''.
###
It recommended that Fulton legislators act`` to have these laws 
studied and revised to the end of modernizing and improving them''.
###
The grand jury commented on a number of other topics, among them 
the Atlanta and Fulton County purchasing departments which it 
said`` are well operated and follow generally accepted practices 
which inure to the best interest of both governments''.
###
Merger proposed However, the jury said it believes`` these two 
offices should be combined to achieve greater efficiency and reduce 
the cost of administration''.
###
The City Purchasing Department, the jury said,`` is lacking in 
experienced clerical personnel as a result of city personnel 
policies''.
###
It urged that the city`` take steps to remedy'' this problem.

It definitely does better than the HMM. The pragmatic_segmenter is quite complex. We could build a more complex model, though. This is a good lesson that in some situations heuristics may be preferable to a model. When you are working on your NLP application, always try the simplest solution first. Look at what goes wrong and then make improvements .

Section Segmentation

There are some documents that are more like a collection of documents. A clinical encounter is recorded as a collection of notes, perhaps from different providers. Legal text is broken into different sections, each with different kinds of content and functions. Narrative text is often broken into chapters or scenes. These different sections may require different treatments, or even different models. For example, we may not want to use the same models on an admission note as we do on a radiological exam, even if they are part of the same visit.

Although sections are incredibly important, they are not actually a part of language per se. They are an artifact of the document formatting that the text is recorded in. There may still be meaning in the different kinds of sections, and even in their placement. This means that we often can’t generalize our techniques outside of a given corpus. Fortunately, regular expressions are even more effective on this problem than sentence-boundary detection.

Part-of-Speech Tagging

Parts of speech are word categories that govern how words are combined to form phrases and sentences. These can be very valuable, especially in a process that involves extracting information from the text. You are likely familiar with the most common parts of speech. In NLP, the categories are a little more complicated.

The following are common parts of speech:

Verbs: “know,” “try,” “call”
Nouns: “cat,” “banana,” “happiness”
Adjectives: “big,” “red,” “quick”
Adverbs: “well,” “now,” “quickly”
Prepositions: “of,” “behind,” “with”

Most part-of-speech tagging data comes from the University of Pennsylvania Treebank, or it is similarly formatted. This data set has a much larger set of parts-of-speech:

CC: Coordinating conjunction (“and”)
CD: Cardinal number (“one,” “1”)
DT: Determiner (“an,” “the”)
EX: Existential “there” (“there are”)
FW: Foreign word (“zeitgeist”)
IN: Preposition or subordinating conjunction (“of,” “because”)
JJ: Adjective (“happy,” “fun”)
JJR: Adjective, comparative (“happier”)
JJS: Adjective, superlative (“happiest”)
LS: List item marker (“a)”)
MD: Modal (“can,” “might”)
NN: Noun, singular or mass (“cat,” “knowledge”)
NNS: Noun, plural (“cats”)
NNP: Proper noun, singular (“Sarah”)
NNPS: Proper noun, plural (“Hungarians”)
PDT: Predeterminer (“half” in “It is half the price.”)
POS: Possessive ending (possessive “’s”)
PRP: Personal pronoun (“I,” “they”)
PRP\$: Possessive pronoun (“my,” “their”)
RB: Adverb (“quickly,” “well”)
RBR: Adverb, comparative (“quicker,” “better”)
RBS: Adverb, superlative (“quickest,” “best”)
RP: Particle (varies, but infinitive “to,” “off” in “It’s a write-off”)
SYM: Symbol (x in mathematical context)
TO: to (sometimes a separate category just for infinitive “to”)
UH: Interjection (“uh”)
VB: Verb, base form (after infinitive “to,” “call,” “know”)
VBD: Verb, past tense (“called,” “knew”)
VBG: Verb, gerund or present participle (“calling,” “knowing”)
VBN: Verb, past participle (“called,” “known”)
VBP: Verb, non–third-person singular present (“call,” “know”)
VBZ: Verb, third-person singular present (“calls,” “knows”)
WDT: Wh-determiner (“which”)
WP: Wh-pronoun (“who”)
WP\$: Possessive wh-pronoun (“whose”)
WRB: Wh-adverb (“when”)

Understanding the linguistics behind these lexical categories will help us understand how to extract them, as well as how to use them. Let’s look a little at how humans identify parts of speech.

Humans decode the part of speech from morphological and syntactic clues. This is why we can determine the parts of speech of nonsense words. Let’s look at part of “Jabberwocky,” the poem by Lewis Carroll.

’Twas brillig, and the slithy toves
Did gyre and gimble in the wabe:
All mimsy were the borogoves,
And the mome raths outgrabe.

“Beware the Jabberwock, my son!
The jaws that bite, the claws that catch!
Beware the Jubjub bird, and shun
The frumious Bandersnatch!”

He took his vorpal sword in hand;
Long time the manxome foe he sought—
So rested he by the Tumtum tree
And stood awhile in thought.

Fluent English speakers will be able to tell you that “brillig” and “vorpal” are adjectives, “gyre” and “gimble” are verbs, and “toves” and “Jabberwock” are nouns. It’s not as easy to do this with every category. If you make up your own subordinating conjunction, people may have a hard time identifying it.

I went there cloom they told me to.

This sentence seems wrong. The reason is that we are used to learning new words in some categories and not in others. If we can create words in a category, it is considered an open category; those we cannot easily add to are called closed categories. This is on a spectrum, though. Pronouns are not as open as nouns, but it is not uncommon for languages to innovate new pronouns. For example, “y’all” is no more than two or three centuries old.

Knowing that there are some categories that are more or less fixed, and some that are fully open, we can tell that our models are learning two kinds of prediction. Lexical cues are useful for closed categories, and contextual cues are useful for open categories.

Let’s take a look at how Spark NLP does POS tagging. In Spark NLP, a perceptron is used. We can train a model on the brown corpus, but first we must save the data in a particular format. Each token-tag pair must be joined by an underscore '_'. Each sentence worth of tagged tokens goes on one line. For example:

The_AT mayor's_NN$ present_JJ term_NN of_IN office_NN expires_VBZ 
Jan._NP 1_CD ._. 
He_PPS will_MD be_BE succeeded_VBN by_IN Ivan_NP Allen_NP Jr._NP 
,_, who_WPS became_VBD a_AT candidate_NN in_IN the_AT Sept._NP 
13_CD primary_NN after_CS Mayor_NN-TL Hartsfield_NP announced_VBD 
that_CS he_PPS would_MD not_* run_VB for_IN reelection_NN ._.

from sparknlp.training import POS

with open('tagged_brown.txt', 'w') as out:
    for fid in brown.fileids():
        for sent in brown.tagged_sents(fid):
            for token, tag in sent:
                out.write('{}_{} '.format(token, tag))
            out.write('\n')
        
tag_data = POS().readDataset(spark, 'tagged_brown.txt', '_', 'tags')

Now we can build our pipeline and train our model.

from sparknlp.annotator import Tokenizer, PerceptronApproach

assembler = DocumentAssembler()\
    .setInputCol('text')\
    .setOutputCol('document')
sent_detector = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentences')
tokenizer = Tokenizer() \
    .setInputCols(['sentences']) \
    .setOutputCol('tokens')

pos_tagger = PerceptronApproach() \
    .setNIterations(1) \
    .setInputCols(["sentences", "tokens"]) \
    .setOutputCol("pos") \
    .setPosCol("tags")

finisher = Finisher()\
    .setInputCols(['tokens', 'pos'])\
    .setOutputCols(['tokens', 'pos'])\
    .setOutputAsArray(True)

pipeline = Pipeline().setStages([
    assembler, sent_detector, tokenizer, pos_tagger, finisher
])

pipeline = pipeline.fit(tag_data)

Let’s look at how it did on the first sentence.

tag_data.first()['text']

'The Friday an investigation of primary election produced evidence 
any irregularities took place .'

tag_data_first = tag_data.first()['tags']
txformed_first = pipeline.transform(tag_data).first()

for i in range(len(tag_data_first)):
    word = tag_data_first[i]['metadata']['word']
    true_pos = tag_data_first[i]['result']
    pred_pos = txformed_first['pos'][i]
    print('{:20s} {:5s} {:5s}'.format(word, true_pos, pred_pos))

The                  AT    AT   
Friday               NR    NR   
an                   AT    AT   
investigation        NN    NN   
of                   IN    IN   
primary              NN    JJ   
election             NN    NN   
produced             VBD   VBN  
evidence             NN    NN   
any                  DTI   DTI  
irregularities       NNS   NNS  
took                 VBD   VBD  
place                NN    NN

The model has learned this data set well. In fact, “primary” being a noun or adjective can both be true, depending on how you parse the sentence.

There are other techniques for part-of-speech tagging, such as conditional random fields.

Conditional Random Field

If hidden Markov models are sequential naïve Bayes, then conditional random fields (CRFs) can be thought of as sequential logistic regression. CRFs are another popular technique for part-of-speech tagging. CRFs have a few benefits over HMMs. They allow for more complex features because they make fewer assumptions on the relationships between features and labels. In Spark NLP, CRFs are one approach used for part-of-speech tagging, in addition to using RNNs. CRFs are learned using gradient descent like logistic regression, but they are run using algorithms like the Viterbi algorithm.

Chunking and Syntactic Parsing

Now that we have the parts of speech for individual tokens, we must consider combining them. Many NLP tasks involve finding entities in the text. Often, these entities are referenced using more than one-word phrases. This means that we have to understand how to combine tagged tokens in phrases, which are generally known as chunks.

Similar to sentence-boundary detection, we can do this with a heuristic, but there will be errors that we’ll need a model to avoid. The heuristics are relatively simple—if two adjacent tokens have the same or similar tags, combine them into a single phrase. For example, the words “fan blade” are both nouns, so they can be combined into a single noun phrase. We can combine certain known structures, like some verb-preposition combinations, such as “knock off,” into a single verb or noun. Heuristics won’t cover more complex syntactic structures. For example, the verb “knock off” inserts its object between “knock” and “off,” so if you are trying to normalize your vocabulary, the infinitive “to knock off” won’t be combined with the inflected form “knocked X off.” There may also be certain syntactic structures you are interested in, like knowing who did what to whom.

To extract these more complex syntactic structures, we need a syntactic parser. These models turn sequences into tree structures. The tokens then become constituents in larger phrases. Language is very complicated, but fortunately, straightforward structures are generally more common than complex ones.

Before using a syntactic parser, make sure that you are certain that you need it. They are often generally complex models that are resource-intensive to train and use. It’s probably best if you try to solve your problem with heuristics first, and if that is not sufficient, consider a syntactic parser.

An additional caveat about syntactic parsers is that labeling is a difficult task. Anyone who is fluent in a language can reliably split text into sentences. Most people can learn parts of speech well enough to label data in a few minutes. Syntactic parsing is a significantly more complex labeling task.

Language Models

Another classic application of sequence modeling is language modeling. A language model is a model of process that generates language. This is called a generative model, as opposed to a discriminative model, which is used for distinguishing the difference between things. Of course, the actual process by which human language is generated is incredibly complex, and is still being explored by neurologists, psychologists, and philosophers. In NLP, language models make simplifying assumptions—for example, that the text generation process can be learned from text alone.

There are many uses for language models. We can use them for feature generation with other models. For example, a neural-network–based language model can be used to feed into other layers used for sequence labeling. Language models are also used for text generation and summarization.

Some of the techniques covered in this chapter can be used to create a language model. For example, we could use a CRF to predict a sequence of words. We could also use a Markov model to predict a sequence of words by learning the transition probabilities. This would not be a hidden Markov model, because there are no hidden states here. We would likely need a larger context window than just the previous token. Fortunately, we can relax the assumption that the language generation has the Markov property. Once we start doing this, however, our model quickly becomes much more complex. This is related to the complication of syntax that made syntactic parsers so challenging.

Currently, RNNs (recurrent neural networks) are the most popular approach to building language models. We introduced RNNs in Chapter 4, but now let’s look into more detail about how they work.

Recurrent Neural Networks

We will build a model that can generate English words. To do this we will use an LSTM. Following are the equations that define it:

\begin{matrix} v_{0} & = 0 \\ c_{0} & = 0 \\ 1 & \leq t \leq T \\ f_{t} & = σ (W_{f} \cdot {\vec{x}}_{t} + U_{f} \cdot v^{(t - 1)} + b_{f}) \\ i_{t} & = σ (W_{i} \cdot {\vec{x}}_{t} + U_{i} \cdot v^{(t - 1)} + b_{i}) \\ o_{t} & = σ (W_{o} \cdot {\vec{x}}_{t} + U_{o} \cdot v^{(t - 1)} + b_{o}) \\ {\tilde{c}}_{t} & = tanh (W_{c} \cdot {\vec{x}}_{t} + U_{c} \cdot v^{(t - 1)} + b_{c}) \\ c_{t} & = f_{t} \circ c_{t - 1} + i_{t} \circ {\tilde{c}}_{t} \\ v_{t} & = o_{t} \circ tanh (c_{t}) \end{matrix}

The idea behind the LSTM is that it will maintain state as you progress through the sequence. It will learn when to update its state through the training process. The input, $x_{t}$ , is combined with the previous state, $v_{t}$ , using four different sets of weights. The first set, $W_{f}, U_{f}, b_{f}$ , represents forgetting, controlling how much prior information affects the state of the cell. The second set, $W_{i}, U_{i}, b_{i}$ , represents input to the cell, controlling how much the current example affects the state of the cell. The third set, $W_{o}, U_{o}, b_{o}$ , represents output of the cell, controlling how much the new cell state affects the output of the LSTM. And finally, $W_{c}, U_{c}, b_{c}$ , represents memory of the current state, controlling what is remembered from the current input and stored in the cell.

In order to pass in our characters, we will need to vectorize them first.

from keras.models import Model, Sequential
from keras.layers import *
import keras.utils as ku
import keras.preprocessing as kp

from scipy.special import expit as sigmoid, softmax

Let’s take only the words that occur frequently. Also, we will mark the end of each word. This will let our model predict when a word should end.

vocab = word_counts[word_counts > 100]
vocab = list(w + '#' for w in vocab.index)

We will build two lookups, c2i for mapping characters to indices, and i2c for mapping indices back to characters. We will also use ? as a symbol for unknown characters.

UNK = '?'
c2i = {UNK: 0}

for word in vocab:
    for c in word:
        c2i.setdefault(c, len(c2i))
        
i2c = {ix: c for c, ix in c2i.items()}
alphabet_size = len(i2c) + 1

Now let’s define some utility functions for converting data.

def texts_to_sequences(texts):
    return [[c2i.get(c, c2i[UNK]) for c in t] for t in texts]

def sequences_to_texts(seqs):
    return [''.join([i2c.get(ix, UNK) for ix in s]) for s in seqs]

Here, we specify the maximum context as 10. We could potentially not do so, but that could lead to some technical difficulties. The implementation of sequence modeling expects to know the maximum length of the sequences. Without fixing the size of the window, the length of the longest word will determine this length. Because long words are much more rare than short ones, most of our sequences will need to be padded. This padding does not help us learn the likely sequences. So there is a trade-off: the larger the window the more context your model has to predict the next item in the sequence. On the other hand, it also increases computational complexity. It’s best to realistically consider how large the context is in your data. In English, the median length of vocabulary words is 6 letters. If you take into account word frequency, the median is 4. Indeed, 10 is the 95th percentile of word length considering word frequency. This means that very rarely will information from more than 10 characters away help in predicting the next character.

seqs = texts_to_sequences(vocab)

w = 10
X = []
Y = []
for seq in seqs:
    for k in range(1, min(w, len(seq))):
        X.append(seq[:k])
        Y.append(seq[k])
    for k in range(0, len(seq) - w):
        X.append(seq[k:k+w])
        Y.append(seq[k+w])
X = kp.sequence.pad_sequences(X, maxlen=w, padding='pre')
Y = ku.to_categorical(Y, num_classes=alphabet_size)

Now we build our model. You may notice the Embedding layer here. This will reduce the dimensionality of our input. Instead of having the width of the input to LSTM be the size of our alphabet, it will be 5. We will learn more about embeddings in Chapter 11.

units = 20

model = Sequential()
model.add(Embedding(alphabet_size, 5, input_length=w))
model.add(LSTM(units, unroll=True))
model.add(Dense(alphabet_size, activation='softmax'))

print(model.summary())

Model: "sequential_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
embedding_1 (Embedding)      (None, 10, 5)             250       
_________________________________________________________________
lstm_1 (LSTM)                (None, 20)                2080      
_________________________________________________________________
dense_1 (Dense)              (None, 50)                1050      
=================================================================
Total params: 3,380
Trainable params: 3,380
Non-trainable params: 0
_________________________________________________________________
None

There are 3,380 parameters to be learned. Unsurprisingly, most of them are in the LSTM—the most complex part of our network. Now, let’s train our network.

model.compile(
    loss='categorical_crossentropy', optimizer='adam', 
    metrics=['accuracy'])
model.fit(X, Y, epochs=300, verbose=1)

Epoch 1/300
5688/5688 [==============================] - 1s 224us/step - 
  loss: 3.1837 - acc: 0.1790
Epoch 2/300
5688/5688 [==============================] - 0s 79us/step - 
  loss: 2.7894 - acc: 0.1834
...
Epoch 299/300
5688/5688 [==============================] - 0s 88us/step - 
  loss: 1.7275 - acc: 0.4524
Epoch 300/300
5688/5688 [==============================] - 0s 84us/step - 
  loss: 1.7267 - acc: 0.4517

<keras.callbacks.History at 0x7fbf5e94d438>

Now that we have a model of the character sequences, we can actually use it to generate words. All we need is a seed character.

def generate_word(seed_char, model):
    text = seed_char 
    for _ in range(100):
        # encode the current text
        encoded = texts_to_sequences([text])[0]
        # pad the sequence
        encoded = kp.sequence.pad_sequences( 
            [encoded], maxlen=w, padding='pre', truncating='pre')
        # predict the next index
        pred = model.predict_classes(encoded, verbose=0) 
        # convert the index
        pred = sequences_to_texts([pred])[0] 
        # if the model predicts the end of the word, exit
        if pred == '#': 
            break
        text += pred
    return text

alphabet = 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'

for c in alphabet:
    print(c, '->', generate_word(c, model), end=', ')
    c = c.lower()
    print(c, '->', generate_word(c, model))

A -> And, a -> an
B -> Breng, b -> beation
C -> Court, c -> court
D -> Dear, d -> dear
E -> Englent, e -> exter
F -> Fer, f -> fort
G -> Gearth, g -> groud
H -> Her, h -> hort
I -> In, i -> indedant
J -> Jome, j -> jorter
K -> Kenglert, k -> kear
L -> Lexter, l -> land
M -> Mand, m -> mearth
N -> Not, n -> near
O -> One, o -> on
P -> Pling, p -> provest
Q -> Qexter, q -> quale
R -> Rhong, r -> rear
S -> Some, s -> state
T -> Ther, t -> ther
U -> Ued, u -> under
V -> Vexter, v -> vering
W -> Watere, w -> wate
X -> Xexter, x -> xowe
Y -> Yot, y -> yurn
Z -> Zexter, z -> zelous

This looks interesting. Some of these are even real words, like “Her,” “land,” and “state.” The principles for doing this with words is the same. The number of dimensions increases, so we must be mindful of memory usage.

Let’s look at how these work. First, let’s extract our layers.

embedding = model.layers[0]
lstm = model.layers[1]
dense = model.layers[2]

After this, we want to extract the weights. The Keras library does not store each of the weights of the LSTM layer separately, so we will have to split them out.

# embedding layers don't have a bias term, so 
# we only get one here
W_e = embedding.get_weights()[0]

W, U, b = lstm.get_weights()

# The W_* weights are concatenated along the second axis
W_i = W[:, :units]
W_f = W[:, units: units * 2]
W_c = W[:, units * 2: units * 3]
W_o = W[:, units * 3:]

# The U_* weights are concatenated along the second axis
U_i = U[:, :units]
U_f = U[:, units: units * 2]
U_c = U[:, units * 2: units * 3]
U_o = U[:, units * 3:]

# The b_* weights are also concatenated
b_i = b[:units]
b_f = b[units: units * 2]
b_c = b[units * 2: units * 3]
b_o = b[units * 3:]

# Finally, the output weights
W_d, b_d = dense.get_weights()

Let’s see what we should expect when we try and predict the next character after “recurren.”

text = ['recurren']
encoded = texts_to_sequences(text) 
encoded = kp.sequence.pad_sequences(
    encoded, maxlen=w, padding='pre')
pred = model.predict_classes(encoded)
pred = sequences_to_texts([pred])
pred

['t']

This makes sense because this would make the word “recurrent.” Now, let’s see if we can calculate this for ourselves. First, we must create our inputs.

X = ['recurren']
X = texts_to_sequences(X)
X = kp.sequence.pad_sequences(encoded, maxlen=w, padding='pre')
X = np.eye(alphabet_size)[X.reshape(-1)].T
X.shape

(50, 10)

Now, we can convert our 50-dimensional sparse vectors into much more dense 5-dimensional vectors using the embedding layer.

V_e = np.dot(W_e.T, X).T
V_e.shape

(10, 5)

Let’s run this through the LSTM. This code is mostly parallel to the previous equations, the exception being that we store the values in h_* variables before sending them through the activation functions. This is done to keep the lines of code from being too long.

v_t = np.zeros(units)
c_t = np.zeros(units)
for v_e in V_e:
    h_f = np.dot(W_f.T, v_e) + np.dot(U_f.T, v_t) + b_f
    f_t = sigmoid(h_f)
    h_i = np.dot(W_i.T, v_e) + np.dot(U_i.T, v_t) + b_i
    i_t = sigmoid(h_i)
    h_o = np.dot(W_o.T, v_e) + np.dot(U_o.T, v_t) + b_o
    o_t = sigmoid(h_o)
    h_cc = np.dot(W_c.T, v_e) + np.dot(U_c.T, v_t) + b_c
    cc_t = np.tanh(h_cc)
    c_t = np.multiply(f_t, c_t) + np.multiply(i_t, cc_t)
    v_t = np.multiply(o_t, np.tanh(c_t))
    
v_t.shape

(20,)

We will take the last output and pass it through the dense layer to get our prediction.

h_d = np.dot(W_d.T, v_t) + b_d
pred = softmax(h_d)
pred

array([5.82594437e-14, 1.42019430e-13, 6.24594676e-05, 7.96185826e-03,
       1.44256098e-01, 8.38904616e-02, 2.30058043e-03, 1.34377453e-02,
       2.41413353e-02, 8.99782631e-03, 3.62877644e-04, 7.10518831e-04,
       4.20883844e-05, 1.14326228e-01, 4.10492247e-01, 1.37839318e-03,
       1.71264211e-03, 1.74333516e-03, 2.45791054e-03, 3.24176673e-04,
       9.32490754e-05, 7.62545395e-14, 9.35015496e-05, 1.53205409e-01,
       2.67653674e-02, 1.24012713e-03, 5.49467572e-14, 3.55922084e-11,
       8.92650636e-14, 9.91368315e-14, 8.16121572e-14, 2.14432828e-18,
       9.12657866e-14, 3.24019025e-06, 9.51441183e-14, 8.55035544e-14,
       8.72065395e-14, 7.73119241e-14, 9.14113806e-14, 1.08231855e-13,
       3.22887102e-07, 8.59110640e-14, 1.10092976e-13, 8.71172907e-14,
       1.04723547e-13, 7.06023940e-14, 8.18420309e-14, 1.21049563e-13,
       8.37730240e-14, 1.04719426e-13])

We just need to find the index with the highest value, so we use argmax.

i2c[pred.argmax()]

't'

Finally, we have our prediction. Let’s look at some of the runners-up.

top5 = sorted(enumerate(pred), key=lambda x: x[1], reverse=True)[:5]

for ix, p in top5:
    print(i2c[ix], p)

t 0.5984201360888022
e 0.12310572070859592
# 0.08221461341991843
c 0.043372692317500176
l 0.037207052073704866

We can speculate as to why the other characters might be predicted. The character “g” could be predicted based on how similar the input string is to “recurring.” As we discussed previously, most words are not as long as “recurren,” so predicting the end of the word also makes sense. The character “c” is part of the word “recurrence.” And the character “s” is a common letter in English, so it will often show up with a high probability.

Modeling sequences is a central part of modern NLP. In the bag-of-words approach, we lose all the information communicated by syntax. Now that we can model language as a sequence, we can look at how to extract information from the sequences.

Exercise: Character N-Grams

Modify the language model RNN to use character N-grams. So, the sequences should be recurrent -> 2-gram -> [re, cu, rr, en, t#].

Remember that you may want to update w to reflect how much context you need.

Exercise: Word Language Model

Take the poem “Jabberwocky” and build a language model from that.

Your sequences will be lines in the poem—for example, ["’Twas brillig, and the slithy toves,” “Did gyre and gimble in the wabe:"]. Use a Spark NLP pipeline to process the text and extract the list of tokens.
Your encoding code will need to change, c2i, i2c, texts_to_sequences, and sequences_to_texts will need to be updated to work with words.
generate_word will need to be updated to generate lines instead.

Resources

The Penn Treebank: An Overview, by Ann Taylor, Mitchell Marcus, and Beatrice Santorini (Springer): this is an overview of the Penn Treebank data set, which is one of the most widely used NLP data sets.

Long Short-Term Memory in Recurrent Neural Networks, by Felix Gers: the PhD dissertation describing LSTMs. It’s a little dense, as expected in a dissertation, but if you are interested in the mathematical underpinnings it’s a must-read.
Probabilistic Graphical Models: Principles and Techniques, by Daphne Koller and Nir Friedman (MIT Press): this is a textbook for probabilistic graphical models like HMMs and CRFs.