Chapter 15. Chatbot

When we discussed language models, we showed how we can generate text. Building a chatbot is similar, except that we are modeling an exchange. This can make our requirements more complex or, actually, more simple depending on how we want to approach the problem.

In this chapter we will discuss some of the ways this can be modeled, and then we will build a program that will use a generative model to take and then generate responses. First, let’s talk about what discourse is.

Morphology and syntax tell us how morphemes are combined into words, and words into phrases and sentences. The combination of sentences into larger language acts is not as easily modeled. There is an idea of an inappropriate combination of sentences. Let’s look at some examples:

I went to the doctor, yesterday. It is just a sprained ankle.
I went to the doctor, yesterday. Mosquitoes have 47 teeth.

In the first example, the second sentence is obviously related to the first. From these two sentences, combined with common knowledge, we can infer that the speaker went to the doctor for an ankle problem that turned out to be a sprain. The second example makes no sense. From a linguistics point of view, sentences are generated from concepts and then encoded into words and phrases. The concepts that are expressed by sentences are connected, so a sequence of sentences should be connected by similar concepts. This will be true whether there is only one speaker or more in a conversation.

The pragmatics of a discourse is important to understanding how to model it. If we are modeling a customer-service exchange, the range of responses can be limited. These limited types of responses are often called intents. When building a customer-service chatbot, this greatly reduces the potential complexity. If we are modeling general conversation, this can become much more difficult. Language models learn what is likely to occur in a sequence, but they cannot learn to generate concepts. So our choice is to either build something that models the probable sequences or find a way to cheat.

We can cheat by building canned responses to unrecognized intents. For example, if the user makes a statement that our simple model is not expecting, we can have it respond with, “Sorry, I don’t understand.” If we are logging the conversations, we can use exchanges that use the canned responses to expand the intents we cover.

In the example we are covering, we will be building a program that purely models the full text of the discourse. Essentially, it is a language model. The difference will be in how we use it.

This chapter is different than previous ones in that it doesn’t make use of Spark. Spark is great for processing large amounts of data in batches. It’s not great in interactive applications. Also, recurrent neural networks can take a long time to train with large amounts of data. So, in this chapter we are working a small piece of data. If you have the right hardware, you change the NLTK processing to use Spark NLP.

Problem Statement and Constraints

We will build a story-building tool. The idea is to help someone write an original story similar to one of the Grimm fairy tales. This model will be much more complex, in the sense of containing many more parameters, than the previous language model was. The program will be a script that asks for an input sentence and generates a new sentence. The user then takes that sentence, modifies and corrects it, and enters it.

  1. What is the problem we are trying to solve?

    We want a system that will recommend the next sentence in a story. We also must recognize the limitations of text generation techniques. We will need to have the user in the loop. So we need a model that can generate related text and a system that lets us review the output.

  2. What constraints are there?

    First, we need a model that has two notions of context—the previous sentence and the current sentence. We don’t need to worry about performance as much, since this will be interacting with a person. This might seem counterintuitive because most interactive systems require quite low latency. However, if you consider what this program is producing, it is not unreasonable to wait one to three seconds for a response.

  3. How do we solve the problem with the constraints?

    We will be building a neural network for generating text, specifically an RNN, as discussed in Chapters 4 and 8. We could learn the word embeddings in this model, but we can instead use a prebuilt embedding. This will help us train a model more quickly.

Design the Solution

If you recall our language model, we used three layers.

  1. Input
  2. Embedding
  3. LSTM
  4. Dense output

We input windows of characters of a fixed size and predicted the following character. Now we need to find a way to take into account larger portions of text. There are a couple of options.

Many RNN architectures include a layer for learning an embedding for the words. This would merely require us to learn more parameters, so we will use a pretrained GloVe model instead. Also, we will be building our model on the token level, and not on the character level as before.

We could make the window size much larger than the average sentence. This has the benefit of keeping the same model architecture. The downside is that our LSTM layer will have to maintain information over quite long distances. We can use one of the architectures used for machine translations.

Let’s consider the concatenating approach.

  1. Context input
  2. Context LSTM
  3. Current input
  4. Current LSTM
  5. Concatenate 2 and 4
  6. Dense output

The current inputs will be windows over sentences, so for each window of a given sentence we will use the same context vector. This approach has the benefit of being able to be extended to multiple sentences. The downside is that the model has to learn to balance the information from far away and from nearby.

Let’s consider the stateful approach.

  1. Context input
  2. Context LSTM
  3. Current input
  4. Current LSTM, initialized with state of 2
  5. Dense output

This helps make training easier by reducing the influence of the previous sentence. This is a double-edged sword, however, because the context gives us less information. We will be using this approach.

Implement the Solution

Let’s start out by doing our imports. This chapter will rely on Keras.

from collections import Counter
import pickle as pkl

import nltk
import numpy as np
import pandas as pd

from keras.models import Model
from keras.layers import Input, Embedding, LSTM, Dense, CuDNNLSTM
from keras.layers.merge import Concatenate
import keras.utils as ku
import keras.preprocessing as kp
import tensorflow as tf
np.random.seed(1)
tf.set_random_seed(2)

Let’s also define some special tokens for the beginning and ending of sentences, as well as for unknown tokens.

START = '>'
END = '###'
UNK = '???'

Now, we can load the data. We will need to replace some of the special characters.

with open('grimms_fairytales.txt', encoding='UTF-8') as fp:
    text = fp.read()
    
text = text\
    .replace('\t', ' ')\
    .replace('“', '"')\
    .replace('”', '"')\
    .replace('“', '"')\
    .replace('‘', "'")\
    .replace('’', "'")

Now, we can process our text into tokenized sentences.

sentences = nltk.tokenize.sent_tokenize(text)
sentences = [s.strip()for s in sentences]
sentences = [[t.lower() for t in nltk.tokenize.wordpunct_tokenize(s)] for s in sentences]
word_counts = Counter([t for s in sentences for t in s])
word_counts = pd.Series(word_counts)
vocab = [START, END, UNK] + list(sorted(word_counts.index))

We need to define some hyperparameters for our model.

  • dim is the size of the token embeddings
  • w is the size of the windows we’ll use
  • max_len is the sentence length that we use
  • units is the size of the state vectors we’ll use for our LSTMs
dim = 50
w = 10
max_len = int(np.quantile([len(s) for s in sentences], 0.95))
units = 200

Now, let’s load the GloVe embeddings.

glove = {}
with open('glove.6B/glove.6B.50d.txt', encoding='utf-8') as fp:
    for line in fp:
        token, embedding = line.split(maxsplit=1)
        if token in vocab:
            embedding = np.fromstring(embedding, 'f', sep=' ')
            glove[token] = embedding
            
vocab = list(sorted(glove.keys()))
vocab_size = len(vocab)

We will also need to have a lookup for the one-hot–encoded output.

i2t = dict(enumerate(vocab))
t2i = {t: i for i, t in i2t.items()}

token_oh = ku.to_categorical(np.arange(vocab_size))
token_oh = {t: token_oh[i,:] for t, i in t2i.items()}

Now, we can define some utility functions.

We will need to pad the end of the sentences; otherwise, we will not learn from the last words in the sentences.

def pad_sentence(sentence, length):
    sentence = sentence[:length]
    if len(sentence)  < length:
        sentence += [END] * (length - len(sentence))
    return sentence

We also need to convert sentences to matrices.

def sent2mat(sentence, embedding):
    mat = [embedding.get(t, embedding[UNK]) for t in sentence]
    return np.array(mat)

We need a function for converting sequences to a sequence of sliding windows.

def slide_seq(seq, w):
    window = []
    target = []
    for i in range(len(seq)-w-1):
        window.append(seq[i:i+w])
        target.append(seq[i+w])
    return window, target

Now we can build our input matrices. We will have two input matrices. One is from the context, and one is from the current sentence.

Xc = []
Xi = []
Y = []

for i in range(len(sentences)-1):
    
    context_sentence = pad_sentence(sentences[i], max_len)
    xc = sent2mat(context_sentence, glove)
    
    input_sentence = [START]*(w-1) + sentences[i+1] + [END]*(w-1)
    for window, target in zip(*slide_seq(input_sentence, w)):
        xi = sent2mat(window, glove)
        y = token_oh.get(target, token_oh[UNK])
    
        Xc.append(np.copy(xc))
        Xi.append(xi)
        Y.append(y)
    
Xc = np.array(Xc)
Xi = np.array(Xi)
Y = np.array(Y)
print('context sentence: ', xc.shape)
print('input sentence: ', xi.shape)
print('target sentence: ', y.shape)
context sentence:  (42, 50)
input sentence:  (10, 50)
target sentence:  (4407,)

Let’s build our model.

input_c = Input(shape=(max_len,dim,), dtype='float32')
lstm_c, h, c = LSTM(units, return_state=True)(input_c)

input_i = Input(shape=(w,dim,), dtype='float32')
lstm_i = LSTM(units)(input_i, initial_state=[h, c])

out = Dense(vocab_size, activation='softmax')(lstm_i)
model = Model(input=[input_c, input_i], output=[out])
print(model.summary())
Model: "model_1"
__________________________________________________________________________
Layer (type)                Output Shape         Param #     Connected to 
==========================================================================
input_1 (InputLayer)        (None, 42, 50)       0                        
__________________________________________________________________________
input_2 (InputLayer)        (None, 10, 50)       0                        
__________________________________________________________________________
lstm_1 (LSTM)               [(None, 200), (None, 200800      input_1[0][0]
__________________________________________________________________________
lstm_2 (LSTM)               (None, 200)          200800      input_2[0][0]
                                                             lstm_1[0][1] 
                                                             lstm_1[0][2] 
__________________________________________________________________________
dense_1 (Dense)             (None, 4407)         885807      lstm_2[0][0] 
==========================================================================
Total params: 1,287,407
Trainable params: 1,287,407
Non-trainable params: 0
__________________________________________________________________________
None
model.compile(
    loss='categorical_crossentropy', optimizer='adam',
    metrics=['accuracy'])

Now we can train our model. Depending on your hardware, this can potentially take four minutes per epoch on CPU. This is our most complex model yet with almost 1.3 million parameters.

Epoch 1/10
145061/145061 [==============================] - 241s 2ms/step 
- loss: 3.7840 - accuracy: 0.3894
...
Epoch 10/10
145061/145061 [==============================] - 244s 2ms/step 
- loss: 1.8933 - accuracy: 0.5645

Once we have this model trained, we can try to generate some sentences. This function will need a context sentence and an input sentence—we can simply supply one word to begin. The function will append tokens to the input sentence until the END token is generated or we have hit the maximum allowed length.

def generate_sentence(context_sentence, input_sentence, max_len=100):
    context_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]
    context_sentence = pad_sentence(context_sentence, max_len)
    context_vector = sent2mat(context_sentence, glove)
    input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]
    input_sentence = [START] * (w-1) + input_sentence
    input_sentence = input_sentence[:w]
    output_sentence = input_sentence

    input_vector = sent2mat(input_sentence, glove)
    predicted_vector = model.predict([[context_vector], [input_vector]])
    predicted_token = i2t[np.argmax(predicted_vector)]
    output_sentence.append(predicted_token)
    i = 0
    while predicted_token != END and i < max_len:
        input_sentence = input_sentence[1:w] + [predicted_token]
        input_vector = sent2mat(input_sentence, glove)
        predicted_vector = model.predict([[context_vector], [input_vector]])
        predicted_token = i2t[np.argmax(predicted_vector)]
        output_sentence.append(predicted_token)
        i += 1
    return output_sentence

Because we need to supply the first word of the new sentence, we can simply sample from the beginning tokens found in our corpus. Let’s save the distribution of first words that we will need as JSON.

first_words = Counter([s[0] for s in sentences])
first_words = pd.Series(first_words)
first_words = first_words.sum()
first_words.to_json('grimm-first-words.json')
with open('glove-dict.pkl', 'wb') as out:
    pkl.dump(glove, out)
with open('vocab.pkl', 'wb') as out:
    pkl.dump(i2t, out)

Let’s see what is generated without human intervention.

context_sentence = '''
In old times, when wishing was having, there lived a King whose
daughters were all beautiful, but the youngest was so beautiful that
the sun itself, which has seen so much, was astonished whenever it
shone in her face.
'''.strip().replace('\n', ' ')

input_sentence = np.random.choice(first_words.index, p=first_words)

for _ in range(10):
    print(context_sentence, END)
    output_sentence = generate_sentence(context_sentence, input_sentence, max_len)
    output_sentence = ' '.join(output_sentence[w-1:-1])
    context_sentence = output_sentence
    input_sentence = np.random.choice(first_words.index, p=first_words)
print(output_sentence, END)
In old times, when wishing was having, there lived a King whose daughters 
were all beautiful, but the youngest was so beautiful that the sun 
itself, which has seen so much, was astonished whenever it shone in her 
face. ###
" what do you desire ??? ###
the king ' s son , however , was still beautiful , and a little chair 
there ' s blood and so that she is alive ??? ###
the king ' s son , however , was still beautiful , and the king ' s 
daughter was only of silver , and the king ' s son came to the forest , 
and the king ' s son seated himself on the leg , and said , " i will go 
to church , and you shall be have lost my life ??? ###
" what are you saying ??? ###
cannon - maiden , and the king ' s daughter was only a looker - boy . ###
but the king ' s daughter was humble , and said , " you are not afraid 
??? ###
then the king said , " i will go with you ??? ###
" i will go with you ??? ###
he was now to go with a long time , and the bird threw in the path , and 
the strong of them were on their of candles and bale - plants . ###
then the king said , " i will go with you ??? ###

This model won’t be passing the Turing test any time soon. This is why we need to have a human in the loop. Let’s build our script. First, let’s save our model.

model.save('grimm-model')

Our script will need to have access to some of our utility functions, as well as to the hyperparameters—for example, dim, w.

%%writefile fairywriter.py
"""
This script helps you generate a fairytale.
"""

import pickle as pkl

import nltk
import numpy as np
import pandas as pd

from keras.models import load_model
import keras.utils as ku
import keras.preprocessing as kp
import tensorflow as tf


START = '>'
END = '###'
UNK = '???'


FINISH_CMDS = ['finish', 'f']
BACK_CMDS = ['back', 'b']
QUIT_CMDS = ['quit', 'q']
CMD_PROMPT = ' | '.join(','.join(c) for c in [FINISH_CMDS, BACK_CMDS, QUIT_CMDS])
QUIT_PROMPT = '"{}" to quit'.format('" or "'.join(QUIT_CMDS))
ENDING = ['THE END']


def pad_sentence(sentence, length):
    sentence = sentence[:length]
    if len(sentence)  < length:
        sentence += [END] * (length - len(sentence))
    return sentence


def sent2mat(sentence, embedding):
    mat = [embedding.get(t, embedding[UNK]) for t in sentence]
    return np.array(mat)


def generate_sentence(context_sentence, input_sentence, vocab, max_len=100, hparams=(42, 50, 10)):
    max_len, dim, w = hparams
    context_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(context_sentence)]
    context_sentence = pad_sentence(context_sentence, max_len)
    context_vector = sent2mat(context_sentence, glove)
    input_sentence = [t.lower() for t in nltk.tokenize.wordpunct_tokenize(input_sentence)]
    input_sentence = [START] * (w-1) + input_sentence
    input_sentence = input_sentence[:w]
    output_sentence = input_sentence

    input_vector = sent2mat(input_sentence, glove)
    predicted_vector = model.predict([[context_vector], [input_vector]])
    predicted_token = vocab[np.argmax(predicted_vector)]
    output_sentence.append(predicted_token)
    i = 0
    while predicted_token != END and i < max_len:
        input_sentence = input_sentence[1:w] + [predicted_token]
        input_vector = sent2mat(input_sentence, glove)
        predicted_vector = model.predict([[context_vector], [input_vector]])
        predicted_token = vocab[np.argmax(predicted_vector)]
        output_sentence.append(predicted_token)
        i += 1
    return output_sentence


if __name__ == '__main__':
    model = load_model('grimm-model')
    (_, max_len, dim), (_, w, _) = model.get_input_shape_at(0)
    hparams = (max_len, dim, w)
    first_words = pd.read_json('grimm-first-words.json', typ='series')
    with open('glove-dict.pkl', 'rb') as fp:
        glove = pkl.load(fp)
    with open('vocab.pkl', 'rb') as fp:
        vocab = pkl.load(fp)
    
    print("Let's write a story!")
    title = input('Give me a title ({}) '.format(QUIT_PROMPT))
    story = [title]
    context_sentence = title
    input_sentence = np.random.choice(first_words.index, p=first_words)
    if title.lower() in QUIT_CMDS:
        exit()
    
    print(CMD_PROMPT)
    while True:
        input_sentence = np.random.choice(first_words.index, p=first_words)
        generated = generate_sentence(context_sentence, input_sentence, vocab, hparams=hparams)
        generated = ' '.join(generated)
        ### the model creates a suggested sentence
        print('Suggestion:', generated)
        ### the user responds with the sentence they want add
        ### the user can fix up the suggested sentence or write their own
        ### this is the sentence that will be used to make the next suggestion
        sentence = input('Sentence: ')
        if sentence.lower() in QUIT_CMDS:
            story = []
            break
        elif sentence.lower() in FINISH_CMDS:
            story.append(np.random.choice(ENDING))
            break
        elif sentence.lower() in BACK_CMDS:
            if len(story) == 1:
                print('You are at the beginning')
            story = story[:-1]
            context_sentence = story[-1]
            continue
        else:
            story.append(sentence)
            context_sentence = sentence
            
    print('\n'.join(story))
    print('exiting...')

Let’s give our script a run. I’ll use it to read the suggestion and take elements of it to add the next line. A more complex model might be able to produce sentences that can be edited and added, but this model isn’t quite there.

%run fairywriter.py
Let's write a story!
Give me a title ("quit" or "q" to quit) The Wolf Goes Home
finish,f | back,b | quit,q
Suggestion: > > > > > > > > > and when they had walked for the time , and 
the king ' s son seated himself on the leg , and said , " i will go to 
church , and you shall be have lost my life ??? ###
Sentence: There was once a prince who got lost in the woods on the way 
to a church.
Suggestion: > > > > > > > > > she was called hans , and as the king ' s 
daughter , who was so beautiful than the children , who was called clever 
elsie . ###
Sentence: The prince was called Hans, and he was more handsome than the 
boys.
Suggestion: > > > > > > > > > no one will do not know what to say , but i 
have been compelled to you ??? ###
Sentence: The Wolf came along and asked, "does no one know where are?"
Suggestion: > > > > > > > > > there was once a man who had a daughter who 
had three daughters , and he had a child and went , the king ' s daughter 
, and said , " you are growing and thou now , i will go and fetch
Sentence: The Wolf had three daughters, and he said to the prince, "I 
will help you return home if you take one of my daughters as your 
betrothed."
Suggestion: > > > > > > > > > but the king ' s daughter was humble , and 
said , " you are not afraid ??? ###
Sentence: The prince asked, "are you not afraid that she will be killed 
as soon as we return home?" 
Suggestion: > > > > > > > > > i will go and fetch the golden horse ??? 
###
Sentence: The Wolf said, "I will go and fetch a golden horse as dowry."
Suggestion: > > > > > > > > > one day , the king ' s daughter , who was 
a witch , and lived in a great forest , and the clouds of earth , and in 
the evening , came to the glass mountain , and the king ' s son
Sentence: The Wolf went to find the forest witch that she might conjure 
a golden horse.
Suggestion: > > > > > > > > > when the king ' s daughter , however , was 
sitting on a chair , and sang and reproached , and said , " you are not 
to be my wife , and i will take you to take care of your ??? ###
Sentence: The witch reproached the wolf saying, "you come and ask me such 
a favor with no gift yourself?"
Suggestion: > > > > > > > > > then the king said , " i will go with you 
??? ###
Sentence: So the wolf said, "if you grant me this favor, I will be your 
servant."
Suggestion: > > > > > > > > > he was now to go with a long time , and 
the other will be polluted , and we will leave you ??? ###
Sentence: f
The Wolf Goes Home
There was once a prince who got lost in the woods on the way to a church.
The prince was called Hans, and he was more handsome than the boys.
The Wolf came along and asked, "does no one know where are?"
The Wolf had three daughters, and he said to the prince, "I will help 
you return home if you take one of my daughters as your betrothed."
The prince asked, "are you not afraid that she will be killed as soon as 
we return home?" 
The Wolf said, "I will go and fetch a golden horse as dowry."
The Wolf went to find the forest witch that she might conjure a golden 
horse.
The witch reproached the wolf saying, "you come and ask me such a favor 
with no gift yourself?"
So the wolf said, "if you grant me this favor, I will be your servant."
THE END
exiting..

You can do additional epochs to get better suggestions, but beware of overfitting. If you overfit this model, then it will generate worse results if you provide it with contexts and inputs that it doesn’t recognize.

Now that we have a model that we can interact with, the next step would be to integrate it with a chatbot system. Most systems require some server that will serve the model. The specifics will depend on your chatbot platform.