1D convolutional neural network model

We have seen that the bag-of-words approach in traditional NLP approaches ignores sentence structure. Consider applying a sentiment analysis task on the four movie reviews in the following table:

Id	sentence	Rating (1=recommended, 0=not recommended)
1	this movie is very good	1
2	this movie is not good	0
3	this movie is not very good	0
4	this movie is not bad	1

If we represent this as a bag of words with term frequency, we will get the following output:

Id	bad	good	is	movie	not	this	very
1	0	1	1	1	0	1	1
2	0	1	1	1	1	1	0
3	0	1	1	1	1	1	1
4	1	0	1	1	1	1	0

In this simple example, we can see some of the problems with a bag-of-words approach, we have lost the relationship between the negation (not) and the adjectives (good, bad). To work around this problem, traditional NLP could use bigrams, so instead of using single words as tokens, use two words as tokens. Now, for the second example, not good is a single token, which makes it more likely that the machine learning algorithm will pick it up. However, we still have a problem with the third example (not very good), where we will have tokens for not very and very good. These are still ambiguous, as not very implies negative sentiment, while very good implies positive sentiment. We could try higher order n-grams, but this further exacerbates the sparsity problem we saw in the previous section.

Word vectors or embeddings have the same problem. We need some method to handle word sequences. Fortunately, there are types of layers in deep learning algorithms that can handle sequential data. One that we have already seen is convolutional neural networks in Chapter 5, Image Classification Using Convolutional Neural Networks. Recall that these are 2D patches that are moved across the image to identify patterns such as a diagonal line or an edge. In a similar manner, we can apply a 1D convolutional neural network across the word vectors. Here is an example of using a 1D convolutional neural network layer for the same text classification problem. The code is in Chapter7/classify_keras2.R. We are only showing the code for the model architecture, because that is the only change from the code in Chapter7/classify_keras1.R:

model <- keras_model_sequential() %>%
  layer_embedding(input_dim = max_features, output_dim = 16,input_length = maxlen) %>%
  layer_dropout(rate = 0.25) %>%
  layer_conv_1d(64,5, activation = "relu") %>%
  layer_dropout(rate = 0.25) %>%
  layer_max_pooling_1d() %>%
  layer_flatten() %>%
  layer_dense(units = 50, activation = 'relu') %>%
  layer_dropout(rate = 0.6) %>%
  layer_dense(units = 1, activation = "sigmoid")

We can see that this follows the same pattern that we saw in the image data; we have a convolutional layer followed by a max pooling layer. There are 64 convolutional layers with a length=5, and so these learn local patterns in the data. Here is the output from the model's training:

Train on 8982 samples, validate on 2246 samples
Epoch 1/5
8982/8982 [==============================] - 13s 1ms/step - loss: 0.3020 - acc: 0.8965 - val_loss: 0.1909 - val_acc: 0.9470
Epoch 2/5
8982/8982 [==============================] - 13s 1ms/step - loss: 0.1980 - acc: 0.9498 - val_loss: 0.1816 - val_acc: 0.9537
Epoch 3/5
8982/8982 [==============================] - 12s 1ms/step - loss: 0.1674 - acc: 0.9575 - val_loss: 0.2233 - val_acc: 0.9368
Epoch 4/5
8982/8982 [==============================] - 12s 1ms/step - loss: 0.1587 - acc: 0.9606 - val_loss: 0.1787 - val_acc: 0.9573
Epoch 5/5
8982/8982 [==============================] - 12s 1ms/step - loss: 0.1513 - acc: 0.9628 - val_loss: 0.2186 - val_acc: 0.9408

This model is an improvement on our previous deep learning model; it gets 95.73% accuracy on the fourth epoch. This beats the traditional NLP approach by 0.49%, which is a significant improvement. Let's move on to other methods that also look to matching sequences. We will start with recurrent neural networks (RNNs).