Chapter 11. Performing Sentiment Analysis on Text Data

In every interaction that we have in the real world, our brain subconsciously registers feedback not just in the words said but also using facial expressions, body language, and other physical cues. However, as more of our communication becomes digital, it increasingly appears in the form of text, where we do not have the possibility of evaluating physical cues. Therefore, it’s important to understand the mood or sentiment felt by a person through the text they write in order to form a complete understanding of their message.

For example, a lot of customer support is now automated through the use of a software ticketing system or even an automated chatbot. As a result, the only way to understand how a customer is feeling is by understanding the sentiment from their responses. Therefore, if we are dealing with a particularly irate customer, it’s important to be extra careful with our responses to not annoy them further. Similarly, if we want to understand what customers think about a particular product or brand, we can analyze the sentiment from their posts, comments, or reviews about that brand in social media channels and understand how they feel about the brand.

Understanding sentiment from text is challenging because there are several aspects that need to be inferred that are not directly evident. A simple example is the following customer review for a laptop purchased on Amazon:

This laptop is full of series problem. Its speed is exactly as per specifications which is very slow! Boot time is more.”

If a human were to read it, they could detect the irony expressed about the speed of the laptop and the fact that it takes a long time to boot up, which leads us to conclude that this is a negative review. However, if we analyze only the text, it’s clear that the speed is exactly as specified. The fact that the boot time is high might also be perceived as a good thing unless we know that this is a parameter that needs to be small. The task of sentiment analysis is also specific to the type of text data being used. For example, a newspaper article is written in a structured manner, whereas tweets and other social media text follow a loose structure with the presence of slang and incorrect punctuation. As a result, there isn’t one blueprint that might work for every scenario. Instead, we will present a set of blueprints that can be used to produce a successful sentiment analysis.

Sentiment Analysis

A lot of information is available in the form of text, and based on the context of the communication, the information can be categorized into objective texts and subjective texts. Objective texts contain a simple statement of facts, like we might find in a textbook or Wikipedia article. Such texts generally present the facts and do not express an opinion or sentiment. Subjective texts, on the other hand, convey someone’s reaction or contain information about emotion, mood, or feelings. This might be typically found in social media channels in tweets or where customers express their opinions, such as in product reviews. We undertake a study of sentiment in order to understand the state of mind of an individual expressed through the medium of text. Therefore, sentiment analysis works best on subjective texts that contain this kind of information rather than objective texts. Before starting our analysis, we must ensure that we have the right kind of dataset that captures the sentiment information we are looking for.

The sentiment of a piece of text can be determined at the phrase, sentence, or document level. For example, if we take the case of a customer writing an email to a company, there will be several paragraphs, with each paragraph containing multiple sentences. Sentiment can be calculated for each sentence and also for each paragraph. While paragraph 1 may be positive, paragraphs 3 and 4 could be negative. So, if we want to determine the overall sentiment expressed by this customer, we would have to determine the best way to aggregate the sentiment for each paragraph up to the document level. In the blueprints that we present, we calculate sentiment at a sentence level.

The techniques for performing sentiment analysis can be broken down into simple rule-based techniques and supervised machine learning approaches. Rule-based techniques are easier to apply since they do not require annotated training data. Supervised learning approaches provide better results but include the additional effort of labeling the data. There might be simple ways to work around this requirement as we will show in our use case. In this chapter, we will provide the following set of blueprints:

Introducing the Amazon Customer Reviews Dataset

Let’s assume you are an analyst working in the marketing department of a leading consumer electronics company and would like to know how your smartphone products compare with competitors. You can easily compare the technical specifications, but it is more interesting to understand the consumer perception of the product. You could determine this by analyzing the sentiment expressed by customers in product reviews on Amazon. Using the blueprints and aggregating the sentiment for each review for a brand, you would be able to identify how customers perceive each brand. Similarly, what if your company is looking to expand their business by introducing products in an adjacent category? You could analyze customer reviews for all products in a segment, such as media tablets, smartwatches, or action cameras, and based on the aggregated sentiment determine a segment with poor customer satisfaction and therefore higher potential success of your product.

For our blueprints, we will use a dataset containing a collection of Amazon customer reviews for different products across multiple product categories. This dataset of Amazon customer reviews has already been scraped and compiled by researchers at Stanford University.1 The last updated version consists of product reviews from the Amazon website between 1996 and 2018 across several categories. It includes product reviews, product ratings, and other information such as helpfulness votes and product metadata. For our blueprints, we are going to focus on product reviews and use only those that are one sentence long. This is to keep the blueprint simple and remove the step of aggregation. A review containing multiple sentences can include both positive and negative sentiment. Therefore, if we tag all sentences in a review with the same sentiment, this would be incorrect. We only use data for some of the categories so that it can fit in memory and reduce processing time. This dataset has already been prepared, but you can refer to the Data_Preparation notebook present in the repository to understand the steps and possibly extend it. The blueprints work on any kind of dataset, and therefore if you have access to powerful hardware or cloud infrastructure, then you can choose more categories.

Let’s now take a look at the dataset:

df = pd.read_json('reviews.json', lines=True)
df.sample(5)

Out:

  overall verified reviewerID asin text summary
163807 5 False A2A8GHFXUG1B28 B0045Z4JAI Good Decaf... it has a good flavour for a decaf :) Nice!
195640 5 True A1VU337W6PKAR3 B00K0TIC56 I could not ask for a better system for my small greenhouse, easy to set up and nozzles do very well I could not ask for a better system for my small greenhouse
167820 4 True A1Z5TT1BBSDLRM B0012ORBT6 good product at a good price and saves a trip to the store Four Stars
104268 1 False A4PRXX2G8900X B005SPI45U I like the principle of a raw chip - something I can eat with my homemade salsa and guac - but these taste absolutely revolting. No better alternatives but still tastes bad.
51961 1 True AYETYLNYDIS2S B00D1HLUP8 Fake China knockoff, you get what you pay for. Definitely not OEM

Looking at a summary of the dataset, we can see that it contains the following columns:

Overall
This is the final rating provided by the reviewer to the product. Ranges from 1 (lowest) to 5 (highest).
Verified
This indicates whether the product purchase has been verified by Amazon.
ReviewerID
This is the unique identifier allocated by Amazon to each reviewer.
ASIN
This is a unique product code that Amazon uses to identify the product.
Text
The actual text in the review provided by the user.
Summary
This is the headline or summary of the review that the user provided.

The column text contains the main content of the customer review and expresses the user’s opinion. While the rest of the information can be useful, we will focus on using this column in the blueprints.

Blueprint: Performing Sentiment Analysis Using Lexicon-Based Approaches

As an analyst working on the Amazon customer reviews data, the first challenge that might come up is the absence of target labels. We do not automatically know whether a particular review is positive or negative. Does the text express happiness because the product worked perfectly or anger because the product has broken at the first use? We cannot determine this until we actually read the review. This is challenging because we would have to read close to 300,000 reviews and manually assign a target sentiment to each of the reviews. We overcome this problem by using a lexicon-based approach.

What is a lexicon? A lexicon is like a dictionary that contains a collection of words and has been compiled using expert knowledge. The key differentiating factor for a lexicon is that it incorporates specific knowledge and has been collected for a specific purpose. We will use sentiment lexicons that contain commonly used words and capture the sentiment associated with them. A simple example of this is the word happy, with a sentiment score of 1, and another is the word frustrated, which would have a score of -1. Several standardized lexicons are available for the English language, and the popular ones are AFINN Lexicon, SentiWordNet, Bing Liu’s lexicon, and VADER lexicon, among others. They differ from each other in the size of their vocabulary and their representation. For example, the AFINN Lexicon comes in the form of a single dictionary with 3,300 words, with each word assigned a signed sentiment score ranging from -3 to +3. Negative/positive indicate the polarity, and the magnitude indicates the strength. On the other hand, if we look at Bing Liu lexicon, it comes in the form of two lists: one for positive words and another for negative, with a combined vocabulary of 6,800 words. Most sentiment lexicons are available for English, but there are also lexicons available for German2 and for 81 other languages as generated by this research paper.3

The sentiment of a sentence or phrase is determined by first identifying the sentiment score for each word from the chosen lexicon and then adding them up to arrive at the overall sentiment. By using this technique, we avoid the need to manually look at each review and assign the sentiment label. Instead, we rely on the lexicon that provides expert sentiment scores for each word. For our first blueprint, we will use the Bing Liu lexicon, but you are free to extend the blueprint to use other lexicons as well. The lexicons normally contain several variants of the word and exclude stop words, and therefore the standard preprocessing steps are not essential in this approach. Only those words that are present in the lexicon will actually be scored. This also leads to one of the disadvantages of this method, which we will discuss at the end of the blueprint.

Bing Liu Lexicon

The Bing Liu lexicon has been compiled by dividing the words into those that express positive opinion and those that express negative opinion. This lexicon also contains misspelled words and is more suitable to be used on text that has been extracted from online discussion forums, social media, and other such sources and should therefore produce better results on the Amazon customer reviews data.

The Bing Liu lexicon is available from the authors’ website as a zip file that contains a set of positive and negative words. It is also available within the NLTK library as a corpus that we can use after downloading. Once we have extracted the lexicon, we will create a dictionary that can hold the lexicon words and their corresponding sentiment score. Our next step is to generate the score for each review in our dataset. We convert the contents of text to lowercase first; then using the word_tokenize function from the NLTK package, we split the sentence into words and check whether this word is part of our lexicon, and if so, we add the corresponding sentiment score of the word to the total sentiment score for the review. As the final step, we normalize this score based on the number of words in the sentence. This functionality is encapsulated in the function bing_liu_score and is applied to every review in our dataset:

from nltk.corpus import opinion_lexicon
from nltk.tokenize import word_tokenize
nltk.download('opinion_lexicon')

print('Total number of words in opinion lexicon', len(opinion_lexicon.words()))
print('Examples of positive words in opinion lexicon',
      opinion_lexicon.positive()[:5])
print('Examples of negative words in opinion lexicon',
      opinion_lexicon.negative()[:5])

Out:

Total number of words in opinion lexicon 6789
Examples of positive words in opinion lexicon ['a+', 'abound', 'abounds',
'abundance', 'abundant']
Examples of negative words in opinion lexicon ['2-faced', '2-faces',
'abnormal', 'abolish', 'abominable']
Then:
# Let's create a dictionary which we can use for scoring our review text
df.rename(columns={"reviewText": "text"}, inplace=True)
pos_score = 1
neg_score = -1
word_dict = {}

# Adding the positive words to the dictionary
for word in opinion_lexicon.positive():
        word_dict[word] = pos_score

# Adding the negative words to the dictionary
for word in opinion_lexicon.negative():
        word_dict[word] = neg_score

def bing_liu_score(text):
    sentiment_score = 0
    bag_of_words = word_tokenize(text.lower())
    for word in bag_of_words:
        if word in word_dict:
            sentiment_score += word_dict[word]
    return sentiment_score / len(bag_of_words)
df['Bing_Liu_Score'] = df['text'].apply(bing_liu_score)
df[['asin','text','Bing_Liu_Score']].sample(2)

Out:

  asin text Bing_Liu_Score
188097 B00099QWOU As expected 0.00
184654 B000RW1XO8 Works as designed... 0.25

Now that we have calculated the sentiment score, we would like to check whether the calculated score matches the expectation based on the rating provided by the customer. Instead of checking this for each review, we could compare the sentiment score across reviews that have different ratings. We would expect that a review that has a five-star rating would have a higher sentiment score than a review with a one-star rating. In the next step, we scale the score for each review between 1 and -1 and compute the average sentiment scores across all reviews for each type of star rating:

df['Bing_Liu_Score'] = preprocessing.scale(df['Bing_Liu_Score'])
df.groupby('overall').agg({'Bing_Liu_Score':'mean'})

Out:

overall Bing_Liu_Score
1 -0.587061
2 -0.426529
4 0.344645
5 0.529065

The previous blueprint allows us to use any kind of sentiment lexicon to quickly determine a sentiment score and can also serve as a baseline to compare other sophisticated techniques, which should improve the accuracy of sentiment prediction.

Supervised Learning Approaches

The use of a supervised learning approach is beneficial because it allows us to model the patterns in the data and create a prediction function that is close to reality. It also gives us the flexibility to choose from different techniques and identify the one that provides maximum accuracy. A more detailed overview of supervised machine learning is provided in Chapter 6.

To use such an approach, we would need labeled data that may not be easily available. Often, it involves two or more human annotators looking at each review and determining the sentiment. If the annotators do not agree, then a third annotator might be needed to break the deadlock. It is common to have five annotators, with three of them agreeing on the opinion to confirm the label. This can be tedious and expensive but is the preferred approach when working with real business problems.

However, in many cases we will be able to test a supervised learning approach without going through the expensive labeling process. A simpler option is to check for any proxy indicators within the data that might help us annotate it automatically. Let’s illustrate this in the case of the Amazon reviews. If somebody has given a five-star product rating, then we can assume that they liked the product they used, and this should be reflected in their review. Similarly, if somebody has provided a one-star rating for a product, then they are dissatisfied with it and would have some negative things to say. Therefore, we could use the product rating as a proxy measure of whether a particular review would be positive or negative. The higher the rating, the more positive a particular review should be.

Preparing Data for a Supervised Learning Approach

Therefore, as the first step in converting our dataset into a supervised machine learning problem, we will automatically annotate our reviews using the rating. We have chosen to annotate all reviews with a rating of 4 and 5 as positive and with ratings 1 and 2 as negative based on the reasoning provided earlier. In the data preparation process, we also filtered out reviews with a rating of 3 to provide a clearer separation between positive and negative reviews. This step can be tailored based on your use case.

df = pd.read_json('reviews.json', lines=True)

# Assigning a new [1,0] target class label based on the product rating
df['sentiment'] = 0
df.loc[df['overall'] > 3, 'sentiment'] = 1
df.loc[df['overall'] < 3, 'sentiment'] = 0

# Removing unnecessary columns to keep a simple DataFrame
df.drop(columns=[
    'reviewTime', 'unixReviewTime', 'overall', 'reviewerID', 'summary'],
        inplace=True)
df.sample(3)

Out:

  verified asin text sentiment
176400 True B000C5BN72 everything was as listed and is in use all appear to be in good working order 1
65073 True B00PK03IVI this is not the product i received. 0
254348 True B004AIKVPC Just like the dealership part. 1

As you can see from the selection of reviews presented, we have created a new column named sentiment that contains a value of 1 or 0 depending on the rating provided by the user. We can now treat this as a supervised machine learning problem where we will use the content present in text to predict the sentiment: positive (1) or negative (0).

Blueprint: Vectorizing Text Data and Applying a Supervised Machine Learning Algorithm

In this blueprint, we will build a supervised machine learning algorithm by first cleaning the text data, then performing vectorization, and finally applying a support vector machine model for the classification.

Step 4: Training the Machine Learning Model

As described in Chapter 6, support vector machines are the preferred machine learning algorithms when working with text data. SVMs are known to work well with datasets with a large number of numeric features, and in particular the LinearSVC module we use is quite fast. We can also select tree-based methods like random forest or XGBoost, but in our experience the accuracy is comparable, and thanks to quick training times, experimentation can be faster:

from sklearn.svm import LinearSVC

model1 = LinearSVC(random_state=42, tol=1e-5)
model1.fit(X_train_tf, Y_train)

Out:

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=1000,
          multi_class='ovr', penalty='l2', random_state=42, tol=1e-05,
          verbose=0)

Then:

from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score

Y_pred = model1.predict(X_test_tf)
print ('Accuracy Score - ', accuracy_score(Y_test, Y_pred))
print ('ROC-AUC Score - ', roc_auc_score(Y_test, Y_pred))

Out:

Accuracy Score -  0.8658396979172006
ROC-AUC Score -  0.8660667427476778

As we can see, this model achieves an accuracy of around 86%. Let’s look at some of the model predictions and the review text to perform a sense check of the model:

sample_reviews = df.sample(5)
sample_reviews_tf = tfidf.transform(sample_reviews['text'])
sentiment_predictions = model1.predict(sample_reviews_tf)
sentiment_predictions = pd.DataFrame(data = sentiment_predictions,
                                     index=sample_reviews.index,
                                     columns=['sentiment_prediction'])
sample_reviews = pd.concat([sample_reviews, sentiment_predictions], axis=1)
print ('Some sample reviews with their sentiment - ')
sample_reviews[['text_orig','sentiment_prediction']]

Out:

Some sample reviews with their sentiment -
  text_orig sentiment_prediction
29500 Its a nice night light, but not much else apparently! 1
98387 Way to small, do not know what to do with them or how to use them 0
113648 Didn’t make the room “blue” enough - returned with no questions asked 0
281527 Excellent 1
233713 fit like oem and looks good 1

We can see that this model is able to predict the reviews reasonably well. For example, review 98387 where the user found the product to be too small and unusable is marked as negative. Consider review 233713 where the user says that the product was fitting well and looks good is marked as positive. How does the model compare with a baseline that uses the Bing Liu lexicon?

def baseline_scorer(text):
    score = bing_liu_score(text)
    if score > 0:
        return 1
    else:
        return 0

Y_pred_baseline = X_test.apply(baseline_scorer)
acc_score = accuracy_score(Y_pred_baseline, Y_test)
print (acc_score)

Out:

0.7521998393903668

It does provide an uplift on the lexicon baseline of 75%, and while the accuracy can be improved further, this is a simple blueprint that provides quick results. For example, if you’re looking to determine the customer perception of your brand versus competitors, then using this blueprint and aggregating sentiments for each brand will give you a fair understanding. Or let’s say you want to create an app that helps people decide whether to watch a movie. Using this blueprint on data collected from Twitter or YouTube comments, you could determine whether people feel more positively or negatively and use that to provide a suggestion. In the next blueprint, we will describe a more sophisticated technique that can be used to improve the accuracy.

Pretrained Language Models Using Deep Learning

Languages have evolved over centuries and are still continuously changing. While there are rules of grammar and guidelines to forming sentences, these are often not strictly followed and depend heavily on context. The words that a person chooses while tweeting would be quite different when writing an email to express the same thought. And in many languages (including English) the exceptions to the rules are far too many! As a result, it is difficult for a computer program to understand text-based communication. This can be overcome by giving algorithms a deeper language understanding by making use of language models.

Language models are a mathematical representation of natural language that allows us to understand the structure of a sentence and the words in it. There are several types of language models, but we will focus on the use of pretrained language models in this blueprint. The most important characteristic of these language models is that they make use of deep neural network architectures and are trained on a large corpus of data. The use of language models greatly improves the performance of NLP tasks such as language translation, automatic spelling correction, and text summarization.

Deep Learning and Transfer Learning

Deep learning is commonly used to describe a set of machine learning methods that leverage artificial neural networks (ANNs). ANNs were inspired by the human brain and try to mimic the connections and information processing activity between neurons in biological systems. Simply explained, it tries to model a function using an interconnected network of nodes spanning several layers with the weights of the network edges learned with the help of data. For a more detailed explanation, please refer to Section II of Hands-On Machine Learning (O’Reilly, 2019) by Aurélien Géron.

Transfer learning is a technique within deep learning that allows us to benefit from pretrained, widely available language models by transferring a model to our specific use case. It gives us the ability to use the knowledge and information obtained in one task and apply it to another problem. As humans, we are good at doing this. For example, we initially learn to play the guitar but can then relatively easily apply that knowledge to pick up the cello or harp more quickly (than a complete beginner). When the same concepts are applied with regard to a machine learning algorithm, then it’s referred to as transfer learning.

This idea was first popularized in the computer vision industry, where a large-scale image recognition challenge led to several research groups competing to build complex neural networks that are several layers deep to reduce the error on the challenge. Other researchers discovered that these complex models work well not just for that challenge but also on other image recognition tasks with small tweaks. These large models had already learned basic features about images (think of edges, shapes, etc.) and could be fine-tuned for the specific application without the need to train from scratch. In the last two years, the same techniques have been successfully applied to text analytics. First, a deep neural network is trained on a large text corpus (often derived from publicly available data sources like Wikipedia). The chosen model architecture is a variant of LSTM or Transformer.4 When training these models, one word is removed (masked) in the sentence, and the prediction task is to determine the masked word given all the other words in the sentence. To go back to our human analogy, there might be far more YouTube videos that teach you how to play the guitar than the harp or cello. Therefore, it would be beneficial to first learn to play the guitar because of the large number of resources available and then apply this knowledge to a different task, like playing the harp or cello.

Such large models take a lot of time to train and can be time-consuming. Fortunately, many research groups have made such pretrained models publicly available, including ULMFiT from fastai, BERT from Google, GPT-2 from OpenAI, and Turing from Microsoft. Figure 11-1 shows the final step of applying transfer learning, where the initial layers of the pretrained models are kept fixed, and the final layers of the model are retrained to better suit the task at hand. In this way, we can apply a pretrained model to specific tasks such as text classification and sentiment analysis.

Figure 11-1. Transfer learning. The parameters of earlier layers in the network are learned by training the model on the large corpus, and the parameters of the final layers are unfrozen and allowed to be fine-tuned during the training on the specific dataset.

For our blueprint we will use the BERT pretrained model released by Google. BERT is an acronym for Bidirectional Encoder Representations from Transformers. It uses the Transformers architecture and trains a model using a large corpus of text data. The model that we use in this blueprint (bert-base-uncased) is trained on the combined English Wikipedia and Books corpus using a Masked Language Model (MLM). There are other versions of the BERT model that can be trained on different corpora. For example, there is a BERT model trained on German Wikipedia articles. The masked language model randomly masks (hides) some of the tokens from the input, and the objective is to predict the original vocabulary ID of the masked word based only on its context (surrounding words). Since it’s bidirectional, the model looks at each sentence in both directions and is able to understand context better. In addition, BERT also uses subwords as tokens, which provides more granularity when identifying the meaning of a word. Another advantage is that BERT generates context-aware embeddings. For example, depending on the surrounding words in a sentence where the word cell is used, it can have a biological reference or actually refer to a prison cell. For a much more detailed understanding of how BERT works, please see “Further Reading”.

Blueprint: Using the Transfer Learning Technique and a Pretrained Language Model

This blueprint will show you how we can leverage pretrained language models to perform sentiment classification. Consider the use case where you would like to take action based on the sentiment expressed. For example, if a customer is particularly unhappy, you would like to route them to your best customer service representative. It’s important that you are able to detect the sentiment accurately or else you risk losing them. Or let’s say you are a small business that relies heavily on reviews and ratings on public websites like Yelp. To improve your ratings, you would like to follow up with unhappy customers by offering them coupons or special services. It’s important to be accurate so that you target the right customers. In such use cases, we may not have a lot of data to train the model, but having a high accuracy is important. We know that sentiment is influenced by the context in which a word is used, and the use of a pretrained language model can improve our sentiment predictions. This gives us the ability to go beyond the limited dataset that we have to incorporate knowledge from general usage.

In our blueprint we will use the Transformers library because of its easy-to-use functionality and wide support for multiple pretrained models. “Choosing the Transformers Library” provides more details about this topic. The Transformers library is continuously updated, with multiple researchers contributing.

Step 1: Loading Models and Tokenization

The first step when using the Transformers library is to import the three classes needed for the chosen model. This includes the config class, used to store important model parameters; the tokenizer, to tokenize and prepare the text for model training; and the model class, which defines the model architecture and weights. These classes are specific to the model architecture, and if we want to use a different architecture, then the appropriate classes need to be imported instead. We instantiate these classes from a pretrained model and choose the smallest BERT model, bert-base-uncased, which is 12 layers deep and contains 110 million parameters!

The advantage of using the Transformers library is that it already provides multiple pretrained models for many model architectures, which you can check here. When we instantiate a model class from a pretrained model, the model architecture and weights are downloaded from an AWS S3 bucket hosted by Hugging Face. This might take a while the first time, but it is then cached on your machine, which removes the need for subsequent downloads. Note that since we are using the pretrained model to predict the sentiment (positive versus negative), we specify finetuning_task='binary'. We have provided additional instructions in the accompanying notebook to ensure that additional Python packages are installed before running this blueprint.

from transformers import BertConfig, BertTokenizer, BertForSequenceClassification

config = BertConfig.from_pretrained('bert-base-uncased',finetuning_task='binary')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

We have to transform the input text data into a standard format required by the model architecture. We define a simple get_tokens method to convert the raw text of our reviews to numeric values. The pretrained model accepts each observation as a fixed length sequence. So, if an observation is shorter than the maximum sequence length, then it is padded with empty (zero) tokens, and if it’s longer, then it is truncated. Each model architecture has a maximum sequence length that it supports. The tokenizer class provides a tokenize function that splits the sentence to tokens, pads the sentence to create the fixed-length sequence, and finally represents it as a numerical value that can be used during model training. This function also adds an attention mask to differentiate those positions where we have actual words from those that contain padding characters. Here is an example of how this process works:

def get_tokens(text, tokenizer, max_seq_length, add_special_tokens=True):
  input_ids = tokenizer.encode(text,
                               add_special_tokens=add_special_tokens,
                               max_length=max_seq_length,
                               pad_to_max_length=True)
  attention_mask = [int(id > 0) for id in input_ids]
  assert len(input_ids) == max_seq_length
  assert len(attention_mask) == max_seq_length
  return (input_ids, attention_mask)

text = "Here is the sentence I want embeddings for."
input_ids, attention_mask = get_tokens(text,
                                       tokenizer,
                                       max_seq_length=30,
                                       add_special_tokens = True)
input_tokens = tokenizer.convert_ids_to_tokens(input_ids)
print (text)
print (input_tokens)
print (input_ids)
print (attention_mask)

Out:

Here is the sentence I want embeddings for.
['[CLS]', 'here', 'is', 'the', 'sentence', 'i', 'want', 'em', '##bed',
'##ding', '##s', 'for', '.', '[SEP]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]', '[PAD]',
'[PAD]', '[PAD]', '[PAD]', '[PAD]']
[101, 2182, 2003, 1996, 6251, 1045, 2215, 7861, 8270, 4667, 2015, 2005, 1012,
102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0]

The first token that we observe is the [CLS] token, which stands for classification, which is one of the pretraining tasks of the BERT model. This token is used to identify the start of a sentence and stores the aggregated representation of the entire sentence within the model. We also see the [SEP] token at the end of the sentence, which stands for separator. When BERT is used for nonclassification tasks like language translation, each observation would include a pair of texts (for example, text in English and text in French), and the [SEP] token is used to separate the first text from the second. However, since we are building a classification model, the separator token is followed by [PAD] tokens. We specified the sequence length to be 30, and since our test observation was not that long, multiple padding tokens have been added at the end. Another interesting observation is that a word like embedding is not one token but actually split into em, ##bed, ##ding, and ##s. The ## is used to identify tokens that are subwords, which is a special characteristic of the BERT model. This allows the model to have a better distinction between root words, prefixes, and suffixes and also try to infer the meaning of words that it may not have seen before.

An important point to note is that since deep learning models use a context-based approach, it is advisable to use the text in the original form without any preprocessing, thus allowing the tokenizer to produce all possible tokens from its vocabulary. As a result, we must split the data again using the original text_orig column rather than the cleaned text column. After that, let’s apply the same function to our train and test data and this time use a max_seq_length of 50:

X_train, X_test, Y_train, Y_test = train_test_split(df['text_orig'],
                                                    df['sentiment'],
                                                    test_size=0.2,
                                                    random_state=42,
                                                    stratify=df['sentiment'])
X_train_tokens = X_train.apply(get_tokens, args=(tokenizer, 50))
X_test_tokens = X_test.apply(get_tokens, args=(tokenizer, 50))

Deep learning models are trained on GPUs using frameworks like TensorFlow and PyTorch. A tensor is the basic data structure used by these frameworks to represent and work with data and can store data in N dimensions. A simple way to visualize a tensor is by drawing an analogy with a chessboard. Let’s suppose that we mark an unoccupied position with 0, a position occupied by a white piece with 1, and a position occupied by a black piece with 2. We get an 8 × 8 matrix denoting the status of the chessboard at a given point in time. If we now want to track and store this over several moves, then we get multiple 8 × 8 matrices, which can be stored in what we call a tensor. Tensors are n-dimensional representations of data, containing an array of components that are functions of the coordinates of a space. The tensor that tracks the historical chess moves would be a rank 3 tensor, whereas the initial 8 × 8 matrix could also be considered a tensor, but with rank 2.

This is a simplistic explanation, but to get a more in-depth understanding, we would recommend reading “An Introduction to Tensors for Students of Physics and Engineering” by Joseph C. Kolecki. In our case, we create three tensors that contain the tokens (tensors containing multiple arrays of size 50), input masks (tensors containing arrays of size 50), and target labels (tensors containing scalars of size 1):

import torch
from torch.utils.data import TensorDataset

input_ids_train = torch.tensor(
    [features[0] for features in X_train_tokens.values], dtype=torch.long)
input_mask_train = torch.tensor(
    [features[1] for features in X_train_tokens.values], dtype=torch.long)
label_ids_train = torch.tensor(Y_train.values, dtype=torch.long)

print (input_ids_train.shape)
print (input_mask_train.shape)
print (label_ids_train.shape)

Out:

torch.Size([234104, 50])
torch.Size([234104, 50])
torch.Size([234104])

We can take a peek at what is in this tensor and see that it contains a mapping to the BERT vocabulary for each of the tokens in a sentence. The number 101 indicates the start, and 102 indicates the end of the review sentence. We combine these tensors together into a TensorDataset, which is the basic data structure used to load all observations during model training:

input_ids_train[1]

Out:

tensor([ 101, 2009, 2134, 1005, 1056, 2147, 6314, 2055, 2009, 1037, 5808, 1997,
        2026, 2769,  102,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,    0,
           0,    0])

Then:

train_dataset = TensorDataset(input_ids_train,input_mask_train,label_ids_train)

Step 2: Model Training

Now that we have preprocessed and tokenized the data, we are ready to train the model. Because of the large memory usage and computation demands of deep learning models, we follow a different approach compared to the SVM model used in the previous blueprint. All training observations are split into batches (defined by train_batch_size and randomly sampled from all observations using RandomSampler) and passed forward through the layers of the model. When the model has seen all the training observations by going through the batches, it is said to have been trained for one epoch. An epoch is therefore one pass through all the observations in the training data. The combination of batch_size and number of epochs determines how long the model takes to train. Choosing a larger batch_size reduces the number of forward passes in an epoch but might result in higher memory consumption. Choosing a larger number of epochs gives the model more time to learn the right value of the parameters but will also result in a longer training time. For this blueprint we have defined batch_size to be 64 and num_train_epochs to be 2:

from torch.utils.data import DataLoader, RandomSampler

train_batch_size = 64
num_train_epochs = 2

train_sampler = RandomSampler(train_dataset)
train_dataloader = DataLoader(train_dataset,
                              sampler=train_sampler,
                              batch_size=train_batch_size)
t_total = len(train_dataloader) // num_train_epochs

print ("Num examples = ", len(train_dataset))
print ("Num Epochs = ", num_train_epochs)
print ("Total train batch size  = ", train_batch_size)
print ("Total optimization steps = ", t_total)

Out:

Num examples =  234104
Num Epochs =  2
Total train batch size  =  64
Total optimization steps =  1829

When all the observations in one batch have passed forward through the layers of the model, the backpropagation algorithm is applied in the backward direction. This technique allows us to automatically compute the gradients for each parameter in the neural network, giving us a way to tweak the parameters to reduce the error. This is similar to how stochastic gradient descent works, but we do not attempt a detailed explanation. Chapter 4 of Hands-On Machine Learning (O’Reilly, 2019) provides a good introduction and mathematical explanation. The key thing to note is that when training a deep learning algorithm, parameters that influence backpropagation like the learning rate and choice of optimizer determine how quickly the model is able to learn the parameters and reach higher accuracies. However, there isn’t a scientific reason for why a certain method or value is better, but a lot of researchers5 attempt to determine what the best options could be. We make informed choices for the blueprint based on the parameters in the BERT paper and recommendations in the Transformers library, as shown here:

from transformers import AdamW, get_linear_schedule_with_warmup

learning_rate = 1e-4
adam_epsilon = 1e-8
warmup_steps = 0

optimizer = AdamW(model.parameters(), lr=learning_rate, eps=adam_epsilon)
scheduler = get_linear_schedule_with_warmup(optimizer,
                                            num_warmup_steps=warmup_steps,
                                            num_training_steps=t_total)

Before setting up the training loop, we check whether a GPU is available (see “Using GPUs for Free with Google Colab”). If so, the model and input data are transferred to the GPU, and then we set up the forward pass by running the inputs through the model to produce outputs. Since we have specified the labels, we already know the deviation from actual (loss), and we adjust the parameters using backpropagation that calculates gradients. The optimizer and scheduler steps are used to determine the amount of parameter adjustment. Note the special condition to clip the gradients to a max value to prevent the problem of exploding gradients.

We will now wrap all these steps in nested for loops—one for each epoch and another for each batch in the epoch—and use the TQDM library introduced earlier to keep track of the training progress while printing the loss value:

from tqdm import trange, notebook

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
train_iterator = trange(num_train_epochs, desc="Epoch")

# Put model in 'train' mode
model.train()

for epoch in train_iterator:
    epoch_iterator = notebook.tqdm(train_dataloader, desc="Iteration")
    for step, batch in enumerate(epoch_iterator):

        # Reset all gradients at start of every iteration
        model.zero_grad()

        # Put the model and the input observations to GPU
        model.to(device)
        batch = tuple(t.to(device) for t in batch)

        # Identify the inputs to the model
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        # Forward Pass through the model. Input -> Model -> Output
        outputs = model(**inputs)

        # Determine the deviation (loss)
        loss = outputs[0]
        print("\r%f" % loss, end='')

        # Back-propogate the loss (automatically calculates gradients)
        loss.backward()

        # Prevent exploding gradients by limiting gradients to 1.0
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)

        # Update the parameters and learning rate
        optimizer.step()
        scheduler.step()

The steps we have performed up to now have fine-tuned the parameters of the BERT model that we downloaded to fit the sentiment analysis of the Amazon customer reviews. If the model is learning the parameter values correctly, you should observe that the loss value reduces over multiple iterations. At the end of the training step, we can save the model and tokenizer into a chosen output folder:

model.save_pretrained('outputs')

Step 3: Model Evaluation

Evaluating our model on the test data is similar to the training steps, with only minor differences. First, we have to evaluate the entire test dataset and therefore don’t need to make random samples; instead, we use the SequentialSampler class to load observations. However, we are still constrained by the number of observations we can load at a time and therefore must use test_batch_size to determine this. Second, we do not need a backward pass or adjustment of parameters and only perform the forward pass. The model provides us with output tensors that contain the value of loss and output probabilities. We use the np.argmax function to determine the output label with maximum probability and calculate the accuracy by comparing with actual labels:

import numpy as np
from torch.utils.data import SequentialSampler

test_batch_size = 64
test_sampler = SequentialSampler(test_dataset)
test_dataloader = DataLoader(test_dataset,
                             sampler=test_sampler,
                             batch_size=test_batch_size)

# Load the pretrained model that was saved earlier
# model = model.from_pretrained('/outputs')

# Initialize the prediction and actual labels
preds = None
out_label_ids = None

# Put model in "eval" mode
model.eval()

for batch in notebook.tqdm(test_dataloader, desc="Evaluating"):

    # Put the model and the input observations to GPU
    model.to(device)
    batch = tuple(t.to(device) for t in batch)

    # Do not track any gradients since in 'eval' mode
    with torch.no_grad():
        inputs = {'input_ids':      batch[0],
                  'attention_mask': batch[1],
                  'labels':         batch[2]}

        # Forward pass through the model
        outputs = model(**inputs)

        # We get loss since we provided the labels
        tmp_eval_loss, logits = outputs[:2]

        # There maybe more than one batch of items in the test dataset
        if preds is None:
            preds = logits.detach().cpu().numpy()
            out_label_ids = inputs['labels'].detach().cpu().numpy()
        else:
            preds = np.append(preds, logits.detach().cpu().numpy(), axis=0)
            out_label_ids = np.append(out_label_ids,
                                      inputs['labels'].detach().cpu().numpy(),
                                      axis=0)

# Get final loss, predictions and accuracy
preds = np.argmax(preds, axis=1)
acc_score = accuracy_score(preds, out_label_ids)
print ('Accuracy Score on Test data ', acc_score)

Out:

Accuracy Score on Test data  0.9535086370393152

The results for our test data show an increase in model accuracy to 95%—a 10 percentage point jump compared to our previous baseline with TF-IDF and SVM. These are the benefits of using a state-of-the-art language model and is most likely a result of BERT being trained using a large corpus of data. The reviews are quite short, and the earlier model has only that data to learn a relationship. BERT, on the other hand, is context aware and can transfer the prior information it has about the words in the review. The accuracy can be improved by fine-tuning the hyperparameters like learning_rate or by training for more epochs. Because the number of parameters for pretrained language models far exceeds the number of observations we use for fine-tuning, we must be careful to avoid overfitting during this process!

Using Saved Models

If you are running the evaluation separately, you can load the fine-tuned model directly without the need to train again. Note that this is the same function that we initially used to load the pretrained model from transformers, but this time we are using the fine-tuned model that we trained ourselves.

As you can see, using a pretrained language model improves the accuracy of our model but also involves many additional steps and can incur costs like the use of a GPU (training a useful model on CPU can take 50–100 times longer). The pretrained models are quite large and not memory efficient. Using these models in production is often more complicated because of the time taken to load millions of parameters in memory, and they are inefficient for real-time scenarios because of longer inference times. Some pretrained models like DistilBERT and ALBERT have been specifically developed for a more favorable trade-off between accuracy and model simplicity. You can easily try this by reusing the blueprint and changing the appropriate model classes to choose the distil-bert-uncased or albert-base-v1 model, which is available in the Transformers library, to check the accuracy.

1 J. McAuley and J. Leskovec. “Hidden Factors and Hidden Topics: Understanding Rating Dimensions with Review Text.” RecSys, 2013. https://snap.stanford.edu/data/web-Amazon.html.

2 “Interest Group on German Sentiment Analysis, Multi-Domain Sentiment Lexicon for German,” https://oreil.ly/WpMhF.

3 Yanqing Chen and Steven Skiena. Building Sentiment Lexicons for All Major Languages. Lexicons available on Kaggle.

4 Ashish Vaswani et al. “Attention Is All You Need.” 2017. https://arxiv.org/abs/1706.03762.

5 Robin M. Schmidt, Frank Schneider, and Phillipp Hennig. “Descending through a Crowded Valley: Benchmarking Deep Learning Optimizers.” 2020. https://arxiv.org/pdf/2007.01547.pdf.