Chapter 12. Sentiment Analysis and Emotion Detection

Sentiment analysis is a set of techniques used for quantifying some sentiment based on text content. There are many community sites and e-commerce sites that allow users to comment and rate products and services. However, this is not the only place where people discuss products and services—there is also social media. We can leverage the data from the sites with comments and ratings to learn the relationship between the language used and positive or negative sentiment. These approaches can be extended to predicting the emotions of the author of a piece of text. Sentiment analysis is one of the most popular uses of NLP.

For this application, we are trying to build a program that we can use to quantify movie reviews. Although many, but not all, movie reviewers use some quantifiable metrics—for example, thumbs up/down, stars, or letter grades, these are not normalized. Two reviewers who use a 10-point scale may have different distributions. One reviewer may give most movies a 4–6 range, where another gives a 6–8 range. We could normalize them, but what about the other reviewers who use different metrics or no metrics at all? It might be better if we build a model that looks at the reviews and produces a score. This way, we know that the scores from a given reviewer are based on the text of the review, instead of on an ad-hoc score.

Problem Statement and Constraints

  1. What is the problem we are trying to solve?

    We want to build an application that takes the text of a movie review and produces a score. We will use this to aggregate reviews, so this application will run as a batch process. We will surface this to the user using a display that shows how positively or negatively the movie was received. We will not worry about the other aspects of the presentation. We will assume that the display will be embedded in other content.

  2. What constraints are there?

    Here are our constraints:

    • We are assuming that we are working with English-language reviews.
    • We do not have much constraint on the speed of the program, since this is a batch offline process. We want to return the aggregate score for 95% of movies in less than 1 minute.
    • We want to make sure that our model is performing well on this task, so we will use a well-known data set. We will use the Large Movie Review Dataset based on IMDb user reviews.
    • We will assume that the input to this program is a JavaScript Object Notation (JSON) file of reviews. The output will be a score.
    • The model must have an F1 score of at least 0.7 on new data.

    It may seem unreasonable to set a desired metric threshold before we have even looked at the data, but this situation is common. Negotiating with stakeholders is important. If an arbitrary threshold has been set but experimentation reveals it is unrealistic, the data scientist should be able to explain to the stakeholders why the problem is more difficult than expected.

    As you work on the project, this list may change. The earlier you catch missed constraints, the better. If you discover a constraint just before deployment, it can be very expensive to fix. This is why we want to iterate with stakeholders during development.

    Now that we have listed our constraints, let’s discuss how we can build our application.

  3. How do we solve the problem with the constraints?

    The first constraint, that the reviews are in English, actually makes our task easier. The second constraint, concerning how long it takes to calculate the aggregate score, controls how complex of a model we can build, but it is a light constraint. We have the IMDb data set. When we build our program, we will load from JSON. However, our modeling code does not need to follow such constraints.

Plan the Project

To plan the project, let’s define what our acceptance criteria are. The product owner would normally define these by incorporating stakeholder requests. In this chapter, you are both product owner and developer.

We want a script that does the following:

  • Takes a file with reviews in JSON objects, one per line
  • Returns distribution information based on the output from the model
    • Mean
    • Standard deviation
    • Quartiles
    • Min
    • Max

We will use Spark NLP to process the data and a Spark MLlib model to predict the sentiment.

Now that we have these high-level acceptance criteria, let’s look at the data. First, we will load the data into DataFrames and add the label columns.

import sparknlp

from pyspark.ml import Pipeline
from pyspark.sql import SparkSession
from pyspark.sql.functions import lit

import sparknlp
from sparknlp import DocumentAssembler, Finisher
from sparknlp.annotator import *
spark = sparknlp.start()
pos_train = spark.sparkContext.wholeTextFiles(
    'aclImdb_v1/aclImdb/train/pos/')
neg_train = spark.sparkContext.wholeTextFiles(
    'aclImdb_v1/aclImdb/train/neg/')
pos_test = spark.sparkContext.wholeTextFiles(
    'aclImdb_v1/aclImdb/test/pos/')
neg_test = spark.sparkContext.wholeTextFiles(
    'aclImdb_v1/aclImdb/test/neg/')
pos_train = spark.createDataFrame(pos_train, ['path', 'text'])
pos_train = pos_train.repartition(100)
pos_train = pos_train.withColumn('label', lit(1)).persist()
neg_train = spark.createDataFrame(neg_train, ['path', 'text'])
neg_train = neg_train.repartition(100)
neg_train = neg_train.withColumn('label', lit(0)).persist()
pos_test = spark.createDataFrame(pos_test, ['path', 'text'])
pos_test = pos_test.repartition(100)
pos_test = pos_test.withColumn('label', lit(1)).persist()
neg_test = spark.createDataFrame(neg_test, ['path', 'text'])
neg_test = neg_test.repartition(100)
neg_test = neg_test.withColumn('label', lit(0)).persist()

Let’s look at an example of a positive.

print(pos_train.first()['text'])
I laughed a lot while watching this. It's an amusing short with a 
fun musical act and a lot of wackiness. The characters are simple, 
but their simplicity adds to the humor stylization. The dialog is 
funny and often unexpected, and from the first line to the last 
everything just seems to flow wonderfully. There's Max, who has 
apparently led a horrible life. And there's Edward, who isn't sure 
what life he wants to lead. My favorite character was Tom, Edward's 
insane boss. Tom has a short role but a memorable one. Highly 
recommended for anyone who likes silly humor. And you can find it 
online now, which is a bonus! I am a fan of all of Jason's cartoons 
and can't wait to see what he comes out with next.

This seems like a clearly positive review. We can identify a few words that seem like a good signal, like “best.”

Now, let’s look an example of a negative.

print(neg_train.first()['text'])
I sat glued to the screen, riveted, yawning, yet keeping an 
attentive eye. I waited for the next awful special effect, or the 
next ridiculously clichéd plot item to show up full force, so I 
could learn how not to make a movie.<br /><br />It seems when they 
set out to make this movie, the crew watched every single other 
action/science-fiction/shoot-em-up/good vs. evil movie ever made, 
and saw cool things and said: "Hey, we can do that." For example, 
the only car parked within a mile on what seems like a one way road 
with a shoulder not meant for parking, is the one car the 
protagonist, an attractive brunette born of bile, is thrown on to. 
The car blows to pieces before she even lands on it. The special 
effects were quite obviously my biggest beef with this movie. But 
what really put it in my bad books was the implausibility, and lack 
of reason for so many elements! For example, the antagonist, a 
flying demon with the ability to inflict harm in bizarre ways, 
happens upon a lone army truck transporting an important VIP. 
Nameless security guys with guns get out of the truck, you know 
they are already dead. Then the guy protecting the VIP says "Under 
no circumstances do you leave this truck, do you understand me?" He 
gets out to find the beast that killed his 3 buddies, he gets 
whacked in an almost comically cliché fashion. Then for no apparent 
reason, defying logic, convention, and common sense, the dumb ass 
VIP GETS OUT OF THE TRUCK!!! A lot of what happened along the 
course of the movie didn't make sense. Transparent acting distanced 
me from the movie, as well as bad camera-work, and things that just 
make you go: "Wow, that's incredibly cheesy." Shiri Appleby saved 
the movie from a 1, because she gave the movie the one element that 
always makes viewers enjoy the experience, sex appeal.

This is a clear example of a negative review. We see many words here that seem like solid indicators of negative sentiment, like “awful” and “cheesy.”

Notice that there are some HTML artifacts that we will want to remove.

Now, let’s look at the corpus as a whole.

print('pos_train size', pos_train.count())
print('neg_train size', neg_train.count())
print('pos_test size', pos_test.count())
print('neg_test size', neg_test.count())
pos_train size 12500
neg_train size 12500
pos_test size 12500
neg_test size 12500

So we have 50,000 documents. Having such an even distribution between positive and negative is artificial in this case. Let’s look at the length of the text, shown in Table 12-1.

pos_train.selectExpr('length(text) AS text_len')\
    .toPandas().describe()
Table 12-1. Summary of text lengths
text_len
count 12500.000000
mean 1347.160240
std 1046.747365
min 70.000000
25% 695.000000
50% 982.000000
75% 1651.000000
max 13704.000000

There appears to be a lot of variation in character lengths. This may be a useful feature. Text length may seem very low level, but it can often be useful information about a text. We may find that longer comments may be more likely to be negative due to rants. In this situation, it would be more useful if we had reviewer IDs, so we could get a sense of what is normal for a reviewer; alas, that is not in the data.

Now that we have taken a brief look at the data, let’s begin to design our solution.

Implement the Solution

Recall the steps to a modeling project we discussed in Chapter 7. Let’s go through them here.

  1. Get data.
  2. Look at the data.
  3. Process data.

    We already have the data, and we have looked at it. Let’s do some basic processing and store it so we can more quickly iterate on our model.

    First, let’s combine positives and negatives into two data sets, train, and test.

    train = pos_train.unionAll(neg_train)
    test = pos_test.unionAll(neg_test)

    Now, let’s use Spark NLP to process the data. We will save both the lemmatized and normalized tokens, as well as GloVe embeddings. This way, we can experiment with different features.

    Let’s create our pipeline.

    assembler = DocumentAssembler()\
        .setInputCol('text')\
        .setOutputCol('document')
    sentence = SentenceDetector() \
        .setInputCols(["document"]) \
        .setOutputCol("sentences")
    tokenizer = Tokenizer()\
        .setInputCols(['sentences'])\
        .setOutputCol('tokens')
    lemmatizer = LemmatizerModel.pretrained()\
        .setInputCols(['tokens'])\
        .setOutputCol('lemmas')
    normalizer = Normalizer()\
        .setCleanupPatterns([
            '[^a-zA-Z.-]+', 
            '^[^a-zA-Z]+', 
            '[^a-zA-Z]+$',
        ])\
        .setInputCols(['lemmas'])\
        .setOutputCol('normalized')\
        .setLowercase(True)
    glove = WordEmbeddingsModel.pretrained(name='glove_100d') \
        .setInputCols(['document', 'normalized']) \
        .setOutputCol('embeddings') \
    
    nlp_pipeline = Pipeline().setStages([
        assembler, sentence, tokenizer, 
        lemmatizer, normalizer, glove
    ]).fit(train)
    

    Let’s select just the values we are interested in—namely, the original data plus the normalized tokens and embeddings.

    train = nlp_pipeline.transform(train) \
        .selectExpr(
            'path', 'text', 'label', 
            'normalized.result AS normalized', 
            'embeddings.embeddings'
        )
    
    test = nlp_pipeline.transform(test) \
        .selectExpr(
            'path', 'text', 'label', 
            'normalized.result AS normalized', 
            'embeddings.embeddings'
        )
    
    nlp_pipeline.write().overwrite().save('nlp_pipeline.3.12')

    Recall the simplest version of doc2vec that we covered in Chapter 11, in which we average the word vectors in a document to create a document vector. We will use this technique here.

    import numpy as np
    from pyspark.sql.types import *
    from pyspark.ml.linalg import DenseVector, VectorUDT
    
    def avg_wordvecs_fun(wordvecs):
        return DenseVector(np.mean(wordvecs, axis=0))
    
    avg_wordvecs = spark.udf.register(
        'avg_wordvecs', 
        avg_wordvecs_fun, 
        returnType=VectorUDT())
    
    train = train.withColumn('avg_wordvec', avg_wordvecs('embeddings'))
    test = test.withColumn('avg_wordvec', avg_wordvecs('embeddings'))
    train.drop('embeddings')
    test.drop('embeddings')
    

    Now, we will save it as parquet files. This will let us free up some memory.

    train.write.mode('overwrite').parquet('imdb.train')
    test.write.mode('overwrite').parquet('imdb.test')

    Let’s clean up the data we persisted before so can have more memory to work with.

    pos_train.unpersist()
    neg_train.unpersist()
    pos_test.unpersist()
    neg_test.unpersist()

    Now we load our data and persist.

    train = spark.read.parquet('imdb.train').persist()
    test = spark.read.parquet('imdb.test').persist()
  4. Featurize

    Let’s see how well our model does with just TF.IDF features.

    from pyspark.ml.feature import CountVectorizer, IDF
    tf = CountVectorizer()\
        .setInputCol('normalized')\
        .setOutputCol('tf')
    idf = IDF()\
        .setInputCol('tf')\
        .setOutputCol('tfidf')
    
    featurizer = Pipeline().setStages([tf, idf])
    
  5. Model

    Now that we have our features, we can build our first model. Let’s start with logistic regression, which is often a good baseline.

    from pyspark.ml.feature import VectorAssembler
    from pyspark.ml.classification import LogisticRegression
    vec_assembler = VectorAssembler()\
        .setInputCols(['avg_wordvec'])\
        .setOutputCol('features')
    logreg = LogisticRegression()\
        .setFeaturesCol('features')\
        .setLabelCol('label')
    
    model_pipeline = Pipeline()\
        .setStages([featurizer, vec_assembler, logreg])
    
    model = model_pipeline.fit(train)

    Now let’s save the model.

    model.write().overwrite().save('model.3.12')

    Now that we have fit a model, let’s get our predictions.

    train_preds = model.transform(train)
    test_preds = model.transform(test)
  6. Evaluate

    Let’s calculate our F1 score on train and test.

    from pyspark.ml.evaluation import MulticlassClassificationEvaluator
    evaluator = MulticlassClassificationEvaluator()\
        .setMetricName('f1')
    evaluator.evaluate(train_preds)
    0.8029598474121058
    evaluator.evaluate(test_preds)
    0.8010723532212578

    This is above the minimal acceptance criteria, so we are ready to ship this model.

  7. Review

    We can, of course, identify ways to improve the model. But it is important to get a first version out. After we have deployed an initial version, we can begin to look at ways to improve the model.

  8. Deploy

    For this application, deployment is merely making the script available. Realistically, offline “deployments” often involve creating a workflow that can be run on demand or periodically. For this application, having the script in a place that can be run for new reviews is all that is needed.

    %%writefile movie_review_analysis.py
    
    """
    This script takes file containing reviews of the same. 
    It will output the results of the analysis to std.out.
    """
    
    
    import argparse as ap
    import json
    from pyspark.sql import SparkSession
    from pyspark.ml import PipelineModel
    
    
    if __name__ == '__main__':
        print('beginning...')
        parser = ap.ArgumentParser(description='Movie Review Analysis')
        parser.add_argument('-file', metavar='DATA', type=str,
                            required=True, 
                            help='The file containing the reviews '\
                                 'in JSON format, one JSON review '\
                                 'per line')
        
        options = vars(parser.parse_args())
        
        spark = SparkSession.builder \
            .master("local[*]") \
            .appName("Movie Analysis") \
            .config("spark.driver.memory", "12g") \
            .config("spark.executor.memory", "12g") \
            .config("spark.jars.packages", 
                    "JohnSnowLabs:spark-nlp:2.2.2") \
            .getOrCreate()
        
        nlp_pipeline = PipelineModel.load('nlp_pipeline.3.12')
        model = PipelineModel.load('model.3.12')
        
        data = spark.read.json(options['file'])
        
        nlp_procd = nlp_pipeline.transform(data)
        preds = model.transform(nlp_procd)
        
        results = preds.selectExpr(
            'count(*)',
            'mean(rawPrediction[1])',
            'std(rawPrediction[1])',
            'median(rawPrediction[1])',
            'min(rawPrediction[1])',
            'max(rawPrediction[1])',
        ).first().asDict()
        
        print(json.dump(results))
    
    

    This script can be used for taking a set of reviews and aggregating into a single score plus some additional statistics.

Test and Measure the Solution

Now that we have a first implementation of the application, let’s talk about metrics. In a more realistic scenario, you would define your metrics in the planning stage. However, it is easier to explain some of these topics once we have something concrete to refer to.

Model-Centric Metrics

For sentiment analysis, you will generally be using classification metrics, like we are here. Sometimes, sentiment labels have grades—for example, very bad, bad, neutral, good, very good. In these situations you can potentially build a regression model instead of a classifier.

There are traditional metrics used with classifiers like precision and recall. In order to make sense, you need to decide which label is the “positive.” With precision and recall, “positive” is in the sense of “testing positive.” This can make discussing these metrics a little confusing. Let’s say that for this application, the good sentiment has the positive label. Assuming this, precision is the proportion of reviews that are predicted to be good that are actually good. Recall is the proportion of actually good reviews that are predicted to be good. Another common classification metric used with precision and recall is f-score. This is the harmonic mean of precision and recall. This is a convenient way of summarizing these metrics. We can calculate precision, recall, and the f-score with MulticlassClassificationEvaluator in Spark.

Another way of measuring classifier models is with a metric called log-loss. This is also called cross-entropy. The idea behind log-loss is measuring how different the observed distribution of labels is from the predicted distribution. This has the benefit of not relying on a mapping of the meaning of good and bad labels to positive and negative. On the downside, this is less interpretable than precision and recall.

When deciding on your model metrics, you should decide on which ones will be useful for experimentation and what singular metric is best for reporting out to share with stakeholders. The metric you select for stakeholders should be one that can be easily explained to an audience that may not be familiar with data-science concepts.

An important part of every machine learning project is deployment. You want to make sure that you have metrics for this stage as well.

Process Metrics

There are many software development metrics out there, and most of them depend on how you track work—for example, number of tickets per unit of time or average time from opening a ticket to closing a ticket.

In an application like this, you are not developing new features, so ticket-based metrics won’t make sense. You can measure responsiveness to bugs with this. The process around this application is that of evaluating reviews for a movie. Measure how long it takes to gather the aggregate score for a new movie. As you automate the process of submitting a set of reviews for aggregation, this will improve.

Another valuable metric for machine-learning–based applications is how long it takes to develop a new model. A simple model like this should not require more than a week, including gathering data, data validation, iterating on model training, documenting results, and deployment. If you find that making a new model takes prohibitively long, try and determine what part of the development process is slowing you down. The following are some common problems:

Data problems are discovered when iterating on the model
  • Do improve the data validation so these problems are caught earlier
  • Don’t just remove the problems in an ad-hoc way without knowing the scope of the data problem—this can lead to invalid models
Every time a new model is needed there is too much data-cleaning work required
  • Do improve the ETL pipeline to handle some of this cleaning in an automated manner, or, if possible, find a better data source
  • Don’t pad the time necessary to build the model and accept that each new developer cleans the data—this can lead to inconsistent models over time
The score of a new model is very different than the previous model
  • Do review the evaluation code used in the current and previous models; the difference may be valid, or the measurement code could have a bug
  • Don’t ignore these changes—this can lead to deploying a worse model that was improperly measured

Now that we have ways to measure the technology and processes of our application, let’s talk about monitoring. This is essential in data-science–based applications because we make assumptions about the data when doing modeling. These assumptions may not hold in production.

Offline Versus Online Model Measurement

The difficulty in monitoring a model is that we generally do not have labels in production. So we can’t measure our model with things like precision or root mean square error (RMSE). What we should do is measure that the distributions of features and predictions in production are similar to what we saw during experimentation. We can measure this in offline applications—in other words, applications that are run at request like the application in this chapter and online applications like a model that is available as a web service.

For an application like this, we have only offline measurement. We should track the aggregates over time. Naïvely, we can assume that the mean average score for movies should be stable. This may very well not be the case, but if there is a trend, we should review the data to make sure that reviews are indeed changing overall and it’s not that our model may have been overfit.

When we look at applications that are deployed as real-time applications, we will discuss online metrics. In spirit, they are similar—they monitor the features and the scores.

There is one more step we need to discuss for this application—that is the review. Data-science–based applications are more complicated to review than most other software because you must review the actual software just as thoroughly as any other application, but you must also review the methodology. NLP applications are even more complicated. The theory behind linguistics and natural language data is not as cleanly modeled as other simpler kinds of data.

Review

The review process is another vital part of writing any application. It is easy for a developer to have blind spots in their own projects. This is why we must bring in others to review our work. This process can be difficult for technical and human reasons. The most important part of any review process is that it not be personal. Both the reviewer and the developer should approach the process with the goal of collaboration. If a problem is found, that is an opportunity. The developer has avoided a later problem, and the reviewer can learning something that may help them avoid problems in future work.

Let’s talk about the steps of the review.

  1. Architecture review: this is where other engineers, product owners, and stakeholders review how the application will be deployed. This should be done at the end of the planning stage of development
  2. Model review: this is done when the developer or data scientist has a model that they believe will meet the expectations of the stakeholder. The model should be reviewed with other data scientists or those familiar with machine learning concepts, and another review should be conducted with stakeholders. The technical review should cover the data, processing, modeling, and measurement aspects of the project. The nontechnical review should explain the assumptions and limitations of the model to verify that it will meet the expectations of the stakeholders.
  3. Code reviews: this is necessary for any software application. The code should be reviewed by someone who has some knowledge of the project. If the reviewer has no context on the application, it will be difficult or impossible for them to catch logical bugs in the code.

In our situation, this application is very simple. We are not developing anything but a script, so there is no actual architecture to review. However, the plan to deploy this as a script must be reviewed by the stakeholders. The model review would also be straightforward, since we have a clean data set and simple model. This will generally not be the case. Similarly, our script is very simple. A code review might suggest that we develop a small test data set to make sure that a model will run the data we expect.

These reviews take place during the development of the model. Once we are ready to deploy, we should have some more reviews to make sure that we have prepared for deployment.