Sentiment analysis is a set of techniques used for quantifying some sentiment based on text content. There are many community sites and e-commerce sites that allow users to comment and rate products and services. However, this is not the only place where people discuss products and services—there is also social media. We can leverage the data from the sites with comments and ratings to learn the relationship between the language used and positive or negative sentiment. These approaches can be extended to predicting the emotions of the author of a piece of text. Sentiment analysis is one of the most popular uses of NLP.
For this application, we are trying to build a program that we can use to quantify movie reviews. Although many, but not all, movie reviewers use some quantifiable metrics—for example, thumbs up/down, stars, or letter grades, these are not normalized. Two reviewers who use a 10-point scale may have different distributions. One reviewer may give most movies a 4–6 range, where another gives a 6–8 range. We could normalize them, but what about the other reviewers who use different metrics or no metrics at all? It might be better if we build a model that looks at the reviews and produces a score. This way, we know that the scores from a given reviewer are based on the text of the review, instead of on an ad-hoc score.
What is the problem we are trying to solve?
We want to build an application that takes the text of a movie review and produces a score. We will use this to aggregate reviews, so this application will run as a batch process. We will surface this to the user using a display that shows how positively or negatively the movie was received. We will not worry about the other aspects of the presentation. We will assume that the display will be embedded in other content.
What constraints are there?
Here are our constraints:
It may seem unreasonable to set a desired metric threshold before we have even looked at the data, but this situation is common. Negotiating with stakeholders is important. If an arbitrary threshold has been set but experimentation reveals it is unrealistic, the data scientist should be able to explain to the stakeholders why the problem is more difficult than expected.
As you work on the project, this list may change. The earlier you catch missed constraints, the better. If you discover a constraint just before deployment, it can be very expensive to fix. This is why we want to iterate with stakeholders during development.
Now that we have listed our constraints, let’s discuss how we can build our application.
How do we solve the problem with the constraints?
The first constraint, that the reviews are in English, actually makes our task easier. The second constraint, concerning how long it takes to calculate the aggregate score, controls how complex of a model we can build, but it is a light constraint. We have the IMDb data set. When we build our program, we will load from JSON. However, our modeling code does not need to follow such constraints.
To plan the project, let’s define what our acceptance criteria are. The product owner would normally define these by incorporating stakeholder requests. In this chapter, you are both product owner and developer.
We want a script that does the following:
We will use Spark NLP to process the data and a Spark MLlib model to predict the sentiment.
Now that we have these high-level acceptance criteria, let’s look at the data. First, we will load the data into DataFrame
s and add the label columns.
import sparknlp from pyspark.ml import Pipeline from pyspark.sql import SparkSession from pyspark.sql.functions import lit import sparknlp from sparknlp import DocumentAssembler, Finisher from sparknlp.annotator import *
spark = sparknlp.start()
pos_train = spark.sparkContext.wholeTextFiles( 'aclImdb_v1/aclImdb/train/pos/') neg_train = spark.sparkContext.wholeTextFiles( 'aclImdb_v1/aclImdb/train/neg/') pos_test = spark.sparkContext.wholeTextFiles( 'aclImdb_v1/aclImdb/test/pos/') neg_test = spark.sparkContext.wholeTextFiles( 'aclImdb_v1/aclImdb/test/neg/')
pos_train = spark.createDataFrame(pos_train, ['path', 'text']) pos_train = pos_train.repartition(100) pos_train = pos_train.withColumn('label', lit(1)).persist() neg_train = spark.createDataFrame(neg_train, ['path', 'text']) neg_train = neg_train.repartition(100) neg_train = neg_train.withColumn('label', lit(0)).persist() pos_test = spark.createDataFrame(pos_test, ['path', 'text']) pos_test = pos_test.repartition(100) pos_test = pos_test.withColumn('label', lit(1)).persist() neg_test = spark.createDataFrame(neg_test, ['path', 'text']) neg_test = neg_test.repartition(100) neg_test = neg_test.withColumn('label', lit(0)).persist()
Let’s look at an example of a positive.
print(pos_train.first()['text'])
I laughed a lot while watching this. It's an amusing short with a fun musical act and a lot of wackiness. The characters are simple, but their simplicity adds to the humor stylization. The dialog is funny and often unexpected, and from the first line to the last everything just seems to flow wonderfully. There's Max, who has apparently led a horrible life. And there's Edward, who isn't sure what life he wants to lead. My favorite character was Tom, Edward's insane boss. Tom has a short role but a memorable one. Highly recommended for anyone who likes silly humor. And you can find it online now, which is a bonus! I am a fan of all of Jason's cartoons and can't wait to see what he comes out with next.
This seems like a clearly positive review. We can identify a few words that seem like a good signal, like “best.”
Now, let’s look an example of a negative.
print(neg_train.first()['text'])
I sat glued to the screen, riveted, yawning, yet keeping an attentive eye. I waited for the next awful special effect, or the next ridiculously clichéd plot item to show up full force, so I could learn how not to make a movie.<br /><br />It seems when they set out to make this movie, the crew watched every single other action/science-fiction/shoot-em-up/good vs. evil movie ever made, and saw cool things and said: "Hey, we can do that." For example, the only car parked within a mile on what seems like a one way road with a shoulder not meant for parking, is the one car the protagonist, an attractive brunette born of bile, is thrown on to. The car blows to pieces before she even lands on it. The special effects were quite obviously my biggest beef with this movie. But what really put it in my bad books was the implausibility, and lack of reason for so many elements! For example, the antagonist, a flying demon with the ability to inflict harm in bizarre ways, happens upon a lone army truck transporting an important VIP. Nameless security guys with guns get out of the truck, you know they are already dead. Then the guy protecting the VIP says "Under no circumstances do you leave this truck, do you understand me?" He gets out to find the beast that killed his 3 buddies, he gets whacked in an almost comically cliché fashion. Then for no apparent reason, defying logic, convention, and common sense, the dumb ass VIP GETS OUT OF THE TRUCK!!! A lot of what happened along the course of the movie didn't make sense. Transparent acting distanced me from the movie, as well as bad camera-work, and things that just make you go: "Wow, that's incredibly cheesy." Shiri Appleby saved the movie from a 1, because she gave the movie the one element that always makes viewers enjoy the experience, sex appeal.
This is a clear example of a negative review. We see many words here that seem like solid indicators of negative sentiment, like “awful” and “cheesy.”
Notice that there are some HTML artifacts that we will want to remove.
Now, let’s look at the corpus as a whole.
print('pos_train size', pos_train.count()) print('neg_train size', neg_train.count()) print('pos_test size', pos_test.count()) print('neg_test size', neg_test.count())
pos_train size 12500 neg_train size 12500 pos_test size 12500 neg_test size 12500
So we have 50,000 documents. Having such an even distribution between positive and negative is artificial in this case. Let’s look at the length of the text, shown in Table 12-1.
pos_train.selectExpr('length(text) AS text_len')\ .toPandas().describe()
text_len | |
---|---|
count | 12500.000000 |
mean | 1347.160240 |
std | 1046.747365 |
min | 70.000000 |
25% | 695.000000 |
50% | 982.000000 |
75% | 1651.000000 |
max | 13704.000000 |
There appears to be a lot of variation in character lengths. This may be a useful feature. Text length may seem very low level, but it can often be useful information about a text. We may find that longer comments may be more likely to be negative due to rants. In this situation, it would be more useful if we had reviewer IDs, so we could get a sense of what is normal for a reviewer; alas, that is not in the data.
Now that we have taken a brief look at the data, let’s begin to design our solution.
First, let’s separate our project into two phases.
Training and measuring the model
The quality of modeling code is often overlooked. This is an important piece of a project. You will want to be able to hand off your experiment, so the code should be reusable. You also need the model to be reproducible, not just for academic reasons but also in case the model needs to be rebuilt for business purposes. You may also want to return to the project at some point to improve the model.
One common way of making a modeling project reusable is to build a notebook, or a collection of notebooks. We won’t cover that in this chapter, since the modeling project is straightforward.
Building the script
The script will take one argument, the path to the reviews in JSON format—one JSON-formatted review per line. It will output a JSON-formatted report on the distribution of reviews.
The following are the acceptance criteria for the script:
{ "count": ###, "mean": 0.###, "std": 0.###, "median": 0.###, "min": 0.###, "max": 0.###, }
The scores, which we will take the mean of, should be floating-point numbers between 0 and 1. Many classifiers output predicted probabilities, but this does an assumption on the output of this script.
Now that we have a plan, let’s implement it.
Recall the steps to a modeling project we discussed in Chapter 7. Let’s go through them here.
Process data.
We already have the data, and we have looked at it. Let’s do some basic processing and store it so we can more quickly iterate on our model.
First, let’s combine positives and negatives into two data sets, train, and test.
train = pos_train.unionAll(neg_train) test = pos_test.unionAll(neg_test)
Now, let’s use Spark NLP to process the data. We will save both the lemmatized and normalized tokens, as well as GloVe embeddings. This way, we can experiment with different features.
Let’s create our pipeline.
assembler = DocumentAssembler()\ .setInputCol('text')\ .setOutputCol('document') sentence = SentenceDetector() \ .setInputCols(["document"]) \ .setOutputCol("sentences") tokenizer = Tokenizer()\ .setInputCols(['sentences'])\ .setOutputCol('tokens') lemmatizer = LemmatizerModel.pretrained()\ .setInputCols(['tokens'])\ .setOutputCol('lemmas') normalizer = Normalizer()\ .setCleanupPatterns([ '[^a-zA-Z.-]+', '^[^a-zA-Z]+', '[^a-zA-Z]+$', ])\ .setInputCols(['lemmas'])\ .setOutputCol('normalized')\ .setLowercase(True) glove = WordEmbeddingsModel.pretrained(name='glove_100d') \ .setInputCols(['document', 'normalized']) \ .setOutputCol('embeddings') \ nlp_pipeline = Pipeline().setStages([ assembler, sentence, tokenizer, lemmatizer, normalizer, glove ]).fit(train)
Let’s select just the values we are interested in—namely, the original data plus the normalized tokens and embeddings.
train = nlp_pipeline.transform(train) \ .selectExpr( 'path', 'text', 'label', 'normalized.result AS normalized', 'embeddings.embeddings' ) test = nlp_pipeline.transform(test) \ .selectExpr( 'path', 'text', 'label', 'normalized.result AS normalized', 'embeddings.embeddings' )
nlp_pipeline.write().overwrite().save('nlp_pipeline.3.12')
Recall the simplest version of doc2vec that we covered in Chapter 11, in which we average the word vectors in a document to create a document vector. We will use this technique here.
import numpy as np from pyspark.sql.types import * from pyspark.ml.linalg import DenseVector, VectorUDT def avg_wordvecs_fun(wordvecs): return DenseVector(np.mean(wordvecs, axis=0)) avg_wordvecs = spark.udf.register( 'avg_wordvecs', avg_wordvecs_fun, returnType=VectorUDT()) train = train.withColumn('avg_wordvec', avg_wordvecs('embeddings')) test = test.withColumn('avg_wordvec', avg_wordvecs('embeddings')) train.drop('embeddings') test.drop('embeddings')
Now, we will save it as parquet files. This will let us free up some memory.
train.write.mode('overwrite').parquet('imdb.train') test.write.mode('overwrite').parquet('imdb.test')
Let’s clean up the data we persisted before so can have more memory to work with.
pos_train.unpersist() neg_train.unpersist() pos_test.unpersist() neg_test.unpersist()
Now we load our data and persist.
train = spark.read.parquet('imdb.train').persist() test = spark.read.parquet('imdb.test').persist()
Featurize
Let’s see how well our model does with just TF.IDF features.
from pyspark.ml.feature import CountVectorizer, IDF
tf = CountVectorizer()\ .setInputCol('normalized')\ .setOutputCol('tf') idf = IDF()\ .setInputCol('tf')\ .setOutputCol('tfidf') featurizer = Pipeline().setStages([tf, idf])
Model
Now that we have our features, we can build our first model. Let’s start with logistic regression, which is often a good baseline.
from pyspark.ml.feature import VectorAssembler from pyspark.ml.classification import LogisticRegression
vec_assembler = VectorAssembler()\ .setInputCols(['avg_wordvec'])\ .setOutputCol('features') logreg = LogisticRegression()\ .setFeaturesCol('features')\ .setLabelCol('label') model_pipeline = Pipeline()\ .setStages([featurizer, vec_assembler, logreg])
model = model_pipeline.fit(train)
Now let’s save the model.
model.write().overwrite().save('model.3.12')
Now that we have fit a model, let’s get our predictions.
train_preds = model.transform(train)
test_preds = model.transform(test)
Evaluate
Let’s calculate our F1 score on train and test.
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator()\ .setMetricName('f1')
evaluator.evaluate(train_preds)
0.8029598474121058
evaluator.evaluate(test_preds)
0.8010723532212578
This is above the minimal acceptance criteria, so we are ready to ship this model.
Review
We can, of course, identify ways to improve the model. But it is important to get a first version out. After we have deployed an initial version, we can begin to look at ways to improve the model.
Deploy
For this application, deployment is merely making the script available. Realistically, offline “deployments” often involve creating a workflow that can be run on demand or periodically. For this application, having the script in a place that can be run for new reviews is all that is needed.
%%writefile movie_review_analysis.py """ This script takes file containing reviews of the same. It will output the results of the analysis to std.out. """ import argparse as ap import json from pyspark.sql import SparkSession from pyspark.ml import PipelineModel if __name__ == '__main__': print('beginning...') parser = ap.ArgumentParser(description='Movie Review Analysis') parser.add_argument('-file', metavar='DATA', type=str, required=True, help='The file containing the reviews '\ 'in JSON format, one JSON review '\ 'per line') options = vars(parser.parse_args()) spark = SparkSession.builder \ .master("local[*]") \ .appName("Movie Analysis") \ .config("spark.driver.memory", "12g") \ .config("spark.executor.memory", "12g") \ .config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.2.2") \ .getOrCreate() nlp_pipeline = PipelineModel.load('nlp_pipeline.3.12') model = PipelineModel.load('model.3.12') data = spark.read.json(options['file']) nlp_procd = nlp_pipeline.transform(data) preds = model.transform(nlp_procd) results = preds.selectExpr( 'count(*)', 'mean(rawPrediction[1])', 'std(rawPrediction[1])', 'median(rawPrediction[1])', 'min(rawPrediction[1])', 'max(rawPrediction[1])', ).first().asDict() print(json.dump(results))
This script can be used for taking a set of reviews and aggregating into a single score plus some additional statistics.
Now that we have a first implementation of the application, let’s talk about metrics. In a more realistic scenario, you would define your metrics in the planning stage. However, it is easier to explain some of these topics once we have something concrete to refer to.
Normally, an NLP project ties into a new or existing product or service. In this case, let’s say that the output of this script will be used in a film blog. Likely, you will already be tracking views. When first introducing this feature to the blog, you may want to do some A/B testing. Ideally, you would do the testing by randomly showing or not showing the score on blog entries. If that isn’t technically feasible, you could show the scores in some entries and not in others during the initial deployment.
Aggregated scores, like those produced by this tool, can be added to blog entries but may not necessarily affect views much. This feature may make your review more attractive for mentions by other outlets. This can be an additional metric. You can possibly capture this by logging where visitors are coming from.
You might want to consider including the aggregate in the messages and notifications you send out. For example, if you notify people of new entries via email, you can include the aggregate in the subject. Also, consider adding the aggregate in the title of the entry.
Once you have decided on your business metrics, you can start working on the technical metrics.
For sentiment analysis, you will generally be using classification metrics, like we are here. Sometimes, sentiment labels have grades—for example, very bad, bad, neutral, good, very good. In these situations you can potentially build a regression model instead of a classifier.
There are traditional metrics used with classifiers like precision and recall. In order to make sense, you need to decide which label is the “positive.” With precision and recall, “positive” is in the sense of “testing positive.” This can make discussing these metrics a little confusing. Let’s say that for this application, the good sentiment has the positive label. Assuming this, precision is the proportion of reviews that are predicted to be good that are actually good. Recall is the proportion of actually good reviews that are predicted to be good. Another common classification metric used with precision and recall is f-score. This is the harmonic mean of precision and recall. This is a convenient way of summarizing these metrics. We can calculate precision, recall, and the f-score with MulticlassClassificationEvaluator
in Spark.
Another way of measuring classifier models is with a metric called log-loss. This is also called cross-entropy. The idea behind log-loss is measuring how different the observed distribution of labels is from the predicted distribution. This has the benefit of not relying on a mapping of the meaning of good and bad labels to positive and negative. On the downside, this is less interpretable than precision and recall.
When deciding on your model metrics, you should decide on which ones will be useful for experimentation and what singular metric is best for reporting out to share with stakeholders. The metric you select for stakeholders should be one that can be easily explained to an audience that may not be familiar with data-science concepts.
An important part of every machine learning project is deployment. You want to make sure that you have metrics for this stage as well.
The infrastructure metrics you choose depend on how your application is deployed. In this case, because the application is a script you likely want to measure the time it takes to run the script. We could put this in the script, but if we are deploying this in some sort of a workflow system, it will likely measure this.
We will talk about more common infrastructure metrics when we get to other applications. Now that we have talked about the metrics for monitoring the technology behind our application, let’s talk about metrics we can use to make sure that we are properly supporting the application.
There are many software development metrics out there, and most of them depend on how you track work—for example, number of tickets per unit of time or average time from opening a ticket to closing a ticket.
In an application like this, you are not developing new features, so ticket-based metrics won’t make sense. You can measure responsiveness to bugs with this. The process around this application is that of evaluating reviews for a movie. Measure how long it takes to gather the aggregate score for a new movie. As you automate the process of submitting a set of reviews for aggregation, this will improve.
Another valuable metric for machine-learning–based applications is how long it takes to develop a new model. A simple model like this should not require more than a week, including gathering data, data validation, iterating on model training, documenting results, and deployment. If you find that making a new model takes prohibitively long, try and determine what part of the development process is slowing you down. The following are some common problems:
Now that we have ways to measure the technology and processes of our application, let’s talk about monitoring. This is essential in data-science–based applications because we make assumptions about the data when doing modeling. These assumptions may not hold in production.
The difficulty in monitoring a model is that we generally do not have labels in production. So we can’t measure our model with things like precision or root mean square error (RMSE). What we should do is measure that the distributions of features and predictions in production are similar to what we saw during experimentation. We can measure this in offline applications—in other words, applications that are run at request like the application in this chapter and online applications like a model that is available as a web service.
For an application like this, we have only offline measurement. We should track the aggregates over time. Naïvely, we can assume that the mean average score for movies should be stable. This may very well not be the case, but if there is a trend, we should review the data to make sure that reviews are indeed changing overall and it’s not that our model may have been overfit.
When we look at applications that are deployed as real-time applications, we will discuss online metrics. In spirit, they are similar—they monitor the features and the scores.
There is one more step we need to discuss for this application—that is the review. Data-science–based applications are more complicated to review than most other software because you must review the actual software just as thoroughly as any other application, but you must also review the methodology. NLP applications are even more complicated. The theory behind linguistics and natural language data is not as cleanly modeled as other simpler kinds of data.
The review process is another vital part of writing any application. It is easy for a developer to have blind spots in their own projects. This is why we must bring in others to review our work. This process can be difficult for technical and human reasons. The most important part of any review process is that it not be personal. Both the reviewer and the developer should approach the process with the goal of collaboration. If a problem is found, that is an opportunity. The developer has avoided a later problem, and the reviewer can learning something that may help them avoid problems in future work.
Let’s talk about the steps of the review.
In our situation, this application is very simple. We are not developing anything but a script, so there is no actual architecture to review. However, the plan to deploy this as a script must be reviewed by the stakeholders. The model review would also be straightforward, since we have a clean data set and simple model. This will generally not be the case. Similarly, our script is very simple. A code review might suggest that we develop a small test data set to make sure that a model will run the data we expect.
These reviews take place during the development of the model. Once we are ready to deploy, we should have some more reviews to make sure that we have prepared for deployment.
When deploying your application, you should also have a fallback plan. If there is a major problem, can you bring down your application until it is fixed, or must there be something there no matter what? If it is the latter, consider having a “dummy” stand-in that you can deploy. You should work out the specifics with the stakeholders. Ideally, this should be discussed early in the project because this can help guide development and testing.
In our situation, this script is not mission critical. If the script doesn’t run, it will cause a delay only in the use of the aggregate score. Perhaps a backup script that uses a much simpler model could be devised if there absolutely must be a score added.
Finally, once you are ready to deploy, you should decide what will be the next steps. In our situation, we would likely want to talk about how the model can be improved. The model’s performance is not terrible, but it is well below state-of-the-art. Perhaps we can consider building a more complex model.
Additionally, we may eventually want to put this model behind a service. This will allow reviews to be scored immediately.
Now we have built our first application. This is a simple application, but it has allowed us to learn many things about how we will deploy more complex applications. In the next chapter we will again be looking at an offline application, but this will not be based on just the output of a model. We will be building an ontology that we can query.
Many Spark-based applications are offline tools like this. If we want to serve a live model behind a service, we would need to look elsewhere. There are several options for this, which we will discuss in Chapter 19.
Sentiment analysis is a fascinating task, and it uses the tools and techniques we have already covered. There are more complex examples, but by using good development processes we can always grow a simple application.