Chapter 19. Productionizing NLP Applications

In this book we have talked about many different possible approaches and techniques that we can use to build our NLP application. We’ve talked about how to plan and develop an NLP application. Now, let’s talk about deploying NLP applications.

We will also talk about deploying models in production environments. Before we talk about how to deploy the models, we need to know the requirements on our product. If the model is being used in a batch process versus being used by a web service for individual evaluations, this changes how we want to deploy. We also need to know what kind of hardware will be required by the models. Some of the things we discuss here should be considered before modeling has begun—for example, the available hardware in production.

The easiest situation is where your application is running as a batch process on an internal cluster. This means that your performance requirements are based only on internal users (in your organization), and securing the data will also be simpler. But not everything is this simple.

Another important part of deploying a production-quality system is making sure that the application works fast enough for user needs without taking up too many resources. In this chapter we will discuss how to optimize the performance of your NLP system. First, we need to consider what we want to optimize.

When people talk about performance testing, they generally mean testing how long it takes for the program to run and how much memory it takes. Because of the possible variance in document size, this can make performance testing NLP-based applications more difficult. Additionally, annotation frameworks, like Spark NLP, can produce many times more data than is input, so optimizing disk usage is important as well. Spark NLP is a distributed framework, so you should also take into consideration performance as a distributed system.

Distributed systems need to take into account all the performance requirements of individual machines and make sure that the cluster is used efficiently. This means that you are not locking up resources unnecessarily and are using what is allocated to your process.

Even once the application is in production, there is still work to be done. We need to monitor the performance of the software and the model. In this chapter, we will talk about what we need to do when we want to take our application live.

The first step in taking any application live is making sure that the product owner and stakeholders are satisfied. For some applications this will be as simple as showing that we can demo the functionalities in the requirements. With NLP-based applications, this can more difficult. This is because intuitions about how NLP works are often wrong. This is why testing is so important.

The checklist for this chapter is much larger than for the others in this part of the book. This is because deployment of NLP applications can be very complicated. It may seem overwhelming, but we can also use the answers to these questions to get a clearer scope of our project.

Let’s start with model deployment.

Spark NLP Model Cache

We’ve used pretrained models from Spark NLP in several chapters of this book. These models are stored in a local cache. Pretrained pipelines are stored here, as are models for individual steps in a pipeline, as well as TensorFlow models. We’ve used Keras in this book when exploring neural networks. Keras, however, is a high-level API for neural network libraries. TensorFlow is the framework that performs the actual computation. The TensorFlow models are a different animal, though, because they are required on the worker machines and not just the driver. Spark NLP will handle setting up this cache for you as long as those machines have internet access. If you do not have internet access, you can put the files in shared storage, like HDFS, and modify your code to load from that location.

This model cache requires access to persistent disk storage. Most deployment scenarios meet this requirement, but if you were to deploy on Amazon Lambda this is not a good idea.

Generally, Spark is not a good solution for real-time NLP applications. Although the cache improves performance, there is a minimum overhead for Spark. You can use the Spark NLP light pipelines, which are pretrained pipelines that run outside of Spark, where available, but you should test performance before deploying in any external scenario.

Another thing to consider is the availability of memory in your production environment. Spark NLP uses RocksDB as an in-memory key-value store for static embeddings. You should make sure that your environment can support this memory load. If you are using Spark, then it is almost certainly the case that you have enough memory for the embeddings.

We’ve talked about how Spark NLP accesses models; now let’s talk about how it integrates with TensorFlow.

Spark NLP and TensorFlow Integration

TensorFlow is implemented in C++ and CUDA, although most data scientists use it from its Python interface. Since Spark NLP is implemented in Scala, it runs on the JVM, although we have also been using it from its Python interface. Spark NLP interfaces with TensorFlow through the Java interface. This requires that TensorFlow be installed on any machine that will use these models. Unfortunately, this means that we have a dependency outside our JAR file. It’s less of an issue if you are using the Python Spark NLP package because it has TensorFlow as dependency. This dependency requires that you are able to install this software on all production machines running your application. You should also note whether you will be using a GPU since the dependency for TensorFlow on the GPU is different.

The reason GPUs can improve training time so much is that GPUs are built to do batches of parallel processing. This is great for doing matrix operations. However, not all machines have appropriate hardware for this. This means that enabling GPU support for your project may require an additional investment. If you are training on your development machine, there are common video cards that are good for some simple GPU training. Since training is much more computationally intensive than serving a model, it may be the case that you need only GPU support for training. Some models are complex enough that evaluating the model on a CPU is prohibitively slow. If you are planning to use such a complex model, you need to coordinate with the team handling hardware infrastructure. They will need to requisition the machines, and you will need to do performance testing to make sure that you can serve the model in an appropriate amount of time.

Now that we have talked about the deployment considerations specific to Spark NLP, let’s discuss deployment of a composite system.

Spark Optimization Basics

An important aspect of optimizing Spark-based programs, and therefore Spark NLP-based programs, is persistence. To talk about persistence, let’s review how Spark organizes work.

When you have a reference to a DataFrame, it does not necessarily refer to actual data on the cluster, since Spark is lazy. This means that if you load data and perform some simple transformations, like change strings to lowercase, no data will be loaded or transformed. Instead, Spark makes an execution plan. As you add more instructions to this execution plan it forms a directed acyclic graph (DAG). When you request data from the DataFrame, it triggers Spark to create a job. The job is split into stages. Stages are sequences of processing steps necessary to produce the data for the object that you have a reference to. These stages are then split into tasks, one for each partition, that are distributed to the executors. The executors will run as many tasks as they have processors for.

When you persist a DataFrame, that will cause Spark to store the actual data once it is realized. This is useful when you will be reusing a particular set of data. For example, when you train a logistic regression model there will be iterations over the data. You don’t want Spark to reload from disk for each iteration, so you should persist the DataFrame containing the training data. Fortunately, you don’t need to do this yourself because it is implemented in the training code for logistic regression.

There are parameters that control how your data is persisted. The first is whether to use disk. If you persist to disk you will have more space, but reloading it will be much more time-consuming. The second parameter is whether to use memory. You must use disk or memory, or you can choose both. If you choose both, Spark will store what it can in memory and “spill” to disk if necessary. You can also choose to use off-heap memory. In Java, there are two parts to the memory. The heap, or on-heap memory, is where the Java objects are stored. The JVM garbage collector works on the heap. The other part is off-heap memory. Java stores classes, threads, and other data used by the JVM in off-heap memory. Persisting data in the off-heap memory space means that you are not restricted to the memory allocated to the JVM. This can be dangerous, since the JVM does not manage or limit this space. If you take up too much heap memory, your program will get an OutOfMemoryError; if you take up too much nonheap memory, you could potentially bring down the machine.

Apart from configuring where you store your persisted data, you can also decide whether to serialize it. Storing serialized data can be more space-efficient, but it will be more CPU-intensive. The last parameter is replication. This will cause the data to be replicated on different workers, which can be useful if a worker fails.

Persisting will help us avoid redoing work unnecessarily, but we also want to make sure that we do the work efficiently. If your partitions are too large, then executors will not be able to process them. You could add more memory to the executors, but this causes poor CPU utilization. If your workers have multiple cores but you take most of the memory to just process one partition on one core, then all the other cores are being wasted. Instead, you should try and reduce the size of your partitions. However, you do not want to go to the other extreme. There is an overhead to partitions, since Spark may need to shuffle the data. This will cause aggregations and group-by operations to be very inefficient. Ideally, each partition should be 200 MB in size.

The Spark developers are constantly working on new ways to improve performance, so you should check the programming guides in each version to see if there are new ways to optimize your application.

Now that we have talked about how to optimize Spark operations, let’s talk about some design-level considerations to improve performance.

Design-Level Optimization

When you are designing your NLP application, you should consider how to divide your pipelines into manageable pieces. It may be tempting to have a single über-pipeline, but this causes several problems. First, it is harder to maintain the code by having everything in your job. Even if you organize the code into a maintainable structure, errors at runtime will be harder to diagnose. The second problem it can cause is inefficiencies in the design of your job. If your data extraction is memory intensive, but your batch model evaluation is not memory intensive, then you are taking up unnecessary resources during evaluation. Instead, you should have two jobs—data extraction and model evaluation. You should be using the job orchestrator of your cluster (Airflow, Databricks job scheduler, etc.). If your application loads data and runs the model as a batch job, here is a list of potential jobs you can create to break your code into more manageable chunks:

  • Data preparation
  • Feature creation
  • Hyperparameter tuning
  • Final training
  • Metrics calculation
  • Model evaluation

You could potentially combine these, but be considerate of the other inhabitants of the cluster, and be mindful of the resource needs of different parts of your workflow.

Another important aspect is monitoring your pipelines and the data that they consume and produce. There have been many “mysterious” failures that are due to a strange document, an empty document, or a document that is three hundred times larger than normal. You should log information from your pipelines. Sometimes, this creates big data of its own. Unless you are trying to debug a pipeline, you do not need to output information for each document or record what you are processing. Instead, you can at least track minima, means, and maxima. The basic values that should be tracked are document size and processing time. If you implement this, then you have a quick first step to triage problems.

Demoing NLP-Based Applications

Properly demoing an NLP-based application is as much a matter of technical skills as communication skills. When you are showing your work to the product owner and the stakeholders, you will need to be prepared to explain the application from three NLP-perspectives: software, data science, and linguistics. When you are building your demo, you should try and “break” the system by finding data and language edge cases that produce poor-looking results. Sometimes, these results are reasonable, but if someone does not have a technical understanding of such systems, the results look ridiculous. If the client finds an example like this, it can derail the whole demo. If you find one beforehand, you should prepare an explanation about either why this is the correct result given the data or how you will fix it. Sometimes “fixing” a problem like this is more aesthetic than technical. This is why you should be considering the user experience from the beginning of the project.

Because these apparently bad, but statistically justified, examples can be embarrassing, it can be tempting to cherry-pick examples. This not only is unethical but also moves their discovery to production, which would be worse. Even if the problem is not a “real” problem, you should try and be as upfront as possible. The intersection of people who know software engineering, data science, and linguistics is small, so the stakeholder may very well have difficulty understanding the explanation. If the problem is found after it has been fully deployed, your explanation will be met with extra skepticism.

As with any application, the work doesn’t end with deployment. You will need to monitor the application.

Checklists

Consider the questions in each of these checklists.

Conclusion

In this chapter, we talked about the final steps needed before your NLP application is used. However, this is not the end. You will likely think of ways to improve your processing, your modeling, your testing, and everything else about your NLP application. The ideas talked about in this chapter are starting points for improvement. One of the hardest things in software development is accepting that finding problems and mistakes is ultimately a good thing. If you can’t see a problem with a piece of software, that means you will eventually be surprised.

In this book, I have talked about wearing three hats—software engineer, linguist, and data scientist—and have discussed the need to consider all three perspectives when building an NLP application. That may seem difficult, and it often is, but it is also an opportunity to grow. Although there are statistically justifiable errors that can be difficult to explain, when an NLP application does something that makes intuitive sense it is incredibly rewarding.

There is always the balance between needing to add or “fix” a thing and wanting to push it out into the world. The great thing about software engineering, sciences like linguistics, and data science is that you are guaranteed to have a mistake in your work. Everyone before you had mistakes, as will everyone after you. What is important is that we fix them and become a little less wrong.

Thank you for reading this book. I am passionate about all three disciplines that inform NLP, as well as NLP. I know I have made mistakes here, and I hope to get better in time.

Good luck!