In this book we have talked about many different possible approaches and techniques that we can use to build our NLP application. We’ve talked about how to plan and develop an NLP application. Now, let’s talk about deploying NLP applications.
We will also talk about deploying models in production environments. Before we talk about how to deploy the models, we need to know the requirements on our product. If the model is being used in a batch process versus being used by a web service for individual evaluations, this changes how we want to deploy. We also need to know what kind of hardware will be required by the models. Some of the things we discuss here should be considered before modeling has begun—for example, the available hardware in production.
The easiest situation is where your application is running as a batch process on an internal cluster. This means that your performance requirements are based only on internal users (in your organization), and securing the data will also be simpler. But not everything is this simple.
Another important part of deploying a production-quality system is making sure that the application works fast enough for user needs without taking up too many resources. In this chapter we will discuss how to optimize the performance of your NLP system. First, we need to consider what we want to optimize.
When people talk about performance testing, they generally mean testing how long it takes for the program to run and how much memory it takes. Because of the possible variance in document size, this can make performance testing NLP-based applications more difficult. Additionally, annotation frameworks, like Spark NLP, can produce many times more data than is input, so optimizing disk usage is important as well. Spark NLP is a distributed framework, so you should also take into consideration performance as a distributed system.
Distributed systems need to take into account all the performance requirements of individual machines and make sure that the cluster is used efficiently. This means that you are not locking up resources unnecessarily and are using what is allocated to your process.
Even once the application is in production, there is still work to be done. We need to monitor the performance of the software and the model. In this chapter, we will talk about what we need to do when we want to take our application live.
The first step in taking any application live is making sure that the product owner and stakeholders are satisfied. For some applications this will be as simple as showing that we can demo the functionalities in the requirements. With NLP-based applications, this can more difficult. This is because intuitions about how NLP works are often wrong. This is why testing is so important.
The checklist for this chapter is much larger than for the others in this part of the book. This is because deployment of NLP applications can be very complicated. It may seem overwhelming, but we can also use the answers to these questions to get a clearer scope of our project.
Let’s start with model deployment.
We’ve used pretrained models from Spark NLP in several chapters of this book. These models are stored in a local cache. Pretrained pipelines are stored here, as are models for individual steps in a pipeline, as well as TensorFlow models. We’ve used Keras in this book when exploring neural networks. Keras, however, is a high-level API for neural network libraries. TensorFlow is the framework that performs the actual computation. The TensorFlow models are a different animal, though, because they are required on the worker machines and not just the driver. Spark NLP will handle setting up this cache for you as long as those machines have internet access. If you do not have internet access, you can put the files in shared storage, like HDFS, and modify your code to load from that location.
This model cache requires access to persistent disk storage. Most deployment scenarios meet this requirement, but if you were to deploy on Amazon Lambda this is not a good idea.
Generally, Spark is not a good solution for real-time NLP applications. Although the cache improves performance, there is a minimum overhead for Spark. You can use the Spark NLP light pipelines, which are pretrained pipelines that run outside of Spark, where available, but you should test performance before deploying in any external scenario.
Another thing to consider is the availability of memory in your production environment. Spark NLP uses RocksDB as an in-memory key-value store for static embeddings. You should make sure that your environment can support this memory load. If you are using Spark, then it is almost certainly the case that you have enough memory for the embeddings.
We’ve talked about how Spark NLP accesses models; now let’s talk about how it integrates with TensorFlow.
TensorFlow is implemented in C++ and CUDA, although most data scientists use it from its Python interface. Since Spark NLP is implemented in Scala, it runs on the JVM, although we have also been using it from its Python interface. Spark NLP interfaces with TensorFlow through the Java interface. This requires that TensorFlow be installed on any machine that will use these models. Unfortunately, this means that we have a dependency outside our JAR file. It’s less of an issue if you are using the Python Spark NLP package because it has TensorFlow as dependency. This dependency requires that you are able to install this software on all production machines running your application. You should also note whether you will be using a GPU since the dependency for TensorFlow on the GPU is different.
The reason GPUs can improve training time so much is that GPUs are built to do batches of parallel processing. This is great for doing matrix operations. However, not all machines have appropriate hardware for this. This means that enabling GPU support for your project may require an additional investment. If you are training on your development machine, there are common video cards that are good for some simple GPU training. Since training is much more computationally intensive than serving a model, it may be the case that you need only GPU support for training. Some models are complex enough that evaluating the model on a CPU is prohibitively slow. If you are planning to use such a complex model, you need to coordinate with the team handling hardware infrastructure. They will need to requisition the machines, and you will need to do performance testing to make sure that you can serve the model in an appropriate amount of time.
Now that we have talked about the deployment considerations specific to Spark NLP, let’s discuss deployment of a composite system.
An important aspect of optimizing Spark-based programs, and therefore Spark NLP-based programs, is persistence. To talk about persistence, let’s review how Spark organizes work.
When you have a reference to a DataFrame
, it does not necessarily refer to actual data on the cluster, since Spark is lazy. This means that if you load data and perform some simple transformations, like change strings to lowercase, no data will be loaded or transformed. Instead, Spark makes an execution plan. As you add more instructions to this execution plan it forms a directed acyclic graph (DAG). When you request data from the DataFrame
, it triggers Spark to create a job. The job is split into stages. Stages are sequences of processing steps necessary to produce the data for the object that you have a reference to. These stages are then split into tasks, one for each partition, that are distributed to the executors. The executors will run as many tasks as they have processors for.
When you persist a DataFrame
, that will cause Spark to store the actual data once it is realized. This is useful when you will be reusing a particular set of data. For example, when you train a logistic regression model there will be iterations over the data. You don’t want Spark to reload from disk for each iteration, so you should persist the DataFrame
containing the training data. Fortunately, you don’t need to do this yourself because it is implemented in the training code for logistic regression.
There are parameters that control how your data is persisted. The first is whether to use disk. If you persist to disk you will have more space, but reloading it will be much more time-consuming. The second parameter is whether to use memory. You must use disk or memory, or you can choose both. If you choose both, Spark will store what it can in memory and “spill” to disk if necessary. You can also choose to use off-heap memory. In Java, there are two parts to the memory. The heap, or on-heap memory, is where the Java objects are stored. The JVM garbage collector works on the heap. The other part is off-heap memory. Java stores classes, threads, and other data used by the JVM in off-heap memory. Persisting data in the off-heap memory space means that you are not restricted to the memory allocated to the JVM. This can be dangerous, since the JVM does not manage or limit this space. If you take up too much heap memory, your program will get an OutOfMemoryError
; if you take up too much nonheap memory, you could potentially bring down the machine.
Apart from configuring where you store your persisted data, you can also decide whether to serialize it. Storing serialized data can be more space-efficient, but it will be more CPU-intensive. The last parameter is replication. This will cause the data to be replicated on different workers, which can be useful if a worker fails.
Persisting will help us avoid redoing work unnecessarily, but we also want to make sure that we do the work efficiently. If your partitions are too large, then executors will not be able to process them. You could add more memory to the executors, but this causes poor CPU utilization. If your workers have multiple cores but you take most of the memory to just process one partition on one core, then all the other cores are being wasted. Instead, you should try and reduce the size of your partitions. However, you do not want to go to the other extreme. There is an overhead to partitions, since Spark may need to shuffle the data. This will cause aggregations and group-by operations to be very inefficient. Ideally, each partition should be 200 MB in size.
The Spark developers are constantly working on new ways to improve performance, so you should check the programming guides in each version to see if there are new ways to optimize your application.
Now that we have talked about how to optimize Spark operations, let’s talk about some design-level considerations to improve performance.
When you are designing your NLP application, you should consider how to divide your pipelines into manageable pieces. It may be tempting to have a single über-pipeline, but this causes several problems. First, it is harder to maintain the code by having everything in your job. Even if you organize the code into a maintainable structure, errors at runtime will be harder to diagnose. The second problem it can cause is inefficiencies in the design of your job. If your data extraction is memory intensive, but your batch model evaluation is not memory intensive, then you are taking up unnecessary resources during evaluation. Instead, you should have two jobs—data extraction and model evaluation. You should be using the job orchestrator of your cluster (Airflow, Databricks job scheduler, etc.). If your application loads data and runs the model as a batch job, here is a list of potential jobs you can create to break your code into more manageable chunks:
You could potentially combine these, but be considerate of the other inhabitants of the cluster, and be mindful of the resource needs of different parts of your workflow.
Another important aspect is monitoring your pipelines and the data that they consume and produce. There have been many “mysterious” failures that are due to a strange document, an empty document, or a document that is three hundred times larger than normal. You should log information from your pipelines. Sometimes, this creates big data of its own. Unless you are trying to debug a pipeline, you do not need to output information for each document or record what you are processing. Instead, you can at least track minima, means, and maxima. The basic values that should be tracked are document size and processing time. If you implement this, then you have a quick first step to triage problems.
There are a variety of profiling tools available for examining performance. These tools each have a context in which they should be used. Let’s look first at the Java Microbenchmark Harness.
The Java Microbenchmark Harness (JMH) is a JVM framework that will allow you to test atomic (or nearly atomic) operations in your code. For example, if you are using a custom data structure, you can use the JMH to test inserts and retrievals. The JMH is more useful in testing library code than testing applications. This is because most application code relies on a number of library functions and so is not atomic. This is not something that you can use to monitor, though. It works by compiling a separate program that runs (and reruns) parts of your code.
VisualVM is free profiler for JVM applications. It allows you to track the number of instances of classes created, as well as time spent in methods. If you find a performance problem, this is a great tool for investigating. One downside is that it really requires that you can run your application on one machine. VisualVM runs an application that inspects your application’s JVM, so it can negatively impact performance.
If you want to monitor NLP applications, Ganglia is an application I’m fond of. Ganglia allows you to view CPU and memory utilization in a cluster in essentially real time. Many modern resource managers, like Mesos and Kubernetes, have similar functionality baked in. Ganglia, or the resource monitoring available from resource managers, is a must-have if you need your application to run reliably.
Now that we know how we will examine the resources used by our application, we need to think about the data that our application consumes and produces.
There are three kinds of data used in NLP applications. There is input data, which is the data that your application processes. Examples of input data include a corpus of documents, an individual document, or a search query. There is output data, which is the data your application produces. Examples of output data include a directory of serialized, annotated documents, a document object containing the annotations, or a list of documents and relevance scores. The third data is the data that your application uses as configuration. Examples of configuration data include trained models, lemma dictionaries, or stop-word lists.
When you are building NLP applications, as with any software application, you want to develop your software tests first. Test-driven development is a great way to state your expected behaviors in code before you start writing the actual application code. However, in my experience, this is rarely done. Test-driven development can be difficult if you don’t have a clear idea of how the product will work or you need to show results immediately. If you have to write tests as you are writing your code, you run the risk of writing tests that are built to pass—not test—the code. It is always better to have a test plan before you’ve built your application. Let’s look at the different kinds of tests.
The most well-known kind of test is the unit test. The unit test is to test a unit of functionality in your code. For example, if you have a phrase chunker that uses a helper function to extract the POS tags, you shouldn’t write separate tests for the helper function. That helper function is not a separate functionality of your chunker, it is part of its chunking functionality. Unit tests should require only the code they are testing. They should not require network resources, production data, other services, and so on. If your code does assume these things, you will need to mock these. Mocking is a technique that creates a façade of the components you need. With data, you can either take the smallest necessary sample of data or create a small amount of fake data.
Once you have built a component of a system, you will need to make sure that it works with other components. This is called integration testing. Often, integration tests are implemented in unit-testing frameworks, so they may be mistakenly referred to as “unit” tests. Let’s look at our research paper classifier project for a hypothetical example of an integration test. If we will be integrating with an existing system that is used for submitting papers to the university’s database, we will need to have two sets of integration tests. The first set will test how the classifier service integrates with the code that manages the database. The second set will test how the classifier integrates with UI of the paper submission system.
You will also need tests that will help us test whether the system, overall, does what we expect. There are generally two kinds of test like this. The first is the smoke test. A smoke test tests as much of the code as possible to find out if there are any show-stopper problems. The metaphor comes from testing plumbing. If you want to find a leak in a septic system, you can pump smoke into the pipes. If you see smoke rising, that tells you where the problem is. You generally want smoke tests that cover the major uses of your system. The other kind of overall system test is the sanity test. The sanity test is used to make sure that the system works with “routine” inputs. If the smoke tests or sanity tests fail, you should not deploy. If you find a bug in the system after it is deployed, you should use reproduction steps as a future smoke test. These are called regression tests. They help us from accidentally reintroducing bugs.
Once you have the functional testing, you can look at testing other aspects of the system. Let’s start with performance testing. Previously in this chapter, we discussed ways to optimize the performance of our application. To know whether a new version of the application will introduce performance problems you should have automated tests. This is something that you use in combination with performance monitoring. If performance is not an important requirement, then it is reasonable to skip this kind of test. Sometimes, performance tests are skipped because of the expense of creating a production-like environment. Instead of doing this, you should change the scale of your performance test. While it is true that you cannot test the application’s performance without a production-like environment, you can still test the performance of components. This won’t catch global problems, but a local performance test can test if a particular component or function has a performance-worsening bug. Using the profiling tools we discussed in the last chapter, you can find hotspots in your code, areas that take the most time or memory. Local performance tests should be focused on these areas.
Another vital kind of nonfunctional testing is usability testing. If your application is simply a batch process, then there is not much need for this kind of test. However, if your application has real end-users, like customers, you should definitely have this. When people are working with NLP-based systems there can sometimes be inflated expectations. For example, let’s say we are building a sentiment analysis tool. This tool predicts a sentiment and highlights the terms that led to the prediction. The model may identify words that make sense statistically for the corpus it was trained on, but a human may consider them silly. To find these inflated expectations you should find test users who are as similar as possible to the actual users. If you just use colleagues who are also familiar with software, then they may have a more realistic understanding of what the system can do. If you find that users have these inflated expectations then you should consider how you can modify the user experience to better set expectations.
Since intuitions are not infallible, especially when it comes to language, we need to make sure that we test our assumptions. However, testing stakeholder assumptions is harder. There is another test, of sorts, that the application needs to pass—the demo.
Properly demoing an NLP-based application is as much a matter of technical skills as communication skills. When you are showing your work to the product owner and the stakeholders, you will need to be prepared to explain the application from three NLP-perspectives: software, data science, and linguistics. When you are building your demo, you should try and “break” the system by finding data and language edge cases that produce poor-looking results. Sometimes, these results are reasonable, but if someone does not have a technical understanding of such systems, the results look ridiculous. If the client finds an example like this, it can derail the whole demo. If you find one beforehand, you should prepare an explanation about either why this is the correct result given the data or how you will fix it. Sometimes “fixing” a problem like this is more aesthetic than technical. This is why you should be considering the user experience from the beginning of the project.
Because these apparently bad, but statistically justified, examples can be embarrassing, it can be tempting to cherry-pick examples. This not only is unethical but also moves their discovery to production, which would be worse. Even if the problem is not a “real” problem, you should try and be as upfront as possible. The intersection of people who know software engineering, data science, and linguistics is small, so the stakeholder may very well have difficulty understanding the explanation. If the problem is found after it has been fully deployed, your explanation will be met with extra skepticism.
As with any application, the work doesn’t end with deployment. You will need to monitor the application.
Consider the questions in each of these checklists.
Spark NLP model cache checklist
The Spark NLP and TensorFlow integration checklist
Profiling tools
Testing NLP-based applications
Demoing NLP-based applications
In this chapter, we talked about the final steps needed before your NLP application is used. However, this is not the end. You will likely think of ways to improve your processing, your modeling, your testing, and everything else about your NLP application. The ideas talked about in this chapter are starting points for improvement. One of the hardest things in software development is accepting that finding problems and mistakes is ultimately a good thing. If you can’t see a problem with a piece of software, that means you will eventually be surprised.
In this book, I have talked about wearing three hats—software engineer, linguist, and data scientist—and have discussed the need to consider all three perspectives when building an NLP application. That may seem difficult, and it often is, but it is also an opportunity to grow. Although there are statistically justifiable errors that can be difficult to explain, when an NLP application does something that makes intuitive sense it is incredibly rewarding.
There is always the balance between needing to add or “fix” a thing and wanting to push it out into the world. The great thing about software engineering, sciences like linguistics, and data science is that you are guaranteed to have a mistake in your work. Everyone before you had mistakes, as will everyone after you. What is important is that we fix them and become a little less wrong.
Thank you for reading this book. I am passionate about all three disciplines that inform NLP, as well as NLP. I know I have made mistakes here, and I hope to get better in time.