When building an NLP system, the first thing you should answer is what language or languages will you support. This can affect everything from data storage, to modeling, to the user interface. In this chapter, we will talk about what you want to consider if you are productionizing a multilingual NLP system.
At the end of the chapter, we will have a checklist of questions to ask yourself about your project.
When supporting multiple languages, one way you can manage complexity is by identifying commonalities between your expected languages. For example, if you are dealing with only Western European languages, you know that you need to consider only the Latin alphabet and its extensions. Also, you know that all the languages are fusional languages, so stemming or lemmatizing will work. They also have similar grammatical gender systems: masculine, feminine, and maybe an inanimate neuter.
Let’s look at a hypothetical scenario.
In this scenario, your inputs will be text documents, PDF documents, or scans of text documents. The output is expected to be JSON documents with text, title, and tags. The languages you will be accepting as input are English, French, German, and Russian. You have labeled data, but it is from only the last five years of articles. This is when the publisher started requiring that articles be tagged during submission. The initial classifications can be at the departmental level—for example, mathematics, biology, or physics. However, we need to have a plan to support subdisciplines.
First, we should consider the different input formats. We need to have OCR models for converting the images and PDF documents to text. We saw in Chapter 16 that we can use tools like Tesseract for this. We can use the text files to create a data set for training if we cannot find a satisfactory model. Some of the scanned documents will have issues. For example, the document may not have been well aligned with the scanning bed, and so the text is skewed. The document may have been aged and the text is eroded. Our model will need to accommodate for this. So, we will need to find some way of generating eroded and skewed text to support the scanned documents. We could transcribe some documents, but that is very labor intensive. You try and make the transcription process easier by breaking it into pieces. Another complication to transcription is that if you use transcribers that do not know the language, they will be much more error prone.
Second, we want to consider the classification task. For the initial model, we could potentially use lexical features and a simple linear classifier. Separating papers at the department level is very doable using just keywords. You can use experts to generate these keywords, or you can find them from vocabulary analysis. You will still want to review the keywords with domain experts who are fluent in these languages. In the future, simple lexical features will be useful at separating subdisciplines, especially niche subdisciplines that may not have many unique keywords. In this situation, we may want to move on to a more sophisticated model. First, we can start with a bag-of-words with a more complex model or go straight to an RNN. Either way, we must structure our code so that we can support different preprocessing and modeling frameworks.
Before we discuss model building in a project, we need to determine how we’re going to process the text. We’ve already talked about some common considerations of tokenization in Chapters 2 and 5. Most writing systems use a space as a word separator; some use other symbols, and some don’t separate words at all. Another consideration is word compounding.
Word compounding is when we combine two words into one. For example, “moonlight” is a combination of two words. In some languages, like German, this is more common. In fact, it is common enough in German that a word splitter is a common text-processing technique for German. Consider the word “Fixpunktgruppe” (“fixed point group”), which is a mathematical term for a special kind of algebraic structure. If we wanted to find all group structures mentioned in a document, we would need to have the “gruppe” separated. This could potentially be useful in languages that have more productive suffixes.
In English, it is as common to borrow a word as it is to add a prefix or suffix to create a new word. For example, we use the word “tractor” for a machine that is used for pulling—“tractor” is simply the Latin word for “puller.” In some other languages, borrowing is less common, like in Greek, Icelandic, and Mandarin. In these languages, we may want to consider splitting these words into their component morphemes. This can be especially important for languages in which compound words might not be compounded in all contexts. These separable words are similar to some phrasal verbs in English. A phrasal verb is a verb like “wake up.” The “up” particle can be separated from the verb, or not.
I woke up the dog.
I woke the dog up.
However, some objects require separation.
*I woke up her.
I woke her up.
The German translation, “aufstehen” loses the prefix when in a finite form.
zu aufstehen den Hund ["to wake the dog up"]
Ich stand den Hund auf ["I woke the dog up"]
Because these derived words often have very distinct meanings from their base, we may not need to deal with them. In document-level work—for example, document classification—it is unlikely that these words will affect the model. You are more likely to need to deal with this in search-based applications. I recommend not dealing with this in your first iteration and monitoring usage to see if compound words are commonly searched for.
In Chapter 2, we talked about the different ways languages combine morphemes into words. Analytic languages, like Mandarin, tend to use particles to express things like the past tense. Meanwhile, synthetic (or agglutinative) languages, like Turkish, have systems of affixes for expressing a noun’s role in a sentence, tense, prepositions, and so on. In between these two are fusional languages, like Spanish, that don’t have as many possible word forms as synthetic languages do but have more than analytic languages. For these different types of morphologies there are trade-offs when considering stemming versus lemmatization.
The more possible word forms there are, the more memory will be required for lemmatization. Also, some fusional languages are more regular than others. The less regular the language, the more difficult the stemming algorithm will be. For example, Finnish nouns can have up to 30 different forms. This means that there will need to be 30 entries for each verb. Finnish verbs are much more complex. This means that if you have a one-million-word vocabulary, you will need well in excess of 30 million entries.
Analytic languages can use either stemming or lemmatization, or even neither. Mandarin likely does not need such processing. English, which is a language in transition from fusional to analytical, can use either. There are few enough forms that lemmatization is feasible, and stemming is not too difficult. Let’s look at a regular verb in English (also called a weak verb). The verb “call” has the forms “call,” “calling,” “called,” and “calls.” Nouns are even simpler in English—there are only two forms (singular and plural). The rules for determining the forms of nouns and regular verbs are also straightforward enough to build a lemmatizer
for.
Synthetic languages, like Finnish, are often quite regular, so stemming algorithms are straightforward. For fusional languages you can potentially use a combined approach. Irregular forms are more common in the most frequently used words. So you can use lemmatization for the most common words and use stemming as a fallback.
One of the ideas behind embeddings and transfer learning is that the neural network is learning higher-level features from the data. These features can be used to take a model, or part of a model, trained on one data set and use it on a different data set or different problem altogether. However, we must be mindful of how different the data is. If the differences between the English on Twitter and in medical records are enough to reduce transferability, imagine how much is lost in translation between English and another language. That being said, if you are looking to build a model with a better than random starting point, you should experiment with transferability. This makes more sense for some problems than for others. For example, in our scenario earlier, the classification of academic documents is going to be dependent on technical terms that may have similar distributions in all of our languages. This means that transferability might be helpful—it would certainly be worth experimenting with. On the other hand, if we are building a model that processes medical records from different countries, transferability across language will likely be less useful. Not only do the underlying phenomena differ (different common ailments in different places), but also the regulatory requirements on documentation differ. So the documents differ not only in language but also in content and purpose.
Word embeddings are a general enough technique that there is hope for transferability. This is still a topic of research. The idea is that although word frequencies may differ for equivalent words, the distribution of concepts is more universal. If this is so, perhaps we can learn a transformation from the vector space of one language to another that preserves relationships between the semantic content.
One way of doing this is to learn a transformation based on reference translations. Let’s say we have two languages, L1 and L2. We take a list of words from L1, with their translations in L2. Each of these reference words will be mapped to a point in the vector space for L2. So let’s say that L1 is Latin, and L2 is English. The word “nauta” has the vector w
in the Latin vector space, and v
in the English vector space after transformation. The English equivalent “sailor” has the vector u
. We can define the error of the transformation for that word by looking at the Euclidean distance between u
and v
. The transformation that minimizes this difference should hopefully work well. The problem for this is that different cultures can use equivalent words very differently. Also, polysemy is different between languages, and this approach works only with static embeddings.
This is an active area of research, and there will be new developments. One of the hopes for these techniques is that it will let us use some of these advanced techniques for languages that do not have the huge corpora required to build deep learning models.
If you are building a search solution across languages, you generally separate the documents by language and have the user select a language when searching. It is possible to build a multilanguage index, but it can be difficult. There are multiple approaches, but ultimately you need some common way to represent the words or concepts in your corpus. Here are some possible approaches.
You can translate everything into a single language using machine translation. In our scenario, we could translate all the documents into English. The benefit of this is that you can review the quality of these translations. The drawback is that the search quality will suffer for the non-English documents.
On the other hand, if you can serve the translation model efficiently, you can translate at query time into all available languages. This has the benefit of not biasing toward one particular language. The drawback is that you need to find a way to make a common score from these indices. An additional complication is that automatic machine translation is built with complete texts and not queries. So a query may be mistranslated, especially if it is a word with multiple meanings.
If automatic machine translation is not an option, you can also consider using word embeddings. This will require the transformations talked about previously. This is essentially building a translation model without the sequence prediction.
Consider these questions about your project:
Dealing with multilanguage applications can be complicated, but it also offers great opportunities. There are not many NLP applications out there that are multilanguage. There are also not many people who have experience creating such applications.
One of the reasons that multilanguage applications are so difficult is that the availability of labeled multilanguage data is poor. This means that multilanguage NLP projects will often require you to gather labeled data. We will discuss human labeling in Chapter 18.