Following the historical overview, it is worth examining some of the limitations of statistical machine translation systems. One fundamental question is related to the approach itself: is it really possible to translate just by putting together sequences of words extracted from very large bilingual corpora? What quality can be expected with such an approach? We will return to this topic at the end of this chapter.
In the meantime, we wish to address two other issues with this approach. It is clear that sentence alignment works better when there is a certain proximity between the language of the source text and the language of the target text. This has an impact on the performance one can expect from a machine translation system: how far can we go with the statistical approach when it is applied to genetically distant languages? Is translation from Chinese or Arabic to English doomed to lag behind? Lastly, statistical translation presupposes the availability of large bilingual training corpora. Thus, problems arise once one leaves the very restricted circle of the best-represented languages on the Internet.
As we have seen, the majority of machine translation systems today use a statistical approach. Identifying translational equivalents at word or segment level works better when similar languages are involved, since these languages will share a similar linguistic structure. This can be seen very directly in the performance of various systems (see chapter 13): it is easier to translate between French or German and English than between Arabic and Japanese or Chinese and English. Even between these languages there are crucial differences. For example, translating from German into English works better than translating from English into German, since compound words in German (i.e., the complex combination of several simple words into a single string of characters) remain problematic for automatic processing. We know relatively well how to automatically decompose existing compounds in German; therefore, compounds are not so problematic when translating to English. It remains quite difficult, however, for an automatic system to generate correct compound words in German, which means that poor translations will be produced when translating to German.
Translating into Japanese or, more recently, Chinese or Arabic has generated a significant amount of research. Performance, compared to those obtained with Indo-European languages such as French or Spanish, remains lower, as these languages have a structure that is very different from English. For these languages, the development of hybrid systems integrating a statistical component but also advanced linguistic modules that take language specificities into account will likely be the main source of progress in the years to come. It is already the case for morphologically-rich languages, for example (i.e., languages for which a lot of different surface forms can be generated from one basic linguistic form): language-specific modules dealing with morphology issues are very helpful in enhancing the overall performance of natural language processing systems in this context.
All statistical systems require a huge amount of bilingual texts in order to work satisfactorily. Corpora made of millions of aligned sentences are nowadays commonplace. As Mercer said, “there is no data like more data” (see chapter 7).
Consequently, it is clear that beyond a handful of languages that are widely used on the Internet, the systems’ performance decreases considerably, especially if one of the languages (source or target) is not English. The quantities of data available on the Internet for these languages are simply insufficient to obtain good performance. Some techniques have been developed to overcome the lack of bilingual data. For example, it is possible to obtain more information from large monolingual corpora, but this remains insufficient for the task.
A popular strategy consists in trying to design translation systems that use English as a pivot language, in order to overcome to a certain extent the lack of training data. The idea is that when there are not enough bilingual data between the two languages (for example, between Greek and Finnish), a solution is to translate first from Greek into English, and then from English into Finnish. The approach is simple and can provide interesting results in some contexts. However, it does not fully solve the problem: the quality of the translation to and from English is sometimes mediocre and applying two steps of translation instead of one also multiplies the likelihood of errors.
The problem of multiple translations is well known and can be observed even when one tries to translate iteratively between the same two languages (for example, from English to French and then back to English). The prototypical example is probably this Biblical sentence: “The spirit is willing, but the flesh is weak.” The story goes that this was translated into Russian and then translated back into English as “The whiskey is strong, but the meat is rotten.” Yet this is indeed an apocryphal example.1 The longevity of this invented example is due to the comical nature of the resulting translation, but it also illustrates the fact that multiplying translation steps amounts to gradually straying away from the original text until an incomprehensible translation is obtained.
Despite these well-known problems with the “pivot approach,” many critics have pointed out that Google Translate increasingly uses English as a pivot language. This often leads to strange results. As Frédéric Kaplan noted on his blog,2 with Google Translate, “Il pleut des cordes” can be translated to Italian as “Piove cani e gatti.” Likewise, “Cette fille est jolie” strangely enough transforms into “Questa ragazza è abbastanza” (“this girl is quite”). These major errors are due to the fact that English is used as a pivot language. In the French expression, “Il pleut des cordes,” Google identifies a frozen expression that should not be translated word for word. “It rains cats and dogs” is thus a good English translation in this context, but the system then fails to find the equivalent expression in Italian and then just performs a word-for-word translation, which is poetic but not really accurate. As for “joli” in the other example, the system identifies “pretty” as an equivalent adjective but then seems to confuse the adjectival and the adverbial values of “pretty” in English, hence the translation related to “quite” (“abbastanza”) instead of “pretty” as “nice” or “beautiful.”
These examples quickly lose their relevance as Google constantly updates its system: Kaplan’s blog post dates from November 15, 2014, and on December 1, the translation “Il pleut des cordes” in Italian had already become “Piove a dirotto,” which is a correct translation. It is clear that Google works intensively to improve the translation of frozen and multiword expressions like “il pleut des cordes” in French or “it rains cats and dogs” in English: these expressions are special since they should be translated as a whole and not literally. There are obviously no cats or dogs falling from the sky, when one says the English expression! This is a major source of enhancement for current systems, but the task is huge since there are a large number of frozen expressions in any language.
Clearly, using English as a pivot language has several implications. It reinforces the dominant position of English as a world language and its cultural hegemony. Besides this, and despite official discourse promoting language diversity, it is clear that the research effort is mainly directed toward dominant languages (the handful of languages dominating the Internet). Google Translate can officially translate between more then 100 languages, but in practice the results are very uneven and almost unusable for some languages.
Despite official discourse promoting language diversity, it is clear that the research effort is mainly directed toward dominant languages (the handful of languages dominating the Internet).
Finally, we must keep in mind that the advances since the 1990s in the field of statistical machine translation are related to the fact that more and more data are available, but also to an increase in computing power. The deep learning approach uses some algorithms that were to a certain extent described in the 1980s but that researchers were unable to apply in practice due to the limitations of machines at the time. Things have changed, and it comes as no surprise that it is the companies with huge computational capacities that are now developing the most efficient systems.
Despite the current predominance of statistical machine translation systems, we should note the persistence of rule-based systems. At the same time, the majority of historical systems are now said to be “hybrid”: they attempt to combine the benefits of a symbolic approach (i.e., dictionaries with very wide coverage, transfer rules between languages) with the recent benefits of statistical techniques. Lastly, for rare languages with too few data to make it possible to develop statistical systems, rule-based systems remain the norm.
Following the success of statistical translation systems, the majority of traditional systems (based on large lexicons and transfer rules) gradually tried to incorporate statistical information in their approach. A striking example is Systran, which was the proponent of the rule-based approach in the early 2000s before developing a hybrid approach to translation, based on both knowledge and statistics. There is, in particular, a clear benefit in using a language model to control the fluency of the translation generated: Systran first used statistics to correct the output and produce more fluent translations.
In practice, statistical information can be integrated in an infinite number of ways into systems that otherwise manipulate symbolic information. It is, for example, possible to design modules that will dynamically adapt the system (the dictionaries and the rules used for translation) according to the domain, for example in the medical, legal, and information technology domains. The statistical approach can also help choose the correct translation at word level. For example, if the system detects a text related to the military domain, the meaning of “tank” as a military vehicle will be preferred to its other meaning of receptacle (for more on this subject, see chapter 10).
The overall idea is of course to combine the richness of existing resources, which are sometimes the result of years of research and development, with the efficiency of statistical approaches. It is now possible to say that the gap has largely closed between the rule-based and the statistical approaches: today most commercial systems are hybrid. We saw above that even Google is integrating more and more semantic resources into its system, making it a truly hybrid system.
Finally, the survival of conventional systems based on rules and bilingual dictionaries only (with little or no statistics at all) should also be noted. When there are too few bilingual corpora available, the statistical approach is no longer of great interest.
Different systems exist for the development of rule-based systems. Apertium (www.apertium.org) is one such platform. The system was originally intended to be used for closely related languages that required a limited number of transfer rules (for example, for two dialects or two closely related languages differing mainly in their vocabulary but not so much in their syntax). But after a while the platform turned out to be also useful for all underrepresented languages on the Internet. It has even since specialized in processing rare languages like Basque, Breton, or Northern Sami (a language from the north of Scandinavia), which are available in this system (the system has data available for about 30 languages, but translation is only possible for 40 translation pairs, most of the time in one direction only). The performance is variable depending on the language pair considered, and most of the implemented language pairs are based on bilingual dictionaries and reordering rules. One of the goals of this project is to promote rare languages and provide access to texts that would not be translated otherwise. It also aims to generate interest for these languages, which for the most part are endangered.
Here we must say a few words on one of the current challenges in the field of machine translation: the rapid development of translation systems for languages that have not been covered up to now. This concerns mainly the defense and intelligence industry; surveillance and intelligence needs evolve rapidly depending on geopolitical risks (see chapter 14, dedicated to the machine translation market).
From a technical point of view, the challenge is to collect bilingual corpora very quickly for the languages considered. While automatic corpus collection is a well-known technique nowadays (see chapter 7), the volume of data collected is often insufficient in practice to develop operational machine translation systems. Since most of the time the target is a language that is distant from English, the result is not as good as for closely related language pairs.
In this context, the statistical approach is not the predominant one. The task consists for the most part in developing large bilingual dictionaries manually. Monolingual corpora are collected and processed so as to automatically produce a large list of words in the target language. Then an automatic process or, more likely, a team of linguists will proceed to provide word translations so as to make it possible to quickly develop a rudimentary translation system. The production speed of such a system closely depends on the number of linguists that can be hired for the project, but this type of demand currently represents a significant part of the machine translation business.
The current situation of machine translation poses many fundamental questions for the field. Is semantics necessary to translate or can we settle for statistics? Also, can we say that current systems, specifically those based on statistics, are completely stripped of semantics? Finally, are the approaches used today going to allow for significant progress or, on the contrary, can we anticipate some strong limitations that will prevent improvements in the near future?
Experts in the field have discussed all these questions. Systems are now based on highly technical machine-learning approaches, and linguistics has been left aside. As we will see in chapter 14, commercial issues are important and put pressure on developers to find efficient short-term solutions. At the same time, the progress of systems is measured annually in evaluation conferences: competition is strong, leaving little time for reflecting on the current state of affairs.
It is first of all crucial to recall that since the 1990s we have seen real and undeniable progress in the field of machine translation. Statistical methods made it possible to better process a large number of frequent and important phenomena (such as the search for the best translation at word level, the management of local ambiguities, the relative contribution of different linguistic constraints when they seem to contradict one another, etc.). These phenomena were generally not solved satisfactorily by rule-based methods. However, the success of statistical models was such that even some of its proponents have called recent progress into question.3 Some local phenomena are relatively satisfactorily solved by current statistical techniques, but more complex phenomena should also be taken into account.
Many very frequent linguistic phenomena (agreement, coordination, pronoun resolution) would indeed require a more complex analysis. They are poorly or not at all addressed by statistical systems (but note that this also applies to most rule-based systems that also focus on local context). The state of the art is just too limited to deal with these complex issues. Syntactic analysis is hard, but semantics is even harder and we are still far from knowing how to address this kind of problem properly.
One interesting question concerning recent approaches to machine translation is the status of statistics in this context. It is widely assumed that statistics are opposed to semantics: on the one hand calculation, on the other hand representation of word and sentence meaning. Yet this opposition is too crude. As we have seen in the previous chapters, statistics make it possible to accurately model the different meanings of words according to the context. Statistics are also efficient to find translational equivalences at word or phrase level.
This raises another question: what, by the way, is the meaning of a word? How can we represent it? It is indeed very difficult to define the meaning of words precisely. This is the job of lexicographers who spend years developing dictionaries, but the enumeration of word meanings they propose is known to be quite subjective and does not always correspond with word usage in context. Moreover, definitions vary significantly from one dictionary to the other, especially for abstract concepts and functional words.
Given this state of affairs, it should be taken into account that although accurate definitions are hard to find, it is easy for any speaker of a language to give synonyms of a given word or examples of word usage in context. The various word meanings correspond, in fact, to various contexts of use. The challenge thus lies in defining and characterizing the notion of “context.” In other words: how can one determine the various meanings of a given word just by observing its usage within a very large corpus? How can usage patterns be identified? Lexicographers (responsible for writing dictionaries) generally use a multitude of tools and criteria to define the various word meanings, and they try to be comprehensive, regular, and coherent. Statistics help automate the process and obtain results that are often different but always interesting.
Multilingual corpora provide a direct and quite natural model for the question of word meaning. The more vague or ambiguous a word is, the more it will match with a variety of different words in the target language. In contrast, the more stable and fixed an expression is (for example, “cryptographie”) the more it will be aligned with a limited number of words (such as “cryptography”) because the word is not (or less) ambiguous. For the same reasons, the approach is able to recognize the fact that a multiword expression (like “pomme de terre” in French) corresponds in the target language to a single word (“potato” in English). The same is true for frozen expressions (“kick the bucket” or “passer l’arme à gauche,” which both mean “to die”). Statistical approaches may seem too simple or too crude, but the system will not produce “frapper le seau” for “kick the bucket” or “pass the weapon to the left” for “passer l’arme à gauche” if it has been properly trained (but these kinds of problem may occur when frozen expressions are not properly recognized, as already said in this chapter). These examples show that statistical analysis therefore leads to a direct modeling of polysemy, idioms, and frozen expressions, without any predefined linguistic theory.
It can even be claimed that the type of representation obtained from a statistical analysis is more appropriate and cognitively more plausible than what formal approaches propose. Notions such as ambiguity and polysemy (or, in other words, meaning) are closely linked to usage and are not absolute notions. In this respect, it is understandable that statistical analysis can help define the various meanings of a word, the different contexts in which it appears, and so on. Numerous linguists and philosophers have defended such ideas, from Ludwig Wittgenstein to John Rupert Firth, the latter of whom is the author of the famous quote: “You shall know a word by the company it keeps” (i.e., you know the meaning of a word by its context of use). This remark has been cited again and again in modern natural language processing texts. Current approaches may not have anything to say about Wittgenstein or Firth, but they are beyond doubt very close to the text, and they have eliminated everything that was “metaphysical” in semantics (the quest for artificial modes of representation, the idea of a universal language, the goal of transforming sentences into logical forms, etc.). It is possible that current approaches will form the basis of a new theory of meaning.
However, the remarks made at the beginning of this chapter should be kept in mind: representations used by statistical machine translation systems remain local, for the most part, which does not make it possible to address many of the fundamental problems related to semantics. Lexical semantics (i.e., the meaning of words) is relatively well formalized today, but propositional semantics (i.e., the meaning of sentences and the relations between them) remains very difficult to achieve and thus, to a large extent, a “terra incognita.” This is what a new approach known as deep learning, or neural machine translation, which we describe in the following chapter, is trying to solve.