Challenges and Limitations of Statistical Machine Translation

The Case of Rare Languages and the Return of Pivot Languages

All statistical systems require a huge amount of bilingual texts in order to work satisfactorily. Corpora made of millions of aligned sentences are nowadays commonplace. As Mercer said, “there is no data like more data” (see chapter 7).

Consequently, it is clear that beyond a handful of languages that are widely used on the Internet, the systems’ performance decreases considerably, especially if one of the languages (source or target) is not English. The quantities of data available on the Internet for these languages are simply insufficient to obtain good performance. Some techniques have been developed to overcome the lack of bilingual data. For example, it is possible to obtain more information from large monolingual corpora, but this remains insufficient for the task.

A popular strategy consists in trying to design translation systems that use English as a pivot language, in order to overcome to a certain extent the lack of training data. The idea is that when there are not enough bilingual data between the two languages (for example, between Greek and Finnish), a solution is to translate first from Greek into English, and then from English into Finnish. The approach is simple and can provide interesting results in some contexts. However, it does not fully solve the problem: the quality of the translation to and from English is sometimes mediocre and applying two steps of translation instead of one also multiplies the likelihood of errors.

The problem of multiple translations is well known and can be observed even when one tries to translate iteratively between the same two languages (for example, from English to French and then back to English). The prototypical example is probably this Biblical sentence: “The spirit is willing, but the flesh is weak.” The story goes that this was translated into Russian and then translated back into English as “The whiskey is strong, but the meat is rotten.” Yet this is indeed an apocryphal example.¹ The longevity of this invented example is due to the comical nature of the resulting translation, but it also illustrates the fact that multiplying translation steps amounts to gradually straying away from the original text until an incomprehensible translation is obtained.

Despite these well-known problems with the “pivot approach,” many critics have pointed out that Google Translate increasingly uses English as a pivot language. This often leads to strange results. As Frédéric Kaplan noted on his blog,² with Google Translate, “Il pleut des cordes” can be translated to Italian as “Piove cani e gatti.” Likewise, “Cette fille est jolie” strangely enough transforms into “Questa ragazza è abbastanza” (“this girl is quite”). These major errors are due to the fact that English is used as a pivot language. In the French expression, “Il pleut des cordes,” Google identifies a frozen expression that should not be translated word for word. “It rains cats and dogs” is thus a good English translation in this context, but the system then fails to find the equivalent expression in Italian and then just performs a word-for-word translation, which is poetic but not really accurate. As for “joli” in the other example, the system identifies “pretty” as an equivalent adjective but then seems to confuse the adjectival and the adverbial values of “pretty” in English, hence the translation related to “quite” (“abbastanza”) instead of “pretty” as “nice” or “beautiful.”

These examples quickly lose their relevance as Google constantly updates its system: Kaplan’s blog post dates from November 15, 2014, and on December 1, the translation “Il pleut des cordes” in Italian had already become “Piove a dirotto,” which is a correct translation. It is clear that Google works intensively to improve the translation of frozen and multiword expressions like “il pleut des cordes” in French or “it rains cats and dogs” in English: these expressions are special since they should be translated as a whole and not literally. There are obviously no cats or dogs falling from the sky, when one says the English expression! This is a major source of enhancement for current systems, but the task is huge since there are a large number of frozen expressions in any language.

Clearly, using English as a pivot language has several implications. It reinforces the dominant position of English as a world language and its cultural hegemony. Besides this, and despite official discourse promoting language diversity, it is clear that the research effort is mainly directed toward dominant languages (the handful of languages dominating the Internet). Google Translate can officially translate between more then 100 languages, but in practice the results are very uneven and almost unusable for some languages.

Despite official discourse promoting language diversity, it is clear that the research effort is mainly directed toward dominant languages (the handful of languages dominating the Internet).

Finally, we must keep in mind that the advances since the 1990s in the field of statistical machine translation are related to the fact that more and more data are available, but also to an increase in computing power. The deep learning approach uses some algorithms that were to a certain extent described in the 1980s but that researchers were unable to apply in practice due to the limitations of machines at the time. Things have changed, and it comes as no surprise that it is the companies with huge computational capacities that are now developing the most efficient systems.

11 Challenges and Limitations of Statistical Machine Translation

The Question of Language Diversity

Languages Distant from English

The Case of Rare Languages and the Return of Pivot Languages

How to Quickly Develop Machine Translation Systems for New Languages?

Hybrid Machine Translation Systems

The Survival of Rule-Based Systems

A Current Challenge: The Rapid Development of Translation Systems for New Language Pairs

Too Many Statistics?

Primary Limitations of Statistics-Based Systems

Statistics Do Not Exclude Semantics

Notes