We saw in the previous chapter the success of the models developed by IBM for machine translation. One of the main limitations of these models is the fact that they are mainly based on alignment at word level (i.e., they mainly produce word-for-word translations, even if they also allow 1-m alignments, where one word in the source language corresponds to several words in the target language). This chapter covers developments that took place in the 1990s and 2000s that aimed to overcome the main limitations of the IBM models. We examine how information of a syntactic and semantic nature has been progressively integrated into models to compensate for the limitation of purely statistical approaches.
The IBM models have been subjected to numerous enhancements. The most significant improvement was to take into consideration the notion of segments (or sequences of words) in order to overcome the limitation of a simple word-for-word translation. Among the other improvements, the notion of double alignment is worth mentioning, since it greatly increases the quality of the search for translations at word level in bilingual corpora.
The IBM models are able to recognize correspondences such that, for one word in the source language, there is 0, 1, or n words in the target language. However, the original IBM models, because of their formal basis, do not make it possible to obtain the opposite correspondences (in other words, one word from the target language cannot correspond to a multiword expression in the source language, for example). This is a strong limitation of these models that has no linguistic basis, since multiword expressions clearly exist in every language. It therefore seemed necessary to overcome this limitation imposed by the IBM models in order to allow for m-n alignments (where any number of words from the source language corresponds to any number of words in the target language).
The original IBM article specifically mentioned the example shown in figure 17, which the models proposed in 1993 were unable to handle.
One way of overcoming this problem is to first calculate the alignments from the source language into the target language, then repeat the operation in the opposite direction (from the target language into the source language). The shared alignments are kept, namely those concerning words that have been aligned in both directions. The alignments obtained using this technique are generally precise but provide a low coverage of the bi-texts. Globally, this method has two major defects: first, the process is more complex than a simple alignment, and is therefore more costly in terms of computation time; second, at the end of the process, a large number of words are no longer aligned because the constraints imposed by the double-direction analysis cause numerous alignments identified in a single direction to be rejected. Various heuristics then have to be used to expand the alignments to neighboring words in order to compensate for the coverage problems incurred (the double alignments, or “symmetric alignments,” can be seen as “islands of confidence”; see chapter 7).
It has been shown that this method improves the results of the original IBM models. However, in order to obtain a good coverage of the data with these models, it is necessary to have huge quantities of data, which makes them impractical in some circumstances.
We have seen that the double alignment approach helps to identify m-n translational equivalences, such as “don’t have any money” ⇔ “sont démunis” (where four English words amount to two French words). In fact, it is possible to generalize the approach so to consider the problem of translation as an alignment problem at the level of sequences of words, and not at the level of isolated words only. The goal is to translate at the phrase level (i.e., sequences of several words): this would enable the context to be better taken into account and would thus offer translations of better quality than simple word-for-word equivalences.
It is possible to generalize the approach so as to consider the problem of translation as an alignment problem at the level of sequences of words, and not at the level of isolated words only.
Several research groups have tackled this problem since the late 1990s, and various strategies have been explored. One strategy is to systematically symmetrize the alignments (see the previous section) in order to identify all the possible m-n alignments. Other researchers have tried to directly identify linguistically coherent sequences in texts, through rules describing syntactic phrases for example (this can be seen as a first attempt to introduce a light syntactic analysis in the translation process). A last line of research tried to import some techniques from the example-based paradigm (see chapter 8), the idea being to make the alignment process both more robust and more precise by aligning from tags and not from word forms. For example, the following two sentences may seem very different for a computer, since several words are different: “In September 2008, the crisis …” and “In October 2009, the crisis … .” However, if the system is able to recognize date expressions, it is possible to recognize the structure “In <DATE>, the crisis” in both sentences: they can thus be aligned successfully. This technique can significantly improve the quality of the alignment.
The results obtained by these models show a clear improvement in comparison to the more complex IBM models, notably IBM model 4. However, the results are still very dependent on the training data: the more data there are, the more accurate the models will be. Moreover, segment models require a lot more training data than models based only on word alignment. Finally, it should be noted that the notion of segments does not generally correspond to the notion of phrases. A closer look at the results obtained shows that the segments obtained by training from large bilingual corpora correspond to frequent but fragmentary groups of words (for example, “table of” or “table based on”). On the contrary, limiting the analysis to linguistically coherent phrases (for example, “the table” or “on the table”) seriously affects the results. In other words, if one forces the system to focus on linguistically coherent sequences corresponding to syntactically complete groups, the results are not as good as with a purely mechanical approach that does not take syntax into account.
The most challenging part of segment-based translation is for the system to produce a relevant sentence from scattered pieces of translation. Figure 18 gives a simplified but typical view of the situation after the selection of translation fragments (this view is simplified, because here the sentence to be translated is short, the number of segments to be taken into account is limited, and in real systems all fragments have a probability score).
It is clear from figure 18 that only the careful selection of some fragments can lead to a meaningful translation. The language model of the target language helps in finding the most probable sequence in the target language; in other words, it tries to separate linguistically correct sentences from incorrect ones (independently from the source sentence at this stage).
As one can easily imagine, these models are much more complex than the original IBM models based on simple words. Thus, they may require considerable processing time compared to the original IBM models. The increasing computational power of computers somewhat compensates for this problem. From a linguistic point of view, it should be noted that these models fail to identify discontinuous phrases (where one word in the source language corresponds to two noncontiguous words in the target language), which are crucial in languages such as French or German (English: “I bought the car” ⇔ German: “Ich habe das Auto gekauft”; English: “I don’t want” ⇔ French: “Je ne veux pas”).
The recent developments we have described in this section have, however, helped improve the IBM models and can still be considered currently as the state of the art in machine translation.
Statistical translation models, despite their increasing complexity to better fit language specificities, have not solved all the difficulties encountered. In fact, bilingual corpora, even large ones, remain insufficient at times to properly cover rare or complex linguistic phenomena. One solution is to then integrate more information of a linguistic nature in the machine translation system to better represent the relations between words (syntax) and their meanings (semantics).
The statistical models described so far are all direct translation systems: they search for equivalences between the source language and the target language at word level, or, at best, they take into consideration sequences of words that are not necessarily linguistically coherent. As we have seen, a strategy that takes into account the fragments identified on a purely statistical basis is more efficient than a strategy that only retains linguistically motivated phrases (e.g., noun phrases or verb phrases).
Several attempts have been made, nonetheless, to take better account of syntax in the machine translation process. This can be done through the integration of parsers, also known as syntactic analyzers: a parser is a tool that tries to automatically identify the syntactic structure of the sentences to be analyzed. If we recall Vauquois’ triangle (see figure 2 in chapter 3), syntax makes it possible to take into account the relations between words. For example, there may be a relation between distant words in the sentence that is very difficult to spot with a purely statistical model. Parsers, however, are theoretically able to consider the relations between the words, even in cases where those words are not adjacent on the surface. In Vauquois’ words, syntax involves transfer rules between equivalent syntactic structures, not a word-for-word translation.
From a theoretical perspective, this type of approach can better analyze discontinuous morphemes that may be common in some languages (as in the example in German above, “Ich habe das Auto gekauft,” where the analyzer should identify “habe gekauft” as a whole despite the distance between the two words). Syntax can also help represent the connection between a preposition and the following noun phrase, or between the verb and its arguments (subject, object, etc.) that are frequently distant from one another.
As already said, this approach requires “parsers” (automatic syntactic analyzers) for the various languages to be processed. However, these are complex tools, far from perfect, whose quality varies considerably depending on the language considered. Applying such tools to machine translation is thus a complex operation.
One basic approach for the integration of a parser in a machine translation system is to analyze the structure of the source sentence (so as to produce a “syntactic tree”) and then, for each level of the syntactic tree, try to determine an equivalent structure in the target language. As one can imagine, identifying equivalent structures between different languages is a daunting challenge that results in a lot of “silence” (in other words, many of the sentences to be translated include structures that will never have been observed up to that point in the training data; i.e., in the available aligned bilingual corpora). It is therefore necessary to imagine a process that makes it possible to translate even if the system cannot analyze the whole sentence properly from a syntactic point of view. Different strategies have been tried, especially “generalization tricks” inherited from the example-based paradigm, where one tries to find a similar structure if a sentence or a part of a sentence cannot be properly analyzed.
It also happens, for some language pairs, that a parser exists for one language (the source or the target language) but not for the other one. Some experiments have been done to integrate the parser only on one side of the translation process, with mixed results.
Globally, the idea of integrating syntax in the translation process seems, of course, promising. However, the approach is still in its infancy and, for the time being, has not obtained better results than the simpler models using direct translation techniques based on segments. The first reason is certainly the highly variable performance of syntactic parsers. Moreover, if the tool makes an error, it will “percolate” throughout the entire translation process. The “silence” issue (i.e., syntactic structures for which the parser is unable to provide a relevant analysis or for which there is no direct equivalent in the target language) is the other main cause of this limited performance. The problem is in a way logical: things are often phrased differently in different languages. It is thus not surprising that no directly equivalent structures can be found in the target language. This is surely a serious challenge for the integration of syntactic parsers in the translation process.
So far, syntax seems to be especially useful in certain specific contexts or for certain languages like German, where the verb is frequently broken into two elements as we have seen. The main benefit of the approach is that it performs a local and limited syntactic analysis to resolve some of the specific problems of each language. All of this makes syntax both a serious avenue to improve existing systems and a difficult solution to implement in practice.
Semantic analysis clearly remains a crucial perspective for the domain. We have seen that the systems thus far have mainly used large bilingual corpora as their main source of data. We have also seen that the integration of linguistic knowledge has brought very little benefit to the different systems so far. However, most experts in the field think that, despite everything, it will be necessary in the short or medium term to integrate semantic information in order to overcome current limitations.
In fact, semantic resources are already used, even by well-known available systems. For example, Google Translate integrates Wordnet, a large lexical database developed at Princeton University for English. Google also uses other semantic resources depending on the language considered. Semantic resources may be useful to disambiguate ambiguous elements: for example, if the system has to translate “the tank was full of water,” it must decide if “tank” represents “a container that receives something” or “a military vehicle.” Wordnet offers a long list of synonyms for “tank” as container, including the word “bucket.” The sentence “the bucket was full of water” is well attested (whereas “the armoured vehicle was full of water” is not nearly as attested), which compels the user to identify “tank” as representing a container (or, more precisely, a “reservoir”) in the sentence.
The integration of large databases of synonyms makes it possible to group certain words (essentially nouns and verbs) into semantic classes, thus providing a more abstract representation. The system can then spot identical structures beyond surface divergence. The semantic analysis may also focus on identifying specific sequences, such as named entities (proper nouns, dates, etc.), again for the purpose of providing more general and more abstract representations. This is very close to the strategies we have already seen for example-based translation (see chapter 8).
A deeper semantic analysis could provide a representation of the semantic structure of the sentences to translate. This applies particularly to the verb, its arguments, and their role in the sentence (subject, object, or, even better, agent, patient, temporal argument, etc.). This corresponds to the vertex of Vauquois’ triangle (see chapter 3). Although many research groups currently focus on these issues in natural language processing, the performance is still too low to be applied as they are to any text, as must be the case for machine translation systems. This remains an open line of research for the years to come, but it will most likely take several more years before efficient systems integrating a semantic analysis become available, given the difficulty of the tasks.