Segment-Based Machine Translation

Toward Segment-Based Machine Translation

The IBM models have been subjected to numerous enhancements. The most significant improvement was to take into consideration the notion of segments (or sequences of words) in order to overcome the limitation of a simple word-for-word translation. Among the other improvements, the notion of double alignment is worth mentioning, since it greatly increases the quality of the search for translations at word level in bilingual corpora.

Double Alignment

The IBM models are able to recognize correspondences such that, for one word in the source language, there is 0, 1, or n words in the target language. However, the original IBM models, because of their formal basis, do not make it possible to obtain the opposite correspondences (in other words, one word from the target language cannot correspond to a multiword expression in the source language, for example). This is a strong limitation of these models that has no linguistic basis, since multiword expressions clearly exist in every language. It therefore seemed necessary to overcome this limitation imposed by the IBM models in order to allow for m-n alignments (where any number of words from the source language corresponds to any number of words in the target language).

The original IBM article specifically mentioned the example shown in figure 17, which the models proposed in 1993 were unable to handle.

Figure 17 Example of an alignment that is impossible to obtain from IBM models. The sequence “*don’t have any money*” corresponds to the group “*sont démunis*” in French: this is an example of an *m-n* correspondence (here, m=4 and n=2 such that four English words correspond to two French words, if we consider “don’t” as a single word).

One way of overcoming this problem is to first calculate the alignments from the source language into the target language, then repeat the operation in the opposite direction (from the target language into the source language). The shared alignments are kept, namely those concerning words that have been aligned in both directions. The alignments obtained using this technique are generally precise but provide a low coverage of the bi-texts. Globally, this method has two major defects: first, the process is more complex than a simple alignment, and is therefore more costly in terms of computation time; second, at the end of the process, a large number of words are no longer aligned because the constraints imposed by the double-direction analysis cause numerous alignments identified in a single direction to be rejected. Various heuristics then have to be used to expand the alignments to neighboring words in order to compensate for the coverage problems incurred (the double alignments, or “symmetric alignments,” can be seen as “islands of confidence”; see chapter 7).

It has been shown that this method improves the results of the original IBM models. However, in order to obtain a good coverage of the data with these models, it is necessary to have huge quantities of data, which makes them impractical in some circumstances.

Introduction of Linguistic Information into Statistical Models

Statistical translation models, despite their increasing complexity to better fit language specificities, have not solved all the difficulties encountered. In fact, bilingual corpora, even large ones, remain insufficient at times to properly cover rare or complex linguistic phenomena. One solution is to then integrate more information of a linguistic nature in the machine translation system to better represent the relations between words (syntax) and their meanings (semantics).

Alignment Models Accounting for Syntax

The statistical models described so far are all direct translation systems: they search for equivalences between the source language and the target language at word level, or, at best, they take into consideration sequences of words that are not necessarily linguistically coherent. As we have seen, a strategy that takes into account the fragments identified on a purely statistical basis is more efficient than a strategy that only retains linguistically motivated phrases (e.g., noun phrases or verb phrases).

Several attempts have been made, nonetheless, to take better account of syntax in the machine translation process. This can be done through the integration of parsers, also known as syntactic analyzers: a parser is a tool that tries to automatically identify the syntactic structure of the sentences to be analyzed. If we recall Vauquois’ triangle (see figure 2 in chapter 3), syntax makes it possible to take into account the relations between words. For example, there may be a relation between distant words in the sentence that is very difficult to spot with a purely statistical model. Parsers, however, are theoretically able to consider the relations between the words, even in cases where those words are not adjacent on the surface. In Vauquois’ words, syntax involves transfer rules between equivalent syntactic structures, not a word-for-word translation.

From a theoretical perspective, this type of approach can better analyze discontinuous morphemes that may be common in some languages (as in the example in German above, “Ich habe das Auto gekauft,” where the analyzer should identify “habe gekauft” as a whole despite the distance between the two words). Syntax can also help represent the connection between a preposition and the following noun phrase, or between the verb and its arguments (subject, object, etc.) that are frequently distant from one another.

As already said, this approach requires “parsers” (automatic syntactic analyzers) for the various languages to be processed. However, these are complex tools, far from perfect, whose quality varies considerably depending on the language considered. Applying such tools to machine translation is thus a complex operation.

One basic approach for the integration of a parser in a machine translation system is to analyze the structure of the source sentence (so as to produce a “syntactic tree”) and then, for each level of the syntactic tree, try to determine an equivalent structure in the target language. As one can imagine, identifying equivalent structures between different languages is a daunting challenge that results in a lot of “silence” (in other words, many of the sentences to be translated include structures that will never have been observed up to that point in the training data; i.e., in the available aligned bilingual corpora). It is therefore necessary to imagine a process that makes it possible to translate even if the system cannot analyze the whole sentence properly from a syntactic point of view. Different strategies have been tried, especially “generalization tricks” inherited from the example-based paradigm, where one tries to find a similar structure if a sentence or a part of a sentence cannot be properly analyzed.

It also happens, for some language pairs, that a parser exists for one language (the source or the target language) but not for the other one. Some experiments have been done to integrate the parser only on one side of the translation process, with mixed results.

Globally, the idea of integrating syntax in the translation process seems, of course, promising. However, the approach is still in its infancy and, for the time being, has not obtained better results than the simpler models using direct translation techniques based on segments. The first reason is certainly the highly variable performance of syntactic parsers. Moreover, if the tool makes an error, it will “percolate” throughout the entire translation process. The “silence” issue (i.e., syntactic structures for which the parser is unable to provide a relevant analysis or for which there is no direct equivalent in the target language) is the other main cause of this limited performance. The problem is in a way logical: things are often phrased differently in different languages. It is thus not surprising that no directly equivalent structures can be found in the target language. This is surely a serious challenge for the integration of syntactic parsers in the translation process.

So far, syntax seems to be especially useful in certain specific contexts or for certain languages like German, where the verb is frequently broken into two elements as we have seen. The main benefit of the approach is that it performs a local and limited syntactic analysis to resolve some of the specific problems of each language. All of this makes syntax both a serious avenue to improve existing systems and a difficult solution to implement in practice.

10 Segment-Based Machine Translation

Toward Segment-Based Machine Translation

Double Alignment

The Generalizations of Segment-Based Machine Translation

Introduction of Linguistic Information into Statistical Models

Alignment Models Accounting for Syntax

Alignment Models Accounting for Semantics