Parallel Corpora and Sentence Alignment

The Notion of Parallel Corpora or Bi-texts

A parallel corpus is a corpus composed of a set of pairs of texts in a translation context. An aligned parallel pair of texts is called a bi-text, from bilingual text, while a multi-text includes multiple aligned translations.

This type of resource is popular among professional human translators. It is actually an extremely valuable source of knowledge: more than a bilingual dictionary, previous translations can provide examples of relevant translations according to the context. In the case of technical translation, the translator must generally use the terminology, the phraseology, and the style of previous translations for reasons of uniformity. It is therefore essential to have access to past translations. It is also important to know in which direction the translation was carried out (i.e., the source language) since the source is by definition the reference text.

Human translators thus generally have access to past translations through a tool called a “translation memory.” A translation memory module makes it possible to store and retrieve fragments of translation from past work, generally through a powerful search engine. Past translations can be analyzed and tagged before being stored, which makes it possible to query the translation memory powerfully, rather than with basic keywords. There are several tools like this available on the market, primarily for professional translators.

Bi-texts were perceived very early on as an important source of knowledge for machine translation. Translation memories contain relevant translation fragments, since these tools store professional past translations. Beyond that, more and more bilingual texts are available on the Internet, so one can imagine the development of systems based uniquely on bilingual data from the Internet. Today, this is in fact the dominant approach in the field of machine translation.

Two types of approaches can be distinguished. On the one hand, the analysis of existing translations and their generalization according to various linguistic strategies can be used as a reservoir of knowledge for future translations. This is known as example-based translation, because in this approach previous translations are considered examples for new translations. On the other hand, with the increasing amount of translations available on the Internet, it is now possible to directly design statistical models for machine translation. This approach, known as statistical machine translation, is the most popular today.

Unlike a translation memory, which can be relatively small, automatic processing presumes the availability of an enormous amount of data. Robert Mercer, one of the pioneers of statistical translation,¹ proclaimed: “There is no data like more data.” In other words, for Mercer as well as followers of the statistical approach, the best strategy for developing a system consists in accumulating as much data as possible. These data must be representative and diversified, but as these are qualitative criteria that are difficult to evaluate, it is the quantitative criterion that continues to prevail. In fact, it has been proven that the systems’ performance regularly improves as more bi-texts are available to develop it.

“There is no data like more data.” [Robert Mercer]

Availability of Parallel Corpora

There are two major sources of bi-texts: on the one hand, corpora already available for two or more languages; the bi-texts may be aligned or not. On the other hand, for pairs of languages without adequate corpora, techniques have been developed to automatically develop such corpora, generally by collecting texts available on the web.

Existing Corpora

There are well-known sources of parallel texts. For example, the majority of countries and institutions that have several official languages have to produce official texts (legislative texts for example) in each of these different languages. This is generally a source of very valuable bilingual corpora, since the translation must be an accurate copy of the original. But since these texts are for the most part associated with legislative and legal fields, machine translation systems based on these data may not be very accurate for other domains or other genres.

The first experiments regarding the alignment of texts, in the 1980s, drew heavily on the Canadian Hansard, which records the official transcripts of Canadian parliamentary debates. The Canadian Hansard is aligned at text level and also at sentence level, which means that this corpus is an invaluable source of knowledge when translating between French and English (see figure 3).

Figure 3 An extract from the Hansard corpus aligned at sentence level.

Other corpora of the same type are available today, particularly corpora made of texts produced by European institutions. The European context is by nature highly multilingual and has already produced several invaluable resources, such as the Europarl corpus and the JRC-Acquis corpus. Both of these corpora include more than 20 languages from Europe. These corpora have been intensively used by machine translation systems and are easy to use since they are in fact already aligned at text and paragraph level, as well as occasionally at sentence level. They consist of several tens of millions of words for each language, but the size varies a great deal according to the language or the pair of languages in consideration (for example, Europarl contains 11 million words for Estonian, 33 million for Finnish, and 54 million words for both English and French).

Many other corpora exist, especially for other families of languages, although it is generally the “strongest” languages (meaning the most widely represented on the Internet) that are the most popular. However, these corpora are not always sufficient: most languages have very few or even no resources at all to develop a system. In this context, it is necessary to develop new corpora, and this is usually done on the web.

Automatic Creation of Parallel Corpora

Researchers early on sought to exploit the mass of texts available on the Internet to complete existing resources. The web is actually far more diverse than existing available parallel corpora, which are related to the legal domain for the most part, as already mentioned. The techniques for “harvesting” high-quality bilingual texts on the web are relatively simple. A harvesting system generally includes a “robot”—that is, a system capable of browsing the web by bouncing from page to page, while following the links mentioned on each webpage. Then, for each webpage, the system checks the language used and if an equivalent page in the target language exists.

The system begins with the rarer language of the two. For example, if the aim is to develop a bilingual corpus between Greek and English, it seems more appropriate to begin with websites written in Greek, which are fewer in number than websites in English. It should also be noted that few pages in English have a corresponding page in Greek, whereas the opposite situation is more likely; for example, websites of English universities rarely have a translation in Greek, whereas websites of Greek universities often have a translation in English.

For each website or page, two tricks can be used: first, the system searches for an equivalent at the website address (URL) level. For example, if one site corresponds with the URL http://my.website.com/gr/, the system will look for an equivalent such as http://my.website.com/en/—that is, a “mirror site” in the target language, identified by its URL. If this first strategy does not work, the system can search each webpage for a link toward the page in the target language, since multilingual sites often make it possible to navigate from one language to the other (these links are often identifiable through small icons featuring the country symbol of the target language). Once two websites have been identified as being translations of each other, the system has to control correspondences at the level of individual webpages. Several tools can be applied to check the language of the identified webpages if this was not previously done. Then, one can compare, for example, the length of the documents (if two documents or two texts are of very different sizes, they are probably not a reliable translation), the HTML structure (the two files must share the same structure), and so on.

These techniques have little to do with linguistics, but, when applied at web scale, they can allow for the extremely rapid development of large corpora in numerous languages from scratch. If the content of the selected webpages is closely monitored (for example, by starting with a list of specific URLs and then retrieving only those webpages that contain specific keywords), it is furthermore possible to obtain specialized corpora for different domains at a lower cost. Nonetheless, one must keep in mind that the process is entirely automatic: it does not guarantee the representativeness of the data nor the quality of the identified sources. In fact, nothing guarantees the quality of the bi-texts obtained in such a way. However, quantity goes together with quality: a given website may propose poor translations, but the consequences will be limited, since one can expect that a multitude of other websites will propose “good” translations, which means that bad translations will be statistically negligible and have no influence in the end. For the same reasons, a literary translation that is unique and original will also be discarded because it will not be statistically significant among all the other translation possibilities. This is not really a problem for machine translation, which looks for standard equivalents and does not attempt to produce originality.

Still, the limitations of this approach must be noted. Not all languages are well represented on the Internet, and this is especially true when searching for bilingual texts. In practice, in the majority of existing corpora, one of the languages is English, increasing the influence of this language. Despite the amount of available data, it is difficult to harvest enough data to develop a quality bilingual corpus, if one of the languages (target or source) is not English. I will return to this subject later in chapter 11.

Once the corpus has been built, it is necessary to align it at paragraph or sentence level, or both, for it to be usable by machine translation systems.

Sentence Alignment

In nearly all languages, a sentence is a linguistic unit that is syntactically and semantically autonomous (as opposed to a phrase or any other nonautonomous group of words). Consequently, natural language processing is often based on the notion of the sentence—particularly machine translation, which operates generally sentence by sentence, each being considered independently from the others.

Sentence alignment flourished toward the end of the 1980s and in the 1990s, during a time when more and more corpora were becoming available. Several kinds of applications using this type of resource were also beginning to appear: machine translation, of course, but also other multilingual applications, for example multilingual terminology extraction.

Sentence alignment is generally based on specific features of bi-texts: it is assumed that the translation generally follows the structure of the original text and that the sentences are usually chained in the same way in the source text and the target text. Furthermore, one can define a length ratio between two pairs of languages (for example, in terms of number of words, a French text is generally 1.2 times longer than the corresponding English text). The relative length of the sentences was the first criterion explored for sentence alignment. The first experiments in the domain of sentence alignment were made on the transcripts of the Canadian Parliament, since this corpus is of extremely good quality and the translation is very close to the original texts, contrary to what is usually found on the Internet.

7 Parallel Corpora and Sentence Alignment

The Notion of Parallel Corpora or Bi-texts

Availability of Parallel Corpora

Existing Corpora

Automatic Creation of Parallel Corpora

Sentence Alignment

Alignment Based on Relative Length of Sentences

Lexical Approach

Mixed Approaches

Note