MTR menu
overview
previous page    next page
British Computer Society's coat of arms British Computer Society
Natural Language Translation
Specialist Group

URL: http://www.bcs-mt.org.uk/
WEB PAGE 6
Machine Translation Review
No. 2, October 1995.   ISSN: 1358-8346
http://www.bcs-mt.org.uk/mtreview/2/6.htm

 
 
MATCHING WORDS IN A BILINGUAL CORPUS

by Roger Garside
Department of Computing and UCREL,
University of Lancaster,
United Kingdom.

Abstract
The following account is based on a talk given to the Annual General Meeting of the Group on 6 July 1995. Readers are also referred to the paper 'The Use of Approximate String Matching Techniques in the Alignment of Sentences in Parallel Corpora' by A. M. McEnery, M. P. Oakes, and R. G. Garside (Clarke and Vella 1995). The work I am going to talk about is part of a project called CRATER (Corpus Resources and Terminology Extraction), which is being conducted at the University of Lancaster. A further project that is of interest here is UCREL, a joint research group involving the Department of Linguistics and the Computing Department of the University of Lancaster. When it was first set up nearly ten years ago, UCREL stood for Unit for Computer Research into the English Language. The title reflected the fact that the project would be concerned mainly with English. Since then the focus of the research has changed to include the matching of words in a bilingual English-French corpus. However, the acronym UCREL has been retained because it is familiar to most people (I believe it now stands for University Central Computing Corpus Research into Modern Language).


 

Corpus-based, probabilistic NLP versus rule-based NLPB
 
The NLP research that we do at Lancaster is corpus-based and probabilistic. This approach was fairly rare up until a few years ago, although it was the first to be proposed when NLP began in the 1940s and 1950s. The idea was that, given a large number of possible linguistic items and a more limited set of probabilities that certain items would actually occur, it would be possible somehow to use the probabilities of the context in order to arrive at an analysis of the linguistic structure. At the time the approach did not work because the computing and linguistic resources were simply not available: machinery and resources of any kind were very limited, but this was especially true for dictionaries and corpora that could be machine-processed. As a result, most researchers concentrated on rule-based natural language processing. The standard paradigm for NLP came to be that a group of linguists generated some sort of rule system that would be used to decide what the structure of the language was. Not until the early 1980s, with the emergence of much more powerful computing and linguistic resources, was this paradigm seriously challenged by the probabilistic model.
 
Underlying probabilistic NLP is the notion of a mathematical model that can be trained and to which probabilities can be assigned for conducting linguistic analysis. The standard — and very popular and successful — example of such a model is word tagging. Tagging involves assigning parts of speech to words in a given sentence. It is performed by training a Hidden Markov Model. This asserts that it is possible to generate sequences of parts of speech according to a given probability pattern. Thus, given a word in context which has a particular part of speech, the category of the following word is more or less probable. For example, the occurrence of either a noun or an adjective after an article is probable, but it is unlikely that the article will be followed by a verb. The model can also be used to predict particular words from given categories. Thus, given a part of speech, we can say the possible words within this category are such-and-such. The model is referred to as 'hidden' because, when analysing input words in sequence, it guesses on the basis of given probabilities what the parts of speech of those words might be: in this sense the categories are 'hidden'.
 
The advantages of probabilistic processing over rule-based analysis are three-fold:

  • the search space is constrained;
  • the process is robust in the face of language variability;
  • the model is trainable.

First, as far as search space is concerned, one of the main problems of performing any kind of natural language parse is the vast number of possible structures that is produced as a result of the inherent ambiguity in language. Even a simple sentence can generate thousands or millions of possible parses, and it is not clear which is the right one to choose. One of the things that the probabilistic mechanism does is to constrain the search space. In looking for the correct parse it will hopefully provide a home-in sector for isolating a small number of possibilities, in some cases narrowing down the choice to a single parse.
 
Second, a probabilistic system is robust in the sense that it will always return an answer. It may not give the right answer — of course, it often does not — but it will always give some sort of answer. It is important to chose the right model of probabilistic analysis and also to select the right corpus on which to train the model: models can be general in relation to the text or they can be specific to the genre or to the domain. The appropriate choice in each case will hopefully return more reliable probabilities and therefore a better result.
 
Third, an advantage of the probabilistic model is that it is trainable. In order words, there are mechanisms for starting the model and for setting probabilities either automatically or semi-automatically. One way is to start out by manually marking up a section of text, using this to generate the initial set of probabilities. There are also unsupervised ways of training the model, including a mechanism for working out the best set of probabilities; these probabilities are then rated repeatedly until a set is eventually arrived at that is actually used.


 

'Statistical' machine translation
 
There are various ways in which probabilistic or statistical NLP can be relevant to machine translation. One is corpus alignment. Another is the development of parallel lexica, especially for specialized subject areas.
 
Corpus alignment involves establishing pointers to matching text units, such as sentences, in translated parallel corpora. If, in a bilingual corpus, the alignment of individual sentences is known, then it is possible to determine automatically the correspondences between the individual words. The method operates as follows. Given an English-French bilingual corpus in which it is known only which sentences correspond to each other, it is presumed that a particular word in an English sentence must correspond to some word in the parallel French sentence. If the corpus contains enough sentences in which the same English word occurs, then it is reasonable to assume that the French translation equivalent will also occur in the parallel French sentences. The method by which the presence of such words may be calculated is known as mutual information and is based on the expectation that a particular English word and its French equivalent will occur more often than would be probable by chance alone. In most cases one would expect to see the same French translation for a given English word in the parallel French sentences. But clearly this is not always so in practice. It is more or less true for English and French but applies less to translations of English into highly inflective languages. In such cases the same word in English is unlikely to correspond to a single word in the other language.
 
Despite these drawbacks the relatively simple approach outlined above achieves remarkably good results. Indeed, researchers have often started out with an astonishingly simple statistical approach which has been subsequently refined by applying some linguistic knowledge.
 
To illustrate the approach, consider how purely statistical techniques can be used to extract multiword units. The technique is fundamentally the same for both monolingual and multilingual corpora. For, say, a monolingual English corpus in a subject domain, the first stage is to ask whether some of the words in the corpus occur next to each other more often than might be expected by sheer chance. For a multilingual corpus — English-French, for instance — the same question is asked for individual word pairs in each language. The next stage is to ask whether there is a pair of words in the English sentence that tends to correspond to a pair of words in the parallel French sentence. If two such pairs are found, then it is plausible that the word pairs are translations of each other. For the method to work the corpus for the domain must be very large, but the basic technique is the same. There are various statistical tests for establishing (a) whether a given pair of words occurs more frequently in English than in French, (b) whether a corresponding pair occurs more frequently in French than in English, and (c) whether both pairs occur more often than might be expected in the aligned sentences.
 
In the first phase of the CRATER project extensive use is made of statistical techniques. We determine the number of times two words occur together in a pair of sentences, the number of times one word occurs without the other, the number of times the second word occurs without the first, and the number of sentences in which neither word occurs at all. Given these four numbers various statistical tests are carried out to try to establish whether it is possible to extract the more interesting pairs of words.
 
The next stage is to refine the process by applying linguistic knowledge. By using a linguistic filter it is possible to concentrate the search on the most interesting word pairs. For example in French, useful pairs of words may have a particular sequence of categories: 'noun d', 'noun d noun', or 'noun d determiner noun', or 'noun adjective'. The corresponding pattern in English might be 'noun noun' or 'adjective noun'. In this way a common translation pattern can be established which states that the 'noun adjective' sequence in French corresponds to an 'adjective noun' sequence in English. The filter is applied on top of the purely statistical techniques. In effect the filter says: do not look at all possible sequences of words, only at those sequences which match a particular pattern. The same technique can be applied to three-word sequences. In the CRATER project we have found that, of the 500 two-word phrases ascertained by the statistical test as translations, about 80% are valid, that is, four out of five. Clearly the technique is not 100% successful and some manual work is required at the end.
 
Work on statistical machine translation proper began at IBM (Yorktown Heights), although the original team has since been dispersed. The project, called CANDIDE, was based on a corpus of English and French parliamentary texts contained in the Canadian Hansard. The object was initially to build a dictionary of translation equivalents. The first step involved taking a French sentence and guessing at some of the corresponding English words in the translation. For this to work it is important to maximize the probability of given French words corresponding to certain English words. There are two probability distributions involved here. The first is what sequences of English words are likely to occur (some sequences are more probable than others). The second is, given a French word, what are the likely English words generated from it? This is a well established and reliable mechanism and, although it is possible to improve considerably on the technique, the basic method remains the same. The approach can also be used to decide on the correct sense of a word in context. Thus by extracting statistically what object the French verb 'prendre' takes, it is possible to determine whether it should be translated as 'take' or 'make'.


 

Corpus alignment
 
As mentioned earlier, the object of corpus alignment is to identify the corresponding units of language in a translated parallel corpus. The unit may be a paragraph, a sentence, or a single word. Alignment at word level effectively means developing a bilingual lexicon. However, most current work is carried out at sentence level. Generating a sentence alignment for a bilingual corpus is a useful end in itself. Monolingual corpora have been used as aids in teaching English at Lancaster, while parallel multilingual corpora have possible applications in second language learning.
 
There are two reasons why corpus alignment is a non-trivial activity. First, there is in practice never a one-to-one correspondence between sentence units in bilingual corpora. If each sentence in one language corresponded to a single sentence in another language, then alignment would present no problem. The difficulty is that some sentences in one language will correspond to more than one sentence in the other. The correspondence will vary from genre to genre: it is less predictable in narrative fiction, for example, than in technical language. The absence of one-to-one correspondences means that we require a matching algorithm that aims to align the disparate text units.
 
The second reason why alignment is non-trivial is that entire sections of text will be missing. If, for instance, an attempt is made to align a large corpus on the basis of paragraph markers alone, it is likely that some paragraph markers or even parts of the text will be left out. There are such missing sections in the alignment performed by IBM on the Canadian Hansard. On a really large corpus containing tens of millions of words the process of alignment is far from straightforward. Although it is possible to go through a small corpus by hand and to identify the corresponding paragraphs, this is not feasible with large corpora.


 

The Gale and Church Algorithm
 
There are two well known algorithms for performing an alignment. One, called the Gale and Church Algorithm, is publicly available and is used at Lancaster. The second is the Kay and Röscheisen Algorithm. A third, less well known algorithm is used solely by IBM. The Gale and Church and Kay and Röscheisen Algorithms are described in the March 1993 issue of Computational Linguistics (Gale and Church 1993; Kay and Röscheisen 1993).
 
The Gale and Church Algorithm employs dynamic programming as an efficient search technique for establishing the optimal alignment. The basic method is as follows. The starting-point is a long sequence of words, that is, a text in language A (say, English), which has to be matched up with a text in language B (say, French). There is always a number of different possible alignments: for example, the first sentences in each corpus might match one another; the first of the following two sentences in English might match the second sentence in French; the fourth French sentence might match the fifth English sentence, and so on. The goal is always to find the best possible alignment. 'Best possible' here means best in a probabilistic sense.
 
There are two main elements of probability in the Gale and Church Algorithm. The first element has to do with sentence length and is based on the strong likelihood that a long sentence in English will correspond to a long sentence in French; similarly a short sentence in one language will correspond to a short sentence in the other. Roughly speaking, given that the average lengths of sentences in French and English are known, it is possible to set up a distribution of possibilities from this information.
 
The second element of probability is based on the lack of a one-to-one correspondence between sentences. A sentence in language A may correspond to zero, one, two, or even seven sentences in language B. In the Gale and Church Algorithm a given sentence in one language corresponds to one, zero or two sentences in the other language. Thus one can have two sentences in language A matching two sentences in language B, but one cannot have three sentences in language A matching two in language B. The only permitted matches are 1–1, 2–2, 1–0, 0–1, 1–2, or 2–1.
 
From their own data, Gale and Church assessed their algorithm as providing valid 1–1 correspondences in 89% of cases. Thus in nine out of ten cases the sentences corresponded 1–1. Gale and Church achieved these results with economics texts. On the Canadian Hansard they achieved 91%. The figure for 1–0 correspondences was 1%, for 1–2 correspondences around 9% (in either direction), and for 2–2 correspondences 31%. The probabilities in Gale and Church's original paper were obtained from hand-aligned data. On the basis of figures thus derived the authors counted what the probabilities should be for the distribution of sentences.


 

Evaluation of the Gale and Church Algorithm
 
Gale and Church give the following figures for the performance of their algorithm: 94.2% success for English to French, and 97.3% for English to German. After performing the alignment they gave it to speakers of French and German to check the correspondences and to comment on whether or not they believed them to be correct. The checkers were themselves tested with economics texts to ensure that their assessments could be relied upon.
 
The algorithm performs best on 1–1 alignments. Separated out from other alignments, error rates on 1–1 matches are 5%, 5–6%, and 3–4%. These results have led to claims that the algorithm works only on the 1–1 alignments and that it is practically useless on the others. However, the assessment depends on what the aims of the user are. If the object is to produce a fully aligned corpus, then clearly it will be necessary to go through and to correct manually the alignment that has been produced automatically. This was done in the CRATER project. On the other hand the user may not wish to have the entire corpus aligned. If he is using the corpus to develop a word lexicon, for instance, he may need only a limited number of parallel sentences, in which case he will throw away all the non1–1 alignments because the error rate for these is too high; for his purposes the corpus consists only of those sentences that are correctly aligned. At Lancaster we obtained 98% alignment for 1–1 correspondences in English–French texts and 93.2% for English-Spanish; the domain of the corpus was telecommunications and the translation a fairly literal one. Tests on various English-Polish parallel texts returned results in the range 65–85%.


 

The IBM Alignment Algorithm
 
The IBM Alignment Algorithm, which to my knowledge has not been made public, returns very similar results to the Gale and Church Algorithm in terms of its success in finding 1–1 alignments.
 
The first thing to note is that the IBM algorithm uses a different definition of length from that of Gale and Church. Clearly, long sentences in one language will be long in another language by whatever definition of length is applied. But whereas Gale and Church measure sentence length by characters, the IBM alignment method is based on tokens (words). There are arguments for both approaches, but it is safe to say that, in general, it is better to base sentence length on tokens rather than on characters.
 
The IBM researchers also adopted a different methodological approach. Whereas Gale and Church took some linguistic data, aligned it by hand, and then used the figures thus obtained in order to generate the values of the distributions, the IBM workers devised a method called unsupervised training. This assumes a mechanism for guessing initial probability estimates which are used to determine the likelihood of an alignment. These values are then used to generate some new (and more accurate) estimates. The process is repeated over and over again. Once the initial guestimates of the probabilities are decided on, an alignment is generated — the most likely one given the probabilities used. From this, new probability estimates are derived, and, in turn, a new and probably more accurate alignment is performed.
 
The method is not without its drawbacks: it requires large volumes of data and is very time-consuming. It is a hill-climbing technique, guaranteed to give a local minimum. Thus, if one thinks of the possibilities as a hilly surface, the method will ensure that one climbs the local peak.
 
While it will not necessarily ascend Everest, it will conquer at least some of the lesser summits. IBM concentrated on French-English because of the availability of the large (French-English) Canadian Hansard corpus. IBM claimed 99% success in correct alignment, although only the 1–1 correspondences were counted. It has also been suggested that the Canadian Hansard is a particularly literal translation, which would help account for the remarkably high alignment figure; the algorithm might work less well with a freer translation. It should be noted that IBM did not carry out a 2–2 alignment, whereas Gale and Church performed 1–1, 1–0, 1–2 and 2–2 alignments. It should also be noted that IBM performed a two-part alignment. This involved first aligning the texts at paragraph level and then discarding about 10% of the material that did not match, either because the paragraphs did not align or because parses and paragraphs were missing. The intention was to generate a parallel corpus with different menus for extracting various probabilities, ultimately for use in machine translation.


 

The Kay and Röscheisen Algorithm
 
The Kay and Röscheisen Algorithm is a much more complicated affair than the Gale and Church and IBM algorithms. It may characterized as a sort of relaxation technique. The method presupposes a so-called alignable sentence table (AST). The AST is a section of sentences in the corpus which may possibly match: thus, for instance, the first sentence in language A may match the first, second, or even the third sentence of language B; it will not, however, match the 99th sentence of language B. Similarly, the last sentence in language A is unlikely to match the first sentence in language B. Thus one can say that any particular sentence in language A is likely to be matched with only a selection of sentences in language B. If this is the case, one can also make some guesses at which individual words might be aligned. The technique operates as follows: if a particular word occurs in language A and another word also seems to occur in at least one of the sentences in language B, then it is possible to say that they might be translations of each other. In this way one generates from the initial AST a word alignment table (WAT) of words similarly distributed in the alignable sentence and therefore presumed to be translation equivalents. By assuming that certain words have been aligned, we assume that particular sentences must also be aligned: if a word is a translation of another word, then the sentences that contain those words are also likely to correspond. The result is a sentence alignment table (SAT), that is, a list of sentences containing words that have been aligned. Thus alignments of sentences are generated from alignments of words, although it must be noted that this applies only to sentences that are fairly certain to correspond. The process is then repeated, leading to adjusted versions of the initial alignable sentence table and the associated SAT. Once it is decided that some sentences are definitely aligned, then constraints begin to operate within the text. Since there can be no crossing over of alignments, it can be assumed that certain other sentences must match. If sentence 27 in language A matches sentence 29 in language B, then we can be confident that some of the sentences preceding the former must match some of the sentences preceding the latter; by the same token, one or two of the sentences that follow must match each other.
 
The Kay and Röscheisen Algorithm clearly works very well, but there are some problems with it. Since the authors have not provided a paper for it, we at Lancaster have written our own implementation. This has actually proved to be very difficult, and the program that we have produced operates very slowly. As a result we have not done very much with it. Kay and Röscheisen claim to achieve very good results with four iterations. Indeed, with very small files it is possible to get very good alignments. The advantage of the approach is, of course, that it does not just align whole sentences: it also aligns partial sentences and words. The other algorithms guarantee to produced an alignment of all sentences, even if the alignment contains 1–0 correspondences. The Kay and Röscheisen program does not necessarily do this, but it does produce a word alignment table.


 

Anchor points
 
The above techniques all make use of what may be called anchor points. These are positions in one text which, with a degree of certainty, seem to match up with positions in a parallel text. Obvious anchor points are paragraph markers. Given a very long text, there is going to be a large number of such markers, so it is unrealistic to assume a close alignment from these alone. In view of the increase in popularity of electronic publishing a proliferation of markers of all kinds is inevitable, regardless of the language involved. Thus if a text in language A contains a mark-up for a section in italic script, then the corresponding section in language B is likely to have the same mark-up. It is interesting to note that the IBM team performed a first-pass alignment at the paragraph level and then a second-pass alignment at the sentence level.
 
The IBM team also used elements of the text as anchor points. They were able to assume that the phrase 'Mr Speaker' in the English text of a parliamentary debate would correspond to the French 'M. le Président' and that the phrase 'some time later' would match 'temps plus tard', and so on. Such elements are highly specific to the corpus. Other elements might include numbers, proper names, and dates. It may even be possible to use certain features of punctuation as anchor points: an example is question marks, since a question in one language is likely to correspond to a question in the other language.


 

Cognates
 
 
Work on the CRATER project at Lancaster has concentrated on identifying word cognates as anchor points. The first question here is whether it is at all possible to find cognates by automatic methods. For instance, are there orthographic marks in verbs which represent tactic correspondences and which can be used to determine automatically whether, say, a 'telephono' in an Italian corpus matches the word 'telephone' in an English corpus?
 
CRATER stands for Corpus Resources and Terminology Extraction MLAP 93/20; partners in the project are UCREL (Lancaster), C2V and IBM (Paris), and the Universidad Autónoma (Madrid). The corpus used is the ITU (Intenational Telecommunications Union) corpus. The English and French parts of the corpus were processed separately from the Spanish section. The corpus contains about one million words in each language and the subject domain is telecommunications; since the corpus is the text of an official CCITT book, it is available in various languages. The English text was tagged using a system developed at Lancaster which assigns 150 parts of speech. The French text was processed using an IBM tagger developed in Paris; this assigns about 100 tags. The Spanish text has been tagged using a publicly available Hidden Markov Model tagger; its complexity is such that is employs about 500 tags. The English and French texts have already been manually post-edited; the Spanish text is in the process of being post-edited.
 
The first thing to note is that lexical cognates can be identified non-automatically. Working from English to Italian, for example, it is clear that in some cases English 'ph' appears as 'f' in Italian. This means that if English word contains the sequence 'ph', we may immediately consider the possibility that the sequence can be replaced by 'f' in Italian. Examples of French to Spanish include ç (c + cedilla) to 'z', 'c' or 'cc'. In certain phonemic contexts double consonants in French go to single consonants in Spanish: thus French 'recommander' becomes Spanish 'recomendar'. Finally, an initial 's' followed by a consonant in French might correspond to initial 'es' plus the consonant in Spanish: thus 'scène' could be matched to 'escena'.


 

The Dice Similarity Coefficient
 
Work at Lancaster has concentrated on approximate string matching, a widely used technique in information retrieval for natural language processing. One way of carrying out string matching is based on the Dice Similarity Coefficient. There are other methods, but the Dice method is the most popular. Essentially, the technique assumes a number of possibilities for the two items that are to be matched. Thus, there is a certain number of possibilities for the first element that is to be matched, and another number for the second element. The first stage is to count the total number of character bigrams (that is, sequences of two characters), both matched and unmatched, in a pair of words.
 
As an example consider the words 'telephone' in English and 'telefono' in Italian. The bigrams for English are: 'te', 'el', 'le', 'ep', 'ph', and so on. For Italian they are 'te', 'el', 'le', 'ef', etc. We count the total number of bigrams in both words, the total number of bigrams that match (such as 'te', 'el', etc), and apply the following formula:

2  (  bm
(((
bi

where bm is the number of matching bigrams, and bi is the total number of bigrams (both matching and non-matching) in the two words (that is, b1 + b2 + .... + bn). This provides a measure of the average match, that is, the number of real matches over the number of possible matches.
 
If the result of applying the formula is 1, then there is an exact match; if the result is 0, then there is no match at all. In our example the result is 0.47, which is a fairly average figure. For the two words in this particular case one might have expected a higher value. This would indeed be returned if we applied linguistic information and changed the Italian 'f' to 'ph': in this case the only bigram that would not match would be the final 'ne' and 'no'.
 
There are particular problems with three-letter acronyms. Our corpus contains a large number of three-character acronyms in one language which seem to match four-character words in the other language. The only way to eliminate this problem would be to start matching at four characters. Since the ITU corpus is lemmatized, it would be possible to match, not on characters, but on tokens and lemmas.
 
Here are some results of percentage accuracy rates for different dice coefficients:

Dice	0.4	0.5	0.6	0.7	0.8	0.9	1.0
Tokens    0   7	 22  49  82  97	100
Lemmas    0   5  23  48  71  95 100

 

Truncation
 
An even simpler technique used in information retrieval is truncation. This is based on the assumption that a pair of words are cognates if the words match on the first n characters. The technique does not work well for low values of n. One way of overcoming this problem is to generate stoplists. These are lists of 'false friends' ( that is, words that are known not to match beyond n characters ( for each value of n. At Lancaster we took a thousand pairs of words and looked through them to see where the technique worked and where it did not: for example, we would identify the cases in which, where n = 5, we would not want words longer than n to be seen as cognates.
 
The percentage rates for truncation without human intervention are as follows:

Length	3	4	5	6	7	8	9	>10
Raw	1	8	14	69	93	98	100	100
Stop	5	20	74	94	98	100	99	100

In this table the top row denotes the different truncation lengths, that is, the value of n chosen for the truncation; the bottom row is the accuracy rate achieved after the 'raw' results (given in the second row) are corrected against the stoplist of 'false friend' cognates. The table indicates that 98% of cognates in the corpus are found automatically and without human intervention by matching the first eight letters of each pair. It is important to note that these cognates are only possible alignments or translations of each other, not necessarily actual or real translations.


 

Dynamic Programming
 
A third method involves Dynamic Programming, which has been mentioned earlier in the context of the Gale and Church Algorithm. Dynamic Programming is based on the situation in which two elements partially match and in which it must be decided what has to be done in order to make the elements match completely. Essentially the same method is used in spell-checking. We begin by asking how much we have to change one element in order to match the other, and what the cost is of achieving the match. Thus, for instance, we might consider a word A to be a misspelling of word B. The objective is then to modify word A with minimal cost in order to make it match word B. There is a given repertoire of possible operations for modifying or rewriting a word, each of which is associated with a certain cost. Typical operations include insertion, deletion, and substitution. For example, to match the English word 'telephone' with the Italian 'telefono', we would have to substitute the 'p' for an 'f', delete the 'h', and substitute the 'e' for an 'o'. Assigning to each operation the cost value of '1' ( that is, each operation is considered to be equally costly ( we arrive at '3' as the total cost of making the match; this value is divided by the length of the longer word in order to arrive at a normalized score that can be used as a basis of comparison.
 
The percentage scores for the accuracy of the Dynamic Programming method are as follows:

Score      0.4  0.5  0.6  0.7  0.8  0.9   1.0
Accuracy     1    2   12   39   73   95   100

The technique can be refined by adjusting the costs of the modification operations in the light of the language pairs involved. For instance, the cost of substituting English 'ph' by Italian 'f' should be much lower than, say, for changing the 'ph' to an 'x' or an 'n'.


 

Restricted regions Up to now we have assumed that cognates are generated from the words in the entire text. A different approach is to restrict the region of search and to look for correspondences in particular areas of the texts to be aligned. Gale and Church distinguished between 'hard' regions (paragraphs) and 'soft' regions (sentences). Given an alignment at paragraph level or a preliminary alignment at sentence level, we might choose to limit the generation of cognates to just those regions. Here are some dice coefficients for matches in restricted regions:

Dice	0.4	0.5	0.6	0.7	0.8	0.9	1.0
All	0	7	22	49	82	97	100
Hard	5	33	66	94	96	99	100
Soft	15	69	71	97	100	100	100

 

The 'best match' criterion The 'best match' criterion was adopted by the IBM team working in Paris. It proceeds from the assumption that a word x in language A might match a number of different words in language B. Conversely, a word y in language B might match a number of different words in language A. We first identify the best of all the matches for x in language B and the best match for y in language A. If x and y turn out to be the best matches for each other, then these are the matches that are retained and the others are discarded. In this way the parallelisms are restricted to those that in an automatic way seem to afford the best possibilities.
 
The following table shows the percentage accuracy rate for both dice and the best match criterion.

Dice   All   Hard   H/BM   Soft   S/BM 
 0.4     0      5     26     15     41  
 0.5     7     33     72     69     93  
 0.6    22     66     81     71     90  
 0.7    49     94    100     97    100 
 0.8    82     96    100    100    100 
 0.9    97     99    100    100    100  
 1.0   100    100    100    100    100 

The figures show that the accuracy rate improves for hard and soft regions. The best results of all are obtained with soft regions, although the problem here is that there may not be enough matches to apply the 'best match' criterion; in this case it is necessary to balance the accuracy achieved with the number of cognates that actually turn up.


 

The incorporation of cognates in the Gale and Church Algorithm
 
In conclusion, I shall discuss how cognates can be incorporated into the Gale and Church Algorithm. The first stage is to obtain a set of possible cognate pairs using one of the mechanisms, such as the Dice Coefficient mentioned above. A particular cognate pair will be associated with a probability estimate which is a measure of the likelihood of its being the correct pair. The Gale and Church Algorithm also states that there is a probability that the sentences are in some sort of alignment, that is of 1–1, 2–1, and so on. There is also a third measure of probability: we can say that there are likely to be cognates in one language which occur in a parallel sentence, while there will also be cognates in the second language, where the expected match is not present.
 
The above approach has been tried out at Lancaster on an English-Spanish corpus. Although the processing has proved to be slower than expected, we can say that the accuracy of alignment has improved considerably. The technique is still being refined and has yet to be tested fully.


 
References
Machine Translation — Ten Years On, proceedings of an international conference held at the University of Cranfield, 12 – 14 November 1995:
Clarke, D. and Vella, A. (1995)
Cranfield University Press (forthcoming)

 
'A Program for Aligning Sentences in Bilingual Corpora' in Computational Linguistics, No. 1 (March), Vol. 19: 75–102
Gale, W. A. and Church, K. W. (1993)

 
'Text-Translation Alignment' in Computational Linguistics, No. 1 (March), Vol. 19: 121–42
Kay, M. and Röscheisen, M. (1993)