MTR menu
overview
previous page    next page
British Computer Society's coat of arms British Computer Society
Natural Language Translation
Specialist Group

URL: http://www.bcs-mt.org.uk/
WEB PAGE 8
Machine Translation Review
No. 2, October 1995.   ISSN: 1358-8346
http://www.bcs-mt.org.uk/mtreview/2/8.htm

 
A Corpus-based Bilingual Dictionary: Why and How?

by Marie-Hélène Corréard
Oxford University Press, United Kingdom.


 

Abstract

In the following talk I shall describe how the Oxford-Hachette French Dictionary (OHD) was written with the help of French and English corpora and how the use of this corpus data has helped the editors to make it a better dictionary.
 
First I shall look at the general set-up of the project. The dictionary was written by two teams of native speakers in one location (Oxford), each working in their own language and cross-checking with each other for accuracy. The method chosen to compile the dictionary, which is a completely new text and not a revision of an existing one, made good use of the most recent developments in language research combined with the traditional craft of the lexicographer. The dictionary was produced in three stages: compilation, translation, and editing; but its distinguishing feature was the recourse to corpus data at all stages, more particularly at the editing stage.
 
Next I shall examine why it was necessary to use corpora, what sort of corpora were available for editors, and when they were most useful. I shall also look at concrete examples of how we used the corpora.
 
No human being, even with all the time in the world, can be sure of remembering all the ways of using one word, let alone how other people use it. Traditional dictionaries, however well researched, have the same problem and do not always represent all the ways in which a word is used. In addition to this, languages evolve and grow. Some words appear while others go out of use. Simply because of the time it takes to write them, dictionaries do not always reflect these changes. Computerized corpora, however, allow editors to do just that.
 
A dictionary of modern language should reflect everybody's use of language, not just the language of one particular editor or group of editors. Using a corpus allows editors to see how a broad and varied range of people have used the word they are describing.
 
The editors had access to two corpora of ten million words each, one in French and one in English. With the help of search and display tools they could, in a matter of seconds, draw out all the occurrences of a particular word. Further, they could customize the search to suit their needs by choosing whether they wanted to alphabetize the results on the keyword, on the word following it, or on the word preceding it, according to the type of entry and the language they were working on. These corpora included the text of letters, books, journals, and newspapers, as well as transcripts of conversations, lectures, and discussions.
 
Deliberately chosen for their contemporary contents, our corpora were not suitable for illustrating words that are to be found more in classical literature than in the contemporary language. This does not mean that such words were not included, but simply that we had to use other sources to describe them, notably monolingual dictionaries and other corpora. The fact that a word was not in the corpus was not an argument for discarding it from the dictionary altogether. Conversely, the fact that a word was in the corpus did not mean it had to be in the dictionary; it could be a highly specialized term generated by an event much covered by the media, for example, the Challenger disaster and the subsequent enquiry. When working with corpus data, editors always had access to the source of the citations and were therefore able to moderate some interesting finds if it turned out that they were all from one author.
 
The corpora were most useful for core vocabulary, new senses, and new words. We used corpus data for several tasks:

  1. 1. when working on the source language: to improve the coverage of core vocabulary and to ensure that all essential complementation patterns (one of the greatest pitfalls for non-native speakers) were given;
  2. 2. when working on the target language: to improve the quality of translations;
  3. 3. when checking the finalized entry: to ensure that our user would be well equipped to use correctly all the information provided in as broad a range of contexts as possible; this was achieved by having lots of contexts against which to test both source and target language elements.

Finally I shall consider the results of using corpora in the compilation of the dictionary. In general it is true that the dictionary is a better product than it would have been without the use of corpora. The benefits can be summarized as follows:

  1. 1. The dictionary is more comprehensive: the wordlist is more extensive; entries provide more examples of common phrases and the most common senses are covered.
  2. 2. The dictionary is more accurate: it reflects more accurately the way French, for example, is used by native speakers, since all the example sentences are taken from real examples of French in our corpus.
  3. 3. The dictionary is more reliable: all the translations are authentic because they have been checked in the corpus.
  4. 4. The dictionary is more user-friendly: all the material the user needs is to be found in the entry; furthermore the translations have been checked against a wide range of real contexts in which they might be used.
  5. 5. The dictionary is safer: because it is possible to verify all the usages in real language, the restrictions that accompany certain usages can be clearly indicated; all essential complementation patterns are shown.
  6. 6. The dictionary is more up-to-date: the corpus was updated throughout the entire editing period and new text was added up to the last minute.

In conclusion we may note that we have used corpus data for other tasks, notably for writing lexical notes and compiling and editing grammar words. The use of corpora brings a new dimension to lexicography, making dictionaries more reliable and more comprehensive. It is particularly valuable for bilingual dictionaries because the same corpus is used for different tasks at the various stages of the actual writing of the dictionary. Also it enables editors to cater better for all the different users of a bidirectional dictionary. Since the publication of the unabridged OHD, we have published a unidirectional dictionary; once again, the corpora were essential for selecting the most important items to be represented in the dictionary.