MTR menu
overview
previous page    next page
British Computer Society's coat of arms British Computer Society
Natural Language Translation
Specialist Group

URL: http://www.bcs-mt.org.uk/
WEB PAGE 7
Machine Translation Review
No. 2, October 1995.   ISSN: 1358-8346
http://www.bcs-mt.org.uk/mtreview/2/7.htm

 
Lexical Resources for MT: a Survey

by Adam Kilgarriff
Research Fellow
Information Technology Research Institute,
University of Brighton.


 

Abstract
 
The following survey first explains why we should be interested in monolingual as well as bilingual dictionaries. It then discusses the potential providers and the sorts of resources they have, such as the following:

  • machine translation companies;
  • dictionary publishers;
  • CD-ROM dictionaries;
  • NLP research groups working on machine-readable dictionaries;
  • other NLP and psycholinguistic lexicographic work.
The survey concludes with an annotated list of sources.


 

Bilingual and monolingual, 'direct' and 'transfer'
 
At a first pass it might appear that it is bilingual dictionaries alone which will be of interest to machine translation (MT). The task is to get from one language to the other, so why would an account of a word given in the same language as the original be of any use? This line of argument is akin to the earliest, 'direct' approach to machine translation: you start from the premise that you simply look the words up in a bilingual dictionary and then swap source word for target word; you do something more sophisticated only when that fails. The shortcomings of this approach are well known (see Hutchins and Somers 1992:71–77 for a discussion), and most recent systems have been 'transfer' systems. Such systems do some grammatical analysis of the source language text in order to produce an intermediate representation (usually a tree structure); they then transfer this into an equivalent target language representation, and from this they generate the target language text. In such a scenario the relevance of monolingual dictionaries becomes evident. If they provide the best grammatical descriptions of words, then they will be the best resources to use for the analysis and generation stages of the process, with the bilingual dictionary being used just for the transfer. Thus MT systems need bilingual lexicons for stating mappings between lexical items in the source and target languages (or, for interlingua systems, between the interlingua and each language). But they also need a source for language-specific facts about words, possibly a monolingual dictionary.
 
 
Machine translation companies
 
Researchers and system-builders in MT have always been aware that lexicons need to be big. Indeed commercial MT companies have been building up their lexicons for as long as they have been in the business. The evidence suggests that it takes at least ten years for an MT product to reach the market, and much of those ten years are spent on lexicography. Lexicons are MT companies' most valuable resources, representing very large quantities of detailed information about words, formally and consistently. Some of the firms are currently collaborating with university research groups.
 
Informal approaches to some of the leading companies suggest that they are indeed interested in academics experimenting with their lexicons, though this would be on the basis of a licence which confirmed that their resources (or software derived from them) would not be redistributed.
 
 
Machine-readable dictionaries (MRDs)
 
Until relatively recently most work in NLP took place on a 'toy' scale. Such work was done in the laboratory, exploring properties of some interesting aspect of grammatical behaviour, and there was no reason to have more than a few dozen words in the lexicon. By the mid-1980s much of the theoretical basis for NLP — chart parsing, feature-based formalisms, unification — was in place and the question, 'how do we apply it?,' came to the fore. A prerequisite for many possible applications is 'scaling up'. This is particularly important in relation to the lexicon, where scaling up means moving from dozens of words to tens of thousands of words. For the last ten years scaling up has been a very active research area.
 
From within NLP the first major push was to exploit machine-readable copies of dictionaries. This involved first getting hold of the data from dictionary publishers and then decoding it in order to turn a typesetters' tape into some form of database where the different fields of information — headword, part-of-speech, definition, etc. — were retrievable. How to do this and what can be achieved are explored and reviewed in Boguraev and Briscoe (1989) and in Byrd et al. (1987). Much of the leading-edge research was performed on monolingual English dictionaries, though some groups, notably the IBM group at the T. J. Watson Research Center, have been busy gathering and decoding dictionaries (monolingual and bilingual) for a number of languages.
 
One goal of machine-readable dictionary research was simple: to produce 'no-semantics' lexicons containing spellings, morphology, and parts of speech — possibly also spelling variants, and a code for the domain a word was likely to occur in. In these aims the enterprise has been fairly successful, with the ANLT lexicon (see the list of resources given below) as one useful output.
 
A more ambitious goal was the construction of a lexical knowledge base containing formalized information about the meanings of words. The outcome here was less successful. The EU-sponsored project ACQUILEX explored the issues and problems involved. Researchers started to see diminishing returns from effort spent on dictionary processing as they moved from orthography, through syntax, to semantics. For the follow-on project, ACQUILEX-2, there was a greater emphasis on working with lexicographers and corpora to produce dictionaries with the right information in a well-structured form in the first place. At an ACQUILEX-2 review workshop last year, one key paper was subtitled 'Have we wasted our time?' (also published as Ide and Veronis 1995). It noted that errors, inconsistencies, and circularities continue to be the bane of the MRD community and undermine hopes of producing a usable, wide-coverage LKB via the largely automatic processing of MRDs.
 
In the last few years dictionary publishers have realized the benefits of writing and storing dictionaries as databases. As a result we now find that most large dictionary publishers have started to use computerized dictionary-writing environments. Such environments are built on the premise that dictionary entries are highly structured entities and that the dictionary entry is a database entry from the moment it flows from the lexicographer's metaphorical pen.
 
 
Licences
 
Dictionaries are very rich in information. This richness corresponds to a large number of person-hours expended on the making of dictionaries. Since the cost of writing a state-of-the-art dictionary runs into millions of pounds, the product is not given away lightly.
 
Various kinds of licence arrangements are possible, but one that allows the licensee to pass the dictionary itself (or software derived from it) to anyone else is sure to be very expensive. Since academics are typically poor and typically acquire fame and glory through their work being adopted and developed by others, this is a fairly severe constraint.
 
The high cost of licences is one reason why existing MT products do not use MRDs. Another reason is the long time lag between the fixing of the architecture for an MT system and its arrival on the marketplace. The design of the MT systems currently on the market pre-dates most MRD research. This may be set to change. At least one MT group is now extensively using MRDs: Sharp Laboratories Europe are partners of Cambridge University Press (CUP) in a project called 'Integrated Language Database' and are using CUP's new dictionary CIDE in their development work.
 
The licence fee for an MRD for a new dictionary is variable, but often seems to be in the region of £1,000. This figure would cover a licence only for research use in a university and typically comes with an obligation to keep the publisher informed of what is being done with it. Where resources developed in academia are nonetheless based on published dictionaries, a licence with the publisher, under similar terms, is usually still required.
 
There are, of course, many free resources, though in most cases these will not be as accurate or as complete as a commercial dictionary.
 
 
CD-ROM dictionaries
 
In the last few years the market for dictionaries on CD-ROM has opened up. As a result there are now machine-readable dictionaries available for many languages and language-pairs which can be bought for a modest sum. There are, of course, legal constraints on what you can do with the information you buy when you purchase such a CD-ROM, as well as the practical difficulty of extracting the information from the medium in a usable form. If you embedded the lexical information in an MT system which was then redistributed without the publisher's consent, this would infringe copyright; but to build such a system for yourself does not. The team at ISI in California has done just that for the CD-ROM MRD format used by the Electronic Book Catalogue; one of the team's researchers, Matthew Haines, has devoted many person-hours to decoding between fifteen and twenty dictionaries. Thus one way to get hold of lexical databases for languages and language-pairs available on CD-ROM is to buy the CD-ROM itself and set about decoding it. Haines declares: 'each dictionary is a new project. It takes a lot of time to reduce an electronic book to a database, but with individuals serious about investing that time, I will be happy to share my programs and offer advice.'
 
 
Research lexicography
 
It is worthwhile noting that the research community itself has developed some substantial resources. Foremost amongst these is Wordnet (Miller 1990). This is a dictionary-like resource produced by lexicographers and students at Princeton University, whose motivation was to pursue various hypotheses from psycholinguistics about the mental lexicon. The outcome is an online dictionary, available entirely free for research purposes over the Internet. Wordnet is now being very widely used in the NLP and information retrieval communities. An EU project which is about to begin aims to extend Wordnet to include words from a number of European languages. Another important resource for English that is being developed specifically for reuse in NLP research and commercial applications is COMLEX (Grishman et al. 1994); this is being produced under contract to the Linguistic Data Consortium.
 
The Message Understanding Conference (MUC) initiative in the United States has nurtured the development of application-specific lexicons on short timescales, making imaginative use of corpora of the appropriate genre for automatic and semi-automatic 'lexicography'. A wide range of approaches and in some cases public resources is described in the MUC-5 proceedings and in Boguraev and Pustejovsky (1993). Recent work includes the automatic 'learning' of translation pairs from corpora in two different languages (Fung 1995, Wu 1995).
 
 
Annotated resource list
 
The following contact names and e-mail addresses have been gleaned from various sources. Where I have not yet been able to confirm the e-mail address by eliciting a reply, I have marked the address as unconfirmed.
 
 
Commercial MT companies
 
LOGOS
Friederike Bruckert,
LOGOS Computer Integrated Translation GmbH,
Mergenthallerallee 79–81,
D-65760 Eschborn/Ts., Germany,
Tel: +49 (0)619 659030
Fax: +49 (0)619 6590315
E–mail: bruckert@logos.de (unconfirmed)
 
INTERGRAPH
The organisation has recently contracted to sell its service over the Compuserve network.
Susan Moore
Tel: +1 205 730 3315
E–mail: sjmoore@com.ingr
Also web page http://www.intergraph.com
 
SYSTRAN
Boasts the largest repertoire of language pairs and the largest dictionaries.
Tel: +1 619 459 6700
Fax: +1 619 459 8487
E–mail: info@systranmt.com (unconfirmed)
 
METAL
Geert Adriaens,
Siemens-Nixdorf,
Centre Software de Liège,
Rue des Fories 2,
4020 Liège,
Belgium;
E–mail: gad@csl.sni.be (unconfirmed)
 
 
Dictionary Publishers
 
LONGMAN
The Longman Dictionary of Contemporary English (LODCE) is the single most widely used dictionary in NLP research. Most work has been based on the first edition of 1978. A corpus-based third edition is now available in SGML. Other monolingual English learners' dictionaries and thesauri are also available.
Steve Crowdy,
Longman Dictionaries,
Longman House,
Burnt Mill,
Harlow,
Essex CM20 2JE
Tel: +44 (0)1279 623816
E–mail: 100425.3057@com.compuserve
 
CAMBRIDGE LANGUAGE SERVICES is the commercial wing of CUP. They have recently produced the Cambridge
Inter-national Dictionary of English. The Cam-bridge University NLP group and the Sharps Laboratories Europe MT group were involved in some aspects of its production, and CLS have widely advertised that they intend to make the database readily available for research use.

Paul Proctor,
Sue Allen-Mills
E–mail: sallmill@uk.ac.cam.cup
 
HARPER COLLINS
The 1978 Collins English Dictionary is readily available on the $25 ACL-DCI CD-ROM. A version of the Collins English-Spanish dictionary for MT has been produced at Carnegie-Mellon University and may be made available to other researchers.
Contact:
Bob Frederking
E–mail: ref@cs.cmu.edu
 

For other Collins monolingual and bilingual dictionaries contact:
Lorna Sinclair-Knight
E–mail: lornas@reference.collins.co.uk
 
For COBUILD dictionaries contact:
Gwyneth Fox
E–mail: gwyneth@cobuild.collins.co.uk
 
OXFORD UNIVERSITY PRESS
OUP publish various monolingual (learner, concise, shorter, full OED) and bilingual dictionaries in machine-readable form.
 

Simon Murison-Bowie,
Electronic Publishing,
Walton Street,
Oxford OX2 6DP
 
CD-ROMs
The Electronic Book Catalogue lists dictionaries (monolingual and/or bilingual) for English, French, Spanish, German, Japanese, Danish and Dutch.
Tel: +44 (0)171 561 9590
Fax: +44 (0)171 561 9591
 

Academia and the NLP Research Community
Distributors of Lexical Resources:
 
LINGUISTIC DATA CONSORTIUM (LDC)
This is a membership organisation which exists explicitly for the purpose of developing and redistributing linguistic resources for NLP. The LDC distributes the ACL-DCI CD-ROM, CELEX, COMLEX, and others. Anyone wishing to gain access to the LDC resources either pay for one specific resource or pay for a year's membership which then entitles them to take copies of all resources published in that year. A year's membership fee for a university is $2,000.
E–mail: ldc@unagi.cis.upenn.edu
WWW: ftp://ftp.cis.upenn.edu/ldc_
www/ldc_catalogue
 
The CONSORTIUM FOR LEXICAL RESOURCES ..
is a library of lexical resources with some overlap with the LDC, although the emphasis is on distributing rather than developing resources. Most resources at LDC are available free.
E–mail: lexical@nmsu.edu
WWW: ftp://crl.nmsu.edu/CLR/catalog
 
Other resources:
 
WORDNET
This is available free by ftp from: clarity.princeton.edu/pub/
 
ANLT LEXICON
This was produced from LDOCE (1978) as part of the UK-funded Alvey Natural Language Tools Project. It contains fairly detailed syntactic information, with subcategorisation codes extracted from LDOCE and then extensively checked and corrected; there is no semantic information. ANLT is available under licence through Lynxvale (the trading arm of Cambridge University) for 500 ECU.
http://www.cl.cam.ac.uk/Research/NL/anlt.html
 
CELEX
This contains detailed morphology and some syntactic information for English, German and Dutch. It is available on CD-ROM through the LDC for $150.
 
 
Bibliography

Acquisition of Lexical Knowledge From Text: Workshop Proceedings, Ohio: ACL Special Interest Group on the Lexicon.
Boguraev, B. K. and Pustejovsky, J. (1993)

 
Computational Lexicography for Natural Language Processing, Longman: Harlow
Boguraev, B. K. and Briscoe, E. J. (1989)

 
A Statistical Approach to Machine Translation' in Computational Linguistics, No. 6, Vol. 2: 79–86.
Brown, P., Cocke, J., Della Pietra, S., Jelinek, F., Lafferty, J. D., Mercer, R. I., and Roossin, P. S. (1990)

 
Tools and Methods for Computational Lexicology' in Computational Linguistics, Vol. 13: 219–40.
Byrd, R. J., Calzolari, N., Chodorow, M. S., Klavans, J. L., Neff, M. S., and Rizk, O. A. (1987)

 
A Pattern Matching Method for Finding Noun and Proper Noun Translations from Noisy Parallel Corpora' in Proceedings, 33rd Annual Meeting of the Association of Computational Linguistics, MIT: 236–43.
Fung, P. (1995)

 
Comlex Syntax: Building a Computational Lexicon' in COLING 94, Tokyo
Grishman, R., MacLeod, C., and Meyers, A. (1994)

 
Introduction to Machine Translation, London: Academic Press.
Hutchins, J. and Somers, H. (1992)

 
Knowledge Extraction from Machine-readable Dictionaries: an Evaluation' in Machine Translation and the Lexicon, Springer Verlag, Lecture Notes in Artificial Intelligence 898: 19–34
Ide, N. M. and Veronis, J. (1995)

 
Wordnet: An On-line Lexical Database', International Journal of Lexicography (special issue), No. 4, Vol. 4: 235–312
Miller, G. (1990)

 
An Algorithm for Simultaneously Bracketing Parallel Texts by Aligning Words' in Proceedings, 33rd Annual Meeting of the Association of Computational Linguistics, MIT: 244–51
Wu, D. (1995)