MTR menu
overview
previous page    next page
British Computer Society's coat of arms British Computer Society
Natural Language Translation
Specialist Group

URL: http://www.bcs-mt.org.uk/
WEB PAGE 7
Machine Translation Review
No. 1, April 1995   ISSN: 1358-8346
http://www.bcs-mt.org.uk/mtreview/1/7.htm

 
CAT2 — A UNIFICATION-BASED MACHINE TRANSLATION SYSTEM

by Ruslan Mitkov


 
Abstract

CAT2 is a MT system embodying a unification-based formalism, similar to PATR-II (Sharp 1988, Sharp 1991) and software for the development of grammars, lexicons and translation modules. It was developed at the IAI, Saarbrücken, as a sideline implementation to the Eurotra Project, and has been undergoing constant development and evolution since 1987. Experimental versions of numerous languages have been implemented, including English, German, Spanish, French, Portuguese, Italian, Dutch, Russian, Greek, Korean, and Japanese. It is now being used in pre-industrial projects for a number of commercial firms and academic institutions (Sharp and Streiter 1995).
 
The translation strategy is based on tree-to-tree transduction, where an initial syntactico-semantic tree is parsed, then transduced to an abstract representation ('interface structure') that is designed for simple transfer to a target language interface structure. This structure is then transduced to a syntactico-semantic tree in the target language, whose yield provides the actual translated text. The analysis of a source language, as well as the generation of a target language, is based on strictly monolingual rules, the transfer component being the only interface between two languages. Thus, an analysis in one language may be transferred to any number of target languages without requiring re-analysis. The various components, however, may make use of common rules, much like subroutines, so that 'universal' descriptions may be made to apply equally to any number of languages, thereby reducing the rule base tremendously, as well as simplifying the maintenance of grammars and the addition of new language components.
 
The formalism specifies two rule types for tree construction, and two rule types for tree transduction. Trees are built using 'b-rules', a context-free backbone with attribute-value pairs rather than simple category symbols. The following illustrates how the rule 'S ( NP VP' might be written in CAT2 notation:

(a)     {cat=s} .[ {cat=np}, {cat=vp} ].

The feature bundles may include any number of simple or complex feature descriptions; simple features have atomic values, for example: 'cat=s', whereas complex features have feature bundles as values, for example: 'agr={num=sing,per=3}'. In addition, since the formalism is implemented in Prolog, a value may be a logical variable, bound to another variable with the same name within the rule; instantiation of one of the variables automatically instantiates the other. The implementation also allows for negative and disjunctive features, implemented in SICStus Prolog using the 'when/2' construct for freezing goal evaluations.
 
The second rule type in tree construction is the 'f-rule' for validating the feature content of partial trees. A simple f-rule for ensuring subject-verb agreement might be coded as follows:
(b)     {} .[ {cat=np}>>{agr=X}, {cat=vp}>>{agr=X} ].

This rule states that, in a tree configuration containing an NP as left daughter and a VP as right daughter, their agreement features must unify.
 
In practice, our grammars make use of a very small number of b-rules, based on X-bar syntax, (extended) head features (Streiter 1994), and lexically-driven tree construction. The f-rules instantiate various universal and language-specific principles and properties, as well as supplying default values to lexical and phrasal constructions.
 
The tree transduction rules employ analogous rule types: t-rules transform tree structures, and tf-rules copy or transform selected features from source to target trees. The rule formats are similar to b- and f-rules, and again unification underlies the rule application. Since the rules for anaphora resolution in our model do not employ t- or tf-rules, they will not be further described here (see Sharp 1994 for a complete description of the formalism). Recently, CAT2 was extended to be able to handle pronominal anaphora (Mitkov, Choi and Sharp 1995).
 
A pre-industrial prototype of CAT2 has been developed for a dictionary of about 10,000 entries (mainly data processing) in the language pair German-English. For other language pairs (for example, French-German/French-English), experimental versions are available. The prototype accepts free input, especially in the German source pair.
 
Since 1987 CAT2 has been used in various universities as a teaching device and for the definition and processing of language analysis, synthesis and translation.
 
Taking over the long tradition of the University in Saarbrücken in 'electronic language research', the IAI is currently carrying out, apart from CAT2, a number of other application-oriented projects sponsored by the LRE and MLAP programmes of the Commission of the European Union, German ministries, and by private industry.
 
Among these are projects using the most advanced techniques in NLP, such as typed feature structures, for example, ALEP, the Advanced Language Engineering Platform). Other projects are using the well tested CAT2 prototype for industrial validations.
 
 
References
  • Mitkov, R., Choi, S. K., and Sharp, R (1995) 'Anaphora resolution in Machine Translation' (in press)
  • Sharp R. (1991) 'CAT2: An Experimental Eurotra Alternative', Machine Translation, Vol. 6: 215-28.
  • Sharp, R. (1994) CAT2 Reference Manual, Version 3.6, IAI, Saarbrücken
  • Sharp, R. and Streiter, O. (1995) 'Applications in Multilingual Machine Translation', paper submitted to Practical Applications of Prolog, Paris.
  • Streiter, O. (1994) 'Komplexe Disjunktion und erweiterter Kopf: Ein Kontrollmechanismus für die MÜ', Proceedings of Konvens '94: 28-30, Vienna