Machine Translation Review
This page URL: http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-9.htm
by Raül Canals, Anna Esteve, Alicia Garrido, M. Isabel Guardiola, Amaia Iturraspe-Bellver, Sandra Montserrat, Pedro Pérez-Antón, Sergio Ortiz, Hermínia Pastor and Mikel L. Forcada, all of the Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, E-03071 Alacant, Spain. E-mail: mlf@dlsi.ua.es
Abstract This paper describes interNOSTRUM, a Spanish-Catalan machine translation system currently under development that achieves great speed through the use of finite-state technologies and a reasonable accuracy for this pair of closely-related languages by using a classical low-level approach which could be described as an advanced morphological transfer strategy.
Keywords: Spanish, Catalan, machine translation, finite-state. This paper describes a Spanish-Catalan machine translation system, interNOSTRUM. The main reason for the demand of translations from Spanish (the official language of Spain) into Catalan is the impulse toward 'linguistic normalization' in the Catalan-speaking regions (ten million inhabitants and about six million speakers) where Catalan was receding and where the language is now co-official. Catalan and Spanish are two closely-related romance languages with a rather limited syntactical divergence. The interNOSTRUM system is currently under development and a prototype has just started to serve the Universitat d'Alacant, a medium-sized university, and the Caja de Ahorros del Mediterráneo, one of the largest savings banks in Spain. These two institutions started and currently fund this three-year project (1999-2001), which has a staff of two linguists and three computer engineers. Even though translation accuracy and vocabulary coverage can still be much improved, the speed of the system (thousands of words per second or millions of words per day on a 1999-model desktop machine acting as an Internet server) has prompted its use as a system to obtain instantaneous rough translations that are relatively easy to turn into publishable documents. These speeds are achieved through the use of finite-state technolog (Roches and Schabes 1997) in most of its modules. As has been said, even though interNOSTRUM is not a finished product yet, it can nevertheless be used to obtain instantaneous rough translations ready for post-edition. Indeed, two of the basic objectives of our project have been, first, to generate an operational version of interNOSTRUM as soon as possible (launched November 1999) and, second, to make the latest stable version available as soon as it is ready. These are the main reasons for its current configuration as a single Internet server. Currently, interNOSTRUM only translates unformatted ANSI or ASCII texts from Castillian Spanish to the central or Barcelona variety of Catalan (the version generating a València variety and a Balearic Island variety will be ready by the end of the project), but both an RTF (Microsoft's Rich Text Format) and an HTML (HyperText Markup Language) versions are about to be launched. We expect to release the inverse (Catalan-Spanish) translator by November 2000. interNOSTRUM is a classical indirect machine translation system using an advanced morphological transfer strategy (similar to a transformer architecture; Bouillon and Clas 1993) or direct system (Arnold 1993) analogous similar to the one used in commercial PC-based machine translation systems. interNOSTRUM has six modules (see figure 1): two analysis modules (morphological analyser and part-of-speech tagger), two transfer modules (bilingual dictionary module and pattern processing module) and two generation modules (morphological generator and postgenerator). The six modules automatically generated from data (see table 1). Four of the modules in interNOSTRUM, namely, the morphological analyser, the bilingual dictionary module, the morphological generator, and the postgenerator are based on finite-state transducers (FSTs) (Roche and Schabes 1997). This allows for processing speeds on the order of 10,000 words per second, which are practically independent of the size of the dictionaries. Another interesting feature of FST-based modules is that they may be made very compact using standard minimization techniques. FSTs read their input symbol by symbol; each time a symbol is read, they move to a new state, and write, also symbol by symbol, one or more output symbols. The morphological analyser, which is automatically generated (Garrido et al 1999) from a morphological dictionary (MD) for the source language (SL). The MD contains the lemmas (canonical or base forms for inflected words), the inflection paradigms, and their mutual relationships. The subprogram reads the text or surface forms (SF) and writes, for each surface form, one or more lexical forms (LF) consisting of a lemma, a part of speech, and inflection information. The bilingual dictionary module is called by the pattern processing module (see below); it is automatically generated from a file that contains the bilingual correspondences. The program reads a SL LF and writes the corresponding target-language (TL) LF. The morphological generator performs basically the reverse of morphological analysis, but applied to the TL. The morphological generator is generated from a MD for the TL. The postgenerator: Those SF involved in apostrophation and hyphenation (such as clitic pronouns, articles, some prepositions, etc.) activate this module which is otherwise asleep. The postgenerator is generated from a file containing the corresponding rules for the TL. The division of a text in words has some nontrivial aspects. On the one hand, there are a number of word groups that cannot be translated word for word and may be treated as fixed-length multiword units (MWU); they are gradually being incorporated to the bilingual dictionary. Examples: Sp. con cargo a ? Cat. a cà rrec de ('at the expense of'); Sp. por adelantado ? Cat. per endavant ('in advance'); Sp. echar de menos ? Cat. trobar a faltar ('to miss [someone]'); in the last example, the MWU has a variable element that may be inflected (in boldface); MWUs with inflection have just started to be incorporated into interNOSTRUM's dictionaries. On the other hand, combinations of certain verb forms and enclitic pronouns are written in Spanish as a single word; these combinations occur with ortographical transformations such as accent marks or loss of consonants: Sp. dàmelo = da + me + lo ? Cat. dòna + me + lo = dòna-me'l ('give it to me!'); Sp. presentémonos = presentemos + nos ? Cat. presentem + nos = presentem-nos ('let us introduce ourselves').
SL_text -> Morph._Analyser -> Tagger -> Pattern_processor ->
Morph._generator -> Post-generator -> TL_text
||
Bilingual_dictionary
Most lexical ambiguities fall into two main groups: homography (when a SF has more than one LF or analysis) and polysemy (when the SF has a single LF but the lemma may have more than one interpretation). The lexical disambiguation module or part-of-speech tagger uses a language model based on trigrams (sequences of three lexical categories) to solve those homographs ocurring in Spanish texts that present a category ambiguity. The model's parameters reflect the frequencies observed for each trigram in a reference text corpus; the tagger assigns a probability to each possible disambiguation of a sentence containing a lexical categorial ambiguity and the most likely disambiguation is chosen. We are currently fine-tuning the tagset used and building a larger training corpus to improve the performance of this module and to address homographies inside the same lexical category. Polysemic words will be avoided through the use of a controlled Spanish biased toward banking and administration applications (see section 3). In spite of the great similarity between Spanish and Catalan, there are still a number of important grammatical divergences: modal constructions --Sp. tienen que firmar ? Cat. han de signar (they have to sign'')--; gender and number divergences --Sp. la deuda contraída (fem.) ? Cat. el deute contret (masc.; the assumed debt'')--; dropping of prepositions before que --Sp. la intención de que el cliente esté satisfecho ? Cat. la intenció ? que el client estigui satisfet (Engl. the intention that the customer be satisfied'')--; relative constructions using cuyo (whose''), absent in Catalan --Sp. la cuenta cuyo titular es el asegurado ? Catalan el compte el titular del qual és l'assegurat (Engl. the account whose owner is the insured person''). These divergences have to be treated using suitable grammatical rules. interNOSTRUM uses a solution which may also be found in commercial MT systems. It is based on the detection and treatment of predefined sequences of lexical categories (patterns) which may be seen as rudimentary phrase-structure constructs: for example, art.-noun or art.-noun-adj are two possible valid noun phrases. Those sequences known to the program constitute its pattern catalog. This module works as follows:
When no pattern is detected in the current position, the program translates one LF literally and restarts at the following LF. "Long-range" phenomena such as subject-verb agreement require the propagation of information from one pattern to the following ones; we are currently working on this aspect. The pattern processing module is automatically generated from a file containing rules that specify the patterns and the associated actions. This is the slowest module (around 1,000 wps), compared to the 10,000 wps of the rest of the modules. The current catalog only contains a few patterns. To convey a rough idea of the performance of interNOSTRUM in its current state of development (August 2000 version) a short example (randomly chosen from an Internet newspaper) is translated from Spanish to Catalan and the minimal editing operations to render the translation acceptable are shown in the translation. Most of the observed problems are due to the lack of coverage of current dictionaries, which are expected to be reasonably complete at the end of the project. Spanish text: Fujimori deja el poder y convoca elecciones generales a las que no se presentará. Lima. -- El presidente de Perú, Alberto Fujimori, ha anunciado por sorpresa en un mensaje televisado a la nación la convocatoria de nuevos comicios y ha precisado que 'en esas elecciones generales, de más está decirlo, no participará quien habla'. Tras 10 años en el poder, el gobernante peruano, el más veterano en Latinoamérica después de su colega cubano, Fidel Castro, presentó así lo que ha sido interpretado como una dimisión en toda regla. La presunta implicación de Vladimiro Montesinos, el más íntimo colaborador de Fujimori, en diferentes actos de corrupción y atentados desde el poder contra el Estado de derecho es la causa directa de la caída del gobernante peruano. Catalan text with corrections: Fujimori deixa el poder i convoca eleccions generals a les que[correct: les quals] no es presentarà. Llima [correct: Lima] -- El president de Perú, Alberto Fujimori, ha anunciat per sorpresa en un missatge *televisado [unknown; correct: televisat] a la nació la convocatòria de nous *comicios [unknown; correct: comicis] i ha precisat que 'en aquestes eleccions generals, de més està [calque; correct: no cal] dir-lo [correct: -ho], no [insert: hi] participarà qui parla'. Després de 10 anys en el poder, el governant *peruano [unknown; correct: peruà], el més veterà en Llatinoamèrica després del seu col·lega cubà, Fidel Castro,va presentar així el que ha estat interpretat com una dimissió en tota regla. La pressumpta implicació de Vladimiro Montesinos, el més íntim col·laborador de Fujimori, en diferents actes de corrupció i atemptats des del poder contra l'Estat de dret és la causa directa de la caiguda del governant *peruano [unknown; correct: peruà]. We are currently working on three support tools: (a) a style assistant to help authors of Spanish texts avoid many difficult ambiguities using the syntactical, lexical and style rules specified in a controlled Spanish; (b) a preedition assistant, for the manual disambiguation of problematic words and structures, by clicking on them to get a menu of options (helpful when the statistical strategy used by the program is unable to make the right choice); and (c) a postedition assistant, in which the author will be able to click on a target-language word when he or she suspects that it is an incorrect translation and will allow him or her to substitute it by an alternative, taking into account the original text. interNOSTRUM is still under development and has not been yet thoroughly compared with its competitors, but they are listed here for completeness; we expect the translation accuracy of interNOSTRUM to be comparable to its best competitors toward the end the project, with the added benefit of a speed in the range of 1000 wps. There are currently four more Spanish-Catalan MT products available (the first two may be tested through the Internet): We have presented interNOSTRUM, a Spanish-Catalan machine translation system currently under development that achieves great speed through the use of finite-state technologies and a reasonable accuracy using an advanced morphological transfer strategy.
|