BCS MACHINE
TRANSLATION
SITE NAVIGATOR
Click for...
BCS MT
home
page
top
 
end
  overview  •  next item  •  previous item

British Computer Society's coat of arms 

 

British Computer Society
Natural Language Translation Specialist Group

Web-site: http://www.bcs.org.uk/siggroup/sg37.htm

Machine Translation Review
Issue No. 11, December 2000 - pages 21-25
ISSN 1358-8346

This page URL: http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-9.htm
Size: 7 A4 pages when printed

 
interNOSTRUM: A Spanish-Catalan Machine Translation System

by Raül Canals, Anna Esteve, Alicia Garrido, M. Isabel Guardiola, Amaia Iturraspe-Bellver, Sandra Montserrat, Pedro Pérez-Antón, Sergio Ortiz, Hermínia Pastor and Mikel L. Forcada, all of the Departament de Llenguatges i Sistemes Informàtics, Universitat d'Alacant, E-03071 Alacant, Spain. E-mail: mlf@dlsi.ua.es

 

Abstract

This paper describes interNOSTRUM, a Spanish-Catalan machine translation system currently under development that achieves great speed through the use of finite-state technologies and a reasonable accuracy for this pair of closely-related languages by using a classical low-level approach which could be described as an advanced morphological transfer strategy.

Keywords: Spanish, Catalan, machine translation, finite-state.

 
1 Introduction

This paper describes a Spanish-Catalan machine translation system, interNOSTRUM. The main reason for the demand of translations from Spanish (the official language of Spain) into Catalan is the impulse toward 'linguistic normalization' in the Catalan-speaking regions (ten million inhabitants and about six million speakers) where Catalan was receding and where the language is now co-official. Catalan and Spanish are two closely-related romance languages with a rather limited syntactical divergence. The interNOSTRUM system is currently under development and a prototype has just started to serve the Universitat d'Alacant, a medium-sized university, and the Caja de Ahorros del Mediterráneo, one of the largest savings banks in Spain. These two institutions started and currently fund this three-year project (1999-2001), which has a staff of two linguists and three computer engineers. Even though translation accuracy and vocabulary coverage can still be much improved, the speed of the system (thousands of words per second or millions of words per day on a 1999-model desktop machine acting as an Internet server) has prompted its use as a system to obtain instantaneous rough translations that are relatively easy to turn into publishable documents. These speeds are achieved through the use of finite-state technolog (Roches and Schabes 1997) in most of its modules.

 
2 Current prototype and future versions

As has been said, even though interNOSTRUM is not a finished product yet, it can nevertheless be used to obtain instantaneous rough translations ready for post-edition. Indeed, two of the basic objectives of our project have been, first, to generate an operational version of interNOSTRUM as soon as possible (launched November 1999) and, second, to make the latest stable version available as soon as it is ready. These are the main reasons for its current configuration as a single Internet server.

Currently, interNOSTRUM only translates unformatted ANSI or ASCII texts from Castillian Spanish to the central or Barcelona variety of Catalan (the version generating a València variety and a Balearic Island variety will be ready by the end of the project), but both an RTF (Microsoft's Rich Text Format) and an HTML (HyperText Markup Language) versions are about to be launched. We expect to release the inverse (Catalan-Spanish) translator by November 2000.

 
2.1 Platform interNOSTRUM currently runs on Linux and may be accessed through an Internet server ( http://www.internostrum.com or http://www.torsimany.ua.es ). It consists of 6 modules that run in parallel and communicate through text channels. Each module is automatically generated from the corresponding linguistic data using compilers written with the aid of yacc and lex, which are standard in Unix environments. The current speed of the system is in the order of 1,000 wps (words per second) on a standard 1999 desktop PC (a 400 MHz Pentium PC).

 
2.2 Machine translation strategy

interNOSTRUM is a classical indirect machine translation system using an advanced morphological transfer strategy (similar to a transformer architecture; Bouillon and Clas 1993) or direct system (Arnold 1993) analogous similar to the one used in commercial PC-based machine translation systems. interNOSTRUM has six modules (see figure 1): two analysis modules (morphological analyser and part-of-speech tagger), two transfer modules (bilingual dictionary module and pattern processing module) and two generation modules (morphological generator and postgenerator). The six modules automatically generated from data (see table 1).

 
2.2.1 Modules based on finite-state technology

Four of the modules in interNOSTRUM, namely, the morphological analyser, the bilingual dictionary module, the morphological generator, and the postgenerator are based on finite-state transducers (FSTs) (Roche and Schabes 1997). This allows for processing speeds on the order of 10,000 words per second, which are practically independent of the size of the dictionaries. Another interesting feature of FST-based modules is that they may be made very compact using standard minimization techniques. FSTs read their input symbol by symbol; each time a symbol is read, they move to a new state, and write, also symbol by symbol, one or more output symbols.

The morphological analyser, which is automatically generated (Garrido et al 1999) from a morphological dictionary (MD) for the source language (SL). The MD contains the lemmas (canonical or base forms for inflected words), the inflection paradigms, and their mutual relationships. The subprogram reads the text or surface forms (SF) and writes, for each surface form, one or more lexical forms (LF) consisting of a lemma, a part of speech, and inflection information.

The bilingual dictionary module is called by the pattern processing module (see below); it is automatically generated from a file that contains the bilingual correspondences. The program reads a SL LF and writes the corresponding target-language (TL) LF.

The morphological generator performs basically the reverse of morphological analysis, but applied to the TL. The morphological generator is generated from a MD for the TL.

The postgenerator: Those SF involved in apostrophation and hyphenation (such as clitic pronouns, articles, some prepositions, etc.) activate this module which is otherwise asleep. The postgenerator is generated from a file containing the corresponding rules for the TL.

The division of a text in words has some nontrivial aspects. On the one hand, there are a number of word groups that cannot be translated word for word and may be treated as fixed-length multiword units (MWU); they are gradually being incorporated to the bilingual dictionary. Examples: Sp. con cargo a ? Cat. a cà rrec de ('at the expense of'); Sp. por adelantado ? Cat. per endavant ('in advance'); Sp. echar de menos ? Cat. trobar a faltar ('to miss [someone]'); in the last example, the MWU has a variable element that may be inflected (in boldface); MWUs with inflection have just started to be incorporated into interNOSTRUM's dictionaries. On the other hand, combinations of certain verb forms and enclitic pronouns are written in Spanish as a single word; these combinations occur with ortographical transformations such as accent marks or loss of consonants: Sp. dàmelo = da + me + lo ? Cat. dòna + me + lo = dòna-me'l ('give it to me!'); Sp. presentémonos = presentemos + nos ? Cat. presentem + nos = presentem-nos ('let us introduce ourselves').

 
Figure 1: Basic interNOSTRUM modules

SL_text -> Morph._Analyser -> Tagger -> Pattern_processor -> 
Morph._generator -> Post-generator -> TL_text
 
          ||
Bilingual_dictionary

 
Table 1: Automatic generation of interNOSTRUM's modules from linguistic data

LANG-
UAGE
LINGUISTIC DATA GENERATION PROGRAM INTERNOSTRUM MODULE
SL morphological dictionary Morphological analyser compiler morphological analyser
SL Morphologically analysed corpus Tagger trainer tagger
SL, TL bilingual dictionary Bilingual dictionary compiler bilingual dictionary module
SL, TL pattern processing rules Pattern processing rule compiler pattern processing module
TL morphological dictionary Morphological generator compiler morphological generator
TL apostrophe & hyphen rules Postgenerator compiler postgenerator

 
2.2.2 The part-of-speech tagger

Most lexical ambiguities fall into two main groups: homography (when a SF has more than one LF or analysis) and polysemy (when the SF has a single LF but the lemma may have more than one interpretation).

The lexical disambiguation module or part-of-speech tagger uses a language model based on trigrams (sequences of three lexical categories) to solve those homographs ocurring in Spanish texts that present a category ambiguity. The model's parameters reflect the frequencies observed for each trigram in a reference text corpus; the tagger assigns a probability to each possible disambiguation of a sentence containing a lexical categorial ambiguity and the most likely disambiguation is chosen. We are currently fine-tuning the tagset used and building a larger training corpus to improve the performance of this module and to address homographies inside the same lexical category. Polysemic words will be avoided through the use of a controlled Spanish biased toward banking and administration applications (see section 3).

 
2.2.3 The pattern processing module

In spite of the great similarity between Spanish and Catalan, there are still a number of important grammatical divergences: modal constructions --Sp. tienen que firmar ? Cat. han de signar (they have to sign'')--; gender and number divergences --Sp. la deuda contraída (fem.) ? Cat. el deute contret (masc.; the assumed debt'')--; dropping of prepositions before que --Sp. la intención de que el cliente esté satisfecho ? Cat. la intenció ? que el client estigui satisfet (Engl. the intention that the customer be satisfied'')--; relative constructions using cuyo (whose''), absent in Catalan --Sp. la cuenta cuyo titular es el asegurado ? Catalan el compte el titular del qual és l'assegurat (Engl. the account whose owner is the insured person'').

These divergences have to be treated using suitable grammatical rules. interNOSTRUM uses a solution which may also be found in commercial MT systems. It is based on the detection and treatment of predefined sequences of lexical categories (patterns) which may be seen as rudimentary phrase-structure constructs: for example, art.-noun or art.-noun-adj are two possible valid noun phrases. Those sequences known to the program constitute its pattern catalog. This module works as follows:

  • The text (morphologically analysed and disambiguated) is read left to right, one LF at a time.
  • The module searches, starting at the current position in the sentence, for the longest LF sequence that matches a pattern in its pattern catalog (for example, if the text starting in the current position is "una señal inequívoca..." ("an unmistakable signal"), it will choose art.-noun-adj instead of art.-noun).
  • The module operates on this pattern (to propagate gender and number agreement, to reorder it, to make lexical changes) following the rules associated to the pattern.
  • Then, the pattern processing module continues immediately after the pattern just processed (it does not visit again any of the LFs on which it has operated).

When no pattern is detected in the current position, the program translates one LF literally and restarts at the following LF. "Long-range" phenomena such as subject-verb agreement require the propagation of information from one pattern to the following ones; we are currently working on this aspect.

The pattern processing module is automatically generated from a file containing rules that specify the patterns and the associated actions. This is the slowest module (around 1,000 wps), compared to the 10,000 wps of the rest of the modules. The current catalog only contains a few patterns.

 
3 A translation example

To convey a rough idea of the performance of interNOSTRUM in its current state of development (August 2000 version) a short example (randomly chosen from an Internet newspaper) is translated from Spanish to Catalan and the minimal editing operations to render the translation acceptable are shown in the translation. Most of the observed problems are due to the lack of coverage of current dictionaries, which are expected to be reasonably complete at the end of the project.

Spanish text: Fujimori deja el poder y convoca elecciones generales a las que no se presentará. Lima. -- El presidente de Perú, Alberto Fujimori, ha anunciado por sorpresa en un mensaje televisado a la nación la convocatoria de nuevos comicios y ha precisado que 'en esas elecciones generales, de más está decirlo, no participará quien habla'. Tras 10 años en el poder, el gobernante peruano, el más veterano en Latinoamérica después de su colega cubano, Fidel Castro, presentó así lo que ha sido interpretado como una dimisión en toda regla. La presunta implicación de Vladimiro Montesinos, el más íntimo colaborador de Fujimori, en diferentes actos de corrupción y atentados desde el poder contra el Estado de derecho es la causa directa de la caída del gobernante peruano.

Catalan text with corrections: Fujimori deixa el poder i convoca eleccions generals a les que[correct: les quals] no es presentarà. Llima [correct: Lima] -- El president de Perú, Alberto Fujimori, ha anunciat per sorpresa en un missatge *televisado [unknown; correct: televisat] a la nació la convocatòria de nous *comicios [unknown; correct: comicis] i ha precisat que 'en aquestes eleccions generals, de més està [calque; correct: no cal] dir-lo [correct: -ho], no [insert: hi] participarà qui parla'. Després de 10 anys en el poder, el governant *peruano [unknown; correct: peruà], el més veterà en Llatinoamèrica després del seu col·lega cubà, Fidel Castro,va presentar així el que ha estat interpretat com una dimissió en tota regla. La pressumpta implicació de Vladimiro Montesinos, el més íntim col·laborador de Fujimori, en diferents actes de corrupció i atemptats des del poder contra l'Estat de dret és la causa directa de la caiguda del governant *peruano [unknown; correct: peruà].

 
4 Projected support tools for interNOSTRUM

We are currently working on three support tools: (a) a style assistant to help authors of Spanish texts avoid many difficult ambiguities using the syntactical, lexical and style rules specified in a controlled Spanish; (b) a preedition assistant, for the manual disambiguation of problematic words and structures, by clicking on them to get a menu of options (helpful when the statistical strategy used by the program is unable to make the right choice); and (c) a postedition assistant, in which the author will be able to click on a target-language word when he or she suspects that it is an incorrect translation and will allow him or her to substitute it by an alternative, taking into account the original text.

 
5 Other Spanish-Catalan MT products

interNOSTRUM is still under development and has not been yet thoroughly compared with its competitors, but they are listed here for completeness; we expect the translation accuracy of interNOSTRUM to be comparable to its best competitors toward the end the project, with the added benefit of a speed in the range of 1000 wps. There are currently four more Spanish-Catalan MT products available (the first two may be tested through the Internet):

  • INCYTA's Es-Ca (http://www.incyta.es) is a syntactical transfer system very much in the spirit of METAL. It runs as a server; customers pay by the word (0.02 euro per word).
  • The newspaper El Periódico de Catalunya publishes daily a Spanish version and a cover-to-cover translation to Catalan, using a system developed by SoftLibrary (http://www.softly.es ) which may be described as a very efficient translation memory which draws from the bilingual corpus of the newspaper.
  • SALT (developed by the Generalitat Valenciana, the government of the autonomous region of València) is available through unofficial channels because it has not been officially published. It runs on Windows at about 10 wps but stops very frequently for human assistance with ambiguous words. It may be classified as a direct system with a plethora of ad-hoc strategies. It generates texts of reasonable quality in the València variety of Catalan.
  • AutoTrad's Ara (www.autotrad.com) is basically a commercial, improved version of SALT which generates the Barcelona variety of Catalan and runs without interruption by accumulating all disambiguation dialogues at the end of the translation.

     
    6 Concluding remarks

    We have presented interNOSTRUM, a Spanish-Catalan machine translation system currently under development that achieves great speed through the use of finite-state technologies and a reasonable accuracy using an advanced morphological transfer strategy.

     
    References

    • Roche, E. and Schabes, Y. (eds) (1997) Finite-State Language Processing, Cambridge, Mass.: MIT Press: 1997-1-65
    • Bouillon, P. and Clas, A. (eds) (1993) La traductique, Montréal University Presses
    • Arnold, D. (1993) 'Sur la conception du transfert', in Bouillon and Clas: 64-76
    • Hutchins, W. J. and Somers, H. L. (1992) An Introduction to Machine Translation, London: Academic Press
    • Gimènez. M. M. i and Forcada, K. L. (1998) 'Understanding PC-based machine translation systems for evaluation, teaching and reverse engineering: the treatment of noun phrases in Power Translator', in Machine Translation Review (British Computer Society), 7:20-27
    • Garrido, A., Iturraspe, A, Pastor H., Forcada M.L. and Montserrat, S. (1999) 'A compiler for morphological analysers and generators based on finite-state transducers', in Procesamiento del Lenguaje Natural, (25):93-98 1999

     

     

    BCS MACHINE
    TRANSLATION
    SITE NAVIGATOR
    Click for...
    BCS MT
    home
    page
    top
     
    end
      overview  •  next item  •  previous item