Machine Translation Review
This page URL: http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-12/5.htm
by Paul R. Bowden Dept. of Computing
Abstract The Brutus program is a direct-MT attempt at Latin to English translation. Incorporating a Latin morphological processor and two specialised English morphological processors, it produces translations on a sentence by sentence basis in a series of stages. The first stage is a word for word transliteration, each English word being marked with case, gender, tense etc. information inherent in the Latin source word. Subsequent stages correct English noun number and verb endings, find connected noun-adjective groups, mark subject and object, and alter word order to standard English VSO order, amongst other things. An open loop learning mechanism is employed for building the Latin vocabulary. Brutus is likely to be useful as a teaching aid or as an assistant to those whose Latin skills are not good. A future possibility is automatic marking of student Latin to English translations. Introduction The direct method was the earliest approach to machine translation (MT). Although its deficiencies soon became apparent, it remains popular in certain situations such as uni-directional MT. This is largely due to its usefulness, robustness and relative simplicity. Direct translation involves a series of stages commencing with word-for-word translation, each stage refining the output from the previous stage, e.g. by word-order changes (Hutchins and Somers, 1992). The Latin language, the tongue of the Roman Empire, began as just one of many languages spoken in pre-Roman Italy, but grew in prominence along with Rome itself (Baugh and Cable, 1978). The Vulgar Latin spoken in the street eventually gave rise to modern Romance languages. Classical Latin, a more highly inflected form, came to be used for literary and cultural purposes, and in later centuries as a means of international communication and storage of knowledge, particularly in the domains of the law, religion and scientific study. It is Classical Latin which is the target of this research (henceforth referred to simply as Latin). Latin is an inflected language in which sentence word order is used mostly to emphasise certain words rather than to indicate e.g. subject/object/indirect object (SOI) information. In Latin, nouns are inflected to indicate case (subject, object, possessive etc.) and number. Adjectives agree with nouns in number, case and gender. Verbs are conjugated to indicate person, number (singular or plural), voice (active or passive), mood (indicative or subjunctive), and tense. (See e.g. Paterson and Macnaughton (1968) for an introduction to Latin.) This highly inflected nature lends itself well to a direct MT approach. For example, the two-word Latin sentence Regem laudabo directly translates as King I-will-praise, where the word regem is the inflected form of rex (king) in the accusative case (used for the direct object), and where the -abo ending of the verb indicates first person singular, future tense of the verb laudare, to praise. The computer program described in this paper attempts Latin to English MT in a multistage direct approach. This program has been named Brutus. Although the program is still being developed it is sufficiently advanced to be currently useful. The program incorporates a Latin lexical morphological processor capable of deducing the meaning(s) of a Latin word based upon a stem form. A vocabulary file of already-met words is used for fast look up, and this also holds stem forms available to the morphological processor. The program incorporates an open-loop learning mechanism so that newly deduced word meanings may be presented to a human user for checking before being appended to the vocabulary list. Direct Look Up and Morphological ProcessingThe first stage of processing provides a word-for-word translation of the Latin text into English, on a sentence by sentence basis. (Texts of several hundred sentences in length may be processed, so that realistic texts may be handled.) However, the English words so produced are tagged with syntactical and functional information inherent within the original inflected Latin words. Thus for example, the two Latin words given in the example above have the following vocabulary entries: regem=kings[NOUNMasc-AccSing] In the above case we see that the English word king is the direct object of the verb (because the Latin was in the accusative case, indicated by Acc) and this information is preserved within the square bracketed section for use by later processing stages. (For practical reasons, all noun meanings are stored in their plural English forms; a later processing stage alters the plural form to the singular if necessary.) The word king in English is not inflected by functional role; this information in English is instead held by sentence word order. On the other hand, the English I_will_praise clearly already holds within it the person and tense information, also stated in the following square bracketed section (the auxiliary will is used to indicate future tense in all persons; shall is not used, despite its historical use in the 1st person singular and plural). Thus Brutus does not distinguish simple future and emphatic future, traditionally distinguished by swapping will for shall and vice versa in all person/numbers. Whenever Brutus encounters laudabo the above output is looked up in the vocabulary list. The inflectional nature of Latin means that several tens of forms derived from each stem may need to be held in the vocabulary file. This is in theory not a problem, but in practice requires constant additions of new words for each new text encountered. Without any form of automated assistance the human user would have to create the meaning entry in every case, a time-consuming and tedious task. To counteract this, Brutus contains a morphological processor which derives the meaning entry from the stem form of the new word. Thus the human operator need only check the presented deduced meaning, rather than create it from scratch. The morphological processor does, however, require that the relevant stem form is already in the vocabulary file. An example is now given. The word laudabas is encountered, and is not found in the vocabulary file. However, the vocabulary list is found to contain the entry: laudare=to_praise[VERB-InfConj1] This indicates that laudare is the infinitive form of the first conjugation verb meaning 'to praise'. The morphological processor finds that laudabas may be derived from laudare as the active voice, indicative mood, imperfect tense, 2nd person singular form, and suggests the following new vocabulary entry: laudabas=you_were_praise[VERB-2ndSingImpIndAct] Here, although you_were is inserted by the Latin morphological processor, the stem form of the verb is returned; a later English verb morphology stage alters praise to praising. The user accepts or rejects this meaning, and the translation continues. Once accepted, the new meaning is added to the vocabulary list and is available to all future runs of the program. Learning mode is selectable so that the program may be run with or without the need for human feedback. The vocabulary file may be filled manually or through the learning mechanism. Brutus is robust and does not stop when an unknown word is encountered; instead, either a deduced meaning is used or if this is not possible (e.g. because the stem form is not in the vocabulary file) the word is left as UNKNOWN. In the rare cases where one Latin word has more than one possible derived meaning, the human user is presented with all meanings in turn for acceptance or rejection. Likewise, when translation is proceeding normally using look up, Brutus returns all looked-up meaning forms for each Latin word in the text, and these are disambiguated later. Further Processing Stages To illustrate the intended processing stages following the initial direct stage, the following example sentence will be used: Secunda legio castra in Gallia habet, sed in Britanniam cum imperatore festinabit. STAGE 1 The output from the first stage is as follows: Second[ADJ-NomSingFem, VocSingFem, AblSingFem, NomPlurNeut, VocPlurNeut, AccPlurNeut] legions[NOUNFEm-NomSing,VocSing] camps[NOUNNeut-NomPlur, VocPlur,AccPlur] in[PREP-+Abl]-OR-into[PREP-+Acc] Gaul[NOUNFem-NomSing, VocSing, AblSing] he/she/it_have[VERB-3rdSingPresIndAct], but[CONJ] in[PREP-+Abl]-OR-into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] generals[NOUNMasc-AblSing] he/she/it_will_hurry[VERB-3rdSingFutIndAct]. Nouns, with the exception of proper nouns, are given in their plural form at this point, as stored in the vocabulary. Verbs are in the form of pronoun plus modals indicating tense/number plus uninflected stem (e.g. he/she/it_eat, I_will_praise, you_had_eat). The forward slash is used to separate alternatives within entries, but where the Latin word has more than one quite distinct sense the possible senses are each given using -OR- as the separator (see e.g. for in). STAGE 2 This stage corrects the noun numbers where possible and also morphs verb forms to correct English. The output from the second stage is as follows: Second[ADJ-NomSingFem, VocSingFem, AblSingFem, NomPlurNeut, VocPlurNeut, AccPlurNeut] legion[NOUNFEm-NomSing, VocSing] camps[NOUNNeut-NomPlur, VocPlur, AccPlur] in[PREP-+Abl]-OR-into[PREP-+Acc] Gaul[NOUNFem-NomSing,VocSing,AblSing] he/she/it_has[VERB-3rdSingPresIndAct], but[CONJ] in[PREP-+Abl]-OR-into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] general[NOUNMasc-AblSing] he/she/it_will_hurry[VERB-3rdSingFutIndAct]. Here, legions and generals have been converted to their singular forms, since the string 'Plur' does not occur in the square bracketed sections following them. This is achieved using the sing function described in Bowden, Halstead and Rose (1996). Also in this stage the stem verb form have has been altered to has. This is done using a look-up table containing rules for the construction of irregular verb forms, such as those shown in Table 1. Stem/[...] pairs not present in the table are altered in a regular manner by a rule-based system which examines the stem ending, although in many cases the stem need not be altered (see e.g. the laudabo example above).
Table 1. Irregular English verb morphological processing rules This morphological processing is specifically tailored to the task at hand; the rules are triggered by the Latin tenses etc. as found in the square bracketed parts. Brutus always translates the Imperfect tense as a continuous past (e.g. we were eating) and always translates the Perfect as a simple past (e.g. I ate). This is not ideal, but it is pragmatic. The other Latin tenses present no such dilemmas. STAGE 3 This stage looks for connected adjective and noun runs, and reduces the possibilities in the square brackets accordingly. Adjectives in Latin may follow or precede the noun, and in addition there is a construct where et (and) is used to link two adjectives applying to the same noun. The output is as follows: Second[ADJ-NomSingFem, VocSingFem] legion[NOUNFEm-NomSing, VocSing] camps[NOUNNeut-NomPlur, VocPlur, AccPlur] in[PREP-+Abl]-OR-into[PREP-+Acc] Gaul[NOUNFem-NomSing,VocSing, AblSing] he/she/it_has[VERB-3rdSingPresIndAct], but[CONJ] in[PREP-+Abl]-OR-into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] general[NOUNMasc-AblSing] he/she/it_will_hurry[VERB-3rdSingFutIndAct]. In this case the plural possibilities have been removed from Second since the noun it modifies, legion, is in the singular, and also the Abl possibility has been deleted because it is not there for the noun. The difficulty in this stage lies with finding the boundaries between adjective-noun groups; for example, the noun camps is not part of the first group. This is done using heuristics which tell where to place the boundary for each possible ADJ/NOUN run. STAGE 4 This stage resolves prepositions, both internally (/-parts) and between polysemous forms (-OR- parts). The output becomes: Second[ADJ-NomSingFem,VocSingFem] legion[NOUNFEm-NomSing,VocSing] camps[NOUNNeut-NomPlur,VocPlur,AccPlur] in[PREP-+Abl] Gaul[NOUNFem-AblSing] he/she/it_has[VERB-3rdSingPresIndAct] , but[CONJ] into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] general[NOUNMasc-AblSing] he/she/it_will_hurry[VERB-3rdSingFutIndAct] . Here, the first occurrence of Latin in can only mean 'existing inside' (Latin in+Abl can sometimes mean 'on', but Brutus always translates it as 'in'; a future stage will perform in-on changing where necessary) since what follows is possibly Abl but definitely not Acc. The second occurrence of Latin in must mean 'into', since what follows is an Acc noun. Redundant case information is deleted from within the [...] parts. The important factor is the case of the following noun phrase, but this example shows that a list of place names is also maintained for /-resolution. STAGE 5 This stage marks the Subject, Verb, Object, and Everything-Else parts (SVOE structure). This is done purely by part of speech and case as indicated with the square bracketed parts: <Subj1=Second[ADJ-NomSingFem] legion[NOUNFEm-NomSing]> <Obj1=camps[NOUNNeut-AccPlur]> <Else1= in[PREP-+Abl] Gaul[NOUNFem-AblSing]> <Verb1=he/she/it_has[VERB-3rdSingPresIndAct]>, <Else2=but[CONJ] into[PREP-+Acc] Britain[NOUNFem-AccSing] <with[PREP-+Abl] general[NOUNMasc-AblSing]> <Verb2= he/she/it_will_hurry[VERB-3rdSingFutIndAct]>. Angle brackets are used to indicate the Subject etc parts as illustrated. It is assumed that all sentences have a subject. In the example, this allows various cases of nouns to be deleted from within the square brackets. Subjects are identified first so that nouns which might be either subject or object can be disambiguated. STAGE 6 Following the identification of possible Subject, Object etc entities in Stage 5, this stage links verbs to their subjects and deletes /-parts within the verbs: <Subj1(Verb1,Verb2)=Second[ADJ-NomSingFem] legion[NOUNFEm-NomSing]> <Obj1(Verb1)=camps[NOUNNeut-NomPlur,AccPlur]> <Else1= in[PREP-+Abl] Gaul[NOUNFem-AblSing]> <Verb1=has[VERB-3rdSingPresIndAct]> , <Else2=but[CONJ] into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] general[NOUNMasc-AblSing]> <Verb2= will_hurry[VERB-3rdSingFutIndAct]>. Here, the first verb (has) is linked to the subject, which has Sing marking. The second verb (he/she/it_will_hurry) might have posed more of a problem, since it might have applied to either the legion in the first clause or to the general in the second. However, general is not marked as a subject group, and so this possibility may be discounted; the second verb must be linked to the single subject of the sentence. STAGE 7 In this stage, word order is altered to reflect standard English SVOE order and <...> parts can therefore be removed. This results in the following: Second[ADJ-NomSingFem] legion[NOUNFEm-NomSing] has[VERB-3rdSingPresIndAct] camps[NOUNNeut-AccPlur] in[PREP-+Abl] Gaul[NOUNFem-AblSing], but[CONJ] will_hurry[VERB-3rdSingFutIndAct] into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] general[NOUNMasc-AblSing]. Moving the verb positions also allows removal of he/she/it where the subject immediately precedes the verb or where it is deemed to be ellipted from that position. In the case of a second verb, where one verb has already been linked to a subject and moved, the second verb is moved to the start of the clause it appears in (after a conjunction if one is present). STAGE 8 The translation is nearly complete. This penultimate stage cannot be performed on an isolated sentence, for it comprises the insertion of determiners. (In Latin, the specificity of objects within a text is deduced largely by the reader, rather than being explicitly marked as it is in English.) The entire translated text after stage 7 will be passed to a discourse-entity (DE) recogniser, which will use the [...] parts to detect and count each possible DE and insert definite and indefinite articles. Assuming that the example sentence is mid-way through a longer text, this would result in the following: The second[ADJ-NomSingFem] legion[NOUNFEm-NomSing] has[VERB-3rdSingPresIndAct] camps[NOUNNeut-AccPlur] in[PREP-+Abl] Gaul[NOUNFem-AblSing], but[CONJ] will_hurry[VERB-3rdSingFutIndAct] into[PREP-+Acc] Britain[NOUNFem-AccSing] with[PREP-+Abl] the general[NOUNMasc-AblSing]. STAGE 9 In the final stage, all square-bracket text is removed, and underscores replaced by spaces: The second legion has camps in Gaul, but will hurry into Britain with the general. Secunda legio castra in Gallia habet, sed in Britanniam cum imperatore festinabit. Discussion The Brutus program is still being developed and although coding for stages 1, 2 and 9 is complete, much of the remaining stages is still being built. In addition, much more vocabulary needs to be added to the program's dictionary. The above description does not discuss certain tasks such as the linking of adverbs to verbs. There are also constructions in Latin which take a standard form (e.g. the 'ablative absolute' construction, the use of -ne on the first word of a sentence to indicate a question etc). These will be tackled within the above stages or in separate stages at relevant positions within the above framework. At this early stage, it is difficult to know how much of a problem polysemous words will be. It is thought that these are in fact much rarer than in English, due to the highly inflected nature of Latin. It is possible to concoct examples where one orthographic Latin word has have more than one distinct meaning, but it remains to be seen if this is a problem in practice. Brutus embodies a very shallow approach to MT, but even the output from Stage 1 (completed) is largely understandable and in itself is a good translation aid. Thus Brutus is already useful. The question arises as to who might use such a program; clearly, expert classical scholars are unlikely to need such assistance. However, as a teaching aid Brutus might well prove very helpful. Students could use it to check their translations, and in this sense it might actually be more useful than a human marker, since human markers do not usually write down e.g. all the possible case/gender combinations for each noun in the text. Also, as the basis for an automatic marking system, Brutus has the potential to do more than just highlight incorrectly translated words or phrases, since it could in effect explain the correct translation to the student. References Baugh, A. C. and Cable, T. (1978) A History of the English Language. (3rd Edition). Routledge and Kegan Paul. Bowden, P. R., Halstead, P. and Rose, T. G. (1996) Dictionaryless English Plural Noun Singularisation Using A Corpus-Based List of Irregular Forms In Corpus-based Studies in English - Papers from the Seventeenth International Conference on English Language Research on Computerized Corpora (ICAME 17). Stockholm, May 15 - 19 1996, Rodopi. Hutchins, W. J. and Somers, H. L. (1992) An Introduction to Machine Translation.Academic Press. Paterson, J. and Macnaughton, E. G. (1968). The Approach to Latin (First Part). (revised 1968). Oliver and Boyd.
|