NATURAL LANGUAGE
TRANSLATION
SITE NAVIGATOR
Click for ...
  MTR main menu   MTR-14 overview

  previous web-page   next web-page   •   print MTR   buy MTR

NLT
home page
BCS
home
page
search
engine

click for British Computer Society home-page The British Computer Society
Natural Language Translation Specialist Group

http://www.bcs.org.uk/siggroup/sg37.htm

Machine Translation Review
PAGES 6-16

No. 14, December 2003   ISSN: 1358-8346
http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-14/5.htm


 
A Latin Morphological Processor

by Paul R. Bowden
School of Computing and Mathematics
The Nottingham Trent University
E-mail: paul.bowden@ntu.ac.uk
 

Abstract
 
The Latin morphological processor used by the BRUTUS Latin-to-English MT system being developed by the author is described. The processor uses the system’s Latin lexicon to find base forms and a database of morphology rules to create a set of possible parts-of-speech (PoS) and other features for each Latin word in the input text. The highly inflected nature of Latin means that most Latin words are labelled with only a single PoS, and so the processor provides the basis for an effective PoS tagger for Latin. Features provided depend on PoS, e.g. gender/case/number for nouns, person/number/tense/voice/mood for verbs. All such possibilities are indicated for each possible PoS. The processor relies upon there being a suitable base form present in the lexicon e.g. infinitive forms for verbs, nominative singular forms for nouns etc, each annotated with standard Latin dictionary category information. The processor is highly effective if the base form is present, finding all inflectional possibilities for each Latin word encountered (together with English meanings).
 
 
Introduction
 
There appears to be a current revival in the fortunes of the Latin language, with many UK schools now teaching it once more, even at the primary level (see e.g. the popular primary level text Minimus the Mouse, (Bell, 1999)). The BRUTUS Latin MT system has been initially described by (Bowden, 2001) and is currently being developed, partly with pedagogic needs in mind. BRUTUS is a unidirectional MT system (Latin to English direction only, at present). It is a ‘direct’ MT system i.e. a multistage system starting with lexical transfer, but involving no parsing (Hutchins and Somers, 1992). The system utilises a morphological processor which aims to present all possible senses of each Latin word in the input text, for later disambiguation down to the gender / case / number level (for nouns and adjectives) and person / number / tense / voice / mood level (for verbs). Figure 1 illustrates the output of the processor for each PoS, and represents the current first-stage output of the MT system:
 
Secunda legio castra magna in Gallia habet, sed in Britanniam cum imperatore festinabit.
 
Stage 1 output sentence:
 
Second[ADJV(us,a,um)-FemNomSing,FemVocSing,FemAblSing,NeuNomPlur,NeuVocPlur,NeuAccPlur]
legions[NOUNFemDec3(?,?is)-NomSing,VocSing]
camps[NOUNNeuPlurDec2(a,orum)-NomPlur,VocPlur,AccPlur]
large[ADJV(us,a,um)-FemNomSing,FemVocSing,FemAblSing,NeuNomPlur,NeuVocPlur,NeuAccPlur]
{in[PREP-+Abl]^OR^on[PREP-+Abl]^OR^into[PREP-+Acc]}
Gaul[NOUNPrpDec1(a,ae)-NomSing,VocSing,AblSing]
he/she/it_have[VERB-3rdSingPresIndAct],

 
,
but[CONJ]
{in[PREP-+Abl]^OR^on[PREP-+Abl]^OR^into[PREP-+Acc]}
Britain[NOUNPrpDec1(a,ae)-AccSing]
with[PREP-+Abl]
generals[NOUNMscDec3(?,?is)-AblSing]
he/she/it_will_hurry[VERB-3rdSingFutrIndAct]
.

 
Figure 1. Example BRUTUS stage-1 output.
 
Most of the input Latin words are looked up directly in the lexicon, which contains both base-form entries and inflected-form entries, all annotated as above with information in the square brackets. However, when a Latin word is encountered which has not been seen by BRUTUS before, i.e. which is not in the lexicon, the morphological processor is automatically applied to it. The processor allows the many tens of inflected forms for each Latin base-word to be generated on-the-fly, i.e. it obviates the need for their manual addition to the lexicon. In order to be able to suggest the square-bracketed parts above for any such new word, as well as the initial English sense, the morphological processor requires access to the system lexicon and the database of lemmatisation rules.
 
BRUTUS also allows an interactive mode in which the user is asked to view the output of the morphological processor and reject any suggestions which are not correct in Latin. This is necessary because on rare occasions a lemmatisation rule can be applied to the Latin word even though this specific lemmatisation is not grammatically correct in Latin. However, this need be done only once for each Latin lexeme, as the results are then stored in the lexicon for immediate lookup in later runs. An example of this situation will be given later.
 
The morphological processor was initially intended as a means to aid the building of a very large lexicon containing all inflected forms of all Latin words. The philosophy of BRUTUS is to attempt a very shallow direct-MT approach, the first stage of which is word lookup (with presentation of several senses if these exist - see e.g. the ^OR^ parts in the prepositions in Figure 1.). Later processing stages are intended to refine the first-stage output (Bowden, 2001). However, the processor does identify all the lemmatisations which could be applied to any given Latin lexeme, and so is able to suggest a PoS even for unknown base-form words. Thus it may in the future provide the basis for a Latin PoS tagger, even without any Latin lexicon. However, further PoS tag disambiguation stages would be required in such a tagger, since Latin words in different PoS categories may end with the same character(s).
 
 
Lexicon Entries and Morphological Rules
 
Extracts from the system lexicon are given in Figure 2. Some of the entries were added manually, and some by the morphological processor. The manually added entries include those having -NomSing (for nouns and adjectives) and -PresInfAct (for verbs), which are the primary base forms towards which the morphological processor works. (Positive adverbs and prepositions are also manual additions.) Other base forms are also required (discussed later).
 
ad===at[PREP-+Acc]
ab===from[PREP-+Abl]
per===through[PREP-+Acc]
ad===towards[PREP-+Acc]
agricola===farmers[NOUNMscDec1(a,ae)-NomSing,VocSing,AblSing]
exitus===ends[NOUNMscDec4(us,us)-NomSing,VocSing,GenSing,NomPlur,VocPlur,
     AccPlur]
cornu===horns[NOUNNeuDec4(u,us)-NomSing,VocSing,AccSing,DatSing,AblSing]
domus===houses[NOUNFemDec4Irg(us,us)-NomSing,VocSing,GenSing,NomPlur,
     VocPlur,AccPlur]
urbs===towns[NOUNFemDec3(?,?is)-NomSing,VocSing]
urbis===towns[NOUNFemDec3(?,?is)-GenSing]
navis===ships[NOUNFemDec3(?,?is)-NomSing,VocSing,GenSing]
classis===fleets[NOUNFemDec3(?,?is)-NomSing,VocSing,GenSing]
canis===dogs[NOUNMscDec3(?,?is)-NomSing,VocSing,GenSing]
canis===dogs[NOUNFemDec3(?,?is)-NomSing,VocSing,GenSing]
bos===oxen[NOUNMscDec3Irg(?,?is)-NomSing]
rex===kings[NOUNMscDec3(?,?is)-NomSing,VocSing]
regis===kings[NOUNMscDec3(?,?is)-GenSing]
dolor===pains[NOUNMscDec3(?,?is)-NomSing,VocSing]
doloris===pains[NOUNMscDec3(?,?is)-GenSing]
amabam===I_was_love[VERB-1stSingImpfIndAct]
amabatis===you_were_love[VERB-2ndPlurImpfIndAct]
amabimus===we_will_love[VERB-1stPlurFutrIndAct]
amabo===I_will_love[VERB-1stSingFutrIndAct]
amabunt===they_will_love[VERB-3rdPlurFutrIndAct]
amamus===we_love[VERB-1stPlurPresIndAct]
amant===they_love[VERB-3rdPlurPresIndAct]
amare===to_love[VERB-PresInfAct]
facere===to_make[VERB-PresInfAct]
facere===to_do[VERB-PresInfAct]
deducere===to_launch[VERB-PresInfAct]
deduxi===I_launch[VERB-1stSingPerfIndAct]
deductum===launch[VERB-SpnAcc/PARTICPerfPas-MscAccSing,NeuNomSing,
     NeuVocSing,NeuAccSing]
regere===to_rule[VERB-PresInfAct]
rectum===rule[VERB-SpnAcc/PARTICPerfPas-MscAccSing,NeuNomSing,
     NeuVocSing,NeuAccSing]
communicare===to_share[VERB-PresInfAct]
navigare===to_sail[VERB-PresInfAct]
appellare===to_call[VERB-PresInfAct]
amas===you_love[VERB-2ndSingPresIndAct]
amaveram===I_had_love[VERB-1stSingPlupIndAct]
amaverint===they_will_have_love[VERB-3rdPlurFperIndAct]
amavi===I_love[VERB-1stSingPerfIndAct]

 
Figure 2. Extracts (non-contiguous) from the BRUTUS lexicon.

Note that nouns are classified by PoS (NOUN), gender (Msc,Fem,Neu and Prp for Proper), declension number (e.g. Dec1) and "how they go" (e.g. a,ae). The base form, which includes the string -NomSing, also indicates the other case/number combinations which the base lexeme might also indicate. (Third declension ‘increasing’ nouns and adjectives also need the GenSing as a base form.) The English sense is always given as a plural, as BRUTUS has available to it a function to create the English singular form as necessary (Bowden et al., 1996). For example, here is the entry for agricola:
 
agricola===farmers[NOUNMscDec1(a,ae)-NomSing,VocSing,AblSing]
 
Verbs have a simpler base-form entry. They are merely labelled as VERB-PresInfAct. No conjugation number is given (perhaps surprisingly), as this has not yet proved to be vital (but see later comments). (In addition, the 1stSingPerfIndAct is used to cover verbs having a distinct perfect stem, e.g. resurgo, resurrexi.) Here is the entry for numerare:
 
numerare===to_count[VERB-PresInfAct]
 
These lexicon entries, together with the morphological rulebase, allow the morphological processor to explain lexemes such as agricolae and numero, if they are not already present in the lexicon as separate entries. Figure 3 gives examples of noun rules from the morphological rulebase, and Figure 4 gives some of the verb rules. In all cases, the notation used is x>>>y===text<<<[...] where x represents the ending characters of the Latin word being investigated, y the ending with which x is replaced by the processor, and text some characters which need to be inserted in the created meaning (used for verbs but not nouns). In the case of nouns, the base form rule always has x the same as y, and so this rule is not used at present, but has been included for completeness and because of potential future applications (e.g. identifying PoS in a Latin PoS tagger).
 

//------------------------------------------------------------------
// NOUNS
//------------------------------------------------------------------
//
// 1st DECLENSION
//
// Note: we morph to the Nom Sing, so we only need this
//       base form in the vocab file.
//
// e.g. mensa===tables[NOUNFemDec1(a,ae)-NomSing,VocSing,AblSing]
//
// mensa
// inaccessible rule: a>>>a===<<<[NOUNFemDec1(a,ae)-NomSing,VocSing,AblSing]
am>>>a===<<<[NOUNFemDec1(a,ae)-AccSing]
ae>>>a===<<<[NOUNFemDec1(a,ae)-GenSing,DatSing,NomPlur,VocPlur]
as>>>a===<<<[NOUNFemDec1(a,ae)-AccPlur]
arum>>>a===<<<[NOUNFemDec1(a,ae)-GenPlur]
is>>>a===<<<[NOUNFemDec1(a,ae)-DatPlur,AblPlur]
abus>>>a===<<<[NOUNFemDec1(a,ae)-DatPlur,AblPlur]
//
// nauta
// inaccessible rule: a>>>a===<<<[NOUNMscDec1(a,ae)-NomSing,VocSing,AblSing]
am>>>a===<<<[NOUNMscDec1(a,ae)-AccSing]
ae>>>a===<<<[NOUNMscDec1(a,ae)-GenSing,DatSing,NomPlur,VocPlur]
as>>>a===<<<[NOUNMscDec1(a,ae)-AccPlur]
arum>>>a===<<<[NOUNMscDec1(a,ae)-GenPlur]
is>>>a===<<<[NOUNMscDec1(a,ae)-DatPlur,AblPlur]
//
//--------------------------------------------------------------------------
// 2nd DECLENSION
//
// Note: we morph to the Nom Sing, so we only need this
//       base form in the vocab file.
//
// e.g. dominus===masters[NOUNMscDec2(us,i)-NomSing]
//
// dominus
// inaccessible rule: us>>>us===<<<[NOUNMscDec2(us,i)-NomSing]
e>>>us===<<<[NOUNMscDec2(us,i)-VocSing]
um>>>us===<<<[NOUNMscDec2(us,i)-AccSing]
i>>>us===<<<[NOUNMscDec2(us,i)-GenSing,NomPlur,VocPlur]
o>>>us===<<<[NOUNMscDec2(us,i)-DatSing,AblSing]
os>>>us===<<<[NOUNMscDec2(us,i)-AccPlur]
orum>>>us===<<<[NOUNMscDec2(us,i)-GenPlur]
is>>>us===<<<[NOUNMscDec2(us,i)-DatPlur,AblPlur]
//
// filius
// inaccessible rule: ius>>>ius===<<<[NOUNMscDec2(ius,i)-NomSing]
ium>>>ius===<<<[NOUNMscDec2(ius,i)-AccSing]
i>>>ius===<<<[NOUNMscDec2(ius,i)-VocSing,GenSing,NomPlur,VocPlur]
ii>>>ius===<<<[NOUNMscDec2(ius,i)-GenSing]
io>>>ius===<<<[NOUNMscDec2(ius,i)-DatSing,AblSing]
ios>>>ius===<<<[NOUNMscDec2(ius,i)-AccPlur]
iorum>>>ius===<<<[NOUNMscDec2(ius,i)-GenPlur]
iis>>>ius===<<<[NOUNMscDec2(ius,i)-DatPlur,AblPlur]
//
// ager
// inaccessible rule: er>>>er===<<<[NOUNMscDec2(er,ri)-NomSing,VocSing]
rum>>>er===<<<[NOUNMscDec2(er,ri)-AccSing]
ri>>>er===<<<[NOUNMscDec2(er,ri)-GenSing,NomPlur,VocPlur]
ro>>>er===<<<[NOUNMscDec2(er,ri)-DatSing,AblSing]
ros>>>er===<<<[NOUNMscDec2(er,ri)-AccPlur]
rorum>>>er===<<<[NOUNMscDec2(er,ri)-GenPlur]
ris>>>er===<<<[NOUNMscDec2(er,ri)-DatPlur,AblPlur]
//
// vir
// inaccessible rule: ir>>>ir===<<<[NOUNMscDec2(ir,iri)-NomSing,VocSing]
irum>>>ir===<<<[NOUNMscDec2(ir,iri)-AccSing]
iri>>>ir===<<<[NOUNMscDec2(ir,iri)-GenSing,NomPlur,VocPlur]
iro>>>ir===<<<[NOUNMscDec2(ir,iri)-DatSing,AblSing]
iros>>>ir===<<<[NOUNMscDec2(ir,iri)-AccPlur]
irorum>>>ir===<<<[NOUNMscDec2(ir,iri)-GenPlur]
iris>>>ir===<<<[NOUNMscDec2(ir,iri)-DatPlur,AblPlur]
//
// puer
// inaccessible rule: er>>>er===<<<[NOUNMscDec2(er,eri)-NomSing,VocSing]
erum>>>er===<<<[NOUNMscDec2(er,eri)-AccSing]
eri>>>er===<<<[NOUNMscDec2(er,eri)-GenSing,NomPlur,VocPlur]
ero>>>er===<<<[NOUNMscDec2(er,eri)-DatSing,AblSing]
eros>>>er===<<<[NOUNMscDec2(er,eri)-AccPlur]
erorum>>>er===<<<[NOUNMscDec2(er,eri)-GenPlur]
eris>>>er===<<<[NOUNMscDec2(er,eri)-DatPlur,AblPlur]
//
// bellum
// inaccessible rule: um>>>um===<<<[NOUNNeuDec2(um,i)-NomSing,VocSing,
     AccSing]
i>>>um===<<<[NOUNNeuDec2(um,i)-GenSing]
o>>>um===<<<[NOUNNeuDec2(um,i)-DatSing,AblSing]
a>>>um===<<<[NOUNNeuDec2(um,i)-NomPlur,VocPlur,AccPlur]
orum>>>um===<<<[NOUNNeuDec2(um,i)-GenPlur]
is>>>um===<<<[NOUNNeuDec2(um,i)-DatPlur,AblPlur]
//
//--------------------------------------------------------------------------
// 3rd DECLENSION
//
// Dec3 increasing nouns: we morph to the genitive singular, and the vocab 
// file has to contain two entries for these type of nouns: one for Nom/Voc
// (and Acc for Neu nouns), and the Gen Sing. This turns the problem into
// a vocab task (which it really is) rather than a morph task. Non-increa-
// sing nouns need only the one entry if the Nom is the same as the Gen.
//
// e.g. rex===kings[NOUNMscDec3(?,?is)-NomSing,VocSing]
//    regis===kings[NOUNMscDec3(?,?is)-GenSing]
//
// e.g. civis===citizens[NOUNMscDec3(?,?is)-NomSing,VocSing,GenSing]
//
// e.g. cubile===sofas[NOUNNeuDec3(?,?is)-NomSing,VocSing,AccSing,AblSing]
//     cubilis===sofas[NOUNNeuDec3(?,?is)-GenSing]
// (unusual, because NomSing is same as AblSing, hence latter in first 
// entry)
//
// Masculine
// ?>>>is===<<<[NOUNMscDec3(?,?is)-NomSing,VocSing]
em>>>is===<<<[NOUNMscDec3(?,?is)-AccSing]
// is>>>is===<<<[NOUNMscDec3(?,?is)-GenSing]
i>>>is===<<<[NOUNMscDec3(?,?is)-DatSing]
e>>>is===<<<[NOUNMscDec3(?,?is)-AblSing]
es>>>is===<<<[NOUNMscDec3(?,?is)-NomPlur,VocPlur,AccPlur]
um>>>is===<<<[NOUNMscDec3(?,?is)-GenPlur]
ium>>>is===<<<[NOUNMscDec3(?,?is)-GenPlur]
ibus>>>is===<<<[NOUNMscDec3(?,?is)-DatPlur,AblPlur]
//
// Feminine - as masculine
// ?>>>is===<<<[NOUNFemDec3(?,?is)-NomSing,VocSing]
em>>>is===<<<[NOUNFemDec3(?,?is)-AccSing]
// is>>>is===<<<[NOUNFemDec3(?,?is)-GenSing]
i>>>is===<<<[NOUNFemDec3(?,?is)-DatSing]
e>>>is===<<<[NOUNFemDec3(?,?is)-AblSing]
es>>>is===<<<[NOUNFemDec3(?,?is)-NomPlur,VocPlur,AccPlur]
um>>>is===<<<[NOUNFemDec3(?,?is)-GenPlur]
ium>>>is===<<<[NOUNFemDec3(?,?is)-GenPlur]
ibus>>>is===<<<[NOUNFemDec3(?,?is)-DatPlur,AblPlur]
//
// Neuter
// ?>>>is===<<<[NOUNNeuDec3(?,?is)-NomSing,VocSing,AccSing]
// is>>>is===<<<[NOUNNeuDec3(?,?is)-GenSing]
i>>>is===<<<[NOUNNeuDec3(?,?is)-DatSing]
e>>>is===<<<[NOUNNeuDec3(?,?is)-AblSing]
a>>>is===<<<[NOUNNeuDec3(?,?is)-NomPlur,VocPlur,AccPlur]
ia>>>is===<<<[NOUNNeuDec3(?,?is)-NomPlur,VocPlur,AccPlur]
um>>>is===<<<[NOUNNeuDec3(?,?is)-GenPlur]
ium>>>is===<<<[NOUNNeuDec3(?,?is)-GenPlur]
ibus>>>is===<<<[NOUNNeuDec3(?,?is)-DatPlur,AblPlur]
//
Figure 3. Some NOUN rules from the BRUTUS morphological ruleset.
//------------------------------------------------------------------
// VERBS
//------------------------------------------------------------------
// Most rules go back to the PresInfAct as the base form, but because
// of irregular verbs, we go back to the 1stSingPerfIndAct (the 3rd
// principle part of the verb) for perfect and other tenses. A few
// irregulars are built in, though.
//------------------------------------------------------------------
// PRESENT
// amo
o>>>are===I_<<<[VERB-1stSingPresIndAct]
as>>>are===you_<<<[VERB-2ndSingPresIndAct]
at>>>are===he/she/it_<<<[VERB-3rdSingPresIndAct]
amus>>>are===we_<<<[VERB-1stPlurPresIndAct]
atis>>>are===you_<<<[VERB-2ndPlurPresIndAct]
ant>>>are===they_<<<[VERB-3rdPlurPresIndAct]
//
// moneo
eo>>>ere===I_<<<[VERB-1stSingPresIndAct]
es>>>ere===you_<<<[VERB-2ndSingPresIndAct]
et>>>ere===he/she/it_<<<[VERB-3rdSingPresIndAct]
emus>>>ere===we_<<<[VERB-1stPlurPresIndAct]
etis>>>ere===you_<<<[VERB-2ndPlurPresIndAct]
ent>>>ere===they_<<<[VERB-3rdPlurPresIndAct]
//
// ?eo
eo>>>ire===I_<<<[VERB-1stSingPresIndAct]
es>>>ire===you_<<<[VERB-2ndSingPresIndAct]
et>>>ire===he/she/it_<<<[VERB-3rdSingPresIndAct]
emus>>>ire===we_<<<[VERB-1stPlurPresIndAct]
etis>>>ire===you_<<<[VERB-2ndPlurPresIndAct]
eunt>>>ire===they_<<<[VERB-3rdPlurPresIndAct]
//
//rego, capio
o>>>ere===I_<<<[VERB-1stSingPresIndAct]
io>>>ere===I_<<<[VERB-1stSingPresIndAct]
is>>>ere===you_<<<[VERB-2ndSingPresIndAct]
it>>>ere===he/she/it_<<<[VERB-3rdSingPresIndAct]
imus>>>ere===we_<<<[VERB-1stPlurPresIndAct]
itis>>>ere===you_<<<[VERB-2ndPlurPresIndAct]
unt>>>ere===they_<<<[VERB-3rdPlurPresIndAct]
iunt>>>ere===they_<<<[VERB-3rdPlurPresIndAct]
//
//audio
io>>>ire===I_<<<[VERB-1stSingPresIndAct]
is>>>ire===you_<<<[VERB-2ndSingPresIndAct]
it>>>ire===he/she/it_<<<[VERB-3rdSingPresIndAct]
imus>>>ire===we_<<<[VERB-1stPlurPresIndAct]
itis>>>ire===you_<<<[VERB-2ndPlurPresIndAct]
iunt>>>ire===they_<<<[VERB-3rdPlurPresIndAct]
//
//------------------------------------------------------------------
// amem
em>>>are===I_<<<[VERB-1stSingPresSubAct]
es>>>are===you_<<<[VERB-2ndSingPresSubAct]
et>>>are===he/she/it_<<<[VERB-3rdSingPresSubAct]
emus>>>are===we_<<<[VERB-1stPlurPresSubAct]
etis>>>are===you_<<<[VERB-2ndPlurPresSubAct]
ent>>>are===they_<<<[VERB-3rdPlurPresSubAct]
//
// moneam
eam>>>ere===I_<<<[VERB-1stSingPresSubAct]
eas>>>ere===you_<<<[VERB-2ndSingPresSubAct]
eat>>>ere===he/she/it_<<<[VERB-3rdSingPresSubAct]
eamus>>>ere===we_<<<[VERB-1stPlurPresSubAct]
eatis>>>ere===you_<<<[VERB-2ndPlurPresSubAct]
eant>>>ere===they_<<<[VERB-3rdPlurPresSubAct]
//
// regam
am>>>ere===I_<<<[VERB-1stSingPresSubAct]
as>>>ere===you_<<<[VERB-2ndSingPresSubAct]
at>>>ere===he/she/it_<<<[VERB-3rdSingPresSubAct]
amus>>>ere===we_<<<[VERB-1stPlurPresSubAct]
atis>>>ere===you_<<<[VERB-2ndPlurPresSubAct]
ant>>>ere===they_<<<[VERB-3rdPlurPresSubAct]
//
// capiam
iam>>>ere===I_<<<[VERB-1stSingPresSubAct]
ias>>>ere===you_<<<[VERB-2ndSingPresSubAct]
iat>>>ere===he/she/it_<<<[VERB-3rd1stSingPresSubAct]
iamus>>>ere===we_<<<[VERB-1stPlurPresSubAct]
iatis>>>ere===you_<<<[VERB-2ndPlurPresSubAct]
iant>>>ere===they_<<<[VERB-3rdPlurPresSubAct]
//

Figure 4. Some VERB rules from the BRUTUS morphological ruleset.

 
Examples of Morphological Processing

I shall now describe the action of the morphological processor. Two examples follow: for a verb, and for a noun.

VERB example:

STEP (1) The word laudaverant is encountered in the input, but is not in the lexicon. However, the following entry exists:

laudare===to_praise[VERB-PresInfAct]

STEP (2) The system looks in the morphological rules file to see if it can reduce the encountered verb laudaverant to the base form (the infinitive, in the case of verbs) laudare, ‘to praise’. The following morphological rule is found:

averant>>>are===they_had_<<<[VERB-3rdPlurPlupIndAct]

This says that the ending -averant can be reduced to the infinitive stem -are where the verb is the 3rd person plural, pluperfect tense, indicative mood, active voice form. In this case the English verb would start "they had".

STEP (3) The morphological processor then creates a possible meaning for laudaverant as follows:

laudaverant===they_had_praise[VERB-3rdPlurPlupIndAct]

Note that the word praise has been inserted. (A later processing stage changes this to praised.)

STEP (4) The user is then asked to inspect the above suggestion. If the user confirms it as being correct, it is added to the lexicon for immediate and future use.

NOUN example:

STEP (1) The word tabernam is encountered in the input, but is not in the lexicon. However, the following entry exists:

taberna===inns[NOUNFemDec1(a,ae)-NomSing,VocSing,AblSing]

This states that taberna is a first declension feminine noun like those having Nominative singular ending -a and Genitive singular ending -ae, and that the word taberna might be Nominative, Vocative or Ablative singular. Note that only the plural version of the English noun is stored (inns), since BRUTUS contains a reliable function to create the singular form if this is later found to be needed. STEP (2) The system looks in the morphological rules file to see if tabernam might be reduced to taberna. The rules file is found to contain an entry for the same type of noun:

am>>>a===<<<[NOUNFemDec1(a,ae)-AccSing]
STEP (3) Applying this rule does indeed reduce tabernam to taberna, so the morphological processor suggests the meaning for tabernam is as follows:
tabernam===inns[NOUNFemDec1(a,ae)-AccSing]

STEP (4) When confirmed by the user, this is then added to the vocab file and can be used immediately. When used, since it can only be Accusative Singular (according to the square-bracketed part), inns is changed to inn.

 
Discussion

The morphological ruleset is largely complete. One of the problems addressed in the early stages of the design of the morphological processor concerned the handling of ‘increasing’ 3rd declension nouns, such as gladiator, gladiatoris and lex, legis. The problem here is that the nominative singular has a shorter form than the genitive singular and that this shorter form is not easily predictable and hence is not easily generated by a few morphological rules. Although there are groups of nouns having similar patterns (e.g. rex, regis is like lex, legis) the problem is that there are many tens of such patterns. Each pattern (-o,-onis; -as,-atis; -as,-adis; -ens,-entis; -ex,-egis; -ex,-icis; -s,-ris; -s,-ssis; -ix,-icis; -or,-oris; -en,-inis etc) would require its own ruleset (for each gender), and every time a new pattern was encountered, not only would the lexicon entry be required but also a new morphological ruleset to go with it. On the other hand, most of the 3rd declension endings themselves are quite regular (those from the accusative singular onwards), and could be described by only three rulesets, one for each gender. It was realised that this problem is not really a morphological problem - it is actually a vocabulary problem (i.e. knowing what the NomSing looks like). Therefore it was decided to morph back to the genitive singular form for the ‘increasing’ 3rd declension nouns, and have two entries provided in the lexicon, e.g. one for lex and one for legis. Happily, this approach also covers the non-increasing 3rd declension nouns, such as clades, cladis, and in some cases only requires one entry in the lexicon e.g. for civis, civis.
 
The ruleset also contains all the "standard" rules for the conjugation of all the main categories of verb (in all their tenses, voices and moods, as well as participles, supines, gerunds etc - there are about 1,500 individual rules for verbs) and for adjective declensions. Adverbs are also handled (comparative and superlative forms morphing back to positives). As such the ruleset encapsulates almost all of the inflectional grammar of Latin. This represents a useful teaching resource in its own right, and in fact the potential pedagogical aspects of BRUTUS were a motivating factor for the research right from the start. The Stage-1 output (see Figure 1) is intended to be helpful to learners of Latin, in that it gives all the possible meanings for each Latin word. Particularly useful for a learner is that this includes all the possible number/case possibilities for the encountered Latin noun, for example.
 
The morphological processor does sometimes suggest bad lemmatisations. The system may suggest more than one meaning for any unknown Latin word, because there may be multiple morphological rules that can be applied, or multiple vocab entries e.g. where one stem has different meanings, or both of these may apply. It is possible for one of the suggestions to be wrong. For example, here are two suggestions made for the same Latin word:

reges===you_will_rule[VERB-2ndSingFutrIndAct]
reges===kings[NOUNMscDec3(?,?is) -NomPlur,VocPlur,AccPlur]
Both of these suggestions are correct. However, there was in fact a third suggestion for reges along with the above two:
reges===you_rule[VERB-2ndSingPresIndAct]
WRONG!

This bad suggestion was made because currently the system does not store information about which conjugation a verb is in. It is interesting to note that the lexical information given in both traditional and more modern Latin dictionaries and textbooks is usually good enough to prevent errors such as that above (e.g. by indicating first person singular present indicative active form of the verb (or the PresInfAct), plus the conjugation number, perfect form and supine.) See e.g. (Balme and Morwood, 1996), (Bell, 1999), (Jones, 1997), (Jones and Sidwell, 1986), (Paterson and Macnaughton, 1968), (Morwood, 2001). However, the error situation illustrated above has been rare and so unless its frequency of occurrence increases as the lexicon grows, the current arrangement with verbs (using the infinitive) will remain. (The advantage of the current system is that it is not necessary to divide the VERB category into five separate categories VERBCnj1, VERBCnj2, VERBCnj3, VERBCnj4, VERBMixd and so the number of verb morphological rules can be kept lower than that required for the latter situation, and the morphological processor is kept as simple as possible.) Furthermore, since this situation always arises with two conflicting VERB meanings 1 being suggested, the morphological processor itself can detect its occurrence. Therefore the possibility of human error can be signalled to the user in the form of an extra ‘caution’ message.
 
The description given above concerning the working of the morphological processor with respect to what needs to be in the lexicon (as a base form) has omitted some detail e.g. for some 3rd declension nouns and for irregular verbs. More detail is given in the Appendix, which contains the header of the lexicon file explaining what is required for every part of speech.
 
 
Conclusion
 
The morphological processor built into the BRUTUS system has the potential to aid in the construction of a very large look-up lexical transfer stage for a Latin MT system. In addition, it is capable of forming the basis for a Latin part-of-speech tagger, even in the absence of any lexicon. However, its successful use depends upon (a) the existence of base-form lexemes in the lexicon, (b) existence of the correct lemmatisation rule in the rulebase, and (c) an expert (Latin-fluent) human user to confirm suggested lemmatisations, as these are incorrect in a very few cases.
 
Future experiments will examine the morphological processor’s ability to PoS tag Latin texts in the absence of a lexicon. In addition, missing rules will be added as necessary (which will mostly be at the same time as the addition of new vocabulary). Vocabulary building will also continue as a separate activity alongside the addition of new rules. It is hoped also that BRUTUS will eventually become a useful teaching aid, particularly for school pupils. To this end, it is planned to provide a website so that single Latin sentences may be submitted, and the useful Stage 1 output returned to the student. The morphological processor is central to these aims.
 



 
1   The problem in fact only occurs for the 2nd/3rd persons singular and 1st/2nd/3rd persons plural for 2nd and 3rd conjugation verbs, where the endings for 2nd conjugation present tense and 3rd conjugation future tense are the same, i.e. –es, -et, -emus, -etis, -ent. The situation arises because both 2nd and 3rd conjugation verbs have present infinitives ending –ere.

 
 
References

Balme, M. and Morwood, J. (1996) Oxford Latin Course Vol. I. Oxford University Press
Bell, B. (1999) Minimus the Mouse. Cambridge University Press
Bowden, P. R. (2001) 'Latin to English Machine Translation - A Direct Approach'.
In Machine Translation Review, December 2001 (on-line via BCS Natural Language Translation Specialist Group at http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-12/5.htm)
Bowden, P. R., Halstead, P. and Rose, T. G. (1996) 'Dictionaryless English Plural Noun
Singularisation Using A Corpus-Based List of Irregular Forms'. In Corpus-based Studies in English - Papers from the Seventeenth International Conference on English Language Research on Computerized Corpora (ICAME 17) Stockholm, May 15 - 19 1996 (Rodopi)
Hutchins, W. J. and Somers, H. L. (1992) An Introduction to Machine Translation.
Academic Press
Jones, P. (1997) Learn Latin - The Book of The Daily Telegraph QED Series.
Duckworth
Jones, P. V. and Sidwell, K. C. (1986) Reading Latin - Grammar, Vocabulary and
Exercises. Cambridge University Press
Morwood, J. (Ed.) (2001) Pocket Oxford Latin Dictionary (2nd edition).
Oxford University Press
Paterson, J. and Macnaughton, E. G. (1968) The Approach to Latin (First Part)
(revised 1968). Oliver and Boyd

 
 
Appendix
 
This is the explanatory header for the BRUTUS lexicon:
//------------------------------------------------------------------
// Brutus' VOCAB FILE
//
// Paul R. Bowden
//
// A.D. MMII
//
//------------------------------------------------------------------
// WARNING! DO NOT SORT THIS FILE!
//------------------------------------------------------------------
//
// ENGLISH NOUNS MUST ALWAYS BE PLURAL IN HERE!
// (EXCEPT PROPER NOUNS)
// ENGLISH VERBS MUST ALWAYS HAVE AUX's BUT NO TENSE CHANGES
//
// Example entries: 
// When adding new vocab manually, there is a minimum amount
// of information you must add in order for the morphological
// processor to be able to do the rest. The minimum information 
// required for each part of speech is given below. 
//
// Note: make multiple entries for different senses (applies to all
//       parts of speech)
//
//
// VERB
// Regular: e.g. amare===to_love[VERB-PresInfAct]
//          e.g. minari===to_threaten[VERB-PresInfDep]
// Irregular: dare===to_give[VERB-PresInfAct]
//            dedi===I_give[VERB-1stSingPerfIndAct]
//            datum===give[VERB-SpnAcc/PARTICPerfPas-MscAccSing,
//                                       NeuNomSing,NeuVocSing,NeuAccSing]
// i.e. give any principle part that is not what you'd expect
// Note: some irregular/defective verbs are here in their entirety.
//
// NOUN
// 1st, 2nd, 4th, 5th declensions: Nom Sing needed:
//        mensa===tables[NOUNFemDec1(a,ae)-NomSing,VocSing,AblSing]
// 3rd declension: Must have Nom and Gen Sing forms:
//       One-liner:
//        civis===citizens[NOUNMscDec3(?,?is)-NomSing,VocSing,GenSing]
//       Two-liner:
//        rex===kings[NOUNMscDec3(?,?is)-NomSing,VocSing]
//        regis===kings[NOUNMscDec3(?,?is)-GenSing]
//
// ADJV
// 1st/2nd declension: Need MscNomSing form:
//      bonus===good[ADJV(us,a,um)-MscNomSing]
// 3rd dec. need 2-line entries:
//      audax===bold[ADJVDec3-MscNomSing,MscVocSing,FemNomSing,FemVocSing,
//                             NeuNomSing,NeuVocSing,NeuAccSing]
//      audacis===bold[ADJVDec3-MscGenSing,FemGenSing,NeuGenSing]
// IRREGULAR comparative and superlatives are also needed:
//     melior===more_good[ADJV(us,a,um)-COMPARA-MscNomSing,MscVocSing,
//                          FemNomSing,FemVocSing]
//     optimus===most_good[ADJV(us,a,um)-SUPERL-MscNomSing]
// Note: (us,a,um) comes from base form bonus.
// Later processing will change "more-","most-" into required word.
//
// ADVB
// Only positive forms required, except for irregular comparatives
// and superlatives:
//   Regular:
//    vere===correctly[ADVB]
//   Irregular:
//    bene===well[ADVB]
//    melius===more_well[ADVB-COMPARA]
//    optime===most_well[ADVB-SUPERL]
// Later processing takes care of more/most, as for ADJVs.
//
// INTJ
//     ecce===behold[INTJ]
//
// PREP
// Need multiple entries if more than one sense:
//      in===in[PREP-+Abl]
//      in===on[PREP-+Abl]
//      in===into[PREP-+Acc]
//
// PRON
// All pronouns are in this vocab list;
// none are generated by the morphological processor.
// Need case/number info. 
//     ego===I[PRON-NomSing]
// Note: there are some ADJV-PRON entries - see below.
//
// In what follows, where you see more than the minimum information 
// present, the morphological processor or a person has added the 
// other lines over the course of time. The morphological processor
// always adds to the end of the file. The order of the entries is
// not important.

 

NATURAL LANGUAGE
TRANSLATION
SITE NAVIGATOR
Click for ...
  MTR main menu   MTR-14 overview

  previous web-page   next web-page   •   print MTR   buy MTR

NLT
home page
BCS
home
page
search
engine