BCS MACHINE
TRANSLATION
SITE NAVIGATOR
Click for...
BCS MT
home
page
top
 
end
  overview  •  next item  •  previous item

British Computer Society's logo  

 

The British Computer Society
Natural Language Translation Specialist Group

Web-site: http://www.bcs.org.uk/siggroup/sg37.htm

Machine Translation Review
Issue No. 12, December 2001 - pages 27-28
ISSN 1358-8346

This page URL: http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-12/12.htm
Size: 3 A4 pages when printed

Book
Review
Parallel Text Processing: Alignment and Use of Translation Corpora

Jean Véronis (ed.)
Kluwer Academic Publishers, London, Hardback, 402pp.
ISBN 0792365461
Price £99

Reviewer: Mark Stevenson, Research and Standards Group, Reuters, 85 Fleet Street, London EC4P 4AJ, United Kingdom. E-mail: mark.stevenson@reuters.com

 

One point that is worth making clear is that this book uses the phrase 'parallel text' in the Computational Linguistics and Natural Language Processing meaning, that is, to refer to a text accompanied by its direct translation. In the translation and terminology communities the same expression means texts in the same domain written in different languages but which are not necessarily translations of each other. The most important characteristic of a parallel text (in the sense used here) is that the same thing is said in more than one language. This book provides a good coverage of contemporary methods for exploiting this feature.

This volume consists of nineteen chapters by different contributors in addition to a preface provided by Martin Kay. The first chapter is written by the editor and provides an excellent introduction to the field of parallel text processing. It covers the main methods through which parallel texts can be exploited computationally as well as describing their applications. Veronis' introduction also proves invaluable for placing the contributions in the volume within a wider context.

The remaining chapters are organised into three sections: Alignment Methodology, Applications and Resources and Evaluation. The first of these forms the core of the volume and contains nine chapters describing specific alignment techniques. There are four different levels at which a parallel text may be aligned. The most granular is at the sentence level, in which corresponding sentences in the two texts are identified. Others have attempted to align text at the level of individual words or expressions, and this is often a by-product of the sentence alignment process. Somewhere between these is the process of aligning texts at the clause level in which linguistic units somewhere between the word, or expression, and sentence are aligned. The majority of alignment methods are extended from two basic approaches reported around 1990. (Although none of these papers are reprinted in this book.) The two basic methods are lexical anchoring and sentence length correlation. The first of these makes use of the assumption that if sentences correspond then the words they contain must also correspond. So these techniques rely on using a mapping between the lexical items in the parallel texts to provide clues about sentence correspondences. In contrast, sentence length correlation methods do not focus on particular lexical items but use the observation that there is most likely a correlation between sentence lengths in the two languages being aligned. The techniques reported in this volume make use of one, or both, of these techniques. In addition some other assumptions are often employed to reduce the search space of potential correspondences to a tractable level: the order of sentence in the two texts are very close, the texts contain few (if any) additions or omissions and, finally, the majority of sentences in one language correspond to exactly one in the other. These assumptions may not hold for real world applications in which translations can be structured differently from the original text with, for example, tables and figures being positioned differently. This leads to a further level of alignment at the document structure level and two chapters describe approaches to this problem. One uses techniques borrowed from Cross Language Information Retrieval by treating the set of sentences in one language as a corpus and the other as a set of queries. The other approach is to make the structure of the texts clearer using mark-up languages such as XML.

The second section contains five chapters on applications of alignment techniques. The most relevant for those interested in translation are likely to be Gaussier, Hull and Ait-Mokhtar's 'Term Alignment in use: Machine-Aided human translation'. This describes work carried out at Xerox's French research centre which shows that word and term alignment methods can be used to build tools to assist human translators by providing them with examples of previously translated sentences similar to the material that has to be translated. This sort of technology is most often used within translators' workbenches. The next chapter describes experiments on cross-language information retrieval using a bilingual training corpus. It was found that a simple technique for automatically extracting a bilingual dictionary from parallel text was the most effective.

Other applications covered in the second section include lexicography, in which the parallel texts are used as evidence for dictionary creation; bilingual terminology extraction, where the translation of more complex linguistic units such as collocations and expressions are extracted from parallel texts; computer assisted language learning, a survey chapter describes how aligned texts can be used to provide a valuable resource for language learning. There are other applications of parallel text, which are not represented by chapters in his volume, one example being word sense disambiguation. Work in this area at IBM and AT&T in the early 1990s made use of the translations of ambiguous words in an aligned corpus to define word senses at the level of granularity most appropriate for machine translation.

The final section consists of four chapters concerned with resources and evaluation. Two chapters describe projects in which bilingual corpora were created. The first is a Japanese-English corpus and the second an English-Panjabi corpus. The next chapter describes the TMX (Translation Memory eXchange) format. This is an XML standard developed within the Localisation Industry Standards Association (LISA). The standard was developed by a consortium of companies and users to create a standard method for communication between the proprietary formats used in MT and related systems. The volume returns full circle with a final chapter, like the introduction, written by Veronis. In this chapter he describes the ARCADE project that was designed to provide standard methods for the evaluation and comparison of sentence alignment algorithms (and later extended to include word alignment algorithms). Like many areas of natural language processing the field of parallel text processing suffers from a lack of standard benchmarking resources. The ARCADE project is an attempt to resolve this problem for alignment techniques.

To summarise, this volume represents a useful overview of contemporary work on parallel text processing covering many aspects of this area including techniques, evaluation, standards and applications. Parallel text processing is relatively new in NLP but this volume is still quite modern since it does not contain any papers describing the early work carried out at Xerox and AT&T. However, the reader will find a collection that provides a valuable introduction to contemporary work on processing parallel texts.

 

 

BCS MACHINE
TRANSLATION
SITE NAVIGATOR
Click for...
BCS MT
home
page
top
 
end
  overview  •  next item  •  previous item