Machine Translation Review
This page URL: http://www.bcs.org.uk/siggroup/nalatran/mtreview/mtr-11/mtr-11-8.htm
by
Natalia V. Loukachevitch
[louk@mail.cir.ru ]
and Boris V. Dobrov
[dobroff@mail.cir.ru ]
The paper describes the technique of construction of a structural thematic summary. A structural thematic summary represents contents of texts by indication of the main theme and subthemes of a text simulated by sets of terms corresponding to these themes. A structural thematic summary comprises the most informative fragments of thematic representation of a text that contains all terms of the text divided to thematic nodes. A structural thematic summary is created on the basis of detailed bilingual description of the sociopolitical domain. It can represent contents of documents of any size and different genres. Language of documents and corresponding structural thematic summaries can be Russian or English. In multilingual information retrieval there is a serious problem of how users can estimate the relevance of retrieved documents and how they can choose the most relevant documents for computer or human translation. Summarisation of texts is usually considered as one of important tools helping to evaluate relevance of texts to users' information needs. Summarisation of texts in broad domains or domain-independent summarisation is mostly based on passage extraction (Salton 1989). Such summaries are constructed as ordered sequences of sentences or paragraphs of initial texts chosen using some criteria. Users of multilingual systems can be unfamiliar with the language of document collection. Machine translation of retrieved documents or their summaries can significantly slow down the process of the choice of relevant documents. Translations of the most frequent words can be used as representative lists of contents of documents. But in these lists the terms corresponding to different topics of texts can be intermixed. Also manifold terms of the most frequent topic can occupy all the available space, and terms of other important topics will be missed. Boguraev el al (1997) proposed the dynamic presentation of document content based on salient fragments of sentences, but to receive qualitative translations of such fragments into other languages is as difficult as it is to translate the whole text. In this paper we describe the construction of the structural thematic summary of a text. The structural thematic summary describes contents of texts by a representation of the main theme of a text simulated by sets of terms corresponding to the main theme. Construction of the structural thematic summary is based on the domain-specific thesaurus, specially created as a linguistic resource for automatic text processing. The structural thematic summary is the most informative fragment of the thematic representation of a text that is a result of complicated process of thematic analysis of texts including term disambiguation and analyses of cohesion relations. Construction of thematic representation is based on the Russian-English Thesaurus on Sociopolitical Life specially created as a tool for automatic text processing. The Thesaurus contains 24 thousand concepts, 50 thousand terms and 90 thousand relations between concepts. Thematic representation including the terms of a text is too detailed to serve as a summary for users of information systems. So we construct a structural thematic summary containing terms of the main theme and important subthemes of a text as a means of rapidly evaluating the relevance of texts. A structural thematic summary can be created for Russian or English documents and presented in Russian or in English. The second section briefly describes main features of the Thesaurus on Sociopolitical Life. The third section is devoted to a description of the main principles of construction of the thematic representation. The fourth section indicates the main steps of automatic construction of the thematic representation. In the fifth section we consider the form of the structural thematic summary. In the sixth section an evaluation of the automatic process of the thematic representation construction is considered. The Thesaurus on Sociopolitical Life is a hierarchical net of concepts constructed specially as a tool for different applications of automatic text processing. It contains a lot of terms from economical, financial, political, military, social, legislative and cultural spheres. The Thesaurus has the following main features (for a detailed description see Loukachevitch et al 1999):
The Sociopolitical thesaurus differs from conventional information-retrieval thesauri (LIV 1984; Subject Headings 1991; UNBIS Thesaurus 1976) and from such linguistic resources as WordNet (Miller et al 1990) and EuroWordNet (Climent et al 1996). The goal in developing a conventional information retrieval thesaurus is to describe terms necessary for representation of main topics of documents. More specific terms are not included. Ambiguous terms are provided with scope notes and comments convenient for human subjects. In fact a conventional information retrieval thesaurus describes an artificial language based on a real language of a domain. Human subjects have to use their domain, common sense, and grammatical knowledge not described in a thesaurus in order to index documents. Therefore conventional information-retrieval thesauri created for manual indexing are hard to utilise in an automatic indexing environment (Salton 1989). To be effective in automatic text processing a thesaurus needs to include a lot of information that is usually missed in thesauri for manual indexing. On the other hand the Sociopolitical thesaurus differs from such linguistic resources as WordNet (Miller et al 1990) and EuroWordNet (Climent et al 1996):
The quality of such automatic text processing procedures as text indexing, text categorisation and text summarisation depends on the quality of recognition of the main theme and subthemes of a text. All the procedures should be based on the same document representation in terms of themes and subthemes and the same linguistic resource. Construction of thematic representation of a text is based on such a property of a connected text as lexical cohesion. One of main properties of a connected text is cohesion (Halliday and Hasan 1976). Cohesion involves relations between words that connect different parts of the text. Lexical cohesion is the most frequent type of cohesion. It can be expressed by repetitions, synonyms and hyponyms or by words connected with other semantic relations such as whole - part, situation - participant, object - property and so on. For example, in a fragment of a text FB6-F001-0015 (see Appendix) from the Text Retrieval Conference text collection (Vorhees and Harman 1995) the terms border troops and serviceman establish one chain of lexical cohesion in the text; the terms poaching, illegal activity and law form other chain of lexical cohesion relations. Such relations can connect sentences of a text without visible markers: The border troops 'are not saber rattling' in Russian territorial waters in the Far East as the mass media, especially the Japanese mass media, are attempting to portray it. Servicemen have been legally granted the right to utilize all of the tools at their disposal, including weapons, to put a stop to poaching. Russian Border Troops Commander-in-Chief Colonel-General Andrey Nikolayev stated that to an ITAR-TASS correspondent while stressing that his subordinates are conducting a strict policy to put a stop to the illegal activities of foreign boats. He noted that the President of Russia supports the position of the border troops for the full observance of the law in the country's territorial waters. Cohesion relations connect not only sentences of a text, but also the main theme and subthemes of a text between each other. Van Dejk and Kintsch (1983) describe the topical structure of a text, the macrostructure, as a hierarchical structure in a sense that the theme of a whole text can be identified and summed up to a single macroproposition. The theme of the text can be usually described in terms of less general themes which in turn can be characterised in terms of even more specific themes, and so on. This means that a connected text has its main theme and this main theme can be formulated. Formulation of the main theme names the most important concepts of the text and relations between them (here and below when we say 'concepts of a text' we imply concepts, the terms of which were mentioned in a text). We call concepts of the main theme of a text 'macroconcepts'. Subthemes of the text discuss relations between macroconcepts or important aspects of a macroconcept. To refer to the main theme a subtheme has to include a macroconcept or its related concept; in sentences of a text such references look like lexical cohesion relations. Relationships between concepts in the main theme can be subdivided into two subtypes. Some of them together name an object. For example, concepts from the example text border troops and Russian Federation name the object 'border troops of the Russian Federation'; concepts Japan and boat name the object 'Japanese boats'. Relations between such concepts are considered as known and usually are not discussed in the text. Various combinations of such concepts and their related concepts often occur together in the text to refer to the corresponding objects, as in for example, 'Russian Border Troops Commander-in-Chief Colonel-General Andrey Nikolayev'. Relationships between other macroconcepts are discussed in subthemes of the text: how border troops plan to struggle with poaching; what their needs are in order to overcome the problem; the results; how Japanese fishermen poach; how they interact with border troops, and so on. Therefore combinations of such macroconcepts and their related concepts also often occur together in clauses and sentences of the text. Taking into account the presentation of macroconcepts and their relations in the text we made three main assumptions:
We can restore the conceptual net of a text using the Thesaurus. For every concept of a text we take direct thesaurus relations with other textual concepts from the Thesaurus or automatically infer them using properties of thesaurus relations. This gives us the conceptual net, or so-called 'thesaurus projection', of the text. The received conceptual net can be subdivided into conceptual nodes. We call a set of concepts related to the same concept the 'thematic node'. The concept that all concepts of the thematic nodes are related to is called a 'thematic centre'. Thematic nodes with macroconcepts as thematic centres are called the 'main thematic nodes 'of a text. Thematic nodes with concepts of various subthemes of a text as thematic centres are called 'specific thematic nodes'. It is not necessary to create thematic nodes around every concept of the text. We supposed that the thematic centre had to be more important for text content than other concepts of the thematic node and that it had to be somehow stressed in the text. It can be used in the title or in the beginning of the text or it can have the highest frequency among related concepts. Thematic nodes can be constructed around such concepts.
To estimate distribution of concepts of different thematic nodes in a text, we use the notion 'textual relation': a given concept has textual relations with those concepts of the text that are located no further than three concepts from the given concept (location order is not important). Thus the context of a concept occurrence is determined by the quantity of meaningful elements. Other words (not from the Thesaurus) are not included in the count. Textual relations pass through sentence borders and are interrupted only by paragraphs. The restriction of 'three concepts' of textual relations was derived experimentally. It means that every occurrence of a concept is considered within a set of seven neighbouring concepts. We incorporate constructed thematic nodes in the'thematic representation' of the text. The thematic representation of text is a hierarchical structure of concepts where concepts semantically or thematically related to thematic centres are gathered in thematic nodes. Thematic nodes whose thematic centres can characterise contents of the text are main thematic nodes. A hierarchy of thematic representation characterises the importance of terms in the text: the thematic centre is more important than other terms of the thematic node; terms of main thematic nodes are more important than the terms of other thematic nodes. Recent works research lexical cohesion, expressed mainly by repetitions, synonyms and hyponyms and the construction of 'lexical chains'. A lexical chain is a chain of words in which the criterion for inclusion of a word is some kind of cohesive relationship to a word that is already in the chain (Morris and Hirst 1991). Morris and Hirst also proposed a specification of cohesive relations based on Roget's Thesaurus. Hirst and St-Onge (1997), Barzilay and Elhadad (1997) construct lexical chains based on WordNet relations.
The main stages in the proposed construction of lexical chains are as follows:
In our Thesaurus we tried to describe different types of conceptual relations that can be useful for the detection of lexical cohesion in texts, and we specially tested the Thesaurus as a source of cohesive relations in texts.
Text units are compared with terms of the Thesaurus using morphological representation of the text and terms. If the same fragment of a text corresponds to different concepts of the Thesaurus, ambiguity of the text unit is indicated. Texts can include names that coincide with terms of the Thesaurus. A name that corresponds to a term of the Thesaurus but has different spelling (capital letters, quotes) is marked as an ambiguous term. After comparison with the Thesaurus the text is represented as a sequence of concepts. All terms of any concept are represented by the concept and are not differentiated further. On the basis of the whole set of concepts of the text the thesaurus projection of the text is constructed (see section 3). Figure 2 shows a fragment of the thesaurus projection of the example text.
Concepts corresponding to different meanings of ambiguous terms also participate in construction of the thesaurus projection for a text. Using the thesaurus projection a proper meaning of an ambiguous term is chosen. For every meaning of an ambiguous term the following conditions are verified:
If one of the conditions is met we consider that the text 'supports' this meaning of the ambiguous term. If the text supports only one meaning of the ambiguous term the corresponding concept is chosen. If the text supports more than one meaning of the term we look through concepts that are the nearest ones to every usage of the ambiguous term and choose the meaning of the concept supported by the nearest concepts. Only chosen concepts participate in further processing of the text. The creation of the thematic nodes begins from choosing the thematic centres. At first concepts mentioned in the title and first sentence of the text can gather all related concepts from the thesaurus projection and become the thematic centres of thematic nodes. Then the most frequent concepts of the text can become thematic centres. A concept included in a thematic node cannot become the thematic centre of a new thematic node. Let us consider document FB6-F001-0015 from TIPSTER Text Collection. Some thematic nodes that were constructed during automatic processing of the example text (the right column represents concept frequency in the text) are as follows:
Figure 2 represents two intersecting thematic nodes: a thematic node with the thematic centre fish and a thematic node with the thematic centre poaching. During the comparison of the text with the Thesaurus terms, textual relations of every concepts (that is neighbour concepts) are collected. As a result we obtain a set of textual relations for every concept of a text. For example, here are fragments of a set of textual relations of concept fish received during processing of the text (on the right side frequency of textual relations is indicated): fish Textual relations between concepts are determined at the stage of comparison of text with the Thesaurus. After construction of thematic nodes the textual relations of concepts in each thematic node are summed up, and we derive the textual relations between thematic nodes. Let us consider fragments of textual relations between thematic nodes in the above example. Thematic nodes are represented by their thematic centres; numbers to the right are the total frequency of textual relations between thematic nodes; textual relations are given for the thematic node with thematic centre fish. Fish In our approach we assume that first of all main thematic nodes are those ones that:
In our example the main thematic nodes were thematic nodes with the main concepts border guards, territorial waters, Russian Federation, fish, poaching, Japan. Evaluated in such a way main thematic nodes determine a threshold that distinguishes the main thematic nodes among all the other thematic nodes of a text. This threshold is an average frequency of concepts in determined main thematic nodes. The initial set of main thematic nodes is supplemented with those thematic nodes whose frequency is more than the threshold. Besides main thematic nodes there are specific thematic nodes and mentioned concepts. Specific thematic nodes represent the primary characteristics of the main topics discussed in the text. Specific nodes are those thematic nodes that have textual relations with at least two different main thematic nodes. Concepts that are not elements of main or specific thematic nodes are called mentioned concepts. Specific thematic nodes are as follows:
logistics mass media
equipment correspondent
computer
Mentioned concepts are legislator, expert, ice situation .... Thus all concepts of the text are divided into five classes of different importance for the text:
We made intensive use of the thematic representation to verify the Thesaurus. We used the following procedure: thematic representation of texts were produced. Our specialists compared contents of texts and main thematic nodes identified in the thematic representations of these texts. If they found considerable differences between them, in most cases the reason was some error or inaccuracy in the Thesaurus descriptions. Therefore the described variants of concepts, ambiguity of terms, missed or extra relations between concepts, English translations were verified.
Fig.3. Structural thematic summary of text FB6-F001-0015 The full thematic representation is too detailed to serve for an evaluation of text relevance. The main topics section alone is not enough for an adequate identification of the contents of documents because it does not represent the aspects of topics discussed in a text. Full main thematic nodes are much more informative but in a large text such thematic nodes can be long.. And we created a new structural form of the most important parts of thematic representation - a structural thematic summary (Figure 3). A structural summary allows us to estimate the contents of a document at first sight. This is the structural summary for the example text (see the fragment in Appendix). A structural thematic summary contains the following parts:
- marks of strength of textual relations between different thematic nodes:
Our experience showed that thematic representation can be produced for texts of any size and for a wide variety of genres and can be used for different applications. Thematic representations became a basis for information retrieval (Yudina and Dorsey 1995), automatic text categorisation (Loukachevitch 1997) and text summarisation (Loukachevitch 1998) in University Information System RUSSIA (http://www.cir.ru/eng/). Over 500 Mb of texts including 100 Mb of Russian official documents (Presidential and governmental decrees 1990-2000), 200 Mb of reports by Russian information agencies, and newspaper articles were processed. The smallest texts had a size of 100 Bytes; one of the biggest texts, the Russian Civil Code, was more than 500 Kb. We also processed texts in English, such as documents of the 104th and 105th Congresses, documents from the routing task of TREC6 and ad hoc tasks of TREC8 (Dobrov, Loukachevitch and Yudina 1997). In 1998, using our technology of other text summarisation process - automatic sentence-based text summarisation (Loukachevitch 1998), we participated in the SUMMAC conference in a text categorisation task. Evaluation in this task was as follows. Given a document, which could be a summary or a full-text source, the human subject determines to which single category of six categories (each of which has an associated topic description) the document is relevant (the sixth category being 'none of the above'). Here the evaluation seeks to determine whether a summary is effective in capturing whatever information in the document is needed correctly to categorise the document. Ten topics were chosen, with 100 documents used per topic. These topics were selected such that they could be grouped into two mutually exclusive classes: environment and global economy. Participants submitted two summaries: a fixed length summary limited to 10% of the character length of the source and a 'best' summary which was not limited in length. Our summaries of 'best length' had a maximal F-score (SUMMAC Final Report 1998) . The F-score of our 10% summaries was more than medium. The system extracted sentences in correspondence with the constructed thematic representations. The usage of terms in different main thematic nodes in a single sentence was the main rule used in choosing this sentence for a summary. The constructed set of thematic nodes allowed us to control whether all concepts of the main theme of a text were mentioned within the restricted volume of the summary. The system was fully based on the Thesaurus knowledge and could not include any processing of manifold proper names, which were very important in the texts. Notwithstanding we received good results. Therefore we consider our results in this competition as confirmation of the quality of our representation of text contents and representation of knowledge. We proposed a new technique for representation of contents of documents. A structural thematic summary presents the main theme and subthemes of a document, which are simulated by sets of semantically related terms. A structural thematic summary can be constructed for documents of various sizes and genres. It is particularly useful for users of multilingual text retrieval systems. Clarity and readability of the structural thematic summary is provided by:
Fragments of Text FB6-F001-0015
|