The use of analogy in natural language processing
-- Application to machine translation --
The object of the present tutorial is to inspect the application of a particular operation, analogy, in natural language processing.
An analogy puts four objects, A, B, C and D, into relation. The usual notation is A:B::C:D.
It reads "A is to B as C is to D". For instance, on words:
English: to walk : I walked :: to laugh : I laughed
Arabic: arsala : mursilun :: aslama : muslimun
and on sentences:
I'd like to open these windows. : Could you open a window? :: I'd like to cash these traveler's checks. : Could you cash a traveler's check?
The purpose of the tutorial is to show how analogy may contribute to natural language processing in the trend of corpus-based approaches as well as in the stream of "least effort" approaches.
First part (1h30)
The first part of the tutorial will give a historical perspective on the notion and will set terminological restrictions, because the meaning of the word has been widen during ages, misunderstood in some ways and abused. It is thus necessary to distinguish between analogy, similarity and metaphor. We shall mention some works where the term seems to be abused. A historical perspective will allow us to identify articulatory and notional concepts at work in analogy. It also explains why the notion has been deemed suspect in the community of natural language processing.
The second part of the tutorial will tackle the problem of analogies between strings of symbols. General formal properties will be given first. Linguistic examples will show that, although the problem appears to be simple, its resolution is not. We shall then distinguish between different types of analogies between strings of symbols and we shall give a temptative formalisation for a particular case, that is relatively economical in terms of the notions used. From this formalisation, we shall define a family of languages called languages of analogical strings for which we shall present some formal results. In particular, we will discuss their adequacy in relation with the problem of the power of natural languages. We shall conclude this part on some consequences for natural language processing.
Second part (1h30)
The third part will show that analogies are indeed present in some corpora and that they allow us to structure those corpora, and even to densify them to some extent. We shall tackle the problem of true analogies, that is those analogies of form which are also analogies of meaning, and we will provide some measures obtained by sampling. In relation with this problem, we will show the efficiency of analogy in the automatic generation of paraphrases in conjunction with filtering by n-sequences.
In the fourth part of the tutorial, we will introduce a "universal engine" based on the principle of conservation of analogies between two analogical domains. The method can be applied to various tasks of natural language processing, from conjugation or declension up to machine translation through parsing or word-segmentation for Asian languages. Some preliminary results will illustrate those tasks. We shall then speak about machine translation in more details. We will show that the method addresses the problem of translation divergences across languages, a core problem for translation. The scores obtained by a "pure" system in two tasks of international evalauation campaigns will be presented and compared with the scores of other systems.
Although the range of possible applications tackled will be large, it is in no way the purpose of this tutorial to claim that everything is possible in natural language processing through the unique use of analogy. In this respect, the conclusion of the tutorial, while recalling the notions and the advantages of the method, will insist on the drawbacks of the method, so as to show where there is room for other techniques. Also we would like to address the implications of using such a linguistic operation in natural language processing.