AUTHECO: AUtomatic THEsaurus COstruction
Information exraction for automatic construction of thesauri and semantic networks from text corpora.
Introduction and context
Semantic networks and thesauri are powerful tools for representing knowledge about certain problem domain such as agriculture, finance, or medcine. A well constructed thesaurus is recognized as a valuable source of semantic information for various applications, especially for Information Retrieval.
An information retrieval thesaurus describes a certain knowledge domain by listing all its main concepts and semantic relations between them. In their simplest form thesauri consist of a list of important terms and semantic relations between them. Thesauri have been used in documentation management projects for years. They were even used by libraries and documentation centers long before the computer era. This long tradition and the more recent success of the thesaurus based information systems has led to adoption of thesaurus-based techniques by the industry and to the development of international standards such as ANSI Z39.19-2005.
The main purposes to use a thesaurus are (1) to provide a standard vocabulary for indexing and searching, (2) to assist users with locating terms for proper query formulation, and (3) to provide classified hierarchies that allow the broadening and narrowing of the current request according to the needs of the user.
EuroVOC is one example of a big contemporary information retrieval thesaurus: it is used for indexing documents of the European Parliament, the Office for Official Publications of the European Communities, and many other European institutions. Another well-known thesaurus is AgroVOC - a multilingual, structured and controlled vocabulary designed to cover the terminology of all subject fields in agriculture, forestry, fisheries, food and related domains. This resource was created by the Food and Agriculture Organization of the United Nations (FAO) and has many applications all over the world.
The main hindrances to using thesaurus-oriented approaches are the high complexity and cost of manual thesauri creation. The traditional way of thesaurus construction involves great amount of manual labor and proved to be very time consuming and costly. Furthermore, it does not allow for an easy way to keep semantic resources updated. All these factors limit applications of thesaurus-oriented approaches. One of the solutions to this problem is to develop an information technology which would automatize thesaurus construction. Basically this is the main objective of this research project.
This research project aims to develop an information extraction technology for automatic construction of semantic networks and thesauri from corpus of domain-specific texts.
The project investigates two problems of (semi-)automatic thesaurus generation from text corpora. The first problem is selecting salient domain-specific terms from corpus. This subtask is also known as term extraction. For instance, if we would like to construct a thesaurus of medical domain such as MESH we are to include to it terms which are relevant to the medical domain such as "poliomyelitis", "hepatitis A", or "swine vesicular disease", but not the terms "gear box" or "building". The second problem is extracting meaningful semantic relations between the terms of a domain such as synonymy, hyponymy, association. This subtask is known as relationship extraction.
The project involves using and developing different techniques of Natural Language Processing, Computer Science, and Data Mining. The proposed information technology will be implemented in a prototype system for automatization of thesaurus construction. In order to bring more valuable outcome from the project to the Wallonia region the experiments will be conducted on the datasets (text corpora) corresponding to the priority domains of the Marshall plan: agro-industry, transport and logistics, life sciences, mechanical engineering, aeronautics-aerospace.
You can try a small demo of the technology developed within this project. This system is a kind of "lexico-semantic search engine". Given a text query it provides a list of related words. A traditional search engine provides as a results a list of related documents. The current version is based on two semantic similarity measures -- Serelex and PatternSim. The first relies on definitions of words, while the second relies on a text corpus.