CENTRE FOR NATURAL LANGUAGE PROCESSING

UCL > CENTAL > Projects > GlossaNet

version française

GLOSSANET

Title

GlossaNet: a linguistic search engine for RSS-based corpora

Abstract

GlossaNet 2 is a free online concordance service that enables users to search into dynamic Web corpora. Two steps are involved in using GlossaNet. At first, users define a corpus by selecting RSS feeds in a preselected pool of sources (they can also add their own RSS feeds). These sources will be visited on a regular basis by a crawler in order to generate a dynamic corpus. Secondly, the user can register one or more search queries on his / her dynamic corpus. Search queries will be re-applied on the corpus every time it is updated and new concordances will be recorded for the user (results can be emailed, published for the user in a privative RSS feed, or they can be viewed online). This service integrates two preexisting software: Corporator (Fairon, 2006), a program that creates corpora by downloading and filtering RSS feeds and Unitex (Paumier, 2003), an open source corpus processor that relies on linguistic resources. After a short introduction, we will briefly present the concept of “RSS corpora” and the assets of this approach to corpus development. We will then give an overview of the GlossaNet architecture and present various cases of use.

References

  • Fairon Cédrick, Macé Kévin, Naets Hubert, "GlossaNet2: a linguistic search engine for RSS-based corpora", In: Proceedings of LREC 2008, Workshop Web As Corpus (WAC4), Marrakesh, 2008. [PDF]
  • Fairon Cédrick, "Corporator: A tool for creating RSS-based specialized corpora", In: Proceedings of the Workshop Web as corpus. EACL 2006, Trento, 2006. [PDF]
  • Fairon Cédrick, Singler John V., "I'm like, 'Hey, it works!': Using GlossaNet to find attestations of the quotative (be) like in English-language newspapers", In: The Changing Face of Corpus Linguistics, A. Renouf and A. Kehoe ed(s), Rodopi.
  • Dister Anne, Fairon Cédrick, "Extension des ressources lexicales grâce à un corpus dynamique", Lexicometrica, 2004. [PDF, Lexicometrica]
  • Fairon Cédrick, "Extension dynamique de ressources lexicales par consultation du Web", Bulletin de linguistique appliquée et générale (BULAG), 26, 2001, p. 39-52.
GlossaNet logo

Duration

  • Start: 1998 (UP7, Fr).
  • From: 2001 (UCL, Be)

Researcher(s)

Advisor