Table of Contents

Table of Contents ..........................................................................................................

vii

Preface ............................................................................................................................

1

 

 

WAC3 .............................................................................................................................

3

Kevin P. Scannell, The Crúbadán Project: Corpus building for under-resourced languages ..............................................................................................


5

Sebastian Blohm, Philipp Cimiano, A Human Evaluation of Filtering Functions for Pattern-based Extraction of Arbitrary Relations from the Web ...........................................................................................................................


17

Emmanuel Cartier, TextBox, a Written Corpus Tool for Linguistic Analysis .

33

William H. Fletcher, Implementing a BNC-Compare-able Web Corpus ........

57

Igor Leturia, Antton Gurrutxaga, Iñaki Alegria, Aitzol Ezeiza, CorpEus, a "web as corpus" tool designed for the agglutinative nature of Basque ......................................................................................................................


69

Serge Sharoff, Classifying Web corpora into domain and genre using automatic feature identification ...........................................................................


83

Anil Kumar Singh, Jagadeesh Gorla, Identification of Languages and Encodings in a Multilingual Document ..............................................................

95

 

 

CLEANEVAL ................................................................................................................

109

Daniel Bauer, Judith Degen, Xiaoye Deng, Priska Herger, Jan Gasthaus, Eugenie Giesbrecht, Lina Jansen, Christin Kalina, Thorben Krüger, Robert Märtin, Martin Schmidt, Simon Scholler, Johannes Steger, Egon Stemle, Stefan Evert, FIASCO: Filtering the Internet by Automatic Subtree Classification, Osnabrück .......

111

Stefan Evert, StupidOS: A high-precision approach to boilerplate removal ..

123

Weizheng Gao, Tony Abou-Assaleh, GenieKnows Web Page Cleaning System .......................................................................................................................

135

Christian Girardi, Htmcleaner: Extracting the Relevant Text from the Web Pages ..

141

Katja Hofmann, Wouter Weerkamp, Web Corpus Cleaning using Content and Structure ...........................................................................................................

145

Michal Marek, Pavel Pecina, Miroslav Spousta, Web Page Cleaning with Conditional Random Fields ..................................................................................

155

Xabier Saralegi, Igor Leturia, Kimatu, a tool for cleaning non-content text parts from HTML docs ....................................................................................

163