A Health Corpus selected and downloaded from the Web – is it healthy enough?

By November 17, 2016,
Page 71-78
Author Anna Braasch
Title A Health Corpus selected and downloaded from the Web – is it healthy enough?
Abstract The work on the Danish corpus-based computational lexicon STO was finished by the end of February 2004. This is the most comprehensive computational lexicon of Danish developed for NLPMLT applications containing also lemmas from six different domains. The domain vocabulary has been selected from corpora created by assembling texts from the web; these corpora also form the basis for linguistic analysis/description. The average size of these corpora is 1.5 million tokens. Although the method of text selection and encoding of linguistic information was identical for all domains, a comparison of the lexicon entries showed that the health domain entries are considerably less complex than other domains. In order to investigate the cause of this difference, an additional corpus consisting of encyclopedic articles has been used for control purposes as regards the health domain. This paper focuses on a comparison of syntactic structures, esp. the variety of prepositional complements in these two corpora. Firstly, the basic properties of both corpora are discussed from the point of view of comparability; secondly, the objectives and the method of the comparison are outlined; thirdly some results and insights gathered from the work are presented. Finally, a method of using supplementary corpus look-ups in the case of deficiencies is suggested.
Session Computational Lexicography and Lexicology
Keywords
BibTex
@InProceedings{ELX04-007,
author = {Anna Braasch},
title = {A Health Corpus selected and downloaded from the Web - is it healthy enough? },
pages = {71-78},
booktitle = {Proceedings of the 11th EURALEX International Congress},
year = {2004},
month = {july},
date = {6-10},
address = {Lorient, France},
editor = {Geoffrey Williams and Sandra Vessier},
publisher = {UniversiteĢ de Bretagne-Sud, FaculteĢ des lettres et des sciences humaines},
isbn = {29-52245-70-3},
}
Download