Domain Specific Corpora from the Web

By November 17, 2016,
Page 336-342
Author Avinesh PVS, Diana McCarthy, Dominic Glennon, Jan Pomikálek,
Title Domain Specific Corpora from the Web
Abstract Language usage is dependent on domain and, as a consequence, domain specific corpora are extremely useful for language learning and lexicography. It is possible to label heterogeneous data for domain either manually or automatically using human knowledge or machine learning. State-of-the-art text classification uses supervised techniques whereby a system learns from previously annotated data. This works well when such data is available in sufficient quantities for supervised machine learning, though often that is not the case depending on the domain and language required. Moreover, this approach assumes that the heterogeneous data in the available corpus covers the required domains. In this paper we present the results of an approach using WebBootCat to retrieve data from the web in eight specific domains. A key component of this work was the use of the DANTE database for generating seed words for initial web data retrieval. To tailor the corpus to the nuances of the domain categorisation that we required, we used some of our own corpus data already annotated with subject codes (domain codes) to help refine the seed words used at the start of the iterative web retrieval process. Human effort was needed to refine a whitelist of words for each domain to reduce the chance of irrelevant data due to ambiguous terms in the seeds and extracted keywords used for subsequent retrieval. The domain corpora retrieved are loaded in the Sketch Engine. The word sketches and sketch difference functionality help reveal appropriate domain specific behaviour of words in the respective corpora.
Session Corpus-driven lexicography
Keywords domain corpus, DANTE, WebBootCat.
author = {Avinesh PVS and Diana McCarthy and Dominic Glennon and Jan Pomikálek and},
title = {Domain Specific Corpora from the Web},
pages = {336--342},
booktitle = {Proceedings of the 15th EURALEX International Congress},
year = {2012},
month = {aug},
date = {7-11},
address = {Oslo,Norway},
editor = {Ruth Vatvedt Fjeld and Julie Matilde Torjusen},
publisher = {Department of Linguistics and Scandinavian Studies, University of Oslo},
isbn = {978-82-303-2228-4},