A co-occurrence taxonomy from a general language corpus

By November 17, 2016,
Page 367-375
Author Rogelio Nazar, Irene Renau
Title A co-occurrence taxonomy from a general language corpus
Abstract This paper presents a quantitative approach to the generation of a taxonomy of general language. The methodology is based on statistics of word co-occurrence and it exploits the fact that word association is asymmetrical in nature, in much the same way as hyperonymy relations are. Words tend to be syntagmatically associated with their hyperonyms, though this is not true the other way round. Taking advantage of this phenomenon, and with the help of directed graphs of word co-occurrence, we were able to collect hyperonym-hyponym pairs using a reference corpus of general language as the only source of information, i.e., without using lexico-syntactic patterns nor any kind of pre-existing semantic resources such as dictionaries, ontologies or thesauri. The results obtained by using this method are not precise enough to be used for immediate practical purposes, but they confirm the hypothesis that as a general rule hyperonymy is linked to asymmetric co-occurrence relations. The paper discusses an experiment in Spanish, but we believe the same conclusions apply to other languages as well.
Session Corpus-driven lexicography
Keywords asymmetric word association, computational lexicography, co-occurrence statistics, distributional semantics, taxonomy extraction
author = {Rogelio Nazar and Irene Renau},
title = {A co-occurrence taxonomy from a general language corpus},
pages = {367--375},
booktitle = {Proceedings of the 15th EURALEX International Congress},
year = {2012},
month = {aug},
date = {7-11},
address = {Oslo,Norway},
editor = {Ruth Vatvedt Fjeld and Julie Matilde Torjusen},
publisher = {Department of Linguistics and Scandinavian Studies, University of Oslo},
isbn = {978-82-303-2228-4},