Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units

By November 17, 2016,
Page 339-350
Author Gaël Dias, Sylvie Guilloré, José Gabriel Pereira Lopes
Title Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units
Abstract The acquisition of terminology for particular domains has long been a significant problem in Natural Language Processing requiring a great deal of manual effort. In order to provide terminologists with powerful tools for the creation, the maintenance and the upgrade of terminological data collections, we present the SENTA software that retrieves, from naturally occurring text, terminologically relevant multiword lexical units. SENTA is a statistical system that conjugates a new association measure based on the concept of normalised expectation, the Mutual Expectation, with a new acquisition process based on an algorithm of local maxima, the LocalMaxs. The results obtained by applying SENTA to the IJSELAN Slovene-English parallel corpus stress the extraction of a great proportion of terms, with 74% precision on average. Moreover, by conducting further experiments, we show that the average precision rate can be drastically improved up to 82% benefiting from the multidomain structure of the IJS-ELAN Slovene-English parallel corpus.
Session PART 8 - Extraction of terminologically relevant multiword expressions
Keywords
BibTex
@InProceedings{ELX00-041,
author = {Gaël Dias, Sylvie Guilloré, José Gabriel Pereira Lopes},
title = {Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units},
pages = {339-350},
booktitle = {Proceedings of the 9th EURALEX International Congress},
year = {2000},
month = {aug},
date = {8-12},
address = {Stuttgart, Germany},
editor = {Ulrich Heid, Stefan Evert, Egbert Lehmann, Christian Rohrer},
publisher = {Institut für Maschinelle Sprachverarbeitung},
isbn = {3-00-006574-1},
}
Download