Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units

Gaël Dias; Sylvie Guilloré; José Gabriel Pereira Lopes

Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units

By admynNovember 17, 2016Euralex 2000, Publications

Page	339-350
Author	Gaël Dias, Sylvie Guilloré, José Gabriel Pereira Lopes
Title	Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units
Abstract	The acquisition of terminology for particular domains has long been a significant problem in Natural Language Processing requiring a great deal of manual effort. In order to provide terminologists with powerful tools for the creation, the maintenance and the upgrade of terminological data collections, we present the SENTA software that retrieves, from naturally occurring text, terminologically relevant multiword lexical units. SENTA is a statistical system that conjugates a new association measure based on the concept of normalised expectation, the Mutual Expectation, with a new acquisition process based on an algorithm of local maxima, the LocalMaxs. The results obtained by applying SENTA to the IJSELAN Slovene-English parallel corpus stress the extraction of a great proportion of terms, with 74% precision on average. Moreover, by conducting further experiments, we show that the average precision rate can be drastically improved up to 82% benefiting from the multidomain structure of the IJS-ELAN Slovene-English parallel corpus.
Session	PART 8 - Extraction of terminologically relevant multiword expressions
Keywords
BibTex	@InProceedings{ELX00-041, author = {Gaël Dias, Sylvie Guilloré, José Gabriel Pereira Lopes}, title = {Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units}, pages = {339-350}, booktitle = {Proceedings of the 9th EURALEX International Congress}, year = {2000}, month = {aug}, date = {8-12}, address = {Stuttgart, Germany}, editor = {Ulrich Heid, Stefan Evert, Egbert Lehmann, Christian Rohrer}, publisher = {Institut für Maschinelle Sprachverarbeitung}, isbn = {3-00-006574-1}, }
Download

Benefiting from Multidomain Corpora to Extract Terminologically Relevant Multiword Lexical Units

Contact data

EURALEX address

EURALEX is supported by

Quick message