Corpus-derived data on German multiword expressions for lexicography

Ulrich Heid; Marion Weller

Corpus-derived data on German multiword expressions for lexicography

By admynNovember 17, 2016Euralex 2010, Publications

Page	331-340
Author	Ulrich Heid, Marion Weller
Title	Corpus-derived data on German multiword expressions for lexicography
Abstract	We show a parsing-based architecture for the extraction of German verbal multiword expressions. It uses dependency parsing as a preprocessing step, allows us to extract syntactic patterns of arbitrary form from the parsed data, and comprises a relational database where each extracted multiword occurrence is stored along with the sentence it is extracted from, and with a number of morphosyntactic and syntactic features. These features serve (i) for an automatic decision about the likely idiomatization of the candidate under review, and (ii) in later lexicographic work to get a clear picture of lexicographically relevant linguistic properties of the selected candidates. We use dependency-parsed text, because this allows us to find non-adjacent multiwords and to use subcategorization knowledge to identify e.g. verb + object pairs more reliably than on the basis of ourface patterns. The extraction results illustrate the potential of the tools; we can identify morphosyntactic preferences in collocations (these often indicate idiomatization), longer collocational or idiomatic structures (where e.g. the core elements and possible modifies can be clearly distinguished), lexical variation in idioms, as well as certain specific features of collocations or idioms (e.g. preferences for negation) As all data are stored in a database, which supports a variety of generalization steps, it is in principle possible to prepare different layouts (i.e. presentations and selections) of dictionary entries, for different user groups and user needs.
Session	Computational Lexicography and Lexicology
Keywords
BibTex	@InProceedings{ELX10-020, author = {Ulrich Heid, Marion Weller}, title = {Corpus-derived data on German multiword expressions for lexicography}, pages = {331-340}, booktitle = {Proceedings of the 14th EURALEX International Congress}, year = {2010}, month = {jul}, date = {6-10}, address = {Leeuwarden/Ljouwert, The Netherlands}, editor = {Anne Dykstra and Tanneke Schoonheim}, publisher = {Fryske Akademy}, isbn = {978-90-6273-850-3}, }
Download

Corpus-derived data on German multiword expressions for lexicography

Contact data

EURALEX address

EURALEX is supported by

Quick message