Abstract |
We show a parsing-based architecture for the extraction of German verbal multiword expressions. It uses dependency parsing as a preprocessing step, allows us to extract syntactic patterns of arbitrary form from the parsed data, and comprises a relational database where each extracted multiword occurrence is stored along with the sentence it is extracted from, and with a number of morphosyntactic and syntactic features. These features serve (i) for an automatic decision about the likely idiomatization of the candidate under review, and (ii) in later lexicographic work to get a clear picture of lexicographically relevant linguistic properties of the selected candidates. We use dependency-parsed text, because this allows us to find non-adjacent multiwords and to use subcategorization knowledge to identify e.g. verb + object pairs more reliably than on the basis of ourface patterns. The extraction results illustrate the potential of the tools; we can identify morphosyntactic preferences in collocations (these often indicate idiomatization), longer collocational or idiomatic structures (where e.g. the core elements and possible modifies can be clearly distinguished), lexical variation in idioms, as well as certain specific features of collocations or idioms (e.g. preferences for negation) As all data are stored in a database, which supports a variety of generalization steps, it is in principle possible to prepare different layouts (i.e. presentations and selections) of dictionary entries, for different user groups and user needs. |
BibTex |
@InProceedings{ELX10-020, author = {Ulrich Heid, Marion Weller}, title = {Corpus-derived data on German multiword expressions for lexicography}, pages = {331-340}, booktitle = {Proceedings of the 14th EURALEX International Congress}, year = {2010}, month = {jul}, date = {6-10}, address = {Leeuwarden/Ljouwert, The Netherlands}, editor = {Anne Dykstra and Tanneke Schoonheim}, publisher = {Fryske Akademy}, isbn = {978-90-6273-850-3}, } |