Generation of Word Profiles on the Basis of a Large and Balanced German Corpus

By November 17, 2016,
Page 371-383
Author Alexander Geyken, Jörg Didakowski, Alexander Siebert
Title Generation of Word Profiles on the Basis of a Large and Balanced German Corpus
Abstract In this paper we present the DWDS word profile system, a unified approach to the extraction of collocations for German, based entirely on finite state transducers. The system is intended as an additional informational source for the DWDS web-platform ( The DWDS website-with 2.5 million page impressions per month-is a widely used internet platform that provides a word-information system based on a large monolingual German dictionary and the DWDS-Kerncorpus, a balanced corpus of German texts of the 20th century. The DWDS word profile consists of two parts: a language-specific part-which consists of a complete German morphology and an efficient syntax parser for German, and a language-independent part comprised of a database management system for collocations and a corpus query engine, together with a web interface. We have applied the DWDS word profile to a balanced German corpus of the 20th century and subsequently present some technicalities. Another experiment using the DWDS word profile in conjunction with a tabloid newspaper shows that there may be significant differences between corpora, underlining the importance of the corpus choice for language learning as well as for the construction of lexical resources. Future work will focus on language learning; in particular, we will use a simplified tag set and a more systematic description of the word profile differences between corpora. We also plan to create word profiles for the DWDS-extended corpus, a 2 billion token corpus.
Session 1. Computational Lexicography and Lexicology
author = {Alexander Geyken, Jörg Didakowski, Alexander Siebert },
title = {Generation of Word Profiles on the Basis of a Large and Balanced German Corpus},
pages = {371-383},
booktitle = {Proceedings of the 13th EURALEX International Congress},
year = {2008},
month = {jul},
date = {15-19},
address = {Barcelona, Spain},
editor = {Elisenda Bernal, Janet DeCesaris},
publisher = {Institut Universitari de Linguistica Aplicada, Universitat Pompeu Fabra},
isbn = {978-84-96742-67-3},