Abstract |
In this paper we present a newly developed formal framework (as well as its practical implementation) for automatic, lexically driven analysis of Danish text tokens. The framework (called “CLINK”) employs a minimal token definition (the “morph”) and a compact lexical representation (the “CLINK template”). All morphs (i.e., text elements with individual semantic contribution) are lexicalized using the same template, word forms, affixes, glue elements, punctuation marks, multi-word expressions, etc. Thus, the definition of “lexeme” is reinterpreted in functional-computational terms. The grammar rules of CLINK are purely abstract, viz. those of the Lambek calculus (categorial grammar). This paper gives an overview of the CLINK framework (motivations and application). References to performance metrics will be given (suggesting CLINK to be on a par with the Danish state-of-the-art in PoS-tagging while providing much richer annotation structure). However, we consider the formal framework in itself to be the main contribution of this short paper. |
BibTex |
@inproceedings{euralex_2024_paper_27, address = {Cavtat}, title = {Make Each Morph Count – A New Approach to Computational Lexicography for Text Processing},isbn = {978-953-7967-77-2}, shorttitle = {Euralex 2024}, url = {}, language = {eng}, booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress}, publisher = {Institut za hrvatski jezik}, author = {Henrichsen, Peter Juel}, editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana}, year = {2024}, pages = {349-357} } |