Creating the Dataset of Croatian Verbal Idioms – Automatic Identification in a Corpus and Lexicographic Implementation

By December 19, 2024,
Page 429-443
Author Ivana Filipović Petrović, Kristina Kocijan
Title Creating the Dataset of Croatian Verbal Idioms – Automatic Identification in a Corpus and Lexicographic Implementation
Abstract This research proposes a step forward in the automatic identification and analysis of verbal idioms in Croatian. The use of the NooJ automated text processing tool, along with the MaCoCu corpus and the Online Dictionary of Croatian Idioms (ODCI), provides a robust framework for recognizing and categorizing these multi-word expressions (MWEs). The research comprises two parts: (a) creation of a dataset by utilizing the ODCI that allowed for a set of 898 verbal idioms to be compiled and annotated with linguistic features, including structure, morphological features, and variation patterns; (b) analysis of extracted data that provides insights into the lexicographical and linguistic significance of the idioms, such as variability, modification, and frequency of use. The study highlights the challenges posed by idiomatic variations and the verb’s role as the most variable component in idioms. For instance, the idiom soliti pamet komu (‘to give unsolicited advice’) is often modified for expressiveness, such as in the phrase “having a big saltshaker to salt everyone’s mind.” The dataset aims for lexicographic integration into ODCI and supports the creation of electronic language resources. It also contributes to theoretical and cross-lingual research, with the CLARIN repository expected to enhance data reusability in NLP. The study’s findings offer a deeper understanding of verbal idioms’ dynamics and their computational processing.
Session Talk
Keywords verbal idioms; automatic identification of multi-word expressions; linguistic features of idioms; NooJ; lexicographic analysis
BibTex
@inproceedings{euralex_2024_paper_34,
address = {Cavtat},
title = {Creating the Dataset of Croatian Verbal Idioms – Automatic Identification in a Corpus and Lexicographic Implementation},isbn = {978-953-7967-77-2},
shorttitle = {Euralex 2024},
url = {},
language = {eng},
booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress},
publisher = {Institut za hrvatski jezik},
author = {Filipović Petrović, Ivana and Kocijan, Kristina},
editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana},
year = {2024},
pages = {429-443}
}
Download