Abstract |
The paper presents preliminary results of the automatic extraction of candidates for dictionary definitions from unstructured texts in the Serbian language with the aim of accelerating dictionary development. Definitions in the Serbian Academy of Sciences and Arts (SASA) dictionary were used to model different definition types (descriptive, grammatical, reference-based and synonym-based) having different syntactic and lexical features. The research corpus consists of 61,213 definitions of nouns, which were analysed using Serbian morphological e-dictionaries and local grammars implemented as finite state transducers in an open-source corpus processing suite Unitex. The 21 models developed up to the present moment cover 57% of dictionary definitions, 83% of which were fully recognized. The analysis has shown that many definitions have a structure that can be modelled, as evidenced by the statistics of definitions grouped by type. These models were used to retrieve noun definitions from a 1.4-million-word corpus containing 25 primary and secondary school textbooks covering various domains. The obtained results were thoroughly analysed, and guidelines were offered for their improvement. |
BibTex |
@inproceedings{ELX2020_2021-072, address = {Alexandroupolis}, title = {Towards {Automatic} {Definition} {Extraction} for {Serbian}}, isbn = {978-618-85138-2-2}, url = {https://www.euralex.org/elx_proceedings/Euralex2020-2021/EURALEX2020-2021_Vol2-p695-703.pdf}, language = {en}, booktitle = {Lexicography for {Inclusion}: {Proceedings} of the 19th {EURALEX} {International} {Congress}, 7-9 {September} 2021, {Alexandroupolis}, {Vol}. 2}, publisher = {Democritus University of Thrace}, author = {Stanković, Ranka and Krstev, Cvetana and Stijović, Rada and Gočanin, Mirjana and Škorić, Mihailo}, editor = {Mitits, Lydia and Kiosses, Sypros}, year = {2021}, pages = {695--703},} |