Towards a multilingual dictionary of discourse markers. Automatic extraction of units from parallel corpus

By September 7, 2022,
Page 262-272
Author Irene Renau, Rogelio Nazar
Title Towards a multilingual dictionary of discourse markers. Automatic extraction of units from parallel corpus
Abstract This paper presents a multilingual dictionary project of discourse markers. During its first stage, consisting of collecting the list of headwords, we used a parallel corpus to automatically extract units from texts written in Spanish, Catalan, English, French and German. We also applied a method to create a taxonomy structure for automatically organising the markers in clusters. As a result, we obtain an extensive, corpus-­driven list of headwords. We present a prototype of the microstructure of the dictionary in the form of a standard XML database and describe the procedure to automatically fill in most of its fields (e. g., the type of DM, the equivalents in other languages, etc.), before human intervention.
Session Talk
Keywords Computational lexicography, corpus­driven lexicography, discourse markers, multilingual lexicography
BibTex
@inproceedings{euralex_mannheim_towards_2022,
address = {Mannheim},
title = {Towards a {Multilingual} {Dictionary} of {Discourse} {Markers}. {Automatic} {Extraction} of {Units} from {Parallel} {Corpus}},
isbn = {978-3-937241-87-6},
shorttitle = {Euralex (2022)},
url = {},
language = {eng},
booktitle = {Dictionaries and {Society}. {Proceedings} of the {XX} {EURALEX} {International} {Congress}},
publisher = {IDS-Verlag},
author = {Renau, Irene and Nazar, Rogelio},
editor = {Klosa-Kückelhaus, Annette and Engelberg, Stefan and Möhrs, Christine and Storjohann, Petra},
year = {2022},
pages = {262--272},
}
Download