Abstract |
This software demonstration presents a data model and a first use case for the representation of text corpus data on a Wikibase instance, including morphosyntactic, semantic and philological annotations as well as links to dictionary entries. Wikibase, an extension of MediaWiki, is the software that underlies Wikidata (Vrandečić & Krötzsch, 2014), an exceptionally large crowdsourced queryable knowledge graph, which includes nodes for ontological concepts, on the one hand, and for lexemes, lexeme senses and lexeme forms, on the other, together with annotations to and relations between them. We argue that the proposed model and the chosen software solutions for the representation of corpus and dictionary data, all free and open source, meet with the requirements of provenance transparency, open access and re-use, and the capability of collaborative work on the data. We also present our own scripts wrapped in a web application that shortcut several workflow steps in a first use case, a 1737 Basque manuscript, transcribed on Wikisource, and represented as an annotated dataset on our Wikibase instance. |
BibTex |
@inproceedings{euralex_2024_paper_64, address = {Cavtat}, title = {Linking Historical Corpus Data and Annotations Using Wikibase},isbn = {978-953-7967-77-2}, shorttitle = {Euralex 2024}, url = {}, language = {eng}, booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress}, publisher = {Institut za hrvatski jezik}, author = {Lindemann, David and Alonso, Mikel}, editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana}, year = {2024}, pages = {785-791} } |