Linking Historical Corpus Data and Annotations Using Wikibase

By December 19, 2024,
Page 785-791
Author David Lindemann, Mikel Alonso
Title Linking Historical Corpus Data and Annotations Using Wikibase
Abstract This software demonstration presents a data model and a first use case for the representation of text corpus data on a Wikibase instance, including morphosyntactic, semantic and philological annotations as well as links to dictionary entries. Wikibase, an extension of MediaWiki, is the software that underlies Wikidata (Vrandečić & Krötzsch, 2014), an exceptionally large crowdsourced queryable knowledge graph, which includes nodes for ontological concepts, on the one hand, and for lexemes, lexeme senses and lexeme forms, on the other, together with annotations to and relations between them. We argue that the proposed model and the chosen software solutions for the representation of corpus and dictionary data, all free and open source, meet with the requirements of provenance transparency, open access and re-use, and the capability of collaborative work on the data. We also present our own scripts wrapped in a web application that shortcut several workflow steps in a first use case, a 1737 Basque manuscript, transcribed on Wikisource, and represented as an annotated dataset on our Wikibase instance.
Session Software Demonstration
Keywords Basque; historical corpus; Wikibase; corpus annotations; Linked Dana
BibTex
@inproceedings{euralex_2024_paper_64,
address = {Cavtat},
title = {Linking Historical Corpus Data and Annotations Using Wikibase},isbn = {978-953-7967-77-2},
shorttitle = {Euralex 2024},
url = {},
language = {eng},
booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress},
publisher = {Institut za hrvatski jezik},
author = {Lindemann, David and Alonso, Mikel},
editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana},
year = {2024},
pages = {785-791}
}
Download