Using Neural Machine Translation for Normalising Historical Documents

By December 19, 2024,
Page 827-839
Author Leonardo Zilio, Besim Kabashi
Title Using Neural Machine Translation for Normalising Historical Documents
Abstract The work with historical documents presents many challenges, not only because some sources are not well preserved, but also because grammar and spelling rules from older times were not always consistent. Still, these texts remain as a rich source of information from our history, and we could greatly benefit from the information that can be extracted from them. At the same time, the lack of spelling and grammatical consistency poses a problem for the application of computational tools, so most of the analysis work is done manually. To overcome this lack of consistency, researchers started normalising the spelling of historical documents, as this increases the performance of modern tools. Spelling normalisation is, however, also carried out manually most of the time. In this paper, we present some experiments that were done for automatically normalising historical documents in two languages: Portuguese and Albanian. Leveraging state-of-the-art large language models that were pre-trained for translation, we used corpora that were carefully curated and manually normalised to train new computational models. These models can automatically normalise documents in these languages, achieving new state-of-the-art BLEU scores above 90 for Portuguese, and up to 59 for Albanian, beating the task baselines.
Session Talk
Keywords historical linguistics; Albanian; Portuguese; spelling normalisation; computational linguistics; large language models; neural machine translation
BibTex
@inproceedings{euralex_2024_paper_68,
address = {Cavtat},
title = {Using Neural Machine Translation for Normalising Historical Documents},isbn = {978-953-7967-77-2},
shorttitle = {Euralex 2024},
url = {},
language = {eng},
booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress},
publisher = {Institut za hrvatski jezik},
author = {Zilio, Leonardo and Kabashi, Besim},
editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana},
year = {2024},
pages = {827-839}
}
Download