Using Neural Machine Translation for Normalising Historical Documents

Leonardo Zilio; Besim Kabashi

Using Neural Machine Translation for Normalising Historical Documents

By Iztok KosemDecember 19, 2024Euralex 2024, Publications

Page	827-839
Author	Leonardo Zilio, Besim Kabashi
Title	Using Neural Machine Translation for Normalising Historical Documents
Abstract	The work with historical documents presents many challenges, not only because some sources are not well preserved, but also because grammar and spelling rules from older times were not always consistent. Still, these texts remain as a rich source of information from our history, and we could greatly benefit from the information that can be extracted from them. At the same time, the lack of spelling and grammatical consistency poses a problem for the application of computational tools, so most of the analysis work is done manually. To overcome this lack of consistency, researchers started normalising the spelling of historical documents, as this increases the performance of modern tools. Spelling normalisation is, however, also carried out manually most of the time. In this paper, we present some experiments that were done for automatically normalising historical documents in two languages: Portuguese and Albanian. Leveraging state-of-the-art large language models that were pre-trained for translation, we used corpora that were carefully curated and manually normalised to train new computational models. These models can automatically normalise documents in these languages, achieving new state-of-the-art BLEU scores above 90 for Portuguese, and up to 59 for Albanian, beating the task baselines.
Session	Talk
Keywords	historical linguistics; Albanian; Portuguese; spelling normalisation; computational linguistics; large language models; neural machine translation
BibTex	@inproceedings{euralex_2024_paper_68, address = {Cavtat}, title = {Using Neural Machine Translation for Normalising Historical Documents},isbn = {978-953-7967-77-2}, shorttitle = {Euralex 2024}, url = {}, language = {eng}, booktitle = {Lexicography and Semantics. Proceedings of the XXI EURALEX International Congress}, publisher = {Institut za hrvatski jezik}, author = {Zilio, Leonardo and Kabashi, Besim}, editor = {Despot, Kristina Š. and Ostroški Anić, Ana and Brač, Ivana}, year = {2024}, pages = {827-839} }
Download

Using Neural Machine Translation for Normalising Historical Documents

Contact data

EURALEX address

EURALEX is supported by

Quick message