SCyDia – OCR for Serbian Cyrillic with Diacritics

By September 7, 2022,
Page 387-400
Author Velibor Ilić, Lenka Bajčetić, Snežana Petrović, Ana Španović
Title SCyDia – OCR for Serbian Cyrillic with Diacritics
Abstract In the currently ongoing process of retro­-digitization of Serbian dialectal dictionaries, the biggest obstacle is the lack of machine-­readable versions of paper editions. Therefore, one essential step is needed before venturing into the dictionary­making process in the digital environment – OCRing the pages with the highest possible accuracy. Successful retro­-digitization of Serbian dialectal dictionaries, currently in progress, has shown a dire need for one basic yet necessary step, lacking until now – OCRing the pages with the highest possible accuracy. OCR processing is not a new technology, as many open­ source and commercial software solutions can reliably convert scanned images of paper documents into digital documents. Available software solutions are usually efficient enough to process scanned contracts, invoices, financial statements, newspapers, and books. In cases where it is necessary to process documents that contain accented text and precisely extract each character with diacritics, such software solutions are not efficient enough. This paper presents the OCR software called “SCyDia”, developed to overcome this issue. We demonstrate the organizational structure of the OCR software “SCyDia” and the first results. The “SCyDia” is a web­-based software solution that relies on the open­source software “Tesseract” in the back­ ground. “SCyDia” also contains a module for semi­automatic text correction. We have already processed over 15,000 pages, 13 dialectal dictionaries, and five dialectal monographs. At this point in our project, we have analyzed the accuracy of the “SCyDia” by processing 13 dialectal dictionaries. The results were analyzed manually by an expert who examined a number of randomly selected pages from each dictionary. The preliminary results show great promise, spanning from 97.19% to 99.87%.
Session Poster
Keywords OCR, Cyrillic, Serbian language, retro-­digitization, convolutional neural networks
BibTex
@inproceedings{euralex_mannheim_scydia_2022,
address = {Mannheim},
title = {{SCyDia} - {OCR} for {Serbian} {Cyrillic} with {Diacritics}},
isbn = {978-3-937241-87-6},
shorttitle = {Euralex (2022)},
url = {},
language = {eng},
booktitle = {Dictionaries and {Society}. {Proceedings} of the {XX} {EURALEX} {International} {Congress}},
publisher = {IDS-Verlag},
author = {Ilić, Velibor and Bajčetić, Lenka and Petrović, Snežana and Španović, Ana},
editor = {Klosa-Kückelhaus, Annette and Engelberg, Stefan and Möhrs, Christine and Storjohann, Petra},
year = {2022},
pages = {387--400},
}
Download