When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F22%3A00127481" target="_blank" >RIV/00216224:14330/22:00127481 - isvavai.cz</a>
Result on the web
<a href="https://raslan2022.nlp-consulting.net/" target="_blank" >https://raslan2022.nlp-consulting.net/</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts
Original language description
Conversion of scanned images to the text form, denoted as optical character recognition or OCR, for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of medieval texts remains an open challenge. In our previous work, we developed an end-to-end image-to-text pipeline (via optical character recognition) for medieval texts, named AHISTO OCR, and we released it together with our test dataset under open licenses. However, the published system relied on the closed-source Google Vision AI service as one component, which made the experiments less reproducible. In this work, we replace Google Vision AI with an open-source OCR algorithm named PERO and we show that this not only makes the AHISTO OCR pipeline open, but also improves the performance of the system. We release the updated AHISTO OCR system and its test results again under open licenses.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
<a href="/en/project/LM2018101" target="_blank" >LM2018101: Digital Research Infrastructure for the Language Technologies, Arts and Humanities</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Sixteenth Workshop on Recent Advances in Slavonic Natural Languages Processing, RASLAN 2022.
ISBN
9788026317524
ISSN
2336-4289
e-ISSN
—
Number of pages
4
Pages from-to
157-160
Publisher name
Tribun EU
Place of publication
Brno
Event location
Brno
Event date
Jan 1, 2022
Type of event by nationality
CST - Celostátní akce
UT code for WoS article
—