When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F21%3A00119901" target="_blank" >RIV/00216224:14330/21:00119901 - isvavai.cz</a>
Výsledek na webu
<a href="https://nlp.fi.muni.cz/raslan/raslan21.pdf#page=37" target="_blank" >https://nlp.fi.muni.cz/raslan/raslan21.pdf#page=37</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
Popis výsledku v původním jazyce
The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.
Název v anglickém jazyce
When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts
Popis výsledku anglicky
The aim of the AHISTO project is to make documents from the Hussite era (1419–1436) available to the general public through a web-hosted searchable database. Although scanned images of letterpress reprints from the 19th and 20th century are available, accurate optical character recognition (OCR) algorithms are required to extract searchable text from the scanned images. In our previous article [15], we have shown that the Tesseract 4 OCR algorithm was the second fastest and the most accurate among five different OCR algorithms. In this article, we investigate the impact of six preprocessing techniques on the accuracy of Tesseract 4. Additionally, we compare Tesseract 4 with three other OCR algorithms on the language identification task. Furthermore, we publish an open dataset [16] of scanned images and OCR texts with human annotations for layout analysis, OCR evaluation, and language identification. In Section 2, we describe the related work in OCR preprocessing. In Section 3, we describe our three preprocessing techniques and our two evaluation tasks. In Section 4, we discuss the results of our evaluation. In Section 5, we offer concluding remarks and ideas for future work in the OCR of medieval texts.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences
Návaznosti výsledku
Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Recent Advances in Slavonic Natural Language Processing (RASLAN 2021)
ISBN
9788026316701
ISSN
2336-4289
e-ISSN
—
Počet stran výsledku
11
Strana od-do
29-39
Název nakladatele
Tribun EU
Místo vydání
Brno
Místo konání akce
Brno
Datum konání akce
1. 1. 2021
Typ akce podle státní příslušnosti
EUR - Evropská akce
Kód UT WoS článku
—