When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F20%3A00117104" target="_blank" >RIV/00216224:14330/20:00117104 - isvavai.cz</a>
Výsledek na webu
<a href="https://nlp.fi.muni.cz/raslan/raslan20.pdf#page=11" target="_blank" >https://nlp.fi.muni.cz/raslan/raslan20.pdf#page=11</a>
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts
Popis výsledku v původním jazyce
Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge. In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms. We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.
Název v anglickém jazyce
When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts
Popis výsledku anglicky
Optical character recognition of scanned images for contemporary printed texts is widely considered a solved problem. However, the optical character recognition of early printed books and reprints of Medieval texts remains an open challenge. In our work, we present a dataset of 19th and 20th century letterpress reprints of documents from the Hussite era (1419–1436) and perform a quantitative and qualitative evaluation of speed and accuracy on six existing OCR algorithms. We conclude that the Tesseract family of OCR algoritms is the fastest and the most accurate on our dataset, and we suggest improvements to our dataset.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Proceedings of the Fourteenth Workshop on Recent Advances in Slavonic Natural Language Processing, RASLAN 2020
ISBN
9788026316008
ISSN
2336-4289
e-ISSN
—
Počet stran výsledku
10
Strana od-do
3-12
Název nakladatele
Tribun EU
Místo vydání
Brno
Místo konání akce
online
Datum konání akce
8. 12. 2020
Typ akce podle státní příslušnosti
EUR - Evropská akce
Kód UT WoS článku
000655471300001

Podobné výsledky(10)

When Tesseract Meets PERO : Open-Source Optical Character Recognition of Medieval Texts When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts Application of Super-Resolution Models in Optical Character Recognition of Czech Medieval Texts

Co hledáte?

Rychlé hledání

Chytré vyhledávání

When Tesseract Does It Alone: Optical Character Recognition of Medieval Texts

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)