OCR Improvements for Images of Multi-page Historical Documents

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F21%3A43962457" target="_blank" >RIV/49777513:23520/21:43962457 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/chapter/10.1007%2F978-3-030-87802-3_21" target="_blank" >https://link.springer.com/chapter/10.1007%2F978-3-030-87802-3_21</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-87802-3_21" target="_blank" >10.1007/978-3-030-87802-3_21</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
OCR Improvements for Images of Multi-page Historical Documents
Popis výsledku v původním jazyce
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
Název v anglickém jazyce
OCR Improvements for Images of Multi-page Historical Documents
Popis výsledku anglicky
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
20205 - Automation and control systems

Návaznosti výsledku

Projekt
<a href="/cs/project/DG20P02OVV018" target="_blank" >DG20P02OVV018: Digitální archiv dokumentů NKVD/KGB vztahujících se k Československu</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings
ISBN
978-3-030-87801-6
ISSN
0302-9743
e-ISSN
1611-3349
Počet stran výsledku
12
Strana od-do
226-237
Název nakladatele
Springer
Místo vydání
Cham
Místo konání akce
St. Petersburg, Russia
Datum konání akce
27. 9. 2021
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

Semantic text segmentation from synthetic images of full-text documents Software pro adaptabilní rozpoznávání textu starých tisků When Tesseract Brings Friends: Layout Analysis, Language Identification, and Super-Resolution in the Optical Character Recognition of Medieval Texts

Co hledáte?

Rychlé hledání

Chytré vyhledávání

OCR Improvements for Images of Multi-page Historical Documents

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)