OCR Improvements for Images of Multi-page Historical Documents
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F21%3A43962457" target="_blank" >RIV/49777513:23520/21:43962457 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/chapter/10.1007%2F978-3-030-87802-3_21" target="_blank" >https://link.springer.com/chapter/10.1007%2F978-3-030-87802-3_21</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-87802-3_21" target="_blank" >10.1007/978-3-030-87802-3_21</a>
Alternative languages
Result language
angličtina
Original language name
OCR Improvements for Images of Multi-page Historical Documents
Original language description
This work presents a pipeline for processing digitally scanned documents, reading their textual content, and storing it in a dataset for the purpose of information retrieval. The pipeline is able to handle images of various quality, whether they were obtained by a digital scanner or camera. The image can contain multiple pages in any layout, but an approximate upright orientation is assumed. The pipeline uses Faster R-CNN to detect individual pages. These are then processed by a deskew algorithm to correct the orientation, and finally read by the Tesseract OCR system that has been retrained on a large set of synthetic images and a small set of annotated real-world documents. By applying the pipeline, we were able to increase the word recall to 60.56% which is an absolute gain of 19.19% from the baseline solution that uses only Tesseract OCR. A demo of the proposed pipeline can be found at https://archivkgb.zcu.cz/.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
20205 - Automation and control systems
Result continuities
Project
<a href="/en/project/DG20P02OVV018" target="_blank" >DG20P02OVV018: Digital archive of the NKVD/KGB files related to Czechoslovakia</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
23rd International Conference, SPECOM 2021, St. Petersburg, Russia, September 27–30, 2021, Proceedings
ISBN
978-3-030-87801-6
ISSN
0302-9743
e-ISSN
1611-3349
Number of pages
12
Pages from-to
226-237
Publisher name
Springer
Place of publication
Cham
Event location
St. Petersburg, Russia
Event date
Sep 27, 2021
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—