Building an efficient OCR system for historical documents with little training data
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F20%3A43958971" target="_blank" >RIV/49777513:23520/20:43958971 - isvavai.cz</a>
Result on the web
<a href="https://link.springer.com/content/pdf/10.1007/s00521-020-04910-x.pdf" target="_blank" >https://link.springer.com/content/pdf/10.1007/s00521-020-04910-x.pdf</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s00521-020-04910-x" target="_blank" >10.1007/s00521-020-04910-x</a>
Alternative languages
Result language
angličtina
Original language name
Building an efficient OCR system for historical documents with little training data
Original language description
As the number of digitized historical documents has increased rapidly it is necessary to provide efficient methods of information retrieval and knowledge extraction to make the data accessible. Such methods are dependent on optical character recognition (OCR) which converts the document images into textual representations. This paper introduces a set of methods that allows performing an OCR on historical document images using only a small amount of real, manually annotated training data. The presented OCR system includes two main tasks: page layout analysis including text block and line segmentation and OCR. Our seg-mentation methods are based on fully convolutional networks, and the OCR approach utilizes recurrent neural networks. We show that both the segmentation and OCR tasks are feasible with only a few annotated real data samples. The experiments aim at determining the best way how to achieve good performance with the given small set of data. We also demonstrate that obtained scores are comparable or even better than the scores of several state-of-the-art systems.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
O - Projekt operacniho programu
Others
Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Neural Computing and Applications
ISSN
0941-0643
e-ISSN
—
Volume of the periodical
32
Issue of the periodical within the volume
23
Country of publishing house
GB - UNITED KINGDOM
Number of pages
19
Pages from-to
17209-17227
UT code for WoS article
000531222300001
EID of the result in the Scopus database
2-s2.0-85084519412