Semantic text segmentation from synthetic images of full-text documents

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F19%3A43958267" target="_blank" >RIV/49777513:23520/19:43958267 - isvavai.cz</a>
Result on the web
<a href="http://proceedings.spiiras.nw.ru/index.php/sp/article/view/4527/2627" target="_blank" >http://proceedings.spiiras.nw.ru/index.php/sp/article/view/4527/2627</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.15622/sp.2019.18.6.1381-1406" target="_blank" >10.15622/sp.2019.18.6.1381-1406</a>

Alternative languages

Result language
angličtina
Original language name
Semantic text segmentation from synthetic images of full-text documents
Original language description
An algorithm (divided into multiple modules) for generating images of fulltext documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.
Czech name
—
Czech description
—

Classification

Type
J<sub>SC</sub> - Article in a specialist periodical, which is included in the SCOPUS database
CEP classification
—
OECD FORD branch
20205 - Automation and control systems

Result continuities

Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Others

Publication year
2019
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
SPIIRAS Proceedings
ISSN
2078-9181
e-ISSN
—
Volume of the periodical
18
Issue of the periodical within the volume
6
Country of publishing house
RU - RUSSIAN FEDERATION
Number of pages
26
Pages from-to
1380-1405
UT code for WoS article
—
EID of the result in the Scopus database
2-s2.0-85078454715

Similar results(10)

Generation of Synthetic Images of Full-Text Documents Hybrid Training Data for Historical Text OCR An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text

What are you looking for?

Quick search

Smart search

Semantic text segmentation from synthetic images of full-text documents

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)