Semantic text segmentation from synthetic images of full-text documents

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F19%3A43958267" target="_blank" >RIV/49777513:23520/19:43958267 - isvavai.cz</a>
Výsledek na webu
<a href="http://proceedings.spiiras.nw.ru/index.php/sp/article/view/4527/2627" target="_blank" >http://proceedings.spiiras.nw.ru/index.php/sp/article/view/4527/2627</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.15622/sp.2019.18.6.1381-1406" target="_blank" >10.15622/sp.2019.18.6.1381-1406</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Semantic text segmentation from synthetic images of full-text documents
Popis výsledku v původním jazyce
An algorithm (divided into multiple modules) for generating images of fulltext documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.
Název v anglickém jazyce
Semantic text segmentation from synthetic images of full-text documents
Popis výsledku anglicky
An algorithm (divided into multiple modules) for generating images of fulltext documents is presented. These images can be used to train, test, and evaluate models for Optical Character Recognition (OCR). The algorithm is modular, individual parts can be changed and tweaked to generate desired images. A method for obtaining background images of paper from already digitized documents is described. For this, a novel approach based on Variational AutoEncoder (VAE) to train a generative model was used. These backgrounds enable the generation of similar background images as the training ones on the fly. The module for printing the text uses large text corpora, a font, and suitable positional and brightness character noise to obtain believable results (for natural-looking aged documents). A few types of layouts of the page are supported. The system generates a detailed, structured annotation of the synthesized image. Tesseract OCR to compare the real-world images to generated images is used. The recognition rate is very similar, indicating the proper appearance of the synthetic images. Moreover, the errors which were made by the OCR system in both cases are very similar. From the generated images, fully-convolutional encoder-decoder neural network architecture for semantic segmentation of individual characters was trained. With this architecture, the recognition accuracy of 99.28% on a test set of synthetic documents is reached.

Klasifikace

Druh
J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS
CEP obor
—
OECD FORD obor
20205 - Automation and control systems

Návaznosti výsledku

Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
SPIIRAS Proceedings
ISSN
2078-9181
e-ISSN
—
Svazek periodika
18
Číslo periodika v rámci svazku
6
Stát vydavatele periodika
RU - Ruská federace
Počet stran výsledku
26
Strana od-do
1380-1405
Kód UT WoS článku
—
EID výsledku v databázi Scopus
2-s2.0-85078454715

Podobné výsledky(10)

Generation of Synthetic Images of Full-Text Documents Hybrid Training Data for Historical Text OCR An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Semantic text segmentation from synthetic images of full-text documents

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)