OCR error correction using correction patterns and self-organizing migrating algorithm
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F21%3A10247265" target="_blank" >RIV/61989100:27240/21:10247265 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/content/pdf/10.1007/s10044-020-00936-y.pdf" target="_blank" >https://link.springer.com/content/pdf/10.1007/s10044-020-00936-y.pdf</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10044-020-00936-y" target="_blank" >10.1007/s10044-020-00936-y</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
OCR error correction using correction patterns and self-organizing migrating algorithm
Popis výsledku v původním jazyce
Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition. (C) 2020, Springer-Verlag London Ltd., part of Springer Nature.
Název v anglickém jazyce
OCR error correction using correction patterns and self-organizing migrating algorithm
Popis výsledku anglicky
Optical character recognition (OCR) systems help to digitize paper-based historical achieves. However, poor quality of scanned documents and limitations of text recognition techniques result in different kinds of errors in OCR outputs. Post-processing is an essential step in improving the output quality of OCR systems by detecting and cleaning the errors. In this paper, we present an automatic model consisting of both error detection and error correction phases for OCR post-processing. We propose a novel approach of OCR post-processing error correction using correction pattern edits and evolutionary algorithm which has been mainly used for solving optimization problems. Our model adopts a variant of the self-organizing migrating algorithm along with a fitness function based on modifications of important linguistic features. We illustrate how to construct the table of correction pattern edits involving all types of edit operations and being directly learned from the training dataset. Through efficient settings of the algorithm parameters, our model can be performed with high-quality candidate generation and error correction. The experimental results show that our proposed approach outperforms various baseline approaches as evaluated on the benchmark dataset of ICDAR 2017 Post-OCR text correction competition. (C) 2020, Springer-Verlag London Ltd., part of Springer Nature.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Pattern Analysis and Applications
ISSN
1433-7541
e-ISSN
1433-755X
Svazek periodika
24
Číslo periodika v rámci svazku
2
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
21
Strana od-do
701-721
Kód UT WoS článku
000591971700001
EID výsledku v databázi Scopus
2-s2.0-85096431401