An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F61989100%3A27240%2F23%3A10252607" target="_blank" >RIV/61989100:27240/23:10252607 - isvavai.cz</a>
Výsledek na webu
<a href="https://ieeexplore.ieee.org/document/10144767" target="_blank" >https://ieeexplore.ieee.org/document/10144767</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ACCESS.2023.3283340" target="_blank" >10.1109/ACCESS.2023.3283340</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
Popis výsledku v původním jazyce
Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.
Název v anglickém jazyce
An Efficient Unsupervised Approach for OCR Error Correction of Vietnamese OCR Text
Popis výsledku anglicky
Different types of OCR errors often occur in OCR texts due to the low quality of scanned document images or limitations in OCR software. In this paper, we propose a novel unsupervised approach for OCR error correction. Correction candidates for OCR errors are generated and explored in their neighborhoods using correction character edits controlled by an adapted hill-climbing algorithm. Correction characters are extracted from only original ground truth texts, which do not depend on OCR texts in training data. A weighted objective function used to score and rank correction candidates is heuristically tested to find optimal weight combinations. The proposed model is evaluated on an OCR text dataset originating from the Vietnamese handwritten database in the ICFHR 2018 Vietnamese online handwritten text recognition competition. The proposed model is also verified concerning its stability and complexity. The experimental results show that our model achieves competitive performance compared to the other models in the ICFHR 2018 competition.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences
Návaznosti výsledku
Projekt
<a href="/cs/project/EF17_049%2F0008425" target="_blank" >EF17_049/0008425: Platforma pro výzkum orientovaný na Průmysl 4.0 a robotiku v ostravské aglomeraci</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
IEEE Access
ISSN
2169-3536
e-ISSN
—
Svazek periodika
11
Číslo periodika v rámci svazku
06 June 2023
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
16
Strana od-do
58406-58421
Kód UT WoS článku
001012334700001
EID výsledku v databázi Scopus
—