Augmenting Historical Alphabet Datasets Using Generative Adversarial Networks
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F60460709%3A41110%2F23%3A92542" target="_blank" >RIV/60460709:41110/23:92542 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/chapter/10.1007/978-3-031-21438-7_11" target="_blank" >https://link.springer.com/chapter/10.1007/978-3-031-21438-7_11</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-21438-7_11" target="_blank" >10.1007/978-3-031-21438-7_11</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Augmenting Historical Alphabet Datasets Using Generative Adversarial Networks
Popis výsledku v původním jazyce
In this paper, we present a method for expanding small classification datasets. Every research project is based on data and methods, including text analysis. When analyzing historical texts in different alphabets, there are not always Optical Character Recognition algorithms available and, in many cases, such texts need to be transliterated and translated manually, or alternatively, an OCR algorithm can be developed. In order to create such an algorithm, a large volume of input data is needed - each alphabet consists of elementary data - either letters, vowels, or in some cases ideograms. The texts need to be segmented into such elements, and then, the elements are classified. In many cases, it is very difficult and time-costly to get a sufficient amount of data, and it is advisable to use augmentation methods. In our research, we propose using Generative Adversarial Network to expand a relatively small dataset of Palmyrene letters and prove that even by adding generated data equal to the third of size of the original dataset, the classification results are improved by 120 percent.
Název v anglickém jazyce
Augmenting Historical Alphabet Datasets Using Generative Adversarial Networks
Popis výsledku anglicky
In this paper, we present a method for expanding small classification datasets. Every research project is based on data and methods, including text analysis. When analyzing historical texts in different alphabets, there are not always Optical Character Recognition algorithms available and, in many cases, such texts need to be transliterated and translated manually, or alternatively, an OCR algorithm can be developed. In order to create such an algorithm, a large volume of input data is needed - each alphabet consists of elementary data - either letters, vowels, or in some cases ideograms. The texts need to be segmented into such elements, and then, the elements are classified. In many cases, it is very difficult and time-costly to get a sufficient amount of data, and it is advisable to use augmentation methods. In our research, we propose using Generative Adversarial Network to expand a relatively small dataset of Palmyrene letters and prove that even by adding generated data equal to the third of size of the original dataset, the classification results are improved by 120 percent.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Data Science and Algorithms in Systems
ISBN
978-3-031-21438-7
ISSN
2367-3389
e-ISSN
—
Počet stran výsledku
10
Strana od-do
132-141
Název nakladatele
Springer
Místo vydání
Gewerbestrasse 11, 6330 Cham, Switzerland
Místo konání akce
online (Praha)
Datum konání akce
1. 1. 2022
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—