Fuzzy Influenced Process to Generate Comparable to Parallel Corpora

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3AGPAWR8UW" target="_blank" >RIV/00216208:11320/23:GPAWR8UW - isvavai.cz</a>
Výsledek na webu
<a href="https://dl.acm.org/doi/10.1145/3599235" target="_blank" >https://dl.acm.org/doi/10.1145/3599235</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3599235" target="_blank" >10.1145/3599235</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Fuzzy Influenced Process to Generate Comparable to Parallel Corpora
Popis výsledku v původním jazyce
"Data-driven supervised approaches rely on the parallel corpus. Due to lack of data and resources availability, it has become more difficult to achieve accurate outputs. In addition, the efficiency of the machine translation system depends on the quality of the used corpora. Hindi still lacks good quality parallel corpora and needs more resources for accurate machine translation. Comparable corpora are easily available compared to parallel corpora, but they cannot be used directly in machine translation. In our present research, we propose an algorithm to mine these comparable corpora from the web, and generate the parallel corpora automatically. Machine translation systems, system combination approach, and IR-based technique join their hands together to choose the set of sentence pairs. Then the sentence pairs having the best score are chosen to prepare the final parallel corpora. The primary modules of this architecture are fuzzy logic-based evaluation metric, information retrieval module, statistical machine translation system, Google neural machine translation system, Microsoft machine translation system, and system combination module for machine translation. For case study, we prepare the Hindi-English parallel corpora of (30825 + 51235) = 82060 sentence pairs. Evaluation results show that the F-Score measurement varies from 95.73 to 96.98 for various data sets. The source code and prepared dataset (comparable and parallel corpus) can be found at https://github.com/debajyoty/Comparable-partallel-Algo2.git."
Název v anglickém jazyce
Fuzzy Influenced Process to Generate Comparable to Parallel Corpora
Popis výsledku anglicky
"Data-driven supervised approaches rely on the parallel corpus. Due to lack of data and resources availability, it has become more difficult to achieve accurate outputs. In addition, the efficiency of the machine translation system depends on the quality of the used corpora. Hindi still lacks good quality parallel corpora and needs more resources for accurate machine translation. Comparable corpora are easily available compared to parallel corpora, but they cannot be used directly in machine translation. In our present research, we propose an algorithm to mine these comparable corpora from the web, and generate the parallel corpora automatically. Machine translation systems, system combination approach, and IR-based technique join their hands together to choose the set of sentence pairs. Then the sentence pairs having the best score are chosen to prepare the final parallel corpora. The primary modules of this architecture are fuzzy logic-based evaluation metric, information retrieval module, statistical machine translation system, Google neural machine translation system, Microsoft machine translation system, and system combination module for machine translation. For case study, we prepare the Hindi-English parallel corpora of (30825 + 51235) = 82060 sentence pairs. Evaluation results show that the F-Score measurement varies from 95.73 to 96.98 for various data sets. The source code and prepared dataset (comparable and parallel corpus) can be found at https://github.com/debajyoty/Comparable-partallel-Algo2.git."

Klasifikace

Druh
J<sub>ost</sub> - Ostatní články v recenzovaných periodicích
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
"ACM Transactions on Asian and Low-Resource Language Information Processing"
ISSN
2375-4699
e-ISSN
—
Svazek periodika
""
Číslo periodika v rámci svazku
2023-12-22
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
23
Strana od-do
1-23
Kód UT WoS článku
—
EID výsledku v databázi Scopus
—

Podobné výsledky(10)

Automatic Resource Augmentation for Machine Translation in Low Resource Language: EnIndic Corpus HindEnCorp - Hindi-English and Hindi-only Corpus for Machine Translation Improving Parallel Data Identification using Iteratively Refined Sentence Alignments and Bilingual Mappings of Pre-trained Language Models

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Fuzzy Influenced Process to Generate Comparable to Parallel Corpora

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)