Evaluating automatic sentence alignment approaches on English-Slovak sentences
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216275%3A25410%2F23%3A39920747" target="_blank" >RIV/00216275:25410/23:39920747 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.nature.com/articles/s41598-023-47479-w" target="_blank" >https://www.nature.com/articles/s41598-023-47479-w</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1038/s41598-023-47479-w" target="_blank" >10.1038/s41598-023-47479-w</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Evaluating automatic sentence alignment approaches on English-Slovak sentences
Popis výsledku v původním jazyce
Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.
Název v anglickém jazyce
Evaluating automatic sentence alignment approaches on English-Slovak sentences
Popis výsledku anglicky
Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.
Klasifikace
Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Scientific Reports
ISSN
2045-2322
e-ISSN
2045-2322
Svazek periodika
13
Číslo periodika v rámci svazku
1
Stát vydavatele periodika
GB - Spojené království Velké Británie a Severního Irska
Počet stran výsledku
12
Strana od-do
20123
Kód UT WoS článku
001125371600054
EID výsledku v databázi Scopus
2-s2.0-85177092385