Evaluating automatic sentence alignment approaches on English-Slovak sentences
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216275%3A25410%2F23%3A39920747" target="_blank" >RIV/00216275:25410/23:39920747 - isvavai.cz</a>
Result on the web
<a href="https://www.nature.com/articles/s41598-023-47479-w" target="_blank" >https://www.nature.com/articles/s41598-023-47479-w</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1038/s41598-023-47479-w" target="_blank" >10.1038/s41598-023-47479-w</a>
Alternative languages
Result language
angličtina
Original language name
Evaluating automatic sentence alignment approaches on English-Slovak sentences
Original language description
Parallel texts represent a very valuable resource in many applications of natural language processing. The fundamental step in creating parallel corpus is the alignment. Sentence alignment is the issue of finding correspondence between source sentences and their equivalent translations in the target text. A number of automatic sentence alignment approaches were proposed including neural networks, which can be divided into length-based, lexicon-based, and translation-based. In our study, we used five different aligners, namely Bilingual sentence aligner (BSA), Hunalign, Bleualign, Vecalign, and Bertalign. We evaluated both, the performance of the Bertalign in terms of accuracy against the up to now employed aligners as well as among each other in the language pair English-Sovak. We created our custom corpus consisting of texts collected in 2021 and 2022. Vecalign and Bertalign performed statistically significantly best and BSA the worst. Hunalign and Bleualign achieved the same performance in terms of F1 score. However, Bleualign achieved the most diverse results in terms of performance.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Scientific Reports
ISSN
2045-2322
e-ISSN
2045-2322
Volume of the periodical
13
Issue of the periodical within the volume
1
Country of publishing house
GB - UNITED KINGDOM
Number of pages
12
Pages from-to
20123
UT code for WoS article
001125371600054
EID of the result in the Scopus database
2-s2.0-85177092385