Using a multilingual literary parallel corpus to train NMT systems
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A90244%2F24%3A10495709" target="_blank" >RIV/00216208:90244/24:10495709 - isvavai.cz</a>
Výsledek na webu
<a href="https://aclanthology.org/2024.ctt-1.1.pdf" target="_blank" >https://aclanthology.org/2024.ctt-1.1.pdf</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Using a multilingual literary parallel corpus to train NMT systems
Popis výsledku v původním jazyce
This article presents an application of a multilingual and multidirectional parallel corpus composed of literary texts in five Romance languages (Spanish, French, Italian, Portuguese, Romanian) and a Slavic language (Croatian), with a total of 142,000 segments and 15.7 million words. After combining it with very large freely available parallel corpora, this resource is used to train NMT systems tailored to literature. A total of five NMT systems have been trained: Spanish-French, Spanish-Italian, Spanish-Portuguese, Spanish-Romanian and Spanish-Croatian. The trained systems were evaluated using automatic metrics (BLEU, chrF2 and TER) and a comparison with a rule-based MT system (Apertium) and a neural system (Google Translate) is presented. As a main conclusion, we can highlight that the use of this literary corpus has been very productive, as the majority of the trained systems achieve comparable, and in some cases even better, values of the automatic quality metrics than a widely used commercial NMT system.
Název v anglickém jazyce
Using a multilingual literary parallel corpus to train NMT systems
Popis výsledku anglicky
This article presents an application of a multilingual and multidirectional parallel corpus composed of literary texts in five Romance languages (Spanish, French, Italian, Portuguese, Romanian) and a Slavic language (Croatian), with a total of 142,000 segments and 15.7 million words. After combining it with very large freely available parallel corpora, this resource is used to train NMT systems tailored to literature. A total of five NMT systems have been trained: Spanish-French, Spanish-Italian, Spanish-Portuguese, Spanish-Romanian and Spanish-Croatian. The trained systems were evaluated using automatic metrics (BLEU, chrF2 and TER) and a comparison with a rule-based MT system (Apertium) and a neural system (Google Translate) is presented. As a main conclusion, we can highlight that the use of this literary corpus has been very productive, as the majority of the trained systems achieve comparable, and in some cases even better, values of the automatic quality metrics than a widely used commercial NMT system.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
60203 - Linguistics
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proceedings of the 1st Workshop on Creative-text Translation and Technology
ISBN
978-1-06-869073-0
ISSN
—
e-ISSN
—
Počet stran výsledku
9
Strana od-do
1-9
Název nakladatele
European Association for Machine Translation
Místo vydání
Sheffield
Místo konání akce
Sheffield
Datum konání akce
27. 6. 2024
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—