Better Low-Resource Machine Translation with Smaller Vocabularies

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00137399" target="_blank" >RIV/00216224:14330/24:00137399 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70563-2_15</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >10.1007/978-3-031-70563-2_15</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Better Low-Resource Machine Translation with Smaller Vocabularies
Popis výsledku v původním jazyce
Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.
Název v anglickém jazyce
Better Low-Resource Machine Translation with Smaller Vocabularies
Popis výsledku anglicky
Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences

Návaznosti výsledku

Projekt
<a href="/cs/project/LM2023062" target="_blank" >LM2023062: Digitální výzkumná infrastruktura pro jazykové technologie, umění a humanitní vědy</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Text, Speech, and Dialogue
ISBN
9783031705625
ISSN
0302-9743
e-ISSN
1611-3349
Počet stran výsledku
12
Strana od-do
184-195
Název nakladatele
Springer
Místo vydání
Cham
Místo konání akce
Brno, Czech Republic
Datum konání akce
1. 1. 2024
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
001307840300015

Podobné výsledky(10)

Embedding structure matters: Comparing methods to adapt multilingual vocabularies to new languages Modeling Diachronic Change in English Scientific Writing over 300+ Years with Transformer-based Language Model Surprisal CUNI NMT System for WAT 2018 Translation Tasks

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Better Low-Resource Machine Translation with Smaller Vocabularies

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)