Better Low-Resource Machine Translation with Smaller Vocabularies
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00137399" target="_blank" >RIV/00216224:14330/24:00137399 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70563-2_15</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >10.1007/978-3-031-70563-2_15</a>
Alternative languages
Result language
angličtina
Original language name
Better Low-Resource Machine Translation with Smaller Vocabularies
Original language description
Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
<a href="/en/project/LM2023062" target="_blank" >LM2023062: Digital Research Infrastructure for Language Technologies, Arts and Humanities</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Text, Speech, and Dialogue
ISBN
9783031705625
ISSN
0302-9743
e-ISSN
1611-3349
Number of pages
12
Pages from-to
184-195
Publisher name
Springer
Place of publication
Cham
Event location
Brno, Czech Republic
Event date
Jan 1, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
001307840300015