All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Better Low-Resource Machine Translation with Smaller Vocabularies

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00137399" target="_blank" >RIV/00216224:14330/24:00137399 - isvavai.cz</a>

  • Result on the web

    <a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70563-2_15</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/978-3-031-70563-2_15" target="_blank" >10.1007/978-3-031-70563-2_15</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Better Low-Resource Machine Translation with Smaller Vocabularies

  • Original language description

    Data scarcity is still a major challenge in machine translation. The performance of state-of-the-art deep learning architectures, such as the Transformers, for under-resourced languages is well below the one for high-resourced languages. This precludes access to information for millions of speakers across the globe. Previous research has shown that the Transformer is highly sensitive to hyperparameters in low-resource conditions. One such parameter is the size of the subword vocabulary of the model. In this paper, we show that using smaller vocabularies, as low as 1k tokens, instead of the default value of 32k, is preferable in a diverse array of low-resource conditions. We experiment with different sizes on English-Akkadian, Lower Sorbian-German, English-Manipuri, to obtain models that are faster to train, smaller, and better performing than the default setting. These models achieve improvements of up to 322% ChrF score, while being up to 66% smaller and up to 17% faster to train.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10200 - Computer and information sciences

Result continuities

  • Project

    <a href="/en/project/LM2023062" target="_blank" >LM2023062: Digital Research Infrastructure for Language Technologies, Arts and Humanities</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Text, Speech, and Dialogue

  • ISBN

    9783031705625

  • ISSN

    0302-9743

  • e-ISSN

    1611-3349

  • Number of pages

    12

  • Pages from-to

    184-195

  • Publisher name

    Springer

  • Place of publication

    Cham

  • Event location

    Brno, Czech Republic

  • Event date

    Jan 1, 2024

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article

    001307840300015