Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Sequence-to-sequence pretraining for a less-resourced Slovenian language

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3AGY8X9D7V" target="_blank" >RIV/00216208:11320/23:GY8X9D7V - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://www.frontiersin.org/articles/10.3389/frai.2023.932519" target="_blank" >https://www.frontiersin.org/articles/10.3389/frai.2023.932519</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.3389/frai.2023.932519" target="_blank" >10.3389/frai.2023.932519</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Sequence-to-sequence pretraining for a less-resourced Slovenian language

  • Popis výsledku v původním jazyce

    "IntroductionLarge pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages.MethodsWe trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model.ResultsConcerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models.DiscussionWhile the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available."

  • Název v anglickém jazyce

    Sequence-to-sequence pretraining for a less-resourced Slovenian language

  • Popis výsledku anglicky

    "IntroductionLarge pretrained language models have recently conquered the area of natural language processing. As an alternative to predominant masked language modeling introduced in BERT, the T5 model has introduced a more general training objective, namely sequence to sequence transformation, which more naturally fits text generation tasks. The monolingual variants of T5 models have been limited to well-resourced languages, while the massively multilingual T5 model supports 101 languages.MethodsWe trained two different-sized T5-type sequence-to-sequence models for morphologically rich Slovene language with much fewer resources. We analyzed the behavior of new models on 11 tasks, eight classification ones (named entity recognition, sentiment classification, lemmatization, two question answering tasks, two natural language inference tasks, and a coreference resolution task), and three text generation tasks (text simplification and two summarization tasks on different datasets). We compared the new SloT5 models with the multilingual mT5 model, multilingual mBART-50 model, and with four encoder BERT-like models: multilingual BERT, multilingual XLM-RoBERTa, trilingual Croatian-Slovene-English BERT, and monolingual Slovene RoBERTa model.ResultsConcerning the classification tasks, the SloT5 models mostly lag behind the monolingual Slovene SloBERTa model. However, these models are helpful for generative tasks and provide several useful results. In general, the size of models matters, and currently, there is not enough training data for Slovene for successful pretraining of large models.DiscussionWhile the results are obtained on Slovene, we believe that they may generalize to other less-resourced languages, where such models will be built. We make the training and evaluation code, as well as the trained models, publicly available."

Klasifikace

  • Druh

    J<sub>ost</sub> - Ostatní články v recenzovaných periodicích

  • CEP obor

  • OECD FORD obor

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

  • Projekt

  • Návaznosti

Ostatní

  • Rok uplatnění

    2023

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    "Frontiers in Artificial Intelligence"

  • ISSN

    2624-8212

  • e-ISSN

  • Svazek periodika

    6

  • Číslo periodika v rámci svazku

    2023

  • Stát vydavatele periodika

    US - Spojené státy americké

  • Počet stran výsledku

    13

  • Strana od-do

    1-13

  • Kód UT WoS článku

  • EID výsledku v databázi Scopus

    2-s2.0-85152679109