Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3A4SGWFY9I" target="_blank" >RIV/00216208:11320/22:4SGWFY9I - isvavai.cz</a>
Výsledek na webu
<a href="https://www.researchgate.net/publication/357201955_Training_dataset_and_dictionary_sizes_matter_in_BERT_models_the_case_of_Baltic_languages" target="_blank" >https://www.researchgate.net/publication/357201955_Training_dataset_and_dictionary_sizes_matter_in_BERT_models_the_case_of_Baltic_languages</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-16500-9_14" target="_blank" >10.1007/978-3-031-16500-9_14</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages
Popis výsledku v původním jazyce
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.
Název v anglickém jazyce
Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages
Popis výsledku anglicky
Large pretrained masked language models have become state-of-the-art solutions for many NLP problems. While studies have shown that monolingual models produce better results than multilingual models, the training datasets must be sufficiently large. We trained a trilingual LitLat BERT-like model for Lithuanian, Latvian, and English, and a monolingual Est-RoBERTa model for Estonian. We evaluate their performance on four downstream tasks: named entity recognition, dependency parsing, part-of-speech tagging, and word analogy. To analyze the importance of focusing on a single language and the importance of a large training set, we compare created models with existing monolingual and multilingual BERT models for Estonian, Latvian, and Lithuanian. The results show that the newly created LitLat BERT and Est-RoBERTa models improve the results of existing models on all tested tasks in most situations.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Analysis of Images, Social Networks and Texts
ISBN
978-3-031-16500-9
ISSN
—
e-ISSN
—
Počet stran výsledku
11
Strana od-do
162-172
Název nakladatele
Springer International Publishing
Místo vydání
—
Místo konání akce
Cham
Datum konání akce
1. 1. 2022
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

Sequence-to-sequence pretraining for a less-resourced Slovenian language Is Multilingual BERT Fluent in Language Generation?Adapting Monolingual Models: Data can be Scarce when Language Similarity is High

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Training Dataset and Dictionary Sizes Matter in BERT Models: The Case of Baltic Languages

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)