SlamaTrain – Representative Training Dataset for Slavonic Large Language Models
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00138085" target="_blank" >RIV/00216224:14330/24:00138085 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
SlamaTrain – Representative Training Dataset for Slavonic Large Language Models
Original language description
The Slama project focuses on building a series of foundational language models for Slavonic languages. Even though the latest developmentyieldsanumberofnewlargepre-trainedandfine-tunedmodels,the main data source came from English-written websites. Therefore the majority of the training data that is used for language model development consists oftheEnglishlanguage.MultilinguallanguagemodelslikeLlama, GPT-4o,mT5,etc.arealsopredominantly(around80%)trainedontheEnglish language, even though they capture the structure of dozens of languages. In this paper, we detail the process of acquiring one of the largest training datasets for Czech, Slovak and other Slavonic languages. We started with huge multi-lingual datasets, extracted the mono-lingual data and joined them with other sources. The combined mono-lingual datasets were then cleaned, deduplicated and filtered for adult content. As a result, we have obtained 71 billion tokens for the Czech and Slovak languages suitable for the Slama language models training.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
<a href="/en/project/LM2023062" target="_blank" >LM2023062: Digital Research Infrastructure for Language Technologies, Arts and Humanities</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Recent Advances in Slavonic Natural Language Processing, RASLAN 2024
ISBN
9788026318354
ISSN
2336-4289
e-ISSN
—
Number of pages
9
Pages from-to
25-33
Publisher name
Tribun EU
Place of publication
Brno, Czech Republic
Event location
Kouty nad Desnou, Česká Republika
Event date
Jan 1, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—