A Dataset and Strong Baselines for Classification of Czech News Texts

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A10475727" target="_blank" >RIV/00216208:11320/23:10475727 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1007/978-3-031-40498-6_4" target="_blank" >https://doi.org/10.1007/978-3-031-40498-6_4</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-40498-6_4" target="_blank" >10.1007/978-3-031-40498-6_4</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
A Dataset and Strong Baselines for Classification of Czech News Texts
Popis výsledku v původním jazyce
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.
Název v anglickém jazyce
A Dataset and Strong Baselines for Classification of Czech News Texts
Popis výsledku anglicky
Pre-trained models for Czech Natural Language Processing are often evaluated on purely linguistic tasks (POS tagging, parsing, NER) and relatively simple classification tasks such as sentiment classification or article classification from a single news source. As an alternative, we present CZEch NEws Classification dataset (CZE-NEC), one of the largest Czech classification datasets, composed of news articles from various sources spanning over twenty years, which allows a more rigorous evaluation of such models. We define four classification tasks: news source, news category, inferred author's gender, and day of the week. To verify the task difficulty, we conducted a human evaluation, which revealed that human performance lags behind strong machine-learning baselines built upon pre-trained transformer models. Furthermore, we show that language-specific pre-trained encoder analysis outperforms selected commercially available large-scale generative language models.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Lecture Notes in Artificial Intelligence
ISBN
978-3-031-40497-9
ISSN
—
e-ISSN
1611-3349
Počet stran výsledku
12
Strana od-do
33-44
Název nakladatele
Springer
Místo vydání
Cham, Switzerland
Místo konání akce
Plzeň, Czechia
Datum konání akce
4. 9. 2023
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—

Podobné výsledky(10)

IndoNLU: Benchmark and Resources for Evaluating Indonesian Natural Language Understanding BertOdia: BERT Pre-training for Low Resource Odia Language Topic Classification and Headline Generation for Maltese using a Public News Corpus

Co hledáte?

Rychlé hledání

Chytré vyhledávání

A Dataset and Strong Baselines for Classification of Czech News Texts

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)