Text Filtering Classifiers for Medium-Resource Languages
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3APVLDQKY7" target="_blank" >RIV/00216208:11320/25:PVLDQKY7 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195236234&partnerID=40&md5=87469362bd8df3682429baa73f0c0621" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195236234&partnerID=40&md5=87469362bd8df3682429baa73f0c0621</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Text Filtering Classifiers for Medium-Resource Languages
Popis výsledku v původním jazyce
Web-crawled corpora are essential resources for linguistic and NLP research, offering far more data than is available from curated corpora. However, they often contain a great deal of low-quality texts which can complicate research and degrade the quality of pre-trained language models. Therefore, they are typically filtered, e.g. by applying rules or classifiers. In this paper, we compare the effectiveness of various text filtering classifiers and measure their impact on language model performance for three medium-resource languages. We present TQ-IS, an Icelandic text quality dataset consisting of 2,000 web-crawled documents, in which spans of low-quality text have been manually identified and labeled. We then evaluate a perplexity-based classifier, a supervised classifier trained on TQ-IS, and a self-supervised classifier trained to discern between documents from curated and web-crawled corpora on Icelandic, Estonian and Basque. We find that these classifiers obtain F1 scores of 94.48%, 99.01% and 93.40%, respectively, when evaluated on the TQ-IS dataset. Furthermore, our results show that while adding filtered web-crawled text to a pre-training corpus can improve downstream performance for pre-trained language models, any improvement is likely to remain modest unless the web-crawled corpus is significantly larger in size. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Název v anglickém jazyce
Text Filtering Classifiers for Medium-Resource Languages
Popis výsledku anglicky
Web-crawled corpora are essential resources for linguistic and NLP research, offering far more data than is available from curated corpora. However, they often contain a great deal of low-quality texts which can complicate research and degrade the quality of pre-trained language models. Therefore, they are typically filtered, e.g. by applying rules or classifiers. In this paper, we compare the effectiveness of various text filtering classifiers and measure their impact on language model performance for three medium-resource languages. We present TQ-IS, an Icelandic text quality dataset consisting of 2,000 web-crawled documents, in which spans of low-quality text have been manually identified and labeled. We then evaluate a perplexity-based classifier, a supervised classifier trained on TQ-IS, and a self-supervised classifier trained to discern between documents from curated and web-crawled corpora on Icelandic, Estonian and Basque. We find that these classifiers obtain F1 scores of 94.48%, 99.01% and 93.40%, respectively, when evaluated on the TQ-IS dataset. Furthermore, our results show that while adding filtered web-crawled text to a pre-training corpus can improve downstream performance for pre-trained language models, any improvement is likely to remain modest unless the web-crawled corpus is significantly larger in size. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc.
ISBN
978-249381410-4
ISSN
—
e-ISSN
—
Počet stran výsledku
13
Strana od-do
15789-15801
Název nakladatele
European Language Resources Association (ELRA)
Místo vydání
—
Místo konání akce
Torino, Italia
Datum konání akce
1. 1. 2025
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—