Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F21%3A10440911" target="_blank" >RIV/00216208:11320/21:10440911 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Beyond the English web: Zero-shot cross-lingual and lightweight monolingual classification of registers
Original language description
We explore cross-lingual transfer of register classification for web documents. Registers, that is, text varieties such as blogs or news are one of the primary predictors of linguistic variation and thus affect the automatic processing of language. We introduce two new register-annotated corpora, FreCORE and SweCORE, for French and Swedish. We demonstrate that deep pre-trained language models perform strongly in these languages and outperform previous state-of-the-art in English and Finnish. Specifically, we show 1) that zero-shot cross-lingual transfer from the large English CORE corpus can match or surpass previously published monolingual models, and 2) that lightweight monolingual classification requiring very little training data can reach or surpass our zero-shot performance. We further analyse classification results finding that certain registers continue to pose challenges in particular for cross-lingual transfer.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
EACL 2021 - 16th Conference of the European Chapter of the Association for Computational Linguistics, Proceedings of the Student Research Workshop
ISBN
978-1-954085-04-6
ISSN
—
e-ISSN
—
Number of pages
9
Pages from-to
183-191
Publisher name
Association for Computational Linguistics
Place of publication
Stroudsburg
Event location
online
Event date
Apr 19, 2021
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—