Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A10475921" target="_blank" >RIV/00216208:11320/23:10475921 - isvavai.cz</a>
Result on the web
<a href="https://aclanthology.org/2023.findings-acl.350" target="_blank" >https://aclanthology.org/2023.findings-acl.350</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages
Original language description
Multilingual language models have recently gained attention as a promising solution for representing multiple languages in a single model. In this paper, we propose new criteria to evaluate the quality of lexical representation and vocabulary overlap observed in sub-word tokenizers.Our findings show that the overlap of vocabulary across languages can be actually detrimental to certain downstream tasks (POS, dependency tree labeling). In contrast, NER and sentence-level tasks (cross-lingual retrieval, NLI) benefit from sharing vocabulary. We also observe that the coverage of the language-specific tokens in the multilingual vocabulary significantly impacts the word-level tasks. Our study offers a deeper understanding of the role of tokenizers in multilingual language models and guidelines for future model developers to choose the most suitable tokenizer for their specific application before undertaking costly model pre-training.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Findings of the Association for Computational Linguistics: ACL 2023
ISBN
978-1-959429-62-3
ISSN
—
e-ISSN
—
Number of pages
21
Pages from-to
5661-5681
Publisher name
Association for Computational Linguistics
Place of publication
Stroudsburg, PA, USA
Event location
Toronto, Canada
Event date
Jul 9, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—