Accelerating Multilingual Language Model for Excessively Tokenized Languages
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AGK7S7KIB" target="_blank" >RIV/00216208:11320/25:GK7S7KIB - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85205297630&partnerID=40&md5=a9d0637da93cb011dc3fe5887dff9884" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85205297630&partnerID=40&md5=a9d0637da93cb011dc3fe5887dff9884</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Accelerating Multilingual Language Model for Excessively Tokenized Languages
Original language description
Recent advancements in large language models (LLMs) have remarkably enhanced performances on a variety of tasks in multiple languages. However, tokenizers in LLMs trained primarily on English-centric corpora often overly fragment a text into character or Unicode-level tokens in non-Roman alphabetic languages, leading to inefficient text generation. We introduce a simple yet effective framework to accelerate text generation in such languages. Our approach involves employing a new language model head with a vocabulary set tailored to a specific target language for a pre-trained LLM. This is followed by fine-tuning the new head while incorporating a verification step to ensure the model's performance is preserved. We show that this targeted finetuning, while freezing other model parameters, effectively reduces token fragmentation for the target language. Our extensive experiments demonstrate that the proposed framework increases the generation speed by a factor of 1.7 while maintaining the performance of pre-trained multilingual models on target monolingual tasks. © 2024 Association for Computational Linguistics.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proc. Annu. Meet. Assoc. Comput Linguist.
ISBN
979-889176099-8
ISSN
0736-587X
e-ISSN
—
Number of pages
17
Pages from-to
11095-11111
Publisher name
Association for Computational Linguistics (ACL)
Place of publication
—
Event location
Hybrid, Bangkok
Event date
Jan 1, 2025
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—