Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3AM8P3JWBW" target="_blank" >RIV/00216208:11320/23:M8P3JWBW - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85175460005&partnerID=40&md5=c99170bb0e8e28087c599cd1ad55592d" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85175460005&partnerID=40&md5=c99170bb0e8e28087c599cd1ad55592d</a>
DOI - Digital Object Identifier
—

Alternative languages

Result language
angličtina
Original language name
Distilling Efficient Language-Specific Models for Cross-Lingual Transfer
Original language description
"Massively multilingual Transformers (MMTs), such as mBERT and XLM-R, are widely used for cross-lingual transfer learning. While these are pretrained to represent hundreds of languages, end users of NLP systems are often interested only in individual languages. For such purposes, the MMTs' language coverage makes them unnecessarily expensive to deploy in terms of model size, inference time, energy, and hardware cost. We thus propose to extract compressed, language-specific models from MMTs which retain the capacity of the original MMTs for cross-lingual transfer. This is achieved by distilling the MMT bilingually, i.e., using data from only the source and target language of interest. Specifically, we use a two-phase distillation approach, termed BIS-TILLATION: (i) the first phase distils a general bilingual model from the MMT, while (ii) the second, task-specific phase sparsely fine-tunes the bilingual 'student' model using a task-tuned variant of the original MMT as its 'teacher'. We evaluate this distillation technique in zero-shot cross-lingual transfer across a number of standard cross-lingual benchmarks. The key results indicate that the distilled models exhibit minimal degradation in target language performance relative to the base MMT despite being significantly smaller and faster. Furthermore, we find that they outperform multilingually distilled models such as DistilmBERT and MiniLMv2 while having a very modest training budget in comparison, even on a per-language basis. We also show that bilingual models distilled from MMTs greatly outperform bilingual models trained from scratch. © 2023 Association for Computational Linguistics."
Czech name
—
Czech description
—

Classification

Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Article name in the collection
"Proc. Annu. Meet. Assoc. Comput Linguist."
ISBN
978-195942962-3
ISSN
0736-587X
e-ISSN
—
Number of pages
19
Pages from-to
8147-8165
Publisher name
Association for Computational Linguistics (ACL)
Place of publication
—
Event location
Melaka, Malaysia
Event date
Jan 1, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—

Similar results(10)

First Align, then Predict: Understanding the Cross-Lingual Ability of Multilingual BERT Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers Cross-Lingual Transfer from Related Languages: Treating Low-Resource Maltese as Multilingual Code-Switching

What are you looking for?

Quick search

Smart search

Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)