Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10415122" target="_blank" >RIV/00216208:11320/20:10415122 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11320/20:10424477
Výsledek na webu
<a href="https://doi.org/10.1007/978-3-030-58323-1_18" target="_blank" >https://doi.org/10.1007/978-3-030-58323-1_18</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-030-58323-1_18" target="_blank" >10.1007/978-3-030-58323-1_18</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
Popis výsledku v původním jazyce
Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data. First of all, we automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data, which we release at http://hdl.handle.net/11234/1-3249. We then trained and evaluated several BERT and XLM-RoBERTa baseline models. However, our main focus lies in cross-lingual transfer models. We report that a XLM-RoBERTa model trained on English data and evaluated on Czech achieves very competitive performance, only approximately 2% points worse than a model trained on the translated Czech data. This result is extremely good, considering the fact that the model has not seen any Czech data during training. The cross-lingual transfer approach is very flexible and provides a reading comprehension in any language, for which we have enough monolingual raw texts.
Název v anglickém jazyce
Reading Comprehension in Czech via Machine Translation and Cross-lingual Transfer
Popis výsledku anglicky
Reading comprehension is a well studied task, with huge training datasets in English. This work focuses on building reading comprehension systems for Czech, without requiring any manually annotated Czech training data. First of all, we automatically translated SQuAD 1.1 and SQuAD 2.0 datasets to Czech to create training and development data, which we release at http://hdl.handle.net/11234/1-3249. We then trained and evaluated several BERT and XLM-RoBERTa baseline models. However, our main focus lies in cross-lingual transfer models. We report that a XLM-RoBERTa model trained on English data and evaluated on Czech achieves very competitive performance, only approximately 2% points worse than a model trained on the translated Czech data. This result is extremely good, considering the fact that the model has not seen any Czech data during training. The cross-lingual transfer approach is very flexible and provides a reading comprehension in any language, for which we have enough monolingual raw texts.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Text, Speech, and Dialogue. TSD 2020
ISBN
978-3-030-58322-4
ISSN
0302-9743
e-ISSN
—
Počet stran výsledku
9
Strana od-do
171-179
Název nakladatele
Springer
Místo vydání
Cham
Místo konání akce
Brno
Datum konání akce
8. 9. 2020
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—