Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A2MU6KJB2" target="_blank" >RIV/00216208:11320/23:2MU6KJB2 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164289459&doi=10.1109%2fICBIR57571.2023.10147628&partnerID=40&md5=2fb9f35120c62448088e49e9154f0479" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164289459&doi=10.1109%2fICBIR57571.2023.10147628&partnerID=40&md5=2fb9f35120c62448088e49e9154f0479</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICBIR57571.2023.10147628" target="_blank" >10.1109/ICBIR57571.2023.10147628</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai
Popis výsledku v původním jazyce
"The author and his colleagues have been developing classical Chinese treebank using Universal Dependencies. We also developed RoBERTa-Classical-Chinese model pre-trained with classical Chinese texts of 1.7 billion characters. In this paper we describe how to finetune sequence-labeling RoBERTa model for dependency-parsing in classical Chinese. We introduce 'goeswith'-labeled edges into the directed acyclic graphs of Universal Dependencies in order to resolve the mismatch between the token length of RoBERTa-Classical-Chinese and the word length in classical Chinese. We utilize [MASK]token of RoBERTa model to handle outgoing edges and to produce the adjacency-matrices for the graphs of Universal Dependencies. Our RoBERTa-UDgoeswith model outperforms other dependency-parsers in classical Chinese on LAS / MLAS / BLEX benchmark scores. Then we apply our methods to other isolating languages. For Vietnamese we introduce 'goeswith'-labeled edges to separate words into space-separated syllables, and finetune RoBERTa and PhoBERT models. For Thai we try three kinds of tokenizers, character-wise tokenizer, quasi-syllable tokenizer, and SentencePiece, to produce RoBERTa models. © 2023 IEEE."
Název v anglickém jazyce
Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai
Popis výsledku anglicky
"The author and his colleagues have been developing classical Chinese treebank using Universal Dependencies. We also developed RoBERTa-Classical-Chinese model pre-trained with classical Chinese texts of 1.7 billion characters. In this paper we describe how to finetune sequence-labeling RoBERTa model for dependency-parsing in classical Chinese. We introduce 'goeswith'-labeled edges into the directed acyclic graphs of Universal Dependencies in order to resolve the mismatch between the token length of RoBERTa-Classical-Chinese and the word length in classical Chinese. We utilize [MASK]token of RoBERTa model to handle outgoing edges and to produce the adjacency-matrices for the graphs of Universal Dependencies. Our RoBERTa-UDgoeswith model outperforms other dependency-parsers in classical Chinese on LAS / MLAS / BLEX benchmark scores. Then we apply our methods to other isolating languages. For Vietnamese we introduce 'goeswith'-labeled edges to separate words into space-separated syllables, and finetune RoBERTa and PhoBERT models. For Thai we try three kinds of tokenizers, character-wise tokenizer, quasi-syllable tokenizer, and SentencePiece, to produce RoBERTa models. © 2023 IEEE."
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
—
Ostatní
Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
"Int. Conf. Bus. Ind. Res., ICBIR - Proc."
ISBN
979-835039964-6
ISSN
—
e-ISSN
—
Počet stran výsledku
5
Strana od-do
169-173
Název nakladatele
Institute of Electrical and Electronics Engineers Inc.
Místo vydání
—
Místo konání akce
Cham
Datum konání akce
1. 1. 2023
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—