Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A2MU6KJB2" target="_blank" >RIV/00216208:11320/23:2MU6KJB2 - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164289459&doi=10.1109%2fICBIR57571.2023.10147628&partnerID=40&md5=2fb9f35120c62448088e49e9154f0479" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85164289459&doi=10.1109%2fICBIR57571.2023.10147628&partnerID=40&md5=2fb9f35120c62448088e49e9154f0479</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICBIR57571.2023.10147628" target="_blank" >10.1109/ICBIR57571.2023.10147628</a>
Alternative languages
Result language
angličtina
Original language name
Sequence-Labeling RoBERTa Model for Dependency-Parsing in Classical Chinese and Its Application to Vietnamese and Thai
Original language description
"The author and his colleagues have been developing classical Chinese treebank using Universal Dependencies. We also developed RoBERTa-Classical-Chinese model pre-trained with classical Chinese texts of 1.7 billion characters. In this paper we describe how to finetune sequence-labeling RoBERTa model for dependency-parsing in classical Chinese. We introduce 'goeswith'-labeled edges into the directed acyclic graphs of Universal Dependencies in order to resolve the mismatch between the token length of RoBERTa-Classical-Chinese and the word length in classical Chinese. We utilize [MASK]token of RoBERTa model to handle outgoing edges and to produce the adjacency-matrices for the graphs of Universal Dependencies. Our RoBERTa-UDgoeswith model outperforms other dependency-parsers in classical Chinese on LAS / MLAS / BLEX benchmark scores. Then we apply our methods to other isolating languages. For Vietnamese we introduce 'goeswith'-labeled edges to separate words into space-separated syllables, and finetune RoBERTa and PhoBERT models. For Thai we try three kinds of tokenizers, character-wise tokenizer, quasi-syllable tokenizer, and SentencePiece, to produce RoBERTa models. © 2023 IEEE."
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
"Int. Conf. Bus. Ind. Res., ICBIR - Proc."
ISBN
979-835039964-6
ISSN
—
e-ISSN
—
Number of pages
5
Pages from-to
169-173
Publisher name
Institute of Electrical and Electronics Engineers Inc.
Place of publication
—
Event location
Cham
Event date
Jan 1, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—