Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F25%3AVLN936EW" target="_blank" >RIV/00216208:11320/25:VLN936EW - isvavai.cz</a>
Result on the web
<a href="https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195902793&partnerID=40&md5=f3dad4cceb57390264a080384872eb53" target="_blank" >https://www.scopus.com/inward/record.uri?eid=2-s2.0-85195902793&partnerID=40&md5=f3dad4cceb57390264a080384872eb53</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Chinese Sequence Labeling with Semi-Supervised Boundary-Aware Language Model Pre-training
Original language description
Chinese sequence labeling tasks are heavily reliant on accurate word boundary demarcation. Although current pre-trained language models (PLMs) have achieved substantial gains on these tasks, they rarely explicitly incorporate boundary information into the modeling process. An exception to this is BABERT (Jiang et al., 2022), which incorporates unsupervised statistical boundary information into Chinese BERT's pre-training objectives. Building upon this approach, we input supervised high-quality boundary information to enhance BABERT's learning, developing a semi-supervised boundary-aware PLM. To assess PLMs' ability to encode boundaries, we introduce a novel “Boundary Information Metric” that is both simple and effective. This metric allows comparison of different PLMs without task-specific fine-tuning. Experimental results on Chinese sequence labeling datasets demonstrate that the improved BABERT variant outperforms the vanilla version, not only on these tasks but also more broadly across a range of Chinese natural language understanding tasks. Additionally, our proposed metric offers a convenient and accurate means of evaluating PLMs' boundary awareness. © 2024 ELRA Language Resource Association: CC BY-NC 4.0.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Jt. Int. Conf. Comput. Linguist., Lang. Resour. Eval., LREC-COLING - Main Conf. Proc.
ISBN
978-249381410-4
ISSN
—
e-ISSN
—
Number of pages
13
Pages from-to
3179-3191
Publisher name
European Language Resources Association (ELRA)
Place of publication
—
Event location
Torino, Italia
Event date
Jan 1, 2025
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—