Using Language Models to Improve Rule-based Linguistic Annotation of Modern Historical Japanese Corpora

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3ASZHJDYMU" target="_blank" >RIV/00216208:11320/22:SZHJDYMU - isvavai.cz</a>
Result on the web
<a href="https://aclanthology.org/2022.latechclfl-1.5" target="_blank" >https://aclanthology.org/2022.latechclfl-1.5</a>
DOI - Digital Object Identifier
—

Alternative languages

Result language
angličtina
Original language name
Using Language Models to Improve Rule-based Linguistic Annotation of Modern Historical Japanese Corpora
Original language description
Annotation of unlabeled textual corpora with linguistic metadata is a fundamental technology in many scholarly workflows in the digital humanities (DH). Pretrained natural language processing pipelines offer tokenization, tagging, and dependency parsing of raw text simultaneously using an annotation scheme like Universal Dependencies (UD). However, the accuracy of these UD tools remains unknown for historical texts and current methods lack mechanisms that enable helpful evaluations by domain experts. To address both points for the case of Modern Historical Japanese text, this paper proposes the use of unsupervised domain adaptation methods to develop a domain-adapted language model (LM) that can flag instances of inaccurate UD output from a pretrained LM and the use of these instances to form rules that, when applied, improves pretrained annotation accuracy. To test the efficacy of the proposed approach, the paper evaluates the domain-adapted LM against three baselines that are not adapted to the historical domain. The experiments conducted demonstrate that the domain-adapted LM improves UD annotation in the Modern Historical Japanese domain and that rules produced using this LM are best indicative of characteristics of the domain in terms of out-of-vocabulary rate and candidate normalized form discovery for “difficult” bigram terms.
Czech name
—
Czech description
—

Classification

Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

Project
—
Continuities
—

Others

Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Article name in the collection
Proceedings of the 6th Joint SIGHUM Workshop on Computational Linguistics for Cultural Heritage, Social Sciences, Humanities and Literature
ISBN
—
ISSN
2951-2093
e-ISSN
—
Number of pages
10
Pages from-to
30-39
Publisher name
International Conference on Computational Linguistics
Place of publication
—
Event location
Gyeongju, Republic of Korea
Event date
Jan 1, 2022
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
—

Similar results(10)

Natural Language Processing in Cultural Heritage Domains: Integrating Pretrained Language Models with Rule-based Systems for Historical Japanese Corpora Introducing Morphology in Universal Dependencies Japanese Text-in-Context: Token-Level Error Detection for Table-to-Text Generation

What are you looking for?

Quick search

Smart search

Using Language Models to Improve Rule-based Linguistic Annotation of Modern Historical Japanese Corpora

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)