OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AZ7XSFRK3" target="_blank" >RIV/00216208:11320/22:Z7XSFRK3 - isvavai.cz</a>
Result on the web
<a href="https://clinjournal.org/clinj/article/view/157" target="_blank" >https://clinjournal.org/clinj/article/view/157</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization
Original language description
We present OpenBoek: a corpus of 103k tokens of classic Dutch novels with annotated coreference and entities. The corpus has several properties that are challenging for current coreference models: long documents (fragments of 10k+ words each), domain-specific literary phenomena, and 19th century Dutch spelling. Spelling normalization is added to the corpus as an additional annotation layer, using a data-driven rule-based spelling normalization tool. Normalizations are added using meta-annotation, such that evaluation can be performed with annotations on the original texts without losing token alignment. This tool enables the application of parsing and coreference systems originally developed for modern Dutch. We evaluate parsing and coreference systems on the OpenBoek dataset and find that spelling normalization gives a substantial increase in performance. The OpenBoek corpus is available under an open licens at https://andreasvc.github.io/openboek/
Czech name
—
Czech description
—
Classification
Type
J<sub>ost</sub> - Miscellaneous article in a specialist periodical
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Computational Linguistics in the Netherlands Journal
ISSN
2211-4009
e-ISSN
1744-4217
Volume of the periodical
12
Issue of the periodical within the volume
2022-12-22
Country of publishing house
US - UNITED STATES
Number of pages
17
Pages from-to
235-251
UT code for WoS article
—
EID of the result in the Scopus database
—