OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AZ7XSFRK3" target="_blank" >RIV/00216208:11320/22:Z7XSFRK3 - isvavai.cz</a>
Výsledek na webu
<a href="https://clinjournal.org/clinj/article/view/157" target="_blank" >https://clinjournal.org/clinj/article/view/157</a>
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization
Popis výsledku v původním jazyce
We present OpenBoek: a corpus of 103k tokens of classic Dutch novels with annotated coreference and entities. The corpus has several properties that are challenging for current coreference models: long documents (fragments of 10k+ words each), domain-specific literary phenomena, and 19th century Dutch spelling. Spelling normalization is added to the corpus as an additional annotation layer, using a data-driven rule-based spelling normalization tool. Normalizations are added using meta-annotation, such that evaluation can be performed with annotations on the original texts without losing token alignment. This tool enables the application of parsing and coreference systems originally developed for modern Dutch. We evaluate parsing and coreference systems on the OpenBoek dataset and find that spelling normalization gives a substantial increase in performance. The OpenBoek corpus is available under an open licens at https://andreasvc.github.io/openboek/
Název v anglickém jazyce
OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization
Popis výsledku anglicky
We present OpenBoek: a corpus of 103k tokens of classic Dutch novels with annotated coreference and entities. The corpus has several properties that are challenging for current coreference models: long documents (fragments of 10k+ words each), domain-specific literary phenomena, and 19th century Dutch spelling. Spelling normalization is added to the corpus as an additional annotation layer, using a data-driven rule-based spelling normalization tool. Normalizations are added using meta-annotation, such that evaluation can be performed with annotations on the original texts without losing token alignment. This tool enables the application of parsing and coreference systems originally developed for modern Dutch. We evaluate parsing and coreference systems on the OpenBoek dataset and find that spelling normalization gives a substantial increase in performance. The OpenBoek corpus is available under an open licens at https://andreasvc.github.io/openboek/

Klasifikace

Druh
J<sub>ost</sub> - Ostatní články v recenzovaných periodicích
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Computational Linguistics in the Netherlands Journal
ISSN
2211-4009
e-ISSN
1744-4217
Svazek periodika
12
Číslo periodika v rámci svazku
2022-12-22
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
17
Strana od-do
235-251
Kód UT WoS článku
—
EID výsledku v databázi Scopus
—

Podobné výsledky(10)

A Dutch coreference resolution system with an evaluation on literary fiction Is one head enough? Mention heads in coreference annotations compared with UD-style heads Skript 2015: Akviziční korpus češtiny rodilých mluvčích - přepisy písemných prací žáků základních a středních škol

Co hledáte?

Rychlé hledání

Chytré vyhledávání

OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)