All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AZ7XSFRK3" target="_blank" >RIV/00216208:11320/22:Z7XSFRK3 - isvavai.cz</a>

  • Result on the web

    <a href="https://clinjournal.org/clinj/article/view/157" target="_blank" >https://clinjournal.org/clinj/article/view/157</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    OpenBoek: A Corpus of Literary Coreference and Entities with an Exploration of Historical Spelling Normalization

  • Original language description

    We present OpenBoek: a corpus of 103k tokens of classic Dutch novels with annotated coreference and entities. The corpus has several properties that are challenging for current coreference models: long documents (fragments of 10k+ words each), domain-specific literary phenomena, and 19th century Dutch spelling. Spelling normalization is added to the corpus as an additional annotation layer, using a data-driven rule-based spelling normalization tool. Normalizations are added using meta-annotation, such that evaluation can be performed with annotations on the original texts without losing token alignment. This tool enables the application of parsing and coreference systems originally developed for modern Dutch. We evaluate parsing and coreference systems on the OpenBoek dataset and find that spelling normalization gives a substantial increase in performance. The OpenBoek corpus is available under an open licens at https://andreasvc.github.io/openboek/

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>ost</sub> - Miscellaneous article in a specialist periodical

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

  • Continuities

Others

  • Publication year

    2022

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    Computational Linguistics in the Netherlands Journal

  • ISSN

    2211-4009

  • e-ISSN

    1744-4217

  • Volume of the periodical

    12

  • Issue of the periodical within the volume

    2022-12-22

  • Country of publishing house

    US - UNITED STATES

  • Number of pages

    17

  • Pages from-to

    235-251

  • UT code for WoS article

  • EID of the result in the Scopus database