Czech Grammar Error Correction with a Large and Diverse Corpus
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3A10456875" target="_blank" >RIV/00216208:11320/22:10456875 - isvavai.cz</a>
Alternative codes found
RIV/00216208:11210/22:10456875
Result on the web
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=~GZ69iKhQ_" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=~GZ69iKhQ_</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1162/tacl_a_00470" target="_blank" >10.1162/tacl_a_00470</a>
Alternative languages
Result language
angličtina
Original language name
Czech Grammar Error Correction with a Large and Diverse Corpus
Original language description
We introduce a large and diverse Czech corpus annotated for grammatical error correction (GEC) with the aim to contribute to the still scarce data resources in this domain for languages other than English. The Grammar Error Correction Corpus for Czech (GECCC) offers a variety of four domains, covering error distributions ranging from high error density essays written by non-native speakers, to website texts, where errors are expected to be much less common. We compare several Czech GEC systems, including several Transformer-based ones, setting a strong baseline to future research. Finally, we meta-evaluate common GEC metrics against human judgements on our data. We make the new Czech GEC corpus publicly available under the CC BY-SA 4.0 license at http://hdl.handle.net/11234/1-4639.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Transactions of the Association for Computational Linguistics [online]
ISSN
2307-387X
e-ISSN
—
Volume of the periodical
10
Issue of the periodical within the volume
1
Country of publishing house
US - UNITED STATES
Number of pages
16
Pages from-to
452-467
UT code for WoS article
000923411900003
EID of the result in the Scopus database
2-s2.0-85128897589