Introducing a corpus of non-native Czech with automatic annotation
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F17%3A10366730" target="_blank" >RIV/00216208:11210/17:10366730 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Introducing a corpus of non-native Czech with automatic annotation
Popis výsledku v původním jazyce
Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language. The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer. Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect 'real words'. The precision of this step is above 80%. The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned. The metadata include, i.a., the author's sex, age, first language, CEFR level of proficiency in Czech, and the task's time limit and topic. The corpus is available on-line via a search interface or for download.
Název v anglickém jazyce
Introducing a corpus of non-native Czech with automatic annotation
Popis výsledku anglicky
Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language. The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer. Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect 'real words'. The precision of this step is above 80%. The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned. The metadata include, i.a., the author's sex, age, first language, CEFR level of proficiency in Czech, and the task's time limit and topic. The corpus is available on-line via a search interface or for download.
Klasifikace
Druh
C - Kapitola v odborné knize
CEP obor
—
OECD FORD obor
60203 - Linguistics
Návaznosti výsledku
Projekt
<a href="/cs/project/LM2011023" target="_blank" >LM2011023: Český národní korpus</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2017
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název knihy nebo sborníku
Language, Corpora and Cognition
ISBN
978-3-631-70709-8
Počet stran výsledku
18
Strana od-do
163-180
Počet stran knihy
296
Název nakladatele
Peter Lang
Místo vydání
Frankfurt am Main
Kód UT WoS kapitoly
—