Introducing a corpus of non-native Czech with automatic annotation

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F17%3A10366730" target="_blank" >RIV/00216208:11210/17:10366730 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Introducing a corpus of non-native Czech with automatic annotation
Popis výsledku v původním jazyce
Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language. The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer. Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect 'real words'. The precision of this step is above 80%. The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned. The metadata include, i.a., the author's sex, age, first language, CEFR level of proficiency in Czech, and the task's time limit and topic. The corpus is available on-line via a search interface or for download.
Název v anglickém jazyce
Introducing a corpus of non-native Czech with automatic annotation
Popis výsledku anglicky
Learner corpus can be annotated with linguistic categories, target hypotheses and error labels. We show that useful results can be achieved even for non-native Czech by applying methods and tools developed for standard language. The corpus includes more than 8.6 thousands short essays, nearly one million words. First, the texts are processed by a tagger and lemmatizer. Then, a stochastic spelling and grammar checker is used to propose correct forms for non-words and some incorrect 'real words'. The precision of this step is above 80%. The corrected texts are tagged again. Original and corrected forms are compared and error labels, based on criteria applicable in a formally specifiable way, are assigned. The metadata include, i.a., the author's sex, age, first language, CEFR level of proficiency in Czech, and the task's time limit and topic. The corpus is available on-line via a search interface or for download.

Klasifikace

Druh
C - Kapitola v odborné knize
CEP obor
—
OECD FORD obor
60203 - Linguistics

Návaznosti výsledku

Projekt
<a href="/cs/project/LM2011023" target="_blank" >LM2011023: Český národní korpus</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2017
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název knihy nebo sborníku
Language, Corpora and Cognition
ISBN
978-3-631-70709-8
Počet stran výsledku
18
Strana od-do
163-180
Počet stran knihy
296
Název nakladatele
Peter Lang
Místo vydání
Frankfurt am Main
Kód UT WoS kapitoly
—

Podobné výsledky(10)

Non-native use of the verb JÍT/GO in Czech: a corpus study A New Approach to Automatically Find and Fix Erroneous Labels in Dependency Parsing Treebanks Improvements to Korektor: A case study with native and non-native Czech

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Introducing a corpus of non-native Czech with automatic annotation

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)