A morphologically annotated longitudinal corpus of spoken Czech child-adult interactions

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F24%3A10471403" target="_blank" >RIV/00216208:11210/24:10471403 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11320/25:HSRPS3XP RIV/00216208:11320/26:ZMNXY9GK
Výsledek na webu
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=O5tdJDeU9b" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=O5tdJDeU9b</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10579-023-09710-y" target="_blank" >10.1007/s10579-023-09710-y</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
A morphologically annotated longitudinal corpus of spoken Czech child-adult interactions
Popis výsledku v původním jazyce
The paper presents a longitudinal corpus of transcribed spontaneous child-adult interactions in Czech. It consists of 99,388 tokens in 42,103 utterances produced by seven children between ca 1.5 to 3.5 years of age, and 238,211 tokens in 61,252 utterances produced by their close caregivers in everyday situations at home. The corpus covers language production of the children from the mean length of 1.01 word per utterance up to 5.33 words per utterance. The length of the recorded period ranges for individual children from 11 to 27 months. The transcripts of both child and adult utterances were lemmatized and tagged using MorphoDiTa, a tool for automatic morphological analysis of Czech. The annotation was transformed into the MOR format used within CHILDES, a database dedicated to corpora of first language acquisition. Detailed manual checking was performed on the annotation of all children's utterances. Data from three children were used for a comparison of part-of-speech classification before and after manual checking, data from one child was additionally analyzed for differences in morphological tagging proper. The number of differences was rather low, with (expected) limitations in the areas of part-of-speech classification for uninflected words, annotation of homonymous forms, and annotation of child-specific words. The corpus represents an important contribution to the research of child language with special significance for Slavic languages and other morphologically rich inflecting languages, which are still underrepresented in the study of first language acquisition.
Název v anglickém jazyce
A morphologically annotated longitudinal corpus of spoken Czech child-adult interactions
Popis výsledku anglicky
The paper presents a longitudinal corpus of transcribed spontaneous child-adult interactions in Czech. It consists of 99,388 tokens in 42,103 utterances produced by seven children between ca 1.5 to 3.5 years of age, and 238,211 tokens in 61,252 utterances produced by their close caregivers in everyday situations at home. The corpus covers language production of the children from the mean length of 1.01 word per utterance up to 5.33 words per utterance. The length of the recorded period ranges for individual children from 11 to 27 months. The transcripts of both child and adult utterances were lemmatized and tagged using MorphoDiTa, a tool for automatic morphological analysis of Czech. The annotation was transformed into the MOR format used within CHILDES, a database dedicated to corpora of first language acquisition. Detailed manual checking was performed on the annotation of all children's utterances. Data from three children were used for a comparison of part-of-speech classification before and after manual checking, data from one child was additionally analyzed for differences in morphological tagging proper. The number of differences was rather low, with (expected) limitations in the areas of part-of-speech classification for uninflected words, annotation of homonymous forms, and annotation of child-specific words. The corpus represents an important contribution to the research of child language with special significance for Slavic languages and other morphologically rich inflecting languages, which are still underrepresented in the study of first language acquisition.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
60203 - Linguistics

Návaznosti výsledku

Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Language Resources and Evaluation
ISSN
1574-020X
e-ISSN
1574-0218
Svazek periodika
Neuveden
Číslo periodika v rámci svazku
30.03.2024
Stát vydavatele periodika
NL - Nizozemsko
Počet stran výsledku
24
Strana od-do
1-24
Kód UT WoS článku
001194629700002
EID výsledku v databázi Scopus
—

Podobné výsledky(10)

Chromá Czech Corpus První korpus mluvčích češtiny v dětském věku Cross-linguistically consistent semantic and syntactic annotation of child-directed speech

Co hledáte?

Rychlé hledání

Chytré vyhledávání

A morphologically annotated longitudinal corpus of spoken Czech child-adult interactions

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)