Expressing Time in English and Czech Children's Literature: A Contrastive N-gram-Based Study of Typologically Distant Languages
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F19%3A10397901" target="_blank" >RIV/00216208:11210/19:10397901 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Expressing Time in English and Czech Children's Literature: A Contrastive N-gram-Based Study of Typologically Distant Languages
Popis výsledku v původním jazyce
The study addresses two issues raised by previous studies dealing with children's literature and phraseology. First, we explore how TIME is expressed in English and Czech children's fiction (cf. Hunt, 2005; Thompson & Sealey, 2007). Our approach relies on the neo-Firthian phraseological tradition, "where meaning... is said to reside in multi-word units rather than single words" (Ebeling & Ebeling, 2013: 65). The study is data-driven, based on n-gram extraction. This raises the question of "the potential contribution" of n-gram-based approaches to language comparison (Granger, 2014). N-grams appear a useful starting point when comparing typologically related languages, and rather "challenging" when dealing with distant ones, e.g. predominantly analytical English and inflectional Czech (Čermáková & Chlumská, 2017; Hasselgård, 2017; Ebeling & Ebeling, 2013). The study uses comparable English and Czech corpora of children's fiction: two small (650,000 words each) and two large ones (2,700,000 words each, sub-corpora of the Czech National Corpus (SYN) and British National Corpus). For technical reasons, queries are restricted to 250,000 hits in the large corpora. The small corpora enabled detailed examination, the large ones served to verify our small-corpus findings, supplementing them by lemma and POS queries. We extracted 2-5-grams (i.e. continuous sequences of 2-5 words excluding punctuation) from the smaller corpora. Numbers of n-grams above the threshold are consistently higher in English. The ratios suggest a larger extent of recurrent patterning in analytical English than in Czech, characterized by high morphological variability and free word-order (cf. Czech 4-grams: se nedá nic dělat, nedá se nic dělat, nedalo se nic dělat). Higher type/token ratios in Czech again point to a higher variability of Czech. Another difference is the higher representation of verbs within the most frequent n-grams in Czech (e.g. se vydal na cestu), and prepositional phrases in English (e.g. for a long time). This is again in accord with the typological expectations, Czech generally preferring (finite) verbal expression and English being more 'nominal'. The POS observations highlighted the importance of verbs for Czech but also their high morphological variability as a potential hindrance to the use of the n-gram approach. Frequent 3-5-grams in the small corpora were classified semantically. We then focused on TIME n-grams. The expression of TIME tends to rely on n-grams comprising temporal nouns in English (e.g. end, time, moment), while in Czech adverbs and conjunctions were salient (pak, hned, když), pointing to the 'nominal' vs. 'verbal' character of English and Czech, respectively. The recurrent lexemes can then be used to identify (partly lemmatized) patterns expressing TIME in both languages (e.g. a pak SE, by the time) (Ebeling & Ebeling, 2013; Gries, 2008). The n-gram method proved a useful starting point in corpus-driven cross-linguistic genre analysis, highlighting typological characteristics of the languages compared. Owing to the limitations on the n-gram method in Czech, a combination of approaches seems beneficial, including semantic analysis, partial lemmatization and n-gram based patterns.
Název v anglickém jazyce
Expressing Time in English and Czech Children's Literature: A Contrastive N-gram-Based Study of Typologically Distant Languages
Popis výsledku anglicky
The study addresses two issues raised by previous studies dealing with children's literature and phraseology. First, we explore how TIME is expressed in English and Czech children's fiction (cf. Hunt, 2005; Thompson & Sealey, 2007). Our approach relies on the neo-Firthian phraseological tradition, "where meaning... is said to reside in multi-word units rather than single words" (Ebeling & Ebeling, 2013: 65). The study is data-driven, based on n-gram extraction. This raises the question of "the potential contribution" of n-gram-based approaches to language comparison (Granger, 2014). N-grams appear a useful starting point when comparing typologically related languages, and rather "challenging" when dealing with distant ones, e.g. predominantly analytical English and inflectional Czech (Čermáková & Chlumská, 2017; Hasselgård, 2017; Ebeling & Ebeling, 2013). The study uses comparable English and Czech corpora of children's fiction: two small (650,000 words each) and two large ones (2,700,000 words each, sub-corpora of the Czech National Corpus (SYN) and British National Corpus). For technical reasons, queries are restricted to 250,000 hits in the large corpora. The small corpora enabled detailed examination, the large ones served to verify our small-corpus findings, supplementing them by lemma and POS queries. We extracted 2-5-grams (i.e. continuous sequences of 2-5 words excluding punctuation) from the smaller corpora. Numbers of n-grams above the threshold are consistently higher in English. The ratios suggest a larger extent of recurrent patterning in analytical English than in Czech, characterized by high morphological variability and free word-order (cf. Czech 4-grams: se nedá nic dělat, nedá se nic dělat, nedalo se nic dělat). Higher type/token ratios in Czech again point to a higher variability of Czech. Another difference is the higher representation of verbs within the most frequent n-grams in Czech (e.g. se vydal na cestu), and prepositional phrases in English (e.g. for a long time). This is again in accord with the typological expectations, Czech generally preferring (finite) verbal expression and English being more 'nominal'. The POS observations highlighted the importance of verbs for Czech but also their high morphological variability as a potential hindrance to the use of the n-gram approach. Frequent 3-5-grams in the small corpora were classified semantically. We then focused on TIME n-grams. The expression of TIME tends to rely on n-grams comprising temporal nouns in English (e.g. end, time, moment), while in Czech adverbs and conjunctions were salient (pak, hned, když), pointing to the 'nominal' vs. 'verbal' character of English and Czech, respectively. The recurrent lexemes can then be used to identify (partly lemmatized) patterns expressing TIME in both languages (e.g. a pak SE, by the time) (Ebeling & Ebeling, 2013; Gries, 2008). The n-gram method proved a useful starting point in corpus-driven cross-linguistic genre analysis, highlighting typological characteristics of the languages compared. Owing to the limitations on the n-gram method in Czech, a combination of approaches seems beneficial, including semantic analysis, partial lemmatization and n-gram based patterns.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
60203 - Linguistics
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach<br>I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Language Use and Linguistic Structure: Proceedings of the Olomouc Linguistics Colloquium 2018
ISBN
978-80-244-5525-9
ISSN
—
e-ISSN
—
Počet stran výsledku
15
Strana od-do
469-483
Název nakladatele
Palacký University
Místo vydání
Olomouc
Místo konání akce
Olomouc: Palacký University
Datum konání akce
7. 6. 2018
Typ akce podle státní příslušnosti
EUR - Evropská akce
Kód UT WoS článku
—