Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AGML525B7" target="_blank" >RIV/00216208:11320/22:GML525B7 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.mdpi.com/1099-4300/24/2/280" target="_blank" >https://www.mdpi.com/1099-4300/24/2/280</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.3390/e24020280" target="_blank" >10.3390/e24020280</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
Popis výsledku v původním jazyce
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.
Název v anglickém jazyce
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
Popis výsledku anglicky
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
—

Ostatní

Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Entropy
ISSN
1099-4300
e-ISSN
1099-4300
Svazek periodika
24
Číslo periodika v rámci svazku
2
Stát vydavatele periodika
CH - Švýcarská konfederace
Počet stran výsledku
16
Strana od-do
1-16
Kód UT WoS článku
000823754800001
EID výsledku v databázi Scopus
2-s2.0-85125048960

Podobné výsledky(10)

Calc: Corpus Calculator Distribution of words across the first years of life: A longitudinal analysis of everyday language input to three English-learning infants Average Word Length from the Diachronic Perspective: The Case of Arabic

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)