Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F22%3AGML525B7" target="_blank" >RIV/00216208:11320/22:GML525B7 - isvavai.cz</a>
Result on the web
<a href="https://www.mdpi.com/1099-4300/24/2/280" target="_blank" >https://www.mdpi.com/1099-4300/24/2/280</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.3390/e24020280" target="_blank" >10.3390/e24020280</a>
Alternative languages
Result language
angličtina
Original language name
Frequency, Informativity and Word Length: Insights from Typologically Diverse Corpora
Original language description
Zipf’s law of abbreviation, which posits a negative correlation between word frequency and length, is one of the most famous and robust cross-linguistic generalizations. At the same time, it has been shown that contextual informativity (average surprisal given previous context) is more strongly correlated with word length, although this tendency is not observed consistently, depending on several methodological choices. The present study examines a more diverse sample of languages than the previous studies (Arabic, Finnish, Hungarian, Indonesian, Russian, Spanish and Turkish). I use large web-based corpora from the Leipzig Corpora Collection to estimate word lengths in UTF-8 characters and in phonemes (for some of the languages), as well as word frequency, informativity given previous word and informativity given next word, applying different methods of bigrams processing. The results show different correlations between word length and the corpus-based measure for different languages. I argue that these differences can be explained by the properties of noun phrases in a language, most importantly, by the order of heads and modifiers and their relative morphological complexity, as well as by orthographic conventions.
Czech name
—
Czech description
—
Classification
Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
—
Continuities
—
Others
Publication year
2022
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Entropy
ISSN
1099-4300
e-ISSN
1099-4300
Volume of the periodical
24
Issue of the periodical within the volume
2
Country of publishing house
CH - SWITZERLAND
Number of pages
16
Pages from-to
1-16
UT code for WoS article
000823754800001
EID of the result in the Scopus database
2-s2.0-85125048960