The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68378092%3A_____%2F23%3A00584073" target="_blank" >RIV/68378092:_____/23:00584073 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/00216208:11210/23:10465205
Výsledek na webu
<a href="https://www.euppublishing.com/doi/full/10.3366/word.2023.0230" target="_blank" >https://www.euppublishing.com/doi/full/10.3366/word.2023.0230</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.3366/word.2023.0230" target="_blank" >10.3366/word.2023.0230</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book
Popis výsledku v původním jazyce
This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.
Název v anglickém jazyce
The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book
Popis výsledku anglicky
This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.

Klasifikace

Druh
J<sub>imp</sub> - Článek v periodiku v databázi Web of Science
CEP obor
—
OECD FORD obor
60203 - Linguistics

Návaznosti výsledku

Projekt
<a href="/cs/project/LM2023044" target="_blank" >LM2023044: Český národní korpus</a><br>
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Word Structure
ISSN
1750-1245
e-ISSN
1755-2036
Svazek periodika
16
Číslo periodika v rámci svazku
2/3
Stát vydavatele periodika
GB - Spojené království Velké Británie a Severního Irska
Počet stran výsledku
25
Strana od-do
233-257
Kód UT WoS článku
001099547400005
EID výsledku v databázi Scopus
2-s2.0-85179302875

Podobné výsledky(10)

GramatiKat (verze 2) : Nástroj pro výzkum gramatických kategorií a gramatické profily Sharing data through specialized corpus-based tools: the case of GramatiKat GramatiKat

Co hledáte?

Rychlé hledání

Chytré vyhledávání

The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)