All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68378092%3A_____%2F23%3A00584073" target="_blank" >RIV/68378092:_____/23:00584073 - isvavai.cz</a>

  • Alternative codes found

    RIV/00216208:11210/23:10465205

  • Result on the web

    <a href="https://www.euppublishing.com/doi/full/10.3366/word.2023.0230" target="_blank" >https://www.euppublishing.com/doi/full/10.3366/word.2023.0230</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.3366/word.2023.0230" target="_blank" >10.3366/word.2023.0230</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    The structuralist tradition meets empirical data: Corpus data enhancing the Czech Internet Language Reference Book

  • Original language description

    This paper demonstrates how the corpus grammar tool GramatiKat can be used to improve and refine morphological information in the Internet Language Reference Book (ILRB), which presents complete declension paradigms for 45,632 standard Czech nouns. The paradigm tables are based mainly on morphological types, following structuralist conceptions of language as a fully articulated system. The paper discusses how to update the ILRB and provide users with empirically based grammatical information for individual word forms in each cell of the paradigm. All noun lemmas have been investigated using the GramatiKat tool for research into grammatical categories in Czech. The tool observes the distribution of word forms of a particular lexeme in comparison with the standard distribution across the whole word class. It is capable of identifying nouns that have an unusually high occurrence of a certain word form, as well as nouns with unattested word forms. GramatiKat is based on the data from two corpora of Czech written texts, SYN2015 and SYN2020 (200 million word tokens). The paper investigates the relationship between defectiveness and overabundance on one side and language variability and potentiality on the other. Based on the unique combination of data from the ILRB and GramatiKat, the paper suggests how information about unusually frequent or overabundant word forms as well as unattested ones should be pointed out, so that ILRB provides the user with accurate, empirically based data.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database

  • CEP classification

  • OECD FORD branch

    60203 - Linguistics

Result continuities

  • Project

    <a href="/en/project/LM2023044" target="_blank" >LM2023044: Czech National Corpus</a><br>

  • Continuities

    I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace

Others

  • Publication year

    2023

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    Word Structure

  • ISSN

    1750-1245

  • e-ISSN

    1755-2036

  • Volume of the periodical

    16

  • Issue of the periodical within the volume

    2/3

  • Country of publishing house

    GB - UNITED KINGDOM

  • Number of pages

    25

  • Pages from-to

    233-257

  • UT code for WoS article

    001099547400005

  • EID of the result in the Scopus database

    2-s2.0-85179302875