Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Comparing web-crawled and traditional corpora

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F20%3A10415257" target="_blank" >RIV/00216208:11210/20:10415257 - isvavai.cz</a>

  • Výsledek na webu

    <a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=5~MgQz0ASE" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=5~MgQz0ASE</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s10579-020-09487-4" target="_blank" >10.1007/s10579-020-09487-4</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Comparing web-crawled and traditional corpora

  • Popis výsledku v původním jazyce

    Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a &quot;traditional&quot; corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the &quot;searchable&quot; web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).

  • Název v anglickém jazyce

    Comparing web-crawled and traditional corpora

  • Popis výsledku anglicky

    Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a &quot;traditional&quot; corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the &quot;searchable&quot; web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    60203 - Linguistics

Návaznosti výsledku

  • Projekt

  • Návaznosti

    O - Projekt operacniho programu

Ostatní

  • Rok uplatnění

    2020

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Language Resources and Evaluation

  • ISSN

    1574-020X

  • e-ISSN

  • Svazek periodika

    54

  • Číslo periodika v rámci svazku

    3

  • Stát vydavatele periodika

    NL - Nizozemsko

  • Počet stran výsledku

    33

  • Strana od-do

    713-745

  • Kód UT WoS článku

    000520997900001

  • EID výsledku v databázi Scopus

    2-s2.0-85082853232