Comparing web-crawled and traditional corpora

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11210%2F20%3A10415257" target="_blank" >RIV/00216208:11210/20:10415257 - isvavai.cz</a>
Result on the web
<a href="https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=5~MgQz0ASE" target="_blank" >https://verso.is.cuni.cz/pub/verso.fpl?fname=obd_publikace_handle&handle=5~MgQz0ASE</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10579-020-09487-4" target="_blank" >10.1007/s10579-020-09487-4</a>

Alternative languages

Result language
angličtina
Original language name
Comparing web-crawled and traditional corpora
Original language description
Using a multi-dimensional (MD) analysis of register variability, the study compares two corpora of Czech: Koditex, a "traditional" corpus carefully designed using various sources with rich metadata, and Araneum Bohemicum Maximum, a web-crawled corpus with an opportunistic composition representative of the "searchable" web. Both types of corpora are projected onto the space induced by the MD model, with the main objective being to find out whether they overlap in the linguistic variation they cover, or whether one introduces some specific variation which cannot be found in the other. We also document a crucial methodological point which has broader relevance for MD analyses in general, namely that texts have to be of similar lengths in order for their scores on the dimensions to be comparable. Results indicate that some traditional text categories, such as journalism or non-fiction, are characterized by language phenomena which are equally well covered by web-crawled data, though of course traditional corpora keep their edge in terms of the richness of the accompanying metadata. But overall, the range of variation in Koditex is broader as it contains texts which have no adequate substitute (i.e. texts with a comparable set of linguistic characteristics, regardless of their extratextual label) in data acquired through general-purpose web-crawling techniques. These include informal conversations, private correspondence, some types of fiction, but also user-generated content (comments on Facebook, forums etc.).
Czech name
—
Czech description
—

Classification

Type
J<sub>imp</sub> - Article in a specialist periodical, which is included in the Web of Science database
CEP classification
—
OECD FORD branch
60203 - Linguistics

Result continuities

Project
—
Continuities
O - Projekt operacniho programu

Others

Publication year
2020
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Name of the periodical
Language Resources and Evaluation
ISSN
1574-020X
e-ISSN
—
Volume of the periodical
54
Issue of the periodical within the volume
3
Country of publishing house
NL - THE KINGDOM OF THE NETHERLANDS
Number of pages
33
Pages from-to
713-745
UT code for WoS article
000520997900001
EID of the result in the Scopus database
2-s2.0-85082853232

Similar results(10)

Practical Web Crawling for Text Corpora AI Koditex v1 CLASSLA-web: Comparable Web Corpora of South Slavic Languages Enriched with Linguistic and Genre Annotation

What are you looking for?

Quick search

Smart search

Comparing web-crawled and traditional corpora

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)