Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F10%3A00169505" target="_blank" >RIV/68407700:21230/10:00169505 - isvavai.cz</a>
Result on the web
—
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data
Original language description
In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database.
Czech name
—
Czech description
—
Classification
Type
J<sub>x</sub> - Unclassified - Peer-reviewed scientific article (Jimp, Jsc and Jost)
CEP classification
JA - Electronics and optoelectronics
OECD FORD branch
—
Result continuities
Project
<a href="/en/project/GA102%2F08%2F0707" target="_blank" >GA102/08/0707: Speech Recognition under Real-World Conditions</a><br>
Continuities
Z - Vyzkumny zamer (s odkazem do CEZ)
Others
Publication year
2010
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Lecture Notes in Artificial Intelligence
ISSN
0302-9743
e-ISSN
—
Volume of the periodical
6231
Issue of the periodical within the volume
2010933819
Country of publishing house
DE - GERMANY
Number of pages
8
Pages from-to
—
UT code for WoS article
000288619400024
EID of the result in the Scopus database
—