All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F10%3A00169505" target="_blank" >RIV/68407700:21230/10:00169505 - isvavai.cz</a>

  • Result on the web

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data

  • Original language description

    In this paper, newly issued Czech Web 1T 5-grams corpus created by Google and LDC is analysed and compared with reference n-gram corpus obtained from Czech National Corpus. Original 5-grams from both corpora were post-processed and statistical trigram language models of various vocabulary sizes and parameters were created. The comparison of various corpus statistics such as unique and total word and n-gram counts before and after post-processing is presented and discussed, especially with the focus on clearing Web 1T data from invalid tokens. The tools from HTK Toolkit were used for the evaluation and accuracy, OOV rates and perplexity were measured using sentence transcriptions from Czech SPEECON database.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>x</sub> - Unclassified - Peer-reviewed scientific article (Jimp, Jsc and Jost)

  • CEP classification

    JA - Electronics and optoelectronics

  • OECD FORD branch

Result continuities

  • Project

    <a href="/en/project/GA102%2F08%2F0707" target="_blank" >GA102/08/0707: Speech Recognition under Real-World Conditions</a><br>

  • Continuities

    Z - Vyzkumny zamer (s odkazem do CEZ)

Others

  • Publication year

    2010

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    Lecture Notes in Artificial Intelligence

  • ISSN

    0302-9743

  • e-ISSN

  • Volume of the periodical

    6231

  • Issue of the periodical within the volume

    2010933819

  • Country of publishing house

    DE - GERMANY

  • Number of pages

    8

  • Pages from-to

  • UT code for WoS article

    000288619400024

  • EID of the result in the Scopus database