All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Performance of Czech Speech Recognition with Language Models Created from Public Resources

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F11%3A00185981" target="_blank" >RIV/68407700:21230/11:00185981 - isvavai.cz</a>

  • Alternative codes found

    RIV/46747885:24220/11:#0001963

  • Result on the web

    <a href="http://www.radioeng.cz" target="_blank" >http://www.radioeng.cz</a>

  • DOI - Digital Object Identifier

Alternative languages

  • Result language

    angličtina

  • Original language name

    Performance of Czech Speech Recognition with Language Models Created from Public Resources

  • Original language description

    In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>x</sub> - Unclassified - Peer-reviewed scientific article (Jimp, Jsc and Jost)

  • CEP classification

    JA - Electronics and optoelectronics

  • OECD FORD branch

Result continuities

  • Project

    Result was created during the realization of more than one project. More information in the Projects tab.

  • Continuities

    Z - Vyzkumny zamer (s odkazem do CEZ)

Others

  • Publication year

    2011

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    Radioengineering

  • ISSN

    1210-2512

  • e-ISSN

  • Volume of the periodical

    40

  • Issue of the periodical within the volume

    4

  • Country of publishing house

    CZ - CZECH REPUBLIC

  • Number of pages

    7

  • Pages from-to

    1002-1008

  • UT code for WoS article

    000298636800039

  • EID of the result in the Scopus database