Performance of Czech Speech Recognition with Language Models Created from Public Resources
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F11%3A00185981" target="_blank" >RIV/68407700:21230/11:00185981 - isvavai.cz</a>
Alternative codes found
RIV/46747885:24220/11:#0001963
Result on the web
<a href="http://www.radioeng.cz" target="_blank" >http://www.radioeng.cz</a>
DOI - Digital Object Identifier
—
Alternative languages
Result language
angličtina
Original language name
Performance of Czech Speech Recognition with Language Models Created from Public Resources
Original language description
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
Czech name
—
Czech description
—
Classification
Type
J<sub>x</sub> - Unclassified - Peer-reviewed scientific article (Jimp, Jsc and Jost)
CEP classification
JA - Electronics and optoelectronics
OECD FORD branch
—
Result continuities
Project
Result was created during the realization of more than one project. More information in the Projects tab.
Continuities
Z - Vyzkumny zamer (s odkazem do CEZ)
Others
Publication year
2011
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Name of the periodical
Radioengineering
ISSN
1210-2512
e-ISSN
—
Volume of the periodical
40
Issue of the periodical within the volume
4
Country of publishing house
CZ - CZECH REPUBLIC
Number of pages
7
Pages from-to
1002-1008
UT code for WoS article
000298636800039
EID of the result in the Scopus database
—