Performance of Czech Speech Recognition with Language Models Created from Public Resources
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F11%3A00185981" target="_blank" >RIV/68407700:21230/11:00185981 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/46747885:24220/11:#0001963
Výsledek na webu
<a href="http://www.radioeng.cz" target="_blank" >http://www.radioeng.cz</a>
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Performance of Czech Speech Recognition with Language Models Created from Public Resources
Popis výsledku v původním jazyce
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
Název v anglickém jazyce
Performance of Czech Speech Recognition with Language Models Created from Public Resources
Popis výsledku anglicky
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
Klasifikace
Druh
J<sub>x</sub> - Nezařazeno - Článek v odborném periodiku (Jimp, Jsc a Jost)
CEP obor
JA - Elektronika a optoelektronika, elektrotechnika
OECD FORD obor
—
Návaznosti výsledku
Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
Z - Vyzkumny zamer (s odkazem do CEZ)
Ostatní
Rok uplatnění
2011
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
Radioengineering
ISSN
1210-2512
e-ISSN
—
Svazek periodika
40
Číslo periodika v rámci svazku
4
Stát vydavatele periodika
CZ - Česká republika
Počet stran výsledku
7
Strana od-do
1002-1008
Kód UT WoS článku
000298636800039
EID výsledku v databázi Scopus
—