Performance of Czech Speech Recognition with Language Models Created from Public Resources

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F11%3A00185981" target="_blank" >RIV/68407700:21230/11:00185981 - isvavai.cz</a>
Nalezeny alternativní kódy
RIV/46747885:24220/11:#0001963
Výsledek na webu
<a href="http://www.radioeng.cz" target="_blank" >http://www.radioeng.cz</a>
DOI - Digital Object Identifier
—

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Performance of Czech Speech Recognition with Language Models Created from Public Resources
Popis výsledku v původním jazyce
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.
Název v anglickém jazyce
Performance of Czech Speech Recognition with Language Models Created from Public Resources
Popis výsledku anglicky
In this paper, we investigate the usability of publicly available n-gram corpora for the creation of language models (LM) applicable for Czech speech recognition systems. N-gram LMs with various parameters and settings were created from two publicly available sets, Czech Web 1T 5-gram corpus provided by Google and 5-gram corpus created from the Czech National Corpus. We tested also a LM made of a large private resource of newspaper and broadcast texts collected by a Czech media mining company. The LMs were analyzed and compared via their perplexity rates and when employed in large vocabulary continuous speech recognition systems. Our study show that the Web1T-based LMs, even after intensive cleaning and normalization procedures, cannot compete with those made of smaller but more consistent corpora. The experiments done on large test data also illustrate the impact of Czech as highly inflective language on the perplexity, OOV, and recognition accuracy rates.

Klasifikace

Druh
J<sub>x</sub> - Nezařazeno - Článek v odborném periodiku (Jimp, Jsc a Jost)
CEP obor
JA - Elektronika a optoelektronika, elektrotechnika
OECD FORD obor
—

Návaznosti výsledku

Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
Z - Vyzkumny zamer (s odkazem do CEZ)

Ostatní

Rok uplatnění
2011
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název periodika
Radioengineering
ISSN
1210-2512
e-ISSN
—
Svazek periodika
40
Číslo periodika v rámci svazku
4
Stát vydavatele periodika
CZ - Česká republika
Počet stran výsledku
7
Strana od-do
1002-1008
Kód UT WoS článku
000298636800039
EID výsledku v databázi Scopus
—

Podobné výsledky(10)

Language models for spontaneous speech recognition Coverage of Spontaneous Conversational Speech from Nijmegen Corpus of Casual Czech by General ASR Language Models Analysis of Czech Web 1T 5-gram corpus and its comparison with Czech National Corpus Data

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Performance of Czech Speech Recognition with Language Models Created from Public Resources

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)