Applying articulatory features within speech recognition
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F19%3A00341897" target="_blank" >RIV/68407700:21230/19:00341897 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Applying articulatory features within speech recognition
Popis výsledku v původním jazyce
This thesis deals with research on Articulatory Features (AF) of speech with special focus on improvement of Czech spontaneous speech recognition. As spontaneous speech is caused by frequent occurrence of coarticulation process, assimilation and reduction of phones and as AF contain the information about speech production mechanisms, they might represent a possible way how to improve results of these systems. The potential contribution of AF-based TANDEM ASR architecture on the tasks of recognition or phonetic segmentation of spontaneous speech is described. The multi-valued AF classes for Czech and four East-European languages were defined and unified. Next work was focused on the estimation of AF using artificial neural networks. The suitability of standard and advanced acoustic speech features was analyzed, mainly from the point of view of temporal context at the input of ANN/DNN network. The behaviour of AF estimation in mismatched or adverse noisy acoustic conditions was also studied and the robustness of DCT-TRAP features was proved as the best choice for this task. The application of AF within ASR was realized in the form of AF-Based TANDEM system. The performance of the AF-Based TANDEM system was analyzed for English phone recognition and Czech ASR tasks. Positive impact of this system was observed for standard monophone and triphone systems based on MFCC features. The ASR combination of GMM-HMM/DNN-HMM with the AF-Based TANDEM system on the level of lattice with decoded hypotheses significantly improved baseline results. Finally, phonetic segmentation task was analyzed using various type of acoustic model architectures as well as focusing on proper pronunciation variant selection. It was done for the following two task: read English and casual Czech. Two-stage forced-alignment with combination of DNN-HMM and optimized monophone-based system was proposed and the improvement of phone boundary determination was proved for both tasks.
Název v anglickém jazyce
Applying articulatory features within speech recognition
Popis výsledku anglicky
This thesis deals with research on Articulatory Features (AF) of speech with special focus on improvement of Czech spontaneous speech recognition. As spontaneous speech is caused by frequent occurrence of coarticulation process, assimilation and reduction of phones and as AF contain the information about speech production mechanisms, they might represent a possible way how to improve results of these systems. The potential contribution of AF-based TANDEM ASR architecture on the tasks of recognition or phonetic segmentation of spontaneous speech is described. The multi-valued AF classes for Czech and four East-European languages were defined and unified. Next work was focused on the estimation of AF using artificial neural networks. The suitability of standard and advanced acoustic speech features was analyzed, mainly from the point of view of temporal context at the input of ANN/DNN network. The behaviour of AF estimation in mismatched or adverse noisy acoustic conditions was also studied and the robustness of DCT-TRAP features was proved as the best choice for this task. The application of AF within ASR was realized in the form of AF-Based TANDEM system. The performance of the AF-Based TANDEM system was analyzed for English phone recognition and Czech ASR tasks. Positive impact of this system was observed for standard monophone and triphone systems based on MFCC features. The ASR combination of GMM-HMM/DNN-HMM with the AF-Based TANDEM system on the level of lattice with decoded hypotheses significantly improved baseline results. Finally, phonetic segmentation task was analyzed using various type of acoustic model architectures as well as focusing on proper pronunciation variant selection. It was done for the following two task: read English and casual Czech. Two-stage forced-alignment with combination of DNN-HMM and optimized monophone-based system was proposed and the improvement of phone boundary determination was proved for both tasks.
Klasifikace
Druh
O - Ostatní výsledky
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů