Vše

Co hledáte?

Vše
Projekty
Výsledky výzkumu
Subjekty

Rychlé hledání

  • Projekty podpořené TA ČR
  • Významné projekty
  • Projekty s nejvyšší státní podporou
  • Aktuálně běžící projekty

Chytré vyhledávání

  • Takto najdu konkrétní +slovo
  • Takto z výsledků -slovo zcela vynechám
  • “Takto můžu najít celou frázi”

Multimodal speech recognition: increasing accuracy using high speed video data

Identifikátory výsledku

  • Kód výsledku v IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F18%3A43952641" target="_blank" >RIV/49777513:23520/18:43952641 - isvavai.cz</a>

  • Výsledek na webu

    <a href="http://dx.doi.org/10.1007/s12193-018-0267-1" target="_blank" >http://dx.doi.org/10.1007/s12193-018-0267-1</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s12193-018-0267-1" target="_blank" >10.1007/s12193-018-0267-1</a>

Alternativní jazyky

  • Jazyk výsledku

    angličtina

  • Název v původním jazyce

    Multimodal speech recognition: increasing accuracy using high speed video data

  • Popis výsledku v původním jazyce

    To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).

  • Název v anglickém jazyce

    Multimodal speech recognition: increasing accuracy using high speed video data

  • Popis výsledku anglicky

    To date, multimodal speech recognition systems based on the processing of audio and video signals show significantly better results than their unimodal counterparts. In general, researchers divide the solution of the audio–visual speech recognition problem into two parts. First, in extracting the most informative features from each modality and second, in the most successful way of fusion both modalities. Ultimately, this leads to an improvement in the accuracy of speech recognition. Almost all modern studies use this approach with video data of a standard recording speed of 25 frames per second. The choice of such a recording speed is easily explained, since the vast majority of existing audio–visual databases are recorded with this rate. However, it should be noticed that the number of 25 frames per second is a world standard for many areas and has never been specifically calculated for speech recognition tasks. The main purpose of this study is to investigate the effect brought by the high-speed video data (up to 200 frames per second) on the speech recognition accuracy. And also to find out whether the use of a high-speed video camera makes the speech recognition systems more robust to acoustical noise. To this end, we recorded a database of audio–visual Russian speech with high-speed video recordings, which consists of records of 20 speakers, each of them pronouncing 200 phrases of continuous Russian speech. Experiments performed on this database showed an improvement in the absolute speech recognition rate up to 3.10%. We also proved that the use of the high-speed camera with 200 fps allows achieving better recognition results under different acoustically noisy conditions (signal-to-noise ratio varied between 40 and 0 dB) with different types of noise (e.g. white noise, babble noise).

Klasifikace

  • Druh

    J<sub>imp</sub> - Článek v periodiku v databázi Web of Science

  • CEP obor

  • OECD FORD obor

    20205 - Automation and control systems

Návaznosti výsledku

  • Projekt

    <a href="/cs/project/LO1506" target="_blank" >LO1506: Podpora udržitelnosti centra NTIS - Nové technologie pro informační společnost</a><br>

  • Návaznosti

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

  • Rok uplatnění

    2018

  • Kód důvěrnosti údajů

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

  • Název periodika

    Journal on Multimodal User Interfaces

  • ISSN

    1783-7677

  • e-ISSN

  • Svazek periodika

    12

  • Číslo periodika v rámci svazku

    4

  • Stát vydavatele periodika

    US - Spojené státy americké

  • Počet stran výsledku

    10

  • Strana od-do

    319-328

  • Kód UT WoS článku

    000448519400006

  • EID výsledku v databázi Scopus

    2-s2.0-85051679221