Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F49777513%3A23520%2F21%3A43962417" target="_blank" >RIV/49777513:23520/21:43962417 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.isca-speech.org/archive/interspeech_2021/svec21_interspeech.html" target="_blank" >https://www.isca-speech.org/archive/interspeech_2021/svec21_interspeech.html</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.21437/Interspeech.2021-1704" target="_blank" >10.21437/Interspeech.2021-1704</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings
Popis výsledku v původním jazyce
The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
Název v anglickém jazyce
Spoken Term Detection and Relevance Score Estimation Using Dot-Product of Pronunciation Embeddings
Popis výsledku anglicky
The paper describes a novel approach to Spoken Term Detection (STD) in large spoken archives using deep LSTM networks. The work is based on the previous approach of using Siamese neural networks for STD and naturally extends it to directly localize a spoken term and estimate its relevance score. The phoneme confusion network generated by a phoneme recognizer is processed by the deep LSTM network which projects each segment of the confusion network into an embedding space. The searched term is projected into the same embedding space using another deep LSTM network. The relevance score is then computed using a simple dot-product in the embedding space and calibrated using a sigmoid function to predict the probability of occurrence. The location of the searched term is then estimated from the sequence of output probabilities. The deep LSTM networks are trained in a self-supervised manner from paired recognition hypotheses on word and phoneme levels. The method is experimentally evaluated on MALACH data in English and Czech languages.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
20205 - Automation and control systems
Návaznosti výsledku
Projekt
<a href="/cs/project/VJ01010108" target="_blank" >VJ01010108: Robustní zpracování nahrávek pro operativu a bezpečnost</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISBN
978-1-71383-690-2
ISSN
2308-457X
e-ISSN
—
Počet stran výsledku
5
Strana od-do
851-855
Název nakladatele
International Speech Communication Association
Místo vydání
Red Hook, NY
Místo konání akce
Brno, Czech Republic
Datum konání akce
30. 8. 2021
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—