Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F12%3A%230002003" target="_blank" >RIV/46747885:24220/12:#0002003 - isvavai.cz</a>
Výsledek na webu
—
DOI - Digital Object Identifier
—
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams
Popis výsledku v původním jazyce
In this paper we study the effect of incorporation of automatic transcriptions in the speaker diarization process. We aim to improve both the diarization accuracy as evaluated by standard objective measures and quality of the diarization output from user?s perspective. Although the presented approach relies on output of an automatic speech recognizer, it makes no use of lexical information. Instead, we use information about word boundaries and classification of non-speech events occurring in the processed stream. The former information is used as constraining condition for speaker change-point candidates and the latter facilitate to neglect various vocal noise sounds that carry no speaker-specific information (considering representation of the signal by cepstral features) and thus harm the speaker?s representation. The experimental evaluation of the presented approach was carried out using the COST278 multilingual broadcast news database. We demonstrate that the approach yields improve
Název v anglickém jazyce
Incorporation of the ASR output in speaker segmentation and clustering within the task of speaker diarization of broadcast streams
Popis výsledku anglicky
In this paper we study the effect of incorporation of automatic transcriptions in the speaker diarization process. We aim to improve both the diarization accuracy as evaluated by standard objective measures and quality of the diarization output from user?s perspective. Although the presented approach relies on output of an automatic speech recognizer, it makes no use of lexical information. Instead, we use information about word boundaries and classification of non-speech events occurring in the processed stream. The former information is used as constraining condition for speaker change-point candidates and the latter facilitate to neglect various vocal noise sounds that carry no speaker-specific information (considering representation of the signal by cepstral features) and thus harm the speaker?s representation. The experimental evaluation of the presented approach was carried out using the COST278 multilingual broadcast news database. We demonstrate that the approach yields improve
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
JC - Počítačový hardware a software
OECD FORD obor
—
Návaznosti výsledku
Projekt
<a href="/cs/project/TA01011204" target="_blank" >TA01011204: Živé archivy</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2012
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proc. of IEEE conf. on Multimedia Signal Processing (MMSP)
ISBN
978-1-4673-4572-9
ISSN
—
e-ISSN
—
Počet stran výsledku
6
Strana od-do
118-123
Název nakladatele
—
Místo vydání
Kanada
Místo konání akce
Banff, Kanada
Datum konání akce
1. 1. 2012
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000312670200021