Using X-vectors for Speech Activity Detection in Broadcast Streams
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F46747885%3A24220%2F21%3A00009297" target="_blank" >RIV/46747885:24220/21:00009297 - isvavai.cz</a>
Result on the web
<a href="https://www.isca-speech.org/archive/pdfs/interspeech_2021/mateju21_interspeech.pdf" target="_blank" >https://www.isca-speech.org/archive/pdfs/interspeech_2021/mateju21_interspeech.pdf</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.21437/Interspeech.2021-192" target="_blank" >10.21437/Interspeech.2021-192</a>
Alternative languages
Result language
angličtina
Original language name
Using X-vectors for Speech Activity Detection in Broadcast Streams
Original language description
A new approach to speech activity detection (SAD) is presented in this work. It allows us to reduce the complexity and computation demands, namely in services that process streaming speech, where a SAD module usually forms the first block of the data pipeline (e.g., in a platform for 24/7 broadcast transcription). Our approach utilizes x-vectors as input features so that, within the subsequent pipeline stages, these embedding instances can also directly be employed for speaker diarization and recognition. The x-vectors are extracted by feed-forward sequential memory network (FSMN), allowing for modeling long-time dependencies; they thus form an input into a computationally undemanding binary classifier, whose output is smoothed by a decoder. Evaluation is performed on the standardized QUTNOISE- TIMIT dataset as well as on broadcast data with large portions of music and background noise. The former data allows for comparison with other existing approaches. The latter shows the performance in terms of word error rate (WER) and reduction in real-time factor (RTF) of the transcription process.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/TH03010018" target="_blank" >TH03010018: DeepSpot - Multilingual technology for spotting and instant alerting</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)<br>S - Specificky vyzkum na vysokych skolach
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH
ISBN
978-171383690-2
ISSN
2308-457X
e-ISSN
—
Number of pages
5
Pages from-to
4161 - 4165
Publisher name
ISCA
Place of publication
—
Event location
Brno, ČR
Event date
Jan 1, 2021
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
000841879501118