DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F22%3APU144910" target="_blank" >RIV/00216305:26230/22:PU144910 - isvavai.cz</a>
Výsledek na webu
<a href="https://ieeexplore.ieee.org/document/9747340" target="_blank" >https://ieeexplore.ieee.org/document/9747340</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICASSP43922.2022.9747340" target="_blank" >10.1109/ICASSP43922.2022.9747340</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction
Popis výsledku v původním jazyce
In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show that for both speech separation and extraction, the DPCCN-based systems achieve significantly better performance and robustness than the currently dominating time-domain methods, especially for the crossdomain tasks. Particularly, we find that the Mixture-Remix finetuning with DPCCN significantly outperforms the TD-SpeakerBeam for unsupervised cross-domain TSE, with around 3.5 dB SISNR improvement on target domain test set, without any source domain performance degradation.
Název v anglickém jazyce
DPCCN: Densely-Connected Pyramid Complex Convolutional Network for Robust Speech Separation and Extraction
Popis výsledku anglicky
In recent years, a number of time-domain speech separation methods have been proposed. However, most of them are very sensitive to the environments and wide domain coverage tasks. In this paper, from the time-frequency domain perspective, we propose a densely-connected pyramid complex convolutional network, termed DPCCN, to improve the robustness of speech separation under complicated conditions. Furthermore, we generalize the DPCCN to target speech extraction (TSE) by integrating a new specially designed speaker encoder. Moreover, we also investigate the robustness of DPCCN to unsupervised cross-domain TSE tasks. A Mixture-Remix approach is proposed to adapt the target domain acoustic characteristics for fine-tuning the source model. We evaluate the proposed methods not only under noisy and reverberant in-domain condition, but also in clean but cross-domain conditions. Results show that for both speech separation and extraction, the DPCCN-based systems achieve significantly better performance and robustness than the currently dominating time-domain methods, especially for the crossdomain tasks. Particularly, we find that the Mixture-Remix finetuning with DPCCN significantly outperforms the TD-SpeakerBeam for unsupervised cross-domain TSE, with around 3.5 dB SISNR improvement on target domain test set, without any source domain performance degradation.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
Výsledek vznikl pri realizaci vícero projektů. Více informací v záložce Projekty.
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2022
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
ISBN
978-1-6654-0540-9
ISSN
—
e-ISSN
—
Počet stran výsledku
5
Strana od-do
7292-7296
Název nakladatele
IEEE Signal Processing Society
Místo vydání
Singapore
Místo konání akce
Singapore
Datum konání akce
22. 5. 2022
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000864187907119