SpeedySpeech: Efficient Neural Speech Synthesis
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F20%3A10424507" target="_blank" >RIV/00216208:11320/20:10424507 - isvavai.cz</a>
Výsledek na webu
<a href="https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2867.html" target="_blank" >https://www.isca-speech.org/archive/Interspeech_2020/abstracts/2867.html</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.21437/Interspeech.2020-2867" target="_blank" >10.21437/Interspeech.2020-2867</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
SpeedySpeech: Efficient Neural Speech Synthesis
Popis výsledku v původním jazyce
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio samples in our GitHub repository.
Název v anglickém jazyce
SpeedySpeech: Efficient Neural Speech Synthesis
Popis výsledku anglicky
While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio samples in our GitHub repository.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
—
Návaznosti
I - Institucionalni podpora na dlouhodoby koncepcni rozvoj vyzkumne organizace
Ostatní
Rok uplatnění
2020
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Proceedings of the 21st Annual Conference of the International Speech Communication Association
ISBN
—
ISSN
1990-9772
e-ISSN
—
Počet stran výsledku
5
Strana od-do
3575-3579
Název nakladatele
International Speech Communication Association
Místo vydání
Baixas, France
Místo konání akce
Online
Datum konání akce
25. 10. 2020
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
—