Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00136935" target="_blank" >RIV/00216224:14330/24:00136935 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70566-3_13</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >10.1007/978-3-031-70566-3_13</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Popis výsledku v původním jazyce
Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Název v anglickém jazyce
Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Popis výsledku anglicky
Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences
Návaznosti výsledku
Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
Text, Speech, and Dialogue
ISBN
9783031705656
ISSN
0302-9743
e-ISSN
—
Počet stran výsledku
10
Strana od-do
139-148
Název nakladatele
Springer Nature Switzerland
Místo vydání
Cham
Místo konání akce
Brno
Datum konání akce
1. 1. 2024
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
001307848400013