Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00136935" target="_blank" >RIV/00216224:14330/24:00136935 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70566-3_13</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >10.1007/978-3-031-70566-3_13</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Popis výsledku v původním jazyce
Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Název v anglickém jazyce
Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Popis výsledku anglicky
Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences

Návaznosti výsledku

Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
Text, Speech, and Dialogue
ISBN
9783031705656
ISSN
0302-9743
e-ISSN
—
Počet stran výsledku
10
Strana od-do
139-148
Název nakladatele
Springer Nature Switzerland
Místo vydání
Cham
Místo konání akce
Brno
Datum konání akce
1. 1. 2024
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
001307848400013

Podobné výsledky(10)

Learning Optimal Prosody Embedding Codebook based on F0 and Energy Intrinsic vs. Extrinsic Evaluation of Czech Sentence Embeddings: Semantic Relevance Doesn't Help with MT Evaluation Bioacoustic fundamental frequency estimation: a cross-species dataset and deep learning baseline

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)