Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00136935" target="_blank" >RIV/00216224:14330/24:00136935 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70566-3_13</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >10.1007/978-3-031-70566-3_13</a>
Alternative languages
Result language
angličtina
Original language name
Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder
Original language description
Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences
Result continuities
Project
—
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2024
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Text, Speech, and Dialogue
ISBN
9783031705656
ISSN
0302-9743
e-ISSN
—
Number of pages
10
Pages from-to
139-148
Publisher name
Springer Nature Switzerland
Place of publication
Cham
Event location
Brno
Event date
Jan 1, 2024
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
001307848400013