All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F24%3A00136935" target="_blank" >RIV/00216224:14330/24:00136935 - isvavai.cz</a>

  • Result on the web

    <a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >http://dx.doi.org/10.1007/978-3-031-70566-3_13</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/978-3-031-70566-3_13" target="_blank" >10.1007/978-3-031-70566-3_13</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Generating High-Quality F0 Embeddings Using the Vector-Quantized Variational Autoencoder

  • Original language description

    Language models operating on discrete audio representa- tions are increasingly becoming the go-to framework for many speech- processing tasks. Recently, discrete embeddings of the fundamental fre- AQ1 quency (F0), have been shown to improve performance across a variety of tasks. However, the benefits of using F0 embeddings can only be as good as the embeddings themselves. Therefore, in this paper, we present an exhaustive study on using the Vector-Quantized Variational Autoencoder (VQ-VAE) to generate high-quality embeddings of the F0 curve. We experiment with various input transformations that focus on handling unvoiced regions of the F0, which are regions where F0 is not defined. For each transformation, we perform an exhaustive grid search over the embedding size and codebook size parameters, in order to achieve high- est possible embedding quality. Our experiments are conducted on two different-sized datasets, LJSpeech and LibriTTS, and, in total, comprise over 140 different experiment settings. We reach results ranging from 0.53% to 4.29% F0 Frame Error (FFE), depending on the dataset and preprocessing strategy used, and we publish our best models on the Hug- gingFace website.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10200 - Computer and information sciences

Result continuities

  • Project

  • Continuities

    S - Specificky vyzkum na vysokych skolach

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    Text, Speech, and Dialogue

  • ISBN

    9783031705656

  • ISSN

    0302-9743

  • e-ISSN

  • Number of pages

    10

  • Pages from-to

    139-148

  • Publisher name

    Springer Nature Switzerland

  • Place of publication

    Cham

  • Event location

    Brno

  • Event date

    Jan 1, 2024

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article

    001307848400013