All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

Comparison of wav2vec 2.0 models on three speech processing tasks

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F24%3APU154885" target="_blank" >RIV/00216305:26230/24:PU154885 - isvavai.cz</a>

  • Result on the web

    <a href="https://link.springer.com/article/10.1007/s10772-024-10140-6" target="_blank" >https://link.springer.com/article/10.1007/s10772-024-10140-6</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1007/s10772-024-10140-6" target="_blank" >10.1007/s10772-024-10140-6</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    Comparison of wav2vec 2.0 models on three speech processing tasks

  • Original language description

    The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0 models are trained on relatively "clean" data from sources such as the LibriSpeech dataset, but we can expect there to be a benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection, and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring better performance than a larger model.

  • Czech name

  • Czech description

Classification

  • Type

    J<sub>SC</sub> - Article in a specialist periodical, which is included in the SCOPUS database

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

    <a href="/en/project/VJ01010108" target="_blank" >VJ01010108: Robust processing of recordings for operations and security</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2024

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Name of the periodical

    International Journal of Speech Technology

  • ISSN

    1381-2416

  • e-ISSN

    1572-8110

  • Volume of the periodical

    27

  • Issue of the periodical within the volume

    4

  • Country of publishing house

    US - UNITED STATES

  • Number of pages

    13

  • Pages from-to

    847-859

  • UT code for WoS article

  • EID of the result in the Scopus database

    2-s2.0-85206375991