Comparison of wav2vec 2.0 models on three speech processing tasks
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216305%3A26230%2F24%3APU154885" target="_blank" >RIV/00216305:26230/24:PU154885 - isvavai.cz</a>
Výsledek na webu
<a href="https://link.springer.com/article/10.1007/s10772-024-10140-6" target="_blank" >https://link.springer.com/article/10.1007/s10772-024-10140-6</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1007/s10772-024-10140-6" target="_blank" >10.1007/s10772-024-10140-6</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Comparison of wav2vec 2.0 models on three speech processing tasks
Popis výsledku v původním jazyce
The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0 models are trained on relatively "clean" data from sources such as the LibriSpeech dataset, but we can expect there to be a benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection, and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring better performance than a larger model.
Název v anglickém jazyce
Comparison of wav2vec 2.0 models on three speech processing tasks
Popis výsledku anglicky
The current state-of-the-art for various speech processing problems is a sequence-to-sequence model based on a self-attention mechanism known as transformer. The widely used wav2vec 2.0 is a self-supervised transformer model pre-trained on large amounts of unlabeled speech and then fine-tuned for a specific task. The data used for training and fine-tuning, along with the size of the transformer model, play a crucial role in both of these training steps. The most commonly used wav2vec 2.0 models are trained on relatively "clean" data from sources such as the LibriSpeech dataset, but we can expect there to be a benefit in using more realistic data gathered from a variety of acoustic conditions. However, it is not entirely clear how big the difference would be. Investigating this is the main goal of our article. To this end, we utilize wav2vec 2.0 models in three fundamental speech processing tasks: speaker change detection, voice activity detection, and overlapped speech detection, and test them on four real conversation datasets. We compare four wav2vec 2.0 models with different sizes and different data used for pre-training, and we fine-tune them either on in-domain data from the same dataset or on artificial training data created from the LibriSpeech corpus. Our results suggest that richer data that are more similar to the task domain bring better performance than a larger model.
Klasifikace
Druh
J<sub>SC</sub> - Článek v periodiku v databázi SCOPUS
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
<a href="/cs/project/VJ01010108" target="_blank" >VJ01010108: Robustní zpracování nahrávek pro operativu a bezpečnost</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2024
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název periodika
International Journal of Speech Technology
ISSN
1381-2416
e-ISSN
1572-8110
Svazek periodika
27
Číslo periodika v rámci svazku
4
Stát vydavatele periodika
US - Spojené státy americké
Počet stran výsledku
13
Strana od-do
847-859
Kód UT WoS článku
—
EID výsledku v databázi Scopus
2-s2.0-85206375991