MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216208%3A11320%2F23%3A10475701" target="_blank" >RIV/00216208:11320/23:10475701 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.21437/SSW.2023-8" target="_blank" >http://dx.doi.org/10.21437/SSW.2023-8</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.21437/SSW.2023-8" target="_blank" >10.21437/SSW.2023-8</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
Popis výsledku v původním jazyce
We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even stateof-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model finetuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.
Název v anglickém jazyce
MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module
Popis výsledku anglicky
We present MooseNet, a trainable speech metric that predicts the listeners' Mean Opinion Score (MOS). We propose a novel approach where the Probabilistic Linear Discriminative Analysis (PLDA) generative model is used on top of an embedding obtained from a self-supervised learning (SSL) neural network (NN) model. We show that PLDA works well with a non-finetuned SSL model when trained only on 136 utterances (ca. one minute training time) and that PLDA consistently improves various neural MOS prediction models, even stateof-the-art models with task-specific fine-tuning. Our ablation study shows PLDA training superiority over SSL model finetuning in a low-resource scenario. We also improve SSL model fine-tuning using a convenient optimizer choice and additional contrastive and multi-task training objectives. The fine-tuned MooseNet NN with the PLDA module achieves the best results, surpassing the SSL baseline on the VoiceMOS Challenge data.

Klasifikace

Druh
O - Ostatní výsledky
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
—
Návaznosti
S - Specificky vyzkum na vysokych skolach

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Podobné výsledky(10)

Dereverberation and Beamforming in Far-Field Speaker Recognition Unsupervised Pretraining for Neural Machine Translation Using Elastic Weight Consolidation Finetuning Is a Surprisingly Effective Domain Adaptation Baseline in Handwriting Recognition

Co hledáte?

Rychlé hledání

Chytré vyhledávání

MooseNet: A Trainable Metric for Synthesized Speech with a PLDA Module

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Podobné výsledky(10)