Transformers of multiple modalities for more natural spoken dialog

Project goals

The goal of the project is the research of more natural spoken dialog systems based on the Transformer framework. Since Transformers could be used in sequence-to-sequence scenarios, their use in natural language understanding and generation is common. We would like to focus on the cases where the input or output of a neural network is speech. To convert speech into semantic representation or dialog intents we will be using the speech recognizer as a black-box but we plan to develop novel methods and approaches to process speech lattices in the general Transformer or recurrent neural networks. The inverse process of generating speech from intents will employ the pre-trained Transformer models for language generation and the recent DNN-based speech synthesis architectures. The dialog management will use the attention neural mechanisms to keep track of the dialog state and to generate consistent prompts in an informal or conversational style. The challenging task of speech synthesis using the given speech style will be backed by the recorded corpus of conversational speech.

Keywords

spoken dialog speech synthesis Transformer deep learning

Public support

Provider
Czech Science Foundation
Programme
Standard projects
Call for proposals
SGA0202200004
Main participants
Západočeská univerzita v Plzni / Fakulta aplikovaných věd
Contest type
VS - Public tender
Contract ID
22-27800S

Alternative language

Project name in Czech
Využití vícemodálních Transformerů pro přirozenější hlasový dialog
Annotation in Czech
Cílem projektu je výzkum přirozenějších hlasových dialogových systémů založených na Transformerech. Vzhledem k tomu, že Transformery lze použít v úlohách typu sequence-to-sequence, běžně se využívají v úlohách porozumění přirozenému jazyku a generování přirozeného jazyka. V projektu se chceme zaměřit na případy, kdy vstupem nebo výstupem neuronové sítě je řeč. K převodu řeči na sémantickou reprezentaci nebo dialogové záměry využijeme rozpoznávač řeči jako černou skříňku, pro zpracování výstupních řečových mřížek v obecném Transformeru nebo rekurentních neuronových sítích ale plánujeme vyvinout nové metody a přístupy. Inverzní proces generování řeči ze záměrů bude využívat předtrénované modely Transformerů pro generování jazyka a moderní architektury syntézy řeči založené na DNN. Řízení dialogu bude využívat neurální attention mechanismy ke sledování stavu dialogu a ke generování konzistentních výstupů v neformálním nebo konverzačním stylu. Pro náročný úkol syntézy řeči v daném řečovém stylu plánujeme vytvořit vlastní korpus konverzační řeči.

Scientific branches

R&D category
ZV - Basic research
OECD FORD - main branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
OECD FORD - secondary branch
20205 - Automation and control systems
OECD FORD - another secondary branch
—
AF - Documentation, librarianship, work with information
BC - Theory and management systems
BD - Information theory
IN - Informatics
JD - Use of computers, robotics and its application

Solution timeline

Realization period - beginning
Jan 1, 2022
Realization period - end
Dec 31, 2024
Project status
—
Latest support payment
Feb 29, 2024

Data delivery to CEP

Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data delivery code
CEP25-GA0-GA-R
Data delivery date
Mar 12, 2025

Finance

Total approved costs
6,810 thou. CZK
Public financial support
6,471 thou. CZK
Other public sources
339 thou. CZK
Non public and foreign sources
0 thou. CZK

Basic information

Recognised costs

6 810 CZK thou.

Public support

6 471 CZK thou.

95%

Provider

Czech Science Foundation

OECD FORD

Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Solution period

01. 01. 2022 - 31. 12. 2024

Similar projects(10)

Deep Learning Methods in Speech Synthesis as a Source of Innovation in Language Learning (EG19_262/0020235) Combining phonetic and corpus-based approaches to remedy disruptive effects in synthetic speech (GA16-04420S) Fully Trainable Deep Neural Network Based Czech Text-to-Speech Synthesis (GA19-19324S)

What are you looking for?

Quick search

Smart search

Share search results

Transformers of multiple modalities for more natural spoken dialog

Project goals

Keywords

Public support

Alternative language

Scientific branches

Solution timeline

Data delivery to CEP

Finance

Basic information

6 810 CZK thou.

6 471 CZK thou.

95%

Similar projects(10)