Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F23%3A00130552" target="_blank" >RIV/00216224:14330/23:00130552 - isvavai.cz</a>
Výsledek na webu
<a href="http://dx.doi.org/10.1145/3539618.3592069" target="_blank" >http://dx.doi.org/10.1145/3539618.3592069</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3539618.3592069" target="_blank" >10.1145/3539618.3592069</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Popis výsledku v původním jazyce
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
Název v anglickém jazyce
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Popis výsledku anglicky
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10200 - Computer and information sciences

Návaznosti výsledku

Projekt
<a href="/cs/project/EF16_019%2F0000822" target="_blank" >EF16_019/0000822: Centrum excelence pro kyberkriminalitu, kyberbezpečnost a ochranu kritických informačních infrastruktur</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2023
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
46th International Conference on Research and Development in Information Retrieval (SIGIR)
ISBN
9781450394086
ISSN
—
e-ISSN
—
Počet stran výsledku
6
Strana od-do
2420-2425
Název nakladatele
Association for Computing Machinery
Místo vydání
New York, NY, USA
Místo konání akce
Taipei, Taiwan
Datum konání akce
1. 1. 2023
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
001118084002091

Podobné výsledky(10)

TubeDETR: Spatio-Temporal Video Grounding with Transformers Efficient Indexing of 3D Human Motions Efficient Retrieval of Human Motion Episodes Based on Indexed Motion-Word Representations

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)