Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

The result's identifiers

Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F00216224%3A14330%2F23%3A00130552" target="_blank" >RIV/00216224:14330/23:00130552 - isvavai.cz</a>
Result on the web
<a href="http://dx.doi.org/10.1145/3539618.3592069" target="_blank" >http://dx.doi.org/10.1145/3539618.3592069</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1145/3539618.3592069" target="_blank" >10.1145/3539618.3592069</a>

Alternative languages

Result language
angličtina
Original language name
Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language
Original language description
Due to recent advances in pose-estimation methods, human motion can be extracted from a common video in the form of 3D skeleton sequences. Despite wonderful application opportunities, effective and efficient content-based access to large volumes of such spatio-temporal skeleton data still remains a challenging problem. In this paper, we propose a novel content-based text-to-motion retrieval task, which aims at retrieving relevant motions based on a specified natural-language textual description. To define baselines for this uncharted task, we employ the BERT and CLIP language representations to encode the text modality and successful spatio-temporal models to encode the motion modality. We additionally introduce our transformer-based approach, called Motion Transformer (MoT), which employs divided space-time attention to effectively aggregate the different skeleton joints in space and time. Inspired by the recent progress in text-to-image/video matching, we experiment with two widely-adopted metric-learning loss functions. Finally, we set up a common evaluation protocol by defining qualitative metrics for assessing the quality of the retrieved motions, targeting the two recently-introduced KIT Motion-Language and HumanML3D datasets. The code for reproducing our results is available here: https://github.com/mesnico/text-to-motion-retrieval.
Czech name
—
Czech description
—

Classification

Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10200 - Computer and information sciences

Result continuities

Project
<a href="/en/project/EF16_019%2F0000822" target="_blank" >EF16_019/0000822: CyberSecurity, CyberCrime and Critical Information Infrastructures Center of Excellence</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

Publication year
2023
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

Article name in the collection
46th International Conference on Research and Development in Information Retrieval (SIGIR)
ISBN
9781450394086
ISSN
—
e-ISSN
—
Number of pages
6
Pages from-to
2420-2425
Publisher name
Association for Computing Machinery
Place of publication
New York, NY, USA
Event location
Taipei, Taiwan
Event date
Jan 1, 2023
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
001118084002091

Similar results(10)

TubeDETR: Spatio-Temporal Video Grounding with Transformers Efficient Indexing of 3D Human Motions Efficient Retrieval of Human Motion Episodes Based on Indexed Motion-Word Representations

What are you looking for?

Quick search

Smart search

Text-to-Motion Retrieval: Towards Joint Understanding of Human Motion Data and Natural Language

The result's identifiers

Alternative languages

Classification

Result continuities

Others

Data specific for result type

Similar results(10)

What are you looking for?

Quick search

Smart search

Result description

The result's identifiers

The result's identifiers

Alternative languages

Alternative languages

Classification

Classification

Result continuities

Result continuities

Others

Others

Data specific for result type

Data specific for result type

Similar results(10)