HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F19%3A00337255" target="_blank" >RIV/68407700:21730/19:00337255 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.1109/ICCV.2019.00272" target="_blank" >https://doi.org/10.1109/ICCV.2019.00272</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV.2019.00272" target="_blank" >10.1109/ICCV.2019.00272</a>
Alternative languages
Result language
angličtina
Original language name
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Original language description
Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Intelligent Machine Perception</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2019
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
2019 IEEE International Conference on Computer Vision (ICCV 2019)
ISBN
978-1-7281-4804-5
ISSN
1550-5499
e-ISSN
2380-7504
Number of pages
11
Pages from-to
2630-2640
Publisher name
IEEE Computer Society Press
Place of publication
Los Alamitos
Event location
Seoul
Event date
Oct 27, 2019
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
000531438102077