HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F19%3A00337255" target="_blank" >RIV/68407700:21730/19:00337255 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1109/ICCV.2019.00272" target="_blank" >https://doi.org/10.1109/ICCV.2019.00272</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV.2019.00272" target="_blank" >10.1109/ICCV.2019.00272</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Popis výsledku v původním jazyce
Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.
Název v anglickém jazyce
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Popis výsledku anglicky
Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
<a href="/cs/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Inteligentní strojové vnímání</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
2019 IEEE International Conference on Computer Vision (ICCV 2019)
ISBN
978-1-7281-4804-5
ISSN
1550-5499
e-ISSN
2380-7504
Počet stran výsledku
11
Strana od-do
2630-2640
Název nakladatele
IEEE Computer Society Press
Místo vydání
Los Alamitos
Místo konání akce
Seoul
Datum konání akce
27. 10. 2019
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000531438102077