HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F19%3A00337255" target="_blank" >RIV/68407700:21730/19:00337255 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1109/ICCV.2019.00272" target="_blank" >https://doi.org/10.1109/ICCV.2019.00272</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV.2019.00272" target="_blank" >10.1109/ICCV.2019.00272</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Popis výsledku v původním jazyce
Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.
Název v anglickém jazyce
HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips
Popis výsledku anglicky
Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
<a href="/cs/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Inteligentní strojové vnímání</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2019
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
2019 IEEE International Conference on Computer Vision (ICCV 2019)
ISBN
978-1-7281-4804-5
ISSN
1550-5499
e-ISSN
2380-7504
Počet stran výsledku
11
Strana od-do
2630-2640
Název nakladatele
IEEE Computer Society Press
Místo vydání
Los Alamitos
Místo konání akce
Seoul
Datum konání akce
27. 10. 2019
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000531438102077

Podobné výsledky(10)

Just Ask: Learning To Answer Questions From Millions of Narrated Videos End-to-End Learning of Visual Representations from Uncurated Instructional Videos Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning

Co hledáte?

Rychlé hledání

Chytré vyhledávání

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)