All

What are you looking for?

All
Projects
Results
Organizations

Quick search

  • Projects supported by TA ČR
  • Excellent projects
  • Projects with the highest public support
  • Current projects

Smart search

  • That is how I find a specific +word
  • That is how I leave the -word out of the results
  • “That is how I can find the whole phrase”

HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

The result's identifiers

  • Result code in IS VaVaI

    <a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F19%3A00337255" target="_blank" >RIV/68407700:21730/19:00337255 - isvavai.cz</a>

  • Result on the web

    <a href="https://doi.org/10.1109/ICCV.2019.00272" target="_blank" >https://doi.org/10.1109/ICCV.2019.00272</a>

  • DOI - Digital Object Identifier

    <a href="http://dx.doi.org/10.1109/ICCV.2019.00272" target="_blank" >10.1109/ICCV.2019.00272</a>

Alternative languages

  • Result language

    angličtina

  • Original language name

    HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips

  • Original language description

    Learning text-video embeddings usually require a dataset of video clips with manually provided captions. However, such datasets are expensive and time-consuming to create and therefore difficult to obtain on a large scale. In this work, we propose instead to learn such embeddings from video data with readily available natural language annotations in the form of automatically transcribed narrations. The contributions of this work are three-fold. First, we introduce HowTo100M: a large-scale dataset of 136 million video clips sourced from 1.22M narrated instructional web videos depicting humans performing and describing over 23k different visual tasks. Our data collection procedure is fast, scalable and does not require any additional manual annotation. Second, we demonstrate that a text-video embedding trained on this data leads to state-of-the-art results for text-to-video retrieval and action localization on instructional video datasets such as YouCook2 or CrossTask. Finally, we show that this embedding transfers well to other domains: fine-tuning on generic Youtube videos (MSR-VTT dataset) and movies (LSMDC dataset) outperforms models trained on these datasets alone.

  • Czech name

  • Czech description

Classification

  • Type

    D - Article in proceedings

  • CEP classification

  • OECD FORD branch

    10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Result continuities

  • Project

    <a href="/en/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Intelligent Machine Perception</a><br>

  • Continuities

    P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Others

  • Publication year

    2019

  • Confidentiality

    S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Data specific for result type

  • Article name in the collection

    2019 IEEE International Conference on Computer Vision (ICCV 2019)

  • ISBN

    978-1-7281-4804-5

  • ISSN

    1550-5499

  • e-ISSN

    2380-7504

  • Number of pages

    11

  • Pages from-to

    2630-2640

  • Publisher name

    IEEE Computer Society Press

  • Place of publication

    Los Alamitos

  • Event location

    Seoul

  • Event date

    Oct 27, 2019

  • Type of event by nationality

    WRD - Celosvětová akce

  • UT code for WoS article

    000531438102077