Just Ask: Learning To Answer Questions From Millions of Narrated Videos
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F21%3A00356150" target="_blank" >RIV/68407700:21730/21:00356150 - isvavai.cz</a>
Result on the web
<a href="https://doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >https://doi.org/10.1109/ICCV48922.2021.00171</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >10.1109/ICCV48922.2021.00171</a>
Alternative languages
Result language
angličtina
Original language name
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Original language description
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Intelligent Machine Perception</a><br>
Continuities
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Others
Publication year
2021
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
ICCV2021: Proceedings of the International Conference on Computer Vision
ISBN
978-1-6654-2812-5
ISSN
1550-5499
e-ISSN
2380-7504
Number of pages
12
Pages from-to
1666-1677
Publisher name
IEEE
Place of publication
Piscataway
Event location
Montreal
Event date
Oct 11, 2021
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
000797698901085