Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Identifikátory výsledku
Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F21%3A00356150" target="_blank" >RIV/68407700:21730/21:00356150 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >https://doi.org/10.1109/ICCV48922.2021.00171</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >10.1109/ICCV48922.2021.00171</a>
Alternativní jazyky
Jazyk výsledku
angličtina
Název v původním jazyce
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Popis výsledku v původním jazyce
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Název v anglickém jazyce
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Popis výsledku anglicky
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Klasifikace
Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Návaznosti výsledku
Projekt
<a href="/cs/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Inteligentní strojové vnímání</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)
Ostatní
Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Údaje specifické pro druh výsledku
Název statě ve sborníku
ICCV2021: Proceedings of the International Conference on Computer Vision
ISBN
978-1-6654-2812-5
ISSN
1550-5499
e-ISSN
2380-7504
Počet stran výsledku
12
Strana od-do
1666-1677
Název nakladatele
IEEE
Místo vydání
Piscataway
Místo konání akce
Montreal
Datum konání akce
11. 10. 2021
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000797698901085