Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Identifikátory výsledku

Kód výsledku v IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21730%2F21%3A00356150" target="_blank" >RIV/68407700:21730/21:00356150 - isvavai.cz</a>
Výsledek na webu
<a href="https://doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >https://doi.org/10.1109/ICCV48922.2021.00171</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1109/ICCV48922.2021.00171" target="_blank" >10.1109/ICCV48922.2021.00171</a>

Alternativní jazyky

Jazyk výsledku
angličtina
Název v původním jazyce
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Popis výsledku v původním jazyce
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.
Název v anglickém jazyce
Just Ask: Learning To Answer Questions From Millions of Narrated Videos
Popis výsledku anglicky
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations.

Klasifikace

Druh
D - Stať ve sborníku
CEP obor
—
OECD FORD obor
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)

Návaznosti výsledku

Projekt
<a href="/cs/project/EF15_003%2F0000468" target="_blank" >EF15_003/0000468: Inteligentní strojové vnímání</a><br>
Návaznosti
P - Projekt vyzkumu a vyvoje financovany z verejnych zdroju (s odkazem do CEP)

Ostatní

Rok uplatnění
2021
Kód důvěrnosti údajů
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů

Údaje specifické pro druh výsledku

Název statě ve sborníku
ICCV2021: Proceedings of the International Conference on Computer Vision
ISBN
978-1-6654-2812-5
ISSN
1550-5499
e-ISSN
2380-7504
Počet stran výsledku
12
Strana od-do
1666-1677
Název nakladatele
IEEE
Místo vydání
Piscataway
Místo konání akce
Montreal
Datum konání akce
11. 10. 2021
Typ akce podle státní příslušnosti
WRD - Celosvětová akce
Kód UT WoS článku
000797698901085

Podobné výsledky(10)

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models HowTo100M: Learning a Text-Video Embedding by Watching Hundred Million Narrated Video Clips Enlargement of the Czech Question-Answering Dataset to SQAD v2.0

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Just Ask: Learning To Answer Questions From Millions of Narrated Videos

Identifikátory výsledku

Alternativní jazyky

Klasifikace

Návaznosti výsledku

Ostatní

Údaje specifické pro druh výsledku

Podobné výsledky(10)

Co hledáte?

Rychlé hledání

Chytré vyhledávání

Popis výsledku

Identifikátory výsledku

Identifikátory výsledku

Alternativní jazyky

Alternativní jazyky

Klasifikace

Klasifikace

Návaznosti výsledku

Návaznosti výsledku

Ostatní

Ostatní

Údaje specifické pro druh výsledku

Údaje specifické pro druh výsledku

Podobné výsledky(10)