Compositional models for VQA: Can neural module networks really count?
The result's identifiers
Result code in IS VaVaI
<a href="https://www.isvavai.cz/riv?ss=detail&h=RIV%2F68407700%3A21230%2F18%3A00327912" target="_blank" >RIV/68407700:21230/18:00327912 - isvavai.cz</a>
Alternative codes found
RIV/68407700:21730/18:00327912
Result on the web
<a href="https://ac.els-cdn.com/S1877050918323986/1-s2.0-S1877050918323986-main.pdf?_tid=a4ba8c06-ab27-49ab-b27e-28c3ef34031c&acdnat=1549358710_448d7843295e9400663948d0d99401d8" target="_blank" >https://ac.els-cdn.com/S1877050918323986/1-s2.0-S1877050918323986-main.pdf?_tid=a4ba8c06-ab27-49ab-b27e-28c3ef34031c&acdnat=1549358710_448d7843295e9400663948d0d99401d8</a>
DOI - Digital Object Identifier
<a href="http://dx.doi.org/10.1016/j.procs.2018.11.110" target="_blank" >10.1016/j.procs.2018.11.110</a>
Alternative languages
Result language
angličtina
Original language name
Compositional models for VQA: Can neural module networks really count?
Original language description
Large neural networks trained in an end-to-end fashion usually fail to generalize over novel inputs which were not included in the training data. In contrast, biologically-inspired compositional models offer a more robust solution due to adaptive chaining of logical operations performed by specialized modules. In this paper, we present an implementation of a cognitive architecture based on the End-to-End Module Networks (N2NMNs) model [9] in the humanoid robot Pepper. The architecture is focused on the Visual Question Answering task (VQA), in which the robot answers questions regarding the seen image in natural language. We trained the system on the synthetic CLEVR dataset [10] and tested it on both synthetic images and real-world situations with CLEVR-like objects. We compare between the results and discuss the decrease of accuracy in real-world situations. Furthermore, we propose a new evaluation method, in which we test whether the model's results for counting objects in each category is consistent with the overall number of seen objects. In summary, our results show that the current visual reasoning models are still far from being applicable in everyday life.
Czech name
—
Czech description
—
Classification
Type
D - Article in proceedings
CEP classification
—
OECD FORD branch
10201 - Computer sciences, information science, bioinformathics (hardware development to be 2.2, social aspect to be 5.8)
Result continuities
Project
<a href="/en/project/TJ01000470" target="_blank" >TJ01000470: Imitation learning supported by language for industrial robotics</a><br>
Continuities
S - Specificky vyzkum na vysokych skolach
Others
Publication year
2018
Confidentiality
S - Úplné a pravdivé údaje o projektu nepodléhají ochraně podle zvláštních právních předpisů
Data specific for result type
Article name in the collection
Procedia Computer Science
ISBN
—
ISSN
1877-0509
e-ISSN
1877-0509
Number of pages
7
Pages from-to
481-487
Publisher name
Elsevier B.V.
Place of publication
New York
Event location
Praha
Event date
Aug 22, 2018
Type of event by nationality
WRD - Celosvětová akce
UT code for WoS article
000551069000073