GAMR: A GUIDED ATTENTION MODEL FOR (VISUAL) REASONING

Abstract

Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR), which instantiates an active vision theory -positing that the brain solves complex visual reasoning problems dynamically -via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.

1. INTRODUCTION

Abstract reasoning refers to our ability to analyze information and discover rules to solve arbitrary tasks, and it is fundamental to general intelligence in human and non-human animals (Gentner & Markman, 1997; Lovett & Forbus, 2017) . It is considered a critical component for the development of artificial intelligence (AI) systems and has rapidly started to gain attention. A growing body of literature suggests that current neural architectures exhibit significant limitations in their ability to solve relatively simple visual cognitive tasks in comparison to humans (see Ricci et al. (2021) for review). Given the vast superiority of animals over state-of-the-art AI systems, it makes sense to turn to brain sciences to find inspiration to leverage brain-like mechanisms to improve the ability of modern deep neural networks to solve complex visual reasoning tasks. Indeed, a recent human EEG study has shown that attention and memory processes are needed to solve same-different visual reasoning tasks (Alamia et al., 2021) . This interplay between attention and memory is previously discussed in Buehner et al. (2006) ; Fougnie (2008); Cochrane et al. ( 2019) emphasizing that a model must learn to perform attention over the memory for reasoning. It is thus not surprising that deep neural networks which lack attention and/or memory system fail to robustly solve visual reasoning problems that involve such same-different judgments (Kim et al., 2018) . Recent computer vision works (Messina et al., 2021a; Vaishnav et al., 2022) have provided further computational evidence for the benefits of attention mechanisms in solving a variety of visual reasoning tasks. Interestingly, in both aforementioned studies, a Transformer module was used to implement a form of attention known as self-attention (Cheng et al., 2016; Parikh et al., 2016) . In such a static module, attention mechanisms are deployed in parallel across an entire visual scene. By contrast, modern cognitive theories of active vision postulate that the visual system explores the environment dynamically via sequences of attention shifts to select and route task-relevant information to memory. Psychophysics experiments (Hayhoe, 2000) on overt visual attention have shown that eye movement patterns are driven according to task-dependent routines.

