GAMR: A GUIDED ATTENTION MODEL FOR (VISUAL) REASONING

Abstract

Humans continue to outperform modern AI systems in their ability to flexibly parse and understand complex visual scenes. Here, we present a novel module for visual reasoning, the Guided Attention Model for (visual) Reasoning (GAMR), which instantiates an active vision theory -positing that the brain solves complex visual reasoning problems dynamically -via sequences of attention shifts to select and route task-relevant visual information into memory. Experiments on an array of visual reasoning tasks and datasets demonstrate GAMR's ability to learn visual routines in a robust and sample-efficient manner. In addition, GAMR is shown to be capable of zero-shot generalization on completely novel reasoning tasks. Overall, our work provides computational support for cognitive theories that postulate the need for a critical interplay between attention and memory to dynamically maintain and manipulate task-relevant visual information to solve complex visual reasoning tasks.

1. INTRODUCTION

Abstract reasoning refers to our ability to analyze information and discover rules to solve arbitrary tasks, and it is fundamental to general intelligence in human and non-human animals (Gentner & Markman, 1997; Lovett & Forbus, 2017) . It is considered a critical component for the development of artificial intelligence (AI) systems and has rapidly started to gain attention. A growing body of literature suggests that current neural architectures exhibit significant limitations in their ability to solve relatively simple visual cognitive tasks in comparison to humans (see Ricci et al. (2021) for review). Given the vast superiority of animals over state-of-the-art AI systems, it makes sense to turn to brain sciences to find inspiration to leverage brain-like mechanisms to improve the ability of modern deep neural networks to solve complex visual reasoning tasks. Indeed, a recent human EEG study has shown that attention and memory processes are needed to solve same-different visual reasoning tasks (Alamia et al., 2021) . This interplay between attention and memory is previously discussed in Buehner et al. (2006); Fougnie (2008) ; Cochrane et al. (2019) emphasizing that a model must learn to perform attention over the memory for reasoning. It is thus not surprising that deep neural networks which lack attention and/or memory system fail to robustly solve visual reasoning problems that involve such same-different judgments (Kim et al., 2018) . Recent computer vision works (Messina et al., 2021a; Vaishnav et al., 2022) have provided further computational evidence for the benefits of attention mechanisms in solving a variety of visual reasoning tasks. Interestingly, in both aforementioned studies, a Transformer module was used to implement a form of attention known as self-attention (Cheng et al., 2016; Parikh et al., 2016) . In such a static module, attention mechanisms are deployed in parallel across an entire visual scene. By contrast, modern cognitive theories of active vision postulate that the visual system explores the environment dynamically via sequences of attention shifts to select and route task-relevant information to memory. Psychophysics experiments (Hayhoe, 2000) on overt visual attention have shown that eye movement patterns are driven according to task-dependent routines. Inspired by active vision theory, we describe a dynamic attention mechanism, which we call guided attention. Our proposed Guided Attention Module for (visual) Reasoning (GAMR) learns to shift attention dynamically, in a task-dependent manner, based on queries internally generated by an LSTM executive controller. Through extensive experiments on the two visual reasoning challenges, the Synthetic Visual Reasoning Test (SVRT) by Fleuret et al. (2011) and the Abstract Reasoning Task (ART) by Webb et al. (2021) , we demonstrate that our neural architecture is capable of learning complex compositions of relational rules in a data-efficient manner and performs better than other state-of-the-art neural architectures for visual reasoning. Using explainability methods, we further characterize the visual strategies leveraged by the model in order to solve representative reasoning tasks. We demonstrate that our model is compositional -in that it is able to generalize to novel tasks efficiently and learn novel visual routines by re-composing previously learned elementary operations. It also exhibit zero shot generalization ability by translating knowledge across the tasks sharing similar abstract rules without the need of re-training. Contributions Our contributions are as follows: • We present a novel end-to-end trainable guided-attention module to learn to solve visual reasoning challenges in a data-efficient manner. • We show that our guided-attention module learns to shift attention to task-relevant locations and gate relevant visual elements into a memory bank; • We show that our architecture demonstrate zero-shot generalization ability and learns compositionally. GAMR is capable of learning efficiently by re-arranging previously-learned elementary operations stored within a reasoning module. • Our architecture sets new benchmarks on two visual reasoning challenges, SVRT (Fleuret et al., 2011) and ART (Webb et al., 2021) .

2. PROPOSED APPROACH

Figure 1 : Our proposed GAMR architecture is composed of three components: an encoder module (f e ) builds a representation (z img ) of an image, a controller guides the attention module to dynamically shift attention, and selectively routes task-relevant object representations (z t ) to be stored in a memory bank (M ). The recurrent controller (f s ) generates a query vector (q intt ) at each time step to guide the next shift of attention based on the current fixation. After a few shifts of attention, a reasoning module (r θ ) learns to identify the relationships between objects stored in memory. Our model can be divided into three components: an encoder, a controller, and a relational module (see Figure 1 for an overview). In the encoder module, a low dimensional representation (z img ) for an input image (x in ) is created. It includes a feature extraction block (f e ) which is composed of five convolutional blocks (SI Figure S1 ). The output of the module is denoted as z img ∈ R (128,hw) (with h height and w width). We applied instance normalization (iNorm) (Ulyanov et al., 2016) over z img

