LEARNING TO REASON OVER VISUAL OBJECTS

Abstract

A core component of human intelligence is the ability to identify abstract patterns inherent in complex, high-dimensional perceptual data, as exemplified by visual reasoning tasks such as Raven's Progressive Matrices (RPM). Motivated by the goal of designing AI systems with this capacity, recent work has focused on evaluating whether neural networks can learn to solve RPM-like problems. Previous work has generally found that strong performance on these problems requires the incorporation of inductive biases that are specific to the RPM problem format, raising the question of whether such models might be more broadly useful. Here, we investigated the extent to which a general-purpose mechanism for processing visual scenes in terms of objects might help promote abstract visual reasoning. We found that a simple model, consisting only of an object-centric encoder and a transformer reasoning module, achieved state-of-the-art results on both of two challenging RPM-like benchmarks (PGM and I-RAVEN), as well as a novel benchmark with greater visual complexity (CLEVR-Matrices). These results suggest that an inductive bias for object-centric processing may be a key component of abstract visual reasoning, obviating the need for problem-specific inductive biases.

1. INTRODUCTION

Human reasoning is driven by a capacity to extract simple, low-dimensional abstractions from complex, high-dimensional inputs. We perceive the world around us in terms of objects, relations, and higher order patterns, allowing us to generalize beyond the sensory details of our experiences, and make powerful inferences about novel situations Spearman (1923) ; Gick & Holyoak (1983) ; Lake et al. (2017) . This capacity for abstraction is particularly well captured by visual analogy problems, in which the reasoner must abstract over the superficial details of visual inputs, in order to identify a common higher order pattern (Gentner, 1983; Holyoak, 2012) . A particularly challenging example of these kinds of problems are the Raven's Progressive Matrices (RPM) problem sets (Raven, 1938) , which have been found to be especially diagnostic of human reasoning abilities (Snow et al., 1984) . A growing body of recent work has aimed to build learning algorithms that capture this capacity for abstract visual reasoning. Much of this previous work has revolved around two recently developed benchmarks -the Procedurally Generated Matrices (PGM) (Barrett et al., 2018), and the RAVEN dataset (Zhang et al., 2019a) -consisting of a large number of automatically generated RPM-like problems. As in RPM, each problem consists of a 3 × 3 matrix populated with geometric forms, in which the bottom right cell is blank. The challenge is to infer the abstract pattern that governs the relationship along the first two columns and/or rows of the matrix, and use that inferred pattern to 'fill in the blank', by selecting from a set of choices. As can be seen in Figure 1 , these problems can be quite complex, with potentially many objects per cell, and multiple rules per problem, yielding a highly challenging visual reasoning task. For each answer choice, slots are extracted from that choice, and the context panels, and these slots are concatenated to form a sequence that is passed to the transformer, which then generates a score. The scores for all answer choices are passed through a softmax in order to compute the task loss L task . Additionally, the slots for each image panel are passed through a slot decoder, yielding a reconstruction of that image panel, from which the reconstruction loss L recon is computed. There is substantial evidence that human visual reasoning is fundamentally organized around the decomposition of visual scenes into objects (Duncan, 1984; Pylyshyn, 1989; Peters & Kriegeskorte, 2021) . Objects offer a simple, yet powerful, low-dimensional abstraction that captures the inherent compositionality underlying visual scenes. Despite the centrality of objects in visual reasoning, previous works have so far not explored the use of object-centric representations in abstract visual reasoning tasks such as RAVEN and PGM, or at best have employed an imprecise approximation to object representations based on spatial location. Recently, a number of methods have been proposed for the extraction of precise object-centric representations directly from pixel-level inputs, without the need for veridical segmentation data (Greff et al., 2019; Burgess et al., 2019; Locatello et al., 2020; Engelcke et al., 2021) . While these methods have been shown to improve performance in some visual reasoning tasks, including question answering from video (Ding et al., 2021) and prediction of physical interactions from video Wu et al. (2022) , previous work has not addressed whether this approach is useful in the domain of abstract visual reasoning (i.e., visual analogy). To address this, we developed a model that combines an object-centric encoding method, slot attention (Locatello et al., 2020), with a generic transformer-based reasoning module (Vaswani et al., 2017 ). The combined system, termed the Slot Transformer Scoring Network (STSN, Figure 1 ) achieves state-of-the-art performance on both PGM and I-RAVEN (a more challenging variant of RAVEN), despite its general-purpose architecture, and lack of task-specific augmentations. Furthermore, we developed a novel benchmark, the CLEVR-Matrices (Figure 2 ), using a similar RPM-like problem structure, but with greater visual complexity, and found that STSN also achieves state-of-the-art performance on this task. These results suggest that object-centric encoding is an essential component for achieving strong abstract visual reasoning, and indeed may be even more important than some task-specific inductive biases.



Figure 1: Slot Transformer Scoring Network (STSN). STSN combines slot attention, an objectcentric encoding method, and a transformer reasoning module. Slot attention decomposes each image panel into a set of K slots, which are randomly initialized and iteratively updated through competitive attention over the image. STSN assigns a score to each of the 8 potential answers, by independently evaluating the combination of each answer choice together with the 8 context panels.For each answer choice, slots are extracted from that choice, and the context panels, and these slots are concatenated to form a sequence that is passed to the transformer, which then generates a score. The scores for all answer choices are passed through a softmax in order to compute the task loss L task . Additionally, the slots for each image panel are passed through a slot decoder, yielding a reconstruction of that image panel, from which the reconstruction loss L recon is computed.

