OCIM: OBJECT-CENTRIC COMPOSITIONAL IMAGINA-TION FOR VISUAL ABSTRACT REASONING

Abstract

A long-sought property of machine learning systems is the ability to compose learned concepts in novel ways that would enable them to make sense of new situations. Such capacity for imagination -a core aspect of human intelligence -is not yet attained for machines. In this work, we show that object-centric inductive biases can be leveraged to derive an imagination-based learning framework that achieves compositional generalization on a series of tasks. Our method, denoted Object-centric Compositional Imagination (OCIM), decomposes visual reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that these primitives can be recomposed to generate new imaginary tasks. By training on such imagined tasks, the model learns to reuse the previously-learned concepts to systematically generalize at test time. We test our model on a series of arithmetic tasks where the model has to infer the sequence of operations (programs) applied to a series of inputs. We find that imagination is key for the model to find the correct solution for unseen combinations of operations.

1. INTRODUCTION

Humans have the remarkable ability to adapt to new unseen environments with little experience (Lake et al., 2017) . In contrast, machine learning systems are sensitive to distribution shifts (Arjovsky et al., 2019; Su et al., 2019; Engstrom et al., 2019) . One of the key aspects that makes human learning so robust is the ability to produce or acquire new knowledge by composing few learned concepts in novel ways, an ability known as compositional generalization (Fodor and Pylyshyn, 1988; Lake et al., 2017) . Although the question of how to achieve such compositional generalization in brains or machines is an active area of research (Ruis and Lake, 2022), a promising hypothesis is that dreams are a crucial element (Hoel, 2021) through the Overfitted Brain Hypothesis (OBH). Both imagination and abstraction are core to human intelligence. Objects in particular are an important representation used by the human brain when applying analogical reasoning (Spelke, 2000) . For instance, we can infer the properties of a new object by transferring our knowledge of these properties from similar objects (Mitchell, 2021) . This realization has inspired a recent body of work that focuses on learning models that discover objects in a visual scene without supervision (Eslami et al., 2016b; Kosiorek et al., 2018; Greff et al., 2017; van Steenkiste et al., 2018; Greff et al., 2019; Burgess et al., 2019; van Steenkiste et al., 2019; Locatello et al., 2020) . Many of these works propose several inductive biases that lead to a visual scene decomposition in terms of its constituting objects. The expectation is that such an object-centric decomposition would lead to better generalization since it better represents the underlying structure of the physical world (Parascandolo et al., 2018) . To the best of our knowledge, the effect of object-centric representations for systematic generalization in visual reasoning tasks remains largely unexplored. While abstractions, like objects, allow for reasoning and planning beyond direct experience, novel configurations of experienced concepts are possible through imagination. Hoel (2021) goes even further and posits that dreaming, which is a form of imagination, improves the generalization and robustness of learned representations. Dreams do so by producing new perceptual events that are composed of concepts experienced/learned during wake-time. These perceptual events can be described by two knowledge types (Goyal et al., 2020; 2021b) : the declarative knowledge encoding object states (e.g. entities that constitute the dreams), and the procedural knowledge encoding how they behave and interact with each other (e.g. how these entities are processed to form the percep-tual event). In this work, we take a step towards showing how OBH can be implemented to derive a new imagination-based learning framework that allows for better compositional generalization like dreams do. We thus propose OCIM, an example of how object-centric inductive biases can be exploited to derive imagination-based learning frameworks. More specifically, we model a perceptual event by its object-centric representations and a modular architecture that processes them to solve the task at hand. Similar to (Ellis et al., 2021) , we take the program-induction approach to reasoning. In order to solve a task, the model needs to (1) abstract the perceptual input in an object-centric manner (e.g. represent declarative knowledge), and (2) select the right arrangement of processing modules (which can be seen as a neural program) that solves the task at hand. In order to generalize beyond direct experience through imagined scenarios, a model would have to imagine both of these components (e.g. objects + how to process them). Here we restrict ourselves to imagining new ways to process experienced perceptual objects. We propose to do so by exploiting object-centric processing of inductive biases. The idea is to have a neural program (Reed and De Freitas, 2015; Cai et al., 2017; Li et al., 2020) composed of modular neural components that can be rearranged (e.g. "sampled" through selection bottlenecks) to invent new tasks. The capacity to generate unseen tasks enables OCIM to generalize systematically to never-seen-before tasks by (1) producing new imagined scenarios composed of learned/experienced concepts and ( 2) training the model on these imagined samples to predict back their constituting concepts (e.g. used modules that were sampled to produce them). Our contribution is threefold: • We propose an example of how object-centric inductive biases can be used to derive an imagination-based learning framework. Specifically we show that rearranging modular parts of an object-centric processing model to produce an imagined sample and training the model to predict the arrangement that produced that sample helps with compositional generalization. • We propose a visual abstract reasoning dataset to illustrate our imagination framework and evaluate the models along different axis of generalization. • We highlight some drawbacks of current state-of-the-art (SOTA) object-centric perception model when it comes to disentangling independent factors of variation within a single visual object.

2. RELATED WORK

Object-centric Representation. A recent research direction explores unsupervised object-centric representation learning from visual inputs (Locatello et al., 2020; Burgess et al., 2019; Greff et al., 2019; Eslami et al., 2016a; Crawford and Pineau, 2019; Stelzner et al., 2019; Lin et al., 2020; Geirhos et al., 2019) . The main motivation behind this line of work is to disentangle a latent representation in terms of objects composing the visual scene (e.g. slots). Recent approaches to slot-based representation learning focus on the generative abilities of the models; in our case, we study the impact of object-centric inductive biases on systematic generalization of the models in a visual reasoning task. We observe that modularity of representations is as important as the mechanisms that operate on them (Goyal et al., 2020; 2021b) . Additionally, we show that object-centric inductive biases of both representations and mechanisms allow us to derive an imagination framework that leads to better systematic generalization. Goyal et al., 2019; 2020; 2021b) . Such architectures are inspired by the notion of independent mechanisms (Pearl, 2009; Bengio et al., 2019; Goyal et al., 2019; Goyal and Bengio, 2022) , which suggests that a set of independently parameterized modules capturing causal mechanisms should remain robust to distribution



Goyal and Bengio (2022)  has proposed to translate these characteristics into architectural inductive biases for deep neural networks. Recent approaches have explored architectures composed of a set of independently parameterized modules that compete with each other to communicate and attend or process an input (

