OCIM: OBJECT-CENTRIC COMPOSITIONAL IMAGINA-TION FOR VISUAL ABSTRACT REASONING

Abstract

A long-sought property of machine learning systems is the ability to compose learned concepts in novel ways that would enable them to make sense of new situations. Such capacity for imagination -a core aspect of human intelligence -is not yet attained for machines. In this work, we show that object-centric inductive biases can be leveraged to derive an imagination-based learning framework that achieves compositional generalization on a series of tasks. Our method, denoted Object-centric Compositional Imagination (OCIM), decomposes visual reasoning tasks into a series of primitives applied to objects without using a domain-specific language. We show that these primitives can be recomposed to generate new imaginary tasks. By training on such imagined tasks, the model learns to reuse the previously-learned concepts to systematically generalize at test time. We test our model on a series of arithmetic tasks where the model has to infer the sequence of operations (programs) applied to a series of inputs. We find that imagination is key for the model to find the correct solution for unseen combinations of operations.

1. INTRODUCTION

Humans have the remarkable ability to adapt to new unseen environments with little experience (Lake et al., 2017) . In contrast, machine learning systems are sensitive to distribution shifts (Arjovsky et al., 2019; Su et al., 2019; Engstrom et al., 2019) . One of the key aspects that makes human learning so robust is the ability to produce or acquire new knowledge by composing few learned concepts in novel ways, an ability known as compositional generalization (Fodor and Pylyshyn, 1988; Lake et al., 2017) . Although the question of how to achieve such compositional generalization in brains or machines is an active area of research (Ruis and Lake, 2022) , a promising hypothesis is that dreams are a crucial element (Hoel, 2021) through the Overfitted Brain Hypothesis (OBH). Both imagination and abstraction are core to human intelligence. Objects in particular are an important representation used by the human brain when applying analogical reasoning (Spelke, 2000) . For instance, we can infer the properties of a new object by transferring our knowledge of these properties from similar objects (Mitchell, 2021) . This realization has inspired a recent body of work that focuses on learning models that discover objects in a visual scene without supervision (Eslami et al., 2016b; Kosiorek et al., 2018; Greff et al., 2017; van Steenkiste et al., 2018; Greff et al., 2019; Burgess et al., 2019; van Steenkiste et al., 2019; Locatello et al., 2020) . Many of these works propose several inductive biases that lead to a visual scene decomposition in terms of its constituting objects. The expectation is that such an object-centric decomposition would lead to better generalization since it better represents the underlying structure of the physical world (Parascandolo et al., 2018) . To the best of our knowledge, the effect of object-centric representations for systematic generalization in visual reasoning tasks remains largely unexplored. While abstractions, like objects, allow for reasoning and planning beyond direct experience, novel configurations of experienced concepts are possible through imagination. Hoel (2021) goes even further and posits that dreaming, which is a form of imagination, improves the generalization and robustness of learned representations. Dreams do so by producing new perceptual events that are composed of concepts experienced/learned during wake-time. These perceptual events can be described by two knowledge types (Goyal et al., 2020; 2021b) : the declarative knowledge encoding object states (e.g. entities that constitute the dreams), and the procedural knowledge encoding how they behave and interact with each other (e.g. how these entities are processed to form the percep-tual event). In this work, we take a step towards showing how OBH can be implemented to derive a new imagination-based learning framework that allows for better compositional generalization like dreams do. We thus propose OCIM, an example of how object-centric inductive biases can be exploited to derive imagination-based learning frameworks. More specifically, we model a perceptual event by its object-centric representations and a modular architecture that processes them to solve the task at hand. Similar to (Ellis et al., 2021) , we take the program-induction approach to reasoning. In order to solve a task, the model needs to (1) abstract the perceptual input in an object-centric manner (e.g. represent declarative knowledge), and (2) select the right arrangement of processing modules (which can be seen as a neural program) that solves the task at hand. In order to generalize beyond direct experience through imagined scenarios, a model would have to imagine both of these components (e.g. objects + how to process them). Here we restrict ourselves to imagining new ways to process experienced perceptual objects. We propose to do so by exploiting object-centric processing of inductive biases. The idea is to have a neural program (Reed and De Freitas, 2015; Cai et al., 2017; Li et al., 2020) composed of modular neural components that can be rearranged (e.g. "sampled" through selection bottlenecks) to invent new tasks. The capacity to generate unseen tasks enables OCIM to generalize systematically to never-seen-before tasks by (1) producing new imagined scenarios composed of learned/experienced concepts and ( 2) training the model on these imagined samples to predict back their constituting concepts (e.g. used modules that were sampled to produce them). Our contribution is threefold: • We propose an example of how object-centric inductive biases can be used to derive an imagination-based learning framework. Specifically we show that rearranging modular parts of an object-centric processing model to produce an imagined sample and training the model to predict the arrangement that produced that sample helps with compositional generalization. • We propose a visual abstract reasoning dataset to illustrate our imagination framework and evaluate the models along different axis of generalization. • We highlight some drawbacks of current state-of-the-art (SOTA) object-centric perception model when it comes to disentangling independent factors of variation within a single visual object.

2. RELATED WORK

Object-centric Representation. A recent research direction explores unsupervised object-centric representation learning from visual inputs (Locatello et al., 2020; Burgess et al., 2019; Greff et al., 2019; Eslami et al., 2016a; Crawford and Pineau, 2019; Stelzner et al., 2019; Lin et al., 2020; Geirhos et al., 2019) . The main motivation behind this line of work is to disentangle a latent representation in terms of objects composing the visual scene (e.g. slots). Recent approaches to slot-based representation learning focus on the generative abilities of the models; in our case, we study the impact of object-centric inductive biases on systematic generalization of the models in a visual reasoning task. We observe that modularity of representations is as important as the mechanisms that operate on them (Goyal et al., 2020; 2021b) . Additionally, we show that object-centric inductive biases of both representations and mechanisms allow us to derive an imagination framework that leads to better systematic generalization. Modularity. Extensive work from the cognitive neuroscience literature (Baars, 1997; Dehaene et al., 2017) suggests that the human brain represents knowledge in a modular way, with different parts (e.g, modules) interacting with a working memory bottleneck via attention mechanisms. Following these observations, a line of work in machine learning Goyal and Bengio (2020) ; Goyal et al. (2019; 2020; 2021b) ; Ostapenko et al. (2021) ; Goyal and Bengio (2022) has proposed to translate these characteristics into architectural inductive biases for deep neural networks. Recent approaches have explored architectures composed of a set of independently parameterized modules that compete with each other to communicate and attend or process an input (Goyal et al., 2019; 2020; 2021b) . Such architectures are inspired by the notion of independent mechanisms (Pearl, 2009; Bengio et al., 2019; Goyal et al., 2019; Goyal and Bengio, 2022) , which suggests that a set of independently parameterized modules capturing causal mechanisms should remain robust to distribution shifts caused by interventions, as adapting one module should not require adapting the others. The hope is that out-of-distribution (OOD) generalization would be facilitated by making it possible to sequentially compose the computations performed by these modules, whereby new situations can be explained by novel combinations of existing concepts. In this work, we show how modular architectural choices can be exploited to derive an imagination-based learned paradigm that allows better compositional generalization; we do so by explicitly exposing the model to data samples composed of novel combination of learned concepts. Imagination, Dreaming, and Generalization Dreams are a form of imagination that have inspired a significant amount of influential work (Hinton et al., 2006; Ellis et al., 2021; Hafner et al., 2019; 2020) . An interesting explanation for such phenomenon is the overfitted brain hypothesis (OBH) (Hoel, 2021) , which states that dreaming improves the generalization and robustness of learned representations. The idea is that, while dreaming, the brain recombines patterns seen during wake time. This results in artificial data augmentation in the form of dreams. This way, dreams regularize and prevent the brain from overfitting the patterns seen while being awake. In machine Most visual reasoning benchmarks revolve around variations of Raven's Progressive Matrices (RPM) (James, 1936; Zhang et al., 2019; Barrett et al., 2018; Hoshen and Werman, 2017) which are discriminative tasks in which the solver chooses from a set of candidate answers. However, in a recent survey, Mitchell (2021) recommends evaluating models on generative tasks that focus on human core knowledge (Spelke, 2000) . Models trained on generative tasks are indeeed less prone to shortcut learning and systems that generate answers are in many cases more interpretable. To that end Chollet (2019) proposes the Abstract Reasoning Corpus (ARC), where the model is given a few examples of Input-Output (I/O) pairs and has to understand the underlying common program that was applied to the inputs to obtain the outputs. ARC tasks are meant to rely only on the innate core knowledge systems which include intuitive knowledge about objects, agents and their goals, numerosity, and basic spatial-temporal concepts. However, ARC remains unapproachable by current deep learning methods. We propose to take a step towards solving ARC by designing a new generative benchmark in which we evaluate systematic compositional generalization. Like ARC, our dataset is composed of a collection of support sets, each having a number of input/output pairs, such that the output for every sample (support set) is obtained by applying the same program to the corresponding input. The model is then presented with a new query input and evaluated on its ability to predict the right associated output (i.e. applying the inferred program in the support set to the query input). The inputs are 56 × 56 images with three colored MNIST digits placed at three different positions. These visual digits can have values between -9 and 9 and their color represents their sign. There are six different colors in total (3 of them are negative and the remaining 3 are positive). The program applied to the inputs is a sequence of arithmetic operations (we restrict ourselves to addition and subtraction and the dataset can further be extended with more complicated queries that involve comparison between the different objects, maximum operations etc..) in a particular positional order. Since we are interested in the model's ability to generalize compositionally to unseen examples, we create different splits that aim at evaluating different axes of compositional generalization. These three splits are as follows : (1) OOD seq : In this split, during training we leave out some sequence of operations (e.g. (+, -)) and perform the evaluation on samples requiring the excluded sequence., (2) OOD order: where uring training the model only sees programs that take input digits in some particular positional order (e.g. top-left, top-right, down) and is evaluated on unseen orders., and (3) OOD perception which evaluate the perception module ability to disentangle the digit class (e.g. 1 to 9) from its color (representing its sign). We thus consider certain pairs of digit-color configurations during training and evaluate the model on unseen pairs. Our model is designed to generate answers in a sequential way. The design choices that we make reflect the fact that the output answer to a generative reasoning task can be computed by sequentially updating a working memory whose arguments are obtained from available input (e.g. slots extracted from images in our case). The computation steps are the following: (1) Visual inputs x are mapped to N s object-centric slots S = [S 1 , .., S Ns ] using a Slot Attention module (Locatello et al., 2020) , then, (2) the controller takes the support set as input and ouputs a single task embedding z; this task embedding is then (3) translated into a sequential neural program (e.g. a sequence of transformation to be applied to a working memory); finally, (4) the executor takes this neural program along with an input query (i.e. its object-centric slots) and performs the sequential updates. The overall computation paths are given in Figure 2 . The main contribution of our work resides in the architecture of the executor, its interface with the controller (i.e. how the executor uses the information encoded in the controller) and the derived imagination machinery. The controller scaffold (detailed in the appendix) that we use in all baselines always outputs a single task embedding z and can be adapted depending on the task at hand. In this section, we detail the modeling choices of the executor and the interface between the executor and the controller, which is formulated as a selection bottleneck and the imagination component.

4.1. EXECUTOR

The executor takes a visual query input x ∈ R 56×56×3 and a neural program; it then updates a working memory h ∈ R p in a sequential and structured manner. The visual input is first mapped to a set of N s object-centric slots [S 1 , S Ns ] that is later used as candidate arguments for each update of the working memory. The executor is composed of a library of N r learned modules (e.g. rules, implemented as small GRU cells) and N c condition values. The conditions are expected to encode the way in which to select an argument (e.g. among the slots) that is in turn used by a module to update the working memory. Both rules (i.e. modules) and conditions are indexed by some learned tags M = [M 1 , . . . , M Nr ] and C = [C 1 , . . . , C Nc ]. The neural program that the executor takes as input generates (1) the number T of updates that the executor needs to perform, specified by a scalar gate at each time step; we denote the sequence of such gates by g = [g 1 , . . . , g T ]) ; (2) the sequence of modules [ m1 , . . . , mT ] (each mt parameterized by a small learned GRU RNN) that will perform the T updates of the working memory; and (3) the sequence of conditions [ĉ 0 , . . . ,ĉ T ] (each ĉt being a condition vector that selects one slot; this slot will be used as an argument to the selected module at each time step). At each time step t, each update in the sequence is done in the following two steps: • Argument selection: given a condition vector ĉt select an argument of the update from among the input slots S of the query. • Update: update working memory h t-1 with GRU rule mt and the selected argument ŝt . Argument Selection. At each time step, a slot argument is selected through a key-query attention mechanism. The idea is that the condition vector ĉt is compared against all the input slots to select the one that corresponds best to the features encoded in the condition (e.g. select the slot at the "topleft" of the image). The attention mechanism is thus realized using the condition vector ĉt ∈ R 1×d as a query and the N s slots S = [S 1 , . . . , S Ns ] ∈ R Ns×d as keys such that the selected argument ŝt at time-step t is given by: ŝt = GumbelSoftmax( ĉt S T √ d )S ∈ R 1×d The sequence of selected arguments is thus given by ŝ = [ŝ 0 , . . . , ŝT ]. Sequential Update. Given a sequence of processing modules [ m1 , . . . , mT ], a sequence of input arguments [ŝ 0 , . . . ,ŝ T ] and a length given by a sequence of gates [g 1 , ..,g T ], the executor updates a working memory whose state at time step t is denoted by h t such that: h t+1 = (g t+1 )h t + (1 -g t+1 ) mt (ŝ t+1 , h t ) and h 0 = ŝ0 (2) For ease of notation, we let Executor(x, P) be the result of applying the neural program P to the visual input x.

4.2. SELECTION BOTTLENECK

In this section, we describe the interface between the controller and the executor: how the task embedding z output by the controller is transformed into a neural program that the executor then takes as input (e.g. sequences of modules, conditions and gates) to perform the sequential update. First the task embedding z ∈ R d is transformed into a sequence of embeddings by feeding z as argument to a GRU RNN that starts with an empty hidden state [z 1 , . . . , z T ] = GRU(z). Both module and condition selections are done through a key-query attention mechanism comparing the task embedding z t to the N r learned module tags (denoted by M = [M 1 , . . . , M Nr ] ∈ R Nr×d ) and the N c learned conditions tags (denoted by C = [C 1 , . . . , C Nc ] ∈ R Nc×d ). The keys are extracted from the condition tags, whereas the query is extracted in both attention operations form the task embedding z t (using two MLPs Q r and Q c ) such that the t-th element of each sequence is obtained with: W t m = GumbelSoftmax( Q r (z t )M T √ d ) ∈ R 1×Nr (3) and the resulting update is given by the following weighted sum mt (h t , ŝt ) = Nr i=1 W t m [i]m i (h t , ŝt ). Similarly, the conditions are obtained through: ĉt = GumbelSoftmax( Q c (z t )C T √ d )c with c ∈ R Nc×d denoting the set of learned condition vectors. Finally, the sequence of step gates are obtained directly from the sequence of [z 1 , . . . , z T ] such that g t = MLP(z t ) For ease of notation, we let P z = SelectionBottleneck(z) = {g, ĉ, m} denote the neural program obtained from the task embedding z, where g, ĉ, m correspond to the associated step gates, condition vectors and processing module sequences.

4.3. COMPOSITIONAL IMAGINATION

Our main contribution resides in showing how object-centric inductive biases (used in the executor) can be leveraged to induce a new imagination-based learning framework that leads to better compositional generalization. The idea is that the same way we select a sequence of modules, conditions and gates using the task embedding output by the controller, we can also sample them at random (from a uniform distribution) to create a new neural program that can be used to create imagined scenarios. To do so, we sample at random a sequence of gates g im = [g im 1 , . . . , g im T ], a sequence of condition vectors c im = [c im 0 , . . . , c im T ] and a sequence of processing modules m im = [m im 0 , . . . , m im T ] that correspond to the procedural part of the knowledge we have about the reasoning task at hand. Ideally we would also sample the query to process (e.g. the declarative part) but we leave that for future work. Instead, we take visual inputs that are already present in the training data and we apply an imagined neural program to them. Since the goal is to create new samples, we need to apply the same imagined program to a set of visual inputs to form a support set. Let X supp = {x 1 , . . . , x L } denote a set of visual inputs from the training data, and let P im = {g im , c im , m im } be an imagined program. Then the imagination phase can be split into 3 main steps: • Imagined samples: this step consists of applying an imagined program P im to a support set of visual input X supp to obtain an imagined I/O support set S im = {X supp , O im } with O im = [Executor(x i , P im ) for x i ∈ X supp ]. • Task embedding inference: this step consists of encoding the imagined support set with the controller to produce a task embedding z im = Controller(S im ). • Mechanisms Prediction: the last step consists of predicting back the neural programs (i.e. its components) that produced the imagined sample. This means matching Pz im = SelectionBotlleneck(z im ) with P im . The associated loss is called the imagination loss L im = L( Pz im , P im ), which can be split into 3 cross-entropies predicting the step gate values, the conditions vector indices and the processing module indices. During training, we introduce this loss after a warming period during which the model is trained only on the training data available. We detail the hyperparameters in the Appendix. for (t = 1; t < T ; t + +) do 6: ŝt ← GumbelSoftmax( ĉtS T √ d )S ∈ R 1×d ▷ Eq. 1 7: h t+1 ← (g t+1 )h t + (1 -g t+1 ) mt (ŝ t+1 , h t ) ▷ Eq. 2 8: end for 9: Each of these steps gives rise to a specific objective loss to optimize. The task prediction objective in our case is a simple cross-entropy on the output of the executor, since we treat each bit of the output as a binary label to predict. This loss is given by: return pred(h T ) ▷ Task-specific prediction 10: end function Algorithm 2 Compositional Imagination Require: X supp ▷ Samples seen during training 1: S supp ← SlotAttention(X supp ) ▷ Object-centric perception 2: P im ∼ U (g, C, M) ▷ Sample a program 3: O im = {Executor(x i , P im ) for x i ∈ X supp } 4: S im ← {X supp , O im } 5: z im ← Controller(S im L task = - i∈Dtrain (y i log(ŷ i ) + (1 -y i ) log(1 -ŷi )). The pretraining phase consists of training the Slot Attention module on a reconstruction task. During the imagination phase, new samples are created according to Algorithm 2, and the model is optimized to infer the programs that generated these samples and to minimize L task at the same time. We detail the hyperparameters associated to the different training phases in the Appendix.

5. EXPERIMENTS

OCIM has two main components. The perception component and the object processing (i.e. reasoning) component. Our contribution lies in the object processing component, while for the perception component we use a SOTA slot attention module (Locatello et al., 2020) . The goal of this section is two-fold: (1) to evaluate our imagination-based learning paradigm on a set of compositional generalization axes and (2) the ability of the perception module to extract symbolic-like representations that can be used to solve our visual abstract reasoning task.

5.1. BASELINES

Our model OCIM can be seen as an extension of the sparse interaction inductive biases proposed in Neural Production Systems (NPS) Goyal et al. (2021a) , and augmented with an imagination-based learning mechanism. NPS sequentially updates a set of slots by choosing at each time-step a primary slot, a secondary slot, and an interaction rule with some key-query attention mechanisms. OCIM sequentially updates the state of a shared working memory across time steps (e.g. instead of slot states) from which the final answer can be extracted. As a result, at each time-step the primary argument of an interaction is always the shared memory and the second argument is selected among the input slots. We also compare OCIM (and its variant without imagination that we call OCIMnoim) to two other baselines in which the executor is parameterized with a single monolithic GRU RNN in one case, and with a dense GNN in the other, for which we use the interaction component from the C-SWM (Kipf et al., 2019 ) model (like Goyal et al. (2021b) ). In each of these baselines, input nodes correspond to extracted slots concatenated with the output of the controller. For the GNN baseline, a GRU RNN is added after computing the interactions between nodes to aggregate the final result. We refer to these two baselines as GRU RNN and GNN respectively, and detail their exact parameterization in the Appendix. 

5.2. OOD SPLITS

We are interested in 3 axes of compositional generalization that evaluate both our proposed imagination-based learning paradigm and the perception module robustness: (1) We first want to evaluate whether the imagination phase in OCIM can lead to a better generalization to arithmetic tasks composed of never-seen sequences of operations during training; (2) we then want to evaluate whether OCIM is able to generalize to never-seen orders in which the input digits are taken to perform the sequence of operations (e.g. its ability to extract meaningful and general argument selection conditions), and finally, (3) we want to evaluate whether current object-centric iductive biases as proposed in Slot Attention Locatello et al. (2020) are well suited for disentangling independent factors of variations within an object (e.g. color and digit class in our case). To that end, we propose to evaluate the models on the three splits described in Section 3:

5.3. RESULTS

Imagination and Generalization. In our experiments, all the models share the same perception model Locatello et al. (2020) and the same controller. They only differ by their execution component. In Figure 3 , we report the accuracy peformances of our model compared to the baselines of interest across three different splits that aim at evaluating a particular axis of generalization. We observe two main results: (1) Imagination does help to generalize to novel sequences of operations as shown in the generalization gap of OCIM between the results of the OOD op split and the other baselines. (2) Current SOTA object-centric perception models like Slot Attention are not quite able to systematically generalize to objects composed of never-seen before arrangements of known arguments (such as new pairs of color/shape). This result is interesting and suggests that additional inductive biases or learning paradigms are needed to learn object-centric representations that disentangle independent factors of variations within an object. We did however notice that the choice architecture for the execution component seems to have an impact on the perception part, and that, surprisingly, both the GNN and GRU baselines perform better than OCIM on the OOD perception split. Modules Specialization. We are also interested in analyzing how specialization of the learned modules (i.e. becoming activate for a certain operation) impacts the generalization performance of OCIM. For each training random seed, we count the number of times each module was selected for each of the ground-truth operations. Since we use the Gumbel softmax trick to select modules , we use the argmax of the attention coefficient to decide which module is selected. We report these proportions in the heatmap in Figure 4 for OCIM with and without imagination. The x-axis corresponds to the seed number; the yaxis corresponds to the module indices for both the addition and the subtraction operations. The accuracies reported at the top of the heatmaps correspond to the validation and test accuracies on the OOD op splits (e.g. when evaluating the models on sequences of operations that have not been seen during training). We note the following three observations: (1) Current inductive biases are not sufficient for specialization to systematically occur since there are some seeds that have overlaps between selected modules. (2) As shown in the generalization results of OCIM-noim, specialization in the modules is not enough for the model to generalize to novel sequences of operations (e.g. modules in seed 2 of OCIM-noim (left column) are specialized yet we do not observe systematic generalization). Finally, (3) We notice that imagining new samples is necessary for the model to generalize, but also not sufficient as it needs the modules to be specialized as well. Seeds 0 and 5 of OCIM do have overlap in the module selections and the imagination framework was not successful. This observation explains the variance in terms of performance for the OOD op split that we report in Figure 3 .

6. CONCLUSION AND FUTURE WORK

We have presented OCIM, a method that leverages object-centric representations to decompose visual reasoning tasks into a series of learned primitives (operation and object choices). OCIM combines these primitives in novel ways in order to generate and learn from unseen imaginary tasks, which radically improve OOD generalization. We compared OCIM against NPS and two other baselines without imagination on a synthetic visual arithmetic reasoning task in which we apply a sequence of operations to colored MNIST digits. We found that only OCIM was able to systematically generalize to new tasks composed of unseen sequences of arithmetical operations. Interestingly, we observed that specialization among the neural modules seems to be a necessary but not sufficient condition for modular architectures like OCIM and NPS to generalize to unseen sequences of operations: imagination seems to be a critical addition to the specialization condition. The effectiveness of imagination in our setup raises the question of whether its function is similar in biological brains. An interesting hypothesis is that dreams have a regularizing effect in the brain (OBH). While the link between OBH and OCIM is superficial, it poses an interesting question that might be worth exploring in future work. Along with OCIM, we have introduced a synthetic visual reasoning benchmark to assess the extent to which imagination improves compositional generalization. Despite the simplicity of the benchmark, we found that SOTA models like NPS fail to compose the primitives learned during training in novel ways in order to generalize. As research in compositional generalization progresses, the benchmark could be extended with more challenging scenarios by increasing the number of operations, the length of the programs, and the number of objects.

Reproducitbility Statement

We reported in the appendix all the model and training hyperparameters to implement and reproduce our model (Table 1 , 2, 3, 4) as well as the detailed content of the different data splits. We will release the code and scripts for both our model, the baselines and the generation of the different dataset splits.



Figure 1: Data sample and dataset splits.

Figure 2: Task and Imagination pathways in OCIM

) 6: P z im = SelectionBottleneck(z im ) ▷ Infer program 7: loss im ← CrossEntropy(P, P z im ) 4.4 TRAINING The training of the whole model can be split into three phases: Step 1: Pretraining of the perception model such that the next steps start with reasonable latent slots. -Step 2: Regular training on the task prediction objective (8-binary-bits digit prediction). -Step 3: Imagination, where random modules and conditions are sampled to create new data points and expose the model to potentially OOD samples.

Figure 3: Validation and test accuracy for the baselines and OCIM on the three axes of compositional generalization described in Section 3. Insights about the high variance of the results for OCIM is given in 5.3 when inspecting modules specialization.

Figure 4: Modules specialization.

A APPENDIX

A.1 MODEL Controller. The controller encodes a support set of I/O pairs S = {X, O} and outputs a task embedding z that is used to predict a task-specific output associated to a query input. We formulate the controller in an iterative manner: it starts with a random guess z init (sampled from a learned gaussian) and refines it. We denote Refine(z, S) a refinement step and T the total number of refinement steps. A refinement step at timestep t can be decomposed into three main steps:• Step 1 -Current guess prediction: we compute the current predicted outputs Ô associated with each input in the support set s.t. Ôt = {ComputeGuess(X, z t )}

•

Step 2 -Compare current guess Ôt to ground-truth outputs O given their associated inputs in S. We denote a t = Compare( Ôt , S, z t ) the output of this step.ComputeGuess(X, z t ) is model-specific and simply consists of predicting the current task output given z t . And z t+1 = Update(a t , z t ) is parametrized as a simple GRU RNN taking a t as input and updating the hidden state z t . We then need to detail how we obtain a t = Compare( Ôt , S, , z t ) This is done in two steps (1) -first we compute a representation b t i for each sample i in the support set, then, (2) we aggregate those representations to obtain a t . To do so, each input x i ∈ X is transformed into an object-centric set of slots {s i j } j using the same perception model as in the rest of the model. Each slot is then concatenated to the current guess z t , the ground-truth associated input o i and the currently predicted output ôi . The sample-wise result is then obtained with a simple GRU RNN. We denote this step by b t i = EncodeSample(x i , z t , o i , ôi ) and the resulting sequence representation by b t = [b t i ]. We then need to aggregate the obtained results accross the whole support set. To do so, we concatenate each sample representation b t i with the current task embedding z t , the ground-truth associated input o i and the currently predicted output ôi . Similarly we aggregate the results with a simple GRU RNN. We denote a t = EncodeSupport(b t , z t , O, Ôt ). The exact parametrization of each of the modules consituting the controller is given in Table 1 . Executor. The executor is composed of three main parts: (1) the learned neural program library (e.g. modules and condition vectors), (2) the argument selection part given a condition vector, and the (3) task-related output prediction. We detail the parameterization of these part in Table 3 . The training of the whole model can be split into three phases:•Step 1: Pretraining of the perception model such that the next steps start with reasonable latent slots.•Step 2: Regular training on the task prediction objective (here 8-binary-bits digit prediction).•Step 3: Imagination, where random modules and conditions are sampled to create new data points and expose the model to potentially OOD samples.Each of these steps gives rise to a specific objective loss to optimize, namely: L rec , L task , and L im . The different phases consist of adding progressively these losses to the optimized objective. The total loss is:The pretraining of the slot attention modules (Step 1) is done separately and we initialize it with the pretrained weights when adding the task-specific loss at Step 2. The coefficient α is fixed, whereas β is introduced after a warm-up period and increased linearly during a certain number of epochs.We report these training-specific hyperparameters in 

