Robust and Controllable Object-Centric Learning through Energy-based Models ROBUST AND CONTROLLABLE OBJECT-CENTRIC LEARNING THROUGH ENERGY-BASED MODELS

Abstract

Humans are remarkably good at understanding and reasoning about complex visual scenes. The capability to decompose low-level observations into discrete objects allows us to build a grounded abstract representation and identify the compositional structure of the world. Accordingly, it is a crucial step for machine learning models to be capable of inferring objects and their properties from visual scenes without explicit supervision. However, existing works on objectcentric representation learning either rely on tailor-made neural network modules or strong probabilistic assumptions in the underlying generative and inference processes. In this work, we present EGO, a conceptually simple and general approach to learning object-centric representations through an energy-based model. By forming a permutation-invariant energy function using vanilla attention blocks readily available in Transformers, we can infer object-centric latent variables via gradient-based MCMC methods where permutation equivariance is automatically guaranteed. We show that EGO can be easily integrated into existing architectures and can effectively extract high-quality object-centric representations, leading to better segmentation accuracy and competitive downstream task performance. Further, empirical evaluations show that EGO's learned representations are robust against distribution shift. Finally, we demonstrate the effectiveness of EGO in systematic compositional generalization, by re-composing learned energy functions for novel scene generation and manipulation.

1. INTRODUCTION

The ability to recognize objects and infer their properties and relations in a scene is a fundamental capability of human cognition. The central question of how objects are discovered and represented in the brain has been a subject of intense research for decades, and has prompted the field of cognitive science (Spelke, 1990) to ask how we might develop intelligent machine agents to learn to represent objects in the same way humans do, without being explicitly taught what those objects are. Developing artificial agents capable of decomposing complex scenes into discrete objects can be a crucial step for many applications in robotics, vision, reasoning, and planning. Learning such object-centric representations can further help to identify the relational and compositional structure among objects and enables the agent to reason about a novel scene composed of new objects by leveraging knowledge from previously-learned representations of similar objects. In recent years, many works have been proposed to learn object-centric representations from visual scenes without human supervision. A variety of models, in the form of structured generative models (Greff et al., 2019; Burgess et al., 2019; Engelcke et al., 2020; Lin et al., 2020) or specifically designed neural network modules (Locatello et al., 2020) , have been proposed to tackle the problem of visual scene decomposition and generation. On the other hand, recent progress in large language models (Vaswani et al., 2017; Brown et al., 2020) and visual-language models (Radford et al., 2021; Ramesh et al., 2022) shows the huge potential of training expressive neural network models with K V Z1 Z2 Figure 1 : Architecture of EGO-Attention, the variant of EGO used in experiments. x is the input image and z i are the object-centric representations. In each block, EGO attends to the latent variables to refines the hidden scene representation using cross-attention mechanism between x and z i , to measure the consistency between image input and latent representation. minimal hand-designed inductive biases. In a similar spirit, we ask whether we can learn objectcentric representations with minimal human assumptions and task-specific architectures. Contributions In this work, we introduce EGO (EnerGy-based Object-centric learning), a conceptually simple yet effective approach to learning object-centric representations without the need for specially-tailored neural network architectures or strong (typically parametric) assumptions on data generating process. Based on the Energy-based Model (EBM) framework, we propose to learn an energy function that takes as input a visual scene and a set of object-centric latent variables and outputs a scalar value that measures the consistency between the observation and the latent representation (Section 2). We minimally assume permutation invariance among objects and embed this assumption into the energy function by leveraging the vanilla attention mechanisms from the Transformer (Vaswani et al., 2017) architecture (Section 2.1). In essence, our method makes models act as segmentation annotators, aiming to iteratively improve their annotations by minimizing our energy function. We use gradient-based Markov chain Monte Carlo (MCMC) sampling to efficiently sample latent variables from the EBM distribution, which automatically yields a permutation-equivariant update rule for the latent variables (Section 2.2). This stochastic inference procedure also addresses the inherent uncertainty in learning object-centric representations; models can learn to represent scenes containing multiple objects and potential occlusions in a probabilistic and multi-modal manner. We demonstrate the effectiveness of our approach on a variety of unsupervised object discovery tasks and show both qualitatively and qualitatively that our model can learn to decompose complex scenes into highly accurate and interpretable objects, outperforming state-of-the-art methods on segmentation performance (Section 4.1). We also show that we can reuse the learned energy functions for controllable scene generation and manipulation, which enables systematic compositional generalization to novel scenes (Section 4.2). Finally, we demonstrate the robustness of our model to various distribution shifts and hyperparameter settings (Section 4.3).

2. ENERGY-BASED OBJECT-CENTRIC REPRESENTATION LEARNING

The goal of object-centric representation learning is to learn a mapping from a visual observation x ∈ R Dx to a set of vectors {z k }, where each vector z k ∈ R Dz describes an individual object (or background) in x. In this work, we make use of an EBM E(x, z; θ), parameterized by θ, to learn a joint energy function which assigns low energy to regions where the visual observation x and the latent object descriptors z are consistent, where z = {z k } K k=1 are a set of K object-centric latent variables. To implement the mapping from a visual scene to its constituting objects, we can sample from the posterior distribution z ∼ p(z|x; θ) ∝ e -E(x,z;θ) by using any efficient MCMC sampling algorithm, such as the stochastic gradient Langevin dynamics method (Parisi, 1981; Welling & Teh, 2011) and Hamiltonian Monte Carlo (Duane et al., 1987; Neal et al., 2011) . Accordingly, the EBM E can be used as a generic module for object-centric representation learning, offering great flexibility in which neural network architectures can be used and the functional form of the energy function.

2.1. PERMUTATION INVARIANT ENERGY FUNCTION

One fundamental inductive bias in object-centric representation learning is encoding the permutation invariance of a set of objects into model learning. In this section, we introduce two formulations

