BRIDGING THE GAP TO REAL-WORLD OBJECT-CENTRIC LEARNING

Abstract

Humans naturally decompose their environment into entities at the appropriate level of abstraction to act in the world. Allowing machine learning algorithms to derive this decomposition in an unsupervised way has become an important line of research. However, current methods are restricted to simulated data or require additional information in the form of motion or depth in order to successfully discover objects. In this work, we overcome this limitation by showing that reconstructing features from models trained in a self-supervised manner is a sufficient training signal for object-centric representations to arise in a fully unsupervised way. Our approach, DINOSAUR, significantly out-performs existing image-based object-centric learning models on simulated data and is the first unsupervised object-centric model that scales to real-world datasets such as COCO and PASCAL VOC. DINOSAUR is conceptually simple and shows competitive performance compared to more involved pipelines from the computer vision literature.

1. INTRODUCTION

Object-centric representation learning has the potential to greatly improve generalization of computer vision models, as it aligns with causal mechanisms that govern our physical world (Schölkopf et al., 2021; Dittadi et al., 2022) . Due to the compositional nature of scenes (Greff et al., 2020) , object-centric representations can be more robust towards out-of-distribution data (Dittadi et al., 2022) and support more complex tasks like reasoning (Assouel et al., 2022; Yang et al., 2020) and control (Zadaianchuk et al., 2020; Mambelli et al., 2022; Biza et al., 2022) . They are in line with studies on the characterization of human perception and reasoning (Kahneman et al., 1992; Spelke & Kinzler, 2007) . Inspired by the seemingly unlimited availability of unlabeled image data, this work focuses on unsupervised object-centric representation learning. Most unsupervised object-centric learning approaches rely on a reconstruction objective, which struggles with the variation in real-world data. Existing approaches typically implement "slot"-structured bottlenecks which transform the input into a set of object representations and a corresponding decoding scheme which reconstructs the input data. The emergence of object representations is primed by the set bottleneck of models like Slot Attention (Locatello et al., 2020) that groups together independently repeating visual patterns across a fixed data set. While this approach was successful on simple synthetic datasets, where low-level features like color help to indicate the assignment of pixels to objects, those methods have failed to scale to complex synthetic or real-world data (Eslami et al., 2016; Greff et al., 2019; Burgess et al., 2019; Locatello et al., 2020; Engelcke et al., 2021) . To overcome these limitations, previous work has used additional information sources, e.g. motion or depth (Kipf et al., 2022; Elsayed et al., 2022) . Like color, motion and depth act as grouping signals when objects move or stand-out in 3D-space. Unfortunately, this precludes training on most real-world image datasets, which do not include depth annotations or motion cues. Following deep learning's mantra of scale, another appealing approach could be to increase the capacity of the Slot Attention architecture. However, our experiments (Sec. 4.3) suggest that scale alone is not sufficient to close the gap between synthetic and real-world datasets. We thus conjecture that the image reconstruction objective on its own does not provide sufficient inductive bias to give rise to object groupings when objects have complex appearance. But instead of relying on auxiliary external signals, we introduce an additional inductive bias by reconstructing features that have a high level of homogeneity within objects. Such features can easily be obtained via recent self-supervised learning techniques like DINO (Caron et al., 2021) . We show that combining such a feature reconstruction loss with existing grouping modules such as Slot Attention leads to models that significantly out-perform other image-based object-centric methods and bridge the gap to real-world object-centric representation learning. The proposed architecture DINOSAUR (DINO and Slot Attention Using Real-world data) is conceptually simple and highly competitive with existing unsupervised segmentation and object discovery methods in computer vision.

2. RELATED WORK

Our research follows a body of work studying the emergence of object-centric representations in neural networks trained end-to-end with certain architectural biases (Eslami et al., 2016; Burgess et al., 2019; Greff et al., 2019; Lin et al., 2020; Engelcke et al., 2020; Locatello et al., 2020; Singh et al., 2022a) . These approaches implicitly define objects as repeating patterns across a closed-world dataset that can be discovered e.g. via semantic discrete-or set-valued bottlenecks. As the grouping of low-level features into object entities is often somewhat arbitrary (it depends for example on the scale and level of detail considered), recent work has explored additional information sources such as video (Kosiorek et al., 2018; Jiang et al., 2020; Weis et al., 2021; Singh et al., 2022b; Traub et al., 2023) , optical flow (Kipf et al., 2022; Elsayed et al., 2022; Bao et al., 2022) , text descriptions of the scene (Xu et al., 2022) or some form of object-location information (e.g. with bounding boxes) (Kipf et al., 2022) . In contrast, we completely avoid additional supervision by leveraging the implicit inductive bias contained in the self-supervised features we reconstruct, which present a high level of homogeneity within objects (Caron et al., 2021) . This circumvents the scalability challenges of previous works that rely on pixel similarity as opposed to perceptual similarity (Dosovitskiy & Brox, 2016) and enables object discovery on real-world data without changing the existing grouping modules. Our approach can be considered similar to SLATE (Singh et al., 2022a) , but with the crucial difference of reconstructing global features from a Vision Transformer (Dosovitskiy et al., 2021) instead of local features from a VQ-VAE (van den Oord et al., 2017) . Challenging object-centric methods by scaling dataset complexity has been of recent interest: Karazija et al. (2021) propose ClevrTex, a textured variant of the popular CLEVR dataset, and show that previous object-centric models perform mostly poorly on it. Greff et al. (2022) introduce the MOVi datasets with rendered videos of highly realistic objects with complex shape and appearance. Arguably the most advanced synthetic datasets to date, we find that current state-of-the-art models struggle with them in the unsupervised setting. Finally, Yang & Yang (2022) show that existing image-based object-centric methods catastrophically fail on real-world datasets such as COCO, likely because they can not cope with the diversity of shapes and appearances presented by natural data. In contrast, we demonstrate that our approach works well on both complex synthetic and real-world datasets. In the computer vision literature, structuring natural scenes without any human annotations has also enjoyed popularity, with tasks such as unsupervised semantic segmentation and object localization. Those tasks are interesting for us because they constitute established real-world benchmarks related to unsupervised object discovery, and we show that our method is also competitive on them. We refer to App. A for a detailed discussion of prior research in these areas.

3. METHOD

Our approach essentially follows the usual autoencoder-like design of object-centric models and is summarized in Figure 1 : a first module extracts features from the input data (the encoder), a second module groups them into a set of latent vectors called slots, and a final one (the decoder) tries to reconstruct some target signal from the latents. However, our method crucially differs from other

