TEST-TIME ADAPTATION WITH SLOT-CENTRIC MODELS

Abstract

Figure 1: Image and point-cloud instance segmentation with Slot-TTA. Slot-TTA parse completely novel scenes into familiar entities via slow inference, i.e., gradient descent on the reconstruction error of the scene example under consideration. Left: Slot-TTA outperform Mask2Former (Cheng et al., 2021), a SOTA 2D image segmentor, on segmenting novel images by gradient descent on image synthesis of neighboring image views. Right: Slot-TTA outperform a state-of-the-art 3D-DETR detector by 30% in instance segmentation accuracy in out-of-distribution 3D point clouds, when trained on the same training data.

1. INTRODUCTION

While significant progress has been made in machine scene perception and segmentation within the last decade, object (and part) detectors continue to generalize poorly outside their training distribution (Geirhos et al., 2020; Hendrycks et al., 2021) . Consider the unfamiliar entity shown in Figure 1 (last row on the right). We can intuitively reason about meaningful parts that this shape could be broken into. Yet, a state-of-the-art 3D detection transformer (Misra et al., 2021) trained to segment chairs in a supervised manner struggles with this decomposition, even though the entitiy contains familiar (chair) parts. This lack of generalization requires us to build systems that can robustly adapt to such changes in distribution. Test-time adaptation (TTA) (Ghifary et al., 2016; Sun et al., 2020; Wang et al., 2020) describes a setting where a model adapts to changes in distribution at test-time, at the cost of additional computation. In recent years, a variety of methods based on TTA have been proposed, focusing on few-shot adaptation (Ren et al., 2018) where the network is given access to a few labelled examples, or unsupervised domain adaptation (UDA) (Zhang, 2021) where the network is given access to many unlabelled examples from the new distribution. Of particular relevance is a specific UDA setting where model parameters are adapted independently to each unlabelled example in the test-set. This setting has been previously referred to as single-example UDA, and here we also refer to it as slow inference since it is similar to a human taking more time to parse a difficult example. Existing approaches for this setting typically devise a self-supervised loss that aligns well with the task of image classification and then optimize this loss during test-time adaptation (Sun et al., 2020; Gandelsman et al., 2022; Bartler et al., 2022; Grill et al., 2020) . However, despite their success for image classification, these approaches do not provide adequate support for other scene understanding tasks, and in particular scene segmentation, as we showcase in Section 4.1. One potentially important aspect to supporting TTA for other scene understanding tasks is the inductive bias of the underlying architecture. In the context of instance segmentation, there has been a lot of recent development in building models that segment scenes into entities in an unsupervised way by optimizing a reconstruction objective (Eslami et al., 2016; Greff et al., 2016; Van Steenkiste et al., 2018; Goyal et al., 2021; Du et al., 2020; Locatello et al., 2020; Zoran et al., 2021) . These methods differ in details but share the notion of incorporating a fixed set of entities, also known as slots or object files. Each slot extracts information about a single entity during encoding, and is "synthesized" back to the input domain during decoding. Their ability to distinguish visual objects at a representation level makes them a particularly promising candidate for TTA for instance segmentation tasks. In light of the above, we propose Test-time adaptation with slot-centric models (Slot-TTA), a semisupervised slot-centric approach that combines Slot Attention (Locatello et al., 2020) (in the 2D image or point clouds setting) or Object Scene Representation Transformer (Sajjadi et al., 2022a) (in multiview image setting) with a supervised segmentation loss to enable it to leverage instance-level image or point cloud annotations. Slot-TTA is trained jointly to synthesize and segment scenes. At test time, the model adapts without supervision to a single test sample by optimizing the self-supervised objective alone. Different from fully-unsupervised object-centric generative models, Slot-TTA uses annotations at training time to help it develop the notion of what an object is, which lets it scale to more complex visual settings. Different from existing TTA methods, Slot-TTA uses a slot-centric architecture and self-supervised synthesis loss that better aligns with the task of instance segmentation. Different from state-of-the-art detectors, Slot-TTA is equipped with reconstruction feedback that allows it to adapt at test time without supervision, i.e. without using additional annotated data. Indeed, we show that test-time adaptation via image or point cloud synthesis in Slot-TTA enables successfully parsing completely unfamiliar scenes composed of familiar entities (Figure 3 ). We test Slot-TTA's instance segmentation ability on the following datasets: PartNet (Mo et al., 2019 ), MultiShapeNet-Hard (Sajjadi et al., 2022b) Multi-Shape and Plating. We evaluate Slot-TTA's ability to parse out-of-distribution scenes and compare it against state-of-the-art entity-centric generative models (Locatello et al., 2020; Sajjadi et al., 2022a) , program synthesis models (Tian et al., 2019) , 3D unsupervised part discovery models (Wu et al., 2020) and supervised visual detectors (Cheng et al., 2021; Misra et al., 2021) trained with labeled data to segment objects. We show improvements over all baselines in Slot-TTA ability to segment novel scenes. Additionally, we ablate different design choices of Slot-TTA. We will make our code and datasets publicly available to the community.

