IMPROVING OBJECT-CENTRIC LEARNING WITH QUERY OPTIMIZATION

Abstract

The ability to decompose complex natural scenes into meaningful object-centric abstractions lies at the core of human perception and reasoning. In the recent culmination of unsupervised object-centric learning, the Slot-Attention module has played an important role with its simple yet effective design and fostered many powerful variants. These methods, however, have been exceedingly difficult to train without supervision and are ambiguous in the notion of object, especially for complex natural scenes. In this paper, we propose to address these issues by investigating the potential of learnable queries as initializations for Slot-Attention learning, uniting it with efforts from existing attempts on improving Slot-Attention learning with bi-level optimization. With simple code adjustments on Slot-Attention, our model, Bi-level Optimized Query Slot Attention, achieves state-of-the-art results on 3 challenging synthetic and 7 complex real-world datasets in unsupervised image segmentation and reconstruction, outperforming previous baselines by a large margin. We provide thorough ablative studies to validate the necessity and effectiveness of our design. Additionally, our model exhibits great potential for concept binding and zero-shot learning.

1. INTRODUCTION

Objects, and their interactions, are the foundations of human cognition (Spelke & Kinzler, 2007) . The endowment on making abstractions from perception and organizing them systematically empowers humans the ability to accomplish and generalize across a broad range of tasks, such as scene modeling (Bear et al., 2020) , visual reasoning (Yi et al., 2020) , and simulating interactions (Bear et al., 2020) . The key to such success lies in the emergence of symbol-like mental representations of object concepts (Whitehead, 1928) . However, important as it is, disentangling object-centric concepts from visual stimuli is an exceedingly difficult task to accomplish with limited supervision (Greff et al., 2020) and requires proper inductive biases (Schölkopf et al., 2021) . Motivated by the development of symbolic thought in human cognition, slot-based representations, instance (Greff et al., 2017; 2019; Locatello et al., 2020) , sequential (Gregor et Jiang et al., 2019) , have been the key inductive bias to recent advances in unsupervised object-centric learning. Among them, the Slot-Attention module has received tremendous focus given its simple yet effective design (Locatello et al., 2020) . By leveraging the iterative attention mechanism, Slot-Attention learns to compete between slots for explaining parts of the input, exhibiting a softclustering effect on visual signals. It is later proven to be more memory and training efficient as a plug-and-play module for unsupervised object-centric learning (Locatello et al., 2020) and fostered powerful variants in understanding images (Singh et al., 2021; Xu et al., 2022) , 3D scenes (Yu et al., 2022; Sajjadi et al., 2022a) and videos (Kipf et al., 2022; Elsayed et al., 2022; Singh et al., 2022) . However, as revealed by recent studies, the Slot-Attention module comes with innate discrepancies for object-centric representation learning. First, with slots randomly initialized each time, the objectcentric representations obtained by these models do not necessarily bind to object concepts (Kipf et al., 2022) . Intuitively, such randomness leads to undesired scenarios where slots with similar initializations compete for objects on different images. Such randomness challenges the iterative refinement procedure as it now needs to project sets of potentially similar representations to independent constituents of the input. As discovered by Chang et al. (2022) , differentiating through such recurrences contributes to various training instabilities with growing spectral norm of Slot-Attention weights. This leads to the second and perhaps least desired property of Slot-Attention; it relies heavily on hyper-parameter tuning, including gradient clipping, learning rate warm-up, etc., and further hurts the flexibility of Slot-Attention in adapting to broader applications with more complex signals. To this end, we propose an extension of the Slot-Attention module, Bi-level Optimized Query Slot Attention (BO-QSA), to tackle the aforementioned problems. First, we follow the bi-level optimization framework proposed by Chang et al. (2022) for easing the training difficulty in Slot-Attention. More importantly, instead of sampling from a learnable Gaussian distribution, we propose to directly learn the slot initializations as queries. With these learnable representations, we eliminate the ambiguous competitions between slots and provide a better chance for them to bind to specific object concepts. We improve the training of query-initialized Slot-Attention with a straight-through gradient estimator (STE) by connecting our method with first-order approaches (Finn et al., 2017; Nichol & Schulman, 2018; Geng et al., 2021) in solving bi-level optimization problems. The experimental results show that the proposed BO-QSA can achieve state-of-the-art results on both synthetic and real-world image datasets with simple code adjustments to the original Slot-Attention module. With our model significantly outperforming previous methods in both synthetic and real domains, we provide thorough ablative studies demonstrating the effectiveness of our model design. We later show that our BO-QSA possesses the potential of binding object concepts to slots. To validate this potential, we design zero-shot transfer learning experiments to show the generalization power of our model on unsupervised object-centric learning. As the experiments suggest (see Sec. 5), our model could potentially be a principle approach for unsupervised object-centric learning and serve as a general plug-and-play module for a broader range of modalities where variants of Slot-Attention prosper. We hope these efforts can help foster new insights in the field of object-centric learning. Contributions In summary, our main contributions are three-fold: • We propose BO-QSA, a query-initialized Slot-Attention model that unites straight-through gradient updates to learnable queries with methods on improving Slot-Attention with bi-level optimization. • We show that, with simple code adjustments on Slot-Attention, the proposed BO-QSA achieves state-of-the-art results on several challenging synthetic and real-world image benchmarks, outperforming previous methods by a large margin. • We show the potential of our BO-QSA being a better approach to concept binding and learning generalizable representations with qualitative results and zero-shot transfer learning experiments. 

2. PRELIMINARIES

s " f ϕ attn ps, xq " ˜Ai,j ř N l"1 A l,j ¸J ¨vpxq where A " softmax ˆkpxq ¨qpsq J ? D ˙P R N ˆK . The slots are initialized from a learnable Gaussian distribution with mean µ and variance σ. They are refined iteratively within the Slot-Attention module by passing the updates into a Gated Recurrent Unit (GRU) (Cho et al., 2014) and MLP parameterized by ϕ update for T iterations: s pt`1q " h ϕ update ps ptq , sptq q, s 0 " N pµ, diagpσqq, ŝ " s pT q . (1) The final prediction ŝ can be treated as the learned object-centric representation w.r.t. to input features x. In the image domain, we take as input a set of images I and encode them with f ϕ enc to obtain features x P R HW ˆDinput . After obtaining ŝ through the iterative refinement procedure with h ϕ update , images could be decoded from these object-centric representations with a mixture-based decoder or autoregressive transformer-based decoder. We refer the readers to Appendix A.1 for details on different decoder designs and their ways of visualizing learned object concepts.

2.2. IMPROVING SLOT-ATTENTION WITH BI-LEVEL OPTIMIZATION

The problem of bi-level optimization embeds the optimization of an inner objective within the outer objective. Normally, a bi-level optimization problem can be formulated as: min θ,ϕ f pθ, ϕq s.t. θ P arg min θ 1 gpθ 1 , ϕq, where we call f pθ, ϕq the outer objective function and gpθ, ϕq the inner objective function. To jointly optimize both objectives w.r.t. parameters θ and ϕ, a straightforward approach to solving Eq. ( 2) is to represent the inner solution of θ as a function of ϕ, i.e., θ ˚pϕq " arg min 

3.2. RETHINKING BI-LEVEL OPTIMIZATION METHODS FOR QUERY SLOT ATTENTION

To improve the learning of QSA, we rewind to the idea of improving the learning of the vanilla Slot-Attention module with bi-level optimization (Chang et al., 2022) . Under this formulation, Slot-Attention could be treated as solving the following objectives: min s,Φ M ÿ i"1 Lpx i , s i , Φq s.t. s i " arg min s L cluster px i , s, Φq, where x i and s i denote the input feature from the i-th image and its corresponding slots, and Φ " tϕ init , ϕ attn , ϕ update u denotes parameters for assigning input features x to different slots. Under this setting, the outer objective L is usually a reconstruction objective and the inner objective could be viewed as a soft-clustering objective (Locatello et al., 2020) . Next, the inner objective is solved by iterative refinement, which could be formulated as solving for fixed-points (Chang et 3), the instabilities through the iterative updates could be avoided by detaching gradients, treating slots in the final iteration as an approximation of s i , and computing first-order gradient approximations for updating Φ with s i . However, we demonstrate in Tab. 7 that this design is only beneficial for randomly initialized slots and detrimental for query-initialized Slot-Attention architectures since it relies heavily on the good approximation of the solution to the inner objective. With no randomness in slot initializations or gradient during training, starting from a fixed set of initialization points puts challenges on the learning of Slot-Attention update F Φ as it will be difficult to provide a good approximation of s i with only a fixed number of iterations (see in Appendix B.2). This urges the need for information flow to the slot initialization queries. We propose BO-QSA to address the learning problem of QSA. As shown in Algorithm 1, we initialize slots with learnable queries in BO-QSA and perform T steps of Slot-Attention update to obtain an approximation of s i . These near-optimal solutions of the inner objective are passed into one additional Slot-Attention step where gradients to all previous iterations are detached. In contrary to I-SA, we use a STE (Bengio et al., 2013; Van Den Oord et al., 2017) to backpropagate gradients and also to slot initialization queries. Such designs help find good starting points for the inner optimization problem on clustering, alleviating the problem of bi-level optimization with QSA mentioned in Sec. 3.2. Similar to dVAE, the STE adds bias to the gradient of the initialization queries. However, since these learnable queries are meant for disentangling image features, they do not have to maintain information about the approximated s ˚.

3.3. BI-LEVEL OPTIMIZED QUERY SLOT ATTENTION

Such bias could lead to learned queries which are better pivots for separating different image features, similar to anchors, or filter queries learned for different tasks (Carion et al., 2020; Zhang et al., 2021) . Note that we do not add constraints on the consistency between s 0 and ŝ (e.g. ||sgpŝq ´s0 || 2 ) as done in dVAE since we find such constraints lead to a mean-representation of datasets that forbids better concept binding (see in Appendix B.3). As shown in Tab. 7 and Fig. 3 , our learned slot initialization queries do fulfill this goal by providing a more separable initialization space and can significantly facilitate model learning. (2020) demonstrates the detrimental effect of using independently initialized slots in Slot-Attention learning. However, we show that our BO-QSA method successfully overcomes this issue and generalizes the success of query networks to the domain of unsupervised object-centric learning. 2022) formulate the Slot-Attention learning from an implicit gradient perspective with gradient stopping derived from first-order hyper-gradient methods (Geng et al., 2021) . However, they ignore the important role of slot initializations in generalization and concept binding. As our experiments suggest, such gradient-stopping methods do not guarantee superior performance compared to the original Slot-Attention. We leave the details to Sec. 5.3 for an in-depth discussion.

5. EXPERIMENTS

In this section, we aim to address the following questions with our experimental results: • How good is our proposed BO-QSA on both synthetic and complex natural scenes? • How important is the query and the optimization method in BO-QSA? • Does BO-QSA possess the potential for concept binding and zero-shot transfer? We provide details in the following sections with thorough comparative and ablative experiments and leave the details on model implementation and hyperparameter selection to Appendix A.3. Here we clarify the datasets and metrics selected for evaluating our model on each domain: Synthetic Domain For the synthetic domain, we select three well-established challenging multiobject datasets Shapestacks (Groth et al., 2018) , ObjectsRoom (Kabra et al., 2019) , and CLEVRTEX for evaluating our BO-QSA model. Specifically, we consider three metrics to evaluate the quality of object segmentation and reconstruction. Adjusted Rand Index (ARI) (Hubert & Arabie, 1985) and Mean Segmentation Covering (MSC) (Engelcke et al., 2020) for segmentation and Mean Squared Error (MSE) for reconstruction. Following the evaluation setting of recent works, we report the first two segmentation metrics over foreground objects (ARI-FG and MSC-FG). Additionally, we conduct extra experiments on more datasets and leave the discussion to Appendix B.1. 

Results

We report multi-object segmentation results on synthetic datasets in Tab. 1 and visualize qualitative results in Fig. 1 . As shown in Tab. 1, our BO-QSA achieves the state-of-the-art results with large improvements over previous object-centric learning methods on all metrics in ShapeStacks and ObjectsRoom. We also observe more stable model performance, i.e. smaller variances in results, across different trials of experiments. Our model with mixture-based decoders obtains the best overall performance on all datasets. More specifically, our mixture-based BO-QSA significantly outperforms the vanilla Slot-Attention model ("15%) with minimal architectural differences. This validates the importance of the learnable queries and our optimization method. We will continue this discussion in Sec. 5.3. As shown in Tab. 2, our model also achieves state-of-the-art results on the unsupervised object segmentation task in CLEVRTEX with consistent improvement over Slot-Attention on the CAMO and OOD generalization split. Interestingly, our model (1) shows larger reconstruction errors, (2) generalizes well in out-of-distribution scenarios, and (3) shows marginal improvement in camouflaged images. We attribute (1) and (3) to the simple architecture of encoders/decoders currently adopted and provide insights on (2) in Sec. 5.4. We attribute the low segmentation performance to mask prediction in these methods, which relies on the attention matrix computed over input features. This leads to coarse object masks as a result of image tokenization. Nonetheless, we observe consistent improvement by applying our slot encoder to both mixture and transformer decoders.

5.2. OBJECT DISCOVERY ON REAL DATASETS

Experimental Setup For real-world experiments, we use the same slot encoder design used in Sec. 5.1 with a 4-layer CNN image encoder and initialize slots with learnable queries. For Results We show quantitative experimental results in Tab. 5 and Tab. 4. We also visualize qualitative results in Fig. 1 . For multi-object segmentation, as shown in Tab. 4, our model outperforms existing object-centric learning baselines by a large margin, especially on the YCB dataset where the segmented objects have clear semantic meanings. For foreground extraction, as shown in Tab. 5, our method significantly outperforms all existing baselines on the task of foreground extraction, achieving new state-of-the-art on all datasets. We recognize the dis- We show our BO-QSA provides the best overall separation as well as correspondence between initialization vectors and post-iteration slots. For I-SA, there exist mismatches between initialization vectors and post-iteration slots (yellow and red). The same optimization method is also not effective for I-QSA, leading to mixing post-iteration slots similar to SA for slot initializations (best viewed in color and with zoom-in). crepancy of mixture-based decoders in both Slot-Attention and our mixture-based design in modeling real-world images, reflecting similar discoveries from recent works (Singh et al., 2021 ) that mixturebased decoder struggles in modeling real-world images. On the other hand, our transformer-based model shows significant improvements over the vanilla version. Notably, our method outperforms a broad range of models, including GAN-based generative models (i.e. OneGAN, Voynov et al. ( 2020)), and large-scale pre-trained contrastive methods (i.e. MoCo-v2, BYOL, R2O). As shown in Tab. 6, our method achieves comparable results with state-of-the-art self-supervised contrastive learning methods without large-scale pre-training and data augmentation. This result sheds light on the potential of object-centric learning as a pre-training task for learning general visual representations.

Experimental Setup

We perform ablative studies over our designs by comparing them with different design variants on ShapeStacks and Stanford Dogs. For slot initialization, we consider (1) the original Slot-Attention module's sampling initialization (SA), and (2) initializing with learnable queries (QSA). For optimization, we consider (1) the original optimization in Slot-Attention (i.e. w/o detach or STE), (2) the I-SA optimization where gradients to slots in iterative updates are detached (i.e. w/ detach only), and (3) our optimization where we both detach the gradients into iterative refinement, and pass gradient to the initialization queries with STE (i.e. w/ detach and STE). For simplicity, we term these variants with prefixes (I-) for I-SA and (BO-) for our full method. We run all ablations on each dataset with the same encoder-decoder architecture.

Results

We show experimental results in Tab. 7 and Fig. 2 . First, from Tab. 7, we observe that BO-QSA significantly outperforms other variants. For sample-based slot initializations, our method shows a similar effect compared with I-SA on improving Slot-Attention learning. For query-based slot initializations, we validate the difficulty in training query-based Slot-Attention with its inferior performance. We further show the ineffectiveness of I-SA for query-based Slot-Attention. The experiments on query-based Slot-Attention prove that both of our design choices are necessary and effective for superior performance. To study the effect of learned queries, we visualize in Fig. 2 where we set different numbers of iterative updates of Slot-Attention during inference on the Stanford *Note that for the ShapeStacks experiment(left), we alternate object colors in CLEVRTEX with seen colors for better qualitative evaluations, and we do not perform such operations for quantitative evaluations. Dogs dataset. We can see that our BO-QSA significantly outperforms other variants with only one iteration. This indicates that our query-based design can help ease training difficulties. In Fig. 3 , we further visualize the learned initializations and post-iteration slots in the same feature space using t-SNE (Van der Maaten & Hinton, 2008). Our initializers provide a more separable space when differentiating image features, which validates the desired model behaviors mentioned in Sec. 3.3.

5.4. ADDITIONAL ANALYSES

In this section, we provide additional analyses on the potential of our BO-QSA as a concept binder for generalizing to new examples. First, we qualitatively visualize our learned content for each slot (without additional clustering) in ShapeStacks, Birds, and YCB in Fig. 4 . We observe high similarity within the learned content of each slot, indicating similar concepts learned by specific slots. This shows the potential of the slots in our BO-QSA for binding specific concepts on object properties (e.g. colors, contours, and spatial positions). Although we can not control which concepts to learn, these results are important indicators that our learned initialization queries could potentially be generalizable concept probes. We further provide quantitative evaluations where we use models trained on dataset X for zero-shot inference on dataset Y. We term this transfer as (XÑY). As shown in Tab. 8, when adapting models trained on YCB to zero-shot inference on ScanNet and COCO, our method outperform I-SA and also the majority of fine-tuned methods shown in Tab. 4. Due to the page limit, we show in Appendix B.1 that this superior transfer capability is general across datasets when compared to Slot-Attention variants.

6. CONCLUSIONS

We introduce BO-QSA for unsupervised object-centric representation learning. We initialize Slot-Attention with learnable queries, and combine bi-level optimization and straight-through gradient estimators to ease the difficulty in query-based Slot-Attention learning. With simple code adjustments on Slot-Attention, we obtain state-of-the-art model for unsupervised object segmentation in both synthetic and natural image domains, outperforming previous baselines by a large margin. More importantly, our learned model exhibits concept-binding effects where visual concepts are attached to specific slot queries. With a fixed number of initialized slots, our model is limited to handling a fixed maximum number of objects in the inputs. However, our queries could be learned to bind object attributes, which leads to meaningful segmentation of images by grouping similar properties (e.g. color, position, etc.). As a future direction, this connects our method with weakly-supervised contrastive learning methods that learn grounded visual representations with language.

A MODEL ARCHITECTURE AND DESIGN A.1 DESIGN OF DECODERS

In this section, we follow the notations used in Sec. 2.1 and describe two common approaches, mixture-based and transformer-based, for decoding images from the learned slot representations.

Mixture-based Decoder

The mixture-based decoder (Watters et al., 2019) decodes each slot ŝi into an object image x i and mask m i with decoding functions g img ϕ dec and g mask ϕ dec , which are implemented using CNNs. The decoded images and masks are calculated by: Îi " g img ϕ dec pŝ i q, m i " exp g mask ϕ dec pŝ i q ř K j"1 exp g mask ϕ dec pŝ j q , Î " K ÿ i"1 m i ¨Î i . During training, a reconstruction objective is employed for supervising model learning. Despite its wide usage, mixture-based decoders showed limited capability at handling natural scenes with high visual complexity (Singh et al., 2021) . to auto-regressively predict image tokens with learned slot representation ŝ: o l " g transformer ϕ dec pŝ; z ăl q where z " f dVAE ϕ enc pIq. To train the entire model, we have the reconstruction objective supervising the learning of z with dVAE decoder g dVAE ϕ dec . Next, the objective for object-centric learning relies on the correct prediction from the auto-regressive transformer for predicting correct tokens: Autoregressive L " L dVAE `LCE where L dVAE " ||g dVAE ϕ dec pzq ´I|| 2 2 , L CE " L ÿ l"1 CrossEntropypz l , o l q Under this setting, the model does not predict additional masks and relies on the attention A within the Slot-Attention module for obtaining slot-specific object masks. Although such models can achieve competitive results on real-world synthetic datasets, as our experiments suggest, they can be inferior to mixture-based decoders on segmentation in synthetic datasets. We suspect that this originates from the low resolution when discretizing images into tokens. A.  where we call f pθ, ϕq the outer objective function and gpθ, ϕq the inner objective function. To jointly optimize both objectives w.r.t. parameters θ and ϕ, a straightforward approach to solving Eq. ( 6) is to represent the inner solution of θ as a function of ϕ, i.e., θ ˚pϕq " arg min θ 1 gpθ 1 , ϕq. Then we can optimize the outer objective with gradient descent: ∇ ϕ f pθ ˚pϕq, ϕq " ∇ ϕ θ ˚pϕq∇ 1 f pθ ˚pϕq, ϕq `∇2 f pθ ˚pϕq, ϕq, However, the difficulty of this method lies in the calculation of ∇ ϕ θ ˚pϕq where we need to solve linear equation from implicit gradient theorem: ∇ 1,2 gpθ ˚pϕq, ϕq∇ ϕ θ ˚pϕq `∇2,2 gpθ ˚pϕq, ϕq " 0. If ∇ 2,2 gpθ ˚, ϕq is invertible, we can solve for ∇ ϕ θ ˚pϕq and obtain the gradient update on ϕ: ϕ k`1 " ϕ k ´ξ `∇2 f k ´p∇ 1,2 g k q J p∇ 2,2 g k q ´1∇ 1 f k achieves the overall best results compared with other powerful Slot-Attention variants (models that achieve best or second-best results in our ablation studies as in Tab. 7) except for (BirdsÑCars). However, our optimization method still helps improve zero-shot transfer for randomly initialized Slot-Attention. As described in Sec. 3.3, our method is connected with recent works on dVAE. However, we do not require the initialization queries to maintain information about the post-iteration slots ŝ as we found such constraints lead to the learning of the mean representation of datasets which forbids disentanglement and concept binding. In this section, we provide experimental results to verify this argument. Specifically, we consider three different ways to update slot initialization queries in addition to our proposed method: 1) using the running mean of the post-iteration slots as initialization queries (RunningMean), 2) running K-Means clustering on post-iteration slots and updating the initialization queries using re-clustered centers by Hungarian matching (KMeans), 3) adding consistency loss between initialization queries and post-iteration slots as done in VQ-VAE (VQ-constraint). For ( 1) and (2), we empirically found such designs to be suffering from frequent updates and therefore use momentum updates to stabilize their training. We term these variants with the suffix (-M). As shown in Tab. 17, our model achieves the best overall performance compared to other initialization methods. Specifically, we found that using the running mean of post-iteration slots or K-Means cluster centers re-clustered from post-iteration slots to be harmful to model performance. We attribute this effect to the learning of the mean-representation of datasets. This is further proved in experiments with VQ-VAE loss on consistency between slot initializations and post-iteration slots (i.e. ||sgpŝq ´s0 || 2 ), where the VQ-constraint variant showed inferior performance. We also found that the weight of this additional loss needs to be carefully tuned for the model to decompose objects. Empirically, most configurations of this hyperparameter will lead to bad reconstructions except for certain small weights (e.g. 0.01 reported here). Above all, we believe these experimental results verify the effectiveness of our design choices on initialization query learning. We provide additional visualizations on the learned contents of slots for each update method in Fig. 6 .

B.4 EXPERIMENTS ON ADDITIONAL DATASETS

In addition to datasets considered in Sec. 5, we conduct experiments on other synthetic datasets and visualize qualitative results. More specifically, we test our model on PTR (Hong et al., 2021) . PTR is a synthetic dataset of 3D objects from PartNet with rendering variations. We run our BO-QSA with the same configuration mentioned in Appendix A.3 previously. We compare our method with the vanilla Slot-Attention module on multi-object segmentation. We report ARI-FG and MSC-FG scores of our model compared with the vanilla Slot-Attention on the PTR validation set. As we can see from Tab. 18, our model achieves similar performance compared with Slot-Attention on ARI-FG and significantly outperforms it on MSC-FG. We attribute this result to the capability of precisely segmenting objects. As ARI-FG applies masks to each slot prediction for calculating results, it does not require models to precisely segment the object from the background. However, MSC-FG uses a mIoU-like measure that requires the model to precisely predict the object boundaries. This indicates that our model is better at precisely segmenting objects without noise. Similarly, we observe the binding of certain slots to scene backgrounds, but with more complex concepts, the binding of slots to concepts is not as straightforward as in ShapeStacks and CUB200 Birds. To further investigate the effectiveness and generality of our method, we adapt BO-QSA to the recent 3D object-centric learning model, uORF (Yu et al., 2022) , and test it on 3D datasets including CLEVR-567, Room-Chair, and Room-Diverse. uORF can decompose complex 3D scenes from a single image by combining NeRF (Mildenhall et al., 2021) with Slot-Attention. We only modify the initialization and optimization method of the Slot-Attention module in uORF, leaving all other hyperparameters unchanged. As we can see from Tab. 19, with our method, the uORF model that trained with 600 epochs can achieve a similar or even superior result compared to the original model trained with 1200 epochs. Additionally, when the dataset complexity increases (e.g., in Room-Diverse), our method demonstrates significant improvement. Please refer to uORF (Yu et al., 2022) for more details about the model, datasets, and evaluation metrics. 

C LIMITATIONS AND FUTURE WORK

We discuss all limitations of our work found in the experiments. First, we observed a strong correlation between the powerfulness of encoder-decoder architectures and model performance. However, in contrast to supervised learning, more powerful encoders/decoders do not guarantee superior performance. Gaining insights from how contrastive learning methods have shown the effect of concept emergence with large-scale pretraining, we can also incorporate such representations learned by self-supervised learning into object-centric learning to unite the best of both worlds. Second, our work is primarily limited by the fixed number of slot initialization vectors. In contrast to the vanilla Slot-Attention that could generalize to a new number of objects, our model can not easily generalize to scenarios with new concepts since our model learns a fixed set of separating spaces that best disentangle different parts of the image. This problem is also frequently met in semantic segmentation and object classification, where we can only use existing concepts to interpret novel objects/semantic entities. Although solutions to this close-vocabulary problem have been proposed in supervised classification and segmentation, we leave the exploration of this problem in object-centric learning to future work. Finally, the current learned slot initialization vectors do not explicitly bind towards concepts and need to be mined by humans. We believe this is an important next step in our current work to combine unsupervised object-centric learning with semantic alignments from language for concept grounding. This opens future research directions on learning finer-level organization of object concepts under more complex scenarios (e.g. hierarchical grouping) with weak supervision of correspondence. 



Figure 16: Unsupervised Multi-Object Segmentation on ScanNet.



BO-QSA Input: input features input, learnable queries init, number of iterations T Output: object-centric representation slots Modules :stop gradient module SG(¨), slot attention module SA(¨, ¨) slots = init for t " 1, ¨¨¨, T do slots = SA(slots, inputs) slots = SG(slots) + init -SG(init) slots = SA(slots, inputs) return slots

Unsupervised Object-Centric Learning Our work falls into the recent line of research on unsupervised object-centric learning on images(Greff et  al., 2016; Eslami et al., 2016; Greff et al., 2017; 2019; Burgess et al., 2019; Crawford & Pineau, 2019; Engelcke et al., 2020; Lin et al., 2020; Bear et al., 2020; Locatello et al., 2020; Zoran et al., 2021). A thorough review and discussion on this type of method can be found in Greff et al. (2020). One critical issue of these methods is on handling complex natural scenes. Singh et al. (2021); Lamb et al. (2021) leverages a transformer-based decoder with Slot-Attention for addressing this problem. Similar attempts have also been made by exploiting self-supervised contrastive learning (Choudhury et al., 2021; Caron et al., 2021; Wang et al., 2022; Hénaff et al., 2022) and energy-based models (Du et al., 2021; Yu et al., 2022). Our work builds upon Slot-Attention by extending it with learnable queries and a novel optimization method for learning. Our compelling experimental suggests our model could potentially serve as a general plug-and-play module for a wider range of modalities where variants of Slot-Attention prosper (Kipf et al., 2022; Elsayed et al., 2022; Singh et al., 2022; Yu et al., 2022; Sajjadi et al., 2022a;b).Query Networks Sets of latent queries are commonly used in neural networks. These methods leverage permutation equivariant network modules (e.g. GNNs(Scarselli et al., 2008) and attention modules(Vaswani et al., 2017)) in model design for solving set-related tasks such as clustering(Lee et al., 2019), outlier detection(Zaheer et al., 2017;Zhang et al., 2019), etc. These learned latent queries have been shown to have good potential as features for tasks like contrastive learning(Caron et al., 2020), object detection(Carion et al., 2020), and data compression(Jaegle et al., 2021a;b). In contrast to the recent success of query networks in supervised or weakly-supervised learning(Carion  et  al., 2020; Zhang et al., 2021; Kipf et al., 2022; Elsayed et al., 2022; Xu et al., 2022), Locatello et al.

For the real image domain, we use two tasks(1)  unsupervised foreground extraction and (2) unsupervised multi-object segmentation for evaluating our method. Specifically, we select Stanford Dogs(Khosla et al., 2011), Stanford Cars (Krause et al., 2013), CUB200 Birds (Welinder et al., 2010), and Flowers (Nilsback & Zisserman, 2010) as our benchmarking datasets for foreground extraction and YCB (Calli et al., 2017), ScanNet (Dai et al., 2017), COCO (Lin et al., 2014) proposed by Yang & Yang (2022) for multi-object segmentation. We use mean Intersection over Union (mIoU) and Dice as metrics for evaluating the quality of foreground extraction and use the evaluation metrics adopted by Yang & Yang (2022) for multi-object segmentation. 5.1 OBJECT DISCOVERY ON SYNTHETIC DATASETS Experimental Setup We explore our proposed BO-QSA with two types of decoder designs, mixture-based and transformer-based, as discussed in Sec. 2.1 and Appendix A.1. We follow the decoder architecture in Slot-Attention (Locatello et al., 2020) for mixture-based decoders and

Figure 1: Visualization of our predicted segmentation and reconstruction results on synthetic and real images.We color the predicted mask that has a maximum intersection with the ground-truth background in black.

Figure 2: Effects of iterative updates in testing.

Figure 4: Visualization of learned concepts and attention maps in zero-shot transfer. At the top, we visualize the per-slot reconstruction of our model trained on ShapeStacks (left), Birds (middle), and YCB (right). At the bottom, we show that our learned slot initialization queries bind to the same concepts in zero-shot transfer experiments (i.e. color in ShapeStacks to CLEVRTEX, contours in Birds to Dogs and Cars, and spatial positions in YCB to ScanNet and COCO) by visualizing attention maps of slot initialization queries over input images.*Note that for the ShapeStacks experiment(left), we alternate object colors in CLEVRTEX with seen colors for better qualitative evaluations, and we do not perform such operations for quantitative evaluations.

Figure6: Visualizations per-slot reconstruction for different update methods. We show that RunningMean and KMeans suffer at decomposing the image, even with momentum updates. For VQ-constraint, though the model variant achieves a similar but slightly inferior effect on segmentation, they can not preserve the same filtered property for each slot across images.

Figure 9: Unsupervised Multi-Object Segmentation on ShapeStacks.

Figure 10: Unsupervised Multi-Object Segmentation on ObjectsRoom. In contrast to ShapeStacks, we observe consistent binding of slots to ground, wall, sky, and also objects in the front.

Figure 11: Unsupervised Foreground Extraction on CUB200 Birds.

Figure 12: Unsupervised Foreground Extraction on Stanford Dogs.

Figure 14: Unsupervised Foreground Extraction on Caltech Flowers.

Figure 15: Unsupervised Multi-Object Segmentation on YCB.

Figure 17: Unsupervised Multi-Object Segmentation on COCO.

al., 2015; Burgess et al., 2019; Engelcke et al., 2021; Goyal et al., 2021), or spatial (Crawford & Pineau, 2019; Lin et al., 2020;

θ 1 gpθ 1 , ϕq. Then we can optimize the outer objective with gradient descent by approximating ∇ ϕ f pθ ˚pϕq, ϕq as a function of ϕ. When the inner optimization objective could be solved by a fixed point iteration θ " F ϕ pθq (Amos & Kolter, 2017;Bai et al., 2019), the bi-level optimization problem could be solved by Given that Slot-Attention is, in essence, an iterative refinement method that falls into the same framework,Chang et al. (2022) adapted this technique to improve Slot-Attention training and obtained significant improvement both in model performance and training stability. We provide more discussions on this in Sec. 3.2 and also other bi-level optimization methods for approximating ∇ ϕ f pθ ˚pϕq, ϕq in Appendix A.2.

al., 2022) of s " h ϕ update ps, sq " h ϕ update ps, f ϕ attn ps, xqq " F Φ ps, xq,

Our work is closely related to bi-level optimization methods with iterative fixed update rules for solving the inner objective. Specifically, methods are designed with implicit differentiation (Amos & Kolter, 2017; Bai et al., 2019) to stabilize the iterative update procedure. Similar formulations are also found when combined with meta-learning where Madan et al. (2021) train queries through recurrence in a meta-learning fashion and Rajeswaran et al. (2019) provides a unified view of the optimization problem with implicit gradients. Concurrent work from Chang et al. (

Multi-object segmentation results on ShapeStacks and ObjectsRoom. We report ARI-FG and MSC-FG of all models with (mean ˘variance) across 3 experiment trials. We visualize the best results in bold.

Multi-object segmentation results on CLEVRTEX. We report ARI-FG (%) and MSE of all models in the form of (mean ˘variance) across 3 experiment trials. We visualize the best results in bold.

Reconstruction results on ShapeStacks and ObjectsRoom (MSEÓ). We compare mixture-based and transformer-based decoder designs.

Unsupervised multi-object segmentation results on YCB, ScanNet, and COCO variant proposed byYang & Yang (2022). We use the same evaluation metrics as inYang & Yang (2022) and report all models' results with (mean (variance)) over 3 experiment trials. We visualize the best results in bold.

Unsupervised foreground extraction results on CUB200 Birds (Birds), Stanford Dogs (Dogs), Stanford Cars (Cars), and Caltech Flowers (Flowers). We visualize the best results in bold.



Ablative experiments on slot initialization and optimization methods. We visualize the best results in bold and underline the second-best results. (*Note that SA represents Slot-Attention with our encoder-decoder design and is different from the original one reported in Tab. 5.)

Zero-shot transfer results of unsupervised multiobject segmentation on real images.

Transformer Decoder Recently, Singh et al. (2021; 2022) reveal the limitations of mixture decoder and leverage transformers and dVAEs (Van Den Oord et al., 2017; Ramesh et al., 2021) for decoding slot-based object-centric representations. To obtain decoded images Î, they learn a separate dVAE for first encoding I into a sequence of L tokens z " tz 1 , ¨¨¨, z L u with dVAE encoder f dVAE ϕ enc . Next, they use a transformer decoder g transformer

2 BI-LEVEL OPTIMIZATION AND META-LEARNINGRecall the bi-level optimization problem we introduced in Sec. 2.2.

Zero-shot transfer results on unsupervised foreground extraction (mIoU Ò).As described in Sec. 3.2, we study whether a fixed point s ˚could be reached by a fixed number of iterations during training. Since we hypothesized that the low performance of I-QSA in Sec. 5.3 originated from the insufficient number of starting points for fixed-point approximation, we conduct experiments on increasing the number of Slot-Attention iterations during training for I-QSA on the Dog dataset. As shown in Tab. 16, increasing the number of Slot-Attention iterations during training for I-QSA significantly improves its performance. However, we found that adding more iterations after a threshold (i.e. 7 in this case) does not further improve the overall performance. This verifies the need for learning slot initialization vectors for better approximating the fixed point solution of the inner soft-clustering objective in Slot-Attention.

Increasing the number of iterations during training for I-QSA.

Comparison between update methods for slot-initialization queries.

Multi-object segmentation results on PTR. We visualize the best results in bold.

3D-object segmentation results on CLEVR-567, Room-Chair, and Room-Diverse. We visualize the best results in bold and underline the second-best results. ˚indicates reimplemented results.

ACKNOWLEDGEMENT

We gratefully thank all colleagues from BIGAI for fruitful discussions. We would also like to thank the anonymous reviewers for their constructive feedback. This work reported herein was supported by National Key R&D Program of China (2021ZD0150200). where ∇ 1 f k " ∇ 2 f pθ ˚pϕ k q, ϕ k q and ∇ 1 f k " ∇ 1 f pθ ˚pϕ k q, ϕ k q. Various methods have been proposed to approximate the solution (Pedregosa, 2016; Lorraine et al., 2020) , and we refer the authors to Ye et al. ( 2022) for a thorough review of related methods.Bi-level optimization is closely related to meta-learning. In meta-learning, we have meta-training tasks which comes in asThe inner and outer objectives in Eq. ( 6) are substituted by averaging training and validation errors over multiple tasks (Franceschi et al., 2018) :where L i represents task-dependent error on D i . The final goal of meta-learning aims at seeking the meta-parameter ϕ that is shared between tasks which later enables few-shot learning and fast adaptation. With its connections with bi-level optimization, the previously mentioned optimization methods broadly adapted for solving meta-learning problems (Finn et al., 2017; Nichol & Schulman, 2018; Rajeswaran et al., 2019) . From the meta-learning perspective, our attempt shares similar insights with first-order meta-learning methods (Finn et al., 2017; Nichol & Schulman, 2018) , where we use the gradient at some task-specific optimal solution s i of the inner optimization for optimizing slot initialization queries which are shared datasets on the outer objective. This meta-learning perspective also indicates the potentials of our BO-QSA for fast adaptation and generalization.

A.3 IMPLEMENTATION DETAILS

We provide a visualization of our designed slot-encoder in Fig. 5 and discuss the implementation details for different experimental settings in the following sections.

A.3.1 SLOT INITIALIZATION

We initialize all models with the number of slots shown in Tab. 13. During training, we add a small perturbation to the queries by sampling from a zero-mean distribution with variance σ as we found it empirically helpful for better performance. We perform annealing over σ to gradually eliminate the effect of this random perturbation during training. We adopt the cosine annealing strategy such that σ starts from 1 and gradually anneals to 0 after N σ training steps, where N σ is a hyperparameter that controls the annealing rate of σ. In our experiments, we use N σ " 0 on Cars and Flowers and N σ " 30000 on the rest of the datasets. 2021). However, we change the kernel size of the dVAE encoder from 1 to 3 since we find that such changes can help increase model performance when decomposing scenes. We train our model for 250k steps with a batch size of 128, and all the training configuration in our experiments is described in Tab. 12.

A.3.4 BASELINES

The reproduction of Slot-Attention and SLATE follows the architecture and hyperparameter selection mentioned in their paper. Similar to our models, we train all baseline models with 250K steps on all datasets. For SLATE, we use the input image size of 96 on the ShapeStacks dataset as we find that the image size of 128 will cause all objects to be divided into the same slot, resulting in low In this section, we continue the discussion in Sec. 5.4 and provide additional zero-shot transfer results. Similarly, we use the notation (X Ñ Y ) to denote the zero-shot adaptation of models trained unsupervisedly on dataset X to new datasets Y .For unsupervised multi-object segmentation, we report transfer results from ScanNet and COCO to all other real-image multi-object segmentation datasets in addition to the results on YCB (mentioned in Sec. 5.4). As shown in Tab. 14, our model shows consistent improvement over Slot-Attention and I-SA during zero-shot transfer. 

D ADDITIONAL VISUALIZATIONS

We provide more qualitative results of our model on different datasets in the following pages. 

