OBJECT-CENTRIC LEARNING WITH SLOT MIXTURE MODELS

Abstract

Object-centric architectures usually apply some differentiable module on the whole feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the center of the cluster in latent space serves as a slot representation. Slot Attention is an example of such a method as a learnable analog of the soft K-Means algorithm. In our work, we use the learnable clustering method based on Gaussian Mixture Model, unlike other approaches we represent slots not only as centers of clusters but we also use information about the distance between clusters and assigned vectors, which leads to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in different objectcentric scenarios, achieving the state-of-the-art performance in the set property prediction task.

1. INTRODUCTION

In recent years, interest in object-centric representations has greatly increased (Greff et al. ( 2019 Automatic segmentation of objects on the scene and the formation of a structured latent space can be carried out in various ways (Greff et al. (2020) ): augmentation of features with grouping information, the use of a tensor product representation, engaging ideas of attractor dynamics, etc. However, the most effective method for learning object-centric representations is Slot Attention Locatello et al. (2020) . Slot Attention maps the input feature vector received from the convolutional encoder to a fixed number of output feature vectors, which are called Slots. As a result of training, each object is assigned a corresponding slot. If the number of slots is greater than the number of objects, then some of the slots remain empty (do not contain objects). This approach showed significant results in such object-centric tasks as set property prediction and object discovery. The iterative slot-based approach can be considered as a variant of the soft K-Means algorithm (Bauckhage (2015b)), where the key/value/query projections are replaced with the identity function and updating via the recurrent neural network is excluded (Locatello et al. (2020) ). In our work, we propose another version of the generalization, when the K-Means algorithm is considered as a Gaussian Mixture Model (Bauckhage (2015a)). We represent slots not only as centers of clusters, but we also use information about the distance between clusters and assigned vectors, which leads to more expressive slot representations. Representing slots in this way improves the quality of the model in object-centric problems, achieving the state-of-the-art results in the set property prediction task, even compared to highly specialized models (Zhang et al. (2019b) ), and also improves the generalization ability of the model for image reconstruction task. The paper is structured as follows. In Section 2, we provide background information about Slot Attention module and Mixture Models. We also describe the process of their training. In Section 3, we introduce a Slot Mixture Module -the modification of a Slot Attention Module which provides 2014)). The proposed Slot Mixture Module improves reconstruction performance. In Section 4.3, we demonstrate that Slot Mixture Module outperforms original Slot Attention on the Object Discovery task on the CLEVR10 dataset. In Section 4.4, we compare K-Means and Gaussian Mixture Model clustering approaches on the Set Property Prediction task and show that Gaussian Mixture Model clustering is a better choice for object-centric learning. We give a short overview of related works in Section 5. In Section 6, we discuss the obtained results, advantages, and limitations of our work. The main contributions of our paper are as follows: 1. We proposed a generalization of slot-based approach for object-centric representations as a Gaussian Mixture Model (Section 3). 2. Such a representation allows state-of-the-art performance in the set property prediction task, even in comparison with specialized models (Section 4.1), which are not aimed at building disentangled representations. 3. The slot representations as a Gaussian Mixture improve the generalization ability of the model in other object-oriented tasks (Section 4.2). 4. The Slot Mixture Module shows a much faster convergence on the Object Discovery task compare to the Original Slot Attention (Section 4.3).

2. BACKGROUND

2.1 SLOT ATTENTION Slot Attention (SA) module (Locatello et al. (2020) ) is an iterative attention mechanism that is designed to map a distributed feature map to a set of K slots. Randomly initialized slots from a Gaussian distribution with trainable parameters are used to get q projections of slots. Feature map vectors with corresponding projections serve as k and v vectors. Dot-product attention between q and k vectors with the softmax across q dimension implies competition between slots for explaining parts of the input. Attention coefficients are used to assign v vectors to slots via a weighted mean. M = 1 √ D k(inputs)q(slots) T ∈ R N ×K , attn i,j = e Mi,j K j=1 e Mi,j , W i,j = attn i,j N i=1 attn i,j , updates = W T v(inputs) ∈ R K×D Gated Recurrent Unit (GRU) (Cho et al. ( )) is used for addition slots update. It takes slot representations before update iteration as a hidden state and updated slots as inputs. The important property of Slot Attention is that it has permutation invariance with respect to input vectors of the feature map and permutation equivariance with respect to slots. These properties make the Slot Attention module suitable for operating with sets and object-centric representations. Technically, Slot Attention is a learnable analogue of K-Means clustering algorithm with an additional trainable GRU update step and dot product (with trainable q, k, v) projections instead of Euclidean distance as the measure of similarity between the input vectors and cluster centroids. At the same time, K-Means clustering can be considered as a simplified version of the Gaussian Mixture Model.



); Burgess et al. (2019); Li et al. (2020); Engelcke et al. (2020; 2021); Locatello et al. (2020)). Such representations have the potential to improve the generalization ability of machine learning methods in many domains, such as reinforcement learning (Keramati et al. (2018); Watters et al. (2019a); Kulkarni et al. (2019); Berner et al. (2019); Sun et al. (2019)), scene representation and generation (El-Nouby et al. (2019); Matsumori et al. (2021); Kulkarni et al. (2019)), reasoning (Yang et al. (2020)), object-centric visual tasks (Groth et al. (2018a); Yi et al. (2020); Singh et al. (2021b)), and planning (Migimatsu & Bohg (2020)).

more expressive slot representations. In Section 4.1, by extensive experiments we show that the proposed Slot Mixture Module riches the state-of-the-art performance in the set property prediction task on the CLEVR dataset Johnson et al. (2017) and outperforms even highly specialized models Zhang et al. (2019b). In Section 4.2, we provide experimental results for the image reconstruction task on four datasets: three with synthetic images (CLEVR-Mirror Singh et al. (2021a), ShapeStacks Groth et al. (2018b), ClevrTex Karazija et al. (2021)) and one with real life imagesCOCO-2017  (Lin et al. (

