OBJECT-CENTRIC LEARNING WITH SLOT MIXTURE MODELS

Abstract

Object-centric architectures usually apply some differentiable module on the whole feature map to decompose it into sets of entity representations called slots. Some of these methods structurally resemble clustering algorithms, where the center of the cluster in latent space serves as a slot representation. Slot Attention is an example of such a method as a learnable analog of the soft K-Means algorithm. In our work, we use the learnable clustering method based on Gaussian Mixture Model, unlike other approaches we represent slots not only as centers of clusters but we also use information about the distance between clusters and assigned vectors, which leads to more expressive slot representations. Our experiments demonstrate that using this approach instead of Slot Attention improves performance in different objectcentric scenarios, achieving the state-of-the-art performance in the set property prediction task.

1. INTRODUCTION

In recent years, interest in object-centric representations has greatly increased (Greff et al. ( 2019 Automatic segmentation of objects on the scene and the formation of a structured latent space can be carried out in various ways (Greff et al. (2020) ): augmentation of features with grouping information, the use of a tensor product representation, engaging ideas of attractor dynamics, etc. However, the most effective method for learning object-centric representations is Slot Attention Locatello et al. (2020) . Slot Attention maps the input feature vector received from the convolutional encoder to a fixed number of output feature vectors, which are called Slots. As a result of training, each object is assigned a corresponding slot. If the number of slots is greater than the number of objects, then some of the slots remain empty (do not contain objects). This approach showed significant results in such object-centric tasks as set property prediction and object discovery. The iterative slot-based approach can be considered as a variant of the soft K-Means algorithm (Bauckhage (2015b)), where the key/value/query projections are replaced with the identity function and updating via the recurrent neural network is excluded (Locatello et al. (2020) ). In our work, we propose another version of the generalization, when the K-Means algorithm is considered as a Gaussian Mixture Model (Bauckhage (2015a)). We represent slots not only as centers of clusters, but we also use information about the distance between clusters and assigned vectors, which leads to more expressive slot representations. Representing slots in this way improves the quality of the model in object-centric problems, achieving the state-of-the-art results in the set property prediction task, even compared to highly specialized models (Zhang et al. (2019b) ), and also improves the generalization ability of the model for image reconstruction task. The paper is structured as follows. In Section 2, we provide background information about Slot Attention module and Mixture Models. We also describe the process of their training. In Section 3, we introduce a Slot Mixture Module -the modification of a Slot Attention Module which provides 1



); Burgess et al. (2019); Li et al. (2020); Engelcke et al. (2020; 2021); Locatello et al. (2020)). Such representations have the potential to improve the generalization ability of machine learning methods in many domains, such as reinforcement learning (Keramati et al. (2018); Watters et al. (2019a); Kulkarni et al. (2019); Berner et al. (2019); Sun et al. (2019)), scene representation and generation (El-Nouby et al. (2019); Matsumori et al. (2021); Kulkarni et al. (2019)), reasoning (Yang et al. (2020)), object-centric visual tasks (Groth et al. (2018a); Yi et al. (2020); Singh et al. (2021b)), and planning (Migimatsu & Bohg (2020)).

