QUANTIZED DISENTANGLED REPRESENTATIONS FOR OBJECT-CENTRIC VISUAL TASKS

Abstract

Recently, the pre-quantization of image features into discrete latent variables has helped to achieve remarkable results in image modeling. In this paper, we propose a method to learn discrete latent variables applied to object-centric tasks. In our approach, each object is assigned a slot which is represented as a vector generated by sampling from non-overlapping sets of low-dimensional discrete variables. We empirically demonstrate that embeddings from the learned discrete latent spaces have the disentanglement property. The model is trained with a set prediction and object discovery as downstream tasks. It achieves the state-of-the-art results on the CLEVR dataset among a class of object-centric methods for set prediction task. We also demonstrate manipulation of individual objects in a scene with controllable image generation in the object discovery setting.

1. INTRODUCTION

The known problem of existing neural networks is that they cannot generalize at the human level (Lake et al. (2016) ; Greff et al. (2020) ). It is assumed that the reason for this is the inability of current neural networks to dynamically and flexibly bind information distributed throughout it. This is called the binding problem. This problem affects the ability of neural networks 1) to construct meaningful representations of entities from unstructured sensory inputs; 2) to maintain obtained separation of information at a representation level; 3) to reuse these representations of entities for new inferences and predictions. One way to solve this problem is to constrain the neural network for learning disentangled object-centric representations of a scene (Burgess et al. ( 2019 We propose a method that produces the disentangled representation of objects by quantization of the corresponding slot representation. We call it Vector Quantized Slot Attention (VQ-SA). VQ-SA obtains object slots in an unsupervised manner (Locatello et al. (2020) ) and then perform quantization. The slot quantization involves two steps. At the first step, we initialize several discrete latent spaces each corresponding to the one of potential generative factors in the data. At the second step, we initialize each latent space with separate embeddings for potential values of a corresponding generative factor. This two-step quantization allows the model to assign a particular generative factor value to a particular latent embedding. 2018)). These metrics are based on the assumption that disentanglement is achieved on a vector coordinate level, i.e. each coordinate corresponds to the generative factor. In our approach, generative factors are expressed by vectors, and separate coordinates are not interpretable. Thus, metrics listed above are not suitable and the problem of quantitative evaluation of disentanglement in the case of vector representation of generative factors remains an open question for future studies. Nevertheless, we propose DQCF-micro and DQCF-macro methods that qualitatively evaluate the disentanglement in the object discovery task. Original Slot Attention based model achieves remarkable results in the object discovery task, but our model allows not only to separate distributed features into object representations but also to separate distributed features of the objects themselves into representations of their properties. We first give an overview of the proposed model VQ-SA (Section 2.1). Then, we provide a detailed explanation of the slot quantization approach (Section 2.3) we use to represent objects from an image. We conduct experiments on the CLEVR dataset (Johnson et al. ( 2017)) for the set prediction task (Section 3.1) and show that our model achieves the state-of-the-art results in some settings and performs comparably well in others. We also conduct experiments for the object discovery task (Section 3.2) and show quality results for the CLEVR dataset (Johnson et al. ( 2017)). We conduct ablation studies (Section 5) and provide results of modified versions of the proposed model to confirm our design choices. The learned discrete latent spaces possess the disentangled property. We qualitatively demonstrate this (Section 4) by analyzing set prediction results. Finally, we position our work relative to other approaches (Section 6) and discuss the obtained results, advantages, and limitations of our work (Section 7). Our main contributions are follows: • We propose a discrete representation (quantization) of object-centric embeddings (Section 2.3) that maps them to several latent spaces. • The quantization produces disentangled representation (Section 4) there the disentanglement achieved on the level of latent embeddings rather than embedding coordinates. • Learned discrete representations allow us to manipulate individual objects in a scene and generate scenes with objects with given attributes by manipulation in the latent space (Section 3.2). • The proposed model VQ-SA achieves state-of-the-art results on the set prediction task on the CLEVR dataset (Section 3.1) among a class of object-centric methods. • We propose DQCF-micro and DQCF-macro methods that qualitatively evaluate the disentanglement of the learned discrete variables, when they are represented by vectors rather by vector coordinates. (2020; 2021) ). We assign each object to a slot of a dimension d s . Further, we transform each slot to a desired latent representation. We draw inspiration from the discrete latent representation proposed in van den Oord et al. (2017) and apply its modification to each slot. We use multiple latent spaces with small embedding dimension d l (d l < d s ) and the small number of embeddings in each latent space instead of using a single discrete latent space to map slots. The small dimension of vectors in the latent spaces enables us to construct the resultant slot representation by concatenation. That could be seen as constructing a new vector of factors from given ones.



); Greff et al. (2019); Yang et al. (2020b)). The disentangled object-centric representation may potentially improve generalization and explainability in many machine learning domains such as structured scene representation and scene generation (El-Nouby et al. (2019); Matsumori et al. (2021); Kulkarni et al. (2019)), reinforcement learning (Keramati et al. (2018); Watters et al. (2019a); Kulkarni et al. (2019); Berner et al. (2019); Sun et al. (2019)), planning (Migimatsu & Bohg (2020)), reasoning (Yang et al. (2020a)), and object-centric visual tasks (Groth et al. (2018); Yi et al. (2020); Singh et al. (2021)). However, recent research has focused either on object-centric or disentangled representation and has not paid enough attention to combining them. There are just several works that consider both objectives (Burgess et al. (2019); Greff et al. (2019); Li et al. (2020); Yang et al. (2020b)).

The proposed object-centric disentangled representation improves the results of the conventional model from Locatello et al. (2020) on object-centric visual tasks such as set prediction compared to light-weighted specialized models (Locatello et al. (2020); Zhang et al. (2019)). We demonstrate it through extensive experiments on the CLEVR dataset (Johnson et al. (2017)). To measure degree of disentanglement the commonly used disentanglement metrics are BetaVAE score (Higgins et al. (2017a)), MIG (Chen et al. (2018)), DCI disentanglement (Eastwood & Williams (2018a)), SAP score (Kumar et al. (2018a)), and FactorVAE score (Kim & Mnih (

OVERVIEWTo obtain valuable object representations, we should first discover objects in the image and then transform their representations into desired ones. We discover objects in an unsupervised manner with the use of a slot attention mechanism(Locatello et al. (2020)). The idea of slot representation is to map an input (image) to a set of latent variables (slots) instead of a single latent vector(Kingma &  Welling (2014); Rezende et al. (2014)) such that each slot will describe a part of an input(Locatello  et al. (2020); Engelcke et al.

