QUANTIZED DISENTANGLED REPRESENTATIONS FOR OBJECT-CENTRIC VISUAL TASKS

Abstract

Recently, the pre-quantization of image features into discrete latent variables has helped to achieve remarkable results in image modeling. In this paper, we propose a method to learn discrete latent variables applied to object-centric tasks. In our approach, each object is assigned a slot which is represented as a vector generated by sampling from non-overlapping sets of low-dimensional discrete variables. We empirically demonstrate that embeddings from the learned discrete latent spaces have the disentanglement property. The model is trained with a set prediction and object discovery as downstream tasks. It achieves the state-of-the-art results on the CLEVR dataset among a class of object-centric methods for set prediction task. We also demonstrate manipulation of individual objects in a scene with controllable image generation in the object discovery setting.

1. INTRODUCTION

The known problem of existing neural networks is that they cannot generalize at the human level (Lake et al. (2016) ; Greff et al. (2020) ). It is assumed that the reason for this is the inability of current neural networks to dynamically and flexibly bind information distributed throughout it. This is called the binding problem. This problem affects the ability of neural networks 1) to construct meaningful representations of entities from unstructured sensory inputs; 2) to maintain obtained separation of information at a representation level; 3) to reuse these representations of entities for new inferences and predictions. One way to solve this problem is to constrain the neural network for learning disentangled object-centric representations of a scene (Burgess et al. ( 2019 We propose a method that produces the disentangled representation of objects by quantization of the corresponding slot representation. We call it Vector Quantized Slot Attention (VQ-SA). VQ-SA obtains object slots in an unsupervised manner (Locatello et al. (2020) ) and then perform quantization. The slot quantization involves two steps. At the first step, we initialize several discrete latent spaces each corresponding to the one of potential generative factors in the data. At the second step, we initialize each latent space with separate embeddings for potential values of a corresponding generative factor. This two-step quantization allows the model to assign a particular generative factor value to a particular latent embedding. 



); Greff et al. (2019); Yang et al. (2020b)). The disentangled object-centric representation may potentially improve generalization and explainability in many machine learning domains such as structured scene representation and scene generation (El-Nouby et al. (2019); Matsumori et al. (2021); Kulkarni et al. (2019)), reinforcement learning (Keramati et al. (2018); Watters et al. (2019a); Kulkarni et al. (2019); Berner et al. (2019); Sun et al. (2019)), planning (Migimatsu & Bohg (2020)), reasoning (Yang et al. (2020a)), and object-centric visual tasks (Groth et al. (2018); Yi et al. (2020); Singh et al. (2021)). However, recent research has focused either on object-centric or disentangled representation and has not paid enough attention to combining them. There are just several works that consider both objectives (Burgess et al. (2019); Greff et al. (2019); Li et al. (2020); Yang et al. (2020b)).

The proposed object-centric disentangled representation improves the results of the conventional model from Locatello et al. (2020) on object-centric visual tasks such as set prediction compared to light-weighted specialized models (Locatello et al. (2020); Zhang et al. (2019)). We demonstrate it through extensive experiments on the CLEVR dataset (Johnson et al. (2017)).

