INTERACTION-BASED DISENTANGLEMENT OF ENTI-TIES FOR OBJECT-CENTRIC WORLD MODELS

Abstract

Perceiving the world compositionally in terms of space and time is essential to understanding object dynamics and solving downstream tasks. Object-centric learning using generative models has improved in its ability to learn distinct representations of individual objects and predict their interactions, and how to utilize the learned representations to solve untrained, downstream tasks is a focal question. However, as models struggle to predict object interactions and track the objects accurately, especially for unseen configurations, using object-centric representations in downstream tasks is still a challenge. This paper proposes STEDIE, a new model that disentangles object representations, based on interactions, into interaction-relevant relational features and interaction-irrelevant global features without supervision. Empirical evaluation shows that the proposed model factorizes global features, unaffected by interactions from relational features that are necessary to predict the outcome of interactions. We also show that STEDIE achieves better performance in planning tasks and understanding causal relationships. In both tasks, our model not only achieves better performance in terms of reconstruction ability but also utilizes the disentangled representations to solve the tasks in a structured manner.

1. INTRODUCTION

Humans have evolved to perceive the world in a structured manner, such that we can infer unseen objects, predict their interaction with the environment, and plan to use them to perform certain tasks. Previous works have emphasized achieving similar levels of systematic generalization in deep learning (Goyal & Bengio, 2020) requires a new set of inductive biases that would enable a model to perceive the world as a composition of objects and their relationships (Greff et al., 2020) . Recent works on object-centric learning have devised generative models to achieve spatial disentanglement; i.e., these methods decompose images into individual objects in an unsupervised manner (Kosiorek et al., 2018; Greff et al., 2017; Van Steenkiste et al., 2018; Hsieh et al., 2018; Lin et al., 2020; Kossen et al., 2020) . This is achieved with the introduction of various inductive biases, such as object propagation and discovery (Kosiorek et al., 2018; Greff et al., 2017; Van Steenkiste et al., 2018) , temporal decomposition (Hsieh et al., 2018 ), background modeling (Lin et al., 2020) , and object-interaction modeling (Kossen et al., 2020) . Given the success of object-centric learning, several studies have investigated its effectiveness in learning world models (Ha & Schmidhuber, 2018) to solve downstream tasks (Veerapaneni et al., 2020; Watters et al., 2019a; Kossen et al., 2020; Min et al., 2021) . As object-centric learning represents the scene as a composition of objects and its interactions, it would drastically reduce the complexity of modeling the temporal evolution of the scene and, therefore, help in predicting the future and planning. For example, Veerapaneni et al. (2020) proposes OP3, the first fully probabilistic object-centric and action-conditioned video prediction model. They show that it can model interactions between entities and plan interactions to solve simple block-stacking tasks. Nevertheless, prior works have still struggled with solving downstream tasks, as modeling object interactions accurately and keeping track of objects as individuals are common challenges. Also, few studies have proposed models conditioned on actions (Veerapaneni et al., 2020; Kossen et al., 2020) . To improve the generation quality of videos and downstream performances, we introduce interaction-based disentanglement, which aims to factorize the representations of entities into interaction-relevant relational features and interaction-irrelevant global features. Here, interaction relevance refers to a feature affecting the future properties of other objects via interaction; e.g., the weight of objects is essential to determining whether an object would be moved after contact occurred. At the same time, some objects' features would not be affected via interaction, e.g., the shape of a rigid body would remain unchanged. Importantly, these factorizations must be fully unsupervised to develop a model that can handle various causal relationships, as hand-crafting decomposition for each task is an infeasible burden. Table 1 summarizes relevant earlier research (Hsieh et al., 2018; Veerapaneni et al., 2020; Zoran et al., 2021; Kossen et al., 2020) . Although learning single representation or explicit factorizing object representations for model-based RL has been introduced in prior investigations, implicit factorization has not been explored. 

2. RELATED WORKS

There has been many research on using Variational Autoencoders (VAE) (Kingma & Welling., 2013) to learn object-factorized representations for both images and videos with multiple objects. An object-centric generative model trained on videos should learn disentangled representations about not only the objects but also its dynamics, which would lead to better generation of unseen objects or dynamics and understanding of the scenes. Expanding the data domain temporally has been confronted by two major challenges: keeping track of each object and modeling object interactions, such as collision. To overcome these hurdles, there have been mainly three approaches in designing



Comparison of relevant work done on object-centric generative models and its utilization for model-based RL. The models can be categorized based on whether it models interactions, is action-conditioned or not, and type of factorization of object representations. OP3(Veerapaneni  et al., 2020)  is similar to our work as it models object interactions and actions by an agent, but it represents objects as a single latent variable. PARTS(Zoran et al., 2021)  is another work that, like OP3, uses single latents, but the generative process is not conditioned on actions. DDPAE(Hsieh  et al., 2018)  is also close to our work as it learns factorization implicitly, but interactions are not modeled and is not action-conditioned. For thorough comparison, see Appendix G.

To enable interaction-based disentanglement, we propose SpatioTEmporal Disentanglement from Interaction of Entities (STEDIE): a fully-probabilistic, object-centric, and action-conditioned video prediction model. By designing a generative and inference model that implement interaction-based disentanglement, STEDIE disentangles videos both spatially into objects and temporally into relational features and global features. Whereas previous works(Li & Mandt, 2018; Hsieh et al.,  2018)  have factorized into time-varying and time-invariant features, we model object interactions using neural networks to motivate temporal factorization into interaction-relevant and interactionirrelevant features. The model can be trained using only raw input video data in an end-to-end fashion with standard evidence lower bound (ELBO). In our experiments, we first demonstrate the model's ability to disentangle videos spatiotemporally. Furthermore, to verify if decomposing in such a way helps solve downstream tasks, we evaluate the trained model on the planning task and causal relationship understanding task. In the planning task, combined with the cross-entropy method (Rubinstein & Kroese, 2014), we show that STEDIE can perform complex tasks (building block towers) better (up to 13%) and approximately 2x more efficiently than OP3. For causal understanding, we use CausalMBRL(Ke et al., 2021), a recently proposed benchmark of video prediction requiring an understanding of causal relationships. We show that it outperforms various baselines, including a standard VAE-based model, object-centric method trained with pixel-based loss function(CSWM (Kipf et al., 2020)), and OP3 by a large margin.

