SLOTFORMER: UNSUPERVISED VISUAL DYNAMICS SIMULATION WITH OBJECT-CENTRIC MODELS

Abstract

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without objectlevel labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks. Additional results and details are available at our Website.

1. INTRODUCTION

The ability to understand complex systems and interactions between its elements is a key component of intelligent systems. Learning the dynamics of a multi-object systems from visual observations entails capturing object instances, their appearance, position and motion, and simulating their spatiotemporal interactions. Both in robotics (Finn et al., 2016; Lee et al., 2018) and computer vision (Shi et al., 2015; Wang et al., 2017) , unsupervised learning of dynamics has been a central problem due to its important practical implications. Obtaining a faithful dynamics model of the environment enables future prediction, planning and, crucially, allows to transfer the dynamics knowledge to improve downstream supervised tasks, such as visual reasoning (Chen et al., 2020b; Ding et al., 2021b) , planning (Sun et al., 2022) and model-based control (Micheli et al., 2022 ). Yet, an effective domainindependent approach for unsupervised visual dynamics learning from video remains elusive. One approach to visual dynamics modeling is to frame it as a prediction problem directly in the pixel space (Shi et al., 2015; Wang et al., 2017; Denton & Fergus, 2018) . This paradigm builds on global frame-level representations, and uses dense feature maps of past frames to predict future features. By design, such models are object-agnostic, treating background and foreground modeling as equal. This frequently results in poorly learned object dynamics, producing unrealistic future predictions over longer horizons (Oprea et al., 2020) . Another perspective to dynamics learning is through object-centric dynamics models (Kosiorek et al., 2018; van Steenkiste et al., 2018; Kossen et al., 2019) . This class of methods first represents a scene as a set of object-centric features (a.k.a. slots), and then learns the interactions among the slots to model scene dynamics. It allows for more natural dynamics modeling and leads to more faithful simulation (Veerapaneni et al., 2020; Zoran et al., 2021) . To achieve this goal, earlier object-centric models bake in strong scene (Jiang et al., 2019) or object (Lin et al., 2020) priors in their frameworks, while more recent methods (Kipf et al., 2020; Zoran et al., 2021) learn object interactions purely from data, with the aid of Graph Neural Networks (GNNs) (Battaglia et al., 2018 ) or Transformers (Vaswani et al., 2017 ). Yet, these approaches

