SLOTFORMER: UNSUPERVISED VISUAL DYNAMICS SIMULATION WITH OBJECT-CENTRIC MODELS

Abstract

Understanding dynamics from visual observations is a challenging problem that requires disentangling individual objects from the scene and learning their interactions. While recent object-centric models can successfully decompose a scene into objects, modeling their dynamics effectively still remains a challenge. We address this problem by introducing SlotFormer -a Transformer-based autoregressive model operating on learned object-centric representations. Given a video clip, our approach reasons over object features to model spatio-temporal relationships and predicts accurate future object states. In this paper, we successfully apply SlotFormer to perform video prediction on datasets with complex object interactions. Moreover, the unsupervised SlotFormer's dynamics model can be used to improve the performance on supervised downstream tasks, such as Visual Question Answering (VQA), and goal-conditioned planning. Compared to past works on dynamics modeling, our method achieves significantly better long-term synthesis of object dynamics, while retaining high quality visual generation. Besides, SlotFormer enables VQA models to reason about the future without objectlevel labels, even outperforming counterparts that use ground-truth annotations. Finally, we show its ability to serve as a world model for model-based planning, which is competitive with methods designed specifically for such tasks. Additional results and details are available at our Website.

1. INTRODUCTION

The ability to understand complex systems and interactions between its elements is a key component of intelligent systems. Learning the dynamics of a multi-object systems from visual observations entails capturing object instances, their appearance, position and motion, and simulating their spatiotemporal interactions. Both in robotics (Finn et al., 2016; Lee et al., 2018) and computer vision (Shi et al., 2015; Wang et al., 2017) , unsupervised learning of dynamics has been a central problem due to its important practical implications. Obtaining a faithful dynamics model of the environment enables future prediction, planning and, crucially, allows to transfer the dynamics knowledge to improve downstream supervised tasks, such as visual reasoning (Chen et al., 2020b; Ding et al., 2021b ), planning (Sun et al., 2022 ) and model-based control (Micheli et al., 2022 ). Yet, an effective domainindependent approach for unsupervised visual dynamics learning from video remains elusive. One approach to visual dynamics modeling is to frame it as a prediction problem directly in the pixel space (Shi et al., 2015; Wang et al., 2017; Denton & Fergus, 2018) . This paradigm builds on global frame-level representations, and uses dense feature maps of past frames to predict future features. By design, such models are object-agnostic, treating background and foreground modeling as equal. This frequently results in poorly learned object dynamics, producing unrealistic future predictions over longer horizons (Oprea et al., 2020) . Another perspective to dynamics learning is through object-centric dynamics models (Kosiorek et al., 2018; van Steenkiste et al., 2018; Kossen et al., 2019) . This class of methods first represents a scene as a set of object-centric features (a.k.a. slots), and then learns the interactions among the slots to model scene dynamics. It allows for more natural dynamics modeling and leads to more faithful simulation (Veerapaneni et al., 2020; Zoran et al., 2021) . To achieve this goal, earlier object-centric models bake in strong scene (Jiang et al., 2019) or object (Lin et al., 2020) priors in their frameworks, while more recent methods (Kipf et al., 2020; Zoran et al., 2021) learn object interactions purely from data, with the aid of Graph Neural Networks (GNNs) (Battaglia et al., 2018) or Transformers (Vaswani et al., 2017 ). Yet, these approaches independently model the per-frame object interactions and their temporal evolution, using different networks. This suggests that a simpler and more effective dynamics model is yet to be designed. In this work, we argue that learning a system's dynamics from video effectively requires two key components: i) strong unsupervised object-centric representations (to capture objects in each frame) and ii) a powerful dynamical module (to simulate spatio-temporal interactions between the objects). To this end, we propose SlotFormer: an elegant and effective Transformer-based object-centric dynamics model, which builds upon object-centric features (Kipf et al., 2022; Singh et al., 2022) , and requires no human supervision. We treat dynamics modeling as a sequential learning problem: given a sequence of input images, SlotFormer takes in the object-centric representations extracted from these frames, and predicts the object features in the future steps. By conditioning on multiple frames, our method is capable of capturing the spatio-temporal object relationships simultaneously, thus ensuring consistency of object properties and motion in the synthesized frames. We evaluate SlotFormer on four video datasets consisting of diverse object dynamics. Our method not only presents competitive results on standard video prediction metrics, but also achieves significant gains when evaluating on object-aware metrics in the long range. Crucially, we demonstrate that SlotFormer's unsupervised dynamics knowledge can be successfully transferred to downstream supervised tasks (e.g., VQA and goal-conditional planning) to improve their performance "for free". In summary, this work makes the following contributions: 1. SlotFormer: a Transformer-based model for object-centric visual simulation; 2. SlotFormer achieves state-of-the-art performance on two video prediction datasets, with significant advantage in modeling long-term dynamics; 3. SlotFormer achieves state-of-the-art results on two VQA datasets and competitive results in one planning task, when equipped with a corresponding task-specific readout module.

2. RELATED WORK

In this section, we provide a brief overview of related works on physical reasoning, object-centric models and Transformers, which is further expanded in Appendix A. Dynamics modeling and intuitive physics. Video prediction methods treat dynamics modeling as an image translation problem (Shi et al., 2015; Wang et al., 2017; Denton & Fergus, 2018; Lee et al., 2018) , and model changes in the pixel space. However, methods that model dynamics using global image-level features usually struggle with long-horizon predictions. Some approaches leverage local priors (Finn et al., 2016; Ebert et al., 2017) , or extra input information (Walker et al., 2016; Villegas et al., 2017) , which only help in the short term. More recent works improve modeling visual dynamics using explicit object-centric representations. Several works directly learn deep models in the abstracted state space of objects (Wu et al., 2015; Battaglia et al., 2016; Fragkiadaki et al., 2016; Chang et al., 2016) . However, they require ground-truth physical properties for training, which is unrealistic for visual dynamics simulation. Instead, recent works use object features from a supervised detector as the base representation for visual simulation (Ye et al., 2019; Li et al., 2019; Qi et al., 2020; Yu et al., 2022) with a GNN-based dynamics model. In contrast to the above works, our model is completely unsupervised; SlotFormer belongs to the class of models that learn both object discovery and scene dynamics without supervision. We review this class of models below. Unsupervised object-centric representation learning from videos. Our work builds upon recent efforts in decomposing raw videos into temporally aligned slots (Kipf et al., 2022; Kabra et al., 2021; Singh et al., 2022) . Earlier works often make strong assumptions on the underlying object representations. Jiang et al. ( 2019) explicitly decompose the scene into foreground and background to apply fixed object size and presence priors. Lin et al. ( 2020) further disentangle object features to represent object positions, depth and semantic attributes separately. Some methods leverage the power of GNNs or Transformers to eliminate these domain-specific priors (Goyal et al., 2019; 2021; Veerapaneni et al., 2020; van Steenkiste et al., 2018; Creswell et al., 2021; Zoran et al., 2021) . However, they still model the object interactions and temporal dynamics with separate modules; and set the context window of the recurrent dynamics module to only a single timestep. The most relevant work to ours is OCVT (Wu et al., 2021) , which also applies Transformers to slots from multiple frames. However, OCVT utilizes manually disentangled object features, and needs Hungarian matching for latent alignment during training. Therefore, it still underperforms RNN-based baselines in the video prediction task. In contrast, SlotFormer is a general Transformer-based dynamics model which is ag-

