UNSUPERVISED VIDEO DECOMPOSITION USING SPATIO-TEMPORAL ITERATIVE INFERENCE Anonymous

Abstract

Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.

1. INTRODUCTION

Unsupervised representation learning, which has a long history dating back to Boltzman Machines (Hinton & Sejnowski, 1986) and original works of Marr (1970) , has recently emerged as one of the important directions of research, carrying the newfound promise of alleviating the need for excessively large and fully labeled datasets. More traditional representation learning approaches focus on unsupervised (e.g., autoencoder-based (Pathak et al., 2016; Vincent et al., 2008) ) or selfsupervised (Noroozi & Favaro, 2016; Vondrick et al., 2016; Zhang et al., 2016) learning of holistic representations that, for example, are tasked with producing (spatial (Noroozi & Favaro, 2016 ), temporal (Vondrick et al., 2016 ), or color (Zhang et al., 2016) ) encodings of images or patches. The latest and most successful methods along these lines include ViLBERT (Lu et al., 2019) and others (Sun et al., 2019; Tan & Bansal, 2019 ) that utilize powerful transformer architectures (Vaswani et al., 2017) coupled with proxy multi-modal tasks (e.g., masked token prediction or visua-lingual alignment). Learning of good disentangled, spatially granular, representations that are, for example, able to decouple object appearance and shape in complex visual scenes consisting of multiple moving objects remains elusive. Recent works that attempt to address this challenge can be characterized as: (i) attention-based methods (Crawford & Pineau, 2019b; Eslami et al., 2016) , which infer latent representations for each object in a scene, and (ii) iterative refinement models (Greff et al., 2019; 2017) , which decompose a scene into a collection of components by grouping pixels. Importantly, the former have been limited to latent representations at object-or image patch-levels, while the latter class of models have illustrated the ability for more granular latent representations at the pixel (segmentation)-level. Specifically, most refinement models learn pixel-level generative models driven by spatial mixtures (Greff et al., 2017) and utilize amortized iterative refinements (Marino et al., 2018) for inference of disentangled latent representations within the VAE framework (Kingma & Welling, 2014); a prime example is IODINE (Greff et al., 2019) . However, while providing a powerful model and abstraction which is able to segment and disentangle complex scenes, IODINE (Greff et al., 2019) and other similar architectures are fundamentally limited by the fact that they only consider images. Even when applied for inference in video, they process one frame at a time. This makes it excessively challenging to discover and represent individual instances of objects that may share properties such as appearance and shape but differ in dynamics. 1

