UNSUPERVISED VIDEO DECOMPOSITION USING SPATIO-TEMPORAL ITERATIVE INFERENCE Anonymous

Abstract

Unsupervised multi-object scene decomposition is a fast-emerging problem in representation learning. Despite significant progress in static scenes, such models are unable to leverage important dynamic cues present in video. We propose a novel spatio-temporal iterative inference framework that is powerful enough to jointly model complex multi-object representations and explicit temporal dependencies between latent variables across frames. This is achieved by leveraging 2D-LSTM, temporally conditioned inference and generation within the iterative amortized inference for posterior refinement. Our method improves the overall quality of decompositions, encodes information about the objects' dynamics, and can be used to predict trajectories of each object separately. Additionally, we show that our model has a high accuracy even without color information. We demonstrate the decomposition, segmentation, and prediction capabilities of our model and show that it outperforms the state-of-the-art on several benchmark datasets, one of which was curated for this work and will be made publicly available.

1. INTRODUCTION

Unsupervised representation learning, which has a long history dating back to Boltzman Machines (Hinton & Sejnowski, 1986) and original works of Marr (1970) , has recently emerged as one of the important directions of research, carrying the newfound promise of alleviating the need for excessively large and fully labeled datasets. More traditional representation learning approaches focus on unsupervised (e.g., autoencoder-based (Pathak et al., 2016; Vincent et al., 2008) ) or selfsupervised (Noroozi & Favaro, 2016; Vondrick et al., 2016; Zhang et al., 2016) learning of holistic representations that, for example, are tasked with producing (spatial (Noroozi & Favaro, 2016 ), temporal (Vondrick et al., 2016 ), or color (Zhang et al., 2016) ) encodings of images or patches. The latest and most successful methods along these lines include ViLBERT (Lu et al., 2019) and others (Sun et al., 2019; Tan & Bansal, 2019) that utilize powerful transformer architectures (Vaswani et al., 2017) coupled with proxy multi-modal tasks (e.g., masked token prediction or visua-lingual alignment). Learning of good disentangled, spatially granular, representations that are, for example, able to decouple object appearance and shape in complex visual scenes consisting of multiple moving objects remains elusive. Recent works that attempt to address this challenge can be characterized as: (i) attention-based methods (Crawford & Pineau, 2019b; Eslami et al., 2016) , which infer latent representations for each object in a scene, and (ii) iterative refinement models (Greff et al., 2019; 2017) , which decompose a scene into a collection of components by grouping pixels. Importantly, the former have been limited to latent representations at object-or image patch-levels, while the latter class of models have illustrated the ability for more granular latent representations at the pixel (segmentation)-level. Specifically, most refinement models learn pixel-level generative models driven by spatial mixtures (Greff et al., 2017) and utilize amortized iterative refinements (Marino et al., 2018) for inference of disentangled latent representations within the VAE framework (Kingma & Welling, 2014); a prime example is IODINE (Greff et al., 2019) . However, while providing a powerful model and abstraction which is able to segment and disentangle complex scenes, IODINE (Greff et al., 2019) and other similar architectures are fundamentally limited by the fact that they only consider images. Even when applied for inference in video, they process one frame at a time. This makes it excessively challenging to discover and represent individual instances of objects that may share properties such as appearance and shape but differ in dynamics. Figure 1 : Unsupervised Video Decomposition. Our approach allows to infer precise segmentations of the objects via interpretable latent representations, that can be used to decompose each frame and simulate the future dynamics, all in unsupervised fashion. Whenever a new object emerges into a frame the model dynamical adapts and uses one of the segmentation slots to assign to the new object. In computer vision, it has been a long-held belief that motion carries important information for segmenting objects (Jepson et al., 2002; Weiss & Adelson, 1996) . Armed with this intuition, we propose a spatio-temporal amortized inference model capable of not only unsupervised multi-object scene decomposition, but also of learning and leveraging the implicit probabilistic dynamics of each object from perspective raw video alone. This is achieved by introducing temporal dependencies between the latent variables across time. As such, IODINE (Greff et al., 2019) could be considered a special (spatial) case of our spatio-temporal formulation. Modeling temporal dependencies among video frames also allows us to make use of conditional priors (Chung et al., 2015) for variational inference, leading to more accurate and efficient inference results. The resulting model, illustrated in Fig. 1 , achieves superior performance on complex multi-object benchmark datasets (Bouncing Balls and CLEVRER) with respect to state-of-the-art models, including R-NEM (Van Steenkiste et al., 2018) and IODINE (Greff et al., 2019) in terms of segmentation, prediction, and generalization. Our model has a number of appealing properties, including temporal extrapolation, computational efficiency, and the ability to work with complex data exhibiting non-linear dynamics, colors, and changing number of objects within the same video sequence. In addition, we introduce an entropy prior to improve our model's performance in scenarios where object appearance alone is not sufficiently distinctive (e.g., greyscale data).

2. RELATED WORK

Unsupervised Scene Representation Learning. Unsupervised scene representation learning can generally be divided into two groups: attention-based methods, which infer latent representations for each object in a scene, and more complex and powerful iterative refinement models, which often make use of spatial mixtures and can decompose a scene into a collection of estimated components by grouping pixels together. Attention-based methods, such as AIR (Eslami et al., 2016 ) (Xu et al., 2019) and SPAIR (Crawford & Pineau, 2019b) , decompose scenes into latent variables representing the appearance, position, and size of the underlying objects. However, both methods can only infer the objects' bounding boxes and have not been shown to work on non-trivial 3D scenes with perspective distortions and occlusions. MoNet (Burgess et al., 2019) is the first model in this family tackling more complex data and inferring representations that can be used for instance segmentation of objects. On the other hand, it is not a probabilistic generative model and thus not suitable for density estimation. GENESIS (Engelcke et al., 2020) extends it and alleviates some of its limitations by introducing a probabilistic framework and allowing for spatial relations between the objects. DDPAE (Hsieh et al., 2018) is a framework that uses structured probabilistic models to decompose a video into low-dimensional temporal dynamics with the sole purpose of prediction. It is shown to operate on binary scenes with no perspective distortion and is not capable of generating per-object segmentation masks. Iterative refinement models started with Tagger (Greff et al., 2016 ) that reasons about the segmentation of its inputs. However, it does not allow explicit latent representations and cannot be scaled to more complex images. NEM (Greff et al., 2017) , as an extension of Tagger, uses a spatial mixture model inside an expectation maximization framework, but is limited to binary data. Finally, IODINE (Greff et al., 2019) 



is a notable example of a model employing iterative amortized inference w.r.t. a spatial mixture formulation and achieves state-of-the-art performance in scene decomposition and segmentation. Unsupervised Video Tracking and Object Detection. SQAIR (Kosiorek et al., 2018), SILOT (Crawford & Pineau, 2019a) and SCALOR (Jiang et al., 2020) are temporal extensions

