LEARNING LONG-TERM VISUAL DYNAMICS WITH REGION PROPOSAL INTERACTION NETWORKS

Abstract

Learning long-term dynamics models is the key to understanding physical common sense. Most existing approaches on learning dynamics from visual input sidestep long-term predictions by resorting to rapid re-planning with short-term models. This not only requires such models to be super accurate but also limits them only to tasks where an agent can continuously obtain feedback and take action at each step until completion. In this paper, we aim to leverage the ideas from success stories in visual recognition tasks to build object representations that can capture inter-object and object-environment interactions over a long-range. To this end, we propose Region Proposal Interaction Networks (RPIN), which reason about each object's trajectory in a latent region-proposal feature space. Thanks to the simple yet effective object representation, our approach outperforms prior methods by a significant margin both in terms of prediction quality and their ability to plan for downstream tasks, and also generalize well to novel environments. Code, pre-trained models, and more visualization results are available at our Website.

1. INTRODUCTION

As argued by Kenneth Craik, if an organism carries a model of external reality and its own possible actions within its head, it is able to react in a much fuller, safer and more competent manner to emergencies which face it (Craik, 1952) . Indeed, building prediction models has been long studied in computer vision and intuitive physics. In computer vision, most approaches make predictions in pixel-space (Denton & Fergus, 2018; Lee et al., 2018; Ebert et al., 2018b; Jayaraman et al., 2019; Walker et al., 2016) , which ends up capturing the optical flow (Walker et al., 2016) and is difficult to generalize to long-horizon. In intuitive physics, a common approach is to learn the dynamics directly in an abstracted state space of objects to capture Newtonian physics (Battaglia et al., 2016; Chang et al., 2016; Sanchez-Gonzalez et al., 2020) . However, the states end up being detached from raw sensory perception. Unfortunately, these two extremes have barely been connected. In this paper, we argue for a middle-ground to treat images as a window into the world, i.e., objects exist but can only be accessed via images. Images are neither to be used for predicting pixels nor to be isolated from dynamics. We operationalize it by learning to extract a rich state representation directly from images and build dynamics models using the extracted state representations. It is difficult to make predictions, especially about the future -Niels Bohr Contrary to Niels Bohr, predictions are, in fact, easy if made only for the short-term. Predictions that are indeed difficult to make and actually matter are the ones made over the long-term. Consider the example of "Three-cushion Billiards" in Figure 1 . The goal is to hit the cue ball in such a way that it touches the other two balls and contacts the wall thrice before hitting the last ball. This task is extremely challenging even for human experts because the number of successful trajectories is very sparse. Do players perform classical Newtonian physics calculations to obtain the best action before each shot, or do they just memorize the solution by practicing through exponentially many configurations? Both extremes are not impossible, but often impractical. Players rather build a physical understanding by experience (McCloskey, 1983; Kubricht et al., 2017) and plan by making intuitive, yet accurate predictions in the long-term. Learning such a long-term prediction model is arguably the "Achilles' heel" of modern machine learning methods. Current approaches on learning physical dynamics of the world cleverly side-step the long-term dependency by re-planning at each step via model-predictive control (MPC) (Allgöwer & Zheng, 2012; Camacho & Alba, 2013) . The common practice is to train short-term dynamical models (usually 1-step) in a simulator. However, small errors in short-term predictions can accumulate et al., 2019) in Figure 1 , where an agent is allowed to take only one action in the beginning so as to preclude any scope of re-planning. How to learn an accurate dynamics model has been a popular research topic for years. Recently, there are a series of work trying to represent video frames using object-centric representations (Battaglia et al., 2016; Watters et al., 2017; Chang et al., 2016; Janner et al., 2019; Ye et al., 2019; Kipf et al., 2020) . However, those methods either operate in the state space, or ignore the environment information, both of which are not practical in real-world scenarios. In contrast, our objective is to build a data-driven prediction model that can both: (a) model long-term interactions over time to plan successfully for new instances, and (b) work from raw visual input in complex real-world environments. Therefore, the question we ask is: how to extract such an effective and flexible object representation and perform long-term predictions? We propose Region Proposal Interaction Network (RPIN) which contains two key components. Firstly, we leverage the region of interests pooling (RoIPooling) operator (Girshick, 2015) to extract object features maps from the frame-level feature. Object feature extraction based on region proposals has achieved huge success in computer vision (Girshick, 2015; He et al., 2017; Dai et al., 2017; Gkioxari et al., 2019) , and yet, surprisingly under-explored in the field of intuitive physics. By using RoIPooling, each object feature contains not only its own information but also the context of the environment. Secondly, we extend the Interaction Network and propose Convolutional Interaction Networks that perform interaction reasoning on the extracted RoI features. Interaction Networks is originally proposed in (Battaglia et al., 2016) , where the interaction reasoning is conducted via MLPs. By changing MLPs to convolutions, we can effectively utilize the spatial information of an object and make accurate future prediction of object location and shapes changes. Notably, our approach is simple, yet outperforms the state-of-the-art methods in both simulation and real datasets. In Section 5, we thoroughly evaluate our approach across four datasets to study scientific questions related to a) prediction quality, b) generalization to time horizons longer than training, c) generalization to unseen configurations, d) planning ability for downstream tasks. Our method reduces the prediction error by 75% in the complex PHYRE environment and achieves state-of-the-art performance on the PHYRE reasoning benchmark.

2. RELATED WORK

Physical Reasoning and Intuitive Physics. Learning models that can predict the changing dynamics of the scene is the key to building physical common-sense. Such models date back to "NeuroAnimator" (Grzeszczuk et al., 1998) for simulating articulated objects. Several methods in recent years have leveraged deep networks to build data-driven models of intuitive physics (Bhattacharyya et al., 2016; Ehrhardt et al., 2017; Fragkiadaki et al., 2015; Chang et al., 2016; Stewart & Ermon, 2017) . However, these methods either require access to the underlying ground-truth state-space or do not scale to long-range due to absence of interaction reasoning. A more generic



Figure 1: Two example of long-term dynamics prediction tasks. Left: three-cushion billiards. Right: PHYRE intuitive-physics dataset(Bakhtin et al., 2019). Our proposed approach makes accurate long-term predictions that do not necessarily align with the ground truth but provide strong signal for planning.over time in MPC. Hence, in this work, we focus primarily on the long-term aspect of prediction by just considering environments, such as the three-cushion billiards example or the PHYRE(Bakhtin  et al., 2019)  in Figure1, where an agent is allowed to take only one action in the beginning so as to preclude any scope of re-planning.

