FORWARD PREDICTION FOR PHYSICAL REASONING

Abstract

Physical reasoning requires forward prediction: the ability to forecast what will happen next given some initial world state. We study the performance of stateof-the-art forward-prediction models in the complex physical-reasoning tasks of the PHYRE benchmark (Bakhtin et al., 2019). We do so by incorporating models that operate on object or pixel-based representations of the world into simple physical-reasoning agents. We find that forward-prediction models can improve physical-reasoning performance, particularly on complex tasks that involve many objects. However, we also find that these improvements are contingent on the test tasks being small variations of train tasks, and that generalization to completely new task templates is challenging. Surprisingly, we observe that forward predictors with better pixel accuracy do not necessarily lead to better physical-reasoning performance. Nevertheless, our best models set a new state-of-the-art on the PHYRE benchmark.

1. INTRODUCTION

When presented with a picture of a Rube Goldberg machine, we can predict how the machine works. We do so by using our intuitive understanding of concepts such as force, mass, energy, collisions, etc., to imagine how the machine state would evolve once released. This ability allows us to solve real world physical-reasoning tasks, such as how to hit a billiards cue such that the ball ends up in the pocket, or how to balance the weight of two children on a see-saw. In contrast, physical-reasoning abilities of machine-learning models have largely been limited to closed domains such as predicting dynamics of multi-body gravitational systems (Battaglia et al., 2016) , stability of block towers (Lerer et al., 2016) , or physical plausibility of observed dynamics (Riochet et al., 2018) . In this work, we explore the use of imaginative, forward-prediction approaches to solve complex physical-reasoning puzzles. We study modern object-based (Battaglia et al., 2016; Sanchez-Gonzalez et al., 2020; Watters et al., 2017) and pixel-based (Finn et al., 2016; Ye et al., 2019; Hafner et al., 2020) forward-prediction models in simple search-based agents on the PHYRE benchmark (Bakhtin et al., 2019) . PHYRE tasks involve placing one or two balls in a 2D world, such that the world reaches a state with a particular property (e.g., two balls are touching) after being played forward. PHYRE tasks are very challenging because small changes in the action (or the world) can have a very large effect on the efficacy of an action; see Figure 1 for an example. Moreover, PHYRE tests models' ability to generalize to completely new physical environments at test time, a significantly harder task than prior work that mostly varies number or properties of objects in the same environment. As a result, physical-reasoning agents may struggle even when their forward-prediction model works well. Nevertheless, our best agents substantially outperform the prior state-of-the-art on PHYRE. Specifically, we find that forward-prediction models can improve the performance of physical-reasoning agents when the models are trained on tasks that are very similar to the tasks that need to be solved at test time. However, we find forward-prediction based agents struggle to generalize to truly unseen tasks, presumably, because small deviations in forward predictions tend to compound over time. We also observe that better forward prediction does not always lead to better physical-reasoning performance on PHYRE (c.f. Buesing et al. (2018) for similar observations in RL). In particular, we find that object-based forward-prediction models make more accurate forward predictions but pixel-based models are more helpful in physical reasoning. This observation may be the result of two key advantages of models using pixel-based state representations. First, it is easier to determine whether a task is solved in a pixel-based representation than in an object-based one, in fully observable 2D environments like PHYRE. Second, pixel-based models facilitate end-to-end training of the Our study builds on a large body of prior research on forward prediction and physical reasoning. We discuss most closely related work in this section and report additional prior work in Appendix B. Forward prediction models attempt to predict the future state of objects in the world based on observations of past states. Such models operate either on object-based (proprioceptive) representations or on pixel-based state representations. A popular class of object-based models use graph neural networks to model interactions between objects (Kipf et al., 2018; Battaglia et al., 2016) , for example, to simulate environments with thousands of particles (Sanchez-Gonzalez et al., 2020; Li et al., 2019) . Another class of object-based models explicitly represents the Hamiltonian or Lagrangian of the physical system (Greydanus et al., 2019; Cranmer et al., 2020; Chen et al., 2019) . While promising, such models are currently limited to simple point objects and physical systems that conserve energy. Hence, they cannot currently be used on PHYRE, which contains dissipative forces and extended objects. Modern pixel-based forward-prediction models extract state representations by applying a convolutional network on the observed frame(s) (Watters et al., 2017; Kipf et al., 2020) or on object segments (Ye et al., 2019; Janner et al., 2019) . The models perform forward prediction on the resulting state representation using graph neural networks (Kipf et al., 2020; Ye et al., 2019; Li et al., 2020) , recurrent neural networks (Xingjian et al., 2015; Hochreiter & Schmidhuber, 1997; Cho et al., 2014; Finn et al., 2016) , or a physics engine (Wu et al., 2017) . The models can be trained to predict object state (Watters et al., 2017) , perform pixel reconstruction (Villegas et al., 2017; Ye et al., 2019) , transform the previous frames (Ye et al., 2018; 2019; Finn et al., 2016) , or produce a contrastive state representation (Kipf et al., 2020; Hafner et al., 2020) . Physical reasoning tasks gauge a system's ability to intuitively reason about physical phenomena (Battaglia et al., 2013; Kubricht et al., 2017) . Prior work has developed models that predict whether physical structures are stable (Lerer et al., 2016; Groth et al., 2018; Li et al., 2016) , predict whether physical phenomena are plausible (Riochet et al., 2018) , describe or answer questions about physical systems (Yi et al., 2020; Rajani et al., 2020) , perform counterfactual prediction in physical worlds (Baradel et al., 2020) , predict effect of forces (Mottaghi et al., 2016; Wang et al., 2018) , or solve physical puzzles/games (Allen et al., 2020; Bakhtin et al., 2019; Du & Narasimhan, 2019) . Unlike other physical reasoning tasks, physical-puzzle benchmarks such as PHYRE (Bakhtin et al., 2019) and Tools (Allen et al., 2020) incorporate a full physics simulator, and contain a large set of physical environments to study generalization. This makes them particularly suitable for studying the effectiveness of forward prediction for physical reasoning, and we adopt the PHYRE benchmark in our study for that reason. Inferring object representations involve techniques like generative models and attention mechanisms to decompose scenes into objects (Eslami et al., 2016; Greff et al., 2019; Burgess et al., 2019; Engelcke et al., 2019) . Many techniques also leverage the motion information for better decomposition or to implicitly learn object dynamics (Kipf et al., 2020; van Steenkiste et al., 2018; Crawford & Pineau, 2020; Kosiorek et al., 2018) . While relevant to our exploration of pixel-based methods as well, we leverage the simplicity of PHYRE visual world to extract object-like representations simply using connected component algorithm in our approaches (c.f. STN in Section 4.1). However, more sophisticated approaches could help further improve the performance, and would be especially useful for more visually complex and 3D environments.

3. PHYRE BENCHMARK

In PHYRE, each task consists of an initial state that is a 256 × 256 image. Colors indicate object properties; for instance, black objects are static while gray objects are dynamic and neither are involved in the goal state. PHYRE defines two task tiers (B and 2B) that differ in their action space. An action involves placing one ball (in the B tier) or two balls (in the 2B tier) in the image. Balls are parameterized by their position and radius, which determine the ball's mass. An action solves the task if the blue or purple object touches the green object (the goal state) for a minimum of three seconds

