LEARNING 3D VISUAL INTUITIVE PHYSICS FOR FLUIDS, RIGID BODIES, AND GRANU-LAR MATERIALS

Abstract

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models purely from unlabeled images. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D pointbased dynamics prediction backend, in which we impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We evaluate the models on three challenging scenarios involving fluid, granular materials, and rigid objects, where standard detection and tracking methods are not applicable. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that, once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.

1. INTRODUCTION

Humans can achieve a strong intuitive understanding of the 3D physical world around us simply from visual perception (Baillargeon et al., 1985; Battaglia et al., 2013; Spelke, 1990; Smith et al., 2019; Sanborn et al., 2013; Carey & Xu, 2001) . As we constantly make physical interactions with the environment, the intuitive physical understanding applies to objects of a wide variety of materials (Bates et al., 2018; Ullman et al., 2019) . For example, after watching videos of water pouring and doing the task ourselves, we can develop a mental model of the interaction process and predict how the water will move when we apply actions like tilting or shaking the cup (Figure 1 ). The ability to predict the future evolution of the physical environment is extremely useful for humans to plan our behavior and perform everyday manipulation tasks. It is thus desirable to develop computational tools that learn 3D-grounded models of the world purely from visual observations that can generalize to objects with complicated physical properties like fluid and granular materials. There has been a series of works on learning intuitive physics models of the environment from data. However, most existing work either focuses on 2D environments (Watter et al., 2015; Agrawal et al., 2016; Fragkiadaki et al., 2016; Xu et al., 2019; Finn & Levine, 2017; Babaeizadeh et al., 2021; Qi et al., 2021; Kipf et al., 2019; Ye et al., 2019b; Hafner et al., 2019b; a; Schrittwieser et al., 2020; Li et al., 2016; Finn et al., 2016; Lerer et al., 2016; Veerapaneni et al., 2019; Girdhar et al., 2020; Chang et al., 2016; Xue et al., 2016) or has to make strong assumptions about the accessible information of the underlying environment (Li et al., 2019a; 2020; Sanchez-Gonzalez et al., 2020; Pfaff et al., 2021; Zhang et al., 2016; Tacchetti et al., 2018; Sanchez-Gonzalez et al., 2018; Battaglia et al., 2018; Ajay et al., 2019; Janner et al., 2019a ) (e.g., full-state information of the fluids represented as points). The limitations prevent their use in tasks requiring an explicit 3D understanding of the environments and

