LEARNING 3D VISUAL INTUITIVE PHYSICS FOR FLUIDS, RIGID BODIES, AND GRANU-LAR MATERIALS

Abstract

Given a visual scene, humans have strong intuitions about how a scene can evolve over time under given actions. The intuition, often termed visual intuitive physics, is a critical ability that allows us to make effective plans to manipulate the scene to achieve desired outcomes without relying on extensive trial and error. In this paper, we present a framework capable of learning 3D-grounded visual intuitive physics models purely from unlabeled images. Our method is composed of a conditional Neural Radiance Field (NeRF)-style visual frontend and a 3D pointbased dynamics prediction backend, in which we impose strong relational and structural inductive bias to capture the structure of the underlying environment. Unlike existing intuitive point-based dynamics works that rely on the supervision of dense point trajectory from simulators, we relax the requirements and only assume access to multi-view RGB images and (imperfect) instance masks. This enables the proposed model to handle scenarios where accurate point estimation and tracking are hard or impossible. We evaluate the models on three challenging scenarios involving fluid, granular materials, and rigid objects, where standard detection and tracking methods are not applicable. We show our model can make long-horizon future predictions by learning from raw images and significantly outperforms models that do not employ an explicit 3D representation space. We also show that, once trained, our model can achieve strong generalization in complex scenarios under extrapolate settings.

1. INTRODUCTION

Humans can achieve a strong intuitive understanding of the 3D physical world around us simply from visual perception (Baillargeon et al., 1985; Battaglia et al., 2013; Spelke, 1990; Smith et al., 2019; Sanborn et al., 2013; Carey & Xu, 2001) . As we constantly make physical interactions with the environment, the intuitive physical understanding applies to objects of a wide variety of materials (Bates et al., 2018; Ullman et al., 2019) . For example, after watching videos of water pouring and doing the task ourselves, we can develop a mental model of the interaction process and predict how the water will move when we apply actions like tilting or shaking the cup (Figure 1 ). The ability to predict the future evolution of the physical environment is extremely useful for humans to plan our behavior and perform everyday manipulation tasks. It is thus desirable to develop computational tools that learn 3D-grounded models of the world purely from visual observations that can generalize to objects with complicated physical properties like fluid and granular materials. There has been a series of works on learning intuitive physics models of the environment from data. However, most existing work either focuses on 2D environments (Watter et al., 2015; Agrawal et al., 2016; Fragkiadaki et al., 2016; Xu et al., 2019; Finn & Levine, 2017; Babaeizadeh et al., 2021; Qi et al., 2021; Kipf et al., 2019; Ye et al., 2019b; Hafner et al., 2019b; a; Schrittwieser et al., 2020; Li et al., 2016; Finn et al., 2016; Lerer et al., 2016; Veerapaneni et al., 2019; Girdhar et al., 2020; Chang et al., 2016; Xue et al., 2016) or has to make strong assumptions about the accessible information of the underlying environment (Li et al., 2019a; 2020; Sanchez-Gonzalez et al., 2020; Pfaff et al., 2021; Zhang et al., 2016; Tacchetti et al., 2018; Sanchez-Gonzalez et al., 2018; Battaglia et al., 2018; Ajay et al., 2019; Janner et al., 2019a ) (e.g., full-state information of the fluids represented as points). The limitations prevent their use in tasks requiring an explicit 3D understanding of the environments and We can predict how the environment would evolve when applying specific actions. This ability roots in our understanding of 3D and applies to objects of diverse materials, which is essential when planning our behavior to achieve specific goals. In this work, we leverage a combination of implicit neural representation and particle representation to build 3D-grounded visual intuitive physics models of the world that applies to objects with complicated physical properties, such as fluids, rigid objects, and granular materials. make it hard to extend to more complicated real-world environments where only visual observations are available. There are works aiming to address this issue by learning 3D-grounded representation of the environment and modeling the dynamics in a latent vector space (Li et al., 2021b;a). However, these models typically encode the entire scene into one single vector. Such design does not capture the structure of the underlying systems, limiting its generalization to compositional systems or systems of different sizes (e.g., unseen container shapes or different numbers of floating ice cubes). In this work, we propose 3D Visual Intuitive Physics (3D-IntPhys), a framework that learns intuitive physics models of the environment with explicit 3D and compositional structures, purely from visual observations. Specifically, the model consists of (1) a perception module based on conditional Neural Radiance Fields (NeRF) (Mildenhall et al., 2020; Yu et al., 2021) that transforms the input images and instance masks into 3D point representations and (2) a dynamics module instantiated as graph neural networks to model the interactions between the points and predict their evolutions over time. Despite advances in graph-based dynamics networks (Sanchez-Gonzalez et al., 2020; Li et al., 2019a) , existing methods require strong supervision provided by 3D GT point trajectories, which are hard to obtain in most real setups. To tackle the problem, we train the dynamics model using (1) a distributionbased loss function measuring the difference between the predicted point sets and the actual point distributions at the future timesteps and (2) a spacing loss to avoid degenerated point set predictions. Our perception module learns spatial-equivariant representations of the environment grounded in the 3D space, which then transforms into points as a flexible representation to describe the system's state. Our dynamics module regards the point set as a graph and exploits the compositional structure of the point systems. The structures allow the model to capture the compositionality of the underlying environment, handle systems involving objects with complicated physical properties (e.g., fluid and granular materials), and perform extrapolated generalization, which we show via experiments greatly outperform various baselines without a structured 3D representation space.

2. RELATED WORK

Visual dynamics learning. Existing works learn to predict object motions from pixels using framecentric features (Agrawal et al., 2016; Finn & Levine, 2017; Babaeizadeh et al., 2021; Hafner et al., 2019b; a; Suh & Tedrake, 2020; Lee et al., 2018; Vondrick et al., 2015; Zhang et al., 2018; Burda et al., 2019; Hafner et al., 2019c; Wu et al., 2021) or object-centric features (Fragkiadaki et al., 2016; Watters et al., 2017; Kipf et al., 2019; Qi et al., 2021; Janner et al., 2019b; Veerapaneni et al., 2019; Ding et al., 2020; Girdhar et al., 2020; Riochet et al., 2020; Ye et al., 2019a ), yet, most works only demonstrate the learning in 2D scenes with objects moving only on a 2D plane. We argue that one reason that makes it hard for these existing methods to be applied to general 3D visual scenes is because they often operate on view-dependent features that can change dramatically due to changes in the camera viewpoint, which shouldn't have any effect on the actual motion of the objects. Recent



Figure1: Visual Intuitive Physics Grounded in 3D Space. Humans have a strong intuitive understanding of the physical environment. We can predict how the environment would evolve when applying specific actions. This ability roots in our understanding of 3D and applies to objects of diverse materials, which is essential when planning our behavior to achieve specific goals. In this work, we leverage a combination of implicit neural representation and particle representation to build 3D-grounded visual intuitive physics models of the world that applies to objects with complicated physical properties, such as fluids, rigid objects, and granular materials.

