∇Sim: DIFFERENTIABLE SIMULATION FOR SYSTEM IDENTIFICATION AND VISUOMOTOR CONTROL https://gradsim.github.io

Abstract

Figure 1 : ∇Sim is a unified differentiable rendering and multiphysics framework that allows solving a range of control and parameter estimation tasks (rigid bodies, deformable solids, and cloth) directly from images/video.

1. INTRODUCTION

Accurately predicting the dynamics and physical characteristics of objects from image sequences is a long-standing challenge in computer vision. This end-to-end reasoning task requires a fundamental understanding of both the underlying scene dynamics and the imaging process. Imagine watching a short video of a basketball bouncing off the ground and ask: "Can we infer the mass and elasticity of the ball, predict its trajectory, and make informed decisions, e.g., how to pass and shoot?" These seemingly simple questions are extremely challenging to answer even for modern computer vision models. The underlying physical attributes of objects and the system dynamics need to be modeled and estimated, all while accounting for the loss of information during 3D to 2D image formation. Depending on the assumptions on the scene structre and dynamics, three types of solutions exist: black, grey, or white box. Black box methods (Watters et al., 2017; Xu et al., 2019b; Janner et al., 2019; Chang et al., 2016) model the state of a dynamical system (such as the basketball's trajectory in time) as a learned embedding of its states or observations. These methods require few prior assumptions about the system itself, but lack interpretability due to entangled variational factors (Chen et al., 2016) or due to the ambiguities in unsupervised learning (Greydanus et al., 2019; Cranmer et al., 2020b) . Recently, grey box methods (Mehta et al., 2020) leveraged partial knowledge about the system dynamics to improve performance. In contrast, white box methods (Degrave et al., 2016; Liang et al., 2019; Hu et al., 2020; Qiao et al., 2020) impose prior knowledge by employing explicit dynamics models, reducing the space of learnable parameters and improving system interpretability. Figure 2 : ∇Sim: Given video observations of an evolving physical system (e), we randomly initialize scene object properties (a) and evolve them over time using a differentiable physics engine (b), which generates states. Our renderer (c) processes states, object vertices and global rendering parameters to produce image frames for computing our loss. We backprop through this computation graph to estimate physical attributes and controls. Existing methods rely solely on differentiable physics engines and require supervision in state-space (f), while ∇Sim only needs image-space supervision (g). Most notably in our context, all of these approaches require precise 3D labels -which are laborintensive to gather, and infeasible to generate for many systems such as deformable solids or cloth. We eliminate the dependence of white box dynamics methods on 3D supervision by coupling explicit (and differentiable) models of scene dynamics with image formation (rendering)foot_0 . Explicitly modeling the end-to-end dynamics and image formation underlying video observations is challenging, even with access to the full system state. This problem has been treated in the vision, graphics, and physics communities (Pharr et al., 2016; Macklin et al., 2014) , leading to the development of robust forward simulation models and algorithms. These simulators are not readily usable for solving inverse problems, due in part to their non-differentiability. As such, applications of black-box forward processes often require surrogate gradient estimators such as finite differences or REINFORCE (Williams, 1992) to enable any learning. Likelihood-free inference for black-box forward simulators (Ramos et al., 2019; Cranmer et al., 2020a; Kulkarni et al., 2015; Yildirim et al., 2017; 2015; 2020; Wu et al., 2017b) has led to some improvements here, but remains limited in terms of data efficiency and scalability to high dimensional parameter spaces. Recent progress in differentiable simulation further improves the learning dynamics, however we still lack a method for end-to-end differentiation through the entire simulation process (i.e., from video pixels to physical attributes), a prerequisite for effective learning from video frames alone. We present ∇Sim, a versatile end-to-end differentiable simulator that adopts a holistic, unified view of differentiable dynamics and image formation(cf. Fig. 1, 2 ). Existing differentiable physics engines only model time-varying dynamics and require supervision in state space (usually 3D tracking). We additionally model a differentiable image formation process, thus only requiring target information specified in image space. This enables us to backpropagate (Griewank & Walther, 2003) training signals from video pixels all the way to the underlying physical and dynamical attributes of a scene. Our main contributions are: • ∇Sim, a differentiable simulator that demonstrates the ability to backprop from video pixels to the underlying physical attributes (cf. Fig. 2 ). • We demonstrate recovering many physical properties exclusively from video observations, including friction, elasticity, deformable material parameters, and visuomotor controls (sans 3D supervision) • A PyTorch framework facilitating interoperability with existing machine learning modules. We evaluate ∇Sim's effectiveness on parameter identification tasks for rigid, deformable and thinshell bodies, and demonstrate performance that is competitive, or in some cases superior, to current physics-only differentiable simulators. Additionally, we demonstrate the effectiveness of the gradients provided by ∇Sim on challenging visuomotor control tasks involving deformable solids and cloth.

2. RELATED WORK

Differentiable physics simulators have seen significant attention and activity, with efforts centered around embedding physics structure into autodifferentiation frameworks. This has enabled differentiation through contact and friction models (Toussaint et al., 2018; de Avila Belbute-Peres et al., 



Dynamics refers to the laws governing the motion and deformation of objects over time. Rendering refers to the interaction of these scene elements -including their material properties -with scene lighting to form image sequences as observed by a virtual camera. Simulation refers to a unified treatment of these two processes.

