VARIATIONAL REPARAMETRIZED POLICY LEARNING WITH DIFFERENTIABLE PHYSICS

Abstract

We study the problem of policy parameterization for reinforcement learning (RL) with high-dimensional continuous action space. Our goal is to find a good way to parameterize the policy of continuous RL as a multi-modality distribution. To this end, we propose to treat the continuous RL policy as a generative model over the distribution of optimal trajectories. We use a diffusion process-like strategy to model the policy and derive a novel variational bound which is the optimization objective to learn the policy. To maximize the objective by gradient descent, we introduce the Reparameterized Policy Gradient Theorem. This theorem elegantly connects classical method REINFORCE and trajectory return optimization for computing the gradient of a policy. Moreover, our method enjoys strong exploration ability due to the multi-modality policy parameterization; notably, when a strong differentiable world model presents, our method also enjoys the fast convergence speed of trajectory optimization. We evaluate our method on numerical problems and manipulation tasks within a differentiable simulator. Qualitative results show its ability to capture the multi-modality distribution of optimal trajectories, and quantitative results show that it can avoid local optima and outperforms baseline approaches.

1. INTRODUCTION

Reinforcement learning (RL) with high-dimensional continuous action space is notoriously hard despite its fundamental importance for many application problems such as robotic manipulation (Ope-nAI et al., 2019; Mu et al., 2021) . Compared with the discrete action space counterpart setup, it is much trickier to represent policies for continuous RL -by the optimality theory of RL, the function class of discrete RL policies is simply categorical distributions (Sutton & Barto, 2018) , while the function class of continuous RL policies has to include density functions of arbitrary probabilistic distributions. In practice, popular frameworks (Silver et al., 2014; Haarnoja et al., 2018; Schulman et al., 2017) of deep RL formulate the continuous policy as a neural network that outputs a single-modality density function over the action space (e.g., a Gaussian distribution over actions). This formulation, however, breaks the promise of RL being a global optimizer of the return function because the singlemodality policy parameterization introduces local minima that are hard to escape using gradients w.r.t. distribution parameters. Besides, a single-modality policy will significantly weaken the exploration ability of RL algorithms because the sampled actions are usually concentrated around the modality. Our Bandit examples show how a single-modality RL policy fails to solve a simple continuous action bandit problem. Therefore, in practice, continuous RL often requires meticulous reward design that takes considerable human effort (Mu et al., 2021; Savva et al., 2019; Yu et al., 2020; Makoviychuk et al., 2021; Zhu et al., 2020) . In this paper, we propose a principled framework to learn the continuous RL policy as a multimodality density function. We provide a holistic solution to two closely-related questions: 1) how to parameterize the continuous RL policy? 2) how to update the policy parameterized as in (1)? Our parameterization of the continuous RL policy is based on two ideas. First, we take a sequence modeling perspective (Chen et al., 2021) and view the policy as a density function over the entire trajectory space (instead of the action space) (Ziebart, 2010; Levine, 2018) . Under this sequence modeling perspective, we can sample a population of trajectories that cover multiple modalities of 1

