VARIATIONAL REPARAMETRIZED POLICY LEARNING WITH DIFFERENTIABLE PHYSICS

Abstract

We study the problem of policy parameterization for reinforcement learning (RL) with high-dimensional continuous action space. Our goal is to find a good way to parameterize the policy of continuous RL as a multi-modality distribution. To this end, we propose to treat the continuous RL policy as a generative model over the distribution of optimal trajectories. We use a diffusion process-like strategy to model the policy and derive a novel variational bound which is the optimization objective to learn the policy. To maximize the objective by gradient descent, we introduce the Reparameterized Policy Gradient Theorem. This theorem elegantly connects classical method REINFORCE and trajectory return optimization for computing the gradient of a policy. Moreover, our method enjoys strong exploration ability due to the multi-modality policy parameterization; notably, when a strong differentiable world model presents, our method also enjoys the fast convergence speed of trajectory optimization. We evaluate our method on numerical problems and manipulation tasks within a differentiable simulator. Qualitative results show its ability to capture the multi-modality distribution of optimal trajectories, and quantitative results show that it can avoid local optima and outperforms baseline approaches.

1. INTRODUCTION

Reinforcement learning (RL) with high-dimensional continuous action space is notoriously hard despite its fundamental importance for many application problems such as robotic manipulation (Ope-nAI et al., 2019; Mu et al., 2021) . Compared with the discrete action space counterpart setup, it is much trickier to represent policies for continuous RL -by the optimality theory of RL, the function class of discrete RL policies is simply categorical distributions (Sutton & Barto, 2018) , while the function class of continuous RL policies has to include density functions of arbitrary probabilistic distributions. In practice, popular frameworks (Silver et al., 2014; Haarnoja et al., 2018; Schulman et al., 2017) of deep RL formulate the continuous policy as a neural network that outputs a single-modality density function over the action space (e.g., a Gaussian distribution over actions). This formulation, however, breaks the promise of RL being a global optimizer of the return function because the singlemodality policy parameterization introduces local minima that are hard to escape using gradients w.r.t. distribution parameters. Besides, a single-modality policy will significantly weaken the exploration ability of RL algorithms because the sampled actions are usually concentrated around the modality. Our Bandit examples show how a single-modality RL policy fails to solve a simple continuous action bandit problem. Therefore, in practice, continuous RL often requires meticulous reward design that takes considerable human effort (Mu et al., 2021; Savva et al., 2019; Yu et al., 2020; Makoviychuk et al., 2021; Zhu et al., 2020) . In this paper, we propose a principled framework to learn the continuous RL policy as a multimodality density function. We provide a holistic solution to two closely-related questions: 1) how to parameterize the continuous RL policy? 2) how to update the policy parameterized as in (1)? Our parameterization of the continuous RL policy is based on two ideas. First, we take a sequence modeling perspective (Chen et al., 2021) and view the policy as a density function over the entire trajectory space (instead of the action space) (Ziebart, 2010; Levine, 2018) . Under this sequence modeling perspective, we can sample a population of trajectories that cover multiple modalities of trajectories, which allows us to concurrently explore distant regions in the solution space. Second, we use a generative model to parameterize the multi-modality policies, inspired by their success in modeling highly complicated distributions such as natural images (Goodfellow et al., 2016; Zhu et al., 2017; Rombach et al., 2022; Ramesh et al., 2021) . We introduce a sequence of latent variables z, and we learn a decoder that "reparameterizes" the random distribution z into the multi-modality trajectory distribution (Kingma & Welling, 2013) , from which we can sample trajectories τ . Our policy network is, in a spirit, akin to diffusion models (Rombach et al., 2022; Ramesh et al., 2021 ) by learning to model the joint distribution p(z, τ ). As a prototypical work, we prefer simple design and choose not to include the multi-step diffusion process as in image modeling. Our choice of policy parameterization leads us to adopt the variational method (Kingma & Welling, 2013; Haarnoja et al., 2018; Moon, 1996) to derive an on-policy learning algorithm. Classical on-policy learning leverages the policy gradient theorem (Sutton & Barto, 2018), i.e., ∇J(π) = E τ [R(τ )∇ log p(τ )]. Because we model p(z, τ ), that requires computing p(τ ) and its gradient with z p(z, τ ) dz. However, marginalizing out z is often intractable when z is continuous, and it is well-known that optimizing the marginal distribution log p(τ ) by gradient descent suffers from local optimality issues (e.g., using gradient descent to optimize Gaussian mixture models which have latent variables is not effective, so EM is often used instead Ng ( 2000)). To overcome these obstacles, we take a different route and adopt variational method (maximum entropy RL) that directly optimizes the joint distribution of the optimal policy without hassels of integrating over z. We derive a novel variational bound which is the optimization objective for policy learning. To maximize this objective, we introduce the Reparameterized Policy Gradient Theorem. The theorem states a principled way to compute the policy gradient by combining the reward-weighted gradient (as in the classical policy gradient theorem) and the path-wise gradient from a differentiable world model (as in classical trajectory optimization methods). The two sources of gradient complement each other -the reward-weighted gradient improves the likelihood of selecting trajectories from regions containing high-reward ones, whereas the path-wise gradient suggests updates to make local improvements to trajectories. This combination allows us to enjoy the precise gradient computation from differentiable world models (Werling et al., 2021; Hu et al., 2019; Huang et al., 2021) and also maintains the flexibility to sample and optimize the trajectory distribution globally. Note that this beautiful result is also a consequence of introducing the latent variable z. The effectiveness of our method is rooted in our choice of generative policy modeling method. An ideal method needs to be powerful in modeling multi-modality distribution, and it needs to support sampling, density value computation, and stable gradient computation. Although there are other candidates for deep generative models, they often have limitations to being used for continuous policy modeling. For example, GAN-like generative models (Goodfellow et al., 2020) can only sample but not compute the density value. While normalizing flow (Rezende & Mohamed, 2015) can compute the density value, they might not be as robust numerically due to the dependency on the determinant of the network Jacobian; moreover, normalizing flows must apply continuous transformations onto a continuously connected distribution, making it difficult to model disconnected modes (Rasul et al., 2021) . We apply our methods to several numerical problems and three trajectory optimization tasks to manipulate rigid-body objects and deformable objects supported by differentiable physics engines (Werling et al., 2021; Huang et al., 2020) . These environments contain various local optima that challenge single-modality policies (either reward-weighted policy gradients as in REINFORCE or path-wise gradients as in trajectory optimization). In contrast, our approach benefits from modeling the trajectories as generative models and the ability to sample in the trajectory space. In qualitative experiments, we observe a strong pattern of multi-modality distribution by visualizing the policy after the learning converges. By quantitative evaluation, the policy learned by our framework does not suffer from the local optimality issue and significantly outperforms baselines.

2. VARIATIONAL REPARAMETERIZED POLICY LEARNING

In this section, we introduce our basic framework. All notations follow the convention of the community. To be more clear, we leave a section that introduces background knowledge in Appendix A and a mathematical table for notations in Appendix B.

