VARIATIONAL REPARAMETRIZED POLICY LEARNING WITH DIFFERENTIABLE PHYSICS

Abstract

We study the problem of policy parameterization for reinforcement learning (RL) with high-dimensional continuous action space. Our goal is to find a good way to parameterize the policy of continuous RL as a multi-modality distribution. To this end, we propose to treat the continuous RL policy as a generative model over the distribution of optimal trajectories. We use a diffusion process-like strategy to model the policy and derive a novel variational bound which is the optimization objective to learn the policy. To maximize the objective by gradient descent, we introduce the Reparameterized Policy Gradient Theorem. This theorem elegantly connects classical method REINFORCE and trajectory return optimization for computing the gradient of a policy. Moreover, our method enjoys strong exploration ability due to the multi-modality policy parameterization; notably, when a strong differentiable world model presents, our method also enjoys the fast convergence speed of trajectory optimization. We evaluate our method on numerical problems and manipulation tasks within a differentiable simulator. Qualitative results show its ability to capture the multi-modality distribution of optimal trajectories, and quantitative results show that it can avoid local optima and outperforms baseline approaches.

1. INTRODUCTION

Reinforcement learning (RL) with high-dimensional continuous action space is notoriously hard despite its fundamental importance for many application problems such as robotic manipulation (Ope-nAI et al., 2019; Mu et al., 2021) . Compared with the discrete action space counterpart setup, it is much trickier to represent policies for continuous RL -by the optimality theory of RL, the function class of discrete RL policies is simply categorical distributions (Sutton & Barto, 2018) , while the function class of continuous RL policies has to include density functions of arbitrary probabilistic distributions. In practice, popular frameworks (Silver et al., 2014; Haarnoja et al., 2018; Schulman et al., 2017) of deep RL formulate the continuous policy as a neural network that outputs a single-modality density function over the action space (e.g., a Gaussian distribution over actions). This formulation, however, breaks the promise of RL being a global optimizer of the return function because the singlemodality policy parameterization introduces local minima that are hard to escape using gradients w.r.t. distribution parameters. Besides, a single-modality policy will significantly weaken the exploration ability of RL algorithms because the sampled actions are usually concentrated around the modality. Our Bandit examples show how a single-modality RL policy fails to solve a simple continuous action bandit problem. Therefore, in practice, continuous RL often requires meticulous reward design that takes considerable human effort (Mu et al., 2021; Savva et al., 2019; Yu et al., 2020; Makoviychuk et al., 2021; Zhu et al., 2020) . In this paper, we propose a principled framework to learn the continuous RL policy as a multimodality density function. We provide a holistic solution to two closely-related questions: 1) how to parameterize the continuous RL policy? 2) how to update the policy parameterized as in (1)? Our parameterization of the continuous RL policy is based on two ideas. First, we take a sequence modeling perspective (Chen et al., 2021) and view the policy as a density function over the entire trajectory space (instead of the action space) (Ziebart, 2010; Levine, 2018) . Under this sequence modeling perspective, we can sample a population of trajectories that cover multiple modalities of trajectories, which allows us to concurrently explore distant regions in the solution space. Second, we use a generative model to parameterize the multi-modality policies, inspired by their success in modeling highly complicated distributions such as natural images (Goodfellow et al., 2016; Zhu et al., 2017; Rombach et al., 2022; Ramesh et al., 2021) . We introduce a sequence of latent variables z, and we learn a decoder that "reparameterizes" the random distribution z into the multi-modality trajectory distribution (Kingma & Welling, 2013) , from which we can sample trajectories τ . Our policy network is, in a spirit, akin to diffusion models (Rombach et al., 2022; Ramesh et al., 2021 ) by learning to model the joint distribution p(z, τ ). As a prototypical work, we prefer simple design and choose not to include the multi-step diffusion process as in image modeling. Our choice of policy parameterization leads us to adopt the variational method (Kingma & Welling, 2013; Haarnoja et al., 2018; Moon, 1996) to derive an on-policy learning algorithm. Classical on-policy learning leverages the policy gradient theorem (Sutton & Barto, 2018) , i.e., ∇J(π) = E τ [R(τ )∇ log p(τ )]. Because we model p(z, τ ), that requires computing p(τ ) and its gradient with z p(z, τ ) dz. However, marginalizing out z is often intractable when z is continuous, and it is well-known that optimizing the marginal distribution log p(τ ) by gradient descent suffers from local optimality issues (e.g., using gradient descent to optimize Gaussian mixture models which have latent variables is not effective, so EM is often used instead Ng (2000) ). To overcome these obstacles, we take a different route and adopt variational method (maximum entropy RL) that directly optimizes the joint distribution of the optimal policy without hassels of integrating over z. We derive a novel variational bound which is the optimization objective for policy learning. To maximize this objective, we introduce the Reparameterized Policy Gradient Theorem. The theorem states a principled way to compute the policy gradient by combining the reward-weighted gradient (as in the classical policy gradient theorem) and the path-wise gradient from a differentiable world model (as in classical trajectory optimization methods). The two sources of gradient complement each other -the reward-weighted gradient improves the likelihood of selecting trajectories from regions containing high-reward ones, whereas the path-wise gradient suggests updates to make local improvements to trajectories. This combination allows us to enjoy the precise gradient computation from differentiable world models (Werling et al., 2021; Hu et al., 2019; Huang et al., 2021) and also maintains the flexibility to sample and optimize the trajectory distribution globally. Note that this beautiful result is also a consequence of introducing the latent variable z. The effectiveness of our method is rooted in our choice of generative policy modeling method. An ideal method needs to be powerful in modeling multi-modality distribution, and it needs to support sampling, density value computation, and stable gradient computation. Although there are other candidates for deep generative models, they often have limitations to being used for continuous policy modeling. For example, GAN-like generative models (Goodfellow et al., 2020) can only sample but not compute the density value. While normalizing flow (Rezende & Mohamed, 2015) can compute the density value, they might not be as robust numerically due to the dependency on the determinant of the network Jacobian; moreover, normalizing flows must apply continuous transformations onto a continuously connected distribution, making it difficult to model disconnected modes (Rasul et al., 2021) . We apply our methods to several numerical problems and three trajectory optimization tasks to manipulate rigid-body objects and deformable objects supported by differentiable physics engines (Werling et al., 2021; Huang et al., 2020) . These environments contain various local optima that challenge single-modality policies (either reward-weighted policy gradients as in REINFORCE or path-wise gradients as in trajectory optimization). In contrast, our approach benefits from modeling the trajectories as generative models and the ability to sample in the trajectory space. In qualitative experiments, we observe a strong pattern of multi-modality distribution by visualizing the policy after the learning converges. By quantitative evaluation, the policy learned by our framework does not suffer from the local optimality issue and significantly outperforms baselines.

2. VARIATIONAL REPARAMETERIZED POLICY LEARNING

In this section, we introduce our basic framework. All notations follow the convention of the community. To be more clear, we leave a section that introduces background knowledge in Appendix A and a mathematical table for notations in Appendix B. Figure 1 : a) the reward landscape where the agent needs to move from the red dots to the region containing a high reward; b) latent space Z modeled by random Gaussian; c) the state density of a sampled trajectory from q θ (z, τ ). In (c), each red dot corresponds to a state in the trajectory. Our policy can be viewed as encoding the stochastic latent variable z into the trajectory distribution through the decoder q θ (z, τ ). We rely on an encoder p ϕ(z|τ ) to ensure the cycle consistency between the latent z and the sampled trajectory τ .

2.1. GENERATIVE MODELING OF OPTIMAL TRAJECTORIES

Following (Todorov, 2006; 2008; Toussaint, 2009; Ziebart, 2010; Kappen et al., 2012; Levine, 2018; Haarnoja et al., 2018) , our Variational Reparameterized Policy (VRP) framework, as shown in Fig. 1 , views policy optimization as learning a generative model that generates optimal trajectories. To capture the multimodalities of the optimal trajectories, we introduce a latent distribution and construct a decoder-encoder structure similar to recent deep generative models (Kingma & Welling, 2013; Ho et al., 2020; Rombach et al., 2022; Ramesh et al., 2021) . The "encoder" maps trajectories to latent samples, while the decoder is a neural network that "reparameterizes" random samples from the latent distribution into different trajectories. Method Overview. As introduced in Sec. 2.2, the latent distribution and the reparameterization function together form a policy that can generate diverse trajectories. We use a novel variational bound to approximate the posterior of optimal trajectories in Sec. 2.3 as the optimization objective. The variational bound naturally combines maximum entropy RL while also containing a term to enforce cycle consistency (Zhu et al., 2017) between the encoder and the decoder, preventing the policy from mode collapse. We then introduce a Reparameterized Policy Gradient Theorem in Sec. 2.4 to optimize our reparameterized policy to improve the variational bound. We find that the derived policy gradient includes two terms: one optimizes the latent distribution through a reward-weighted gradient as the classical policy gradient theorem, and another optimizes the reparameterization network, which can benefit from the path-wise gradient from a differentiable world model. This helps our inference algorithm to enjoy the efficiency of differentiable physics while being able to sample globally when the reward landscape is discontinuous or non-convex.

2.2. TRAJECTORY GENERATION WITH REPARAMETERIZED POLICY

Let z ∈ Z be a latent variable, which can be either continuous or categorical to model optimal trajectories. We design our "policy" as a joint distribution q θ (z, τ ) of latent z and the trajectory τ . To model a sequential trajectory, we can consider the following two factorizations: 1. sample z before τ : q θ (z, τ ) = p(s 1 )q θ (z|s 1 ) T t=1 p(s t+1 |s t , a t )q θ (a t |z, s t ); 2. sample z with τ : q θ (z, τ ) = p(s 1 )q θ (z 0 |s 1 ) T t=1 p(s t+1 |s t , a t )q θ (a t |z t-1 , s t )q θ (z t |s t , z t-1 ). The latter allows us to model the hybrid policy as in the option-critic (Bacon et al., 2017) , where we can treat z t as the option per step. Though sampling from q θ may require us to sample τ, z together, it is still convenient to view it as the combination of two marginal distributions: (1) the policy for the latent representation p θ (z|s 1 )p(s 1 ), and (2) a decoder p θ (τ |z, s 1 ) that reparameterizes z into a real trajectory. Note that we will use p θ to refer to various marginal distribution of q θ (z, τ ). For example, p θ (τ ) = z q θ (z, τ )dz will be the marginal distribution of trajectories sampled from q θ (z, τ ). The term "reparameterization" is inspired by the reparameterization trick in (Kingma & Welling, 2013) . When z is a Gaussian noise ξ, we can reparameterize the noise into actions by a = µ(θ)+ξσ(θ) and optimize it with gradients directly. The reparameterized policy q θ can be viewed as an instance of the stochastic computation graph (Schulman et al., 2015; Weber et al., 2019) . In practice, q θ (z|s) or q θ (a|s, z) are modeled by neural networks as in common policy learning frameworks (Haarnoja et al., 2018) .

2.3. THE VARIATIONAL LOWER BOUND FOR REPARAMETERIZED POLICIES

The Auxiliary Trajectory Encoder Intuitively, learning policy networks by inputting random distribution can help generate diverse examples, but it will also suffer from the issues of mode collapse Li et al. (2017) . To build a connection between the trajectory and τ and enforce a cycle consistency, we introduce an auxiliary distribution p ϕ (z|τ ), which can look at the whole trajectories and map τ back into the latent z. We will illustrate how the auxiliary encoder naturally emerges from the variational lower bounds following (Rezende & Mohamed, 2015; Levine, 2018) . The Evidence Lower Bound With the help of the auxiliary encoder p ϕ (z|τ ), we can now define a joint distribution of optimality O, latent z, and the trajectory τ as p ϕ (O, z, τ ) = p(O|τ )p ϕ (z|τ )p(τ ) where O and z are independent conditioning on τ . Treating q θ (z, τ ) as the posterior approximation, we can write the Evidence Lower Bound (ELBO) for the optimality. Note that the following inequality holds for arbitrary distribution p ϕ and q θ , log p(O) = E z,τ ∼q θ [log p ϕ (O, z, τ ) -log q θ (z, τ )] ELBO +D KL (q θ (z, τ )||p ϕ (z, τ |O)) ≥ E z,τ ∼q θ [log p ϕ (O, τ, z) -log q θ (z, τ )] = E z,τ ∼q θ [log p(O, τ ) + log p ϕ (z|τ ) -log q θ (z, τ )] = E z,τ ∼q θ   log p(O|τ ) reward + log p(τ ) prior + log p ϕ (z|τ ) cross entropy -log q θ (z, τ ) entropy    The ELBO contains four parts that can all be computed directly given the sampled z and τ (the environment probability p(s t+1 |s t , a t ) is canceled as in (Levine, 2018) ). The first two parts are the predefined reward log p(O|τ ) = R(τ )/T + c, where c is the normalizing constant that can be ignored during optimization, and a prior distribution p(τ ), which is assumed to be known. The third part is the log-likelihood of z based on our encoder. It is easy to see that if we fix q θ , and maximize p ϕ alone will minimize the cross-entropy E z,τ ∼q θ [-log p ϕ (z|τ )], similarly to the supervised learning. It achieves optimality when p ϕ (z|τ ) = p θ (z|τ ) = q θ (z,τ ) z q θ (z,τ )dz , modeling the posterior of z for τ sampled from q θ . On the other hand, fixing ϕ, the decoder q θ is encouraged to generate trajectories that are easy to identify or classify; this helps to increase diversity and enforce a cycle consistent to avoid mode collapse. The fourth part is the policy entropy that enables maximum entropy exploration. Maximizing all terms together for the parameters θ and ϕ will minimize D KL (q θ (z, τ )||p ϕ (z, τ |O)) = D KL (q θ (z, τ )||p ϕ (z|τ )p(τ |O)), where optimality can be achieved when p θ (z|τ ) = p ϕ (z|τ ) and p θ (τ )p θ (z|τ ) = p ϕ (z|τ )p(τ |O) ⇒ p θ (τ ) = q θ (τ, z)dz = p(τ |O). We discuss the method's connection with other methods in Appendix C.

2.4. REPARAMETERIZED POLICY GRADIENT WITH DIFFERENTIABLE PHYSICS

Given θ, ϕ we can treat the ELBO as our reward R elbo (τ ) = log p(O|τ ) + log p(τ ) + log p ϕ (z|τ )log q θ (z, τ ). The maximization w.r.t. ϕ is straightforward. As for q θ , depending on its structure, we can optimize it with various on-policy or off-policy RL methods as in (Weber et al., 2019) . Here we study a special case where R elbo (τ ) is differentiable to θ through a sampled trajectory τ . Thus, we can optimize θ by first-order path-wise gradients directly. This case is very interesting as analytical gradients provide a fast convergence speed but also suffer from the local optima issue (Li et al., 2017) , while our reparameterization policy naturally includes a latent distribution for sampling and a parameterization part for optimization, providing more chances to search globally. One can obtain a path-wise gradient from a differentiable simulator (Hu et al., 2019; Werling et al., 2021) , a learned neural model (Liang et al., 2022) . Reparameterized Policy in Differentiable Environments. Formally, we hope to find θ to maximize the expectation E[R elbo (τ )] = z,τ q θ (z, τ )R elbo (τ )dτ dzfoot_0 . The sampling procedure is usually non-differentiable. But if we let a t = f θ (s t , z, t) and the dynamics s t+1 = h(s t , a t ) be differentiable functions, the trajectory τ becomes a differentiable function of θ almost everywhere. We further assume that s 1 is fixed and the functions f θ , h are deterministic to simplify the derivation. One can remove these assumptions easily through reparameterization as in (Silver et al., 2014; Heess et al., 2015) to generalize to stochastic functions. As a result, we can write q θ (z, τ θ ) = q θ (z|τ (z, s 1 )), where τ and the density q θ is a deterministic function of the sampled z and s 1 . In this case, E[R elbo (τ θ )] = z q θ (z, τ θ )R elbo (τ θ )dz. For each sampled trajectory τ θ , we can compute its pathwise gradient w.r.t. to the policy parameter θ as ∇ θ R elbo (τ θ ). Then we have the following theorem to compute the gradient of the expected rewards: Theorem 1 (Reparameterized Policy Gradient Theorem). Given s 1 and under regularity conditions in Appendix F.1. For almost every θ, the expected reward E[R elbo (τ θ )] exists and differentiable, and its gradient can be computed by ∇ θ E[R elbo (τ θ )] = z q θ (z, τ θ )      R elbo (τ θ )∇ θ log q θ (z, τ θ ) Reward-weighted Gradient (REINFORCE) + ∇ θ R elbo (τ θ ) Pathwise Gradient (Trajectory Optimization)      dz. We provide a short proof in Appendix F.2. Combine Sampling and Optimization. Our theorem shares the same spirit as in Schulman et al. (2015) and is related to the problem of exchange derivative and expectation L'Ecuyer (1995) . Here we want to emphasize its use case in trajectory optimization. Theorem 1 splits gradients into two parts. The first part is a reward-weighted gradient that aims to improve the likelihood of the distribution q θ (z|τ ) = q θ (z, τ θ ) for high-reward ones. If the environment is differentiable, the second part may optimize the policy directly through a path-wise gradient to make local improvements to trajectories. The natural combination of the sampling and optimization may allow our approach to enjoy the benefits of both: it can search over the whole space but can also leverage local structures for fast convergence speed. This can not be achieved without the latent variable z. Our methods provide the potential to avoid discontinuities in the reward landscape. If q θ (z, τ θ ) does not depend on s >1 , the reward-weighted gradient can optimize the sampling distribution directly without relying on the path-wise gradients, providing the policy a chance to ignore the discontinuous point and move to the high reward region directly. Another appealing point in Theorem 1 is that in Assumption 2, we only require R elbo (τ θ ) to be Lipschitz continuous for q θ (z, τ θ ) ≥ 0. Thus, our policy gradient estimate can make an unbiased estimation so long as the probability density of sampling a discontinuous trajectory is negligible. Even if a discontinuous point generates a biased gradient estimation for a sampled z, the zeroth-order part still has a chance to correct its density directly by observing the reward directly so long as the optimal trajectories have certain properties. By defining latent distributions, building the reparameterized policy and the auxiliary encoder, then optimizing the ELBO with the reparameterized policy gradient through a differentiable world model, we can combine search and optimization to enjoy the generative modeling of the optimal trajectories. We summarize the whole recipe in Algorithm 1 of Appendix D and describe implementation details in Appendix D.

3. RELATED WORK

Differentiable Simulation and Trajectory Optimization. Trajectory optimization (Kelly, 2017) aims to find a trajectory which optimizes the target metrics under given constraints over the trajectory. One branch of methods optimizes the trajectory using the gradient of the optimization objective (Ratliff et al., 2009; Kalakrishnan et al., 2011; Schulman et al., 2014) . Recently, the development of differentiable simulation technique (Hu et al., 2019; Werling et al., 2021; Qiao et al., 2021; Freeman et al., 2021; Howell et al., 2022) enables to compute the gradient from the reward function and optimize the trajectory in physical simulators. Such methods are efficient around the optimal solution; however, the optimization often gets stuck at local optima elsewhere, especially when the reward function is not smooth. In our work, one term in the formula of our RPG theorem corresponds to the gradient from trajectory optimization. Policy as Sequential Generative Model. Maximum entropy reinforcement learning (Todorov, 2006; 2008; Toussaint, 2009; Ziebart, 2010; Kappen et al., 2012) can be viewed as variational inference in probabilistic graphical models (Levine, 2018) , which models optimality as observed variable and trajectory as a latent variable. When the demonstration or a fixed dataset is provided on the offline RL setting (Chen et al., 2021; Reed et al., 2022) , policy learning is simplified as a sequence modeling task (Chen et al., 2021; Zheng et al.; Reed et al., 2022) . They use autoregressive models to learn the distribution of the whole trajectory, including actions, states, and rewards, and use the action prediction as policy. In our work, we learn a sequential generative model of policy for online RL via the variational method. The policy can model any distributions in the trajectory space. Variational Skill Discovery Our method is closely related to the work in unsupervised reinforcement learning (Eysenbach et al., 2018; Achiam et al., 2018) or diverse skill learning (Kumar et al., 2020; Osa et al., 2022) . These methods share the same technique of using neural networks to approximate the posterior of the latent variable given either states or state-action pairs and encourage the policy to reach states consistent with the latent variables. However, these methods do not model the optimal trajectory distribution but only aim to generate a diverse set of solutions by adding the mutual information term as a reward bonus, resulting in different formulations and effects. For example, they all choose to fix initial latent distributions without optimizing them, limiting their ability to achieve optimality. Moreover, Eysenbach et al. (2018) ; Achiam et al. (2018) does not optimize the learned skill for the environment rewards; Osa et al. (2022) does not optimize the mutual information along trajectories; Kumar et al. (2020) needs to solve the optimization problem first before finding a diverse set of solutions. In contrast, we jointly optimize the latent representation and the policy with a single objective, providing a simple but unified perspective for previous approaches in optimization problems. Hierarchical Methods As mentioned in Sec. 2.2, the hierarchical methods, e.g., option critic (Bacon et al., 2017) , can be seen as a special case of our method when we use a sequence of latent variables z = (z 1 , • • • , z T ) to reparameterize the policy. Without optimizing the latent variable through the variational inference, most hierarchical RL methods often need special designs for the latent space, e.g., state-based subgoals (Kulkarni et al., 2016; Nachum et al., 2018b; a) or predefined skills (Li et al., 2020) to avoid mode-collapse. Osa et al. (2019) regularized options to maximize the mutual information between the action and the options, which are very relevant to ours. However, it does not model temporal structures as ours to ensure consistency along the trajectories. 2020) extract temporal abstractions using generative models from demonstrations. InfoGAN (Li et al., 2017) and ASE (Peng et al., 2022) uses adversarial training Goodfellow et al. (2020) ; Ho & Ermon (2016) to imitate demonstrations. These works all rely on demonstrations rather than rewards to learn abstractions. For example, (Co-Reyes et al., 2018) learns representation on the collected dataset with variational inference and then utilizes the trained model for planning or policy learning. The separation of the representation learning and reward maximization makes it differ from our methods: first, it requires a state reconstruction module to supervise the generative model, which is challenging for high-dimensional observations; second, it optimizes neither the latent distribution nor the actions for the reward directly, thus requires additional planning procedure during the execution to find suitable actions.

4.1.1. ENVIRONMENT AND EVALUATION

In this section, we investigate the following questions: 1) Whether our method is more efficient for trajectory optimization than gradient-free algorithms; 2) Whether our method can search over the large solution space and avoid local optima that might trap a single-modality policy. We design the following environments to answer these questions. Bandit 1 and 2: Our bandit problems in Figure 2a -(1) and Figure 2a-( 2) have a 1d action space and a non-convex reward landscape. We initialize our policy as a Gaussian centralized around 0 with its scale barely touching the right mode of reward. The reward function of Bandit 2 contains an additional discontinuous point. Move 1 and 2: An agent moves in a 2D environment for a fixed number of steps and receives a reward according to its terminal position. The terminal reward landscape contains 4 Gaussian peaks (deeper color = higher reward), as shown in Figure 2a-( 3) and ( 4). The agent is initialized at the center. Note that the reward of Move 1 near the initialization point is constant, so the agent receives a non-zero gradient only if it arrives at a position closer to one of the four Gaussian peaks. Move 3: The terminal reward shown in Figure 2a-(5 ). There is one obstacle consisting of three white circles. When an agent runs into an obstacle, it will bounce off (continuous collision detection (Hu et al., 2019) ) due to an impulse normal to the contact surface. Note that there is a small dent between the three circles, creating a local optimum. We evaluate our method against the single modality policies learned from the following baselines: 1) Reinforce (Williams, 1992), and 2) policies optimized through path-wise gradients with Adam (Kingma & Ba, 2014) . We compare their sample efficiency and final performance after training is finished.

4.1.2. EXPERIMENT RESULTS

We plot the average episode return of each algorithm against the number of samples in Figure 2b . Error bars show the standard deviations over five runs with different random seeds. The results suggest that our methods achieve better sample efficiency and find better solutions than baseline algorithms. (a) Environments for numerical experiments. We plot the reward landscapes in the first two bandit environments and show the rewards of the terminating positions as contour plots in the right three 2d move environments. These environments all pose some level of challenge in optimizing reward/value landscapes that are discontinuous and/or locally optimal. We first take Bandit 2 in Figure 2a-( 2), which has two modes and regions containing zero gradients in its reward landscape, as an example for comparison. We show the performance of each algorithm in Figure 2b-(2 ). In terms of optimality, our method converges to the globally optimal solution in the end and achieves a much higher return compared with other methods in Figure 2b-(2) . Two reasons prevent our method from being stuck at the local optima. First, the entropy term in the R elbo objective encourages the agent to perform a maximum entropy exploration over the whole reward landscape, while the cross entropy term drives the agent to sample distinguishable actions for different latent variables, causing different action modes determined by latent variables to move away from each other, resulting in a better state coverage. Second, after discovering different local optima, our policy is able to maintain a multi-modality action distribution to exploit those local optima independently. By adjusting the probabilities of different modes through optimizing rewards with latent variables, it can eventually converge to the global optimal solution. In experiments, we observed that although our method quickly found the right mode of the reward function, shown in Figure 2a-( 2), which is locally optimal, it would still increase the probabilities for the under-explored region due to the exploration terms in our objective. After it found the left mode, it would maintain two action modes and optimize them together until the expected return of the left mode was larger than the right one and eventually converged to the left, finding the global optimum. A notable feature of the multi-modality is that during the whole optimization process, the expected reward increases monotonically as it only needs to increase the reward for modes with higher rewards. In contrast, when Adam and Reinforce converge to the right mode, even if we have introduced the maximum entropy reward to encourage exploration. different from our method, their single modality property prevents them from jumping from one mode to a better mode as there is no way to go across the flat region in between without decreasing the reward. In terms of sample efficiency, our method benefits from the pathwise analytical gradient, consistently outperforming Reinforce, which has a noisy gradient estimate that may reduce convergence speed when the number of samples is limited. Though Adam increases rewards faster than our method in the beginning stage, it quickly gets stuck at the local optima. Current RL algorithms like PPO or SAC are also constrained by the single modality gaussian parameterization of the policy, these methods can only find suboptimal solutions, even when more than enough samples are given. Remaining environments Experiments on the remaining four environments show a similar behavior pattern that our method can converge to the global optima that the baselines fail to find. We observe that the curve of ADAM in Figure 2b -(1) even decreases after reaching a local optimum. This is due to the discontinuity on the right of the reward landscape as shown in Figure 2a-(1 ). When a Gaussian policy reaches the right local optimum, the analytical gradients nearing that point still only contain a positive value, pushing the policy to go across the optimal point and move into the right flat region, causing the return to decrease to -0.5 in the end. Reinforce and PPO does not suffer from this issue, as it only uses the reward-weighted gradient estimate, and can correctly identify the left low-reward region and move away from it. SAC also does not suffer from this issue, as the policy optimizes the Q function, which smooth the discontinuity to allow the policy to converge at the local maximum stably. But it is still very challenging to find the global optima with a similar reason as in the Bandit 2 environment. In the case of sequential decision-making problems, our method consistently outperforms algorithms that maintain single modality policy in several move environments by a large margin and can find global optima as shown in Figure 2b (3)-(5). We observed that Reinforce went to the left side mode in Move 1 and Move 2 as at the starting state there are no reward signals pointing to the global optimum in the upper side of the state space. Its variance is also high due to the increased problem dimensions. Adam even did not make any progress in Move 1 as there is no gradient around the initial state. The Move 3 environment includes physics-based contacts in the dynamic system, while our method still can solve it with the analytical gradients, in presence of the discontinuities during contacts.

4.1.3. HOW LATENT DISTRIBUTIONS AFFECT THE TRAJECTORY MODELING?

Modeling the trajectories with a sequential {z i } instead of a single latent variable sampled before sampling the trajectory τ can help factorize the trajectory space to build a more compact representation. We construct an environment containing four obstacles to demonstrate its effects in Figure 3 . The agent on the bottom left must avoid obstacles and reach the goal at the top right corner. We study the case where z is a discrete variable. We first plot the state and the action distributions for policies learned by sampling a single categorical z before the trajectory τ in Figure 3a . Different colors represent the different values of z. We plot the state and action distributions and use different colors to represent their corresponding z values. Though a categorical distribution can represent various paths toward the goal, the number of latent samples constrained its expressive power. However, as shown in Figure 3b , we model the distribution with a sequence of z 1 , z K+1 , z 2K+1 , . . . , z N K+1 . This brings the compositionality that allows the agents to select various z sequences to generate trajectories that can better cover the state space. The action distribution on the right successfully captures the symmetry of the optimal trajectories and the action pattern of moving along different directions, which helps the policy better cover the whole state space compared with a single categorical distribution. 

4.2. EVALUATION ON MANIPULATION TASKS WITH DIFFERENTIABLE PHYSICS

We evaluate our approach in several physical environments to demonstrate its potential in trajectory optimization. Figure 4 compares our approaches against the gradient-based trajectory optimizer with a single Gaussian head in three environments. The Grasp environment, based on Nimble Physics (Werling et al., 2021) , contains three balls. The agent needs to control a 3 DoF gripper to grasp the green ball and pick it up. However, the agent only knows that it needs to grasp an object, and it does not receive any reward before lifting it. We use a reward to minimize the distance between the gripper and the nearest balls, which creates a local minimum that encourages the gripper to touch the nearest ball in front of it, as shown in Figure 4 . The first-order gradient-based trajectory optimization got stuck. In contrast, our method can capture the multimodality of the reward landscape, approach different balls, and exploit different options (touching different balls) simultaneously. In the end, our method successfully finds the solution to pick up the green ball. The Rope and the Cut environments are two soft body environments built with PlasticineLab (Huang et al., 2020) , each containing a 3DoF rigid body manipulator. In the Rope environment, an agent needs to push the left side of a rope forward. Similarly, in the Cut environment, an agent needs to cut off the left end of a rope. However, manipulators are initialized in the middle of the objects, which is far away from suitable contact points required to finish the tasks, leading to local optima (Li et al., 2022) as shown in Fig. 4 . In contrast, our method can explore different contact points and find correct solutions in the end. These experiments show the potential of our approach in learning the optimal policies with differentiable physics.

5. CONCLUSION AND FUTURE WORK

We derive a framework that models the policy of continuous RL by a multi-modality distribution in the variational inference framework. Under this framework, we also derive a Reparameterized Policy Gradient Theorem which enjoys the advantage of both classical sample-based methods (as in REINFORCE) and trajectory optimization methods (which require the support of differentiable world models). Our framework opens a new venue of continuous RL. It has deep connections with diffusion model, decision sequence modeling, and differentiable physics techniques. In the future, we are interested in exploring how sequence modeling techniques, such as transformers, can be used to model the policy in our framework. We are also interested in testing our method for more complicated tasks, such as dexterous object manipulation. Finally, our method can be extended to offline RL and off-policy RL setups. We leave all these exploration efforts to future works. Figure 4 : Visualization of policy distribution for manipulation with differentiable physics. We qualitatively compare the Gaussian policy learned by ADAM and the multi-modality policy learned by our method. On the left, we overlay final states from three trajectories of the Gaussian policy. On the right, we also draw three samples (trajectories) from the policy, and we visualize the final state of each trajectory in a different plot. We see that the three final states from the Gaussian policy are similar, while the final states are very different for trajectories from our policy. Additionally, one of the modalities of our method can solve the problem (the rightmost column), while the Gaussian policies get stuck in local minima.

REPRODUCIBLITY STATEMENT

We will provide an open-source implementation of our method on GitHub. We will also update more illustrative examples and demos to our website. The hyperparameter used in the experiment section can be found in Table 2 in the appendix. The theoretical results are also in multiple sections of our appendix.

A PRELIMINARY

Markov decision process A Markov decision process (MDP) is a tuple of (S, A, P, R), where S is the state space and A is the action space. p(s ′ |s, a) is the transition probability that transits state s to another state s ′ after taking action a. The function R(s, a, s ′ ) computes a reward per transition. A policy π(a|s) selects an action distribution according to the state s. Executing a policy π starting from the initial state s 1 with density p(s 1 ) will result in a trajectory τ , which is a sequence of states and actions {s 1 , a 1 , s 2 , . . . , s t , a t , . . . } where a t ∼ π(a|s = s t ), s t+1 ∼ p(s|s = s t , a = a t ). We also use the terminology environment to refer to an MDP in an RL problem. The discounted reward of a trajectory is R γ (τ ) = ∞ t=1 γ t R(s t , a t , s t+1 ) where 0 < γ < 1 is the discount factor to ensure the series converges. The goal of reinforcement learning (RL) is to find a parameterized policy π θ that maximizes the expected reward E s1∼p(s1) V π θ (s 1 ) = E τ ∼π θ ,s1∼p(s1) [R γ (τ )], where we call V π θ (s 1 ) the value of the state s 1 for the policy π θ . For simplicity, we focus on optimization problems in a finite horizon MDP in this paper. This means that we only consider the first T states and actions {s 1 , a 1 , . . . , s T , a T , s T +1 }. In this case, we can directly optimize for the total reward R(τ ) = T t=1 R(s t , a t , s t+1 ). Defining the measure and ensuring the existence of policy gradients for infinite horizon MDP with differentiable physics requires more strict conditions, and we leave it as future work. Policy gradient and zeroth-order gradient estimate REINFORCE (Sutton & Barto, 2018) approximates the expected reward with Monte-Carlo sampling and optimizes the policy parameter θ with a zeroth order gradient estimate, which can be written as ∂E s 1 ∼p(s 1 ) [V π θ (s1)] ∂θ ≈ Nsamples i=1 R(τ i )∇ θ log π θ (a i ) where N samples is the number of sampled trajectories, τ i and a i = {a i t } 1≤t≤T represent the trajectory and action sequence per sample, and the log-likelihood can be computed by log π θ (a i ) = 1≤t≤T log π θ (a t |s t ).

Differentiable simulation

In differentiable physics we consider as a special MDP where the transition probability p(s ′ |s, a) can be represented as δ(s ′ -h(s, a))foot_1 . The transition function h : S × A → S and reward function R(s, a, s ′ ) are deterministic and differentiable. If we further assume that the policy π θ (a|s) = δ(a -f θ (s)) is a differentiable deterministic function w.r.t. θ and input s, then each element in a sampled trajectory τ and the total reward R(τ ) are also differentiable w.r.t. each s t , a t and θ. We can optimize the reward R(τ ) with gradient-based optimizers after computing the first-order path-wise gradient ∂R(τ ) ∂θ through back-propogation. It's argued that even if the environment is deterministic, it is still beneficial to optimize a stochastic policy (Suh et al., 2022; Xu et al., 2022) to leverage the advantages of stochastic sampling for the nonsmooth reward landscape (Hu et al., 2019; Xu et al., 2022; Antonova et al., 2022) . A typical choice is to add Gaussian noise w to the action sequence a and optimize it with the reparameterization trick (Kingma & Welling, 2013) , which estimates the same policy gradient as REINFORCE by ∂E s 1 ∼p(s 1 ) [V π θ (s1)] ∂θ ≈ Nsamples i=1 ∇ θ R(τ i (a i + w i )) , where a i and w i are actions and Gaussian noises to generate the i-th trajectory τ i .

RL as Probabilistic Inference

The RL as inference framework (Todorov, 2006; 2008; Toussaint, 2009; Ziebart, 2010; Kappen et al., 2012; Levine, 2018) , which defines optimality p(O|τ ) ∝ e R(τ )/T , where T is a defined temperature. It further defines a prior distribution of the trajectory p(τ ) = p(s 1 ) T t=1 p(a t |s t )p(s t+1 |s t , a t ), where p(a t |s t ) is a known prior action distribution, e.g., a Gaussian distribution. Thus, we can compute the possibility of the optimality p(O) = p(τ |O)p(τ ). The goal of the framework is to approximate the posterior distribution of optimal trajectories p(τ |O) = p(O|τ )p(τ ) p(O|τ )p(τ )dτ , which can be considered as a softmax weighting among all prior trajectories based on their rewards, sharing the same spirit to path integral and the SoftMax policy.

B MATHEMATICAL NOTATIONS Notation Explanation z

The latent variable τ The sampled trajectory p(O|τ ) Optimality of a trajectory τ p(τ ) Prior distribution of trajectories (for example, random actions) p(O) p(O|τ )p(τ )dτ . p(τ |O) The posterior policy we want to model p ϕ (z|τ ) The introduced auxiliary encoder q θ (z, τ ) Joint distribution of z, τ by sampling from the reparameterized policy in the environment p(s t+1 |s t , a t ). q θ (a t |s t , z) or q θ (a t |s t , z t ) The policy to select actions at different steps q θ (z|s 1 ) or q θ (z t |s t , z t-1 ) The policy to sample latent z p θ (z|τ ) The probability of sampling z or all z i given τ in q θ p θ (z|s 1 ) The marginal distribution of z in q θ conditioning on s 1 p θ (τ ) The marginal distribution of τ in q θ p θ (τ |z) The marginal distribution of τ conditioning on z in q θ R elbo The variational lower bound defined in Eq. 1 f θ (s t , z, t) The deterministic policy function. h(s t , a t ) The deterministic state transition function. q θ (z|τ ) Usually the q θ (z, τ θ ) in deterministic environments τ θ or τ θ (z, s 1 ) In deterministic environment, the trajectory becomes a function of θ and s 1 ∇ θ R elbo (τ θ ) The path-wise gradient obtained from the differentiable environment

C CONNECTION WITH GENERATIVE MODELS

Many generative models are based on ELBO log p(x) = E z∼q(z) [log p(x, z) -log q(z)] + KL(q(z)∥p(z|x)). One can compare our approaches with other generative models. VAE defines p θ (x, z) = p θ (x|z)p(z) and q(z) = q ϕ (z|x) and then optimize θ, ϕ jointly to maximize the ELBO bound. By doing so, q ϕ (z|x) has to align with the true posterior of p θ (z|x). Thus log p(x) ≥ E z∼q ϕ (z|x) [log p θ (x|z) + log p(z) -log q ϕ (z|x)] The Expectation-maximization algorithm (EM) (Dempster et al., 1977) assumes that we have p θ (x, z) = p θ (x|z)p θ (z) with finite z, thus we can compute the expectation p θ (z|x) explicitly and treat it as q ϕ , then E-step: compute p θ (z|x) to solve max ϕ log p θ (x) -D KL (q ϕ (z|x)||p θ (z|x)). M-step: fixing ϕ, find max θ E q ϕ [log p θ (x, z)] -E q ϕ [log q ϕ (z|x)] can maximize the ELBO bound. In Maximum Entropy RL, we have p(O, τ ) = p(O|τ )p(τ ) defined by the reward, and we optimize q θ (τ |O) only. The ELBO bound becomes a maximum entropy term  E τ ∼π [log p(O|τ ) + log p(τ ) -log π(τ )] . Latent Encoder q(z|x) Joint p(x, z) MLE objective VAE z p ϕ (z|x) p θ (x|z)p(z) p(x) EM z max ϕ log p θ (x) -D KL (q ϕ (z|x)||p θ (z|x)) p θ (x|z)p θ (z) p(x) Diffusion {x t } t≥1 T i=1 N (x t ; √ 1 -β t x t-1 , β t I) p(x T ) t≥1 p θ (x t-1 |x t ) p(x 0 ) MaxEntRL τ π θ (τ ) p(O|τ )p(τ ) p(O) RPG τ, z q θ (z, τ ) p(O|τ )p ϕ (z|τ )p(τ ) p(O)

D IMPLEMENTATION DETAILS

The algorithm is shown in Algorithm 1. Algorithm 1 Variational Reparameterized Policy 1: Input: p ϕ , q θ (z, τ ). 2: Initialize p ϕ , q θ . 3: while time remains do 4: Sample start state s 1 .

5:

Sample from the latent code policy z, τ .

6:

Compute the variational lower bound R elbo based Eq. 1. 7: Update the auxiliary encoder p ϕ to maximize R elbo . 8: Update the parameterized policy gradient q θ to maximize R elbo with Theorem 1. 9: end while The learning rate (3 × 10 -4 ) is the same across all environments. Adam uses the exact same hyperparameter as our method except for the dimension of the latent variable is 1. REINFORCE uses the exact same hyperparameter as our method except for the dimension of the latent variable is 1 and the policy gradient is estimated using reinforce.

E BIAS OF THE FIRST-ORDER GRADIENT ESTIMATOR

In this section, we analyze the difference between two types of policy gradient estimators, namely the Zeroth-Order (ZoBG) and First-Order (FoBG) gradient estimator (Suh et al., 2022) . We establish a necessary condition for two to be equivalent when both action and parameter are 1 dimensional. From the 1D case, we also show the intractability of constructing a gradient estimator similarly to these two in higher dimensions. Lemma 1 (Leibniz integral rule, general form). Let f (x, t) be a function such that both f (x, t) and its partial derivative ∂f (x,t) ∂x are continuous in t and x in some region of the xt plane, including a(x) ≤ t ≤ b(x), x 0 ≤ x ≤ x 1 . Also suppose that the functions a(x) and b(x) are both continuous and both have continuous derivatives for x 0 ≤ x ≤ x 1 . Then, for x 0 ≤ x ≤ x1 d dx b(x) a(x) f (x, t)dt = b(x) a(x) ∂ ∂x f (x, t)dt + d dx b(x) • f (x, b(x)) - d dx a(x) • f (x, a(x)) Lemma 2 (Jump Discontinuity, 1D version). Let f(x, t) be a function such that both f (x, t) and its partial derivative ∂f (x,t) ∂x are continuous in t and x in some region of the xt plane. f has jump discontinuity if and only if there exist b i = q i (x) such that lim t→qi(x) -f (x, t) ̸ = lim t→qi(x) + f (x, t) We define i to correspond the ith jump discontinuity of f . For convenience, we also abbreviate the above as f (b - i ) ̸ = f (b + i ) in later parts of this section. Corollary 1 (Leibniz integral rule, 1D region form). Let f(x, t) be a function such that both f (x, t) and its partial derivative ∂f (x,t) ∂x are continuous in t and x in some region of the xt plane, including t ∈ T, T = R, x ∈ X, X = R. Also suppose that the functions q i (x) for all i are continuous and have continuous derivatives for x ∈ R. Then, by Lemma 1 and 2, for x ∈ X d dx T f (x, t)dt = T ∂ ∂x f (x, t)dt interior part + i ∂q i (x) ∂x f (x, q i (x) -) -f (x, q i (x) + ) boundary part For convenience, we also refer to the first term as the interior part and the second term as the boundary part in the later parts of this section.  d dθ A π θ (a)R(a)da = A ∂π θ (a) ∂θ R(a)da + i ∂q i (θ) ∂θ f (θ, q i (θ) -) -f (θ, q i (θ) + ) Case 1: If π θ (b i ) ̸ = 0, by Lemma 2, the definition of the ith jump discontinuity can be rewritten as q i (θ) = b i : f (θ, b - i ) ̸ = f (θ, b + i ) : π θ (b - i )R(b - i ) ̸ = π θ (b + i )R(b + i ) : R(b - i ) ̸ = R(b + i ) π θ (•) ∈ C 0 for all θ , which means q i (θ) is constant with respect to θ. Lemma 3 (Law of the unconscious statistician, continuous random variable). Let x be a random variable and let y = g(x) be a function of this random variable. E[g(x)] = X f X (x)g(x)dx Theorem 3 (Policy Gradient Theorem, first-order version). Let ω be a random variable such that a = h(θ, ω) with probability distribution p(ω) ∈ C 0 and ∂h(θ,ω) ∂ω ̸ = 0 for all θ. Let f (θ, ω) = p(ω)R(h(θ, ω)) with the ith jump discontinuity of f defined to be q i (θ). The first order gradient is d dθ A π θ (a)R(a)da = Ω p(ω) ∂ ∂θ R(h(θ, ω))dω + i ∂q i (θ) ∂θ f (θ, q i (θ) -) -f (θ, q i (θ) + ) Proof. Reparameterize a = h(θ, ω), substitute in f by definition d dθ A π θ (a)R(a)da = d dθ Ω p(ω)R(h(θ, ω))dω = d dθ Ω f (θ, ω)dω By Lemma 2 the definition of the ith jump discontinuity can be rewritten as q i (θ) = b i : f (θ, b - i ) ̸ = f (θ, b + i ) : p(b - i )R(h(θ, b - i )) ̸ = p(b + i )R(h(θ, b + i )) : R(h(θ, b - i )) ̸ = R(h(θ, b + i )) p(ω) ∈ C 0 Suppose ω ∼ N (0, σ), h(θ, ω) = θ + σω and exist action a i = h(θ, b i ) such that R(a - i ) ̸ = R(a + i ), the definition of q i (θ) and its evaluation at a local neighbour q i (θ + ∆θ) can be rewritten as q i (θ) = b i : h(θ, b i ) = a i q i (θ + ∆θ) = b ∆ i : h(θ + ∆θ, b ∆ i ) = a i since function R is not parameterized, the location of its ith jump discontinuity is a i and does not change. Therefore, substitute in h(θ, ω) = θ + σω and solve for b i and b ∆ i gives b i = (a i -θ)/σ and b ∆ i = (a i -(θ + ∆θ))/σ. Differentiate q i (θ) gives dq i (θ) dθ = lim ∆θ→0 q i (θ + ∆θ) -q i (θ) ∆θ = lim ∆θ→0 b ∆ i -b i ∆θ = lim ∆θ→0 ∆θ/σ ∆θ = 1 σ ̸ = 0 In general, ignoring the boundary term when estimating the left-hand side with a first-order estimator leads to a bias due to the discontinuity of the reward/value function. Therefore, we have shown that a gradient estimator for only the interior part of the first-order gradient is a biased estimator of the original policy gradient. The bias is the boundary term. F PROOF OF THE REPARAMETERIZED POLICY GRADIENT THEOREM F.1 REGULARITY CONDITIONS Let θ ∈ Θ = (-ϵ, ϵ) ∈ R without loss of generality and Ω = {z ∈ Z|q θ (z, τ θ ) > 0} be a measure space, then for z ∈ Ω, θ ∈ [-ϵ, ϵ], we make the following assumptions Assumption 1 (Bounded ELBO). The reward R, prior density log p(τ ), log q θ (z|τ ), log p ϕ (z|τ ) are bounded for all τ . Assumption 2 (Lipschitz Condition). q θ (z|τ ), log q θ (z|τ ), log p ϕ (z|τ ), prior distribution log p(τ ), action policy f θ (s t , z, t), reward R and dynamics h(a t , s t ) are Lipschitz continuous w.r.t. any s, a and continuous z. In addition, q θ is Lipschitz continuous w.r.t. θ. Assumption 3 (Lebesgue-integrable). The probability density q θ (z, τ θ ) and its derivative |∇ θ q θ (z, τ θ )| is Lebesgue-integrable jointly over Ω and their integration is bounded for all θ ∈ [-ϵ, ϵ]. Assumption 1 is easy to guarantee when the state space and the action space are compact and the time horizon T is finite because continuous functions on a compact set are always bounded. Similarly, for bounded inputs, the common neural networks can also satisfy Assumption 2. Otherwise, we can simply clamp them before feeding them into a neural network to make the input bounded, as we do not need the network and the environment to be continuously differentiable as in Lemma 1 but only absolutely continuous. We require the dynamics h(a t , s t ) to be Lipschitz continuous, ruling out chaotic systems. Assumption 3 holds for common distributions. For example, given a Gaussian distribution p(z) = e (-z 2 /(2σ 2 )) /( √ 2πσ), the parameter σ's partial derivative ∂p ∂σ (z) = e (-z 2 /(2σ 2 )) (-σ 2 + z 2 )/( √ 2πσ 4 ) is simply a function of exponential family and can be verified to be absolutely integrable on R and bounded for any domain σ ∈ [a, b], a, b > 0.

F.2 PROOF OF THEOREM 1

To differentiate under integration, we use the following Lemma (cheng; L'Ecuyer, 1995), Lemma 4 (Leibniz integral rule). Let X be an open subset of R, and Ω be a measure space. Suppose that a function f : X × Ω → R satisfies the following conditions: 1. f (x, ω) is a measurable function of x and ω jointly and is integrable over ω for almost all x ∈ X fixed. 2. For almost all ω ∈ Ω, f (x, ω) is an absolutely continuous function of x (the derivative ∂f (x, ω)/∂x exists almost everywhere). 3. ∂f /∂x is "locally integrable", that is, for all compact intevals [a, b] ∈ X, b a Ω ∂ ∂x f (x, ω) dωdx < ∞. Then for almost every x ∈ X, its derivative exists and d dx Ω f (x, ω)dω = Ω ∂ ∂x f (x, ω)dω Then it is easy to prove Theorem 1. Proof. Define the measurable function f (θ, z) = q θ (z, τ θ )R elbo (τ θ ). Since R elbo (τ θ ) is bounded by a constant M by Assumption 1 and q θ (z, τ θ ) is positive and Lebesgueintegrable by Assumption 3, their product's absolute value will be bounded by q θ (z, τ θ )M , so f (θ, z) becomes Lebesgue-integrable for every θ. By Assumption 2 and assuming the horizon is finite, the trajectory τ will become a Lipschitz function with respect to θ and z, so as R elbo (τ θ ), q θ (z, τ θ ) and their products f (θ, z). Thus ∂f (θ,z) ∂θ exists almost everywhere for all z ∈ Ω. To bound its derivative, let R elbo (τ θ ) be L-Lipschitz w.r.t. θ, notice that |∇ θ f (θ, z)| = |R elbo (τ θ )∇ θ q θ (z, τ θ ) + q θ (z, τ θ )∇ θ R elbo (τ θ )| ≤ M |∇ θ q θ (z, τ θ )| + L |q θ (z, τ θ )| , is dominated by the sum of two Lebesgue-integrable functions whose integration is bounded for all θ ∈ Θ by Assumption 3. Thus its gradient must be locally integrable. Then applying Lemma 4, we can do differentiation under the integration sign and finish the proof as ∇ θ E[R elbo (τ θ )] = z ∇ θ (q θ (z, τ θ )R elbo (τ θ )) dz = z R elbo (τ θ )∇ θ q θ (z, τ θ )(z|s 1 ) + q θ (z, τ θ )∇ θ R elbo (τ θ )dz = z q θ (z, τ θ ) [R elbo (τ θ )∇ θ log q θ (z, τ θ ) + ∇ θ R elbo (τ θ )] dz. We want to emphasize the importance of the Lipschitz continuity that helps to bound the integration. Informally, the Lipschitz condition is directly related to the scale of the path-wise gradient. When the Lipschitz constant is too large, the environments will generate exploding gradients or contains sharp changes in the reward landscape, similarly to the discontinuity and causing empirical bias Suh et al. (2022) . Besides, when the reward landscape has discontinuous points, there will be a bias caused by the motion of the discontinuity, as suggested by the Leibniz integral rule. We refer interested readers to Appendix E for a detailed illustration of the difference between the zeroth-order gradient estimate and the first-order gradient estimate under the expectation we do not find in the literature. Note that we do not ask for the continuity of the gradients and support activation functions such as ReLU. The finite horizon assumption is also necessary to make the theorem holds, otherwise the gradients ∇ θ R elbo (τ θ ) may be unbounded if we do not set up a suitable discount factor. Sample start state o 1 and encode it as f ψ (o 1 ). Select z from π θ (z|s 1 ).

4:

Execute the policy π(a|s, z) and store transitions into the replay buffer B.

5:

Sample a batch of trajectory segment of length K {τ i t:t+K , z} from the buffer B. 6: Optimize ψ using Equation 2. 7: Optimize π θ (a|s, z) with gradient descent to maxmize the value estimate in Equation 1 for s, z sampled from the buffer.

8:

Optimize π θ (z|s 1 ) with policy gradient to maximize V estimate (s 1 , z) -α log π θ (z|s 1 ) for s 1 is sampled from the buffer. 9: Optimize ϕ, α, β and other auxiliary networks. 10: end while The agent always receives a penalty proportional to its distance to the closest goal, which motivates it to go upward instead of moving right. However, if the agent chooses the longer right path, it will receive a higher bonus when it reaches the goal. Such a local optimum will trap a single modality agent, as shown in Figure 6 (c). When we remove the latent space and the latent policy p(z|s 1 ), the baseline single-modal RL policy (MBRL) will get stuck in the local optima, while our approach (MBRPG), thanks to its ability to maintain multi-modality trajectory distribution, will attempt to go right even when it is sub-optimal at the beginning and thus has a higher chance to find the global optima. We also test our method for an 11 -dof mobile robot for opening the cabinet shown in Figure 5 

H INSUFFICIENCY OF THE SINGLE MODALITY GAUSSIAN POLICY

Our method is motivated by the insufficiency of the Gaussian policy in solving non-convex problems. A gaussian policy with large entropy has proven effective in many continuous control problems. However, even if the Gaussian distribution has a very high initial variance, it may still fail to solve many tasks. The Gaussian parameterization can only model single-modality action distribution, which limits the policy update in both REINFORCE and ADA. It would prevent them from converging to the global optima even though it can explore the whole reward landscape. Without loss of generality, let us take the 1d continuous bandit problem as an example. In this case, a Gaussian policy is fully determined by its mean µ and variance σ 2 . Both REINFORCE and ADAM maximize the expected reward E a∼N (µ,σ) [R(a)] by computing its policy gradient ∇E a∼N (µ,σ) [R(a)]. We now assume R(a) is Lipschitz continuous and differentiable almost everywhere. As we show in the paper, the gradients computed by the two methods are the same. If the standard deviation σ is not large enough, a Gaussian policy will suffer from the nonconvexity issue like a deterministic optimizer. It is easy to see that ∇ R(a)da is a deterministic function given by passing the original reward landscape through a Gaussian filter. As shown in Figure 7 , optimizing R(a) with policy gradient will be equivalent to running gradient descent on F σ (µ) with a sufficient number of samples. So if a non-convex reward R(a) is still non-convex after it was smoothed by a Gaussian kernel, the random perturbation provided by the Gaussian noise will not help the policy to jump out of local optima. We now consider the method starting with a very large standard deviation σ and gradually reducing it to zero, as shown in Figure 7 . We plot the F σ for different std σ. The leftmost shows F σ for large σ and the rightmost is close to the original reward function. We use the red dotted line to denote the optimum µ * (σ) = arg max µ F σ (µ) for each σ. We can see that for very large σ, F σ is convex, and we can assume the policy gradient method finds the optimal µ * (σ). However, µ * for very large σ is not necessarily close to the global optima µ * (0). In the beginning, the right mode is optimal as it has a high average reward (which can be visually measured by the area under the right mode). But when we reduce the σ gradually, the global maximum may drastically change from the right mode to the left mode as shown in the red block (middle two) of Figure 7 because the left mode has a higher extreme value that exceeds the right. However, at the moment that the optimum changes to the left mode, F σ is already non-convex; the policy gradient method has little chance to discover the new global optima and will get stuck at the right local optima. This behavior will devastate all gradient-based algorithms that optimize a single-modal policy (REINFORCE, ADAM, SAC, and PPO). To illustrate the insufficiency of the gaussian policy with a large standard deviation, we vary the initial standard deviation for REINFORCE and anneal to 0 in a sufficiently large amount of training steps. Notice that the global optima have a reward of 1.10.



Here we assume z contains only continuous variables to simplify the derivation. Adding discrete variables is straightforward by summing over all possibilities. δ is the Dirac delta function.



Hierarchical imitation learning Gupta et al. (2019); Pertsch et al. (2021); Shankar & Gupta (2020); Jiang et al. (2022); Lynch et al. (2020); Fang et al. (

(b) Training curve. X-axis: number of sampled states. Y-axis: average return. Error bars show standard deviations over 5 runs. Hyperparameters are available in Table 2.

Figure 3: A navigation task to demonstrate the effects of learning a sequence of z. An agent needs to move from the bottom left to the top right and avoid the green obstacles. Both the state and action space are 2D. Different colors represent different values of z or z i .

Theorem 2 (Policy Gradient Theorem, zeroth order version). Let π θ (a) be a function such that π θ (•) ∈ C 0 for all θ ∈ R and π (•) (a) ∈ C 1 for all a ∈ A, A = R. Let R(a) be a function with jump discontinuities and defined for all a ∈ Define the integrand π θ (a)R(a) = f (θ, a), by Corollary 1

Case 2: If π θ (b) = 0, by Lemma 2, then f (θ, b) = π θ (b)R(b) = 0, which means f (θ, b -) = f (θ, b + ) = 0 and f does not have jump discontinuity at b.

Figure 5: Non-differentiable environments

(b). It is easier for the robot to open the left door, but it will receive a higher reward when it opens the right one. Similarly, the normal RL agent modeled by a Gaussian distribution (MBRL) fails to explore a way to open the right door. In contrast, our method explores the two directions simultaneously and can open the right one, in the end, resulting in higher rewards, as shown in Figure6.

Figure 6: Experiments with a learned world model.

Figure 7: The landscape of expected reward after a gaussian filter with gradually decreasing standard deviation.

µ E a∼N (µ,σ) [R(a)] = ∇ µ a 1/(σ √ 2π)e -(µ-a) 2 2σ 2 R(a)da = ∇ µ F σ (µ), where F σ (µ) = a 1/(σ √ 2π)e -(µ-a) 2 2σ 2

Comparison of different algorithms that optimize ELBO bounds for inference

Hyperparameters and rewards of our algorithms.

Algorithm 2 Model-based Variational Reparameterized Policy 1: Input: p ϕ , π θ , h ψ , R ψ , f ψ , Q ψ 2: while time remains do

G LEARNING A DIFFERENTIABLE WORLD MODEL G.1 METHOD

To apply our method in a non-differentiable environment, we train a differentiable world model jointly with the policy optimization as Model-based Reparameterized Policy Gradient (MBRPG) . This method has been proven data-efficient for policy optimization (Hafner et al., 2019; Schrittwieser et al., 2020; Ye et al., 2021; Hansen et al., 2022) . Specifically, besides of the encoder p ϕ (z|s, a), the policy π θ (a|s, z), π θ (z|s 1 ) we additionally learn a deterministic dynamic network h ψ (s, a), a reward network R ψ (s, a), an observation encoder f ψ (o) and the Q-network Q ψ (s, a, z). In practice, they are all two-layer fully connected neural networks with a hidden dimension 256. We also define α, β to weigh the entropy term and the cross entropy term in the variational lower bound.Given any z and any initial latent state s i = f ψ (o i ), and an arbitrary action sequence, we can use the learned dynamic network to generate the trajectory byNotice that during the model rollout, we can either use the action a t sampled from the current policy π θ (a|s t , z) or an action sequence {a gt t } sampled from the replay buffer. When the action is sampled from the current policy π θ (a|s t , z), we obtain a Monte-Carlo estimate for the value of s i , which can be used to optimize the policy π θ :To train the dynamics model, we sample trajectory segments of length K + 1 τ i:i+K = {o i , a gt i , r gt i , o i+1 , a gt i+1 , r gt i+1 , . . . , o i+K } from the replay buffer and select a latent z. We then self-supervise the dynamic network to ensure state consistency and avoid reconstruction as in (Ye et al., 2021; Hansen et al., 2022) (in this case we let a t = a gt t in the trajectory):(2) where ng(x) means stopping gradient and L 1 = 1000, L 2 = L 3 = 0.5 are constant to balance the loss. The whole process is illustrated in Algorithm 2. For all experiments, we take K = 6.

G.2 EXPERIMENT

We first compare our method with SAC Haarnoja et al. (2018) on locomotion tasks (Cheetah-v3 and Humanoid-v3 in OpenAI Gym Brockman et al. (2016) ) in Figure 6 (a) and (b). We plot the learning curve of MBRPG together with the SAC's performance after training for 3 million steps. We can see that by learning the model, our method achieves on-par performance with fewer samples. In particular, our method only requires 1 million samples to reach 15000 scores on Cheetah-v3 and only 0.5 million samples to reach 6000+ scores on Humanoid-v3. This attributes to the efficiency brought by learning a model for policy optimization. However, we observed that removing the latent variables and the variational bound, which is a baseline model-based RL algorithm (MBRL), did not affect the performance of our method. We conjecture that these locomotion tasks do not require a multi-modal policy for exploration, and a Gaussian policy is sufficient.To illustrate the effectiveness of our approach, we have built two environments that need multimodality explorations, as illustrated in Figure 5 . In (a) AntMove environments, an ant robot can move Initial std 0.01 0.1 0.5 1 2 5 10 expected reward 0.50±0.00 0.50±0.00 0.50±0.00 0.50±0.00 0.49±0.00 0.47±0.00 0.42±0.00Table 3 : Final performance of reinforce with linearly decreasing standard deviation.

I ABLATION STUDIES

In this section, we first study the importance of the trajectory encoder to understand what role it plays in helping maintain a multi-modality trajectory distribution. Then, we show how hyperparameter controlling the reward weight of entropy of action affects algorithms that have different policy paramterization.Figure 8 : Ablating trajectory encoderIn Figure 8 , we show the importance of the trajectory encoder, We can see that when we do not optimize the mutual information between z and τ through log p ϕ (z|τ ) (Ours (no info loss)), our method degrades to almost identical behavior as Adam. We can see that the best-performing method is our method in its current form.

