EFFICIENT WASSERSTEIN NATURAL GRADIENTS FOR REINFORCEMENT LEARNING

Abstract

A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.

1. INTRODUCTION

Defining efficient optimization algorithms for reinforcement learning (RL) that are able to leverage a meaningful measure of similarity between policies is a longstanding and challenging problem (Lee & Popović, 2010; Meyerson et al., 2016; Conti et al., 2018b) . Many such works rely on similarity measures such as the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) to define procedures for updating the policy of an agent as it interacts with the environment. These are generally motivated by the need to maintain a small variation in the KL between successive updates in an off-policy context to control the variance of the importance weights used in fthe estimation of the gradient. This includes work by Kakade (2002) and Schulman et al. (2015) , who propose to use the Fisher Natural Gradient (Amari, 1997) as a way to update policies, using local geometric information to allow larger steps in directions where policies vary less; and the work of Schulman et al. (2017) , which relies on a global measure of proximity using a soft KL penalty to the objective. While those methods achieve impressive performance, and the choice of the KL is well-motivated, one can still ask if it is possible to include information about the behavior of policies when measuring similarity, and whether this could lead to more efficient algorithms. Pacchiano et al. (2019) provide a first insight into this question, representing policies using behavioral distributions which incorporate information about the outcome of the policies in the environment. The Wasserstein Distance (WD) (Villani, 2016) between those behavioral distributions is then used as a similarity measure between their corresponding policies. They further propose to use such behavioral similarity as a global soft penalty to the total objective. Hence, like the KL penalty, proximity between policies is measured globally, and does not necessarily exploit the local geometry defined by the behavioral embeddings. In this work, we show that substantial improvements can be achieved by taking into account the local behavior of policies. We introduce new, efficient optimization methods for RL that incorporate the local geometry defined by the behavioral distributions for both policy gradient (PG) and evolution strategies (ES) approaches. Our main contributions are as follows: 1-We leverage recent work in (Li & Montufar, 2018a; b; Li, 2018; Li & Zhao, 2019; Chen & Li, 2018) which introduces the notion of the Wasserstein Information Matrix to define a local behavioral similarity measure between policies. This allows us to identify the Wasserstein Natural Gradient (WNG) as a key ingredient for optimization methods that rely on the local behavior of policies. To enable efficient estimation of WNG, we build on the recent work of Arbel et al. (2020) , and further extend it to cases where the re-parameterization trick is not applicable, but only the score function of the model is available. 2-This allows us to introduce two novel methods: Wasserstein natural policy gradients (WNPG) and Wasserstein natural evolution strategies (WNES) which use the local behavioral structure of policies through WNG and can be easily incorporated into standard RL optimization routines. When combined in addition with a global behavioral similarity such as a WD penalty, we show substantial improvement over using the penalty alone without access to local information. We find that such WNG-based methods are especially useful on tasks in which initial progress is difficult. 3-Finally, we demonstrate, to our knowledge, the first in-depth comparative analysis of the FNG and WNG, highlighting a clear interpretable advantage of using WNG over FNG on tasks where the optimal solution is deterministic. This scenario arises frequently in ES and in policy optimization for MDPs (Puterman, 2010) . This suggests that WNG could be a powerful tool for this class of problems, especially when reaching accurate solutions quickly is crucial. In Section 2, we present a brief review of policy gradient approaches and the role of divergence measures as regularization penalties. In Section 3 we introduce the WNG and detail its relationship with the FNG and the use of Wasserstein penalties, and in Section 4 we derive practical algorithms for applying the WNG to PG and ES. Section 5 contains our empirical results.

2. BACKGROUND

Policy Gradient (PG) methods directly parametrize a policy π θ , optimizing the parameter θ using stochastic gradient ascent on the expected total discounted reward R(θ). An estimate ĝk of the gradient of R(θ) at θ k can be computed by differentiating a surrogate objective L θ which often comes in two flavors, depending on whether training is on-policy (left) or off-policy (right): L(θ) = Ê log π θ (a t |s t ) Ât , or L(θ) = Ê π θ (a t |s t ) π θ k (a t |s t ) Ât . The expectation Ê is an empirical average over N trajectories τ i = (s i 1 , a i 1 , r i 1 , ..., s i T , a i T , r i T ) of state-action-rewards obtained by simulating from the environment using π θ k . The scalar Ât is an estimator of the advantage function and can be computed, for instance, using Ât = r t + γV (s t+1 ) -V (s t ) where γ ∈ [0, 1) is a discount factor and V is the value function often learned as a parametric function via temporal difference learning (Sutton & Barto, 2018) . Reusing trajectories can reduce the computational cost at the expense of increased variance of the gradient estimator (Schulman et al., 2017) . Indeed, performing multiple policy updates while using trajectories from an older policy π θ old means that the current policy π θ can drift away from the older policy. On the other hand, the objective is obtained as an expectation under π θ for which fresh trajectories are not available. Instead, the objective is estimated using importance sampling (by re-weighting the old trajectories according to importance weights π θ /π θ old ). When π θ is too far from π θ old , the importance weight can have a large variance. This can lead to a drastic degradation of performance if done naïvely (Schulman et al., 2017) . KL-based policy optimization (PO) aims at addressing these limitations. KL-based PO methods ensure that the policy does not change substantially between successive updates, where change is measured by the KL divergence between the resulting action distributions. The general idea is to add either a hard KL constraint, as in TRPO (Schulman et al., 2015) , or a soft constraint, as in PPO (Schulman et al., 2017) , to encourage proximity between policies. In the first case, TRPO recovers the FNG with a step-size further adjusted using line-search to enforce the hard constraint. The FNG permits larger steps in directions where policy changes the least, thus reducing the number of updates required for optimization. In the second case, the soft constraint leads to an objective of the form: maximize θ L(θ) -β Ê [KL(π θ k (•|s t ), π θ (•|s t ))] . The KL penalty prevents the updates from deviating too far from the current policy π θ k , thereby controlling the variance of the gradient estimator. This allows making multiple steps with the same simulated trajectories without degradation of performance. While both methods take into account the proximity between policies as measured using the KL, they do not take into account the behavior of such policies in the environment. Exploiting such information can greatly improve performance. Behavior-Guided Policy Optimization. Motivated by the idea that policies can differ substantially as measured by their KL divergence but still behave similarly in the environment, Pacchiano et al. (2019) recently proposed to use a notion of proximity in behavior between policies for PO. Exploiting similarity in behavior during optimization allows to take larger steps in directions where policies behave similarly despite having a large KL divergence. To capture a sense of global behavior, they define a behavioral embedding map (BEM) Φ that maps every trajectory τ to a behavior variable X = Φ(τ ) belonging to some embedding space E. The behavior variable X provides a simple yet meaningful representation of each the trajectory τ . As a random variable, X is distributed according to a distribution q θ , called the behavior distribution. Examples of Φ include simply returning the final state of a trajectory (Φ(τ ) = s T ) or its concatenated actions (Φ(τ ) = [a 0 , . . . , a T ]). Proximity between two policies π θ and π θ is then measured using the Wasserstein distance between their behavior distributions q θ and q θ . Although, the KL could also be used in some cases, the Wasserstein distance has the advantage of being well-defined even for distributions with non-overlapping support, therefore allowing more freedom in choosing the embedding Φ (see Section 3.1). This leads to a penalized objective that regulates behavioral proximity: maximize θ L(θ) - β 2 W 2 (q θ k , q θ ), where β ∈ R is a hyper-parameter controlling the strength of the regularization. To compute the penalty, Pacchiano et al. (2019) use an iterative method from Genevay et al. (2016) . This procedure is highly accurate when the Wasserstein distance changes slowly between successive updates, as ensured when β is large. At the same time, larger values for β also mean that the policy is updated using smaller steps, which can impede convergence. An optimal trade-off between the rate of convergence and the precision of the estimated Wasserstein distance can be achieved using an adaptive choice of β as done in the case of PPO Schulman et al. (2017) . For a finite value of β, the penalty accounts for global proximity in behavior and doesn't explicitly exploit the local geometry induced by the BEM, which can further improve convergence. We introduce an efficient method that explicitly exploits the local geometry induced by the BEM through the Wasserstein Natural gradient (WNG), leading to gains in performance at a reduced computational cost. When global proximity is important to the task, we show that using the Wasserstein penalty in Equation ( 4) and optimizing it using the WNG yields more efficient updates, thus converging faster than simply optimizing Equation (4) using standard gradients.

3. THE WASSERSTEIN NATURAL GRADIENT

The Wasserstein natural gradient (WNG) (Li & Montufar, 2018a; b) corresponds to the steepestascent direction of an objective within a trust region defined by the local behavior of the Wasserstein-2 distance (W 2 ). The W 2 between two nearby densities q θ and q θ+u can be approximated by computing the average cost of moving every sample X from q θ to a new sample X approximately distributed according to q θ+u using an optimal vector field of the form ∇ x f u (x) so that X = X + ∇ x f u (X) (see Figure 6 ). Optimality of ∇ x f u is defined as a trade-off between accurately moving mass from q θ to q θ+u and reducing the transport cost measured by the average squared norm of ∇ x f u sup fu ∇ θ E q θ [f u (X)] u - 1 2 E q θ ∇ x f u (X) 2 , ( ) where the optimization is over a suitable set of smooth real valued functions on E. Hence, the optimal function f u solving Equation (5) defines the optimal vector field ∇ x f u (x). Proposition 1 makes this intuition more precise and defines the Wasserstein Information Matrix. Proposition 1 (Adapted from Defintion 3 Li & Zhao ( 2019)) The second-order Taylor expansion of W 2 between two nearby parametric probability distributions q θ and q θ+u is given by W 2 2 (q θ , q θ+u ) = u G(θ)u + o( u 2 ) (6) where G(θ) is the Wasserstein Information Matrix (WIM), with components in a basis (e 1 , ..., e p ) G j,j (θ) = E q θ ∇ x f j (X) ∇ x f j (X) . The functions f j solve Equation (5) with u chosen as e j . Moreover, for any given u, the solution f u to Equation ( 5)  satisfies E θ [ ∇ x f u (X) 2 ] = u G(θ)u. (θ) = E q θ [ψ(x)] where q θ is a gaussian of 100 dimensions with parameters θ = (µ, v). Here µ in bold is the mean vector, v parameterizes the covariance matrix Σ, which is chosen to be diagonal. Two parameterizations for the covariance matrix are considered: When q θ and q θ+u are the behavioral embedding distributions of two policies π θ and π θ+u , the function f u allows to transport behavior from a policy π θ to a behavior as close as possible to π θ+u with the least cost. We thus refer to f u as the behavioral transport function. The function f u determines how hard it is to change behavior locally from policy π θ in a direction u, thus providing a tool to find update directions u with either maximal or minimal change in behavior. Σ ii = e vi (log-diagonal) and Σ ii = v i (diagonal). ψ(x) Probing all directions in a basis (e 1 , ..., e p ) of parameters allows us to construct the WIM G(θ) in Equation ( 7) which summarizes proximity in behavior along all possible directions u using u G(θ )u = E q θ [ ∇ x f u (X) 2 ]. For an objective L(θ), such as the expected total reward of a policy, the Wasserstein natural gradient (WNG) is then defined as the direction u that locally increases L(θ + u) the most with the least change in behavior as measured by f u . Formally, the WNG is related to the usual Euclidean gradient g = ∇ θ L(θ) by g W = arg max u 2g u -u G(θ)u. From Equation ( 8), the WNG can be expressed in closed-form in terms of G(θ) and g as g W = G -1 (θ)g. Hence, WNG ascent is simply performed using the update equation θ k+1 = θ k + λg W k . We'll see in Section 4 how to estimate WNG efficiently without storing or explicitly inverting the matrix G. Next, we discuss the advantages of using WNG over other methods.

3.1. WHY USE THE WASSERSTEIN NATURAL GRADIENT?

To illustrate the advantages of the WNG, we consider a simple setting where the objective is of the form L(θ) = E q θ [ψ(x)], with q θ being a gaussian distribution. The optimal solution in this example is a deterministic point mass located at the global optimum x of the function ψ(x). This situation arises systematically in the context of ES when using a gaussian noise distribution with learnable mean and variance. Moreover, the optimal policy of a Markov Decision Processes (MDP) is necessarily deterministic (Puterman, 2010) . Thus, despite its simplicity, this example allows us to obtain closed-form expressions for all methods while capturing a crucial property in many RL problems (deterministic optimal policies) which, as we will see, results in differences in performance. Wasserstein natural gradient vs Fisher natural gradient While Figure 1 (c) shows that both methods seem to reach the same solution, a closer inspection of the loss, as shown in Figure 1 (d) and  (e) for two different parameterizations of q θ , shows that the FNG is faster at first, then slows down to reach a final error of 10 -4 . On the other hand, WNG is slower at first then transitions suddenly to an error of 10 -8 . The optimal solution being deterministic, the variance of the gaussian q θ needs to shrink to 0. In this case, the KL blows up, while the W 2 distance remains finite. As the natural gradient methods are derived from those two divergences (Theorem 2 of Appendix B), they inherit the same behavior. This explains why, unlike the WNG, the FNG doesn't achieve the error of 10 -8 . Beyond this example, when the policy π θ is defined only implicitly using a generative network, as in Tang & Agrawal (2019) , the FNG and KL penalty are ill-defined since π θ k and π θ k+1 might have non-overlapping supports. However, the WNG remains well-defined (see Arbel et al. (2020) ) and allows for more flexibility in representing policies, such as with behavioral embeddings. Wasserstein penalty vs Wasserstein natural gradient The Wasserstein penalty Equation (4) encourages global proximity between updates q θ k . For small values of the penalty parameter β, the method behaves like standard gradient descent (Figure 1 (a) ). As β increases, the penalty encourages more local updates and thus incorporates more information about the local geometry defined by q θ . In fact, it recovers the WNG direction (Theorem 2 of Appendix B) albeit with an infinitely small step-size which is detrimental to convergence of the algorithm. To avoid slowing-down, there is an intricate balance between the step-size and penalty β that needs to be maintained (Schulman et al., 2017) . All of these issues are avoided when directly using the WNG, as shown in Figure 1 (a) , which performs the best and tolerates the widest range of step-sizes Figure 1 (f). Moreover, when using the log-diagonal parameterization as in Figure 1 (d, a ), the WNGD (in red) achieves an error of 1e-8, while W 2 -penalty achieves a larger error of order 1e-0 for various values of the β. When using the diagonal parameterization instead, as shown in Figure 1 (e), both methods achieve a similar error of 1e-6. This discrepancy in performance highlights the robustness of WNG to parameterization of the model. Combining WNG and a Wasserstein penalty. The global proximity encouraged by a W 2 penalty can be useful on its own, for instance, to explicitly guarantee policy improvement as in (Pacchiano et al., 2019, Theorem 5.1) . However, this requires estimating the W 2 at every iteration, which can be costly. Using WNG instead of the usual gradient can yield more efficient updates, thus reducing the number of time W 2 needs to be estimated. The speed-up can be understood as performing secondorder optimization on the W 2 penalty since the WNG arises precisely from a second-order expansion of the W 2 distance, as shown in Section 3 (See also Example 2 in Arbel et al. (2020) ).

4. POLICY OPTIMIZATION USING BEHAVIORAL GEOMETRY

We now present practical algorithms to exploit the behavioral geometry induced by the embeddings Φ. We begin by describing how to efficiently estimate the WNG. Efficient estimation of the WNG can be performed using kernel methods, as shown in Arbel et al. (2020) in the case where the re-parametrization trick is applicable. This is the case, if for instance, the behavioral variable is the concatenation of actions X = [a 0 , ..., a T ] and if actions are sampled from a gaussian with mean and variance parameterized by a neural network, as is often done in practice for real-valued actions. Then X can be expressed as X = B θ (Z) where B θ is a known function and Z is an input sample consisting in the concatenation of states [s 0 , ..., s T ] and the gaussian noise used to generate the actions. However, the proposed algorithm is not readily applicable if for instance the behavioral variable X is a function of the reward. We now introduce a procedure that extends the previous method to more general cases, including those where only the score ∇ θ log q θ is available without an explicit re-parametrization trick. The core idea is to approximate the functions f ej defining G(θ k ) in Equation ( 7) using a linear combinations of user-specified basis functions (h 1 (x), ..., h M (x)): fej (x) = M m=1 α j m h m (x), The number M controls the computational cost of the estimation and is typically chosen on the order of M = 10. The basis can be chosen to be data-dependent using kernel methods. More precisely, Algorithm 1: Wasserstein Natural Policy Gradient 1: Input Initial policy π θ0 2: for iteration k = 1, 2, ... do 3: Obtain N rollouts {τ } N n=1 of length T using policy π θ k 4: Compute loss L(θ k ) in a forward pass 5: Compute gradient ĝk in the backward pass on L(θ k ) 6: Compute Behavioral embeddings {X n = Φ(τ n )} N n=1 7: Compute WNG ĝW k using Algorithm 3 with samples {X n } N n=1 and gradient estimate ĝk . 8: Update policy using: θ k+1 = θ k + λĝ W k . 9: end for we use the same approach as in Arbel et al. (2020) , where we first subsample M data-points Y m from a batch of N variables X n and M indices i m from {1, ..., d} where d is the dimension of X n . Then, each basis can of the form h m (x) = ∂ im K(Y m , x) where K is a positive semi-definite kernel, such as the gaussian kernel K(x, y) = exp(-x-y 2 σ 2 ). This choice of basis allows us to provide guarantees for functions f j in terms of the batch size N and the number of basis points M (Arbel et al., 2020, Theorem 7) . Plugging-in each fj in the transport cost problem Equation ( 5) yields a quadratic problem of dimension M in the coefficients α j : maximize α j 2J .,j α j -(α j ) Lα j where L is a square matrix of size M × M independent of the index j and J is a Jacobian matrix of shape M × p with rows given by J m,. = ∇ θ E q θ k [h m (X)]. There are two expressions for J, depending on the applicability of the re-parametrization trick or the availability of the score J m,. = Êq θ [∇ x h m (X)∇ θ B θ (Z)] or J m,. = Êq θ [∇ θ log q θ (X)h m (X)] Computing J can be done efficiently for moderate size M by first computing a surrogate vector of V of size M whose Jacobian recovers J using automatic differentiation software: V m = Êq θ [h m (X n )] , or V m = Êq θ [log q θ (X n )h m (X n )] . The optimal coefficients α j are then simply expressed as α = L † J. Plugging-in the optimal functions in the expression of the Wasserstein Information Matrix (Equation ( 7)), yields a low rank approximation of G of the form Ĝ = J L † J. By adding a small diagonal perturbation matrix I, it is possible efficiently compute ( Ĝ + I) -1 ĝ using a generalized Woodbury matrix identity which yields an estimator for the Wasserstein Natural gradient ĝW = 1 ĝ -J JJ + L † J ĝ . The pseudo-inverse is only computed for a matrix of size M . Using the Jacobian-vector product, Equation ( 12) can be computed without storing large matrices G as shown in Algorithm 3. Wasserstein Natural Policy Gradient (WNPG). It is possible to incorporate local information about the behavior of a policy in standard algorithms for policy gradient as summarized in Algorithm 1. In its simplest form, one first needs to compute the gradient ĝk of the objective L(θ k ) using, for instance, the REINFORCE estimator computed using N trajectories τ n . The trajectories are then used to compute the BEMs which are fed as input, along with the gradient ĝk to get an estimate of the WNG g W k . Finally, the policy can be updated in the direction of g W k . Algorithm 1 can also be used in combination with an explicit W 2 penalty to control non-local changes in behavior of the policy thus ensuring a policy improvement property as in (Pacchiano et al., 2019, Theorem 5.1) . In that case, WNG enhances convergence by acting as a second-order optimizer, as discussed in Section 3.1. The standard gradient ĝk in Algorithm 1 is then simply replaced by the one computed in (Pacchiano et al., 2019, Algorithm 3) . In Section 5, we show that this combination, which we call behavior-guided WNPG (BG-WNPG), leads to the best overall performance. Wasserstein Natural Evolution Strategies (WNES). ES treats the total reward observed on a trajectory under policy π θ as a black-box function L(θ) (Salimans et al., 2017; Mania et al., 2018;  Algorithm 2: Wasserstein Natural Evolution Strategies 1: Input Initial policy π θ0 , α > 0, δ ≤ 1 2: for iteration k = 1, 2, ... do 3: Sample 1 , . . . , n ∼ N (0, I).

4:

Perform rollouts {τ n } N n-1 of length T using the perturbed parameters { θ n = θ k + σ n } N n=1 and compute behavioral embeddings {X n = Φ(τ n )} N n=1 5: Compute gradient estimate of L( θ n ) using Equation ( 13) and trajectories {τ n } N n=1 . 6: Compute Jacobian matrix J appearing in Algorithm 3 using Equation ( 14).

7:

Compute WNG ĝW k using Algorithm 3, with samples {X n } N i=1 and computed ĝk and J.

8:

Update policy using Equation ( 15). 9: end for Choromanski et al., 2020) . Evaluating it under N policies whose parameters θ n are gaussian perturbations centered around θ k and with variance σ can give an estimate of the gradient of L(θ k ): ĝk = 1 N σ N n=1 L( θ n ) -L(θ k ) ( θ n -θ k ). ( ) Instead of directly updating the policy using Equation ( 13), it is possible to encourage either proximity or diversity in behavior using the embeddings X n = Φ(τ n ) of the trajectories τ n generated for each perturbed policy π θn . Those embeddings can be used as input to Algorithm 3 (see appendix), along with Equation ( 13) to estimate the ĝW k , which captures similarity in behavior. The algorithm remains unchanged except for the estimation of the Jacobian J of Equation ( 10) which becomes J m,. = 1 N σ N n=1 h m (X n )( θ n -θ k ). The policy parameter can then be updated using an interpolation between ĝk and the WNG ĝW k , i.e., ∆θ k ∝ (1 -δ)ĝ k + δĝ W k (15) with δ ≤ 1 that can also be negative. Positive values for δ encourage proximity in behavior, the limit case being δ = 1 where a full WNG step is taken. Negative values encourage repulsion and therefore need to compensated by ĝk to ensure overall policy improvement. Algorithm 2 summarizes the whole procedure, which can be easily adapted from existing ES implementations by calling a variant of Algorithm 3. In particular, it can also be used along with an explicit W 2 penalty, in which case the proposed algorithm in Pacchiano et al. (2019) is used to estimate the standard gradient ĝk of the penalized loss. Then the policy is updated using Equation (15) instead of ĝk . We refer to this approach as behavior-guided WNES (BG-WNES).

5. EXPERIMENTS

We now test the performance of our estimators for both policy gradients (PG) and evolution strategies (ES) against their associated baseline methods. We show that in addition to an improved computational efficiency, our approach can effectively utilize the geometry induced by a Wasserstein penalty to improve performance, particularly when the optimization problem is ill-conditioned. Further experimental details can be found in the appendix, and our code is available onlinefoot_0 . Policy Gradients. We first apply WNPG and BG-WNPG to challenging tasks from OpenAI Gym (Brockman et al., 2016) and Roboschool (RS). We compare performance against behavior-guided policy gradients (BGPG), (Pacchiano et al., 2019) , PPO with clipped surrogate objective (Schulman et al., 2017) (PPO (Clip)), and PG with no trust region (None). From Figure 2 , we can see that BGPG outperforms the corresponding KL-based method (PPO) and vanilla PG, as also demonstrated in the work of Pacchiano et al. (2019) . Our method (WNPG) matches or exceeds final performance of BGPG on all tasks. Moreover, combining both (BG-WNPG) produces the largest gains on all The performance mean ± standard deviation is plotted versus time steps for 5 random seeds on each task. environments. Final mean rewards are reported in Table 1 . It is also important to note that WNGbased methods appear to offer the biggest advantage on tasks where initial progress is difficult. To investigate this further, we computed the hessian matrix at the end of training for each task and measured the ratios of its largest eigenvalue to each successive eigenvalue (Figure 3 ). Larger ratios indicate ill-conditioning, and it is significant that WNG methods produce the greatest improvement on the environments with the poorest conditioning. This is consistent with the findings in Arbel et al. (2020) that showed WNG to perform most favorably compared to other methods when the optimization problem is ill-conditioned, and implies a useful heuristic for gauging when WNGbased methods are most useful for a given problem. Evolution Strategies To test our estimator for WNES, as well as BG-WNES, we applied our approach to the environment introduced by Pacchiano et al. (2019) , designed to test the ability of behavior-guided learning to succeed despite deceptive rewards. During the task, the agent receives a penalty proportional to its distance from a goal, but a wall is placed directly in the agent's path (Figure 7 ). This barrier induces a local maximum in the objective-a naïve agent will simply walk directly towards the goal and get stuck at the barrier. The idea is that the behavioral repulsion fostered by applying a positive coefficient to the Wasserstein penalty (β > 0) will encourage the agent to seek novel policies, helping it to eventually circumvent the wall. As in Pacchiano et al. (2019) , we test two agents, a simple point and a quadruped. We then compare our method with vanilla ES as described by Salimans et al. (2017) , ES with gradient norm clipping, BGES (Pacchiano et al., 2019) , and NSR-ES (Conti et al., 2018a) . In Figure 4 , we can see that WNES and BG-WNES improve over the baselines for both agents. To test that the improvement shown by BG-WNES wasn't simply a case of additional "repulsion" supplied by the WNG to BGES, we also tested BGES with an increased β = 0.75, compared to the default of 0.5. This resulted in a decrease in performance, attesting to the unique benefit provided by the WNES estimator. Computational Efficiency We define the computational efficiency of an algorithm as the rate with which it accumulates reward relative to its runtime. To test the computational efficiency of our approach, we plotted the total reward divided by wall clock time obtained by each agent for each task (Fig. 5 ). Methods using a WNG estimator were the most efficient on each task for both PG and ES agents. On several environments used for the policy gradient tasks, the added cost of BG-WNPG reduced its efficiency, despite having the highest absolute performance.

6. CONCLUSION

Explicit regularization using divergence measures between policy representations has been a common theme in recent work on policy optimization for RL. While prior works have previously focused on the KL divergence, Pacchiano et al. (2019) showed that a Wasserstein regularizer over behavioral distributions provides a powerful alternative framework. Both approaches implicitly define a form of natural gradient, depending on which divergence measure is chosen. Through the introduction of WNPG and WNES, we demonstrate that directly estimating the natural gradient of the un-regularized objective can deliver greater performance at lower computational cost. These algorithms represent novel extensions of previous work on the WNG to problems where the reparameterization trick is not available, as well as to black-box methods like ES. Moreover, using the WNG in conjunction with a WD penalty allows the WNG to take advantage of the local geometry induced by the regularization, further improving performance. We also provide a novel comparison between the WNG and FNG, showing that the former has significant advantages on certain problems. We believe this framework opens up a number of avenues for future work. Developing a principled way to identify useful behavioral embeddings for a given RL task would allow to get the highest benefit form WNPG and WNES. From a theoretical perspective, it would be useful to characterize the convergence boost granted by the combination of explicit regularization and the corresponding natural gradient approach. ∇f u (X) X X q θ q θ+u π θ π θ+u φ τ then g D k = lim β→+∞ arg max u β L(θ k + β -1 u) -L(θ k ) - β 2 D θ k , θ k + β -1 u Equation ( 18) simply states that the both WNG and FNG arise as limit cases of penalized objectives provided the strength of the penalty β diverges to infinity and the step-size is shrank proportionally to β -1 . An additional global rescaling by β of the total objective prevents it from collapsing to 0. Intuitively, performing a Taylor expansion of Equation ( 18) recovers an equation similar to Equation (8). Equation ( 18) shows that using a penalty that encourages global proximity between successive policies, it is possible to recover the local geometry of policies (captured by the local ) by increasing the strength of the penalty using appropriate re-scaling. This also informally shows why both natural gradients are said to be invariant to re-parametrization (Arbel et al., 2020 , Proposition 1), since both KL and W 2 remains unchanged if q θ is parameterized in a different way.

C ALGORITHM FOR ESTIMATING WNG

Algorithm 3: Efficient Wasserstein Natural Gradient 1: Input mini-batch of samples {X n } N n=1 distributed according to q θ , gradient direction ĝ, basis functions h 1 , ..., h M , regularization parameter . 2: Output Wasserstein Natural gradient ĝW 3: Compute a matrix C of shape M × N d using C m,(n,i) ← ∂ i h m (X n ). 4: Compute similarity matrix L ← 1 N CC T . 5: Compute surrogate vector V using Equation (11). 6: for iteration= 1, 2, ...M do 7: Use automatic differentiation on V m to compute Jacobian matrix J in Equation ( 10). 8: end for 9: Compute a matrix D of shape M × M using D ← JJ + L. We conserve all baseline and shared hyperparameters used by Pacchiano et al. (2019) . More precisely, for each task we ran a hyperparameter sweep over learning rates in the set {1e-5, 5e-5, 1e-4, 3e-4}, and used the concatenation-of-actions behavioral embedding Φ(τ ) = [a 0 , a 1 , . . . , a T ] with the base network implementation the same as Dhariwal et al. (2017) . The WNG hyperparameters were also left the same as in Arbel et al. (2020) . Specifically, the number of basis points was set as M = 5, the reduction factor was bounded in the range [0.25, 0.75], and ∈ [1e-10, 1e5].

D.2 EVOLUTION STRATEGIES TASKS

As with the policy gradient tasks, we conserved all baseline and shared hyperparameters used by Pacchiano et al. (2019) . Specifically, for the point task, we set the learning rate to be η = 0.1, the standard deviation of the noise to be σ = 0.01, the rollout length H was 50 time steps, and the behavioral embedding function to be the last state Φ(τ ) = s H . For the quadruped task we set η = 0.02, σ = 0.02, H = 400, and Φ(τ ) = H t=0 r t t i=0 e i (reward-to-go encoding; see Pacchiano et al. (2019) for more details). Both tasks used 1000-dimensional random features and embeddings from the n = 2 previous policies to compute the WD. For WNG, the same hyperparameters were used as in the policy gradient tasks.

D.3 EXPERIMENTAL SETTING OF FIGURE 1

The Objective We consider a function ψ(x) is the sum of sinc functions over all dimensions of x ∈ R 100 ψ(x) = 100 i=1 sin(x i ) x i -1 Such function is highly non-convex and admits multiple bad local minima with the global minimum of ψ(x) reached for x = 0. However, we do not make use of this information during optimization. To alleviate the non-convexity of this loss, we consider a gaussian relaxation objective L(θ) obtained by taking the expectation of ψ(x) over the 100 dimensional vector x w.r.t. to a gaussian q θ with parameter vector θ. Thus the objective function to be optimized is a function of θ: L(θ) = E q θ [ψ(x)] The parameter vector θ is of the form θ = (µ, v), where µ is the mean of the gaussian q θ and v is a vector in R 100 parameterizing the covariance matrix Σ of the gaussian q θ . We will later consider two parameterizations for the covariance matrix. The minimal value of L(θ) is reached when the gaussian q θ is degenerate with Σ = 0 and mean µ = x = 0. Hence, the mean parameter of the global minimum of L(θ) recover the global optimum of ψ. Parameterization of the gaussian We choose the covariance matrix of the gaussian to be diagonal and consider two parameterizations for the covariance matrix Σ: diagonal and log-diagonal. For the diagonal parameterization the Covariance Σ ii = v i and for the log-diagonal we set Σ ii = exp(2v i ).



https://github.com/tedmoskovitz/WNPG



Figure1: Different optimization methods using an objective L(θ) = E q θ [ψ(x)] where q θ is a gaussian of 100 dimensions with parameters θ = (µ, v). Here µ in bold is the mean vector, v parameterizes the covariance matrix Σ, which is chosen to be diagonal. Two parameterizations for the covariance matrix are considered: Σ ii = e vi (log-diagonal) and Σ ii = v i (diagonal). ψ(x) is the sum of sinc functions over all dimensions. Training is up to 4000 iterations, with λ = .9 and β = .1 unless they are varied. In Figure1(c), σ and µ refer to the std of the first component of the gaussian σ = √ Σ 11 and µ = µ 1 . More details about the experimental setting are provided in Appendix D.3.

Figure1: Different optimization methods using an objective L(θ) = E q θ [ψ(x)] where q θ is a gaussian of 100 dimensions with parameters θ = (µ, v). Here µ in bold is the mean vector, v parameterizes the covariance matrix Σ, which is chosen to be diagonal. Two parameterizations for the covariance matrix are considered: Σ ii = e vi (log-diagonal) and Σ ii = v i (diagonal). ψ(x) is the sum of sinc functions over all dimensions. Training is up to 4000 iterations, with λ = .9 and β = .1 unless they are varied. In Figure1(c), σ and µ refer to the std of the first component of the gaussian σ = √ Σ 11 and µ = µ 1 . More details about the experimental setting are provided in Appendix D.3.

Figure 2: WNG-based algorithms provide large gains on tasks where initial progress is difficult.The performance mean ± standard deviation is plotted versus time steps for 5 random seeds on each task.

Figure 3: Condition numbers for different tasks.

Figure 4: WNES methods more reliably overcome local maxima. Results obtained on the point (a) and quadruped (b) tasks. The mean ± standard deviation is plotted across 5 random seeds.

Figure 6: A visualization of the behavioral transport function.

10: Compute a vector b of size M using b ← J ĝ. 11: Solve linear system of size M : b ← solve (D, b) 12: Return ĝW ← 1 (ĝ -J b) D ADDITIONAL EXPERIMENTAL DETAILS D.1 POLICY GRADIENT TASKS

Figure 7: A visualization of the quadruped task. The agent receives receives more reward the closer it is to the goal (green). A naïve agent will get stuck in the local maximum at the wall if it attempts to move directly to the goal.

acknowledgement

Acknowledgments The authors would like to thank Jack Parker-Holder for sharing his code for BGPG and BGES, as well as colleagues at Gatsby for useful discussions.

annex

 2 . ± values denote one standard deviation across trials. The value for the best-performing method is listed in bold, while a * denotes the second best-performing method. BG-WNPG reaches the highest performance on all tasks. WNPG beats the best-performing baseline (BGPG) on all tasks except HalfCheetah, where the difference is small.

A BACKGROUND

A.1 POLICY OPTIMIZATION An agent interacting with an environment form a system that can be described by a state variable s belonging to a state space S. In the Markov Decision Process (MDP) setting, the agent can interact with the environment by taking an action a from a set of possible actions A given the current state s of the system. As a consequence, the system moves to a new state s according to a probability transition function P (s |a, s) which describes the probability of moving to state s given the previous state s and action a. The agent also receives a partial reward r which can be expressed as a possibly randomized function of the new state s , r = r(s ). The agent has access to a set of possible policies π θ (a|s) parametrized by θ ∈ R p and that generates an action a given a current state s. Thus, each policy can be seen as a probability distribution conditioned a state s. Using the same policy induces a whole trajectory of state-action-rewards τ = (s t , a t , r t ) t≥0 which can be viewed as a sample from a trajectory distribution P θ defined over the space of possible trajectories τ . Hence, for a given random trajectory τ induced by a policy π θ , the agent receives a total discounted reward R(τ ) := ∞ t=1 γ t-1 r(s t ) with discount factor 0 < γ < 1. This allows to define the value function as the expected total reward conditioned on a particular initial state s:When the gradient of the score function ∇ log π θ (a|s) is available, the policy gradient theorem allows us to express the gradient of R(θ):where the expectation is taken over trajectories τ under P θ and A θ (s, a) represents the advantage function which can be expressed in terms of the value function V θ (s) in terms ofThe agent seeks an optimal policy π θ that maximizes the expected total reward under the trajectory distribution: R(θ)

B WASSERSTEIN NATURAL GRADIENT

Connection to the Fisher natural gradient and proximal methods. Both WNG and FNG are obtained from a proximity measure between probability distributions:Proposition 2 Let D(θ, θ ) be either the KL-divergence KL(π θ , π θ ) or the Wasserstein-2 distance between the behavioral distributions W 2 (q θ , q θ ) and let g D be either the FNG g F or WNG g W , Optimization methods We consider different optimization methods using the same objective L(θ). For the penalty methods, we use the closed form expressions for the both the Wasserstein distance and KL which are available explicitly in the case of gaussians.For the Natural gradient methods (WNG) and (FNG), we use the closed form expressions which are also available in the gaussian case. We denote them as ∇ W L(θ) for (WNG) and ∇ F L(θ) for FNG and express them in terms of the euclidean/standard gradient ∇L(θ):• Diagonal parameterization:-WNG:-FNG:• Log-diagonal parameterization:-WNG:-FNG:Training details Training is up to 4000 gradient iterations, with λ = .9 and β = .1 unless they are varied.

