EFFICIENT WASSERSTEIN NATURAL GRADIENTS FOR REINFORCEMENT LEARNING

Abstract

A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baselines.

1. INTRODUCTION

Defining efficient optimization algorithms for reinforcement learning (RL) that are able to leverage a meaningful measure of similarity between policies is a longstanding and challenging problem (Lee & Popović, 2010; Meyerson et al., 2016; Conti et al., 2018b) . Many such works rely on similarity measures such as the Kullback-Leibler (KL) divergence (Kullback & Leibler, 1951) to define procedures for updating the policy of an agent as it interacts with the environment. These are generally motivated by the need to maintain a small variation in the KL between successive updates in an off-policy context to control the variance of the importance weights used in fthe estimation of the gradient. This includes work by Kakade (2002) and Schulman et al. (2015) , who propose to use the Fisher Natural Gradient (Amari, 1997) as a way to update policies, using local geometric information to allow larger steps in directions where policies vary less; and the work of Schulman et al. (2017) , which relies on a global measure of proximity using a soft KL penalty to the objective. While those methods achieve impressive performance, and the choice of the KL is well-motivated, one can still ask if it is possible to include information about the behavior of policies when measuring similarity, and whether this could lead to more efficient algorithms. Pacchiano et al. (2019) provide a first insight into this question, representing policies using behavioral distributions which incorporate information about the outcome of the policies in the environment. The Wasserstein Distance (WD) (Villani, 2016) between those behavioral distributions is then used as a similarity measure between their corresponding policies. They further propose to use such behavioral similarity as a global soft penalty to the total objective. Hence, like the KL penalty, proximity between policies is measured globally, and does not necessarily exploit the local geometry defined by the behavioral embeddings. In this work, we show that substantial improvements can be achieved by taking into account the local behavior of policies. We introduce new, efficient optimization methods for RL that incorporate the local geometry defined by the behavioral distributions for both policy gradient (PG) and evolution strategies (ES) approaches. Our main contributions are as follows: 1-We leverage recent work in (Li & Montufar, 2018a; b; Li, 2018; Li & Zhao, 2019; Chen & Li, 2018) which introduces the notion of the Wasserstein Information Matrix to define a local behavioral similarity measure between policies. This allows us to identify the Wasserstein Natural Gradient (WNG) as a key ingredient for optimization methods that rely on the local behavior of policies. To enable efficient estimation of WNG, we build on the recent work of Arbel et al. (2020) , and further extend it to cases where the re-parameterization trick is not applicable, but only the score function of the model is available. 2-This allows us to introduce two novel methods: Wasserstein natural policy gradients (WNPG) and Wasserstein natural evolution strategies (WNES) which use the local behavioral structure of policies through WNG and can be easily incorporated into standard RL optimization routines. When combined in addition with a global behavioral similarity such as a WD penalty, we show substantial improvement over using the penalty alone without access to local information. We find that such WNG-based methods are especially useful on tasks in which initial progress is difficult. 3-Finally, we demonstrate, to our knowledge, the first in-depth comparative analysis of the FNG and WNG, highlighting a clear interpretable advantage of using WNG over FNG on tasks where the optimal solution is deterministic. This scenario arises frequently in ES and in policy optimization for MDPs (Puterman, 2010) . This suggests that WNG could be a powerful tool for this class of problems, especially when reaching accurate solutions quickly is crucial. In Section 2, we present a brief review of policy gradient approaches and the role of divergence measures as regularization penalties. In Section 3 we introduce the WNG and detail its relationship with the FNG and the use of Wasserstein penalties, and in Section 4 we derive practical algorithms for applying the WNG to PG and ES. Section 5 contains our empirical results.

2. BACKGROUND

Policy Gradient (PG) methods directly parametrize a policy π θ , optimizing the parameter θ using stochastic gradient ascent on the expected total discounted reward R(θ). An estimate ĝk of the gradient of R(θ) at θ k can be computed by differentiating a surrogate objective L θ which often comes in two flavors, depending on whether training is on-policy (left) or off-policy (right): L(θ) = Ê log π θ (a t |s t ) Ât , or L(θ) = Ê π θ (a t |s t ) π θ k (a t |s t ) Ât . The expectation Ê is an empirical average over N trajectories τ i = (s i 1 , a i 1 , r i 1 , ..., s i T , a i T , r i T ) of state-action-rewards obtained by simulating from the environment using π θ k . The scalar Ât is an estimator of the advantage function and can be computed, for instance, using Ât = r t + γV (s t+1 ) -V (s t ) where γ ∈ [0, 1) is a discount factor and V is the value function often learned as a parametric function via temporal difference learning (Sutton & Barto, 2018) . Reusing trajectories can reduce the computational cost at the expense of increased variance of the gradient estimator (Schulman et al., 2017) . Indeed, performing multiple policy updates while using trajectories from an older policy π θ old means that the current policy π θ can drift away from the older policy. On the other hand, the objective is obtained as an expectation under π θ for which fresh trajectories are not available. Instead, the objective is estimated using importance sampling (by re-weighting the old trajectories according to importance weights π θ /π θ old ). When π θ is too far from π θ old , the importance weight can have a large variance. This can lead to a drastic degradation of performance if done naïvely (Schulman et al., 2017) . KL-based policy optimization (PO) aims at addressing these limitations. KL-based PO methods ensure that the policy does not change substantially between successive updates, where change is measured by the KL divergence between the resulting action distributions. The general idea is to add either a hard KL constraint, as in TRPO (Schulman et al., 2015) , or a soft constraint, as in PPO (Schulman et al., 2017) , to encourage proximity between policies. In the first case, TRPO recovers the FNG with a step-size further adjusted using line-search to enforce the hard constraint. The FNG permits larger steps in directions where policy changes the least, thus reducing the number of updates required for optimization. In the second case, the soft constraint leads to an objective of the form: maximize θ L(θ) -β Ê [KL(π θ k (•|s t ), π θ (•|s t ))] . The KL penalty prevents the updates from deviating too far from the current policy π θ k , thereby controlling the variance of the gradient estimator. This allows making multiple steps with the same simulated trajectories without degradation of performance. While both methods take into account the proximity between policies as measured using the KL, they do not take into account the behavior of such policies in the environment. Exploiting such information can greatly improve performance.

