ON PROXIMAL POLICY OPTIMIZATION'S HEAVY-TAILED GRADIENTS

Abstract

Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate pronounced heavy-tailedness of the gradients, specifically for the actor network, which increases as the current policy diverges from the behavioral one (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. Subsequently, we study the effects of the standard PPO clipping heuristics, demonstrating how these tricks primarily serve to offset heavy-tailedness in gradients. Motivated by these connections, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Our method achieves comparable performance to that of PPO with all heuristics enabled on a battery of MuJoCo continuous control tasks.

1. INTRODUCTION

As Deep Reinforcement Learning (DRL) methods have made strides on such diverse tasks as game playing and continuous control (Berner et al., 2019; Silver et al., 2017; Mnih et al., 2015) , policy gradient methods (Williams, 1992; Sutton et al., 2000; Mnih et al., 2016) have emerged as a popular alternative to dynamic programming approaches. Since the breakthrough results of Mnih et al. (2016) demonstrated the applicability of policy gradients in DRL, a number of popular variants have emerged (Schulman et al., 2017; Espeholt et al., 2018) . Proximal Policy Optimization (PPO) (Schulman et al., 2017) -one of the most popular policy gradient methods-introduced the clipped importance sampling update, an effective heuristic for off-policy learning. However, while their stated motivation for clipping draws upon trust-region enforcement, the behavior of these methods tends to deviate from its key algorithmic principle (Ilyas et al., 2018) , and exhibit sensitivity to implementation details (Engstrom et al., 2019) . More generally, policy gradient methods are brittle, sensitive to both the random seed and hyperparameter choices, and poorly understood (Ilyas et al., 2018; Engstrom et al., 2019; Henderson et al., 2017; 2018; Islam et al., 2017) . The ubiquity of these issues raises a broader concern about our understanding of policy gradient methods. In this work, we take a step forward towards understanding the workings of PPO, the most prominent and widely used deep policy gradient method. Noting that the heuristics implemented in PPO are evocative of estimation techniques from robust statistics in outlier-rich settings, we conjecture that the heavy-tailed distribution of gradients is the main obstacle addressed by these heuristics. We perform a rigorous empirical study to understand the causes of heavy-tailedness in PPO gradients. Furthermore, we provide a novel perspective on the clipping heuristics implemented in PPO by showing that these heuristics primarily serve to alleviate heavy-tailedness in gradients. Our first contribution is to analyze the role played by each component of the PPO objective in the heavy-tailedness of the gradients. We observe that as the training proceeds, gradients of both the actor and the critic loss get more heavy-tailed. Our findings show that during on-policy gradient steps the advantage estimates are the primary contributors to the heavy-tailed nature of the gradients. Moreover, as off-policyness increases (i.e. as the behavioral policy and actor policy diverge) dur-ing training, the likelihood-ratios that appear in the surrogate objective exacerbates the heavy-tailed behavior. Subsequently, we demonstrate that the clipping heuristics present in standard PPO implementations (i.e., gradient clipping, actor objective clipping, and value loss clipping) significantly counteract the heavy-tailedness induced by off-policy training. Finally, motivated by this analysis, we present an algorithm that uses Geometric Median-of-Means (GMOM), a high-dimensional robust aggregation method adapted from the statistics literature. Without using any of the objective clipping and gradient clipping heuristics implemented in PPO, the GMOM algorithm nearly matches PPO's performance on MuJoCo (Todorov et al., 2012) continuous control tasks.

2. PRELIMINARIES

We define a Markov Decision Process (MDP) as a tuple (S, A, R, γ, P ), where S represent the set of environments states, A represent the set of agent actions, R : S × A → R is the reward function, γ is the discount factor, and P : S × A × S → R is the state transition probability distribution. The goal in reinforcement learning is to learn a policy π θ : S × A → R + , parameterized by θ, such that the expected cumulative discounted reward (known as returns) is maximized. Formally, π * : = argmax π E at∼π(•|st),st+1∼P (•|st,at) [ ∞ t=0 γ t R(s t , a t )]. Policy gradient methods directly parameterize the policy (also known as actor network). Since directly optimizing the cumulative rewards can be challenging, modern policy gradient algorithms typically optimize a surrogate reward function. Often the surrogate objective includes a likelihood ratio to allow importance sampling from a behavior policy π 0 while optimizing policy π θ . For example, Schulman et al. ( 2015a) optimize: max θ E (st,at)∼π0 π θ (a t , s t ) π 0 (a t , s t ) A π0 (s t , a t ) , where A π θ = Q π θ (s t , a t ) -V π θ (s t ). Here, Q-function , i.e. Q π θ (s, a), is the expected discounted reward after taking an action a at state s and following π θ afterwards and V π θ (s) is the value estimate (implemented with a critic network). However, the surrogate is indicative of the true reward function only when π θ and π 0 are close in distribution. Different policy gradient methods (Schulman et al., 2015a; 2017; Kakade, 2002) attempt to enforce the closeness in different ways. In Natural Policy Gradients (Kakade, 2002) and Trust Region Policy Optimization (TRPO) (Schulman et al., 2015a) , authors utilize a conservation policy iteration with an explicit divergence constraint which provides provable lower bounds guarantee on the improvements of the parameterized policy. On the other hand, PPO (Schulman et al., 2017) implements a clipping heuristic on the likelihood ratio of the surrogate reward function to avoid excessively large policy updates. Specifically, PPO optimizes the following objective: max θ E (st,at)∼π0 min clip(ρ t , 1 -, 1 + ) Âπ0 (s t , a t ), ρ t Âπ0 (s t , a t ) , where ρ t : = π θ (at,st) π0(at,st) . We refer to ρ t as likelihood-ratios. Due to a minimum with the unclipped surrogate reward, the PPO objective acts as a pessimistic bound on the true surrogate reward. As in standard PPO implementation, we use Generalized Advantage Estimation (GAE) (Schulman et al., 2015b) . Moreover, instead of fitting the value network via regression to target values: L V = (V θt -V targ ) 2 , standard implementations fit the value network with a PPO-like objective: L V = max (V θt -V targ ) 2 , clip V θt , V θt-1 -ε, V θt-1 + ε -V targ 2 , , where is the same value used to clip probability raitos in PPO's loss function (Eq. 9). PPO uses the following training procedure: At any iteration t, the agent creates a clone of the current policy π θt which interacts with the environment to collect rollouts B (i.e., state-action pairs {(s i , a i )} N i=1 ). Then the algorithm optimizes the policy π θ and value function V θ for a fixed K gradient steps on the sampled data B. Since at every iteration the first gradient step is taken on the same policy from which the data was sampled, we refer to these gradient updates as on-policy steps.

