ON PROXIMAL POLICY OPTIMIZATION'S HEAVY-TAILED GRADIENTS

Abstract

Modern policy gradient algorithms, notably Proximal Policy Optimization (PPO), rely on an arsenal of heuristics, including loss clipping and gradient clipping, to ensure successful learning. These heuristics are reminiscent of techniques from robust statistics, commonly used for estimation in outlier-rich ("heavy-tailed") regimes. In this paper, we present a detailed empirical study to characterize the heavy-tailed nature of the gradients of the PPO surrogate reward function. We demonstrate pronounced heavy-tailedness of the gradients, specifically for the actor network, which increases as the current policy diverges from the behavioral one (i.e., as the agent goes further off policy). Further examination implicates the likelihood ratios and advantages in the surrogate reward as the main sources of the observed heavy-tailedness. Subsequently, we study the effects of the standard PPO clipping heuristics, demonstrating how these tricks primarily serve to offset heavy-tailedness in gradients. Motivated by these connections, we propose incorporating GMOM, a high-dimensional robust estimator, into PPO as a substitute for three clipping tricks. Our method achieves comparable performance to that of PPO with all heuristics enabled on a battery of MuJoCo continuous control tasks.

1. INTRODUCTION

As Deep Reinforcement Learning (DRL) methods have made strides on such diverse tasks as game playing and continuous control (Berner et al., 2019; Silver et al., 2017; Mnih et al., 2015) , policy gradient methods (Williams, 1992; Sutton et al., 2000; Mnih et al., 2016) have emerged as a popular alternative to dynamic programming approaches. Since the breakthrough results of Mnih et al. (2016) demonstrated the applicability of policy gradients in DRL, a number of popular variants have emerged (Schulman et al., 2017; Espeholt et al., 2018) . Proximal Policy Optimization (PPO) (Schulman et al., 2017) -one of the most popular policy gradient methods-introduced the clipped importance sampling update, an effective heuristic for off-policy learning. However, while their stated motivation for clipping draws upon trust-region enforcement, the behavior of these methods tends to deviate from its key algorithmic principle (Ilyas et al., 2018) , and exhibit sensitivity to implementation details (Engstrom et al., 2019) . More generally, policy gradient methods are brittle, sensitive to both the random seed and hyperparameter choices, and poorly understood (Ilyas et al., 2018; Engstrom et al., 2019; Henderson et al., 2017; 2018; Islam et al., 2017) . The ubiquity of these issues raises a broader concern about our understanding of policy gradient methods. In this work, we take a step forward towards understanding the workings of PPO, the most prominent and widely used deep policy gradient method. Noting that the heuristics implemented in PPO are evocative of estimation techniques from robust statistics in outlier-rich settings, we conjecture that the heavy-tailed distribution of gradients is the main obstacle addressed by these heuristics. We perform a rigorous empirical study to understand the causes of heavy-tailedness in PPO gradients. Furthermore, we provide a novel perspective on the clipping heuristics implemented in PPO by showing that these heuristics primarily serve to alleviate heavy-tailedness in gradients. Our first contribution is to analyze the role played by each component of the PPO objective in the heavy-tailedness of the gradients. We observe that as the training proceeds, gradients of both the actor and the critic loss get more heavy-tailed. Our findings show that during on-policy gradient steps the advantage estimates are the primary contributors to the heavy-tailed nature of the gradients. Moreover, as off-policyness increases (i.e. as the behavioral policy and actor policy diverge) dur-

