A STRONG ON-POLICY COMPETITOR TO PPO Anonymous authors Paper under double-blind review

Abstract

As a recognized variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely used with several advantages: efficient data utilization, easy implementation, and good parallelism. In this paper, a first-order gradient reinforcement learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence, is proposed as another powerful variant. The penalty item has dual effects, prohibiting policy updates from overshooting and encouraging more explorations. By carefully controlled experiments on both discrete and continuous benchmarks, our approach is proved highly competitive to PPO.

1. INTRODUCTION

With the development of deep reinforcement learning, lots of impressive results have been produced in a wide range of fields such as playing Atari game (Mnih et al., 2015; Hessel et al., 2018) , controlling robotics (Lillicrap et al., 2015) , Go (Silver et al., 2017) , neural architecture search (Tan et al., 2019; Pham et al., 2018) . The basis of a reinforcement learning algorithm is generalized policy iteration (Sutton & Barto, 2018) , which states two essential iterative steps: policy evaluation and improvement. Among various algorithms, policy gradient is an active branch of reinforcement learning whose foundations are Policy Gradient Theorem and the most classical algorithm REINFORCEMENT (Sutton & Barto, 2018) . Since then, handfuls of policy gradient variants have been proposed, such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) , Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) , Actor-Critic using Kronecker-factored Trust Region (ACKTR) (Wu et al., 2017) , and Proximal Policy Optimization (PPO) (Schulman et al., 2017) . Improving the strategy monotonically had been nontrivial until Schulman et al. (2015) proposed Trust Region Policy Optimization (TRPO), in which Fisher vector product is utilized to cut down the computing burden. Specifically, Kullback-Leibler divergence (KLD) acts as a hard constraint in place of objective, because its corresponding coefficient is difficult to set for different problems. However, TRPO still has several drawbacks: too complicated, inefficient data usage. Quite a lot of efforts have been devoted to improving TRPO since then and the most commonly used one is PPO. PPO can be regarded as a first-order variant of TRPO and have obvious improvements in several facets. In particular, a pessimistic clipped surrogate objective is proposed where TRPO's hard constraint is replaced by the clipped action probability ratio. In such a way, it constructs an unconstrained optimization problem so that any first-order stochastic gradient optimizer can be directly applied. Besides, it's easier to be implemented and more robust against various problems, achieving an impressive result on Atari games (Brockman et al., 2016) . However, the cost of data sampling is not always cheap. Haarnoja et al. ( 2018) design an off-policy algorithm called Soft Actor-Critic and achieves the state of the art result by encouraging better exploration using maximum entropy. In this paper, we focus on the on-policy improvement to improve PPO and answer the question: how to successfully leverage penalized optimization to solve the constrained one which is formulated by Schulman et al. (2015) . 1. It proposes a simple variant of TRPO called POP3D along with a new surrogate objective containing a point probability penalty item, which is symmetric lower bound to the square of the total variance divergence of policy distributions. Specifically, it helps to stabilize the learning process and encourage exploration. Furthermore, it escapes from penalty item setting headache along with penalized version TRPO, where is arduous to select one fixed value for various environments.

2.. It achieves state-of-the-art results among on-policy algorithms with a clear margin on 49

Atari games within 40 million frame steps based on two shared metrics. Moreover, it also achieves competitive results compared with PPO in the continuous domain. It dives into the mechanism of PPO's improvement over TRPO from the perspective of solution manifold, which also plays an important role in our method. 3. It enjoys almost all PPO's advantages such as easy implementation, fast learning ability. We provide the code and training logs to make our work reproducible.

2. PRELIMINARY KNOWLEDGE AND RELATED WORK

2.1 POLICY GRADIENT Agents interact with the environment and receive rewards which are used to adjust their policy in turn. At state s t , one agent takes strategy π and transfers to a new state s t+1 , rewarded r t by the environment. Maximizing discounted return (accumulated rewards) R t is its objective. In particular, given a policy π, R t is defined as R t = ∞ n=0 (r t + γr t+1 + γ 2 r t+2 + ... + γ n r t+n ). (1) γ is the discounted coefficient to control future rewards, which lies in the range (0, 1). Regarding a neural network with parameter θ, the policy π θ (a|s) can be learned by maximizing Equation 1 using the back-propagation algorithm. Particularly, given Q(s, a) which represents the agent's return in state s after taking action a, the objective function can be written as max θ E s,a log π θ (a|s)Q(s, a). (2) Equation 2 lays the foundation for handfuls of policy gradient based algorithms. Another variant can be deduced by using A(s, a) = Q(s, a) -V (s) (3) to replace Q(s, a) in Equation 2 equivalently, V (s) can be any function so long as V depends on s but not a. In most cases, state value function is used for V , which not only helps to reduce variations but has clear physical meaning. Formally, it can be written as max θ E s,a log π θ (a|s)A(s, a). (4)

2.2. ADVANTAGE ESTIMATE

A commonly used method for advantage calculation is one-step estimation, which follows (γλ) l δ V t+l δ V t+l = r t+l + γV (s t+l+1 ) -V (s t+l ). A(s t , a t ) = Q(s t , a t ) -V (s t ) = r t + γV (s t+1 ) -V (s t ). Â(k) t := k-1 l=0 γ l δ V t+l = -V (s t ) + r t + γr t+1 + • • • + γ k-1 r t+k-1 + γ k V (s t+k ) The parameter λ meets 0 ≤ λ ≤ 1, which controls the trade-off between bias and variance. All methods in this paper utilize ÂGAE(γ,λ) t to estimate the advantage.



However, a more accurate method called generalized advantage estimation is proposed inSchulman  et al. (2016), where all time steps of estimation are combined and summarized using λ-based weights,.

