A STRONG ON-POLICY COMPETITOR TO PPO Anonymous authors Paper under double-blind review

Abstract

As a recognized variant and improvement for Trust Region Policy Optimization (TRPO), proximal policy optimization (PPO) has been widely used with several advantages: efficient data utilization, easy implementation, and good parallelism. In this paper, a first-order gradient reinforcement learning algorithm called Policy Optimization with Penalized Point Probability Distance (POP3D), which is a lower bound to the square of total variance divergence, is proposed as another powerful variant. The penalty item has dual effects, prohibiting policy updates from overshooting and encouraging more explorations. By carefully controlled experiments on both discrete and continuous benchmarks, our approach is proved highly competitive to PPO.

1. INTRODUCTION

With the development of deep reinforcement learning, lots of impressive results have been produced in a wide range of fields such as playing Atari game (Mnih et al., 2015; Hessel et al., 2018) , controlling robotics (Lillicrap et al., 2015) , Go (Silver et al., 2017) , neural architecture search (Tan et al., 2019; Pham et al., 2018) . The basis of a reinforcement learning algorithm is generalized policy iteration (Sutton & Barto, 2018) , which states two essential iterative steps: policy evaluation and improvement. Among various algorithms, policy gradient is an active branch of reinforcement learning whose foundations are Policy Gradient Theorem and the most classical algorithm REINFORCEMENT (Sutton & Barto, 2018) . Since then, handfuls of policy gradient variants have been proposed, such as Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) , Asynchronous Advantage Actor-Critic (A3C) (Mnih et al., 2016) , Actor-Critic using Kronecker-factored Trust Region (ACKTR) (Wu et al., 2017) , and Proximal Policy Optimization (PPO) (Schulman et al., 2017) . Improving the strategy monotonically had been nontrivial until Schulman et al. (2015) proposed Trust Region Policy Optimization (TRPO), in which Fisher vector product is utilized to cut down the computing burden. Specifically, Kullback-Leibler divergence (KLD) acts as a hard constraint in place of objective, because its corresponding coefficient is difficult to set for different problems. However, TRPO still has several drawbacks: too complicated, inefficient data usage. Quite a lot of efforts have been devoted to improving TRPO since then and the most commonly used one is PPO. PPO can be regarded as a first-order variant of TRPO and have obvious improvements in several facets. In particular, a pessimistic clipped surrogate objective is proposed where TRPO's hard constraint is replaced by the clipped action probability ratio. In such a way, it constructs an unconstrained optimization problem so that any first-order stochastic gradient optimizer can be directly applied. Besides, it's easier to be implemented and more robust against various problems, achieving an impressive result on Atari games (Brockman et al., 2016) . However, the cost of data sampling is not always cheap. Haarnoja et al. (2018) design an off-policy algorithm called Soft Actor-Critic and achieves the state of the art result by encouraging better exploration using maximum entropy. In this paper, we focus on the on-policy improvement to improve PPO and answer the question: how to successfully leverage penalized optimization to solve the constrained one which is formulated by Schulman et al. (2015) . 1. It proposes a simple variant of TRPO called POP3D along with a new surrogate objective containing a point probability penalty item, which is symmetric lower bound to the square of the total variance divergence of policy distributions. Specifically, it helps to stabilize

