COUPLING POLICY GRADIENT WITH POPULATION-BASED SEARCH (PGPS)

Abstract

Gradient-based policy search algorithms (such as PPO, SAC, or TD3) in deep reinforcement learning (DRL) have shown successful results on a range of challenging control tasks. However, they often suffer from deceptive gradient problems in flat or gentle regions of the objective function. As an alternative to policy gradient methods, population-based evolutionary approaches have been applied to DRL. While population-based search algorithms show more robust learning in a broader range of tasks, they are usually inefficient in the use of samples. Recently, reported are a few attempts (such as CEMRL) to combine gradient with a population in searching optimal policy. This kind of hybrid algorithm takes advantage of both camps. In this paper, we propose yet another hybrid algorithm, which more tightly couples policy gradient with the population-based search. More specifically, we use Cross Entropy Method (CEM) for population-based search and Twin Delayed Deep Deterministic Policy Gradient (TD3) for policy gradient. In the proposed algorithm called Coupling Policy Gradient with Population-based Search (PGPS), a single TD3 agent, which learns by a gradient from all experiences generated by population, leads a population by providing its critic function Q as a surrogate to select better performing next generation population from candidates. On the other hand, if the TD3 agent falls behind the CEM population, then the TD3 agent is updated toward the elite member of CEM population using loss function augmented with the distance between the TD3 and the CEM elite. Experiments in five challenging control tasks in a MuJoCo environment show that PGPS is robust to deceptive gradient and also outperforms the state-of-the-art algorithms.

1. INTRODUCTION

In Reinforcement Learning (RL), an agent interacts with the environment, and its goal is to find the policy that maximizes the objective function, which is generally defined as a cumulative discounted reward. Recently, many researchers have worked on combining deep neural networks and a gradient-based RL algorithm, generally known as Deep Reinforcement Learning (DRL). This approach has achieved great success not only in the discrete action domain, such as in Go (Silver et al., 2017) and Atari games (Mnih et al., 2015; 2016) , but also in the continuous action domain, such as in Robot control (Fujimoto et al., 2018; Lillicrap et al., 2015; Schulman et al., 2015) . However, it is difficult to use the gradient-based method for the objective function (J), which includes "many wide flat regions" since the gradient (∇ θ J) is near zero at a flat point. Figure 1 is an extreme case consisting of only flat regions, which is called a piece-wise constant function. This problem remains an unsolved issue in gradient-based DRL with continuous control domains (Colas et al., 2018) . The Swimmer in a MuJoCo environment (Todorov et al., 2012) has already been reported to be hard to use the gradient-based method (Jung et al., 2020; Liu et al., 2019) . Our experiment shows that the objective function of Swimmer includes wide flat regions (Appendix A). The population-based Evolutionary Approach (EA), which is an alternative to the gradient-based method, has also shown successful results in various control tasks (Conti et al., 2018; Liu et al., 2019; 1 



Figure 1: Flat gradient and populationbased search on piece-wise constant function

