COUPLING POLICY GRADIENT WITH POPULATION-BASED SEARCH (PGPS)

Abstract

Gradient-based policy search algorithms (such as PPO, SAC, or TD3) in deep reinforcement learning (DRL) have shown successful results on a range of challenging control tasks. However, they often suffer from deceptive gradient problems in flat or gentle regions of the objective function. As an alternative to policy gradient methods, population-based evolutionary approaches have been applied to DRL. While population-based search algorithms show more robust learning in a broader range of tasks, they are usually inefficient in the use of samples. Recently, reported are a few attempts (such as CEMRL) to combine gradient with a population in searching optimal policy. This kind of hybrid algorithm takes advantage of both camps. In this paper, we propose yet another hybrid algorithm, which more tightly couples policy gradient with the population-based search. More specifically, we use Cross Entropy Method (CEM) for population-based search and Twin Delayed Deep Deterministic Policy Gradient (TD3) for policy gradient. In the proposed algorithm called Coupling Policy Gradient with Population-based Search (PGPS), a single TD3 agent, which learns by a gradient from all experiences generated by population, leads a population by providing its critic function Q as a surrogate to select better performing next generation population from candidates. On the other hand, if the TD3 agent falls behind the CEM population, then the TD3 agent is updated toward the elite member of CEM population using loss function augmented with the distance between the TD3 and the CEM elite. Experiments in five challenging control tasks in a MuJoCo environment show that PGPS is robust to deceptive gradient and also outperforms the state-of-the-art algorithms.

1. INTRODUCTION

In Reinforcement Learning (RL), an agent interacts with the environment, and its goal is to find the policy that maximizes the objective function, which is generally defined as a cumulative discounted reward. Recently, many researchers have worked on combining deep neural networks and a gradient-based RL algorithm, generally known as Deep Reinforcement Learning (DRL). This approach has achieved great success not only in the discrete action domain, such as in Go (Silver et al., 2017) and Atari games (Mnih et al., 2015; 2016) , but also in the continuous action domain, such as in Robot control (Fujimoto et al., 2018; Lillicrap et al., 2015; Schulman et al., 2015) . However, it is difficult to use the gradient-based method for the objective function (J), which includes "many wide flat regions" since the gradient (∇ θ J) is near zero at a flat point. Figure 1 is an extreme case consisting of only flat regions, which is called a piece-wise constant function. This problem remains an unsolved issue in gradient-based DRL with continuous control domains (Colas et al., 2018) . The Swimmer in a MuJoCo environment (Todorov et al., 2012) has already been reported to be hard to use the gradient-based method (Jung et al., 2020; Liu et al., 2019) . Our experiment shows that the objective function of Swimmer includes wide flat regions (Appendix A). The population-based Evolutionary Approach (EA), which is an alternative to the gradient-based method, has also shown successful results in various control tasks (Conti et al., 2018; Liu et al., 2019; Salimans et al., 2017; Such et al., 2017) . As a population-based search, the EA generates a population of agents to explore policy, and the population is regenerated with improvement in each generation. The EA is also known as a direct policy search (Schmidhuber & Zhao, 1998) because it directly searches by perturbing the parameter of policy. In Figure 1 , the Cross-Entropy Method (CEM) as a kind of population-based search is simply described, where the current population sampled from the target distribution is evaluated. Then the distribution is updated to the direction for generating a more promising population. Not depending on the gradient, these approaches are robust to flat or deceptive gradients (Staines & Barber, 2013; Liu et al., 2019) . However, the EA is sample inefficient because it requires a Monte-Carlo evaluation, and the previous results and data generally cannot be reused. The off-policy Policy Gradient (PG) algorithm uses the data from arbitrary policies to train its actor and critic functions. It generates exciting potential by combining the EA and PG, where the data which is discarded in a standard EA is directly used to train the PG's functions. Khadka & Tumer (2018) and Pourchot & Sigaud (2018) introduced a framework combining the EA and off-policy PG. However, the framework of (Khadka & Tumer, 2018) is less efficient to train the policy for general tasks than the PG algorithm alone, and the framework of (Pourchot & Sigaud, 2018) is unsuitable to train the policy for a task providing a deceptive gradient. In this paper, we propose another hybrid algorithm, called Policy Gradient with Population-based Search (PGPS) in which the CEM and Twin Delayed Deep Deterministic Policy Gradient (TD3) (Fujimoto et al., 2018) are combined. It is as robust to a deceptive gradient as the CEM and more efficient to train the policy for general tasks than TD3. To be robust to a deceptive gradient, the proposed algorithm is constructed in a way similar to the one in (Khadka & Tumer, 2018) , where the TD3 is trained using data from the CEM and periodically participates in the CEM population as an individual (PG guides EA). However, in this basic framework, the TD3 sometimes falls into the inferior solution and inefficiently searches. To get the TD3 out of the inferior solution, we let the EA guide the TD3 by guided policy learning (Jung et al., 2020) (EA guides PG). Furthermore, the TD3 critic contributes to generating a more promising population by filtering the set of actors sampled from CEM (Q-critic filtering). Lastly, to control the trade-off between the frequency of search and stable estimation, we used evaluation step scheduling in the process of population evaluation (Increasing evaluation steps). It carries out frequent searches when searching far from the optimal, whereas it carries out stable estimation when searching close to the optimal. These approaches bring out more synergies between the CEM and the TD3 while maintaining both the population-based search and the gradient-based search. Consequently, the proposed algorithm is not only robust to a deceptive gradient, but also produces outstanding performances with a low additional computational cost.

2. RELATED WORKS

Recently, beyond the view of an alternative approach, few attempts have been proposed in the form of A supporting B. An attempt is to use EA to fill a replay buffer with diverse samples. In Colas et al. ( 2018), a Goal Exploration Process (GEP), a kind of EA, is firstly applied to search the policy and to fill a replay buffer with the diverse samples, and then the off-policy PG algorithm is sequentially used for fine tuning the parameters of the policy. Another attempt is to combine a population-based approach and PG for efficiently searching a good policy or the good hyper-parameters of an algorithm in parallel multi-learners setting. These applications generally consist of periodically evaluating the population, followed by distributing good knowledge to the other learners. To find the best architecture and hyper-parameters, Jaderberg et al. ( 2017) proposed a Population-Based Training (PBT) method in which the current best knowledge is periodically transferred to PG learners. Gangwani & Peng (2017) developed the distilled crossover using imitation learning and mutation based on the PG. Proposed operators transfer the information on current good policies into the next population without destructive change to the neural network. Jung et al. ( 2020) introduced a soft-manner guided policy learning to fuse the knowledge of the best policy with other identical multiple learners while maintaining a more extensive search area for the exploration. The idea of combining the population-based EA and off-policy PG was recently introduced by Khadka & Tumer (2018). Their approach was called Evolutionary-Guided Reinforcement Learning (ERL) in which the Genetic Algorithm (GA) and the Deep Deterministic Policy Gradient (DDPG) (Lillicrap et al., 2015) are combined. In ERL frameworks, the GA transfers the experience from evaluation into the DDPG through a replay buffer, and the DDPG transfers the knowledge learned



Figure 1: Flat gradient and populationbased search on piece-wise constant function

