GENETIC SOFT UPDATES FOR POLICY EVOLUTION IN DEEP REINFORCEMENT LEARNING

Abstract

The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the benefits of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present significant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not exhibit detrimental behaviours. Experiments in robotic applications and continuous control benchmarks demonstrate the versatility of our approach that significantly outperforms prior DRL, EAs, and mixed approaches. Finally, we employ formal verification to confirm the policy improvement, mitigating the inefficient exploration and hyper-parameter sensitivity of DRL.

1. INTRODUCTION

The key to a wider and successful application of DRL techniques in real scenarios is the ability to adapt to the surrounding environment by generalizing from training experiences. These solutions have to cope with the uncertainties of the operational environment, requiring a huge number of trials to achieve good performance. Hence, devising robust learning approaches while improving sample efficiency is one of the challenges for wider utilization of DRL. Despite the promising results (Tai et al., 2017; Zhang et al., 2017; Marchesini et al., 2019) , DRL also suffer from convergence to local optima, which is mainly caused by the lack of diverse exploration when operating in high-dimensional spaces. Several studies address the exploration problem (e.g., curiosity-driven exploration (Pathak et al., 2017 ), count-based exploration (Ostrovski et al., 2017) ), but they typically rely on sensitive task-specific hyper-parameters. The sensitivity to such hyper-parameters is another significant issue in DRL as it typically results in brittle convergence properties and poor performance in practical tasks (Haarnoja et al., 2018) . Evolutionary Algorithms (Fogel, 2006) have been recently employed as a promising gradient-free optimization alternative to DRL. The redundancy of these population-based approaches has the advantages of enabling diverse exploration and improve robustness, leading to a more stable convergence. In particular, Genetic Algorithms (GA) (Montana & Davis, 1989) show competitive results compared to gradient-based DRL (Such et al., 2017) and are characterized by low computational cost. These gradient-free approaches, however, struggle to solve high-dimensional problems having poor generalization skills and are significantly less sample efficient than gradient-based methods. An emergent research direction proposes the combination of gradient-free and gradient-based methods following the physical world, where evolution and learning cooperate to assimilate the best of both solutions (Simpson, 1953) . The first mixed approach, Evolutionary Reinforcement Learning (ERL) (Khadka & Tumer, 2018) , relies on actor-critic architecture to inject information in an evolutionary population while both the gradient-free and gradient-based training phases proceed in parallel. Similarly, Proximal Distilled ERL (PDERL) (Bodnar, 2020) extends ERL with different evolutionary methods. CEM-RL (Pourchot, 2019) brings this research direction into the family of distributed approaches, combining a portfolio of TD3 (Fujimoto et al., 2018) learners with the Cross-Entropy Method (Yan Duan, 2016) . These mixed approaches however also present several limitations, which we address through our work: (i) the parallel training phases of the DRL and EA components (Khadka & Tumer, 2018; Bodnar, 2020) , or the multitude of learners (Pourchot, 2019) result in significant overhead (detailed in Section 4). (ii) The actor-critic formalization of previous mixed approaches, allows them to be easily evaluated in continuous locomotion benchmarks (Brockman et al., 2016; Todorov et al., 2012) . However, this also hinders their combination with value-based DRL (Marchesini & Farinelli, 2020a) . This is important as recent work (Matheron et al., 2019) shows the limitation of actor-critic in deterministic tasks, that in contrast can be effectively addressed with value-based DRL. In particular, Section 4 shows that a value-based implementation of Khadka & Tumer (2018) does not converge in our discrete robotic task. (iii) The combination strategy does not ensure better performance compared to the DRL agent as it does not prevent detrimental behaviours (e.g., drop in performance). This is shown in the poor performance of a value-based implementation of ERL and PDERL (Section 4). We propose a novel mixed framework, called Soft Updates for Policy Evolution (Supe-RL), that enables us to combine the characteristics of GAs with any DRL algorithm, addressing the limitations of previous approaches. Supe-RL (Figure 1 ) benefits from the high sampling efficiency of gradientbased DRL while incorporating gradient-free GA to generate diverse experiences and find better policies. Summarizing, Supe-RL based algorithms perform a periodical genetic evaluation applying GAs to the agent network. A selection operator uses a fitness metric to evaluate the population, choosing the best performing genome (i.e., the weights of the network) that is used to update the weights of the DRL agent. In contrast to previous work, our genetic evaluation is only performed periodically, drastically reducing the overhead. Furthermore, our soft update (Section 3) allows a direct integration of GAs to any DRL algorithm as it is similar to perform a gradient step towards a better policy, avoiding detrimental behaviours. As detailed in Section 3.1, this allows using valuebased DRL, exploiting the variety of optimizations developed for the well-known DQN (van Hasselt et al., 2016; Schaul et al., 2016; Wang et al., 2016; Fortunato et al., 2017; Bellemare et al., 2017) . Crucially, our genetic component influences the DRL agent policy only if one of its mutated version performs better in a subset of evaluation episodes. Hence, as detailed in Section 3, with a sufficient number of episodes we obtain a good estimation of the overall performance of the population. Our evaluation focuses on mapless navigation, a well-known problem in robotics and recent DRL (Zhang et al., 2017; Wahid et al., 2019; Marchesini & Farinelli, 2020b) . In particular, we consider two tasks developed with Unity (Juliani et al., 2018) : (i) a discrete action space indoor scenario with obstacles for a mobile robot and (ii) a continuous task for aquatic drones, with dynamic waves and physically realistic water. Besides considering standard metrics related to performance (success rate and reward), we also consider safety properties that are particularly important in these domains (e.g., the agent does not collide with obstacles). In more detail, we employ formal verification ((Corsi et al., 2020) ) to compute the percentage of input cases that cause violations of these properties. This is important to confirm our claim that Supe-RL based approaches correctly bias the exploration process in the direction of more robust policy regions with higher returns. Results show that Supe-RL algorithms improve performance (i.e., training time, success rate, average reward), stability, and safety over value-based and policy-gradient DRL (Rainbow (Hessel et al., 2018) , PPO (Schulman et al., 2017) ) and ERL. Finally, we performed additional comparisons of Supe-RL with: (i) PDERL (Bodnar, 2020) to evidence the poor performance of previous mixed approaches when combined with value-based DRL; (ii) CEM-RL (Pourchot, 2019) in the aquatic scenario, to show the differences with a multi learner approach; (iii) ERL in standard continuous benchmarks (i.e., MuJoCo locomotion (Brockman et al., 2016; Todorov et al., 2012)) , where results confirm the superior performance of Supe-RL.

2. BACKGROUND AND RELATED WORK

We formalize robotic navigation as a RL problem, defined over a Markov Decision Process, as described in recent DRL literature (Tai et al., 2017; Zhang et al., 2017; Wahid et al., 2019) . DRL for robotic navigation focus exclusively on continuous action algorithms such as actor-critic DDPG (Lillicrap et al., 2015) , TD3 (Fujimoto et al., 2018) and PPO (Schulman et al., 2017) . Such methods



Figure 1: Supe-RL overview.

