GENETIC SOFT UPDATES FOR POLICY EVOLUTION IN DEEP REINFORCEMENT LEARNING

Abstract

The combination of Evolutionary Algorithms (EAs) and Deep Reinforcement Learning (DRL) has been recently proposed to merge the benefits of both solutions. Existing mixed approaches, however, have been successfully applied only to actor-critic methods and present significant overhead. We address these issues by introducing a novel mixed framework that exploits a periodical genetic evaluation to soft update the weights of a DRL agent. The resulting approach is applicable with any DRL method and, in a worst-case scenario, it does not exhibit detrimental behaviours. Experiments in robotic applications and continuous control benchmarks demonstrate the versatility of our approach that significantly outperforms prior DRL, EAs, and mixed approaches. Finally, we employ formal verification to confirm the policy improvement, mitigating the inefficient exploration and hyper-parameter sensitivity of DRL.

1. INTRODUCTION

The key to a wider and successful application of DRL techniques in real scenarios is the ability to adapt to the surrounding environment by generalizing from training experiences. These solutions have to cope with the uncertainties of the operational environment, requiring a huge number of trials to achieve good performance. Hence, devising robust learning approaches while improving sample efficiency is one of the challenges for wider utilization of DRL. Despite the promising results (Tai et al., 2017; Zhang et al., 2017; Marchesini et al., 2019) , DRL also suffer from convergence to local optima, which is mainly caused by the lack of diverse exploration when operating in high-dimensional spaces. Several studies address the exploration problem (e.g., curiosity-driven exploration (Pathak et al., 2017 ), count-based exploration (Ostrovski et al., 2017) ), but they typically rely on sensitive task-specific hyper-parameters. The sensitivity to such hyper-parameters is another significant issue in DRL as it typically results in brittle convergence properties and poor performance in practical tasks (Haarnoja et al., 2018) . Evolutionary Algorithms (Fogel, 2006) have been recently employed as a promising gradient-free optimization alternative to DRL. The redundancy of these population-based approaches has the advantages of enabling diverse exploration and improve robustness, leading to a more stable convergence. In particular, Genetic Algorithms (GA) (Montana & Davis, 1989) show competitive results compared to gradient-based DRL (Such et al., 2017) and are characterized by low computational cost. These gradient-free approaches, however, struggle to solve high-dimensional problems having poor generalization skills and are significantly less sample efficient than gradient-based methods. An emergent research direction proposes the combination of gradient-free and gradient-based methods following the physical world, where evolution and learning cooperate to assimilate the best of both solutions (Simpson, 1953) . The first mixed approach, Evolutionary Reinforcement Learning (ERL) (Khadka & Tumer, 2018) , relies on actor-critic architecture to inject information in an evolutionary population while both the gradient-free and gradient-based training phases proceed in parallel. Similarly, Proximal Distilled ERL (PDERL) (Bodnar, 2020) extends ERL with different evolutionary methods. CEM-RL (Pourchot, 2019) brings this research direction into the family of distributed approaches, combining a portfolio of TD3 (Fujimoto et al., 2018) learners with the Cross-Entropy Method (Yan Duan, 2016) .

