SAMPLE-EFFICIENT AUTOMATED DEEP REINFORCEMENT LEARNING

Abstract

Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.

1. INTRODUCTION

Deep reinforcement learning (RL) algorithms are often sensitive to the choice of internal hyperparameters (Jaderberg et al., 2017; Mahmood et al., 2018) , and the hyperparameters of the neural network architecture (Islam et al., 2017; Henderson et al., 2018) , hindering them from being applied out-of-the-box to new environments. Tuning hyperparameters of RL algorithms can quickly become very expensive, both in terms of high computational costs and a large number of required environment interactions. Especially in real-world applications, sample efficiency is crucial (Lee et al., 2019) . Hyperparameter optimization (HPO; Snoek et al., 2012; Feurer & Hutter, 2019) approaches often treat the algorithm under optimization as a black-box, which in the setting of RL requires a full training run every time a configuration is evaluated. This leads to a suboptimal sample efficiency in terms of environment interactions. Another pitfall for HPO is the non-stationarity of the RL problem. Hyperparameter settings optimal at the beginning of the learning phase can become unfavorable or even harmful in later stages (François-Lavet et al., 2015) . This issue can be addressed through dynamic configuration, either through self adaptation (Tokic & Palm, 2011; François-Lavet et al., 2015; Tokic, 2010) or through external adaptation as in population-based training (PBT; Jaderberg et al., 2017) . However, current dynamic configuration approaches substantially increase the number of environment interactions. Furthermore, this prior work does not consider adapting the architecture. In this work, we introduce a simple meta-optimization framework for Sample-Efficient Automated RL (SEARL) to address all three challenges: sample-efficient HPO, dynamic configuration, and the dynamic modification of the neural architecture. The foundation of our approach is a joint optimization of an off-policy RL agent and its hyperparameters using an evolutionary approach. To reduce the amount of required environment interactions, we use a shared replay memory across the population of different RL agents. This allows agents to learn better policies due to the diverse collection of experience and enables us to perform AutoRL at practically the same amount of environment interactions as training a single configuration. Further, SEARL preserves the benefits of dynamic configuration present in PBT to enable online HPO and discovers hyperparameter schedules rather than a single static configuration. Our approach uses evolvable neural networks that preserve trained network parameters while adapting their architecture. We emphasize that SEARL is simple to use and allows efficient AutoRL for any off-policy deep RL algorithm. In a case study optimizing the popular TD3 algorithm (Fujimoto et al., 2018) in the MuJoCo benchmark suite we demonstrate the benefits of our framework and provide extensive ablation and analytic experiments. We show a 10× improvement in sample efficiency of the meta-optimization compared to random search and PBT. We also demonstrate the generalization capabilities of our approach by meta-optimizing the established DQN (Mnih et al., 2015) 

2. RELATED WORK

Advanced experience collection: Evolutionary RL (ERL) introduced by Khadka & Tumer ( 2018) and successors PDERL (Bodnar et al., 2020) and CERL (Khadka et al., 2019) combine Actor-Critic RL algorithms with genetic algorithms to evolve a small population of agents. This line of work mutates policies to increase the diversity of collected sample trajectories. The experience is stored in a shared replay memory and used to train an Actor-Critic learner with fixed network architectures using DDPG/TD3 while periodically adding the trained actor to a separate population of evolved actors. CERL extends this approach by using a whole population of learners with varying discount rates. However, this line of work aims to increase a single configuration's performance, while our work optimizes hyperparameters and the neural architecture while training multiple agents. SEARL also benefits from a diverse set of mutated actors collecting experience in a shared replay memory. Schmitt et al. ( 2019) mix on-policy experience with shared experiences across concurrent hyperparameter sweeps to take advantage of parallel exploration. However, this work neither tackles dynamic configuration schedules nor architecture adaptation. ApeX/IMPALA: Resource utilization in the RL setting can be improved using multiple actors in a distributed setup and decoupling the learner from the actor. Horgan et al. ( 2018) extends a prioritized replay memory to a distributed setting (Ape-X) to scale experience collection for a replay memory used by a single trainer. In IMPALA (Espeholt et al., 2018) , multiple rollout actors asynchronously send their collected trajectories to a central learner through a queue. To correct the policy lag that this distributed setup introduces, IMPALA leverages the proposed V-trace algorithm for the central learner. These works aim at collecting large amounts of experience to benefit the learner, but they do not explore the space of hyperparameter configurations. In contrast, the presented work aims to reduce the number of environment interactions to perform efficient AutoRL. 



Please find the source code on GitHub: github.com/automl/SEARL



algorithm for the Atari benchmark. We provide an open-source implementation of SEARL. 1 Our contributions are: • We introduce an AutoRL framework for off-policy RL which enables: (i) Sample-efficient HPO while training a population of RL agents using a shared replay memory. (ii) Dynamic optimization of hyperparameters to adjust to different training phases; (iii) Online neural architecture search in the context of gradient-based deep RL; • We propose a fair evaluation protocol to compare AutoRL and HPO in RL, taking into account the actual cost in terms of environment interactions. • We demonstrate the benefits of SEARL in a case study, reducing the number of environment interactions by up to an order of magnitude.

Neural architecture search with Reinforcement Learining: The work of Zoph & Le (2016) on RL for neural architecture search (NAS) is an interesting counterpart to our work on the intersection of RL and NAS. Zoph & Le (2016) employ RL for NAS to search for better performing architectures, whereas we employ NAS for RL to make use of better network architectures. AutoRL: Within the framework of AutoRL, the joint hyperparameter optimization and architecture search problem is addressed as a two-stage optimization problem in Chiang et al. (2019), first shaping the reward function and optimizing for the network architecture afterward. Similarly, Runge et al. (2019) propose to jointly optimize algorithm hyperparameters and network architectures by searching

