SAMPLE-EFFICIENT AUTOMATED DEEP REINFORCEMENT LEARNING

Abstract

Despite significant progress in challenging problems across various domains, applying state-of-the-art deep reinforcement learning (RL) algorithms remains challenging due to their sensitivity to the choice of hyperparameters. This sensitivity can partly be attributed to the non-stationarity of the RL problem, potentially requiring different hyperparameter settings at various stages of the learning process. Additionally, in the RL setting, hyperparameter optimization (HPO) requires a large number of environment interactions, hindering the transfer of the successes in RL to real-world applications. In this work, we tackle the issues of sample-efficient and dynamic HPO in RL. We propose a population-based automated RL (AutoRL) framework to meta-optimize arbitrary off-policy RL algorithms. In this framework, we optimize the hyperparameters and also the neural architecture while simultaneously training the agent. By sharing the collected experience across the population, we substantially increase the sample efficiency of the meta-optimization. We demonstrate the capabilities of our sample-efficient AutoRL approach in a case study with the popular TD3 algorithm in the MuJoCo benchmark suite, where we reduce the number of environment interactions needed for meta-optimization by up to an order of magnitude compared to population-based training.

1. INTRODUCTION

Deep reinforcement learning (RL) algorithms are often sensitive to the choice of internal hyperparameters (Jaderberg et al., 2017; Mahmood et al., 2018) , and the hyperparameters of the neural network architecture (Islam et al., 2017; Henderson et al., 2018) , hindering them from being applied out-of-the-box to new environments. Tuning hyperparameters of RL algorithms can quickly become very expensive, both in terms of high computational costs and a large number of required environment interactions. Especially in real-world applications, sample efficiency is crucial (Lee et al., 2019) . Hyperparameter optimization (HPO; Snoek et al., 2012; Feurer & Hutter, 2019) approaches often treat the algorithm under optimization as a black-box, which in the setting of RL requires a full training run every time a configuration is evaluated. This leads to a suboptimal sample efficiency in terms of environment interactions. Another pitfall for HPO is the non-stationarity of the RL problem. Hyperparameter settings optimal at the beginning of the learning phase can become unfavorable or even harmful in later stages (François-Lavet et al., 2015) . This issue can be addressed through dynamic configuration, either through self adaptation (Tokic & Palm, 2011; François-Lavet et al., 2015; Tokic, 2010) or through external adaptation as in population-based training (PBT; Jaderberg et al., 2017) . However, current dynamic configuration approaches substantially increase the number of environment interactions. Furthermore, this prior work does not consider adapting the architecture. In this work, we introduce a simple meta-optimization framework for Sample-Efficient Automated RL (SEARL) to address all three challenges: sample-efficient HPO, dynamic configuration, and the dynamic modification of the neural architecture. The foundation of our approach is a joint optimization of an off-policy RL agent and its hyperparameters using an evolutionary approach. To reduce the amount of required environment interactions, we use a shared replay memory across the population of different RL agents. This allows agents to learn better policies due to the diverse 1

