GRAY-BOX GAUSSIAN PROCESSES FOR AUTOMATED REINFORCEMENT LEARNING

Abstract

Despite having achieved spectacular milestones in an array of important realworld applications, most Reinforcement Learning (RL) methods are very brittle concerning their hyperparameters. Notwithstanding the crucial importance of setting the hyperparameters in training state-of-the-art agents, the task of hyperparameter optimization (HPO) in RL is understudied. In this paper, we propose a novel gray-box Bayesian Optimization technique for HPO in RL, that enriches Gaussian Processes with reward curve estimations based on generalized logistic functions. In a very large-scale experimental protocol, comprising 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (OpenAI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our method significantly outperforms current HPO practices in RL.

1. INTRODUCTION

While Reinforcement Learning (RL) has celebrated amazing successes in many applications (Mnih et al., 2015; Silver et al., 2016; OpenAI, 2018; Andrychowicz et al., 2020; Degrave et al., 2022) , it remains very brittle (Henderson et al., 2018; Engstrom et al., 2020) . The successes of RL are achieved by leading experts in the field with many years of expertise in the "art" of RL, but the field does not yet provide a technology that broadly yields successes off the shelf. A crucial hindrance for both broader impact and faster progress in research is that an RL algorithm that has been well-tuned for one problem does not necessarily work for another one; especially, optimal hyperparameters are environment-specific and must be carefully tuned in order to yield strong performance. Despite the crucial importance of strong hyperparameter settings in RL (Henderson et al., 2018; Chen et al., 2018; Zhang et al., 2021; Andrychowicz et al., 2021) , the field of hyperparameter optimization (HPO) for RL is understudied. The field is largely dominated by manual tuning, computationally expensive hyperparameter sweeps, or population-based training which trains many agents in parallel that exchange hyperparameters and states (Jaderberg et al., 2017) . While these methods are feasible for large industrial research labs, they are costly, substantially increase the CO2 footprint of artificial intelligence research (Dhar, 2020), and make it very hard for smaller industrial and academic labs to partake in RL research. In this paper, we address this gap, developing a computationally efficient yet robust HPO method for RL. The method we propose exploits the fact that reward curves tend to have similar shapes. As a result, future rewards an agent collects with a given hyperparameter setting can be predicted quite well based on initial rewards, providing a computationally cheap mechanism to compare hyperparameter settings against each other. We combine this insight in a novel gray-boy Bayesian optimization method that includes a parametric reward curve extrapolation layer in a neural network for computing a Gaussian process kernel. In a large-scale empirical evaluation using 5 popular RL methods (DDPG, A2C, PPO, SAC, TD3), 22 environments (OpenAI Gym: Mujoco, Atari, Classic Control), and 7 HPO baselines, we demonstrate that our resulting method, the Reward-Curve Gaussian Process (RCGP), yields state-of-the-art performance across the board. In summary, our contributions are as follows:

