EVOLVING REINFORCEMENT LEARNING ALGORITHMS

Abstract

We propose a method for meta-learning reinforcement learning algorithms by searching over the space of computational graphs which compute the loss function for a value-based model-free RL agent to optimize. The learned algorithms are domain-agnostic and can generalize to new environments not seen during training. Our method can both learn from scratch and bootstrap off known existing algorithms, like DQN, enabling interpretable modifications which improve performance. Learning from scratch on simple classical control and gridworld tasks, our method rediscovers the temporal-difference (TD) algorithm. Bootstrapped from DQN, we highlight two learned algorithms which obtain good generalization performance over other classical control tasks, gridworld type tasks, and Atari games. The analysis of the learned algorithm behavior shows resemblance to recently proposed RL algorithms that address overestimation in value-based methods.

1. INTRODUCTION

Designing new deep reinforcement learning algorithms that can efficiently solve across a wide variety of problems generally requires a tremendous amount of manual effort. Learning to design reinforcement learning algorithms or even small sub-components of algorithms would help ease this burden and could result in better algorithms than researchers could design manually. Our work might then shift from designing these algorithms manually into designing the language and optimization methods for developing these algorithms automatically. Reinforcement learning algorithms can be viewed as a procedure that maps an agent's experience to a policy that obtains high cumulative reward over the course of training. We formulate the problem of training an agent as one of meta-learning: an outer loop searches over the space of computational graphs or programs that compute the objective function for the agent to minimize and an inner loop performs the updates using the learned loss function. The objective of the outer loop is to maximize the training return of the inner loop algorithm. Our learned loss function should generalize across many different environments, instead of being specific to a particular domain. Thus, we design a search language based on genetic programming (Koza, 1993) that can express general symbolic loss functions which can be applied to any environment. Data typing and a generic interface to variables in the MDP allow the learned program to be domain agnostic. This language also supports the use of neural network modules as subcomponents of the program, so that more complex neural network architectures can be realized. Efficiently searching over the space of useful programs is generally difficult. For the outer loop optimization, we use regularized evolution (Real et al., 2019) , a recent variant of classic evolutionary algorithms that employ tournament selection (Goldberg & Deb, 1991) . This approach can scale with the number of compute nodes and has been shown to work for designing algorithms for supervised learning (Real et al., 2020) . We adapt this method to automatically design algorithms for reinforcement learning. While learning from scratch is generally less biased, encoding existing human knowledge into the learning process can speed up the optimization and also make the learned algorithm more interpretable. Because our search language expresses algorithms as a generalized computation graph, we can embed known RL algorithms in the graphs of the starting population of programs. We compare starting from scratch with bootstrapping off existing algorithms and find that while starting from scratch can learn existing algorithms, starting from existing knowledge leads to new RL algorithms which can outperform the initial programs. We learn two new RL algorithms which outperform existing algorithms in both sample efficiency and final performance on the training and test environments. The learned algorithms are domain agnostic and generalize to new environments. Importantly, the training environments consist of a suite of discrete action classical control tasks and gridworld style environments while the test environments include Atari games and are unlike anything seen during training. The contribution of this paper is a method for searching over the space of RL algorithms, which we instantiate by developing a formal language that describes a broad class of value-based model-free reinforcement learning methods. Our search language enables us to embed existing algorithms into the starting graphs which leads to faster learning and interpretable algorithms. We highlight two learned algorithms which generalize to completely new environments. Our analysis of the metalearned programs shows that our method automatically discovers algorithms that share structure to recently proposed RL innovations, and empirically attain better performance than deep Q-learning methods.

2. RELATED WORK

Learning to learn is an established idea in in supervised learning, including meta-learning with genetic programming (Schmidhuber, 1987; Holland, 1975; Koza, 1993) , learning a neural network update rule (Bengio et al., 1991) , and self modifying RNNs (Schmidhuber, 1993) . Genetic programming has been used to find new loss functions (Bengio et al., 1994; Trujillo & Olague, 2006) . More recently, AutoML (Hutter et al., 2018) aims to automate the machine learning training process. Automated neural network architecture search (Stanley & Miikkulainen, 2002; Real et al., 2017; 2019; Liu et al., 2017; Zoph & Le, 2016; Elsken et al., 2018; Pham et al., 2018) has made large improvements in image classification. Instead of learning the architecture, AutoML-Zero (Real et al., 2020) learns the algorithm from scratch using basic mathematical operations. Our work shares similar ideas, but is applied to the RL setting and assumes additional primitives such as neural network modules. In contrast to AutoML-Zero, we learn computational graphs with the goal of automating RL algorithm design. Our learned RL algorithms generalize to new problems, not seen in training. Automating RL. While RL is used for AutoML (Zoph & Le, 2016; Zoph et al., 2018; Cai et al., 2018; Bello et al., 2017) , automating RL itself has been somewhat limited. RL requires different design choices compared to supervised learning, including the formulation of reward and policy update rules. All of which affect learning and performance, and are usually chosen through trial and error. AutoRL addresses the gap by applying the AutoML framework from supervised learning to the MDP setting in RL. For example, evolutionary algorithms are used to mutate the value or actor network weights (Whiteson & Stone, 2006; Khadka & Tumer, 2018) , learn task reward (Faust et al., 2019) , tune hyperparameters (Tang & Choromanski, 2020; Franke et al., 2020) , or search for a neural network architecture (Song et al., 2020; Franke et al., 2020) . This paper focuses on task-agnostic RL update rules in the value-based RL setting which are both interpretable and generalizable.



Figure 1: Method overview. We use regularized evolution to evolve a population of RL algorithms. A mutator alters top performing algorithms to produce a new algorithm. The performance of the algorithm is evaluated over a set of training environments and the population is updated. Our method can incorporating existing knowledge by starting the population from known RL algorithms instead of purely from scratch.

