DISTRIBUTIONAL META-GRADIENT REINFORCEMENT LEARNING

Abstract

Meta-gradient reinforcement learning (RL) algorithms have substantially boosted the performance of RL agents by learning an adaptive return. All the existing algorithms adhere to the same reward learning principle, where the adaptive return is simply formulated in the form of expected cumulative rewards, upon which the policy and critic update rules are specified under well-adopted distance metrics. In this paper, we present a novel algorithm that builds on the success of metagradient RL algorithms and effectively improves such algorithms by following a simple recipe, i.e., going beyond the expected return to formulate and learn the return in a more expressive form, value distributions. To this end, we first formulate a distributional return that could effectively capture bootstrapping and discounting behaviors over distributions, to form an informative distributional return target in value update. Then we derive an efficient meta update rule to learn the adaptive distributional return with meta-gradients. For empirical evaluation, we first present an illustrative example on a toy two-color grid-world domain, which validates the benefit of learning distributional return over expectation; then we conduct extensive comparisons on a large-scale RL benchmark Atari 2600, where we confirm that our proposed method with distributional return works seamlessly well with the actor-critic framework and leads to state-of-the-art median human normalized score among meta-gradient RL literature.

1. INTRODUCTION

Meta-gradient reinforcement learning (MGRL) (Xu et al., 2018b; Sutton, 2022) is a family of algorithms that leverage the gradient of the gradient descent update to learn better objectives for reinforcement learning (RL). The MGRL paradigm has achieved substantial performance breakthroughs and dominated the state-of-the-art model-free RL algorithms in various domains, such as Atari (Xu et al., 2018b; Zahavy et al., 2020; Flennerhag et al., 2022) , DMLab (Zahavy et al., 2020), and DeepMind Control (Zahavy et al., 2020) . The major algorithmic research on MGRL has been developing towards more and more ambitious targets. For example, it starts from learning two fundamental hyperparameters (i.e., discount factor γ and bootstrapping factor λ) (Xu et al., 2018b) , and then goes to self-tuning 20+ hyperparameters in a high-performing RL agent (Zahavy et al., 2020) . As a powerful tool for discovering RL semantics from the data, MGRL has been utilized to discover intrinsic rewards (Zheng et al., 2018; 2020) , auxiliary tasks (Veeriah et al., 2019 ), options (Veeriah et al., 2021) , etc. Recently, it has even been used to tackle the more flexible form of parameterizing the objective, i.e. using black-box neural networks to learn the objectives from the environment interactions and learning context (Xu et al., 2020; Oh et al., 2020) . However, MGRL research has been limited to scalar form, where either meta-parameters are scalar ones (Xu et al., 2018b; Zahavy et al., 2020) , or outputs of meta networks are scalars (Xu et al., 2020) . This is in line with the traditional definition of returns (Sutton & Barto, 1998) , as an expected sum of discounted rewards. However, note that MGRL is not limited to providing RL objectives, while we could also meta-learn a broader concept, e.g., learning rate (Sutton, 1981; Baik et al., 2020; Pinion et al., 2021) and exploration (Gupta et al., 2018; Xu et al., 2018a) . In this work, we investigate an essential step of extending the algorithmic scope of MGRL to distributional methods, where RL algorithms take away the expectations in Bellman updates and consider a richer formulation of value distributions (Bellemare et al., 2017) . This is essential for RL updates and even more critical for MGRL which enjoys multiple satisfying properties. Sparse signals are hard to learn in RL, and the bi-level optimization nature of the MGRL makes it extremely difficult to propagate helpful signals to learn meta-parameters. By modeling value distributions, the RL algorithm can naturally produce a rich set of predictions, which provides dense and informative signals for meta-optimization. Moreover, some distributional RL methods come with new optimization properties in the loss form (not available in conventional non-distributional RL methods), e.g., with KL divergence between discrete distributions (Bellemare et al., 2017) or quantile regression (Dabney et al., 2018b) . Despite the great accomplishment of distributional RL for non-meta policy learning, the attempt to learn a distributional meta objective has not yet been made by any prior MGRL methods. This paper aims to develop a novel distributional meta-gradient RL algorithm that can discover meaningful value distributions online. Furthermore, we aim to both methodologically establish a novel distributional meta-gradient framework and find out the intriguing effect of applying distributional meta-gradient techniques on real-world problems, for which we conduct extensive evaluations with a motivating toy domain as well as large-scale end-to-end policy training problems. The evaluation results well demonstrate the effectiveness of our method. Overall, the contributions of our paper are as follows: (i) We present a novel meta-gradient variant of the RL method, which considers approximating the distributional return in its value update. (ii) Our proposed distributional meta-gradient algorithm is general and can be compatible with almost all the existing meta-gradient RL approaches. (iii) We conduct an extensive empirical study on a toy control domain and the large-scale benchmark Atari 2600. Our results have demonstrated substantial improvements in our method over strong baseline methods on the major evaluation domains.

2. RELATED WORK

In recent decades, meta-learning, or learning-to-learn, has been explored extensively within the machine learning community (Hospedales et al., 2022; Sutton, 2022) , e.g., learning a neural network update rule (Bengio et al., 1990) , adapting the learning rate (Sutton, 1992; Schraudolph & Giannakopoulos, 1999) , and transferring domain-invariant knowledge (Thrun & Mitchell, 1995) . Notably, it has also driven many advances in the RL problems. One direction for such attempts is to learn meta-optimizers (Andrychowicz et al., 2016; Wichrowska et al., 2017) or metapolicies (Duan et al., 2016; Wang et al., 2017) parameterized by recurrent neural networks, which capture the highlevel time-dependent information as meta knowledge. Another direction is gradient-based metalearning, introduced in Model-Agnostic Meta-Learning (MAML) (Finn et al., 2017a) , as well as its followup works (Finn & Levine, 2018; Finn et al., 2017b; Grant et al., 2018; Al-Shedivat et al., 2018) . Those works focus on learning a good initialization of the model for one or few-shot multitask learning with meta-gradients. In this paper, we consider solving meta-gradient RL problem from a fresh new angle, i.e., optimizing the adaptive return as distributions. Policy learning with distributional return has been considered in existing RL literature for off-policy learning only. All the existing methods fall into the strand of DQN variants which replace the n-step return in DQN with distributions during Q-learning. In (Bellemare et al., 2017) , the return is formulated as a categorical distribution with fixed values for the base, where the Bellman operator shifts the base values of the distribution, and the target distribution is contracted and projected upon the fixed base. In (Dabney et al., 2018b) , the return is formulated as quantile distributions, where the algorithm tries to learn the values corresponding to each quantile, alleviating the need for distribution projection or value range approximation. In (Dabney et al., 2018a) , the quantiles are approximated by mapping from values drawn from the random distribution (proposals) into quantiles, resulting in an unlimited number of implicit distributions. In (Wang et al., 2019) , the proposals in implicit quantiles are replaced by the output of a learned proposal network. In contrast to the aforementioned off-policy distributional RL methods, we tackle a novel problem of learning adaptive distributional return in on-policy actor-critic algorithms with meta-gradient.

