ADAPTIVE N-STEP BOOTSTRAPPING WITH OFF-POLICY DATA

Abstract

The definition of the update target is a crucial design choice in reinforcement learning. Due to the low computation cost and empirical high performance, nstep returns with off-policy data is a widely used update target to bootstrap from scratch. A critical issue of applying n-step returns is to identify the optimal value of n. In practice, n is often set to a fixed value, which is either determined by an empirical guess or by some hyper-parameter search. In this work, we point out that the optimal value of n actually differs on each data point, while the fixed value n is a rough average of them. The estimation error can be decomposed into two sources, off-policy bias and approximation error, and the fixed value of n is a trade-off between them. Based on that observation, we introduce a new metric, policy age, to quantify the off-policyness of each data point. We propose the Adaptive N-step Bootstrapping, which calculates the value of n for each data point by its policy age instead of the empirical guess. We conduct experiments on both MuJoCo and Atari games. The results show that adaptive n-step bootstrapping achieves state-of-the-art performance in terms of both final reward and data efficiency.

1. INTRODUCTION

The goal of reinforcement learning (RL) is to find an optimal policy by interacting with the environment. In order to do that, a RL algorithm needs to define a target, e.g., Q function or value function, and update it iteratively to bootstrap from scratch. The challenge of designing an efficient update target manifests both on the sample complexity and computation complexity. Ideally, the target should only be updated by the data generated by the corresponding policy to obtain an unbiased estimate (Sutton & Barto, 2018) , while the amount of the data needs to reach a certain scale to control the variance (Schulman et al., 2015) . These two requirements limit the update frequency and lead to a high sample complexity finally. On the computational part, the consideration is to make a trade-off between the computation cost of each step and the number of total steps. Monte-Carlo returns has advantages on generalization (behave well with function approximation) and exploration (a quick propagation of new findings) at the cost of computing the whole trajectory on each step (Sutton & Barto, 2018) . Bootstrapping methods apply readily on off-policy data and control the trace length (Sutton & Barto, 2018) . However, they require more update steps to converge comparing with Monte-Carlo returns. Those design concerns are nested together, which makes it hard to achieve a good balance. N-step returns (Sutton & Barto, 2018) serves as the basis of various update targets, due to its flexibility and simple implementation. Together with off-policy learning (Sutton & Barto, 2018) and the replay buffer (Mnih et al., 2015) , n-step returns is able to update the target frequently while ensures that the variance is in a controllable range. However, a systematical study (Fedus et al., 2020) reveals that the performance of n-step returns highly relies on the exact value of n. Since the underlying working mechanism is unclear, previous research can only give some vague suggestion based on empirical results, that simply increases the value from one to a larger number, e.g. 3 or 4. In this paper, we illustrate that the estimation error of n-step returns can be decomposed into offpolicy bias (under-estimation part) and approximation error (over-estimation part), and the selection of n controls the balance between them. Data stored in the replay buffer are generated by previous policies. Thus, adopting them for update introduces the off-policy bias. Since the current policy is better than previous policies, the off-policy bias is an underestimation. Replay buffer is not the only source of the off-policy bias, epsilon-greedy exploration also introduces the off-policy issue. On the other hand, n-step returns adopts a max operator explicitly (on Q-learning based algorithms) or implicitly (on actor-critic algorithms) on an existing function to approximate the real target. The max operator brings the approximation error, which is an overestimation. Sec 4 gives the formal definition of the decomposition and verifies the conclusion by experiments. According to our analysis, the quantity of the off-policy bias and approximation error varies a lot on different data points. Thus, a fixed value of n is just a rough average, and there is plenty of room for improvement. We introduce a new metric, policy age, to quantify the off-policyness of each data point. As the policy age grows, the off-policy bias increases linearly, while the approximation error decreases exponentially. Based on this observation, we propose a novel algorithm, named adaptive n-step bootstrapping. Given the policy age of each data point, adaptive n-step calculates the optimal n by an exponential function. Hyperparameter of the function is determined by the tree-structured parzen estimator (Bergstra et al., 2011) . We conduct extensive experiments on both MuJoCo and Atari games. Adaptive n-step bootstrapping outperforms all fixed-value n settings with a large margin, in terms of both data efficiency and final reward. For the other update target definitions, we select Retrace (Munos et al., 2016) as a representative. Compared with Retrace, our method maintains the performance advantage under the premise of low computational complexity and simple implementation.

2.1. RESEARCH ON N-STEP RETURNS

The recent works on n-step returns focus on finding the optimal value of n. In Ape-X (Horgan et al., 2018) and R2D2 (Kapturowski et al., 2019) , the value of n is fixed, which is set by manual tuning or hyper-parameter search. Rainbow (Hessel et al., 2018) figures out that the final performance is sensitive to the value of n, and n = 3 achieves the best score in most cases on Atari games. Fedus et al. (2020) verifies that setting n to 3 is a good choice, and further reveals that the replay buffer must also be large enough to gain performance benefits. Those researches give some heuristic rules of setting the value of n, but the underlying mechanism of why n = 3 performs better than one-step temporal difference is still unclear.

2.2. OTHER UPDATE TARGETS

To improve the performance of the vanilla n-step returns, there are many other update target definitions in the literature (Hernandez-Garcia & Sutton, 2019) . Importance sampling (IS) (Precup et al., 2000) provides a simple way to correct the off-policy bias. It can be seen as a weighted average of multiple one-step TD(0) target. However, IS brings large (and possibly infinite) variance, which makes it impractical on large-scale problems. Retrace (Munos et al., 2016) clips the IS ratio to a maximum value of 1 to reduce the large variance of IS targets. It has many applications in recent reinforcement learning agents, like distributed offpolicy learning agent Reactor (Gruslys et al., 2017) . The most serious disadvantage of Retrace is its high computation cost. Retrace needs to calculate O(n) times of Q and O(n) times of π in each time, compared with only O(1) from n-step returns (n is trace length). In large-scale problems, evaluating Q and π requires a forward pass in the neural network, which is slow and expensive. Reactor (Gruslys et al., 2017) calculates the Retrace target as a linear combination of many n-step targets and dispatches those calculation workloads into different nodes for acceleration. Since the computation complexity is still high, reactor can not work under limited resources. Furthermore, even without considering the calculation cost, the application scope of Retrace is not as good as n-step returns, as reported in Hernandez-Garcia & Sutton (2019).

3. PRELIMINARIES

Reinforcement learning's goal is to find an optimal policy π * with maximal discounted returns R π = E π [ t γ t-1 r t ] given the Markov Decision Process (MDP). To achieve this, agents often

