ADAPTIVE N-STEP BOOTSTRAPPING WITH OFF-POLICY DATA

Abstract

The definition of the update target is a crucial design choice in reinforcement learning. Due to the low computation cost and empirical high performance, nstep returns with off-policy data is a widely used update target to bootstrap from scratch. A critical issue of applying n-step returns is to identify the optimal value of n. In practice, n is often set to a fixed value, which is either determined by an empirical guess or by some hyper-parameter search. In this work, we point out that the optimal value of n actually differs on each data point, while the fixed value n is a rough average of them. The estimation error can be decomposed into two sources, off-policy bias and approximation error, and the fixed value of n is a trade-off between them. Based on that observation, we introduce a new metric, policy age, to quantify the off-policyness of each data point. We propose the Adaptive N-step Bootstrapping, which calculates the value of n for each data point by its policy age instead of the empirical guess. We conduct experiments on both MuJoCo and Atari games. The results show that adaptive n-step bootstrapping achieves state-of-the-art performance in terms of both final reward and data efficiency.

1. INTRODUCTION

The goal of reinforcement learning (RL) is to find an optimal policy by interacting with the environment. In order to do that, a RL algorithm needs to define a target, e.g., Q function or value function, and update it iteratively to bootstrap from scratch. The challenge of designing an efficient update target manifests both on the sample complexity and computation complexity. Ideally, the target should only be updated by the data generated by the corresponding policy to obtain an unbiased estimate (Sutton & Barto, 2018) , while the amount of the data needs to reach a certain scale to control the variance (Schulman et al., 2015) . These two requirements limit the update frequency and lead to a high sample complexity finally. On the computational part, the consideration is to make a trade-off between the computation cost of each step and the number of total steps. Monte-Carlo returns has advantages on generalization (behave well with function approximation) and exploration (a quick propagation of new findings) at the cost of computing the whole trajectory on each step (Sutton & Barto, 2018) . Bootstrapping methods apply readily on off-policy data and control the trace length (Sutton & Barto, 2018) . However, they require more update steps to converge comparing with Monte-Carlo returns. Those design concerns are nested together, which makes it hard to achieve a good balance. N-step returns (Sutton & Barto, 2018) serves as the basis of various update targets, due to its flexibility and simple implementation. Together with off-policy learning (Sutton & Barto, 2018) and the replay buffer (Mnih et al., 2015) , n-step returns is able to update the target frequently while ensures that the variance is in a controllable range. However, a systematical study (Fedus et al., 2020) reveals that the performance of n-step returns highly relies on the exact value of n. Since the underlying working mechanism is unclear, previous research can only give some vague suggestion based on empirical results, that simply increases the value from one to a larger number, e.g. 3 or 4. In this paper, we illustrate that the estimation error of n-step returns can be decomposed into offpolicy bias (under-estimation part) and approximation error (over-estimation part), and the selection of n controls the balance between them. Data stored in the replay buffer are generated by previous policies. Thus, adopting them for update introduces the off-policy bias. Since the current policy is better than previous policies, the off-policy bias is an underestimation. Replay buffer is not the only source of the off-policy bias, epsilon-greedy exploration also introduces the off-policy issue. On the other hand, n-step returns adopts a max operator explicitly (on Q-learning based algorithms) or implicitly (on actor-critic algorithms) on an existing function to approximate the real target. The max operator brings the approximation error, which is an overestimation. Sec 4 gives the formal definition of the decomposition and verifies the conclusion by experiments. According to our analysis, the quantity of the off-policy bias and approximation error varies a lot on different data points. Thus, a fixed value of n is just a rough average, and there is plenty of room for improvement. We introduce a new metric, policy age, to quantify the off-policyness of each data point. As the policy age grows, the off-policy bias increases linearly, while the approximation error decreases exponentially. Based on this observation, we propose a novel algorithm, named adaptive n-step bootstrapping. Given the policy age of each data point, adaptive n-step calculates the optimal n by an exponential function. Hyperparameter of the function is determined by the tree-structured parzen estimator (Bergstra et al., 2011) . We conduct extensive experiments on both MuJoCo and Atari games. Adaptive n-step bootstrapping outperforms all fixed-value n settings with a large margin, in terms of both data efficiency and final reward. For the other update target definitions, we select Retrace (Munos et al., 2016) as a representative. Compared with Retrace, our method maintains the performance advantage under the premise of low computational complexity and simple implementation.

2.1. RESEARCH ON N-STEP RETURNS

The recent works on n-step returns focus on finding the optimal value of n. In Ape-X (Horgan et al., 2018) and R2D2 (Kapturowski et al., 2019) , the value of n is fixed, which is set by manual tuning or hyper-parameter search. Rainbow (Hessel et al., 2018) figures out that the final performance is sensitive to the value of n, and n = 3 achieves the best score in most cases on Atari games. Fedus et al. (2020) verifies that setting n to 3 is a good choice, and further reveals that the replay buffer must also be large enough to gain performance benefits. Those researches give some heuristic rules of setting the value of n, but the underlying mechanism of why n = 3 performs better than one-step temporal difference is still unclear.

2.2. OTHER UPDATE TARGETS

To improve the performance of the vanilla n-step returns, there are many other update target definitions in the literature (Hernandez-Garcia & Sutton, 2019) . Importance sampling (IS) (Precup et al., 2000) provides a simple way to correct the off-policy bias. It can be seen as a weighted average of multiple one-step TD(0) target. However, IS brings large (and possibly infinite) variance, which makes it impractical on large-scale problems. Retrace (Munos et al., 2016) clips the IS ratio to a maximum value of 1 to reduce the large variance of IS targets. It has many applications in recent reinforcement learning agents, like distributed offpolicy learning agent Reactor (Gruslys et al., 2017) . The most serious disadvantage of Retrace is its high computation cost. Retrace needs to calculate O(n) times of Q and O(n) times of π in each time, compared with only O(1) from n-step returns (n is trace length). In large-scale problems, evaluating Q and π requires a forward pass in the neural network, which is slow and expensive. Reactor (Gruslys et al., 2017) calculates the Retrace target as a linear combination of many n-step targets and dispatches those calculation workloads into different nodes for acceleration. Since the computation complexity is still high, reactor can not work under limited resources. Furthermore, even without considering the calculation cost, the application scope of Retrace is not as good as n-step returns, as reported in Hernandez-Garcia & Sutton (2019).

3. PRELIMINARIES

Reinforcement learning's goal is to find an optimal policy π * with maximal discounted returns R π = E π [ t γ t-1 r t ] given the Markov Decision Process (MDP). To achieve this, agents often estimate the state-action value function q π (s, a) = E π [ t γ t-1 r t |s 0 = s, a 0 = a]. Let Q π (s, a) denote the estimation of q π (s, a). In tabular settings, Q π can be represented by a table of all stateaction pairs s, a , while in large-scale settings, Q π is often approximated by a deep neural network (DNN) with parameter θ, written as Q π;θ . During the training process, Q π is continuously updated by the update target Ĝπ , which is calculated from data points (s, a, r, s ). In tabular settings (Watkins & Dayan, 1992) , the update equation can be written as: Q π (s, a) ← Q π (s, a) + α[ Ĝπ (s, a) -Q π (s, a)], where α is the learning rate. In large-scale settings (Mnih et al., 2015; Lillicrap et al., 2016) , Q π;θ is updated by mini-batch gradient descent on the neural network parameter θ as: θ ← θ -α 1 N N i=1 ∇ θ L(Q π;θ (s i , a i ), Ĝπ (s i , a i )), where L is the loss function. Off-policy learning adopts two policies, the behavior policy µ for generating data points and the target policy π for learning from the data points. Replay buffer (Fedus et al., 2020) is often used together with the off-policy to handle the increasing sample complexity brought by large-scale problems. Agents draw data points from the replay buffer to update the estimator Q π . This enables agents to learn from past experiences, thus yields higher sample efficiency. In the sequel, the term off-policy learning refers to adopt off-policy learning together with the replay buffer, as used by most recent off-policy algorithms. N-step returns (Sutton & Barto, 2018) together with off-policy learning have strong empirical performance (Hessel et al., 2018) , as well as being easy to calculate. It calculates the Ĝn π as: Ĝn π (s, a) = r 0 + γr 1 + • • • + γ n-1 r n-1 + γ n E an∼π(sn) [Q π (s n , a n )]. It can be seen as a mix of Monte-Carlo (MC) estimation of q π ≈ t γ t-1 r t (n → ∞ case) and one-step TD(0 ) estimation r 0 + γE a∼π(s1) [Q π (s 1 , a)] (n = 1 case). In off-policy learning, to calculate n-step returns Ĝn π (s 0 , a 0 ) for data point (s 0 , a 0 , r 0 , s 0 ), we draw consecutive transitions (s t , a t , r t , s t ) t=0,1,2,... in the same trajectory τ as current data point (s 0 , a 0 , r 0 , s 0 ) from replay buffer. As n-step returns estimates q π using trajectory τ generated by behavior policy µ, the discrepancy between π and µ makes it biased. We define this off-policy induced bias as off-policy bias, and difference between π and µ as off-policyness.

4. UNDERLYING WORKING MECHANISM OF N-STEP BOOTSTRAPPING

N-step returns lays in the center of designing the update target. It not only unifies the Monte-Carlo returns and one-step temporal difference but also lays the foundation of the eligibility traces (Singh & Sutton, 1996) . Together with off-policy learning, n-step bootstrapping works well because it achieves a good balance between the bias (TD) and variance (MC). Considering its importance and wide application, the underlying mechanism of n-step returns has not been studied in detail. In this section, we give a careful analysis of why n-step bootstrapping works and what property the optimal selection of n should satisfy.

4.1. DECOMPOSITION OF THE ESTIMATION ERROR

To understand how n-step bootstrapping works, we conduct a systematical analysis of the estimation error. We formalize the estimation error of an update target Ĝπ as its difference with ground truth q π for every (s, a) pair that the agent experienced: E( Ĝπ ) = E µ [ Ĝπ (s, a) -q π (s, a)]. (2) As shown in Eq. 1, the estimation error consists of two parts -the off-policy bias and approximation error. The off-policy bias is introduced by the difference of π and µ, as we mentioned before, while the approximation error comes as agent's estimation of Q π is not perfect, e.g. Q π = q π . Note that, the behavior policy µ is an older version of the target policy π. We split these two types of error by defining two intermediate targets, Ĝn π,qπ;τ and Ĝn π,Qπ;τ , to eliminate the other type of error.

Ĝn

π,qπ;τ adopts the ground truth q π instead of the estimated Q π to eliminate approximation error: Ĝn π,qπ;τ (s, a) = r 0 + γr 1 + • • • + γ n-1 r n-1 + γ n E an∼π(sn) [q π (s n , a n )]. Ĝn π,Qπ;τ uses the trajectory τ = ( st , ãt , rt , s t ) t=0,1,2,... which is generated by the current policy π instead of the old trajectory τ to remove the off-policy bias: π,Qπ;τ (s, a) = r0 + γ r1 + • • • + γ n-1 rn-1 + γ n E ãn∼π(sn) [Q π (s n , ãn )]. Then we can quantify the off-policy bias E offpolicy and the approximation error E approx independently as: E offpolicy ( Ĝn π ) = E( Ĝn π,qπ;τ ), E approx ( Ĝn π ) = E( Ĝn π,Qπ;τ ). Now, we can decompose the total error E( Ĝn π ) into the sum of off-policy bias and approximation error, plus a negligible small residual term E residual ( Ĝn π ): E residual ( Ĝn π ) = E( Ĝn π ) -E offpolicy ( Ĝn π ) -E approx ( Ĝn π ) = γ n (E a∼π(sn) [Q π (s n , a) -q π (s n , a)] -E ã∼π(sn) [Q π (s n , ã) -q π (s n , ã)]). The residual term E residual ( Ĝn π ) is a difference of the discounted error γ n E a∼π(s) [Q π (s, a)q π (s, a)] on terminal states s n and sn . If n is small, the difference between Q π and q π on s n will be close to the difference between Q π and q π on sn . Otherwise, the discount factor γ n will shrink exponentially, making the residual term very small. The experimental results in Sec 4.2 show that the residual term is an order of magnitude smaller than the two main sources in practice. Thus, it can be ignored and we get an approximate decomposition: E( Ĝn π ) ≈ E offpolicy ( Ĝn π ) + E approx ( Ĝn π ).

4.2. VERIFICATION BY EXPERIMENTS

In this section, we perform tabular Q-Learning and Soft Actor Critic (Haarnoja et al., 2018a) on the Pendulum-v0 task (Brockman et al., 2016) to verify the decomposition quantitatively. For tabular Q-learning, we added replay buffer (Mnih et al., 2015) as recent off-policy learning does, and discretization to deal with continuous observation space and action space. The approximation error is an overestimation, which means E approx ≥ 0, while the off-policy bias is an underestimation, E offpolicy ≤ 0. Single-step temporal difference target (n = 1) will have a large approximation error, while too many steps (n is large) leads to too much underestimation that cannot be balanced out. N-step returns works because a suitable selection of n makes the overestimation and underestimation cancel each other. Overestimation of E approx is a well-studied issue. The greedy policy π acts as a max operator in Q-learning targets. Gradient ascent on π in actor-critic architectures is also an implicit max operator (Fujimoto et al., 2018) . The max operator is one main cause of overestimation, as reported in (Hasselt, 2010) . Usage of function approximation (Thrun & Schwartz, 1999; Van Hasselt et al., 2015) in large-scale settings also boosts this issue. Off-policy learning with replay buffer is the root cause of the underestimation of E offpolicy . With less learning update iterations, µ often has worse performance than π. Estimating q π using µ leads to underestimating, as we are accumulating µ's reward in Ĝn π . As shown in Figure 1 , on per-datapoint perspective, off-policy bias E offpolicy grows with offpolicyness. Older data points take a large proportion in replay buffer, but data points from current policy π have only a few, so Q π has been updated with much more older data points. That makes approximation error E approx decrease with off-policyness. On value of n perspective, larger n leads to less approximation error, as weight γ n of estimated Q π shrinks exponentially. But larger n also leads target Ĝn π accumulating more rewards from µ, thus enlarges off-policy bias.

5. IDENTIFY THE OPTIMAL VALUE OF N

The magnitude of off-policy bias is closely related to off-policyness. To analyze the error of nstep returns, we need a precise measurement of the off-policyness. Evaluating real off-policyness directly, e.g. calculating the real difference between π and µ is hard, as calculating the difference of two policies is non-trivial and computational heavy, especially for complicated environments. So we introduce a new metric policy age, which is simple to evaluate and also predicts well of real off-policyness. For every data point (s t , a t , r t , s t ), policy age is defined as number of update steps between π and µ. As we show in Figure 2 , policy age predicts accurately of the difference between π and µ, e.g. E µ [| log π(a|s) -log µ(a|s)|] for every (s, a) pair that agent experienced. We will use policy age as an indicator of off-policyness in the paper.

5.1. ADAPTIVE N-STEP BOOTSTRAPPING

As we pointed out in Sec 4.2, the optimal value of n achieves a balance between overestimation and underestimation. Since the optimal value of n varies by policy age, a fixed n value is only a coarse approximation. For each data point, the selection of n should be calculated individually to achieve the best performance. We propose a novel algorithm, Adaptive N-step Bootstrapping, to select the optimal n which achieves minimal error for every data point. We define error for policy age p as: E p ( Ĝπ ) = E µp [ Ĝπ (s, a) -q π (s, a)], where µ p refers to behavior policy that is p updates away from π. Then, the optimal n * for every data point is calculated as: As calculating error E p ( Ĝn π ) requires the ground truth q π , directly solving Eq 5 on every data point is very expensive on large-scale environments. We start by solving n * for a small number of data points. As shown in Figure 3 , these data points reveal that n * (p) is an exponential form in a wide variety of settings, both tabular and large-scale. This exponential pattern is rooted in the management of replay buffer and mini-batch gradient descent. In replay buffer, the larger the policy age of a data point, the more times that it will be sampled into the mini-batch. This difference on sample times inflects both the off-policy bias and the approximation error. The off-policy bias part is simple, it increases linearly as the policy age grows. The approximation error is a little bit more complicated because the sample data update the parameter through a gradient descent. Bhandari et al. (2018) describes a similar condition, and it concludes that the rate of convergence is exponential. In summary, as the policy age grows, the off-policy bias increases linearly while the approximation error decreases exponentially. Thus, we use an exponential approximation for n * as follows: n * (p) = arg min n |E p ( Ĝn π )|. ( n * (p) = arg min n |E p ( Ĝn π )| (6) ≈ round(n max * e -log(n)•min(1, p d ) ), where p is the policy age for the data point. The maximum factor n max and decay rate d are hyperparameters. Adaptive n-step returns can also be seen as a form of off-policy correction, cutting trace when the difference between π and µ is large. It is simpler and more stable than IS and Retrace because it does not rely on the metric like IS ratio, which may have infinite variance. Algorithm 5.1 describes how adaptive n-step bootstrapping works. i (s i , a i ) end for update Q π using batch (s i , a i , r i , s i ) with target g i end for

5.2. HYPERPARAMETER SELECTION

As we show in Sec 5.1, solving Eq 5 on limited data points is enough to reveal the exponential form. n * of complicated Swimmer-v3 environment is also successfully reconstructed. We also adopt the same setting to solve the optimal value of n max and d. The exact solution on Swimmer-v3 give us n max = 755. On near-onpolicy case, adaptive n-step behaves like MC target, which is unbiased for on-policy case (Sutton & Barto, 2018) . However, MC target also has a large variance, which leads to performance degradation. So we need variance control measurements, like the λ factor in Retrace (Munos et al., 2016) . Vanilla n-step returns sets n to a small value to do variance control (Sutton & Barto, 2018) . We clip n max to reduce the variance in our method. The notorious bias-variance trade-off makes it non-trivial to solve optimal n max . However, the solved d = 122952 is very close to d = 100000 that works for all MuJoCo tasks, hinting the best hyperparameter range. We found that the best hyperparameter varies little with different environments. With this assumption, we use the tree-structured parzen estimator (Bergstra et al., 2011) to optimize hyperparameter on one environment and use this single set of hyperparameter for the whole benchmark.

6. EXPERIMENTAL RESULTS

Q-learning and Actor-Critic methods span the whole space of model-free reinforcement learning algorithms. Q-learning updates the state-action value function Q π and select action with maximum Q π in -greedy manner. Actor-Critic has an explicit representation of π, and update Q π and π simultaneously. We conduct experiments on two representatives of both worlds, DQN (Mnih et al., 2015) for Q-learning methods, and SAC (Haarnoja et al., 2018a) for actor-critic, to test the generality of our method.

6.1. ACTOR CRITIC METHODS

Soft Actor-Critic (SAC) is the state-of-the-art algorithm in off-policy actor-critic domain. It focuses on stability and data-efficiency, and even can be applied to challenging real-world robot control (Haarnoja et al., 2018b) . We compare adaptive n-step with fixed n-step and Retrace on the SAC algorithm. For each update target, we use it in place of single-step TD target in original SAC implementation and evaluate on Gym MuJoCo (Brockman et al., 2016) The result shows that adaptive n-step returns outperforms all fixed n-step returns consistently across all tasks. Adaptive n-step also has the lowest variance across different runs, being more stable. It is worth to note that different environments have different best performing fixed n, while adaptive n-step perform well with only one set of n max and d across all environments. This suggests that optimal n varies with different data points, while a fixed n is only a coarse approximation. Adaptive n-step also outperforms the Retrace method, both in terms of average return and computational cost. The calculation of Retrace target requires 2n Q π and 3n π evaluations per step (n is trace length, n = 32 for our MuJoCo experiment), and each evaluation is a network forward pass. In contrast, adaptive n-step target inherits O(1) complexity of the vanilla n-step method, only evaluates 1 Q π and 1 π for arbitrary n, thus makes it much faster than Retrace. 6.2 Q-LEARNING DQN (Hessel et al., 2018) sets the foundation of combining Q-Learning with deep neural networks, we pick it as the representation of Q-learning methods. We conduct our experiment on a subset of Atari 2600 games. We compare the adaptive n-step with fixed n-step with n = 3, which is the best n on Atari games recommended by Hessel et al. (2018) and Fedus et al. (2020) . As shown in Figure 5 , adaptive n-step outperforms the fixed value n-step returns in all games. And in most of them, the performance benefit brings by the adaptive n-step method exceeds 20%.

7. CONCLUSION

Generally, n-step bootstrapping is simply viewed as a unification between Monte-Carlo returns and one-step temporal difference. However, with the introduction of replay buffer to apply reinforcement learning on large-scale problems, we figure out that n-step bootstrapping actually serves as a control factor to reduce the estimation error. Thus, the selection of n should differ on each data point instead of a fixed value. Based on this observation, we propose the adaptive n-step bootstrapping algorithm to select the value of n for each data point individually. Experimental results show that adaptive n-step outperforms all fixed value n settings with a large margin. Comparing with other update target definitions, e.g. Retrace, adaptive n-step bootstrapping only introduces negligible computation cost and is easy to implement. Those characters make it be easily embedded into other algorithms.



Figure1: N-step target errors. Top row is tabular settings, bottom is SAC. X-axis is off-policyness, quantified by policy age that is described in the beginning of Sec 5.

Figure2: The x axis is the policy age, and the y axis is the difference between µ and π, which grows approximately linearly as the policy age grows.

Figure3: n * and exponential fitting. Note that we only evaluate error for 128 steps, so the value of n * may be clipped to a maximum of 128.

Figure4: Training curve on continuous control benchmark. We report top 4 in a total 8 runs.

Figure 5: Comparison on Atari 2600 games. For both fixed 3-step and adaptive n-step, we report the agent that obtained highest reward during training. The figure is smoothed by moving average with length 10 to improve readability.

Algorithm 1 Adaptive N-step Bootstrapping Hyperparameter: Maximum steps n max , decay rate d for each bootstrapping iteration do sample a batch (s i , a i , r i , s i ) from replay buffer for each data point i in batch do calculate policy age p for data point i n ← round(n max * e -log(nmax)•min(1, p d ) ) g i ← Ĝn

annex

Satinder P Singh and Richard S Sutton. Reinforcement learning with replacing eligibility traces.Machine learning, 22(1-3):123-158, 1996.Richard S Sutton and Andrew G Barto. Reinforcement learning: An introduction. MIT press, 2018.S. Thrun and A. Schwartz. Issues in using function approximation for reinforcement learning. 1999.Hado Van Hasselt, Arthur Guez, and David Silver. Deep reinforcement learning with double qlearning. arXiv preprint arXiv:1509.06461, 2015.Christopher JCH Watkins and Peter Dayan. Q-learning. Machine learning, 8(3-4): 279-292, 1992. 

