ADAPTIVE N-STEP BOOTSTRAPPING WITH OFF-POLICY DATA

Abstract

The definition of the update target is a crucial design choice in reinforcement learning. Due to the low computation cost and empirical high performance, nstep returns with off-policy data is a widely used update target to bootstrap from scratch. A critical issue of applying n-step returns is to identify the optimal value of n. In practice, n is often set to a fixed value, which is either determined by an empirical guess or by some hyper-parameter search. In this work, we point out that the optimal value of n actually differs on each data point, while the fixed value n is a rough average of them. The estimation error can be decomposed into two sources, off-policy bias and approximation error, and the fixed value of n is a trade-off between them. Based on that observation, we introduce a new metric, policy age, to quantify the off-policyness of each data point. We propose the Adaptive N-step Bootstrapping, which calculates the value of n for each data point by its policy age instead of the empirical guess. We conduct experiments on both MuJoCo and Atari games. The results show that adaptive n-step bootstrapping achieves state-of-the-art performance in terms of both final reward and data efficiency.

1. INTRODUCTION

The goal of reinforcement learning (RL) is to find an optimal policy by interacting with the environment. In order to do that, a RL algorithm needs to define a target, e.g., Q function or value function, and update it iteratively to bootstrap from scratch. The challenge of designing an efficient update target manifests both on the sample complexity and computation complexity. Ideally, the target should only be updated by the data generated by the corresponding policy to obtain an unbiased estimate (Sutton & Barto, 2018) , while the amount of the data needs to reach a certain scale to control the variance (Schulman et al., 2015) . These two requirements limit the update frequency and lead to a high sample complexity finally. On the computational part, the consideration is to make a trade-off between the computation cost of each step and the number of total steps. Monte-Carlo returns has advantages on generalization (behave well with function approximation) and exploration (a quick propagation of new findings) at the cost of computing the whole trajectory on each step (Sutton & Barto, 2018) . Bootstrapping methods apply readily on off-policy data and control the trace length (Sutton & Barto, 2018) . However, they require more update steps to converge comparing with Monte-Carlo returns. Those design concerns are nested together, which makes it hard to achieve a good balance. N-step returns (Sutton & Barto, 2018) serves as the basis of various update targets, due to its flexibility and simple implementation. Together with off-policy learning (Sutton & Barto, 2018) and the replay buffer (Mnih et al., 2015) , n-step returns is able to update the target frequently while ensures that the variance is in a controllable range. However, a systematical study (Fedus et al., 2020) reveals that the performance of n-step returns highly relies on the exact value of n. Since the underlying working mechanism is unclear, previous research can only give some vague suggestion based on empirical results, that simply increases the value from one to a larger number, e.g. 3 or 4. In this paper, we illustrate that the estimation error of n-step returns can be decomposed into offpolicy bias (under-estimation part) and approximation error (over-estimation part), and the selection of n controls the balance between them. Data stored in the replay buffer are generated by previous policies. Thus, adopting them for update introduces the off-policy bias. Since the current policy

