MIXTURE OF STEP RETURNS IN BOOTSTRAPPED DQN

Abstract

The concept of utilizing multi-step returns for updating value functions has been adopted in deep reinforcement learning (DRL) for a number of years. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Conventional methods such as TD (λ) leverage these advantages by using a target value equivalent to an exponential average of different step returns. Nevertheless, integrating step returns into a single target sacrifices the diversity of the advantages offered by different step return targets. To address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) built on top of bootstrapped DQN, and uses different backup lengths for different bootstrapped heads. MB-DQN enables heterogeneity of the target values that is unavailable in approaches relying only on a single target value. As a result, it is able to maintain the advantages offered by different backup lengths. In this paper, we first discuss the motivational insights through a simple maze environment. In order to validate the effectiveness of MB-DQN, we perform experiments on the Atari 2600 benchmark environments, and demonstrate the performance improvement of MB-DQN over a number of baseline methods. We further provide a set of ablation studies to examine the impacts of different design configurations of MB-DQN.

1. INTRODUCTION

In recent value-based deep reinforcement learning (DRL), a value function is usually utilized to evaluate state values, which stand for estimates of the expected long-term cumulative rewards that might be collected by an agent. In order to perform such an evaluation, a deep neural network (DNN) is employed by a number of contemporary value-based DRL methods (Mnih et al., 2015; Wang et al., 2016; Hasselt et al., 2016; Osband et al., 2016; Hessel et al., 2018) as the value function approximator, in which the network parameters are iteratively updated based on the agent's experience of interactions with an environment. For many of these methods (Mnih et al., 2015; Wang et al., 2016; Hasselt et al., 2016; Osband et al., 2016; Hessel et al., 2018) , the update procedure is carried out by one-step temporal-difference (TD) learning (Sutton & Barto, 1998) (or simply "one-step TD"), which calculates the error between an estimated state value and a target differing by one timestep. One-step TD has been demonstrated effective in backing up immediate reward signals collected by an agent. Nevertheless, the long temporal horizon that the reward signals from farther states have to propagate through might lead to an extended learning period of the value function approximator. Learning from multi-step returns (Sutton & Barto, 1998 ) is a way of propagating rewards newly observed by the agent faster to earlier visited states, and has been adopted in several previous works. Asynchronous advantage actor-critic (A3C) (Mnih et al., 2016) employs multi-step returns as targets to update the value functions of its asynchronous threads. Rainbow deep Q-network (Rainbow DQN) (Hessel et al., 2018 ) also utilizes multi-step returns during the backup procedure. The authors in (Barth-Maron et al., 2018 ) also modify the target value function of deep deterministic dolicy gradient (DDPG) (Lillicrap et al., 2016) to estimate TD errors using multi-step returns. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Backing up reward signals through multi-step returns shifts the bias-variance tradeoff (Hessel et al., 2018) . Therefore, backing up with different step return lengths (or simply 'backup length' hereafter (Asis et al., 2018) ) might lead to different target values in the Bellman equation, resulting in different exploration behaviors of the agent as well as different achievable performance of it. The authors in (Amiranashvili et al., 2018) have demonstrated that the performance of the agent varies with different backup lengths, and showed that both very short and very long backup lengths could cause performance drops. These insights suggest that identifying the best backup length for an environment is not straightforward. In addition, although learning based on multi-step returns enhances the immediate sensitivity to future rewards, it is at the expense of greater variance which may cause the value function approximator to require more data samples to converge to the true expectation. Moreover, relying on a single target value with any specific backup length constrains the exploration behaviors of the agent, and might limit the achievable performance of it. Based on the above observations, there have been several research works proposed to unify different target values with different backup lengths to leverages their respective advantages. The traditional TD (λ) (Sutton & Barto, 1998) uses a target value equivalent to an exponential average of all n-step returns (where n is a natural number), providing a faster empirical convergence by interpolating between low-variance TD returns and low-bias Monte Carlo returns. DQN (λ) (Daley & Amato, 2019) further proposes an efficient implementation of TD (λ) for DRL by modifying the replay buffer memory such that λ-returns can be pre-computed. Although these methods benefit from combining multiple distinct backup lengths, they still rely on a single target value during the update procedure. Integrating step returns into a single target value, nevertheless, may sacrifice the diversity of the advantages provided by different step return targets. As a result, in this paper, we propose Mixture Bootstrapped DQN (abbreviated as "MB-DQN") to address the above issues. MB-DQN is built on top of bootstrapped DQN (Osband et al., 2016) , which contains multiple bootstrapped heads with randomly initialized weights to learn a set of value functions. MB-DQN leverages the advantages of different step return targets by assigning a distinct backup length to each bootstrapped head. Each bootstrapped head maintains its own target value derived from the assigned backup length during the update procedure. Since the backup lengths of the bootstrapped heads are distinct from each other, MB-DQN provides heterogeneity in the target values as well as diversified exploration behaviors of the agent that is unavailable in approaches relying only on a single target value. To validate the proposed concept, in our experiments, we first provide motivational insights on the influence of different configurations of backup lengths in a simple maze environment. We then evaluate the proposed MB-DQN on the Atari 2600 (Bellemare et al., 2015) benchmark environments, and demonstrate its performance improvement over a number of baseline methods. We further provide a set of ablation studies to analyze the impacts of different design configurations of MB-DQN. In summary, the primary contributions of this paper include: (1) introducing an approach for maintaining the advantages from different backup lengths, (2) providing heterogeneity in the target values by utilizing multiple bootstrapped heads, and (3) enabling diversified exploration behaviors of the agent. The remainder of this paper is organized as the following. Section 2 provides the background material related to this work. Section 3 walks through the proposed MB-DQN methodology. Section 4 reports the experimental results, and presents a set of the ablation analyses. Section 5 concludes this paper.

2. BACKGROUND

In this section, we provide the background material related to this work. We first introduce the basic concepts of the Markov Decision Process (MDP) and one-step return, followed by an explanation of the concept of multi-step returns. Next, we provide a brief overview of the Deep Q-Network (DQN).

2.1. MARKOV DECISION PROCESS AND ONE-STEP RETURN

In RL, an agent interacting with an environment E with state space S and action space A is often formulated as an MDP. At each timestep t, the agent perceives a state s t ∈ S, takes an action a t ∈ A according to its policy π(a|s), receives a reward r t ∼ R(s t , a t ), and transits to next state s t+1 ∼ p(s t+1 |s t , a t ), where R(s t , a t ) and p(s t+1 |s t , a t ) are the reward function and transition probability function, respectively. The main objective of the agent is to learn an optimal policy π * (a|s) that maximizes discounted cumulative return G t = T i=t γ i-t r t , where γ ∈ (0, 1] is the discount factor and T is the horizon. For a given policy π(a|s), the state value function V π and state-action value function Q π are defined as the expected discounted cumulative return G t starting

