MIXTURE OF STEP RETURNS IN BOOTSTRAPPED DQN

Abstract

The concept of utilizing multi-step returns for updating value functions has been adopted in deep reinforcement learning (DRL) for a number of years. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Conventional methods such as TD (λ) leverage these advantages by using a target value equivalent to an exponential average of different step returns. Nevertheless, integrating step returns into a single target sacrifices the diversity of the advantages offered by different step return targets. To address this issue, we propose Mixture Bootstrapped DQN (MB-DQN) built on top of bootstrapped DQN, and uses different backup lengths for different bootstrapped heads. MB-DQN enables heterogeneity of the target values that is unavailable in approaches relying only on a single target value. As a result, it is able to maintain the advantages offered by different backup lengths. In this paper, we first discuss the motivational insights through a simple maze environment. In order to validate the effectiveness of MB-DQN, we perform experiments on the Atari 2600 benchmark environments, and demonstrate the performance improvement of MB-DQN over a number of baseline methods. We further provide a set of ablation studies to examine the impacts of different design configurations of MB-DQN.

1. INTRODUCTION

In recent value-based deep reinforcement learning (DRL), a value function is usually utilized to evaluate state values, which stand for estimates of the expected long-term cumulative rewards that might be collected by an agent. In order to perform such an evaluation, a deep neural network (DNN) is employed by a number of contemporary value-based DRL methods (Mnih et al., 2015; Wang et al., 2016; Hasselt et al., 2016; Osband et al., 2016; Hessel et al., 2018) as the value function approximator, in which the network parameters are iteratively updated based on the agent's experience of interactions with an environment. For many of these methods (Mnih et al., 2015; Wang et al., 2016; Hasselt et al., 2016; Osband et al., 2016; Hessel et al., 2018) , the update procedure is carried out by one-step temporal-difference (TD) learning (Sutton & Barto, 1998) (or simply "one-step TD"), which calculates the error between an estimated state value and a target differing by one timestep. One-step TD has been demonstrated effective in backing up immediate reward signals collected by an agent. Nevertheless, the long temporal horizon that the reward signals from farther states have to propagate through might lead to an extended learning period of the value function approximator. Learning from multi-step returns (Sutton & Barto, 1998 ) is a way of propagating rewards newly observed by the agent faster to earlier visited states, and has been adopted in several previous works. Asynchronous advantage actor-critic (A3C) (Mnih et al., 2016) employs multi-step returns as targets to update the value functions of its asynchronous threads. Rainbow deep Q-network (Rainbow DQN) (Hessel et al., 2018 ) also utilizes multi-step returns during the backup procedure. The authors in (Barth-Maron et al., 2018 ) also modify the target value function of deep deterministic dolicy gradient (DDPG) (Lillicrap et al., 2016) to estimate TD errors using multi-step returns. Updating value functions with different backup lengths provides advantages in different aspects, including bias and variance of value estimates, convergence speed, and exploration behavior of the agent. Backing up reward signals through multi-step returns shifts the bias-variance tradeoff (Hessel et al., 2018) . Therefore, backing up with different step return lengths (or simply 'backup length' hereafter (Asis et al., 2018) ) might lead to different target values in the Bellman equation, resulting in different exploration behaviors of the agent as well as different achievable performance of it. The authors in (Amiranashvili et al., 2018) have demonstrated that the performance of the agent varies with different backup lengths, and showed that both very short and very long backup lengths could cause performance drops. These insights suggest that identifying the best backup length for an environment is not straightforward. In addition, although learning based on multi-step returns enhances the immediate sensitivity to future rewards, it is at the expense of greater variance which may cause the value function approximator to require more data samples to converge to the true expectation. Moreover, relying on a single target value with any specific backup length constrains the exploration behaviors of the agent, and might limit the achievable performance of it. Based on the above observations, there have been several research works proposed to unify different target values with different backup lengths to leverages their respective advantages. The traditional TD (λ) (Sutton & Barto, 1998) uses a target value equivalent to an exponential average of all n-step returns (where n is a natural number), providing a faster empirical convergence by interpolating between low-variance TD returns and low-bias Monte Carlo returns. DQN (λ) (Daley & Amato, 2019) further proposes an efficient implementation of TD (λ) for DRL by modifying the replay buffer memory such that λ-returns can be pre-computed. Although these methods benefit from combining multiple distinct backup lengths, they still rely on a single target value during the update procedure. Integrating step returns into a single target value, nevertheless, may sacrifice the diversity of the advantages provided by different step return targets. As a result, in this paper, we propose Mixture Bootstrapped DQN (abbreviated as "MB-DQN") to address the above issues. MB-DQN is built on top of bootstrapped DQN (Osband et al., 2016) , which contains multiple bootstrapped heads with randomly initialized weights to learn a set of value functions. MB-DQN leverages the advantages of different step return targets by assigning a distinct backup length to each bootstrapped head. Each bootstrapped head maintains its own target value derived from the assigned backup length during the update procedure. Since the backup lengths of the bootstrapped heads are distinct from each other, MB-DQN provides heterogeneity in the target values as well as diversified exploration behaviors of the agent that is unavailable in approaches relying only on a single target value. To validate the proposed concept, in our experiments, we first provide motivational insights on the influence of different configurations of backup lengths in a simple maze environment. We then evaluate the proposed MB-DQN on the Atari 2600 (Bellemare et al., 2015) benchmark environments, and demonstrate its performance improvement over a number of baseline methods. We further provide a set of ablation studies to analyze the impacts of different design configurations of MB-DQN. In summary, the primary contributions of this paper include: (1) introducing an approach for maintaining the advantages from different backup lengths, (2) providing heterogeneity in the target values by utilizing multiple bootstrapped heads, and (3) enabling diversified exploration behaviors of the agent. The remainder of this paper is organized as the following. Section 2 provides the background material related to this work. Section 3 walks through the proposed MB-DQN methodology. Section 4 reports the experimental results, and presents a set of the ablation analyses. Section 5 concludes this paper.

2. BACKGROUND

In this section, we provide the background material related to this work. We first introduce the basic concepts of the Markov Decision Process (MDP) and one-step return, followed by an explanation of the concept of multi-step returns. Next, we provide a brief overview of the Deep Q-Network (DQN).

2.1. MARKOV DECISION PROCESS AND ONE-STEP RETURN

In RL, an agent interacting with an environment E with state space S and action space A is often formulated as an MDP. At each timestep t, the agent perceives a state s t ∈ S, takes an action a t ∈ A according to its policy π(a|s), receives a reward r t ∼ R(s t , a t ), and transits to next state s t+1 ∼ p(s t+1 |s t , a t ), where R(s t , a t ) and p(s t+1 |s t , a t ) are the reward function and transition probability function, respectively. The main objective of the agent is to learn an optimal policy π * (a|s) that maximizes discounted cumulative return G t = T i=t γ i-t r t , where γ ∈ (0, 1] is the discount factor and T is the horizon. For a given policy π(a|s), the state value function V π and state-action value function Q π are defined as the expected discounted cumulative return G t starting from a state s and a state-action pair (s, a) respectively, and can be represented as the following: V π (s) = E[G t |s t = s, π], Q π (s, a) = E[G t |s t = s, a t = a, π]. (1) In order to maximize E[G t ], conventional value-based RL methods often use one-step TD learning to iteratively update V π and Q π . Take Q π for example, the update rule is expressed as the following: Q(s t , a t ) ← Q(s t , a t ) + α[r t + γQ(s t+1 , a t+1 ) -Q(s t , a t )], where α ∈ (0, 1] is a step size parameter which controls the update speed. This update procedure only considers the immediate return r t and γQ(s t+1 , a t+1 ), which is together called one-step return.

2.2. MULTI-STEP RETURN

Multi-step return is a variant of one-step return presented in the previous section. Multi-step return modifies the target of one-step return through bootstrapping over longer time intervals. It replaces the single reward r t in Eq. ( 2) with the truncated multi-step return R n t , which is represented as follows: R n t = n j=0 γ j r t+j , where n is the selected backup length. The update rule of Eq. ( 2) is then re-written as the following: Q(s t , a t ) ← Q(s t , a t ) + α[R n t + γ n+1 Q(s t+n+1 , a t+n+1 ) -Q(s t , a t )]. A longer backup length n has been shown to increase the variance of the estimated Q(s t , a t ) as well as decrease its bias (Jaakkola et al., 1994) . Despite of the high-variance and the increased computational cost, multi-step return enhances the immediate sensitivity of the value approximator to future rewards, and allows them to backup faster. As a result, in certain cases, it is possible to achieve a faster learning speed for the value approximator by using an appropriate backup length n larger than one (Sutton & Barto, 1998; Hessel et al., 2018; Amiranashvili et al., 2018) .

2.3. DEEP Q-NETWORK

DQN is a DNN parameterized by θ for approximating the optimal Q-function. DQN is trained using samples drawn from an experience replay buffer Z, and is updated based on one-step TD learning with an objective to minimize a loss function L DQN , which is typically expressed as the following: L DQN = E s,a,r,s ∼U (Z) (y s,a -Q(s, a, θ)) 2 , ( ) where y s,a = r t + γ max a Q(s t+1 , a, θ -) is the one-step target value, U (Z) is a uniform distribution over Z, and θ -is the parameters of the target network. θ -is updated by θ at predefined intervals.

3. METHODOLOGY

In this section, we first demonstrate the impacts of different backup lengths on the behaviors of an agent in a simple maze environment. Then, we walk through the details of the MB-DQN framework. Dense Maze 

5-Step Return

Starting grid Goal grid (+1 reward) c) illustrate the behaviors of the agents updated using 1-step return and 5-step return, respectively. It is observed that the agent trained with 5-step return reaches the goal through shorter diagonal trajectories, while the agent trained with 1-step return explores more grids in the early stage. -0.1 -0.2 -0.3 -0.1 -0.2 -0.3 -0.2 -0.3 -0.3 • • • • • • • • • • • • • • • • • • (b) (c)

3.1. AGENT BEHAVIOR WITH DIFFERENT BACKUP LENGTHS IN DQN

To illustrate the impacts of different backup lengths on an agent's behavior, we first consider a toy model in a two-dimensional maze environment containing a starting point and a goal, as depicted in Fig. 1 (a). We use DQN as our default agent and perform our experiments on this maze environment (denoted as Dense Maze) with dense rewards. In this setting, the reward of a grid gradually decreases as its distance to the goal increases. We depict the states visited by the agents for 100k timesteps in the training phase in Figs. 1 (b) and (c). It is observed that the agent trained with 5-step return reaches the goal through a shorter path than that of the 1-step return case. This is because the longer backup length allows the agent to adjust its value function estimation faster. On the other hand, although the agent trained with 1-step return might converge slower than that of the 5-step return case, it is observed that 1-step return enables the agent to visit and explore more states in the early stage. This is because the reward signal from a farther state has to propagate through a longer temporal horizon. Therefore, the agent explores more extensively before learning an effective policy to reach the goal. Head 1 Q 1 (s, a; θ 1 ) Head 2 Q 2 (s, a; θ 2 ) Head K Q k (s, a; θ k ) At St St+1 At+1 St+2 Rt Rt+1 R n1 t = R t + R t+1 Backup Length = 2 n1 At St St+1 Rt Backup Length = 1 n2 R n2 t = R t At St St+m Rt Rt+(m-1) Backup Length = m nk R nk t = R t + . . . + R t+(m-1) Shared Convolutional Neural Network Mixture Bootstrapped DQN (MB-DQN) At St St+1 Rt Backup Length = 1 n2 R n2 t = R t Head 1 Q 1 (s, a; θ 1 ) Head 2 Q 2 (s, a; θ 2 ) Head K Q k (s, a; θ k ) Shared Convolutional Neural Network At St St+1 Rt Backup Length = 1 n1 R n1 t = R t At St St+1 Rt Backup Length = 1 nk R nk t = R t Vanilla Bootstrapped DQN (a) (b) Figure 2 : Overview of the proposed MB-DQN framework.

3.2. MIXTURE OF STEP RETURNS IN BOOTSTRAPPED DQN

In order to combine step returns with different backup lengths, we choose bootstrapped DQN (Osband et al., 2016) as our backbone framework. Bootstrapped DQN modifies DQN to approximate a distribution over Q-values via bootstrapping, and has demonstrated both the improved learning speed as well as the performance of the agents in various environments. At the beginning of an episode, bootstrapped DQN uniformly samples a Q-value function head Q k (s, a; θ k ), k ∈ {1, ..., K} from its K bootstrapped Q-value function heads, as shown in Fig. 2 . The agent then performs its control according to Q k (s, a; θ k ) during the entire episode. Bootstrapped DQN re-samples a Q-value function head for each episode based on the same backup length (i.e., 1-step return) to calculate the target value (i.e., Eq. ( 2)). The framework, nevertheless, might be lack in diversity and heterogeneity among the bootstrapped heads. As a result, we leverage the advantages of distinct Q-value function heads in bootstrapped DQN, and propose the usage of mixture backup lengths for different bootstrapped Q-value function heads in our MB-DQN framework, which is shown in Fig. 2 . MB-DQN is similarly implemented as K bootstrapped heads for estimating the Q-value function, where each bootstrapped head k ∈ K correspond to its own backup lengths n k . In each episode, MB-DQN also uniformly and randomly selects a head k ∈ {1, ..., K}, and stores the state transition data collected by the agent using this head into a replay buffer. The replay buffer is played back periodically to update the parameters of all the bootstrapped Q-value function heads as well as the shared convolutional neural network. Each head is trained with its own target network Q k (s, a; θ - k ) and its own target value y k s,a with the truncated multi-step return defined in Eq. ( 3). The detailed update method is summarized in Algorithm 1, while the training methodology is the same as bootstrapped DQN (Osband et al., 2016) . The truncated multi-step returns with different backup lengths thus provide diversity and heterogeneity for the K bootstrapped estimates, which balance the strengths and the weaknesses of different backup lengths. Algorithm 1 Update Methodology of MB-DQN 1: Initialize K Q-networks Q k (s, a; θ k ) with random weights θ k 2: Let each networks Q k with its own backup length n k 3: for each update time do 4: for each head k = 1, 2,...,K do 5: R Qualitative comparison via attention maps. In order to understand the rationale behind the high performance and advantages offered by MB-DQN, we further visualize the attention areas (Greydanus et al., 2018) of the agents trained with MB-DQN for three cases: attention areas generated from (a) all of the 1-step bootstrapped heads in MB-DQN (denoted as 1-Step Heads), (b) all of the 3step bootstrapped heads in MB-DQN (denoted as 3-Step Heads), and (c) the composition of the bootstrapped heads which contribute to the decided actions (i.e., the majority of the bootstrapped heads that vote the resultant actions, denoted as Majority Heads). Please note that the composition in (c) may contain both the 1-step and 3-step bootstrapped heads. Fig. 4 illustrates the attention areas (rendered in red) of these three cases for two Atari games: Breakout and Seaquest, and highlights their differences by yellow circles. In Breakout, it is observed that the 1-Step Heads case focuses more on the ball, the most important object in this game, than the 3-Step Heads case. It is also observed that when the majority of the bootstrapped heads is considered, the attention of the agent is fell on the ball as well, allowing MB-DQN to play as good as the All-1-Step baseline in Breakout in Fig. 3 . In Seaquest, it is observed that the 1-Step Heads case focuses more on the scoreboard, while the 3-Step Heads case focuses more on the enemy and the submarine. The attention areas of the Majority Heads case, on the other hand, cover the areas from both the 1-Step Heads and the 3-Step Heads cases, allowing MB-DQN to outperform the two baselines in Fig. 3 . These examples therefore validate that MB-DQN is able to leverage the advantages from different backup lengths, and can achieve superior performance to the two baselines by offering heterogeneity among its bootstrapped heads. n k t = n k j=0 γ j Rt+j 6: y k s,a = R n k t + γ n k maxa Q k (st+n k , arg max a Q(st+n k , a; θ k ); θ - k ) 7: θ k ≈ arg min θ k E(y k s,a -Q k (s, a; θ k ))

4.2. ANALYSIS OF THE DATA SAMPLE QUALITY FOR MB-DQN

As the experimental results presented in the previous section have quantitatively and qualitatively demonstrated the performance benefits offered by MB-DQN, we next dive further to investigate the rationale behind the advantages. We hypothesize that the performance improvements provided by MB-DQN may come from the quality of the collected data samples in the experience replay buffer. In other words, MB-DQN may have benefited from the heterogeneity in the data samples collected by bootstrapped heads with different backup lengths. To validate this hypothesis, we design an experiment containing two agents: one agent is responsible for generating state-action pairs for an experience replay buffer while updating its Q-value network with the data contained in it. The other agent only updates its Q-value network by the existing data samples contained in the replay buffer, without contributing data to it. Both of these agents are implemented with ten bootstrapped heads. We consider three configurations for the former data generation agent: Mixed-1-3-Step (MB-DQN), All-1-Step (Baseline), and All-3-Step (Baseline), and two configurations for the latter learning-only agent (i.e., the one without contributing data samples to the replay buffer): All configurations in both cases. These results thus validate our hypothesis that the data samples generated by MB-DQN are superior in quality than those generated by the other configurations, and explain why MB-DQN is able to offer benefits in performance in the environments presented in this paper.

4.3. λ-TARGET VERSUS HETEROGENEOUS BOOTSTRAPPED TARGETS

In order to validate our assumption in Section 1 that unifying different step return targets to a single target value may not be as effective as the heterogeneous bootstrapped approach adopted by MB-DQN, in this section, we compare these strategies of combining step returns in several Atari environments. For the unified return target strategy, we consider a recently proposed method called DQN (λ) (Daley & Amato, 2019) , which implements TD (λ) by pre-computing λ-returns using an additional cache for its replay buffer memory. On the other hand, MB-DQN employs a strategy that leverages K heterogeneous bootstrapped heads, where each head k ∈ K has its own target value. For a fair comparison, we further include a variant of DQN (λ), called DQN (λ) Ensemble, which employs K bootstrapped heads using λ-return as the target value. In our experiments, We set K = 10 for MB-DQN and DQN (λ) Ensemble, where the settings for DQN (λ) are configured as its default values (Daley & Amato, 2019) . The single target value used by DQN (λ) is derived from multiple backup lengths ranging from one to a hundred. The evaluation curves of these strategies are plotted in Fig. 5 (b). It can be observed that for the four environments presented in Fig. 5 (b), the curves corresponding to the heterogeneous bootstrapped targets strategy (i.e., MB-DQN) grow faster and higher than those corresponding to the single-unified target strategy (i.e., DQN (λ) and DQN (λ) Ensemble). The above interesting evidence not only validates our assumption in Section 1, but also reveals that the advantages offered by the heterogeneity in multiple target values may outweigh the advantages offered by a single TD (λ) target that aggregates returns from the long temporal horizon.

4.4. ABLATION ANALYSIS

In this section, we provide a set of ablation analyses for the proposed MB-DQN on four selceted Atari games, including Breakout, Qbert, Seaquest, and Freeway, to examine the impacts of different configurations on MB-DQN's performances. We perform two sets of analyses for MB-DQN: (a) different configurations of step returns for the bootstrapped heads, and then (b) different numbers of the bootstrapped heads. Please note that additional experimental results are provided in the appendix. 1-2-Step, Mixed-1-3-Step, and Mixed-2-3-Step, the ten bootstrapped heads are evenly distributed to different backup lengths. The results are presented in Fig. 6 (a) . For all of the configurations, it is observed that the agents trained with different mixtures of step returns perform similarly, and outperform those trained with the All-1-Step baseline. These evaluation results thus suggest that the bootstrapped heads in the proposed MB-DQN is not limited to certain configurations of step returns.

4.4.2. DIFFERENT NUMBERS OF THE BOOSTRAPPED HEADS

In bootstrapped DQN (Osband et al., 2016) , more bootstrapped heads lead to faster learning, while even a small number of bootstrapped heads is still able to capture most of its benefits. As MB-DQN inherits its architecture from bootstrapped DQN, we investigate the impacts of different numbers of the bootstrapped heads on MB-DQN, and examine if MB-DQN still maintains this property or not. We perform a set of experiments with three different configurations of the bootstrapped heads: K = 2, 4, and 10. For each of the configuration, MB-DQN is set to consist of K/2 1-step bootstrapped heads and K/2 3-step bootstrapped heads. On the contrary, the bootstrapped DQN baseline is set as its default configuration and is implemented with K 1-step bootstrapped heads. Fig. 6 (b) illustrates the evaluation curves of the above configurations. In most cases, it is observed that for both MB-DQN and bootstrapped DQN, more bootstrapped heads lead to better performance. It is worth noticing that the proposed MB-DQN trained with two bootstrapped heads outperforms the baseline trained with ten bootstrapped heads in three of four games. This fact shows the significance and advantage of the mixture usage of multi-step returns in bootstrapped heads. On the other hand, MB-DQN's performance drops in Breakout as the number of the bootstrapped heads K become less than that of the baseline. This is caused by the fact that the 3-Step Heads perform worse than the 1- Step Heads in Breakout, as described in Section 4.1. As a result, a smaller K strengthens the negative influence caused by the 3-Step Heads, causing MB-DQN to become sensitive to the undesirable performance in certain bootstrapped heads. The above results thus suggest that an appropriate number of K has to be selected in order to maintain the advantages as well as the performance of MB-DQN.

5. CONCLUSION

In this paper, we proposed MB-DQN for combining and leveraging the advantages of different step return targets using multiple bootstrapped heads. Instead of unifying different step return targets to a single target value, MB-DQN assigns a distinct backup length to each bootstrapped head. This allows MB-DQN to offer heterogeneity in the target values during its update procedure, and enables a DRL agent to have diversified exploration behaviors. In our experiments, we first provided motivational examples to demonstrate the influence of different configurations of backup lengths in a simple maze environment. We then evaluated the proposed MB-DQN methodology on a number of Atari 2600 environments both quantitatively and qualitatively, and validated that MB-DQN is able to outperform a number of baseline methods with different configurations of backup lengths. Finally, we presented a set of ablation studies to examine the impacts of different design configurations for MB-DQN. . A3. This is due to the fact that the visualization approach proposed in (Greydanus et al., 2018) is only a way to interpret the behaviors of the agents, and are thus unable to reflect the full behaviors of them. In the following paragraphs, we provide further discussions of the three games in which the attention regions of the 1-Step Heads and the 3-Step Heads are different in MB-DQN. Frostbite. The 1-Step Heads maintain focusing on the scoreboard, which might be a sign of overfitting. This might be the reason that causes the poor performance of 1-Step Heads in Table . A3. Alien. In order to achieve a high score in Alien, the agent has to learn to dodge the attacks from the aliens, survive, and then destroy the alien eggs laid in the hallways. It seems that 1-Step Heads concentrate on both the eggs and the monster with a similar extent of attention (i.e., the red colors of the attention regions are highlighted with similar magnitudes), which might cause the agent to be overly greedy in destroying the eggs, leading to its death and undesirable performance in Table . A3. Amidar. From Fig. A1 (f), it is observed that the 3-Step Heads seem to pay more attention on the monsters than the 1-Step Head. 3-Step Head can easily walk around the grid to get higher rewards without attacking by monsters. On the contrary, the 1-Step Heads merely concentrate on the character. For these three games, the attention areas of the Majority Heads cover the areas from both the 1-Step Heads and the 3-Step Heads, allowing MB-DQN to outperform the two baselines All-1-Step (Baseline) and All-3-Step (Baseline) in Table . A3. These examples again validate that MB-DQN can leverage the benefits from different backup lengths as discussed in Section 4.1 of the main manuscript. We further inspect in detail three different configurations of backup lengths for the bootstrapped heads in MB-DQN, including All-1-Step (Baseline), Mixed-1-3-Step, and Mixed-1-3-5-Step. The learning curves of these three configurations are presented in Fig. A3 . It can be observed that the agents trained with a larger upper bound of backup lengths (i.e., Mixed-1-3-5-Step) learn faster in three out of the four games, including Qbert, Seaquest, and Freeway. An observation from the results is that the agents suffer from performance drops in Seaquest and Breakout for the Mixed-1-3-5-Step setting. This is caused by the implication revealed in Fig. 3 and Section 4.1 of the main manuscript that longer backup lengths might not necessarily bring beneficial impact on the learning process of the value function. The observations suggest that the optimal configuration of the bootstrapped heads and the upper bound of the backup length are still a challenging issue to be investigated in the future.

A4 COMPUTING INFRASTRUCTURE

In this section, we provide the configuration of our computing infrastructure in Table . A2 for reference.



For all of our experiments, the MB-DQN and the bootstrapped DQN agents are evaluated every 250k timesteps based on the results voted by the majority of their bootstrapped heads. The evaluation curves are averaged from three random seeds, and are drawn with 68% confidence interval, illustrated as the shaded regions.



Figure 1: A visualization of the behaviors of the agents with different backup lengths. (a) presents the layout of the maze environment (denoted as Dense Maze), which contains a starting grid and a goal grid. (b) and (c) illustrate the behaviors of the agents updated using 1-step return and 5-step return, respectively. It is observed that the agent trained with 5-step return reaches the goal through shorter diagonal trajectories, while the agent trained with 1-step return explores more grids in the early stage.

Figure 3: Comparison of the evaluation curves of MB-DQN and the baselines in eight Atari games. 1

Figure 5: The evaluation curves of (a) comparison of different configurations of data generation agents and learning-only agents for validating the quality of data samples collected by MB-DQN in Section 4.2, and (b) comparison between the single λ-target strategy adopted by DQN (λ) (Daley & Amato, 2019) and the multiple bootstrapped targets strategy adopted by MB-DQN in Section 4.3.

Figure 6: Impacts of (a) different configurations of step returns for the bootstrapped heads, and (b) different numbers of the bootstrapped heads on the proposed MB-DQN for four different Atari games.

Figure A1: Visualization of the agents' attention areas (rendered in red) for six Atari games.

Figure A2: Curves of four more Atari games for (a) different configurations of step returns for the bootstrapped heads, and (b) different numbers of the bootstrapped heads on the proposed MB-DQN.

A1 ADDITIONAL BACKGROUND MATERIAL

In this section, we provide additional background materials related to our work. We first introduce the concept of DQN (λ) (Daley & Amato, 2019) , which is compared in Section 4.3. Next, we explain the generation method of the attention maps (Greydanus et al., 2018) used for the qualitative comparisons in Section 4.1 of the main manuscript.A1.1 DQN (λ) DQN (λ) (Daley & Amato, 2019) incorporates the concept of λ-return into DQN by modifying the replay buffer, with an aim to reduce the computation time required for deriving λ-return. The replay buffer is modified to store the λ-return R λ t at timestep t along with its corresponding transition. The value of R λ t is computed in a recursive fashion, allowing repeated computations of λ-returns to be reduced when the same transitions are sampled multiple times. The recursive rule of R λ t is thus expressed as follows:where R 1 t = r t + max a∈A Q(s t+1 , a) is the one-step return target at timestep t, Q is the state-action value function, γ is the discount factor, and s t+1 is the next state. In order to reuse outdated λreturns caused by Q-function updates, DQN (λ) introduces a mechanism which periodically samples random intervals of consecutive transitions from the experience replay buffer, and stores them into another cache memory. During each update, the transitions in the cache are refreshed by the current Q-function. The agent then samples batches of training data from the relatively smaller cache using a prioritized sampling mechanism similar to the one used by the prioritized replay buffer (Schaul et al., 2016) (i.e., the larger the TD error of the transition is, the more likely it would be sampled). DQN (λ) has been validated with two different λ-return estimators, Peng's Q (λ) (Peng & Williams, 1994 ) and Watkin's Q (λ) (Watkins, 1989) . The difference between the two estimator is that Watkin's Q (λ) terminates the calculation of λ-return whenever an exploratory action is taken. In our experiments, we adopt the configuration that uses Peng's Q (λ), since it has been demonstrated in (Daley & Amato, 2019 ) that Peng's Q (λ) is superior to Watkin's Q (λ).

A1.2 SALIENCY MAP GENERATION METHODOLOGY OF THE ATTENTION MAPS FOR MB-DQN

In order to perform the qualitative analysis for comparing MB-DQN with the bootstrapped DQN in Seciton 4.1, we utilize the methodology proposed in (Greydanus et al., 2018) to construct perturbation-based saliency maps to highlight what an agent actually perceives and focuses on from its observation. This information offers a way to understand the learning procedure of the agent, and enables us to validate that bootstrapped heads with different backup lengths may concentrate on different subjects in an agent's observation. Given an image I, the method first generates a the perturbed image I defined as:where M is the mask centered at the coordinate (i, j) of I, and A is the Gaussian blur operator with a standard deviation σ A . The aim of the perturbation process is to add uncertainty, in order to cause the agent become more uncertain about the area around (i, j). The saliency map S is then defined as the L2-norm difference between the estimated values of I before and after each perturbation process:The L2-norm difference serves as a measure to reflect the extent of attention of the agent around region (i, j). The saliency map S can be added to one of the three color channels of the original image I to visualize the attention map of the agent. In order to construct saliency maps for MB-DQN, the methodology discussed above is extended to a bootstrapped version, in which all the bootstrapped heads that participate in the final decision of a taken action (denoted as a voting ) are taken into consideration (i.e., the bootstrapped heads contributing to the highest votes in the majority voting procedure). The saliency measure of each bootstrapped head k ∈ K (where K is the total number of heads) is then defined as S k , and the total saliency map S bootstrapped for MB-DQN is thus defined as:A2 ADDITIONAL DETAILS OF THE EXPERIMENTAL SETUPIn this section, we provide additional training details of our experiments. The agents are evaluated based on the average scores of ten test episodes every 250k timesteps. During the evaluation phase, both the proposed MB-DQN and bootstrapped DQN use a majority vote policy to decide the action to be taken. In other words, every bootstrapped head predicts its own action for each input state, while the action taken by the agent is determined by the highest vote from all of the bootstrapped heads.The experimental results presented in this paper are generated based on three different random seeds, and the evaluation curves are drawn with 68% confidence interval (i.e., one standard deviation) as the shaded areas. The detailed settings of the hyper-parameters in our experiments are listed in Table A1 .

A2.1 THE HYPERPARAMETERS FOR TRAINING MB-DQN

Table A1 summarizes the hyper-parameters of MB-DQN, Bootstrapped DQN (Osband et al., 2016) , and DQN(λ) (Daley & Amato, 2019) , including the training parameters and their configurations, followed by the input sizes for the agents.

A2.2 NETWORK STRUCTURE

In this section, we explain the network structure used in our experiments. The shared convolutional neural network depicted in Fig. 2 of the main manuscript consists of three convolutional layers with 32, 64, and 64 filters, which is the same as the configurations used in DQN (Mnih et al., 2015) , Bootstrapped DQN (Osband et al., 2016) , and DQN (λ) (Daley & Amato, 2019) . The output of the last convolutional layer is then fed into K distinct heads, where each head is implemented as a fully-connected (FC) layer with 512 filters, followed by another FC layer to predict Q-value for each action. The default activation function is set to ReLU.

A3 ADDITIONAL EXPERIMENTAL RESULTS

In this section, we provide and discuss additional experimental results evaluated on Atari 2600 games in Section A3.1, as well as additional sets of ablation analyses of the backup lengths in Section A3.3. A3 , where these methods are similarly denoted as All 

A5 REPRODUCIBILITY

We implemented the proposed MB-DQN based on the RLTF framework (Nikolov, 2018) , which is a research framework that provides high-quality implementations of common RL algorithms based on the TensorFlow deveplopment platform. We modified the source codes of RLTF and added an additional option MB-DQN into its DQN family. All the conducted experiments presented in our paper are re-producible with easy-following instructions. For more details about our source codes, please refer to the anonymous github repository at the following link: https://github.com/Anonymous-Source-Code/MB-DQN.Table A3 : Comparison of the evaluation results of MB-DQN and the baselines in 33 Atari games.

