OPTIMISTIC EXPLORATION WITH BACKWARD BOOTSTRAPPED BONUS FOR DEEP REINFORCEMENT LEARNING

Abstract

Optimism in the face of uncertainty is a principled approach for provably efficient exploration for reinforcement learning in tabular and linear settings. However, such an approach is challenging in developing practical exploration algorithms for Deep Reinforcement Learning (DRL). To address this problem, we propose an Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL by following these two principles. OEB3 is built on bootstrapped deep Q-learning, a non-parametric posterior sampling method for temporally-extended exploration. Based on such a temporally-extended exploration, we construct an UCB-bonus indicating the uncertainty of Q-functions. The UCB-bonus is further utilized to estimate an optimistic Q-value, which encourages the agent to explore the scarcely visited states and actions to reduce uncertainty. In the estimation of Q-function, we adopt an episodic backward update strategy to propagate the future uncertainty to the estimated Q-function consistently. Extensive evaluations show that OEB3 outperforms several state-of-the-art exploration approaches in MNIST maze and 49 Atari games.

1. INTRODUCTION

In Reinforcement learning (RL) (Sutton & Barto, 2018) formalized by the Markov decision process (MDP), an agent aims to maximize the long-term reward by interacting with the unknown environment. The agent takes actions according to the knowledge of experiences, which leads to the fundamental problem of the exploration-exploitation dilemma. An agent may choose the best decision given current information or acquire more information by exploring the poorly understood states and actions. Exploring the environment may sacrifice immediate rewards but potentially improves future performance. The exploration strategy is crucial for the RL agent to find the optimal policy. The theoretical RL offers various provably efficient exploration methods in tabular and linear MDPs with the basic value iteration algorithm: least-squares value iteration (LSVI). Optimism in the face of uncertainty (Auer & Ortner, 2007; Jin et al., 2018) is a principled approach. In tabular cases, the optimism-based methods incorporate upper confidence bound (UCB) into the value functions as bonuses and attain the optimal worst-case regret (Azar et al., 2017; Jaksch et al., 2010; Dann & Brunskill, 2015) . Randomized value function based on posterior sampling chooses actions according to the randomly sampled statistically plausible value functions and is known to achieve near-optimal worst-case and Bayesian regrets (Osband & Van Roy, 2017; Russo, 2019) . Recently, the theoretical analyses in tabular cases are extended to linear MDP where the transition and the reward function are assumed to be linear. In linear cases, optimistic LSVI (Jin et al., 2020) attains a near-optimal worst-case regret by using a designed bonus, which is provably efficient. Randomized LSVI (Zanette et al., 2020 ) also attains a near-optimal worst-case regret. Although the analyses in tabular and linear cases provide attractive approaches for efficient exploration, these principles are still challenging in developing a practical exploration algorithm for Deep Reinforcement Learning (DRL) (Mnih et al., 2015) , which achieves human-level performance in large-scale tasks such as Atari games and robotic tasks. For example, in linear cases, the bonus in optimistic LSVI (Jin et al., 2020) and nontrivial noise in randomized LSVI (Zanette et al., 2020) are tailored to linear models (Abbasi-Yadkori et al., 2011) , and are incompatible with practically powerful function approximations such as neural networks. To address this problem, we propose the Optimistic Exploration algorithm with Backward Bootstrapped Bonus (OEB3) for DRL. OEB3 is an instantiation of optimistic LSVI (Jin et al., 2020) in DRL by using a general-purpose UCB-bonus to provide an optimistic Q-value and a randomized value function to perform temporally-extended exploration. We propose an UCB-bonus that represents the disagreement of bootstrapped Q-functions (Osband et al., 2016) to measure the epistemic uncertainty of the unknown optimal value function. Q-value added by UCB-bonus becomes an optimistic Q + function that is higher than Q for scarcely visited state-action pairs and remains close to Q for frequently visited ones. The optimistic Q + function encourages the agent to explore the states and actions with high UCB-bonuses, signifying scarcely visited areas or meaningful events in completing a task. We propose an extension of episodic backward update (Lee et al., 2019) to propagate the future uncertainty to the estimated action-value function consistently within an episode. The backward update also enables OEB3 to perform highly sample-efficient training. Comparing to existing count-based and curiosity-driven exploration methods (Taiga et al., 2020) , OEB3 has several benefits. (1) We utilize intrinsic rewards to produce optimistic value function and also take advantage of bootstrapped Q-learning to perform temporally-consistent exploration while existing methods do not combine these two principles. (2) The UCB-bonus measures the disagreement of Q-values, which considers the long-term uncertainty in an episode rather than the single-step uncertainty used in most bonus-based methods (Pathak et al., 2019; Burda et al., 2019b) . Meanwhile, the UCB-bonus is computed without introducing additional modules compared to bootstrapped DQN. (3) We provide a theoretical analysis showing that OEB3 has theoretical consistency with optimistic LSVI in linear cases. (4) Extensive evaluations show that OEB3 outperforms several state-of-the-art exploration approaches in MNIST maze and 49 Atari games.

2. BACKGROUND

In this section, we introduce bootstrapped DQN (Osband et al., 2016) , which we utilize in OEB3 for temporarily-extended exploration. We further introduce optimistic LSVI (Jin et al., 2020) , which we instantiate via DRL and propose OEB3.

2.1. BOOTSTRAPPED DQN

We consider an episodic MDP represented as (S, A, T, P, r), where T ∈ Z + is the episode length, S is the state space, A is the action space, r is the reward function, and P is the unknown dynamics. In each timestep, the agent obtains the current state s t , takes an action a t , interacts with the environment, receives a reward r t , and updates to the next state s t+1 . The action-value function Q π (s t , a t ) := E π T -1 i=t γ i-t r i ] represents the expected cumulative reward starting from state s t , taking action a t , and thereafter following policy π(a t |s t ) until the end of the episode. γ ∈ [0, 1) is the discount factor. The optimal value function Q * = max π Q π , and the optimal action a * = arg max a∈A Q * (s, a). Deep Q-Network (DQN) uses a deep neural network with parameters θ to approximate the Q-function. The loss function takes the form of L(θ) = E[(y t -Q(s t , a t ; θ)) 2 (s t , a t , r t , s t+1 ) ∼ D] , where y t = r t + γ max a Q(s t+1 , a ; θ -) is the target value, and θ -is the parameter of the target network. The agent accumulates experiences (s t , a t , r t , s t+1 ) in a replay buffer D and samples mini-batches in training. Bootstrapped DQN (Osband et al., 2016; 2018) is a non-parametric posterior sampling method, which maintains K estimations of Q-values to represent the posterior distribution of the randomized value function. Bootstrapped DQN uses a multi-head network that contains a shared convolution network and K heads. Each head defines a Q k -function. Bootstrapped DQN diversifies different Q k by using different random initialization and individual target networks. The loss for training Q k is L(θ k ) = E r t + γ max a Q k (s t+1 , a ; θ k-) -Q k (s t , a t ; θ k ) 2 (s t , a t , r t , s t+1 ) ∼ D . (1) The k-th head Q k (s, a; θ k ) is trained with its own target network Q k (s, a; θ k-). If k-th head is sampled at the start of an episode when interacting with the environment, the agent will follow Q k to choose actions in the whole episode, which provides temporally-consistent exploration for DRL. Algorithm 1 Optimistic LSVI in linear MDP 1: Initialize: Λ t ← λ • I and w h ← 0 2: for episode m = 0 to M -1 do 3: Receive the initial state s 0 4: for step t = 0 to T -1 do 5: Take action a t = arg max a∈A Q t (s t , a) and observe s t+1 .

6:

end for 7: for step t = T -1 to 0 do 8: Λ t ← m τ =0 φ(x τ t , a τ t )φ(x τ t , a τ t ) + λ • I 9: w t ← Λ -1 t m τ =0 φ(x τ t , a τ t )[r t (x τ t , a τ t ) + max a Q t+1 (x τ t+1 , a)] 10: Q t (•, •) = min{w t φ(•, •) + α[φ(•, •) Λ -1 t φ(•, •)] 1 /2 , T } 11: end for 12: end for 2.2 OPTIMISTIC LSVI Optimistic LSVI (Jin et al., 2020) uses an optimistic Q-value with LSVI in linear MDP. We denote the feature map of the state-action pair as φ : S × A → R d . The transition kernel and reward function are assumed to be linear in φ. Optimistic LSVI shown in Algorithm 1 consists of two parts. In the first part (line 3-6), the agent executes the policy according to Q t for an episode. In the second part (line 7-11), the parameter w t of Q-function is updated in closed-form by following the regularized least-squares problem w t ← arg min w∈R d m τ =0 r t (s τ t , a τ t ) + max a∈A Q t+1 (s τ t+1 , a) -w φ(s τ t , a τ t ) 2 + λ w 2 , ( ) where m is the number of episodes, and τ is the episodic index. The least-squares problem has the explicit solution w t = Λ -1 t m τ =0 φ(x τ t , a τ t ) r t (x τ t , a τ t ) + max a Q t+1 (x τ t+1 , a) (line 9) , where Λ t is the Gram matrix. Then the value function is estimated by Q t (s, a) ≈ w t φ(s, a). Optimistic LSVI uses a bonus α[φ(s, a) Λ -1 t φ(s, a)] 1 /2 (line 10) to measure the uncertainty of state-action pairs. We can intuitively consider u := (φ Λ -1 t φ) -1 as a pseudo count of state-action pair by projecting the total features that have observed (Λ t ) to the direction of the curresponding feature φ. Thus, the bonus α / √ u represents the uncertainty along the direction of φ. By adding the bonus to Q-value, we obtain an optimistic value function Q + , which serves as an upper bound of Q to encourage exploration. The bonus in each step is propagated from the end of the episode by the backward update of Q-value (line 7-11), which follows the principle of dynamic programming. Theoretical analysis (Jin et al., 2020) shows that optimistic LSVI achieves a near-optimal worst-case regret of Õ( √ d 3 T 3 L 3 ) with proper selections of α and λ, where L is the total number of steps.

3. PROPOSED METHOD

Optimistic LSVI (Jin et al., 2020) provides an atractive approach for efficient exploration. Nevertheless, developing a practical exploration algorithm for DRL is challenging, since (i) the UCB-bonus utilized by Optimistic LSVI is tailored for linear MDPs, and (ii) optimistic LSVI utilizes backward update of Q-functions (line 7-11 in Alg. 1) to aggregate uncertainty. To this end, we propose the following approaches, which are the building blocks of OEB3. • We propose a general-purpose UCB-bonus for optimistic exploration. More specifically, we utilize bootstrapped DQN to construct a general-purpose UCB-bonus, which is theoretically consistent with optimistic LSVI for linear MDPs. We refer to Section 3.1 for the details. • We propose a sample-efficient learning algorithm to integrate bootstrapped DQN and UCB-bonus into the backward update, which faithfully follows the principle of dynamic programming. More specifically, we extend Episodic Backward Update (EBU) (Lee et al., 2019) from ordinary Qlearning to bootstrapped Q-learning, which we abbreviate by BEBU (Bootstrapped EBU). BEBU allows sample-efficient learning and fast propagation of the future uncertainty to the estimated Qvalue consistently. We further propose OEB3 by combining BEBU and the UCB-bonus obtained via bootstrapped Q-learning. We refer to Section 3.2 for the details.  B(s t , a t ) := σ Q k (s t , a t ) = 1 K K k=1 Q k (s t , a t ) -Q(s t , a t ) 2 , where Q(s t , a t ) is the mean of bootstrapped Q-values. The similar measurement was first used in Chen et al. (2017) . We further establish connection between the UCB-bonus defined in Eq. ( 3) and the bonus in optimistic-LSVI. Theorem 1 (Informal Version of Theorem 2). In linear function approximation, the UCB-bonus B(s t , a t ) in OEB3 is equivalent to the bonus-term [φ t Λ -1 t φ t ] 1 /2 in optimistic-LSVI, where Λ t ← m τ =0 φ(x τ t , a τ t )φ(x τ t , a τ t ) + λ • I, and m is the current episode. We refer to Appendix A for the proof and a detailed discussion. Implementing the UCB-bonus B(s t , a t ) defined in Eq. (3) in DRL is desirable for exploration for the following reasons. • Bootstrapped DQN is a non-parametric posterior sampling method, which can be implemented via deep neural networks (Osband et al., 2019 ). • The UCB-bonus B(s t , a t ) defined in Eq. (3) quantifies the epistemic uncertainty of a specific state-action pair adequately. More specifically, due to the non-convexity nature of optimizing neural network and independent initialization, if (s t , a t ) is scarcely visited, the UCB-bonus B(s t , a t ) obtained via bootstrapped DQN is high. Moreover, the UCB-bonus converges to zero asymptotically as the sample size increases to infinity. • The UCB-bonus is computed in a batch when performing experience replay, which is more efficient than other optimistic methods that change the action-selection scheme in each timestep (Chen et al., 2017; Nikolov et al., 2019) . The optimistic Q + is obtained by summing up the UCB-bonus and the estimated Q-function, which takes the following form, Q + (s t , a t ) := Q(s t , a t ) + αB(s t , a t ). We use a simple regression task with neural networks to illustrate the proposed UCB-bonus, as shown in Figure 1 . We use 20 neural networks with the same network architecture to solve the same regression problem. Each network contains three residual blocks with two fully-connected layers each. According to (Osband et al., 2016) , the difference between the outcome of fitting the neural networks is a result of different initialization. For a single input x, the networks yield different estimations {g i (x)} 20 i=1 as shown in Figure 1 . It follows from Figure 1 (a) that the estimations {g i (x)} 20 i=1 behave similar in the region with rich observations, resulting in small disagreement of the estimations, but vary in the region with scarce observations, resulting in large disagreement of the estimations. In Figure 1 (b), we illustrate the confidence bound of the regression results ḡ(x) ± σ(g i (x)) and ḡ(x) ± 2σ(g i (x)), where ḡ(x) and σ(g i (x)) are the mean and standard deviation of the estimations. The standard deviation σ(g i (x)) captures the epistemic uncertainty of regression results. Figure 1(c) shows the optimistic estimation as g + (x) = ḡ(x) + σ(g i (x)) by adding the uncertainty measured by standard deviation. It can be seen that the optimistic estimation g + is close to ḡ in the region with rich observations, and higher than ḡ in the region with scarce observations. In DRL, the bootstrapped Q-functions {Q k (s t , a t )} K k=1 obtained by fitting the target Q-function perform similarly as {g i (x)} 20 i=1 in the regression task. A higher UCB-bonus B(s t , a t ) := σ(Q k (s t , a t )) indicates a higher epistemic uncertainty of the action-value function with (s t , a t ). The estimated Q-function augmented with the UCB-bonus yields Q + , which produces optimistic estimation for novel state-action pairs and behaves similar to the target Q-function in areas that are well explored by the agent. Hence, the optimistic extimation Q + incentives the agent to explore the potentially informative states and actions efficiently.

3.2. UNCERTAINTY BACKWARD IN OEB3

There are two major reasons for updating the action-value function through Bootstrapped Episodic Backward Update (BEBU) in OEB3. • The backward propagation utilizes a complete trajectory from the replay buffer. Such an approach allows OEB3 to infer the long-term effect of decision making from the replay buffer. In contrast, DQN and Bootstrapped DQN utilize the replay buffer by random sampling one-step transitions, which loses the information containing such a long-term effect (Lee et al., 2019 ). • The backward propagation is required to propagate the future uncertainty to the estimated actionvalue function consistently via UCB-bonus within an episode. For instance, let t 2 > t 1 be the indices of two steps in an episode. If the update of Q t2 occurs after the update of Q t1 within an episode, then it can occur that the uncertainty propagated into Q t1 is not consistent with the uncertainty that Q t2 contains. To integrate the UCB-bonus into bootstrapped Q-learning, we propose a novel Q-target in update by adding the bonus term in both the immediate reward and the next-Q value. The proposed Q-target needs to be suitable for BEBU in training. Formally, the Q-target for updating Q k is defined as y k t := r(s t , a t ) + α 1 B(s t , a t ; θ) + γ Q k (s t+1 , a ; θ k-) + α 2 1 a =at+1 Bk (s t+1 , a ; θ -) , where a = arg max a Q k (s t+1 , a; θ k-). The choice of a is determined by the target Q-value without considering the bonus. The immediate reward is added by B(s t , a t ; θ) with a factor α 1 , where the bonus B is computed by bootstrapped Q-network with parameter θ. The next-Q value is added by 1 a =at+1 Bk (s t+1 , a ; θ -) with factor α 2 , where the bonus Bk is computed by the target network with parameter θ -. We assign different bonus Bk of next-Q value to different heads, since the choice of action a are different among the heads. Meanwhile, we assign the same bonus B of immediate reward for all the heads. We introduce the indicator function 1 a =at+1 to suit the backward update of Q-values. More specifically, in the t-th step, the action-value function Q k is updated optimistically at the state-action pair (s t+1 , a t+1 ) due to the backward update. Thus, we ignore the bonus of next-Q value in the update of Q k when a is equal to a t+1 . We use an example to explain the process of backward update. We store and sample the experiences in episodes, and perform update in episodes rather than uniformly sampled transitions. We consider an episode containing three time steps, (s 0 , a 0 ) → (s 1 , a 1 ) → (s 2 , a 2 ). We thus update the Q-value in the head k in the backward manner, namely Q(s 2 , a 2 ) → Q(s 1 , a 1 ) → Q(s 0 , a 0 ) from the end of the episode. We describe the process as follows. 1. First, we update Q(s 2 , a 2 ) ← r(s 2 , a 2 ) + α 1 B(s 2 , a 2 ). Note that in the last time step, we do not need to consider the next-Q value. 2. Then, Q(s 1 , a 1 ) ← [r(s 1 , a 1 ) + α 1 B(s 1 , a 1 )] + [Q(s 2 , a ) + α 2 1 a =a2 B(s 2 , a )] by following Eq. ( 5), where a = arg max a Q(s 2 , a). Since Q(s 2 , a 2 ) is updated optimistically in step 1, we ignore the bonus-term B in next-Q value when a = a 2 . The UCB-bonus is augmented in update by adding B and B to the immediate reward and next-Q value, respectively. 3. The update of Q(s 0 , a 0 ) follows the same principle. The optimistic Q-value is Q(s 0 , a 0 ) ← [r(s 0 , a 0 ) + α 1 B(s 0 , a 0 )] + [Q(s 1 , a ) + α 2 1 a =a1 B(s 1 , a )], where a = arg max a Q(s 1 , a). In practice, the episodic update typically leads to instability in DRL due to the strong correlation in consecutive transitions. Hence, we propose a diffusion factor β ∈ [0, 1] in BEBU to prevent instability as in Lee et al. (2019) . The Q-value is therefore computed as the weighted sum of the current value and the back-propagated estimation scaled with factor β. We consider an episodic experience that contains T transitions, denoted by E = {S, A, R, S }, where S = {s 0 , . . . , s T -1 }, A = {a 0 , . . . , a T -1 }, R = {r 0 , . . . , r T -1 } and S = {s 1 , . . . , s T }. We initialize a Q-table Q ∈ R K×|A|×T by Q(•; θ -) to store the next Q-values of all the next states S and valid actions for K heads. We initialize y ∈ R K×T to store the Q-target for K heads and T steps. We use bootstrapped Q-network with parameters θ to compute the bonus B = [B(s 0 , a 0 ), . . . , B(s T -1 , a T -1 )] for immediate reward, and use the target network with parameters θ -to compute bonus Bk = [ Bk (s 1 , a 1 ), . . . , Bk (s T , a T )] for next-Q value in each head, where a t = arg max a Q k (s t , a; θ k-). The bonus vector B ∈ R T is the same for all Q-heads, while B ∈ R K×T contains different values for different heads because the choices of a t are different. In the training of head k, we initialize the Q-target in the last step by y [k, T -1] = R T -1 + α 1 B T -1 . We then perform a recursive backward update to get all Q-target values. The elements of Q[k, a t+1 , t] for step t in head k is updated by using its corresponding Q-target y[k, t + 1] with the diffusion factor as follows, Q[k, a t+1 , t] ← βy[k, t + 1] + (1 -β) Q[k, a t+1 , t]. Then, we update y[k, t] in the previous time step based on the newly updated t-th column of Q[k] as follows, y[k, t] ← R t + α 1 B t + γ Q[k, a , t] + α 2 1 a =at+1 B[k, t] , where a = arg max a Q[k, a, t]. In practice, we construct a matrix Ã = arg max a Q[•, a, •] ∈ R K×T to gather all the actions a that correspond to the next-Q. We then construct a mask matrix M ∈ R K×T to store the information whether Ã is identical to the executed action in the corresponding timestep or not. The bonus of next-Q is the element-wise product of M and B with factor α 2 . After the backward update, we compute the Q-value of (S, A) as Q = Q(S, A; θ) ∈ R K×T . The loss function takes the form of L(θ) = E (y -Q) 2 |(s t , a t , r t , s t+1 ) ∈ E, E ∼ D , where the episodic experience E is sampled from replay buffer to perform gradient descent. The gradient of all heads can be computed simultaneously via BEBU. We refer the full algorithm of OEB3 to Appendix B. The theory in optimistic-LSVI requires a strong linear assumption in the transition dynamics. To make it works empirically, we make several adjustments in the implementation details. First, in each training step of optimistic-LSVI, all historical samples are utilized to update the weight of Q-function and calculate the confidence bonus. While in OEB3, we use samples from a batch of episodic trajectories from the replay buffer in each training step. Such a difference in implementation is imposed to achieve computational efficiency. Second, in OEB3, the target-network has a relatively low update frequency, whereas, in Optimistic LSVI, the target Q function is updated in each iteration. Such implementation techniques are commonly used in most existing (off-policy) DRL algorithms. We use BEBU to propagate the future uncertainty in an episode, which is an extension of EBU (Lee et al., 2019) . Compared to EBU, BEBU requires extra tensors to store the UCB-bonus for immediate reward and next-Q value, which are integrated to propagate uncertainties. Meanwhile, integrating uncertainty into BEBU needs special design by using the mask. The previous works do not propagate the future uncertainty and, therefore, does not capture the core benefit of utilizing UCB-bonus for the exploration of MDPs. We highlight that OEB3 propagates future uncertainty in a time-consistent manner based on BEBU, which exploits the theoretical analysis established by Jin et al. (2020) . The backward update significantly improves the sample-efficiency by allowing bonuses and delayed rewards to propagate through transitions of the sampled episode, which we demonstrate in the sequel.

4. RELATED WORK

One practical principle for exploration in DRL is maintaining epistemic uncertainty of action-value functions and learning to reduce the uncertainty. Epistemic uncertainty appears due to the missing knowledge of the environment, and disappears with the progress of exploration. Bootstrapped DQN (Osband et al., 2016; 2018) (Azizzadenesheli et al., 2018) modifies the linear regression of the last layer in Q-network and uses Bayesian Linear Regression (BLR) instead, which estimates an posterior of the Q-function. These methods use parametric distributions to describe the posterior while OEB3 uses a non-parametric method to estimate the confidence bonus. UBE and BLR require inverting a large matrix in training. Previous methods also utilize the epistemic uncertainty of dynamics through Bayesian posterior (Ratzlaff et al., 2020) and ensembles (Pathak et al., 2019) . Nevertheless, these works consider single-step uncertainty, while we consider the long-term uncertainty in an episode. To measure the novelty of states for constructing count-based intrinsic rewards, previous methods use density model (Bellemare et al., 2016; Ostrovski et al., 2017) , static hashing (Tang et al., 2017; Choi et al., 2019; Rashid et al., 2020) , episodic curiosity (Savinov et al., 2019; Badia et al., 2020) , representation changes (Raileanu & Rocktäschel, 2020) , curiosity-bottleneck (Kim et al., 2019b) , information gain (Houthooft et al., 2016) and RND (Burda et al., 2019b) . The curiosity-driven exploration based on prediction-error of environment models such as ICM (Pathak et al., 2017; Burda et al., 2019a) , EMI (Kim et al., 2019a) , variational dynamics (Bai et al., 2020) and learning progress (Kim et al., 2020) & Weigend, 1994) . Model-assisted RL (Kalweit & Boedecker, 2017) uses ensembles to make use of artificial data only in cases of high uncertainty. Buckman et al. (2018) uses ensemble dynamics and Q-functions to use model rollouts when they do not cause large errors. Planning to explore (Sekar et al., 2020) seeks out future uncertainty by integrating uncertainty to Dreamer (Hafner et al., 2020) . Ready Policy One (Ball et al., 2020) optimizes policies for both reward and model uncertainty reduction. Noise-Augmented RL (Pacchiano et al., 2020) uses statistical bootstrap to generalize the optimistic posterior sampling (Agrawal & Jia, 2017) to DRL. Hallucinated UCRL (Curi et al., 2020) reduces optimistic exploration to exploitation by enlarging the control space. The model-based RL needs to estimate the posterior of dynamics, while OEB3 relies on the posterior of Q-functions.

5.1. ENVIRONMENTAL SETUP

We The number of environmental interaction steps L 1 are typically set to be much larger than the number of training steps L 2 (e.g., in DQN, L 1 ≈ 4L 2 ). We refer to Appendix D for the detailed specifications. The code is available at https://bit.ly/33jv1ab.

5.2. RESULT COMPARISON

In Table 1 , we additionally report the performance of DQN (Mnih et al., 2015) , NoisyNet (Fortunato et al., 2018) , Bootstrapped DQN (Osband et al., 2016) , and IDS (Nikolov et al., 2019) in 200M training frames. We choose NoisyNet as a baseline since it is shown (Taiga et al., 2020) to perform substantially better than existing bonus-based methods evaluated by the whole Atari suit (instead of several hard exploration games), including CTS-counts (Bellemare et al., 2016) , PixelCNN-counts (Ostrovski et al., 2017) , RND (Burda et al., 2019b) , and ICM (Pathak et al., 2017) . UBE (O'Donoghue To understand the proposed UCB-bonus, we use a trained OEB3 agent to interact with the environment for an episode in Breakout and record the UCB-bonuses at each step. The curve in Figure 2 shows the UCB-bonuses of the subsampled steps in the episode. We choose 8 spikes with high UCB-bonuses and visualize the corresponding frames. The events in spikes correspond to scarcely visited areas or crucial events, which are important for the agent to obtain rewards: digging a tunnel (1), throwing the ball to the top of bricks (2,3), rebounding the ball (4,5,6), eliminating all bricks and starting a new round (7,8). We provide more examples of visualization in Appendix E. We further record the the mean of UCB-bonus of the training batch in the learning process, which is shown in Figure 3 . The UCB-bonus is low at the beginning since the networks are randomly initialized. When the agent starts to explore the environment, the mean UCB-bonus rises rapidly to incentive exploration. As more experiences of states and actions are gathered, the mean UCB-bonus reduces gradually, which indicates that the bootstrapped value functions concentrate around the optimal value and the epistemic uncertainty decreases. Nevertheless, the UCB-bonuses are relatively high at scarcely visited areas or crucial events, according to Figure 2 , which motivates the agent to enhance exploration at the corresponding events. We conduct an ablation study to better comprehend the importance of backward update and bonus term in OEB3. We refer to Table 2 for the outcomes. We observe that: (1) when we use the ordinary update strategy by sampling transitions instead of episodes in training, OEB3 reduces to BootDQN-UCB with significant performance loss. Hence, the backward update is crucial in OEB3 for sample-efficient training. (2) When the UCB-bonus is set to 0, OEB3 reduces to BEBU. We observe that OEB3 outperforms BEBU in 36 out of all 49 games. (3) When both the backward update and UCB-bonus are removed, OEB3 reduces to standard BootDQN, which performs poorly in 20M training frames. (4) To illustrate the effect of UCB-bonus, we substitute UCB-bonus with the popular RND-bonus (Burda et al., 2019b) . Specifically, we use an independent RND network to generate RND-bonus for each state in training. The RND-bonus is added to both the immediate reward and next-Q. The result shows that UCB-bonus outperforms RND-bonus without introducing additional modules compared to BootDQN.

6. CONCLUSION

In this work, we propose OEB3, which has theoretical underpinnings from optimistic LSVI. We propose a UCB-bonus to capture the epistemic uncertainty of Q-function and an BEBU algorithm for sample-efficient training. We demonstrate OEB3 empirically by solving MNIST maze and Atari games and show that OEB3 outperforms several strong baselines. The visualizations suggest that high UCB-bonus corresponds to informative experiences for exploration. As far as we are concerned, our work seems to establish the first empirical attempt of uncertainty propagation in deep RL. Moreover, we observe that such a connection between theoretical analysis and practical algorithm provides relatively strong empirical performance in Atari games and outperforms several strong baselines, which hopefully gives useful insights on combining theory and practice to the community.

A UCB BONUS IN OEB3

Recall that we consider the following regularized least-square problem, w t ← arg min w∈R d m τ =0 r t (s τ t , a τ t ) + max a∈A Q t+1 (s τ t+1 , a) -w φ(s τ t , a τ t ) 2 + λ w 2 . ( ) In the sequel, we consider a Bayesian linear regression perspective of (8) that captures the intuition behind the UCB-bonus in OEB3. Our objective is to approximate the action-value function Q t via fitting the parameter w, such that w φ(s t , a t ) ≈ r t (s t , a t ) + max a∈A Q t+1 (s t+1 , a), where Q t+1 is given. We assume that we are given a Gaussian prior of the initial parameter w ∼ N (0, I/λ). With a slight abuse of notation, we denote by w t the Bayesian posterior of the parameter w given the set of independent observations D m = {(s τ t , a τ t , s τ t+1 )} τ ∈[0,m] . We further define the following noise with respect to the least-square problem in (8), = r t (s t , a t ) + max a∈A Q t+1 (s t+1 , a) -w φ(s t , a t ), where (s t , a t , s t+1 ) follows the distribution of trajectory. The following theorem justifies the UCBbonus in OEB3 under the Bayesian linear regression perspective. Theorem 2 (Formal Version of Theorem 1). We assume that follows the standard Gaussian distribution N (0, 1) given the state-action pair (s t , a t ) and the parameter w. Let w follows the Gaussian prior N (0, I/λ). We define Λ t = m τ =0 φ(x τ t , a τ t )φ(x τ t , a τ t ) + λ • I. It then holds for the posterior of w t given the set of independent observations D m = {(s τ t , a τ t , s τ t+1 )} τ ∈[0,m] that Var φ(s t , a t ) w t = Var Qt (s t , a t ) = φ(s t , a t ) Λ -1 t φ(s t , a t ), ∀(s t , a t ) ∈ S × A. Here we denote by Qt = w t φ the estimated action-value function. Proof. The proof follows the standard analysis of Bayesian linear regression. See, e.g., West (1984) for a detailed analysis. We denote the target of the linear regression in (8) by y t = r t (s t , a t ) + max a∈A Q t+1 (s t+1 , a). By the assumption that follows the standard Gaussian distribution, we obtain that y t | (s t , a t ), w ∼ N w φ(s t , a t ), 1 . Recall that we have the prior distribution w ∼ N (0, I/λ). Our objective is to compute the posterior density  w t = w | D m , where D m = {(s τ t , a τ t , s τ t+1 )} τ ∈[0,m log p(w | D m ) = -w 2 /2 - m τ =1 w φ(s τ t , a τ t ) -y τ t 2 /2 + Const. = -(w -µ t ) Λ -1 t (w -µ t )/2 + Const., where we define µ t = Λ -1 t m τ =1 φ(s τ t , a τ t )y τ t , Λ t = m τ =0 φ(x τ t , a τ t )φ(x τ t , a τ t ) + λ • I. Thus, by ( 13), we obtain that w t = w | D m ∼ N (µ t , Λ -1 t ). It then holds for all (s t , a t ) ∈ S × A that Var φ(s t , a t ) w t = Var Qt (s t , a t ) = φ(s t , a t ) Λ -1 t φ(s t , a t ), which concludes the proof of Theorem 2. 

B ALGORITHMIC DESCRIPTION

Initialize a Q-table Q = Q(S , A; θ -) ∈ R K×|A|×T by the target Q-network 12: Compute the UCB-bonus for immediate reward for all steps to construct B ∈ R T

13:

Compute the action matrix Ã = arg max a Q[•, a, •] ∈ R K×T to gather all a of next-Q 14: Compute the UCB-bonus for next-Q for all heads and all steps to construct B ∈ R K×T 15: Compute the mask matrix end for 27: end while We summarize the closely related works in Table 3 . Reset the environment and receive the initial state s 0 7: M ∈ R K×T where M[k, t] = 1 Ã[k,t] =At+1 16: Initialize target table y ∈ R K×T to zeros, and set y[•, T -1] = R T -1 + α 1 B T -1 17: for t = T -2 to 0 do 18: Q[•, a t+1 , t] ← βy[•, t + 1] + (1 -β) Q[•, a t+1 , t] 19: y[•, t] ← R t + α 1 B t + γ Q[•, a , t] + α 2 M[•, t] • B[•, t] where a = Ã[•, t] for step i = 0 to Terminal do 8: if Algorithm type is BEBU then 9: With -greedy choose a i = arg max a Q k (s i , a) 10: else if Algorithm type is BEBU-UCB then 11: With -greedy choose Ii(si,a) by following the regret-information ratio, where ∆i (s i , a i ) = max a ∈A u i (s i , a ) -l i (s i , a i ) is the expected regret, and a i = arg max a [ Q(s i , a) + α • σ(Q(s i , a))], where Q(s i , a i ) = 1 K K k=1 Q k (s i , a i ) and σ(Q(s i , a i )) = 1 K K k=1 (Q k (s i , a i ) -Q(s i , a i )) 2 [l i (s i , a i ), u i (s i , a i )] is the confidence interval. In particular, u i (s i , a i ) = Q(s i , a i ) + λ ids • σ(Q(s i , a i )) and l i (s i , a i ) = Q(s i , a i ) -λ ids • σ(Q(s i , a i )). I(s i , a i ) = log(1 + σ(Q(si,ai)) 2 /ρ 2 ) + ids measures the uncertainty, where ρ and ids are constants. We use 10 × 10 MNIST maze with randomly placed walls to evaluate our method. The agent starts from the initial position (0, 0) in the upper-left of the maze and aims to reach the goal position (9, 9) in the bottom-right. The state of position (i, j) is represented by stacking two randomly sampled images with label i and j from the MNIST dataset. When the agent steps to a new position, the state representation is reconstructed by sampling images. Hence the agent gets different states even stepping to the same location twice, which minimizes the correlation among locations. We further add some stochasticity in the transition probability. The agent has a probability of 10% to arrive in the adjacent locations when taking an action. For example, when taking action 'left', the agent has a 10% chance of transiting to 'up', and a 10% chance of transiting to 'down'. The agent gets a reward of -1 when bumping into a wall, and gets 1000 when reaching the goal. Initialize a Q-table Q = Q(S , A; θ -) ∈ R K×|A|×T by the target Q-network 21: Compute the action matrix Ã = arg max a Q[•, a, •] ∈ R K×T to gather all a of next-Q 22: Initialize target table y ∈ R K×T to zeros, and set y[•, T -1] = R T -1 + α 1 B T -1 23: for t = T -2 to 0 do 24: Q[•, a t+1 , t] ← βy[•, t + 1] + (1 -β) Q[•, a t+1 , t] 25: y[•, t] ← R t + γ Q[•, a , t] where a = Ã[•, t] We use the different setup of wall-density in the experiment. The wall-density means the proportion of walls in all locations. Figure 5 (left) shows a generated maze with wall-density of 50%. The gray positions represent walls. We train all methods with wall-density of 30%, 40%, and 50%. For each setup, we train 50 independent agents for 50 randomly generated mazes. The relative length defined by l agent /l best is used to evaluate the performance, where l agent is the length of the agent's path in an episode (maximum steps are 1000), and l best is the length of the shortest path. The performance comparison is shown in Figure 4 . OEB3 performs best in all methods, and BEBU-IDS is a strong baseline. We use a trained OEB3 agent to take action in the maze of Figure 5 (left). The UCB-bonuses of state-action pairs in the agent's path are computed and visualized in Figure 5 (right). The state-action pairs with high UCB-bonus show the bottleneck positions in the path. For example, the state-action pairs in location (3, 3) and (6, 7) produce high UCB-bonus to guide the agent to choose the right direction. The UCB-bonus encourages the agent to walk through these bottleneck positions correctly. We give more examples of visualization in Appendix E.

D IMPLEMENTATION DETAIL D.1 MNIST MAZE

Hyper-parameters of BEBU. BEBU is the basic algorithm of BEBU-UCB and BEBU-IDS. BEBU uses the same network-architecture as Bootstrapped DQN (Osband et al., 2016) . The diffusion factor and other training parameters are set by following EBU paper (Lee et al., 2019) . Details are summarized in Table 4 . The target-network is updated every 2000 steps. optimizer Adam Adam optimizer is used for training. Detailed parameters: β1 = 0.9, β2 = 0.999, ADAM = 10 -7 . learning rate 0.001 Learning rate for Adam optimizer. (h-H) 2 H 2 Exploration factor. H is the total timesteps for training, and h is the current timestep. starts from 1 and is annealed to 0 in a quadratic manner. γ 0.9 Discount factor. β 1.0 Diffusion factor of backward update. wall density 30%, 40%, and 50% Proportion of walls in all locations of the maze. reward -1 or 1000 Reward is -1 when bumping into a wall, and 1000 when reaching the goal. stochasticity 10% Has a probability of 10% to arrive in the adjacent locations when taking an action. evaluation metric l rel = lagent/l best Ratio between length of the agent's path and the best length. Hyper-parameters of BEBU-UCB. BEBU-UCB uses the upper-bound of Q-values to select actions. In particular, a = arg max a∈A [µ(s, a) + λ ucb σ(s, a)], where µ(s, a) and σ(s, a) are the mean and standard deviation of bootstrapped Q-values {Q k (s, a)} K k=1 . We use λ ucb = 0.1 in our experiment. Hyper-parameters of BEBU-IDS. The action-selection in IDS (Nikolov et al., 2019) follows the regret-information ratio as a t = arg min a∈A ∆t(s,a) 2 It(s,a) , which balances the regret and exploration. ∆t (s, a) is the expected regret that indicates the loss of reward when choosing a suboptimal action a. IDS uses a conservative estimate of regret as ∆t (s, a) = max a ∈A u t (s, a ) -l t (s, a), where [l t (s, a), u t (s, a)] is the confidence interval of action-value function. In particular, u t (s, a) = µ(s, a)+λ ids σ(s, a) and l t (s, a) = µ(s, a)-λ ids σ(s, a), where µ(s, a) and σ(s, a) are the mean and standard deviation of bootstrapped Q-values {Q k (s, a)} K k=1 . The information gain I t (a) measures the uncertainty of action-values as I(s, a) = log(1 + σ(s,a) 2 ρ(s,a) 2 ) + ids , where ρ(s, a) is the variance of the return distribution. ρ(s, a) is measured by C51 (Bellemare et al., 2017) in distributional RL and becomes a constant in ordinary Q-learning. We use λ ids = 0.1, ρ(s, a) = 1.0, and ids = 10 -5 in our experiment. Hyper-parameters of OEB3. We set α 1 and α 2 to the same value of 0.01. We find that adding a normalizer to UCB-bonus B of the next Q-value enables us to get more stable performance. B is smoothed by dividing a running estimate of its standard deviation. Because the UCB-bonuses for next-Q are different for each Q-head, the normalization is useful in most cases by making the value function have a smooth and stable update.

D.2 ATARI GAMES

Hyper-parameters of BEBU. The basic setting of the Atari environment is the same as Nature DQN (Mnih et al., 2015) and EBU (Lee et al., 2019) papers. We use the different network architecture, environmental setup, exploration factor, and evaluation scheme from MNIST maze experiment. Details are summarized in Table 5 . Hyper-parameters of OEB3. We set α 1 and α 2 to the same value of 0.5 × 10 -4 . The UCB-bonus B for the next Q-value is normalized by dividing a running estimate of its standard deviation to have a stable performance.

E VISUALIZING OEB3

OEB3 uses the UCB-bonus that indicates the disagreement of bootstrapped Q-estimates to measure the uncertainty of Q-functions. The state-action pairs with high UCB-bonuses signify the bottleneck positions or meaningful events. We provide visualization in several tasks to illustrate the effect of UCB-bonuses. Specifically, we choose Mnist-maze and two popular Atari games Breakout and Mspacman to analyze. The positions with UCB-bonuses that higher than 0 draw the path of the agent. The path is usually winding and includes positions beyond the shortest path because the state transition has stochasticity. The state-action pairs with high UCB-bonuses show the bottleneck positions in the path. In maze 6(a), the agent slips from the right path in position (4, 7) to (4, 9). The state-action in position (4, 8) produces high bonus to guide the agent back to the right path. In maze 6(b), the bottleneck state in (3, 2) has high bonus to avoid the agent from entering into the wrong side of the fork. The other two mazes also have bottleneck positions, like (3, 3) in maze 6(c) and (7, 6) in maze 6(d). Selecting actions incorrectly in these important locations is prone to failure. The UCB-bonus encourages the agent to walk through these bottleneck positions correctly. In Breakout, the agent uses walls and the paddle to rebound the ball against the bricks and eliminate them. We use a trained OEB3 agent to interact with the environment for an episode in Breakout. The whole episode contains 3445 steps, and we subsample them every 4 steps for visualization. The curve in Figure 7 shows the UCB-bonus in 861 sampled steps. We choose 16 spikes with high UCB-bonuses and visualize the corresponding frames. The events in spikes usually mean meaningful experiences that are important for the agent to get rewards. In step 1, the agent is hoping to dig a tunnel and get rewards faster. After digging a tunnel, the balls appear on the top of bricks in steps 2, 3, 4, 5, 6, 9, and 12. Balls on the top are easier to hit bricks. The agents rebound the ball and throw it over the bricks in steps 7, 8, 10, and 11. In state 13, the agent eliminates all bricks and then comes to a new round, which is novel and promising to get more rewards. The agents in steps 14, 15, and 16 rebound the ball and try to dig a tunnel again. The UCB-bonus encourages the agent to explore the potentially informative and novel state-action pairs to get high rewards. We record 15 frames after each spike for further visualization. The video is available at https://youtu.be/VptBkHyMt8g. In MsPacman, the agent earns points by avoiding monsters and eating pellets. Eating an energizer causes the monsters to turn blue, allowing them to be eaten for extra points. We use a trained OEB3 agent to interact with the environment for an episode. Figure 8 shows the UCB bonus in all 708 steps. We choose 16 spikes to visualize the frames. The spikes of exploration bonuses correspond to meaningful events for the agent to get rewards: starting a new scenario (1,2,9,10), changing direction (3, 4, 13, 14, 16) , eating energizer (5,11), eating monsters (7,8,12), and entering the corner (6,15). These state-action pairs with high UCB-bonuses make the agent explore the environment efficiently. We record 15 frames after each spike, and the video is shown at https: //youtu.be/C_8NHKpBNXM. 

G PERFORMANCE COMPARISON

We use the relative scores as Score Agent -Score Baseline max{Score human , Score baseline } -Score random to compare OEB3 with baselines. The results of OEB3 comparing with BEBU, BEBU-UCB, and BEBU-IDS is shown in Figure 9 H FAILURE ANALYSIS Our method does not have a good performance on Montezuma's Revenge (see Table 7 ) because the epistemic uncertainty-based methods are not particularly tweaked for this domain. IDS, NoisyNet and BEBU-based methods all fail in this domain and score zero. Bootstrapped DQN achieves 100 points, which is also very low and does not indicate successful learning in Montezuma's revenge. In contrast, the bonus-based methods achieve significantly higher scores in this tasks (e.g., RND achieves 8152 points). However, according to Taiga et al. (2020) and Table 1 , NoisyNet and IDS significantly outperform several strong bonus-based methods evaluated by the mean and median scores of 49 Atari games.



Figure 2: Visualizing UCB-bonus in BreakoutWe set α 1 and α 2 to the same value of 0.5 × 10 -4 by searching coarsely. We use diffusion factor β = 0.5 for all methods by followingLee et al. (2019). OEB3 requires much less training time than BEBU-UCB and BEBU-IDS because the confidence bound used in BEBU-UCB and regret-information ratio used in BEBU-IDS are both computed in each time step when interacting with the environment, while the UCB-bonuses in OEB3 are calculated in episodes when performing batch training. The number of environmental interaction steps L 1 are typically set to be much larger than the number of training steps L 2 (e.g., in DQN, L 1 ≈ 4L 2 ). We refer to Appendix D for the detailed specifications. The code is available at https://bit.ly/33jv1ab.

Figure 3: The change of mean UCB-bonus in Breakout

] is the set of observations. It holds from Bayes rule that log p(w| D m ) = log p(w) + log p(D m | w) + Const.,(12)where p(•) denote the probability density function of the respective distributions. Plugging (11) and the probability density function of Gaussian distribution into (12) yields

OEB3 in DRL 1: Initialize: replay buffer D, bootstrapped Q-network Q(•; θ) and target network Q(•; θ -) 2: Initialize: total training frames H = 20M, current frame h = 0 3: while h < H do 4: Pick a bootstrapped Q-function to act by sampling k ∼ Unif{1, . . . , K} 5: Reset the environment and receive the initial state s 0 6: for step i = 0 to Terminal do 7: With -greedy choose a i = arg max a Q k (s i , a) 8: Take action and observe r i and s i+1 , then save the transition in buffer D 9: if h % training frequency = 0 then 10: Sample an episodic experience E = {S, A, R, S } with length T from D 11:

the Q-value of (S, A) for all heads asQ = Q(S, A; θ) ∈ R K×T 22:Perform a gradient descent step on (y -Q) 2 with respect to θ

are the mean and standard deviation of the bootstrapped Q-estimates 12: else if Algorithm type is BEBU-IDS then 13: With -greedy choose a i = arg min a ∆i(si,a) 2

observe r i and s i+1 , then save the transition in buffer D 18: if h % training frequency = 0 then 19: Sample an episodic experience E = {S, A, R, S } with length T from D 20:

Figure 4: Results of 200K steps training of MNIST maze with different wall-density setup.

Hyper-parameters of BEBU-UCB. BEBU-UCB selects actions by a = arg max a∈A [µ(s, a) + λ ucb σ(s, a)]. The detail is given in Appendix D.1. We use λ ucb = 0.1 in our experiment by searching coarsely. Hyper-parameters of BEBU-IDS. The action-selection follows the regret-information ratio as a t = arg min a∈A ∆t(s,a) 2 It(s,a) . The details are given in Appendix D.1. We use λ ids = 0.1, ρ(s, a) = 1.0 and ids = 10 -5 in our experiment by searching coarsely.

Figure 6 illustrates the UCB-bonus in four randomly generated mazes. The mazes in Figure 6(a) and 6(b) have a wall-density of 40%, and in Figure 6(c) and 6(d) have a wall-density of 50%. The left of each figure shows the map of maze, where the black blocks represent the walls. We do not show the MNIST representation of states for simplification, and MNIST states are used in training. A trained OEB3 agent starts at the upper-left, then takes actions to achieve the goal at bottom-right. The UCB-bonuses of state-action pairs in the agent's path are computed and shown on the right of each figure. The value is normalized to 0 ∼ 1 for visualization. We show the maximal value if the agent appears several times in the same location.

Figure 6: Visualization of UCB-bonus in Mnist-maze

Figure 7: Visualization of UCB-bonus in Breakout

, Figure 10, and Figure 11, respectively.

Figure 9: Relative score of OEB3 compared to BEBU in percents (%).

Figure 10: Relative score of OEB3 compared to BEBU-UCB in percents (%).

K a n g a r o o K u n g -F u M a s te r D o u b le D u n k G o p h e r V e n tu r e B o w li n g F is h in g D e r b y C e n ti p e d e N a m e T h is G a m e D e m o n A tt a c k A m id e r W iz a r d o f W a r R iv e r R a id P r iv a te E y e S e a q u e s t M o n te z u m a 's R e v e n g e T e n n is C h o p p e r C o m m a n d A tl a n ti s A s te r o id s A li e n H .E .R .O M s . P a c -M a n S ta r G u n n e r B a n k H e is t B r e a k o u t Ic e H o c k e y A s te r ix P o n g Q * B e r t G r a v it a r U p a n d D o w n B e a m R id e r S p a c e In v a d e r s B o x in g B a tt le Z o n e Z a x x o n F r o s tb it e A s s a u lt R o a d R u n n e r R o b o ta n k Ja m e s B o n d C r a z y C li m b e r F r e e w a y E n d u r o V id e o P in g b a ll T im e P il o t T u ta n k h a m

Figure 11: Relative score of OEB3 compared to BEBU-IDS in percents (%).

Optimistic exploration uses an optimistic action-value function Q + to incentive exploration by adding a bonus term to the ordinary Q-value. Thus Q + serves an upper bound of ordinary Q. The bonus term represents the epistemic uncertainty that results from lacking experiences of the corresponding states and actions. In this paper, we use an UCB-bonus B(s t , a t ) by measuring the disagreement of the bootstrapped Q-values {Q k (s t , a t )} K k=1 of the state-action pair (s t , a t ) in bootstrapped DQN, which takes the following form,

samples Q-values from the randomized value function to encourage exploration through Thompson sampling.Chen et al. (2017)  proposes to use the standard-deviation of bootstrapped Q-functions to measure the uncertainty. While the uncertainty measurement is similar to that of OEB3, our method is different fromChen et al. (2017)  in the following aspects. First, our approach propagates the uncertainty through time by the backward update, which allows for deep exploration for the MDPs. Second, Chen et al. (2017) does not use the bonus in the update of Q-functions. The bonus is computed when taking the actions. Third, we establish a theoretical connection between the UCB-bonus and the bonus in optimistic-LSVI. SUNRISE(Lee et al., 2020) extends Chen et al. (2017) to continuous control through confidence reward and weighted Bellman backup. Information-Directed Sampling (IDS)(Nikolov et al., 2019) is based on bootstrapped DQN, and chooses actions by balancing the instantaneous regret and information gain. OAC(Ciosek et al., 2019) uses two Q-networks to get lower and upper bounds of Q-value to perform exploration in continuous control tasks. These methods seek to estimate the epistemic uncertainty and choose the optimistic actions. In contrast, we use the uncertainty of value function to construct intrinsic rewards and perform backward update, which propagates future uncertainty to the estimated Q-value.

Summary of human-normalized scores in 49 Atari games. BEBU, BEBU-UCB, BEBU-IDS and OEB3 are trained for 20M frames with RTX-2080Ti GPU for 5 random seeds. al., 2018) uses a parametric method to describe the posterior of Q-values, which are utilized for optimism in exploration. We also use UBE as a baseline. According to Table1, BootDQN-IDS performs better than UBE, BootDQN, and NoisyNet. Thus, BootDQN-IDS outperforms existing bonus-based exploration methods that perform worse than NoisyNet. We re-implement BootDQN-IDS with BEBU-based update and observe that OEB3 outperforms BEBU-IDS in both mean and medium scores, thus outperforming the bonus-based methods that performs worse than NoisyNet. We report the raw scores in Appendix F. Moreover, Appendix G shows that OEB3 outperforms BEBU, BEBU-UCB, and BEBU-IDS in 36, 34, and 35 games out of all 49 games, respectively.

Ablation Study

Algorithmic Comparison of Related Works

Hyper-parameters of BEBU for MNIST-Maze

Hyper-parameters of BEBU for Atari gamesUsing convolution(channels, kernel size, stride) layers first, then fully connected into K bootstrapped heads. Each head has 512 ReLUs and |A| linear units.

Raw scores for Atari games. Each game is trained for 20M frames with a single RTX-2080Ti GPU. Bold scores signify the best score out of all methods.

Comparison of scores in Montezuma's Revenge.

