REGRET BOUNDS AND REINFORCEMENT LEARNING EXPLORATION OF EXP-BASED ALGORITHMS Anonymous

Abstract

EXP-based algorithms are often used for exploration in multi-armed bandit. We revisit the EXP3.P algorithm and establish both the lower and upper bounds of regret in the Gaussian multi-armed bandit setting, as well as a more general distribution option. The analyses do not require bounded rewards compared to classical regret assumptions. We also extend EXP4 from multi-armed bandit to reinforcement learning to incentivize exploration by multiple agents. The resulting algorithm has been tested on hard-to-explore games and it shows an improvement on exploration compared to state-of-the-art.

1. INTRODUCTION

Multi-armed bandit (MAB) is to maximize cumulative reward of a player throughout a bandit game by choosing different arms at each time step. It is also equivalent to minimizing the regret defined as the difference between the best rewards that can be achieved and the actual reward gained by the player. Formally, given time horizon T , in time step t ≤ T the player choose one arm a t among K arms, receives r t at among rewards r t = (r t 1 , r t 2 , . . . , r t K ), and maximizes the total reward T t=1 r t at or minimizes the regret. Computationally efficient and with abundant theoretical analyses are the EXP-type MAB algorithms. In EXP3.P, each arm has a trust coefficient (weight). The player samples each arm with probability being the sum of its normalized weights and a bias term, receives reward of the sampled arm and exponentially updates the weights based on the corresponding reward estimates. It achieves the regret of the order O( √ T ) in a high probability sense. In EXP4, there are any number of experts. Each has a sample rule over actions and a weight. The player samples according to the weighted average of experts' sample rules and updates the weights respectively. Contextual bandit is a variant of MAB by adding context or state space S. At time step t, the player has context s t ∈ S with s 1:T = (s 1 , s 2 , . . . , s T ) being independent. Rewards r t follow F (µ(s t )) where F is any distribution and µ(s t ) is the mean vector that depends on state s t . Reinforcement Learning (RL) generalizes contextual bandit, where state and reward transitions follow a Markov Decision Process (MDP) represented by transition kernel P (s t+1 , r t |a t , s t ). A key challenge in RL is the trade-off between exploration and exploitation. Exploration is to encourage the player to try new arms in MAB or new actions in RL to understand the game better. It helps to plan for the future, but with the sacrifice of potentially lowering the current reward. Exploitation aims to exploit currently known states and arms to maximize the current reward, but it potentially prevents the player to gain more information to increase local reward. To maximize the cumulative reward, the player needs to know the game by exploration, while guaranteeing current reward by exploitation. How to incentivize exploration in RL has been a main focus in RL. Since RL is built on MAB, it is natural to extend MAB techniques to RL and UCB is such a success. UCB (Auer et al. (2002a) ) motivates count-based exploration (Strehl and Littman, 2008) in RL and the subsequent Pseudo-Count exploration (Bellemare et al., 2016) . New deep RL exploration algorithms have been recently proposed. Using deep neural networks to keep track of the Q-values by means of Q-networks in RL is called DQN (Mnih et al. (2013) ). This combination of deep learning and RL has shown great success. -greedy in Mnih et al. (2015) is a simple exploration technique using DQN. Besides -greedy, intrinsic model exploration computes intrinsic rewards by focusing on experiences. Intrinsic rewards directly measure and incentivize exploration if added to extrinsic (actual) rewards of RL, e.g. DORA (Fox et al., 2018) and (Stadie et al., 2015) . Random Network Distillation (RND) (Burda et al., 2018) is a more recent suggestion relying on a fixed target network. A drawback of RND is its local focus without global exploration. In order to address weak points of these various exploration algorithms in the RL context, the notion of experts is natural and thus EXP-type MAB algorithms are appropriate. The allowance of arbitrary experts provides exploration for harder contextual bandits and hence providing exploration possibilities for RL. We develop an EXP4 exploration algorithm for RL that relies on several general experts. This is the first RL algorithm using several exploration experts enabling global exploration. Focusing on DQN, in the computational study we focus on two agents consisting of RND and -greedy DQN. We implement the RL EXP4 algorithm on the hard-to-explore RL game Montezuma's Revenge and compare it with the benchmark algorithm RND (Burda et al. (2018) ). The numerical results show that the algorithm gains more exploration than RND and it gains the ability of global exploration by not getting stuck in local maximums of RND. Its total reward also increases with training. Overall, our algorithm improves exploration and exploitation on the benchmark game and demonstrates a learning process in RL. Reward in RL in many cases is unbounded which relates to unbounded MAB rewards. There are three major versions of MAB: Adversarial, Stochastic, and herein introduced Gaussian. For adversarial MAB, rewards of the K arms r t can be chosen arbitrarily by adversaries at step t. For stochastic MAB, the rewards at different steps are assumed to be i.i.d. and the rewards across arms are independent. It is assumed that 0 ≤ r t i ≤ 1 for any arm i and step t. For Gaussian MAB, rewards r t follow multi-variate normal N (µ, Σ) with µ being the mean vector and Σ the covariance matrix of the K arms. Here the rewards are neither bounded, nor independent among the arms. For this reason the introduced Gaussian MAB reflects the RL setting and is the subject of our MAB analyses of EXP3.P. EXP-type algorithms (Auer et al. (2002b) ) are optimal in the two classical MABs. Auer et al. (2002b) show lower and upper bounds on regret of the order O( √ T ) for adversarial MAB and of the order O(log(T )) for stochastic MAB. All of the proofs of these regret bounds by EXP-type algorithms are based on the bounded reward assumption, which does not hold for Gaussian MAB. Therefore, the regret bounds for Gaussian MAB with unbounded rewards studied herein are significantly different from prior works. We show both lower and upper bounds on regret of Gaussian MAB under certain assumptions. Some analyses even hold for more generally distributed MAB. Upper bounds borrow some ideas from the analysis of the EXP3.P algorithm in Auer et al. (2002b) for bounded MAB to our unbounded MAB, while lower bounds are by our brand new construction of instances. Precisely, we derive lower bounds of order Ω(T ) for certain fixed T and upper bounds of order O * ( √ T ) for T being large enough. The question of bounds for any value of T remains open. The main contributions of this work are as follows. On the analytical side we introduce Gaussian MAB with the unique aspect and challenge of unbounded rewards. We provide the very first regret lower bound in such a case by constructing a novel family of Gaussian bandits and we are able to analyze the EXP3.P algorithm for Gaussian MAB. Unbounded reward poses a non-trivial challenge in the analyses. We also provide the very first extension of EXP4 to RL exploration. We show its superior performance on two hard-to-explore RL games. A literature review is provided in Section 2. Then in Section 3 we exhibit upper bounds for unbounded MAB of the EXP3.P algorithm and lower bounds, respectively. Section 4 discusses the EXP4 algorithm for RL exploration. Finally, in Section 5, we present numerical results related to the proposed algorithm.

2. LITERATURE REVIEW

The importance of exploration in RL is well understood. Count-based exploration in RL relies on UCB. Strehl and Littman (2008)  develop Bellman value iteration V (s) = max a R(s, a) + γE[V (s )] + βN (s, a) -1 2 , where N (s, a) is the number of visits to (s, a) for state s and action a. Value N (s, a) -1 2 is positively correlated with curiosity of (s, a) and encourages exploration. This method is limited to tableau model-based MDP for small state spaces, while Bellemare et al. (2016) introduce Pseudo-Count exploration for non-tableau MDP with density models. In conjunction with DQN, -greedy in Mnih et al. (2015) is a simple exploration technique using DQN. Besides -greedy, intrinsic model exploration computes intrinsic rewards by the accuracy of a model trained on experiences. Intrinsic rewards directly measure and incentivize exploration if added to extrinsic (actual) rewards of RL, e.g. DORA in Fox et al. (2018) and Stadie et al. (2015) . Intrinsic rewards in Stadie et al. (2015) are defined as e(s, a) = ||σ(s ) -M φ (σ(s), a)|| 2 2 where M φ is a parametric model, s is the next state and σ is input extraction. Intrinsic reward e(s, a) relies on stochastic transition from s to s and brings noise to exploration. Random Network Distillation(RND) in Burda et al. (2018) addresses this by defining e(s, a) = || f (s ) -f (s )|| 2 2 where f is a parametric model and f is a randomly initialized but fixed model. Here e(s, a), independent of the transition, only depends on state s and drives RND to outperform other algorithms on Montezuma's Revenge. None of these algorithms use several experts which is a significant departure from our work. In terms of MAB regret analyses focusing on EXP-type algorithms, Auer et al. (2002b) first introduce EXP3.P for bounded adversarial MAB and EXP4 for contextual bandits. Under the EXP3.P algorithm, an upper bound on regret of the order O( √ T ) is achieved, which has no gap with the lower bound and hence it establishes that EXP3.P is optimal. However these regret bounds are not applicable to Gaussian MAB since rewards can be infinite. Meanwhile for unbounded MAB, Srinivas et al. (2010) demonstrate a regret bound of order O( √ T • γ T ) for noisy Gaussian process bandits where a reward observation contains noise. The information gain γ T is not well-defined in a noiseless Gaussian setting. For noiseless Gaussian bandits, Grünewälder et al. (2010) show both the optimal lower and upper bounds on regret, but the regret definition is not consistent with the one used in Auer et al. (2002b) . We establish a lower bound of the order Ω(T ) for certain T and an upper bound of the order O * ( √ T ) asymptotically on regret of unbounded noiseless Gaussian MAB following standard definitions of regret.

3. REGRET BOUNDS FOR GAUSSIAN MAB

For Gaussian MAB with time horizon T , at step 0 < t ≤ T rewards r t follow multi-variate normal N (µ, Σ) where µ = (µ 1 , µ 2 , . . . , µ K ) is the mean vector and Σ = (a ij ) i,j∈{1,...,K} is the covariance matrix of the K arms. The player receives reward y t = r t at by pulling arm a t . We use R T = T • max k µ k -t E[y t ] to denote pseudo regret called simply regret. (Note that the alternative definition of regret R T = max i T t=1 r t i -T t=1 y t depends on realizations of rewards.) 3.1 LOWER BOUNDS ON REGRET In this section we derive a lower bound for Gaussian and general MAB under an assumption. General MAB replaces Gaussian with a general distribution. The main technique is to construct instances or sub-classes that have certain regret, no matter what strategies are deployed. We need the following assumption or setting. Assumption 1 There are two types of arms with general K with one type being superior (S is the set of superior arms) and the other being inferior (I is the set of inferior arms). Let 1 -q, q be the proportions of the superior and inferior arms, respectively which is known to the adversary and clearly 0 ≤ q ≤ 1. The arms in S are indistinguishable and so are those in I. The first pull of the player has two steps. In the first step the player selects an inferior or superior set of arms based on P (S) = 1 -q and P (I) = q and once a set is selected, the corresponding reward of an arm from the selected set is received. An interesting special case of Assumption 1 is the case of two arms and q = 1/2. In this case, the player has no prior knowledge and in the first pull chooses an arm uniformly at random. The lower bound is defined as R L (T ) = inf sup R T , where, first, inf is taken among all the strategies and then sup is among all Gaussian MAB. All proofs are in the Appendix. The following is the main result with respect to lower bounds and it is based on inferior arms being distributed as N (0, 1) and superior as N (µ, 1) with µ > 0. Theorem 1. In Gaussian MAB under Assumption 1, for any q ≥ 1/3 we have R L (T ) ≥ (q -) • µ • T where µ has to satisfy G(q, µ) < q with and T determined by G(q, µ) < < q, T ≤ -G(q, µ) (1 -q) • e -x 2 2 -e -(x-µ) 2 2 + 2 and G(q, µ) = max qe -x 2 2 -(1 -q)e -(x-µ) 2 2 dx, (1 -q)e -x 2 2 -qe -(x-µ) 2 2 dx . To prove Theorem 1, we construct a special subset of Gaussian MAB with equal variances and zero covariances. On these instances we find a unique way to explicitly represent any policy. This builds a connection between abstract policies and this concrete mathematical representation. Then we show that pseudo regret R T must be greater than certain values no matter what policies are deployed, which indicates a regret lower bound on these subset of instances. The feasibility of the aforementioned conditions is established in the following theorem. Theorem 2. In Gaussian MAB under Assumption 1, for any q ≥ 1/3, there exist µ and , < µ such that R L (T ) ≥ (q -) • µ • T . The following result with two arms and equal probability in the first pull deals /with general probabilities. Even in the case of Gaussian MAB it is not a special case of Theorem 2 since it is stronger. Theorem 3. For general MAB under Assumption 1 with K = 2, q = 1/2, we have that R L (T ) ≥ T •µ 4 holds for any distributions f 0 for the arms in I and f 1 for the arms in S with |f 1 -f 0 | > 0 (possibly with unbounded support), for any µ > 0 and T satisfying T ≤ 1 2• |f0-f1| + 1. The theorem establishes that for any fixed µ > 0 there is a finite set of horizons T and instances of Gaussian MAB so that no algorithm can achieve regret smaller than linear in T . Table 1 provides the values of the relationship between µ and largest T in the Gaussian case where the inferior arms are distributed based on the standard normal and the superior arms have mean µ > 0 and variance 1. For example, there is no way to attain regret lower than T • 10 -4 /4 for any 1 ≤ T ≤ 2501. The function decreases very quickly. Table 1 : Upper bounds for T as a function of µ µ 10 -5 10 -4 10 -3 10 -2 10 -1 Upper bound for T 25001 2501 251 26 3.5 The established lower bound result R L (T ) ≥ Ω(T ) is larger than known results of classical MAB. This is not surprising since the rewards in classical MAB are assumed to be bounded, while rewards in our setting follow an unbounded Gaussian distribution, which apparently increases regret. Besides the known result Ω( √ T ) of adversarial MAB and Ω(log T ) of stochastic MAB, for noisy Gaussian Process bandits, Srinivas et al. (2010) show R L (T ) ≤ Ω( √ T • γ T ). Our lower bound for Gaussian MAB is different from this lower bound. The information gain term γ T in noisy Gaussian bandits is not well-defined in Gaussian MAB and thus the two bounds are not comparable.

3.2. UPPER BOUNDS ON REGRET

In this section, we establish upper bounds for regret of Gaussian MAB by means of the EXP3.P algorithm (see Algorithm 1) from Auer et al. (2002b) . We stress that rewards can be infinite, without the bounded assumption present in stochastic and adversarial MAB. We only consider non-degenerate Gaussian MAB where variance of each arm is strictly positive, i.e. min i a ii > 0. Algorithm 1: EXP3.P Initialization: Weights w i (1) = exp ( αδ 3 T K ), i ∈ {1, 2, . . . , K} for α > 0 and δ ∈ (0, 1); for t = 1, 2, . . . , T do for i = 1, 2, . . . , K do p i (t) = (1 -δ) wi(t) K j=1 wj (t) + δ K end Choose i t randomly according to the distribution p 1 (t), . . . , p K (t); Receive reward r it (t); for j = 1, . . . , K do xj (t) = rj (t) pj (t) • 1 j=it , w j (t + 1) = w j (t) exp δ 3K (x j (t) + α pj (t) √ KT ) end end Formally, we provide analyses for upper bounds on R T with high probability, on E[R T ] and on R T . In Auer et al. (2002b) EXP3.P is studied to yield a bound on regret R T with high probability in the bounded MAB setting. As part of our contributions, we show that EXP3.P regret is of the order O * ( √ T ) in the unbounded Gaussian MAB in the case of R T with high probability, E[R T ] and R T . The results are summarized as follows. The density of N (µ, Σ) is denoted by f . Theorem 4. For Gaussian MAB, any time horizon T , for any 0 < η < 1, EXP3.P has regret R T ≤ 4∆(η) • ( KT log( KT δ ) + 4 5 3 KT log K + 8 log( KT δ )) with probability (1 -δ) • (1 -η) T where ∆(η) is determined by ∆ -∆ . . . ∆ -∆ f (x 1 , . . . , x K ) dx 1 . . . dx K = 1 -η. In the proof of Theorem 4, we first perform truncation of the rewards of Gaussian MAB by dividing the rewards to a bounded part and unbounded tail throughout the game. For the bounded part, we directly borrow the regret upper bound of EXP3.P in Auer et al. (2002b) and conclude with the regret upper bound of order O(∆(η)

√

T ). Since a Gaussian distribution is a light-tailed distribution we can control the probability of tail shrinking which leads to the overall result. The dependence of the bound on ∆ can be removed by considering large enough T as stated next. Theorem 5. For Gaussian MAB, and any a > 2, 0 < δ < 1, EXP3.P has regret R T ≤ log(1/δ)O * ( √ T ) with probability (1 -δ) • (1 -1 T a ) T . The constant behind O * depends on K, a, µ and Σ. The above theorems deal with R T but the aforementioned lower bounds are with respect to pseudo regret. To complete the analysis of Gaussian MAB, it is desirable to have an upper bound on pseudo regret which is established next. It is easy to verify by the Jensen's inequality that R T ≤ E[R T ] and thus it suffices to obtain an upper bound on E[R T ]. For adversarial and stochastic MAB, the upper bound for E[R T ] is of the same order as R T which follows by a simple argument. For Gaussian MAB, establishing an upper bound on E[R T ] or R T based on R T requires more work. We show an upper bound on E[R T ] by using select inequalities, limit theories, and Randemacher complexity. To this end, the main result reads as follows. Theorem 6. The regret of EXP3.P in Gaussian MAB satisfies R T ≤ E [R T ] ≤ O * ( √ T ). All these three theorems also hold for sub-Gaussian MAB, which is defined by replacing Gaussian with sub-Gaussian. This generalization is straightforward and it is directly shown in the proof of Gaussian MAB in Appendix. Optimal upper bounds for adversarial MAB and noisy Gaussian Process bandits are of the same order as our upper bound. Auer et al. (2002b) derive an upper bound of the same order O( √ T ) as the lower bound for adversarial MAB. For noisy Gaussian Process bandits, there is also no gap between its upper and lower bounds. 

4. EXPALGORITHM FOR RL

EXP4 has shown great success in contextual bandits. Therefore, in this section, we extend EXP4 to RL and develop EXP4-RL illustrated in Algorithm 2. The player has experts that are represented by deep Q-networks trained by RL algorithms (there is a one to one correspondence between the experts and Q-networks). Each expert also has a trust coefficient. Trust coefficients are also updated exponentially based on the reward estimates as in EXP4. At each step of one episode, the player samples an expert (Q-network) with probability that is proportional to the weighted average of expert's trust coefficients. Then -greedy DQN is applied on the chosen Q-network. Here different from EXP4, the player needs to store all the interaction tuples in experience buffer since RL is a MDP. After one episode, the player trains all Q-networks with the experience buffer and uses the trained networks as experts for the next episode. Algorithm 2: EXP4-RL Initialization: Trust coefficients w k = 1 for any k ∈ {1, . . . , E}, E = number of experts (Q-networks), K = number of actions, ∆, , η > 0 and temperature z, τ > 0, n r = -∞ (an upper bound on reward); while True do Initialize episode by setting s 0 ; for i = 1, 2, . . . , T (length of episode) do Observe state s i ; Let probability of Q k -network be ρ k = (1 -η) w k E k=1 w k + η E ; Sample network k according to {ρ k } k ; For Qk-network, use -greedy to sample an action a * = argmax a Qk(s i , a), π j = (1-)•1 j=a * + K -1 •1 j =a * j ∈ {1, 2, . . . , K} Sample action a i based on π; Interact with the environment to receive reward r i and next state s i+1 ; n r = max{r i , n r }; Update the trust coefficient w k of each Q k -network as follows: P k = -greedy(Q k ), xkj = 1- 1 j=a P kj + ∆ (n r -r i ), j ∈ 1, 2, . . . , K, y k = E[x kj ], w k = w k •e y k z Store (s i , a i , r i , s i+1 ) in experience replay buffer B; end Update each expert's Q k -network from buffer B; end The basic idea is the same as EXP4 by using the experts that give advice vectors with deep Q-networks. It is a combination of deep neural networks with EXP4 updates. From a different perspective, we can also view it as an ensemble in classification (Xia et al. ( 2011)), by treating Q-networks as ensembles in RL, instead of classification algorithms. While Q-networks do not necessarily have to be experts, i.e., other experts can be used, these are natural in a DQN framework. In our implementation and experiments we use two experts, thus E = 2 with two Q-networks. The first one is based on RND (Burda et al. (2018) ) while the second one is a simple DQN. To this end, in the algorithm before storing to the buffer, we also record c i r = || f (s i ) -f (s i )|| 2 , the RND intrinsic reward as in Burda et al. (2018) . This value is then added to the 4-tuple pushed to B. When updating Q 1 corresponding to RND at the end of an iteration in the algorithm, by using r j + c j r we modify the Q 1 -network and by using c j r an update to f is executed. Network Q 2 pertaining to -greedy is updated directly by using r j . Intuitively, Algorithm 2 circumvents this drawback with the total exploration guided by two experts with EXP4 updated trust coefficients. When the RND expert drives high exploration, its trust coefficient leads to a high total exploration. When it has low exploration, the second expert DQN should have a high one and it incentivizes the total exploration accordingly. Trust coefficients are updated by reward estimates iteratively as in EXP4, so they keep track of the long-term performance of experts and then guide the total exploration globally. These dynamics of EXP4 combined with intrinsic rewards guarantees global exploration. The experimental results exhibited in the next section verify this intuition regarding exploration behind Algorithm 2. We point out that potentially more general RL algorithms based on Q-factors can be used, e.g., boostrapped DQN (Osband et al. (2016) ), random prioritized DQN (Osband et al. (2018) ) or adaptive -greedy VDBE (Tokic (2010)) are a possibility. Furthermore, experts in EXP4 can even be policy networks trained by PPO (Schulman et al. (2017) ) instead of DQN for exploration. These possibilities demonstrate the flexibility of the EXP4-RL algorithm. As a numerical demonstration of the superior performance and exploration incentive of Algorithm 2, we show the improvements on baselines on two hard-to-explore RL games, Mountain Car and Montezuma's Revenge. More precisely, we present that the real reward on Mountain Car improves significantly by Algorithm 2 in Section 5.1. Then we implement Algorithm 2 on Montezuma's Revenge and show the growing and remarkable improvement of exploration in Section 5.2. Intrinsic reward c i r = || f (s i ) -f (s i )|| 2 given by intrinsic model f represents the exploration of RND in Burda et al. (2018) as introduced in Sections 2 and 4. We use the same criterion for evaluating exploration performance of our algorithm and RND herein. RND incentivizes local exploration with the single step intrinsic reward but with the absence of global exploration.

5.1. MOUNTAIN CAR

In this part, we summarize the experimental results of Algorithm 2 on Mountain Car, a classical control RL game. This game has very sparse positive rewards, which brings the necessity and hardness of exploration. Blog post (Rivlin (2019) ) shows that RND based on DQN improves the performance of traditional DQN, since RND has intrinsic reward to incentivize exploration. We use RND on DQN from Rivlin (2019) as the baseline and show the real reward improvement of Algorithm 2, which supports the intuition and superiority of the algorithm. The comparison between Algorithm 2 and RND is presented in Figure 1 . Here the x-axis is the epoch number and the y-axis is the cumulative reward of that epoch. Figure 1a shows the raw data comparison between EXP4-RL and RND. We observe that though at first RND has several spikes exceeding those of EXP4-RL, EXP4-RL has much higher rewards than RND after 300 epochs. Overall, the relative difference of areas under the curve (AUC) is 4.9% for EXP4-RL over RND, which indicates the significant improvement of our algorithm. This improvement is better illustrated in Figure 1b with the smoothed reward values. Here there is a notable difference between EXP4-RL and RND. Note that the maximum reward hit by EXP4-RL is -86 and the one by RND is -118, which additionally demonstrates our improvement on RND. We conclude that Algorithm 2 performs better than the RND baseline and that the improvement increases at the later training stage. Exploration brought by Algorithm 2 gains real reward on this hard-to-explore Mountain Car, compared to the RND counterpart (without the DQN expert). The power of our algorithm can be enhanced by adopting more complex experts, not limited to only DQN.

5.2. MONTEZUMA'S REVENGE AND PURE EXPLORATION SETTING

In this section, we show the experimental details of Algorithm 2 on Montezuma's Revenge, another notoriously hard-to-explore RL game. The benchmark on Montezuma's Revenge is RND based on DQN which achieves a reward of zero in our environment (the PPO algorithm reported in Burda et al. (2018) has reward 8,000 with many more computing resources; we ran the PPO-based RND with 10 parallel environments and 800 epochs to observe that the reward is also 0), which indicates that DQN has room for improvement regarding exploration. To this end, we first implement the DQN-version RND (called simply RND hereafter) on Montezuma's Revenge as our benchmark by replacing the PPO with DQN. Then we implement Algorithm 2 with two experts as aforementioned. Our computing environment allows at most 10 parallel environments. In subsequent figures the x-axis always corresponds to the number of epochs. RND update probability is the proportion of experience that are used for training the intrinsic model f (Burda et al., 2018) . A comparison between Algorithm 2 (EXP4-RL) and RND without parallel environments (the update probability is 100% since it is a single environment) is shown in Figure 2 with the emphasis on exploration by means of the intrinsic reward. We use 3 different numbers of burn-in periods (58, 68, 167 burn-in epochs) to remove the initial training steps, which is common in Gibbs sampling. Overall EXP4-RL outperforms RND with many significant spikes in the intrinsic rewards. The larger the number of burn-in periods is, the more significant is the dominance of EXP4-RL over RND. EXP4-RL has much higher exploration than RND at some epochs and stays close to RND at other epochs. At some epochs, EXP4-RL even has 6 times higher exploration. The relative difference in the areas under the curves are 6.9%, 17.0%, 146.0%, respectively, which quantifies the much better performance of EXP4-RL. We next compare EXP4-RL and RND with 10 parallel environments and different RND update probabilities in Figure 3 . The experiences are generated by the 10 parallel environments. Figure 3a shows that both experts in EXP4-RL are learning with decreasing losses of their Q-networks. The drop is steeper for the RND expert but it starts with a higher loss. With RND update probability 0.25 in Figure 3b we observe that EXP4-RL and RND are very close when RND exhibits high exploration. When RND is at its local minima, EXP4-RL outperforms it. Usually these local minima are driven by sticking to local maxima and then training the model intensively at local maxima, typical of the RND local exploration behavior. EXP4-RL improves on RND as training progresses, e.g. the improvement after 550 epochs is higher than the one between epochs 250 and 550. In terms for AUC, this is expressed by 1.6% and 3.5%, respectively. Overall, EXP4-RL improves RND local minima of exploration, keeps high exploration of RND and induces a smoother global exploration. With the update probability of 0.125 in Figure 3c , EXP4-RL almost always outperforms RND with a notable difference. The improvement also increases with epochs and is dramatically larger at RND's local minima. These local minima appear more frequently in training of RND, so our improvement is more significant as well as crucial. The relative AUC improvement is 49.4%. The excellent performance in Figure 3c additionally shows that EXP4-RL improves RND with global exploration by improving local minima of RND or not staying at local maxima. Overall, with either 0.25 or 0. 

A DETAILS ABOUT NUMERICAL EXPERIMENTS

A.1 MOUNTAIN CAR For the Mountain Car experiment, we use the Adam optimizer with the 2 • 10 -4 learning rate. The batch size for updating models is 64 with the replay buffer size of 10,000. The remaining parameters are as follows: the discount factor for the Q-networks is 0.95, the temperature parameter τ is 0.1, η is 0.05, and is decaying exponentially with respect to the number of steps with maximum 0.9 and minimum 0.05. The length of one epoch is 200 steps. The target networks load the weights and biases of the trained networks every 400 steps. Since a reward upper bound is known in advance, we use n r = 1. We next introduce the structure of neural networks that are used in the experiment. The neural networks of both experts are linear. For the RND expert, it has the input layer with 2 input neurons, followed by a hidden layer with 64 neurons, and then a two-headed output layer. The first output layer represents the Q values with 64 hidden neurons as input and the number of actions output neurons, while the second output layer corresponds to the intrinsic values, with 1 output neuron. For the DQN expert, the only difference lies in the absence of the second output layer.

A.2 MONTEZUMA'S REVENGE

For the Montezuma's Revenge experiment, we use the Adam optimizer with the 10 -5 learning rate. The other parameters read: the mini batch size is 4, replay buffer size is 1,000, the discount factor for the Q-networks is 0.999 and the same valus is used for the intrinsic value head, the temperature parameter τ is 0.1, η is 0.05, and is increasing exponentially with minimum 0.05 and maximum 0.9. The length of one epoch is 100 steps. Target networks are updated every 300 steps. Pre-normalization is 50 epochs and the weights for intrinsic and extrinsic values in the first network are 1 and 2, respectively. The upper bound on reward is set to be constant n r = 1. For the structure of nerual networks, we use CNN architectures since we are dealing with videos. More precisely, for the Q-network of the DQN expert in EXP4-RL and the predictor network f for computing the intrinsic rewards, we use Alexnet (Krizhevsky et al. (2012) ) pretrained on ImageNet (Deng et al. (2009) ). The number of output neurons of the final layer is 18, the number of actions in Montezuma. For the RND baseline and RND expert in EXP4-RL, we customize the Q-network with different linear layers while keeping all the layers except the final layer of pretrained Alexnet. Here we have two final linear layers representing two value heads, the extrinsic value head and the intrinsic value head. The number of output neurons in the first value head is again 18, while the second value head is with 1 output neuron. More details about the setup of the experiment on Montezuma's Revenge are elaborated as follows. The experiment of RND with PPO in Burda et al. (2018) uses many more resources, such as 1024 parallel environments and runs 30,000 epochs for each environment. Parallel environments generate experiences simultaneously and store them in the replay buffer. Our computing environment allows at most 10 parallel environments. For the DQN-version of RND (called simply RND hereafter), we use the same settings as Burda et al. (2018) , such as observation normalization, intrinsic reward normalization and random initialization. RND update probability is the proportion of experience in the replay buffer that are used for training the intrinsic model f in RND (Burda et al., 2018) . Here in our experiment, we compare the performance under 0.125 and 0.25 RND update probability.

B PROOF OF RESULTS IN SECTION 3.1

For brevity, we define n = T -1. We start by showing the following proposition that is used in the proofs. Proposition 1. Let G(q, µ), q, and µ be defined as in Theorem 1. Then for any q ≥ 1/3, there exists a µ that satisfies the constraint G(q, µ) < q. Proof. Let us denote G 1 = |qf 0 (x) -(1 -q)f 1 (x)| dx, G 2 = |(1 -q)f 0 (x) -qf 1 (x)| dx. Then we have G 1 (q, µ) = |qf 0 (x) -(1 -q)f 1 (x)| dx = (qf 0 (x) -(1 -q)f 1 (x)) 1 qf0(x)>(1-q)f1(x) dx + (-qf 0 (x) + (1 -q)f 1 (x)) 1 qf0(x)<(1-q)f1(x) dx = (qf 0 (x) -(1 -q)f 1 (x)) 1 x<g(µ) dx + (-qf 0 (x) + (1 -q)f 1 (x)) 1 x>g(µ) dx = 1 √ 2π g(µ) -∞ qe -x 2 2 -(1 -q)e -(x-µ) 2 2 dx + ∞ g(µ) -qe -x 2 2 + (1 -q)e -(x-µ) 2 2 dx = 1 √ 2π q g(µ) -g(µ) e -x 2 2 -(1 -q) g(µ)-µ -g(µ)+µ e -x 2 2 where g(µ) = 1 2 • µ - log( 1-q q ) µ . Similarly we get G 2 (q, µ) = 1 √ 2π (1 -q) g(µ) -g(µ) e -x 2 2 -q g(µ)-µ -g(µ)+µ e -x 2 2 . It is easy to establish continuity of G 1 (q, µ) and G 2 (q, µ) on [0, ∞), as well as the continuity of G(q, µ). Indeed, we have G(q, µ) = |1 -2q| µ = 0 max(q, 1 -q) µ → ∞ . Since q ≥ 1 3 , then |1 -2q| < q. From continuity of G(q, µ), there exists µ 0 > 0 such that G(q, µ) < q for any µ ≤ µ 0 . Proof of Theorem 1. As in Assumption 1, let the inferior arm set be I and the superior one be S, respectively, P (I) = q and P (S) = 1 -q. Arms in I follow f 0 (x) = N (0, 1) and arms in S follow f 1 (x) = N (µ, 1) where µ > 0. According to Assumption 1, at the first step the player pulls an arm from either I or S and receives reward y 1 . At time step i > 1, the reward is y i and let b i represent a policy of the player. We can always define b i as b i = 1 if the chosen arm at step i is not in the same arm set as the initial arm, 0 otherwise. Let a i ∈ {0, 1} be the actual arm played at step i. It suffices to only specify a i is in arm set I (a i = 0) or S (a i = 1) since the arms in I and S are identical. The connection between a i and b i is explicitly given by b i = |a i -a 1 |. By Assumption 1, it is easy to argue that b i = S i (y 1 , y 2 , ..., y i-1 ) for a set of functions S 2 , S 3 , . . . , S n , S n+1 . We proceed with the following lemma. Lemma 1. Let the rewards of the arms in set I follow any L 1 distribution f 0 (x) and in set S follow any L 1 distribution f 1 (x) where the means satisfy µ(f 1 ) > µ(f 0 ). Let B be the number of arms played in the game in set S. Let us assume the player meets Assumption 1. Then no matter what strategy the player takes, we have E[B]-(1-q)•(n+1) n+1 ≤ where , T, f 0 , f 1 satisfy G(q, f 0 , f 1 ) + (1 -q)(n -1) |f 0 (x) -f 1 (x)| ≤ , G(q, f 0 , f 1 ) = max |qf 0 (x) -(1 -q)f 1 (x)| dx, |(1 -q)f 0 (x) -qf 1 (x)| dx . Proof. We have E[B] = (a 1 + a 2 + • • • + a n+1 ) f a1 (y 1 ) f a2 (y 2 ) . . . f an (y n ) dy 1 dy 2 . . . dy n . If a 1 = 0, then a i = b i and E [B|a 1 = 0] = (0 + b 2 (y 1:1 ) + . . . + b n+1 (y 1:n )) f 0 (y 1 ) f b2 (y 2 ) . . . f bn (y n ) dy 1 dy 2 . . . dy n . If a 1 = 1, then 1 -a i = b i and E [B|a 1 = 1] = (1 + 1 -b 2 (y 1:1 ) + • • • + 1 -b n+1 (y 1:n )) f 1 (y 1 ) . . . f 1-bn (y n ) dy 1 dy 2 . . . dy n . This gives us E[B] = q • E [B|a 1 = 0] + (1 -q) • E [B|a 1 = 1] = (1 -q)(n + 1) + (b 2 + • • • + b n+1 ) • (q • f 0 (y 1 ) . . . f bn (y n ) -(1 -q) • f 1 (y 1 ) . . . f 1-bn (y n )) dy 1 dy 2 . . . dy n . By defining b 1 = 0, we have E[B] = (1 -q) • (n + 1)+ (b 2 + • • • + b n+1 ) (q • f b1 (y 1 ) . . . f bn (y n ) -(1 -q) • f 1-b1 (y 1 ) . . . f 1-bn (y n )) dy 1 dy 2 . . . dy n . For any 1 ≤ m ≤ n we also derive m i=1 f bi (y i ) - m i=1 f 1-bi (y i ) dy 1 dy 2 . . . dy m ≤ m-1 i=1 f bi (y i ) |f bm (y m ) -f 1-bm (y n )| dy 1 dy 2 . . . dy m + m-1 i=1 f bi (y i ) - m-1 i=1 f 1-bi (y i ) f 1-bm (y m ) dy 1 dy 2 . . . dy m ≤ |f 0 (x) -f 1 (x)| dx + m-1 i=1 f bi (y i ) - m-1 i=1 f 1-bi (y i ) f 1-bm (y m ) dy 1 dy 2 . . . dy m = |f 0 (x) -f 1 (x)| dx + m-1 i=1 f bi (y i ) - m-1 i=1 f 1-bi (y i ) dy 1 dy 2 . . . dy m-1 ≤ 2 • |f 0 (x) -f 1 (x)| dx + m-2 i=1 f bi (y i ) - m-2 i=1 f 1-bi (y i ) dy 1 dy 2 . . . dy m-2 ≤ m |f 0 (x) -f 1 (x)| . (1) This provides E[B] -(1 -q) • (n + 1) n + 1 ≤ q • n i=1 f bi (y i ) -(1 -q) • n i=1 f 1-bi (y i ) dy 1 dy 2 . . . dy n ≤ n-1 i=1 f bi (y i ) |q • f bn (y n ) -(1 -q) • f 1-bn (y n )| dy 1 dy 2 . . . dy n + (1 -q) • n-1 i=1 f bi (y i ) -(1 -q) • n-1 i=1 f 1-bi (y i ) f 1-bn (y n ) dy 1 dy 2 . . . dy n ≤ max |q • f 0 (x) -(1 -q) • f 1 (x)| dx, |(1 -q) • f 0 (x) -q • f 1 (x)| dx + (1 -q) • n-1 i=1 f bi (y i ) - n-1 i=1 f 1-bi (y i ) dy 1 dy 2 . . . dy n-1 ≤ max |q • f 0 (x) -(1 -q) • f 1 (x)| dx, |(1 -q) • f 0 (x) -q • f 1 (x)| dx + (1 -q) • (n -1) • |f 0 (x) -f 1 (x)| , where the last inequality follows from (1).The statement of the lemma now follows. According to Proposition 1, there is such µ satisfying the constraint G(q, µ) < q. Note that G(q, µ) = G(q, f 0 , f 1 ). Then we can choose to be any quantity such that G(q, µ) < < q. Finally, there is T satisfying T ≤ -G(q,µ) (1-q)• |f0(x)-f1(x)| + 2 that gives us G(q, µ) + (1 -q)(T -2) |f 0 (x) -f 1 (x)| ≤ . By choosing , T, µ as above, by Lemma 1 we have E[B] -(1 -q) • T T < , which is equivalent to E[B] < (1 -q + ) • T . Therefore, regret R T satisfies, with A being the number of arm pulls from I, inequality R T = t max k (µ k ) - t E[y t ] = T µ - t E[y t ] = T µ -(E[B] • µ + E[A] • 0) ≥T µ -(1 -q + )µT = (q -)µT. This yields R L T = inf sup R T ≥ (q -) • µT. Theorem 2 follows from Theorem 1 and Proposition 1. Proof of Theorem 3. The assumption here is the special case of Assumption 1 where there are two arms and q = 1/2. Set I follows f 0 and S follows f 1 where µ(f 0 ) < µ(f 1 ). In the same was as in the proof of Theorem 1 we obtain R L (T ) ≥ 1 2 -• T • µ under the constraint that n/2 • |f 0 -f 1 | = n/2 • TV(f 0 , f 1 ) < where TV stands for total variation. Here we use G(1/2, µ) = 1/2 • TV(f 0 , f 1 ). Setting = 1/4 yields the statement. In the Gaussian case it turns out that = 1/4 yields the highest bound. For total variation of Gaussian variables N (µ 1 , σ 2 1 ) and N (µ 2 , σ 2 2 ), Devroye et al. (2018) show that TV N µ 1 , σ 2 1 , N µ 2 , σ 2 2 ≤ 3|σ 2 1 -σ 2 2 | 2σ 2 1 + |µ1-µ2| 2σ1 , which in our case yields T V ≤ µ 2 . From this we obtain µ • T ≥ and in turn R L T ≥ • ( 1 2 -). The maximum of the right-hand side is obtained at = 1 4 . This justifies the choice of in the proof of 3. C PROOF OF RESULTS IN SECTION 3.2 C.1 PROOF FOR THEOREM 4 Proof. Since the rewards can be unbounded in our setting, we consider truncating the reward with any ∆ > 0 for any arm i by r t i = rt i + rt i where rt i = r t i • 1 (-∆≤r t i ≤∆) , rt i = r t i • 1 (|r t i |>∆) . Then for any parameter 0 < η < 1, we choose such ∆ that satisfies P (r t i = rt i , i ≤ K) = P (-∆ ≤ r t 1 ≤ ∆, . . . , -∆ ≤ r t K ≤ ∆) = ∆ -∆ ∆ -∆ . . . ∆ -∆ f (x 1 , . . . , x K )dx 1 . . . dx K ≥ 1 -η . The existence of such ∆ = ∆(η) follows from elementary calculus. Let A = {|r t i | ≤ ∆ for every i ≤ K, t ≤ T }. Then the probability of this event is P (A) = P (r t i = rt i , i ≤ K, t ≤ T ) ≥ (1 -η) T . With probability (1 -η) T , the rewards of the player are bounded in [-∆, ∆] throughout the game. Then R B T = T t=1 (max i rt i -ri t ) ≤ T • ∆ - T t=1 r t is the regret under event A, i.e. R T = R B T with probability (1 -η) T . For the EXP3.P algorithm and R B T , for every δ > 0, according to Auer et al. (2002b) we have R B T ≤ 4∆ KT log( KT δ ) + 4 5 3 KT log K + 8 log( KT δ ) with probability 1 -δ. Then we have R T ≤ 4∆(η) KT log( KT δ ) + 4 5 3 KT log K + 8 log( KT δ ) with probability (1-δ)•(1-η) T . C.2 PROOF FOR THEOREM 5 Lemma 2. For any non-decreasing differentiable function ∆ = ∆(T ) > 0 satisfying lim T →∞ ∆(T ) 2 log(T ) = ∞, lim T →∞ ∆ (T ) ≤ C 0 < ∞, and any 0 < δ < 1, a > 2 we have P R T ≤ ∆(T ) • log(1/δ) • O * ( √ T ) ≥ (1 -δ) 1 - 1 T a T for any T large enough. Proof. Let a > 2 and let us denote F (y) = y -y f (x 1 , x 2 , . . . , x K )dx 1 dx 2 . . . dx K , ζ(T ) = F (∆(T ) • 1) -1 - 1 T a for y ∈ R K and 1 = (1, . . . , 1) ∈ R K . Let also y -i = (y 1 , . . . , y i-1 , y i+1 , . . . , y K ) and x| xi=y = (x 1 , . . . , x i-1 , y, x i+1 , . . . , x K ). We have lim T →∞ ζ(T ) = 0. The gradient of F can be estimated as ∇F ≤ y-1 -y-1 f (x| x1=y1 ) dx 2 . . . dx K , . . . , y -K -y -K f (x| x K =y K ) dx 1 . . . dx K-1 . According to the chain rule and since ∆ (T ) ≥ 0, we have dF (∆(T ) • 1) dT ≤ ∆(T )•1-1 -∆(T )•1-1 f x| x1=∆(T ) dx 2 . . . dx K • ∆ (T )+ . . . + ∆(T )•1 -K -∆(T )•1 -K f x| x K =∆(T ) dx 1 . . . dx • ∆ (T ). Next we consider ∆(T )1-i -∆(T )1-i f x| xi=∆(T ) dx 1 . . . dx i-1 dx i+1 . . . dx K = e -1 2 aii(∆(T )) 2 +µi∆(T ) • ∆(T )1-i -∆(T )1-i e g(x-i) dx 1 . . . dx i-1 dx i+1 . . . dx K . Here e g(x-i) is the conditional density function given x i = ∆(T ) and thus ∆(T )1-i -∆(T )1-i e g(x-i) dx 1 . . . dx i-1 dx i+1 . . . dx K ≤ 1. We have ∆(T )1-i -∆(T )1-i f x| xi=∆(T ) dx 1 . . . dx i-1 dx i+1 . . . dx K ≤ e -1 2 aii(∆(T )) 2 +µi∆(T ) ≤ e -1 2 minj ajj (∆(T )) 2 +maxj µj ∆(T ) . Then for T ≥ T 0 we have ∆ T ≤ C 0 + 1 and in turn ζ (T ) ≤ (C 0 + 1) • K • e -1 2 minj ajj (∆(T )) 2 +maxj µj ∆(T ) -a • T -a-1 . Since we only consider non-degenerate Gaussian bandits with min a ii > 0, µ i are constants and ∆(T ) → ∞ as T → ∞ according to the assumptions in Lemma 2, there exits C 1 > 0 and T 1 such that e -1 2 minj ajj (∆(T )) 2 +maxj µj ∆(T ) ≤ e -C1∆(T ) 2 for every T > T 1 . Since lim T →∞ ∆(T ) 2 log(T ) = ∞, we have ∆(T ) 2 > 2(a+1)

C1

• log(T ) for T > T 2 . These give us that ζ(T ) ≤ (C 0 + 1)Ke -2(a+1) log T -aT -a-1 = (C 0 + 1)Ke -2(a+1) log T -ae -(a+1) log T < 0 for T ≥ T 3 ≥ max(T 0 , T 1 , T 2 ). This concludes that ζ (T ) < 0 for T ≥ T 3 . We also have lim T →∞ ζ(T ) = 0 according to the assumptions. Therefore, we finally arrive at ζ(T ) > 0 for T ≥ T 3 . This is equivalent to ∆(T )•1 -∆(T )•1 f (x 1 , . . . , x K ) dx 1 . . . dx K ≥ 1 - 1 T a , i.e. the rewards are bounded by ∆(T ) with probability 1 -1 T a . Then by the same argument for T large enough as in the proof of Theorem 4, we have P R T ≤ ∆(T ) • log(1/δ) • O * ( √ T ) ≥ (1 -δ)(1 - 1 T a ) T . Proof of Theorem 5. In Lemma 2, we choose ∆(T ) = log(T ), which meets all of the assumptions. The result now follows from log T • O * ( √ T ) = O * ( √ T ) , Lemma 2 and Theorem 4.

C.3 PROOF FOR THEOREM 6

We first list 3 known lemmas. The following lemma by Duchi (2009) provides a way to bound deviations. Lemma 3. For any function class F , and i.i.d. random variable {x 1 , x 2 , . . . , x T }, the result E x sup f ∈F E x f -1 T T t=1 f (x t ) ≤ 2R c T (F ) holds where R c T (F ) = E x,σ sup f 1 T T t=1 σ t f (x t ) and σ t is a {-1, 1} random walk of t steps. The following result holds according to Balcan (2011) . Lemma 4. For any subclass A ⊂ F , we have Rc T ≤ R(A, T ) • √ 2 log |A| T , where R(A, T ) = sup f ∈A T t=1 f (x t ) 1 2 and Rc T = sup f 1 T T t=1 σ t f (x t ) . A random variable X is σ 2 -sub-Gaussian if for any t > 0, the tail probability satisfies P (|X| > t) ≤ Be -σ 2 t 2 , where B is a positive constant. The following lemma is listed in the Appendix A of Chatterjee (2014). Lemma 5. For i.i.d. σ 2 -sub-Gaussian random variables {Y 1 , Y 2 , . . . , Y T }, we have E [max 1≤t≤T |Y t |] ≤ σ √ 2 log T + 4σ √ 2 log T . Proof for Theorem 6. Let us define F = {f j : x → x j |j = 1, 2, . . . , K}. Let x t = (r t 1 , r t 2 , . . . , r t K ) where r t i is the reward of arm i at step t and let a t be the arm selected at time t by EXP3.P. Then for any f j ∈ F , f j (x t ) = r t j . In Gaussian-MAB, {x 1 , x 2 , . . . , x T } are i.i.d. random variables since the Gaussian distribution N (µ, Σ) is invariant to time and independent of time. Then by Lemma 3, we have E max i µ i -1 T T t=1 r t i ≤ 2R c T (F ). We consider E [|R T -R T |] = E T • max i µ i - T t=1 µ at -max i T t=1 r t i - T t=1 r t at = E T • max i µ i -max i T t=1 r t i - T t=1 µ at - T t=1 r t at ≤ E T • max i µ i -max i T t=1 r t i + E T t=1 µ at - T t=1 r t at ≤ E max i T • µ i - T t=1 r t i + E T t=1 µ at - T t=1 r t at ≤ 2T R c T (F ) + 2T 1 R c T1 (F ) + • • • + 2T K R c T K (F ) where T i is the number of pulls of arm i. Clearly T 1 + T 2 + . . . + T K = T . By Lemma 4 with A = F we get  R c T (F ) = E Rc T (F ) ≤ E[R(F, T )] • √ 2 log K T , R c Ti (F ) ≤ E [R (F, T i )] • √ 2 log K T i i = {1, Regarding E[R(F, T )], we have E[R(F, T )] = E   sup f ∈F T t=1 f (x t ) 1 2   = E   sup i T t=1 (r t i ) 2 1 2   ≤ E   K i=1 T t=1 (r t i ) 2 1 2   ≤ K i=1 E T • max 1≤t≤T (r i t ) 2 1 2 = √ T • K i=1 E max 1≤t≤T |r t i | . We next use Lemma 5 for any arm i. To this end let Y t = r t i . Since x t are Gaussian, the marginals Y t are also Gaussian with mean µ i and standard deviation of a ii . Combining this with the fact that a Gaussian random variable is also σ 2 -sub-Gaussian justifies the use of the lemma. Thus E max 1≤j≤T |r j i | ≤ a i,i • √ 2 log T + 4ai,i √ 2 log T . Continuing with equation 5 we further obtain E[R(F, T )] ≤ √ T • K • max i a i,i 2 log T + 4a i,i √ 2 log T = K 2T log T + 4 √ T √ 2 log T • max i a i,i . By combining equation 4 and equation 6 we conclude E [|R T -R T |] ≤ 2(K + 1) 2 log K • max i a i,i • K 2T log T + 4 √ T √ 2 log T = O * ( √ T ). We now turn our attention to the expectation of regret E[R T ]. It can be written as E [R T ] = E R T 1 R T ≤O * ( √ T ) + E R T 1 R T >O * ( √ T ) ≤ O * ( √ T )P R T ≤ O * ( √ T ) + E R T 1 R T >O * ( √ T ) ≤ O * ( √ T ) + E R T 1 R T >O * ( √ T ) = O * ( √ T ) + E R T 1 O * ( √ T )<R T <O * ( √ T )+E[R T ] + E R T 1 R T ≥O * ( √ T )+E[R T ] . We consider δ = 1/ √ T and η = T -a for a > 2. We have  lim T →∞ (1 -δ)(1 -η) T = lim T →∞ (1 -δ)(1 - 1 T a ) T = lim T →∞ (1 -δ)(1 - 1 T a ) (T a )• T T a = lim |r t i | ≤ 2T • E max i max t |r t i | ≤ 2T • K i=1 E max t |r t i | ≤ 2T • K i=1 a i,i 2 log T + 4a i,i √ log T ≤ 2T • K i=1 max i a i,i 2 log T + 4 √ log T ≤ C 0 • T • log(T ) for a constant C 0 . The asymptotic behavior of the second term in equation 8 reads E R T 1 O * ( √ T )<R T <O * ( √ T )+E[R T ] = E R T 1 R T -O * ( √ T )∈(0,E[R T ]) = E R T -O * ( √ T ) 1 R T -O * ( √ T )∈(0,E[R T ]) + O * ( √ T ) ≤ E [R T ] P R T -O * ( √ T ) ∈ (0, E [R T ]) + O * ( √ T ) ≤ E [R T ] P R T -O * ( √ T ) > 0 + O * ( √ T ) ≤ C 0 log(T ) • T • (1 -P 1 ) + O * ( √ T ) = O * ( √ T ) where at the end we use equation 9. Regarding the third term in equation 8, we note that R T ≤ E[R T ] by the Jensen's inequality. By using equation 7 and again equation 9 we obtain E R T 1 R T ≥O * ( √ T )+E[R T ] = E (R T -R T ) 1 (R T -E[R T ])≥O * ( √ T ) + E R T 1 (R T -E[R T ])≥O * ( √ T ) ≤ E [|R T -R T |] + R T • P R T ≥ E [R T ] + O * ( √ T ) ≤ E [|R T -R T |] + E [R T ] • P R T ≥ E [R T ] + O * ( √ T ) ≤ O * ( √ T ) + C 0 • log(T ) • T • P R T ≥ O * ( √ T ) = O * ( √ T ) + C 0 • log(T ) • T (1 -P 1 ) = O * ( √ T ). Combining all these together we obtain E[R T ] = O * ( √ T ) which concludes the proof.



Our upper bound of the order O * ( √ T ) is of the same order as the one for bounded MAB. In our case the upper bound result O * ( √ T ) holds for large enough T which is hidden behind O * while the linear lower bounds is valid only for small values of T . This illustrates the rationality of the lower bound of O(T ) and the upper bound of order O * ( √ T ).

Figure 1: The performance of Algorithm 2 and RND measured by the epoch-wise reward on Mountain Car, with the left one being the original data and the right being the smoothed reward values.

Figure 2: The performance of Algorithm 2 and RND measured by intrinsic reward without parallel environments with three different burn-in periods

2, , . . . , K}.Since R(F, T ) is increasing in T andT i ≤ T , we have R c Ti (F ) ≤ E [R (F, T )] the expected deviation E [|R T -R T |] based on (3) as follows E [|R T -R T |] ≤ 2T E[R(F, T )] KE[R(F, T )].

1 -δ)(1 -η) T • log T • T = lim P 1 = P R T ≤ log(1/δ)O * ( √ T ) which equals to P R T ≤ O * ( √ T ) since log(1/δ) = log( √ T ) = O * ( √ T ). By Theorem 5 we haveP 1 = (1 -δ) • (1 -η) T .Note that E[R T ] ≤ C 0 log(T ) • T as shown byE[R T ] = E max

. L. Strehl and M. L. Littman. An analysis of model-based interval estimation for markov decision processes. Journal of Computer and System Sciences, 74(8):1309-1331, 2008. M. Tokic. Adaptive ε-greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pages 203-210. Springer, 2010.

