TEMPORAL DIFFERENCE UNCERTAINTIES AS A SIGNAL FOR EXPLORATION Anonymous authors Paper under double-blind review

Abstract

An effective approach to exploration in reinforcement learning is to rely on an agent's uncertainty over the optimal policy, which can yield near-optimal exploration strategies in tabular settings. However, in non-tabular settings that involve function approximators, obtaining accurate uncertainty estimates is almost as challenging as the exploration problem itself. In this paper, we highlight that value estimates are easily biased and temporally inconsistent. In light of this, we propose a novel method for estimating uncertainty over the value function that relies on inducing a distribution over temporal difference errors. This exploration signal controls for state-action transitions so as to isolate uncertainty in value that is due to uncertainty over the agent's parameters. Because our measure of uncertainty conditions on state-action transitions, we cannot act on this measure directly. Instead, we incorporate it as an intrinsic reward and treat exploration as a separate learning problem, induced by the agent's temporal difference uncertainties. We introduce a distinct exploration policy that learns to collect data with high estimated uncertainty, which gives rise to a "curriculum" that smoothly changes throughout learning and vanishes in the limit of perfect value estimates. We evaluate our method on hardexploration tasks, including Deep Sea and Atari 2600 environments and find that our proposed form of exploration facilitates efficient exploration.

1. INTRODUCTION

Striking the right balance between exploration and exploitation is fundamental to the reinforcement learning problem. A common approach is to derive exploration from the policy being learned. Dithering strategies, such as -greedy exploration, render a reward-maximising policy stochastic around its reward maximising behaviour (Williams & Peng, 1991) . Other methods encourage higher entropy in the policy (Ziebart et al., 2008) , introduce an intrinsic reward (Singh et al., 2005) , or drive exploration by sampling from the agent's belief over the MDP (Strens, 2000) . While greedy or entropy-maximising policies cannot facilitate temporally extended exploration (Osband et al., 2013; 2016a) , the efficacy of intrinsic rewards depends crucially on how they relate to the extrinsic reward that comes from the environment (Burda et al., 2018a) . Typically, intrinsic rewards for exploration provide a bonus for visiting novel states (e.g Bellemare et al., 2016) or visiting states where the agent cannot predict future transitions (e.g Pathak et al., 2017; Burda et al., 2018a) . Such approaches can facilitate learning an optimal policy, but they can also fail entirely in large environments as they prioritise novelty over rewards (Burda et al., 2018b) . Methods based on the agent's uncertainty over the optimal policy explicitly trade off exploration and exploitation (Kearns & Singh, 2002) . Posterior Sampling for Reinforcement Learning (PSRL; Strens, 2000; Osband et al., 2013 ) is one such approach, which models a distribution over Markov Decision Processes (MDPs). While PSRL is near-optimal in tabular settings (Osband et al., 2013; 2016b) , it cannot be easily scaled to complex problems that require function approximators. Prior work has attempted to overcome this by instead directly estimating the agent's uncertainty over the policy's value function (Osband et al., 2016a; Moerland et al., 2017; Osband et al., 2019; O'Donoghue et al., 2018; Janz et al., 2019) . While these approaches can scale posterior sampling to complex problems and nonlinear function approximators, estimating uncertainty over value functions introduces issues that can cause a bias in the posterior distribution (Janz et al., 2019) . In response to these challenges, we introduce Temporal Difference Uncertainties (TDU), which derives an intrinsic reward from the agent's uncertainty over the value function. Concretely, TDU relies on the Bootstrapped DQN (Osband et al., 2016a) and separates exploration and reward-maximising behaviour into two separate policies that bootstrap from a shared replay buffer. This separation allows us to derive an exploration signal for the exploratory policy from estimates of uncertainty of the reward-maximising policy. Thus, TDU encourages exploration to collect data with high model uncertainty over reward-maximising behaviour, which is made possible by treating exploration as a separate learning problem. In contrast to prior works that directly estimate value function uncertainty, we estimate uncertainty over temporal difference (TD) errors. By conditioning on observed stateaction transitions, TDU controls for environment uncertainty and provides an exploration signal only insofar as there is model uncertainty. We demonstrate that TDU can facilitate efficient exploration in challenging exploration problems such as Deep Sea and Montezuma's Revenge.

2. ESTIMATING VALUE FUNCTION UNCERTAINTY IS HARD

We begin by highlighting that estimating uncertainty over the value function can suffer from bias that is very hard to overcome with typical approaches (see also Janz et al., 2019) . Our analysis shows that biased estimates arise because uncertainty estimates require an integration over unknown future state visitations. This requires tremendous model capacity and is in general infeasible. Our results show that we cannot escape a bias in general, but we can take steps to mitigate it by conditioning on an observed trajectory. Doing so removes some uncertainty over future state-visitations and we show in Section 3 that it can result in a substantially smaller bias. We consider a Markov Decision Process (S, A, P, R, γ) for some given state space (S), action space (A), transition dynamics (P), reward function (R) and discount factor (γ). For a given (deterministic) policy π : S → A, the action value function is defined as the expected cumulative reward under the policy starting from state s with action a: Q π (s, a) := E π ∞ t=0 γ t r t+1 s 0 = s, a 0 = a = E r∼R(s,a) s ∼P(s,a) [r + γQ π (s , π(s ))] , (1) where t index time and the expectation E π is with respect to realised rewards r sampled under the policy π; the right-hand side characterises Q recursively under the Bellman equation. The actionvalue function Q π is estimated under a function approximator Q θ parameterised by θ. Uncertainty over Q π is expressed by placing a distribution over the parameters of the function approximator, p(θ). We overload notation slightly and write p(θ) to denote the probability density function p θ over a random variable θ. Further, we denote by θ ∼ p(θ) a random sample θ from the distribution defined by p θ . Methods that rely on posterior sampling under function approximators assume that the induced distribution, p(Q θ ), is an accurate estimate of the agent's uncertainty over its value function, p(Q π ), so that sampling Q θ ∼ p(Q θ ) is approximately equivalent to sampling from Q π ∼ p(Q π ). For this to hold, the moments of p(Q θ ) at each state-action pair (s, a) must correspond to the expected moments in future states. In particular, moments of p(Q π ) must satisfy a Bellman Equation akin to Eq. 1 (O'Donoghue et al., 2018) . We focus on the mean (E) and variance (V): E θ [Q θ (s, a)] = E θ [E r,s [r + γQ θ (s , π(s ))]] , V θ [Q θ (s, a)] = V θ [E r,s [r + γQ θ (s , π(s ))]]. If E θ [Q θ ] and V θ [Q θ ] fail to satisfy these conditions, the estimates of E[Q π ] and V[Q π ] are biased, causing a bias in exploration under posterior sampling from p(Q θ ). Formally, the agent's uncertainty over p(Q) implies uncertainty over the MDP (Strens, 2000) . Given a belief over the MDP, i.e., a distribution p(M ), we can associate each M ∼ p(M ) with a distinct value function Q M π . Lemma 1 below shows that, for p(θ) to be interpreted as representing some p(M ) by push-forward to p(Q θ ), the induced moments must match under the Bellman Equation Lemma 1. If E θ [Q θ ] and V θ [Q θ ] fail to satisfy Eqs. 2 and 3, respectively, they are biased estimators of E M Q M π and V M Q M π for any choice of p(M ). All proofs are deferred to Appendix B. Lemma 1 highlights why estimating uncertainty over value functions is so challenging; while the left-hand sides of Eqs. 2 and 3 are stochastic in θ only, the right-hand sides depend on marginalising over the MDP. This requires the function approximator to generalise to unseen future trajectories. Lemma 1 is therefore a statement about scale; the harder it is to generalise, the more likely we are to observe a bias-even in deterministic environments. This requirement of "strong generalisation" poses a particular problem for neural networks that tend to interpolate over the training data (e.g. Li et al., 2020; Liu et al., 2020; Belkin et al., 2019) , but the issue is more general. In particular, we show that factorising the posterior p(θ) will typically cause estimation bias for all but tabular MDPs. This is problematic because it is often computationally infeasible to maintain a full posterior; previous work either maintains a full posterior over the final layer of the function approximator (Osband et al., 2016a; O'Donoghue et al., 2018; Janz et al., 2019) or maintains a diagonal posterior over all parameters (Fortunato et al., 2018; Plappert et al., 2018) of the neural network. Either method limits how expressive the function approximator can be with respect to future states, thereby causing an estimation bias. To establish this formally, let Q θ := w•φ ϑ , where θ = (w 1 , . . . , w n , ϑ 1 , . . . , ϑ v ), with w ∈ R n a linear projection and φ : S × A → R n a feature extractor with parameters ϑ ∈ R v . Proposition 1. If the number of state-action pairs where E θ [Q θ (s, a)] = E θ [Q θ (s , a )] is greater than n, where w ∈ R n , then E θ [Q θ ] and V θ [Q θ ] are biased estimators of E M Q M π and V M Q M π for any choice of p(M ). This result is a consequence of the feature extractor ψ mapping into a co-domain that is larger than the space spanned by w; a bias results from having more unique state-action representations ψ(s, a) than degrees of freedom in w. The implication is that function approximators under factorised posteriors cannot generalise uncertainty estimates across states (a similar observation in tabular settings was made by Janz et al., 2019) -they can only produce temporally consistent uncertainty estimates if they have the capacity to memorise point-wise uncertainty estimates for each (s, a), which defeats the purpose of a function approximator. This is a statement about the structure of p(θ) and holds for any estimation method. Thus, common approaches to uncertainty estimation with neural networks generally fail to provide unbiased uncertainty estimates over the value function in non-trivial MDPs. Proposition 1 shows that to accurately capture value function uncertainty, we need a full posterior over parameters, which is often infeasible. It also underscores that the main issue is the dependence on future state visitation. This motivates Temporal Difference Uncertainties as an estimate of uncertainty conditioned on observed state-action transitions.

3. TEMPORAL DIFFERENCE UNCERTAINTIES

While Proposition 1 states that we cannot remove this bias unless we are willing to maintain a full posterior p(θ), we can construct uncertainty estimates that control for uncertainty over future state-action transition. In this paper, we propose to estimate uncertainty over a full transition τ := (s, a, r, s ) to isolate uncertainty due to p(θ). Fixing a transition, we induce a conditional distribution p(δ | τ ) over Temporal Difference (TD) errors, δ(θ, τ ) := γQ θ (s , π(s )) + r -Q θ (s, a), that we characterise by its mean and variance: E δ [δ | τ ] = E θ [δ(θ, τ ) | τ ] and V δ [δ | τ ] = V θ [δ(θ, τ ) | τ ] . Estimators over TD-errors is akin to first-difference estimators of uncertainty over the action-value. They can therefore exhibit smaller bias if that bias is temporally consistent. To illustrate, for simplicity assume that E θ [Q θ ] consistently over/under-estimates E M Q M π by an amount b ∈ R. The corresponding bias in E θ [δ(θ, τ ) | τ ] is given by Bias(E θ [δ(θ, τ ) | τ ]) = Bias(γE θ [Q θ (s , π(s ))] + r -E θ [Q θ (s, a)]) = (γ -1)b. This bias is close to 0 for typical values of γ-notably for γ = 1, E θ [δ(θ, τ ) | τ ] is unbiased. More generally, unless the bias is constant over time as in the above example, we cannot fully remove the bias when constructing an estimator over a quantity that relies on Q θ . However, as the above example shows, by conditioning on a state-action transition, we can make it significantly smaller. We formalise this logic in the following result. Proposition 2. For any τ := (s, a, r, s ) and any p(M ), given p(θ), define the following ratios: ρ = Bias (E θ [Q θ (s , π(s ))]) / Bias (E θ [Q θ (s, a)]) (5) φ = Bias E θ Q θ (s , π(s )) 2 / Bias E θ Q θ (s, a) 2 (6) κ = Bias (E θ [Q θ (s , π(s ))Q θ (s, a)]) / Bias E θ Q θ (s, a) 2 (7) α = E M Q M π (s , π(s )) / E M Q M π (s, a) . (8) If ρ ∈ (0, 2/γ), then E δ [δ | τ ] has lower bias than E θ [Q θ (s, a)]. Moreover, if ρ = 1/γ, then E δ [δ | τ ] is unbiased. Additionally, there exists ρ ≈ 1, φ ≈ 1, κ ≈ 1, α ≈ 1 such that V θ [δ(θ, τ ) | τ ] have less bias than V θ [Q θ (s, a)]. In particular, if ρ = φ = κ = α = 1, then | Bias(V θ [δ(θ, τ ) | τ ])| = |(γ -1) 2 Bias(V θ [Q θ (s, a)])| < | Bias(V θ [Q θ (s, a)])|. ( ) Further, ρ = 1/γ, κ = 1/γ, φ = 1/γ 2 , then V θ [δ(θ, τ ) | τ ] is unbiased for any α. The first part of Proposition 2 generalises the example above to cases where the bias b varies across action-state transitions. It is worth noting that the required "smoothness" on the bias is not very stringent: the bias of E θ [Q θ ] (s , π(s )) can be twice as large as that of E θ [Q θ ] (s, a) and E δ [δ | τ ] can still produce a less biased estimate. Importantly, it must have the same sign, and so Proposition 2 requires temporal consistency. To establish a similar claim for V δ [δ | τ ], we need a bit more structure. The ratios ρ, φ, and κ capture temporal consistency in the bias, while α relates to the temporal consistency of the underlying estimand. Proposition 2 establishes that if these ratios are close to unity, then V θ [δ(θ, τ ) | τ ] will have less bias. For most transitions, it is reasonable to assume that this holds true. In some MDPs, large changes in the reward can cause these requirements to break. Because Proposition 2 only establishes sufficiency, violating this requirement does not necessarily mean that V δ [δ | τ ] has greater bias than V θ [Q θ (s, a)]. Finally, it is worth noting that these are statements about a given transition τ . In most state-action transitions, the requirements in Proposition 2 will hold, in which case E δ [δ | τ ] and V δ [δ | τ ] exhibit less overall bias. We provide direct empirical support that Proposition 2 holds in practice through careful ceteris paribus comparisons in Section 5.1. To obtain a concrete signal for exploration, we follow O' Donoghue et al. (2018) and derive an exploration signal from the variance V θ [δ(θ, τ )|τ ]. Because p(δ | τ ) is defined per transition, it cannot be used as-is for posterior sampling. Therefore, we incorporate TDU as a signal for exploration via an intrinsic reward. To obtain an exploration signal that is on approximately the same scale as the extrinsic reward, we use the standard deviation σ(τ ) := V θ [δ(θ, τ ) | τ ] to define an augmented reward function R(τ ) := R((s, a) ∈ τ ) + β σ(τ ), where β ∈ [0, ∞) is a hyper-parameter that determines the emphasis on exploration. Another appealing property of σ is that it naturally decays as the agent converges on a solution (as model uncertainty diminishes); TDU defines a distinct MDP (S, A, P, R, γ) under Eq. 10 that converges on the true MDP in the limit of no model uncertainty. For a given policy π and distribution p(Q θ ), there exists an exploration policy µ that collects transitions over which p(Q θ ) exhibits maximal uncertainty, as measured by σ. In hard exploration problems, the exploration policy µ can behave fundamentally differently from π. To capture such distinct exploration behaviour, we treat µ as a separate exploration policy that we train to maximise the augmented reward R, along-side training a policy π that maximises the extrinsic reward R. This gives rise to a natural separation of exploitation and exploration in the form of a cooperative multi-agent game, where the exploration policy is tasked with finding experiences where the agent is uncertain of its value estimate for the greedy policy π. As π is trained on this data, we expect uncertainty to vanish (up to noise). As this happens, the exploration policy µ is incentivised to find new experiences with higher estimated uncertainty. This induces a particular pattern where exploration will reinforce experiences until the agent's uncertainty vanishes, at which point the exploration policy expands its state visitation further. This process can allow TDU to overcome estimation bias in the posterior-since it is in effect exploiting it-in contrast to previous methods that do not maintain a distinct exploration policy. We demonstrate this empirically both on Montezuma's Revenge and on Deep Sea (Osband et al., 2020) .

4. IMPLEMENTING TDU WITH BOOTSTRAPPING

The distribution over TD-errors that underlies TDU can be estimated using standard techniques for probability density estimation. In this paper, we leverage the statistical bootstrap as it is both easy to implement and provides a robust approximation without requiring distributional assumptions. TDU is easy to implement under the statistical bootstrap-it requires only a few lines of extra code. It can be implemented with value-based as well as actor-critic algorithms (we provide generic pseudo code in Appendix A); in this paper, we focus on Q-learning. Q-learning alternates between policy evaluation (Eq. 1) and policy improvement under a greedy policy π θ (s) = arg max a Q θ (s, a). Deep Q-learning (Mnih et al., 2015) learns Q θ by minimising its TD-error by stochastic gradient descent on transitions sampled from a replay buffer. Unless otherwise stated, in practice we adopt a common approach of evaluating the action taken by the learned network through a target network with separate parameters that are updated periodically (Van Hasselt et al., 2016) . Our implementation starts from the bootstrapped DQN (Osband et al., 2016a) , which maintains a set of K function approximators Q = {Q θ k } K k=1 , each parameterised by θ k and regressed towards a unique target function using bootstrapped sampling of data from a shared replay memory. The Bootstrapped DQN derives a policy π θ by sampling θ uniformly from Q at the start of each episode. We provide an overview of the Bootstrapped DQN in Algorithm 1 for reference. To implement TDU in this setting, we make a change to the loss function (Algorithm 2, changes highlighted in green). First, we estimate the TDU signal σ using bootstrapped value estimation. We estimate σ through observed TD-errors {δ k } K k=1 incurred by the ensemble Q on a given transition: σ(τ ) ≈ 1 K -1 K k=1 δ(θ k , τ ) -δ(τ ) 2 , where δ = γ Q + r -Q, with x := 1 K K i=1 x i and Q := Q(s , π(s ) ). An important assumption underpinning the bootstrapped estimation is that of stochastic optimism (Osband et al., 2016b) , which requires the distribution over Q to be approximately as wide as the true distribution over value estimates. If not, uncertainty over Q can collapse, which would cause σ to also collapse. To prevent this, Q can be endowed with a prior (Osband et al., 2018) that maintains diversity in the ensemble by defining each value function as ) , where P k is a random prior function. Q θ k + λP k , λ ∈ [0, ∞ Rather than feeding this exploration signal back into the value functions in Q, which would create a positive feedback loop (uncertainty begets higher reward, which begets higher uncertainty adinfinitum), we introduce a separate ensemble of exploration value functions Q = {Q θk } N k=1 that we train over the augmented reward (Eqs. 10 and 11). We derive an exploration policy µ θ by sampling exploration parameters θ uniformly from Q, as in the standard bootstrapped DQN. In summary, our implementation of TDU maintains K + N value functions. The first K defines a standard Bootstrapped DQN. From these, we derive an exploration signal σ, which we use to train the last N value functions. At the start of each episode, we proceed as in the standard Bootstrapped DQN and randomly sample a parameterisation θ from Q ∪ Q that we act under for the duration of the episode. All value functions are trained by bootstrapping from a single shared replay memory (Algorithm 1); see Appendix A for a complete JAX (Bradbury et al., 2018) implementation. Consequently, we execute the (extrinsic) reward-maximising policy π θ∼Q with probability K /(K+N) and the exploration policy µ θ∼ Q with probability N /(K+N). While π visits states around current reward-maximising behaviour, µ searches for data with high model uncertainty. While each population Q and Q can be seen as performing Bayesian inference, it is not immediately clear that the full agent admits a Bayesian interpretation. We leave this question for future work. There are several equally valid implementations of TDU (see Appendix A for generic implementations for value-based learning and policy-gradient methods). In our case, it would be equally valid to define only a single exploration policy (i.e. N = 1) and specify the probability of sampling this policy. While this can result in faster learning, a potential drawback is that it restricts the exploratory behaviour that µ can exhibit at any given time. Using a full bootstrapped ensemble for the exploration policy leverages the behavioural diversity of bootstrapping. Observe s and choose Q k ∼ Q ∪ Q 5: while episode not done do 6: Take action a = arg max â Q k (s, â) 7: Sample mask m, m i ∼Bin(n =1, p =ρ) 8: Enqueue transition (s, a, r, s , m) to B 9: Optimise L({θ k } K 1 , { θk } N 1 ,γ, β, D∼B) 10: end while 11: end while Algorithm 2 Bootstrapped TD-loss with TDU. Require: {θ k } K 1 , { θk } N 1 : parameters Require: γ, β, D: hyper-parameters, data 1: Initialise ← 0 2: for s, a, r, s , m ∈ D do 3: τ ← (s, a, r, s , γ) 4: Compute {δ i } K i=1 = {δ(θ i , τ )} K i=1 5: Compute σ from {δ k } K k=1 (Eq. 11) 6: Update τ by r ← r + β σ 7: Compute { δj } N j=K+1 = {δ( θj , τ )} N j=K+1 8: ← + K i=1 m i δ 2 i + N j=1 m K+j δ2 j 9: end for 10: return: / (2(N + K)| D |) 5 EMPIRICAL EVALUATION 5.1 BEHAVIOUR SUITE Bsuite (Osband et al., 2020) was introduced as a benchmark for characterising core capabilities of RL agents. We focus on a Deep Sea, which is explicitly designed to test for deep exploration. It is a challenging exploration problem where only one out of 2 N policies yields any positive reward. Performance is compared on instances of the environment with grid sizes N ∈ {10, 12, . . . , 50}, with an overall "score" that is the percentage of N for which average regret goes to below 0.9 faster than 2 N . The stochastic version generates a 'bad' transition with probability 1/N . This is a relatively high degree of uncertainty since the agent cannot recover from a bad transition in an episode. For all experiments, we use a standard MLP with Q-learning, off-policy replay and a separate target network. See Appendix D for details and TDU results on the full suite. We compare TDU on Deep Sea to a battery of exploration methods, broadly divided into methods that facilitate exploration by (a) sampling from a posterior (Bootstrapped DQN, Noisy Nets (Fortunato et al., 2018) , Successor Uncertainties (Janz et al., 2019) ) or (b) use an intrinsic reward (Random Network Distillation (RND; Burda et al., 2018b) , CTS (Bellemare et al., 2016) , and Q-Explore (QEX; Simmons-Edler et al., 2019) ). We report best scores obtained from a hyper-parameter sweep for each method. Overall, performance varies substantially between methods; only TDU performs (near-)optimally on both the deterministic and stochastic version. Methods that rely on posterior sampling do well on the deterministic version, but suffer a substantial drop in performance on the stochastic version. As the stochastic version serves to increase the complexity of modelling future state visitation, this is clear evidence that these methods suffer from the estimation bias identified in Section 2. We could not make Q-explore and NoisyNets perform well in the default Bsuite setup, while Successor Uncertainties suffers a catastrophic loss of performance on the stochastic version of DeepSea. Examining TDU, we find that it facilitates exploration while retaining overall performance except on Mountain Car where β > 0 hurts performance (Appendix D). For Deep Sea (Figure 2 ), prior functions are instrumental, even for large exploration bonuses (β 0). However, for a given prior strength, TDU does better than the BDQN (β = 0). In the stochastic version of Deep Sea, BDQN suffers a significant loss of performance (Figure 2 ). As this is a ceteris paribus comparison, this performance difference can be directly attributed to an estimation bias in the BDQN that TDU circumvents through its intrinsic reward. That TDU is able to facilitate efficient exploration despite environment stochasticity demonstrates that it can correct for such estimation errors. Finally, we verify Proposition 2 experimentally. We compare TDU to versions that estimate uncertainty directly over Q (full analysis in Appendix D.2). We compare TDU to (a) a version where σ is defined as standard deviation over Q and (b) where σ(Q) is used as an upper confidence bound in the policy instead of as an intrinsic reward (Figure 2 ). Neither matches TDU's performance across Bsuite an in particular on Deep Sea. Being ceteris paribus comparisons, this demonstrates that estimating uncertainty over TD-errors provides a stronger signal for exploration, as per Proposition 2. 

5.2. ATARI

Proposition 1 shows that estimation bias is particularly likely in complex environments that require neural networks to generalise across states. In recent years, such domains have seen significant improvements from running on distributed training platforms that can process large amounts of experience obtained through agent parallelism. It is thus important to develop exploration algorithms that scale gracefully and can leverage the benefits of distributed training. Therefore, we evaluate whether TDU can have a positive impact when combined with the Recurrent Replay Distributed DQN (R2D2) (Kapturowski et al., 2018) , which achieves state-of-the-art results on the Atari2600 suite by carefully combining a set of key components: a recurrent state, experience replay, off-policy value learning and distributed training. As a baseline we implemented a distributed version of the bootstrapped DQN with additive prior functions. We present full implementation details, hyper-parameter choices, and results on all games in Appendix E. For our main results, we run each agent on 8 seeds for 20 billion steps. We focus on games that are well-known to pose challenging exploration problems (Machado et al., 2018) : montezuma_revenge, pitfall, private_eye, solaris, venture, gravitar, and tennis. Following standard practice, Figure 3 reports Human Normalized Score (HNS), HNS = Agent score -Randomscore Humanscore-Randomscore , as an aggregate result across exploration games as well as results on montezuma_revenge and tennis, which are both known to be particularly hard exploration games (Machado et al., 2018) . Generally, we find that TDU facilitates exploration substantially, improving the mean HNS score across exploration games by 30% compared to baselines (right panel, Figure 3 ). An ANOVA analysis yields a statistically significant difference between TDU and non-TDU methods, controlling for game (F = 8.17, p = 0.0045). Notably, TDU achieves significantly higher returns on montezuma_revenge and is the only agent that consistently achieves the maximal return on tennis. We report all per-game results in Appendix E.4. We observe no significant gains from including prior functions with TDU and find that bootstrapping alone produces relatively marginal gains. Beyond exploration games, TDU can match or improve upon the baseline, but exhibits sensitivity to TDU hyper-parameters (β, number of explorers (N ); see Appendix E.3 for details). This finding is in line with observations made by (Puigdomènech Badia et al., 2020) ; combining TDU with online hyper-parameter adaptation (Schaul et al., 2019; Xu et al., 2018; Zahavy et al., 2020) are exciting avenues for future research. See Appendix E for further comparisons. In Table 1 , we compare TDU to recently proposed state-of-the-art exploration methods. While comparisons must be made with care due to different training regimes, computational budgets, and architectures, we note a general trend that no method is uniformly superior. Methods that are good on extremely sparse exploration games (montezuma_ revenge and pitfall!) tend to do poorly on games with dense rewards and vice versa. TDU is generally among the top 2 algorithms in all cases except on montezuma_revenge and pitfall!, state-based exploration is needed to achieve sufficient coverage of the MDP. TDU generally outperforms both Pixel-CNN (Ostrovski et al., 2017) , CTS, and RND. TDU is the only algorithm to achieve super-human performance on solaris and achieves the highest score of all baselines considered on venture.

6. RELATED WORK

Bayesian approaches to exploration typically use uncertainty as the mechanism for balancing exploitation and exploration (Strens, 2000) . A popular instance of this form of exploration is the PILCO algorithm (Deisenroth & Rasmussen, 2011 ). While we rely on the bootstrapped DQN (Osband et al., 2016a) in this paper, several other uncertainty estimation techniques have been proposed, such as by placing a parameterised distribution over model parameters (Fortunato et al., 2018; Plappert et al., 2018) or by modeling a distribution over both the value and the returns (Moerland et al., 2017) , using Bayesian linear regression on the value function (Azizzadenesheli et al., 2018; Janz et al., 2019) , or by modelling the variance over value estimates as a Bellman operation (O'Donoghue et al., 2018) . The underlying exploration mechanism in these works is posterior sampling from the agent's current beliefs (Thompson, 1933; Dearden et al., 1998) ; our work suggests that estimating this posterior is significantly more challenging that previously thought. An alternative to posterior sampling is to facilitate exploration via learning by introducing an intrinsic reward function. Previous works typically formulate intrinsic rewards in terms of state visitation (Lopes et al., 2012; Bellemare et al., 2016; Puigdomènech Badia et al., 2020) , state novelty (Schmidhuber, 1991; Oudeyer & Kaplan, 2009; Pathak et al., 2017) , or state predictability (Florensa et al., 2017; Burda et al., 2018b; Gregor et al., 2016; Hausman et al., 2018) . Most of these works rely on properties of the state space to drive exploration while ignoring rewards. While this can be effective in sparse reward settings (e.g. Burda et al., 2018b; Puigdomènech Badia et al., 2020) , it can also lead to arbitrarily bad exploration (see analysis in Osband et al., 2019) . A smaller body of work uses statistics derived from observed rewards (Nachum et al., 2016) or TD-errors to design intrinsic reward functions; our work is particularly related to the latter. Tokic (2010) proposes an extension of -greedy exploration, where the TD-error modulates to be higher in states with higher TD-error. Gehring & Precup (2013) use the mean absolute TD-error, accumulated over time, to measure controllability of a state and reward the agent for visiting states with low mean absolute TD-error. In contrast to our work, this method integrates the TD-error over time to obtain a measure of irreducibility. Simmons-Edler et al. (2019) propose to use two Q-networks, where one is trained on data collected under both networks and the other obtains an intrinsic reward equal to the absolute TD-error of the first network on a given transition. In contrast to our work, this method does not have a probabilistic interpretation and thus does not control for uncertainty over the environment. TD-errors have also been used in White et al. (2015) , where surprise is defined in terms of the moving average of the TD-error over the full variance of the TD-error. Kumaraswamy et al. (2018) rely on least-squares TD-errors to derive a context-dependent upper-confidence bound for directed exploration. Finally, using the TD-error as an exploration signal is related to the notion of "learnability" or curiosity as a signal for exploration, which is often modelled in terms of the prediction error in a dynamics model (e.g. Schmidhuber, 1991; Oudeyer et al., 2007; Gordon & Ahissar, 2011; Pathak et al., 2017) .

7. CONCLUSION

We present Temporal Difference Uncertainties (TDU), a method for estimating uncertainty over an agent's value function. Obtaining well-calibrated uncertainty estimates under function approximation is non-trivial and we show that popular approaches, while in principle valid, can fail to accurately represent uncertainty over the value function because they must represent an unknown future. This motivates TDU as an estimate of uncertainty conditioned on observed state-action transitions, so that the only source of uncertainty for a given transition is due to uncertainty over the agent's parameters. This gives rise to an intrinsic reward that encodes the agent's model uncertainty, and we capitalise on this signal by introducing a distinct exploration policy. This policy is incentivised to collect data over which the agent has high model uncertainty and we highlight how this separation gives rise to a form of cooperative multi-agent game. We demonstrate empirically that TDU can facilitate efficient exploration in hard exploration games such as Deep Sea and Montezuma's Revenge.

A IMPLEMENTATION AND CODE

In this section, we provide code for implementing TDU in a general policy-agnostic setting and in the specific case of bootstrapped Q-learning. Algorithm 3 presents TDU in a policy-agnostic framework. TDU can be implemented as a pre-processing step (Line 9) that augments the reward with the exploration signal before computing the policy loss. If Algorithm 3 is used to learn a single policy, it benefits from the TDU exploration signal but cannot learn distinct exploration policies for it. In particular, on-policy learning does not admit such a separation. To learn a distinct exploration policy, we can use Algorithm 3 to train the exploration policy, while the another policy is trained to maximise extrinsic rewards only using both its own data and data from the exploration policy. In case of multiple policies, we need a mechanism for sampling behavioural policies. In our experiments we settled on uniform sampling; more sophisticated methods can potentially yield better performance. In the case of value-based learning, TDU takes a special form that can be implemented efficiently as a staggered computation of TD-errors (Algorithm 4). Concretely, we compute an estimate of the distribution of TD-errors from some given distribution over the value function parameters (Algorithm 4, Line 3). These TD-errors are used to compute the TDU signal σ, which then modulates the reward used to train a Q function (Algorithm 4, Line 7). Because the only quantities being computed are TD-errors, this can be combined into a single error signal (Algorithm 4, Line 11). When implemented under bootstrapping, Qparams denotes the ensemble Q and Qtilde_distribution_params denotes the ensemble Q; we compute the loss as in Algorithm 2. Finally, Algorithm 5 presents a complete JAX (Bradbury et al., 2018) implementation that can be used along with the Bsuite (Osband et al., 2020) codebase. 1 We present the corresponding TDU agent class (Algorithm 4), which is a modified version of the BootstrappedDqn class in bsuite/baselines/jax/boot_dqn/agent.py and can be used by direct swap-in. Algorithm 3 Pseudo-code for generic TDU loss 1 def loss(transitions, pi_params, Qtilde_distribution_params, beta): 2 # Estimate TD-error distribution. 

B PROOFS

We begin with the proof of Lemma 1. First, we show that if Eq. 2 and Eq. 3 fail, p(θ) induce a distribution p(Q θ ) whose first two moments are biased estimators of the moments of the distribution of interest p(Q π ), for any choice of belief over the MDP, p(M ). We restate it here for convenience. Lemma 1. If E θ [Q θ ] and V θ [Q θ ] fail to satisfy Eqs. 2 and 3, respectively, they are biased estimators of E M Q M π and V M Q M π for any choice of p(M ). Proof. Assume the contrary, that E M Q M π (s, π(s)) = E θ [Q θ (s, π(s)) ] for all (s, a) ∈ S × A. If Eqs. 2 and 3 do not hold, then for any M ∈ {E, V}, M M Q M π (s, π(s)) = M θ [Q θ (s, π(s))] (12) = M θ E s ∼P(s,π(s)) r∼R(s,π(s)) [r + γQ θ (s , π(s ))] (13) = E s ∼P(s,π(s)) r∼R(s,π(s)) [r + γM θ [Q θ (s , π(s ))]] (14) = E s ∼P(s,π(s)) r∼R(s,π(s)) r + γM M Q M π (s , π(s )) (15) = M M E s ∼P(s,π(s)) r∼R(s,π(s)) r + γQ M π (s , π(s )) (16) = M M Q M π (s, π(s)) , a contradiction; conclude that M M Q M π (s, π(s)) = M θ [Q θ (s, π(s))]. Eqs. 13 and 17 use Eqs. 2 and 3; Eqs. 12, 13 and 15 follow by assumption; Eqs. 14 and 16 use linearity of the expectation operator E r,s by virtue of M being defined over θ. As (s, a, r, s ) and p(M ) are arbitrary, the conclusion follows. Methods that take inspiration from by PSRL but rely on neural networks typically approximate p(M ) by a parameter distribution p(θ) over the value function. Lemma 1 establishes that the induced distribution p(Q θ ) under push-forward of p(θ) must propagate the moments of the distribution p(Q θ ) consistently over the state-space to be unbiased estimate of p(Q M π ), for any p(M ). With this in mind, we now turn to neural networks and their ability to estimate value function uncertainty in MDPs. To prove our main result, we establish two intermediate results. Recall that we define a function approximator Q θ = w • φ ϑ , where θ = (w 1 , . . . , w n , ϑ 1 , . . . , ϑ v ); w ∈ R n is a linear layer and φ : S × A → R n is a feature extractor with parameters ϑ ∈ R v . As before, let M be an MDP (S, A, P, R, γ) with discrete state and action spaces. We denote by N the number of states and actions with E θ [Q θ (s, a)] = E θ [Q θ (s , a )] with N ⊂ S × A × S × A the set of all such pairs (s, a, s , a ). This set can be thought of as a minimal MDP-the set of states within a larger MDP where the function approximator generates unique predictions. It arises in an MDP through dense rewards, stochastic rewards, or irrevocable decisions, such as in Deep Sea. Our first result is concerned with a very common approach, where ϑ is taken to be a point estimate so that p(θ) = p(w). This approach is often used for large neural networks, where placing a posterior over the full network would be too costly (Osband et al., 2016a; O'Donoghue et al., 2018; Azizzadenesheli et al., 2018; Janz et al., 2019) . Lemma 2. Let p(θ) = p(w). If N > n, with w ∈ R n , then E θ [Q θ ] fail to satisfy the first moment Bellman equation (Eq. 2). Further, if N > n 2 , then V θ [Q θ ] fail to satisfy the second moment Bellman equation (Eq. 3). Proof. Write the first condition of Eq. 2 as E θ w T φ ϑ (s, a) = E θ E r,s r + γw T φ ϑ (s , π(s )) . ( ) Using linearity of the expectation operator along with p(θ) = p(w), we have E w [w] T φ ϑ (s, a) = µ(s, a) + γE w [w] T E s [φ ϑ (s , π(s ))] , where µ(s, a) = E r∼R(s,a) [r] . Rearrange to get µ(s, a) = E w [w] T φ ϑ (s, a) -γE s [φ ϑ (s , π(s ))] . By assumption E θ [Q θ (s, a)] = E θ [Q θ (s , a )], which implies φ ϑ (s, a) = φ ϑ (s , π(s )) by linearity in w. Hence φ ϑ (s, a) -γE s [φ ϑ (s , π(s ))] is non-zero and unique for each (s, a). Thus, Eq. 20 forms a system of linear equations over S × A, which can be reduced to a full-rank system over N : µ = ΦE w [w] , where µ ∈ R N stacks expected reward µ(s, a) and Φ ∈ R N ×n stacks vectors φ ϑ (s, a) -γE s [φ ϑ (s , π(s ))] row-wise. Because Φ is full rank, if N > n, this system has no solution. The conclusion follows for E θ [Q θ ]. If the estimator of the mean is used to estimate the variance, then the estimator of the variance is biased. For an unbiased mean, using linearity in w, write the condition of Eq. 3 as E θ (w -E w [w]) T φ ϑ (s, a) 2 = E θ γ(w -E w [w]) T E s [φ ϑ (s , π(s ))] 2 . ( ) Let w = w T -E w [w] , x = wT φ ϑ (s, a), y = γ wT E s [φ ϑ (s , a )]. Rearrange to get E θ x 2 -y 2 = E w [(x -y)(x + y)] = 0. Expanding terms, we find 0 = E θ wT [φ ϑ (s, a) -γE s [φ ϑ (s , a )]] wT [φ ϑ (s, a) + γE s [φ ϑ (s , a )]] (23) = n i=1 n j=1 E w [ wi wj ] d - i d + j = n i=1 n j=1 Cov (w i , w j ) d - i d + j . ( ) where we define d -= φ ϑ (s, a) -γE s [φ ϑ (s , a )] and d + = φ ϑ (s, a) + γE s [φ ϑ (s , a )]. As before, d -and d + are non-zero by assumption of unique Q-values. Perform a change of variables ω α(i,j) = Cov(w i , w j ), λ α(i,j) = d - i d + j to write Eq. 24 as 0 = λ T ω. Repeating the above process for every state and action we have a system 0 = Λω, where 0 ∈ R N and Λ ∈ R N ×n 2 are defined by stacking vectors λ row-wise. This is a system of linear equations and if N > n 2 no solution exists; thus, the conclusion follows for V θ [Q θ ], concluding the proof. Note that if E θ [Q θ ] is biased and used to construct the estimator E θ [Q θ ], then this estimator is also biased; hence if N > n, p(θ) induce biased estimators E θ [Q θ ] and V θ [Q θ ] of E M Q M π and V M Q M π , respectively. Lemma 2 can be seen as a statement about linear uncertainty. While the result is not too surprising from this point of view, it is nonetheless a frequently used approach to uncertainty estimation. We may hope then that by placing uncertainty over the feature extractor as well, we can benefit from its nonlinearity to obtain greater representational capacity with respect to uncertainty propagation. Such posteriors come at a price. Placing a full posterior over a neural network is often computationally infeasible, instead a common approach is to use a diagonal posterior, i.e. Cov(θ i , θ j ) = 0 (Fortunato et al., 2018; Plappert et al., 2018) . Our next result shows that any posterior of this form suffers from the same limitations as placing a posterior only over the final layer. We establish something stronger: any posterior of the form p(θ) = p(w)p(ϑ) suffers from the limitations described in Lemma 2. Lemma 3. Let p(θ) = p(w)p(ϑ); if N > n, with w ∈ R n , then E θ [Q θ ] fail to satisfy the first moment Bellman equation (Eq. 2). Further, if N > n 2 , then V θ [Q θ ] fail to satisfy the second moment Bellman equation (Eq. 3). Proof. The proof largely proceeds as in the proof of Lemma 2. Re-write Eq. 19 as E w [w] T E ϑ [φ ϑ (s, a)] = µ(s, a) + γE w [w] T E s [E ϑ [φ ϑ (s , π(s ))]] . (25) Perform a change of variables φ = E ϑ [φ ϑ ] to obtain µ(s, a) = E w [w] T φ(s, a) -γE s φ(s , π(s )) . ( ) Because E θ [Q θ (s, a)] = E θ [Q θ (s , a )], by linearity in w we have that φ(s, a) -φ(s , a ) is non-zero for any (s , a ) and hence Eq. 26 has no trivial solutions. Proceeding as in the proof of Lemma 2 obtains µ = ΦE w [w], where Φ is analogously defined. Note that if N > n there is no solution E w [w] for any admissible (full-rank) choice of Φ, and hence the conclusion follows for the first part. For the second part, using that E θ = E w E ϑ in Eq. 24 yields 0 = n i=1 n j=1 E w [ wi wj ] E ϑ d - i d + j = n i=1 n j=1 Cov (w i , w j ) E ϑ d - i d + j . ( ) Perform a change of variables λα(i,j ) = E ϑ d - i d + j . Again, by E θ [Q θ (s, a)] = E θ [Q θ (s , a )] we have that λ is non-zero; proceed as before to complete the proof. We are now ready to prove our main result. We restate it here for convenience: Proposition 1. If the number of state-action pairs where E θ [Q θ (s, a)] = E θ [Q θ (s , a )] is greater than n, where w ∈ R n , then E θ [Q θ ] and V θ [Q θ ] are biased estimators of E M Q M π and V M Q M π for any choice of p(M ). Proof. Let p(θ) be of the form p(θ) = p(w) or p(θ) = p(w)p(ϑ). By Lemmas 2 and 3, p(θ) fail to satisfy Eq. 2. By Lemma 1, this causes E θ [Q θ ] to be a biased estimator of E M Q M π . This in turn implies that V θ [Q θ ] is a biased estimator of V M Q M π . Further, if N > n 2 , V θ [Q θ ] is biased independently of E θ [Q θ ]. We now turn to analysing the bias of our proposed estimators. As before, we will build up to Proposition 2 through a series of lemmas. For the purpose of these results, let B : S × A → R denote the bias of E θ [Q θ ] in any tuple (s, a) ∈ S × A, so that Bias(E θ [Q θ ] (s, a)) = B(s, a). Lemma 4. Given a transition τ := (s, a, r, s ), for any p(M ), given p(θ), if B(s , π(s )) B(s, a) ∈ (0, 2/γ) (28) then E θ [δ(θ, τ ) | τ ] has less bias than E θ [Q θ (s, a)]. Proof. From direct manipulation of E θ [δ(θ, τ ) | τ ], we have for this to hold true, we must have ρ ∈ (0, 2/γ), as to be proved. E θ [δ(θ, τ ) | τ ] = E θ [γQ θ (s , π(s )) + r -Q θ (s, a)] (29) = γE θ [Q θ (s , π(s ))] + r -E θ [Q θ (s, a)] (30) = γE M Q M π (s , π(s )) + r -E M Q M π (s, a) + γB(s , π(s )) -B(s, a) (31) = E M δ M π (τ ) + γB(s , π(s )) -B(s, a). We now turn to characterising the conditions under which V θ [δ(θ, τ ) | τ ] enjoys a smaller bias than V θ [Q θ (s, a)]. Because the variance term involves squaring the TD-error, we must place some restrictions on the expected behaviour of the Q-function to bound the bias. First, as with B, let C : S × A → R denote the bias of E θ Q 2 θ for any tuple (s, a) ∈ S × A, so that Bias(E θ Q θ (s, a) 2 ) = C(s, a). Similarly, let D : S × A × S → R denote the bias of E θ [Q θ (s , π(s ))Q θ (s, a)] for any transition (s, a, s ) ∈ S × A × S. Lemma 5. For any τ and any p(M ), given p(θ), define relative bias ratios ρ = B(s , π(s )) B(s, a) , φ = C(s , π(s )) C(s, a) , κ = D(s, a, s ) C(s, a) , α = E M Q M π (s , π(s )) E M [Q M π (s, a)] . ( 33) There exists ρ ≈ 1, φ ≈ 1, κ ≈ 1, α ≈ 1 such that V θ [δ(θ, τ ) | τ ] have less bias than V θ [Q θ (s, a)]. In particular, if ρ = φ = κ = α = 1, then | Bias(V θ [δ(θ, τ ) | τ ])| = |(γ -1) 2 Bias(V θ [Q θ (s, a)])| < | Bias(V θ [Q θ (s, a)])|. ( ) Further, if ρ = 1/γ, κ = 1/γ, φ = 1/γ 2 , then | Bias(V θ [δ(θ, τ ) | τ ])| = 0 for any α. Proof. We begin by characterising the bias of V θ [Q θ (s, a)]. Write V θ [Q θ (s, a)] = E θ Q(s, a) 2 -E θ [Q(s, a)] 2 (35) = E M Q M π (s, a) 2 + C(s, a) -E M Q M π (s, a) + B(s, a) 2 . ( ) The squared term expands as  E M Q M π (s, a) + B(s, a) 2 = E M Q M π (s, a) 2 + 2E M Q M π (s, Bias(V θ [Q θ (s, a)]) = C(s, a) + 2A(s, a) + B(s, a) 2 . ( ) We now turn to V θ [δ(θ, τ ) | τ ]. First note that the reward cancels in this expression: δ(θ, τ ) -E θ [δ(θ, τ )] = γQ θ (s , π(s )) -Q θ (s, a) -(γE θ [Q θ (s , π(s ))] -E θ [Q θ (s, a)]) . ( ) Denote by x θ = γQ θ (s , π(s )) -Q θ (s, a) with E θ [x θ ] = γE θ [Q θ (s , π(s ))] -E θ [Q θ (s, a)]. Write V θ [δ(θ, τ ) | τ ] = E θ (δ(θ, τ ) -E θ [δ(θ, τ )]) 2 (40) = E θ (x θ -E θ [x θ ]) 2 (41) = E θ x 2 θ -E θ [x θ ] 2 (42) = E θ (γQ θ (s , π(s )) -Q θ (s, a)) 2 -(γE θ [Q θ (s , π(s ))] -E θ [Q θ (s, a)]) 2 . ( ) Eq. 41 uses Eq. 39 and Eq. 43 substitutes back for x θ . We consider each term in the last expression in turn. For the first term, E θ (γQ θ (s , π(s )) -Q θ (s, a)) 2 , expanding the square yields γ 2 E θ Q θ (s , π(s )) 2 -2γE θ [Q θ (s , π(s )Q θ (s, a)] + E θ Q θ (s, a) 2 . ( ) From this, we obtain the bias as Bias E θ (γQ θ (s , π(s )) -Q θ (s, a)) 2 = γ 2 C(s , π(s )) -2γD(s, a, s ) + C(s, a) (45) = γ 2 φ -2γκ + 1 C(s, a). We can compare this term to C(s, a) in the bias of of V θ [Q θ (s, a)] (Eq. 38). For the bias term in Eq. 46 to be smaller, we require | γ 2 φ -2γκ + 1 C(s, a)| < |C(s, a)| from which it follows that γ 2 φ -2γκ + 1 ∈ (-1, 1). In terms of φ, this means φ ∈ 2kγ -2 γ 2 , 2k γ . ( ) If the bias term D is close to C (κ ≈ 1), this is approximately the same condition as for ρ in Lemma 4. Generally, as κ grows large, φ must grow small and vice-versa. The gist of this requirement is that the biases should be relatively balanced κ ≈ φ ≈ 1. For the second term in Eq. 43, recall that E θ [Q θ (s , π(s ))] = E M Q M π (s , π(s )) + B(s , π(s )) and E θ [Q θ (s, a)] = E M Q M π (s, a) + B(s, a ). We have (E θ [Q θ (s , π(s ))] -E θ [Q θ (s, a)]) 2 = (γα -1)E M Q M π (s, a) + (γρ -1)B(s, a) 2 , where α = E M Q M π (s , π(s )) /E M Q M π (s, a) . This expands as (γα -1) 2 E M Q M π (s, a) 2 + 2(γα -1)(γρ -1)E M Q M π (s, a) B(s, a) + (γρ -1) 2 B(s, a) 2 . ( ) Note that from Eq. 34, E M Q M π (s , π(s )) -E M Q M π (s, a) 2 = (γα -1) 2 E M Q M π (s, a) 2 and so the bias of V θ [δ(θ, τ ) | τ ] can be written as Bias(V θ [δ(θ, τ ) | τ ]) = w 1 (φ, κ)C(s, a) + w 2 (α, ρ)2A(s, a) + w 3 (ρ)B(s, a) 2 where w 1 (φ, κ) = γ 2 φ -2γκ + 1 , w 2 (α, ρ) = (γα -1)(γρ -1), w 3 (ρ) = (γρ -1) 2 . ( ) Note that the bias in Eq. 50 involves the same terms as the bias of V θ [Q θ (s, a)] (Eq. 38) but are weighted. Hence, there always exist as set of weights such that | Bias(V θ [δ(θ, τ ) | τ ])| < | Bias(V θ [Q θ (s, a)])|. In particular, if ρ = 1/γ, κ = 1/γ, φ = 1/γ 2 , then Bias(V θ [δ(θ, τ ) | τ ])| = 0 for any α. Further, if ρ = α = κ = φ = 1, then we have that w 1 (φ, κ) = w 2 (α, ρ) = w 3 (ρ) = (γ -1) 2 and so | Bias(V θ [δ(θ, τ ) | τ ])| = |(γ -1) 2 Bias(V θ [Q θ (s, a)])| < | Bias(V θ [Q θ (s, a)])|, as desired. Proposition 2. For any τ and any p(M ), given p (θ), if ρ ∈ (0, 2/γ), then E δ [δ | τ ] has lower bias than E θ [Q θ (s, a)]. Additionally, there exists ρ ≈ 1, φ ≈ 1, κ ≈ 1, α ≈ 1 such that V θ [δ(θ, τ ) | τ ] have less bias than V θ [Q θ (s, a)]. In particular, if ρ = φ = κ = α = 1, then | Bias(V θ [δ(θ, τ ) | τ ])| = |(γ -1) 2 Bias(V θ [Q θ (s, a)])| < | Bias(V θ [Q θ (s, a)])|. Further, if ρ = 1/γ, κ = 1/γ, φ = 1/γ 2 , then | Bias(V θ [δ(θ, τ ) | τ ])| = 0 for any α. Proof. The first part follows from Lemma 4, the second part follows from Lemma 5.

C BINARY TREE MDP

In this section, we make a direct comparison between the Bootstrapped DQN and TDU on the Binary Tree MDP introduced by Janz et al. (2019) . In this MDP, the agent has two actions in every state. One action terminates the episode with 0 reward while the other moves the agent one step further up the tree. At the final branch, one leaf yields a reward of 1. Which action terminates the episode and which moves the agent to the next branch is randomly chosen per branch, so that the agent must learn an action map for each branch separately. This is a similar environment to Deep Sea, but simpler in that an episode terminates upon taking a wrong action and the agent does not receive a small negative reward for taking the correct action. We include the Binary Tree MDP experiment to compare the scaling property of TDU as compared to TDU on a well-known benchmark. We use the default Bsuite implementationfoot_1 of the bootstrapped DQN, with the default architecture and hyper-parameters from the published baseline, reported in Table 2 . The agent is composed of a two-layer MLP with RELU activations that approximate Q(s, a) and is trained using experience replay. In the case of the bootstrapped DQN, all ensemble members learn from a shared replay buffer with bootstrapped data sampling, where each member Q θ k is a separate MLP (no parameter sharing) that is regressed towards separate target networks. We use Adam (Kingma & Ba, 2015) and update target networks periodically (Table 2 ). We run 5 seeds per tree-depth, for depths L ∈ {10, 20, . . . , 250} and report mean performance in Figure 4 . Our results are in line with those of Janz et al. (2019) , differences are due to how many gradient steps are taken per episode (our results are between the reported scores for the 1× and 25× versions of the bootstrapped DQN). We observe a clear beneficial effect of including TDU, even for small values of β. Further, we note that performance is largely monotonically increasing in β, further demonstrating that the TDU signal is well-behaved and robust to hyper-parameter values. We study the properties of TDU in Figure 5 , which reports performance without prior functions (λ = 0). We vary β and the number of exploration value functions N . The total number of value functions is fixed at 20, and so varying N is equivalent to varying the degree of exploration. We note that N has a similar effect to β, but has a slightly larger tendency to induce over-exploration for large values of N . 

D BEHAVIOUR SUITE

From Osband et al. (2020) : "The Behaviour Suite for Reinforcement Learning (Bsuite) is a collection of carefully-designed experiments that investigate core capabilities of a reinforcement learning agent . The aim of the Bsuite project is to collect clear, informative and scalable problems that capture key issues in the design of efficient and general learning algorithms and study agent behaviour through their performance on these shared benchmarks." All baselines use the default Bsuite DQN implementationfoot_2 . We use the default architecture and hyperparameters from the published baseline, reported in Table 2 , and sweep over algorithm-specific hyperparameters, reported in Table 3 . The agent is composed of a two-layer MLP with RELU activations that approximate Q(s, a) and is trained using experience replay. In the case of the bootstrapped DQN, all ensemble members learn from a shared replay buffer with bootstrapped data sampling, where each member Q θ k is a separate MLP (no parameter sharing) that is regressed towards separate target networks. We use Adam (Kingma & Ba, 2015) and update target networks periodically (Table 2 ).

D.1 AGENTS AND HYPER-PARAMETERS

QEX Uses two networks Q θ and Q ϑ , where Q θ is trained to maximise the extrinsic reward, while Q ϑ is trained to maximise the absolute TD-error of Q θ (Simmons-Edler et al., 2019) . In contrast to TDU, the intrinsic reward is given as a point estimate of the TD-error for a given transition, and thus cannot be interpreted as measuring uncertainty as such. CTS Implements a count-based reward defined by i(s, a, H) = (N (s, a, H) + 0.01) -1/2 , where H is the history and N (s, a, H) = τ ∈H 1 (s,a)∈τ is the number of times (s, a) has appeared in a transition τ := (s, a, r, s ). This intrinsic reward is added to the extrinsic reward to form an augmented reward r = r + βi used to train a DQN agent (Bellemare et al., 2016) . RND Uses two auxiliary networks f ϑ and f θ that map a state into vectors x = f ϑ (s) and x = f θ(s), x, x ∈ R m . While θ is a random parameter vector that is fixed throughout, ϑ is trained to minimise the mean squared error i(s) = x -x . This error is simultaneously used as an intrinsic reward in the augmented reward function r(s, a) = r(s, a) + βi(s) and is used to train a DQN agent. Following -4 , 10 -3 , 10 -2 , 10 -1 , 10 0 , 5 • 10 0 , 10 1 , 10 2 , 10 3 } CTS Intrinsic reward scale (β) {10 -foot_3 , 10 -3 , 10 -2 , 10 -1 , 5 • 10 0 , 10 0 , 10 1 , 10 2 , 10 3 } RND Intrinsic reward scale (β) {10 -2 , 5 • 10 -1 , 10 -1 , 10 0 , 5 × 10 0 , 10 1 , 10 2 } x-dim (m) {10, 64, 128} Moving average decay (α) {0.9, 0.99, 0.999} Normalise intrinsic reward {True, False}

BDQN

Prior scale (λ) {0, 1, 3, 5, 10, 50, 100} SU Hidden size {20, 64} Likelihood variance (β) {10 -2 , 10 -1 , 10 0 , 10 1 , 10 2 } Prior variance (θ) {10 -3 , 10 -1 , 10 0 , 10 1 , 10 3 } NNS Noise scale (β) {10 -2 , 10 -1 , 10 0 , 10 1 , 10 2 } TDU Prior scale (λ) {0, 10 0 , 3 • 10 0 } Intrinsic reward scale (β) {10 -3 , 10 -2 , 10 -1 , 10 0 , 5 • 10 0 , 10 1 } Burda et al. (2018b) , we normalise intrinsic rewards by an exponential moving average of the mean and the standard deviation that are being updated with batch statistics (with decay α). BDQN Trains an ensemble Q = {Q θ k } K k=1 of DQNs (Osband et al., 2016a) . At the start of each episode, one DQN is randomly chosen from which a greedy policy is derived. Data collected is placed in a shared replay memory, and all ensemble members have some probability ρ of training on any transition in the replay. Each ensemble member has its own target network. In addition, each DQN is augmented with a random prior function f ϑ , where θ is a fixed parameter vector that is randomly sampled at the start of training. Each DQN is defined by Q θ k + λf ϑ k , where λ is a hyper-parameter regulating the scale of the prior. Note that the target network uses a distinct prior function. SU Decomposes the DQN as Q θ (s, a) = w T ψ ϑ (s, a). The parameters ϑ are trained to satisfy the Success Feature identity while w is learned using Bayesian linear regression; at the start of each episode, a new w is sampled from the posterior p(w | history) (Janz et al., 2019). 4 NNS NoisyNets replace feed-forward layers W x + b by a noisy equivalent (W + Σ W )x + (b + σ b ), where is element-wise multiplication; W ij ∼ N (0, β) and b i ∼ N (0, β) are white noise of the same size as W and b, respectively. The set (W, Σ, b, σ) are learnable parameters that are trained on the normal TD-error, but with the noise vector re-sampled after every optimisation step. Following Fortunato et al. (2018) , sample noise separately for the target and the online network. TDU We fix the number of explorers to 10 (half of the number of value functions in the ensemble), which roughly corresponds to randomly sampling between a reward-maximising policy and an exploration policy. Our experiments can be replicated by running the TDU agent implemented in Algorithm 5 in the Bsuite GitHub repository.foot_4 

D.2 TDU EXPERIMENTS

Effect of TDU Our main experiment sweeps over β to study the effect of increasing the TDU exploration bonus, with β ∈ {0, 0.01, 0.1, 0.5, 1, 2, 3, 5}; β = 0 corresponds to default bootstrapped DQN. We find that β reflects the exploitation-exploration trade-off: increasing β leads to better performance on exploration tasks (see main paper) but typically leads to worse performance on tasks that do not require further exploration beyond -greedy (Figure 6 ). In particular, we find that β > 0 prevents the agent from learning on Mountain Car, but otherwise retains performance on non-exploration tasks. Figure 7 provides an in-depth comparison per game. Because σ is a principled measure of concentration in the distribution p(δ | s, a, r, s ), β can be interpreted as specifying how much of the tail of the distribution the agent should care about. The higher we set β, the greater the agent's sensitivity to the tail-end of its uncertainty estimate. Thus, there is no reason in general to believe that a single β should fit all environments, and recent advances in multi-policy learning (Schaul et al., 2019; Zahavy et al., 2020; Puigdomènech Badia et al., 2020) suggests that a promising avenue for further research is to incorporate mechanisms that allow either β to dynamically adapt or the sampling probability over policies. To provide concrete evidence to that effect, we conduct an ablation study that uses bandit policy sampling below.

Effect of prior functions

We study the inter-relationship between additive prior functions (Osband et al., 2019) and TDU. We sweep over λ ∈ [0, 1, 3], where prior functions define value function estimates by Q k = Q θ k + λP k for some random network P k . Thus, λ = 0 implies no prior function. We find a general synergistic relationship; increasing λ improves performance (both with and without TDU), and for a given level of λ, performance on exploration tasks improve for any β > 0. It should be noted that these effects do no materialise as clearly in our Atari settings, where we find no conclusive evidence to support λ > 0 under TDU. Ablation: exploration under non-TD signals To empirically support theoretical underpinnings of TDU (Proposition 2), we conduct an ablation study where σ is re-defined as the standard deviation over value estimates: σ(Q) := 1 K -1 K k=1 Q k -Q. In contrast to TDU, this signal does not condition on the future and consequently is likely to suffer from a greater bias. We apply this signal both as in intrinsic reward (QU), as in TDU, and as an UCB-style exploration bonus (Q+UCB), where σ is instead applied while acting by defining a policy by π(•) = arg max a Q(•, a) + βσ(Q; •, a). Note that TDU cannot be applied in this way because the TDU exploration signal depends on r and s . We tune each baseline over the same set of β values as above (incidentally, these coincide to β = 1) and report best results in Figure 6 . We find that either alternative is strictly worse than TDU. They suffer a significant drop in performance on exploration tasks, but are also less able to handle noise and reward scaling. Because the only difference between QU and TDU is that in TDU, σ conditions on the next state. Thus, there results are in direct support of Proposition 2 and demonstrates that V θ [δ | τ ] is likely to have less bias than V θ [Q θ (s, a)]. Ablation: bandit policy sampling Our main results indicate, unsurprisingly, that different environments require different emphasis on exploration. To test this more concretely, in this experiment we replace uniform policy sampling with the UCB1 bandit algorithm. However, in contrast to that example, where UCB1 is used to take actions, here it is used to select a policy for the next episode. We treat each N + K value function as an "arm" and estimate its mean reward V k ≈ E π k [r] , where the expectation is with respect to rewards r collected under policy π k (•) = arg max a Q k (•, a). The mean reward is estimated as the running average V k (n) = 1 n(k) n(k) i=1 r i , where n(k) is the number of environment steps for which policy π k has been used and r i are the observed rewards under policy π k . Prior to an episode, we choose a policy to act under according to: arg max k=1,...,N +K V k (n) + η log n/n(k), where n is the total number of environment steps taken so far and η is a hyper-parameter that we tune. As in the bandit example, this sampling strategy biases selection towards policies that currently collect higher reward, but balances sampling by a count-based exploration bonus that encourages the agent to eventually try all policies. This bandit mechanism is very simple as our purpose is to test whether some form of adaptive sampling can provide benefits; more sophisticated methods (e.g. Schaul et al., 2019) can yield further gains. We report full results in Figure 7 ; we use β = 1 and tune η ∈ {0.1, 1, 2, 4, 6, 8}. We report results for the hyper-parameter that performed best overall, η = 8, though differences with η > 4 are marginal. While TDU does not impact performance negatively in general, in the one case where it does-Mountain Car-introducing a bandit to adapt exploration can largely recover performance. The bandit yields further gains in dense reward settings, such as in Cartpole and Catch, with an outlying exception in the bandit setting with scaled rewards.

E ATARI WITH R2D2

E.1 BOOTSTRAPPED R2D2 We augment the R2D2 agent with an ensemble of dueling action-value heads Q i . The behavior policy followed by the actors is an -greedy policy as before, but where the greedy action is determined according to a single Q i for a fixed length of time (100 actor steps in all of our experiments), before sampling a new Q i uniformly at random. The evaluation policy is also -greedy with = 0.001, where the Q-values are averaged only over the exploiter heads. Each trajectory inserted into the replay buffer is associated with a binary mask indicating which Q i will be trained from this data, ensuring that the same mask is used every time the trajectory is sampled. Priorities are computed as in R2D2, except that TD-errors are now averaged over all heads. Instead of using reward clipping, R2D2 estimates a transformed version of the state-action value function to make it easier to approximate for a neural network. One can define a transformed Bellman operator given any squashing function h : R → R that is monotonically increasing and invertible. We use the function h : R → R defined by h(z) = sign(z)( |z| + 1 -1) + z, h -1 (z) = sign(z) 1 + 4 (|z| + 1 + ) -1 2 -1 , for small. In order to compute the TD errors accurately we need to account for the transformation, δ(θ, s, a, r, s ) := γh -1 (Q θ (s , π(s ))) + r -h -1 (Q θ (s, a)). Similarly, at evaluation time we need to apply h -1 to the output of each head before averaging. When making use of a prior we use the form Q k = Q k θ + λP k , where P k is of the same architecture as the Q k θ network, but with the widths of all layers cut to reduce computational cost. Finally, instead of n-step returns we utilise Q(λ) (Peng & Williams, 1994) as was done in (Guez et al., 2020) . In all variants we used the hyper-parameters listed in Table 4 .

E.2 PRE-PROCESSING

We used the standard pre-process of the frames received from the Arcade Learning Environment.foot_5 See Table 5 for details. In the distributed setting we have three TDUspecific hyper-parameters to tune namely: β, N and the prior weight λ. For our main results, we run each agent across 8 seeds for 20 billions steps. For ablations and hyper-parameter tuning, we ran agents across 3 seeds for 5 billion environment steps on a subset of 8 games: frostbite,gravitar, hero, montezuma_revenge, ms_pacman, seaquest, space_invaders, venture. This subset presents quite a bit of diversity including dense-reward games as well as three hard exploration games: gravitar, montezuma_revenge and venture. To minimise the computational cost, we started by setting λ and N while maintaining β = 1. We employed a coarse grid of λ ∈ {0., 0.05, 0.1} and N ∈ {2, 3, 5}. Figure 8 summarises the results in terms of the mean Human Normalised Scores (HNS) across the set. We see that the performance depends on the type of games being evaluated. Specifically, hard exploration games achieve a significantly lower score. Performance does not significantly change with the number of explorers. The largest differences are observed for the exploration games when N = 5. We select best performing sets of hyper parameters for TDU with and without additive priors: (N = 2, λ = 0.1) and (N = 5, λ = 0), respectively.

E.3 HYPER-PARAMETER SELECTION

We evaluate the influence of the exploration bonus strength by fixing (N = 5, λ = 0) and choosing β ∈ {0.1, 1., 2.}. Figure 9 summarises the results. The set of dense rewards is composed of the games in the ablation set that are not considered hard exploration games. We observe that larger values of β help on exploration but affect performance on dense reward games. We plot jointly the performance in mean HNS acting when averaging the Q-values for both, the exploiter heads (solid lines) and the explorer heads (dotted lines). We can see that higher strengths for the exploration bonus (higher β) renders the explorers "uninterested" in the extrinsic rewards, preventing them to converge to exploitative behaviours. This effect is less strong for the hard exploration games. We fix (N = 5, λ = 0). We report the mean HNS for the ensemble of exploiter (solid lines) and the ensemble of explorers (dotted lines). All runs are average over three seeds per game. Refer to the text for details on ablation and exploration set of games. Figure 10 we show how this effect manifests itself on the performance on three games: gravitar, space_invaders, and hero. This finding also applies to the evaluation performed on our evaluation using all 57 games in the Atari suite, as shown below. We conjecture that controlling for the strength of the exploration bonus on a per game manner would significantly improve the results. This finding is in line with observations made by (Puigdomènech Badia et al., 2020) ; combining TDU with adaptive policy sampling (Schaul et al., 2019) or online hyper-parameter tuning (Xu et al., 2018; Zahavy et al., 2020) are exciting avenues for future research. 



Available at: https://github.com/deepmind/bsuite. https://github.com/deepmind/bsuite/tree/master/bsuite/baselines/jax/ bootdqn. https://github.com/deepmind/bsuite/tree/master/bsuite/baselines/jax/ dqn. See https://github.com/DavidJanz/successor_uncertainties_tabular. https://github.com/deepmind/bsuite/blob/master/bsuite/baselines/jax. Publicly available at https://github.com/mgbellemare/Arcade-Learning-Environment.



Figure 1: Deep Sea Benchmark. QEX, CTS, and RND use intrinsic rewards; BDQN, SU, and NNS use posterior sampling (Section 5.1). Posterior sampling does well on the deterministic version, but struggles on the stochastic version, suggesting an estimation bias (Section 2). Only TDU performs (near-)optimally on both the deterministic and the stochastic version of Deep Sea.

Figure 3: Atari results with distributed training. We compare TDU with and without additive prior functions to R2D2 and Bootstrapped R2D2 (B-R2D2). Left: Results for montezuma_revenge. Center: Results for tennis. Right: Mean HNS for the hard exploration games in the Atari2600 suite (including tennis). Shading depicts standard deviation over 8 seeds.

td = array([td_error(p, transitions)  for p in sample(Qtilde_distribution_params)]) Pseudo-code for Q-learning TDU loss 1 def loss(transitions, Q_params, Qtilde_distribution_params, beta): 2 # Estimate TD-error distribution.

Under review as a conference paper at ICLR 2021 Algorithm 5 JAX implementation of TDU Agent under Bootstrapped DQN

) Consequently, Bias(E θ [δ(θ, τ ) | τ ]) = γB(s , π(s )) -B(s, a) and for this bias to be less than Bias(E θ [Q θ (s, a)]) = B(s, a), we require |γB(s , π(s )) -B(s, a)| < |B(s, a)|. Let ρ = B(s , π(s ))/B(s, a) and write |(γρ -1)B(s, a)| < |B(s, a)| from which it follows that

a) B(s, a) + B(s, a) 2 . (37) Let A(s, a) = E M Q M π (s, a) B(s, a) and write the bias of V θ [Q θ (s, a)] as

Figure 6: Overall performance scores on Bsuite. Left: Effect of varying β. Right: comparison of TDU to exploration under σ = σ(Q) as intrinsic reward (QU) or as an immediate bonus (Q+UCB).

Figure7: Bsuite per-task results. Results reported for different values of β with prior λ = 3. We also report results under UCB1 policy sampling ("bandit") for β = 1, λ = 3, η = 8.

Figure8: Ablation for prior scale, λ and the number of explorers, N , on the distributed setting. We fix β = 1. Refer to the text for details on the ablation and exploration set of games.

Figure9: Ablation for the exploration bonus strength, β, on the distributed setting. We fix (N = 5, λ = 0). We report the mean HNS for the ensemble of exploiter (solid lines) and the ensemble of explorers (dotted lines). All runs are average over three seeds per game. Refer to the text for details on ablation and exploration set of games.

Figure 11: Performance on each game in the main experiment in Section 5.2. Shading depicts standard deviation over 8 seeds.

Figure12: Performance across all games in the main experiment in Section 5.2. We report mean HNS over the full set of games used in the main experiment, dense reward games, and exploration games. Shading depicts standard deviation over 8 seeds.

Figure 14: Results for each individual game. Shading depicts standard deviation over 3 seeds.

Atari benchmark on exploration games. † Ostrovski et al. (2017), ‡ Bellemare et al. (2016), Burda et al. (2018b), Choi et al. (2018), § Puigdomènech Badia et al. (2020), + With prior functions.

# Copyright 2020 the Temporal Difference Uncertainties as a Signal for Exploration authors. Licensed under # the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at https://www.apache.org/licenses/LICENSE-2.0. # Unless required by applicable law or agreed to in writing, software distributed under the License is # distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and limitations under the License.

Hyper-parameters for Bsuite.

Hyper-parameter grid searches for Bsuite. Best values in bold.

Atari pre-processing hyperparameters.

annex

We fix (N = 5, λ = 0). We report the score on three different games for the ensemble of exploiter (solid lines) and the ensemble of explorers (dotted lines). All runs are average over three seeds per game.

E.4 DETAILED RESULTS: MAIN EXPERIMENT

In this section we provide more detailed results from our main experiment in Section 5.2. We concentrated our attention on the subset of games that are well-known to pose challenging exploration problems (Machado et al., 2018) : montezuma_revenge, pitfall, private_eye, solaris, venture, gravitar, and tennis. We also add a varied set of dense reward games.Figure 11 shows the performance for each game. We can see that TDU always performs on par or better than each of the baselines, leading to significant improvements in data efficiency and final score in games such as montezuma_revenge, private_eye, venture, gravitar, and tennis. Gains in exploration games can be substantial, and in montezuma_revenge, private_eye, venture, and gravitar, TDU without prior functions achieves statistically significant improvements. TDU with prior functions achieve statistically significant improvements on montezuma_revenge, private_eye, and gravitar. Beyond this, both methods improve the rate of convergence on seaquest and tennis, and achieve higher final mean score. Overall, TDU yields benefits across both dense reward and exploration games, as summarised in Figure 12 .Note that R2D2's performance on dense reward games is deflated due to particularly low scores on space_invaders. Our results are in line with the original publication, where R2D2 does not show substantial improvements until after 35 Bn steps.

E.5 FULL ATARI SUITE

In this section we report the performance on all 57 games of the Atari suite. In addition to the two configurations used to obtain the results presented in the main text (reported in Section 5.2), in this section we included a variant of each of them with lower exploration bonus strength of λ = 0.1.In all figures we refer to these variants by adding an L (for lower λ) at the end of the name, e.g. TDU-R2D2-L. In Figure 13 we report a summary of the results in terms of mean HNS and median HNS for the suite as well as mean HNS restricted to the hard exploration games only. We show the performance on each game in Figure 14 . Reducing the value of β significantly improves the mean HNS without strongly degrading the performance on the games that are challenging from an exploration standpoint. The difference in performance in terms of mean HNS can be explained by looking at a few high scoring games, for instance: assault, asterix, demon_attack and gopher (see Figure 14 ). We can see that incorporating priors to TDU is not crucial for achieving high performance in the distributed setting. Figure 13 : Performance over the 57 atari games. We report mean and median HNS over the full suite, and mean HNS over the exploration games.

