OPTIMISM IN REINFORCEMENT LEARNING WITH GENERALIZED LINEAR FUNCTION APPROXIMATION

Abstract

We design a new provably efficient algorithm for episodic reinforcement learning with generalized linear function approximation. We analyze the algorithm under a new expressivity assumption that we call "optimistic closure," which is strictly weaker than assumptions from prior analyses for the linear setting. With optimistic closure, we prove that our algorithm enjoys a regret bound of O H √ d 3 T where H is the horizon, d is the dimensionality of the state-action features and T is the number of episodes. This is the first statistically and computationally efficient algorithm for reinforcement learning with generalized linear functions.

1. INTRODUCTION

We study episodic reinforcement learning problems with infinitely large state spaces, where the agent must use function approximation to generalize across states while simultaneously engaging in strategic exploration. Such problems form the core of modern empirical/deep-RL, but relatively little work focuses on exploration, and even fewer algorithms enjoy strong sample efficiency guarantees. On the theoretical side, classical sample efficiency results from the early 00s focus on "tabular" environments with small finite state spaces (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Strehl et al., 2006) , but as these methods scale with the number of states, they do not address problems with infinite or large state spaces. While this classical work has inspired practically effective approaches for large state spaces (Bellemare et al., 2016; Osband et al., 2016; Tang et al., 2017) , these methods do not enjoy sample efficiency guarantees. More recent theoretical progress has produced provably sample efficient algorithms for complex environments where function approximation is required, but these algorithms are relatively impractical (Krishnamurthy et al., 2016; Jiang et al., 2017) . In particular, these methods are computationally inefficient or rely crucially on strong dynamics assumptions (Du et al., 2019b) . In this paper, with an eye toward practicality, we study a simple variation of Q-learning, where we approximate the optimal Q-function with a generalized linear model. The algorithm is appealingly simple: collect a trajectory by following the greedy policy corresponding to the current model, perform a dynamic programming back-up to update the model, and repeat. The key difference over traditional Q-learning-like algorithms is in the dynamic programming step. Here we ensure that the updated model is optimistic in the sense that it always overestimates the optimal Q-function. This optimism is essential for our guarantees. Optimism in the face of uncertainty is a well-understood and powerful algorithmic principle in shorthorizon (e.g,. bandit) problems, as well as in tabular reinforcement learning (Azar et al., 2017; Dann et al., 2017; Jin et al., 2018) . With linear function approximation, Yang & Wang (2019) and Jin et al. (2019) show that the optimism principle can also yield provably sample-efficient algorithms, when the environment dynamics satisfy certain linearity properties. Their assumptions are always satisfied in tabular problems, but are somewhat unnatural in settings where function approximation is required. Moreover as these assumptions are directly on the dynamics, it is unclear how their analysis can accommodate other forms of function approximation, including generalized linear models. In the present paper, we replace explicit dynamics assumptions with expressivity assumptions on the function approximator, and, by analyzing a similar algorithm to Jin et al. (2019) , we show that the optimism principle succeeds under these strictly weaker assumptions. 1 More importantly, the relaxed assumption facilitates moving beyond linear models, and we demonstrate this by providing the first practical and provably efficient RL algorithm with generalized linear function approximation. The paper is organized as follows: In Section 2 we formalize our setting, introduce the optimistic closure assumption, and discuss related assumptions in the literature. In Section 3 we study optimistic closure in detail and verify that it is strictly weaker than the recently proposed Linear MDP assumption. Our main algorithm and results are presented in Section 4, with the main proof in Section A. We close with some final remarks and future directions in Section 5.

2. PRELIMINARIES

We consider episodic reinforcement learning in a finite-horizon markov decision process (MDP) with possibly infinitely large state space S, finite action space A, initial distribution µ ∈ ∆(S), transition operator P : S × A → ∆(S), reward function R : S × A → ∆([0, 1]) and horizon H. The agent interacts with the MDP in episodes and, in each episode, a trajectory (s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . , s H , a H , r H ) is generated where s 1 ∼ µ, for h > 1 we have s h ∼ P (• | s h-1 , a h-1 ), r h ∼ R(s h , a h ) , and actions a 1:H are chosen by the agent. For normalization, we assume that H h=1 r h ∈ [0, 1] almost surely. A (deterministic, nonstationary) policy π = (π 1 , • • • , π H ) consists of H mappings π h : S → A, where π h (s h ) denotes the action to be taken at time point h if at state s h ∈ S The value function for a policy π is a collection of functions (V π 1 , . . . , V π H ) where V π h : S → R is the expected future reward the policy collects if it starts in a particular state at time point h. Formally, V π h (s) E H h =h r h | s h = s, a h:H ∼ π . The value for a policy π is simply V π E s1∼µ [V π 1 (s 1 )], and the optimal value is V max π V π , where the maximization is over all nonstationary policies. The typical goal is to find an approximately optimal policy, and in this paper, we measure performance by the regret accumulated over T episodes, Reg(T ) T V -E T t=1 H h=1 r h,t . Here r h,t is the reward collected by the agent at time point h in the t th episode. We seek algorithms with regret that is sublinear in T , which demonstrates the agent's ability to act near-optimally over the long run.

2.1. Q-VALUES AND FUNCTION APPROXIMATION

For any policy π, the state-action value function, or the Q-function is a sequence of mappings Q π = (Q π 1 , . . . , Q π H ) where Q π h : S × A → R is defined as Q π h (s, a) E H h =h r h | s h = s, a h = a, a h+1:H ∼ π . The optimal Q-function is Q h Q π h where π argmax π V π is the optimal policy. In the value-based function approximation setting, we use a function class G to model Q . In this paper, we always take G to be a class of generalized linear models (GLMs), defined as follows: Let d ∈ N be a dimensionality parameter and let B d x ∈ R d : x 2 ≤ 1 be the 2 ball in R d . Definition 1. For a known feature map φ : S × A → B d and a known link function f : [-1, 1] → [-1, 1] the class of generalized linear models is G {(s, a) → f ( φ(s, a), θ ) : θ ∈ B d }. As is standard in the literature (Filippi et al., 2010; Li et al., 2017) , we assume the link function satisfies certain regularity conditions. Assumption 1. f (•) is either monotonically increasing or decreasing. Furthermore, there exist absolute constants 0 < κ < K < ∞ and M < ∞ such that κ ≤ |f (z)| ≤ K and |f (z)| ≤ M for all |z| ≤ 1. For intuition, two example link functions are the identity map f (z) = z and the logistic map f (z) = 1/(1 + e -z ) with bounded z. It is easy to verify that both of these maps satisfy Assumption 1.

2.2. EXPRESSIVITY ASSUMPTIONS: REALIZABILITY AND OPTIMISTIC CLOSURE

To obtain sample complexity guarantees that scale polynomially with problem parameters in the function approximation setting, it is necessary to posit expressivity assumptions on the function class G (Krishnamurthy et al., 2016; Du et al., 2019a) . The weakest such condition is realizability, which posits that the optimal Q function is in G, or at least well-approximated by G. Realizability alone suffices for provably efficient algorithms in the "contextual bandits" setting where H = 1 (Li et al., 2017; Filippi et al., 2010; Abbasi-Yadkori et al., 2011) , but it does not seem to be sufficient when H > 1. Indeed in these settings it is common to make stronger expressivity assumptions (Chen & Jiang, 2019; Yang & Wang, 2019; Jin et al., 2019) . Following these works, our main assumption is a closure property of the Bellman update operator T h . This operator has type T h : (S × A → R) → (S × A → R) and is defined for all s ∈ S, a ∈ A as T h (Q)(s, a) E [r h + V Q (s h+1 ) | s h = s, a h = a] , V Q (s) max a∈A Q(s, a). The Bellman update operator for time point H is simply T H (Q)(s, a) E [r H | s H = s, a H = a], which is degenerate. To state the assumption, we must first define the enlarged function class G up . For a d × d matrix A, A 0 denotes that A is positive semi-definite. For a positive semi-definite matrix A, A op is the matrix operator norm, which is just the largest eigenvalue, and x A √ x Ax is the matrix Mahalanobis seminorm. For a fixed constant Γ ∈ R + that we will set to be polynomial in d and log(T ), define G up (s, a) → 1 ∧ f ( φ(s, a), θ ) + γ φ(s, a) A : θ ∈ B d , A 0, A op ≤ 1 , Here we use a∧b min{a, b}. The class G up contains G in addition to all possible upper confidence bounds that arise from solving least squares regression problems using the class G. We now state our main expressivity assumption, which we call optimistic closure. Assumption 2 (Optimistic closure). For any 1 ≤ h < H and g ∈ G up , we have T h (g) ∈ G. In words, when we perform a Bellman backup on any upper confidence bound function for time point h + 1, we obtain a generalized linear function at time h. While this property seems quite strong, we note that a similar notion is mentioned informally in Jin et al. (2019) and that related closure-type assumptions are common in the literature (see Section 2.3 for detailed discussion). More importantly, we will prove in Section 3 that optimistic closure is actually strictly weaker than previous assumptions used in our RL setting where exploration is required. Before turning to these discussions, we mention two basic properties of optimistic closure. Fact 1 (Optimistic closure and realizability). Optimistic closure implies that Q ∈ G (realizability). Proof. We will solve for Q via dynamic programming, starting from time point H. In this case, the Bellman update operator is degenerate, and we start by observing that T H (g) ≡ Q H for all g. Consequently we have Q H ∈ G. Next, inductively we assume that we have Q h+1 ∈ G, which implies that Q h+1 ∈ G up as we may take the same parameter θ and set A ≡ 0. Then, by the standard Bellman fixed-point characterization, we know that Q h = T h (Q h+1 ), at which point Assumption 2 yields that Q h ∈ G. Fact 2 (Optimistic closure in tabular settings). If S is finite and φ(s, a) = e s,a is the standard-basis feature map, then under Assumption 1 we have optimistic closure. Proof. We simply verify that G contains all mappings from (s, a) → [0, 1], at which point the result is immediate. To see why, observe that via Assumption 1 we know that f is invertible (it is monotonic with derivative bounded from above and below). Then, note that any function (s, a) → [0, 1] can be written as a vector v ∈ [0, 1] |S|×|A| . For such a vector v, if we define θ s,a f -1 (v s,a ) we have that f ( e s,a , θ ) = v s,a . Hence G contains all functions, so we trivially have optimistic closure.

2.3. RELATED WORK

The majority of the theoretical results for reinforcement learning focus on the tabular setting where the state space is finite and sample complexities scaling polynomially with |S| are tolerable (Kearns & Singh, 2002; Brafman & Tennenholtz, 2002; Strehl et al., 2006) . Indeed, by now there are a number of algorithms that achieve strong guarantees in this setting (Dann et al., 2017; Azar et al., 2017; Jin et al., 2018; Simchowitz & Jamieson, 2019) . Via Fact 2, our results apply to this setting, and indeed our algorithm can be viewed as a generalization of the canonical tabular algorithm (Azar et al., 2017; Dann et al., 2017; Simchowitz & Jamieson, 2019) to the function approximation setting.foot_1  Turning to the function approximation setting, several other results concern function approximation in settings where exploration is not an issue, including the infinite-data regime (Munos, 2003; Farahmand et al., 2010) and the "batch RL" setting where the agent does not control the data-collection process (Munos & Szepesvári, 2008; Antos et al., 2008; Chen & Jiang, 2019) . While the details differ, all of these results require that the function class satisfy some form of (approximate) closure with respect to the Bellman operator. As an example, one assumption is that T (g) ∈ G for all g ∈ G, with an appropriately defined approximate variant (Chen & Jiang, 2019) . These results therefore provide motivation for our optimistic closure assumption. While optimistic closure is stronger than the assumptions in these works, we emphasize that we are also addressing exploration, so our setting is also significantly more challenging. A recent line of work studies function approximation in settings where the agent must explore the environment (Krishnamurthy et al., 2016; Jiang et al., 2017; Du et al., 2019b) . The algorithms developed here can accommodate function classes beyond generalized linear models, but they are still relatively impractical and the more practical ones require strong dynamics assumptions (Du et al., 2019b) . In contrast, our algorithm is straightforward to implement and does not require any explicit dynamics assumption. As such, we view these results as complementary to our own. Most closely related to our work are the recent results of Yang & Wang (2019) and Jin et al. (2019) . Both papers study MDPs with certain linear dynamics assumptions (what they call the Linear MDP assumption) and use linear function approximation to obtain provably efficient algorithms. Jin et al. (2019) hint at optimistic closure as a weakening of their Linear MDP assumption and remark that their guarantees continues to hold under this weaker assumption. One of our contributions is to formalize this remark. Indeed, our algorithm is almost identical to theirs. However we emphasize that optimistic closure is strictly weaker than their Linear MDP assumption, which in turn is strictly weaker than the assumption of Yang & Wang (2019). Further, and perhaps more importantly, by avoiding explicit dynamics assumptions, we enable approximation with GLMs, which are incompatible with the Linear MDP structure. Hence, the present paper can be seen as a significant generalization of these recent results. Since the initial version of this paper appeared, several other works have studied linear function approximation in reinforcement learning. A number of papers (Cai et al., 2019; Ayoub et al., 2020; Modi et al., 2020; Zhou et al., 2020) study an incomparable class of dynamics models that permit linear function approximation. Others study weakenings of the Linear MDP assumptions. In particular, Agarwal et al. (2020) only require small transfer error for linear regression, which formalizes out-of-distribution generalization and is always zero in Linear MDPs. Zanette et al. (2020a) only require that the Bellman operator is closed with respect to linear functions, which is considerably weaker than our optimistic closure assumption. However, their algorithm is not computationally efficient. Computational efficiency is addressed in Zanette et al. (2020b) in the reward-free setting with reachability assumptions. As we do not require reachability assumptions, this latter result is incomparable to ours. None of these results considers generalized linear models. Linear MDPs are studied by Jin et al. (2019) , who establish a √ T -type regret bound for an optimistic algorithm. This assumption already subsumes that of Yang & Wang (2019) , and related assumptions also appear elsewhere in the literature (Bradtke & Barto, 1996; Melo & Ribeiro, 2007; Zanette et al., 2019) . In this section, we show that optimistic closure (Assumption 2) is strictly weaker than assuming the environment is a linear MDP. Proposition 1. If an MDP is linear then Assumption 2 holds with G = {(s, a) → w, ψ(s, a) :

3. ON OPTIMISTIC CLOSURE

w ∈ B d } . Proof. The result is implicit in Jin et al. (2019) , and we include the proof for completeness. For any function g, observe that owing to the linear MDP property T h (g)(s, a) = E r + max a g(s , a ) | s, a = ψ(s, a), η + ψ(s, a), µ(s ) max a g(s , a )ds , which is clearly a linear function in ψ(s, a). Hence for any function g, which trivially includes the optimistic functions, we have T h (g) ∈ G. Thus the linear MDP assumption is stronger than Assumption 2. Next, we show that it is strictly stronger. Proposition 2. There exists an MDP with H = 2, d = 2, |A| = 2 and |S| = ∞ such that Assumption 2 is satisfied, but the MDP is not a linear MDP. Thus we have that optimistic closure is strictly weaker than the linear MDP assumption from Jin et al. (2019) . Thus, our results strictly generalize theirs. Proof. Fix the link function to be f (z) = z. We first construct the MDP. Set the action space A = {a 1 , a 2 }. We use e i to denote the i th standard basis element, and let x = (0.1/Γ, 0.1/Γ) be a fixed vector where Γ appears in the construction of G up . Recall that s 1 is the first state in each trajectory. In our example, for all a ∈ A, φ(s 1 , a) is sampled uniformly at random from the set {αe 1 + (1 -α)e 2 : α ∈ [0, 1]}. The transition rule is deterministic and given by: ∀a ∈ A : φ(s 2 , a) = αx if φ(s 1 , a) = αe 1 + (1 -α)e 2 . Moreover, for the reward function, R(s 1 , a) = 0 and R(s 2 , a) = 0.1α/Γ. We first show that the Linear MDP property does not hold for the constructed MDP and the given feature map φ. Let s (1) 1 be the state with φ(s (1) 1 , a) = e 2 and s (2) 1 be the state with φ(s (2) 1 , a) = e 1 . Notice that we deterministically transition from s 2 , a) = x, which already fixes the whole transition operator under the linear MDP assumption. Thus, under the linear MDP assumption, we must therefore have a randomized transition for any state s 1 with φ(s 1 , a) = αe 1 + (1 -α)e 2 where α ∈ (0, 1). This contradicts the fact that our constructed MDP has deterministic transitions everywhere, so the linear MDP cannot hold. We next show that Assumption 2 holds. Consider an arbitrary optimistic Q estimate of the form g(z) = min{1, z θ + γ

√

z Az} ∈ G up . Notice that for x = (0.1/Γ, 0.1/Γ), we always have that Algorithm 1 The LSVI-UCB algorithm with generalized linear function approximation. 1: Initialize estimates Qh,0 ≡ 1 for all h ≤ H and QH+1,t ≡ 0 for all 1 ≤ t ≤ T ; 2: Set γ = CKκ -1 1 + M + K + d 2 ln((1 + K + Γ)T H) for a universal constant C; 3: for t = 1, 2, • • • , T do 4: Commit to policy πh,t (s) argmax a∈A Qh,t-1 (s, a);

5:

Use policy π•,t to collect one trajectory {(s h,t , a h,t , r h,t )} H h=1 ; 6: for h = H, H -1, • • • , 1 do 7: Compute x h,τ φ(s h,τ , a h,τ ) and y h,τ r h,τ + max a ∈A Qh+1,t (s h+1,τ , a ) for all τ ≤ t; 8: Compute ridge estimate θh,t argmin θ 2≤1 τ ≤t (y h,τ -f ( x h,τ , θ )) 2 ; (1) 9: Compute Λ h,t τ ≤t x h,τ x h,τ + I; 10: Construct Qh,t (s, a) min 1, f (φ(s, a) θh,t ) + γ φ(s, a) Λ -1 h,t ; 11: end for 12: end for x θ + γ √ x Ax ≤ 1 for any θ ∈ B d and A with A op ≤ 1. Moreover, for all s 2 , i.e., the second state in the trajectory, we always have φ(s 2 , a) = αx for some α ∈ [0, 1]. Hence we can ignore the first term in the minimum, and, by direct calculation, we have that when φ(s, a) = αe 1 + (1 -α)e 2 : T 1 (g)(s, a) = αx θ + γ √ α 2 x Ax = α(x θ + γ √ x Ax) = αc 0 . Hence we can write T 1 (g) = φ(s, a), (c 0 , 0) , which verifies Assumption 2.

4. ALGORITHM AND MAIN RESULT

We now turn to presenting our algorithm and main results. We study a least-squares dynamic programming style algorithm that we call LSVI-UCB, with pseudocode presented in Algorithm 1. The algorithm is nearly identical to the algorithm proposed by Jin et al. (2019) with the same name. A similar algorithmic template has also been extensively studied in the tabular setting Azar et al. (2017); Dann et al. (2017) ; Simchowitz & Jamieson (2019) , albeit with slightly different confidence bounds. As our algorithm applies to all of these settings, it should be considered as a generalization. The algorithm uses dynamic programming to maintain optimistic Q function estimates { Qh,t } h≤H,t≤T for each time point h and each episode t. In the t th episode, we use the previously computed estimates to define the greedy policy πh,t (•) argmax a∈A Qh,t-1 (•, a), which we use to take actions for the episode. Then, with all of the trajectories collected so far, we perform a dynamic programming update, where the main per-step optimization problem is (1). Starting from time point H, we update our Q function estimates by solving constrained least squares problems using our class of GLMs. At time point H, the covariates are {φ(s H,τ , a H,τ )} τ ≤t , and the regression targets are simply the immediate rewards {r H,τ } τ ≤t . For time points h < H, the covariates are defined similarly as {φ(s h,τ , a h,τ )} τ ≤t but the regression targets are defined by inflating the learned Q function for time point h + 1 by an optimism bonus. In detail, the least squares problem for time point h + 1 yields a parameter θh+1,t and we also form the second moment matrix Λ h+1,t of all the covariates at time h + 1 that we have seen so far. Using these, we define the optimistic Q function Qh+1,t (s, a) min 1, f ( φ(s, a), θh+1,t ) + γ φ(s, a) Λ -1 h+1,t . In our analysis, we verify that Qh+1,t is optimistic in the sense that it over-estimates Q for every (s, a). Then, the regression targets for the least squares problem at time point h are r h,τ + max a ∈A Qh+1,t (s h+1,τ , a ), which is a natural stochastic approximation to the Bellman backup of Qh+1,t . Applying this update backward from time point H to 1, we obtain the Q-function estimates that we use to define the policy for the next episode. The main conceptual difference between Algorithm 1 and the algorithm of Jin et al. ( 2019) is that we allow non-linear function approximation with GLMs, while they consider only linear models. On a more technical level, we use constrained least squares for our dynamic programming backup which we find easier to analyze, while they use the ridge regularized version. On the computational side, the algorithm is straightforward to implement, and, depending on the link function f , it can be easily shown to run in polynomial time. For example, if f is the identity map, then (1) is equivalent to standard least square ridge regression, which can be solved in closed form. Moreover, we can use the Sherman-Morrison formula to amortize matrix inversions, and, by doing so, we obtain a running time of O d 2 |A|HT 2 . The dominant cost in this calculation is evaluating the optimism bonus when computing the regression targets. In practice, using an epoch schedule or incremental optimization algorithms for updating Q would yield an even faster algorithm. Of course, with modern machine learning libraries, it is also straightforward to implement the algorithm with a non-trivial link function f , even though (1) may be non-convex.

4.1. MAIN RESULT

Our main result is a regret bound for LSVI-UCB when the link function satisfies Assumption 1 and the function class satisfies Assumption 2. Theorem 1. For any episodic MDP, with Assumption 1 and Assumption 2, and for any T , the cumulative regret of Algorithm 1 is 3 O HKκ -1 (M + K + d 2 ln(KT H)) • T d ln(T /d) = O H √ d 3 T , with probability 1 -1/(T H). The result states that LSVI-UCB enjoys √ T -regret for any episodic MDP problem and any GLM, provided that the regularity conditions are satisfied and that optimistic closure holds. As we have mentioned, these assumptions are relatively mild, encompassing the tabular setting and prior work on linear function approximation. Importantly, no explicit dynamics assumptions are required. Thus, Theorem 1 is one of the most general results we are aware of for provably efficient exploration with function approximation. Nevertheless, to develop further intuition for our bound, it is worth comparing to prior results. First, in the linear MDP setting of Jin et al. (2019) , we use the identity link function so that K = κ = 1 and M = 1, and we also are guaranteed to satisfy Assumption 2. In this case, our bound differs from that of Jin et al. (2019) only in the dependence on H, which arises due to a difference in normalization. Our bound is essentially equivalent to theirs and can therefore be seen as a strict generalization. To capture the tabular setting, we use the standard basis featurization as in Fact 2 and the identity link function, which gives d = |S||A|, K = κ = 1, and M = 1. Thus, we obtain the following corollary: Corollary 2. For MDPs with finite state and action spaces, using feature map φ(s, a) e s,a ∈ R |S|×|A| , for any T , the cumulative regret of Algorithm 1 is O H |S|foot_2 |A| 3 T , with probability 1 -1/(T H). Note that this bound is polynomially worse than the near-optimal O(H Azar et al. (2017) . However, a refined analysis specialized to the tabular setting can be shown to obtain a better regret bound of O H |S| 2 |A| 2 T . Of course, our algorithm and analysis address problems with infinitely large state spaces and other settings that are significantly more complex than tabular MDPs, which we believe is more important than recovering the optimal guarantee for tabular MDPs. √ SAT + H 2 S 2 A log(T )) bound of Published as a conference paper at ICLR 2021

5. DISCUSSION

This paper presents a provably efficient reinforcement learning algorithm that approximates the Q function with a generalized linear model. We prove that the algorithm obtains O(H √ d 3 T ) regret under mild regularity conditions and a new expressivity condition that we call optimistic closure. These assumptions generalize both the tabular setting, which is classical, and the linear MDP setting studied in recent work. Further they represent the first statistically and computationally efficient algorithms for reinforcement learning with generalized linear function approximation, without explicit dynamics assumptions. We close with some open problems. First, using the fact that Corollary 3 applies beyond GLMs, can we develop algorithms that can employ general function classes? While such algorithms do exist for the contextual bandit setting (Foster et al., 2018) , it seems quite difficult to generalize this analysis to multi-step reinforcement learning. More importantly, while optimistic closure is weaker than some prior assumptions (and incomparable to others), it is still quite strong, and stronger than what is required for the batch RL setting. An important direction is to investigate weaker assumptions that enable provably efficient reinforcement learning with function approximation. We look forward to studying these questions in future work. A PROOF OF THEOREM 1 We now provide the proof of Theorem 1, deferring some technical details to later sections in this appendix. The proof has three main components: a regret decomposition for optimistic Q learning, a deviation analysis for least squares with GLMs to ensure optimism, and a potential argument to obtain the final regret bound. Regret decomposition. The first step of the proof is a regret decomposition that applies generically to optimistic algorithms. 4 The lemma demonstrates concisely the value of optimism in reinforcement learning, and is the primary technical motivation for our interest in optimistic algorithms. We state the lemma more generally, which requires some additional notation. Fix round t and let { Qh,t-1 } h≤H denote the current estimated Q functions. The precondition is that Qh,t-1 is optimistic and has controlled overestimation. Precisely, we assume that there exists a function cnf h,t-1 : S × A → R + such that Q h (s, a) ≤ Qh,t-1 (s, a) (2) Qh,t-1 (s, a) ≤ T h ( Qh+1,t-1 )(s, a) + cnf h,t-1 (s, a) (3) We will verify that our estimates Qh,• satisfy these properties subsequently. Before doing so, we state the regret decomposition lemma and an immediate corollary. Lemma 1. Fix episode t and let F t-1 be the filtration of {(s h,τ , a h,τ , r h,τ )} τ <t . Assume that Qh,t-1 satisfies (3) for some function cnf h,t-1 . Then, if π t = argmax a∈A Qh,t-1 (•, a) is deployed we have V -E H h=1 r h,t | F t-1 ≤ ζ t + H h=1 cnf h,t-1 (s h,t , a h,t ), where E [ζ t | F t-1 ] = 0 and |ζ t | ≤ 2H almost surely. Corollary 3. Assume that for all t, Qh,t-1 satisfies (3) and that π t is the greedy policy with respect to Qh,t-1 . Then with probability at least 1 -δ, we have Reg(T ) ≤ T t=1 H h=1 cnf h,t-1 (s h,t , a h,t ) + O(H T log(1/δ)). Proof of Lemma 1. Observe that V = E [Q (s 1 , π (s 1 ))] ≤ E Q1,t-1 (s 1 , π (s 1 )) ≤ E Q1,t-1 (s 1 , π t (s 1 )) ≤ E [cnf 1,t-1 (s 1 , π t (s 1 ))] + E T 1 ( Q2,t-1 )(s 1 , π t (s 1 )) = E [cnf 1,t-1 (s 1 , π t (s 1 ))] + E [r 1 | s 1 , a 1 = π t (s 1 )] + E s2∼πt Q2,t-1 (s 2 , π t (s 2 )) Throughout this calculation, s 1 ∼ µ. The first step here is by definition, the second uses the optimism property for Q1,t-1 . The third uses that π t is the greedy policy with respect to Q1,t-1 while the fourth uses the upper bound on Q1,t-1 . Finally we use the definition of the Bellman operator and the fact that π t is the greedy policy yet again. Comparing this upper bound with the expected reward collected by π t we observe that r 1 cancels, and we get V -E H h=1 r h,t | F t-1 ≤ E πt [cnf 1,t-1 (s 1 , π t (s 1 ))] + E πt Q2,t-1 (s 2 , π t (s 2 )) - H h=2 r h,t | F t-1 . At this point, notice that Q2,t-1 (s 2 , π t (s 2 )) is precisely what we alreacy upper bounded at time point h = 1 and we are always considering the state-action distribution induced by π t . Hence, repeating the argument for all h, we obtain V -E H h=1 r h,t | F t-1 ≤ H h=1 E πt [cnf h,t-1 (s h , a h )] = H h=1 cnf h,t-1 (s h,t , a h,t ) + ζ t , where ζ t H h=1 ζ h,t and ζ h,t E πt [cnf h,t-1 (s h , π t (s h ))] -cnf h,t-1 (s h,t , a h,t ), which is easily seen to have the required properties. The lemma states that if Qh,t-1 is optimistic and we deploy the greedy policy π t , then the perepisode regret is controlled by the overestimation error of Qh,t-1 , up to a stochastic term that enjoys favorable concentration properties. Crucially, the errors are accumulated on the observed trajectory, or, stated another way, the cnf h,t-1 is evaluated on the states and actions visited during the episode. As these states and actions will be used to update Q, we can expect that the cnf function will decrease on these arguments. This can yield one of two outcomes: either we will incur lower regret in the next episode, or we will explore the environment by visiting new states and actions. In this sense, the lemma demonstrates how optimism navigates the exploration-exploitation tradeoff in the multi-step RL setting, analogously to the bandit setting. Note that Lemma 1 does not assume any form for Qh,t-1 and does not require Assumption 2. In particular, they are not specialized to GLMs. In our proof, we use the GLM representation and Assumption 2 to ensure that (3) holds and to bound the confidence sum in Corollary 3. We believe these technical results will be useful in designing RL algorithms for general function classes, which is a natural direction for future work. Deviation analysis. The next step of the proof is to design the cnf function and ensure that (3) holds, with high probability. This is the contents of the next lemma. Lemma 2. Under Assumption 1 and Assumption 2, with probability 1 -1/(T H), we have that ∀t, h, s, a: f ( φ(s, a), θh,t ) -T h ( Qh+1,t )(s, a) ≤ γ φ(s, a) Λ -1 h,t where γ, Λ h,t are defined in Algorithm 1. A simple induction argument then verifies that (3) holds, which we summarize in the next corollary. Corollary 4. Under Assumption 1 and Assumption 2, with probability 1-1/(T H), we have that (3) holds for all t, h with cnf h,t-1 (s, a) = min{2, 2γ φ(s, a) Λ -1 h,t-1 }. As the proof of Lemma 2 is rather long and technical, we defer the details to the appendix and instead explain the high-level argument here. The proof requires an intricate deviation analysis to account for the dependency structure in the data sequence. The intuition is that, thanks to Assumption 2 and the fact that Qh+1,t ∈ G up , we know that there exists a parameter θh,t such that f ( φ(s, a), θh,t ) = T h ( Qh+1,t )(s, a). It is easy to verify that θh,t is the Bayes optimal predictor for the square loss problem in (1), and so with a uniform convergence argument we can expect that θh,t is close to θh,t , which is our desired conclusion. There are two subtleties with this argument. First, we want to show that θh,t and θh,t are close in a data-dependent sense, to obtain the dependence on the Λ -1 h,t -Mahalanobis norm in the bound. This can be done using vector-valued self-normalized martingale inequalities (Peña et al., 2008) , as in prior work on linear stochastic bandits (Abbasi-Yadkori et al., 2012; Filippi et al., 2010; Abbasi-Yadkori et al., 2011) . However, the process we are considering is not a martingale, since Qh+1,t , which determines the regression targets y h,τ , depends on all data collected so far. Hence y h,τ is not measurable with respect to the filtration F τ , which prevents us from directly applying a self-normalized martingale concentration inequality. To circumvent this issue, we use a uniform convergence argument and introduce a deterministic covering of G up . Each element of the cover induces a different sequence of regression targets y h,τ , but as the covering is deterministic, we do obtain martingale structure. Then, we show that the error term for the random Qh+1,t that we need to bound is close to a corresponding term for one of the covering elements, and we finish the proof with a uniform convergence argument over all covering elements. The corollary is then obtained by a straightforward inductive argument. Assuming Qh+1,t dominates Q , it is easy to show that Qh,t also dominates Q , and the upper bound is immediate. Combining Corollary 4 with Corollary 3, all that remains is to upper bound the confidence sum. Potential argument. To bound the confidence sum, we use a standard potential argument that appears in a number of works on stochastic linear bandits. We summarize the conclusion with the following lemma, which follows directly from Lemma 11 of Abbasi-Yadkori et al. (2012) . Lemma 3. For any h ≤ H we have that T t=1 φ(s h,t , a h,t ) 2 Λ -1 h,t-1 ≤ 2d ln(1 + T /d). Wrapping up. Equipped with the above results, we are now prepared to prove Theorem 1. Proof of Theorem 1. Assume that Corollary 4 holds for all 1 ≤ h ≤ H and 1 ≤ t ≤ T . Applying Lemma 1 and the definition of cnf h,t-1 implied by Corollary 4, the cumulative expected regret is at most T V -E T t=1 H h=1 r h,t ≤ T t=1 ζ t + T t=1 H h=1 min 2, γ φ(s h,t , a h,t ) Λ -1 h,t-1 ≤ T t=1 ζ t + H h=1 T γ 2 • T t=1 φ(s h,t , a h,t ) 2 Λ -1 h,t-1 ≤ T t=1 ζ t + H h=1 T γ 2 • 2d ln(1 + T /d). Here, the second step follows from the Cauchy-Schwarz inequality, and the last step is an application of Lemma 3. The first term forms a martingale, and we know that |ζ t | ≤ 2H. Therefore, by Azuma's inequality, we have that with probability at least 1 -1/T H T t=1 ζ t ≤ 8T H 2 ln(T H). Finally, using the definition of γ, the final regret is upper bounded by Reg(T ) ≤ O H T ln(T H) + HKκ -1 × (M + K + d 2 ln((K + Γ)T H)) • T d ln(1 + T /d) ≤ O H √ d 3 T , which proves the result.

B PROOF OF LEMMA 2 AND COROLLARY 4

To facilitate our analysis we define the following important intermediate quantity: θh,t ∈ B d : f ( φ(s, a), θh,t ) E r h + max a ∈A Qh+1,t (s , a ) | s, a . In words, θh,t is the Bayes optimal predictor for the squared loss problem at time point h in the t th episode. Since by inspection Qh+1,t ∈ G up , by Assumption 2 we know that θh,t exists for all h and t. Lemma 4. For any θ, θ , x ∈ R d satisfying θ 2 , θ 2 , x 2 ≤ 1, κ 2 | x, θ -θ | 2 ≤ |f ( x, θ ) -f ( x, θ )| 2 ≤ K 2 θ -θ 2 2 . Proof. By the mean-value theorem, there exists θ = θ + λ(θ -θ) for some λ ∈ (0, 1) such that f ( x, θ ) -f ( x, θ ) = ∇ θ f ( x, θ ), θ -θ . On the other hand, by the chain rule and Assump- tion 1, ∇ θ f ( x, θ ) = f ( x, θ ) • x. Hence, | ∇ θ f (x θ), θ -θ | 2 ≤ f ( x, θ ) 2 • | x, θ -θ | 2 ≤ K 2 x 2 2 θ -θ 2 2 ≤ K 2 θ -θ 2 2 ; | ∇ θ f (x θ), θ -θ | 2 ≥ κ 2 | x, θ -θ | 2 , which are to be demonstrated. Lemma 5. For any 0 < ε ≤ 1, there exists a finite subset V ε ⊂ G up with ln |V ε | ≤ 6d 2 ln(2(1 + K + Γ)/ε), such that sup g∈Gup min v∈Vε sup s,a |g(φ(s, a)) -v(φ(s, a))| ≤ ε. Proof. Recall that for every g ∈ G up , there exists θ ∈ B d , 0 ≤ γ ≤ Γ and A op ≤ 1 such that g(x) = min{1, f ( x, θ ) + γ x A }. Let Θ ε ⊆ B d , Γ ε ⊆ [0, Γ] and M ε ⊆ {M ∈ S + d : M op ≤ 1} be finite subsets such that for any θ, γ, A, there exist θ ∈ Θ ε , γ ∈ Γ ε , A ∈ M ε such that max θ -θ 2 , |γ -γ | , A -A op ≤ ε , where ε ∈ (0, 1) will be specified later in the proof. For the function g ∈ G up corresponding to the parameters θ, γ, A the function g corresponding to parameters θ , γ , A satisfies sup s,a |g(φ(s, a)) -g (φ(s, a))| ≤ sup x∈B d |g(x) -g (x)| ≤ sup x∈B d |f ( x, θ ) -f ( x, θ ) + γ x A -γ x A | ≤ K θ -θ 2 + |γ -γ | + Γ | x A -x A | ≤ K θ -θ 2 + |γ -γ | + Γ |x (A -A )x| ≤ Kε + ε + Γ √ ε ≤ (1 + K + Γ) √ . In the last step we use ε ≤ 1. Therefore, if we define the class V ε {(s, a) → min{1, f ( φ(s, a), θ ) + γ φ(s, a) A : θ ∈ Θ ε , γ ∈ Γ ε , A ∈ M ε }, we know that the covering property is satisfied with parameter (1 + K + Γ) √ ε . Setting ε = ε 2 /(1 + K + Γ) 2 we have the desired covering property. For the next lemma, let F t-1 σ({(s h,τ , a h,τ , r h,τ )} τ <t ) be the filtration induced by all observed trajectories up to but not including time t. Observe that Q•,t-1 and our policy πh,t are F t-1 measurable. Published as a conference paper at ICLR 2021 Lemma 6 (Restatement of Lemma 2). Fix any 1 ≤ t ≤ T and 1 ≤ h ≤ H. Then as long as π t is F t-1 measurable, with probability 1 -1/(T H) 2 it holds that f ( φ(s, a), θh,t ) -f ( φ(s, a), θh,t ) ≤ min 2, γ φ(s, a) Λ -1 h,t , ∀s, a. for γ ≥ CKκ -1 1 + M + K + d 2 ln((1 + K + Γ)T H) and 0 < C < ∞ is a universal constant. Note that this is precisely Lemma 2, as θh,t is defined as f ( φ(s, a), θh,t ) = T h ( Qh+1,t )(s, a). Proof. The upper bound of 2 is obvious, since both terms are upper bounded by 1 in absolute value. Therefore we focus on the second term in the minimum. To simplify notation we omit the dependence on h in the subscripts and write x τ , y τ for x h,τ and y h,τ . We also abbreviate θ θh,t and θ θh,t . Since θ 2 ≤ 1, the optimality of θ for (1) implies that τ ≤t f ( x τ , θ ) -y τ 2 ≤ τ ≤t f ( x τ , θ ) -y τ 2 . Decomposing the squares and re-organizing the terms, we have that τ ≤t f ( x τ , θ ) -f ( x τ , θ ) 2 ≤ 2 τ ≤t ξ τ (f ( x τ , θ ) -f ( x τ , θ )) , where ξ τ y τ -f ( x τ , θ ). By the fundamental theorem of calculus, we have f ( x τ , θ ) -f ( x τ , θ ) = xτ , θ xτ , θ f (s)ds = x τ , θ -θ 1 0 f ( x τ , s θ + (1 -s) θ )ds Dτ . Using this identity on both sides of (5), we have that τ ≤t D 2 τ x τ , θ -θ 2 ≤ 2 τ ≤t ξ τ D τ x τ , θ -θ . Note also that, by Assumption 1, D τ satisfies κ 2 ≤ D 2 τ ≤ K 2 almost surely for all τ . The difficulty in controlling ( 6) is that θ itself is a random variable that depends on {(x τ , y τ )} τ ≤t . In particular, we want that E[ξ τ | D τ x τ , φ , F τ -1 ] = 0 for any fixed φ, but this is not immediate as θ depends on x τ . To proceed, we eliminate this dependence with a uniform convergence argument. Let ε ∈ (0, 1) be a covering accuracy parameter to be determined later in this proof. Let V ε be the pointwise covering for G up that is implied by Lemma 5. Let g ε ∈ V ε be the approximation for Qh+1,t that satisfies (4). By Assumption 2, there exists some θ ∈ B d such that ∀s, a : f ( φ(s, a), θ ) = E r + max a ∈A g ε (s , a ) | s, a . Now, define y τ and ξ τ as y τ r h,τ + max a ∈A g ε (s h+1,τ , a ), ξ τ y τ -f ( x h,τ , θ ). The right-hand side of (6) can then be upper bounded as 2 τ ≤t ξ τ D τ x τ , θ -θ ≤ 2 τ ≤t ξ τ D τ x τ , θ -θ + ∆, where |∆| ≤ Kt × max τ ≤t |ξ τ -ξ τ | almost surely. Upper bounding ∆ in (7). Fix τ ≤ t. By definition, we have that ξ τ -ξ τ ≤ y τ -y τ + f ( x τ , θ ) -f ( x τ , θ ) ≤ max a∈A g ε (s h+1,τ , a) -Qh+1,t (s h+1,τ , a) + K θ -θ 2 (8) ≤ + K ≤ (K + 1) , where (8) holds by Lemma 4 and (9) follows from Lemma 5. In particular, the bound on θ -θ 2 can be verified by expanding the definitions and noting that g ε is pointwise close to Qh+1,t . Therefore, we have |∆| ≤ (K + 1) 2 t . Upper bounding (7). Note that D τ is a function of x τ , θ, and θ. For clarity, we define D τ (θ, θ ) := 1 0 f ( x τ , sθ + (1 -s)θ ) )ds. As |f (z)| ≤ M for all |z| ≤ 1 and x τ 2 ≤ 1, we have that for every θ, θ , θ, θ ∈ B d D τ (θ, θ ) -D τ ( θ, θ ) ≤ 1 0 f ( x τ , sθ + (1 -s)θ ) -f ( x τ , s θ + (1 -s) θ ) ds ≤ M ( θ -θ 2 + θ -θ 2 ). Hence, for any (θ, θ ) and ( θ, θ ) pairs, we have for every τ that ξ τ x τ , D τ (θ, θ )(θ -θ ) -D τ ( θ, θ )( θ -θ ) ≤ D τ (θ, θ ) -D τ ( θ, θ ) × θ -θ 2 + D τ ( θ, θ ) × ( θ -θ 2 + θ -θ 2 ) ≤ M ( θ -θ 2 + θ -θ 2 ) × 2 + K( θ -θ 2 + θ -θ 2 ) ≤ (2M + K)( θ -θ 2 + θ -θ 2 ). Here we are using that |ξ τ | ≤ 1. We are now in a position to invoke Lemma 8. Consider a fixed function g ε , which defines a fixed θ . We will bound τ ≤t ξ τ x τ , D τ (θ, θ )(θ -θ ) uniformly over all pairs (θ, θ ). With g ε , θ fixed and since π t is F t-1 measurable, we have that {x τ , ξ τ } τ ≤t are random variables satisfying E[ξ τ | x 1:τ , ξ 1:τ -1 ] = 0. For φ = (θ, θ ) we define the function q(x τ , φ) = x, D τ (φ)(θ -θ ) , which as we have just calculated satisfies |q(x τ , φ) -q(x τ , φ )| ≤ (2M + K) φ -φ 2 . For δ ∈ (0, 1/2) with probability 1 -δ we have ∀φ = (θ, θ ) ∈ B 2 d : τ ≤t ξ τ x τ , D τ (φ)(θ -θ ) ≤ (2M + K) + 2 1 + V (φ) 2d ln(4T ) + ln(1/δ ) ≤ 4 max M + K + 2d ln(4T ) + ln(1/δ ), V (φ) 2d ln(4T ) + ln(1/δ ) , where V (φ) τ ≤t x τ , D τ (φ)(θ -θ ) 2 . The last inequality holds because a + b ≤ 2 max{a, b}. Next, take a union bound over all g ε ∈ V ε so (11) holds for any g ε and any subsequently induced choice of ξ τ with probability at least 1 -|V ε |δ . In particular, this union bound implies that (11) holds for the choice of g ε that approximates Qh+1,t . Therefore, combining (6), ( 7), ( 10) with (11) for this choice of g ε , we have that with probability at least 1 -|V ε |δ τ ≤t D 2 τ x τ , θ -θ 2 ≤ 2∆ + 2 τ ≤t ξ τ x τ , D τ ( θ -θ) ≤ 2(K + 1) 2 tε + 8 max M + K + 2d ln(4T ) + ln(|V ε |/δ ), V ( θ, θ) • 2d ln(4T ) + ln(|V ε |/δ ) . Observe that the left hand side is precisely V ( θ, θ). Now, set ε = 1/(2(K + 1) 2 T ) and δ = 1/(|V ε |T 2 H 2 ) and use the bound on ln |V ε | from Lemma 5 to get 2d ln(4T ) + ln(|V ε |/δ ) ≤ 2d ln(4T ) + 12d 2 ln(2(1 + K + Γ)/ε) + 2 ln(T H) ≤ 4d ln(2T H) + 24d 2 ln(2(1 + K + Γ)T ) ≤ 28d 2 ln(2(1 + K + Γ)T H) Therefore, we obtain V ( θ, θ) ≤ 1 + 8 max M + K + 28d 2 ln(2(1 + K + Γ)T H), V ( θ, θ) • 28d 2 ln(2(1 + K + Γ)T H) ≤ 16 max 1 + M + K + 28d 2 ln(2(1 + K + Γ)T H), V ( θ, θ) • 28d 2 ln(2(1 + K + Γ)T H) . Subsequently, V ( θ, θ) = τ ≤t D 2 τ x τ , θ -θ 2 ≤ 16 max 1 + M + K + 28d 2 ln(2(1 + K + Γ)T H), 448d 2 ln(2(1 + K + Γ)T H) ≤ C 2 V (1 + M + K + d 2 ln((1 + K + Γ)T H)), where 0 < C V < ∞ is a universal constant. Next, note that D 2 τ ≥ κ 2 , thanks to Assumption 1. We then have ( θ -θ) Λ h,t ( θ -θ) ≤ κ -1 V ( θ, θ) ≤ C V κ -1 1 + M + K + d 2 ln((1 + K + Γ)T H), where Λ h,t = τ <t x τ , x τ . Finally, for any (s, a) pair, invoking Lemma 4 and the Cauchy-Schwarz inequality we have f ( φ(s, a), θ ) -f ( φ(s, a), θ ) ≤ K φ(s, a), θθ ≤ K ( θθ) Λ h,t ( θθ) × φ(s, a) Λ -1 h,t φ(s, a) ≤ C V Kκ -1 1 + M + K + d 2 ln((1 + K + Γ)T H) × φ(s, a) Λ -1 h,t which is to be demonstrated. Corollary 5 (Restatement of Corollary 4). With probability 1 -1/(T H), Qh,t (s, a) ≥ Q h (s, a) holds for all h, t, s, a. Proof. Fix 1 ≤ t ≤ T . We use induction on h to prove this corollary. For h = H +1, QH+1,t (•, •) ≥ Q H+1 (•, •) clearly holds because QH+1,t ≡ Q H+1 ≡ 0. Now assume that Qh+1,t ≥ Q h+1 , and let us prove that this is also true for time step h. Since Qh+1,t (s , a ) ≥ Q h+1 (s , a ) for all s , a , we have that f ( φ(s, a), θh,t ) ≥ f ( φ(s, a), θ h ) for all (s, a) pairs. Then, by the definition of Qh,t and Lemma 6, with probability 1 -1/(T H) 2 it holds uniformly for all (s, a) pairs that Qh,t (s, a) ≥ f ( φ(s, a), θh,t ). Hence, with the same probability, we have Qh,t (s, a) ≥ Q h (s, a) for all (s, a) pairs. A union bound over all t ≤ T and h ≤ H completes the proof.

C TAIL INEQUALITIES

Lemma 7 (Azuma's inequality). Suppose X 0 , X 1 , X 2 , • • • , X N form a martingale (i. Let {ξ τ , u τ } τ ≤t be random variables such that E[ξ τ |u 1 , ξ 1 , • • • , u τ -1 , ξ τ -1 , u τ ] = 0 and |ξ τ | ≤ 1 almost surely. Let q : (u, φ) → R be an arbitrary deterministic function satisfying |q(u, φ) -q(u, φ )| ≤ C φ -φ 2 for all u, φ and φ , where φ, φ ∈ R D . Then for any δ ∈ (0, 1) and R > 0, Pr ∀φ ∈ B D (R) : t τ =1 ξ τ q(u τ , φ) ≤ C + 2 1 + V q (φ) D ln(2tR) + ln(1/δ) ≥ 1 -δ, where B D (R) {x ∈ R D : x 2 ≤ R} and V q (φ) τ ≤t q 2 (u τ , φ). Proof. Let > 0 be a small precision parameter to be specified later. ξ τ q(u τ , φ) > Ct + ∆ ≤ Pr ∃φ ∈ H : t τ =1 ξ τ q(u τ , φ ) > ∆ ≤ φ ∈H Pr t τ =1 ξ τ q(u τ , φ ) > ∆ , where the last inequality holds by the union bound. For any fixed φ ∈ H, h(u τ , φ ) only depends on u τ , and therefore E[ξ τ | q(u τ , φ )] = 0 for all τ . Invoking Lemma 7 with X τ τ ≤τ ξ τ q(u τ , φ ) and c τ = |q(u τ , φ )|, we have Pr t τ =1 ξ τ q(u τ , φ ) > ∆ ≤ 2 exp -∆ 2 2 τ ≤t q 2 (u τ , φ ) = 2 exp -∆ 2 2V q (φ ) Equating the right-hand side of the above inequality with δ and combining with the union bound application, we have Pr ∃φ ∈ B d (R) : ξ τ q(u τ , φ) > Ct + 2DV q (φ ) ln(2R/ ) + 2V q (φ ) ln(1/δ) ≤ δ. Finally, as |q(u τ , φ ) -q(u τ , φ)| ≤ , we have V q (φ ) ≤ 2V q (φ) + 2t 2 and so Pr ∃φ ∈ B D (R) : t τ =1 ξ τ q(u τ , φ) > Ct + 2 Dt ln(2R/ δ) + 2 V q (φ)(D ln(2R/ ) + ln(1/δ) ≤ δ. Setting = 1/t in the above inequality completes the proof.



This is also mentioned as a remark inJin et al. (2019). The description of the algorithm looks quite different from that of Azar et al. (2017), but via an equivalence between model-free methods with experience replay and model-based methods(Fujimoto et al., 2018), they are indeed quite similar. We use O (•) to suppress factors of M, K, κ, Γ and any logarithmic dependencies on the arguments. Related results appear elsewhere in the literature focusing on the tabular setting, see e.g.,Simchowitz & Jamieson (2019).



detailed comparison to the recent work of Yang & Wang (2019) and Jin et al. (2019), we define the linear MDP model studied in the latter work. Definition 2. An MDP is said to be a linear MDP if there exist known feature map ψ : S × A → R d , unknown signed measures µ : S → R d , and an unknown vector η ∈ R d such that (1) P (s |s, a) = ψ(s, a), µ(s ) holds for all states s, s and actions a, and (2) E[r | s, a] = ψ(s, a), η .

) = 0, and we deterministically transition from s

, we upper bound ln |V ε |. By definition, we have that ln |V ε | ≤ ln |Θ ε | + ln |Γ ε | + ln |M ε |. Furthermore, standard covering number bounds reveals that ln |Θ ε | ≤ d ln(2/ε ), ln |Γ ε | ≤ ln(1/ε ) and ln |M ε | ≤ d 2 ln(2/ε ). Plugging in the definition of ε yields the result.

e., E[X k+1 |X 1 , • • • , X k ] = X k ) and satisfy |X k -X k-1 | ≤ c k almost surely. Then for any > 0, Pr X n -X 0 ≥ ≤ 2 exp -

Let H ⊆ B D (R) be a finitecovering of B D (R) such that sup x∈B D (R) min z∈H x -z 2 ≤ . Using standard covering number arguments, such a covering exists with ln |H| ≤ D ln(2R/ ).For any φ ∈ B D (R) let φ argmin z∈H φ -z 2 . By definition, φ -φ 2 ≤ . This implies t τ =1 ξ τ [q(u τ , φ) -q(u τ , φ )] ≤ Ct because |ξ τ | ≤1 almost surely. Subsequently, for any ∆ > 0, Pr ∃φ ∈ B D (R) : t τ =1

τ h(u τ , φ) > Ct + 2V q (φ ) ln(2/δ ) ≤ δ |H|. (12)Further equating δ = δ/|H| and using the fact that ln |H| ≤ D ln(2R/ ), we have Pr ∃φ ∈ B d (R) :

Lin F Yang and Mengdi Wang. Reinforcement leaning in feature space: Matrix bandit, kernels, and regret bound. arXiv:1905.10389, 2019. Andrea Zanette, David Brandfonbrener, Matteo Pirotta, and Alessandro Lazaric. Frequentist regret bounds for randomized least-squares value iteration. arXiv preprint arXiv:1911.00567, 2019.

