OPTIMISTIC POLICY OPTIMIZATION WITH GENERAL FUNCTION APPROXIMATIONS

Abstract

Although policy optimization with neural networks has a track record of achieving state-of-the-art results in reinforcement learning on various domains, the theoretical understanding of the computational and sample efficiency of policy optimization remains restricted to linear function approximations with finite-dimensional feature representations, which hinders the design of principled, effective, and efficient algorithms. To this end, we propose an optimistic model-based policy optimization algorithm, which allows general function approximations while incorporating exploration. In the episodic setting, we establish a T -regret that scales polynomially in the eluder dimension of the general model class. Here T is the number of steps taken by the agent. In particular, we specialize such a regret to handle two nonparametric model classes; one based on reproducing kernel Hilbert spaces and another based on overparameterized neural networks.

1. INTRODUCTION

Reinforcement learning with neural networks achieved impressive empirical breakthroughs (Mnih et al., 2015; Silver et al., 2016; 2017; Berner et al., 2019; Vinyals et al., 2019) . These algorithms are often based on policy optimization (Williams, 1992; Baxter & Bartlett, 2000; Sutton et al., 2000; Kakade, 2002; Schulman et al., 2015; 2017) . Compared with value-based approaches, which iteratively estimate the optimal value function, policy-based approaches directly optimize the expected total reward, which leads to more steady policy improvement. In particular, as shown in this paper, policy optimization generates steadily improving stochastic policies and consequently allow adversarial environments. On the other hand, policy optimization often suffers from a lack of computational and statistical efficiency in practice, which calls for the principled design of efficient algorithms. Specifically, in terms of computational efficiency, the recent progress (Abbasi-Yadkori et al., 2019a; b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) establishes the convergence of policy optimization to a globally optimal policy given sufficiently many data points, even in the presence of neural networks. However, in terms of sample efficiency, it remains less understood how to sequentially acquire the data points used in policy optimization while balancing exploration and exploitation, especially in the presence of neural networks, despite the recent progress (Cai et al., 2019; Agarwal et al., 2020) . In particular, such a lack of sample efficiency prohibits the principled applications of policy optimization in critical domains, e.g., autonomous driving and dynamic treatment, where data acquisition is expensive. In this paper, we aim to provably achieve sample efficiency in model-based policy optimization, which is quantified via the lens of regret. In particular, we focus on the episodic setting with general function approximations on the transition kernel. Such a setting is studied by Russo & Van Roy (2013; 2014) ; Osband & Van Roy (2014) ; Ayoub et al. (2020) ; Wang et al. (2020) , which however focus on value iteration. In contrast, policy optimization remains less understood, despite its critical role in practice. To this end, we propose an optimistic policy optimization algorithm, which achieves exploration by incorporating optimism into policy evaluation and propagating it through policy improvement. In particular, we establish a κ(P) • √ H 3 T -regret of the proposed algorithm, which matches that of existing value iteration algorithms but additionally allow the reward function to adversarially vary across each episode. Here T is the number of steps, H is the length of each episode, and κ(P) is the model capacity, which is defined based on the eluder dimension. Moreover, we instantiate the proposed algorithm for the special cases of reproducing kernel Hilbert spaces and overparameterized neural networks, both of which are infinite-dimensional model classes. Our work is related to the study on computational efficiency of policy optimization (Fazel et al., 2018; Yang et al., 2019; Abbasi-Yadkori et al., 2019a; b; Bhandari & Russo, 2019; Liu et al., 2019; Agarwal et al., 2019; Wang et al., 2019) . These works assume either the transition model is known or there exists a well-explored behavior policy such that the policy update direction can be estimated accurately. With such assumptions, the tradeoff between exploration and exploitation is absent and their focus is solely on the computational aspect. In addition, our work is related to the works on adversarial MDP (Even-Dar et al., 2009; Yu et al., 2009; Neu et al., 2010a; b; Zimin & Neu, 2013; Neu et al., 2012; Rosenberg & Mansour, 2019b; a) . The algorithm in these work directly estimates the visitation measure and their algorithm utilize mirror descent to handle adversarial reward functions. Furthermore, our work is closely related the recent work on the sample complexity of policy optimization methods Cai et al. (2019) , which only focus on the tabular and linear settings. In contrast, our work consider general function approximation setting, which is significantly more general. Moreover, the construction of optimistic policy evaluation is related to Ayoub et al. (2020) , where the similar approach is incorporated in estimating the optimal value function. The theoretical foundation of such a type of optimistic estimation is innovated by Russo & Van Roy (2014) in the bandit problem. In particular, to characterizing the optimism and accuracy of the optimistic evaluation, we rely on the notion of the eluder dimension proposed by Russo & Van Roy (2014) , which is further instantiated by this paper to the cases of kernel and neural function approximations.

1.1. NOTATIONS

We denote by • p the p -norm of a vector when p ∈ N or the spectral norm of a matrix when p = 2. For any two distributions p 1 , p 2 over a discrete set A, we denote by D KL (p 1 p 2 ) the KL-divergence D KL (p 1 p 2 ) = a∈A p 1 (a) log p 1 (a) p 2 (a) . For any a, b, x ∈ R, we define the clamp function clamp(x, a, b) =    b, if x > b, x, if a ≤ x ≤ b, a, if x < a. (1.1)

2.1. ONLINE REINFORCEMENT LEARNING WITH ADVERSARIAL REWARDS

We consider an episodic MDP (S, A, H, {P h } H h=1 , {r h } H h=1 ) , where S is a continuous state space, A is a discrete action space, H is the number of steps in each episode, {P h } H h=1 represent the unknown transition model, and {r h } H h=1 represent the reward function. In particular, for any h ∈ [H], P h represents the transition kernel from a state-action pair (s h , a h ) at the h-th step to the next state s h+1 , while r h represents the reward function at the h-th step that maps a state-action pair to a deterministic reward. Moreover, we allow the reward function to vary across episodes and denote by r k h the reward function at the h-th step of the k-th episode. In particular, r k h depends on the trajectories before the k-th episode begins, possibly in an adversarial manner, and remains unobservable until the k-th episode ends. Without loss of generality, we assume each episode starts from a fixed state s 1 and all rewards fall in the interval [0, 1]. For any h ∈ [H], a policy π h represents the conditional distribution of the action given the state at the h-th step. We drop the subscript h to represent the collection of policies at all steps and still refer to such a collection as a policy when it is clear from the context. For any (k, h) ∈ N × [H], given a policy π and reward functions {r k h } H h=1 , the value function and Q-function at the h-th step of the k-th episode are defined by V π,k h (s) = E π H j=h r k j (s j , a j ) s h = s , Q π,k h (s, a) = E π H j=h r k j (s j , a j ) s h = s, a h = a for any (s, a) ∈ S × A. Here the subscript π in the expectation E π [•] denotes all actions are taken according to the policy π except for the one given in the condition. An online algorithm aims to construct and execute a sequence of policies {π k } k≥1 and minimize the regret Regret(T ) = max π K k=1 V π,k 1 (s 1 ) -V π k ,k 1 (s 1 ) , (2.1) where K is the number of episodes and T = KH is the number of steps taken by the algorithm.

2.2. REPRODUCING KERNEL HILBERT SPACE

We say H is a reproducing kernel Hilbert space (RKHS) on a set Y with the reproducing kernel K : Y × Y → R if there exists an inner product •, • H such that, for any f ∈ H and x ∈ Y, we have f (x) = f, K x H . Here K x represents the function K(x, •), which is the Riesz representation of the evaluation functional at x (Schölkopf et al., 2002) . When the reproducing kernel K is continuous, symmetric, and positive definite, Mercer's theorem (Steinwart & Christmann, 2008) says K has the representation K(x, y) = ∞ j=1 λ j φ j (x)φ j (y), for any x, y ∈ Y, where {φ j } ∞ j=1 is an orthonormal basis of L 2 (Y) and λ 1 ≥ λ 2 ≥ • • • ≥ 0. See more details on RKHS in Section A.

3. ALGORITHM

Framework: Before the k-th episode begins, we construct the policy π k based on π k-1 and {Q k-1 h } H h=1 , which are the policy in the (k -1)-th episode and estimators of {Q π k-1 ,k-1 h } H h=1 , respectively. Then, we execute the policy π k in the k-th episode and correspondingly update the Q-function estimators {Q k h } H h=1 using the reward function {r k h } H h=1 , which is observed after the k-th episode ends. Policy Improvement: For any (k, h) ∈ [K] × [H], we parametrize π k h by π k h (a | s) = exp{E k h (s, a)} a ∈A exp{E k h (s, a )} , for any (s, a) ∈ S × A. (3.1) Here E k h is the potential function, which is initialized as the zero function and updated by E k h (s, a) = E k-1 h (s, a) + α • Q k-1 h (s, a). (3.2) Here α > 0 is the stepsize of policy improvement. Equivalently, we have π k h (• | s) ∝ π k-1 h (• | s) • exp α • Q k-1 h (s, •) for any s ∈ S. To see (3.2) is a policy improvement step, note that π k h is the maximizer of L k h (π h ) = E π k-1 Q k-1 h (s h , •), π h (• | s h ) A + α -1 • D KL π h (• | s h ) π k-1 h (• | s h ) . This is the same as the update in Politex (Abbasi-Yadkori et al., 2019a) , which originates at MDP-E (Even-Dar et al., 2009) . It is also very close to a slightly changed variant of the one-step objective in the proximal policy optimization (PPO) algorithm (Schulman et al., 2017; 2015) . Policy Evaluation: Let P be a known class of transition models such that P h ∈ P for any h ∈ [H], which is specified in Section 4. Also, for any P ∈ P, s ∈ S, a ∈ A, and V : S → [0, H], we define z P (s, a, V ) = S V (s ) • P (s | s, a) ds . (3.3) For any (k, h) ∈ [K] × [H], we construct a confidence set of the transition model P h and correspondingly the optimistic Q-function estimator Q k h using the data collected before the k-th episode begins. Note that we do not use the data collected from the k-th episode although they are available, which, however, is only used to simplify the analysis. Let V k H+1 be the zero function. Inspired by Ayoub et al. (2020) , given the optimistic value function estimators {V τ h+1 } k-1 τ =1 from the first (k -1) episodes, we construct the confidence set P k h of P h by P k h = P ∈ P k-1 τ =1 z P (s τ h , a τ h , V τ h+1 ) -z P k h (s τ h , a τ h , V τ h+1 ) 2 ≤ β , where P k h = argmin P ∈P k-1 τ =1 V τ h+1 (s τ h+1 ) -z P (s τ h , a τ h , V τ h+1 ) 2 for a threshold β > 0, which represents the degree of optimism. Then, for any (s, a) ∈ S × A, given the optimistic value function estimator V k h+1 , we define the optimistic Q-function estimator Q k h by Q k h (s, a) = r k h (s, a) + max P ∈P k h z P (s, a, V k h+1 ) (3.5) and correspondingly update the optimistic value function estimator by V k h (s) = Q k h (s, •), π k h (• | s) A for any s ∈ S. We apply the clamp function defined in (1.1) to the second term on the right-hand side of (3.5) to ensure it falls in the range [0, H -h], which is due to the assumption that all rewards fall in the range [0, 1].

Implementation:

The full algorithm is presented in Algorithm 1. Given a parametrization of the model class P, we can apply the projected stochastic gradient descent (PSGD) algorithm to solve the constrained minimization problem in Line 11 of Algorithm 1. In particular, for kernel function approximations in Section 4.1, it reduces to a convex optimization problem, which allows the PSGD algorithm to converge to a global minimizer. Meanwhile, for neural function approximations, it reduces to an approximately convex problem in the overparametrized regime (Arora et al., 2019) , which leads to the same global convergence guarantee. Also, to implement Lines 12 and 13 of Algorithm 1, it suffices to solve a constrained maximization problem (Feng et al., 2020) , where the constraint is defined in Line 12. The Lagrangian relaxation of such a constrained maximization problem can be solved by the PSGD algorithm in the same manner of Line 11. In addition, to instantiate the update of Q k h in Line 13, it suffices to solve a least-squares regression problem. In summary, we can instantiate the aforementioned steps through supervised learning oracles, which can be implemented in a computationally efficient manner.

4. THEORY

We analyze the regret of Algorithm 1, which is defined in (2.1). In Sections 4.1 and 4.2, we characterize the regret with specific choices of the model class P, while in Section 4.3, we characterize the regret for a general P, which serves as a meta result. An informal version of the results is given in the following theorem. Algorithm 1 Optimistic Policy Optimization with General Function Approximations 1: Input: number of episodes K, model class P, stepsize α, threshold β 2: Initialize π 0 as the uniformly random policy 3: For k = 1 to K do 4: Start the k-th episode and receive the initial state s k 1 5: For step h = 1 to H do (policy improvement) 6: Update the policy by π k h (• | s) ∝ π k-1 h (• | s) exp{αQ k-1 h (s, •)} for any s ∈ S 7: Take the action For step h = H to 1 do (policy evaluation) a k h ∼ π k h (• | s k h ) 11: P k h ← argmin P ∈P k-1 τ =1 (V τ h+1 (s τ h+1 ) -S V τ h+1 (s )P (s | s τ h , a τ h ) ds ) 2 12: Theorem 4.1 indicates that, compared with the optimal policy in hindsight, namely argmax π K k=1 V π,k 1 (s 1 ), the average regret of Algorithm 1, namely Regret(T )/T , converges to zero at a sublinear rate. In other words, at least one of the K policies attained by Algorithm 1 achieves a vanishing optimality gap with respect to the varying reward function across the K episodes. The model capacity κ(P) is specified in Sections 4.1 and 4.2 for kernel and neural function approximations, respectively. To establish such specific results, we characterize κ(P) using the eluder dimension for general function approximations in Section 4.3, which serves as the unified analysis. P k h ← {P ∈ P : k-1 τ =1 ( S V τ h+1 (s )P (s | s τ h , a τ h ) ds - S V τ h+1 (s )P k h (s | s τ h , a τ h ) ds ) 2 ≤ β} 13: Q k h (•, •) ← r k h (•, •) + clamp(max P ∈P k h S V k h+1 (s )P (s | •, •) ds , 0, H -h) 14: V k h (•) ← Q k h (•, •), π k h (• | •) A Theorem 4.

4.1. KERNEL FUNCTION APPROXIMATIONS

Let P be a subset of an RKHS H with the reproducing kernel K, which has the representation in (2.2). In detail, let S be a measurable set with |S| ≤ 1, where | • | denotes the Lebesgue measure. With a slight abuse of notation, we denote by A the embedding of the action space into a Euclidean space with the dimension |A|, where |A| denotes the number of actions. Meanwhile, let Y be a d Y -dimensional set such that S × A × S ⊂ Y. We assume there exists R ≥ 2 such that P ⊂ H R , where H R is the RKHS ball over Y with the radius R. Assumption 4.2. We assume K satisfies the following regularity conditions. (i). It holds that |K(x, y)| ≤ 1, |φ j (x)| ≤ 1, and λ j ≤ 1 for any x, y ∈ Y and j ∈ N. (ii). There exist a threshold γ ∈ (0, 1/2) and absolute constants C 1 , C 2 > 0 such that λ j ≤ C 1 • exp(-C 2 j γ ) for any j ∈ N. Note that we can replace the 1's in the upper bounds of Assumption 4.2 with any absolute constant, which is reflected in the H-norm of any function in H. Meanwhile, we can relax |φ j (x)| ≤ 1 into |λ τ j • φ j (x)| ≤ 1 for any absolute constant τ ∈ [0, 1/2) , which leads to the same regret. We have the following result on the regret of Algorithm 1. Theorem 4.3. Suppose Assumption 4.2 holds and P ⊂ H R . There exist absolute constants C 3 , C 4 > 0 such that, for any p ∈ (0, 1), if we set α = 2 log |A|/(HT ), β = C 3 H 2 • log 1+1/γ (RT /p) • log 2 (1/γ)/γ in Algorithm 1, then it holds that Regret(T ) ≤ C 4 √ H 3 T • log 1+1/γ (|A|RT /p) • log 2 (1/γ)/γ with probability at least (1 -p). Proof. See Section C for a detailed proof. Theorem 4.3 indicates that log 1+1/γ (|A|RT /p) • log 2 (1/γ)/γ serves as the model capacity κ(P) in Theorem 4.1 for kernel function approximations. In particular, we can obtain γ for a broad range of reproducing kernels K (Srinivas et al., 2009) . Meanwhile, we can scale R to control the model capacity κ(P) (Schölkopf et al., 2002) .

4.2. NEURAL FUNCTION APPROXIMATIONS

Let P be a set of overparametrized neural networks. In detail, we denote by NN a neural network with its weights collected in a vector w ∈ R m . Let w 0 be the random initial weights. For a radius R ≥ 2, we define P = {P : ∃ w ∈ B R , s.t. P (s | s, a) = NN(x; w), for any x = (s, a, s ) ∈ S × A × S ⊂ Y}, where B R = {w ∈ R m : w -w 0 2 ≤ R}. (4.1) Without loss of generality, we assume NN(x; w 0 ) = 0 for any x ∈ Y, which can be achieved by a symmetric initialization scheme. See Section E for a detailed explanation. To connect with the result for kernel function approximations in Section 4.1, we define the following condition. Condition 4.4 (Implicit Linearization). It holds that ξ m = max x∈Y,w∈B R |NN(x; w) -∇ w NN(x; w 0 ) (w -w 0 )| ≤ 1/(4K 3/2 H). Condition 4.4 indicates that NN(x; w) is uniformly close to the linear function ∇ w NN(x; w 0 ) (ww 0 ) of w. In particular, the linearization error ξ m is negligible compared with the dominating terms in the regret. The following lemma ensures Condition 4.4 holds for two-layer neural networks when m is sufficiently large. Lemma 4.5 (Overparametrization). Suppose NN is a two-layer neural network, where the activation function is 1-smooth, and it holds that x 2 ≤ 1 for any x ∈ Y. Then, Condition 4.4 holds when m ≥ d Y R 4 K 3 H 2 . Proof. See Section E for a detailed proof. Note that the analogous of Lemma 4.5 also applies to nonsmooth activation functions, for example, the rectified linear unit (ReLU), and multilayer neural networks (Allen-Zhu et al., 2019; Du et al., 2019; Zou et al., 2020; Gao et al., 2019) , which ensures Condition 4.4 holds. The linear function of w in Condition 4.4 induces an RKHS H with the reproducing kernel K NTK (x, y) = ∇ w NN(x; w 0 ) ∇ w NN(y; w 0 ), for any x, y ∈ Y, (4.2) which is known as the neural tangent kernel (NTK) (Jacot et al., 2018) . Assumption 4.6. We assume K NTK satisfies the regularity conditions in Assumption 4.2. Note that the NTK defined in (4.2) depends on the randomness of w 0 . When m goes to infinity, such an empirical NTK converges to its expectation, which gives the population NTK. It is shown in Yang & Salman (2019) that Assumption 4.2 holds for the population NTK, which implies it also holds for the empirical NTK with high probability when m is sufficiently large. We have the following result on the regret of Algorithm 1. Theorem 4.7. Suppose Assumption 4.6 and Condition 4.4 hold and P has the representation in (4.1). There exist absolute constants C 5 , C 6 > 0 such that for any p ∈ (0, 1), if we set α = 2 log |A|/(HT ), β = C 5 H 2 • log 1+1/γ (RT /p) • log 2 (1/γ)/γ in Algorithm 1, then it holds that Regret(T ) ≤ C 6 √ H 3 T • log 1+1/γ (|A|RT /p) • log 2 (1/γ)/γ with probability at least (1 -p). Proof. See Section D for a detailed proof. In parallel with Theorem 4.3, Theorem 4.7 indicates that log 1+1/γ (|A|RT /p) • log 2 (1/γ)/γ serves as the model capacity κ(P) in Theorem 4.1 for neural function approximations, which can be controlled by scaling R (Arora et al., 2019) .

4.3. GENERAL FUNCTION APPROXIMATIONS

Let P be a general model class, whose model capacity is characterized by the eluder dimension (Russo & Van Roy, 2014; Osband & Van Roy, 2014; Ayoub et al., 2020 ) defined as follows. Definition 4.8 (Eluder Dimension). Let Z be a set of real-valued functions on the domain X . For any ε > 0 and τ ∈ N, we say x τ ∈ X is (Z, ε)-independent of x 1 , . . . , x τ -1 ∈ X if there exist f 1 , f 2 ∈ Z such that τ -1 j=1 |f 1 (x j ) -f 2 (x j )| 2 1/2 ≤ ε, |f 1 (x τ ) -f 2 (x τ )| > ε. (4.3) The eluder dimension of Z at scale ε, which is denoted by dim E (Z, ε), is the length of the longest sequence x 1 , . . . , x τ ∈ X such that, for any j ∈ [τ ], x j is (Z, ε )-independent of x 1 , . . . , x j-1 for some ε ≥ ε. The following lemma decomposes the regret of Algorithm 1 into errors that arise from policy improvement and policy evaluation, respectively. Lemma 4.9 (Regret Decomposition). For any k ∈ [K], it holds that Regret(T ) = K k=1 H h=1 E π * [ Q k h (•, s h ), π * h (• | s h ) -π k h (• | s h ) ] + K k=1 V k 1 (s 1 ) -V π k ,k 1 (s 1 ) + K k=1 H h=1 E π * [r k h (s h , a h ) + z P h (s h , a h , V k h+1 ) -Q k h (s h , a h )]. Proof. See Lemma 4.2 of Cai et al. (2019) for a detailed proof. The following lemma characterizes the error that arises from policy improvement. Lemma 4.10 (Policy Improvement). It we set α = 2 log |A|/(KH 2 ) in Algorithm 1, then it holds that K k=1 H h=1 E π * [ Q k h (•, s h ), π * h (• | s h ) -π k h (• | s h ) ] ≤ 2KH 4 • log |A|. Proof. See Section B.2 for a detailed proof. Recall that for any P ∈ P, z P is defined in (3.3). Also, let Z P = {z P : P ∈ P}. For any > 0, we denote by N (P, • ∞,1 ) the -covering number of P with respect to the ∞,1 -norm distance, which is defined by The following lemma characterizes the error that arises from policy evaluation. Lemma 4.11 (Policy Evaluation). For any p ∈ (0, 1), if we set P -P ∞,1 = max β ≥ 2H 2 • log N 1/(KH) (P, • ∞,1 ) • 2H/p + 4 H + H 2 /4 • log(8K 2 H/p) (4.4) in Algorithm 1, then the following results hold with probability at least (1 -p). • (Optimism) For any (k, h) ∈ [K] × [H] and (s, a) ∈ S × A, it holds that r k h (s, a) + z P h (s, a, V k h+1 ) -Q k h (s, a) ≤ 0. • (Accuracy) Let d = K ∧ dim E (Z P , 1/K). It holds that K k=1 V k 1 (s 1 ) -V π k ,k 1 (s 1 ) ≤ 32KH 3 • log(p/2) + H(dH + 1) + 4 dβKH 2 . Proof. See Section B.1 for a detailed proof. Recall that T = KH. The following theorem characterizes the regret of Algorithm 1 when P is a general model class, which serves as a meta result. Theorem 4.12. In Algorithm 1, if we set α as in Lemma 4.10 and β as in Lemma 4.11, then it holds that Regret(T ) ≤ 2H 3 T • log |A| + 32H 2 T • log(p/2) + H(dH + 1) + 4 dβHT with probability at least (1 -p), where d = K ∧ dim E (Z P , 1/K). Proof. The proof follows from combining Lemmas 4.9, 4.11, and 4.10. Theorem 4.12 indicates that max d, d • log N 1/(KH) (P, • ∞,1 ) serves as the model capacity κ(P) in Theorem 4.1. The regret upper bound in Theorem 4.12 is similar to that in Ayoub et al. (2020) when P is a general model class, whose model capacity is characterized by the eluder dimension. In contrast, our algorithm additionally handles adversarial rewards, which is a benefit of the policy optimization approach. To establish the regret upper bounds in Sections 4.1 and 4.2, it remains to characterize the corresponding eluder dimension and log-covering number, respectively. See Sections C and D for details. As a special case, Theorem 4.12 also applies to the case where P is a set of d-dimensional linear models with a finite d, which is studied in Cai et al. (2019) . In particular, the eluder dimension and log-covering number in (4.4) are both O(d) (Ayoub et al., 2020) , which leads to the d 2 H 3 T -regret in Cai et al. (2019) . In contrast, Theorem 4.12 additionally handles the case where d is infinite as in kernel and neural function approximations. A MORE DETAILS ON RKHS An example of the RKHS is the linear model class. In particular, let ψ be a d-dimensional feature vector ψ(x) = ψ 1 (x), . . . , ψ d (x) , for any x ∈ Y, where ψ 1 , . . . , ψ d are linearly independent. Then, the linear span of {ψ j } ∞ j=1 forms an RKHS H ψ with the reproducing kernel K ψ (x, y) = ψ(x) ψ(y) for any x, y ∈ Y, and the corresponding inner product •, • H ψ is defined by the Euclidean inner product ψ(•) c 1 , ψ(•) c 2 H ψ = c 1 c 2 for any c 1 , c 2 ∈ R d . The above example can be naturally generalized to the case where d = ∞, that is, the feature vector can be infinite-dimensional. Moreover, recall that (2.2) says the reproducing kernel K has the representation K(x, y) = ∞ j=1 λ j φ j (x)φ j (y), for any x, y ∈ Y, where {φ j } ∞ j=1 is an orthonormal basis of L 2 (Y) and λ 1 ≥ λ 2 ≥ • • • ≥ 0. We refer to {φ j } ∞ j=1 as the eigenfunctions of K with the corresponding eigenvalues {λ j } ∞ j=1 . Such a representation gives the feature vector φ(x) = √ λ 1 • φ 1 (x), λ 2 • φ 2 (x), . . . , for any x ∈ Y. The linear span of { λ j • φ j } ∞ j=1 recovers the RKHS H with the reproducing kernel K and inner product •, • H . When the reproducing kernel K is infinite-dimensional, that is, K has an infinite number of non zero eigenvalues, φ is an infinite-dimensional vector. It is known that RKHSs of various infinite-dimensional reproducing kernels, for example, the Gaussian radius basis function kernel (Steinwart & Christmann, 2008) , are rich function model classes in the sense that they are dense in the class of continuous and bounded functions.

B PROOFS FOR SECTION 4.3

In this section, we provide the detailed proofs of the result in Section 4.3. For notational simplicity, we denote by P h the operator that takes the conditional expectation with respect to the transition kernel P h for any h ∈ [H].

B.1 PROOF OF LEMMA 4.11

Proof. We define the event E that P h ∈ P k h , for any (k, h) ∈ [K] × [H]. (B.1) By our choice of β and Lemma F.4 with δ = p/2, it holds that E occurs with probability at least (1 -p/2). Optimism: For any (k, h) ∈ [K] × [H], P h ∈ P k h implies Q k h (•, •) -r k h (•, •) + (P h V k h+1 )(•, •) = clamp max P ∈P k h S V k h+1 (s ) • P (s | •, •) ds , 0, H -h - S V k h+1 (s ) • P h (s | •, •) ds ≥ clamp S V k h+1 (s ) • P h (s | •, •) ds , 0, H -h - S V k h+1 (s ) • P h (s | •, •) ds . (B.2) When h = H, the right-hand side of (B.2) is zero since V k H+1 (•) = 0. When h < H, by the construction of Q k h+1 in (3.5) and the assumption that r k h+1 (•) ∈ [0, 1], we have Q k h+1 (•, •) ∈ [0, H -h], V k h+1 (•) ∈ [0, H -h], S V k h+1 (s ) • P h (s | •, •) ds ∈ [0, H -h]. The right-hand side of (B.2) is also zero, which implies Q k h (•, •) -r k h (•, •) + (P h V k h+1 )(•, •) ≥ 0. Thus, the optimism result holds under the event E. Accuracy: We invoke Lemma F.1 and obtain K k=1 V k 1 (s k 1 ) -V π k 1 (s k 1 ) = K k=1 H h=1 (D k h,1 + D k h,2 ) (B.3) + K k=1 H h=1 Q k h (s k h , a k h ) -r k h (s k h , a k h ) + P h V k h+1 (s k h , a k h ) , where |D k h,1 | ≤ 2H, |D k h,2 | ≤ 2H, D k H,2 = 0 for any (k, h) ∈ [K] × [H], and D 1 1,1 , D 1 1,2 + D 1 2,1 , D 1 2,2 + D 1 3,1 , . . . , D 1 H-1,2 + D 1 H,1 , D 2 1,1 , D 2 1,2 + D 2 2,1 , D 2 2,2 + D 2 3,1 , . . . , D 2 H-1,2 + D 2 H,1 , . . . . . . is a martingale difference sequence. The Azuma-Hoeffding inequality (Azuma, 1967) implies K k=1 H h=1 (D k h,1 + D k h,2 ) ≤ 32KH 3 • log(p/2) (B.4) with probability at least (1 -p/2). It remains to upper bound the second term on the right-hand side of (B.3). For any (k, h) ∈ [K] × [H], P h ∈ P k h implies Q k h (s k h , a k h ) -r k h (s k h , a k h ) + (P h V k h+1 )(s k h , a k h ) = clamp max P ∈P k h S V k h+1 (s ) • P (s | s k h , a k h ) ds , 0, H -h - S V k h+1 (s ) • P h (s | s k h , a k h ) ds ≤ max P ∈P k h S V k h+1 (s ) • P (s | s k h , a k h ) ds -min P ∈P k h S V k h+1 (s ) • P (s | s k h , a k h ) ds . Applying Lemma F.5, for any h ∈ [H], we have Q k h (s k h , a k h ) -r k h (s k h , a k h ) + (P h V k h+1 )(s k h , a k h ) ≤ K k=1 max P ∈P k h S V k h+1 (s ) • P (s | s k h , a k h ) ds -min P ∈P k h S V k h+1 (s ) • P (s | s k h , a k h ) ds ≤ 1 + dH + 4 dβK, (B.5) where d = K ∧ dim E (Z P , 1/K). Combining (B.3)-(B.5), we have K k=1 V k 1 (s k 1 ) -V π k 1 (s k 1 ) ≤ 32KH 3 • log(p/2) + H + dH 2 + 4 dβHT . In summary, when the event E and (B.4) hold, which occur with probability at least (1 -p), we have  Q k h (•, •) ≥ r k h (•, •) + (P h V k h+1 )(•, •), for any (k, h) ∈ [K] × [H], K k=1 V k 1 (s k 1 ) -V π k ,k 1 (s k h ) ≤ 32KH 3 • log(p/2) + H(1 + dH) + 4 dβHT . Q k h (s, •), π * h (• | s) -π k h (• | s) ≤ αH 2 /2 + α -1 • D KL π * h (• | s) π k h (• | s) -D KL π * h (• | s) π k+1 h (• | s) , (B.6) which implies K k=1 H h=1 E π * [ Q k h (s h , •), π * h (• | s h ) -π k h (• | s h ) ] ≤ αKH 3 /2 + α -1 • H h=1 E π * D KL π * h (• | s h ) π 1 h (• | s h ) ≤ αKH 3 /2 + α -1 H • log |A|. (B.7) Plugging α = 2 log |A|/(HT ) into the right-hand side of (B.7), we conclude the proof of Lemma 4.10.

C PROOF OF THEOREM 4.3

In this section, we prove Theorem 4.3. By Theorem 4.12, it suffices to upper bound the eluder dimension of Z P = {z P : P ∈ P} and log-covering number of H R , which are characterized by the following two lemmas, respectively. Lemma C.1. Under Assumption 4.2, there exists an absolute constant C 7 > 0 such that K ∧ dim E (Z P , 1/K) ≤ C 7 • log 2 (1/γ)/γ • log 1+1/γ (RT ). Proof. Let t = K ∧ dim E (Z P , 1/K). By Definition 4.8, there exists a sequence x 1 , . . . , x t ∈ S × A × [0, H] S such that x τ is (Z P , 1/K)-independent of x 1 , . . . , x τ -1 for any τ ∈ [t]. Here the independency scale is assumed to be 1/K without loss of generality, which can be changed to any value larger than 1/K. In other words, for any τ ∈ [t], there exist P 1 , P 2 ∈ P such that τ -1 i=1 |z P1 (x i ) -z P2 (x i )| 2 ≤ 1/K 2 , (C.1) |z P1 (x τ ) -z P2 (x τ )| > 1/K. (C.2) Note that for any x = (s, a, V ) ∈ S × A × [0, H] S and P ∈ P, by the reproducing property of the RKHS H, we can write z P (x) = S V (s ) • P (s | s, a) ds = S V (s ) • P, K (s,a,s ) H ds . The representation of K in (2.2) implies z P (x) = S V (s ) • P, ∞ j=1 λ j • φ j (s, a, s ) • φ j H ds = S V (s ) • ∞ j=1 λ j • φ j (s, a, s ) • P, λ j • φ j H ds = ∞ j=1 λ j • S V (s ) • φ j (s, a, s ) ds • P, λ j • φ j H . For any d 0 such that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 where γ and C 2 are defined in Assumption 4.2, we define z P (x) = 1≤j≤d0 λ j • S V (s ) • φ j (s, a, s ) ds • P, λ j • φ j H for any x = (s, a, V ) ∈ S × A × [0, H] S and P ∈ P. Then, we have |z P (x) -z P (x)| = j>d0 λ j • S V (s ) • φ j (s, a, s ) ds • P, λ j • φ j H ≤ j>d0 λ j • |S| • H • P H ≤ j>d0 λ j • RH, where |S| is the Lebesgue measure of S and |S| ≤ 1. Here the first inequality follows from the Cauchy-Schwarz inequality and λ j • φ j H = 1 for any j ∈ N, since { λ j • φ j } ∞ j=1 is an orthonormal basis of H. By Assumption 4.2, we obtain |z P (x) -z P (x)| ≤ j>d0 C 1 • exp(-C 2 j γ /2) • RH = C 1 • RH • j>d0 exp(-C 2 j γ /2) ≤ C 1 • RH • 4d 1-γ 0 (γC 2 ) -1 • exp(-C 2 d γ 0 /2), (C.3) where the last inequality follows from Lemma F.6. For notational simplicity, let Γ(d 0 ) = C 1 • RH • 4d 1-γ 0 (γC 2 ) -1 • exp(-C 2 d γ 0 /2). (C.4) By (C.1), it holds that τ -1 i=1 | z P1 (x i ) -z P2 (x i )| 2 ≤ τ -1 i=1 |z P1 (x i ) -z P2 (x i )| + 2Γ(d 0 ) 2 ≤ τ -1 i=1 2|z P1 (x i ) -z P2 (x i )| 2 + 8Γ(d 0 ) 2 ≤ 2/K 2 + 8t • Γ(d 0 ) 2 , (C.5) where the last inequality uses τ ≤ t. For any τ ∈ [t], we define y = P 1 -P 2 , λ 1 • φ 1 H , . . . , P 1 -P 2 , λ d0 • φ d0 H , (C.6) v τ = λ 1 • S V τ (s ) • φ 1 (s τ , a τ , s ) ds , . . . , λ d0 • S V τ (s ) • φ d0 (s τ , a τ , s ) ds , (C.7) Λ τ = I d0×d0 /(d 0 K 2 R 2 ) + τ -1 i=1 v i v i . (C.8) Then, it holds that y Λ τ y = 1 d 0 K 2 R 2 • y 2 2 + τ -1 i=1 (y v i ) 2 = 1 d 0 K 2 R 2 • d0 j=1 P 1 -P 2 , λ j • φ j 2 H + τ -1 i=1 | z P1 (x i ) -z P2 (x i )| 2 ≤ 1 d 0 K 2 R 2 • 4d 0 R 2 + 2/K 2 + 8t • Γ(d 0 ) 2 = 6/K 2 + 8t • Γ(d 0 ) 2 , (C.9) where the inequality uses (C.5) and P 1 -P 2 , λ j • φ j 2 H ≤ P 1 -P 2 2 H • λ j • φ j 2 H ≤ 4R 2 • 1 = 4R 2 . Here we uses P H ≤ R for any P ∈ P and √ λ i • φ i H = 1 for any j ∈ [d 0 ]. In the sequel, we establish an upper bound of | z P1 (x τ ) -z P2 (x τ )|. By the definitions of y in (C.6) and v τ in (C.7), we can write z P1 (x τ ) -z P2 (x τ ) = y, v τ . By (C.9), | z P1 (z τ ) -z P2 (z τ )| is upper bounded by max y ∈R d 0 y , v τ s.t. (y ) Λ τ y ≤ 6/K 2 + 8t • Γ(d 0 ) 2 . The maximizer of such a quadratic program is y = [6/K 2 + 8t • Γ(d 0 ) 2 ]/(v τ Λ -1 τ v τ ) • Λ -1 τ v τ . Therefore, we obtain | z P1 (x τ ) -z P2 (x τ )| ≤ 6/K 2 + 8t • Γ(d 0 ) 2 (v τ Λ -1 τ v τ ). (C.10) On the other hand, by (C.2)-(C.4), we have | z P1 (x τ ) -z P2 (x τ )| ≥ |z P1 (x τ ) -z P2 (x τ )| -| z P1 (x τ ) -z P1 (x τ )| + | z P2 (x τ ) -z P2 (x τ )| ≥ |z P1 (x τ ) -z P2 (x τ )| -2Γ(d 0 ) > 1/K -2Γ(d 0 ). (C.11) We set d 0 = C • log(1/γ)/γ • log 1/γ (4tRKH) , where C is defined in Lemma F.7. Then, by Lemma F.7, it holds that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 and Γ(d 0 ) ≤ 1/(4tK). (C.12) Combining (C.10), (C.11), and (C.12), we obtain v τ Λ -1 τ v τ > 1/K -2Γ(d 0 ) 2 6/K 2 + 8t • Γ(d 0 ) 2 ≥ 1/K -1/(2K) 2 6/K 2 + 1/(2K 2 ) > 1/100. (C.13) Therefore, by (C.13), we have t/100 = t τ =1 min{1/100, v τ Λ -1 τ v τ } ≤ t τ =1 2 log(1 + v τ Λ -1 τ v τ ) ≤ 2 (log det(Λ t+1 ) -log det(Λ 1 ) , (C.14) where the last inequality uses the elliptical potential lemma (Abbasi-Yadkori et al., 2011) . Here, similar to (C.8), Λ t+1 is defined by Λ t+1 = I d0×d0 /(d 0 K 2 R 2 ) + t i=1 v i v i . Note that v τ 2 2 = d0 j=1 λ j • S V τ (s )φ j (s τ , a τ , s ) ds 2 ≤ d 0 H 2 for any τ ∈ [t]. Thus, setting λ = 1/(d 0 K 2 R 2 ), we have log det(Λ 1 ) = d 0 • log λ, (C.15) log det(Λ t+1 ) ≤ d 0 • log Λ t+1 2 ≤ d 0 • log λ + t i=1 v i 2 2 ≤ d 0 • log(λ + d 0 KH 2 ). (C.16) where the last inequality follows from t = K ∧ dim E (Z P , 1/K). Combining (C.15) and (C.16), we have 2 (log det(Λ t+1 ) -log det(Λ 1 ) ≤ 2d 0 • log(1 + d 0 KH 2 /λ) = 2d 0 • log(1 + d 2 0 K 3 R 2 H 2 ) (C.17) Combining (C.14) and (C.17), we obtain t ≤ 200d 0 • log(1 + d 2 0 K 3 R 2 H 2 ) ≤ 600d 0 • log(d 0 KRH). (C.18) Moreover, by d 0 = C • log(1/γ)/γ • log 1/γ (4tRKH ) and (C.18), there exists an absolute constant C 8 > 0 such that t ≤ 600d 0 • log(d 0 KRH) ≤ 600 • C • log(1/γ)/γ • log 1/γ (tKRH) • log C • log(1/γ)/γ • log 1/γ (tKRH) • KRH ≤ C 8 • log 2 (1/γ)/γ • log 1+1/γ (tKRH). (C.19) Recall that t = K ∧ dim E (Z P , 1/K). Thus, by (C.19), we obtain t ≤ C 7 • log 2 (1/γ)/γ • log 1+1/γ (KRH) (C.20) for an absolute constant C 7 > 0. Thus, we conclude the proof of Lemma C.1. Lemma C.2. Under Assumption 4.2, there exists an absolute constant C 9 > 0 such that N 1/(KH) (P, • ∞,1 ) ≤ C 9 • log 2 (1/γ)/γ • log 1+1/γ (RT ) for any R ≥ 2. Proof. Recall that we define Γ(d 0 ) in (C.4). By choosing t = 2KH in Lemma F.7, there exists d 0 = C • log(1/γ)/γ • log 1/γ (2KRH 2 ) where C is defined in Lemma F.7, such that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 and Γ(d 0 ) ≤ 1/(2KH). (C.21) Here γ and C 2 are defined in Assumption 4.2. Let C be a minimal 1/(2d 1/2 0 KH)-covering of the Euclidean ball {v ∈ R d0 : v 2 ≤ R} with respect to the 2 -norm distance. For any P ∈ P, we define P ∈ P by P (x) = d0 j=1 λ j • P, φ j H • φ j (x), for any x ∈ Y. Also, we write v = λ 1 • P, φ 1 H , . . . , λ d0 • P, φ d0 H . Since { λ j • φ j } ∞ j=1 is an orthonormal basis of H and P ∈ H R , we have v 2 = P H ≤ P H ≤ R. Thus, by the definition of C, there exists v * ∈ C with v -v * 2 ≤ 1/(2d 1/2 0 KH). We define P * ∈ P by P * (x) = d0 j=1 v * j • λ j • φ j (x), for any x ∈ Y. Then, by Assumption 4.2, for any x ∈ Y, we have | P (x) -P * (x)| = d0 j=1 (v j -v * j ) • λ j • φ j (x) ≤ v -v * 1 ≤ d 1/2 0 • v -v * 2 ≤ 1/(2KH). (C.22) Also, for any x ∈ Y, we have |P (x) -P (x)| = j>d0 λ j • φ j (x) • P, λ j • φ j H ≤ j>d0 λ j • R ≤ j>d0 C 1 • exp(-C 2 j γ /2) • R ≤ Γ(d 0 ), (C.23) where the first inequality uses P H ≤ R and λ j • φ j H = 1, the second inequality follows from Assumption 4.2, and the last inequality follows from the same argument in (C.3) and H ≥ 1. Thus, for any x ∈ Y, it holds that |P (x) -P * (x)| ≤ |P (x) -P (x)| + | P (x) -P * (x)| ≤ Γ(d 0 ) + 1/(2KH) ≤ 1/(KH). (C.24) We define P C = P : ∃ v ∈ C s.t. P (x) = d0 j=1 v j • λ j • φ j (x) for any x ∈ Y . Then, by (C.24), P C is a 1/(KH)-covering of P with respect to the ∞ -norm distance. Therefore, we have N 1/(KH) (P, • ∞,1 ) ≤ N 1/(KH) (P, • ∞ ) ≤ |P C | ≤ |C| ≤ C 12 • d 0 • log(d 0 KRH) ≤ C 9 • log 2 (1/γ)/γ • log 1+1/γ (KRH), (C.25) where C 12 , C 9 > 0 are absolute constants. Here the first inequality is because We note that our proofs of Lemmas C.1 and C.2 for the exponential decay case can be made general. • ∞,1 ≤ • ∞ given |S| ≤ 1, For a finite-rank kernel, we can let d 0 be the rank of the kernel, i.e., the number of nonzero eigenvalues. Then, the same proof still holds and we can show that the eluder dimension is upper bounded by O(d 0 log(RKH)). Moreover, for a kernel with polynomially decaying eigenvalues, that is, we have Proof of Theorem 4.3 λ j ≤ C poly • j -γ for Proof. Recall that d = K ∧ dim E (Z P , 1/K). By Theorem 4.12, it holds that Regret(T ) ≤ 2H 3 T • log |A| + 32H 2 T • log(p/2) + H(dH + 1) + 4 dβHT , (C.26) with probability at least (1 -p), where we set α = 2 log |A|/(KH 2 ) and β ≥ 2H 2 • log N 1/T (P, • ∞,1 ) • 2H/p + 4 H + H 2 /4 • log(8K 2 H/p) . (C.27) By Lemma C.2, we have N 1/(KH) (P, • ∞,1 ) ≤ C 9 • log 2 (1/γ)/γ • log 1+1/γ (RT ), which implies there exists an absolute constant C 3 > 0 such that 2H 2 • log N 1/(KH) (P, • ∞,1 ) • 2H/p + 4 H + H 2 /4 • log(8K 2 H/p) ≤ C 3 H 2 • log 2 (1/γ)/γ • log 1+1/γ (RT /p). In other words, (C.27) holds if we set β = C 3 H 2 • log 2 (1/γ)/γ • log 1+1/γ (RT /p). (C.28) On the other hand, by Lemma C.1, we have d ≤ C 7 • log 2 (1/γ)/γ • log 1+1/γ (RT ). (C.29) Plugging (C.28) and (C.29) into (C.26), we obtain Regret(T ) ≤ 2H 3 T • log |A| + 32H 2 T • log(p/2) + H + C 7 H 2 • log 2 (1/γ)/γ • log 1+1/γ (RT ) + 4 C 3 C 7 H 3 T • log 2 (1/γ)/γ • log (1+1/γ)/2 (RT ) • log (1+1/γ)/2 (RT /p) ≤ C 4 √ H 3 T • log 2 (1/γ)/γ • log 1+1/γ (|A|RT /p) (C.30) with probability at least (1 -p), where C 4 > 0 is an absolute constant. Thus, we conclude the proof of Theorem 4.3. D PROOF OF THEOREM 4.7 In this section, we prove Theorem 4.7. Similar to the proof of Theorem 4.3, we upper bound the eluder dimension of Z P and log-covering number of P by the following two lemmas respectively. Recall that B R is defined in (4.1). By the definition of K NTK in (4.2), we have H R = {P : ∃ w ∈ B R s.t. P (x) = ∇ w NN(x; w 0 ) (w -w 0 ) for any x ∈ Y}. Without loss of generality, we assume entries of ∇ w NN(•; w 0 ) are linearly independent, which happens with probability one for most neural networks with nonlinear activation functions and random initialization. Lemma D.1. Under Assumptions 4.6 and 4.4, there exists an absolute constant C 10 such that K ∧ dim E (Z P , 1/K) ≤ C 10 • log 2 (1/γ)/γ • log 1+1/γ (RT ). Proof. Let t = K ∧ dim E (Z P , 1/K). By Definition 4.8, there exists a sequence x 1 , . . . , x t ∈ S × A × [0, H] S such that x τ is (Z P , 1/K)-independent of x 1 , . . . , x τ -1 for any τ ∈ [t]. In other words, for any τ ∈ [t], there exist P 1 , P 2 ∈ P such that τ -1 i=1 |z P1 (x i ) -z P2 (x i )| 2 ≤ 1/K 2 , (D.1) |z P1 (x τ ) -z P2 (x τ )| > 1/K. (D.2) For any x = (s, a, V ) ∈ S × A × [0, H] S and P ∈ P, we define P (x; w) = ∇ w NN(x; w 0 ) (w -w 0 ). Using the reproducing property of H and the representation of K in (2.2), we have z P (x) = S V (s ) • P (s, a, s ; w) ds = S V (s ) • P (s, a, s ; w) ds + S V (s ) • P (s, a, s ; w) -P (s, a, s ; w) ds = ∞ j=1 λ j • S V (s ) • φ j (s, a, s ) ds • P , λ j • φ j H + S V s ) • P (s, a, s ; w) -P (s, a, s ; w) ds . For any d 0 such that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 , we define z P (x) = 1≤j≤d0 λ j • S V (s ) • φ j (s, a, s ) ds • P , λ j • φ j H for any x = (s, a, V ) ∈ S × A × [0, H] S and P ∈ P. And we have |z P (x) -z P (x)| ≤ j>d0 λ j • S V (s ) • φ j (s, a, s ) ds • P , λ j • φ j H + S V s ) P (s, a, s ; w) -P (s, a, s ; w) ds ≤ j>d0 λ j • H • R + ξ m H ≤ 4d 1-γ 0 C 1 HR(γC 2 ) -1 • exp(-C 2 d γ 0 /2) + ξ m H = Γ(d 0 ) + ξ m H, (D.3) where the second inequality uses Assumption 4.2, P H ≤ R, λ j • φ j H = 1, and the definition of ξ m in Condition 4.4, the third inequality follows from Lemma F.6, and Γ(d 0 ) is defined in (C.4). Then, using the triangle inequality, we have τ -1 i=1 | z P1 (x i ) -z P2 (x i )| 2 ≤ τ -1 i=1 |z P1 (x i ) -z P2 (x i )| + 2Γ(d 0 ) + 2ξ m H 2 ≤ τ -1 i=1 2|z P1 (x i ) -z P2 (x i )| 2 + 16t • Γ(d 0 ) 2 + 16tξ 2 m H 2 ≤ 2/K 2 + 16t • Γ(d 0 ) 2 + 16tξ 2 m H 2 , (D.4) where the last inequality is by (D.1). In the sequel, we establish an upper bound of | z P1 (x τ ) -z P2 (x τ )|. For any τ ∈ [t], similar to (C.6)-(C.8), we denote y = P 1 -P 2 , λ 1 • φ 1 H , . . . , P 1 -P 2 , λ d0 • φ d0 H , (D.5) v τ = λ 1 • S V τ (s ) • φ 1 (s τ , a τ , s ) ds , . . . , λ d0 • S V τ (s ) • φ d0 (s τ , a τ , s ) ds , (D.6) Λ τ = I d0×d0 /(d 0 K 2 R 2 ) + τ -1 i=1 v i v i . (D.7) Then, we have y Λ τ y = 1 d 0 K 2 R 2 • y 2 2 + τ -1 i=1 (y v i ) 2 = 1 d 0 K 2 R 2 • d0 j=1 P 1 -P 2 , λ j • φ j 2 H + τ -1 i=1 | z P1 (x i ) -z P2 (x i )| 2 ≤ 1 d 0 K 2 R 2 • 4d 0 R 2 + 2/K 2 + 16t • Γ(d 0 ) 2 + 16te 2 m H 2 = 6/K 2 + 16t • Γ(d 0 ) 2 + 16te 2 m H 2 ≤ 7/K 2 + 16t • Γ(d 0 ) 2 . (D.8) Here the inequality uses (D.4), Condition 4.4, and P 1 -P 2 , λ j • φ j 2 H ≤ P 1 -P 2 2 H • λ j • φ j 2 H ≤ 4R 2 • 1 = 4R 2 , which follows from the fact P 1 H , P 2 H ≤ R and √ λ i • φ i H = 1 for any j ∈ [d 0 ] . By the definitions of y in (D.5) and v τ in (D.6), we can write z P1 (x τ ) -z P2 (x τ ) = y, v τ , which by (D.8) is upper bounded by max y ∈R d 0 y , v τ s.t. (y ) Λ τ y ≤ 7/K 2 + 16t • Γ(d 0 ) 2 . The maximizer of such a quadratic program is y = 7/K 2 + 16t • Γ(d 0 ) 2 /(v τ Λ -1 τ v τ ) • Λ -1 τ v τ . Therefore, we have | z P1 (x τ ) -z P2 (x τ )| ≤ 7/K 2 + 16t • Γ(d 0 ) 2 (v τ Λ -1 τ v τ ). (D.9) On the other hand, by (D.2), (D.3), and Condition 4.4 we have | z P1 (x τ ) -z P2 (x τ )| ≥ |z P1 (x τ ) -z P2 (x τ )| -| z P1 (x τ ) -z P2 (x τ )| + | z P1 (x τ ) -z P2 (x τ )| ≥ |z P1 (x τ ) -z P2 (x τ )| -2Γ(d 0 ) -2ξ m H > 1/K -2Γ(d 0 ) -1/(2K) ≥ 1/(2K) -2Γ(d 0 ). (D.10) We set d 0 = C • log(1/γ)/γ • log 1/γ (8tKRH) , where C is defined in Lemma F.7. Then, by Lemma F.7 it holds that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 and Γ(d 0 ) ≤ 1/(8tK). (D.11) Combining (D.9)-(D.11), we obtain v τ Λ -1 τ v τ > 1/(2K) -2Γ(d 0 ) 2 7/K 2 + 16t • Γ(d 0 ) 2 ≥ 1/(2K) -1/(4K) 2 7/K 2 + 1/(4K 2 ) > 1/100. Following the same argument in (C.14)-(C.20) in the proof of Lemma C.1, we obtain t ≤ C 10 • log 2 (1/γ)/γ • log 1+1/γ (KRH) for an absolute constant C 10 > 0. Thus, we conclude the proof of Lemma D.1. Lemma D.2. Under Assumption 4.6 and 4.4, there exists an absolute constant C 11 such that we have N 1/(KH) (P, • ∞,1 ) ≤ C 11 • log 2 (1/γ)/γ • log 1+1/γ (RT ) for any R ≥ 2. Proof. By choosing t = 4KH in Lemma F.7, there exists d 0 = C • log(1/γ)/γ • log 1/γ (4KRH 2 ) , where C is defined in Lemma F.7, such that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 and Γ(d 0 ) ≤ 1/(4KH). (D.12) Here Γ(d 0 ) is defined in (C.4), γ and C 2 are defined in Assumption 4.6. Let C be a minimal 1/(4d 1/2 0 KH)-covering of the Euclidean ball {v ∈ R d0 : v 2 ≤ R} with respect to the 2 -norm distance. For any P ∈ P with parameter w, we define P and P by P (x) = ∇ w NN(x; w 0 ) (w -w 0 ), P (x) = d0 j=1 λ j • P , φ j H • φ j (x). for any x ∈ Y. Also, we write v = λ 1 • P , φ 1 H , . . . , λ d0 • P , φ d0 H . Note that, because { λ j • φ j } ∞ j=1 is an orthonormal basis in H and w ∈ B R , we have v 2 = P H ≤ P H ≤ R. Thus, there exists v * ∈ C with v -v * 2 ≤ 1/(4d 1/2 0 KH). We define P * ∈ H R by Then, by (D.16), P C is a 1/(KH)-covering of P with respect to the ∞ -norm distance. Following the same argument of (C.25), we have P * (x) = d0 j=1 v * j • λ j • φ j (x). N 1/(KH) (P, • ∞,1 ) ≤ N 1/(KH) (P, • ∞ ) ≤ |P C | ≤ |C| ≤ C 13 • d 0 • log(d 0 KRH) ≤ C 11 • log 2 (1/γ)/γ • log 1+1/γ (KRH), where C 13 , C 11 > 0 are absolute constants. Thus, we conclude the proof of Lemma D.2. Proof of Theorem 4.7 Proof. Recall that d = K ∧ dim E (Z P , 1/K). By the result of a general P in Theorem 4.12, we have (D.17) with probability at least (1 -p), when we set α = 2 log |A|/(KH 2 ) and Regret(T ) ≤ 2H 3 T • log |A| + 32H 2 T • log(p/2) + H(dH + 1) + 4 dβHT , β ≥ 2H 2 • log N 1/T (P, • ∞,1 ) • 2H/p + 4 H + H 2 /4 • log(8K 2 H/p) (D.18) By Lemma D.2, we have N 1/(KH) (P, • ∞,1 ) ≤ C 11 • log 2 (1/γ)/γ • log 1+1/γ (RT ), which implies that there exists an absolute constant C 5 such that 2H 2 • log N 1/(KH) (P, • ∞,1 ) • 2H/p + 4 H + H 2 /4 • log(8K 2 H/p) ≤ C 5 H 2 • log 2 (1/γ)/γ • log 1+1/γ (RT /p). In other words, (D.18) holds if we set  β ≥ C 5 H 2 • log 2 (1/γ)/γ • log 1+1/γ ( + C 10 H 2 • log 2 (1/γ)/γ • log 1+1/γ (RT ) + 4 C 5 C 10 H 3 T • log 2 (1/γ)/γ • log (1+1/γ)/2 (RT ) • log (1+1/γ)/2 (RT /p) ≤ C 6 √ H 3 T • log 2 (1/γ)/γ • log 1+1/γ (|A|RT /p) with probability at least (1 -p), where C 6 > 0 is an absolute constant. Thus, we conclude the proof of Theorem 4.7. For any w ∈ R m such that w -w 0 2 ≤ R, we have

E IMPLICIT LINEARIZATION

∇ w NN(x; w) -∇ w NN(x; w 0 ) 2 2 = 1 m/d Y m/d Y j=1 b j σ (x w j ) • x -b j σ (x w 0 j ) • x 2 2 = 1 m/d Y m/d Y j=1 σ (x w j ) -σ (x w 0 j ) 2 • (b j ) 2 • x 2 2 Note that, by the assumption that σ is 1-smooth, we have σ (x w j ) -σ (x w 0 j ) 2 ≤ (x w j -x w 0 j ) 2 ≤ w j -w 0 j 2 2 . Thus, we have ∇ w NN(x; w) -∇ w NN(x; w 0 ) 2 2 ≤ 1 m/d Y m/d Y j=1 w j -w 0 j 2 2 ≤ d Y R 2 /m. (E.2) By the mean value theorem, for any w : w -w 0 2 ≤ R, there exists w † : w † -w 0 2 ≤ R such that NN(x; w) -NN(x; w 0 ) = ∇ w NN(x; w † ) (w -w 0 ) = ∇ w NN(x; w 0 ) (w -w 0 ) + ∇ w NN(x; w † ) -∇ w NN(x; w 0 ) (w -w 0 ), combining which with (E.2) we obtain |NN(x; w) -∇ w NN(x; w 0 ) (w -w 0 )| = |NN(x; w) -NN(x; w 0 ) -∇ w NN(x; w 0 ) (w -w 0 )| = ∇ w NN(x; w † ) -∇ w NN(x; w 0 ) (w -w 0 ) ≤ ∇ w NN(x; w † ) -∇ w NN(x; w 0 ) 2 • w -w 0 2 ≤ d Y 1/2 R 2 m -1/2 . Therefore, we have ξ m ≤ d Y 1/2 R 2 m -1/2 . Then, Condition 4.4 holds when m ≥ d Y R 4 K 3 H 2 . Thus, we conclude the proof of Lemma 4.5.

F SUPPORTING LEMMAS

F.1 DECOMPOSITION For notational simplicity, we define the linear operator J k h for (k, h) ∈ [K] × [H] by (J k h f )(s) = E[f (s, a) | a ∼ π k h (• | s)], for any s ∈ S, f ∈ [0, H] S×A . Lemma F.1 (Martingale Decomposition). For any k ∈ [K], we have V k 1 (s k 1 ) -V π k 1 (s k 1 ) = H h=1 (D k h,1 + D k h,2 ) + H h=1 Q k h (s k h , a k h ) -r k h (s k h , a k h ) + P h V k h+1 (s k h , a k h ) , where D k h,1 and D k h,2 take the forms D k h,1 = J k h (Q k h -Q π k ,k h ) (s k h ) -(Q k h -Q π k ,k h )(s k h , a k h ), D k h,2 = (P h V k h+1 -P h V π k ,k h+1 )(s k h , a k h ) -(V k h+1 -V π k ,k h+1 )(s k h+1 ). Moreover, we have D k H,2 = 0 for any k ∈ [K], and the sequence D 1 1,1 , D 1 1,2 + D 1 2,1 , D 1 2,2 + D 1 3,1 , . . . , D 1 H-1,2 + D 1 H,1 , D 2 1,1 , D 2 1,2 + D 2 2,1 , D 2 2,2 + D 2 3,1 , . . . , D 2 H-1,2 + D 2 H,1 , . . . . . . (F.1) is a martingale difference sequence with respect to the filtration { F t } t≥1 in Definition F.2 and each term is bounded by 4H. Proof. For any (k, h) ∈ [K] × [H] , by the definition of the operator J k h , we have V k h (s k h ) -V π k h (s k h ) = (J k h Q k h )(s k h ) -(J k h Q π k ,k h )(s k h ) = (Q k h -Q π k ,k h )(s k h , a k h ) + J k h (Q k h -Q π k ,k h ) (s k h ) -(Q k h -Q π k ,k h )(s k h , a k h ) = D k h,1 , (F.2) where we denote the second term on the right-hand side by D k h,1 . Also, we have (Q k h -Q π k ,k h )(s k h , a k h ) = (Q k h -r k h -P h V π k ,k h+1 )(s k h , a k h ) = (P h V k h+1 -P h V π k ,k h+1 )(s k h , a k h ) + (Q k h -r k h -P h V k h+1 )(s k h , a k h ) = V k h+1 (s k h+1 ) -V π k ,k h+1 (s k h+1 ) + (Q k h -r k h -P h V k h+1 )(s k h , a k h ) + (P h V k h+1 -P h V π k ,k h+1 )(s k h , a k h ) -(V k h+1 -V π k ,k h+1 )(s k h+1 ) = D k h,2 , (F.3) where we denote the third term on the right-hand side by D k h,2 . Combining (F.2) and (F.3) we obtain V k h (s k h ) -V π k h (s k h ) -V k h+1 (s k h+1 ) -V π k ,k h+1 (s k h+1 ) = D k h,1 + D k h,2 + (Q k h -r k h -P h V k h+1 )(s k h , a k h ). (F.4) Note that V k H+1 (•) = 0 for any k ∈ [K]. Using the identity (F.4) for h ∈ [H] we have V k 1 (s k 1 ) -V π k 1 (s k 1 ) = H h=1 V k h (s k h ) -V π k h (s k h ) -V k h+1 (s k h+1 ) -V π k ,k h+1 (s k h+1 ) = H h=1 D k h,1 + D k h,2 + (Q k h -r k h -P h V k h+1 )(s k h , a k h ) = H h=1 D k h,1 + D k h,2 + H h=1 Q k h (s k h , a k h ) -r k h (s k h , a k h ) + P h V k h+1 (s k h , a k h ) . {(X τ , Y τ )} t τ =1 is defined by z t = argmin z∈Z t τ =1 z(X τ ) -Y τ 2 . (F.7) We say that η is conditionally σ-sub-Gaussian given F τ ∈ F for any τ ≥ 1 if for any λ ∈ R, log E[exp(λη) | F τ ] ≤ λ 2 σ 2 /2. For any ε > 0, we denote by N ε (Z, • ∞ ) the ε-covering number of Z with respect to the supremum norm distance z 1 -z 2 ∞ = sup x∈R |z 1 (x) -z 2 (x)|. For any β > 0, we define Z t (β) = z ∈ Z : t τ =1 z(X τ ) -z t (X τ ) 2 ≤ β . (F.8) Lemma F.3 (Proposition 6 of Russo & Van Roy ( 2014)). Assume that for any t ∈ N, Y t -z * (X t ) is conditionally σ-sub-Gaussian given F t-1 . Then, for any ε > 0 and δ ∈ [0, 1], with probability at least (1 -δ), it holds that z * ∈ Z t (β t (δ, ε)) for any t ∈ N, where β t (δ, ε) = 8σ 2 • log N ε (Z, • ∞ )/δ + 4tε C + σ 2 log(4t(t + 1)/δ) . Lemma F.4. For any δ ∈ [0, 1], if we let β ≥ 2H 2 • log N 1/(KH) (P, • ∞,1 ) • H/δ + 4 H + H 2 /4 • log(4K 2 H/δ) in Algorithm 1, then, with probability at least (1 -δ), for any (k, h) ∈ [K] × [H], we have P h ∈ P k h . Proof. Recall that, for any p ∈ P, we define z where the inequality is because 1 -γ ≥ 0 and t -γ ≤ d -γ 0 for t ≥ d 0 . Thus, we obtain P : S × A × [0, H] S → [0, H] by z P s, a, V (•) = S V (s ) • P (s | s, a) ds , ∀ s, a, V (•) ∈ S × A × [0, H] S . Let Z = Z P = {z P : P ∈ P}. For any (k, h) ∈ [K] × [H], we set Y k = V k h+1 (s k h+1 ), X k = (s k h , a k h , V k h+1 (•)), and z * = z P h . Then, Y τ -z * (X τ ) is conditionally H/2-sub-Gaussian given F i(k,h) defined in Definition F.2. Then, by the definitions of P k h in (3.4) and Z k (β) in (F.8), we have Z k (β) = {z P : P ∈ P k h }. By setting β ≥ 2H 2 • log N 1/K (Z) • H/δ + 4 H + H 2 /4 • log(4K 2 H/δ) in Algorithm 1, it holds that β ≥ 2H 2 • log N 1/K (Z) • H/δ + 4(k -1)/K • H + H 2 /4 • log(4k(k -1)H/δ) = β k-1 (δ/H, 1/K) for any k ∈ [K], where β k-1 (δ/H, 1/K) is ∞ d0 exp(-C 2 t γ /2) dt ≤ 2d 1-γ 0 /(γC 2 ) 1 -2(1 -γ)d -γ 0 /(γC 2 ) exp(-C 2 d γ 0 /2) ≤ 4d 1-γ 0 /(γC 2 ) • exp(-C 2 d γ 0 /2), where the last inequality is because 1 -2(1 -γ)d -γ 0 /(γC 2 ) ≥ 1/2 by (F.9). Then, by the fact that j>d0 exp(-C 2 j γ ) ≤ ∞ t=d0 exp(-C 2 t γ ) dt, we conclude the proof of Lemma F.6. Recall that Γ(d 0 ) is defined in (C.4). Lemma F.7. Let C 1 and C 2 be the absolute constants in Assumption 4.2. There exists an absolute constant C such that for any γ ∈ (0, 1/2), t ≥ 1, and R ≥ 2, if we set d 0 = C • log(1/γ)/γ • log 1/γ (tRH) , then it holds that d γ 0 ≥ 4(1 -γ)(γC 2 ) -1 and Γ(d 0 ) = C 1/2 1 d 1-γ 0 RH(γC 2 ) -1 • exp(-C 2 d γ 0 /2) ≤ 1/t. (F.10) Proof. For any y > 0, we consider the function f (x) = e x /x y , x > 0. Taking derivatives, we have f (x) = e x x y-1 (x -y) x 2y . Note that f (x) ≥ 0 if and only if x ≤ y, which implies f (x) ≥ f (y) for any x > 0. Reorganizing the inequality, we obtain e x ≥ (ex/y) y for any x > 0 and y > 0. Applying the above inequality via choosing where µ is the uniform measure on Y.

G.1 SQUARED EXPONENTIAL KERNEL

The squared exponential kernel is defined as K se (x, y) = exp{-1/ι 2 • x -y 2 2 }, for any x, y ∈ Y, (G.2) where the constant ι satisfies ι 2 ≥ 2/d Y . For any u ∈ [-1, 1], we define k(u) = exp{-2(1 -u)/ι 2 } and P j (u) = (-1/2) j • Γ (d Y -1)/2 Γ (2j + d Y -1)/2 • (1 -u 2 ) (3-d Y )/2 • d du j [(1 -u 2 ) j+(d Y -3)/2 ], (G.3) where, with a slight abuse of notations, we use Γ(•) to denote the Gamma function in this section. Lemma G.1 (Theorem 2 of Minh et al. ( 2006)). For the kernel K se defined in (G.2), the eigenvalues {ρ j } j≥1 (without duplicates) of the corresponding integral operator T Kse take the form Then, for any j ≥ 1, f ∈ Y j , x ∈ S d Y -1 , we have ρ j = |S d Y -2 | |S d Y -1 | • 1 -1 k(u) • P j (u; d Y ) • (1 -u 2 ) (d Y -3)/2 du, S d Y -1 K(x, y)f (y) dµ(y) = |S d Y -2 | |S d Y -1 | • 1 -1 k 2 (u) • P j (u) • (1 -u 2 ) (d Y -3)/2 du • f (x), where P j (u) is defined in (G.3). We let k 2 (u) = u • exp{u -1}. Recall the definition of k in Section G.1, we have k 2 (u) = u • k(u) for ι = √ 2, which satisfies the requirement in Lemma G.1. Lemma G.2 shows that the eigenvalues { ρ j } j≥1 (with duplicates) of T K takes the form ρ j = C ρ • 1 -1 k 2 (u) • P j (u) • (1 -u 2 ) (d Y -3)/2 du = C ρ • 1 -1 u • k(u) • P j (u) • (1 -u 2 ) (d Y -3)/2 du, where C ρ = |S d Y -2 |/|S d Y -1 |. Using the relation u • P j (u) = j 2j + d Y -2 • P j-1 (u) + j + d Y -2 2j + d Y -2 • P j+1 (u), which is from the definition of P j (u), we have ρ j = C ρ • j 2j + d Y -2 • ρ j-1 + j + d Y -2 2j + d Y -2 • ρ j+1 , where {ρ j } j≥1 are the eigenvalues of the operator T Kse studied in Section G.1 with ι = √ 2. Thus, following the same argument of (G.6)-(G.8), we know such an NTK satisfies the second condition of Assumption 4.2.



(s,a)∈S×A S |P (s | s, a) -P (s | s, a)| ds , for any P, P ∈ P.

some constants C poly , γ > 0 and any j ≥ 1, we can still truncate at dimension d 0 and calculate Γ(d 0 ) in (C.4) using the polynomial eigenvalue decay condition. It can be shown that Γ(d 0 ) decays polynomially in d 0 . Then, we can find a proper d 0 by solving (C.12) and (C.21) and follow the same proof afterwards.

By the same arguments in (C.22) and (C.23), for any x ∈ Y, we have| P (x) -P (x)| ≤ Γ(d 0 ), | P (x) -P * (x)| ≤ 1/(4KH). (D.13)Then, by (D.12), (D.13), and Condition 4.4, for any x ∈ Y, it holds that|P (x) -P * (x)| ≤ |P (x) -P (x)| + | P (x) -P (x)| + | P (x) -P * (x)| ≤ ξ m + Γ(d 0 ) + 1/(4KH) ≤ 3/(4KH). (D.14)Moreover, since P * ∈ H R , there exists a w v ∈ B R such thatP * (x) = ∇ w NN(x; w 0 ) (w v -w 0 )for any x ∈ Y. By Condition 4.4, we have|P * (x) -NN(x; w v )| ≤ ξ m ≤ 1/(4KH).(D.15) Combining (D.14) and (D.15), we have |P (x) -NN(x; w v )| ≤ 1/(KH). (D.16) Note that because entries of ∇ w NN(•; w 0 ) are linearly independent, w v is unique for any v : v 2 ≤ R. We define P C = P : ∃ v ∈ C s.t. P (x) = NN(x; w v ) for any x ∈ Y .

Two-layer fully-connected neural networks: A two-layer fully-connected neural network is deloss of generality, we assume m is integer times of 2d Y . Here σ is the activation function. The weight vectors w and b corresponding to the first layer and second layer, respectively, take the formw = (w 1 , . . . , w m/d Y ) ∈ R m , b = (b 1 , . . . , b m/d Y ) ∈ R m/d Y ,respectively. During the training, we only tune the weights in w.Symmetric Initialization: When initializing the neural network, we generate the initial weight vectors w 0 and b by dY • I d Y ×d Y ), w 0 j+m/(2d Y ) Unif({-1, 1}), b j+m/(2d Y ) = -b j for any j ∈ [m/(2d Y )].As a result of such initialization, we have NN(•; w 0 ) = 0 and we can generalize the result to multilayer neural networks by setting the last two layers in this manner.Proof of Lemma 4.5Proof. Let NN be a two-layer fully-connected neural network in the form (E.1). The activation function σ is 1-smooth and the second layer weights satisfies b j ∈ {-1, 1] for any j ∈ [m/d Y ].

obtain (F.10), it suffices to make the following inequality hold,exp(C 2 d γ 0 /4) ≥ tC 1/2 1 RH γC 2 (eC 2 /4 • γ 1-γ )Since γ ∈ (0, 1/2) and tRH ≥ 2, there exists an absolute constant C such that4(1 -γ)(γC 2 ) -1 ≤ C • log(1/γ)/γ • log(tRH)C • log(1/γ)/γ • log(tRH).Therefore, by choosing d 0 ≥ ( C log(1/γ)/γ • log(tRH)) 1/γ , we conclude the proof of Lemma F.7.G EXAMPLES OF KERNELS WITH EXPONENTIALLY DECAYING EIGENVALUESIn this section, we provide examples of kernels that satisfies Assumption 4.2. We let Y = S d Y -1 , which represents the unit sphere in R d Y . For any kernel K, we define the integral operatorT K : L 2 (Y) → L 2 (Y) by (T K f )(x) = Y K se (x,y)f (y) dµ(y), for any f ∈ L 2 (Y) and x ∈ Y (G.1)

and each ρ j has multiplicityN (j) = (2j + d Y -2) • (d Y + j -3)! j!(d Y -2)! . (G.4)Moreover, when ι 2 ≥ 2/d Y , we have that {ρ j } j≥1 is in a decreasing order and satisfiesρ j > A 1 • e ι 2 j • (2j + d Y -2) -(2j+d Y -1)/2 , ρ j < A 2 • e ι 2 j • (2j + d Y -2) -(2j+d Y -1)/2 , (G.5)Lemma G.2 (Funk-Hecke formula(Müller (2012), page 30)). Let k 2 : [-1, 1] → R be a continuous function, which gives rise to an inner product kernel K on S d Y -1 × S d Y -1 with the definition K(x, y) = k 2 (x y), for any x, y ∈ S d Y -1 .

Ingo Steinwart and Andreas Christmann. Support vector machines. Springer Science & Business Media, 2008. Richard S Sutton, David A McAllester, Satinder P Singh, and Yishay Mansour. Policy gradient methods for reinforcement learning with function approximation. In Advances in Neural Information Processing Systems, 2000.

Therefore, we conclude the proof of Lemma 4.11.

the third inequality follows from Corollary 4.2.13 of Vershynin (2018), and the last inequality follows from the same argument in (C.19). Thus, we conclude the proof of Lemma C.2.

defined in Lemma F.3. Applying Lemma F.3 with

annex

In the sequel, we show that the sequence in (F.1) is a bounded martingale difference sequence with respect to the filtration { F t } t≥1 . For any (k, h) ∈ [K] × [H], by the definitions of D k h,1 and D k h,2 , we haveWhen h = 1, we have) (s k 1 ) = 0.Here the second equality is because the only randomness conditional on1 , and π k , we haveSimilarly, when h ≥ 2, we havewhich is because the only randomness conditional on) and we haveCombining (F.5) and (F.6), we obtainTherefore, we conclude the proof of Lemma F.1.Definition F.2 (Filtration). We define the time index map i(•, •) bywe define F t(k,h) as the σ-algebra generated bywhen h ≤ H -1, which are the reward functions and state-action pairs determined before s τ h+1 , andwhen h = H, which are the reward functions and state-action pairs determined before s τ +1 1. Then, the sequence { F t } t≥1 forms a filtration. Note that r τ = {r τ h } H h=1 are determined before the beginning of the τ -th episode although they are revealed to the agent until the τ -th episode ends.

F.2 CONCERNTRATION

Let {(X τ , Y τ )} τ ≥1 be a sequence of random elements in X × R for some measurable set X . Let Z be a set of [0, C]-valued measurable functions with domain X for some

The least-squares predictor z given

In the sequel, we prove that N ε (Z, • ∞ ) ≤ N ε/H (P, • ∞,1 ) for any ε > 0. Indeed, this is obtained by the observation that, for any z P , z P ∈ Z with P, P ∈ P, we haveThus, we conclude the proof of Lemma F.4.

F.3 ELUDER DIMENSION

Recall that Z is a set of [0, C]-valued functions with domain X for some C > 0. Meanwhile, Z k (β) is defined in (F.8).Lemma F.5 (Lemma 5 of Russo & Van Roy ( 2014)). For any β > 0, we haveProof. When dim E (Z, 1/K) ≤ K, by Lemma 5 of Russo & Van Roy (2014) we haveWhen dim E (Z, 1/K) > K, since Z is a set of [0, C]-valued functions and Z k (β) ⊂ Z for any k and β, we haveThus, we conclude the proof of Lemma F.5.

F.4 OTHER USEFUL INEQUALITIES

Lemma F.6. For any γ ∈ (0, 1/2), C 2 > 0, and any d 0 ∈ N such thatProof. By basic calculus we havefor any j ≥ 1, where the constants A 1 , A 2 only depend on d Y and ι.By Lemma G.1, the eigenvalues {λ j } j≥1 (with duplicates) of the kernel K se satisfy, for any j ≥ 1,By the definition of N (j) in (G.4) and Stirling's formula, we haveHere the asymptotic notation omits constant factors that are independent of j. Combining (G.6) and (G.7), when j is sufficiently large, we haveThen, by (G.5) we obtainas j → ∞ for an absolute constant c > 0. Thus, we know K se satisfies the second condition of Assumption 4.2.

G.2 NTK OF SINE ACTIVATION

We consider the neural tangent kernel of a two-layer neural network of the form (E.1) where the activation function is the sine function. In detail, the neural network is parametrized asfor any x ∈ Y. Here we modify the initial form in (E.1) by adding an intercept term, which is equivalent to adding one more dimension with constant value 1 to the input space. The initialization of the network weights follows the same symmetric random initialization schemeHere without loss of generality we assume m/(2d Y + 2) ∈ N. Then, the population NTK of such a parametrization takes the formfor any x, y ∈ Y, which is the limit of the empirical NTK defined in (4.2) as m goes to infinity. Here the second equality is derived in Rahimi & Recht (2007) and the third equality is by the fact that x 2 = y 2 = 1 for any x, y ∈ Y = S d Y -1 . For any j ≥ 1, let Y j be the set of all homogeneous harmonics of degree j on S d Y -1 , which is a finite-dimensional subspace of L 2 µ (S d Y -1 ), the space of square-integrable functions on S d Y -1 with respect to µ. It can be shown that the dimensionality of Y j is given by N (j).

