VIPER: PROVABLY EFFICIENT ALGORITHM FOR OF-FLINE RL WITH NEURAL FUNCTION APPROXIMATION

Abstract

We propose a novel algorithm for offline reinforcement learning called Value Iteration with Perturbed Rewards (VIPeR), which amalgamates the pessimism principle with random perturbations of the value function. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the offline data multiple times with carefully-designed i.i.d. Gaussian noises to learn an ensemble of estimated state-action value functions and acting greedily with respect to the minimum of the ensemble. The estimated state-action values are obtained by fitting a parametric model (e.g., neural networks) to the perturbed datasets using gradient descent. As a result, VIPeR only needs O(1) time complexity for action selection, while LCB-based algorithms require at least Ω(K 2 ), where K is the total number of trajectories in the offline data. We also propose a novel data-splitting technique that helps remove a factor involving the log of the covering number in our bound. We prove that VIPeR yields a provable uncertainty quantifier with overparameterized neural networks and enjoys a bound on sub-optimality of Õ(κH 5/2 d/ √ K), where d is the effective dimension, H is the horizon length and κ measures the distributional shift. We corroborate the statistical and computational efficiency of VIPeR with an empirical evaluation on a wide set of synthetic and real-world datasets. To the best of our knowledge, VIPeR is the first algorithm for offline RL that is provably efficient for general Markov decision processes (MDPs) with neural network function approximation. SubOpt(π) := E s1∼d1 [SubOpt(π; s 1 )] , where SubOpt(π; s 1 ) := V π * 1 (s 1 ) -V π 1 (s 1 ).

1. INTRODUCTION

Offline reinforcement learning (offline RL) (Lange et al., 2012; Levine et al., 2020) is a practical paradigm of RL for domains where active exploration is not permissible. Instead, the learner can access a fixed dataset of previous experiences available a priori. Offline RL finds applications in several critical domains where exploration is prohibitively expensive or even implausible, including healthcare (Gottesman et al., 2019; Nie et al., 2021) , recommendation systems (Strehl et al., 2010; Thomas et al., 2017) , and econometrics (Kitagawa & Tetenov, 2018; Athey & Wager, 2021) , among others. The recent surge of interest in this area and renewed research efforts have yielded several important empirical successes (Chen et al., 2021; Wang et al., 2023; 2022; Meng et al., 2021) . A key challenge in offline RL is to efficiently exploit the given offline dataset to learn an optimal policy in the absence of any further exploration. The dominant approaches to offline RL address this challenge by incorporating uncertainty from the offline dataset into decision-making (Buckman et al., 2021; Jin et al., 2021; Xiao et al., 2021; Nguyen-Tang et al., 2022a; Ghasemipour et al., 2022; An et al., 2021; Bai et al., 2022) . The main component of these uncertainty-aware approaches to offline RL is the pessimism principle, which constrains the learned policy to the offline data and leads to various lower confidence bound (LCB)-based algorithms. However, these methods are not easily extended or scaled to complex problems where neural function approximation is used to estimate the value functions. In particular, it is costly to explicitly compute the statistical confidence regions of the model or value functions if the class of function approximator is given by overparameterized neural networks. For example, constructing the LCB for neural offline contextual bandits (Nguyen-Tang et al., 2022a) and RL (Xu & Liang, 2022) requires computing the inverse of a large covariance matrix whose size scales with the number of parameters in the neural network. This computational cost hinders the practical application of these provably efficient offline RL algorithms. Therefore, a largely open question is how to design provably computationally efficient algorithms for offline RL with neural network function approximation. In this work, we present a solution based on a computational approach that combines the pessimism principle with randomizing the value function (Osband et al., 2016; Ishfaq et al., 2021) . The algorithm is strikingly simple: we randomly perturb the offline rewards several times and act greedily with respect to the minimum of the estimated state-action values. The intuition is that taking the minimum from an ensemble of randomized state-action values can efficiently achieve pessimism with high probability while avoiding explicit computation of statistical confidence regions. We learn the state-action value function by training a neural network using gradient descent (GD). Further, we consider a novel data-splitting technique that helps remove the dependence on the potentially large log covering number in the learning bound. We show that the proposed algorithm yields a provable uncertainty quantifier with overparameterized neural network function approximation and achieves a sub-optimality bound of Õ(κH 5/2 d/ √ K), where K is the total number of episodes in the offline data, d is the effective dimension, H is the horizon length, and κ measures the distributional shift. We achieve computational efficiency since the proposed algorithm only needs O(1) time complexity for action selection, while LCB-based algorithms require O(K 2 ) time complexity. We empirically corroborate the statistical and computational efficiency of our proposed algorithm on a wide set of synthetic and real-world datasets. The experimental results show that the proposed algorithm has a strong advantage in computational efficiency while outperforming LCB-based neural algorithms. To the best of our knowledge, ours is the first offline RL algorithm that is both provably and computationally efficient in general MDPs with neural network function approximation.

2. RELATED WORK

Randomized value functions for RL. For online RL, Osband et al. (2016; 2019) were the first to explore randomization of estimates of the value function for exploration. Their approach was inspired by posterior sampling for RL (Osband et al., 2013) , which samples a value function from a posterior distribution and acts greedily with respect to the sampled function. Concretely, Osband et al. (2016; 2019) generate randomized value functions by injecting Gaussian noise into the training data and fitting a model on the perturbed data. Jia et al. (2022) extended the idea of perturbing rewards to online contextual bandits with neural function approximation. Ishfaq et al. (2021) obtained a provably efficient method for online RL with general function approximation using the perturbed rewards. While randomizing the value function is an intuitive approach to obtaining optimism in online RL, obtaining pessimism from the randomized value functions can be tricky in offline RL. Indeed, Ghasemipour et al. (2022) point out a critical flaw in several popular existing methods for offline RL that update an ensemble of randomized Q-networks toward a shared pessimistic temporal difference target. In this paper, we propose a simple fix to obtain pessimism properly by updating each randomized value function independently and taking the minimum over an ensemble of randomized value functions to form a pessimistic value function. Offline RL with function approximation. Provably efficient offline RL has been studied extensively for linear function approximation. Jin et al. (2021) were the first to show that pessimistic value iteration is provably efficient for offline linear MDPs. Xiong et al. (2023) ; Yin et al. (2022) improved upon Jin et al. (2021) by leveraging variance reduction. Xie et al. (2021) proposed a Bellman-consistency assumption with general function approximation, which improves the bound of Jin et al. (2021) by a factor of √ d when realized to finite action space and linear MDPs. Wang et al. (2021) ; Zanette (2021) studied the statistical hardness of offline RL with linear function approximation via exponential lower bound, and Foster et al. (2021) suggested that only realizability and strong uniform data coverage are not sufficient for sample-efficient offline RL. Beyond linearity, some works study offline RL for general function approximation, both parametric and nonparametric. These approaches are either based on Fitted-Q Iteration (FQI) (Munos & Szepesvári, 2008; Le et al., 2019; Chen & Jiang, 2019; Duan et al., 2021a; b; Hu et al., 2021; Nguyen-Tang et al., 2022b) or the pessimism principle (Uehara & Sun, 2022; Nguyen-Tang et al., 2022a; Jin et al., 2021) . While pessimism-based algorithms avoid the strong assumptions of data coverage used by FQI-based algorithms, they require an explicit computation of valid confidence regions and possibly the inverse of a large covariance matrix which is computationally prohibitive and does not scale to complex function approximation setting. This limits the applicability of pessimism-based, provably efficient offline RL to practical settings. A very recent work Bai et al. (2022) estimates the uncertainty for constructing LCB via the disagreement of bootstrapped Q-functions. However, the uncertainty quantifier is only guaranteed in linear MDPs and must be computed explicitly. We provide a more detailed discussion of our technical contribution in the context of existing literature in Section C.1.

3. PRELIMINARIES

In this section, we provide basic background on offline RL and overparameterized neural networks.

3.1. EPISODIC TIME-INHOMOGENOUS MARKOV DECISION PROCESSES (MDPS)

A finite-horizon Markov decision process (MDP) is denoted as the tuple M = (S, A, P, r, H, d 1 ), where S is an arbitrary state space, A an arbitrary action space, H the episode length, and d 1 the initial state distribution. We assume that SA := |S||A| is finite but arbitrarily large, e.g., it can be as large as the total number of atoms in the observable universe ≈ 10 82 . Let P(S) denote the set of probability measures over S. A time-inhomogeneous transition kernel P = {P h } H h=1 , where P h : S × A → P(S) maps each state-action pair (s h , a h ) to a probability distribution P h (•|s h , a h ). Let r = {r h } H h=1 where r h : S × A → [0, 1] is the mean reward function at step h. A policy π = {π h } H h=1 assigns each state s h ∈ S to a probability distribution, π h (•|s h ), over the action space and induces a random trajectory s 1 , a 1 , r 1 , . . . , s H , a H , r H , s H+1 where s 1 ∼ d 1 , a h ∼ π h (•|s h ), s h+1 ∼ P h (•|s h , a h ). We define the state value function V π h ∈ R S and the actionstate value function Q π h ∈ R S×A at each timestep h as Q π h (s, a) = E π [ H t=h r t |s h = s, a h = a], and V π h (s) = E a∼π(•|s) [Q π h (s, a)], where the expectation E π is taken with respect to the randomness of the trajectory induced by π. Let P h denote the transition operator defined as (P h V )(s, a) := E s ′ ∼P h (•|s,a) [V (s ′ )]. For any V : S → R, we define the Bellman operator at timestep h as (B h V )(s, a) := r h (s, a) + (P h V )(s, a). The Bellman equations are given as follows. For any (s, a, h) ∈ S × A × [H], Q π h (s, a) = (B h V π h+1 )(s, a), V π h (s) = ⟨Q π h (s, •), π h (•|s)⟩ A , V π H+1 (s) = 0, where [H] := {1, 2, . . . , H}, and ⟨•, •⟩ A denotes the summation over all a ∈ A. We define an optimal policy π * as any policy that yields the optimal value function, i.e. V π * h (s) = sup π V π h (s) for any (s, h) ∈ S × [H]. For simplicity, we denote V π * h and Q π * h as V * h and Q * h , respectively. The Bellman optimality equation can be written as Q * h (s, a) = (B h V * h+1 )(s, a), V * h (s) = max a∈A Q * h (s, a), V * H+1 (s) = 0. Define the occupancy density as d π h (s, a) := P((s h , a h ) = (s, a)|π) which is the probability that we visit state s and take action a at timestep h if we follow the policy π. We denote d π * h by d * h . Offline regime. In the offline regime, the learner has access to a fixed dataset D = {(s t h , a t h , r t h , s t h+1 )} t∈[K] h∈[H] generated a priori by some unknown behaviour policy µ = {µ h } h∈ [H] . Here, K is the total number of trajectories, and a t h ∼ µ h (•|s t h ), s t h+1 ∼ P h (•|s t h , a t h ) for any (t, h) ∈ [K] × [H]. Note that we allow the trajectory at any time t ∈ [K] to depend on the trajectories at previous times. The goal of offline RL is to learn a policy π, based on (historical data) D, such that π achieves small sub-optimality, which we define as Algorithm 1 Value Iteration with Perturbed Rewards (VIPeR) 1: Input: Offline data D = {(s k h , a k h , r k h )} k∈[K] h∈[H] , a parametric function family F = {f (•, •; W ) : W ∈ W} ⊂ {X → R} (e.g. neural networks), perturbed variances {σ h } h∈[H] , number of bootstraps M , regularization parameter λ, step size η, number of gradient descent steps J, and cutoff margin ψ, split indices {I h } h∈ [H] where I h := [(H -h)K ′ + 1, . . . , (H -h + 1)K ′ ] 2: Initialize ṼH+1 (•) ← 0 and initialize f (•, •; W ) with initial parameter W 0 3: for h = H, . . . , 1 do 4: for i = 1, . . . , M do 5: Sample {ξ k,i h } k∈I h ∼ N (0, σ 2 h ) and ζ i h = {ζ j,i h } j∈[d] ∼ N (0, σ 2 h I d ) 6: Perturb the dataset Di h ← {s k h , a k h , r k h + Ṽh+1 (s k h+1 ) + ξ k,i h } k∈I h ▷ Perturbation 7: Let W i h ← GradientDescent(λ, η, J, Di h , ζ i h , W 0 ) (Algorithm 2) ▷ Optimization 8: end for 9: Compute Qh (•, •) ← min{min i∈[M ] f (•, •; W i h ), (H -h + 1)(1 + ψ)} + ▷ Pessimism 10: πh ← arg max π h ⟨ Qh , π h ⟩ and Ṽh ← ⟨ Qh , πh ⟩ ▷ Greedy 11: end for 12: Output: π = {π h } h∈[H] . Notation. For simplicity, we write x t h = (s t h , a t h ) and x = (s, a). We write Õ(•) to hide logarithmic factors of the problem parameters (d, H, K, m, 1/δ) in the standard Big-Oh notation. We use Ω(•) as the standard Omega notation. We write u ≲ v if u = O(v) and write u ≳ v if v ≲ u. We write A ⪯ B iff B -A is a positive definite matrix. I d denotes the d × d identity matrix.

3.2. OVERPARAMETERIZED NEURAL NETWORKS

In this paper, we consider neural function approximation setting where the state-action value function is approximated by a two-layer neural network. For simplicity, we denote X := S ×A and view it as a subset of R d . Without loss of generality, we assume X ⊂ S d-1 := {x ∈ R d : ∥x∥ 2 = 1}. We consider a standard two-layer neural network: f (x; W, b) = 1 √ m m i=1 b i σ(w T i x) , where m is an even number, σ(•) = max{•, 0} is the ReLU activation function (Arora et al., 2018) , and W = (w T 1 , . . . , w T m ) T ∈ R md . During the training, we initialize (W, b) via the symmetric initialization scheme (Gao et al., 2019)  m 2 +i = -b i ∼ Unif({-1, 1} ).foot_0 During the training, we optimize over W while the b i are kept fixed, thus we write f (x; W, b) as f (x; W ). Denote g(x; W ) = ∇ W f (x; W ) ∈ R md , and let W 0 be the initial parameters of W . We assume that the neural network is overparameterized, i.e, the width m is sufficiently larger than the number of samples K. Overparameterization has been shown to be effective in studying the convergence and the interpolation behaviour of neural networks (Arora et al., 2019; Allen-Zhu et al., 2019; Hanin & Nica, 2020; Cao & Gu, 2019; Belkin, 2021) . Under such an overparameterization regime, the dynamics of the training of the neural network can be captured using the framework of the neural tangent kernel (NTK) (Jacot et al., 2018) .

4. ALGORITHM

In this section, we present the proposed algorithm called Value Iteration with Perturbed Rewards, or VIPeR; see Algorithm 1 for the pseudocode. The key idea underlying VIPeR is to train a parametric model (e.g., a neural network) on a perturbed-reward dataset several times and act pessimistically by picking the minimum over an ensemble of estimated state-action value functions. In particular, at each timestep h ∈ [H], we draw M independent samples of zero-mean Gaussian noise with variance σ h . We use these samples to perturb the sum of the observed rewards, r k h , and the estimated value function with a one-step lookahead, i.e., Ṽh+1 (s k h+1 ) (see Line 6 of Algorithm 1). The weights W i h are then updated by minimizing the perturbed regularized squared loss on { Di h } i∈[M ] using gradient descent (Line 7). We pick the value function pessimistically by selecting the minimum over the finite ensemble. The chosen value function is truncated at (H -h + 1)(1 + ψ) (see Line 9), where ψ ≥ 0 is a small cutoff margin (more on this when we discuss the theoretical analysis). The returned policy is greedy with respect to the truncated pessimistic value function (see Line 10). Algorithm 2 GradientDescent(λ, η, J, Di h , ζ i h , W 0 ) 1: Input: Regularization parameter λ, step size η, number of gradient descent steps J, perturbed dataset Di h = {s k h , a k h , r k h + Ṽh+1 (s k h+1 ) + ξ t,i h } k∈I h , regularization per- turber ζ i h , initial parameter W 0 2: L(W ) := 1 2 k∈I h (f (s k h , a k h ; W ) -(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h )) 2 + λ 2 ∥W + ζ i h -W 0 ∥ 2 2 3: for j = 0, . . . , J -1 do 4: W j+1 ← W j -η∇L(W j ) 5: end for 6: Output: W J . It is important to note that we split the trajectory indices [K] evenly into H disjoint buckets [K] = ∪ h∈[H] I h , where I h = [(H -h)K ′ + 1, . . . , (H -h + 1)K ′ ] for K ′ := ⌊K/H⌋ 2 , as illustrated in Figure 1 . The estimated Qh is thus obtained only from the offline data with (trajectory) indices from I h along with Ṽh+1 . This novel design removes the data dependence structure in offline RL with function approximation (Nguyen-Tang et al., 2022b) and avoids a factor involving the log of the covering number in the bound on the sub-optimality of Algorithm 1, as we show in Section D.1. To deal with the non-linearity of the underlying MDP, we use a two-layer fully connected neural network as the parametric function family F in Algorithm 1. In other words, we approximate the state-action values: f (x; W ) = 1 √ m m i=1 b i σ(w T i x) , as described in Section 3.2. We use two-layer neural networks to simplify the computational analysis. We utilize gradient descent to train the state-action value functions {f (•, •; W i h )} i∈[M ] , on perturbed rewards. The use of gradient descent is for the convenience of computational analysis, and our results can be extended to stochastic gradient descent by leveraging recent advances in the theory of deep learning (Allen-Zhu et al., 2019; Cao & Gu, 2019) , albeit with a more involved analysis. Existing offline RL algorithms utilize estimates of statistical confidence regions to achieve pessimism in the offline setting. Explicitly constructing these confidence bounds is computationally expensive in complex problems where a neural network is used for function approximation. For example, the lower-confidence-bound-based algorithms in neural offline contextual bandits (Nguyen-Tang et al., 2022a) and RL (Xu & Liang, 2022) require computing the inverse of a large covariance matrix with the size scaling with the number of network parameters. This is computationally prohibitive in most practical settings. Algorithm 1 (VIPeR) avoids such expensive computations while still obtaining provable pessimism and guaranteeing a rate of Õ( 1 √ K ) on the sub-optimality, as we show in the next section. Next, we provide a theoretical guarantee on the sub-optimality of VIPeR for the function approximation class, F, represented by (overparameterized) neural networks. Our analysis builds on the recent advances in generalization and optimization of deep neural networks (Arora et al., 2019; Allen-Zhu et al., 2019; Hanin & Nica, 2020; Cao & Gu, 2019; Belkin, 2021) that leverage the observation that the dynamics of the neural parameters learned by (stochastic) gradient descent can be captured by the corresponding neural tangent kernel (NTK) space (Jacot et al., 2018) when the network is overparameterized.

5. SUB-OPTIMALITY ANALYSIS

Next, we recall some definitions and state our key assumptions, formally. Definition 1 (NTK (Jacot et al., 2018) ). The NTK kernel K ntk : X × X → R is defined as K ntk (x, x ′ ) = E w∼N (0,I d /d) ⟨xσ ′ (w T x), x ′ σ ′ (w T x ′ )⟩, where σ ′ (u) = 1{u ≥ 0}. Let H ntk denote the reproducing kernel Hilbert space (RKHS) induced by the NTK, K ntk . SinceK ntk is a universal kernel (Ji et al., 2020) , we have that H ntk is dense in the space of continuous functions on (a compact set) X = S × A (Rahimi & Recht, 2008) . Definition 2 (Effective dimension). For any h ∈ [H], the effective dimension of the NTK matrix on data {x k h } k∈I h is defined as dh := logdet(I K ′ + K h /λ) log(1 + K ′ /λ) , where 2013) for kernelized contextual bandits and was subsequently adopted by Yang & Wang (2020) and Zhou et al. (2020) for kernelized RL and neural contextual bandits, respectively. The effective dimension is data-dependent and can be bounded by d ≲ K ′(d+1)/(2d) in the worst case (see Section B for more details).foot_2 Definition 3 (RKHS of the infinite-width NTK). Define Q K h := [K ntk (x i h , x j h )] i, * := {f (x) = R d c(w) T xσ ′ (w T x)dw : sup w ∥c(w)∥2 p0(w) < B}, where c : R d → R d is any function, p 0 is the probability density function of N (0, I d /d), and B is some positive constant. We make the following assumption about the regularity of the underlying MDP under function approximation. Assumption 5.1 (Completeness). For any V : S → [0, H + 1] and any h ∈ [H], B h V ∈ Q * . 4 Assumption 5.1 ensures that the Bellman operator B h can be captured by an infinite-width neural network. This assumption is mild as Q * is a dense subset of H ntk (Gao et al., 2019, Lemma C.1) when B = ∞, thus Q * is an expressive function class when B is sufficiently large. Moreover, similar assumptions have been used in many prior works on provably efficient RL with function approximation (Cai et al., 2019; Wang et al., 2020; Yang et al., 2020; Nguyen-Tang et al., 2022b) . Next, we present a bound on the suboptimality of the policy π returned by Algorithm 1. Recall that we use the initialization scheme described in Section 3.2. Fix any δ ∈ (0, 1). H, d, B, d, λ, δ) be some high-order polynomial of the problem parameters, λ = Theorem 1. Let σ h = σ := 1 + λ 1 2 B + (H + 1) d log(1 + K ′ /λ) + 2 + 2 log(3H/δ) 1 2 . Let m = poly(K ′ , 1 + H K , η ≲ (λ + K ′ ) -1 , J ≳ K ′ log(K ′ (H d + B)), ψ = 1, and M = log HSA δ / log 1 1-Φ(-1) , where Φ(•) is the cumulative distribution function of the standard normal distribution. Then, under Assumption 5.1, with probability at least 1 -M Hm -2 -2δ, for any s 1 ∈ S, we have that SubOpt(π; s 1 ) ≤ σ(1 + 2 log(M SAH/δ)) • E π * H h=1 ∥g(s h , a h ; W 0 )∥ Λ -1 h + Õ( 1 K ′ ) where Λ h := λI md + k∈I h g(s k h , a k h ; W 0 )g(s k h , a k h ; W 0 ) T ∈ R md×md . Remark 2. Theorem 1 shows that the randomized design in our proposed algorithm yields a provable uncertainty quantifier even though we do not explicitly maintain any confidence regions in the algorithm. The implicit pessimism via perturbed rewards introduces an extra factor of 1 + 2 log(M SAH/δ) into the confidence parameter β. We build upon Theorem 1 to obtain an explicit bound using the following data coverage assumption. Assumption 5.2 requires any positive-probability trajectory induced by the optimal policy to be covered by the behavior policy. This data coverage assumption is significantly milder than the uniform coverage assumptions in many FQI-based offline RL algorithms (Munos & Szepesvári, 2008; Chen & Jiang, 2019; Nguyen-Tang et al., 2022b) and is common in pessimism-based algorithms (Rashidinejad et al., 2021; Nguyen-Tang et al., 2022a; Chen & Jiang, 2022; Zhan et al., 2022) . Theorem 2. For the same parameter settings and the same assumption as in Theorem 1, we have that with probability at least 1 -M Hm -2 -5δ, SubOpt(π) ≤ 2σκH √ K ′   2 d log(1 + K ′ /λ) + 1 + log H δ λ   + 16H 3K ′ log log 2 (K ′ H) δ + Õ( 1 K ′ ), where σ := σ(1 + 2 log(SAH/δ)). Remark 3. Theorem 2 shows that with appropriate parameter choice, VIPeR achieves a sub- Yang et al. (2020) , we improve by a factor of K 2 dγ-1 for some γ ∈ (0, 1) at the expense of of PEVI (Jin et al., 2021, Corollary 4.6 ) by a factor of √ d lin . We provide the result summary and comparison in Table 1 and give a more detailed discussion in Subsection B.1. optimality of Õ κH 3/2 √ d•max{B,H √ d} √ K . Compared to

6. EXPERIMENTS

In this section, we empirically evaluate the proposed algorithm VIPeR against several state-of-the-art baselines, including (a) PEVI (Jin et al., 2021) , which explicitly constructs lower confidence bound (LCB) for pessimism in a linear model (thus, we rename this algorithm as LinLCB for convenience in our experiments); (b) NeuraLCB (Nguyen-Tang et al., 2022a) which explicitly constructs an LCB using neural network gradients; (c) NeuraLCB (Diag), which is NeuraLCB with a diagonal approximation for estimating the confidence set as suggested in NeuraLCB (Nguyen-Tang et al., 2022a) ; (d) Lin-VIPeR which is VIPeR realized to the linear function approximation instead of neural network function approximation; (e) NeuralGreedy (LinGreedy, respectively) which uses neural networks (linear models, respectively) to fit the offline data and act greedily with respect to the estimated state-action value functions without any pessimism. Note that when the parametric class, F, in Algorithm 1 is that of neural networks, we refer to VIPeR as Neural-VIPeR. We do not utilize data splitting in the experiments. We provide further algorithmic details of the baselines in Section H. We evaluate all algorithms in two problem settings: (1) the underlying MDP is a linear MDP whose reward functions and transition kernels are linear in some known feature map (Jin et al., 2020) , and (2) the underlying MDP is non-linear with horizon length H = 1 (i.e., non-linear contextual bandits) (Zhou et al., 2020) , where the reward function is either synthetic or constructed from MNIST dataset (LeCun et al., 1998) . We also evaluate (a variant of) our algorithm and show its strong performance advantage in the D4RL benchmark (Fu et al., 2020) in Section A.3. We implemented all algorithms in Pytorch (Paszke et al., 2019 ) on a server with Intel(R) Xeon(R) Gold 6248 CPU @ 2.50GHz, 755G RAM, and one NVIDIA Tesla V100 Volta GPU Accelerator 32GB Graphics Card.foot_4 

6.1. LINEAR MDPS

We first test the effectiveness of pessimism implicit in VIPeR (Algorithm 1). To that end, we construct a hard instance of linear MDPs (Yin et al., 2022; Min et al., 2021) ; due to page limitation, we defer the details of our construction to Section A.1. We test for different values of H ∈ {20, 30, 50, 80} and report the sub-optimality of LinLCB, Lin-VIPeR, and LinGreedy, averaged over 30 runs, in Figure 2 . We find that LinGreedy, which is uncertainty-agnostic, fails to learn from offline data and has poor performance in terms of sub-optimality when compared to pessimism-based algorithms LinLCB and Lin-VIPeR. Further, LinLCB outperforms Lin-VIPeR when K is smaller than 400, but the performance of the two algorithms matches for larger sample sizes. Unlike LinLCB, Lin-VIPeR does not construct any confidence regions or require computing and inverting large (covariance) matrices. The Y-axis is in log scale; thus, Lin-VIPeR already has small sub-optimality in the first K ≈ 400 samples. These show the effectiveness of the randomized design for pessimism implicit in Algorithm 1. Finally, in Figure 5 , we study the effect of the ensemble size on the performance of Neural-VIPeR.

6.2. NEURAL CONTEXTUAL BANDITS

We use different values of M ∈ {1, 2, 5, 10, 20, 30, 50, 100, 200} for sample size K = 1000. We find that the sub-optimality of Neural-VIPeR decreases graciously as M increases. Indeed, the grid search from the previous experiment in Figure 3 also yields M = 10 and M = 20 from the search space M ∈ {1, 10, 20} as the best result. This suggests that the ensemble size can also play an important role as a hyperparameter that can determine the amount of pessimism needed in a practical setting.

7. CONCLUSION

We propose a novel algorithmic approach for offline RL that involves randomly perturbing value functions and pessimism. Our algorithm eliminates the computational overhead of explicitly maintaining a valid confidence region and computing the inverse of a large covariance matrix for pessimism. We bound the suboptimality of the proposed algorithm as Õ κH 5/2 d/ √ K . We support our theoretical claims of computational efficiency and the effectiveness of our algorithm with extensive experiments.

A EXPERIMENT DETAILS A.1 LINEAR MDPS

In this subsection, we provide further details to the experiment setup used in Subsection 6.1. We describe in detail a variant of the hard instance of linear MDPs (Yin et al., 2022) used in our experiment. The linear MDP has S = {0, 1}, A = {0, 1, • • • , 99}, and the feature dimension d = 10. Each action a ∈ [99] = {1, . . . , 99} is represented by its binary encoding vector u a ∈ R 8 with entry being either -1 or 1. The feature mapping ϕ(s, a) is given by ϕ where δ(s, a) = 1 if (s, a) = (0, 0) and δ(s, a  (s, a) = [u T a , δ(s, a), 1 -δ(s, a)] T ∈ R 10 , ) = 0 otherwise. The true measure ν h (s) is given by ν h (s) = [0, • • • , 0, (1 -s) ⊕ α h , s ⊕ α h ] where {α h } h∈[H] ∈ {0, 1} H are generated uni- formly at random and ⊕ is the XOR operator. We define θ h = [0, • • • , 0, r, 1 -r] T ∈ R 10 where r = 0.99. Recall that the transition follows P h (s ′ |s, a) = ⟨ϕ(s, a), ν h (s ′ )⟩ and the mean reward r h (s, a) = ⟨ϕ(s, a), θ h ⟩. We generated a priori K ∈ {1, . . . , 1000} trajectories using the behavior policy µ, where for any h ∈ [H] we set µ h (0|0) = p, µ h (1|0) = 1 -p, µ h (a|0) = 0, ∀a > 1; µ h (0|1) = p, µ h (a|1) = (1 -p)/99, ∀a > 0, where we set p = 0.6. We run over K ∈ {1, . . . , 1000} and H ∈ {20, 30, 50, 80}. We set λ = 0.01 for all algorithms. For Lin-VIPeR, we grid searched σ h = σ ∈ {0.0, 0.1, 0.5, 1.0, 2.0} and M ∈ {1, 2, 10, 20}. For LinLCB, we grid searched its uncertainty multiplier β ∈ {0.1, 0.5, 1, 2}. The sub-optimality metric is used to compare algorithms. For each H ∈ {20, 30, 50, 80}, each algorithm was executed for 30 times and the averaged results (with std) are reported in Figure 2 .

A.2 NEURAL CONTEXTUAL BANDITS

In this subsection, we provide in detail the experimental and hyperparameter setup in our experiment in Subsection 6.2. For Neural-VIPeR, NeuralGreedy, NeuraLCB and NeuraLCB (Diag), we use the same neural network architecture with two hidden layers whose width m = 64, train the network with Adam optimizer (Kingma & Ba, 2015) with learning rate being grid-searched over {0.0001, 0.001, 0.01} and batch size of 64. For NeuraLCB, NeuraLCB (Diag), and LinLCB, we grid-searched β over {0.001, 0.01, 0.1, 1, 5, 10}. For Neural-VIPeR and Lin-VIPeR, we gridsearched σ h = σ over {0.001, 0.01, 0.1, 1, 5, 10} and M over {1, 10, 20}. We did not run NeuraLCB in MNIST as the inverse of a full covariance matrix in this case is extremely expensive. We fixed the regularization parameter λ = 0.01 for all algorithms. Offline data is generated by the (1-ϵ)-optimal policy which generates non-optimal actions with probability ϵ and optimal actions with probability 1 -ϵ. We set ϵ = 0.5 in our experiments. To estimate the expected sub-optimality, we randomly obtain 1, 000 novel samples (i.e. not used in training) to compute the average sub-optimality and keep these same samples for all algorithms.

A.3 EXPERIMENT IN D4RL BENCHMARK

In this subsection, we evaluate the effectiveness of the reward perturbing design of VIPeR in the Gym domain in the D4RL benchmark (Fu et al., 2020) . The Gym domain has three environments (HalfCheetah, Hopper, and Walker2d) with five datasets (random, medium, medium-replay, medium-expert, and expert), making up 15 different settings. Design. To adapt the design of VIPeR to continuous control, we use the actor-critic framework. Specifically, we have M critics {Q θ i } i∈[M ] and one actor π ϕ , where {θ i } i∈[M ] and ϕ are the learnable parameters for the critics and actor, respectively. Note that in the continuous domain, we consider discounted MDP with discount factor γ, instead of finite-time episode MDP as we initially considered in our setting in the main paper. In the presence of the actor π ϕ , there are two modifications to Algorithm 1. The first modification is that when training the critics {Q i θ } i∈[M ] , we augment the training loss in Algorithm 2 with a new penalization term. Specifically, the critic loss for Q θ i on a training sample τ := (s, a, r, s ′ ) (sampled from the offline data D) is L(θ i ; τ ) = (Q θ i (s, a) -(r + γQθi(s ′ ) + ξ)) 2 + β E a ′ ∼π ϕ (•|s) (Q θ i (s, a ′ ) -Q(s, a ′ )) 2 penalization term R(θ i ;s,ϕ) , where θi has the same value of the current θ i but is kept fixed, Q = 1 M M i=1 Q θ i and ξ ∼ N (0, σ 2 ) is Gaussian noise, and β is a penalization parameter (note that β here is totally different from the β in Theorem 1). The penalization term R(θ i ; s, ϕ) discourages overestimation in the value function estimate Q θ i for out-of-distribution (OOD) actions a ′ ∼ π ϕ (•|s). Our design of R(θ i ; s, ϕ) is initially inspired by the OOD penalization in Bai et al. (2022) that creates a pessimistic pseudo target for the values at OOD actions. Note that we do not need any penalization for OOD actions in our experiment for contextual bandits in Section 6.2. This is because in the contextual bandit setting in Section 6.2 the action space is finite and not large, thus the offline data often sufficiently cover all good actions. In the continuous domain such as the Gym domain of D4RL, however, it is almost certain that there are actions that are not covered by the offline data since the action space is continuous. We also note that the inclusion of the OOD action penalization term R(θ i ; s, ϕ) in this experiment does not contradict our guarantee in Theorem 1 since in the theorem we consider finite action space while in this experiment we consider continuous action space. We argue that the inclusion of some regularization for OOD actions (e.g., R(θ i ; s, ϕ)) is necessary for the continuous domain. 6The second modification to Algorithm 1 for the continuous domain is the actor training, which is the implementation of policy extraction in line 10 of Algorithm 1. Specifically, to train the actor π ϕ given the ensemble of critics {Q i θ } i∈[M ] , we use soft actor update in Haarnoja et al. (2018) via max ϕ E s∼D,a ′ ∼π ϕ (•|s) min i∈[M ] Q θ i (s, a ′ ) -log π ϕ (a ′ |s) , which is trained using gradient ascent in practice. Note that in the discrete action domain, we do not need such actor training as we can efficiently extract the greedy policy with respect to the estimated action-value functions when the action space is finite. Also note that we do not use data splitting and value truncation as in the original design of Algorithm 1. Hyperparameters. For the hyper-parameters of our training, we set M = 10 and the noise variance σ = 0.01. For β, we decrease it from 0.5 to 0.2 by linear decay for the first 50K steps and exponential decay for the remaining steps. For the other hyperparameters of actor-critic training, we fix them the same as in Bai et al. (2022) . Specifically, the Q-network is the fully connected neural network with three hidden layers all of which has 256 neurons. The learning rate for the actor and the critic are 10 -4 and 3 × 10 -4 , respectively. The optimizer is Adam. Results. We compare VIPeR with several state-of-the-art algorithms, including (i) BEAR (Kumar et al., 2019) that use MMD distance to constraint policy to the offline data, (ii) UWAC (Wu et al., 2021) that improves BEAR using dropout uncertainty, (iii) CQL (Kumar et al., 2020) that minimizes Q-values of OOD actions, (iv) MOPO (Yu et al., 2020 ) that uses model-based uncertainty via ensemble dynamics, (v) TD3-BC (Fujimoto & Gu, 2021 ) that uses adaptive behavior cloning, and (vi) PBRL (Bai et al., 2022) that use uncertainty quantification via disagreement of bootstrapped Q-functions. We follow the evaluation protocol in Bai et al. (2022) . We run our algorithm for five seeds and report the average final evaluation scores with standard deviation. We report the scores of our method and the baselines in Table 2 . We can see that our method has a strong advantage of good performance (highest scores) in 11 out of 15 settings, and has good stability (small std) in all settings. Overall, we also have the strongest average scores aggregated over all settings.

B EXTENDED DISCUSSION

Here we provide extended discussion of our result.

B.1 COMPARISON WITH OTHER WORKS AND DISCUSSION

We provide further discussion regarding comparison with other works in the literature. Table 2 : Average normalized score and standard deviation of all algorithms over five seeds in the Gym domain in the "v2" dataset of D4RL (Fu et al., 2020) . The scores for all the baselines are from Table 1 of Bai et al. (2022) . The highest scores are highlighted. Comparing to Jin et al. (2021) . When the underlying MDP reduces into a linear MDP, if we use the linear model as the plug-in parametric model in Algorithm 1, our bound reduces into (Jin et al., 2021, Corollary 4.6) by a factor of √ d lin and worsen by a factor of √ H due to the data splitting. Thus, our bound is more favorable in the linear MDPs with high-dimensional features. Moreover, our bound is guaranteed in more practical scenarios where the offline data can have been adaptively generated and is not required to uniformly cover the state-action space. The explicit bound Õ(d (Jin et al., 2021, Corollary 4.6 ) is obtained under the assumption that the offline data have uniform coverage and are generated independently on the episode basis. Õ κH 5/2 d lin √ K which improves the bound Õ(d 3/2 lin H 2 / √ K) of PEVI 3/2 lin H 2 / √ K) of PEVI Comparing to Yang et al. (2020) . Though Yang et al. (2020) work in the online regime, it shares some part of the literature with our work in function approximation for RL. Besides different learning regimes (offline versus online), we offer three key distinctions which can potentially be used in the online regime as well: (i) perturbed rewards, (ii) optimization, and (iii) data split. Regarding (i), our perturbed reward design can be applied to online RL with function approximation to obtain a provably efficient online RL that is computationally efficient and thus remove the need of maintaining explicit confidence regions and performing the inverse of a large covariance matrix. Regarding (ii), we incorporate the optimization analysis into our algorithm which makes our algorithm and analysis more practical. We also note that unlike (Yang et al., 2020) , we do not make any assumption on the eigenvalue decay rate of the empirical NTK kernel as the empirical NTK kernel is data-dependent. Regarding (iii), our data split technique completely removes the factor log N ∞ (H, 1/K, B) in the bound at the expense of increasing the bound by a factor of √ H. In complex models, such log covering number can be excessively larger than the horizon H, making the algorithm too optimistic in the online regime (optimistic in the offline regime, respectively). For example, the target function class is RKHS with a γ-polynomial decay, the log covering number scales as (Yang et al., 2020, Lemma D1) , log N ∞ (H, 1/K, B) ≲ K 2 αγ-1 , for some α ∈ (0, 1). In the case of two-layer ReLU NTK, γ = d (Bietti & Mairal, 2019)  , thus log N ∞ (H, 1/K, B) ≲ K 2 αd-1 which is much larger than √ H when the size of dataset is large. Note that our data-splitting technique is general that can be used in the online regime as well. Comparing to Xu & Liang (2022) . Xu & Liang (2022) consider a different setting where pertimestep rewards are not available and only the total reward of the whole trajectory is given. Used with neural function approximation, they obtain Õ(D eff H 2 / √ K) where D eff is their effective dimension. Note that Xu & Liang (2022) do not use data splitting and still achieve the same order of D eff as our result with data splitting. It at first might appear that our bound is inferior to their bound as we pay the cost of √ H due to data splitting. However, to obtain that bound, they make three critical assumptions: (i) the offline data trajectories are independently and identically distributed (i.i.d.) (see their Assumption 3), (ii) the offline data is uniformly explorative over all dimensions of the feature space (also see their Assumption 3), and (iii) the eigenfunctions of the induced NTK RKHS has finite spectrum (see their Assumption 4). The i.i.d. assumption under the RKHS space with finite dimensions (due to the finite spectrum assumption) and the well-explored dataset is critical in their proof to use a matrix concentration that does not incur an extra factor of √ D eff as it would normally do without these assumptions (see Section E, the proof of their Lemma 2). Note that the celebrated ReLU NTK does not satisfy the finite spectrum assumption (Bietti & Mairal, 2019) . Moreover, we do not make any of these three assumptions above for our bound to hold. That suggests that our bound is much more general. In addition, we do not need to compute any confidence regions nor perform the inverse of a large covariance matrix. Comparing to Yin et al. (2023) . During the submission of our work, a concurrent work of Yin et al. (2023) appeared online. Yin et al. (2023) study provably efficient offline RL with a general parametric function approximation that unifies the guarantees of offline RL in linear and generalized linear MDPs, and beyond with potential applications to other classes of functions in practice. We remark that the result in Yin et al. (2023) is orthogonal/complementary to our paper since they consider the parametric class with third-time differentiability which cannot apply to neural networks (not necessarily overparameterized) with non-smooth activation such as ReLU. In addition, they do not consider reward perturbing in their algorithmic design or optimization errors in their analysis.

B.2 WORSE-CASE RATE OF EFFECTIVE DIMENSION

In the main paper, we prove an Õ κH 5/2 d √ K sub-optimality bound which depends on the notion of effective dimension defined in Definition 2. Here we give a worst-case rate of the effective dimension d for the two-layer ReLU NTK. We first briefly review the background of RKHS. Let H be an RKHS defined on X ⊆ R d with kernel function ρ : X ×X → R. Let ⟨•, •⟩ H : H ×H → R and ∥ • ∥ H : H → R be the inner product and the RKSH norm on H. By the reproducing kernel property of H, there exists a feature mapping ϕ : X → H such that f (x) = ⟨f, ϕ(x)⟩ H and ρ(x, x ′ ) = ⟨ϕ(x), ϕ(x ′ )⟩ H . We assume that the kernel function ρ is uniformly bounded, i.e. sup x∈X ρ(x, x) < ∞. Let L 2 (X ) be the space of square-integral functions on X with respect to the Lebesgue measure and let ⟨•, •⟩ L 2 be the inner product on L 2 (X ). The kernel function ρ induces an integral operator T ρ : L 2 (X ) → L 2 (X ) defined as T ρ f (x) = X ρ(x, x ′ )f (x ′ )dx ′ . By Mercer's theorem (Steinwart & Christmann, 2008) , T ρ has countable and positive eigenvalues {λ i } i≥1 and eigenfunctions {ν i } i≥1 . The kernel function and H can be expressed as ρ(x, x ′ ) = ∞ i=1 λ i ν i (x)ν i (x ′ ), H = {f ∈ L 2 (X ) : ∞ i=1 ⟨f, ν i ⟩ L 2 λ i < ∞}. Now consider the NTK defined in Definition 1: K ntk (x, x ′ ) = E w∼N (0,I d /d) ⟨xσ ′ (w T x), x ′ σ ′ (w T x ′ )⟩. It follows from (Bietti & Mairal, 2019 , Proposition 1) that λ i ≍ i -d . Thus, by (Srinivas et al., 2010, Theorem 5) , the data-dependent effective dimension of H ntk can be bounded in the worst case by d ≲ K ′(d+1)/(2d) . We remark that this is the worst-case bound that considers uniformly over all possible realizable of training data. The effective dimension d is on the other hand data-dependent, i.e. its value depends on the specific training data at hand thus d can be actually much smaller than the worst-case rate. C PROOF OF THEOREM 1 AND THEOREM 2 In this section, we provide both the outline and detailed proofs of Theorem 1 and Theorem 2. 2022) is more related to our work since they consider reward perturbing with overparameterized neural networks (but they consider contextual bandits). However, our reward perturbing strategy is largely different from that in Jia et al. (2022) . Specifically, Jia et al. (2022) perturbs each reward only once while we perturb each reward multiple times, where the number of perturbing times is crucial in our work and needs to be controlled carefully. We show in Theorem 1 that our reward perturbing strategy is effective in enforcing sufficient pessimism for offline learning in general MDP and the empirical results in Figure 2 , Figure 3 , Figure 5 , and Table 2 are strongly consistent with our theoretical suggestion. Thus, our technical proofs are largely different from those of Jia et al. (2022) . Finally, the idea of perturbing rewards multiple times in our algorithm is inspired by Ishfaq et al. (2021) . However, Ishfaq et al. (2021) consider reward perturbing for obtaining optimism in online RL. While perturbing rewards are intuitive to obtain optimism for online RL, for offline RL, under distributional shift, it can be paradoxically difficult to properly obtain pessimism with randomization and ensemble (Ghasemipour et al., 2022) , especially with neural function approximation. We show affirmatively in our work that simply taking the minimum of the randomized value functions after perturbing rewards multiple times is sufficient to obtain provable pessimism for offline RL. In addition, Ishfaq et al. (2021) do not consider neural network function approximation and optimization. Controlling the uncertainty of randomization (via reward perturbing) under neural networks with extra optimization errors induced by gradient descent sets our technical proof significantly apart from that of Ishfaq et al. (2021) . Besides all these differences, in this work, we propose an intricately-designed data splitting technique that avoids the uniform convergence argument and could be of independent interest for studying sample-efficient RL with complex function approximation. Proof Overview. The key steps for proving Theorem 1 and Theorem 2 are highlighted in Subsection C.2 and Subsection C.3, respectively. Here, we discuss an overview of our proof strategy. The key technical challenge in our proof is to quantify the uncertainty of the perturbed value function estimates. To deal with this, we carefully control both the near-linearity of neural networks in the NTK regime and the estimation error induced by reward perturbing. A key result that we use to control the linear approximation to the value function estimates is Lemma D.  K ′ bucket size, K/H I h index buckets, [(H -h)K ′ + 1, (H -h)K ′ + 2, . . . , (H -h + 1)K ′ ]

B

Parameter radius of the Bellman operator γ h,1 c 1 σ h log(KM/δ) γ h,2 c 2 σ h d log(dKM/δ) B 1 λ -1 2K(H + ψ) 2 + 8C g R 4/3 m -1/6 √ log m √ KC g R 1/3 m -1/6 √ log m B1 λ -1 2K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 + 8C g R 4/3 m -1/6 √ log m √ K ′ C g R 1/3 m -1/6 √ log m B2 λ -1 K ′ C g R 4/3 m -1/6 √ log m ι 0 Bm -1/2 (2 √ d + 2 log(3H/δ)) ι 1 C g R 4/3 m -1/6 √ log m + C g B1 + B2 + λ -1 (1 -ηλ) J K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 ι 2 C g R 4/3 m -1/6 √ log m + C g B 1 + B2 + λ -1 (1 -ηλ) J K ′ (H + ψ) 2 ι ι 0 + ι 1 + 2ι 2 β BK ′ √ m (2 √ d + 2 log(3H/ δ))λ -1/2 C g + λ 1/2 B +(H + ψ) dh log(1 + K ′ λ ) + K ′ log λ + 2 log(3H/δ) Table 3 : The problem parameters and the additional parameters that we introduce for our proofs. Here c 1 , c 2 , and C g are some absolute constants independent of the problem parameters. in establishing Lemma D.3 is how to carefully control and propagate the optimization error incurred by gradient descent. The complete proof of Lemma D.3 is provided in Section E.3. The implicit uncertainty quantifier induced by the reward perturbing is established in Lemma D.1 and Lemma D.2, where we carefully design a series of intricate auxiliary loss functions and establish the anti-concentrability of the perturbed value function estimates. This requires a careful design of the variance of the noises injected into the rewards. To deal with removing a potentially large covering number when we quantify the implicit uncertainty, we propose our data splitting technique which is validated in the proof of Lemma D.1 in Section E.1. Moreover, establishing Lemma D.1 in the overparameterization regime induces an additional challenge since a standard analysis would result in a vacuous bound that scales with the overparameterization. We avoid this issue by carefully incorporating the use of the effective dimension in Lemma D.1.

C.2 PROOF OF THEOREM 1

In this subsection, we present the proof of Theorem 1. We first decompose the suboptimality SubOpt(π; s) and present the main lemmas to bound the evaluation error and the summation of the implicit confidence terms, respectively. The detailed proof of these lemmas are deferred to Section D. For proof convenience, we first provide the key parameters that we use consistently throughout our proofs in Table 3 . We define the model evaluation error at any (x, h) ∈ X × [H] as err h (x) = (B h Ṽh+1 -Qh )(x), where B h is the Bellman operator defined in Section 3, and Ṽh and Qh are the estimated (action-) state value functions returned by Algorithm 1. Using the standard suboptimality decomposition (Jin et al., 2021 , Lemma 3.1), for any s 1 ∈ S, SubOpt(π; s 1 ) = - H h=1 E π [err h (s h , a h )] + H h=1 E π * [err h (s h , a h )] + H h=1 E π * ⟨ Qh (s h , •), π * h (•|s h ) -πh (•|s h )⟩ A ≤0 , where the third term is non-positive as πh is greedy with respect to Qh . Thus, for any s 1 ∈ S, we have SubOpt(π; s 1 ) ≤ - H h=1 E π [err h (s h , a h )] + H h=1 E π * [err h (s h , a h )] . In the following main lemma, we bound the evaluation error err h (s, a). In the rest of the proof, we consider an additional parameter R and fix any δ ∈ (0, 1). Lemma C.1. Let                              m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m , m = Ω K ′10 (H + ψ) 2 log(3K ′ H/δ) λ > 1 K ′ C 2 g ≥ λR ≥ max{4 B1 , 4 B2 , 2 2λ -1 K ′ (H + ψ + γ h,1 ) 2 + 4γ 2 h,2 }, η ≤ (λ + K ′ C 2 g ) -1 , ψ > ι, σ h ≥ β, ∀h ∈ [H], (5) where B1 , B2 , γ h,1 , γ h,2 , and ι are defined in Table 3 , C g is a absolute constant given in Lemma G.1, and R is an additional parameter. Let M = log HSA δ / log 1 1-Φ(-1) where Φ(•) is the cumulative distribution function of the standard normal distribution. With probability at least 1-M Hm -2 -2δ, for any (x, h) ∈ X × [H], we have -ι ≤ err h (x) ≤ σ h (1 + 2 log(M SAH/δ)) • ∥g(x; W 0 )∥ Λ -1 h + ι where Λ h := λI md + k∈I h g(x k h ; W 0 )g(x k h ; W 0 ) T ∈ R md×md . Now we can prove Theorem 1. Proof of Theorem 1. Theorem 1 can directly follow from substituting Lemma C.1 into Equation ( 4). We now only need to simplify the conditions in Equation (5). To satisfy Equation ( 5), it suffices to set                        λ = 1 + H K ψ = 1 > ι σ h = β 8C g R 4/3 m -1/6 √ log m ≤ 1 λ -1 K ′ H 2 ≥ 2 B1 ≤ 2K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 + 1 √ K ′ C g R 1/3 m -1/6 √ log m ≤ 1 B2 ≤ K ′ C g R 4/3 m -1/6 √ log m ≤ 1. Combining with Equation 5, we have                        λ = 1 + H K ψ = 1 > ι σ h = β η ≲ (λ + K ′ ) -1 m ≳ max R 8 log 3 m, K ′10 (H + 1) 2 log(3K ′ H/δ), d 3/2 R -1 log 3/2 ( √ m/R), K ′6 R 8 log 3 m m ≳ [2K ′ (H + 1 + β log(K ′ M/δ)) 2 + λβ 2 d log(dK ′ M/δ) + 1] 3 K ′3 R log 3 m 4 √ K ′ (H + 1 + β log(K ′ M/δ)) + 4β d log(dK ′ M/δ) ≤ R ≲ K ′ . Note that with the above choice of λ = 1 + H K , we have K ′ log λ = log(1 + 1 K ′ ) K ′ ≤ log 3 < 2. We further set that m ≳ B 2 K ′2 d log(3H/δ), we have β = BK ′ √ m (2 √ d + 2 log(3H/ δ))λ -1/2 C g + λ 1/2 B + (H + ψ) dh log(1 + K ′ λ ) + K ′ log λ + 2 log(3H/δ) ≤ 1 + λ 1/2 B + (H + 1) dh log(1 + K ′ λ ) + 2 + 2 log(3H/δ) = o( √ K ′ ). Thus, 4 √ K ′ (H + 1 + β log(K ′ M/δ)) + 4β d log(dK ′ M/δ) << K ′ for K ′ large enough. Therefore, there exists R that satisfies Equation ( 6). We now only need to verify ι < 1. We have ι 0 = Bm -1/2 (2 √ d + 2 log(3H/δ)) ≤ 1/3, ι 1 = C g R 4/3 m -1/6 log m + C g B1 + B2 + λ -1 (1 -ηλ) J K ′ (H + 1 + γ h,1 ) 2 + λγ 2 h,2 ≲ 1/3 if (1 -ηλ) J K ′ (H + 1 + β log(K ′ M/δ)) 2 + λβ 2 d log(dK ′ M/δ) ≲ 1. Note that (1 -ηλ) J ≤ e -ηλJ , K ′ (H + 1 + β log(K ′ M/δ)) 2 + λβ 2 d log(dK ′ M/δ) ≲ K ′ H 2 λβ 2 d log(dK ′ M/δ). Thus, Equation ( 7) is satisfied if J ≳ ηλ log K ′ H 2 λβ 2 d log(dK ′ M/δ) . Finally note that ι 2 ≤ ι 1 . Rearranging the derived conditions here gives the complete parameter conditions in Theorem 1. Specifically, the polynomial form of m is m ≳ max{R 8 log 3 m, K ′10 (H + 1) 2 log(3K ′ H/δ), d 3/2 R -1 log 3/2 ( √ m/R), K ′6 R 8 log 3 m, B 2 K ′2 d log(3H/δ)}, m ≳ [2K ′ (H + 1 + β log(K ′ M/δ)) 2 + λβ 2 d log(dK ′ M/δ) + 1] 3 K ′3 R log 3 m.

C.3 PROOF OF THEOREM 2

In this subsection, we give a detailed proof of Theorem 2. We first present intermediate lemmas whose proofs are deferred to Section D. For any h ∈ [H] and k ∈ I h = [(H -h)K ′ + 1, . . . , (Hh + 1)K ′ ], we define the filtration F k h = σ {(s t h ′ , a t h ′ , r t h ′ )} t≤k h ′ ∈[H] ∪ {(s k+1 h ′ , a k+1 h ′ , r k+1 h ′ )} h ′ ≤h-1 ∪ {(s k+1 h , a k+1 h )} . Let Λ k h := λI + t∈I k ,t≤k g(x t h ; W 0 )g(x t h ; W 0 ) T , β := β(1 + 2 log(SAH/δ)). In the following lemma, we connect the expected sub-optimality of π to the summation of the uncertainty quantifier at empirical data. Lemma C.2. Suppose that the conditions in Theorem 1 all hold. With probability at least 1 -M Hm -2 -3δ, SubOpt(π) ≤ 2 β K ′ H h=1 k∈I h E π * ∥g(x h ; W 0 )∥ (Λ k h ) -1 F k-1 h , s k 1 + 16 3K ′ H log(log 2 (K ′ H)/δ) + 2 K ′ + 2ι, Lemma C.3. Under Assumption 5.2, for any h ∈ [H] and fixed W 0 , with probability at least 1 -δ, k∈I h E π * ∥g(x h ; W 0 )∥ (Λ k h ) -1 F k-1 , s k 1 ≤ k∈I h κ∥g(x h ; W 0 )∥ (Λ k h ) -1 + κ K ′ log(1/δ) λ . Lemma C.4. If λ ≥ C 2 g and m = Ω(K ′4 log(K ′ H/δ)), then with probability at least 1 -δ, for any h ∈ [H], we have k∈I h ∥g(x h ; W 0 )∥ 2 (Λ k h ) -1 ≤ 2 dh log(1 + K ′ /λ) + 1. where dh is the effective dimension defined in Definition 2. Proof of Theorem 2. Theorem 2 directly follows from Lemma C.2-C.3-C.4 using the union bound.

D PROOF OF LEMMA C.1

In this section, we provide the proof for Lemma C.1. We set up preparation for all the results in the rest of the paper and provide intermediate lemmas that we use to prove Lemma C.1. The detailed proofs of these intermediate lemmas are deferred to Section E.

D.1 PREPARATION

To prepare for the lemmas and proofs in the rest of the paper, we define the following quantities. Recall that we use abbreviation x = (s, a) ∈ X ⊂ S d-1 and x k h = (s k h , a k h ) ∈ X ⊂ S d-1 . For any h ∈ [H] and i ∈ [M ], we define the perturbed loss function Li h (W ) := 1 2 k∈I h f (x k h ; W ) -ỹi,k h ) 2 + λ 2 ∥W + ζ i h -W 0 ∥ 2 2 , where ỹi,k h := r k h + Ṽh+1 (s k h+1 ) + ξ i,k h , Ṽh+1 is computed by Algorithm 1 at Line 10 for timestep h + 1, and {ξ i,k h } and ζ i h are the Gaussian noises obtained at Line 5 of Algorithm 1. Here the subscript h and the superscript i in Li h (W ) emphasize the dependence on the ensemble sample i and timestep h. The gradient descent update rule of Li h (W ) is W i,(j+1) h = W i,(j) h -η∇ Li h (W ), where W i,(0) h = W 0 is the initialization parameters. Note that W i h = GradientDescent(λ, η, J, Di h , ζ i h , W 0 ) = W i,(J) h , where W i h is returned by Line 7 of Algorithm 1. We consider a non-perturbed auxiliary loss function L h (W ) := 1 2 k∈I h f (x k h ; W ) -y k h ) 2 + λ 2 ∥W -W 0 ∥ 2 2 , where y k h := r k h + Ṽh+1 (s k h+1 ). Note that L h (W ) is simply a non-perturbed version of Li h (W ) where we drop all the noises {ξ i,k h } and {ζ i h }. We consider the gradient update rule for L h (W ) as follows Ŵ (j+1) h = Ŵ (j) h -η∇L h (W ), where Ŵ (0) h = W 0 is the initialization parameters. To correspond with W i h , we denote Ŵh := Ŵ (J) h . ( ) We also define the auxiliary loss functions for both non-perturbed and perturbed data in the linear model with feature g(•; W 0 ) as follows Li,lin h (W ) := 1 2 k∈I h ⟨g(x k h ; W 0 ), W ⟩ -ỹi,k h 2 + λ 2 ∥W + ζ i h -W 0 ∥ 2 2 , ( ) L lin h (W ) := 1 2 k∈I h ⟨g(x k h ; W 0 ), W ⟩ -y k h 2 + λ 2 ∥W -W 0 ∥ 2 2 . ( ) We consider the auxiliary gradient updates for Li,lin h (W ) as W i,lin,(j+1) h = W i,lin,(j) h -η∇ Li,lin h (W ), Ŵ lin,(j+1) h = Ŵ lin,(j) h -η∇ Llin h (W ), where W i,lin,(0) h = Ŵ i,lin,(0) h = W 0 for all i, h. Finally, we define the least-square solutions to the auxiliary perturbed and non-perturbed loss functions for the linear model as follows W i,lin h = arg min W ∈R md Li,lin h (W ), Ŵ lin h = arg min W ∈R md L lin h (W ). ( ) For any h ∈ [H], we define the auxiliary covariance matrix Λ h as follows Λ h := λI md + k∈I h g(x k h ; W 0 )g(x k h ; W 0 ) T . ( ) It is worth remarking that Algorithm 1 only uses Equation ( 8) and ( 9) thus it does not actually require any of the auxiliary quantities defined in this subsection during its run time. The auxiliary quantities here are only for our theoretical analysis.

D.2 PROOF OF LEMMA C.1

In this subsection, we give detailed proof of Lemma C.1. To prepare for proving Lemma C.1, we first provide the following intermediate lemmas. The detailed proofs of these intermediate lemmas are deferred to Section E. In the following lemma, we bound the uncertainty f (x; Ŵh ) in estimating the Bellman operator at the estimated state-value function B h Ṽh+1 . Lemma D.1. Let    m = Ω K ′10 (H + ψ) 2 log(3K ′ H/δ) λ > 1 K ′ C 2 g ≥ λ With probability at least 1 -Hm -2 -2δ, for any x ∈ S d-1 , and any h ∈ [H], |f (x; Ŵh ) -(B h Ṽh+1 )(x)| ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 2 + ι 0 , where Ṽh+1 is computed by Algorithm 1 for timestep h + 1, Ŵh is defined in Equation ( 12), and β, ι 2 and ι 0 are defined in Table 3 . In the following lemma, we establish the anti-concentration of Qh . Lemma D.2. Let            m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m , η ≤ (λ + K ′ C 2 g ) -1 , R ≥ max{4 B1 , 4 B2 , 2 2λ -1 K ′ (H + ψ + γ h,1 ) 2 + 4γ 2 h,2 }, ( ) where B1 , B2 , γ h,1 and γ h,2 are defined in Table 3 , and C g is a constant given in Lemma G.1. Let M = log HSA δ / log 1 1-Φ(-1) where Φ(•) is the cumulative distribution function of the standard normal distribution and M is the number of bootstrapped samples in Algorithm 1. Then with probability 1 -M Hm -2 -δ, for any x ∈ S d-1 and h ∈ [H], Qh (x) ≤ max{⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 , 0}, where Ŵ lin h is defined in Equation ( 18), Qh is computed by Line 9 of Algorithm 1, and ι 1 and ι 2 are defined in Table 3 . We prove the following linear approximation error lemma. Lemma D.3. Let            m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m , η ≤ (λ + K ′ C 2 g ) -1 , R ≥ max{4 B1 , 4 B2 , 2 2λ -1 K ′ (H + ψ + γ h,1 ) 2 + 4γ 2 h,2 }, where B1 , B2 , γ h,1 and γ h,2 are defined in Table 3 , and C g is a constant given in Lemma G.1. With probability at least 1 -M Hm -2 -δ, for any (x, i, j, h) ∈ S d-1 × [M ] × [J] × [H], |f (x; W i,(j) h ) -⟨g(x; W 0 ), W i,lin h -W 0 ⟩| ≤ ι 1 , where W i,(j) h , W i,lin h , and ι 1 are defined in Equation ( 9), Equation ( 17), and Table 3 , respectively. In addition, with probability at least 1 -Hm -2 , for any for any (x, j, h) ∈ S d-1 × [J] × [H], |f (x; Ŵ (j) h ) -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩| ≤ ι 2 , where Ŵ (j) h , Ŵ lin h , and ι 2 are defined in Equation ( 11), Equation ( 18), and Table 3 , respectively. We now can prove Lemma C.1. Proof of Lemma C.1. Note that the first fourth conditions in Equation ( 5 It follows from Lemma D.1 that (B h Ṽh+1 )(x) ≥ f (x; Ŵh ) -β • ∥g(x; W 0 )∥ Λ -1 h -ι 0 -ι 2 . ( ) It follows from Lemma D.2 that Qh (x) ≤ max{⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 , 0}. ( ) Note that Qh (x) ≥ 0. If ⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 ≤ 0, Equation (23) implies that Qh (x) = 0 and thus err h (x) = (B h Ṽh+1 )(x) -Qh (x) = (B h Ṽh+1 )(x) ≥ 0. Otherwise, if ⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 > 0, Equation (23) implies that Qh (x) ≤ ⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 . ( ) Thus, combining Equation ( 22), ( 24) and Lemma D.3, with the choice σ h ≥ β, we have err h (x) := (B h Ṽh+1 )(x) -Qh (x) ≥ -(ι 0 + ι 1 + 2ι 2 ) = -ι. As ι ≥ 0, in either case, we have err h (x) := (B h Ṽh+1 )(x) -Qh (x) ≥ -ι. ( ) Note that due to Equation ( 25), we have Qh (x) ≤ (B h Ṽh+1 )(x) + ι ≤ H -h + 1 + ι < H -h + 1 + ψ, where the last inequality holds due to the choice ψ > ι. Thus, we have Qh (x) = min{ min i∈[M ] f (x; W i h ), H -h + 1 + ψ} + = max{ min i∈[M ] f (x; W i h ), 0}. Substituting Equation ( 26) into the definition of err h (x), we have err h (x) = (B h Ṽh+1 )(x) -Qh (x) ≤ (B h Ṽh+1 )(x) -min i∈[M ] f (x; W i h ) = (B h Ṽh+1 )(x) -f (x; Ŵh ) + f (x; Ŵh ) -min i∈[M ] f (x; W i h ) ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 0 + ι 2 + f (x; Ŵh ) -min i∈[M ] f (x; W i h ) ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 0 + ι 2 + ⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ + ι 2 -min i∈[M ] ⟨g(x; W 0 ), W i,lin h -W 0 ⟩ + ι 1 = β • ∥g(x; W 0 )∥ Λ -1 h + ι 0 + ι 2 + max i∈[M ] ⟨g(x; W 0 ), Ŵ lin h -W i,lin h ⟩ + ι 1 + ι 2 ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 0 + ι 2 + 2 log(M SAH/δ)σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 where the first inequality holds due to Equation ( 26), the second inequality holds due to Lemma D.1, the third inequality holds due to Lemma D.3, and the last inequality holds due to Lemma E.2 and Lemma G.3 via the union bound. D.3 PROOF OF LEMMA C.2 Proof of Lemma C.2. Let Z k := β H h=1 E π * 1{k ∈ I h }∥g(x h ; W 0 )∥ (Λ k h ) -1 |s k 1 , F k-1 h where 1{} is the indicator function. Under the event in which the inequality in Theorem 1 holds, we have SubOpt(π) ≤ min H, β • E π * H h=1 ∥g(x h ; W 0 )∥ Λ -1 h + 2ι ≤ min H, βE π * H h=1 ∥g(x h ; W 0 )∥ Λ -1 h + 2ι = 1 K ′ K k=1 min H, βE π * H h=1 1{k ∈ I h }∥g(x h ; W 0 )∥ Λ -1 h + 2ι ≤ 1 K ′ K k=1 min H, βE π * H h=1 1{k ∈ I h }∥g(x h ; W 0 )∥ (Λ k h ) -1 |F k-1 h + 2ι = 1 K ′ K k=1 min H, E[Z k |F k-1 h ] + 2ι ≤ 1 K ′ K k=1 E min{H, Z k }|F k-1 h + 2ι, ( ) where the first inequality holds due to Theorem 1 and that SubOpt(π; s 1 ) ≤ H, ∀s 1 ∈ S, the second inequality holds due to min{a, b + c} ≤ min{a, b} + c, the third inequality holds due to that Λ -1 h ⪯ (Λ k h ) -1 , the fourth inequality holds due to Jensen's inequality for the convex function f (x) = min{H, x}. It follows from Lemma G.9 that with probability at least 1 -δ, K k=1 E [min{H, Z k }|F k-1 ] ≤ 2 K k=1 Z k + 16 3 H log(log 2 (KH)/δ) + 2. ( ) Published as a conference paper at ICLR 2023 Substituting Equation ( 28) into Equation ( 27) and using the union bound complete the proof.  and by Assumption 5 .2, we have, D.4 PROOF OF LEMMA C.3 Proof of Lemma C.3. Let Z k h := 1{k ∈ I h } d * h (x k h ) d µ h (x k h ) ∥g(x k h ; W 0 )∥ (Λ k h ) -1 . We have Z k h is F k h - measurable, |Z k h | ≤ d * h (x k h ) d µ h (x k h ) ∥g(x k h ; W 0 ))∥ 2 ∥(Λ k h ) -1 ∥ ≤ 1/ √ λ d * h (x k h ) d µ h (x k h ) < ∞, E Z k h |F k-1 h , s k 1 = E x h ∼d µ h 1{k ∈ I h } d * h (x h ) d µ h (x h ) ∥g(x h ; W 0 )∥ (Λ k h ) -1 F k-1 h , s k 1 . Thus, by Lemma G.4, for any h ∈ [H], with probability at least 1 -δ, we have: K k=1 E x∼d * h 1{k ∈ I h }∥g(x h ; W 0 )∥ (Λ k h ) -1 F k-1 h , s k 1 = K k=1 E x h ∼d µ h ) 1{k ∈ I h } d * h (x h ) d µ h (x h ) ∥ϕ h (x h )∥ (Λ k h ) -1 F k-1 h , s k 1 ≤ K k=1 1{k ∈ I h } d * h (x k h ) d µ h (x k h ) ∥g(x k h ; W 0 )∥ (Λ k h ) -1 + 1 λ log(1/δ) K k=1 1{k ∈ I h } d * h (x k h ) d µ h (x k h ) 2 ≤ κ K k=1 1{k ∈ I h }∥g(x h ; W 0 )∥ (Λ k h ) -1 + κ K ′ log(1/δ) λ = κ k∈I h ∥g(x h ; W 0 )∥ (Λ k h ) -1 + κ K ′ log(1/δ) λ D.5 PROOF OF LEMMA C.4 Proof of Lemma C.4. For any fixed h ∈ [H], let U = [g(x k h ; W 0 )] k∈I h ∈ R md×K ′ . By the union bound, with probability at least 1 -δ, for any h ∈ [H], we have k∈I h ∥g(x h ; W 0 )∥ 2 (Λ k h ) -1 ≤ 2 log det Λ h det(λI) = 2 logdet I + k∈I h g(x k h ; W 0 )g(x k h ; W 0 ) T /λ = 2 logdet(I + U U T /λ) = 2 logdet(I + U T U/λ) = 2 logdet(I + K h /λ + (U T U -K h )/λ) ≤ 2 logdet(I + K h /λ) + 2 tr (I + K h /λ) -1 (U T U -K h )/λ ≤ 2 logdet(I + K h /λ) + 2∥(I + K h /λ) -1 ∥ F ∥U T U -K h ∥ F ≤ 2 logdet(I + K h /λ) + 2 √ K ′ ∥U T U -K h ∥ F ≤ 2 logdet(I + K h /λ) + 1 = 2 dh log(1 + K ′ /λ) + 1 where the first inequality holds due to λ ≥ C 2 g and (Abbasi-yadkori et al., 2011, Lemma 11) , the third equality holds due to that logdet(I + AA T ) = logdet(I + A T A), the second inequality holds due to that logdet(A + B) ≤ logdet(A) + tr(A -1 B) as the result of the convexity of logdet, the third inequality holds due to that tr(A) ≤ ∥A∥ F , the fourth inequality holds due to 2 In this subsection, we give detailed proof of Lemma D.1. For this, we first provide a lemma about the linear approximation of the Bellman operator. In the following lemma, we show that B h Ṽh+1 can be well approximated by the class of linear functions with features g(•; W 0 ) with respect to l ∞ -norm. Lemma E.1. Under Assumption 5.1, with probability at least 1 -δ over w 1 , . . . , w m drawn i.i.d. from N (0, I d ), for any h ∈ [H], there exist c 1 , . . . , c m where √ K ′ ∥U T U - K h ∥ F ≤ 1 by the choice of m = Ω(K ′4 log(K ′ H/δ)), Lemma G. c i ∈ R d and ∥c i ∥ 2 ≤ B m such that Qh (x) := m i=1 c T i xσ ′ (w T i x), ∥B h Ṽh+1 -Qh ∥ ∞ ≤ B √ m (2 √ d + 2 log(H/δ)) Moreover, Qh (x) can be re-written as Qh (x) = ⟨g(x; W 0 ), Wh ⟩, Wh := √ m[a 1 c T 1 , . . . , a m c T m ] T ∈ R md , and ∥ Wh ∥ 2 ≤ B. We now can prove Lemma D.1. Proof of Lemma D.1. We first bound the difference ⟨g( x; W 0 ), Wh ⟩ -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩: ⟨g(x; W 0 ), Wh ⟩ -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ = g(x; W 0 ) T Wh -g(x; W 0 ) T Λ -1 h k∈I h g(x k h ; W 0 )y k h = g(x; W 0 ) T Wh -g(x; W 0 ) T Λ -1 h k∈I h g(x k h ; W 0 ) • (B h Ṽh+1 )(x k h ) I1 + g(x; W 0 ) T Λ -1 h k∈I h g(x k h ; W 0 ) • (B h Ṽh+1 )(x k h ) -(r k h + Ṽh+1 (s k h+1 )) I2 . For bounding I 1 , it follows from Lemma E.1 that with probability at least 1 -δ/3, for any for any x ∈ S d-1 and any h ∈ [H], |(B h Ṽh+1 )(x) -⟨g(x; W 0 ), Wh ⟩| ≤ ι 0 , where ι 0 is defined in Table 3 . where Wh is defined in Lemma E.1. Thus, with probability at least 1 -δ/3, for any for any x ∈ S d-1 and any h ∈ [H], I 1 = g(x; W 0 ) T Wh -g(x; W 0 ) T Λ -1 h k∈I h g(x k h ; W 0 ) • (B h Ṽh+1 )(x k h ) -g(x k h ; W 0 ) T Wh -g(x; W 0 ) T Wh + λg(x; W 0 ) T Λ -1 h Wh ≤ ∥g(x; W 0 ) T ∥ Λ -1 h k∈I h ι 0 ∥g(x k h ; W 0 ) T ∥ Λ -1 h + λ∥g(x; W 0 )∥ Λ -1 h ∥ Wh ∥ Λ -1 h ≤ ∥g(x; W 0 )∥ Λ -1 h K ′ ι 0 λ -1/2 C g + λ 1/2 B , where the first equation holds due to the definition of Λ h , and the last inequality holds due to Step I with ∥ Wh ∥ Λ -1 h ≤ ∥Λ -1 h ∥ 2 • ∥ Wh ∥ 2 ≤ λ -1/2 B. For bounding I 2 , we have I 2 ≤ k∈I h g(x k h ; W 0 ) (B h Ṽh+1 )(x k h ) -r k h -Ṽh+1 (s k h+1 ) Λ -1 h I3 ∥g(x; W 0 )∥ Λ -1 h . If we directly apply the result of Jin et al. (2021) in linear MDP, we would get I 2 ≲ dmH log(2dmK ′ H/δ) • ∥g(x; W 0 )∥ Λ -1 h , which gives a vacuous bound as m is sufficiently larger than K in our problem. Instead, in the following, we present an alternate proof that avoids such vacuous bound. For notational simplicity, we write ϵ k h := (B h Ṽh+1 )(x k h ) -r k h -Ṽh+1 (s k h+1 ), E h := [(ϵ k h ) k∈I h ] T ∈ R K ′ . We denote K init h := [⟨g(x i h ; W 0 ), g(x j h ; W 0 )⟩] i,j∈I h as the Gram matrix of the empirical NTK kernel on the data {x k h } k∈[K] . We denote G 0 := g(x k h ; W 0 ) k∈I h ∈ R md×K ′ , K int h := G T 0 G 0 ∈ R K ′ ×K ′ . Recall the definition of the Gram matrix K h of the NTK kernel on the data {x k h } k∈I h . It follows from Lemma G.2 and the union bound that if m = Ω(ϵ -4 log(3K ′ H/δ)) with probability at least 1 -δ/3, for any h ∈ [H], ∥K h -K init h ∥ F ≤ √ K ′ ϵ. We now can bound I 3 . We have I 2 3 = k=I h g(x k h ; W 0 )ϵ k h 2 Λ -1 h = E T h G T 0 (λI md + G 0 G T 0 ) -1 G 0 E h = E T h G T 0 G 0 (λI K ′ + G T 0 G 0 ) -1 E h = E T h K init h (K init h + λI K ) -1 E h = E T h K h (K h + λI K ′ ) -1 E h I5 + E T h K h (K h + λI K ′ ) -1 -K init h (K int h + λI K ′ ) -1 E h I4 . We bound each I 4 and I 5 separately. For bounding I 4 , applying Lemma G.1, with 1 -Hm -2 , for any h ∈ [H], I 4 ≤ K h (K h + λI K ′ ) -1 -K init h (K int h + λI K ′ ) -1 2 ∥E h ∥ 2 2 = (K h -K init h )(K h + λI K ′ ) -1 + K init h (K h + λI K ′ ) -1 -(K int h + λI K ′ ) -1 2 ∥E h ∥ 2 2 ≤ ∥K h -K init h ∥ 2 /λ + ∥K init h ∥ 2 • ∥K h -K init h ∥ 2 /λ 2 ∥E h ∥ 2 ≤ λ + K ′ C 2 g λ 2 ∥K h -K init h ∥ 2 ∥E h ∥ 2 2 ≤ 2K ′ C 2 g K ′ (H + ψ) 2 ∥K h -K init h ∥ 2 , where the first inequality holds due to the triangle inequality, the second inequality holds due to the triangle inequality, Lemma G.7, and ∥(K h + λI K ′ ) -1 ∥ 2 ≤ λ -1 , the third inequality holds due to ∥K init h ∥ 2 ≤ ∥G 0 ∥ 2 2 ≤ ∥G 0 ∥ 2 F ≤ K ′ C 2 g due to Lemma G.1, the fourth inequality holds due to ∥E h ∥ 2 ≤ √ K ′ (H + ψ), λ ≥ 1, and K ′ C 2 g ≥ λ. Substituting Equation (32) in Equation ( 34) using the union bound, with probability 1-Hm -2 -δ/3, for any h ∈ [H], I 4 ≤ 2K ′ C 2 g K ′ (H + ψ) 2 √ K ′ ϵ ≤ 1, where the last inequality holds due to the choice of ϵ = 1/2K ′-5/2 (H + ψ) -2 C -2 g and thus m = Ω(ϵ -4 log(3K ′ H/δ)) = Ω K ′10 (H + ψ) 2 log(3K ′ H/δ) . For bounding I 5 , as λ > 1, we have I 5 = E T h K h (K h + λI K ′ ) -1 E h ≤ E T h (K h + (λ -1)I K )(K h + λI K ′ ) -1 E h = E T h (K h + (λ -1)I K ′ ) -1 + I K ′ -1 E h . Let σ(•) be the σ-algebra induced by the set of random variables. For any h ∈ [H] and k ∈ I h = [(H -h)K ′ + 1, . . . , (H -h + 1)K ′ ], we define the filtration F k h = σ {(s t h ′ , a t h ′ , r t h ′ )} t≤k h ′ ∈[H] ∪ {(s k+1 h ′ , a k+1 h ′ , r k+1 h ′ )} h ′ ≤h-1 ∪ {(s k+1 h , a k+1 h )} which is simply all the data up to episode k + 1 and timestep h but right before r k+1 h and s k+1 h+1 are generated (in the offline data). 7 Note that for any k ∈ I h , we have (s k h , a k h , r k h , s k h+1 ) ∈ F k h , Ṽh+1 ∈ σ {(s k h ′ , a k h ′ , r k h ′ )} k∈I h ′ h ′ ∈[h+1,...,H] ⊆ F k-1 h ⊆ F k h . Thus, for any k ∈ I h , we have ϵ k h = (B h Ṽh+1 )(x k h ) -r k h -Ṽh+1 (s k h+1 ) ∈ F k h . The key property in our data split design is that we nicely have that Ṽh+1 ∈ σ {(s k h ′ , a k h ′ , r k h ′ )} k∈I h ′ h ′ ∈[h+1,...,H] ⊆ F k-1 h . Thus, conditioned on F k-1 h , Ṽh+1 becomes deterministic. This implies that E ϵ k h |F k-1 h = (B h Ṽh+1 )(s k h , a k h ) -r k h -Ṽh+1 (s k h+1 )|F k-1 h = 0. Note that this is only possible with our data splitting technique. Otherwise, ϵ k h is not zero-mean due to the data dependence structure induced in offline RL with function approximation (Nguyen-Tang et al., 2022b) . Our data split technique is a key to avoid the uniform convergence argument with the log covering number that is often used to bound this term in Jin et al. (2021) , which is often large for complex models. For example, in a two-layer ReLU NTK, the eigenvalues of the induced RKHS has d-polynomial decay (Bietti & Mairal, 2019) , thus its log covering number roughly follows, by (Yang et al., 2020, Lemma D1) , log N ∞ (H ntk , ϵ, B) ≲ 1 ϵ 4 αd-1 , 7 To be more precise, we need to include into the filtration the randomness from the generated noises {ξ k,i h } and {ζ i h } but since these noises are independent of any other randomness, they do not affect any derivations here but only complicate the notations and representations. for some α ∈ (0, 1).

Therefore, for any

h ∈ [H], {ϵ k h } k∈I h is adapted to the filtration {F k h } k∈I h . Applying Lemma G.5 with Z t = ϵ h t ∈ [-(H + ψ), H + ψ], σ 2 = (H + ψ) 2 , ρ = λ -1, for any δ > 0, with probability at least 1 -δ/3, for any h ∈ [H], E T h (K h + (λ -1)I K ′ ) -1 + I -1 E h ≤ (H + ψ) 2 logdet (λI K ′ + K h ) + 2(H + ψ) 2 log(3H/δ) Substituting Equation (37) into Equation (36), we have I 5 ≤ (H + ψ) 2 logdet(λI K ′ + K h ) + 2(H + ψ) 2 log(H/δ) = (H + ψ) 2 logdet(I K ′ + K h /λ) + (H + ψ) 2 K ′ log λ + 2(H + ψ) 2 log(H/δ) = (H + ψ) 2 dh log(1 + K ′ /λ) + (H + ψ) 2 K ′ log λ + 2(H + ψ) 2 log(H/δ), where the last equation holds due to the definition of the effective dimension. Combining Equations ( 38), ( 35), ( 33), (31), and (30) via the union bound, with probability at least 1 -Hm -2 -δ, for any x ∈ S d-1 and any h ∈ [H], |⟨g(x; W 0 ), Wh ⟩ -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩| ≤ β • ∥g(x; W 0 )∥ Λ -1 h , β := K ′ ι 0 λ -1/2 C g + λ 1/2 B + (H + ψ) dh log(1 + K ′ /λ) + K ′ log λ + 2 log(3H/δ) . Combing with Lemma D.3 using the union bound, with probability at least 1 -Hm -2 -2δ, for any x ∈ S d-1 , and any h ∈ [H], f (x; Ŵh ) -(B h Ṽh+1 )(x) ≤ ⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ + ι 2 -⟨g(x; W 0 ), Wh ⟩ + ι 0 ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 2 + ι 0 , = β • ∥g(x; W 0 )∥ Λ -1 h + ι 2 + ι 0 where ι 2 , and β are defined in Table 3 . Similarly, it is easy to show that  (B h Ṽh+1 )(x) -f (x; Ŵh ) ≤ β • ∥g(x; W 0 )∥ Λ -1 h + ι 2 + ι 0 . E. i ∈ [M ], W i,lin h -Ŵ lin h ∼ N (0, σ 2 h Λ -1 h ). Lemma E.3. If we set M = log HSA δ / log 1 1-Φ(-1) where Φ(•) is the cumulative distribution function of the standard normal distribution, then with probability at least 1-δ, for any (x, h) ∈ X ×[H], min i∈[M ] ⟨g(x; W 0 ), W i,lin h ⟩ ≤ ⟨g(x; W 0 ), Ŵ lin h ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h . We are now ready to prove Lemma D.2. min i∈[M ] f (x; W i h ) -f (x; Ŵh ) ≤ min i∈[M ] ⟨g(x; W 0 ), W i,lin h -W 0 ⟩ -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ + ι 1 + ι 2 ≤ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 where the first inequality holds due to Lemma D.3, and the second inequality holds due to Lemma E.3. Thus, we have Qh (x) = min{ min i∈[M ] f (x; W i h ), H -h + 1 + ψ} + ≤ max{ min i∈[M ] f (x; W i h ), 0} ≤ max{⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h + ι 1 + ι 2 , 0}. E.3 PROOF OF LEMMA D.3 In this subsection, we provide a detailed proof of Lemma D.3. We first provide intermediate lemmas that we use for proving Lemma D.3. The detailed proofs of these intermediate lemmas are deferred to Section F. The following lemma bounds the the gradient descent weight of the perturbed loss function around the linear weight counterpart. Lemma E.4. Let            m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m , η ≤ (λ + K ′ C 2 g ) -1 , R ≥ max{4 B1 , 4 B2 , 2 2λ -1 K ′ (H + ψ + γ h,1 ) 2 + 4γ 2 h,2 }, where B1 , B2 , γ h,1 and γ h,2 are defined in Table 3 and C g is a constant given in Lemma G.1. With probability at least 1 -M Hm -2 -δ, for any (i, j, h) ∈ [M ] × [J] × [H], we have • W i,(j) h ∈ B(W 0 ; R), • ∥ W i,(j) h -W i,lin h ∥ 2 ≤ B1 + B2 + λ -1 (1 -ηλ) j K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 Similar to Lemma E.4, we obtain the following lemma for the gradient descent weights of the nonperturbed loss function. Lemma E.5. Let          m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m , η ≤ (λ + K ′ C 2 g ) -1 , R ≥ max{4B 1 , 4 B2 , 2 √ 2λ -1 K ′ (H + ψ)}, where B 1 , B2 , γ h,1 and γ h,2 are defined in Table 3 and C g is a constant given in Lemma G.1. With probability at least 1 -M Hm -2 -δ, for any (i, j, h) ∈ [M ] × [J] × [H], we have • Ŵ (j) h ∈ B(W 0 ; R), • ∥ Ŵ (j) h -Ŵ lin h ∥ 2 ≤ B 1 + B2 + λ -1 (1 -ηλ) j K ′ (H + ψ) 2 We now can prove Lemma D.3. Proof of Lemma D.3. Note that Equation ( 21) implies both Equation (40) of Lemma E.4 and Equation (41) of Lemma E.5, thus both Lemma E.4 and Lemma E.5 holds under Equation ( 21). Thus, by the union bound, with probability at least 1 -M Hm -2 -δ, for any (i, j, h) ∈ [M ] × [J] × [H], and x ∈ S d-1 , |f (x; W i,(j) h ) -⟨g(x; W 0 ), W i,lin h -W 0 ⟩| ≤ |f (x; W i,(j) h ) -⟨g(x; W 0 ), W i,(j) h -W 0 ⟩| + |⟨g(x; W 0 ), W i,(j) h -W i,lin h ⟩| ≤ C g R 4/3 m -1/6 log m + C g B1 + B2 + λ -1 (1 -ηλ) j K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 = ι 1 , where the first inequality holds due to the triangle inequality, the second inequality holds due to Cauchy-Schwarz inequality, Lemma G.1, and Lemma E.4. Similarly, by the union bound, with probability at least 1-Hm -2 , for any (i, j, h) ∈ [M ]×[J]×[H], and x ∈ S d-1 , |f (x; Ŵ (j) h ) -⟨g(x; W 0 ), Ŵ lin h -W 0 ⟩| ≤ |f (x; Ŵ (j) h ) -⟨g(x; W 0 ), Ŵ (j) h -W 0 ⟩| + |⟨g(x; W 0 ), Ŵ (j) h -Ŵ lin h ⟩| ≤ C g R 4/3 m -1/6 log m + C g B 1 + B2 + λ -1 (1 -ηλ) j K ′ (H + ψ) 2 = ι 2 , where the first inequality holds due to the triangle inequality, the second inequality holds due to Cauchy-Schwarz inequality, Lemma E.5, and Lemma G.1.

F PROOFS OF LEMMAS IN SECTION E

In this section, we provide the detailed proofs of Lemmas in Section E.

F.1 PROOF OF LEMMA E.1

Proof of Lemma E.1. As B h Ṽh+1 ∈ Q * by Assumption 5.1, where Q * is defined in Section 5, we have B h Ṽh+1 = R d c(w) T xσ ′ (w T x)dw, for some c : R d → R d such that sup w ∥c(w)∥2 p0(w) ≤ B. The lemma then directly follows from approximation by finite sum (Gao et al., 2019) . F.2 PROOF OF LEMMA E.2 Proof of Lemma E.2. Let W := W + ζ i h and Li h ( W ) := k∈I h ⟨g(x k h ; W 0 ), W ⟩ -ȳi,k h 2 + λ∥ W ∥ 2 2 , where ȳk  h = r k h + Ṽh+1 (s k h+1 ) + ξ k,i h + ⟨g(x k h ; W 0 ), ζ i h ⟩. Li h ( W ) = Λ -1 h k∈I h g(x k h ; W 0 )ȳ k h = Λ -1 h k∈I h g(x k h ; W 0 )(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h ) + k∈I h g(x k h ; W 0 )⟨g(x k h ; W 0 ), ζ i h ⟩ = Λ -1 h k∈I h g(x k h ; W 0 )(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h ) + k∈I h g(x k h ; W 0 )g(x k h ; W 0 ) T ζ i h = Λ -1 h k∈I h g(x k h ; W 0 )(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h ) + (Λ h -λI md )ζ i h . Thus, we have W i h = arg max W Li,lin h (W ) = arg max W Li h ( W ) -ζ i h = Λ -1 h k∈I h g(x k h ; W 0 )(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h ) + (Λ h -λI md )ζ i h -ζ i h = Λ -1 h k∈I h g(x k h ; W 0 )(r k h + Ṽh+1 (s k h+1 ) + ξ k,i h ) -λζ i h = Ŵh + Λ -1 h k∈I h g(x k h ; W 0 )ξ k,i h -λζ i h By direct computation, it is easy to see that W i h -Ŵh = Λ -1 h k∈I h g(x k h ; W 0 )ξ k,i h -λζ i h ∼ N (0, σ 2 h Λ -1 h ). F.3 PROOF OF LEMMA E.3 In this subsection, we provide a proof for E.3. We first provide a bound for the perturbed noises used in Algorithm 1 in the following lemma. Lemma F.1. There exist absolute constants c 1 , c 2 > 0 such that for any δ > 0, event E(δ) holds with probability at least 1 -δ, for any (k, h, i) ∈ [K] × [H] × [M ], |ξ k,i h | ≤ c 1 σ h log(K ′ HM/δ) =: γ h,1 , ∥ζ i h ∥ 2 ≤ c 2 σ h d log(dK ′ HM/δ) =: γ h,2 . Proof of Lemma F.1. It directly follows from the Gaussian concentration inequality in Lemma G.3 and the union bound. We now can prove Lemma E.3. Proof of Lemma E.3. By Lemma E.2, W i,lin h -Ŵ lin h ∼ N (0, σ 2 h Λ -1 h ). Using the anti-concentration of Gaussian distribution, for any x = (s, a) ∈ S × A and any i ∈ [M ], P ⟨g(x; W 0 ), W i,lin h ⟩ ≤ ⟨g(x; W 0 ), Ŵ lin h ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h = Φ(-1) ∈ (0, 1). As { W i,lin h } i∈[M ] are independent, using the union bound, with probability at least 1 -SAH(1 -Φ(-1)) M , for any x = (s, a) ∈ S × A, and h ∈ [H], min i∈[M ] ⟨g(x; W 0 ), W i,lin h ⟩ ≤ ⟨g(x; W 0 ), Ŵ lin h ⟩ -σ h ∥g(x; W 0 )∥ Λ -1 h . Setting δ = SAH(1 -Φ(-1)) M completes the proof. F.4 PROOF OF LEMMA E.4 In this subsection, we provide a detailed proof of Lemma E.4. We first prove the following intermediate lemma whose proof is deferred to Subsection F.5. Lemma F.2. Let m = Ω d 3/2 R -1 log 3/2 ( √ m/R) R = O m 1/2 log -3 m . and additionally let η ≤ (K ′ C 2 g + λ/2) -1 , η ≤ 1 2λ . Then with probability at least 1 -M Hm -2 -δ, for any (i, j, h) ∈ [M ] × [J] × [H], if W i,(j) h ∈ B(W 0 ; R) for any j ′ ∈ [j], then ∥f j ′ -ỹ∥ 2 ≲ K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 + (λη) -2 R 4/3 m -1/6 log m. We now can prove Lemma E.4. Proof of Lemma E.4. To simplify the notations, we define ∆ j := W i,(j) h - W i,lin,(j) h ∈ R md , G j := g(x k h ; W i,(j) h ) k∈I h ∈ R md×K ′ , H j := G j G T j ∈ R md×md , f j := f (x k h ; W i,(j) h ) k∈I h ∈ R K ′ ỹ := ỹi,k h k∈I h ∈ R K ′ . The gradient descent update rule for W i,(j) h in Equation ( 9) can be written as: W i,(j+1) h = W i,(j) h -η G j (f j -ỹ) + λ( W i,(j) h + ζ i h -W 0 ) . The auxiliary updates in Equation ( 15) can be written as: W i,lin,(j+1) h = W i,lin,(j) h -η G 0 G T 0 ( W i,lin,(j) h -W 0 ) -ỹ + λ( W i,lin,(j) h + ζ i h -W 0 ) . Step 1: Proving W i,(j) h ∈ B(W 0 ; R) for all j. In the first step, we prove by induction that with probability at least 1 -M Hm -2 -δ, for any (i, j, h) ∈ [M ] × [J] × [H], we have W i,(j) h ∈ B(W 0 ; R). In the rest of the proof, we consider under the event that Lemma F.2 holds. Note that the condition in Lemma E.4 satisfies that of Lemma F.2 and under the above event of Lemma F.2, Lemma G.1 and Lemma F.1 both hold. It is trivial that W i,(0) h = W 0 ∈ B(W 0 ; R). For any fix j ≥ 0, we assume that W i,(j ′ ) h ∈ B(W 0 ; R), ∀j ′ ∈ [j]. We will prove that W i,(j+1) h ∈ B(W 0 ; R). We have ∥∆ j+1 ∥ 2 = (1 -ηλ)∆ j -η G 0 (f j -G T 0 ( W i,(j) h -W 0 )) + G 0 G T 0 ( W i,(j) h - W i,lin,(j) h ) + (f j -ỹ)(G j -G 0 ) 2 ≤ ∥(I -η(λI + H 0 ))∆ j ∥ 2 I1 + η∥f j -ỹ∥ 2 ∥G j -G 0 ∥ 2 I2 + η∥G 0 ∥ 2 ∥f j -G T 0 ( W i,(j) h -W 0 )∥ 2 I3 . We bound I 1 , I 2 and I 3 separately. Bounding I 1 . For bounding I 1 , I 1 = ∥(I -η(λI + H 0 ))∆ j ∥ 2 ≤ ∥I -η(λI + H 0 )∥ 2 ∥∆ j ∥ 2 ≤ (1 -η(λ + K ′ C 2 g ))∥∆ j ∥ 2 ≤ (1 -ηλ)∥∆ j ∥ 2 where the first inequality holds due to the spectral norm inequality, the second inequality holds due to η(λI + H 0 ) ⪯ η(λ + ∥G 0 ∥ 2 )I ⪯ η(λ + K ′ C 2 g )I ⪯ I, where the first inequality holds due to that H 0 ⪯ ∥H 0 ∥ 2 I ⪯ ∥G 0 ∥ 2 2 I, the second inequality holds due to that ∥G 0 ∥ 2 ≤ √ KC g due to Lemma G.1, and the last inequality holds due to the choice of η in Equation (40). Bounding I 2 . For bounding I 2 , I 2 = η∥f j -ỹ∥ 2 ∥G j -G 0 ∥ 2 ≤ η∥f j -ỹ∥ 2 max k∈I h √ K ′ ∥g(x k h ; W i,(j) h ) -g(x k h ; W 0 )∥ 2 ≤ η∥f j -ỹ∥ 2 √ K ′ C g R 1/3 m -1/6 log m ≤ η 2K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 + 8C g R 4/3 m -1/6 log m) √ K ′ C g R 1/3 m -1/6 log m where the first inequality holds due to Cauchy-Schwarz inequality, the second inequality holds due to the induction assumption in Equation ( 42) and Lemma G.1, and the third inequality holds due to Lemma F.2 and the induction assumption in Equation (42). Bounding I 3 . For bounding I 3 , I 3 = η∥G 0 ∥ 2 ∥f j -G T 0 ( W i,(j) h -W 0 )∥ 2 ≤ η √ K ′ C g √ K ′ max k∈I h |f (x k h ; W i,(j) h ) -g(x k h ; W 0 ) T ( W i,(j) h -W 0 )| ≤ ηK ′ C g R 4/3 m -1/6 log m, where the first inequality holds due to Cauchy-Schwarz inequality and due to that ∥G 0 ∥ 2 ≤ √ K ′ C g and the second inequality holds due to the induction assumption in Equation ( 42) and Lemma G.1. Combining the bounds of I 1 , I 2 , I 3 above, we have ∥∆ j+1 ∥ 2 ≤ (1 -ηλ)∥∆ j ∥ 2 + I 2 + I 3 . Recursively applying the inequality above for all j, we have ∥∆ j ∥ 2 ≤ I 2 + I 3 ηλ ≤ R 4 + R 4 = R 2 , ( ) where the second inequality holds due the choice specified in Equation ( 40). We also have λ∥ W i,lin,(j+1) h + ζ i h -W 0 ∥ 2 2 ≤ 2 Li,lin h ( W i,lin,(j+1) h ) ≤ 2 Li,lin h ( W i,lin,(0) h ) = 2 Li,lin h (W 0 ) = k∈I h   ⟨g(x k h ; W 0 ), W 0 ⟩ =0 -ỹ i,k h   2 + λ∥ζ i h ∥ 2 2 = k∈I h (ỹ i,k h ) 2 + λ∥ζ i h ∥ 2 2 ≤ K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 , where the first inequality holds due the the definition of Li,lin h ( W i,lin, ), the second inequality holds due to the monotonicity of Li,lin h (W ) on the gradient descent updates { W i,lin,(j ′ ) h } j ′ for the squared loss on a linear model, the third equality holds due to ⟨g(x k h ; W 0 ), W 0 ⟩ = 0 from the symmetric initialization scheme, and the last inequality holds due to Lemma F.1. Thus, we have ∥ W i,lin,(j+1) h -W 0 ∥ 2 ≤ 2∥ W i,lin,(j+1) h + ζ i h -W 0 ∥ 2 2 + ∥ζ i h ∥ 2 2 ≤ 2λ -1 K ′ (H + ψ + γ h,1 ) 2 + 4γ 2 h,2 ≤ R 2 , ( ) where the first inequality holds due to Cauchy-Schwarz inequality, the second inequality holds due to Equation ( 44) and Lemma F.1, and the last inequality holds due to the choice specified in Equation ( 40). Combining Equation ( 43) and Equation ( 45), we have ∥ W i,(j+1) h -W 0 ∥ 2 ≤ ∥ W i,(j+1) h - W i,lin, ∥ 2 + ∥ W i,lin,(j+1) h -W 0 ∥ 2 ≤ R 2 + R 2 = R, where the first inequality holds due to the triangle inequality. Step  ∥ W i,lin,(j) h -W i,lin h ∥ 2 2 ≤ (1 -ηλ) j 2 λ ( L(W 0 ) -L( W i,lin h )) ≤ (1 -ηλ) j 2 λ L(W 0 ) ≤ λ -1 (1 -ηλ) j K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 . Thus, for any j, we have ∥ W i,(j) h -W i,lin h ∥ 2 ≤ ∥ W i,(j) h - W i,lin,(j) h ∥ 2 + ∥ W i,lin,(j) h -W i,lin h ∥ 2 ≤ (ηλ) -1 (I 2 + I 3 ) + λ -1 (1 -ηλ) j K ′ (H + ψ + γ h,1 ) 2 + λγ 2 h,2 , ) where the first inequality holds due to the triangle inequality, the second inequality holds due to Equation (43) and Equation (46).

F.5 PROOF OF LEMMA F.2

Proof of Lemma F.2. We bound this term following the proof flow of (Zhou et al., 2020, Lemma C.3) with modifications for different neural parameterization and noisy targets. Suppose that for some fixed j, W i,(j ′ ) h ∈ B(W 0 ; R), ∀j ′ ∈ [j]. Let us define G(W ) := g(x k h ; W ) k∈I h ∈ R md×K ′ , f (W ) := f (x k h ; W ) k∈I h ∈ R K ′ , e(W ′ , W ) := f (W ′ ) -f (W ) -G(W ) T (W ′ -W ) ∈ R. To further simplify the notations in this proof, we drop i, h in Li h (W ) defined in Equation ( 8) to write Li h (W ) as L(W ) and write W j = W i,(j) h , where L(W ) = 1 2 k∈I h f (x k h ; W ) -ỹi,k h ) 2 + λ 2 ∥W + ζ i h -W 0 ∥ 2 2 = 1 2 ∥f (W ) -ỹ∥ 2 2 + λ 2 ∥W + ζ i h -W 0 ∥ 2 2 . Suppose that W ∈ B(W 0 ; R). By that ∥ • ∥ 2 2 is 1-smooth, L(W ′ ) -L(W ) ≤ ⟨f (W ) -ỹ, f (W ′ ) -f (W )⟩ + 1 2 ∥f (W ′ ) -f (W )∥ 2 2 + λ⟨W + ζ i h -W 0 , W ′ -W ⟩ + λ 2 ∥W ′ -W ∥ 2 2 = ⟨f (W ) -ỹ, G(W ) T (W ′ -W ) + e(W ′ , W )⟩ + 1 2 ∥G(W ) T (W ′ -W ) + e(W ′ , W )∥ 2 2 + λ⟨W + ζ i h -W 0 , W ′ -W ⟩ + λ 2 ∥W ′ -W ∥ 2 2 = ⟨∇ L(W ), W ′ -W ⟩ + ⟨f (W ) -ỹ, e(W ′ , W )⟩ + 1 2 ∥G(W ) T (W ′ -W ) + e(W ′ , W )∥ 2 2 + λ 2 ∥W ′ -W ∥ 2 2 I1 . For bounding I 1 , I 1 ≤ ∥f (W ) -ỹ∥ 2 ∥e(W ′ , W )∥ 2 + K ′ C 2 g ∥W ′ -W ∥ 2 2 + ∥e(W ′ , W )∥ 2 2 + λ 2 ∥W ′ -W ∥ 2 2 = ∥f (W ) -ỹ∥ 2 ∥e(W ′ , W )∥ 2 + (K ′ C 2 g + λ/2)∥W ′ -W ∥ 2 2 + ∥e(W ′ , W )∥ 2 2 , where the first inequality holds due to Cauchy-Schwarz inequality, W ∈ B(W 0 ; R) and Lemma G.1. Substituting Equation (49) into Equation ( 48) with W ′ = W -η∇ L(W ), L(W ′ ) -L(W ) ≤ -η(1 -(KC 2 g + λ/2)η)∥∇ L(W )∥ 2 2 + ∥f (W ) -ỹ∥ 2 ∥e(W ′ , W )∥ 2 + ∥e(W ′ , W )∥ 2 2 . ( ) By the 1-strong convexity of ∥ • ∥ 2 2 , for any W ′ , L(W ′ ) -L(W ) ≥ ⟨f (W ) -ỹ, f (W ′ ) -f (W )⟩ + λ⟨W + ζ i h -W 0 , W ′ -W ⟩ + λ 2 ∥W ′ -W ∥ 2 2 = ⟨f (W ) -ỹ, G(W ) T (W ′ -W ) + e(W ′ , W )⟩ + λ⟨W + ζ i h -W 0 , W ′ -W ⟩ + λ 2 ∥W ′ -W ∥ 2 2 = ⟨∇ L(W ), W ′ -W ⟩ + ⟨f (W ) -ỹ, e(W ′ , W )⟩ + λ 2 ∥W ′ -W ∥ 2 2 ≥ - ∥∇ L(W )∥ 2 2 2λ -∥f (W ) -ỹ∥ 2 ∥e(W ′ , W )∥ 2 , where the last inequality holds due to Cauchy-Schwarz inequality. Substituting Equation (51) into Equation ( 50), for any W ′ , L(W -η∇ L(W )) -L(W ) ≤ 2λη(1 -(KC 2 g + λ/2)η) α L(W ′ ) -L(W ) + ∥f (W ) -ỹ∥ 2 ∥e(W ′ , W )∥ 2 + ∥f (W ) -ỹ∥ 2 ∥e(W -η∇ L(W ), W )∥ 2 + ∥e(W -η∇ L(W ), W )∥ 2 ≤ α L(W ′ ) -L(W ) + γ 1 2 ∥f (W ) -ỹ∥ 2 2 + 1 2γ 1 ∥e(W ′ , W )∥ 2 2 + γ 2 2 ∥f (W ) -ỹ∥ 2 2 + 1 2γ 2 ∥e(W -η∇ L(W ), W )∥ 2 2 + ∥e(W -η∇ L(W ), W )∥ 2 2 ≤ α L(W ′ ) -L(W ) + γ 1 L(W ) + 1 2γ 1 ∥e(W ′ , W )∥ 2 2 + γ 2 L(W ) + 1 2γ 2 ∥e(W -η∇ L(W ), W )∥ 2 2 + ∥e(W -η∇ L(W ), W )∥ 2 2 , where the second inequality holds due to Cauchy-Schwarz inequality for any γ 1 , γ 2 > 0, and the third inequality holds due to ∥f (W ) -ỹ∥ 2 2 ≤ 2 L(W ). Rearranging terms in Equation ( 52) and setting W = W j , W ′ = W 0 , γ 1 = 1 4 , γ 2 = α 4 , L(W j+1 ) -L(W 0 ) ≤ (1 -α + αγ 1 + γ 2 ) L(W j ) -(1 - α 2 ) L(W 0 ) + α 2 L(W 0 ) + α 2γ 1 ∥e(W 0 , W j )∥ 2 2 + 1 2γ 2 ∥e(W j+1 , W j )∥ 2 2 + ∥e(W j+1 , W j )∥ 2 2 = (1 - α 2 ) L(W j ) -L(W 0 ) + α 2 L(W 0 ) + 2α∥e(W 0 , W j )∥ 2 2 + (1 + 2 α )∥e(W j+1 , W j )∥ 2 2 ≤ (1 - α 2 ) L(W j ) -L(W 0 ) + α 2 L(W 0 ) + (1 + 2 α + 2α)e, where e := C g R 4/3 m -1/6 √ log m, the last inequality holds due to Equation ( 47) and Lemma G.1. Applying Equation ( 53), we have L(W j ) -L(W 0 ) ≤ 2 α α 2 L(W 0 ) + (1 + 2 α + 2α)e . Rearranging the above inequality, L(W j ) ≤ 2 L(W 0 ) + ( 2 α + 4 α 2 + 4 )e where the last inequality holds due to the choice of η. Finally, we have Proof of Lemma G.1. Due to (Yang et al., 2020, Lemma C.2 ) and (Cai et al., 2019, Lemma F.1, F. 2), we have the first two inequalities and the following: ∥f j -ỹ∥ 2 2 ≤ 2 L(W j ) and L(W 0 ) = 1 2 ∥ỹ∥ 2 2 + λ 2 ∥ζ i h ∥ 2 2 ≤ K ′ 2 (H + ψ + γ h,1 ) 2 + λ 2 γ 2 h,2 due to Lemma F.1. G SUPPORT LEMMAS Lemma G.1. Let m = Ω d 3/2 R -1 log 3/2 ( √ m/R) and R = O m 1/2 log -3 m . With probabil- ity at least 1 -e -Ω(log 2 m) ≥ 1 -m -2 with |f (x; W ) -⟨g(x; W 0 ), W -W 0 ⟩| ≤ O C g R 4/3 m -1/6 log m . For any W, W ′ ∈ B(W 0 ; R), f (x; W ) -f (x; W ′ ) -⟨g(x; W ′ ), W -W ′ ⟩ = f (x; W ) -⟨g(x; W 0 ), W -W 0 ⟩ -(f (x; W ′ ) -⟨g(x; W 0 ), W ′ -W 0 ⟩) + ⟨g(x; W 0 ) -g(x; W ′ ), W 0 -W ′ ⟩. Thus, |f (x; W ) -f (x; W ′ ) -⟨g(x; W ′ ), W -W ′ ⟩| ≤ |f (x; W ) -⟨g(x; W 0 ), W -W 0 ⟩| + |f (x; W ′ ) -⟨g(x; W 0 ), W ′ -W 0 ⟩| + ∥g(x; W 0 ) -g(x; W ′ )∥ 2 ∥W 0 -W ′ ∥ 2 ≤ O C g R 4/3 m -1/6 log m . Lemma G.2 ( (Arora et al., 2019, Theorem 3) ). If m = Ω(ϵ -4 log(1/δ)), then for any x, x ′ ∈ X ⊂ S d-1 , with probability at least 1 -δ, |⟨g(x; W 0 ), g(x ′ , W 0 )⟩ -K ntk (x, x ′ )| ≤ 2ϵ. Lemma G.3. Let X ∼ N (0, aΛ -1 ) be a d-dimensional normal variable where a is a scalar. There exists an absolute constant c > 0 such that for any δ > 0, with probability at least 1 -δ, ∥X∥ Λ ≤ c da log(d/δ). For d = 1, c = √ Lemma G.4 (A variant of Hoeffding-Azuma inequality). Suppose {Z k } ∞ k=0 is a real-valued stochastic process with corresponding filtration {F k } ∞ k=0 , i.e. ∀k, Z k is F k -measurable. Suppose that for any k, E[|Z k |] < ∞ and |Z k -E [Z k |F k-1 ] | ≤ c k almost surely. Then for all positive n and t, we have: P n k=1 Z k - n k=1 E [Z k |F k-1 ] ≥ t ≤ 2 exp -t 2 n i=1 c 2 i . Lemma G.5 ((Chowdhury & Gopalan, 2017, Theorem 1)). Let H be an RKHS defined over X ⊆ R d . Let {x t } ∞ t=1 be a discrete time stochastic process adapted to filtration {F t } ∞ t=0 . Let {Z k } ∞ k=1 be a real-valued stochastic process such that Z k ∈ F k , and Z k is zero-mean and σ-sub Gaussian conditioned on F k-1 . Let E k = (Z 1 , . . . , Z k-1 ) T ∈ R k-1 and K k be the Gram matrix of H defined on {x t } t≤k-1 . For any ρ > 0 and δ ∈ (0, 1), with probability at least 1 -δ, E T k (K k + ρI) -1 + I -1 E k ≤ σ 2 logdet [(1 + ρ)I + K k ] + 2σ 2 log(1/δ). Lemma G.6. For any matrices A and B where A is invertible, logdet(A + B) ≤ logdet(A) + tr(A -1 B). Lemma G.7. For any invertible matrices A, B, ∥A -1 -B -1 ∥ 2 ≤ ∥A -B∥ 2 λ min (A)λ min (B) . Proof of Lemma G.7. We have: ∥A -1 -B -1 ∥ 2 = ∥(AB) -1 (AB)(A -1 -B -1 )∥ 2 = ∥(AB) -1 (ABA -1 -A)∥ 2 ≤ ∥(AB) -1 ∥ 2 ∥ABA -1 -A∥ 2 = ∥(AB) -1 ∥ 2 ∥ABA -1 -AAA -1 ∥ 2 = ∥(AB) -1 ∥ 2 ∥A(B -A)A -1 ∥ 2 = ∥(AB) -1 ∥ 2 ∥B -A∥ 2 ≤ λ max (A -1 )λ max (B -1 )∥ 2 ∥B -A∥ 2 . Lemma G.8 (Freedman's inequality (Tropp, 2011) ). Let {X k } n k=1 be a real-valued martingale difference sequence with the corresponding filtration {F k } n k=1 , i.e. X k is F k -measurable and E[X k |F k-1 ] = 0. Suppose for any k, |X k | ≤ M almost surely and define V := n k=1 E X 2 k |F k-1 . For any a, b > 0, we have: P n k=1 X k ≥ a, V ≤ b ≤ exp -a 2 2b + 2aM/3 . In an alternative form, for any t > 0, we have: P n k=1 X k ≥ 2M t 3 + √ 2bt, V ≤ b ≤ e -t . Lemma G.9 (Improved online-to-batch argument Nguyen-Tang et al. ( 2023)). Let {X k } be any real-valued stochastic process adapted to the filtration {F k }, i.e. X k is F k -measurable. Suppose that for any k, X k ∈ [0, H] almost surely for some H > 0. For any K > 0, with probability at least 1 -δ, we have: K k=1 E [X k |F k-1 ] ≤ 2 K k=1 X k + 16 3 H log(log 2 (KH)/δ) + 2. Proof of Lemma G.9 . Let Z k = X k -E [X k |F k-1 ] and f (K) = K k=1 E [X k |F k-1 ]. We have Z k is a real-valued difference martingale with the corresponding filtration {F k } and that V := K k=1 E Z 2 k |F k-1 ≤ K k=1 E X 2 k |F k-1 ≤ H K k=1 E [X k |F k-1 ] = Hf (K). Note that |Z k | ≤ H and f (K) ∈ [0, KH] and let m = log 2 (KH). Also note that f (K) = K k=1 X k - K k=1 Z k ≥ - K k=1 Z k . Thus if K k=1 Z k ≤ -1, we have f (K) ≥ 1. For any t > 0, leveraging the peeling technique (Bartlett et al., 2005) , we have: P K k=1 Z k ≤ - 2Ht 3 -4Hf (K)t -1 = P K k=1 Z k ≤ - 2Ht 3 -4Hf (K)t -1, f (K) ∈ [1, KH] ≤ m i=1 P K k=1 Z k ≤ - 2Ht 3 -4Hf (K)t -1, f (K) ∈ [2 i-1 , 2 i ) ≤ m i=1 P K k=1 Z k ≤ - 2Ht 3 - √ 4H2 i-1 t -1, V ≤ H2 i , f (K) ∈ [2 i-1 , 2 i ) ≤ m i=1 P K k=1 Z k ≤ - 2Ht 3 - √ 2H2 i t, V ≤ H2 i ≤ m i=1 e -t = me -t , where the first equation is by that K k=1 Z k ≤ -2Ht 3 -4Hf (K)t -1 ≤ -1 thus f (K) ≥ 1, the second inequality is by that V ≤ Hf (K), and the last inequality is by Lemma G.8. Thus, with probability at least 1 -me -t , we have: K k=1 X k -f (K) = K k=1 Z k ≥ - 2Ht 3 -4Hf (K)t -1. The above inequality implies that f (K) ≤ 2 K k=1 X k + 4Ht/3 + 2 + 4Ht, due to the simple inequality: if x ≤ a √ x + b, x ≤ a 2 + 2b. Then setting t = log(m/δ) completes the proof.

H ALGORITHMS

For completeness, we give definition of linear MDPs as follows. Definition 4 (Linear MDPs & Wang, 2019; Jin et al., 2020) ). An MDP has a linear structure if for any (s, a, s ′ , h), we have: where ϕ : S × A → R d lin is a known feature map, θ h ∈ R d lin is an unknown vector, and µ h : S → R d lin are unknown signed measures. We also give the details of the baseline algorithms: LinLCB in Algorithm 3, LinGreedy in Algorithm 4, Lin-VIPeR in Algorithm 5, NeuraLCB in Algorithm 6 and NeuralGreedy in Algorithm 7. For simplicity, we do not use data split in these algorithms presented here. Algorithm 3 LinLCB (Jin et al., 2021) 1: Input: Offline data D = {(s k h , a k h , r k h )} 



This symmetric initialization scheme makes f (x; W0) = 0 and ⟨g(x; W0), W0⟩ = 0 for any x. Without loss of generality, we assume K/H ∈ N. Note that this is the worst-case bound, and the effective dimension can be significantly smaller in practice. We consider V : S → [0, H + 1] instead of V : S → [0, H] due to the cutoff margin ψ in Algorithm 1. Our code is available here: https://github.com/thanhnguyentang/neural-offline-rl. In our experiment, we also observe that without this penalization term, the method struggles to learn any good policy. However, using only the penalization term without the first term in Eq. (1), we observe that the method cannot learn either.



as follows: For any i ≤ m 2 , w i = w m 2 +i ∼ N (0, I d /d), and b

Figure 1: Data splitting.

When realized to a linear MDP in R d lin , d = d lin and our bound reduces into Õ κH 5/2 d lin √ K which improves the bound Õ(d

Figure 2: Empirical results of sub-optimality (in log scale) on linear MDPs.

Figure 3: Sub-optimality (on log-scale) vs. sample size (K) for neural contextual bandits with following reward functions: (a) r(s, a) = cos(3s T θ a ), (b) r(s, a) = exp(-10(s T θ a ) 2 ), and (c) MNIST.

Figure 5: Sub-optimality of Neural-VIPeR versus different values of M .

Figure 4 shows the average runtime for action selection of neural-based algorithms NeuraLCB, NeuraLCB (Diag), and Neural-VIPeR. We observe that algorithms that use explicit confidence regions, i.e., NeuraLCB and NeuraLCB (Diag), take significant time selecting an action when either the number of offline samples K or the network width m increases. This is perhaps not surprising because NeuraLCB and Neu-raLCB (Diag) need to compute the inverse of a large covariance matrix to sample an action and maintain the confidence region for each action per state. The diagonal approximation significantly reduces the runtime of NeuraLCB, but the runtime still scales with the number of samples and the network width. In comparison, the runtime for action selection for Neural-VIPeR is constant. Since NeuraLCB, NeuraLCB (Diag), and Neural-VIPeR use the same neural network architecture, the runtime spent training one model is similar. The only difference is that Neural-VIPeR trains M models while NeuraLCB and NeuraLCB (Diag) train a single model. However, as the perturbed data in Algorithm 1 are independent, training M models in Neural-VIPeR is embarrassingly parallelizable.

TECHNICAL REVIEW AND PROOF OVERVIEW Technical Review.In what follows, we provide more detailed discussion when placing our technical contribution in the context of the related literature. Our technical result starts with the value difference lemma inJin et al. (2021) to connect bounding the suboptimality of an offline algorithm to controlling the uncertainty quantification in the value estimates. Thus, our key technical contribution is to provably quantify the uncertainty of the perturbed value function estimates which were obtained via reward perturbing and gradient descent. This problem setting is largely different from the current analysis of overparameterized neural networks for supervised learning which does not require uncertainty quantification.Our work is not the first to consider uncertainty quantification with overparameterized neural networks, since it has been studied inZhou et al. (2020); Nguyen-Tang et al. (2022a);Jia et al. (2022). However, there are significant technical differences between our work and these works. The work inZhou et al. (2020); Nguyen-Tang et al. (2022a) considers contextual bandits with overparameterized neural networks trained by (S)GD and quantifies the uncertainty of the value function with explicit empirical covariance matrices. We consider general MDP and use reward perturbing to implicitly obtain uncertainty, thus requiring different proof techniques.Jia et al. (

) of Lemma C.1 satisfy Equation (21). Moreover, the event in which the inequality in Lemma D.3 holds already implies the event in which the inequality in Lemma D.1 holds (see the proofs of Lemma D.3 and Lemma D.1 in Section D). Now in the rest of the proof, we consider the joint event in which both the inequality of Lemma D.3 and that of Lemma D.1 hold. Then, we also have the inequality in Lemma D.1. Consider any x ∈ X , h ∈ [H].

2 and the union bound, and the last equality holds due to the definition of dh . E PROOFS OF LEMMAS IN SECTION D E.1 PROOF OF LEMMA D.1

respect to the random initialization, it holds for anyW, W ′ ∈ B(W 0 ; R) and x ∈ S d-1 that ∥g(x; W )∥ 2 ≤ C g , ∥g(x; W ) -g(x; W 0 )∥ 2 ≤ O C g R 1/3 m -1/6 log m , |f (x; W ) -f (x; W ′ ) -⟨g(x; W ′ ), W -W ′ ⟩| ≤ O C g R 4/3 m -1/6 log m ,where C g = O(1) is a constant independent of d and m. Moreover, without loss of generality, we assume C g ≤ 1.

r h (s, a) = ϕ h (s, a) T θ h , P h (s ′ |s, a) = ϕ h (s, a) T µ h (s ′ ),

NeuraLCB (a modification of(Nguyen-Tang et al., 2022a))1: Input: Offline data D = {(s k h , a k h , r k h )} k∈[K] h∈[H] , neural networks F = {f (•, •; W ) : W ∈ W} ⊂ {X → R},uncertainty multiplier β, regularization parameter λ, step size η, number of gradient descent steps J 2: Initialize ṼH+1 (•) ← 0 and initialize f (•, •; W ) with initial parameter W 0 3: for h = H, . . . , 1 do 4:Ŵh ← GradientDescent(λ, η, J, {(s k h , a k h , r k h )} k∈[K] , 0, W 0 ) Qh (•, •) ← min{f (•, •; Ŵh ) -β∥g(•, •; Ŵh )∥ Λ -1 h , H -h + 1} + 7: πh ← arg max π h ⟨ Qh , π h ⟩ and Vh ← ⟨ Qh , πh ⟩ 8: end for 9: Output: π = {π h } h∈[H] . ] , neural networks F = {f (•, •; W ) : W ∈ W} ⊂{X → R}, uncertainty multiplier β, step size η, number of gradient descent steps J 2: Initialize ṼH+1 (•) ← 0 and initialize f (•, •; W ) with initial parameter W 0 3: for h = H, . . . , 1 do 4:Ŵh ← GradientDescent(λ, η, J, {(s k h , a k h , r k h )} k∈[K] , 0, W 0 ) (Algorithm 2) 5: Compute Qh (•, •) ← min{f (•, •; Ŵh ), H -h + 1} + 6:πh ← arg max π h ⟨ Qh , π h ⟩ and Vh ← ⟨ Qh , πh ⟩ 7: end for 8: Output: π = {π h } h∈[H]  .

j∈I h is the Gram matrix of K ntk on the data {x k h } k∈I h . We further define d := max h∈[H] dh . Remark 1. Intuitively, the effective dimension dh measures the number of principal dimensions over which the projection of the data {x k h } k∈I h in the RKHS H ntk is spread. It was first introduced by Valko et al. (

Assumption 5.2 (Optimal-Policy Concentrability). ∃κ < ∞, sup (h,s h ,a h )

3. The technical challenge

2 PROOF OF LEMMA D.2 Before proving Lemma D.2, we prove the following intermediate lemmas. The detailed proofs of these intermediate lemmas are deferred to Section F. Lemma E.2. Conditioned on all the randomness except {ξ k,i h } and {ζ i h }, for any

Proof of Lemma D.2. Note that the parameter condition in Equation (20) of Lemma D.2 satisfies Equation (21) of Lemma D.3, thus given the parameter condition Lemma D.2, Lemma D.3 holds. For the rest of the proof, we consider under the joint event in which both the inequality of Lemma D.3 and that of Lemma E.3 hold. By the union bound, probability that this joint event holds is at least 1 -M Hm -2 -δ. Thus, for any x ∈ S d-1 , h ∈ [H], and i ∈ [M ],

We have Li,lin

: Initialize ṼH+1 (•) ← 0 3: for h = H, . . . , 1 do : end for 10: Output: π = {π h } h∈[H] , perturbed variances {σ h } h∈[H] , number of bootstraps M , regularization parameter λ. 2: Initialize ṼH+1 (•) ← 0 3: for h = H, . . . , 1 do : end for 8: Output: π = {π h } h∈[H] Algorithm 5 Lin-VIPeR 1: Input: data = {(s k h , a k h , r k h k∈[K] h∈[H] , perturbed variances {σ h } h∈[H] , number of bootstraps M , regularization parameter λ. 2: Initialize ṼH+1 (•) ← 0 3: for h = H, . . . , 1 do πh ← arg max π h ⟨ Qh , π h ⟩ and Ṽh ← ⟨ Qh , πh ⟩ 11: end for 12: Output: π = {π h } h∈[H]

ACKNOWLEDGEMENTS

This research was supported, in part, by DARPA GARD award HR00112020004, NSF CAREER award IIS-1943251, an award from the Institute of Assured Autonomy, and Spring 2022 workshop on "Learning and Games" at the Simons Institute for the Theory of Computing.

