

Abstract

With the increasing need for handling large state and action spaces, general function approximation has become a key technique in reinforcement learning (RL). In this paper, we propose a general framework that unifies model-based and model-free RL, and an Admissible Bellman Characterization (ABC) class that subsumes nearly all Markov Decision Process (MDP) models in the literature for tractable RL. We propose a novel estimation function with decomposable structural properties for optimization-based exploration and the functional eluder dimension as a complexity measure of the ABC class. Under our framework, a new sample-efficient algorithm namely OPtimization-based ExploRation with Approximation (OPERA) is proposed, achieving regret bounds that match or improve over the best-known results for a variety of MDP models. In particular, for MDPs with low Witness rank, under a slightly stronger assumption, OPERA improves the state-of-the-art sample complexity results by a factor of dH. Our framework provides a generic interface to design and analyze new RL models and algorithms.

1. I N T R O D U C T I O N

Reinforcement learning (RL) is a decision-making process that seeks to maximize the expected reward when an agent interacts with the environment (Sutton & Barto, 2018) . Over the past decade, RL has gained increasing attention due to its successes in a wide range of domains, including Atari games (Mnih et al., 2013) , Go game (Silver et al., 2016) , autonomous driving (Yurtsever et al., 2020) , Robotics (Kober et al., 2013) , etc. Existing RL algorithms can be categorized into valuebased algorithms such as Q-learning (Watkins, 1989) and policy-based algorithms such as policy gradient (Sutton et al., 1999) . They can also be categorized as a model-free approach where one directly models the value function classes, or alternatively, a model-based approach where one needs to estimate the transition probability. Due to the intractably large state and action spaces that are used to model the real-world complex environment, function approximation in RL has become prominent in both algorithm design and theoretical analysis. It is a pressing challenge to design sample-efficient RL algorithms with general function approximations. In the special case where the underlying Markov Decision Processes (MDPs) enjoy certain linear structures, several lines of works have achieved polynomial sample complexity and/or regression and a Bernstein-type bonus. Other structural MDP models include the block MDPs (Du et al., 2019) and FLAMBE (Agarwal et al., 2020b) foot_0 , to mention a few. In a more general setting, however, there is still a gap between the plethora of MDP models and sample-efficient RL algorithms that can learn the MDP model with function approximation. The question remains open as to what constitutes minimal structural assumptions that admit sampleefficient reinforcement learning. To answer this question, there are several lines of work along this direction. Russo & Van Roy (2013) ; Osband & Van Roy (2014) proposed an structural condition named eluder dimension, and Wang et al. (2020) extended the LSVI-UCB for general linear function classes with small eluder dimension. Another line of works proposed low-rank structural conditions, including Bellman rank (Jiang et al., 2017; Dong et al., 2020) and Witness rank (Sun et al., 2019) . Recently, Jin et al. (2021) proposed a complexity called Bellman eluder (BE) dimension, which unifies low Bellman rank and low eluder dimension. Concurrently, Du et al. (2021) proposed Bilinear Classes, which can be applied to a variety of loss estimators beyond vanilla Bellman error. Very recently, Foster et al. (2021) proposed Decision-Estimation Coefficient (DEC), which is a necessary and sufficient condition for sample-efficient interactive learning. To apply DEC to RL, they proposed a RL class named Bellman Representability, which can be viewed as a generalization of the Bilinear Class. Nevertheless, Sun et al. (2019) is limited to model-based RL, and Jin et al. ( 2021) is restricted to model-free RL. The only frameworks that can unify both model-based and model-free RL are Du et al. (2021) and Foster et al. (2021) , but their sample complexity results when restricted to special MDP instances do not always match the best-known results. Viewing the above gap, we aim to answer the following question: Is there a unified framework that includes all model-free and model-based RL classes while maintaining sharp sample efficiency? In this paper, we tackle this challenging question and give a nearly affirmative answer to it. We summarize our contributions as follows: • We propose a general framework called Admissible Bellman Characterization (ABC) that covers a wide set of structural assumptions in both model-free and model-based RL, such as linear MDPs, FLAMBE, linear mixture MDPs, kernelized nonlinear regulator (Kakade et al., 2020) , etc. Furthermore, our framework encompasses comparative structural frameworks such as the low Bellman eluder dimension and low Witness rank. • Under our ABC framework, we design a novel algorithm, OPtimization-based ExploRation with Approximation (OPERA), based on maximizing the value function while constrained in a small confidence region around the model minimizing the estimation function. • We apply our framework to several specific examples that are known to be not sample-efficient with value-based algorithms. For the kernelized nonlinear regulator (KNR), our framework is the first general framework to derive a √ T regret-bound result. For the witness rank, our framework yields a sharper sample complexity with a mild additional assumption compared to prior works. We visualize and compare prevailing sample-efficient RL frameworks and ours in Figure 1 . We can see that both the general Bilinear Class and our ABC frameworks capture most existing MDP classes, including the low Witness rank and the KNR models. Notation. For a state-action sequence s 1 , a 1 , . . . , s H in our given context, we use J h := σ(s 1 , a 1 , . . . , s h ) to denote the σ-algebra generated by trajectories up to step h ∈ [H] . Let π f denote the policy of following the max-Q strategy induced by hypothesis f . When f = f i we write π f i as π i for notational simplicity. We write s h ∼ π to indicate the state-action sequence are generated by step h ∈ [H] by following policy π(• | s) and transition probabilities P(• | s, a) of the underlying MDP model M . We also write a h ∼ π to mean a h ∼ π(• | s h ) for the h-th step. Let ∥ • ∥ 2 denote the ℓ 2 -norm and ∥ • ∥ ∞ the ℓ ∞ -norm of a given vector. Other notations will be explained at their first appearances.

2. P R E L I M I N A R I E S

We consider a finite-horizon, episodic Markov Decision Process (MDP) defined by the tuple M = (S, A, P, r, H), where S is the space of feasible states, A is the action space. H is the horizon in each episode defined by the number of action steps in one episode, and P := {P h } h∈ [H] is defined for every h ∈ [H] as the transition probability from the current state-action pair (s h , a h ) ∈ S × A to the next state s h+1 ∈ S. We use r h (s, a) ≥ 0 to denote the reward received at step h ∈ [H] when taking action a at state s and assume throughout this paper that for any possible trajectories, H h=1 r h (s h , a h ) ∈ [0, 1]. A deterministic policy π is a sequence of functions {π h : S → A} h∈ [H] , where each π h specifies a strategy at step h. Given a policy π, the action-value function is defined to be the expected cumulative rewards where the expectation is taken over the trajectory distribution generated by {(P h (• | s h , a h ), π h (• | s h ))} h∈[H] as Q π h (s, a) := E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s, a h = a . Similarly, we define the state-value function for policy π as the expected cumulative rewards as V π h (s) := E π H h ′ =h r h ′ (s h ′ , a h ′ ) s h = s . We use π * to denote the optimal policy that satisfies V π * h (s) = max π V π h (s) for all s ∈ S (Puterman, 2014). For simplicity, we abbreviate V π * h as V * h and Q π * h as Q * h . Moreover, for a sequence of value functions {Q h } h∈ [H] , the Bellman operator at step h is defined as: (T h Q h+1 ) (s, a) = r h (s, a) + E s ′ ∼P h (•|s,a) max a ′ ∈A Q h+1 (s ′ , a ′ ). We also call Q h -(T h Q h+1 ) the Bellman error (or Bellman residual). The goal of an RL algorithm is to find an ϵ-optimal policy such that V π 1 (s 1 ) -V * 1 (s 1 ) ≤ ϵ. For an RL algorithm that updates the policy π t for T iterations, the cumulative regret is defined as Regret(T ) := T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) , Hypothesis Classes. Following Du et al. (2021) , we define the hypothesis class for both model-free and model-based RL. Generally speaking, a hypothesis class is a set of functions that are used to estimate the value functions (for model-free RL) or the transitional probability and reward (for model-based RL). Specifically, a hypothesis class F on a finite-horizon MDP is the Cartesian product of H hypothesis classes F := F 1 × . . . × F H in which each hypothesis f = {f h } h∈[H] ∈ F can be identified by a pair of value functions {Q f , V f } = {Q h,f , V h,f } h∈ [H] . Based on the value function pair, it is natural to introduce the greedy policy π h,f (s) = arg max a∈A Q h,f (s, a) at each step h ∈ [H], and the corresponding π f (s) as the sequence of time-dependent policies {π h,f } H-1 h=0 . An example of a model-free hypothesis class is defined by a sequence of action-value function {Q h,f } h∈ [H] . The corresponding state-value function is given by: V h,f (s) = E a∼π h,f [Q h,f (s, a)] . In another example that falls under the model-based RL setting, where for each hypothesis f ∈ F we have the knowledge of the transition matrix P f and the reward function r f . We define the value function Q h,f corresponding to hypothesis f as the optimal value function following M f := (P f , r f ): Q h,f (s, a) = Q * h,M f (s, a) and V h,f (s) = V * h,M f (s). We also need the following realizability assumption that requires the true model M f * (model-based RL) or the optimal value function f * (model-free RL) to belong to the hypothesis class F. Assumption 1 (Realizability). For an MDP model M and a hypothesis class F, we say that the hypothesis class F is realizable with respect to M if there exists a f * ∈ F such that for any h a) . We call such f * an optimal hypothesis. This assumption has also been made in the Bilinear Classes (Du et al., 2021) and low Bellman eluder dimension frameworks (Jin et al., 2021) . We also define the ϵ-covering number of F under a well-defined metric ρ of a hypothesis class F:foot_1 Definition 2 (ϵ-covering Number of Hypothesis Class). For any ϵ > 0 and a hypothesis class F, we use N F (ϵ) to denote the ϵ-covering number, which is the smallest possible cardinality of (an ϵ-cover) F ϵ such that for any f ∈ F there exists a f ′ ∈ F ϵ such that ρ(f, f ′ ) ≤ ϵ. ∈ [H], Q * h (s, a) = Q h,f * (s, Functional Eluder Dimension. We proceed to introduce our new complexity measure, functional eluder dimension, which generalizes the concept of eluder dimension firstly proposed in bandit literature (Russo & Van Roy, 2013; 2014) . It has since become a widely used complexity measure for function approximations in RL (Wang et al., 2020; Ayoub et al., 2020; Jin et al., 2021; Foster et al., 2021) . Here we revisit its definition: Definition 3 (Eluder Dimension). For a given space X and a class F of functions defined on X , the eluder dimension dim E (F, ϵ) is the length of the existing longest sequence x 1 , . . . , x n ∈ X satisfying for some ϵ ′ ≥ ϵ and any 2 ≤ t ≤ n, there exist f 1 , f 2 ∈ F such that t-1 i=1 (f 1 (x i ) -f 2 (x i )) 2 ≤ ϵ ′ while |f 1 (x t ) -f 2 (x t )| > ϵ ′ . The eluder dimension is usually applied to the state-action space X = S × A and the corresponding value function class F : S × A → R (Jin et al., 2021; Wang et al., 2020) . We extend the concept of eluder dimension as a complexity measure of the hypothesis class, namely, the functional eluder dimension, which is formally defined as follows. Definition 4 (Functional Eluder Dimension). For a given hypothesis class F and a function G defined on F × F, the functional eluder dimension (FE dimension) dim FE (F, G, ϵ) is the length of the existing longest sequence f 1 , . . . , f n ∈ F satisfying for some ϵ ′ ≥ ϵ and any 2 ≤ t ≤ n, there exists g ∈ F such that t-1 i=1 (G(g, f i )) 2 ≤ ϵ ′ while |G(g, f t )| > ϵ ′ . Function G is dubbed as the coupling function. The notion of functional eluder dimension introduced in Definition 4 is generalizable in a straightforward fashion to a sequence G := {G h } h∈[H] of coupling functions: we simply set dim FE (F, G, ϵ) = max h∈[H] dim FE (F, G h , ϵ) to denote the FE dimension of {G h } h∈[H] . The Bellman eluder (BE) dimension recently proposed by (Jin et al., 2021) is in fact a special case of FE dimension with a specific choice of coupling function sequence. 3 As will be shown later, our framework based on FE dimension with respect to the corresponding coupling function captures many specific MDP instances such as the kernelized nonlinear regulator (KNR) (Kakade et al., 2020) and the generalized linear Bellman complete model (Wang et al., 2019) , which are not captured by the framework of low BE dimension. As we will see in later sections, introducing the concept of FE dimension allows the coverage of a strictly wider range of MDP models and hypothesis classes. 3 A D M I S S I B L E B E L L M A N C H A R A C T E R I Z AT I O N F R A M E W O R K In this section, we introduce the framework of admissible Bellman characterization. 3 . 1 A D M I S S I B L E B E L L M A N C H A R A C T E R I Z AT I O N Given an MDP M , a sequence of states and actions s 1 , a 1 , . . . , s H , two hypothesis classes F and G satisfying the realizability assumption (Assumption 1),foot_3 and a discriminator function class V = {v(s, a, s ′ ) : S × A × S → R}, the estimation function ℓ = {ℓ h,f ′ } h∈[H],f ′ ∈F is an R ds -valued function defined on the set consisting of o h := (s h , a h , s h+1 ) ∈ S × A × S, f ∈ F, g ∈ G and v ∈ V and serves as a surrogate loss function of the Bellman error. Note that our estimation function is a vector-valued function, and is more general than the scalar-valued estimation function (or discrepancy function) used in Foster et al. (2021) ; Du et al. (2021) . The discriminator v originates from the function class the Integral Probability Metrics (IPM) (Müller, 1997) is taken with respect to (as a metric between two distributions), and is also used in the definition of Witness rank (Sun et al., 2019) . We use a coupling function G h,f * (f, g) defined on F × F to characterize the interaction between two hypotheses f, g ∈ F. The subscript f * is an indicator of the true model and is by default unchanged throughout the context. When the two hypotheses coincide, our characterization of the coupling function reduces to the Bellman error. Definition 5 (Admissible Bellman Characterization). Given an MDP M , two hypothesis classes F, G satisfying the realizability assumption (Assumption 1) and F ⊂ G, an estimation function ℓ h,f ′ : (S × A × S) × F × G × V → R ds , an operation policy π op and a constant κ ∈ (0, 1], we say that G is an admissible Bellman characterization of (M, F, G, ℓ) if the following conditions hold: (i) (Dominating Average Estimation Function) For any f, g ∈ F max v∈V E s h ∼πg,a h ∼πop ||E s h+1 [ℓ h,g (o h , f h+1 , f h , v) | s h , a h ] || 2 ≥ (G h,f * (f, g)) 2 . (ii) (Bellman Dominance) For any (h, f ) ∈ [H] × F, κ • E s h ,a h ∼π f [Q h,f (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 )] ≤ |G h,f * (f, f )| . We further say (M, F, G, ℓ, G) is an ABC class if G is an admissible Bellman characterization of (M, F, G, ℓ). In Definition 5, one can choose either π op = π g or π op = π f . We refer readers to §D for further explanations on π op . The ABC class is quite general and de facto covers many existing MDP models; see §3.2 for more details. Comparison with Existing MDP Classes. Here we compare our ABC class with three recently proposed MDP structural classes: Bilinear Classes (Du et al., 2021) , low Bellman eluder dimension (Jin et al., 2021) , and Bellman Representability (Foster et al., 2021) . • Bilinear Classes. Compared to the structural framework of Bilinear Class in Du et al. (2021, Definition 4 .3), Definition 5 of Admissible Bellman Characterization does not require a bilinear structure and recovers the Bilinear Class when we set G h,f * (f, g) = ⟨W h (g) -W h (f * ), X h (f )⟩. Our ABC class is strictly broader than the Bilinear Class since the latter does not capture low eluder dimension models, and our ABC class does. In addition, the ABC class admits an estimation function that is vector-valued, and the corresponding algorithm achieves a √ T -regret for KNR case while the BiLin-UCB algorithm for Bilinear Classes (Du et al., 2021) does not. • Low Bellman Eluder Dimension. Definition 5 subsumes the MDP class of low BE dimension when ℓ h,f ′ (o h , f h+1 , g h , v) := Q h,g (s h , a h ) -r h -V h+1,f (s h+1 ). Moreover, our definition unifies the V -type and Q-type problems under the same framework by the notion of π op . We will provide a more detailed discussion on this in §3.2. Our extension from the concept of the Bellman error to estimation function (i.e. the surrogate of the Bellman error) enables us to accommodate model-based RL for linear mixture MDPs, KNR model, and low Witness rank. • Bellman Representability. Foster et al. (2021) proposed DEC framework which is another MDP class that unifies both the Bilinear Class and the low BE dimension. Indeed, our ABC framework introduced in Definition 5 shares similar spirits with the Bellman Representability Definition F.1 in Foster et al. (2021) . Nevertheless, our framework and theirs bifurcate from the base point: our work studies an optimization-based exploration instead of the posterior sampling-based exploration in Foster et al. (2021) . Structurally different from their DEC framework, our ABC requires estimation functions to be vector-valued, introduces the discriminator function v, and imposes the weaker Bellman dominance property (i) in Definition 5 than the corresponding one as in Foster et al. (2021, Eq. (166) ). In total, this allows broader choices of coupling function G as well as our ABC class (with low FE dimension) to include as special instances both low Witness rank and KNR models, which are not captured in Foster et al. (2021) . Decomposable Estimation Function. Now we introduce the concept of decomposable estimation function, which generalizes the Bellman error in earlier literature and plays a pivotal role in our algorithm design and analysis. Definition 6 (Decomposable Estimation Function). A decomposable estimation function ℓ : (S × A × S) × F × G × V → R ds is a function with bounded ℓ 2 -norm such that the following two conditions hold: (i) (Decomposability) There exists an operator that maps between two hypothesis classes T (•) : F → G 5 such that for any f ∈ F, (h, f ′ , g, v) ∈ [H] × F × G × V and all possible o h ℓ h,f ′ (o h , f h+1 , g h , v) -E s h+1 [ℓ h,f ′ (o h , f h+1 , g h , v) | s h , a h ] = ℓ h,f ′ (o h , f h+1 , T (f ) h , v). Moreover, if f = f * , then T (f ) = f * holds. (ii) (Global Discriminator Optimality) For any f ∈ F there exists a global maximum v * h (f ) ∈ V such that for any (h, f ′ , g, v) ∈ [H] × F × G × V and all possible o h ||E s h+1 [ℓ h,f ′ (o h , f h+1 , f h , v * h (f )) | s h , a h ] || ≥ ||E s h+1 [ℓ h,f ′ (o h , f h+1 , f h , v) | s h , a h ] ||. Compared with the discrepancy function or estimation function used in prior work (Du et al., 2021; Foster et al., 2021) , our estimation function (EF) admits the unique properties listed as follows: (a) Our EF enjoys a decomposable property inherited from the Bellman error -intuitively speaking, the decomposability can be seen as a property shared by all functions in the form of the difference of a J h -measurable function and a J h+1 -measurable function; (b) Our EF involves a discriminator class and assumes the global optimality of the discriminator on all (s h , a h ) pairs; (c) Our EF is a vector-valued function which is more general than a scalar-valued estimation function (or the discrepancy function). We remark that when f = g, E s h+1 [ℓ h,f ′ (o h , f h+1 , f h , v) | s h , a h ] measures the discrepancy in optimality between f and f * . In particular, when f = f * , E s h+1 ℓ h,f ′ (o h , f * h+1 , f * h , v) | s h , a h = 0. Consider a special case when ℓ h,f ′ (o h , f h+1 , g h , v) := Q h,g (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 ). Then the decomposability (i) in Definition 6 reduces to [Q h,g (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 )] -[Q h,g (s h , a h ) -(T h V h+1 )(s h , a h )] = (T h V h+1 )(s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 ). In addition, we make the following Lipschitz continuity assumption on the estimation function. Assumption 7 (Lipschitz Estimation Function). There exists a L > 0 such that for any (h, f ′ , f, g, v) ∈ [H] × F × F × G × V, ( f , g, v, f ′ ) ∈ F × G × V × F and all possible o h , ℓ h,f ′ (•, f, g, v) -ℓ h,f ′ (•, f , g, v) ∞ ≤ Lρ(f, f ), ∥ℓ h,f ′ (•, f, g, v) -ℓ h,f ′ (•, f, g, v)∥ ∞ ≤ Lρ(g, g), ∥ℓ h,f ′ (•, f, g, v) -ℓ h,f ′ (•, f, g, v)∥ ∞ ≤ L ∥v -v∥ ∞ , ℓ h,f ′ (•, f, g, v) -ℓ h, f ′ (•, f, g, v) ∞ ≤ Lρ(f ′ , f ′ ). Note that we have omitted the subscript h of hypotheses in Assumption 7 for notational simplicity. We further define the induced estimation function class as L = {ℓ h,f ′ (•, f, g, v) : (h, f ′ , f, g, v) ∈ [H] × F × F × G × V}. We can show that under Assumption 7, the covering number of the induced estimation function class L can be upper bounded as N L (ϵ) ≤ N 2 F ( ϵ 4L )N G ( ϵ 4L )N V ( ϵ 4L ) , where N F (ϵ), N G (ϵ), N V (ϵ) are the ϵ-covering number of F, G and V, respectively. Later in our theoretical analysis in §4, our regret upper bound will depend on the growth rate of the covering number or the metric entropy, log N L (ϵ).

3. . 2 M D P I N S TA N C E S I N T H E A B C C L A S S

In this subsection, we present a number of MDP instances that belong to ABC class with low FE dimension. As we have mentioned before, for all special cases with ℓ h,f ′ (o h , f h+1 , g h , v) := Q h,g (s h , a h ) -r h -V h+1,f (s h+1 ), both conditions in Definition 5 are satisfied automatically with G h,f * (f, g) = E s h ∼πg,a h ∼πop [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )]. The FE dimension under this setting recovers the the BE dimension. Thus, all model-free RL models with low BE dimension (Jin et al., 2021) belong to our ABC class with low FE dimension. In the rest of this subsection, our focus shifts to the model-based RLs that belong to the ABC class: linear mixture MDPs, low Witness rank, and kernelized nonlinear regulator. Linear Mixture MDPs. We start with a model-based RL with a linear structure called the linear mixture MDP (Modi et al., 2020; Ayoub et al., 2020; Zhou et al., 2021b) . For known transition and reward feature mappings ϕ(s, a, s ′ ) : S ×A×S → H, ψ(s, a) : S ×A → H taking values in a Hilbert space H and an unknown θ * ∈ H, a linear mixture MDP assumes that for any (s, a, s ′ ) ∈ S × A × S and h ∈ [H], the transition probability P h (s ′ | s, a) and the reward function r(s, a) are linearly parameterized as P h (s ′ | s, a) = ⟨θ * h , ϕ(s, a, s ′ )⟩ , r(s, a) = ⟨θ * h , ψ(s, a)⟩ . We provide the following proposition, which shows that linear mixture MDPs belong to the ABC class with low FE dimension. Proposition 8 (Linear Mixture MDP ⊂ ABC with Low FE Dimension). The linear mixture MDP model belongs to the ABC class with estimation function ℓ h,f ′ (o h , f h+1 , g h , v) = θ ⊤ h,g ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f ′ (s ′ ) -r h -V h+1,f ′ (s h+1 ), (3.1) and coupling function Set π t := π f t where f t is taken as  G h,f * (f, g) = θ h,g -θ * h , E s h ,a h ∼π f [ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f (s ′ )] . Moreover, h,f ′ (o h , f h+1 , g h , v) = E s∼g h v(s h , a h , s) -v(s h , a h , s h+1 ), (3.2) and coupling function G h,f * (f, g) = ⟨W h (g), X h (f )⟩. Moreover, argmax f ∈F Q 1,f (s 1 , π f (s 1 )) subject to max v∈V t-1 i=1 ||ℓ h,f i (o i h , f h+1 , f h , v)|| 2 -inf g h ∈G h t-1 i=1 ||ℓ h,f i (o i h , f h+1 , g h , v)|| 2 ≤ β for all h ∈ [H] (4. ℓ h,f ′ (o h , f h+1 , g h , v) = U h,g ϕ(s h , a h ) -s h+1 , (3.4) and coupling function G h,f * (f, g) := E s h ,a h ∼πg ||(U h,f -U * h )ϕ(s h , a h )|| 2 . Moreover, it has a low FE dimension. Although the dimension of the RKHS H can be infinite, our complexity analysis depends solely on its effective dimension d ϕ .

4. A L G O R I T H M A N D M A I N R E S U LT S

In this section, we present an RL algorithm for the ABC class. Then we present the regret bound of this algorithm, along with its implications to several MDP instances in the ABC class.

4. . 1 O P E R A A L G O R I T H M

We first present the OPtimization-based ExploRation with Approximation (OPERA) algorithm in Algorithm 1, which finds an ϵ-optimal policy in polynomial time. Following earlier algorithmic art in the same vein e.g., GOLF (Jin et al., 2021) , the core optimization step of OPERA is optimizationbased exploration under the constraint of an identified confidence region; we additionally introduce an estimation policy π est sharing the similar spirit as in Du et al. (2021) . Due to space limit, we focus on the Q-type analysis here and defer the V -type results to §D in the appendix. 6Pertinent to the constrained optimization subproblem in Eq. (4.1) of our Algorithm 1, we adopt the confidence region based on a general DEF, extending the Bellman-error-based confidence region used in Jin et al. (2021) . As a result of such an extension, our algorithm can deal with more complex models such as low Witness rank and KNR. Similar to existing literature on RL theory with general function approximation, our algorithm is in general computationally inefficient. Yet OPERA is oracle efficient given the oracle for solving the optimization problem in Line 3 of Algorithm 1. We will discuss its computational issues in detail in §E.1, §E.2 and §E.3. 4 . 2 R E G R E T B O U N D S We are ready to present the main theoretical results of our ABC class with low FE dimension: Theorem 11 (Regret Bound of OPERA). For an MDP M , hypothesis classes F, G, a Decomposable Estimation Function ℓ satisfying Assumption 7, an admissible Bellman characterization G, suppose (M, F, G, ℓ, G) is an ABC class with low functional eluder dimension. For any fixed δ ∈ (0, 1), we choose β = O (log(T HN L (1/T )/δ)) in Algorithm 1. Then for the on-policy case when π op = π est = π t , with probability at least 1 -δ, the regret is upper bounded by Regret(T ) = O H κ T • dim FE F, G, 1/T • β . We defer the proof of Theorem 11, together with a corollary for sample complexity analysis, to §C in the appendix. We observe that the regret bound of the OPERA algorithm is dependent on both the functional eluder dimension dim FE and the covering number of the induced DEF class N L ( 1/T ). In the special case when DEF is chosen as the Bellman error, the relation dim FE (F, G, 1/T ) = dim BE (F, Π, 1/T ) holds with Π being the function class induced by {π f , f ∈ F}, and our Theorem 11 reduces to the regret bound in Jin et al. (2021) (Theorem 15). We will provide a detailed comparison between our framework and other related frameworks in §A when applied to different MDP models in the appendix. Here we focus on comparing our results applied to model-based RLs that are hardly analyzable in the model-free framework in §3.2. We demonstrate how OPERA can find near-optimal policies and achieve a state-of-the-art sample complexity under our new framework. Regret-bound analyses of linear mixture MDPs and several other MDP models can be found in §B in the appendix. We highlight that Algorithm 1 not only provides a simple optimization-based scheme, recovers previous near-optimal algorithms in literature (Algorithms 2 and 4 in §E) when applied to specific MDP instances, but also reduces to a novel Algorithm 3 for low witness rank MDPs with improved sample complexity. Low Witness Rank. We first provide a sample complexity result for the low Witness rank model structure. Let |M| and |V| be the cardinality of the model classfoot_7 M and discriminator class V, respectively, and W κ be the witness rank (Definition 28) of the model. We have the following sample complexity result for low Witness rank models. Corollary 12 (Finite Witness Rank). For an MDP model M with finite witness rank structureand any fixed δ ∈ (0, 1), we choose β = O (log(T H|M||V|/δ)) in Algorithm 1. With probability at least 1 -δ, Algorithm 1 outputs an ϵ-optimal policy π out within T = O H 2 |A|W κ β/(κ 2 ϵ 2 ) trajectories. Proof of Corollary 12 is delayed to §E.4.foot_8 Compared with previous best-known sample complexity Sun et al. (2019) , our sample complexity is superior by a factor of dH up to a polylogarithmic prefactor in model parameters. Kernel Nonlinear Regulator. Now we turn to the implication of Theorem 11 for learning KNR models. We have the following regret bound result for KNR. Corollary 13 (KNR). For the KNR model in Eq. (3.3) and any fixed δ ∈ (0, 1), we choose result of O H 3 W 2 κ |A| log(T |M||V|/δ)/(κ 2 ϵ 2 ) due to β = O σ 2 d ϕ d s log 2 (T H/δ) in Algorithm 1. With probability at least 1 -δ, the regret is upper bounded by O H 2 d ϕ T β/σ . We remark that neither the low BE dimension nor the Bellman Representability classes admit the KNR model with a sharp regret bound. Among earlier attempts, Du et al. (2021, §6) proposed to use a generalized version of Bilinear Classes to capture models including KNR, Generalized Linear Bellman Complete, and finite Witness rank. Nevertheless, their characterization requires imposing monotone transformations on the statistic and yields a suboptimal O(T 3/4 ) regret bound. Our ABC class with low FE dimension is free of monotone operators, albeit that the coupling function for the KNR model is not of a bilinear form.

5. C O N C L U S I O

N A N D F U T U R E W O R K In this paper, we proposed a unified framework that subsumes nearly all Markov Decision Process (MDP) models in existing literature from model-based and model-free RLs. For the complexity analysis, we propose a new type of estimation function with the decomposable property for optimization-based exploration and use the functional eluder dimension with respect to an admissible Bellman characterization function as the complexity measure of our model class. In addition, we proposed a new sample-efficient algorithm, OPERA, which matches or improves the state-of-the-art sample complexity (or regret) results. On the other hand, we notice that some MDP instances are not covered by our framework such as the Q * state-action aggregation, and the deterministic linear Q * models where only Q * has a linear structure. We leave it as a future work to include these MDP models. Grady Williams, Andrew Aldrich, and Evangelos Theodorou. Model predictive path integral control using covariance variable importance sampling. arXiv preprint arXiv:1509.01149, 2015. Lin Yang and Mengdi Wang. Sample-optimal parametric q-learning using linearly additive features. In A R E L AT E D W O R K Tabuler RL. Tabular RL considers MDPs with finite state space S and action space A. This setting has been extensively studied (Auer et al., 2008; Dann & Brunskill, 2015; Brafman & Tennenholtz, 2002; Agrawal & Jia, 2017; Azar et al., 2017; Zanette & Brunskill, 2019; Zhang et al., 2020) and the minimax-optimal regret bound is proved to be O( H 2 |S||A|T ) (Jin et al., 2018; Domingues et al., 2021) . The minimax optimal bounds suggests that the tabular RL is information-theoretically hard for large |S| and |A|. Therefore, in order to deal with high-dimensional state-action space arose in many real-world applications, more advanced structural assumptions that enable function approximation are in demand. Complexity Measures for Statistical Learning. In classic statistical learning, a variety of complexity measures have been proposed to upper bound the sample complexity required for achieving a certain accuracy, including VC Dimension (Vapnik, 1999) , covering number (Pollard, 2012) , Rademacher Complexity (Bartlett & Mendelson, 2002) , sequential Rademacher complexity (Rakhlin et al., 2010) and Littlestone dimension (Littlestone, 1988) . However, for reinforcement learning, it is a major challenge to find such general complexity measures that can be used to analyze the sample complexity under a general framework. RL with Linear Function Approximation. A line of work studied the MDPs that can be represented as a linear function of some given feature mapping. Under certain completeness conditions, the proposed algorithms can enjoy sample complexity/regret scaling with the dimension of the feature mapping rather than |S| and |A|. One such class of MDPs is linear MDPs (Jin et al., 2020; Wang et al., 2019; Neu & Pike-Burke, 2020) , where the transition probability function and reward function are linear in some feature mapping over state-action pairs. Zanette et al. (2020a;b) studied MDPs under a weaker assumption called low inherent Bellman error, where the value functions are nearly linear w.r.t. the feature mapping. Another class of MDPs is linear mixture MDPs (Modi et al., 2020; Jia et al., 2020; Ayoub et al., 2020; Zhou et al., 2021b; Cai et al., 2020) , where the transition probability kernel is a linear mixture of a number of basis kernels. The above paper assumed that feature vectors are known in the MDPs with linear approximation while Agarwal et al. (2020b) studied a harder setting where both the feature and parameters are unknown in the linear model. RL with General Function Approximation. Beyond the linear setting, a recent line of research attempted to unify existing sample-efficient approaches with general function approximation. Osband & Van Roy (2014) proposed an structural condition named eluder dimension. Wang et al. (2020) further proposed an efficient algorithm LSVI-UCB for general linear function classes with small eluder dimension. Another line of works proposed low-rank structural conditions, including Bellman rank (Jiang et al., 2017; Dong et al., 2020) and Witness rank (Sun et al., 2019)  d 3 H 4 /ϵ 2 d 2 H 2 /ϵ 2 d 3 H 3 /ϵ 2 d 2 H 2 /ϵ 2 Linear Mixture MDPs (Modi et al., 2020 ) d 3 H 4 /ϵ 2 ✘ d 3 H 3 /ϵ 2 d 2 H 2 /ϵ 2 Bellman Rank (Jiang et al., 2017 ) d 2 H 5 |A|/ϵ 2 dH 2 |A|/ϵ 2 d 2 H 3 |A|/ϵ 2 dH 2 |A|/ϵ 2 Eluder Dimension (Wang et al., 2020 ) ✘ dim E H 2 /ϵ 2 dim 2 E H 3 /ϵ 2 dim E H 2 /ϵ 2 Witness Rank (Sun et al., 2019) - ✘ - W κ H 2 |A|/ϵ 2 Low Occupancy Complexity (Du et al., 2021 )  d 3 H 4 /ϵ 2 d 2 H 2 /ϵ 2 d 3 H 3 /ϵ 2 d 2 H 2 /ϵ 2 Kernelized Nonlinear Regulator (Kakade et al., 2020) - ✘ - d 2 ϕ d s H 4 /ϵ 2 Linear Q * /V * (Du et al., 2021) d 3 H 4 /ϵ 2 d 2 H 2 /ϵ 2 d 3 H 3 /ϵ 2 d 2 H 2 /ϵ 2

B A D D I T I O N A L E X A M P L E S

In this section, we compare our work with other results in the literature in terms of regret bounds/sample complexity. First of all, as we mentioned earlier in §3 when taking DEF as ℓ h,f ′ (o h , f h+1 , g h , v) = Q h,g (s h , a h ) -r h -V h+1,f (s h+1 ) the ABC function reduces to the average Bellman error, and our ABC framework recovers the low Bellman eluder dimension framework for all cases compatible with such an estimation function. On several model-free structures, our regret bound is equivalent to that of the GOLF algorithm (Jin et al., 2021) (2021a) up to a factor of H 1/2 . In the rest of this section, we compare on six additional examples: the linear Q * /V * model (Du et al., 2021) , the low occupancy complexity model (Du et al., 2021) , kernel reactive POMDPs, FLAMBE/Feature Selection, Linear Quadratic Regulator, and finally Generalized Linear Bellman Complete. Moreover, we added a discussion on the Q * state-action aggregation model.

B . 1 L I

N E A R Q * /V * The linear Q * /V * model was proposed in Du et al. (2021) . In addition to the linear structure of the optimal action-value function Q * , we further assume linear structure of the optimal state-value function V * . We formally define the linear Q * /V * model as follows: Definition 14 (Linear Q * /V * , Definition 4.5 in Du et al. 2021) . A linear Q * /V * model satisfies for two Hilbert spaces H 1 , H 2 and two given feature mappings ϕ(s, a) : S × A → H 1 , ψ(s ′ ) : S → H 2 , there exist w * h ∈ H 1 , θ * h ∈ H 2 such that Q * h (s, a) = ⟨w * h , ϕ(s, a)⟩ and V * h (s ′ ) = ⟨θ * h , ψ(s ′ )⟩ for any h ∈ [H] and (s, a, s ′ ) ∈ S × A × S. Suppose that H 1 and H 2 has dimension number d 1 and d 2 , separately, Du et al. (2021) shows that linear Q * /V * model belongs to the Bilinear Class with dimension d = d 1 + d 2 and BiLin-UCB algorithm achieves an O( d 3 H 4 ϵ 2 ) sample complexity. On the other hand the sample complexity of OPERA is of O( d 2 H 2 ϵ 2 ). B . 2 L O W O C C U PA N C Y C O M P L E X I T Y The low occupancy complexity model assumes linearity on the state-action distribution and has been proposed in Du et al. (2021) . We recap its definition formally as follows: Definition 15 (Low Occupancy Complexity, Definition 4.7 in Du et al. 2021) . A low occupancy complexity model is an MDP M satisfying for some hypothesis class F, a Hilbert space H and feature mappings ϕ h (•, •) : S × A → H, ∀h ∈ [H] that there exists a function on hypothesis classes Du et al. (2021) proved that the low occupancy complexity model belongs to the Bilinear Classes and has a sample complexity of d 3 H 4 /ϵ 2 under the BiLin-UCB algorithm. In the meantime, the low occupancy complexity model admits an improved sample complexity of d 2 H 2 /ϵ 2 under the OPERA algorithm. β h : F → H such that d π f (s h , a h ) = ⟨β h (f ), ϕ h (s h , a h )⟩ , ∀f ∈ F, ∀(s h , a h ) ∈ S × A. B . 3 K E R N E L R E A C T I V E P O M D P S The Reactive POMDP (Krishnamurthy et al., 2016 ) is a partially observable MDP (POMDP) model that can be described by the tuple (S, A, O, T, O, r, H), where S and A are the state and action spaces respectively, O is the observation space, T is the transition matrix that maps each (s, a) ∈ S × A to a probability measure on S and determines the dynamics of the next state as s h+1 ∼ T(• | s h , a h ), O is the emission measure that determines the observation o h ∼ O(• | s h ) given current state s h . The reactiveness of a POMDP refers to the property that the optimal value function Q * depends only on the current observation and action. In other words, for all h, there exists a f * h : O × A → [0, 1] such that for any given trajectory τ h = [o 1 , a 1 , . . . , o h ] and a h , we have Q * (τ h , a h ) = f * h (o h , a h ). Given the definition of a reactive POMDP, we define the kernel reactive POMDP (Jin et al., 2021) as follows: Definition 16 (Kernel Reactive POMDP). A kernel reactive POMDP is a reactive POMDP that satisfies for each h ∈ [H] and a given seperable Hilbert space H, there exist feature mappings ϕ h : S×A → H and ψ h : S → H such that the transition matrix In Jin et al. (2021) , the authors showed that the kernel reactive POMDP with vanilla estimation function ℓ T h (s ′ | s, a) = ⟨ϕ h (s, a), ψ h (s ′ )⟩ H and ψ is bounded in the sense that for any V (•) : S → [0, 1], s ′ ∈S V (s ′ )ψ(s ′ ) H ≤ 1. h (o h , f h+1 , g h , v) = Q h,g (s h , a h )-r h -V h+1,f (s h+1 ) has V -type BE dimension bounded by the effective dimension. According to Proposition 34, the kernel reactive POMDP model also has low FE dimension bounded by the effective dimension.

B . 4 F L A M B E / F E AT U R E S E L E C T I O N

For FLAMBE/feature selection model firstly introduced in Agarwal et al. (2020b) , similarity is shared with the linear MDP setting but the main difference lies in that the feature mappings are unknown. We formally define the feature selection model as follows: Definition 17 (Feature Selection). A low rank feature selection model is an MDP M that satisfies for any h ∈ [H] and a given Hilbert space H, there exist unknown feature mappings µ * h : S → H and ϕ * : S × A → H such that the transition probability satisfies: P h (s ′ | s, a) = µ * h (s ′ ) ⊤ ϕ * (s, a), ∀(s, a, s ′ ) ∈ S × A × S. We consider the feature selection model with DEF ℓ h (o h , f h+1 , g h , v) := Q h,g (s h , a h ) -r h - V h+1,f (s h+1 ). In Du et al. (2021) they have proved in Lemma A.1 that E s h ∼πg,a h ∼π f [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] = ⟨W h (f ), X h (g)⟩ , (B.1) where W h (f ) := s∈S µ * h (s) V h,f (s) -r(s, π f (s)) -E s ′ ∼P h (•|s,π f (s)) [V h+1,f (s ′ )] ds, X h (g) := E s h-1 ,a h-1 ∼πg [ϕ * (s h-1 , a h-1 )] . We note that Eq. (B.1) ensures condition (i) and (ii) in Definition 5 at the same time and the ABC of the feature selection setting has a bilinear structure that enables us to apply Proposition 35 to conclude low FE dimension. B . 5 L I N E A R Q U A D R AT I C R E G U L AT O R In a linear quadratic regulator (LQR) model (Bradtke, 1992; Anderson & Moore, 2007; Dean et al., 2020) , we consider the d dimensional state space S ⊆ R d and K dimensinal action space A ⊆ R K . The transition dynamics of an LQR model can be written in matrix form so that the induced value function is quadratic (Jiang et al., 2017) . We formally define the LQR model as follows: Definition 18 (Linear Quadratic Regulator). A linear quadratic regulator model is an MDP M such that there exist unknown matrix A ∈ R d×d , B ∈ R d×K and Q ∈ R d×d satisfying for ∀h ∈ [H] and zero-centered random variables ϵ h , τ h with E[ϵ h ϵ ⊤ h ] = Σ and E[τ 2 h ] = σ 2 that s h+1 = As h + Ba h + ϵ h , r h = s ⊤ h Qs h + a ⊤ h a h + τ h . The LQR model has been analyzed in Du et al. (2021) and proved to belong to the Bilinear Classes. Du et al. (2021) used the hypothesis class defined as F h = (C h , Λ h , O h ) : C h ∈ R K×d , Λ h ∈ R d×d , O h ∈ R h∈[H] . For each hypothesis in the class f ∈ F, the corresponding policy and value function are Du et al. (2021) showed that π f (s h ) = C h,f s h , V h,f (s h ) = s ⊤ h Λ h,f s h + O h,f . Under the above setting, we use the DEF for LQR ℓ h (o h , f h+1 , g h , v) := Q h,g (s h , a h ) -r h - V h+1,f (s h+1 ) and Lemma A.4 in E s h ,a h ∼πg [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] = ⟨W h (f ), X h (g)⟩ , (B.2) where  W h (f ) = vec(Λ h,f -Q -C ⊤ h,f C h,f -(A + BC h,f ) ⊤ Λ h+1,f (A + BC h,f )), O h,f -O h+1,f -trace(Λ h+1,f Σ)] , X h (f ) = vec(E s h ∼π f [s h s ⊤ h ), = F h = σ(θ ⊤ h ϕ(s, a)) : θ h ∈ H, ||θ h || ≤ R h∈[H] such that for any f ∈ F and ∀h ∈ [H] the Bellman completeness condition holds: r(s, a) + E s ′ ∈P h max a ′ ∈A σ(θ ⊤ h+1,f ϕ(s ′ , a ′ )) ∈ H h . By the choice of the hypothesis class F, we know that there exists a mapping T h : H → H such that σ T h (θ h+1,f ) ⊤ ϕ(s, a) = r(s, a) + E s ′ ∈P h max a ′ ∈A σ(θ ⊤ h+1,f ϕ(s ′ , a ′ )). (B.3) We note that in Du et al. (2021) they choose a discrepancy function dependent on a discriminator function v. In this work, we choose a different estimation function that allows much simpler calculation and sharper sample complexity result. We let ℓ h (o h , f h+1 , g h , v) := σ(θ ⊤ h,g ϕ(s h , a h )) -r h -max a ′ θ ⊤ h+1,f ϕ(s h+1 , a ′ ). By Eq. (B.3), it is easy to check that the above DEF satisfies the decomposable condition. Assuming a ≤ σ ′ (x) ≤ b, Lemma 6.2 in Du et al. (2021) has already shown the Bellman dominance property that E s h ,a h ∼π f [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] ≤ b vec ((θ h,f -T h (θ h+1,f )(θ h,f -T h (θ h+1,f ) ⊤ )) , vec E s h ,a h ∼π f ϕ(s h , a h )ϕ(s h , a h ) ⊤ = b ⟨W h (f ), X h (f )⟩. Next, we illustrate that the Dominating Average EF condition holds in our framework. We have E s h ∼πg,a h ∼πop ||E s h+1 [ℓ h,g (o h , f h+1 , f h , v) | s h , a h ] || 2 = E s h ,a h ∼πg ||σ(θ ⊤ h,f ϕ(s h , a h )) -σ(T h (θ h+1,f ) ⊤ ϕ(s, a))|| 2 ≥ aE s h ,a h ∼πg (θ h,f -T h (θ h+1,f )) ⊤ ϕ(s h , a h ) 2 ≥ a ⟨W h (f ), X h (g)⟩ , where W h (f ) := vec (θ h,f -T h (θ h+1,f )(θ h,f -T h (θ h+1,f ) ⊤ ) , X h (f ) := vec E s h ,a h ∼π f ϕ(s h , a h )ϕ(s h , a h ) ⊤ . Analogous to the KNR case and the proof of Lemma 30, the aforementioned model with ABC function ⟨W h (f ), X h (f )⟩ has low FE dimension. B . 7 Q * S TAT E -A C T I O N A G G R E G AT I O N Finally, we consider the Q * state-action aggregation model (Dong et al., 2019) , which cannot be covered by the bilinear classes (Du et al., 2021) . We illustrate that our ABC framework covers this model with a nonlinear coupling function. Definition 20 (Q * state-action aggregation). We call an MDP M a Q * state-action model if there exists a ξ(s, a) : S × A → B such that for any state-action pairs (s, a), (s For this case, we use the hypothesis class defined as ′ , a ′ ) ∈ S × A, if ξ(s, a) = ξ(s ′ , a ′ ), then Q * (s, a) = Q * (s ′ , a ′ ). F h = w h ∈ R d h∈[H] , where we take Q h,g (s, a) = ⟨w h,g , ψ(s, a)⟩ , V h,g (s ′ ) = max a ′ ∈A ⟨w h,g , ψ(s ′ , a ′ )⟩ . We define the DEF as the Bellman residual and conclude that ℓ h (o h , g h+1 , g h , v) = Q h,g (s h , a h ) -r h -V h,g (s h+1 ) = Q h,g (s h , a h ) -V h,g (s h+1 ) -[Q * h (s h , a h ) -V * h (s h+1 )] = ⟨w h,g -w * h , ψ(s, a)⟩ -max a ′ ∈A ⟨w h,g , ψ(s h+1 , a ′ )⟩ -max a ′ ∈A ⟨w * h , ψ(s h+1 , a ′ )⟩ . (B.4) Taking expectation over the distribution on o h given by s h , a h ∼ π f and s h+1 ∼ P h , we have E s h ,a h ∼π f [ℓ h (o h , g h+1 , g h , v)] = w h,g -w * h , E s h ,a h ∼π f ψ(s h , a h ) -E s h ,a h ∼π f ,s h+1 ∼P h max a ′ ∈A ⟨w h,g , ψ(s h+1 , a ′ )⟩ -max a ′ ∈A ⟨w * h , ψ(s h+1 , a ′ )⟩ . (B.5) If we define G(g, f ) by the right hand side of (B.5), it is obvious that G h,f * (•, •) with ℓ defined in (B.4) serves as the coupling function and the estimation function of an ABC class, respectively. In the mean time, Equation (B.5) cannot be expressed as a inner product of some W h (f ), X h (g) and thus cannot be covered by Du et al. (2021) . Nevertheless, in our ABC framework, it is unclear if the FE dimension of F with respect to G h,f * (•, •) can be bounded in a nontrivial way (i.e., ≪ |S| • |A|).

C P R O O F O F M A I N R E S U LT S

In this section, we provide proofs of our main result Theorem 11 and a sample complexity corollary of the OPERA algorithm. Originated from proof techniques widely used in confidence bound based RL algorithms Russo & Van Roy (2013) our proof steps generalizes that of the GOLF algorithm Jin et al. (2021) but admits general DEF and ABCs. We prove our main result as follows: C . 1 P R O O F O F T H E O R E M 1 1 Proof of Theorem 11. We recall that the objective of an RL problem is to find an ϵ-optimal policy satisfying V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ ϵ. Moreover, the regret of an RL problem is defined as T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) , where π t is the output policy of an algorithm at time t. Step 1: Feasibility of f * . First of all, we show that the optimal hypothesis f * lies within the confidence region defined by Eq. (4.1) with high probability: Lemma 21 (Feasibility of f * ). In Algorithm 1, given ρ > 0 and δ > 0 we choose β = c(log (T HN L (ρ)/δ) + T ρ) for some large enough constant c. Then with probability at least 1 -δ, f * satisfies for any t ∈ [T ]: max v∈V t-1 i=1 ||ℓ h,f i h (o i h , f * h+1 , f * h , v)|| 2 -inf g h ∈G h t-1 i=1 ||ℓ h,f i h (o i h , f * h+1 , g h , v)|| 2 ≤ O(β). Lemma 21 shows that at each round of updates the optimal hypothesis f * stays in the confidence region depicted by Eq. (4.1) with radius O(β). We delay the proof of Lemma 21 to §F.2. Lemma 21 together with the optimization procedure Line 3 of Algorithm 1 implies an upper bound of V * 1 (s 1 ) -V π t 1 (s 1 ) with probability at least 1 -δ as follows: V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ V 1,f t (s 1 ) -V π t 1 (s 1 ). (C.1) Step 2: Policy Loss Decomposition. The second step is to upper bound the regret by the summation of Bellman errors. We apply the policy loss decomposition lemma in Jiang et al. (2017) . Lemma 22 (Lemma 1 in Jiang et al. 2017) . ∀f ∈ H, V 1,f t (s 1 ) -V π t 1 (s 1 ) = H h=1 E s h ,a h ∼π t [Q h,f t (s h , a h ) -r h -V h+1,f t (s h+1 )] . Combining Lemma 22 with Eq. (C.1) we have the following: V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ V 1,f t (s 1 ) -V π t 1 (s 1 ) = H h=1 E s h ,a h ∼π t [Q h,f t (s h , a h ) -r h -V h+1,f t (s h+1 )] . (C.2) Step 3: Small ABC Value in the Confidence Region. The third step is devoted to controlling the cumulative square of Admissible Bellman Characterization function. Recalling that the ABC function is upper bounded by the average DEF, where each feasible DEF stays in the confidence region that satisfies Eq. ( 4.1), we arrive at the following Lemma 23: Lemma 23. In Algorithm 1, given ρ > 0 and δ > 0 we choose β = c(log (T HN L (ρ)/δ) + T ρ) for some large enough constant c. Then with probability at least 1 -δ, for all (t, h) ∈ [T ] × [H], we have t-1 i=1 G h,f * (f t , f i ) 2 ≤ O(β). (C.3) The proof of Lemma 23 makes use of Freedman's inequality (the precise version as in Agarwal et al. ( 2014)) and we delay the proof to §F.1. Step t-1 i=1 (G(f t , g i )) 2 ≤ β, the following inequality holds for all t ∈ [T ] and ω > 0: t i=1 |G(f i , g i )| ≤ O dim FE (F, G, ω)βt + C • min{t, dim FE (F, G, ω)} + tω . The proof of Lemma 24 is in §F.3. Step 5: Combining Everything. In the final step, we combine the regret bound decomposition argument, the cumulative ABC bound, and the Bellman dominance property together to derive our final regret guarantee. For any h ∈ [H], we take G(•, •) = G h,f * (•, •), g i = f i , f t = f t and ω = 1 T in Lemma 24. By Eq. (C.3) in Lemma 23, we have for any h ∈ [H] and t ∈ [T ], t i=1 |G h,f * (f i , f i ))| ≤ O dim FE (F, G h,f * , 1/T )βt + C • min{t, dim FE (F, G h,f * , 1/T )} + √ t ≤ O dim FE (F, G h,f * , 1/T )βt . We recall our choice of β = c (log (T HN L (ρ)/δ) + T ρ). Taking ρ = 1 T , we have t i=1 |G h,f * (f i , f i ))| ≤ O dim FE F, G h,f * , 1/T log (T HN L (1/T )/δ) • t ≤ O dim FE F, G, 1/T log (T HN L (1/T )/δ) • t . Combining this with property (ii) in Definition 5 and decomposition (C.2), we conclude our main result that with probability at least 1 -δ, T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ 1 κ T t=1 H h=1 |G h,f * (f t , f t )| ≤ O H κ T • dim FE (F, G, 1/T ) log (T HN L (1/T )/δ) . This completes the whole proof of Theorem 11.

C . 2 S A M P L E C O M P L E X I T Y O F O P E R A

Corollary 25 (Sample Complexity of OPERA). For an MDP M with hypothesis classes F, G that satisfies Assumption 1 and a Decomosable Estimation Function ℓ satisfying Assumption 7. If there exists an Admissible Bellman Characterzation G with low functional eluder dimension. For any ϵ ∈ (0, 1], we choose β = c log(T HN L κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 /δ) + T κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 for some large enough constant c. For the on-policy case when π op = π est = π t , with probability at least 1 -δ Algorithm 1 outputs a ϵ-optimal policy π out within T trajectories where T = dim FE (F, G, κϵ H ) log T HN L κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 /δ H 2 κ 2 ϵ 2 . Proof of Corollary 25. By the policy loss decomposition (C.2), (C.3) in Lemma 23 and Lemma 24, we have that 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ 1 κT T t=1 H h=1 G h,f * (f t , f t ) ≤ O H κ dim FE (F, G, ω) log (T HN L (ρ)/δ) T + ρ + Hω κ . (C.4) Taking ω = κϵ H and ρ = κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 , the above Eq. (C.4) becomes 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ O H κ dim FE (F, G, κϵ H ) log (T HN L (ρ)/δ) T + ϵ . Taking T = dim FE (F, G, κϵ H ) log (T HN L (ρ)/δ) H 2 κ 2 ϵ 2 yields the desired result. D Q-T Y P E A N D V -T Y P E S A M P L E C O M P L E X I T Y A N A LY S I S In Definition 5, we note that there are two ways to calculate the ABC of an MDP model depending on the different choices of the operating policy π op . Specifically, if π op = π g , we call it the Q-type ABC. Otherwise, if π op = π f , we call it the V -type ABC. For example, when taking G h,f * (f, g) = E s h ∼πg,a h ∼πg [Q h,f (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 )] the FE dimension of G h,f * (f, g) recovers the Q-type BE dimension (Definition 8 in Jin et al. (2021) . When taking G h,f * (f, g) = E s h ∼πg,a h ∼π f [Q h,f (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 )] the FE dimension of G h,f * (f, g) recovers the V -type BE dimension (Definition 20 in Jin et al. (2021) . The algorithm for solving Q-type or V -type models slightly differs in the executing policy π est . We use π est = π t for Q-type models in Algorithm 1, while π est = U (A) is the uniform distribution on action set for V -type models. The Q-type characterization and the V -type characterization have respective applicable zones. For example, the reactive POMDP model belongs to ABC with low FE dimension with respect to Vtype ABC while inducing large FE dimension with respect to Q-type ABC. On the contrary, the low inherent bellman error problem in Zanette et al. ( 2020a) is more suitable for using a Q-type characterization rather than a V -type characterization. For general RL models, we often prefer Q-type ABC because the sample complexity of V -type algorithms scales with the dimension of the action space |A|. Due to the uniform executing policy, we will only be able to derive regret bound for Q-type characterizations, as is explained in Jin et al. (2021) . In §4 and §C, we have illustrated regret bound and sample complexity results for the Q-type cases where we let π op = π est = π t through Algorithm 1. In the following Corollary 26, we prove sample complexity result for V -type ABC models. Corollary 26. For an MDP M with hypothesis classes F, G that satisfies Assumption 1 and a Decomposable Estimation Function ℓ satisfying Assumption 7. If there exists an Admissible Bellman Characterization G with low functional eluder dimension. For any ϵ ∈ (0, 1], if we choose β = O (log(T HN L (ρ)/δ) + T ρ). For V -type models when π op = π est = π t , with probability at least 1 -δ Algorithm 1 outputs a ϵ-optimal policy π out within T = |A| dimFE(F ,G,κϵ/H) log(T HN L (ρ)/δ)H 2 κ 2 ϵ 2 trajectories where ρ = κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 . Proof of Corollary 26. The proof of Corollary 26 basically follows the proof of Theorem 11 and Corollary 25. We again have feasibility of f * and policy loss decomposition. However, due to different sampling policy, the proof of Lemma 23 differs at Eq. (F.5). Instead, we have t-1 i=1 max v∈V E s h ∼π i ,a h ∼π t E s h+1 X i (h, f t , v) | s h , a h = t-1 i=1 max v∈V E s h ∼π i ,a h ∼U (A) 1(a i h = π f (s i h )) 1/|A| E s h+1 X i (h, f t , v) | s h , a h = t-1 i=1 max v∈V E s h ∼π i ,a h ∼U (A) 1(a i h = π f (s i h )) 1/|A| ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v) | s h , a h || 2 ≤ O(|A| β + Rtρ + R 2 ι ). (D.1) Thus, Eq. (C.3) in Lemma 23 becomes t-1 i=1 G h,f * (f t , f i ) 2 ≤ O(|A|β). The rest of the proof follow the proof of Corollary 25 with an additional |A| factor. By the policy loss decomposition (C.2) and Lemma 24, we have that 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ 1 κT T t=1 H h=1 G h,f * (f t , f t ) ≤ O H κ |A| dim FE (F, G, ω) log (T HN L (ρ)/δ) T + ρ + Hω κ . (D.2) Taking ω = κϵ H and ρ = κ 2 ϵ 2 dimFE(F ,G, κϵ H )H 2 , the above Eq. (D.2) becomes 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ O H κ |A| dim FE (F, G, κϵ H ) log (T HN L (ρ)/δ) T + ϵ . Taking T = |A| dim FE (F, G, κϵ H ) log (T HN L (ρ)/δ) H 2 κ 2 ϵ 2 yields the desired result.

E P R O O F F O R S P E C I F I C E X A M P L E S

In this section, we consider three specific examples: linear mixture MDPs, low Witness rank MDPs, and KNRs. We explains how our framework exhibits superior properties than other general frameworks on these three instances of MDPs. For reader's convenience, we summarize the conditions introduced in Items (i), (ii) in Definition 6 and also Items (i), (ii) in Definition 5, that are essential for any RL models to fit in our framework: • Decomposability: ℓ h,f ′ (o h , f h+1 , g h , v) -E s h+1 [ℓ h,f ′ (o h , f h+1 , g h , v) | s h , a h ] = ℓ h,f ′ (o h , f h+1 , T (f ) h , v). • Global Discriminator Optimality: ||E s h+1 [ℓ h,f ′ (o h , f h+1 , f h , v * h (f )) | s h , a h ] || ≥ ||E s h+1 [ℓ h,f ′ (o h , f h+1 , f h , v) | s h , a h ] ||. • Dominating Average EF: max v∈V E s h ∼πg,a h ∼πop ||E s h+1 [ℓ h,g (o h , f h+1 , f h , v) | s h , a h ] || 2 ≥ (G h,f * (f, g)) 2 . • Bellman Dominance: κ • E s h ,a h ∼π f [Q h,f (s h , a h ) -r(s h , a h ) -V h+1,f (s h+1 )] ≤ |G h,f * (f, f )| . E . 1 L I N E A R M I X T U R E M D P S In this case, we choose F h = G h = {θ h ∈ H}. Thus, the hypothesis classes F and G consist of the set of parameters θ 1 , . . . , θ H ∈ H. Moreover, for each hypothesis class f = (θ 1,f , . . . , θ H,f ) ∈ F, the value function with respect to f satiafies for any h ∈ [H] that Q h,f (s, a) = θ ⊤ h,f ψ(s, a) + ϕ V h+1,f (s, a) , where ϕ V h+1,f (s, a) := s ′ ∈S ϕ(s, a, s ′ )V h+1,f (s ′ ). It is natural to define the DEF by ℓ h,f ′ (o h , f h+1 , g h , v) = θ ⊤ h,g ψ(s h , a h ) + ϕ V h+1,f ′ (s h , a h ) -r h -V h+1,f ′ (s h+1 ). If we use Φ t-1 h to denote the matrix (ψ + ϕ V h+1,f 1 )(s 1 h , a 1 h ), . . . , (ψ + ϕ V h+1,f t-1 )(s t-1 h , a t-1 h ) and y t-1 h to denote the vector r h -V h+1,f 1 (s i h+1 ), . . . , r h -V h+1,f t-1 (s t-1 h+1 ) , Eq. (4.1) in Algorithm 1 under linear mixture setting can be written in a matrix form as: ||θ ⊤ h,f Φ t-1 h -y t-1 h || 2 -inf θ ||θ ⊤ Φ t-1 h -y t-1 h || 2 ≤ β. (E.1) Taking θ h,t = arg min θ ||θ ⊤ Φ t-1 h -y t-1 h || 2 = Φ t-1 h Φ t-1 h ⊤ -1 Φ t-1 h y t-1 h ⊤ and Σ t-1 h := Φ t-1 h Φ t-1 h ⊤ , simple algebra yields ||θ ⊤ h,f Φ t-1 h -y t-1 h || 2 -inf θ ||θ ⊤ Φ t-1 h -y t-1 h || 2 = || θ h,f -θ h,t ⊤ Φ t-1 h || 2 = θ h,f -θ h,t 2 Σ t-1 h , (E.2) So Algorithm 1 reduces to Algorithm 2. In particular, the confidence region defined by Eq. (E.2) in Algorithm 2 is the same as the confidence region used in the upper confidence RL with value-targeted regression (UCRL-VTR) algorithm (Jia et al., 2020; Ayoub et al., 2020) . While in UCRL-VTR, they perform step-by-step local optimization within the confidence region, resulting in a confidence bonus added upon the Q value function, our Algorithm 2 follows a global optimization scheme, where the objective is the total expected return by following the optimal policy under the current hypothesis. The design principle of the global optimization is the same as the ELEANOR algorithm (Zanette et al., 2020a). In fact, the difference between UCRL-VTR with Algorithm 2 is analogous to the difference between LSVI-UCB (Jin et al., 2020) Set π t := π f t where f t is taken as argmax f ∈F Q 1,f (s 1 , π f (s 1 )) subject to θ h,t = Φ t-1 h Φ t-1 h ⊤ -1 Φ t-1 h y t-1 h ⊤ , θ h,f -θ h,t 2 Σ t-1 h ≤ β for all h ∈ [H] (E.3) 4: For any h ∈ [H], collect tuple (r h , s h , a h , s h+1 ) by executing s h , a h ∼ π t

5:

Augment D h = D h ∪ {(r h , s h , a h , s h+1 )} 6: end for 7: Output: π out uniformly sampled from {π t } T t=1 algorithm improves over the best-known results on general frameworks that subsumes linear mixture MDPs. We provide more comparisons on the linear mixture model in §B. In terms of computation, assume that there exists a planning oracle for the optimization problem in Line 3 of Algorithm 2 that requires B time complexity to solve. Then for each t ∈ [T ], h ∈ [H], the computational complexity of the rest of the algorithm is dominated by the computation of Σ t-1 h -1 , and the total computational complexity would be O(BT + d 2 HT ). Next, we proceed to prove that a linear mixture MDP belongs to ABC class with low FE dimension. Proof of Proposition 8. In the linear mixture model, we choose hypothesis class F h = G h = {θ h ∈ H}, and DEF function ℓ h,f ′ (o h , f h+1 , g h , v) = θ ⊤ h,g ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f ′ (s ′ ) -r h -V h+1,f ′ (s h+1 ). (a) Decomposability. Taking expectation over s h+1 and we obtain that E s h+1 [ℓ h,f ′ (o h , f h+1 , g h , v) | s h , a h ] = (θ h,g -θ * h ) ⊤ ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f ′ (s ′ ) . Thus, we have ℓ h,f ′ (o h , f h+1 , g h , v) -E s h+1 [ℓ h,f ′ (o h , f h+1 , g h , v) | s h , a h ] = (θ * h ) ⊤ ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f ′ (s ′ ) -r h -V h+1,f ′ (s h+1 ) = ℓ h,f ′ (o h , f h+1 , f * h , v). (b) Global Discriminator Optimality holds automatically since ℓ is independent of v. (c) Dominating Average EF. We have the following inequality for linear mixture models: Es h ,a h ∼πg ||E [ℓ h,g (o h , f h+1 , f h , v) | s h , a h ] || 2 = Es h ,a h ∼πg (θ h,f -θ * h ) ⊤ ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,g (s ′ ) 2 ≥ (θ h,f -θ * h ) ⊤ Es h ,a h ∼πg ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,g (s ′ ) 2 . (E.4) (d) Bellman Dominance. On the other hand, we know that Es h ,a h ∼π f [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] = Es h ,a h ∼π f (θ h,f -θ * h ) ⊤ ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f (s ′ ) = (θ h,f -θ * h ) ⊤ Es h ,a h ∼π f ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,f (s ′ ) . (E.5) Algorithm 3 OPERA (Low Witness Rank MDPs) 1: Initialize: D h = ∅ for h = 1, . . . , H 2: for iteration t = 1, 2, . . . , T do 3: Set π t := π f t where f t is taken as argmax f ∈F Q 1,f (s 1 , π f (s 1 )) subject to max v∈V t-1 i=1 E s∼f h v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 -inf g h ∈G h t-1 i=1 E s∼g h v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 ≤ β for all h ∈ [H] (E.7) 4: For any h ∈ [H], collect tuple (r h , s h , a h , s h+1 ) by rolling in s h ∼ π t and executing a h ∼ U (A) Augment D h = D h ∪ {(r h , s h , a h , s h+1 )} 6: end for 7: Output: π out uniformly sampled from {π t } T t=1 (e) Low FE Dimension. Observe from Eqs. (E.4) and (E.5) that we can choose ABC function of an linear mixture MDP as G h,f * (f, g) := (θ h,f -θ * h ) ⊤ E s h ,a h ∼πg ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,g (s ′ ) . (E.6) The next Lemma 27 proves that the FE dimension of F with respect to the coupling function G h,f * (f, g) is less than the effective dimension d of the parameter space H. Lemma 27. The linear mixture MDP model has FE dimension ≤ O(d) with respect to the ABC defined in (E.6). We prove Lemma 27 in §G. Thus, we conclude our proof of Proposition 8. From the above Proof of Proposition 8, we see that linear mixture MDPs perfectly fit our framework. We apply Theorem 11 and Corollary 25 to linear mixture MDPs and conclude directly that Algorithm 2 has a regret upper bound of dH √ T together with a sample complexity upper bound of d 2 H 2 /ϵ 2 , matching the best-known results that uses a Hoeffding-type bonus for exploration.

E . 2 L O W W I T N E S S R A N K M D P S

In this subsection, we provide a novel method for solving low Witness rank MDPs as a direct application of the OPERA algorithm. The witness rank is an important model-based assumption that covers several structural models including the factored MDPs (Kearns, 1998) . Also, all models with low Bellman rank structure belong to the class of low Witness rank models while the opposite does not hold (Sun et al., 2019) . Although the witness rank models can be solved in a model-free manner, model-free algorithms cannot find near-optimal solutions of general witness rank models in polynomial time. Meanwhile, existing frameworks (Sun et al., 2019; Du et al., 2021) with an efficient algorithm does not exhibit sharp sample complexity results. We recall that in low Witness rank settings, hypotheses on model-based parameters (transition kernel and reward function) are made. Based on this, there are two recent lines of related approaches. Sun et al. (2019) first proposed an algorithm that eliminates candidate models with high estimated witness model misfits. On the other hand, Du et al. (2021) proposed a general algorithmic framework that would imply an optimization-based algorithm on low Witness rank models. The following definition is a generalized version of the witness rank in Sun et al. (2019) , where we require the discriminator class V to be complete, meaning that the assemblage of functions by taking the value at (s, a) from different functions also belongs to V. Definition 28 (Witness Rank). For an MDP M , a given symmetric and complete discriminator class V = {V h } h∈[H] , V h ⊂ S × A × S → R and a hypothesis class F, we define the Witness rank of M as the smallest d such that for any two hypotheses f, g ∈ F, there exist two mappings X h : F → R d and W h : F → R d and a constant κ ∈ (0, 1], the following inequalities hold for all h ∈ [H]: max v∈V h E s h ∼π f ,a h ∼πg [E s∼g h v(s h , a h , s) -E s∼P h v(s h , a h , s)] ≥ ⟨W h (g), X h (f )⟩ , (E.8) κ • E s h ∼π f ,a h ∼πg [E s∼g h V h+1,g ( s) -E s∼P h V h+1,g ( s)] ≤ ⟨W h (g), X h (f )⟩ . (E.9) We prove an improved sample complexity result over existing literature and illustrate the differences in design scheme of our algorithm. We present the pseudocode in Algorithm 3. Note that in Eq. (E.7), we replace the DEF in Eq. ( 4.1) by (3.2). Next, we elaborate the design scheme of our algorithm in comparison with Sun et al. (2019) and Du et al. (2021) . Note that the DEF E s∼g h v(s h , a h , s) -v(s h , a h , s h+1 ) is similar with the discrepancy function used in Du et al. (2021) except for an importance sampling factor. Moreover, after taking sup over discriminator functions, the expected DEF equals the witnessed model misfit in Sun et al. (2019) . Although Du et al. (2021) did not explicitly give an algorithm for witness rank, we observe some general differences between OPERA and BiLin-UCB (Du et al., 2021) . The confidence region used in Algorithm 3 (simplified version for comparison ) is i [(ℓ i f ) 2 -inf g (ℓ i g ) 2 ] ≤ β centered at the optimal hypothesis, while the confidence region used in BiLin-UCB is i 1 m j≤m ℓ i,(j) f 2 ≤ β ′ that bound an estimate of ℓ centered at 0. Similarly as in BiLin-UCB, Sun et al. (2019) also attempts to bound a batched estimate of ℓ. Their algorithm constantly eliminates out of range models, enforcing small witness model misfit on prior distributions. The analysis in Sun et al. (2019) and Du et al. (2021) , however, does not enforce the additional assumption on the discriminator class; we obtain a sharper sample complexity as in Corollary 12. If we assume that there exists a planning oracle for solving the optimization problem in Line 3 of Algorithm 3 with B time complexity. The computation in the rest of the algorithm is dependent on the structure of discriminator class V and the hypothesis class G. We omit the discussion here as the planning oracle with a total computational complexity of O(BT ) is usually the dominating term. In the forthcoming, we prove that low Witness rank MDPs belongs to ABC class with low FE dimension. Proof of Proposition 9. In the low Witness rank model, we choose hypothesis class F h = G h = M, and DEF function ℓ h (o h , f h+1 , g h , v) = E s∼g h v(s h , a h , s) -v(s h , a h , s h+1 ). (E.10) Without loss of generality, we assume that the discriminator class V is rich enough in the sense that if ∀s, a ∈ S × A, v s,a (•, •, •) ∈ V, then v(s, a, s ′ ) := v s,a (s, a, s ′ ) ∈ V (if not, we can use a rich enough V ′ induced by V), an assumption generally satisfied by common discriminator classes. For example, Total variation, Exponential family, MMD, Factored MDP in Sun et al. (2019) all use a rich enough discriminator class. Also, if V = {v : ∥v∥ ∞ ≤ c} for some absolute constant c, the function class is rich enough. (a) Decomposability. Taking expectation over s h+1 of Eq. (E.10) and we obtain that E s h+1 [ℓ h (o h , f h+1 , g h , v) | s h , a h ] = E s∼g h v(s h , a h , s) -E s∼P h v(s h , a h , s). (E.11) Thus, we have ℓ h (o h , f h+1 , g h , v) -E s h+1 [ℓ h (o h , f h+1 , g h , v) | s h , a h ] = E s∼P h v(s h , a h , s) -v(s h , a h , s h+1 ) = ℓ h (o h , f h+1 , f * h , v). (b) Global Discriminator Optimality. Eq. (E.11) implies that E s h+1 [ℓ h (o h , f h+1 , f h ) | s h , a h ] = v(s h , a h , s) (f h (s | s h , a h ) -P h (s | s h , a h )) ds. We define v * h (f )(s, a, s ′ ) = v s,a (s, a, s ′ ) where v s,a := arg max v∈V v(s, a, s) (f h ( s | s, a) -P h ( s | s, a)) d s. It is easy to verify that v * h (f ) satisfies for all h ∈ [H] and (s h , a h ) ∈ S × A, E s h+1 [ℓ h (o h , f h+1 , f h , v * h (f )) | s h , a h ] ≥ E s h+1 [ℓ h (o h , f h+1 , f h , v) | s h , a h ] . Finally, the symmetry of V concludes the global discriminator optimality. Published as a conference paper at ICLR 2023 (c) Dominating Average EF. We have the following inequality for low Witness rank model: max v∈V E s h ∼πg,a h ∼π f ||E [ℓ h (o h , f h+1 , f h , v) | s h , a h ] || 2 = max v∈V E s h ∼πg,a h ∼π f (E s∼f h v(s h , a h , s) -E s∼P h v(s h , a h , s)) 2 ≥ max v∈V E s h ∼πg,a h ∼π f [E s∼f h v(s h , a h , s) -E s∼P h v(s h , a h , s)] 2 (i) ≥ ⟨W h (f ), X h (g)⟩ 2 . (E.12) where the last inequality (i) follows Definition 28 of witness rank. (d) Bellman Dominance. On the other hand, by Definition 28 we know that κ • E s h ,a h ∼π f [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] ≤ ⟨W h (f ), X h (f )⟩ . (E.13) (e) Low FE Dimension. We see from Eq. (E.12) and (E.13) that we can choose ABC function with low Witness rank RL model as G h,f * (f, g) := ⟨W h (f ), X h (g)⟩ . (E.14) The next Lemma 29 proves that the FE dimension of F with respect to the coupling function G h,f * (f, g) is less than the dimension W κ of the witness model. Lemma 29. The low Witness rank MDP model has FE dimension ≤ O(W κ ) with respect to the ABC defined in (E.14). We prove Lemma 29 in §G. Thus, we conclude our proof of Proposition 9. By Proposition 9 we can straightforwardly derive the sample complexity by applying Corollary 26. For better understanding of the context, we present a complete proof of the sample complexity result of witness rank model in §E.4. E . 3 K E R N E L I Z E D N O N L I N E A R R E G U L AT O R In the KNR setting introduced in §3.2, the norm of s h+1 might be arbitrarily large if the random vector ϵ h+1 is large in magnitude. On the contrary, our framework requires the boundedness of the DEF. To resolve this issue, we note the tail bound of one-dimensional Gaussian distribution indicates that for any given positive x: e x 2 /2 ∞ x e -t 2 /2 dt ≤ e x 2 /2 ∞ x t x e -t 2 /2 dt = 1 x . Thus, for T H i.i.d. R ds -valued random vectors ϵ t h ∼ N (0, σ 2 I) and a fixed δ ∈ (0, 1), there exists an event B with P(B) ≥ 1 -δ such that ∥ϵ t h ∥ ∞ ≤ O σ log(T Hd s /δ) holds on event B. We first provide the application of OPERA on the KNR model, the algorithm is written in Algorithm 4. Note that by similar algebra as in Eq. (E.2), the confidence set (E.15) is equivalent to ||(U h,f -U h,f )(Σ t-1 h ) 1/2 || 2 ≤ β, where Σ t-1 h := Φ t-1 h (Φ t-1 h ) ⊤ and U h,f is the optimal solution to the least square problem arg min U t-1 i=1 ||U ϕ(s i h , a i h ) -s i h+1 || 2 . The OPERA algorithm reduces to the LC 3 algorithm in Kakade et al. (2020) except that LC 3 is under a homogeneous setting. The only difference between Algorithm 4 and LC 3 is that in Eq. (E.15), LC 3 sums over t and H and we can only sum over t because of the inhomogeneous setting. H is due to the reduction from the time-inhomogeneous setting to the time-homogeneous setting. Thus, our regret bound matches the state-of-the-art result on KNR instances (Kakade et al., 2020) regarding the dependencies on d ϕ , d s , H. However, d 2 ϕ d s in our result is slightly looser than Algorithm 4 OPERA (kernelized nonlinear regulator) 1: Initialize: D h = ∅ for h = 1, . . . , H 2: for iteration t = 1, 2, . . . , T do 3: Set π t := π f t where f t is taken as argmax f ∈F Q 1,f (s 1 , π f (s 1 )) subject to t-1 i=1 ||U h,f ϕ(s i h , a i h ) -s i h+1 || 2 -inf g h ∈G h t-1 i=1 ||U h,g ϕ(s i h , a i h ) -s i h+1 || 2 ≤ β for all h ∈ [H] (E.15) 4: For any h ∈ [H], collect tuple (r h , s h , a h , s h+1 ) by executing s h , a h ∼ π t

5:

Augment Kakade et al. (2020) and can be possibly improved by instance-specific analysis of KNR. Similar to the linear mixture MDP case, if we assume that there exists a planning oracle for the optimization problem in Line 3 of Algorithm 4 that requires B time complexity to solve, the rest of the algorithm can be solved efficiently in O(d ϕ (d ϕ + d s )HT ) time complexity. So the total computational complexity of Algorithm 4 is O(BT + d ϕ (d ϕ + d s )HT ). D h = D h ∪ {(r h , s h , a h , s h+1 )} 6: end for 7: Output: π out uniformly sampled from {π t } T t=1 d ϕ (d s + d ϕ ) in In addition, we would like to remark that we can adapt algorithms from the optimal control literature such as MPPI (Williams et al., 2015) and DMDMPC (Wagener et al., 2019) to solve the optimization problem in Line 3 of Algorithm 4. This approach has been used in Kakade et al. (2020) , where they designed the LC 3 algorithm for solving KNRs. In particular, they leveraged the MPPI algorithm for the planning oracle and provided rich empirical results. Proof of Proposition 10. In the KNR model, we choose hypothesis class F h = G h = {U ∈ H → R ds : ||U || ≤ R}, and DEF function ℓ h (o h , f h+1 , g h , v) = U h,g ϕ(s h , a h ) -s h+1 . (a) Decomposability. Taking expectation over s h+1 and we obtain that E s h+1 [ℓ h (o h , f h+1 , g h , v) | s h , a h ] = (U h,g -U * h )ϕ(s h , a h ). Thus, we have ℓ h (o h , f h+1 , g h , v) -Es h+1 [ℓ h (o h , f h+1 , g h , v) | s h , a h ] = U * h ϕ(s h , a h ) -s h+1 = ℓ h (o h , f h+1 , f * h , v). (b) Global Discriminator Optimality holds automatically since ℓ is independent of v. (c) Dominating Average EF. We have the following inequality for the KNR model: E s h ,a h ∼πg ||E [ℓ h (o h , f h+1 , f h , v) | s h , a h ] || 2 = E s h ,a h ∼πg ||(U h,f -U * h )ϕ(s h , a h )|| 2 . (E.16) (d) Bellman Dominance. On the other hand, we know that E s h ,a h ∼π f [Q h,f (s h , a h ) -r h -V h+1,f (s h+1 )] ≤ 2H σ E s h ,a h ∼π f ∥(U h,f -U * h )ϕ(s h , a h )∥ 2 . (E.17) (e) Low FE Dimension. We see from Eqs. (E.16) and (E.17) that we can choose ABC function of an linear mixture MDP as G h,f * (f, g) := E s h ,a h ∼πg ||(U h,f -U * h )ϕ(s h , a h )|| 2 , (E.18) and KNR has an ABC with κ = σ 2H . The next Lemma 30 proves that the FE dimension of F with respect to the coupling function G h,f * (f, g) can be controlled by d ϕ : Lemma 30. The KNR model has FE dimension ≤ O(d ϕ ) with respect to the ABC defined in (E.18). We prove Lemma 30 in §G. Thus, we conclude our proof of Proposition 10. E . 4 P R O O F O F C O R O L L A R Y 1 2 In this subsection, we provide sample complexity guarantee for models with low Witness rank. In the main text in §4.3 we presented our Corollary 12 for M and V with finite cardinality for convenience of comparison with previous works. Here, we prove general result for model class M and discriminator class V with finite ρ-covering. Proof of Corollary 12. We start the proof by showing that V * (s 1 ) -V π t 1 (s 1 ) can be upper bounded by a sum of Bellman errors, which is a simple deduction from the policy loss decomposition lemma in Jiang et al. (2017) and is the same as the equality in Eq. (C.2) in the proof of Theorem 11 in §C. Next, we verify that f * satisfies constraint (E.7) so that taking f t = arg max V 1,f (s 1 ) in the confidence region yields V * 1 (s 1 ) ≤ V 1,f t (s 1 ). Lemma 31 (Feasibility of f * ). In Algorithm 3, given ρ > 0 and δ > 0, we choose β = c(log (T H|M ρ ||V ρ |/δ) + T ρ) for some large enough constant c, then with probability at least 1 -δ, f * satisfies for any t ∈ [T ]: max v∈V t-1 i=1 E s∼f * h v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 -inf g h ∈G h t-1 i=1 E s∼g h v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 ≤ β. We prove Lemma 31 in §F.5. The next Lemma 32 is devoted to controlling the average squared DEF. Lemma 32. In Algorithm 3, given ρ > 0 and δ > 0, we choose β = c(log (T H|M ρ ||V ρ |/δ) + T ρ) for some large enough constant c, then with probability at least 1 -δ, for all (t, h) ∈ [T ] × [H], we have t-1 i=1 max v∈V E s h ∼πi,a h ∼π f E s∼f h v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 ≤ O(|A|β). Proof is delayed to §F.4. By Lemma 32 and properties of the witness rank in Definition 28, we have t-1 i=1 ⟨W h (f ), X h (f i )⟩ 2 ≤ t-1 i=1 max v∈V E s h ∼πi,a h ∼π f E s∼f h v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 ≤ t-1 i=1 max v∈V E s h ∼πi,a h ∼π f E s∼f h v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 ≤ O (|A|β) . Applying Lemma 24 with G h,f * (f, g) := ⟨W h (f ), X h (g)⟩ and g i = f i , f t = f t , we have t i=1 |⟨W h (f ), X h (f i )⟩| ≤ O |A| dim FE (F, G h,f * , ω)βt + tω . Policy loss decomposition (C.2) yields 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ O H κ |A| dim FE (F, G h,f * , ω) log (T H|M ρ ||V ρ |/δ) T + ρ + Hω κ . Taking ω = κϵ H and ρ = ϵ 2 dimFE(F ,G, ϵ H )H 2 , the above Eq. (C.4) becomes 1 T T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ O H κ |A| dim FE (F, G, ϵ H ) log (T H|M ρ ||V ρ |/δ) T + ϵ . Taking T = |A| dim FE (F, G, ϵ H ) log (T H|M ρ ||V ρ |/δ) H 2 κ 2 ϵ 2 h+1 yields the desired result. We can directly apply Theorem 11 to the KNR model based on Proposition 10 to obtain the regret bound result. For better understanding of our framework, we illustrate the main features in the proof of Corollary 13 that are different from the proof of Theorem 11. Proof of Corollary 13. To resolve the unboundedness issue, we unfold the analysis of KNR case and conclude a high-probability event B analogous to the argument in §E.3. However, doing so would impose an additional √ d s factor induced by estimating the ℓ 2 -norm of multivariate Gaussians. In lieu to this, we present a sharper convergence analysis that incorporates KNR instance-specific structures. We recall the DEF of the KNR model: ℓ h (o h , f h+1 , g h , v) = U h,g ϕ(s h , a h ) -s h+1 . We first define an auxilliary random variable X t (h, f, v) := ℓ h (o t h , f h+1 , g h , v) 2 -ℓ h (o t h , f h+1 , T (f ) h , v) 2 = U h,f ϕ(s t h , a t h ) -s t h+1 2 2 -U * h ϕ(s t h , a t h ) -s t h+1 2 2 = (U h,f -U * h )ϕ(s t h , a t h ), (U h,f -U * h )ϕ(s t h , a t h ) -2ϵ t h+1 = ||(U h,f -U * h )ϕ(s t h , a t h )|| 2 -2 (U h,f -U * h )ϕ(s t h , a t h ), ϵ t h+1 . By the boundedness of operator U h,f , U * h and uniform boundedness of ϕ(s, a), we obtain that ||(U h,f -U * h )ϕ(s t h , a t h )|| 2 ≤ 4B 2 U B 2 . The conditional distribution of (U h,f -U * h )ϕ(s t h , a t h ), ϵ t h+1 is a zero-mean Gaussian with variance σ 2 ||(U h,f -U * h )ϕ(s t h , a t h )|| 2 ≤ 4B 2 U B 2 σ 2 . By the tail bound of Gaussian distributions along with standard union bound, we know that with probability at least 1 -δ, (U h,f -U * h )ϕ(s t h , a t h ), ϵ t h+1 ≤ O σ log (T H/δ) holds uniformly for all t ∈ [T ] and h ∈ [H]. Thus, we bound the absolute value of the auxillary variable X t by |X t | ≤ Rσ where R is positive and of order O log(T H/δ) . Taking expectation with respect to s h+1 , we have E s h+1 [X t (h, f, v) | s h , a h ] = (U h,f -U * h )ϕ(s t h , a t h ) 2 . On the other hand, E s h+1 (X t (h, f, v)) 2 | s h , a h = E s h+1 ||(U h,f -U * h )ϕ(s t h , a t h )|| 2 -2 (U h,f -U * h )ϕ(s t h , a t h ), ϵ t h+1 2 | s h , a h = E s h+1 ||(U h,f -U * h )ϕ(s t h , a t h )|| 4 + 4 (U h,f -U * h )ϕ(s t h , a t h ), ϵ t h+1 2 | s h , a h = E s h+1 ||(U h,f -U * h )ϕ(s t h , a t h )|| 4 + 4||(U h,f -U * h )ϕ(s t h , a t h )|| 2 σ 2 | s h , a h ≤ O σ 2 R 2 E [X t (h, f, v) | s h , a h ] . By taking Z t = X t (h, f, v) -E s h+1 [X t (h, f, v) | s h , a h ] with |Z t | ≤ 2Rσ in Freedman's inequality (F.1) in Lemma 33, we have for any η satisfying 0 < η < 1 2R 2 σ 2 almost surely, with probability at least 1 -δ: t i=1 Z i ≤ O R 2 σ 2 η t i=1 E s h+1 [X i (h, f, v) | s h , a h ] + log(δ -1 ) η . Optimizing over η, we have t i=1 Z i ≤ O   Rσ t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + R 2 σ 2 log(δ -1 )   . (E.19) Following the same Freedman's inequality (Lemma 33) and ρ-covering argument as as in the proof of Theorem 11 with derivations detailed in §F.1, we have with probability ≥ 1 -δ and β = O σ 2 log(T HN L (ρ)/δ) + σρT : t i=1 E s h ,a h ∼π i ||(U h,f t -U * h )ϕ(s h , a h )|| 2 ≤ t i=1 E s h ,a h ∼π i ||(U h,f t -U * h )ϕ(s h , a h )|| 2 ≤ O(β). Feasibility of f * can be derived by taking the same auxilliary random variable and analyze on t i=1 X i (h, f, v) as in the proof of Lemma 31. As explained in §E.3 , we can apply Lemma 24 with ω = 1 T , ρ = 1 T , G h,f * (f, g) = E s h ,a h ∼πg ||(U h,f -U * h )ϕ(s h , a h )|| 2 , and have t i=1 E s h ,a h ∼π i ||(U h,f t -U * h )ϕ(s h , a h )|| 2 ≤ σ dim FE F, G, 1/T log (T HN L (1/T )) • t. The rest of the proof follows by applying Bellman dominance, policy loss decomposition and calculating the FE dimension based on G h,f * (f, g), which is shown in Lemma 30. We therefore obtain that T t=1 V * 1 (s 1 ) -V π t 1 (s 1 ) ≤ 1 κ T t=1 H h=1 |G h,f * (f t , f t )| ≤ O H κ σ T • dim FE (F, G, 1/T ) log (T HN L (1/T )/δ) = O H 2 d 2 ϕ d s T . F P R O O F O F T E C H N I C A L L E M M A S We start with introducing the Freedman's inequality that are crucial in proving concentration properties in our main results. Lemma 33 (Freedman-Style Inequality, Agarwal et al. 2014) . Consider an adapted sequence {Z t , J t } t=1,2,...,T that satisfies E [Z t | J t-1 ] = 0 and Z t ≤ R for any t = 1, 2, . . . T . Then for any δ > 0 and η ∈ [0, 1 R ], it holds with probability at least 1 -δ that T t=1 Z t ≤ (e -2)η T t=1 E Z 2 t | J t-1 + log(δ -1 ) η . (F.1) Before proving our technical lemmas, we note that for notational simplicity we use the expectation E s h+1 [• | s h , a h ] to denote the conditional expectation with respect to the transition probability of the true model at h. The value of s h , a h is data dependent (might be s i h , a i h or s t h , a t h depending on the function inside the expectation). F. 1 P R O O F O F L E M M A 2 3 Proof of Lemma 23. We recall that ℓ has a bounded ℓ 2 -norm in Definition 6 and assume that  ||ℓ h,f ′ (•, f h+1 , g h , v)|| ≤ R for ∀h ∈ [H], f ′ , f ∈ F, g ∈ G, v ∈ V (t, h, f, v) ∈ [T ] × [H] × F × V and consider X i (h, f, v) := ||ℓ h,f i (o i h , f h+1 , f h , v)|| 2 -||ℓ h,f i (o i h , f h+1 , T (f ) h , v)|| 2 , where the randomness is due to uniformly sampling the data sequence D h . We know that |X t (h, f )| ≤ R 2 . Take conditional expectation of X i with respect to s h , a h , we have by definition that E s h+1 [X i (h, f, v) | s h , a h ] = E s h+1 ||ℓ h,f i (o i h , f h+1 , f h , v)|| 2 -||ℓ h,f i (o i h , f h+1 , T (f ) h , v)|| 2 | s h , a h Using the fact that ∥a∥ 2 -∥b∥ 2 = ⟨a -b, a + b⟩ for arbitrary vectors a, b and property (i) in Definition 6 we have E s h+1 [X i (h, f, v) | s h , a h ] = ℓ h,f i (o i h , f h+1 , f h , v) -ℓ h,f ′ (o i h , f h+1 , T (f ) h , v), E s h+1 ℓ h,f i (o i h , f h+1 , f h , v) + ℓ h,f ′ (o i h , f h+1 , T (f ) h , v) | s h , a h = ||E s h+1 ℓ h,f i (o i h , f h+1 , f h , v) | s h , a h || 2 . On the other hand, E s h+1 (X i (h, f, v)) 2 | s h , a h ≤ E s h+1 ||ℓ h,f i (o i h , f h+1 , f h , v) -ℓ h,f i (o i h , f h+1 , T (f ) h , v)|| 2 •||ℓ h,f ′ (o i h , f h+1 , f h , v) + ℓ h,f ′ (o i h , f h+1 , T (f ) h , v)|| 2 | s h , a h ≤ 4||E s h+1 ℓ h,f i (o i h , f h+1 , f h , v) | s h , a h || 2 R 2 ≤ 4R 2 E s h+1 [X i (h, f, v) | s h , a h ] . By taking Z t = X t (h, f, v) -E s h+1 [X t (h, f, v) | s h , a h ] with |Z t | ≤ 2R 2 in Freedman's inequality (F.1) in Lemma 33, we have for any η satisfying 0 < η < 1 2R 2 , with probability at least 1 -δ: t i=1 Z i ≤ O η t i=1 Var [X i (h, f, v) | s h , a h ] + log(δ -1 ) η ≤ O η t i=1 E s h+1 X 2 i (h, f, v) | s h , a h + log(δ -1 ) η ≤ O 4R 2 η t i=1 E s h+1 [X i (h, f, v) | s h , a h ] + log(δ -1 ) η . Taking η = √ log(δ -1 ) 2R √ t i=1 E[Xi(h,f,v)|s h ,a h ] ∨ 1 2R 2 , we have t i=1 Z i ≤ O   2R t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + 2R 2 log(δ -1 )   . (F.2) Similarly by applying Freedman's inequality to t i=1 -Z t and combining with Eq. (F.2), we have that for any three-tuple (t, h, f ), the following holds with probability at least 1 -2δ: t i=1 Z i ≤ O   2R t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + 2R 2 log(δ -1 )   . We note that in §3 we have that L admits a ρ-covering of F, G, V, meaning that for any ℓ h,f ′ (•, f, g, v) and a ρ > 0 there exists a ρ and a four-tuple ( f ′ , f , g, v) ∈ F ρ × F ρ × G ρ × V ρ such that ℓ h, f ′ (•, f , g, v) -ℓ h,f ′ (•, f, g, v) ∞ ≤ ρ, where F ρ , G ρ , V ρ are ρ-covers of F, G, V respectively. This is denoted by ( f ′ , f , g, v) ∈ L ρ . In definition of X t , g is always taken as f or a function of T ( f ). Then if T is Lipschitz, as it is mostly the expectation operator, we omit the g in the tuple and use ( f ′ , f , v) ∈ L ρ to denote an element in the ρ-covering. By taking a union bound over L ρ , we have with probability at least 1 -2δ that the following holds for any ( f i , f , v) ∈ L ρ , t i=1 X i (h, f , v) - t i=1 E s h+1 X i (h, f , v) | s h , a h ≤ O   2R t i=1 E s h+1 X i (h, f , v) | s h , a h ι + 2R 2 ι   , (F.3) where X i (h, f , v) := ||ℓ h, f i (o i h , f h+1 , f h , v)|| 2 -||ℓ h, f i (o i h , f h+1 , T ( f ) h , v)|| 2 and ι = log HT N L (ρ) δ . Further for any X i (h, f t , v), we choose the three-tuple ( f i , f t , v) := arg min ( f i , f t , v)∈Lρ X i (h, f t , v) -X i (h, f t , v) ≤ ρ and by the ρ-covering argument, we arrive at t-1 i=1 X i (h, f t , v) = t-1 i=1 ||ℓ h, f i (o i h , f t h+1 , f t h , v)|| 2 -||ℓ h, f i (o i h , f t h+1 , T ( f ) t h , v)|| 2 ≤ t-1 i=1 ||ℓ h,f i (o i h , f t h+1 , f t h , v)|| 2 -||ℓ h,f i (o i h , f t h+1 , T (f t ) h , v)|| 2 + O(Rtρ) (i) ≤ O(β + Rtρ), (F.4) where (i) comes from the constraint (4.1) of Algorithm 1. Combining (F.3) with (F.4), we derive the following t-1 i=1 E s h+1 X i (h, f t , v) | s h , a h ≤ O(β + Rtρ + R 2 ι). Applying the ρ-covering argument as in before, we conclude max v∈V t-1 i=1 E s h+1 X i (h, f t , v) | s h , a h ≤ O(β + Rtρ + R 2 ι). Global optimality of the discriminator in (ii) of Definition 6 implies that v * h is the optimal discriminator under any distribution or summation of s h , a h (and thus max is interchangeable with summation): t-1 i=1 E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v * h (f t )) | s h , a h || 2 ≥ t-1 i=1 E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v) | s h , a h || 2 , ∀v ∈ V. Thus, we have t-1 i=1 max v∈V E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v) | s h , a h || 2 = t-1 i=1 E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v * h (f t )) | s h , a h || 2 , and also t-1 i=1 E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v * h (f t )) | s h , a h || 2 = max v∈V t-1 i=1 E s h ,a h ∼π i ||E s h+1 ℓ h,f i (o h , f t h+1 , f t h , v) | s h , a h || 2 = max v∈V t-1 i=1 E s h ,a h ∼π i E s h+1 X i (h, f t , v) | s h , a h ≤ O(β + Rtρ + R 2 ι). (F.5) We apply property (i) in Definition 5 and conclude that t-1 i=1 G h,f * (f t , f i ) 2 ≤ O(β), which finishes the proof of Lemma 23. F. 2 P R O O F O F L E M M A 2 1 Proof of Lemma 21. For a data set D h = {r t h , s t h , a t h , s t h+1 } t=1,2,...T , we first build an auxillary random variable defined for every (t, h, f, v) ∈ [T ] × [H] × F × V X i (h, f, v) := ||ℓ h,f i (o i h , f * h , f h , v)|| 2 -||ℓ h,f i (o i h , f * h , f * h , v)|| 2 . By similar derivations as in the proof of Lemma 23, we have E s h+1 [X i (h, f, v) | s h , a h ] = E s h+1 ℓ h,f i (o i h , f * h , f h , v) | s h , a h 2 , E s h+1 (X i (h, f, v)) 2 | s h , a h ≤ 4R 2 E s h+1 [X i (h, f, v) | s h , a h ] . Take Z t = X t (h, f, v) -E s h+1 [X t (h, f, v) | s h , a h ] with |Z t | ≤ 2R 2 in Freedman's inequality (F.1) in Lemma 33. Then via the same procedure as in the proof of Lemma 23 we have that for any four-tuple (t, h, f, v), the following holds with probability at least 1 -2δ: t i=1 X i (h, f, v) - t i=1 E s h+1 [X i (h, f, v) | s h , a h ] ≤ O   2R t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + 2R 2 log(δ -1 )   . Thus, we have - t i=1 X i (h, f, v) ≤ O(R 2 log(δ -1 )). By the same ρ-covering argument as in the proof of Lemma 23, there exists a ρ-covering of L such that we can take a union bound over L ρ and have - t-1 i=1 X i (h, f , v) ≤ O R 2 ι + Rtρ where ι = log HT N L (ρ) δ . Then for f * , any f ∈ F and any v ∈ V, we can use the nearest three-tuple ( f i , f , v) in the ρ-covering and conclude that max v∈V t-1 i=1 ||ℓ h,f i (o i h , f * h , f * h , v)|| 2 -||ℓ h,f i (o i h , f * h , f h , v)|| 2 = max v∈V t-1 i=1 -X i (h, f, v) ≤ O (β) . This in sum finishes our proof of Lemma 21 with β = O R 2 ι + Rρt . F. 3 P R O O F O F L E M M A 2 4 Proof of Lemma 24. The proof basically follows Appendix §C of Russo & Van Roy (2013) and Appendix §D of Jin et al. (2021) . We first prove that for all t ∈ [T ], Therefore, we have proved that there exists j such that |G(f sj , g sj )| > ϵ and f sj is ϵ-independent with at least L = ⌈(m -1)/ dim F E (F, G, ϵ)⌉ disjoint sequences in {f s1 , . . . , f sj-1 }. For each of the sequences { f 1 , . . . , f l }, by definition of the FE dimension in Definition 3 we have that l k=1 G( f k , g sj ) 2 ≥ ϵ 2 . (F.7) Summing all of bounds (F.7) for L disjoint sequences together we have that sj -1 k=1 G(f t , g sj ) 2 ≥ Lϵ 2 = ⌈(m -1)/ dim F E (F, G, ϵ)⌉ • ϵ 2 . (F.8) The left hand side of (F.8) can be upper bounded by β 2 due to the condition of lemma. Therefore, we have proved that  β 2 ≥ ⌈(m -1)/ dim F E (F, G, ϵ)⌉ • ϵ 2 which k ≤ t i=1 1(e i > ω) ≤ (β/α 2 + 1) dim F E (F, G, α) ≤ (β/α 2 + 1)d, which implies that α ≤ dβ/(k -d). Taking the limit α → e - k , we have that e k ≤ min{ dβ/(k -d), C}. Finally, we have that t k=1 e i 1(e k > ω) ≤ min{d, t} • C + t i=d+1 dβ k -d ≤ min{d, t} • C + dβ t 0 z -1/2 dz ≤ min{d, t} • C + 2 dβt. (F.10) Plugging (F.10) into (F.9) completes the proof. F. 4 P R O O F O F L E M M A 3 2 Proof of Lemma 32. We assume that ∥v∥ ∞ ≤ B and treat B as an absolute constant (B = 2 in Sun et al. ( 2019)) in the following derivations. For a dataset D h = {r t h , s t h , a t h , s t h+1 } t=1,2,...T , we first build an auxillary random variable defined for every (t, h, f, v) ∈ [T ] × [H] × F × V X t (h, f, v) := E s∼f v(s t h , a t h , s) -v(s t h , a t h , s t h+1 ) 2 -E s∼f * v(s t h , a t h , s) -v(s t h , a t h , s t h+1 ) 2 , where the randomness lies in the sampling of the dataset D h . We know that |X t (h, f )| ≤ 4B 2 almost surely. Take conditional expectation of X i with respect to s h , a h , we have by definition that E s h+1 [X i (h, f, v) | s h , a h ] = E s h+1 E s∼f v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 -E s∼f * v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 | s h , a h . Using the fact that a 2 -b 2 = (a -b)(a + b) and E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) is nonrandom given s h , a h , we have E s h+1 [X i (h, f, v) | s h , a h ] = E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) • E s h+1 E s∼f v(s i h , a i h , s) + E s∼f * v(s i h , a i h , s) -2v(s i h , a i h , s i h+1 ) | s h , a h = E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 . On the other hand, E s h+1 X i (h, f, v) 2 | s h , a h ≤ E s h+1 E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 4B 2 | s h , a h = 16B 2 E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 ≤ 16B 2 E s h+1 [X i (h, f, v) | s h , a h ] . By taking Z t = X t (h, f, v) -E s h+1 [X t (h, f, v) | s h , a h ] with |Z t | ≤ 8B 2 a. s. in Freedman's inequality (F.1) in Lemma 33, by the same procedure as in the proof of Lemma 23, we have that for any four-tuple (t, h, f, v), the following holds with probability at least 1 -2δ: t i=1 X i (h, f, v) - t i=1 E s h+1 [X i (h, f, v) | s h , a h ] ≤ O   4B t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + 8B 2 log(δ -1 )   . (F.11) Let M ρ be a ρ-cover of M and V ρ a ρ-cover of V. By taking a union bound over all (t, h, f ′ , v') ∈ [T ] × [H] × M ρ × V ρ , we have with probability at least 1 -2δ that the following holds for any f ′ ∈ M ρ , v ′ ∈ V ρ , t i=1 X i (h, f ′ , v ′ ) - t i=1 E s h+1 [X i (h, f ′ , v ′ ) | s h , a h ] ≤ O   4B t i=1 E s h+1 [X i (h, f ′ , v ′ ) | s h , a h ] ι + 8B 2 ι   , (F.12) where ι = log(

HT |Mρ||Vρ| δ

). Further for any f t calculated at t ∈ [T ] and any v ∈ V, we choose f ′ = arg min f ∈Mρ dist( f , f t ) where dist is the distance measure on M, v ′ = min v ′ ∈Vρ (v ′ , v) and conclude t-1 i=1 X i (h, f ′ , v ′ ) = t-1 i=1 E s∼f ′ v ′ (s i h , a i h , s) -v ′ (s i h , a i h , s i h+1 ) 2 -E s∼f * v ′ (s i h , a i h , s) -v ′ (s i h , a i h , s i h+1 ) 2 ≤ t-1 i=1 E s∼f t v ′ (s i h , a i h , s) -v ′ (s i h , a i h , s i h+1 ) 2 -E s∼f * v ′ (s i h , a i h , s) -v ′ (s i h , a i h , s i h+1 ) 2 + O(Btρ) ≤ O(β + Btρ), (F.13) where (i) is due to the constraint of Algorithm 3. Combining (F.12) with (F.13), we derive the following t-1 i=1 E s h+1 [X i (h, f ′ , v ′ ) | s h , a h ] ≤ O(β + Btρ + B 2 ι). Note that f ′ is chosen as the nearest model to f t in the ρ-covering of M and for any v there exists a nearest v ′ in the ρ-covering of V, we conclude max v∈V t-1 i=1 E s h+1 X i (h, f t , v) | s h , a h ≤ O(β + Btρ + B 2 ι). Note we also have proved property (ii) in Definition 6 in §E.2, and we apply the global optimality of the discriminator as in the proof of Lemma 23 and obtains t-1 i=1 max v∈V E s h+1 X i (h, f t , v) | s h , a h ≤ O(β + Btρ + B 2 ι). Multiplying E s∼f v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 by 1(a i h =π f (s i h )) 1/|A| , taking expectation on s i h ∼ π i , a i h ∼ π f and again using the global discriminator optimality, we arrive at t-1 i=1 max v∈V E s i h ∼πi,a i h ∼π f E s∼f h v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 = t-1 i=1 max v∈V E s i h ∼π i ,a i h ∼U (A) 1(a i h = π f (s i h )) 1/|A| E s∼f h v(s i h , a i h , s) -E s∼f * v(s i h , a i h , s) 2 ≤ O(|A| β + Btρ + B 2 ι ), which concludes the proof. F. 5 P R O O F O F L E M M A 3 1 Proof of Lemma 31. For a dataset D h = {r t h , s t h , a t h , s t h+1 } t=1,2,...T , we first build an auxillary random variable defined for every (t, h, f, v) ∈ [T ] × [H] × F × V X t (h, f, v) := E s∼f v(s t h , a t h , s) -v(s t h , a t h , s t h+1 ) 2 -E s∼f * v(s t h , a t h , s) -v(s t h , a t h , s t h+1 ) 2 . By Eq. (F.11), with probability at least 1 -2δ, t i=1 X i (h, f, v) - t i=1 E s h+1 [X i (h, f, v) | s h , a h ] ≤ O   4B t i=1 E s h+1 [X i (h, f, v) | s h , a h ] log(δ -1 ) + 8B 2 log(δ -1 )   . Let M ρ be a ρ-cover of M and V ρ a ρ-cover of V. By taking a union bound over all (t, h, f ′ , v') ∈ [T ] × [H] × M ρ × V ρ , we have with probability at least 1 -2δ that the following holds for any f ′ ∈ Z ρ , t i=1 X i (h, f ′ , v ′ ) - t i=1 E s h+1 [X i (h, f ′ , v ′ ) | s h , a h ] ≤ O   4B t i=1 E s h+1 [X i (h, f ′ , v ′ ) | s h , a h ] ι + 8B 2 ι   , where ι = log HT |Mρ||Vρ| δ . Thus, we have - t i=1 X i (h, f ′ , v ′ ) ≤ O B 2 ι . Further for any f ∈ F and any v ∈ V, we choose f ′ = arg min f ∈Mρ dist( f , f ) where dist is the distance measure on M, v ′ = min v ′ ∈Vρ (v ′ , v) and have - t-1 i=1 X i (h, f, v) = t-1 i=1 E s∼f * v(s t h , a t h , s) -v(s t h , a t h , s t h+1 ) 2 - t-1 i=1 E s∼f v(s t h , a t h , s) -v(s t h , a t h , s t h+1 2 ≤ O B 2 ι + Bρt . Thus, max v∈V t-1 i=1 E s∼f * v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 -inf g∈Q t-1 i=1 E s∼g v(s i h , a i h , s) -v(s i h , a i h , s i h+1 ) 2 ≤ β, which concludes the proof. G P R O O F F O R F U N C T I O N A L E L U D E R D I M E N S I O N In the following proposition, we prove that the Bellman eluder (BE) dimension (Jin et al., 2021 ) is a special case of the FE dimension when G h (g, f ) := E π h,f (g h -T h g h+1 ). Proposition 34. For any hypothesis class F, taking coupling function G to be the union of {G h : F h × F h → R} h=1,...,H with each G h (g, f ) := E π h,f (g h -T h g h+1 ). dim FE (F, G, ϵ) ≤ dim BE (F, Π, ϵ). Proof of Proposition 34. By definition of the functional eluder dimension, dim FE (F, G, ϵ) = max h∈[H] dim FE (F, G h , ϵ), where dim FE (F, G h , ϵ) is the length n of the longest sequence satisfying for every t ∈ [n], t-1 i=1 (G h (g t , f i )) 2 ≤ ϵ ′ and |G h (g t , f t )| > ϵ ′ . Bringing in G h (g, f ) := E π h,f (g h -T h g h+1 ), we have f 1 , . . . , f n is also the longest sequence that satisfies for some g 1 , . . . , g n that t-1 i=1 E π h,f i (g t,h -T h g t,h+1 ) 2 ≤ ϵ ′ , and E π h,f t (g t,h -T h g t,h+1 ) > ϵ ′ . Thus, dim DE ((I -T h )F, Π h , ϵ) ≥ n. Taking maximum over h ∈ [H], we have dim FE (F, G, ϵ) = max h∈[H] dim FE (F, G h , ϵ) ≤ max h∈[H] dim DE ((I -T h )F, Π h , ϵ) = dim BE (F, Π, ϵ), which concludes our proof.  log det I + 1 ϵ 2 n i=1 x i x ⊤ i . Remark 5.2 in Du et al. (2021) showed that for finite dimensional setting with X ⊆ R d and ||x|| ≤ B, d eff (X , ϵ) = O(d). Moreover, the effective dimension can be small even for infinite dimensional RKHS case. In the next proposition, we prove that when the coupling function exhibits a bilinear structure G(f, g) = ⟨W (f ), X(g)⟩ H with feature space X := {X(g) ∈ H : g ∈ F} and ∥X(g)∥ H ≤ √ B, the functional eluder dimension in Definition 4 is always less than the effective dimesion of X . Proposition 35. For any hypothesis class F and coupling function G(•, •) : F × F → R that can be expressed in bilinear form ⟨W (f ), X(g)⟩ H , we have dim FE (F, G, ϵ) ≤ d eff X , ϵ/ √ B . Proof of Proposition 35. The proof basically follows the proof of Proposition 29 in Jin et al. (2021) with modifications specified for the functional eluder dimension. Given a hypothesis class F and a coupling function G(•, •) : F × F → R. Suppose there exists an ϵ'-independent sequence f 1 , . . . , f n ∈ F such that there exist g 1 , . . . , g n ∈ F,        t-1 i=1 (G(g t , f i )) 2 ≤ ϵ ′ , t ∈ [n], |G(g t , f t )| > ϵ ′ , t ∈ [n]. (G.1) When G(f, g) := ⟨W (f ), X(g)⟩ H , the above becomes        t-1 i=1 ⟨W (g t ), X(f i )⟩ 2 H ≤ ϵ ′ , t ∈ [n], |⟨W (g t ), X(f t )⟩ H | > ϵ ′ , t ∈ [n]. (G.2) ϕ(s h , a h , s ′ )V h+1,g (s ′ ) = ⟨W h (f ), X h (g)⟩ , Defining Σ t = t-1 i=1 X(f i )X(f i ) ⊤ + ϵ ′2 B • I, where W h (f ) := θ h,f -θ * h , X h (g) := E s h ,a h ∼πg [ψ(s h , a h ) + s ′ ϕ(s h , a h , s ′ )V h+1,g (s ′ )] in Proposition 35. Properties of the effective dimension yield that the FE dimension of the linear mixture MDP model is ≤ O(d).

G . 2 P R O O F O F L E M M A 2 9

Proof of Lemma 29. Taking G h,f * (f, g) := ⟨W h (f ), X h (g)⟩ in Proposition 35, and properties of the effective dimension yields the conclusion that the FE dimension of low Witness rank MDP model is ≤ O(W κ ).

G . 3 P R O O F O F L E M M A 3 0

We first introduce two auxillary lemmas: Lemma 36. Let random variable x i ∈ R d and E∥x i ∥ 2 2 ≤ B 2 . Then we have that 1 n log det I + 1 λ n-1 t=0 E[x t x ⊤ t ] ≤ d log 1 + nB 2 dλ n . Proof. We first have trace I + 1 λ n-1 t=0 E[x t x ⊤ t ] = d + 1 λ n-1 t=0 E[∥x t ∥ 2 2 ] ≤ d + nB 2 λ . Therefore, using the Determinant-Trace inequality, we get the first result, log det I + 1 λ n-1 t=0 E[x t x ⊤ t ] ≤ d log trace I + 1 λ n-1 t=0 E[x t x ⊤ t ] d ≤ d log 1 + nB 2 dλ . Dividing n from the both side of the inequality completes the proof. The following lemma is a variant of the well-known Elliptical Potential Lemma (Dani et al., 2008; Srinivas et al., 2009; Abbasi-Yadkori et al., 2011; Agarwal et al., 2020a) . Lemma 37 (Randomized elliptical potential). Consider a sequence of random vectors {x 0 , . . . , x T -1 }. Let λ > 0 and Σ 0 = λI and Σ t = Σ 0 + Then we have that n = O(d ϕ ). We will prove dim F E (F, G, ϵ) ≤ n by contradiction. Suppose that dim F E (F, G, ϵ) > n, there exists an ϵ ′ -independent (where ϵ ′ ≥ ϵ) sequence f 1 , . . . , f n ∈ F such that there exist g 1 , . . . , g n ∈ F,        t-1 i=1 (G(g t , f i )) 2 ≤ ϵ ′ , t ∈ [n], |G(g t , f t )| > ϵ ′ , t ∈ [n]. (G.6) Recall that the ABC function of KNR model is defiend as,  G h,f * (f, g) = vec ((U h,f -U * h ) ⊤ (U h,f -U * h )) , d ϕ ϵ ′2 n ≤ e -1 . This leads to a contradiction because ϵ ′ ≥ ϵ and log(3/2) > e -1 . We complete the proof of dim F E (F, G, ϵ) = O(d ϕ ). 

H E X P E R I M E N T

In this section, we carry out experiments to evaluate the empirical performance of our algorithm OPERA for linear mixture MDPs (Algorithm 2). In this experiment, we construct an MDP M with dimension d = 3 and episode length H = 5. The state space S consists of H + 2 different states x 1 , . . . , x H+2 and the action space A = {-1, 1} d-1 consists of 2 d-1 different actions. For each step h ∈ [H] and episode k ∈ [K], we assume that the reward function r(s, a) is known (so no need to introduce ψ as in Section E.1). In particular, for all 1 ≤ h ≤ H, the reward function r h (s, a) = 1 if and only if s = x H+2 and r h (s, a) = 0 otherwise. For each step h ∈ [H] and corresponding transition probability function P h , x H+1 and x H+2 are absorbing states. For other states x h (1 ≤ h ≤ H), the transition probability satisfies that where each 1 d-1 is a (d -1)-dimensional vector of all ones, a ∈ A is also a (d -1)-dimensional vector. Then we have that P h (x h ′ | x h , a) = ⟨θ * h , ϕ(x h ′ , a, x h )⟩ where θ * h = [0.01 • 1 d-1 , 1], and ϕ(x h ′ , a, x h ) are as follows, • ϕ(x h ′ , a, x h ) = [-a, 0.95] if 1 ≤ h ≤ H and h ′ = h + 1, • ϕ(x h ′ , a, x h ) = [a, 0.05] if 1 ≤ h ≤ H and h ′ = H + 2, • ϕ(x h ′ , a, x h ) = [0 d-1 , 1] if h ′ = h = H + 1, • ϕ(x h ′ , a, x h ) = [0 d-1 , 1] if h ′ = h = H + 2, • ϕ(x h ′ , a, x h ) = 0 d , otherwise. Here 0 d-1 is a (d -1)-dimensional vector of all zeros. We compare our algorithm OPERA with the following two baselines: Optimal (optimal policy) and Random (uniformly random policy which chooses actions uniformly from A). For numerical stability, we add λI to Σ (t) h with λ = 1 and use CVX (Diamond & Boyd, 2016; Agrawal et al., 2018) to approximately solve (E.3) when we implement Algorithm 2. The cumulative rewards of different algorithms averaged over 10 runs for the first 10000 episodes are plotted in Figure 2 . We can see that OPERA performs much better than the random policy and can converge to the optimal policy after 10000 episodes.



In this paper, we use FLAMBE to refer to both the algorithm and the low-rank MDP with unknown feature mappings. For example for model-free cases where f, g are value functions, ρ(f, g) = max h∈[H] ∥f h -g h ∥∞. For model-based RL where f, g are transition probabilities, we adopt ρ(P,Q) = max h∈[H] ( √ dP h -√ dQ h) 2 which is the maximal (squared) Hellinger distance between two probability distribution sequences. Indeed, when the coupling function is chosen as the expected Bellman errorG h (g, f ) := Eπ h,f (Q h,g -T h Q g,h+1 )where T h denotes the Bellman operator, we recover the definition of BE dimension(Jin et al., 2021), i.e. dimFE(F, G, ϵ) = dimBE(F, G, ϵ). We assume F ⊆ G throughout this paper and in the general case where F ̸ ⊆ G, we overload G := F ∪ G. The decomposability item (i) in Definition directly implies that a Generalized Completeness condition similar to Assumption 14 ofJin et al. (2021) holds. Here and throughout our paper we considers πest = π t for Q-type models. For V -type models, we instead consider πest = U (A) to be the uniform distribution over the action space. Such a representation of estimation policy allows us to unify the Q-type and V -type models in a single analysis. Hypothesis class reduces to model class(Sun et al., 2019) when restricted to model-based setting. The definition of witness rank adopts a V -type representation and hence we can only derive the sample complexity of our algorithm. For detailed discussion on the V -type cases, we refer readers to §D in the appendix.



Figure 1: Venn-Diagram Visualization of Prevailing Sample-Efficient RL Classes. As by far the richest concept, the DEC framework is both a necessary and sufficient condition for sample-efficient interactive learning. BE dimension is a rich class that subsumes both low Bellman rank and low eluder dimension and addresses almost all model-free RL classes. The generalized Bilinear Class captures model-based RL settings including KNRs, linear mixture MDPs and low Witness rank MDPs, yet precludes some eluder-dimension based models. Bellman Representability is another unified framework that subsumes the vanilla bilinear classes but fails to capture KNRs and low Witness rank MDPs. Our ABC class encloses both generalized Bilinear Class and Bellman Representability and subsumes almost all known solvable MDP cases, with the exception of the Q * state-action aggregation and deterministic linear Q * MDP models, which neither Bilinear Class nor our ABC class captures.

4 . 3 I M P L I C AT I O N F O R S P E C I F I C M D P I N S TA N C E S

1].We note that Eq. (B.2) ensures condition (i) and (ii) in Definition 5 simultaneously and the ABC of the LQR model setting admits a bilinear structure that enables us to apply Proposition 35 and conclude low FE dimension.B . 6 G E N E R A L I Z E D L I N E A R B E L L M A N C O M P L E T ENext we introduce the generalized linear Bellman complete model, showing that our ABC class with low FE dimension captures this model even without the monotone operator √ x used inDu et al. (2021).Definition 19 (Generalized Linear Bellman Complete). A generalized linear Bellman complete model consists of an inverse link function σ : R → R + and a hypothesis class F :

The dimension d of the Q * state-action aggregation model is defined as the cardinality of B, i.e., d = |B|. For each element b ∈ B, we note that for all (s, a) ∈ S × A satisfying ξ(s, a) = b, they share the same value of Q(s, a). For notational simplicity, we use Q(b) to denote this common value for all ξ(s, a) = b. Moreover, we let w * h be a d-dimensional vector of all aggregated values with (w * h ) b = Q(b) and Q * (s, a) can be expressed as a linear function on the aggregated values: Q * h (s, a) = ⟨w * h , ψ(s, a)⟩ where ψ(s, a) : S × A → {0, 1} d is a one-hot vector satisfying (ψ(s, a)) b = 1 when ξ(s, a) = b and (ψ(s, a)) b = 0 otherwise.

Bringing in the choice of β in Corollary 13 yields a regret bound of O d 2 ϕ d s H 4 T . In comparison, LC 3 in Kakade et al. (2020) has a regret bound of O d ϕ (d s + d ϕ )H 3 T . The improved factor of √

Taking telescope sum from t = 0 to t = T -1 completes the proof.Proof of Lemma 30. Given a hypothesis class F and a coupling function G(•, •) : F × F → R. Let n to be defined as follows, n := min n ∈ N : n ≥ ed ϕ log(1 + 4nd s R 4 /(d ϕ ϵ ′2 )) .

Figure 2: Cumulative regret comparison in the first 10000 episodes of different RL algorithms (i.e., OPERA, Optimal Policy, Random Policy). Results are averaged over 10 runs.

h (x h+1 |x h , a) = 0.95 -⟨0.01 • 1 d-1 , a⟩, P h (x H+2 |x h , a) = 0.05 + ⟨0.01 • 1 d-1 , a⟩,

it has a low FE dimension. Low Witness Rank. We defer the formal definition of witness rank to §E.2 and provide the following proposition showing that low Witness rank models belongs to our ABC class with low FE dimension. Proposition 9 (Low Witness Rank ⊂ ABC with Low FE Dimension). The low Witness rank model belongs to the ABC class with estimation function ℓ

The following proposition shows that KNR belongs to the ABC class with low FE dimension.

International Conference on Machine Learning, pp. 6995-7004. PMLR, 2019. Lin Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. In International Conference on Machine Learning, pp. 10746-10756. PMLR, 2020. Zhuoran Yang, Chi Jin, Zhaoran Wang, Mengdi Wang, and Michael I Jordan. On function approximation in reinforcement learning: Optimism in the face of large state spaces. arXiv preprint arXiv:2011.04622, 2020. Ekim Yurtsever, Jacob Lambert, Alexander Carballo, and Kazuya Takeda. A survey of autonomous driving: Common practices and emerging technologies. IEEE Access, 8:58443-58469, 2020. Andrea Zanette and Emma Brunskill. Tighter problem-dependent regret bounds in reinforcement learning without domain knowledge using value function bounds. arXiv preprint arXiv:1901.00210, 2019. The appendix is organized as follows. §A discusses the related work, providing comparisons with previous frameworks based on both coverage and sharpness of sample complexity. §B compares our regret bound and sample complexity on specific examples and discusses several additional examples including reactive POMDPs, FLAMBE, LQR, and the generalized linear Bellman complete model. §C proves the main results (Theorem 11 and Corollary 26 on sample complexity of OPERA). §D explains the V -type setting and the corresponding results. §E discusses the OPERA algorithm when being applied to special examples (linear mixture MDPs, low Witness rank MDPs, KNRs). §F details the delayed proofs of technical lemmas. §G details the proofs relevant to FE dimension.

.Yang et al. (2020)  studied the MDPs with a structure where the action-value function can be represented by a kernel function or an over-parameterized neural network. Recently,Jin et al. (2021) proposed a complexity called Bellman eluder (BE) dimension. The RL problems with low BE dimension subsume the problems with low Bellman rank and low eluder dimension. SimultaneouslyDu et al. (2021) proposed Bilinear Classes, which can be applied to a variety of loss estimators beyond vanilla Bellman error, but with

Comparison of sample complexity for different MDP models under different RL frameworks.

Also, for models with low Bellman rank d, our sample complexity scales linearly in d as inJin et al. (2021) while complexity inJiang et al. (2017) scales quadratically. For model-based RL settings with linear structure that are not within the low BE dimension framework such as the linear mixture MDPs, our OPERA algorithm obtains a d FE H Bernstein-type bonus for exploration. The Bilinear Classes(Du et al., 2021) is a general framework that covers linear mixture MDPs as a special case. The sample complexity of the BiLin-UCB algorithm when constrained to linear mixture models is d 3 H 4 /ϵ 2 , which is dH 2 worse than that of OPERA in this work. In terms of lower bound, when specialized to linear mixture MDPs, our result of O(dH √ T ) matches the lower bound provided in Zhou et al.

Compared with the d 3 H 4 /ϵ 2 sample complexity in Du et al. (2021), our Algorithm 2 OPERA (linear mixture MDPs)

completes the proof of (F.6). Now let d = dim F E (F, G, ω) and sort |G(f 1 , g 1 )|, . . . , |G(f t , g t )| in a nonincreasing order, denoted by e 1 , . . . , e t . Then we have that

Combining Proposition 34 with Proposition 29 inJin et al. (2021), it is straightforward to conclude that FE dimension is smaller than the effective dimension. In particular, Proposition 33 says dim FE is controlled by dim BE , Proposition 29 inJin et al. (2021) says dim BE is controlled by the effective dimension dim eff , therefore low effective dimension would imply ABC with low FE dimension. In the following paragraphs and Proposition 35 we prove this conclusion from sketch to grant a better understanding of the FE dimension.

we have by Eq. (G.2) that ∥W(g t )∥ Σt ≤ √ 2ϵ ′ . Furthermore, ϵ ′ ≤ |⟨W (g t ), X(f t )⟩ H | ≤ ∥W (g t )∥ Σt • ∥X(f t )∥ Σ -1 for any t ∈ [n]. By applying the log-determinant argument, we have We now provide the detailed proofs of Lemmas 27, 29 and 30.

Proof. By definition of Σ t we have thatlog det(Σ t+1 ) = log det(Σ t ) + log det(I + Σ E[x t x ⊤ t ](Σ t ) -1/2 with eigenvalue λ 1 , . . . , λ d ≥ 0, we have that det(I + Λ t ) = Π d i=1 (λ i + 1) ≥ 1 +

vec E s h ,a h ∼πg ϕ(s h , a h )ϕ(s h , a h ) ⊤ = E s h ,a h ∼πg ||(U h,f -U * h )ϕ(s h , a h )|| 2 .Denote U h,gt,j , j ∈ [d s ] and U * h,j , j ∈ [d s ] to be the rows of U h,gt and U * h . Taking square over both side of the inequalities in (G.7) gives that ,a h ∼πf i [(U h,gt,j -U * h,j )ϕ(s h , a h )] 2 ≤ ϵ ′2 , t ∈ [n], ds j=1 E s h ,a h ∼π f t [(U h,gt,j -U * h,j )ϕ(s h , a h )] 2 > ϵ ′2 , t ∈ [n]. E s h ,a h ∼π f i [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] + (ϵ ′2 /4d s R 2 ) • I.Then by (G.8), we have thatEs h ,a h ∼π f i [(U h,g t ,j -U * h,j )ϕ(s h , a h )] 2 + (ϵ ′2 /4dsR 2 ) • Es h ,a h ∼π f t ∥U h,g t ,j -U * h,j ∥ 2 Es h ,a h ∼π f i [(U h,g t ,j -U * h,j )ϕ(s h , a h )] 2 + (ϵ ′2 /4dsR 2 ) • 2ds maxwhere the first equality is by the Cauchy-Schwartz inequality and the last inequality is by∥U h,gt,j ∥ 2 ≤ ∥U h,gt ∥ 2 ≤ R, ∥U * h,j ∥ 2 ≤ ∥U * h ∥ 2 ≤ R. Furthermore we have that E s h ,a h ∼π f t ∥Σ = E s h ,a h ∼π f t ∥Σwhere the last inequality is by the Cauchy-Schwarz inequality for random variables. Thus, we have thatE s h ,a h ∼π f t ∥Σ -1/2 t ϕ(s h , a h )∥ 2 2 ≥ 1/2 for all t ∈ [n]. By applying Lemma 37, we have that mint∈[n] log(1 + E s h ,a h ∼π f t ∥Σ ,a h ∼π f i [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] . ,a h ∼π f i [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] E s h ,a h ∼π f t ∥Σ

annex

We first show that for the sequence {f s1 , . . . , f sm } ⊆ F, there exists j ∈ Van Roy, 2013) . We will prove this by following procedure. Starting with singleton sequences B 1 = {f s1 }, . . . , B L = {f s L } and j = L + 1. For each j, if f sj is ϵ-dependent on B 1 , . . . , B L we already achieved our goal and the process stops. Otherwise, there exist i ∈ [L] such that f sj is ϵ-dependent of B i and update B i = B i ∪ {f sj }. Then we add increment j by 1 and continue the process. By the definition of FE dimension, the cardinally of each set B 1 , . . . , B L cannot larger than dim F E (F, G, ϵ) at any point in this process. Therefore, by pigeonhole principle the process stops by step j = L dim F E (F, G, ϵ) + 1 ≤ m.

