ENTROPIC RISK-SENSITIVE REINFORCEMENT LEARNING: A META REGRET FRAMEWORK WITH FUNCTION APPROXIMATION

Abstract

We study risk-sensitive reinforcement learning with the entropic risk measure and function approximation. We consider the finite-horizon episodic MDP setting, and propose a meta algorithm based on value iteration. We then derive two algorithms for linear and general function approximation, namely RSVI.L and RSVI.G, respectively, as special instances of the meta algorithm. We illustrate that the success of RSVI.L depends crucially on carefully designed feature mapping and regularization that adapt to risk sensitivity. In addition, both RSVI.L and RSVI.G maintain risk-sensitive optimism that facilitates efficient exploration. On the analytic side, we provide regret analysis for the algorithms by developing a meta analytic framework, at the core of which is a risk-sensitive optimism condition. We show that any instance of the meta algorithm that satisfies the condition yields a meta regret bound. We further verify the condition for RSVI.L and RSVI.G under respective function approximation settings to obtain concrete regret bounds that scale sublinearly in the number of episodes.

1. INTRODUCTION

Risk is one of the most important considerations in decision making, so should it be in reinforcement learning (RL) . As a prominent paradigm in RL that performs learning while accounting for risk, risksensitive RL explicitly models risk of decisions via certain risk measures and optimizes for rewards simultaneously. It is poised to play an essential role in application domains where accounting for risk in decision making is crucial. A partial list of such domains includes autonomous driving (Buehler et al., 2009; Thrun, 2010) , behavior modeling (Niv et al., 2012; Shen et al., 2014) , realtime strategy games (Berner et al., 2019; Vinyals et al., 2019) and robotic surgery (Fagogenis et al., 2019; Shademan et al., 2016) . In this paper, we study risk-sensitive RL through the lens of function approximation, which is an important apparatus for scaling up and accelerating RL algorithms in applications of high dimension. We focus on risk-sensitive RL with the entropic risk measure, a classical framework established by the seminal work of Howard & Matheson (1972) . Informally, for a fixed risk parameter β = 0, our goal is to maximize the objective V β = 1 β log{Ee βR }. The definition of V β will be made formal later in (2). The objective (1) admits a Taylor expansion V β = E[R] + β 2 Var(R) + O(β 2 ). Comparing (1) with the risk-neutral objective V = E[R] studied in the standard RL setting, we see that β > 0 induces a risk-seeking objective and β < 0 induces a risk-averse one. Therefore, the formulation with the entropic risk measure in (1) accounts for both risk-seeking and risk-averse modes of decision making, whereas most others are restricted to the risk-averse setting (Fu et al., 2018) . It can also be seen that V β tends to the risk-neutral V as β → 0. Existing works on function approximation for RL have mostly focused on the risk-neutral setting and heavily exploits the linearity of risk-neutral objective V in both transition dynamics (implicitly captured by the expectation) and the reward R, which is clearly not available in the risk-sensitive objective (1). It is also well known that even in the risk-neutral setting, improperly implemented function approximation could result in errors that scale exponentially in the size of the state space. Combined with nonlinearity of the risk-sensitive objective (1), it compounds the difficulties of implementing function approximation in risk-sensitive RL with provable guarantees. This work provides a principled solution to function approximation in risk-sensitive RL by overcoming the above difficulties. Under the finite-horizon MDP setting, we propose a meta algorithm based on value iteration, and from that we derive two concrete algorithms for linear and general function approximation, which we name RSVI.L and RSVI.G, respectively. By modeling a shifted exponential transformation of estimated value functions, RSVI.L and RSVI.G cater to the nonlinearity of the risk-sensitive objective (1) and adapt to both risk-seeking and risk-averse settings. Moreover, both RSVI.L and RSVI.G maintain risk-sensitive optimism in the face of uncertainty for effective exploration. In particular, RSVI.L exploits a synergistic relationship between feature mapping and regularization in a risk-sensitive fashion. The resulting structure of RSVI.L makes it more efficient in runtime and memory than RSVI.G under linear function approximation, while RSVI.G is more general and allows for function approximation beyond the linear setting. Furthermore, we develop a meta regret analytic framework and identify a risk-sensitive optimism condition that serves as the core component of the framework. Under the optimism condition, we prove a meta regret bound incurred by any instance of the meta algorithm, regardless of function approximation settings. Furthermore, we show that both RSVI.L and RSVI.G satisfy the optimism condition under the respective function approximation and achieve regret that scales sublinearly in the number of episodes. The meta framework therefore helps us disentangle the analysis associated with function approximation from the generic analysis, shedding light on the role of function approximation in regret guarantees. We hope that our meta framework will motivate and benefit future studies of function approximation in risk-sensitive RL. Our contributions. We may summarize the contributions of the present paper as follows: • we study function approximation in risk-sensitive RL with the entropic risk measure; we provide a meta algorithm, from which we derive two concrete algorithms for linear and general function approximation, respectively; the concrete algorithms are both shown to adapt to all levels of risk sensitivity and maintain risk-sensitive optimism over the learning process; • we develop a meta regret analytic framework and identify a risk-sensitive optimism condition, under which we prove a meta regret bound for the meta algorithm; furthermore, by showing that the optimism condition holds for both concrete algorithms, we establish regret bounds for them under linear and general function approximation, respectively. Notations. For a positive integer n, we let [n] := {1, 2, . . . , n}. For a number u = 0, we define sign(u) = 1 if u > 0 and -1 if u < 0. For two non-negative sequences {a i } and {b i }, we write a i b i if there exists a universal constant C > 0 such that a i ≤ Cb i for all i, and write a i b i if a i b i and b i a i . We use Õ(•) to denote O(•) while hiding logarithmic factors. For any ε > 0 and set X , we let N ε (X , • ) be the ε-net of the set X with respect to the norm • . We let ∆(X ) be the set of probability distributions supported on X . For any vector u ∈ R n and symmetric and positive definite matrix Γ ∈ R n×n , we let u Γ := √ u Γu. We denote by I n the n × n identity matrix.

2. RELATED WORK

Initiated by the seminal work of Howard & Matheson (1972) , risk-sensitive control/RL with the entropic risk measure has been studied in a vast body of literature (Bäuerle & Rieder, 2014; Borkar, 2001; 2002; 2010; Borkar & Meyn, 2002; Cavazos-Cadena & Hernández-Hernández, 2011; Coraluppi & Marcus, 1999; Di Masi & Stettner, 1999; 2000; 2007; Fleming & McEneaney, 1995; Hernández-Hernández & Marcus, 1996; Jaśkiewicz, 2007; Marcus et al., 1997; Mihatsch & Neuneier, 2002; Osogami, 2012; Patek, 2001; Shen et al., 2013; 2014; Whittle, 1990 ). Yet, this line of works either assumes known transition kernels or focuses on asymptotic behaviors of the problem/algorithms, and finite-sample/time results with unknown transitions have rarely been investigated. The most relevant work to ours is perhaps Fei et al. (2020) , who consider the same problem as ours under the tabular setting. They propose two algorithms based on value iteration and Q-learning. They prove regret bounds for their algorithms, which are then certified to be nearly optimal by a lower bound. However, their algorithms and analysis are restricted to the tabular setting. Compared to Fei et al. (2020) , our paper provides a novel and unified framework of algorithms and analysis for function approximation. We study linear and general function approximation as two instances of the framework, both of which subsume the tabular setting. We also briefly discuss existing works on function approximation with regret analysis, which so far have focused on the risk-neutral setting. The works of Cai et al. (2019) ; Jin et al. (2019) ; Wang et al. (2019); Yang & Wang (2019); Zhou et al. (2020) study linear function approximation, while Ayoub et al. (2020) ; Wang et al. (2020) investigate general function approximation. In addition, all these works prove Õ(K 1/2 )-regret for their algorithms, although dependence on other parameters varies in settings. As we have argued in the previous section, the nonlinear objective (1) makes algorithm design and regret analysis for function approximation much more challenging in risksensitive settings than in the standard risk-neutral one.

3. PROBLEM FORMULATION

3.1 EPISODIC MDP An episodic MDP is parameterized by a tuple (K, H, S, A, {P h } h∈[H] , {r h } h∈[H] ), where K is the number of episodes, H is the number of steps in each episode, S is the state space, A is the action space, P h : S × A → ∆(S) is the transition kernel at step h, and r h : S × A → [0, 1] is the reward function at step h. We assume that {P h } are unknown. For simplicity we also assume that {r h } are known and deterministic, as is done in existing works such as Yang & Wang (2019); Zhou et al. (2020) . We interact with the episodic MDP as follows. In the beginning of each episode k ∈ [K], the environment chooses an arbitrary initial state s k 1 ∈ S. Then in each step h ∈ [H], we take an action a k h ∈ A, receives a reward r h (s k h , a k h ) and transitions to the next state s k h+1 ∈ S sampled from P h (• | s k h , a k h ). Once we reach s k H+1 , the current episode terminates and we advance to the next episode unless k = K.

3.2. VALUE FUNCTIONS, BELLMAN EQUATIONS AND REGRET

We assume that β is fixed prior to the learning process, and for notational simplicity we omit it from quantities to be introduced subsequently. In risk-sensitive RL with the entropic risk measure, we aim to find a policy π = {π h : S → A} so as to maximize the value function given by V π h (s) := 1 β log E exp β H h =h r h (s h , π h (s h )) s h = s , for all (h, s) ∈ [H] × S. Under some mild regularity conditions, there exists a greedy policy & Rieder, 2014) . In addition to the value function, another key notion is the action-value function defined as π * = {π * h } which gives the optimal value V π * h (s) = sup π V π h (s) for all (h, s) ∈ [H] × S (Bäuerle Q π h (s, a) := 1 β log E exp β H h =h r h (s h , a h ) s h = s, a h = a , for all (h, s, a) ∈ [H] × S × A. The action-value function Q π h is closely associated with the value function V π h via the so-called Bellman equation: Q π h (s, a) = r h (s, a) + 1 β log E s ∼P h (• | s,a) exp β • V π h+1 (s ) , V π h (s) = Q π h (s, π h (s)), V π H+1 (s) = 0, which holds for all (h, s, a) ∈ [H] × S × A. Note that the identity of Q π h in (4) is a result of simple calculation based on (2) and (3). Similarly, the Bellman optimality equation is given by Q * h (s, a) = r h (s, a) + 1 β log E s ∼P h (• | s,a) exp β • V * h+1 (s ) , V * h (s) = max a∈A Q * h (s, a), V * H+1 (s) = 0, again for all (h, s, a) ∈ [H] × S × A. In the above, we use the shorthand , a) implies that the optimal π * is the greedy policy with respect to the optimal action-value function {Q * h } h∈ [H] . During the learning process, the policy π k in each episode k may be different from the optimal π * . We quantify this difference over all K episodes through the notion of regret, defined as Q * h (•, •) := Q π * h (•, •) for all h ∈ [H] and V * h (•) is similarly defined. The identity V * h (•) = max a∈A Q * h (• Regret(K) := k∈[K] V * 1 (s k 1 ) -V π k 1 (s k 1 ) . ( ) Since V * 1 (s) ≥ V π 1 (s) for any π and s ∈ S by definition, regret also characterizes the suboptimality of {π k } relative to the optimal π * .

3.3. FUNCTION APPROXIMATION

In this paper, we focus on linear and general function approximation. We consider the following form of linear function approximation, which assumes that each transition kernel admits a linear form. Assumption 1 We assume that the MDP is equipped with a known feature function ψ : S ×A×S → R d such that for any h ∈ [H], there exists a vector θ h ∈ R d with θ h 2 ≤ √ d and the transition kernel is given by P h (s | s, a) = ψ(s, a, s ) θ h for any (s, a, s ) ∈ S × A × S. We also assume that This form of linear function approximation is also studied in the work of Ayoub et al. (2020) ; Cai et al. (2019); Zhou et al. (2020) , whose setting is equivalent to ours when β → 0. The setting of Assumption 1 may be reduced to the tabular setting in which ψ(s, a, s ) is a canonical basis vector in R d with d = |S| 2 |A|, i.e., the (s, a, s )-th entry of ψ(s, a, s ) is equal to one and the other entries are equal to zero. It also subsumes various settings of function approximation including linear combinations of base models (Modi et al., 2020) and the matrix bandit setting (Yang & Wang, 2019) ; we refer readers to (Zhou et al., 2020) for more details on the generality of Assumption 1. For general function approximation, we make the following assumption. Assumption 2 We assume that we have access to a function setfoot_0 P such that the transition kernel P h ∈ P for all h ∈ [H]. This setting is also considered in Ayoub et al. (2020) . Under Assumption 2, we may measure the complexity of function sets using the notion of the so-called eluder dimension (Russo & Van Roy, 2014) , which we define and discuss in Appendix A. Note that although we focus on function approximation of transition kernels, a similar approach can be taken to apply function approximation to reward functions and our regret guarantees presented below would still hold, as argued in Yang & Wang (2019) .

4. ALGORITHMS

We first present the Meta algorithm for Risk-Sensitive Value Iteration (MetaRSVI) in Algorithm 1, which is a high-level framework including key features of algorithms based on value iteration (Bradtke & Barto, 1996; Osband et al., 2014; Jin et al., 2019) . It mainly consists of a value estimation step and policy execution step. In the value estimation step, the algorithm estimates the optimal {Q * h } by its iterates {Q k h } based on historical data. Since we focus on greedy policies, in Line 5 the estimated value function V k h (•) is simply taken as the maximum among {Q k h (•, a )} a ∈A . The primary machinery of value estimation, known as Risk-Sensitive Temporal Difference or RSTD, is presented in an abstract way in Line 4; this is because its concrete form would depend on function approximation of the underlying MDP and algorithmic implementation. In the policy execution step, the algorithm uses the policy learned so far (represented by {Q k h }) to collect data for subsequent learning stages. We remark that Algorithm 1 is flexible and general enough to allow for any function approximation. In the remaining of this section, we derive two special instances of Algorithm 1 for linear and general function approximation, respectively, by providing concrete implementation of RSTD.

Algorithm 1 MetaRSVI

Input: risk parameter β, number of episodes K 1: for episode k = 1, . . . , K do 2: V k H+1 (•) ← 0 3: for step h = H, H -1, . . . , 1 do value estimation 4: Q k h (•, •) ← RSTD(k, h, β, {V τ h+1 } τ ∈[k] ) 5: V k h (•) ← max a ∈A Q k h (•, a ) 6: end for 7: Receive the initial state s k 1 from the environment 8: for step h = 1, 2, . . . , H do policy execution 9: Take action a k h ← argmax a ∈A Q k h (s k h , a ) 10: Receive the reward r h (s k h , a k h ) and the next state s k h+1 11: end for 12: end for We introduce Risk-Sensitive Value Iteration for Linear function approximation, or RSVI.L, in Algorithm 2 under the setting of Assumption 1. This algorithm is inspired by RSVI proposed in Fei et al. (2020) that specializes in the tabular setting. It replaces the abstract function RSTD in Algorithm 1 with a concrete implementation for the linear function approximation. In Line 4 the iterate w k h can be interpreted as the solution of the following least-squares problem: w k h ← argmin w∈R d τ ∈[k-1] [e β•V τ h+1 (s τ h+1 ) -1 -w φ τ h (s τ h , a τ h )] 2 + λ w 2 2 , where the surrogate feature mappings {φ τ h (•, •)} τ ∈[k-1] are constructed in Line 5 and λ ≥ 0 is the regularization parameter to be set by users. The above regression problem essentially computes an estimate of θ h , the parameter of the transition kernel P h , by taking advantage of the linear form E s ∼P h (• | s,a) [e β•V π h+1 (s ) ] in the Bellman equation ( 4). Note that the regression targets are set as a shifted exponential transformation of estimated value functions, i.e., {e β•V τ h+1 (s τ h+1 ) -1}. Similar construction is applied to the surrogate features φ k h (•, •) in Line 5. Mechanically, such design ensures that when V k h+1 (•) = 0, we would have φ k h (•, •) = 0 and therefore by definition Q k h (•, •) = r h (•, •); this is a similar behavior exhibited by risk-neutral algorithms for linear function approximation, e.g., Algorithm 1 in Cai et al. (2019) whose surrogate features are given by replacing e β•V k h+1 (•) -1 in Line 5 with V k h+1 (•). To update Q k h (•, •) in Line 7, we use the quantity q k h,L (•, •) := min{e β(H-h) , φ k h (•, •), w k h + 1 + b k h (•, •)}, if β > 0, max{e β(H-h) , φ k h (•, •), w k h + 1 -b k h (•, •)}, if β < 0. Informally, with b k h taking the role of bonus, q k h,L can be seen as an "optimistic" and risk-adaptive estimate for the expected value of e  2: procedure RSTD(k, h, β, {V τ h+1 } τ ∈[k] , γ L , λ) 3: Λ k h ← τ ∈[k-1] φ τ h (s τ h , a τ h )φ τ h (s τ h , a τ h ) + λI d 4: w k h ← (Λ k h ) -1 τ ∈[k-1] φ τ h (s τ h , a τ h ) • (e β•V τ h+1 (s τ h+1 ) -1) 5: φ k h (•, •) ← S ψ(•, •, s ) • (e β•V k h+1 (s ) -1)ds 6: b k h (•, •) ← γ L • [φ k h (•, •) (Λ k h ) -1 φ k h (•, •)] 1/2 7: return Q k h (•, •) ← r h (•, •) + 1 β log(q k h,L (•, •)), where q k h,L (•, •) is defined in (8) 8: end procedure φ k h (•, •), w k h + 1 serves as a correction of the shifted quantity e β•V k h+1 (•) -1 in both φ k h (•, •) and w k h , so as to align the structure of Line 7 to that of the Bellman equation ( 4). A proper definition of bonus b k h (with γ L therein formally given in Theorem 2 below) and the truncation at e β(H-h) would put q k h,L within [1, e β(H-h) ] entrywise. This ensures that the estimate Q k h (•, •) is on the same scale as the optimal Q * h (•, •) ∈ [0, H -h + 1]. We note that there is a significant difference between Algorithm 2 and RSVI introduced in Fei et al. (2020) . Since RSVI is designed only for the tabular setting, it suffices to set λ = 0 and update the entries in Q k h whose corresponding state-action pairs have been visited in the history. In the linear setting, however, the entire Q k h is updated at once and it's imperative to prevent the singularity of the covariance matrix Λ k h . This boils down to having a proper choice of the regularization parameter λ. If λ is too small, Λ k h would be nearly singular; if λ is too large, the spectrum of Λ k h would be dominated by λ for prohibitively many episodes with the algorithm making little progress in learning. Intuitively, since φ τ h (•, •) 2 ∝ |e βH -1|, an ideal λ should depend on β and be on the order of (e βH -1)foot_1 , so that the spectrums of the matrices φ τ h (•, •)φ τ h (•, •) and λI d are close. Indeed, this intuition provides guidance to our choice of λ (see Theorem 2). Our Algorithm 2 hence demonstrates a synergistic relationship between the surrogate features and regularization in a risksensitive fashion. This is in great contrast with existing algorithms for linear function approximation in the risk-neutral setting, such as Cai et al. (2019); Zhou et al. (2020) , in which the design of surrogate features and choice of regularization are decoupled. For general function approximation under Assumption 2, we present Risk-Sensitive Value Iteration for General function approximation, abbreviated as RSVI.G, in Algorithm 3 of Appendix B. Despite their apparent difference, Algorithms 2 and 3 both implement the principle of Risk-Sensitive Optimism in the Face of Uncertainty (RS-OFU) (Fei et al., 2020) , by adding bonus/maximizing over the confidence set when β > 0, and subtracting bonus/minimizing over the confidence set when β < 0. Such mechanism encourages the algorithms to explore actions that may have rarely been taken due to low estimated Q-values, while accounting for risk sensitivity at the same time. 2

5. MAIN RESULTS

In this section, we present regret guarantees for our algorithms via a meta regret framework. We identify a risk-sensitive optimism condition, which certifies a certain form of optimism of an algorithm. Under the condition, we first provide a meta regret bound for Algorithm 1. We then instantiate the meta regret bound for Algorithms 2 and 3 under Assumptions 1 and 2, respectively.

Recall the iterates {Q

k h } in Algorithm 1. For each tuple (k, h, s, a) ∈ [K] × [H] × S × A, we define Q k h (s, a) := r h (s, a) + 1 β log{E s ∼P h (• | s,a) e β•V k h+1 (s ) }. It can be seen that {Q k h } are the ideal counterparts of {Q k h } that could be constructed if the transition kernels {P h } were known. We set forth the following condition, which is the central component of our meta regret framework. Condition 1 For all (k, h, s, a) ∈ [K] × [H] × S × A, we have Q k h (s, a) ∈ [0, H -h + 1], and there exist some quantities m k h (s, a) > 0, g ≥ 1 and universal constant c > 0 such that 0 ≤ Q k h (s, a) -Q k h (s, a) ≤ c • e |β|H -1 |β| • g • m k h (s, a). Since {Q k h } are informally "optimistic" estimates of the ideal {Q k h }, the difference Q k h (s, a) - Q k h (s, a) in Condition 1 may be thought of as the level of optimism maintained by the algorithm for state-action pair (s, a) in step h of episode k, with its upper bound depending on risk sensitivity through the factor e |β|H -1 |β| . Therefore, we say that Condition 1 is a risk-sensitive optimism condition. In the upper bound of the condition, the actual values of g and m k h (s, a) may depend on function approximation and implementation of the abstract function RSTD. Let us recall that {(s k h , a k h )} are the state-action pairs visited by Algorithm 1, and we are ready to state the meta regret bound. Theorem 1 Let M := g k∈[K] h∈[H] min{1, m k h (s k h , a k h )} , where g and {m k h } are as given in Condition 1. On the event of Condition 1, for any δ ∈ (0, 1], with probability at least 1 -δ the regret of Algorithm 1 satisfies Regret(K) e |β|H -1 |β| e |β|H 2 M + e |β|H 2 KH 3 log(1/δ). The proof is given in Appendix D. Even though the actual form of M depends on specific function approximation, the derivation of Theorem 1 only requires the structure of Algorithm 1, which is agnostic of function approximation. In the above bound, the first term can be interpreted as the total optimism maintained by Algorithm 1, and is in fact a direct consequence of Condition 1. The second term can be seen as the total drift of iterates {V k h } from the value functions {V π k h }, which is the result of a martingale analysis. The factor e |β|H 2 shared by both terms is due to a local linearization of the nonlinear objective (2) as well as a standard backward induction analysis of H-horizon MDPs. Soon we will show that M = Õ(K 1/2 ) under both linear and general function approximation, so the first term in Theorem 1 would dominate in the regret bound. Similar to M , the exponential factor e |β|H -1 |β| also comes into the bound from Condition 1. It has been shown as a distinctive feature of risk-sensitive RL algorithms that represents a tradeoff between risk sensitivity and sample complexity; see Fei et al. (2020) for a detailed discussion on this point.

5.2. REGRET BOUND FOR LINEAR FUNCTION APPROXIMATION

We now present a regret bound for Algorithm 2 induced by Theorem 1. Let γ L = c γ |e βH -1| d log(2dKH/δ), where c γ > 0 is an appropriate universal constant. We have the following result. Theorem 2 Let γ L of (9) and λ = (e βH -1) 2 be input to Algorithm 2, and M be as defined in Theorem 1. Under Assumption 1, for any δ ∈ (0, 1], with probability at least 1 -δ, Condition 1 holds for Algorithm 2 so that M [d 2 KH 2 log 2 (2dKH/δ)] 1/2 . Therefore, Theorem 1 implies that the regret of Algorithm 2 satisfies Regret(K) e |β|H -1 |β| e |β|H 2 d 2 KH 2 log 2 (2dKH/δ). The proof is given in Appendix E. One may obtain a regret bound for the tabular setting by taking d = |S| 2 |A| in Theorem 2, and the resulting bound can be seen to nearly match that of Fei et al. (2020, Theorem 1), except for the polynomial dependency on |S| and |A|. 3 The bound obtained Under review as a conference paper at ICLR 2021 from specializing Theorem 2 to the tabular setting is also nearly optimal for small |β| (with respect to |β|, K and H) in view of the lower bound E [Regret(K)] e |β|H/2 -1 |β| K log K (10) given by Fei et al. (2020, Theorem 3 ). In addition, as β → 0, the setting of risk-sensitive RL tends to that of standard risk-neutral RL. We have the following corollary as a precise characterization of Theorem 2 in that regime. Corollary 1 Under the setting of Theorem 2 and when β → 0, with probability at least 1 -δ, the regret of Algorithm 2 satisfies Regret(K) d 2 KH 4 log 2 (2dKH/δ). The proof is given in Appendix F. Corollary 1 matches the standard result in the risk-neutral setting, e.g. Cai et al. (2019, Theorem 3.1) , up to logarithmic factors.

5.3. REGRET BOUND FOR GENERAL FUNCTION APPROXIMATION

To present the regret guarantee for general function approximation, we need to set a few additional notations. Recall the function set P from Assumption 2. For any P ∈ P, (s, a) ∈ S × A and V : S → [0, H], we define the function set Z := {z P : P ∈ P}, where z P (s, a, V ) := sign(β) S P (s | s, a) • (e β•V (s ) -1)ds , For any P, P ∈ P, we define P -P ∞,1 := sup (s,a)∈S×A P (• | s, a) -P (• | s, a) 1 . We let d E := dim E (Z, |e βH -1|/K) be the (|e βH -1|/K)-eluder dimension of function set Z,foot_3 and ζ := log H • N 1/K (P, • ∞,1 )/δ + log(4K 2 H/δ). ( ) In Algorithm 3, we set γ G = 10|e βH -1| ζ. We now state our result for Algorithm 3, which is another instantiation of Theorem 1. Theorem 3 Let γ G of (13) be input to Algorithm 3 and M be as defined in Theorem 1. Under Assumption 2, for any δ ∈ (0, 1], with probability at least 1 -δ, Condition 1 holds for Algorithm 3 so that M H min{d E , K} + d E KH 2 ζ. Therefore, Theorem 1 implies that the regret of Algorithm 3 satisfies Regret(K) e |β|H -1 |β| e |β|H 2 H min{d E , K} + d E KH 2 ζ . The proof is given in Appendix G. The term H min{d E , K} above also appears in the regret bound for the multi-arm bandit problem with general function approximation (Russo & Van Roy, 2014) , which is a special case of our finite-horizon episodic MDP setting. When K is sufficiently large, we have H min{d E , K} d E KH 2 ζ and therefore Theorem 3 yields Regret(K) = e |β|H -1 |β| e |β|H 2 Õ( √ d E KH 2 ). In case that the transition kernels in P take the linear form as in Assumption 1, we have d E d log K, and log(N 1/K (P, • ∞,1 )) d log K so that ζ d log(KH/δ). Then for sufficiently large K, the bound in Theorem 3 matches that in Theorem 2 up to a logarithmic factor. On the other hand, under the linear setting of Assumption 1, Algorithm 2 may be more efficient in runtime and memory than Algorithm 3, as discussed in Appendix B. 

A ELUDER DIMENSION

To introduce the eluder dimension, we need to set forth the concept of ε-independence. Definition 1 For any ε > 0 and function set G whose elements are in the domain X , we say that an x ∈ X is ε-dependent on the set of elements X n := {x 1 , x 2 , . . . , x n } ⊂ X with respect to G if any pair of functions g, g ∈ G satisfying i∈[n] (g(x i ) -g (x i )) 2 ≤ ε 2 also satisfies g(x) -g (x) ≤ ε. We say that x is ε-independent of X n with respect to G if x is not ε-dependent on X n with respect to G. Hence, ε-independence characterizes a notion of dissimilarity of a point x to the elements in subset X n of function set G. Now we are ready to formally define the eluder dimension, which quantifies the length of the longest possible chain of dissimilar elements in a function set. Definition 2 For any ε > 0 and function set G whose elements are in the domain X , the ε-eluder dimension dim E (G, ε) is defined as the length d of the longest sequence of elements in X such that, for some ε ≥ ε, every element is ε -independent of its predecessors. The eluder dimension extends the concept of dimension in linear spaces and generalizes to nonlinear function spaces. It is also related to the notions of Kolmogorov dimension and VC dimension. We refer readers to Russo & Van Roy (2014) for further details on the eluder dimension and its advantages compared to other complexity measures.

B MORE ON ALGORITHMS

We present details of Algorithm 3 that we have omitted from the main text. For each (k, h) ∈ [K] × [H], transition kernels P, P ∈ P and estimated value functions {V τ h+1 } τ ∈[k-1] , we define the squared error Γ k h (P, P ) := k-1 τ =1 S P (s | s τ h , a τ h )e β•V τ h+1 (s ) ds - S P (s | s τ h , a τ h )e β•V τ h+1 (s ) ds 2 . In Algorithm 3, Line 3 computes an estimate P k h of the transition kernel P h , by solving a leastsquares problem similar to (7). Line 4 then constructs a confidence set around the estimate P k h . This step is reminiscent of UCRL (Jaksch et al., 2010) , where confidence sets are used to enforce the so-called Optimism in the Face of Uncertainty (OFU) principle for efficient exploration. Line 5 updates Q k h by solving an optimization problem over the confidence set P k h with the definition q k h,G (•, •) := max P ∈P k h S P (s | •, •)e β•V k h+1 (s ) ds , if β > 0, min P ∈P k h S P (s | •, •)e β•V k h+1 (s ) ds , if β < 0. It is worth remarking that both Lines 3 and 4 implicitly operate with the shifted exponential transformation of estimated value functions: replacing e β•V τ h+1 (•) therein by e β•V τ h+1 (•) -1 would not result in any difference for the algorithm. Computational aspects. We briefly discuss computational aspects of Algorithms 2 and 3. It is not hard to see that the time and space complexities for Algorithm 2 are polynomial in key model parameters d, K and H. For Algorithm 3, it is unclear how the complexities scale under Assumption 2, where the structure of P is unknown. Nevertheless, under Assumption 1 in which the transition Algorithm 3 RSVI.G Input: risk parameter β, number of episodes K, confidence width γ G , function set P 1: Run Algorithm 1 with RSTD() therein overloaded by the following subroutine: 2: procedure RSTD(k, h, β, {V τ h+1 } τ ∈[k] , γ G , P) 3: P k h ← argmin P ∈P k-1 τ =1 (e β•V τ h+1 (s τ h+1 ) -S P (s | s τ h , a τ h )e β•V τ h+1 (s ) ds ) 2 4: P k h ← {P ∈ P : Γ k h (P, P k h ) ≤ γ 2 G }, where Γ k h (•, •) is defined in (14) 5: return Q k h (•, •) ← r h (•, •) + 1 β log(q k h,G (•, •)) , where q k h,G (•, •) is defined in (15) 6: end procedure kernels admit a linear form, Algorithm 3 also attains polynomial time and space complexities with respect to key model parameters. The actual runtime and memory consumption of Algorithm 3 may be higher than those of Algorithm 2 though, since the construction of confidence sets in Algorithm 3 requires solving linear programs which could be computationally cumbersome.

C PRELIMINARIES TO PROOFS

We fix a tuple (k, h, s, a) ∈ [K] × [H] × S × A and a policy π. For Algorithm 1 (which subsumes both Algorithms 2 and 3), define q 2 = q k h,2 (s, a) := E s ∼P h (• | s,a) e β•V k h+1 (s ) , and q 3 = q k h,3 (s, a) := E s ∼P h (• | s,a) e β•V π h+1 (s ) . In the above definitions, note that q 2 and q 3 depend on (k, h, s, a); we suppress such dependency for notational simplicity. We have the following bounds on q 2 and q 3 . Lemma 1 We have q 2 , q 3 ∈ [min{1, e β(H-h) }, max{1, e β(H-h) }]. Proof. The result for q 2 and q 3 can be seen if we recall their definitions and the fact that e β•V (•) ∈ [min{1, e β(H-h) }, max{1, e β(H-h) }] for any V : S → [0, H -h]. For any q > 0, define G 1 (q ) := 1 β log{q } - 1 β log{q 2 }, G 2 := 1 β log{q 2 } - 1 β log{q 3 }. ( ) Since q 2 , q 3 > 0 by definition, G 1 (q ) and G 2 are well-defined. By the Bellman equation ( 4), for any π we have Q π h (s, a) = r h (s, a) + 1 β log E s ∼P h (• | s,a) e β•V π h+1 (s ) . We restate Condition 1 in the following. Condition 2 (Restatement of Condition 1) Let G 1 (q ) be as defined in (18). For all (k, h, s, a) ∈ [K]×[H]×S ×A, assume Q k h (s, a) ∈ [0, H] in Algorithm 1, then there exist some quantities g ≥ 1, m k h = m k h (s, a) > 0, q 1 = q k h,1 (s, a) ≥ min{1, e β(H-h) } and universal constant c 1 > 0 such that G 1 (q k h,1 (s, a)) = Q k h (s, a) -1 β log{q k h,2 (s, a)}, 0 ≤ G 1 (q k h,1 (s, a)) ≤ c 1 • e |β|H -1 |β| • g • m k h . It can be seen that G 1 (q k h,1 (s, a)) = Q k h (s, a) -Q k h (s, a), where Q k h is defined in Section 5. Under Condition 2, it holds that (Q k h -Q π h )(s, a) = 1 β log{q 1 } - 1 β log{q 3 } = G 1 + G 2 , ( ) by the construction of Q k h in the algorithms, where we have let G 1 := G 1 (q 1 ). Condition 2 has unspecified quantities q 1 , {m k h } and g. The condition, along with those quantities therein, will be verified in Lemmas 6 and 11 under Assumptions 1 and 2, respectively. For now let us focus on G 2 , and we need the following simple result to control it. Fact 1 Consider x, y, b ∈ R such that x ≥ y. (a) if y ≥ g 0 for some g 0 > 0, then log(x) -log(y) ≤ 1 g (x -y); (b) Assume further that y ≥ 0. If b ≥ 0 and x ≤ u for some u > 0, then e bx -e by ≤ be bu (x -y); if b < 0, then e by -e bx ≤ (-b)(x -y). Proof. The results follow from Lipschitz continuity of the functions x → log(x) and x → e bx . We next control G 2 , whose proof is agnostic of function approximation. Lemma 2 For each (k, h, s, a) ∈ [K] × [H] × S × A, if V k h+1 (s ) ≥ V π h+1 (s ) for all s ∈ S, then we have 0 ≤ G 2 ≤ e |β|H • E s ∼P h (• | s,a) [V k h+1 (s ) -V π h+1 (s )]. Proof. Case β > 0. The assumption V k h+1 (s ) ≥ V π h+1 (s ) for all s ∈ S implies that q 2 ≥ q 3 (by the definitions of q 2 and q 3 in ( 16) and ( 17)) and therefore G 2 ≥ 0 by the definition (18). We also have G 2 ≤ 1 β (q 2 -q 3 ) ≤ e |β|H E s ∼P h (• | s,a) [V k h+1 (s ) -V π h+1 (s )], where the first step holds by Fact 1(a) (with g 0 = 1, x = q 2 , and y = q 3 ) and the fact that q 2 ≥ q 3 ≥ 1 (implied by Lemma 1), and the second step holds by Fact 1(b) (with b = β, x = V k h+1 (s), and y = V π h+1 (s)) and H ≥ V k h+1 (s) ≥ V π h+1 (s) ≥ 0. Case β < 0. The assumption V k h+1 (s ) ≥ V π h+1 s ) for all s ∈ S implies that q 2 ≤ q 3 and therefore G 2 ≥ 0 due to its definition (18). We also have G 2 = 1 (-β) (log{q 3 } -log{q 2 }) ≤ e -βH (-β) (q 3 -q 2 ) ≤ e |β|H E s ∼P h (• | s,a) [V k h+1 (s ) -V π h+1 (s )], where the second step holds by Fact 1(a) (with g 0 = e βH , x = q 3 , and y = q 2 ) and the fact that q 3 ≥ q 2 ≥ e βH (suggested by Lemma 1), and the third step holds by Fact 1(b) (with b = β, x = V k h+1 (s), and y = V π h+1 (s)) and V k h+1 (s) ≥ V π h+1 (s) ≥ 0. With the help of Lemma (2), we can show the "optimism" of Q k h in the following sense. Lemma 3 Suppose (19) holds with G 1 ≥ 0. We have Q k h (s, a) ≥ Q π h (s, a) for all (k, h, s, a) ∈ [K] × [H] × S × A. Proof. For the purpose of the proof, we set Q π H+1 (s, a) = Q * H+1 (s, a) = 0 for all (s, a) ∈ S × A. We fix a tuple (k, s, a) ∈ [K] × S × A and use strong induction on h. The base case for a) . By the induction assumption we have h = H + 1 is satisfied since (Q k H+1 -Q π H+1 )(s, a) = 0 for k ∈ [K] by definition. Now we fix an h ∈ [H] and assume that 0 ≤ (Q k h+1 -Q π h+1 )(s, V k h+1 (s) = max a ∈A Q k h+1 (s, a ) ≥ max a ∈A Q π h+1 (s, a ) ≥ V π h+1 (s). Applying ( 20) to Lemma 2 yields G 2 ≥ 0. Since G 1 ≥ 0 by assumption, it follows that 19). The induction is completed and so is the proof. (Q k h - Q π h )(s, a) ≥ 0 by ( Lemma 3 implies an immediate but important corollary. Lemma 4 Suppose (19) holds with G 1 ≥ 0. We have V k h (s) ≥ V π h (s) for all (k, h, s) ∈ [K] × [H] × S. Proof. The result follows from Lemma 3 and Equation ( 20). We now have all the keys to proving the meta regret bound.

D PROOF OF THEOREM 1

We work on the event of Condition 2 (which is a restatement of Condition 1), where g and {m k h } are defined. This means (19) also holds. Define δ k h := V k h (s k h ) -V π k h (s k h ), and ζ k h+1 := E s ∼P h (• | s k h ,a k h ) [V k h+1 (s ) -V π k h+1 (s )] -δ k h+1 . Let {m k h } be as defined in Condition 2. For any (k, h) ∈ [K] × [H], we have δ k h = (Q k h -Q π k h )(s k h , a k h ) ≤ min{H, (Q k h -Q π k h )(s k h , a k h )} (21) ≤ min H, c 1 • e |β|H -1 |β| • g • m k h (s k h , a k h ) + e |β|H • E s ∼P h (• | s k h ,a k h ) [V k h+1 (s ) -V π k h+1 (s )] ≤ c 1 • e |β|H -1 |β| • g • min{1, m k h (s k h , a k h )} + e |β|H • E s ∼P h (• | s k h ,a k h ) [V k h+1 (s ) -V π k h+1 (s )] = c 1 • e |β|H -1 |β| • g • min{1, m k h (s k h , a k h )} + e |β|H (δ k h+1 + ζ k h+1 ). In the above equation, the first step holds by the construction of Algorithm 1 and the definition of V π k h in (4); the second step is due to the fact that Recalling from Algorithm 1 and the Bellman equation ( 4) that V k H+1 (s) = V π k H+1 (s) = 0, as well as noting the fact that δ k h+1 + ζ k h+1 ≥ 0 implied by Lemma 4, we can continue by expanding the recursion in Equation ( 23) and get Q k h (•, •) ≤ H and Q π k h (•, •) ≥ 0; δ k 1 ≤ h∈[H] e |β|Hh ζ k h+1 + c 1 • e |β|H -1 |β| • h∈[H] e |β|H(h-1) g • min{1, m k h (s k h , a k h )}. Therefore, we have Regret(K) = k∈[K] (V * 1 -V π k 1 )(s k 1 ) ≤ k∈[K] δ k 1 ≤ e |β|H 2 k∈[K] h∈[H] ζ k h+1 + c 1 • e |β|H -1 |β| • e |β|H 2 • g k∈[K] h∈[H] min{1, m k h (s k h , a k h )}, where the second step holds by Lemma 4 with π therein set to the optimal policy, and in the last step we have applied (24) along with the Holder inequality. We proceed to control the first term in Equation ( 25). Since the construction of V k h is independent of the new observation s k h in episode k, we have that {ζ k h+1 } is a martingale difference sequence satisfying ζ k h ≤ 2H for all (k, h) ∈ [K] × [H] . By the Azuma-Hoeffding inequality, we have for any t > 0, P   k∈[K] h∈[H] ζ k h+1 ≥ t   ≤ exp - t 2 2T • H 2 . Hence, with probability 1 -δ/2, there holds k∈[K] h∈[H] ζ k h+1 ≤ 2H 2 T • log(2/δ). Finally, plugging ( 26) into (25) yields Regret(K) ≤ e |β|H 2 2H 2 T • log(2/δ)+c 1 • e |β|H -1 |β| •e |β|H 2 •g k∈[K] h∈[H] min{1, m k h (s k h , a k h )}. We then rescale δ properly and finish the proof of Theorem 1. To obtain regret bounds for Algorithms 2 and 3, it remains to only specify quantities g and k∈[K] h∈[H] m k h (s k h , a k h ) in Condition 2. The next result prepares us for the quest. Lemma 5 Let q 2 = q k h (s, a) be as defined in (16). If for each (k, h, s, a) ∈ [K] × [H] × S × A, there exists some quantities g ≥ 1, m k h = m k h (s, a) > 0, q 1 = q k h,1 (s, a) ≥ min{1, e β(H-h) } and universal constant c 1 ≥ 1 such that G 1 (q k h,1 (s, a)) = Q k h (s, a) -1 β log{q k h,2 (s, a)}, and 0 ≤ q 1 -q 2 ≤ c 1 e βH -1 g • m k h for β > 0 and 0 ≤ q 2 -q 1 ≤ c 1 e βH -1 g • m k h for β < 0, then Condition 2 holds with the aforementioned c 1 , g, m k h and q 1 . Proof. Case β > 0. By the definition of G 1 in ( 19), the assumption 0 ≤ q 1 -q 2 implies that G 1 ≥ 0. Moreover, Lemma 1 and Fact 1(a) (with g 0 = 1, x = q 1 and y = q 2 ) together imply G 1 ≤ 1 β (q 1 -q 2 ). Invoking the upper bound on q 1 -q 2 for β > 0 in the assumption completes the proof for the case. Case β < 0. By the definition of G 1 in 18, the assumption 0 ≤ q 2 -q 1 implies that G 1 ≥ 0. Furthermore, by Lemma 1 and Fact 1(a) (with g 0 = e βH , x = q 2 and y = q 1 ), we further have G 1 = 1 (-β) (log{q 2 } -log{q 1 }) ≤ e -βH |β| (q 2 -q 1 ). Invoking the upper bound on q 2 -q 1 and the fact that e βH -1 = 1 -e βH for β < 0 completes the proof for the case.

E PROOF OF THEOREM 2

First of all, it can be seen that Line 7 in Algorithm 2 and the initial condition V k H+1 (s, a) = 0 in Algorithm 1 ensure that Q k h (s, a) ∈ [0, H -h + 1] for all (k, h, s, a) ∈ [K] × [H] × S × A. We let λ = (e βH -1) 2 and γ L set as in (9) for Algorithm 2. We define q + 1 = q k,+ h,1 (s, a) := φ k h (s, a), w k h + 1 + b k h (s, a), if β > 0, φ k h (s, a), w k h + 1 -b k h (s, a), if β < 0; q 1 = q k h,1 (s, a) := min{e β(H-h) , q + 1 }, if β > 0, max{e β(H-h) , q + 1 }, if β < 0. Indeed, q 1 defined above is equivalent to q k h,L defined in (8). It can also be verified that G 1 (q 1 ) = Q k h (s, a)-1 β log{q 2 }, where G 1 (•) is defined in (18) and q 2 is defined in ( 16). We have the following result which shows that Algorithm 2 satisfies Condition 2 (a restatement of Condition 1) with high probability. Lemma 6 Under Assumption 1, for any δ ∈ (0, 1], with probability at least 1 -δ, Condition 2 holds for Algorithm 2 with c 1 ≥ 1, g = d log(2dKH/δ). m k h (s, a) = φ k h (s, a) (Λ k h ) -1 φ k h (s, a) and q 1 as defined in (28). Proof. Let us fix a tuple (k, h, s, a) ∈ [K] × [H] × S × A. Then, we have φ k h (s, a) w k h + 1 = φ k h (s, a) (Λ k h ) -1   τ ∈[k-1] φ τ h (s τ h , a τ h ) • [e β•V τ h+1 (s τ h+1 ) -1]   + 1 (29) by Line 4 of Algorithm 2, and E s ∼P h (• | s,a) e β•V k h+1 (s ) = S ψ(s, a, s ) θ h (e β•V k h+1 (s ) -1)ds + 1 = φ k h (s, a) θ h + 1 = φ k h (s, a) (Λ k h ) -1 Λ k h θ h + 1 = φ k h (s, a) (Λ k h ) -1   τ ∈[k-1] φ τ h (s τ h , a τ h )φ τ h (s τ h , a τ h ) θ h + λ • θ h   + 1 = φ k h (s, a) (Λ k h ) -1   τ ∈[k-1] φ τ h (s τ h , a τ h ) • E s ∼P h (• | s τ h ,a τ h ) [e β•V τ h+1 (s ) -1] + λ • θ h   + 1, where the first step follows from Assumption 1, the second step holds by Line 5 of Algorithm 2, the third step holds since Λ k h is positive definite by construction, the fourth step holds by Line 3 of Algorithm 2, and the last step holds since φ τ h (s, a) θ h = S ψ(s, a, s ) θ h • (e β•V τ h+1 (s ) -1)ds = E s ∼P h (• | s,a) [e β•V τ h+1 (s ) -1] for τ ∈ [K], which is due to Assumption 1. Now we consider the cases β > 0 and β < 0 separately. Case β > 0. Recall q + 1 defined in ( 27) and q 2 in (16). To control G 1 , we can compute q + 1 -q 2 -b k h (s, a) = φ k h (s, a) w k h + 1 -E s ∼P h (• | s,a) e β•V k h+1 (s ) = φ k h (s, a) (Λ k h ) -1   τ ∈[k-1] φ τ h (s τ h , a τ h ) • e β•V τ h+1 (s τ h+1 ) -E s ∼P h (• | s τ h ,a τ h ) [e β•V τ h+1 (s ) ]   S1 -λ • φ k h (s, a) (Λ k h ) -1 θ h S2 ≤ |S 1 | + |S 2 | , where the first step holds by the definitions of q + 1 and q 2 , and the second step is implied by ( 29) and (30). We control each of S 1 and S 2 . For S 1 , we have |S 1 | ≤ τ ∈[k-1] φ τ h (s τ h , a τ h ) • (e β•V τ h+1 (s τ h+1 ) -E s ∼P h (• | s τ h ,a τ h ) e β•V τ h+1 (s ) ) (Λ k h ) -1 φ k h (s, a) (Λ k h ) -1 by the Cauchy-Schwarz inequality. On the event of Lemma 8, we further have |S 1 | ≤ c e βH -1 d log(2dKH/δ) • φ k h (s, a) (Λ k h ) -1 for some universal constant c > 0. Now for S 2 , we have |S 2 | ≤ λ • φ k h (s, a) (Λ k h ) -1 • θ h (Λ k h ) -1 ≤ √ λ • φ k h (s, a) (Λ k h ) -1 • θ h 2 ≤ √ λd • φ k h (s, a) (Λ k h ) -1 , where the first step holds by the Cauchy-Schwarz inequality, the second step holds since Λ k h λ•I d , and the last step holds by Assumption 1 that θ h 2 ≤ √ d. Plugging the bounds on S 1 and S 2 into (31), and using the fact that λ = (e βH -1) 2 and the definition of b k h , we have φ τ h (s, a) w k h + 1 -E s ∼P h (• | s,a) e β•V k h+1 ≤ b k h (s, a). We choose c γ = c + 1 in the definition of b k h (s, a) in Line 6 of Algorithm 2, and we have 0 ≤ q + 1 -q 2 ≤ 2c γ c1 • e βH -1 d log(2dKH/δ) g • φ k h (s, a) (Λ k h ) -1 m k h (s,a) . Since (32) implies q + 1 ≥ q 2 and Lemma 1 implies q 2 ≤ e β(H-h) , we can infer that q 1 ≥ q 2 from the definition of q 1 = min{e β(H-h) , q + 1 } in ( 28). Then we have 0 ≤ q 1 -q 2 ≤ q + 1 -q 2 . By Lemma 1, we also have q 1 ≥ q 2 ≥ 1. Case β < 0. Similar to the case of β > 0, we have q 2 -q + 1 -b k h (s, a) ≤ c • e βH -1 d log(2dKH/δ) • φ k h (s, a) (Λ k h ) -1 . If we choose c γ = c + 1 in the definition of b k h (s, a) in Line 6 of Algorithm 2, the above equation implies 0 ≤ q 2 -q + 1 ≤ 2c γ c1 • e βH -1 d log(2dKH/δ) g • φ k h (s, a) (Λ k h ) -1 m k h (s,a) . Therefore, using the same reasoning as in the case of β > 0, we have 0 ≤ q 2 -q 1 ≤ q 2 -q + 1 . Also, by the definition of q 1 in (28), we have q 1 ≥ e β(H-h) . The proof is completed by invoking Lemma 5 and recalling the identity φ k h (s, a) (Λ k h ) -1 = φ k h (s, a) (Λ k h ) -1 φ k h (s, a). Next, we give a bound for the quantity k∈ [K] h∈[H] min{1, m k h (s k h , a k h )}. Lemma 7 Under Assumption 1, let {m k h (s, a)} be as defined in Lemma 6 and we have k∈[K] h∈[H] min{1, m k h (s k h , a k h )} ≤ √ 2dKH 2 ι, where ι = log(2dK/δ). Proof. We have k∈[K] h∈[H] min{1, m k h (s k h , a k h )} ≤ h∈[H] √ K k∈[K] min{1, φ k h (s k h , a k h ) (Λ k h ) -1 φ k h (s k h , a k h )} ≤ H √ 2dKι. where the first step holds by the Cauchy-Schwarz inequality, and the last step holds by Lemma 10. Recall the definition of M from Theorem 1, and now its upper bound can be determined by combining Lemmas 6 and 7. Therefore, the proof of the theorem is completed.

E.1 AUXILIARY LEMMAS

We first present a concentration result. Lemma 8 Let λ = (e βH -1) 2 in Algorithm 2. There exists a universal constant c > 0 such that for any δ ∈ (0, 1] and (k, h) ∈ [K] × [H], with probability 1 -δ, we have τ ∈[k-1] φ τ h (s τ h , a τ h )•(e β•V τ h+1 (s τ h+1 ) -E s ∼P h (• | s τ h ,a τ h ) e β•V τ h+1 (s ) ) (Λ k h ) -1 ≤ c e βH -1 d log(2dKH/δ) Proof. The proof is adapted from that of Cai et al. (2019, Lemma D.1 ) by using the fact that for any function V : S → [0, H], we have sup (s,a,s )∈S×A×S e β•V (s ) -E s ∼P h (• | s,a) e β•V (s ) ≤ e βH -1 , and that Assumption 1 implies sup (s,a)∈S×A S φ(s, a, s ) • (e β•V (s ) -1)ds 2 ≤ √ d e βH -1 . The next few lemmas can help control the sum of the terms {φ k h (s k h , a k h ) (Λ k h ) -1 φ k h (s k h , a k h )}. Lemma 9 (Cai et al. (2019, Lemma D.3 )) Let {φ j } j≥1 be a sequence in R d . Let Λ 0 ∈ R d×d be a positive-definite matrix and Λ t := Λ 0 + j∈[t-1] φ j φ j . Then for any t ∈ Z >0 , we have j∈[t] min{1, φ j Λ -1 j φ j } ≤ 2 log det(Λ t+1 ) det(Λ 1 ) Lemma 10 Let λ = (e βH -1) 2 in Algorithm 2. For any h ∈ [H], we have k∈[K] min{1, φ k h (s k h , a k h ) (Λ k h ) -1 φ k h (s k h , a k h )} ≤ 2dι, where ι = log(2dK/δ). Proof. By construction of Algorithm 2, we may define Λ 0 h := λI d so we have Λ k h = Λ 0 h + τ ∈[k-1] φ τ h (s τ h , a τ h )φ τ h (s τ h , a τ h ) . Since φ k h (s k h , a k h ) 2 ≤ √ d e βH -1 for all (k, h) ∈ [K] × [H] as implied by Assumption 1, we have for any h ∈ [H] that Λ K+1 h = k∈[K] φ k h (s k h , a k h )φ k h (s k h , a k h ) + λI d (dK|e βH -1| 2 + λ)I d . Given λ = e βH -1 2 , we have for any h ∈ [H] that log det(Λ K+1 h ) det(Λ 1 h ) ≤ d log dK e βH -1 2 + λ λ ≤ d log[dK + 1] ≤ dι. We now apply Lemma 9 to get  k∈[K] min{1, φ k h (s k h , a k h ) (Λ k h ) -1 φ k h (s k h , a k h )} ≤ 2 log det(Λ K+1 h ) det(Λ 1 h ) ≤ 2dι,

G PROOF OF THEOREM 3

First of all, it can be seen that Line 5 in Algorithm 3 and the initial condition V k H+1 (s, a) = 0 in Algorithm 1 ensure that Q k h (s, a) ∈ [0, H -h + 1] for all (k, h, s, a) ∈ [K] × [H] × S × A. We fix a tuple (k, h, s, a) ∈ [K] × [H] × S × A. Recall that γ G is as defined in (13). For Algorithm 3, we let q 1 = q k h,1 (s, a) := max P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds , if β > 0, min P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds , if β < 0, which is equivalent to q k h,G (s, a) defined in (15). Recall the definitions of z p and Z in (11). We have the result below, which verifies Condition 2 (a restatement of Condition 1) for Algorithm 3 under the general function approximation. Lemma 11 Under Assumption 2, for any δ ∈ (0, 1] the following holds with probability at least 1-δ. For all (k, h, s, a) ∈ [K]×[H]×S ×A, Condition 2 holds for Algorithm 3 with c 1 = 1, g = 1, m k h (s, a) = 1 |e βH -1| max P ∈P k h z P (s, a, V k h+1 ) -min P ∈P k h z P (s, a, V k h+1 ) and q 1 as defined in (34). Proof. It is not hard to see that by the definition of q 1 in (34), we have q 1 ∈ [min{1, e β(H-h) }, max{1, e β(H-h) }]. Recall the definitions of q 2 in (16) and G 1 (•) in (18); it holds that G 1 (q 1 ) = Q k h -1 β log{q 2 }. On the event of Lemma 14, we have P h ∈ P k h . Case β > 0. By definitions of q 1 and q 2 and the fact that P h ∈ P k h , we have q 1 ≥ q 2 . We can also derive q 1 -q 2 = max P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds - S P h (s | s, a)e β•V k h+1 (s ) ds ≤ max P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds -min P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds = max P ∈P k h S P (s | s, a)(e β•V k h+1 (s ) -1)ds -min P ∈P k h S P (s | s, a)(e β•V k h+1 (s ) -1)ds = e βH -1 g • m k h (s, a), where the second step holds since P h ∈ P k h , and the third step holds since S P (s | s, a)ds = 1. Case β < 0. By definitions of q 1 and q 2 and the fact that P h ∈ P k h , we may deduce that q 1 ≤ q 2 . Also, we have where the second step holds since P h ∈ P k h , and the fourth step holds since S P (s | s, a)ds = 1. Finally, invoking Lemma 5 completes the proof. q 2 -q 1 = S P h (s | s, a)e β•V k h+1 (s ) ds -min P ∈P k h S P (s | s, a)e β•V k h+1 (s ) ds ≤ max P ∈P k h S P (s | s, a)e β•V k h+1 The following lemma controls k∈[K] h∈[H] min{1, m k h (s k h , a k h )}. Lemma 12 Let d := dim E (Z, e βH -1 /K). Under Assumption 2, for any δ ∈ (0, 1], let {m k h (s, a)} be as defined in Lemma 11 and ζ be as defined in (12), then with probability at least 1 -δ, we have Recall the definition of M from Theorem 1, and now its upper bound can be determined by combining Lemmas 11 and 12. Therefore, the proof of the theorem is completed.

G.1 AUXILIARY LEMMAS

Let Z be a set of [0, D]-valued functions for some number D > 0. We define {(X τ , Y τ )} τ ∈[t] be a series of random variables such that each X τ is in the domain of the elements of function set Z, and each Y τ ∈ R. Let F = {F τ } τ ≥1 be a set of filtrations such that for all τ ≥ 1, the random variables {X 1 , Y 1 , . . . , X τ -1 , Y τ -1 , X τ } is F τ -1 -measurable. Furthermore, we assume there exists a function z * ∈ Z such that E[Y τ | F τ -1 ] = z * (X τ ). For any ε > 0, we denote by N ε (Z, • ∞ ) the ε-covering number of Z with respect to the supremum norm z 1 -z 2 ∞ = sup x |z 1 (x) -z 2 (x)|. We define ẑt := argmin z∈Z τ ∈[t] (z(X τ ) -Y τ ) 2 , and for γ ≥ 0, let Z t (γ) :=    z ∈ Z : τ ∈[t] (z(X τ ) -ẑt (X τ )) 2 ≤ γ 2    . We record a concentration result. Lemma 13 Suppose that for any τ ≥ 1, the random variable Y τ -z * (X τ ) is conditionally σ-sub-Gaussian given filtration F τ -1 . Let γ 2 t (δ, ε) := 8σ 2 log N ε (Z, • ∞ )/δ + 4εt D + σ 2 log(4t(t + 1)/δ) . Then for any ε > 0 and δ ∈ (0, 1], with probability at least 1 -δ we have z * ∈ Z t (γ t (δ, ε)). Proof. The proof can be adapted from that of Russo & Van Roy (2014, Proposition 6 ). In the following, we use the shorthand γ := γ G , where γ G is defined in (13) and used in Line 4 of Algorithm 3. Lemma 14 For any δ ∈ (0, 1] and γ 2 = 10 e βH -1 2 log N 1/K (P, • ∞,1 ) • H/δ + log(4K 2 H/δ) , then for all (k, h) ∈ [K] × [H] we have P h ∈ P k h with probability at least 1 -δ. Proof. We first note that for any (k, h) ∈ [K] × [H], Line 3 in Algorithm 3 can be equivalently written as P k h ← argmin P ∈P τ ∈[k-1] ((e β•V τ h+1 (s τ We have the following result on the eluder dimension. Lemma 15 Let Z = Z and d = dim E (Z, e βH -1 /K). For any K ≥ 1, β ∈ R and γ > 0, we have Proof. This result is an adaptation of Russo & Van Roy (2014, Lemma 5) with C = e βH -1 therein.



Throughout the paper, we use function class and function set interchangeably. We also discuss computational aspects of our algorithms; see Appendix B. By inspecting the proof ofFei et al. (2020, Theorem 1), we see that they apply the bound(1/|β|)(exp(|β| H) -1) exp(|β| H 2 ) ≤ (1/|β|)(exp(C |β| H 2 ) -1) for some universal constant C > 0. Recall that the definition of the eluder dimension is formally given in Appendix A.



ψ(s, a, s ) • (e β•V (s ) -1)ds 2 ≤ √ d • e β v -1 ,for any (s, a) ∈ S × A and V : S → [0, v] where v ≥ 0.

the third step holds by (19) combined with Condition 2 and Lemma 2; the fourth step holds since c 1 , g ≥ 1 and e |β|H -1 |β| ≥ H; the last step follows from the definitions of δ k h and ζ k h+1 .

from Theorem 2, as well as the fact that lim β→0 e |β|H -1 |β| = H and lim β→0 e |β|H 2 = 1.

(s | s, a)(-e β•V k h+1 (s ) )ds -(-1) max P ∈P k h S P (s | s, a)(-e β•V k h+1 (s ) )ds = max P ∈P k h S P (s | s, a)(1 -e β•V k h+1 (s ) )ds -min P ∈P k h S P (s | s, a)(1 -e β•V k h+1 (s ) )ds = e βH -1 g • m k h (s, a),

h+1 ) -1) -S P (s | s τ h , a τ h ) • (e β•V τ h+1 (s ) -1)ds ) 2 ,since S P (s | s τ h , a τ h )ds = 1. Recall the definition of z P in (11). For any(k, h) ∈ [K] × [H], we set Z = Z, Y k = e β•V k h+1 (s k h+1 ) -1, X k = (s k h , a k h , V k h+1 ) and z * = z P h . Then, we have that Y τ -z * (X τ ) is conditionally ( e βH -1 )-sub-Gaussian for τ ∈ [k -1] given a properly defined filtration. By the definition of P k h , we have Z k (γ) = {z P : P ∈ P k h }. By definition of γ, we have γ ≥ γ k-1 (δ/H, e βH -1 /K) for all k ∈ [K], where γ t (•, •) is as defined in Lemma 13. By Lemma 13 with D = σ = e βH -1 and ε = e βH -1 /K, with probability at least 1 -δ/H and for all k ∈ [K], we havez * ∈ Z k (γ k-1 (δ/H, e βH -1 /K)) ⊂ Z k (γ),thus implying P h ∈ P k h . Applying the union bound over h ∈ [H], we have that P h ∈ P k h with probability at least 1 -δ. Now we show that N ε (Z, • ∞ ) ≤ N ε/|e βH -1| (P, • ∞,1 ) for any ε > 0. Let V := {V : S → [0, H]}. For any P, P ∈ P and their corresponding z P , z P ∈ Z, we can compute z P -z P ∞ = sup (s,a,V )∈S×A×V S P (s | s, a)(e β•V (s ) -1)ds -S P (s | s, a)(e β•V (s ) -1)ds ≤ e βH -1 • sup (s,a)∈S×A S |P (s | s, a) -P (s | s, a)| ds = e βH -1 • P -P ∞,1 , as desired.

∈Z k (γ) |z(x k ) -z (x k )| ≤ e βH -1 + e βH -1 • min{d, K} + 4 γ2 dK.

Peter Whittle. Risk-sensitive Optimal Control, volume 20. Wiley New York, 1990. Lin F. Yang and Mengdi Wang. Reinforcement learning in feature space: Matrix bandit, kernels, and regret bound. arXiv preprint arXiv:1905.10389, 2019. Dongruo Zhou, Jiafan He, and Quanquan Gu. Provably efficient reinforcement learning for discounted mdps with feature mapping. arXiv preprint arXiv:2006.13165, 2020.

) for all (k, h)∈ [K] × [H]. For each fixed h ∈ [H], by Lemma 15, we have 1| e βH -1 + e βH -1 • min{d, K} + 5 γ 2 dK ≤ 1 |e βH -1| 2 e βH -1 • min{d, K} + 5 γ 2 dK = 2 • min{d, K} + 5 |e βH -1| γ 2 dK ≤ 2 • min{d, K} + 20 dK log N 1/K (P, • ∞,1 ) • H/δ + log(4K 2 H/δ) ,where the last step holds since γ G = γ where γ is given in Lemma 14. Summing both sides of the above equations over h ∈ [H] results in the desired bound.

