PROVABLY EFFICIENT RISK-SENSITIVE REINFORCE-MENT LEARNING: ITERATED CVAR AND WORST PATH

Abstract

In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to realworld tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes K. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.

1. INTRODUCTION

Reinforcement Learning (RL) (Kaelbling et al., 1996; Szepesvári, 2010; Sutton & Barto, 2018 ) is a classic online decision-making formulation, where an agent interacts with an unknown environment with the goal of maximizing the obtained reward. Despite the empirical success and theoretical progress of recent RL algorithms, e.g., (Szepesvári, 2010; Agrawal & Jia, 2017; Azar et al., 2017; Zanette & Brunskill, 2019) , they focus mainly on the risk-neutral criterion, i.e., maximizing the expected cumulative reward, and can fail to avoid rare but disastrous situations. As a result, existing algorithms cannot be applied to tackle real-world risk-sensitive tasks, such as autonomous driving (Wen et al., 2020) and clinical treatment planning (Coronato et al., 2020) , where policies that ensure low risk of getting into catastrophic situations at all decision stages are strongly preferred. Motivated by the above facts, we investigate Iterated CVaR RL, a novel episodic RL formulation equipped with an important risk-sensitive criterion, i.e., Iterated Conditional Value-at-Risk (CVaR) (Hardy & Wirch, 2004) . Here, CVaR (Artzner et al., 1999) is a popular static (single-stage) risk measure which stands for the expected tail reward. Iterated CVaR is a dynamic (multi-stage) risk measure defined upon CVaR by backward iteration, and focuses on the worst portion of the reward-to-go at each stage. In the Iterated CVaR RL problem, an agent interacts with an unknown episodic Markov Decision Process (MDP) in order to maximize the worst α-portion of the rewardto-go at each step, where α ∈ (0, 1] is a given risk level. Under this model, we investigate two important performance metrics, i.e., Regret Minimization (RM), where the goal is to minimize the cumulative regret over all episodes, and Best Policy Identification (BPI), where the performance is measured by the number of episodes required for identifying an optimal policy. Compared to existing CVaR MDP model, e.g., (Boda & Filar, 2006; Ott, 2010; Bäuerle & Ott, 2011; Chow et al., 2015) , which aims to maximize the CVaR (i.e., the worst α-portion) of the total reward, our Iterated CVaR RL concerns the worst α-portion of the reward-to-go at each step, and prevents the agent from getting into catastrophic states more carefully. Intuitively, CVaR MDP takes more cumulative reward into account and prefers actions which have better performance in general, but can have larger probabilities of getting into catastrophic states. Thus, CVaR MDP is suitable for scenarios where bad situations lead to a higher cost instead of fatal damage, e.g., finance. In contrast, our Iterated CVaR RL prefers actions which have smaller probabilities of getting into catastrophic states. Hence, Iterated CVaR RL is suitable for safety-critical applications, where catastrophic states are unacceptable and need to be carefully avoided, e.g., clinical treatment planning (Wang et al., 2019) and unmanned helicopter control (Johnson & Kannan, 2002) . For example, consider the case where we fly an unmanned helicopter to complete some task. There is a small probability that, at each time during execution, the helicopter encounters a sensing or control failure and does not take the scheduled action. To guarantee the safety of surrounding workers and the helicopter, we need to make sure that even if the failure occurs, the taken policy ensures that the helicopter does not crash and cause fatal damage (see Appendix C.2, C.3 for more detailed comparisons with existing risk-sensitive MDP models). Iterated CVaR RL faces several unique challenges as follows. (i) The importance (contribution to regret) of a state in Iterated CVaR RL is not proportional to its visitation probability. Specifically, there can be states which are critical (risky) but have a small visitation probability. As a result, the regret for Iterated CVaR RL cannot be decomposed into the estimation error at each step with respect to the visitation distribution, as in standard RL analysis (Jaksch et al., 2010; Azar et al., 2017; Zanette & Brunskill, 2019) . (ii) In Iterated CVaR RL, the calculation of estimation error involves bounding the change of CVaR when the true value function shifts to optimistic value function, which is very different from typically bounding the change of expected rewards as in existing RL analysis (Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018) . Therefore, Iterated CVaR RL demands brandnew algorithm design and analytical techniques. To tackle the above challenges, we design two efficient algorithms ICVaR-RM and ICVaR-BPI for the RM and BPI metrics, respectively, equipped with delicate CVaR-adapted value iteration and exploration bonuses to allocate more attention on rare but potentially dangerous states. We also develop novel analytical techniques, for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution. Lower bounds for both metrics are established to demonstrate the optimality of our algorithms with respect to the number of episodes K. Moreover, we present experiments to validate our theoretical results and show the performance superiority of our algorithm (see Appendix A). We further study an interesting limiting case of Iterated CVaR RL when α approaches 0, called Worst Path RL, where the goal becomes to maximize the minimum possible cumulative reward (optimize the worst path). This setting corresponds to the scenario where the decision maker is extremely riskadverse and concerns the worst situation (e.g., in clinical treatment planning (Coronato et al., 2020) , the worst case can be disastrous). We emphasize that Worst Path RL cannot be directly solved by taking α → 0 in Iterated CVaR RL's results, as the results there have a dependency on 1 α in both upper and lower bounds. To handle this limiting case, we design a simple yet efficient algorithm MaxWP, and obtain constant upper and lower regret bounds which are independent of K. The contributions of this paper are summarized as follows. • We propose a novel Iterated CVaR RL formulation, where an agent interacts with an unknown environment, with the objective of maximizing the worst α-percent tail of the reward-to-go at each step. This formulation enables one to tightly control risk throughout the decision process, and is most suitable for applications where such safety-at-all-time is critical. • We investigate two important metrics of Iterated CVaR RL, i.e., Regret Minimization (RM) and Best Policy Identification (BPI), and propose efficient algorithms ICVaR-RM and ICVaR-BPI. We establish nearly matching regret/sample complexity upper and lower bounds with respect to K. Moreover, we develop novel techniques to bound the change of CVaR due to the value function shift and decompose the regret via a distorted visitation distribution, which can be applied to other risk-sensitive decision making problems. • We further investigate a limiting case of Iterated CVaR RL when α approaches 0, called Worst Path RL, where the objective is to maximize the minimum possible cumulative reward. We develop a simple and efficient algorithm MaxWP, and provide constant regret upper and lower bounds (independent of K). Due to space limit, we defer all proofs and experiments to Appendix.

2. RELATED WORK

Below we review the most related works, and defer a full literature review to Appendix B. CVaR-based MDPs (Known Transition). Boda & Filar (2006) ; Ott (2010) ; Bäuerle & Ott (2011) ; Chow et al. (2015) study the CVaR MDP where the objective is to minimize the CVaR of the total cost, and show that the optimal policy for CVaR MDP is history-dependent (see Appendix C.2 for a detailed comparison with CVaR MDP). Hardy & Wirch (2004) firstly define the Iterated CVaR measure, and Osogami (2012) ; Chu & Zhang (2014) ; Bäuerle & Glauner (2022) consider iterated coherent risk measures (including Iterated CVaR) in MDPs, and demonstrate the existence of Markovian optimal policies. The above works focus mainly on the planning side, i.e., proposing algorithms and error guarantees for MDPs with known transition, while our work develops RL algorithms (interacting with the environment) and regret/sample complexity results for unknown transition. Risk-sensitive Reinforcement Learning (Unknown Transition). Tamar et al. (2015) ; Keramati et al. (2020) study CVaR MDP with unknown transition and provide convergence analysis. Borkar & Jain (2014) ; Chow & Ghavamzadeh (2014) ; Chow et al. (2017) investigate RL with CVaR-based constraints. Heger (1994) ; Coraluppi & Marcus (1997; 1999) consider minimizing the worst-case cost in RL and design heuristic algorithms. Fei et al. (2020; 2021a; b) study risk-sensitive RL with the exponential utility criterion, which takes all successor states into account with an exponential reweighting scheme. In contrast, our Iterated CVaR RL primarily concerns the worst α-portion successor states, and focuses on optimizing the performance under bad situations (see Appendix C.3 for a detailed comparison).

3. PROBLEM FORMULATION

In this section, we present the problem formulations of Iterated CVaR RL and Worst Path RL. Conditional Value-at-Risk (CVaR). We first introduce two risk measures, i.e., Value-at-Risk (VaR) and Conditional Value-at-Risk (CVaR). Let X be a random variable with cumulative distribution function F (x) = Pr[X ≤ x]. Given a risk level α ∈ (0, 1], the VaR at risk level α is the α-quantile of X, i.e., VaR α (X) = min{x|F (x) ≥ α}, and the CVaR at risk level α is defined as (Rockafellar et al., 2000) : CVaR α (X) = sup x∈R x - 1 α E (x -X) + , where (x) + := max{x, 0}. If there is no probability atom at VaR α (X), CVaR can also be written as CVaR α (X) = E[X|X ≤ VaR α (X)] (Shapiro et al., 2021) . Intuitively, CVaR α (X) is a distorted expectation of X conditioning on its α-portion tail, which depicts the average value when bad situations happen. When α = 1, CVaR α (X) = E[X], and when α → 0, CVaR α (X) tends to min(X) (Chow et al., 2015) . Iterated CVaR RL. We consider an episodic Markov Decision Process (MDP) M(S, A, H, p, r). Here S is the state space, A is the action space, and H is the length of horizon in each episode. p is the transition distribution, i.e., p(s ′ |s, a) gives the probability of transitioning to s ′ when starting from state s and taking action a. r : S × A → [0, 1] is a reward function, and r(s, a) gives a deterministic reward for taking action a in state s. A policy π is defined as a collection of H functions, i.e., π = {π h : S → A} h∈ [H] , where [H] := {1, 2, ..., H}. The episodic RL game is as follows. In each episode k, an agent chooses a policy π k , and starts from a fixed initial state s 1 , i.e., s k 1 := s 1 , as assumed in many prior RL works (Fiechter, 1994; Kaufmann et al., 2021; Ménard et al., 2021) . At each step h ∈ [H], the agent observes the state s k h and takes an action a k h = π k h (s k h ). After that, it receives a reward r(s k h , a k h ) and transitions to a next state s k h+1 according to the transition distribution p(•|s k h , a k h ). The episode ends after H steps and the agent enters the next episode. In Iterated CVaR RL, for any risk level α ∈ (0, 1] and a policy π, we use value function V α,π h : S → R and Q-value function Q α,π h : S × A → R to denote the cumulative reward that can be obtained when the agent transitions to the worst α-portion states at each step, starting from s and (s, a) at step h, respectively. For simplicity of notation, when the value of α is clear, we omit the superscript α and use the notations V π h and Q π h . Formally, Q π h and V π h are recurrently defined in Eq. (i) below. Since S, A and H are finite and the maximization of V π h (s) in Iterated CVaR RL satisfies the optimal substructure property, there exists an optimal policy π * which gives the optimal value V * h (s) = max π V π h (s) for all s ∈ S and h ∈ [H] (Chu & Zhang, 2014) . Therefore, the Bellman equation and the Bellman optimality equation are given in Eqs. (i),(ii) below, respectively (Chu & Zhang, 2014) .      Q π h (s, a) = r(s, a)+CVaR α s ′ ∼p(•|s,a) (V π h+1 (s ′ )) V π h (s) = Q π h (s, π h (s)) V π H+1 (s) = 0, ∀s ∈ S,        Q * h (s, a) = r(s, a)+CVaR α s ′ ∼p(•|s,a) (V * h+1 (s ′ )) V * h (s) = max a∈A Q * h (s, a) V * H+1 (s) = 0, ∀s ∈ S, where CVaR α s ′ ∼p(•|s,a) (V π h+1 (s ′ )) denotes the CVaR value of random variable V π h+1 (s ′ ) with s ′ ∼ p(•|s, a) at risk level α. We also provide the expanded version of value function definitions for Iterated CVaR RL (Eqs. (i), (ii)) in Appendix C.1. We consider two performance metrics for Iterated CVaR RL, i.e., Regret Minimization (RM) and Best Policy Identification (BPI). In Iterated CVaR RL-RM, the agent aims to minimize the cumulative regret in K episodes, defined as R(K) = K k=1 (V * 1 (s 1 ) -V π k 1 (s 1 )) . In Iterated CVaR RL-BPI, given a confidence parameter δ ∈ (0, 1] and an accuracy parameter ε > 0, the agent needs to use as few trajectories (episodes) as possible to identify an ε-optimal policy π, which satisfies V π 1 (s 1 ) ≥ V * 1 (s 1 ) -ε, with probability as least 1 -δ. That is, the performance of BPI is measured by the number of trajectories used, i.e., sample complexity. Worst Path RL. Furthermore, we investigate an interesting limiting case of Iterated CVaR RL when α approaches 0, called Worst Path RL. In this case, the objective becomes maximizing the minimum possible reward (Heger, 1994) . The Bellman (optimality) equations become        Q π h (s, a) = r(s, a) + min s ′ ∼p(•|s,a) (V π h+1 (s ′ )) V π h (s) = Q π h (s, π h (s)) V π H+1 (s) = 0, ∀s ∈ S,        Q * h (s, a) = r(s, a) + min s ′ ∼p(•|s,a) (V * h+1 (s ′ )) V * h (s) = max a∈A Q * h (s, a) V * H+1 (s) = 0, ∀s ∈ S, where min s ′ ∼p(•|s,a) (V π h+1 (s ′ )) denotes the minimum value of random variable V π h+1 (s ′ ) with s ′ ∼ p(•|s, a). From Eq. (2), one sees that Q π h (s, a) = min (st,at)∼π H t=h r(s t , a t ) s h = s, a h = a, π , V π h (s) = min (st,at)∼π H t=h r(s t , a t ) s h = s, π . Thus, Q π h (s, a) and V π h (s) denote the minimum possible cumulative reward under policy π, starting from (s, a) and s at step h, respectively. The optimal policy π * maximizes the minimum possible cumulative reward (i.e., optimizes the worst path) for all starting states and steps. Formally, π * gives the optimal value V * h (s) = max π V π h (s) for all s ∈ S and h ∈ [H]. For Worst Path RL, in this paper we mainly consider the regret minimization setting, where the regret is defined the same as Eq. ( 1). Note that this case cannot be directly solved by taking α → 0 in Iterated CVaR RL, as the results there have a dependency on 1 α . Thus, changing from CVaR(•) to min(•) in Worst Path RL requires a different algorithm design and analysis. Algorithm 1: ICVaR-RM a) , H}, ∀(s, a) ∈ S×A; Input: δ, α, δ ′ := δ 5 , L := log( KHSA δ ′ ), V k H+1 (s) = 0 for any k > 0 and s ∈ S for k = 1, 2, . . . , K do for h = H, H -1, . . . , 1 do Qk h (s, a) ← min{r(s, a)+CVaR α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ ))+ H α L n k (s, V k h (s) ← max a∈A Qk h (s, a), π k h (s) ← argmax a∈A Qk h (s, a), ∀s ∈ S; Play the episode k with policy π k , and update n k+1 (s, a) and pk+1 (s ′ |s, a); The best policy identification setting of Worst Path RL, on the other hand, is very challenging. This is because we cannot establish confidence intervals under the min(•) operation, and it is difficult to determine when the estimated optimal policy is accurate enough and when the algorithm should stop. We will further investigate this setting in future work.

4. ITERATED CVAR RL WITH REGRET MINIMIZATION

In this section, we consider regret minimization (Iterated CVaR RL-RM). We propose an algorithm ICVaR-RM with CVaR-adapted exploration bonuses, and demonstrate its sample efficiency. (s,a) . Here n k (s, a) is the number of times (s, a) was visited up to episode k, and pk (s ′ |s, a) is the empirical estimate of transition probability p(s ′ |s, a). Then, ICVaR-RM constructs optimistic Q-value function Qk h (s, a), optimistic value function V k h (s), and a greedy policy π k with respect to Qk h (s, a). After calculating the value functions and policy, ICVaR-RM plays episode k with policy π k , observes a trajectory, and updates n k (s, a) and pk+1 (s ′ |s, a). The calculation of CVaR (Line 3) can be implemented efficiently, and costs O(S log S) computation complexity (Shapiro et al., 2021) . We summarize the regret performance of ICVaR-RM as follows. Theorem 1 (Regret Upper Bound). With probability at least 1-δ, the regret of algorithm ICVaR-RM is bounded by O min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 • HS √ KHA α log KHSA δ , where w π,h (s) denotes the probability of visiting state s at step h under policy π. Remark 1. The regret depends on the minimum between an MDP-intrinsic visitation factor (min π,h,s: w π,h (s)>0 w π,h (s)) -1 2 and 1 √ α H-1 . When α is small, the first term dominates the bound, which stands for the minimum probability of visiting an available state under any feasible policy. Note that min π,h,s: w π,h (s)>0 w π,h (s) takes the minimum over only the policies under which s is reachable, and thus, this factor will never be zero. Indeed, this factor also exists in the lower bound (see Section 4.2). Thus, it characterizes the essential problem hardness, i.e., when the agent is highly risk-adverse, her regret will be heavily influenced by exploring critical but hard-to-reach states. When α is large, 1 √ α H-1 instead dominates the bound. The intuition behind the factor 1 √ α H-1 is that for any state-action pair, the ratio of the visitation probability conditioning on transitioning to bad successor states over the original visitation probability can be upper bounded by 1 α H-1 . This ratio is critical and will appear in the regret bound (see Lemma 9 for a formal statement). In the special case when α = 1, our Iterated CVaR RL problem reduces to the classic RL formulation, and our regret bound becomes Õ(HS RL work (Jaksch et al., 2010) . This bound has a gap of

√

HS to the state-of-the-art regret bound for classic RL (Azar et al., 2017; Zanette & Brunskill, 2019) . This is because our algorithm is mainly designed for general risk-sensitive cases (which require CVaR-adapted exploration bonuses), and does not use the Bernstein-type exploration bonuses (which only work for classic expectation maximization criterion). Such phenomenon also appears in existing risk-sensitive RL works (Fei et al., 2020; 2021a) . Designing an algorithm which achieves an optimal regret simultaneously for both risk-sensitive cases and classic expectation maximization case is still an open problem, which we leave for future work. To validate our theoretical analysis, we also conduct experiments to exhibit the influences of parameters α, δ, H, S, A and K on the regret of ICVaR-RM in practice, and the empirical results well match our theoretical bound (see Appendix A). Challenges and Novelty in Regret Analysis. The analysis of Iterated CVaR RL faces several challenges. (i) First of all, in Iterated CVaR RL, the contribution of a state to the regret is not proportional to its visitation probability as in standard RL analysis (Jaksch et al., 2010; Azar et al., 2017; Zanette & Brunskill, 2019) . Instead, the regret is influenced more by risky but hard-to-reach states. Thus, the regret cannot be decomposed into estimation error with respect to visitation distribution. (ii) Second, unlike existing RL analysis (Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018) which typically calculates the change of expected rewards between optimistic and true value functions, in Iterated CVaR RL, we need to instead analyze the change of CVaR when the true value function shifts to an optimistic value function. To tackle these challenges, we develop a new analytical technique to bound the change of CVaR due to the value function shift via conditional transition probabilities, which can be applied to other CVaR-based RL problems. Furthermore, we establish a novel regret decomposition for Iterated CVaR RL via a distorted (conditional) visitation distribution, and quantify the deviation between this distorted visitation distribution and the original visitation distribution. Below we present a proof sketch for Theorem 1 (see Appendix D.1 for a complete proof). Proof sketch of Theorem 1. First, we introduce a key inequality (Eq. ( 3)) to bound the change of CVaR when the true value function shifts to an optimistic one. To this end, let β α,V (•|s, a) ∈ R S denote the conditional transition probability conditioning on transitioning to the worst α-portion successor states s ′ , i.e., with the lowest values V (s ′ ). It satisfies that s ′ ∈S β α,V (s ′ |s, a) • V (s ′ ) = CVaR α s ′ ∼p(•|s,a) (V (s ′ )). Then, for any (s, a) and value functions V , V such that V (s ′ ) ≥ V (s ′ ) for any s ′ ∈ S, we have CVaR α s ′ ∼p(•|s,a) ( V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ β α,V (•|s, a) ⊤ V -V . Eq. (3) implies that the deviation of CVaR between optimistic and true value functions can be bounded by their value deviation under the conditional transition probability, which resolves the aforementioned challenge (ii), and serves as the basis of our recurrent regret decomposition. Now, since V k h is an optimistic estimate of V * h , we decompose the regret in episode k as V k 1 (s k 1 )-V π k 1 (s k 1 ) (a) ≤ H α L n k (s k 1 , a k 1 ) + CVaR α s ′ ∼ pk (•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) -CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) + CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) -CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) (V π k 2 (s ′ )) (b) ≤ H α L n k (s k 1 , a k 1 ) + 4H α SL n k (s k 1 , a k 1 ) + β α,V π k 2 (•|s k 1 , a k 1 ) ⊤ ( V k 2 -V π k 2 ) (c) ≤ H h=1 (s,a) w CVaR,α,V π k kh (s, a) • H √ L + 4H √ SL α n k (s, a) Here w CVaR,α,V π k kh (s, a) denotes the conditional probability of visiting (s, a) at step h of episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α- 4) decomposes the regret into estimation error at all state-action pairs via the distorted (conditional) visitation distribution w CVaR,α,V π k kh (s, a), which overcomes the aforementioned challenge (i). portion values V π k h ′ +1 (s ′ )) at each step h ′ = 1, . . . , h -1. Intuitively, w CVaR,α,V π k kh (s, Summing Eq. ( 4) over all episodes k ∈ [K] and using the Cauchy-Schwarz inequality, we have E[R(K)] ≤ 5H √ SL α K k=1 H h=1 (s,a) w CVaR,α,V π k kh (s, a) n k (s, a) • K k=1 H h=1 (s,a) w CVaR,α,V π k kh (s, a) (d) = 5H √ SL • √ KH α K k=1 H h=1 (s,a) w CVaR,α,V π k kh (s, a) w kh (s, a) • w kh (s, a) n k (s, a) • 1 {w kh (s, a) ̸ = 0} (e) ≤ 5H √ KHSL α min 1 min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) , 1 α H-1 K k=1 H h=1 (s,a) w kh (s, a) n k (s, a) , Here w kh (s, a) denotes the probability of visiting (s, a) at step h of episode k, and w π,h (s, a) denotes the probability of visiting (s, a) at step h under policy π. Equality (d) uses the facts that (s,a) w CVaR,α,V π k kh (s, a) = 1, and if the visitation probability w kh (s, a) = 0, the conditional visitation probability w CVaR,α,V π k kh (s, a) must be 0 as well. Inequality (e) is due to that w CVaR,α,V π k kh (s, a)/w kh (s, a) can be bounded by both 1/ min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) and 1/α H-1 . Specifically, the bound 1/ min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) follows from min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) ≤ w kh (s, a), and the bound 1/α H-1 comes from the fact that the conditional visitation probability w CVaR,α,V π k kh (s, a) is at most 1/α H-1 times the visitation probability w kh (s, a). Having established the above, we can use a similar analysis as that in classic RL (Azar et al., 2017; Zanette & Brunskill, 2019) to bound K k=1 H h=1 (s,a) a) , and then, we can obtain Theorem 1. □ w kh (s,a) n k (s,

4.2. REGRET LOWER BOUND

We now present a regret lower bound to demonstrate the optimality of algorithm ICVaR-RM. Theorem 2 (Regret Lower Bound). There exists an instance of Iterated CVaR RL-RM, where min π,h,s: w π,h (s)>0 w π,h (s) > α H-1 and the regret of any algorithm is at least Ω H AK α min π,h,s: w π,h (s)>0 w π,h (s) . In addition, there exists an instance of Iterated CVaR RL-RM, where α H-1 > min π,h,s: w π,h (s)>0 w π,h (s) and the regret of any algorithm is at least Ω( AK α H-1 ). Remark 2. Theorem 2 demonstrates that when α is small, the factor min π,h,s: w π,h (s)>0 w π,h (s) is inevitable in general. This reveals the intrinsic hardness of Iterated CVaR RL, i.e., when the agent is highly sensitive to bad situations, she must suffer a regret due to exploring risky but hard-to-reach states. This lower bound also validates that ICVaR-RM is near-optimal with respect to K. Lower Bound Analysis. Here we provide the proof idea of the first lower bound (Eq. ( 5)) in Theorem 2, and defer the full proof to Appendix D.2. We construct an instance with a hard-to-reach bandit state (which has an optimal action and multiple sub-optimal actions), and show that this state is critical for minimizing the regret, but difficult for any algorithm to learn. As shown in Figure 1 , we consider an MDP with A actions, n regular states s 1 , . . . , s n and three absorbing states x 1 , x 2 , x 3 , where n < 1 2 H. The Published as a conference paper at ICLR 2023 𝑠 1 𝑠 2 𝑠 𝑛-1 𝑠 𝑛 𝑥 1 𝑥 2 𝑥 3 𝜇 𝜇 𝜇 𝜇 1 -3𝜇 1 -𝜇 1 -𝜇 Reward 1 Reward 0.8 Reward 0.2 1 -𝛼 1 -𝛼 + 𝜂 𝛼 𝛼 -𝜂 … 𝜇 𝜇 Algorithm 2: MaxWP Input: δ, δ ′ := δ 2 , L := log( SA δ ′ ), V k H+1 (s) = 0 for any k > 0 and s ∈ S for k = 1, 2, . . . , K do for h = H, H -1, . . . , 1 do Qk h (s, a) ← r(s, a) + min s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )), ∀(s, a) ∈ S × A; V k h (s) ← max a∈A Qk h (s, a), π k h (s) ← argmax a∈A Qk h (s, a), ∀s ∈ S; Play the episode k with policy π k , and update n k+1 (s, a) and pk+1 (s ′ |s, a); reward function r(s, a) depends only on the states, i.e., s 1 , . . . , s n generate zero reward, and x 1 , x 2 , x 3 generate rewards 1, 0.8 and 0.2, respectively. Let µ be a parameter such that 0 < α < µ < 1 3 . Under all actions, state s 1 transitions to s 2 , x 1 , x 2 , x 3 with probabilities µ, 1 -3µ, µ and µ, respectively, and state s i (2 ≤ i ≤ n -1) transitions to s i+1 , x 1 with probabilities µ and 1 -µ, respectively. For the bandit state s n , under the optimal action, s n transitions to x 2 , x 3 with probabilities 1 -α + η and α -η, respectively. Under sub-optimal actions, s n transitions to x 2 , x 3 with probabilities 1 -α and α, respectively. In this MDP, under the Iterated CVaR criterion, the value function mainly depends on the path s 1 → s 2 → • • • → s n → x 2 / x 3 , and especially on the action choice in the bandit state s n . Thus, to distinguish the optimal action in s n , any algorithm must suffer a regret dependent on the probability of visiting s n , which is exactly the minimum visitation probability over all reachable states min π,h,s: w π,h (s)>0 w π,h (s). Note that in this instance, min π,h,s: w π,h (s)>0 w π,h (s) = µ n-1 , which does not depend on α and H. This demonstrates that there is an essential dependency on min π,h,s: w π,h (s)>0 w π,h (s) in the lower bound.

5. ITERATED CVAR RL WITH BEST POLICY IDENTIFICATION

In this section, we design an efficient algorithm ICVaR-BPI, and establish sample complexity upper and lower bounds for Iterated CVaR RL with best policy identification (BPI).

5.1. ALGORITHM ICVaR-BPI AND SAMPLE COMPLEXITY UPPER BOUND

Algorithm ICVaR-BPI introduces a novel distorted (conditional) empirical transition probability to construct estimation error, which effectively assigns more attention to bad situations and fits the main focus of the Iterated CVaR criterion. Due to space limit, we defer the pseudo-code and detailed description of ICVaR-BPI to Appendix E.1. Below we present the sample complexity of ICVaR-BPI. Theorem 3 (Sample Complexity Upper Bound). The number of trajectories used by algorithm ICVaR-BPI to return an ε-optimal policy with probability at least 1 -δ is bounded by O min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1 H 3 S 2 A ε 2 α 2 • C , where C := log 2 (min{ 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1 } HSA εαδ ). Similar to Theorem 1, min π,h,s: w π,h (s)>0 w π,h (s) and α H-1 dominate the bound for a large α and a small α, respectively. When α = 1, the problem reduces to the classic RL formulation with best policy identification, and our sample complexity becomes Õ( H 3 S 2 A ε 2 ), which recovers the result in prior classic RL work (Dann et al., 2017) . Similar to Theorem 1, this bound has a gap of HS to the state-of-the-art sample complexity for classic RL (Ménard et al., 2021) . This gap is due to the fact that the result in (Ménard et al., 2021) is obtained using the Bernstein-type exploration bonuses, which are more fine-grained for the classic RL problem but do not work for general risk-sensitive cases, because it cannot be used to quantify the estimation error of CVaR. To validate the tightness of Theorem 3, we further provide sample complexity lower bounds Ω( H 2 A ε 2 α min π,h,s: w π,h (s)>0 w π,h (s) log 1 δ ) and Ω( A α H-1 ε 2 log 1 δ ) for different instances, which demonstrate that the factor min{1/ min π,h,s: w π,h (s)>0 w π,h (s), 1/α H-1 } is indispensable in general (see Appendix E.3 for a formal statement of lower bound).

6. WORST PATH RL

In this section, we investigate an interesting limiting case of Iterated CVaR RL when α → 0, called Worst Path RL, in which case the agent aims to maximize the minimum possible cumulative reward. Worst Path RL has a unique feature that, the value function (Eq. ( 2)) concerns only the minimum value of successor states, which are independent of specific transition probabilities. Therefore, once we learn the connectivity among states, we can perform a planning to compute the optimal policy. Yet, this feature does not make the Worst Path RL problem trivial, because it is still challenging to distinguish whether a successor state is hard to reach or does not exist. As a result, a careful scheme is needed to both explore undetected successor states and exploit observations to minimize regret.

6.1. ALGORITHM MaxWP AND REGRET UPPER BOUND

We design an algorithm MaxWP (Algorithm 2) based on a simple and efficient empirical Q-value function, which makes full use of the unique feature of Worst Path RL, and simultaneously explores undetected successor states and exploits the current best action. Specifically, in episode k, MaxWP constructs empirical Q-value/value functions Qk h (s, a), V k h (s) using the estimated lowest value of next states, and then, takes a greedy policy π k h (s) with respect to Qk h (s, a) in this episode. The intuition behind MaxWP is as follows. Since the Q-value function for Worst Path RL uses the min operator, if the Q-value function is not accurately estimated, it can only be over-estimated (not under-estimated). If over-estimation happens, MaxWP will be exploring an over-estimated action and urging its empirical Q-value to get back to its true Q-value. Otherwise, if the Q-value function is already accurate, MaxWP just selects the optimal action. In other words, MaxWP combines the exploration of over-estimated actions (which lead to undetected successor states) and exploitation of current best actions. Below we provide the regret guarantee for algorithm MaxWP. where υ π (s, a) denotes the probability (s, a) is visited at least once in an episode under policy π. Remark 3. The factor min π: υπ(s,a)>0 υ π (s, a) stands for the minimum probability of visiting (s, a) at least once in an episode over all feasible policies, and min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a) denotes the minimum transition probability over all successor states of (s, a). Note that this result cannot be implied by Theorem 1, because the result for Iterated CVaR RL there depends on 1 α , and simply taking α → 0 leads to a vacuous bound. Theorem 4 demonstrates that algorithm MaxWP enjoys a constant regret with respect to K. This constant regret is made possible by the unique feature of Worst Path RL that, under the worst path metric, once the agent determines the connectivity among states, she can accurately estimate the value function and find the optimal policy. Furthermore, determining the connectivity among states (with a given confidence) only requires a number of samples independent of K. MaxWP effectively utilizes this problem feature, and efficiently explores the connectivity among states. To validate the optimality of our regret upper bound, we also provide a lower bound Ω(max (s,a):∃h, a̸ =π * 

7. CONCLUSION

In this paper, we investigate a novel Iterated CVaR RL problem with the regret minimization and best policy identification metrics. We design two efficient algorithms ICVaR-RM and ICVaR-BPI, and provide nearly matching regret/sample complexity upper and lower bounds with respect to K. We also study an interesting limiting case called Worst Path RL, and propose a simple and efficient algorithm MaxWP with rigorous regret guarantees. There are several interesting directions for future work, e.g., further closing the gap between upper and lower bounds, and extending our model and results from the tabular setting to the function approximation framework.

APPENDIX A EXPERIMENTS

In this section, we provide experimental results to evaluate the empirical performance of our algorithm ICVaR-RM, and compare it to the state-of-the-art algorithms EULER (Zanette & Brunskill, 2019) and RSVI2 (Fei et al., 2021a) for classic RL and risk-sensitive RL, respectively. In our experiments, we consider an H-layered MDP with S = 3(H -1) + 1 states and A actions. There is a single state s 0 (initial state) in layer 1. For any 2 ≤ h ≤ H, there are three states s 3(h-2)+1 , s 3(h-2)+2 and s 3(h-2)+3 in layer h, which induce rewards 1, 0 and 0.4, respectively. The agent starts from s 0 in layer 1, and for each step h ∈ [H], she takes an action from {a 1 , . . . , a A }, and then transitions to one of three states in the next layer. For any a ∈ {a 1 , . . . , a A-1 }, action a leads to s 3(h-1)+1 and s 3(h-1)+2 with probabilities 0.5 and 0.5, respectively. Action a A leads to s 3(h-1)+2 and s 3(h-1)+3 with probabilities 0.001 and 0.999, respectively. We set α ∈ {0.05, 0.1, 0.15}, δ ∈ {0.5, 0.005, 0.00005}, H ∈ {2, 5, 10}, S ∈ {7, 13, 25}, A ∈ {3, 5, 12} and K ∈ [0, 10000] (the change of K can be seen from the X-axis in Figure 2 ). We take α = 0.05, δ = 0.005, H = 5, S = 13, A = 5 and K = 10000 as the basic setting, and change parameters α, δ, H, S, A and K to see how they affect the empirical performance of algorithm ICVaR-RM. For each algorithm, we perform 20 independent runs and report the average regret across runs with 95% confidence intervals. As shown in Figure 2 , our algorithm ICVaR-RM achieves a significantly lower regret than the other algorithms EULER (Zanette & Brunskill, 2019) and RSVI2 (Fei et al., 2021a) , which demonstrates that ICVaR-RM can effectively control the risk under the Iterated CVaR criterion and shows performance superiority over the baselines. Moreover, the influences of parameters α, δ, H, S, A and K on the regret of algorithm ICVaR-RM match our theoretical bounds. Specifically, as α or δ increases, the regret of ICVaR-RM decreases. As H, S or A increases, the regret of ICVaR-RM increases as well. As the number of episodes K increases, the regret of ICVaR-RM increases at a sublinear rate.

B RELATED WORK

Below we present a complete review of related works. CVaR-based MDPs (Known Transition). Boda & Filar (2006) ; Ott (2010) ; Bäuerle & Ott (2011); Haskell & Jain (2015) ; Chow et al. (2015) study the CVaR MDP problem where the objective is to minimize the CVaR of the total cost with known transition, and demonstrate that the optimal policy for CVaR MDP is history-dependent (not Markovian) and is inefficient to exactly compute. Hardy & Wirch (2004) firstly define the Iterated CVaR measure, and prove that it is a coherent dynamic risk measure, and applicable to equity-linked insurance. Osogami (2012) ; Chu & Zhang (2014) ; Bäuerle & Glauner (2022) investigate iterated coherent risk measures (including Iterated CVaR) in MDPs, and prove the existence of Markovian optimal policies for these MDPs. The above works focus mainly on designing planning algorithms and derive planning error guarantees for known transition, while our work develops RL algorithms (interacting with the environment online) and provides regret and sample complexity guarantees for unknown transition. Risk-Sensitive Reinforcement Learning (Unknown Transition). Heger (1994); Coraluppi & Marcus (1997; 1999) consider minimizing the worst-case cost in RL, and present dynamic programming of value functions and heuristic algorithms without theoretical analysis. Borkar (2001; 2002) study risk-sensitive RL with the exponential utility measure, and design algorithms based on actor-critic learning and Q-learning, respectively. Di Castro et al. (2012) ; La & Ghavamzadeh (2013) investigate variance-related risk measures, and devise policy gradient and actor-critic-based algorithms with convergence analysis. Tamar et al. (2015) consider maximizing the CVaR of the total reward, and propose a sampling-based estimator for the CVaR gradient and a stochastic gradient decent algorithm to optimize CVaR. Keramati et al. (2020) also investigate optimizing the CVaR of the total reward, and design an algorithm based on an optimistic version of the distributional Bellman operator. Borkar & Jain (2014) ; Chow & Ghavamzadeh (2014) ; Chow et al. (2017) study how to minimize the expected total cost with CVaR-based constraints, and develop policy gradient, actor-critic and stochastic approximation-style algorithms. The above works mainly give convergence analysis, and do not provide finite-time regret and sample complexity guarantees as in our work. (d) δ = 0.00005 (e) δ = 0.005 (f) δ = 0.5 (g) H = 2 (h) H = 5 (i) H = 10 (j) S = 7 (k) S = 13 (l) S = 25 (m) A = 3 (n) A = 5 (o) A = 12 To our best knowledge, there are only a few risk-sensitive RL works which provide finite-time regret analysis (Fei et al., 2020; 2021a; b) . Fei et al. (2020) consider risk-sensitive RL with the exponential utility criterion, and propose algorithms based on logarithmic-exponential transformation and least-squares updates. Fei et al. (2021a) further improve the regret bound in (Fei et al., 2020) by developing an exponential Bellmen equation and a Bellman backup analytical procedure. Fei et al. (2021b) extend the model and results in (Fei et al., 2020; 2021a) from the tabular setting to the function approximation framework. Our work is very different from the above works (Fei et al., 2020; 2021a; b) in formulation, algorithms and results. The above works (Fei et al., 2020; 2021a; b) use the exponential utility criterion to characterize the risk and take all successor states into account in decision making. They design algorithms based on exponential Bellmen equations and doubly decaying exploration bonuses. In contrast, we interpret the risk by the Iterated CVaR criterion, which primarily concerns the worst α-portion successor states. We develop algorithms using CVaR-adapted exploration bonuses. The works we discuss above fall in the literature of RL with risk-sensitive criteria. There are also other RL works which focus on state-wise safety. Cheng et al. (2019) utilize control barrier functions (CBFs) to ensure the agent within a set of safe sets and guide the learning by constraining explorable polices. Fatemi et al. (2019; 2021) define the notion of dead-end states (which lead to suboptimal terminal state with probability 1 in finite steps) and aim to avoid getting into dead-end states. The formulations and algorithms in these works greatly differ from ours, and they do not provide finitetime regret and sample complexity analysis as us. We refer interested readers to the survey (Garcıa & Fernández, 2015) for detailed categorization and discussion on safe RL.

C MORE DISCUSSION ON ITERATED CVAR RL

In this section, we first present the expanded value function definitions for Iterated CVaR RL. Then, we compare Iterated CVaR RL with existing risk-sensitive MDP models, including CVaR MDP (Boda & Filar, 2006; Ott, 2010; Bäuerle & Ott, 2011; Chow et al., 2015) and the exponential utility-based RL (Fei et al., 2020; 2021a) .

C.1 VALUE FUNCTION DEFINITIONS FOR ITERATED CVAR RL

The value function definition for Iterated CVaR RL, i.e., Eq. (i) in Section 3, can be expanded as Q π h (s, a) = r(s, a) + CVaR α s h+1 ∼p(•|s,a) r(s h+1 , π h+1 (s h+1 )) + CVaR α s h+2 ∼p(•|s h+1 ,π h+1 (s h+1 )) . . . CVaR α s H ∼p(•|s H-1 ,π H-1 (s H-1 )) (r(s H , π H (s H ))) , V π h (s) = r(s, π h (s)) + CVaR α s h+1 ∼p(•|s,π h (s)) r(s h+1 , π h+1 (s h+1 )) + CVaR α s h+2 ∼p(•|s h+1 ,π h+1 (s h+1 )) . . . CVaR α s H ∼p(•|s H-1 ,π H-1 (s H-1 )) (r(s H , π H (s H ))) . Similarly, the optimal value function definition, e.g., Eq. (ii) in Section 3, can be expanded as From the above value function definitions, we can see that, Iterated CVaR RL aims to maximize the worst α-portion tail of the reward-to-go at each step, i.e., taking the CVaR operator on the rewardto-go at each step. Intuitively, Iterated CVaR RL wants to optimize the performance even when bad situations happen at each decision stage. Q * h (s, a) = max π r(s, a) + CVaR α s h+1 ∼p(•|s,a) r(s h+1 , π h+1 (s h+1 )) + CVaR α s h+2 ∼p(•|s h+1 ,π h+1 (s h+1 )) . . .CVaR α s H ∼p(•|s H-1 ,π H-1 (s H-1 )) (r(s H , π H (s H ))) , V * h (s) = max π r(s, π h (s)) + CVaR α s h+1 ∼p(•|s,π h (s)) r(s h+1 , π h+1 (s h+1 )) + CVaR α s h+2 ∼p(•|s h+1 ,π h+1 (s h+1 )) . . .CVaR α s H ∼p(•|s H-1 ,π H-1 (s H-1 )) (r(s H , π H (s H ))) .

C.2 COMPARISON WITH CVAR MDP

The objective of CVaR MDP, e.g., (Boda & Filar, 2006; Ott, 2010; Bäuerle & Ott, 2011; Chow et al., 2015) , is to maximize the worst α-portion of the total reward, which is formally defined as max π CVaR α (s h ,a h )∼p,π H h=1 r(s h , a h ) . Compared to our Iterated CVaR RL (Eq. ( 6) and Eq. (ii) in Section 3) which concerns bad situations at each step, CVaR MDP takes more cumulative reward into account and prefers actions which have better performance in general, but can have larger probabilities of getting into catastrophic states. Thus, CVaR MDP is suitable for scenarios where bad situations lead to a higher cost but not fatal damage, e.g., finance. In contrast, Iterated CVaR RL prefers actions which have smaller probabilities of getting into catastrophic states. Hence, Iterated CVaR RL is most suitable for safetycritical applications, where catastrophic states are unacceptable and need to be carefully avoid, e.g., clinical treatment planning. We emphasize that Iterated CVaR is not equivalent to simply taking the worst α H -portion of the total reward. In fact, the good (1 -α H )-portion of the total reward also contributes to Iterated CVaR. This is because Iterated CVaR accounts bad situations for all states (both good and bad states) in its iterated computation, instead of just considering bad situations upon bad states. Below we provide an example of clinical treatment planning to illustrate the difference between Iterated CVaR and CVaR MDP. Here we interpret the objective as cost minimization for ease of understanding, and set the risk level α = 0.05. Consider a 4-layered binary tree-structured MDP shown in Figure 3 . The state sets in layers 1, 2, 3 and 4 are {s 1 }, {s 2 , s 3 }, {s 4 , . . . , s 7 } and {s 8 , . . . , s 15 }, respectively. There are two actions a 1 , a 2 in each state, and a 1 , a 2 have the same transition distribution in all states except the initial state s 1 . Thus, a policy is to decide whether to choose a 1 or a 2 in state s 1 , which leads to different subsequent costs. The agent starts from the initial state s 1 in layer 1. If the agent takes action a 1 , she will transition to state s 2 deterministically, and goes into the left sub-tree. On the other hand, if the agent takes action a 2 in state s 1 , she will transition to state s 3 deterministically, and enters the right sub-tree. If the agent goes into the left sub-tree (state s 2 ) in layer 2, she will transition to s 4 and s 5 in layer 3 with probabilities 0.05 and 0.95, respectively. Then, if she starts from state s 4 in layer 3, she will transition to s 8 and s 9 in layer 4 with probabilities 0.05 and 0.95, respectively. Otherwise, if she starts from state s 5 in layer 3, she will transition to s 10 and s 11 in layer 4 with probabilities 0.05 and 0.95, respectively. On the other hand, if the agent goes into the right sub-tree (state s 3 ) in layer 2, she will transition to s 6 and s 7 in layer 3 with probabilities 0.01 and 0.99, respectively. Then, if she starts from state s 6 in layer 3, she will transition to s 12 and s 13 in layer 4 with probabilities 0.01 and 0.99, respectively. Otherwise, if she starts from state s 7 in layer 3, she will transition to s 14 and s 15 in layer 4 with probabilities 0.01 and 0.99, respectively. The costs are state-dependent, and only the states in layer 4 produce non-zero costs. To be concrete, we use the clinical trial example and the costs represent the patient status. Specifically, in layer 4, s 8 and s 12 give costs 1, which denote death. s 13 and s 14 produce costs 0.5, which means the patient is getting better. s 9 and s 10 induce costs 0.4, which denote that the patient gets much better. s 11 and s 15 produce costs 0, which stand for that the patient is fully cured. Under the CVaR criterion, we have that Q CVaR,α (s 1 , a 1 ) = 0.0025 0.05 • 1 + 0.05 -0.0025 0.05 • 0.4 = 0.43, Q CVaR,α (s 1 , a 2 ) = 0.0001 0.05 • 1 + 0.05 -0.0001 0.05 • 0.5 = 0.501. Thus, CVaR MDP will choose action a 1 (and goes into the left sub-tree), since a 1 leads to better medium states s 9 and s 10 , which give a lower cost 0.4 than the cost 0.5 produced by the right sub-tree. On the other hand, under the Iterated CVaR criterion, we have that Thus, Iterated CVaR RL will instead choose action a 2 , because a 2 has a smaller probability of going into the bad left direction (which leads to the catastrophic state s 12 ). Q ICVaR,α (s 1 , a 1 ) = 0.05 0.05 •Q ICVaR,α (s 4 , •) = 0.05 0.05 • 0.05 0.05 • Q ICVaR,α (s 8 , •) = 0.05 0.05 • 0.05 0.05 • 1 = 1, and Q ICVaR,α (s 1 , a 2 ) = 0.01 0.05 • Q ICVaR,α (s 6 , •) + 0.05 -0.01 0.05 • Q ICVaR,α (s 7 , •) = 0.01 0.05 • 0.01 0.05 • Q ICVaR,α (s 12 , •) + 0.05 -0.01 0.05 • Q ICVaR,α (s 13 , •) + 0.05 -0.01 0.05 • 0.01 0.05 • Q ICVaR,α (s 14 , •) + 0.05 -0.01 0.05 • Q ICVaR,α (s 15 , •) = 0. The above example shows that, Iterated CVaR RL prefers actions with a smaller probability of getting into catastrophic states. In contrast, CVaR MDP favors actions with better average therapeutic effects, but has a larger probability of causing death. Note that the above example also demonstrates that Iterated CVaR is not equivalent to the worst α H -portion of the total cost. To see this, we have that (here we consider α 3 because there are 3 transition steps): In addition, one can see that, the good state which gives cost 0 (i.e., s 15 ) also contributes to Q ICVaR,α (s 1 , a 2 ), which shows that the good (1 -α H )-portion of the total cost also matters for Iterated CVaR. Q CVaR,α 3 (s 1 , a 2 ) = 0.0001 0.000125 • 1 + 0.

C.3 COMPARISON WITH EXPONENTIAL UTILITY-BASED RISK-SENSITIVE RL

The Bellman optimality equation for risk-sensitive RL with the exponential utility criterion (Fei et al., 2020; 2021a) is defined as Q * h (s, a) = r h (s, a) + 1 β log{E s ′ ∼p(•|s,a) [exp(β • V * h+1 (s ′ ))]}, which takes all successor states s ′ into account, i.e., all successor states s ′ contribute to the computation of the Q-value. Here β < 0 is a risk-sensitivity parameter. In contrast, in Iterated CVaR RL, the Bellman optimality equation is defined as Q * h (s, a) = r(s, a) + CVaR α s ′ ∼p(•|s,a) (V * h+1 (s ′ ) ), which focuses only on the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V * h+1 (s ′ )), i.e., only the worst α-portion successor states s ′ contribute to the computation of the Q-value. Besides the formulation, our algorithm design and results are also very different from those in (Fei et al., 2020; 2021a) . The algorithms in (Fei et al., 2020; 2021a) are based on exponential Bellman equations and doubly decaying exploration bonuses, and their results depend on exp(|β|H). In contrast, our algorithms are based on value iteration for Iterated CVaR with CVaR-adapted exploration bonuses, and our results depend on the minimum between an MDP-intrinsic visitation measure 1/ min π,h,s: w π,h (s)>0 w π,h (s) and a risk-level-dependent factor 1/α H-1 .

D PROOFS FOR ITERATED CVAR RL WITH REGRET MINIMIZATION

In this section, we present the proofs of regret upper and lower bounds (Theorems 1 and 2) for Iterated CVaR RL-RM. For each s ′ ∈ {s 1 , s 2 , s 3 , s 4 }, the height of the bar denotes the value V (s ′ ) (fixed), and the width of the bar denotes the transition probability p(s ′ |s, a) or pk (s ′ |s, a). The colored part of the bars denotes the worst α-portion successor states (i.e., with the lowest α-portion values V (s ′ )). In this example, α = 0.5. Lemma 1 (Concentration for V * ). It holds that Pr CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log KHSA δ ′ n k (s, a) , ∀k ∈ [K], ∀h ∈ [H], ∀(s, a) ∈ S × A ≥ 1 -2δ ′ . Proof of Lemma 1. Using Brown's inequality (Brown, 2007) (Theorem 2 in (Thomas & Learned-Miller, 2019) ) and a union bound over (s, a) ∈ S × A and n k (s, a) ∈ [KH], we can obtain this lemma. For any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S × S × A, β α,V (s ′ |s, a) is the conditional transition probability from (s, a) to s ′ , conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V (s ′ )). Let µ α,V (s ′ |s, a) denote how large the transition probability of successor state s ′ belongs to the worst α-portion, which satisfies that µ α,V (s ′ |s,a) α = β α,V (s ′ |s, a) and s ′ ∈S µ α,V (s ′ |s, a) = α. In addition, for any risk level α ∈ (0, 1], function V : S → R and (s, a) ∈ S × A, CVaR α s ′ ∼p(•|s,a) (V (s ′ )) = s ′ ∈S µ α,V (s ′ |s, a) • V (s ′ ) α = s ′ ∈S β α,V (s ′ |s, a) • V (s ′ ). Lemma 2 (Concentration for any V ). It holds that CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log KHSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀k ∈ [K], ∀(s, a) ∈ S × A ≥ 1 -2δ ′ . Proof of Lemma 2. As shown in Figure 4 , we sort all successor states s ′ ∈ S by V (s ′ ) in ascending order (from the left to the right). Add a virtual line at the α-quantile, denoted by α-quantile line. Fix the value function V (•), and the transition probability changes from p(•|s, a) to pk (•|s, a). Without loss of generality, below we consider the case where as the transition probability changes from p(•|s, a) to pk (•|s, a), the α-quantile line shifts from left to right (the analysis of the contrary case can also be obtained by interchanging p(•|s, a) and pk (•|s, a)). We use original α-quantile line and shifted α-quantile line to denote the α-quantile line before and after the shift, respectively. We divide the successor states s ′ ∈ S into five subsets as follows. Let S lef t and S right denote the sets of states which are always on the left and right sides of the original and shifted α-quantile lines, respectively. Let S middle denote the set of states which are in the middle of the original and shifted α-quantile lines. Let s line-l and s line-r denote the states which lie on the original and shifted α-quantile lines, respectively. For any s ′ ∈ S lef t , we have that µ α,V (s ′ |s, a) = p(s ′ |s, a) and μk;α,V (s ′ |s, a) = pk (s ′ |s, a). For any s ′ ∈ S right , we have µ α,V (s ′ |s, a) = μk;α,V (s ′ |s, a) = 0. For any s ′ ∈ S middle , we have that µ α,V (s ′ |s, a) = 0 and μk;α,V (s ′ |s, a) = pk (s ′ |s, a). For state s line-r , we have that µ α,V (s line-r |s, a) = 0 and μk;α,V (s line-r |s, a) = α - s ′ ∈S lef t pk (s ′ |s, a) -s ′ ∈S middle pk (s ′ |s, a) -pk (s line-l |s, a). Then, we obtain Thus, we have s ′ ∈S μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) ≤ s ′ ∈S lef t μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) + s ′ ∈S right μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) + s ′ ∈S middle μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) + μk;α,V (s line-l |s, a) -µ α,V (s line-l |s, a) + μk;α,V (s line-r |s, a) -µ α,V (s line- CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) = s ′ ∈S μk;α,V (s ′ |s, a) • V (s ′ ) α -s ′ ∈S µ α,V (s ′ |s, a) • V (s ′ ) α = s ′ ∈S μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) • V (s ′ ) α ≤ s ′ ∈S μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) • H α ≤ 2 s ′ ∈S p k (s ′ |s, a) -p(s ′ |s, a) • H α Using Eq. ( 55) in (Zanette & Brunskill, 2019 ) (originated from (Weissman et al., 2003 )), we have that with probability at least 1 -2δ ′ , for any k ∈ [K] and (s, a) ∈ S × A, s ′ ∈S pk (s ′ |s, a) -p(s ′ |s, a) ≤ 2S log KHSA δ ′ n k (s, a) . Plugging Eq. ( 9) into Eq. ( 8), we obtain that with probability at least 1 -2δ ′ , for any k ∈ [K], (s, a) ∈ S × A and function V : S → [0, H], CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log KHSA δ ′ n k (s, a) . For any k > 0, h ∈ Lemma 3 (Concentration of Visitation). It holds that Pr n k (s, a) ≥ 1 2 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A ≥ 1 -δ ′ . Proof of Lemma 3. Applying Lemma F.4 in (Dann et al., 2017) , we have that for any fixed h ∈ [H], Pr n kh (s, a) ≥ 1 2 k-1 k ′ =1 w k ′ h (s, a) -log HSA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A ≥ 1 - δ ′ H By a union bound over h ∈ [H], we have Pr n k (s, a) ≥ 1 2 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ ≥ 1 -δ ′ . To sum up, we define several concentration events which will be used in the following proof. E 1 := CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log KHSA δ ′ n k (s, a) , ∀k ∈ [K], ∀h ∈ [H], ∀(s, a) ∈ S × A E 2 := CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log KHSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀k ∈ [K], ∀(s, a) ∈ S × A E 3 := n k (s, a) ≥ 1 2 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A E :=E 1 ∩ E 2 ∩ E 3 Lemma 4. Letting δ ′ = δ 5 , it holds that Pr [E] ≥ 1 -δ. Proof of Lemma 4. This lemma can be obtained by combining Lemmas 1-3.

D.1.2 OPTIMISM, VISITATION AND CVAR GAP

Recall that L := log KHSA δ ′ . Lemma 5 (Optimism). Suppose that event E holds. Then, for any k ∈ [K], h ∈ [H] and s ∈ S, we have V k h (s) ≥ V * h (s). Proof of Lemma 5. We prove this lemma by induction. First, for any k ∈ [K], s ∈ S, it holds that V k H+1 (s) = V * H+1 (s) = 0. Then, for any k ∈ [K], h ∈ [H] and (s, a) ∈ S × A, if Qk h (s, a) = H, Qk h (s, a) ≥ Q * h (s, a) trivially holds, and otherwise, Qk h (s, a) =r(s, a) + CVaR α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )) + H α L n k (s, a) (a) ≥r(s, a) + CVaR α s ′ ∼ pk (•|s,a) (V * h+1 (s ′ )) + H α L n k (s, a) (b) ≥r(s, a) + CVaR α s ′ ∼p(•|s,a) (V * h+1 (s ′ )) =Q * h (s, a) , where (a) uses the induction hypothesis and (b) comes from Lemma 1. Thus, we have V k h (s) ≥ Qk h (s, π * h (s)) ≥ Q * h (s, π * h (s)) = V * h (s), which concludes the proof. Following (Zanette & Brunskill, 2019) , for any episode k > 0, we define the set of state-action pairs which have sufficient visitations in expectation as follows. L k := (s, a) ∈ S × A : 1 4 k-1 k ′ =1 H h=1 w k ′ h (s, a) ≥ H log HSA δ ′ + H . ( ) Lemma 6 (Sufficient Visitation). Suppose that event E holds. Then, for any k > 0 and (s, a) ∈ L k , n k (s, a) ≥ 1 4 k k ′ =1 H h=1 w k ′ h (s, a). Proof of Lemma 6. This proof is the same as that of Lemma 6 in (Zanette & Brunskill, 2019) . Using Lemma 3, we have Lemma 7 (Standard Visitation Ratio). For any K > 0, we have n k (s, a) ≥ 1 2 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ = 1 4 k-1 k ′ =1 H h=1 w k ′ h (s, a) + 1 4 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ (a) ≥ 1 4 k-1 k ′ =1 H h=1 w k ′ h (s, a) + H (b) ≥ 1 4 k-1 k ′ =1 H h=1 w k ′ h (s, a) + H h=1 w kh (s, a) = 1 4 k k ′ =1 H h=1 w k ′ h (s, K k=1 H h=1 (s,a)∈L k w kh (s, a) n k (s, a) ≤2 SA log KHSA δ ′ . Proof of Lemma 7. This proof is the same as that of Lemma 13 in (Zanette & Brunskill, 2019) . Recall that for any k > 0, let w k (s, a) := (s,a)∈S×A w kh (s, a). Then, we have K k=1 H h=1 (s,a)∈L k w kh (s, a) n k (s, a) = K k=1 (s,a)∈L k w k (s, a) n k (s, a) = K k=1 (s,a)∈S×A w k (s, a) n k (s, a) • 1 {(s, a) ∈ L k } (a) ≤2 K k=1 (s,a)∈S×A w k (s, a) k k ′ =1 w k ′ (s, a) • 1 {(s, a) ∈ L k } =2 (s,a)∈S×A K k=1 w k (s, a) k k ′ =1 w k ′ (s, a) • 1 {(s, a) ∈ L k } where (a) is due to Lemma 6. According to the definition of L k (Eq. ( 10)), for any (s, a) ∈ S × A, once (s, a) satisfies (s, a) ∈ L k in some episode k, it will always satisfy (s, a) ∈ L k ′ for all k ′ ≥ k. For any (s, a) ∈ S × A, let k 0 (s, a) denote the first episode k where (s, a) ∈ L k . Then, for any k > 0 and (s, a) ∈ S × A, if (s, a) ∈ L k , we have k k ′ =1 w k ′ (s, a) = k0(s,a)-1 k ′ =1 w k ′ (s, a) + k k ′ =k0(s,a) w k ′ (s, a) (a) ≥H + k k ′ =k0(s,a) w k ′ (s, a), where (a) uses the fact that (s, a) ∈ L k0(s,a) and the definition of L k (Eq. ( 10)). Thus, we have  w k (s, a) H + k k ′ =k0(s,a) w k ′ (s, a) = K-k0(s,a)+1 k=1 f (k) H + F (k) = K-k0(s,a)+1 0 f (⌈x⌉) H + F (⌈x⌉) dx (a) ≤ K-k0(s,a)+1 0 f (x) H + F (x) dx = log (H + F (K -k 0 (s, a) + 1)) -log (H + F (0)) (b) ≤ log (KH) ≤ log KHSA δ ′ , where (a) uses the fact that for any 0 ≤ x ≤ K -k 0 (s, a) + 1, f (x) = f (⌈x⌉) and F (x) ≤ F (⌈x⌉), and (b) is due to that k 0 (s, a) ≥ 2 by the definitions of L k (Eq. ( 10)) and k 0 (s, a). Therefore, we have K k=1 H h=1 (s,a)∈L k w kh (s, a) n k (s, a) ≤2 (s,a)∈S×A log KHSA δ ′ ≤2 SA log KHSA δ ′ Recall that for any (s ′ , s, a) ∈ S × S × A, p(s ′ |s, a) is the transition probability from (s, a) to s ′ . For any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S × S × A, β α,V (s ′ |s, a) is the conditional probability of transitioning to s ′ from (s, a), conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V (s ′ )), and it holds that CVaR α s ′ ∼p(s ′ |s,a) (V (s ′ )) = s ′ ∈S β α,V (s ′ |s, a) • V (s ′ ). For any k > 0, h ∈ [H] and (s, a) ∈ S × A, w kh (s, a) is the probability of visiting (s, a) at step h of episode k (under transition probability p(•|•, •)), and it holds that w kh (s, a) ∈ [0, 1] and (s,a)∈S×A w kh (s, a) = 1. For any risk level α ∈ (0, 1], k > 0, h ∈ [H] and (s, a) ∈ S × A, w CV aR,α,V π k kh (s, a) is the conditional probability of visiting (s, a) at step h of episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V π k h ′ +1 (s ′ )) at each step h ′ = 1, . . . , h -1. Here π k is the policy taken in episode k, and V π k h (•) : S → R is the value function at step h for policy π k . Intuitively, w CV aR,α,V π k kh (s, a) is the probability of visiting (s, a) at step h of episode k under conditional transition probability Note that for each step h ′ = 1, . . . , h -1, the conditional transition probability β α,V π k h ′ +1 (s ′ |s, a) just renormalizes the transition probability and assigns more weights to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V π k h ′ +1 (s ′ )), but will not make an unreachable successor state reachable. Thus, (s, a) is also unreachable under conditional transition probability  β α,V π k h ′ +1 (•|•, •) for each step h ′ = 1, . . . , h -1. It holds that for any risk level α ∈ (0, 1], k > 0, h ∈ [H] β α,V π k h ′ +1 (•|•, •) for each step h ′ = 1, . . . , h -1, β α,V h ′ +1 (•|•, •) for each step h ′ = 1, . . . , h -1, respectively, we have that w kh (s, a) = (s2,...,s h-1 )∈S h-2 h-1 h ′ =1 p(s h ′ +1 |s h ′ , a h ′ ) and w CVaR,α,V kh (s, a) = (s2,...,s h-1 )∈S h-2 h-1 h ′ =1 β α,V h ′ +1 (s h ′ +1 |s h ′ , a h ′ ), where s 1 is the initial state, s h := s, and a h ′ := π k (s h ′ ) for h ′ = 1, . . . , h -1. Recall that for any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S×S×A, µ α,V (s ′ |s, a) denotes how large the transition probability of successor state s ′ belongs to the worst α-portion successor states (i.e., with the lowest α-portion values V (•)), which satisfies that µ α,V (s ′ |s,a) α = β α,V (s ′ |s, a) and 0 ≤ µ α,V (s ′ |s, a) ≤ p(s ′ |s, a).

Thus, we have

w CVaR,α,V kh (s, a) = (s2,...,s h-1 )∈S h-2 h-1 h ′ =1 µ α,V h ′ +1 (s h ′ +1 |s h ′ , a h ′ ) α ≤ (s2,...,s h-1 )∈S h-2 h-1 h ′ =1 p(s h ′ +1 |s h ′ , a h ′ ) α = 1 α h-1 (s2,...,s h-1 )∈S h-2 h-1 h ′ =1 p(s h ′ +1 |s h ′ , a h ′ ) = 1 α h-1 • w kh (s, a) Therefore, w CVaR,α,V kh (s, a) w kh (s, a) ≤ 1 α h-1 . Combining Eqs. ( 11) and ( 12), we obtain this lemma. Lemma 10 (Insufficient Visitation). It holds that K k=1 H h=1 (s,a) / ∈L k w CVaR,α,V π k kh (s, a) ≤ min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    • 4SAH log HSA δ ′ + 5SAH . Proof of Lemma 10. According to the definition of L k (Eq. ( 10)), for any (s, a) ∈ S × A, once (s, a) satisfies (s, a) ∈ L k in some episode k, it will always satisfy (s, a) ∈ L k ′ for all k ′ ≥ k. For any (s, a) ∈ S × A, let k(s, a) denote the last episode k where (s, a) / ∈ L k . Then, we have For any policy π, h ∈ [H] and (s, a) ∈ S × A, let w π,h (s, a) and w π,h (s) denote the probabilities of visiting (s, a) and s at step h under policy π, respectively. Then, we have Recall that for any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S×S×A, β α,V (s ′ |s, a) is the conditional probability of transitioning to s ′ from (s, a), conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V (s ′ )), and it holds that K k=1 H h=1 (s,a) / ∈L k w CVaR,α,V π k kh (s, a) (a) = K k=1 H h=1 (s,a) / ∈L k w CVaR,α,V π k kh (s, a) w kh (s, a) • w kh (s, a) • 1 {w kh (s, a) ̸ = 0} (b) ≤ min    1 min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) , 1 α H-1    K k=1 H h=1 (s,a) / ∈L k w kh (s, a) (c) ≤ min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH log HSA δ ′ + 5SAH , CVaR α s ′ ∼p(s ′ |s,a) (V (s ′ )) = s ′ ∈S β α,V (s ′ |s, a) • V (s ′ ). Lemma 11 (CVaR Gap due to Value Function Shift). For any (s, a) ∈ S ×A, distribution p(•|s, a) ∈ △ S , and functions V, V : S → [0, H] such that V (s ′ ) ≥ V (s ′ ) for any s ′ ∈ S, CVaR α s ′ ∼p(•|s,a) ( V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ β α,V (•|s, a) ⊤ V -V . Proof of Lemma 11. Recall that for any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S × S × A, β α,V (s ′ |s, a) is the conditional transition probability from (s, a) to s ′ , conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V (s ′ )), and µ α,V (s ′ |s, a) denotes how large the transition probability of successor state s ′ belongs to the worst α-portion, which satisfies that µ α,V (s ′ |s,a) α = β α,V (s ′ |s, a) and s ′ ∈S µ α,V (s ′ |s, a) = α. Then, for any risk level α ∈ (0, 1], function V : S → R and (s ′ , s, a) ∈ S × S × A, CVaR α s ′ ∼p(•|s,a) (V (s ′ )) = s ′ ∈S µ α,V (s ′ |s, a) • V (s ′ ) α = s ′ ∈S β α,V (s ′ |s, a) • V (s ′ ), As shown in Figure 5 , we sort all successor states s ′ ∈ S by their values V (s ′ ) in ascending order (from left to right). Fix the transition probability p(•|s, a) and the value function shifts from V (•) to V (•). Then, below we divide all successor states s ′ ∈ S into three subsets, i.e., S up , S down and S unch , according to how µ α,V (s ′ |s, a) changes to µ α, V (s ′ |s, a) as V (s ′ ) shifts to V (s ′ ). • For any s ′ ∈ S up , µ α, V (s ′ |s, a) < µ α,V (s ′ |s, a), the rank of s ′ goes up, and the position of s ′ moves to the right (here "rank" means to rank all successor states s ′ ∈ S by their values V (s ′ ) or V (s ′ ) from highest to lowest). • For any s ′ ∈ S down , µ α, V (s ′ |s, a) > µ α,V (s ′ |s, a), the rank of s ′ goes down, and the position of s ′ moves to the left. • For any s ′ ∈ S unch , µ α, V (s ′ |s, a) = µ α,V (s ′ |s, a), the rank and position of s ′ keep unchanged. Then, it holds that s ′ ∈Sup µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) + s ′ ∈S down µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) = 0. (13) Next, we have CVaR α s ′ ∼p(•|s,a) ( V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) = 1 α • s ′ ∈Sup µ α, V (s ′ |s, a) • V (s ′ ) -µ α,V (s ′ |s, a) • V (s ′ ) + s ′ ∈S down µ α, V (s ′ |s, a) • V (s ′ ) -µ α,V (s ′ |s, a) • V (s ′ ) + s ′ ∈S unch µ α, V (s ′ |s, a) • V (s ′ ) -µ α,V (s ′ |s, a) • V (s ′ ) = 1 α • s ′ ∈Sup µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) + µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) • V (s ′ ) + s ′ ∈S down µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) + µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) • V (s ′ ) + s ′ ∈S unch µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) = 1 α • s∈S µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) - s ′ ∈Sup µ α,V (s ′ |s, a) -µ α, V (s ′ |s, a) • V (s ′ ) + s ′ ∈S down µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) • V (s ′ ) (a) ≤ 1 α s∈S µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) -min s ′ ∈Sup V (s ′ ) • s ′ ∈Sup µ α,V (s ′ |s, a) -µ α, V (s ′ |s, a) + min s ′ ∈Sup V (s ′ ) • s ′ ∈S down µ α, V (s ′ |s, a) -µ α,V (s ′ |s, a) (b) = 1 α • s∈S µ α,V (s ′ |s, a) • V (s ′ ) -V (s ′ ) =β α,V (•|s, a) ⊤ V -V Here (a) is due to that for any s ′ ∈ S up , µ α, V (s ′ |s, a) < µ α,V (s ′ |s, a), and for any s ∈ S up , s ′ ∈ S down , V (s) ≥ V (s ′ ). (b) comes from Eq. ( 13).

D.1.3 PROOF OF THEOREM 1

Proof of Theorem 1. Suppose that event E holds. Then, for any k ∈ [K], V * 1 (s k 1 ) -V π k 1 (s k 1 ) (a) ≤ V k 1 (s k 1 ) -V π k 1 (s k 1 ) = min r(s k 1 , a k 1 ) + CVaR α s ′ ∼ pk (•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) + H α L n k (s k 1 , a k 1 ) , H -r(s k 1 , a k 1 ) + CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) (V π k 2 (s ′ )) ≤r(s k 1 , a k 1 ) + CVaR α s ′ ∼ pk (•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) + min H α L n k (s k 1 , a k 1 ) , H -r(s k 1 , a k 1 ) + CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) (V π k 2 (s ′ )) = min H α L n k (s k 1 , a k 1 ) , H + CVaR α s ′ ∼ pk (•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) -CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) + CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) ( V k 2 (s ′ )) -CVaR α s ′ ∼p(•|s k 1 ,a k 1 ) (V π k 2 (s ′ )) (b) ≤ min H α L n k (s k 1 , a k 1 ) , H +min 4H α SL n k (s k 1 , a k 1 ) , H +β α,V π k 2 (•|s k 1 , a k 1 ) ⊤ ( V k 2 -V π k 2 ) (c) ≤ min H √ L + 4H √ SL α n k (s k 1 , a k 1 ) , 2H + s2∈S β α,V π k 2 (s 2 |s k 1 , a k 1 ) • ( V k 2 (s 2 ) -V π k 2 (s 2 )) (d) ≤ min H √ L + 4H √ SL α n k (s k 1 , a k 1 ) , 2H + s2∈S β α,V π k 2 (s 2 |s k 1 , a k 1 )• min H √ L + 4H √ SL α n k (s 2 , a 2 ) , 2H + s3∈S β α,V π k 3 (s 3 |s 2 , a 2 ) • ( V k 3 (s 3 ) -V π k 3 (s 3 )) (e) ≤ min H √ L + 4H √ SL α n k (s k 1 , a k 1 ) , 2H + s2∈S β α,V π k 2 (s 2 |s k 1 , a k 1 )• min H √ L + 4H √ SL α n k (s 2 , a 2 ) , 2H + s3∈S β α,V π k 3 (s 3 |s 2 , a 2 )• • • • s H ∈S β α,V π k H (s H |s H-1 , a H-1 ) • min H √ L + 4H √ SL α n k (s H , a H ) , 2H (f) = H h=1 (s,a)∈S×A w CV aR,α,V π k kh (s, a) • min H √ L + 4H √ SL α n k (s, a) , 2H ≤ H h=1 (s,a)∈L k w CV aR,α,V π k kh (s, a) • H √ L + 4H √ SL α n k (s, a) + H h=1 (s,a) / ∈L k w CV aR,α,V π k kh (s, a) • 2H (14) Here  a h := π k (s h ) for h = 2, . . . , H. CVaR α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )) -CVaR α s ′ ∼p(•|s,a) ( V k h+1 (s ′ )) ≤ H, α,V π k h ′ +1 (•|•, •) for each step h ′ = 1, . . . , h -1. Since the second term in Eq. ( 14) can be bounded by Lemma 10, below we analyze the first term. Recall that for any policy π, h ∈ [H] and (s, a) ∈ S × A, w π,h (s, a) and w π,h (s) denote the probabilities of visiting (s, a) and s at step h under policy π, respectively. Summing the first term in Eq. ( 14) over k ∈ [K], we have K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) H √ L + 4H √ SL α n k (s, a) ≤ H √ L + 4H √ SL α K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) n k (s, a) • K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) (a) = H √ L + 4H √ SL α K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) n k (s, a) • 1 {w kh (s, a) ̸ = 0} • √ KH = (H √ L + 4H √ SL) √ KH α K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) w kh (s, a) • w kh (s, a) n k (s, a) •1 {w kh (s, a) ̸ = 0} (b) ≤ (H √ L+4H √ SL) √ KH α min 1 min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) , 1 α H-1 K k=1 H h=1 (s,a)∈L k w kh (s, a) n k (s, a) (c) ≤ (H √ L + 4H √ SL) √ KH α • 2 √ SAL • min        1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1        ≤ 10HSL √ KHA α • min        1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1        . Here (a) is due to Lemma 8 and the fact that for any k > 0 and h ∈ [H], Then, summing the first and second terms in Eq. ( 14) over k ∈ [K] and using Lemma 10, we have R(K) = K k=1 V * 1 (s k 1 ) -V π k 1 (s k 1 ) ≤ K k=1 H h=1 (s,a)∈L k w CVaR,α,V π k kh (s, a) H √ L + 4H √ SL α n k (s, a) + K k=1 H h=1 (s,a) / ∈L k w CVaR,α,V π k kh (s, a) • 2H ≤ min        1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1        10HS √ KHA α log KHSA δ ′ + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    8SAH 2 log HSA δ ′ + 10SAH 2 When K is large enough, the first term dominates the bound, and thus we obtain Theorem 1.

D.2 PROOF OF REGRET LOWER BOUND

Below we prove the regret lower bound (Theorem 2) for Iterated CVaR RL-RM. Proof of Theorem 2. First, we construct an instance where min π,h,s: w π,h (s)>0 w π,h (s) > α H-1 , and prove that on this instance any algorithm must suffer a Ω( H min π,h,s: w π,h (s)>0 w π,h (s) AK α ) regret. Consider the instance shown in Figure 6 (the same as Figure 1 in the main text): The state space is S = {s 1 , s 2 , . . . , s n , x 1 , x 2 , x 3 }, where s 1 is the initial state, and n = S -3 < S < 1 2 H. The reward functions are as follows. For any a ∈ A, r(x 1 , a) = 1, r(x 2 , a) = 0.8 and r(x 3 , a) = 0.2. For any i ∈ [n] and a ∈ A, r(s i , a) = 0. The transition distributions are as follows. Let µ be a parameter which satisfies that 0 < α < µ < 1 3 . For any a ∈ A, p(s 2 |s 1 , a) = µ, p(x 1 |s 1 , a) = 1 -3µ, p(x 2 |s 1 , a) = µ and p(x 3 |s 1 , a) = µ. For any i ∈ {2, . . . , n -1} and a ∈ A, p(s i+1 |s i , a) = µ and p(x 1 |s i , a) = 1 -µ. x 1 , x 2 and x 3 are absorbing states, i.e., for any a ∈ A, p(x 1 |x 1 , a) = 1, p(x 2 |x 2 , a) = 1 and p(x 3 |x 3 , a) = 1. Let a J be the optimal action in state s n , which is uniformly drawn from A. For the optimal action a J , p(x 2 |s n , a J ) = 1 -α + η and p(x 3 |s n , a J ) = α -η, where η is a parameter which satisfies 0 < η < α and will be chosen later. For any suboptimal action a ∈ A \ {a J }, p(x 2 |s n , a) = 1 -α and p(x 3 |s n , a) = α. For any a j ∈ A, let E j [•] and Pr j [•] denote the expectation and probability operators under the instance with a J = a j . Let E unif [•] and Pr unif [•] denote the expectation and probability operators under the uniform instance where all actions a ∈ A in state s n have the same transition distribution, i.e., p(x 2 |s n , a) = 1 -α and p(x 3 |s n , a) = α. Fix an algorithm A. Let π k denote the policy taken by algorithm A in episode k. Let N sn,aj = K k=1 1 π k (s n ) = a j denote the number of episodes that the policy chooses a j in state s n . Let V sn,aj denote the number of episodes that the algorithm A visits (s n , a j ). Let w(s n ) denote the probability of visiting s n in an episode (the probability of visiting s n is the same for all policies). Then, it holds that E[V sn,aj ] = w(s n ) • E[N sn,aj ]. Recall that a J is the optimal action in state s n . According to the definition of the value function for Iterated CVaR RL, we have that V * 1 (s 1 ) = (α -η) • 0.2(H -n) + η • 0.8(H -n) α , and for any policy π, V π 1 (s 1 ) = (α -η) • 0.2(H -n) + η • 0.8(H -n) α • 1 {π(s n ) = a J } + 0.2(H -n) • (1 -1 {π(s n ) = a J }) . If J = j, for any policy π, V * 1 (s 1 ) -V π 1 (s 1 ) = η • 0.6(H -n) α • (1 -1 {π(s n ) = a j }) , and summing over all episodes k ∈ [K], we have E j [R(K)] = K k=1 V * 1 (s 1 ) -V π k 1 (s 1 ) = η • 0.6(H -n) α • K - K k=1 1 {π(s n ) = a j } = η • 0.6(H -n) α • K -E j [N sn,aj ] Therefore, we have E [R(K)] = 1 A A j=1 K k=1 V * 1 (s 1 ) -V π k 1 (s 1 ) = 1 A A j=1 η α • 0.6(H -n) K -E j [N sn,aj ] =0.6(H -n) • η α •   K - 1 A A j=1 E j [N sn,aj ]   ( ) For any j ∈ [A], using Pinsker's inequality and 0 < α < 1 3 , we have that KL(p unif (s n , a j )∥p j (s n , a j )) = KL(Ber(α)∥Ber(α -η)) ≤ η 2 (α-η)(1-α+η) ≤ c1η 2 α for some constant c 1 and small enough η. Then, using Lemma A.1 in (Auer et al., 2002) , we have that for any j ∈ [A], E j [N sn,aj ] ≤E unif [N sn,aj ] + K 2 E unif [V sn,aj ] • KL (p unif (s n , a j )||p j (s n , a j )) ≤E unif [N sn,aj ] + K 2 w(s n ) • E unif [N sn,aj ] • c 1 η 2 α Then, using A j=1 E unif [N sn,aj ] = K and the Cauchy-Schwarz inequality, we have 1 A A j=1 E j [N sn,aj ] ≤ 1 A A j=1 E unif [N sn,aj ] + Kη 2A A j=1 c 1 α • w(s n ) • E unif [N sn,aj ] ≤ 1 A A j=1 E unif [N sn,aj ] + Kη 2A A A j=1 c 1 α • w(s n ) • E unif [N sn,aj ] ≤ K A + Kη 2 c 1 • w(s n )K αA By plugging Eq. ( 17) into Eq. ( 16), we have E [R(K)] ≥ 0.6(H -n) • η α • K - K A - Kη 2 c 1 • w(s n )K αA . Let η = c 2 αA w(sn)K for a small enough constant c 2 . We have E [R(K)] =Ω H A α • w(s n )K • K =Ω H AK α • w(s n ) Recall that n < 1 2 H and 0 < α < µ < 1 3 . Thus, we have that min π,h,s: w π,h (s)>0 w π,h (s) = w(s n ) = µ n-1 > α H-1 , and E [R(K)] =Ω   H AK α • min π,h,s: w π,h (s)>0 w π,h (s)    . Next, we construct another instance where α H-1 > min π,h,s: w π,h (s)>0 w π,h (s), and prove that on this instance any algorithm must suffer a Ω( AK α H-1 ) regret. Consider the instance shown in Figure 7 : The state space is S = {s 1 , . . . , s n , s ′ 2 , . . . , s ′ n , x 1 , x 2 , x 3 , x 4 }, where n = H -1 and s 1 is the initial state. Let 0 < α < 1 4 . The reward functions are as follows. For any a ∈ A, r(x 1 , a) = r(x 4 , a) = 1, r(x 2 , a) = 0.8 and r(x 3 , a) = 0.2. For any i ∈ [n] and a ∈ A, r(s i , a) = 0. For any i ∈ {2, . . . , n} and a ∈ A, r(s ′ i , a) = 0. The transition distributions are as follows. For any a ∈ A, p(s 2 |s 1 , a) = α, p(s ′ 2 |s 1 , a) = γ and p(x 1 |s 1 , a) = 1-γ -α. For any i ∈ {2, . . . , n-1} and a ∈ A, p(s i+1 |s i , a) = α and p(x 1 |s i , a) = 1 -α. For any i ∈ {2, . . . , n -1} and a ∈ A, p(s ′ i+1 |s ′ i , a) = γ and p(x 1 |s ′ i , a) = 1 -γ. For any a ∈ A, p(x 4 |s ′ n , a) = γ and p(x 1 |s ′ n , a) = 1 -γ. x 1 , x 2 , x 3 and x 4 are absorbing states, i.e., for any a ∈ A and i ∈ [4], p(x i |x i , a) = 1. Let a J be the optimal action in state s n , which is uniformly drawn from A. For the optimal action a J , p(x 2 |s n , a J ) = 1 -α + η and p(x 3 |s n , a J ) = α -η, 1 -- 1 - 1 - Reward 1 Reward 0.8 Reward 0.2 Action : 1 - Action * : 1 -+ Action : Action * : - … ′ ′ ′ … Reward 1 1 - Figure 7 : Instance of lower bounds (Theorems 2 and 5) for the α H-1 > min π,h,s: w π,h (s)>0 w π,h (s) case. where η is a parameter which satisfies 0 < η < α and will be chosen later. For any suboptimal action a ∈ A \ {a J }, p(x 2 |s n , a) = 1 -α and p(x 3 |s n , a) = α. According to the definition of the value function for Iterated CVaR RL, we have that V * 1 (s 1 ) = 0.2(α -η) + 0.8η α , and for any policy π, V π 1 (s 1 ) = 0.2(α -η) + 0.8η α • 1 {π(s n ) = a J } + 0.2 (1 -1 {π(s n ) = a J }) . If J = j, for any policy π, V * 1 (s 1 ) -V π 1 (s 1 ) = 0.6η α (1 -1 {π(s n ) = a j }) , and summing over all episodes k ∈ [K], we have E j [R(K)] = K k=1 V * 1 (s 1 ) -V π k 1 (s 1 ) = 0.6η α • K - K k=1 1 {π(s n ) = a j } = 0.6η α • K -E j [N sn,aj ] Therefore, we have E [R(K)] = 1 A A j=1 K k=1 V * 1 (s 1 ) -V π k 1 (s 1 ) = 1 A A j=1 0.6η α K -E j [N sn,aj ] = 0.6η α   K - 1 A A j=1 E j [N sn,aj ]   Recall that 0 < α < 1 4 . For any j ∈ [A], we have that KL(p unif (s n , a j )∥p j (s n , a j )) = KL(Ber(α)∥Ber(α -η)) ≤ η 2 (α-η)(1-α+η) ≤ c1η 2 α for some constant c 1 and small enough η. Then, using Lemma A.1 in (Auer et al., 2002) , we have that for any j ∈ [A], E j [N sn,aj ] ≤E unif [N sn,aj ] + K 2 E unif [V sn,aj ] • KL (p unif (s n , a j )||p j (s n , a j )) ≤E unif [N sn,aj ] + K 2 w(s n ) • E unif [N sn,aj ] • c 1 η 2 α Then, using A j=1 E unif [N sn,aj ] = K and the Cauchy-Schwarz inequality, we have 1 A A j=1 E j [N sn,aj ] ≤ 1 A A j=1 E unif [N sn,aj ] + Kη 2A A j=1 c 1 α • w(s n ) • E unif [N sn,aj ] ≤ 1 A A j=1 E unif [N sn,aj ] + Kη 2A A A j=1 c 1 α • w(s n ) • E unif [N sn,aj ] ≤ K A + Kη 2 c 1 • w(s n )K αA By plugging Eq. ( 20) into Eq. ( 19), we have E [R(K)] ≥ 0.6η α • K - K A - Kη 2 c 1 • w(s n )K αA . Let η = c 2 αA w(sn)K for a small enough constant c 2 . We have E [R(K)] =Ω A α • w(s n )K • K =Ω AK α • w(s n ) Recall that 0 < γ < α and n = H -1. Thus, we have min π,h,s: w π,h (s)>0 w π,h (s) = w(x 4 ) = γ H-1 < α H-1 . In addition, since w(s n ) = α n-1 = α H-2 , we have E [R(K)] =Ω AK α • α H-2 =Ω AK α H-1 .

E PROOFS FOR ITERATED CVAR RL WITH BEST POLICY IDENTIFICATION

In this section, we present the pseudo-code and detailed description of algorithm ICVaR-BPI, and formally state the sample complexity lower bound for Iterated CVaR-BPI (Theorem 5). We also give the proofs of sample complexity upper and lower bounds (Theorems 3 and 5).

E.1 ALGORITHM ICVaR-BPI

Algorithm ICVaR-BPI (Algorithm 3) constructs optimistic and pessimistic value functions, estimation error, and a hypothesized optimal policy in each episode, and returns the hypothesized optimal policy when the estimation error shrinks within ε. Specifically, in each episode k, ICVaR-BPI calculates the empirical CVaR for values of next states CVaR a) , H ; a) , 0 ; α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )), CVaR α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) and exploration bonuses Algorithm 3: ICVaR-BPI Input: ε, δ, α, δ ′ := δ 7 , L(k) := log( 2HSAk 3 δ ′ ) for any k > 0, J k H+1 (s) = V k H+1 (s) = V k H+1 (s) = 0 for any k > 0 and s ∈ S. for k = 1, 2, . . . , K do for h = H, H -1, . . . , 1 do for s ∈ S do 4 for a ∈ A do 5 Qk h (s, a) ← min r(s, a) + CVaR α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )) + H α L(k) n k (s, Q k h (s, a) ← max r(s, a) + CVaR α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) -4H α S L(k) n k (s, G k h (s, a) ← min H(1+4 √ S) √ L(k) α √ n k (s,a) + βk;α,V k h+1 (•|s, a) ⊤ J k h+1 , H ; π k h (s) ← argmax a∈A Qk h (s, a). V k h (s) ← max a∈A Qk h (s, a). V k h (s) ← Q k h (s, π k h (s)). J k h (s) ← G k h (s, π k h (s)); if J k 1 (s) ≤ ε then return π k (s) else Play the episode k with policy π k , and update n k+1 (s, a) and pk+1 (s ′ |s, a) a) , to establish the optimistic and pessimistic Q-value functions Qk h (s, a) and Q k h (s, a), respectively. ICVaR-BPI further maintains a hypothesized optimal policy π k , which is greedy with respect to Qk h (s, a). Let βk;α,V k h+1 (•|s, a) denote the conditional empirical transition probability in episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the worst α-portion values V k h+1 (s ′ )), and it satisfies H α L(k) n k (s,a) , 4H α S L(k) n k (s, s ′ ∈S βk;α,V k h+1 (s ′ |s, a) • V k h+1 (s ′ ) = CVaR α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) (Line 7 ). Then, ICVaR-BPI computes estimation error G k h (s, a) and J k h (s) using conditional transition probability βk;α,V k h+1 (•|s, a). Once estimation error J k h (s) shrinks within accuracy parameter ε, ICVaR-BPI returns the hypothesized optimal policy π k .

E.2 PROOFS OF SAMPLE COMPLEXITY UPPER BOUND E.2.1 CONCENTRATION

In the best policy identification analysis, we introduce several useful lemmas and concentration events. Different from the regret minimization analysis where the logrithmic factor log( KHSA δ ′ ) in the exploration bonuses is an universal constant, here the logrithmic factor log( 2k 3 HSA δ ′ ) will increase as the index of the episode k increases. Lemma 12 (Concentration for V * -BPI). It holds that Pr CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log 2k 3 HSA δ ′ n k (s, a) , ∀k > 0, ∀h ∈ [H], ∀(s, a) ∈ S × A ≥ 1 -2δ ′ . Proof of Lemma 12. Using the same analysis as Lemma 1, we have that for a fixed k, Pr CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log 2k 3 HSA δ ′ n k (s, a) , ∀h ∈ [H], ∀(s, a) ∈ S × A ≥ 1 -2 • δ ′ 2k 2 . By a union bound over k = 1, 2, . . . , we have Pr CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log 2k 3 HSA δ ′ n k (s, a) , ∀k > 0, ∀h ∈ [H], ∀(s, a) ∈ S × A ≥1 -2 • ∞ k=1 δ ′ 2k 2 ≥1 -2δ ′ . Lemma 13 (Concentration for any V -BPI). It holds that Pr CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀k > 0, ∀(s, a) ∈ S × A ≥ 1 -2δ ′ . Proof of Lemma 13. Using the same analysis as Lemma 2, we have that for a fixed k, Pr CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀(s, a) ∈ S × A ≥ 1 -2 • δ ′ 2k 2 . By a union bound over k = 1, 2, . . . , we have Pr CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀k > 0, ∀(s, a) ∈ S × A ≥1 -2 • ∞ k=1 δ ′ 2k 2 ≥1 -2δ ′ . For any risk level α ∈ (0, 1], function V : S → R, k > 0 and (s ′ , s, a) ∈ S × S × A, β α,V (s ′ |s, a) and βk;α,V (s ′ |s, a) are the conditional transition probability from (s, a) to s ′ and the conditional empirical transition probability from (s, a) to s ′ in episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V (s ′ )), respectively. µ α,V (s ′ |s, a) and μk;α,V (s ′ |s, a) denote how large the transition probability of successor state s ′ and the empirical transition probability of successor state s ′ in episode k belong to the worst α-portion, respectively. It holds that µ α,V (s ′ |s, a) α = β α,V (s ′ |s, a), and μk;α,V (s ′ |s, a) α = βk;α,V (s ′ |s, a). Lemma 14 (Concentration for conditional transition probability). It holds that Pr βk;α,V (s ′ |s, a) -β α,V (s ′ |s, a) ≤ 2 α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → R, ∀k > 0, ∀(s, a) ∈ S × A ≥ 1 -2δ ′ . Proof of Lemma 14. Using the analysis of Eq. ( 7), we have that for any risk level α ∈ (0, 1], function V : S → R, k > 0 and (s, a) ∈ S × A, s ′ ∈S μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) ≤ 2 s ′ ∈S pk (s ′ |s, a) -p(s ′ |s, a) . Using Eq. ( 55) in (Zanette & Brunskill, 2019 ) (originated from (Weissman et al., 2003 )), we have that for any fixed k, with probability at least 1 -2 • ( δ ′ 2k 2 ), for any (s, a) ∈ S × A, s ′ ∈S pk (s ′ |s, a) -p(s ′ |s, a) ≤ 2S log 2k 3 HSA δ ′ n k (s, a) , and thus, s ′ ∈S βk;α,V (s ′ |s, a) -β α,V (s ′ |s, a) = s ′ ∈S μk;α,V (s ′ |s, a) α - µ α,V (s ′ |s, a) α = s ′ ∈S μk;α,V (s ′ |s, a) -µ α,V (s ′ |s, a) α ≤ 2 s ′ ∈S pk (s ′ |s, a) -p(s ′ |s, a) α ≤ 2 α 2S log 2k 3 HSA δ ′ n k (s, a) By a union bound over k = 1, 2, . . . , we have Pr s ′ ∈S βk;α,V (s ′ |s, a) -β α,V (s ′ |s, a) ≤ 2 α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → R, ∀k > 0, ∀(s, a) ∈ S × A ≥1 -2 • ∞ k=1 δ ′ 2k 2 ≥1 -2δ ′ . To sum up, we define the following concentration events and recall event E 3 . F 1 := CVaR α s ′ ∼ pk (•|s,a) (V * h (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V * h (s ′ )) ≤ H α log 2k 3 HSA δ ′ n k (s, a) , ∀k > 0, ∀h ∈ [H], ∀(s, a) ∈ S × A F 2 := CVaR α s ′ ∼ pk (•|s,a) (V (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V (s ′ )) ≤ 2H α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → [0, H], ∀k > 0, ∀(s, a) ∈ S × A F 3 := βk;α,V (s ′ |s, a) -β α,V (s ′ |s, a) ≤ 2 α 2S log 2k 3 HSA δ ′ n k (s, a) , ∀V : S → R, ∀k > 0, ∀(s, a) ∈ S × A E 3 := n k (s, a) ≥ 1 2 k-1 k ′ =1 H h=1 w k ′ h (s, a) -H log HSA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A F :=F 1 ∩ F 2 ∩ F 3 ∩ E 3 Lemma 15. Letting δ ′ = δ 7 , it holds that Pr [F] ≥ 1 -δ. Proof of Lemma 15. This lemma can be obtained by combining Lemmas 12-14 and 3.

E.2.2 OPTIMISM AND ESTIMATION ERROR

For any k > 0, let L(k) := log 2HSAk 3 δ ′ . Lemma 16 (Optimism and Pessimism). Suppose that event F holds. Then, for any k > 0, h ∈ [H] and s ∈ S, V k h (s) ≥ V * h (s), V k h (s) ≤ V π k h (s). Proof of Lemma 16. The proof of V k h (s) ≥ V * h (s) is similar to Lemma 5. Below we prove V k h (s) ≤ V π k h (s) by induction. First, for any k > 0, s ∈ S, it holds that V k H+1 (s) = V π k H+1 (s) = 0. Then, for any k > 0, h ∈ [H] and (s, a) ∈ S × A, if Q k h (s, a) = 0, Q k h (s, a) ≤ Q π k (s, a) trivially holds, and otherwise, Q k h (s, a) =r(s, a) + CVaR α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) - 4H α S L(k) n k (s, a) (a) ≤r(s, a) + CVaR α s ′ ∼ pk (•|s,a) (V π k h+1 (s ′ )) - 4H α S L(k) n k (s, a) (b) ≤r(s, a) + CVaR α s ′ ∼ pk (•|s,a) (V π k h+1 (s ′ )) -CVaR α s ′ ∼ pk (•|s,a) (V π k h+1 (s ′ )) -CVaR α s ′ ∼p(•|s,a) (V π k h+1 (s ′ )) =r(s, a) + CVaR α s ′ ∼p(•|s,a) (V π k h+1 (s ′ )) =Q π k (s, a), where (a) uses the induction hypothesis, and (b) comes from Lemma 2. Thus, we have V k h (s) = Q k h (s, π k h (s)) ≤ Q π k h (s, π k h (s)) = V π k h (s), which concludes the proof. For any risk level α ∈ (0, 1], k > 0, h ∈ [H] and (s ′ , s, a) ∈ S × S × A, βk;α,V k h+1 (s ′ |s, a) is the conditional empirical transition probability from (s, a) to s ′ in episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V k h+1 (s ′ )).

It holds that CVaR

α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) = s ′ ∈S βk;α,V k h+1 (s ′ |s, a) • V k h+1 (s ′ ). Lemma 17 (Estimation Error). Suppose that event F holds. Then, for any k > 0, V * 1 (s 1 ) -V π k 1 (s 1 ) ≤ J k 1 (s 1 ). Proof of Lemma 17. In the following, we prove by induction that for any k > 0, h ∈ [H] and s ∈ S, V k h (s) -V k h (s) ≤ J k h (s). First, for any k > 0 and s ∈ S, it holds that V k H+1 (s) -V k H+1 (s) = J k H+1 (s) = 0. Then, for any k > 0, h ∈ [H] and (s, a) ∈ S ×A, if G k h (s, a) = H, Qk h (s, a)-Q k h (s, a) ≤ G k h (s, a) holds trivially, and otherwise, Qk h (s, a) -Q k h (s, a) = H α L(k) n k (s, a) + 4H α S L(k) n k (s, a) + CVaR α s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )) -CVaR α s ′ ∼ pk (•|s,a) (V k h+1 (s ′ )) (a) ≤ H L(k)(1 + 4 √ S) α n k (s, a) + βk;α,V k h+1 (•|s, a) ⊤ V k h+1 -V k h+1 (b) ≤ H L(k)(1 + 4 √ S) α n k (s, a) Thus, + βk;α,V k h+1 (•|s, a) ⊤ J k h+1 =G k h (s, V k h (s) -V k h (s) = Qk h (s, π k h (s)) -Q k h (s, π k h (s)) ≤ G k h (s, π k h (s)) = J k h (s) , which completes the proof of Eq. ( 22).

Hence, for any

k > 0, V k 1 (s 1 ) -V k 1 (s 1 ) ≤ J k 1 (s 1 ). Using Lemma 16, we have V * 1 (s) -V π k 1 (s 1 ) ≤ V k 1 (s 1 ) -V k 1 (s 1 ) ≤ J k 1 (s 1 ).

E.2.3 PROOF OF THEOREM 3

Proof of Theorem 3. Suppose that event F holds. First, we prove the correctness. Using Lemma 17, when algorithm ICVaR-BPI stops, we have V * 1 (s 1 ) -V π k 1 (s 1 ) ≤ J k 1 (s 1 ) ≤ ε. Thus, the output policy π k is ε-optimal. Next, we prove the sample complexity. Unfolding J k 1 (s 1 ), we have J k 1 (s 1 ) (a) = min    H(1 + 4 √ S) L(k) α n k (s 1 , a 1 ) + s2∈S βk;α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ), H    = min H(1 + 4 √ S) L(k) α n k (s 1 , a 1 ) + s2∈S β α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ) + s2∈S βk;α,V k 2 (s 2 |s 1 , a 1 ) -β α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ), H ≤ min H(1 + 4 √ S) L(k) α n k (s 1 , a 1 ) + s2∈S β α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ) + H s2∈S βk;α,V k 2 (s 2 |s 1 , a 1 ) -β α,V k 2 (s 2 |s 1 , a 1 ) , H ≤ min H(1 + 4 √ S) L(k) α n k (s 1 , a 1 ) + s2∈S β α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ) + 4H α S • L(k) n k (s 1 , a 1 ) , H ≤ min Let τ denote the episode in which algorithm ICVaR-BPI stops (ICVaR-BPI will not sample any trajectory in the stopping episode τ ). Then, for any k < τ , we have ε < J k 1 (s 1 ). Summing over k < τ , we have    H(1 + 8 √ S) L(k) α n k (s 1 , a 1 ) , H    + s2∈S β α,V k 2 (s 2 |s 1 , a 1 ) • J k 2 (s 2 ) (e) ≤ min    H(1 + 8 √ S) L(k) α n k (s 1 , a 1 ) , H    + s2∈S β α,V k 2 (s 2 |s 1 , a 1 )•   min    H(1 + 8 √ S) L(k) α n k (s 2 , a 2 ) , H    + s3∈S β α,V k 3 (s 3 |s 2 , a 2 ) • J k 3 (s 3 )   (f) ≤ min    H(1 + 8 √ S) L(k) α n k (s 1 , a 1 ) , H    + s2∈S β α,V k 2 (s 2 |s 1 , a 1 )• min    H(1 + 8 √ S) L(k) α n k (s 2 , a 2 ) , H    + s3∈S β α,V k 3 (s 3 |s 2 , a 2 )• • • • s H ∈S β α,V k H (s H |s H-1 , a H-1 ) • min    H(1 + 8 √ S) L(k) α n k (s H , a H ) , H    (g) = H h=1 (s,a)∈S×A w CVaR,α,V k kh (s, a) • min    H(1 + 8 √ S) L(k) α n k (s H , a H ) , H    ≤ H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) • H(1 + 8 √ S) L(k) α n k (s, a) + H h=1 (τ -1) • ε < τ -1 k=1 J k 1 (s 1 ) ≤ τ -1 k=1 H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) • H(1 + 8 √ S) L(k) α n k (s, a) + τ -1 k=1 H h=1 (s,a) / ∈L k w CVaR,α,V k kh (s, a) • H (a) ≤ H(1 + 8 √ S) L(τ -1) α τ -1 k=1 H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) n k (s, a) • τ -1 k=1 H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH 2 log HSA δ ′ + 5SAH 2 (b) ≤ H(1 + 8 √ S) L(τ -1) α • (τ -1)H • τ -1 k=1 H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) n k (s, a) + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH 2 log HSA δ ′ + 5SAH 2 (c) ≤ (1 + 8 √ S)H H • L(τ -1) • (τ -1) α • τ -1 k=1 H h=1 (s,a)∈L k w CVaR,α,V k kh (s, a) w kh (s, a) • w kh (s, a) n k (s, a) 1 {w kh (s, a) ̸ = 0} + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH 2 log HSA δ ′ + 5SAH 2 ≤ (1 + 8 √ S)H H • L(τ -1)•(τ -1) α •min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 • τ -1 k=1 H h=1 (s,a)∈L k w kh (s, a) n k (s, a) + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    • 4SAH 2 log HSA δ ′ + 5SAH 2 (d) ≤ (1 + 8 √ S)H H • L(τ -1) • (τ -1) α min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 • 2 SA L(τ -1) + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    • 4SAH 2 log HSA δ ′ + 5SAH 2 ≤ min        1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1        18SH • L(τ -1) HA(τ -1) α + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH 2 log HSA δ ′ + 5SAH 2 , where (a) is due to Lemma 10, (b) uses the fact that for any risk level α ∈ (0, 1], k > 0 and h ∈ [H], (s,a)∈S×A w CV aR,α,V k kh (s, a) = 1, (c) comes from Lemma 8, and (d) is due to Lemma 7. Thus, when log HSA δ ′ ≥ 1, we have τ -1 ≤min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 18SH √ HA εα • √ τ -1•log 2HSA(τ -1) 3 δ ′ + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    4SAH 2 log HSA δ ′ + 5SAH 2 ε ≤min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 54SH √ HA εα • √ τ -1•log 2HSA(τ -1) δ ′ + min    1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1    9SAH 2 ε log HSA δ ′ Using Lemma 24 with A = 1, B = 0, C = min{ 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 √ α H-1 } 54SH √ HA εα , D = min{ 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1 } 9SAH 2 ε , E = 0, β = 2HSA δ ′ and T = τ -1, and recalling that algorithm ICVaR-BPI does not sample any trajectory in the stopping episode τ , we have that the number of used trajectories is bounded by τ -1 = O min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1 H 3 S 2 A ε 2 α 2 • log 2 min 1 min π,h,s: w π,h (s)>0 w π,h (s) , 1 α H-1 HSA εαδ .

E.3 SAMPLE COMPLEXITY LOWER BOUND

Below we present the sample complexity lower bound for Iterated CVaR RL-BPI and provide its proof. We say algorithm A is (δ, ε)-correct if A returns an ε-optimal policy π such that V π 1 (s 1 ) ≥ V * 1 (s 1 )ε with probability 1 -δ. Theorem 5 (Sample Complexity Lower Bound). There exists an instance of Iterated CVaR RL-BPI, where min π,h,s: w π,h (s)>0 w π,h (s) > α H-1 and the number of trajectories used by any (δ, ε)correct algorithm is at least Ω H 2 A ε 2 α min π,h,s: w π,h (s)>0 w π,h (s) log 1 δ . In addition, there also exists an instance of Iterated CVaR RL-BPI, where α H-1 > min π,h,s: w π,h (s)>0 w π,h (s) and the number of trajectories used by any (δ, ε)-correct algorithm is at least Ω A α H-1 ε 2 log 1 δ . Theorem 5 corroborates that when α is small, the factor min π,h,s: w π,h (s)>0 w π,h (s) is unavoidable in general. This reveals the intrinsic hardness of Iterated CVaR RL, i.e., when the agent is highly risk-sensitive, she needs to spend a number of trajectories on exploring critical but hard-to-reach states in order to identify an optimal policy. Proof of Theorem 5. This proof uses a similar analytical procedure as Theorem 2 in (Dann & Brunskill, 2015) . First, we consider the instance in Figure 6 as in the proof of Theorem 2, where min π,h,s: w π,h (s)>0 w π,h (s) > α H-1 . Below we prove that on this instance any algorithm must suffer a O( 1 min π,h,s: w π,h (s)>0 w π,h (s) • H 3 S 2 A ε 2 α 2 log( 1 δ )) regret. Fix an algorithm A. Define E sn := {π(s n ) = a J } as the event that the output policy π of algorithm A chooses the optimal action in state s n . Using the similar analysis as in the proof of Theorem 2 (Eq. ( 15)), we have V * 1 (s 1 ) -V π 1 (s 1 ) =0.6(H -n) • η α • (1 -1 {E sn }). For π to be ε-optimal, we need ε ≥ V * 1 (s 1 ) -V π 1 (s 1 ) = 0.6(H -n) • η α • (1 -1 {E sn }), which is equivalent to 1 {E sn } ≥ 1 - εα 0.6(H -n) • η . Let η = 8e 4 εα 0.6c0(H-n) for some constant c 0 and small enough ε. Then, for π to be ε-optimal, we need 1 {E sn } ≥ 1 - c 0 8e 4 . Let ϕ := 1 -c0 8e 4 . For algorithm A to be (ε, δ)-correct, we need 1 -δ ≤ Pr[V * -V π ≥ ε] ≤ Pr[1 {E sn } ≥ ϕ] ≤ E[E sn ] ϕ ≤ 1 ϕ Pr[E sn ], which is equivalent to Pr [ Ēsn ] = 1 -Pr[E sn ] ≤ 1 -ϕ + ϕδ. Recall that 0 < α < 1 3 . For any j ∈ [A], KL(p unif (s n , a j )∥p j (s n , a j )) = KL(Ber(α)∥Ber(α - η)) ≤ η 2 (α-η)(1-α+η) ≤ c1•η 2 α for some constant c 1 and small enough η. Let V sn be the number of times that algorithm A visited state s n . To ensure Pr[ Ēsn ] ≤ 1 -ϕ + ϕδ, we need E[V sn ] ≥ A j=1 1 KL(p unif (s n , a j )∥p j (s n , a j )) log c 2 1 -ϕ + ϕδ ≥ αA c 1 • η 2 log c 2 1 -ϕ + ϕδ = αA • 0.6 2 c 2 0 (H -n) 2 c 1 • 64e 8 ε 2 α 2 log c 2 c0 8e 4 + δ for some constant c 2 . Let c 0 be a small constant such that c0 8e 4 < δ. Let w(s n ) denote the probability of visiting s n in an episode, and this probability is the same for all policies. Let τ denote the number of trajectories required by A to be (ε, δ)-correct. Then, τ must satisfy τ ≥ A • 0.6 2 c 2 0 (H -n) 2 c 1 • 64e 8 ε 2 α • w(s n ) log c 2 c0 8e 4 + δ =Ω H 2 A ε 2 α • w(s n ) log 1 δ . Recall that n < 1 2 H and 0 < α < µ < 1 3 . Thus, in the constructed instance (Figure 6 ), we have that min π,h,s: w π,h (s)>0 w π,h (s) = w(s n ) = µ n-1 > α H-1 , and τ =Ω   H 2 A ε 2 α • min π,h,s: w π,h (s)>0 w π,h (s) log 1 δ   . Next, we consider the instance in Figure 7 as in the proof of Theorem 2, where α H-1 > min π,h,s: w π,h (s)>0 w π,h (s). Below we prove that on this instance any algorithm must suffer a O( 1 α H-1 • H 3 S 2 A ε 2 α 2 log( 1 δ )) regret. Define E sn := {π(s n ) = a J } as the event that the output policy π of algorithm A chooses the optimal action in state s n . Using the similar analysis as in the proof of Theorem 2 (Eq. ( 18)), we have V * 1 (s 1 ) -V π 1 (s 1 ) = 0.6η α • (1 -1 {E sn }). For π to be ε-optimal, we need ε ≥ V * 1 (s 1 ) -V π 1 (s 1 ) = 0.6η α • (1 -1 {E sn }), which is equivalent to 1 {E sn } ≥ 1 - εα 0.6η Let η = 8e 4 εα 0.6c0 for some constant c 0 and small enough ε. Then, for π to be ε-optimal, we need 1 {E sn } ≥ 1 - c 0 8e 4 . Let ϕ := 1 -c0 8e 4 . For algorithm A to be (ε, δ)-correct, we need 1 -δ ≤ Pr[V * -V π ≥ ε] ≤ Pr[1 {E sn } ≥ ϕ] ≤ E[E sn ] ϕ ≤ 1 ϕ Pr[E sn ], which is equivalent to Pr [ Ēsn ] = 1 -Pr[E sn ] ≤ 1 -ϕ + ϕδ. Recall that 0 < α < 1 4 . For any j ∈ [A], KL(p unif (s n , a j )∥p j (s n , a j )) = KL(Ber(α)∥Ber(α - η)) ≤ η 2 (α-η)(1-α+η) ≤ c1•η 2 α for some constant c 1 and small enough η. Let V sn be the number of times that algorithm A visited state s n . To ensure Pr[ Ēsn ] ≤ 1 -ϕ + ϕδ, we need E[V sn ] ≥ A j=1 1 KL(p unif (s n , a j )∥p j (s n , a j )) log c 2 1 -ϕ + ϕδ ≥ αA c 1 • η 2 log c 2 1 -ϕ + ϕδ = αA • 0.6 2 c 2 0 c 1 • 64e 8 ε 2 α 2 log c 2 c0 8e 4 + δ for some constant c 2 . Let c 0 be a small constant such that c0 8e 4 < δ. Let w(s n ) denote the probability of visiting s n in an episode, and this probability is the same for all policies. Let τ denote the number of trajectories required by A to be (ε, δ)-correct. Then, τ must satisfy τ ≥ A • 0.6 2 c 2 0 c 1 • 64e 8 ε 2 α • w(s n ) log c 2 c0 8e 4 + δ =Ω A ε 2 α • w(s n ) log 1 δ . Recall that n = H -1. Thus, in the constructed instance (Figure 7 ), we have that w(s n ) = α n-1 = α H-2 , and τ =Ω A ε 2 α • α H-2 log 1 δ =Ω A ε 2 α H-1 log 1 δ .

F PROOFS FOR WORST PATH RL

In this section, we provide the proofs of regret upper and lower bounds (Theorems 4 and 6) for Worst Path RL. F.1 PROOFS OF REGRET UPPER BOUND F.1.1 CONCENTRATION Recall that for any k > 0 and (s, a) ∈ S × A, n k (s, a) is the number of times that (s, a) was visited before episode k. For any k > 0 and (s ′ , s, a) ∈ S × S × A, let n k (s ′ , s, a) denote the number of times that (s, a) was visited and transitioned to s ′ before episode k. For any policy π and (s, a) ∈ S × A, let υ π (s, a) and υ π (s) denote the probabilities that (s, a) and s are visited at least once in an episode under policy π, respectively. Lemma 18. It holds that Pr n k (s, a) ≥ 1 2 k-1 k ′ =1 υ π k ′ (s, a) -log SA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A ≥ 1 -δ ′ Proof of Lemma 18. For any k and (s, a) ∈ S × A, conditioning on the filtration generated by episodes 1, . . . , k -1, whether the algorithm visited (s, a) at least once in episode k is a Bernoulli random variable with success probability υ π k (s, a). Then, using Lemma F.4 in (Dann et al., 2017) , we can obtain this lemma. Lemma 19. It holds that Pr n k (s ′ , s, a) ≥ 1 2 • n k (s, a) • p(s ′ |s, a) -2 log SA δ ′ , ∀k > 0, ∀(s ′ , s, a) ∈ S × S × A ≥ 1 -δ ′ . Proof of Lemma 19. For any k, h ∈ [H] and (s, a) ∈ S × A, conditioning on the event {s k h = s, a k h = a}, the indicator 1 s k h+1 = s ′ is a Bernoulli random variable with success probability p(s ′ |s, a). Then, using Lemma F.4 in (Dann et al., 2017) , we can obtain this lemma. To summarize, we define some concentration events which will be used in the following proof. G 1 := n k (s, a) ≥ 1 2 k-1 k ′ =1 υ π k ′ (s, a) -log SA δ ′ , ∀k > 0, ∀(s, a) ∈ S × A G 2 := n k (s ′ , s, a) ≥ 1 2 • n k (s, a) • p(s ′ |s, a) -2 log SA δ ′ , ∀k > 0, ∀(s ′ , s, a) ∈ S × S × A G :=G 1 ∩ G 2 Lemma 20. Letting δ ′ = δ 2 , it holds that Pr [G] ≥ 1 -δ. Proof of Lemma 20. This lemma can be obtained by combining Lemmas 18 and 19.

F.1.2 OVERESTIMATION AND GOOD STAGE

Recall that in Worst Path RL, for any k > 0, h ∈ [H] and (s, a) ∈ S × A, Q * h (s, a) := r(s, a) + min s ′ ∼p(•|s,a) (V * h+1 (s ′ )) and V * h (s) := max a∈A Q * h (s, a). In addition, Qk h (s, a) := r(s, a) + min s ′ ∼ pk (•|s,a) ( V k h+1 (s ′ )) and V k h (s) := max a∈A Qk h (s, a). Lemma 21 (Overestimation). For any k > 0, h ∈ [H] and (s, a) ∈ S × A, Qk h (s, a) ≥ Q * h (s, a) and V k h (s) ≥ V * h (s). Remark. Lemma 21 shows that if the Q-value of some state-action pair is not accurately estimated, it can only be overestimated (not underestimated). This feature is due to the min metric in the Worst Path RL formulation (Eq. ( 2)). Proof of Lemma 21. We prove this lemma by induction. For any k > 0 and s ∈ S, V k H+1 (s) = V * H+1 (s) = 0. For any k > 0, h ∈ [H] and (s, a) ∈ S × A, since r(s, a) is known and V k h+1 (s) ≥ V * h+1 (s) (due to the induction hypothesis), if pk (•|s, a) has detected all successor states, then Qk h (s, a) ≥ Q * h (s, a). Otherwise, if pk (•|s, a) has not detected all successor states, due to the property of min, we also have Qk h (s, a) ≥ Q * h (s, a). Therefore, we have V k h (s) ≥ V * h (s), which completes the proof. Lemma 22 (Non-increasing Estimated Value). For any k 1 , k 2 > 0 such that k 1 ≤ k 2 , h ∈ [H] and (s, a) ∈ S × A, Qk1 h (s, a) ≥ Qk2 h (s, a) and V k1 h (s) ≥ V k2 h (s). Proof of Lemma 22. We prove this lemma by induction. For any Remark. Lemma 23 reveals that if in some episode k, for any step h and state s ∈ S * , algorithm MaxWP estimates the V-value accurately and chooses an optimal action, then hereafter, algorithm MaxWP always takes an optimal policy. We say algorithm MaxWP enters a good stage, if starting from some episode k, for any k ≥ k, h ∈ [H] and s ∈ S * , V k h (s) = V * h (s) and π k h (s) suggests an optimal action. k 1 , k 2 > 0 such that k 1 ≤ k 2 and s ∈ S, V k1 H+1 (s) = V k2 H+1 (s) = 0. For any k 1 , k 2 > 0 such that k 1 ≤ k 2 , h ∈ [H] and (s, a) ∈ S × A, since r(s, a) is known and V k1 h+1 (s) ≥ V k2 h+1 Proof of Lemma 23. Suppose that in episode k, we have that for any h ∈ [H] and s ∈ S * , V k h (s) = V * h (s) and π k h (s) suggests an optimal action. This is equivalent to the statement that in episode k, for any h ∈ [H] and s ∈ S * , for each optimal action a (such that Q According to Lemma 23, in order to prove Theorem 4, it suffices to prove that in episode T + 1, for any h ∈ [H] and s ∈ S * , V T +1 h (s) = V * h (s) and π T +1 h (s) suggests an optimal action, i.e., algorithm MaxWP has entered the good stage in episode T + 1. We prove this statement by contradiction. * h (s, a) = V * h Suppose that in episode T + 1, there exists some h ∈ [H] and some s ∈ S * which satisfy that V T +1 h (s) > V * h (s) (the value function can only be overestimated) or π T +1 h (s) suggests a suboptimal action. If the policy π T +1 taken in episode T + 1 is optimal, then there exists some h ∈ [H], some s ∈ S * and some optimal action a which satisfy that v π T +1 (s) > 0 and Q T +1 h (s, a) > Q * h (s, a). Otherwise, if the policy π T +1 taken in episode T + 1 is suboptimal, then there exists some h ∈ [H], some s ∈ S * and some suboptimal action a which satisfy that v π T +1 (s) > 0 and Q T +1 h (s, a) > Q * h (s, a). Hence, no matter which of the above cases happen, we have that there exists some h ∈ [H] and some (s, a) ∈ S × A which satisfy that v π T +1 (s, a) > 0 and Q T +1 h (s, a) > Q * h (s, a). Under the min metric in Worst Path RL, the overestimation of a Q-value comes from the following reasons: (i) Algorithm MaxWP has not detected the successor state with the lowest V-value. (ii) The V-values of successor states at the next step are overestimated. If the overestimation of Q T +1 h (s, a) comes from the overestimation of the V-values at the next step (reason (ii)), then we have that at the next step, there exists some state-action pair whose Q-value is overestimated. Then, we can trace the overestimation from (s, a) at step h to some (s ′ , a ′ ) at some step h ′ ≥ h, which satisfies that Q T +1 h ′ (s ′ , a ′ ) > Q * h ′ (s ′ , a ′ ) and V T +1 h ′ +1 (x) = V * h ′ +1 (x) for any x ∈ S. In other words, the overestimation of Q T +1 h ′ (s ′ , a ′ ) is purely due to that at (s ′ , a ′ ), algorithm MaxWP has not detected the successor state x with the lowest value V * h ′ +1 (x). For any k > 0 and (s, a) ∈ S × A, let T k (s, a) = {k ′ < k : υ π k ′ (s, a) > 0} denote the set of episodes where (s, a) is reachable before episode k. We consider the following two cases according to whether |T T +1 (s ′ , a ′ )| is large enough to detect all successor states of (s ′ , a ′ ). Case (1): If |T T +1 (s ′ , a ′ )| ≥ T (s ′ , a ′ ), using Lemma 18, we have p(s|s ′ , a ′ ) -2 log SA δ n k (s ′ , a ′ ) ≥ 1 2 k-1 k ′ =1 υ π k ′ (s ′ , a ′ ) -log SA δ ′ ≥ 1 2 • T (s ′ , ≥ 1 2 • 2 2 log SA δ + 1 -2 log SA δ =1 which contradicts that Qk h ′ (s ′ , a ′ ) is overestimated. Case (2): If |T T +1 (s ′ , a ′ )| < T (s ′ , a ′ ), we say the overestimation in episode T + 1 is due to the insufficient visitation on (s ′ , a ′ ). Then, among episodes 1, . . . , T , we exclude the episodes contained in T T +1 (s ′ , a ′ ), i.e., we ignore the episodes where (s ′ , a ′ ) is reachable and can be the source of the overestimation. Then, the number of excluded episodes due to (s ′ , a ′ ) is |T T +1 (s ′ , a ′ )| < T (s ′ , a ′ ). According to Lemma 23, since episode T + 1 has not entered the good stage, for any k ≤ T , episode k has also not entered the good stage. Then, for any k ≤ T , there exists some h ∈ [H] and some s ∈ S * which satisfy that V k h (s) > V * h (s) or π k h (s) suggests a suboptimal action. This implies that there exists some h ∈ [H] and some (s, a) ∈ S × A which satisfy that v π k (s, a) > 0 and Qk h (s, a) > Q * h (s, a). Consider the last episode k among episodes 1, . . . , T which is not excluded. Using the above argument, let (s k , a k ) denote the source of overestimation in episode k which satisfies that Qk h ′ (s ′ , a ′ ) > Q * h ′ (s ′ , a ′ ) and V k h ′ +1 (x) = V * h ′ +1 (x) for any x ∈ S. Since we have excluded all the episodes where (s ′ , a ′ ) is reachable among episodes 1, . . . , T and episode k is not excluded, it holds that (s k , a k ) ̸ = (s ′ , a ′ ). We repeat the above analysis on T k (s k , a k ). If Case (1) happens, i.e., |T k (s k , a k )| ≥ T (s k , a k ), then we can derive a contradiction and complete the proof. If Case (2) happens, i.e., |T k (s k , a k )| < T (s k , a k ), we exclude episode k and the episodes contained in T k (s k , a k ). Then, the number of excluded episodes due to (s k , a k ) among episodes 1, . . . , T is at most |T k (s k , a k )| + 1 ≤ | T (s k , a k )|. We repeat the above procedure. Once Case (1) happens, we can derive a contradiction and complete the proof. Otherwise, if Case (2) keeps happening, we will exclude the episodes due to the reachability and possible overestimation of (s, a) for all (s, a) ∈ S × A, and the total number of excluded episodes is strictly smaller than (s,a)∈S×A T (s, a) = T . Thus, there exists some episode k 0 among episodes 1, . . . , T which satisfies that for any (s, a) ∈ S ×A, υ π k 0 (s, a) = 0, which gives a contradiction.

F.2 REGRET LOWER BOUND

In the following, we establish a regret lower bound for Worst Path RL and give its proof. To exclude trivial instance-specific algorithms and formally state our lower bound, we first define an o(K)-consistent algorithm as an algorithm which guarantees an o(K) regret on any instance of Worst Path RL. where max (s,a):∃h, a̸ =π * h (s) takes the maximum over all (s, a) such that a is sub-optimal in state s at some step. The intuition behind this lower bound is as follows. For a critical but hard-to-reach state s, any o(K)-consistent algorithm must explore all actions a in state s, in order to detect their induced successor states s ′ and distinguish the optimal action. This process incurs a regret dependent on factors min π:υπ(s,a)>0 υ π (s, a) and min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a), and hence the lower bound. Proof of Theorem 6. Consider the instance I as shown in Figure 8 : The action space contains two actions, i.e., A = {a 1 , a 2 }. The state space is S = {s 1 , s 2 , . . . , s n , x 1 , x 2 , x 3 }, where n = S -3 and s 1 is the initial state. Let H ≫ S and 0 < α < 1 4 . The reward functions are as follows. For any a ∈ A, r(x 1 , a) = 1, r(x 2 , a) = 0.8 and r(x 3 , a) = 0.2. For any i ∈ [n] and a ∈ A, r(s i , a) = 0. The transition distributions are as follows. For any a ∈ A, p(s 2 |s 1 , a) = α, p(x 1 |s 1 , a) = 1 -3α, p(x 2 |s 1 , a) = α and p(x 3 |s 1 , a) = α. For any i ∈ {2, . . . , n -1} and a ∈ A, p(s i+1 |s i , a) = α and p(x 1 |s i , a) = 1 -α. x 1 , x 2 and x 3 are absorbing states, i.e., for any a ∈ A, p(x 1 |x 1 , a) = 1, p(x 2 |x 2 , a) = 1 and p(x 3 |x 3 , a) = 1. The state s n is a bandit state, which has an optimal action and a suboptimal action. Let a * denote the optimal action in state s n , which is uniformly drawn from {a 1 , a 2 }, and let a sub denote the other sub-optimal action in state s n . For the optimal action a * , p(x 2 |s n , a * ) = 1. For the sub-optimal action a sub , p(x 2 |s n , a sub ) = 1 -α and p(x 3 |s n , a sub ) = α. Fix an o(K)-consistent algorithm A, which guarantees a sub-linear regret on any instance of Worst Path RL. We have that A needs to observe the transition from (s n , a sub ) to x 3 at least once. Otherwise, without any observation of the transition from (s n , a sub ) to x 3 , A can only trivially choose a 1 or a 2 in state s n , and no matter A chooses a 1 or a 2 , it will suffer a linear regret in the counter instance where the unchosen action is optimal. Thus, any o(K)-consistent algorithm must observe the transition from (s n , a sub ) to x 3 at least once, and needs at least 1 υ π sub (s n , a sub ) • p(x 3 |s n , a sub ) episodes with sub-optimal policies. Here π sub denotes a policy which chooses a sub in state s n , and υ π sub (s n , a sub ) denotes the probability that (s n , a sub ) is visited at least once in an episode under policy π sub . Once the agent takes a sub-optimal policy in an episode, she will suffer regret 0.6(H -n) in this episode. 

G TECHNICAL TOOL

In this section, we present a useful technical tool. Lemma 24 (Lemma 13 in (Ménard et al., 2021) ). 



a) is a distorted visitation probability under the conditional transition probability β α,V π k (•|•, •). Inequality (b) uses the concentration of CVaR and Eq. (3). Inequality (c) follows from recurrently applying steps (a)-(b) to unfold V k h (•) -V π k h (•) for h = 2, . . . , H, and the fact that w CVaR,α,V π k kh (s, a) is the visitation probability under conditional transition probability β α,V π k (•|•, •). Eq. (

Figure 1: Instance for the lower bound.

With probability at least 1 -δ, the regret of algorithm MaxWP is bounded by O (s,a)∈S×A H min π: υπ(s,a)>0 υ π (s, a) • min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a) log SA δ ,

υπ (s,a)>0 υπ(s,a)•min s ′ ∈supp(p(•|s,a)) p(s ′ |s,a) ) for Worst Path RL, which demonstrates the tightness of the factors min π: υπ(s,a)>0 υ π (s, a) and min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a).

Figure 2: Experimental results for Iterated CVaR RL.

Figure 3: Illustrating example for the comparison between CVaR MDP and Iterated CVaR RL.

Figure 4: Illustrating example for Lemma 2.For each s ′ ∈ {s 1 , s 2 , s 3 , s 4 }, the height of the bar denotes the value V (s ′ ) (fixed), and the width of the bar denotes the transition probability p(s ′ |s, a) or pk (s ′ |s, a). The colored part of the bars denotes the worst α-portion successor states (i.e., with the lowest α-portion values V (s ′ )). In this example, α = 0.5.

state s line-l , we have that µ α,V (s line-l |s, a) = p(s line-l |s, a) -( s ′ ∈S lef t p(s ′ |s, a) + p(s line-l |s, a) -α) and μk;α,V (s line-l |s, a) = pk (s line-l |s, a).

r |s, a) ≤ s ′ ∈S lef t pk (s ′ |s, a) -p(s ′ |s, a) + s ′ ∈S middle pk (s ′ |s, a) + pk (s line-l |s, a) -p(s line-l |s, a) +   s ′ ∈S lef t p(s ′ |s, a) + p(s line-l |s, a) -αs ′ ∈S lef t pk (s ′ |s, a)s ′ ∈S middle pk (s ′ |s, a) -pk (s line-l |s, a)   ≤ s ′ ∈S lef t pk (s ′ |s, a) -p(s ′ |s, a) + s ′ ∈S middle pk (s ′ |s, a) + pk (s line-l |s, a) -p(s line-l |s, a) + s ′ ∈S lef t p(s ′ |s, a) + p(s line-l |s, a) -α +   αs ′ ∈S lef t pk (s ′ |s, a)s ′ ∈S middle pk (s ′ |s,a) -pk (s line-l |s, a) lef t pk (s ′ |s, a) -p(s ′ |s, a) + s ′ ∈S middle pk (s ′ |s, a) + pk (s line-l |s, a) -p(s line-l |s, a) +   s ′ ∈S lef t p(s ′ |s, a) + p(s line-l |s, a) -α s ′ ∈S lef t pk (s ′ |s, a)s ′ ∈S middle pk (s ′ |s, a) -pk (s line-l |s, a)   = s ′ ∈S lef t pk (s ′ |s, a) -p(s ′ |s, a) + pk (s line-l |s, a) -p(s line-l |s, a) + s ′ ∈S lef t p(s ′ |s, a)s ′ ∈S lef t pk (s ′ |s, a) + p(s line-l |s, a) -pk (s line-l |s, a) ≤ s ′ ∈S lef t pk (s ′ |s, a) -p(s ′ |s, a) + pk (s line-l |s, a) -p(s line-l |s, a) + s ′ ∈S lef t p(s ′ |s, a) -pk (s ′ |s, a) + p(s line-l |s, a) -pk (s line-l |s, a) ≤2 s ′ ∈S pk (s ′ |s, a) -p(s ′ |s, a) , (7) where (a) is due to s ′ ∈S lef t p(s ′ |s, a) + p(s line-l |s, a) -α ≥ 0 by the definition of state s line-l .

[H]  and (s, a) ∈ S × A, let w kh (s, a) denote the probability of visiting (s, a) at step h of episode k. Then, it holds that for any k > 0, h ∈ [H] and (s, a) ∈ S ×A, w kh (s, a) ∈ [0, 1] and (s,a)∈S×A w kh (s, a) = 1.

a) where (a) uses the fact that (s, a) ∈ L k and the definition of L k , and (b) is due to that for any k > 0, h ∈ [H] and (s, a) ∈ S × A, w kh (s, a) ∈ [0, 1].

=k0(s,a) w k ′ (s, a).Let a 1 := w k0(s,a) (s, a), a 2 := w k0(s,a)+1 (s, a), . . . , a K-k0(s,a)+1 := w K (s, a). Define functionF (x) = ⌊x⌋ i=1 a i + a ⌈x⌉ (x -⌊x⌋), where 0 ≤ x ≤ K -k 0 (s, a) + 1. If x is an integer, we have F (x) =x i=1 a i , and otherwise, F (x) interpolates between the function values for integers x. The derivative of F (s) is f (x) = a ⌈x⌉ .

and (s, a) ∈ S ×A, w CV aR,α,V π k kh (s, a) ∈ [0, 1] and (s,a)∈S×A w CV aR,α,V π k kh (s, a) = 1. Lemma 8. For any risk level α ∈ (0, 1], k > 0, h ∈ [H] and (s, a) ∈ S × A, if w kh (s, a) = 0, then w CV aR,α,V π k kh (s, a) = 0. Proof of Lemma 8. If w kh (s, a) = 0, then the algorithm has zero probability to visit (s, a) at step h of episode k, which means that (s, a) is unreachable under transition probability p(•|•, •).

and therefore w CV aR,α,V π k kh (s, a) = 0. Lemma 9. For any functions V 1 , . . . , V H : S → R, k > 0, h ∈ [H] and (s, a) ∈ S × A such that w kh (s, a) > 0, a) denotes the conditional probability of visiting (s, a) at step h of episode k, conditioning on transitioning to the worst α-portion successor states s ′ (i.e., with the lowest α-portion values V h ′ +1 (s ′ )) at each step h ′ = 1, . . . , h -1.Proof of Lemma 9. Since w CVaR,α,V kh (s, a) is the conditional probability of visiting (s, a), we have w CVaR,α,V kh (s, a) ∈ [0, 1]. Since w kh (s, a) is the probability of visiting (s, a) at step h under policy π k and min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) is the minimum probability of visiting any reachable (s, a) at any step h over all policies π, we have w kh (s, a) ≥ min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a). s 1 be the initial state. Since w kh (s, a) and w CVaR,α,V kh (s, a) are the probabilities of visiting (s, a) at step h with policy π k under transition probability p(•|•, •) and conditional transition probability

Figure 5: Illustrating example for Lemma 11. For each s ′ ∈ {s 1 , s 2 , s 3 }, the height of the bar denotes the value V (s ′ ) or V (s ′ ), and the width of the bar denotes the transition probability p(s ′ |s, a) (fixed). The colored part of the bars denotes the worst α-portion successor states (i.e., with the lowest α-portion values V (s ′ ) or V (s ′ )). In this example, α = 0.5. ≤4SAH log HSA δ ′ + 5SAH, where (a) is due to that (s, a) / ∈ L k(s,a) , and for any k > 0, h ∈ [H] and (s, a) ∈ S × A, w kh (s, a) ∈ [0, 1].

Here (a) is due to Lemma 8. (b) comes from Lemma 9. (c) uses the fact that for any deterministic policy π, h ∈ [H] and (s, a) ∈ S × A, we have either w π,h (s, a) = w π,h (s) or w π,h (s, a) = 0, and thus min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) = min π,h,s: w π,h (s)>0 w π,h (s).

(a) is due to Lemma 5. (b) uses Lemma 2 and the fact that for any k > 0, h ∈ [H] and s ∈ S, V k h (s) ∈ [0, H], and thus for any k > 0, h ∈ [H] and (s, a) ∈ S ×A,

and also uses Lemma 11. (c) comes from the property of min{•, •}. (d) and (e) follow from recurrently applying steps (a)-(c). (f) is due to that w CV aR,α,V π k kh (s, a) is defined as the probability of visiting (s, a) at step h of episode k under the conditional transition probability β

(s,a)∈S×A w CVaR,α,V π k kh (s, a) = 1. (b) comes from Lemma 9. (c) uses Lemma 7 and the fact that for any deterministic policy π, h ∈ [H] and (s, a) ∈ S × A, we have either w π,h (s, a) = w π,h (s) or w π,h (s, a) = 0, and thus min π,h,(s,a): w π,h (s,a)>0 w π,h (s, a) = min π,h,s: w π,h (s)>0 w π,h (s).

Figure6: Instance of lower bounds (Theorems 2 and 5) for the min π,h,s: w π,h (s)>0 w π,h (s) > α H-1 case.

a), where (a) uses Lemma 11 with empirical transition probability pk (•|s, a), conditional empirical transition probability βk;α,V k h+1 (•|s, a), and values V k h+1 , V k h+1 , and (b) is due to the induction hypothesis.

a) • H Here (b) is due to that for any k > 0, h ∈ [H] and s ∈ S, J k h (s) ∈ [0, H]. (c) comes from Lemma 14. (e) and (f) follow from recurrently applying steps (a)-(d). (g) uses the fact that w CVaR,α,V k kh (s, a) is defined as the probability of visiting (s, a) at step h of episode k under the conditional transition probability β α,V k h ′ +1 (•|•, •) for each step h ′ = 1, . . . , h -1.

s)), Qk h (s, a) = Q * h (s, a), and for each suboptimal action a (such thatQ * h (s, a) < V * h (s)), Q * h (s, a) ≤ Qk h (s, a) < V * h(s). Using Lemmas 21 and 22, we have that for any h ∈ [H] and s ∈ S * , as k increases, Qk h (s, a) will either decrease to the true value Q * h (s, a) or keep the same. Therefore, we have that for any k ≥ k, h ∈ [H] and s ∈ S * , for each optimal action a, Qk h (s, a) = Q * h (s, a), and for each suboptimal action a, Q * h (s, a) ≤ Qk h (s, a) < V * h (s), which completes the proof. For any (s, a) ∈ S × A, let T (s, a) := 1 min π: υπ(s,a)>0 υ π (s, a) • min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a) • 8 2 log SA δ + 1 . It holds that T = (s,a)∈S×A T (s, a).

a ′ ) • min π: υπ(s ′ ,a ′ )>0 υ π (s ′ , a ′ ) -log SA δ s ′ ∈supp(p(•|s,a)) p(s ′ |s, a)Then, using Lemma 19, we have that for any s ∈ supp(p(•|s ′ , a ′ )),n k (s, s ′ , a ′ ) ≥ 1 2 • n k (s ′ , a ′ ) • min s∈supp(p(•|s ′ ,a ′ ))

There exists an instance of Worst Path RL, for which the regret of any o(K)-consistent algorithm is at least Ω max (s,a): ∃h, a̸ =π * h (s) H min π: υπ(s,a)>0 υ π (s, a) • min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a) ,

Figure 8: The instance for lower bound under the min metric (Theorem 6).

Therefore, A needs to suffer at leastΩ 1 υ π sub (s n , a sub ) • p(x 3 |s n , a sub ) • 0.6(H -n) regret in expectation.Since in the constructed instance (Figure8)max (s,a): ∃h, a̸ =π * h (s) 1 min π: υπ(s,a)>0 υ π (s, a) • min s ′ ∈supp(p(•|s,a)) p(s ′ |s, a) = 1 υ(s n , a sub ) • p(x 3 |s n , a sub ) ,we have that A needs to suffer at least

Let A, B, C, D, E and β be positive scalars such that 1 ≤ B ≤ E and β ≥ e. If T ≥ 0 satisfies T ≤ C T A log (βT ) + B log 2 (βT ) + D A log (βT ) + E log 2 (βT ) , (A + E) (C + D) .

)

s) (due to the induction hypothesis), if pk1 (•|s, a) has detected all successor states, then Qk1 h (s, a) ≥ Qk2 h (s, a). Otherwise, if pk1 (•|s, a) has not detected all successor states, due to the min metric and that pk2 (•|s, a) will detect more (or the same) successor states than pk1 (•|s, a), Remark. Combining Lemmas 21 and 22, we have that as the episode k increases, the estimated value Qkh (s, a) ( V k h (s)) will decrease to its true value Q * h (s, a) (V * h (s))or keep the same. Let S * := {s ∈ S : v π * (s) > 0} denote the set of states which are reachable for an optimal policy. Lemma 23 (Good Stage). If there exists some episode k > 0 which satisfies that for any h ∈ [H] and s ∈ S * , V k h (s) = V * h (s) and π k h (s) suggests an optimal action, then we have that for any k ≥ k, h ∈ [H] and s ∈ S * , V k h (s) = V * h (s) and π k h (s) suggests an optimal action, and thus for any k ≥ k, algorithm MaxWP takes an optimal policy.

ACKNOWLEDGEMENTS

The work of Yihan Du and Longbo Huang is supported by the Technology and Innovation Major Project of the Ministry of Science and Technology of China under Grant 2020AAA0108400 and 2020AAA0108403, the Tsinghua University Initiative Scientific Research Program, and Tsinghua Precision Medicine Foundation 10001020109. The work of Siwei Wang was supported in part by the National Natural Science Foundation of China Grant 62106122.

