RISK-AWARE REINFORCEMENT LEARNING WITH COHERENT RISK MEASURES AND NON-LINEAR FUNCTION APPROXIMATION

Abstract

We study the risk-aware reinforcement learning (RL) problem in the episodic finite-horizon Markov decision process with unknown transition and reward functions. In contrast to the risk-neutral RL problem, we consider minimizing the risk of having low rewards, which arise due to the intrinsic randomness of the MDPs and imperfect knowledge of the model. Our work provides a unified framework to analyze the regret of risk-aware RL policy with coherent risk measures in conjunction with non-linear function approximation, which gives the first sub-linear regret bounds in the setting. Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018 ) is a control-theoretic problem in which an agent interacts with an unknown environment and aims to maximize its expected total reward. Due to the intrinsic randomness of the environment, even a policy with high expected total rewards may occasionally produce very low rewards. This uncertainty is problematic in many real-life applications like competitive games (Mnih et al., 2013) and healthcare (Liu et al., 2020) , where the agent (or decision-maker) needs to be risk-averse. For example, the drug responses to patients are stochastic due to the patients' varying physiology or genetic profiles (McMahon & Insel, 2012) ; therefore, it is desirable to select a set of treatments that yield high effectiveness and minimize the possibility of adverse effects (Beutler et al., 2016; Fatemi et al., 2021) . The existing RL policies that maximize the risk-neutral total reward can not lead to an optimal risk-aware RL policy for problems where the total reward has uncertainty (Yu et al., 2018) . Therefore, our goal is to design an RL algorithm that learns a risk-aware RL policy to minimize the risk of having a small expected total reward. Then, how should we learn a risk-aware RL policy? A natural approach is to directly learn a risk-aware RL policy that minimizes the risk of having a small expected total reward (Howard & Matheson, 1972) . For quantifying such a risk, one can use risk measures like entropic risk (Föllmer & Knispel, 2011) , value-at-risk (VaR) (Dempster, 2002) , conditional value-at-risk (CVaR) (Rockafellar et al., 2000) , or entropic value-at-risk (EVaR) (Ahmadi-Javid, 2012) . These risk measures capture the total reward volatility and quantify the possibility of rare but catastrophic events. The entropic risk measure can be viewed as a mean-variance criterion, where the risk is expressed as the variance of total reward (Fei et al., 2021) . Alternatively, VaR, CVaR, and EVaR use quantile criteria, which are often preferable for better risk management over the mean-variance criterion (Chapter 3 of Kisiala (2015) ). Among these risk measures, coherent risk measuresfoot_0 such as CVaR and EVaR are preferred as they enjoy compelling theoretical properties such as coherence (Rockafellar et al., 2000) . The risk-aware RL algorithms with CVaR as a risk measure (Bäuerle & Ott, 2011; Yu et al., 2018; Rigter et al., 2021) exist in the literature. However, apart from being customized only for CVaR, these algorithms suffer two significant shortcomings. First, most of them focus on the tabular MDP setting and need multiple complete traversals of the state space (Bäuerle & Ott, 2011; Rigter et al., 2021) . These traversals are prohibitively expensive for problems with large state space and impossible for problems with continuous state space, thus limiting these algorithms' applicability in practice. Second, the existing algorithms considering continuous or infinite state space assume that MDP is known, i.e., the probability transitions and reward of each state are known a priori to the algorithm. In such settings, the agent does not need to explore or generalize to unseen scenarios. Therefore, the problem considered in Yu et al. (2018) is a planning problem rather than a learning problem. This paper alleviates both shortcomings by proposing a new risk-aware RL algorithm where MDPs are unknown and uses non-linear function approximation for addressing continuous state space. Recent works (Jin et al., 2020; Yang et al., 2020) have proposed RL algorithms with function approximation and finite-sample regret guarantees, but they only focus on the risk-neutral RL setting. Extending their results to a risk-aware RL setting is non-trivial due to two major challenges. First, the existing analyses heavily rely on the linearity of the expectation in the risk-neutral Bellman equation. This linearity property does not hold in the risk-aware RL setting when a coherent risk measure replaces the expectation in the Bellman equation. Then, how can we address this challenge? We overcome this challenge by the non-trivial application of the super-additivity propertyfoot_1 of coherent risk measures (see Lemma 3 and its application in Appendix 4). The risk-neutral RL algorithms only need one sample of the next state to construct an unbiased estimate of the Bellman update (Yang et al., 2020) as one can unbiasedly estimate the expectation in the risk-neutral Bellman equation with a single sample. However, this does not hold in the risk-aware RL setting. Furthermore, whether one can construct an unbiased estimate of an arbitrary risk measure using only one sample is unknown. This problem leads to the second major challenge: how can we construct an unbiased estimate of the risk-aware Bellman update? To resolve this challenge, we assume access to a weak simulatorfoot_2 that can sample different next states given the current state and action and use these samples to construct an unbiased estimator. Such an assumption is mild and holds in many real-world applications, e.g., a player can anticipate the opponent's next moves and hence the possible next states of the game. After resolving both challenges, we propose an algorithm that uses a risk-aware value iteration procedure based on the upper confidence bound (UCB) and has a finite-sample sub-linear regret upper bound. Specifically, our contributions are as follows: • We first formalize the risk-aware RL setting with coherent risk measures, namely the risk-aware objective function and the risk-aware Bellman equation in Section 3. We then introduce the notion of regret for a risk-aware RL policy. • We propose a general risk-aware RL algorithm named Risk-Aware Upper Confidence Bound (RA-UCB) for an entire class of coherent risk measures in Section 4. RA-UCB uses UCB-based value functions with non-linear function approximation and also enjoys a finite-sample sub-linear regret upper bound guarantee. • We provide a unified framework to analyze regret for any coherent risk measure in Section 4.1. The novelty in our analysis is in the decomposition of risk-aware RL policy's regret by the super-additivity property of coherent risk measures (shown in the proof of Lemma 4 in Appendix D.2). • Our empirical experiments on synthetic and real datasets validate the different performance aspects of our proposed algorithm in Section 5.

1.1. RELATED WORK

Risk-aware MDPs first introduced in the seminal work of Howard & Matheson (1972) with the use of an exponential utility function known as the entropic risk measure. Since then, the risk-aware MDPs have been studied with different risk criteria: optimizing moments of the total reward (Jaquette, 1973) , exponential utility or entropic risk (Borkar, 2001; 2002; Bäuerle & Rieder, 2014; Fei et al., 2020; 2021; Moharrami et al., 2022 ), mean-variance criterion (Sobel, 1982; Li & Ng, 2000; La & Ghavamzadeh, 2013; Tamar et al., 2016) , and conditional value-at-risk (Boda & Filar, 2006; Artzner et al., 2007; Bäuerle & Mundt, 2009; Bäuerle & Ott, 2011; Tamar et al., 2015; Yu et al., 2018; Rigter et al., 2021) . Vadori et al. (2020) focuses on the variability or uncertainty of the rewards. Many of these existing works assume the MDPs are known a priori (known reward and transition kernels) (Yu et al., 2018) , focus on the optimization problem (Bäuerle & Ott, 2011; Yu et al., 2018) or asymptotic behaviors of algorithms (e.g., does an optimal policy exist, and if so, is it Markovian, etc.) (Bäuerle & Ott, 2011; Bäuerle & Rieder, 2014) . The closest works to ours are Fei et al. (2021); Fei & Xu (2022) , which consider the risk-aware reinforcement learning in the function approximation and regret minimization setting. However, they use the entropic risk measure. In contrast, our work considers a significantly different family of risk measures, namely the coherent risk measures. They are preferable and widely used for risk management (Kisiala, 2015) . The analysis in Fei et al. (2021) ; Fei & Xu (2022) utilizes a technique called exponentiated Bellman equation, which is uniquely applicable to the entropic risk measure (or more generally the exponential utility family) and cannot be readily extended to coherent risk measures. Therefore, our analysis differs significantly from that in Fei et al. (2021) ; Fei & Xu (2022) . Tamar et al. (2015) proposes an actor-critic algorithm for the entire class of coherent risk measures but does not provide any theoretical analysis of the regret. Safe RL and constrained MDPs represent a parallel approach to obtaining risk-aware policies in the presence of uncertainty. Unlike risk-aware MDPs, safe RL does not modify the optimality criteria. Instead, the risk-aversion is captured via constraints on the rewards or risks (Chow & Pavone, 2013; Chow et al., 2017) , or as chance constraints (Ono et al., 2015; Chow et al., 2017) . Compared with risk-aware MDPs, the constrained MDPs approach enjoys less compelling theoretical properties. The existence of a global optimal Markov policy using the constrained MDPs is unknown, and many existing algorithms only return locally optimal Markov policies using gradient-based techniques. It makes these methods extremely susceptible to policy initialization (Chow et al., 2017) , and hence the best theoretical result one can get in this setting is convergence to a locally optimal policy (Chow et al., 2017) . In contrast, our result in this paper considers the regret (or sub-optimality) with respect to the global optimal policy. Distributional RL (Bellemare et al., 2022) attempts to model the state-value distribution, and any risk measure can be characterized by such distribution. Therefore, distributional RL represents a more ambitious approach in which the agent needs to estimate the entire value distribution. Existing distributional RL algorithms need to make additional distributional assumptions to work with distributional estimates such as quantiles (Dabney et al., 2018) or empirical distributions (Rowland et al., 2018) . In contrast, our risk-aware RL framework only considers the risk measures that apply to the random state-value. As a trade-off, the demand for data and computational resources to estimate the value distribution at every state can be prohibitively expensive for even moderate-sized problems. We establish more detailed connections between risk-aware RL and distribution RL in Appendix A.

2. COHERENT RISK MEASURES

Let Z ∈ L 1 (Ω, F, P)foot_3 be a real-valued random variable with a finite mean and the cumulative distribution function F Z (z) = P(Z ≤ z). For Z ′ ∈ L 1 (Ω, F, P), a function ρ : L 1 (Ω, F, P) → R ∪ {+∞} is a coherent risk measure if it satisfies the following properties: 1. Normalized: ρ(0) = 0.

2.. Monotonic: If

P(Z ≤ Z ′ ) = 1, then ρ(Z) ≤ ρ(Z ′ ). 3. Super-additive: ρ(Z + Z ′ ) ≥ ρ(Z) + ρ(Z ′ ). 4. Positively homogeneous: For α ≥ 0, we have ρ(αZ) = αρ(Z).

5.. Translation invariant:

For a constant variable A with value a, we have ρ(Z + A) = ρ(Z) + a. Since our reward maximization setting contrasts with the cost minimization setting often considered in the literature, we aim to maximize the risk applied to the random reward, i.e., maximizing ρ(Z). Consequently, the properties of risk measure are upended compared to those usually presented in cost minimization setting (Föllmer & Schied, 2010) . For example, super-additivity in the reward maximization setting becomes sub-additivity in the cost minimization setting. Empirical estimation of the risk. The risk of a random variable ρ(Z) is completely determined by the distribution of Z (F Z ). In practice, we do not know the distribution F Z ; instead, we can observe m independent and identically distributed (IID) samples {Z i } m i=1 from the distribution F Z . Then we can use these samples to get an empirical estimator of ρ(Z), which is denoted by ρ(Z 1 , . . . , Z m ).

3. PROBLEM SETTING

We consider an episodic finite-horizon Markov decision process (MDP), denoted by a tuple M = (S, A, H, P, r), where S and A are sets of possible states and actions, respectively, H ∈ Z + is the episode length, P = {P h } h∈ [H] are the state transition probability measures, and r = {r h : S × A → [0, 1]} h∈ [H] : are the deterministic reward functions. We assume S is a measurable space of possibly infinite cardinality, and A is a finite set. For each h ∈ [H], P h (•|x, a) denotes the probability transition kernel when the agent takes action a at state x in time step h. An agent interacts with the MDP as follows. There are T episodes. In the t-th episode, the agent begins at state x t 1 chosen arbitrarily by the environment.  max π E π x1 H h=1 r h (x h , a h ) .

3.1. RISK-AWARE EPISODIC MDP

The risk-neutral objective defined in Eq. (1) does not account for the risk incurred due to the stochasticity in the state transitions and the agent's policy. Markov risk measures (Ruszczyński, 2010) are proposed to model and analyze such risks. The risk-aware MDP objective is defined as max π J π (x 1 ), where J π (x 1 ) := r 1 (x 1 , a 1 ) + ρ(r 2 (x 2 , a 2 ) + ρ(r 3 (x 3 , a 3 ) + . . . )), (2 ) where ρ is a coherent one-step conditional risk measure (Ruszczyński, 2010, Definition 6) , and {x 1 , a 1 , x 2 , a 2 , . . . } is a trajectory of states and actions from the MDP under policy π. Here, J π is defined as a nested and multi-stage composition of ρ, rather than through a single-stage risk measure on the cumulative reward ρ H h=1 r h (x h , a h ) . The choice of the risk-aware objective function in Eq. ( 2) has two advantages. Firstly, it guarantees the existence of an optimal policy, and furthermore, this optimal policy is Markovian. Please refer to Theorem 4 in Ruszczyński (2010) for a rigorous treatment of the existence of the optimal Markov policy. Secondly, the above risk-aware objective satisfies the time consistency property. This property ensures that we do not contradict ourselves in our risk evaluation. The sequence that is better today should continue to be better tomorrow, i.e., our risk preference stays the same over time. Note that in standard RL, where the risk measure is replaced with expectation, this property is trivially satisfied. In contrast, a single-stage risk measure (i.e., static version) applied on the cumulative reward ρ H h=1 r h (x h , a h ) does not enjoy this time consistency property (Ruszczyński, 2010) . More detailed discussions about this are in Appendix B.

3.2. BELLMAN EQUATION AND REGRET

The risk-aware Bellman equation is developed for the risk-aware objective defined in Eq. (2) (Ruszczyński, 2010) . More specifically, let us define the risk-aware state-and action-value functions with respect to the Markov risk measure ρ as V π h (x) = r h (x, π h (x)) + ρ r h+1 (x h+1 , π h+1 (x h+1 )) + ρ r h+2 (x h+2 , π h+2 (x h+2 )) + . . . , Q π h (x, a) = r h (x, a) + ρ r h+1 (x h+1 , π h+1 (x h+1 )) + ρ r h+2 (x h+2 , π h+2 (x h+2 )) + . . . . We also define the optimal policy π ⋆ to be the policy that yields the optimal value function V ⋆ h (x) = sup π V π h (x). The advantage of the formulation given in Eq. ( 2) is that one can show that the optimal policy exists, and it is Markovian (Theorem 4 of Ruszczyński (2010) ). For notations convenience, for any measurable function V : S → [0, H], we define the operator D ρ h as (D ρ h V )(x, a) := ρ (V (x ′ )) , where the risk measure ρ is taken over the random variable x ′ ∼ P h (•|x, a). Then, the risk-aware Bellman equation associated with a policy π takes the form Q π h (x, a) = (r h + D ρ h V π h+1 )(x, a), V π h (x) = ⟨Q π h (x, •), π h (•|x)⟩ A , V π H+1 (x) = 0 , where ⟨•, •⟩ A denote the inner productfoot_4 over A and (f + g)(x) = f (x) + g(x) for function f and g. Similarly, the Bellman optimality equation is given by Q ⋆ h (x, a) = (r h + D ρ h V ⋆ h+1 )(x, a), V ⋆ h (x) = max a∈A Q ⋆ h (x, a), V ⋆ H+1 (x) = 0. The above equation implies that the optimal policy π ⋆ is the greedy policy with respect to the optimal action-value function {Q ⋆ h } h∈[H] . In the episodic MDP setting, the agent interacts with the environment through T episodes to learn the optimal policy. At the beginning of episode t, the agent selects a policy π t , and the environment chooses an initial state x t 1 . The difference in values between V π t 1 (x t 1 ) and V ⋆ (x t 1 ) quantifies the sub-optimality of π t , which serves as the regret of the agent at episode t. The total regret after T episodes is defined as R T (ρ) = T t=1 V ⋆ 1 (x t 1 ) -V πt 1 (x t 1 ) . We use the widely adopted notion of regret in the risk-neutral setting (Jin et al., 2020; Yang et al., 2020) and risk-aware setting (Fei et al., 2020; 2021) . Here, the policy's regret depends on the risk measure ρ via the optimal policy π ⋆ . A good policy should have sub-linear regret, i.e., lim T →∞ R T /T = 0, which implies that the policy will eventually learn to select the best risk-averse actions. Remark 1. Given two risk measures ρ 1 and ρ 2 with R T (ρ 1 ) < R T (ρ 2 ), does not imply ρ 1 is a better choice of risk measure for the given problem. Because the optimal policies for ρ 1 and ρ 2 can be different, their regrets are not directly comparable. Therefore, we cannot use regret as a measure to compare or select the risk measure.

3.3. WEAK SIMULATOR ASSUMPTION

One key challenge for the risk-aware RL policy is that the empirical estimation of risk is more complex than the estimation of expectation in risk-neutral RL (Yu et al., 2018) . In this paper, we assume the existence of a weak simulator that we can use to draw samples from the probability transition kernel P h (•|x, a) for any h ∈ [H], x ∈ S, a ∈ A. This assumption is much weaker than the archetypal simulator assumptions often seen in the RL literature, as they also allow to query reward of a given state and action r h (x, a). To the best of our knowledge, all existing works in risk-aware RL with coherent risk measures require some assumptions on the transition probabilities to facilitate the risk estimation procedure. Among these assumptions, our weak simulator assumption is the weakest.

3.4. ESTIMATING NON-LINEAR FUNCTIONS

We use reproducing kernel Hilbert space (RKHS) as the class of non-linear functions to represent the optimal action-value function Q * h . For notational convenience, let us denote z = (x, a) and Z = S × A. Following the standard setting, we assume that Z is a compact subset of R d for fixed dimension d. Let H denote the RKHS defined on Z with the kernel function k : Z × Z → R. Let ⟨•, •⟩ H and ∥•∥ H be the inner product and the RKHS norm on H, respectively. Since H is an RKHS, there exists a feature map ϕ : Z → H such that ϕ(z) = k(z, •) and f (z) = ⟨ϕ(z), f ⟩ H for all f ∈ H and for all z ∈ Z, this is known as the reproducing kernel property.

4. RISK-AWARE RL ALGORITHM WITH COHERENT RISK MEASURES

We now introduce our algorithm named Risk-Aware Upper Confidence Bound (RA-UCB), which is built upon the celebrated Value Iteration Algorithm (Sutton & Barto, 2018) . RA-UCB first estimates the value function using kernel least-square regression. Then, it computes an optimistic bonus that gets added to the estimated value function to encourage exploration. Finally, it executes the greedy policy with respect to the estimated value function in the next episode. RA-UCB Risk-Aware Upper Confidence Bound 1: Input: Hyperparameters of coherent risk measure ρ (e.g., confidence level α ∈ (0, 1) for CVaR) 2: for episode t = 1, 2, . . . , T do 3: Receive the initial state x t 1 and initialize V t H+1 as the zero function.

4:

for step h = H, . . . , 1 do 5: For τ ∈ [t -1], draw m samples from the weak simulator and construct the response vector y t h using Eq. ( 7). 6: Compute µ t h and σ t h using Eq. ( 8). 7: Compute Q t h and V t h using Eq. ( 9).

8:

end for 9: for step h = 1, . . . , H do 10: Take action a t h ← arg max a∈A Q t h (x t h , a). 11: Observe reward r h (x t h , a t h ) and the next state x t h+1 . 12: end for 13: end for Recall that we defined z = (x, a) and Z = S × A in Section 3.4. We define the following Gram matrix K t h ∈ R (t-1)×(t-1) and a function k t h : Z → R t-1 associated with the RKHS H as K t h = k(z τ h , z τ ′ h ) τ,τ ′ ∈[t-1] , k t h (z) = k(z 1 h , z), . . . , k(z t-1 h , z) ⊤ . ( ) Given the observed histories and the weak simulator, we define the response vector y t h ∈ R t-1 as [y t h ] = r h (x τ h , a τ h ) + ρ({V t h+1 (x ′ (i) )} m i=1 ) τ ∈[t-1] , where {x ′ (i) } m i=1 are m next states drawn from the weak simulator P h (•|x τ h , a τ h ). This step contains one of the key differences between RA-UCB and its risk-neutral counterpart, with the presence of the empirical risk estimator in the definition of the response vector y t h . With the newly introduced notations, we define two functions µ t : Z → R and σ t : Z → R as µ t h (z) = k t h (z) ⊤ (K t h + λ • I) -1 y t h , σ t h (z) = λ -1/2 • k(z, z) -k t h (z) ⊤ (K t h + λI) -1 k t h (z) 1/2 . ( ) The terms µ t h and σ t h have several important connections with other literature. More specifically, it resembles the posterior mean and standard deviation of a Gaussian process regression problem (Rasmussen, 2003) , with y t h as its target. The second term σ t h also reduces to the UCB term used in linear bandits when the feature map ϕ is finite-dimensional (Lattimore & Szepesvári, 2020) . We then define our estimate of the value functions Q t h and V t h as follows: Q t h (x, a) := min µ t h (x, a) + β • σ t h (x, a), H -h + 1 , V t h (x) := max a∈A Q t h (x, a), where β > 0 is an exploration versus exploitation trade-off parameter. To get some insights on the algorithm, notice that Eq. ( 7) implements the one-step Bellman optimality update in Eq. ( 4). To see this, let X ′ ∼ P h (•|x τ h , a τ h ) be the random variable representing the next state. Recall that V t h+1 is the estimated value function by our algorithm at episode t. Thus, V t h+1 (X ′ ) is also a random variable, where the randomness comes from X ′ . Here, we can start looking at ρ(V t h+1 (X ′ )), i.e., the risk measure ρ applied on the random variable V t h+1 (X ′ ). Intuitively, this can be interpreted as the risk-adjusted value of the next state. The second term in Eq. ( 7) above, ρ({V t h+1 (x ′ (i) )} m i=1 ), is an empirical estimate of ρ(V t h+1 (X ′ )). The choice of the response vector in Eq. ( 7) represents the primary novelty in our algorithm design. This choice enables a new regret decomposition and an upper bound using the concentration inequality of the risk estimator. More details are presented in Appendix D.1.

4.1. MAIN THEORETICAL RESULTS

This section presents our main theoretical result, i.e., the regret upper bound guarantee of RA-UCB. We first outline the key assumption that enables the efficient approximation of the value function. Assumption 1. Let R > 0 be a fixed constant, H be the RKHS, and B(r) = {f ∈ H : ∥f ∥ H ≤ r} to be the RKHS-norm ball with radius r. We assume that for any h ∈ [H] and any Q : S × A → [0, H], we have T * h Q ∈ B(RH) , where T * h is the Bellman optimality operator defined in Eq. ( 4). This assumption postulates that the risk-aware Bellman optimality operator maps any bounded action-value function to a function in an RKHS H with a bounded norm. This assumption ensures that for all h ∈ [H], the optimal action-value function Q ⋆ h lies inside B(RH). Consequently, there is no approximation error when using functions from H to approximate Q ⋆ h . It can be viewed as equivalent to the realizability assumption in supervised learning. Similar assumptions are made in Jin et al. ( 2020 Given this assumption, it is clear that the complexity of H plays a central role in the regret bound of RA-UCB. Following the seminal work of Srinivas et al. (2009) , we characterize the intrinsic complexity of H with the notion of maximum information gain defined as Γ k (T, λ) = 1/2 sup D⊆Z,|D|≤T log det(I + K D /λ) , where k is the kernel function, λ > 0 is a parameter, and K D is the Gram matrix. The maximum information gain depends on how fast the eigenvalues of H decay to zero and can be viewed as a proxy for the dimension of H when H is infinite-dimensional. Note that Γ k (T, λ) is a problem-dependent quantity that depends on the kernel k, state space S, and action space A. Furthermore, let us first define the action-value function classes Q ucb (h, R, B) as Q ucb (h, R, B) = {Q : Q(z) = min{f (z) + β • λ -1/2 [k(z, z) -k D (z) ⊤ (K D + λI) -1 k D (z)] 1/2 , H -h + 1} + , f ∈ H, ∥f ∥ H ≤ R, β ∈ [0, B], |D| ≤ T }. ( ) With the appropriate choice of R and B, the set Q ucb (h, R, B) contains every possible Q t h that can be constructed by RA-UCB. Therefore, the function class Q ucb resembles the concept of hypothesis space in supervised learning. And as we will see, the complexity of Q ucb , in particular, the covering number of Q ucb , plays a crucial role in the regret bound of RA-UCB. Theorem 1. Let λ = 1 + 1/T , β = B T in RA-UCB, and let Γ k (T, λ) be the maximal information gain defined in Eq. (10). Define a constant B T > 0 that satisfies B T = Θ H( Γ k (T, λ) + max h∈H log N ∞ (ϵ, h, B T )) . Suppose that the empirical risk estimate ρ achieves the rate of Ξ(m, δ), i.e., P |ρ(Z) -ρ({Z i } m i=1 )| ≤ Ξ(m, δ) ≥ 1 -δ. Then, under Assumption 1, with a probability of at least 1 -(T 2 H 2 ) -1 , the regret of RA-UCB is R T ≤ 5B T H T Γ k (T, λ) + 2T H • Ξ m, (8T 3 H 3 ) -1 . The proof of Theorem 1 is in Appendix D.1. The regret upper bound consists of two terms. The first term resembles risk-neutral regret bound (Yang et al., 2020, Theorem 4.2) . Interestingly, our bound distinguishes itself from the risk-neutral setting with the presence of the second term, which quantifies how fast one can estimate the risk from observed samples. It originates from the risk-aware Bellman optimality equation, in which the one-step update requires knowledge of the risk-to-go starting from the next state (see Eq. ( 4) for more detail). This risk-to-go quantity is approximated by its empirical counterpart, and the discrepancies give rise to the second term in regret. Due to the weak simulator assumption, we have good control over the second term. In the following result, we derive the number of samples sufficient to achieve the order-optimal regret for the Conditional Value-at-Risk (CVaR), which is one of the most commonly used coherent risk measures. More details on CVaR and its properties are given in Appendix C.1. Corollary 1. Let ρ be the CVaR measure defined in Eq. (13) and ρ be the CVaR estimator defined in Eq. ( 14). Then, under the same conditions in Theorem 1, the algorithm RA-UCB achieves the regret of R T = O B T H T Γ k (T, λ) with O T H • log T 5 H 6 /B 2 T Γ k (T, λ) total samples (across all T episodes) from the weak simulator. The detailed proof of Corollary 1 is in Appendix D.5. As an example, for the commonly used squared exponential (SE) kernel, we get Srinivas et al., 2009) , and thus RA-UCB incurs a regret of R T = Õ H 2 √ T (log T ) 1.5d+1 . This result leads to the first sub-linear regret upper bound of the risk-aware RL policy with coherent risk measures. B T = O H • log (T H) • (log T ) d (Yang et al., 2020, Corollary 4) and Γ k (T, λ) = O (log T ) d+1 (

5. EXPERIMENTS

In this section, we empirically demonstrate the effectiveness of RA-UCB. We run different experiments on synthetic and real-world data with the CVaR as a risk measure, which is a commonly used coherent risk measure. We analyze the influence of the risk aversion parameter α (or confidence level for CVaR) on the total reward as well as the behavior of the output policies. The code for these experiments is available in the supplementary material. The robot navigation environment is a continuous version of the cliff walking problem considered in example 6.6 of Sutton & Barto (2018), visualized in Fig. 1 . In this synthetic experiment, a robot must navigate inside a room full of obstacles to reach its goal destination. The robot navigates by choosing from 4 actions {up, down, left, right}. Since the floor is slippery, the direction of movement is perturbed by r •ϕ, where ϕ ∼ U (-π, π) and r ∈ [0, 1] represent the angle and magnitude of the perturbation. The robot receives a positive reward of 10 for reaching the destination and a negative reward for being close to obstacles. The negative reward increases exponentially as the robot comes close to the obstacle. We set the horizon of each episode to H = 30. The robot does not know perturbation parameters (r = 0.3) and the obstacles' positions, so it has to learn them online via interacting with the environment. We approximate the state-action value function using the RBF kernel and the KernelRidge regressor from Scikit-learn.

5.1. SYNTHETIC EXPERIMENT: ROBOT NAVIGATION

Figure 2 : Estimated distribution of the cumulative reward when following the learned policy for different risk parameters. For α = 0.9 (leftmost plot), the policy is more risk-tolerant, which causes the average reward to be higher, but occasionally small. As we decrease α, the policy becomes more risk-averse, favoring safer paths with smaller average rewards and higher worst-case rewards. In Fig. 2 , we show the histograms of the robot's cumulative rewards that it receives in 50 episodes by following the learned policy with different values of the risk parameter α ∈ [0.9, 0.5, 0.1]. For smaller values of α, the learned policy successfully mitigates the tail risk in the distribution, illustrated by the rightmost histogram having the smallest reward of at least 3.0, whereas the reward could go as low as near 0 for the remaining two policies. As we increase α, the policy becomes more risk-tolerant, leading to a higher average reward at the expense of occasional bad rewards. In this experiment, we use m = 100 samples from the weak simulator to estimate the risk in Eq. ( 7). When α = 0.9 (the blue bar), the policy is more risk-tolerant, which causes the average reward to be higher at the expense of occasional low reward. The policy is more risk-averse as we decrease the value of α, favoring safe paths with lower average-case rewards and higher worst-case rewards. This trading setup is a generalization of the betting game environment (Bäuerle & Ott, 2011; Rigter et al., 2021) . This experiment considers a simplified foreign exchange trading environment based on real historical exchange rates and volumes between EUR and USD in 12 months of 2017. For simplicity, we fixed the trade volume for each hour at 10000. There are two actions in the environment: buy or sell. The state of the environment includes the current position, which is either long or short, and a vector of signal features containing the historical prices and trading volumes over a short period of time. We customize this environment based on the ForexEnv in the python package gym-anytrading. 6In Fig. 3 , we show a histogram of the cumulative terminal wealth achieved by the agents in 100 episodes with different risk parameters, plotted in different colors. Similar to the robot experiment, we demonstrate that for a smaller value of α, the policy is risk-averse and successfully mitigates the tail of the distribution. This can be seen that the worst-case wealth for α = 0.1 (in green) is higher than for α = 0.5 (in red) or α = 0.9 (in blue). In this experiment, we use m = 100 samples from the weak simulator to estimate the risk in Eq. ( 7). Additional experiments with other risk measures like VaR and EVaR are given in Appendix E. Computational complexity of RA-UCB: We need to solve H kernel ridge regression problems in each episode. In the t-th episode, each regression problem complexity is dominated by two operations: First, the inversion of the Gram matrix K t h of size (t -1) × (t -1) in Eq. ( 8), which has O(t 3 ) time complexity and O(t 2 ) space complexity. Second, the construction of the response vector in Eq. ( 7) has O(mt) time and space complexity. Therefore, the time and space complexity of the t-episode is O(H(t 3 + mt)) and O(H(t 2 + mt)) respectively.

6. CONCLUSION

We proposed a risk-aware RL algorithm named RA-UCB that uses coherent risk measures and non-linear function approximations. We then provided a finite-sample regret upper bound guarantee for RA-UCB and demonstrated its effectiveness in robot navigation and forex trading environments. The performance of the proposed algorithm depends profoundly on the quality of the empirical risk estimator. This paper assumes access to a weak simulator that can sample the next states, thus effectively alleviating the need to estimate the risk from the observed trajectories. Therefore, a potential future direction is to relax or weaken this assumption, allowing risk-aware RL algorithms to be useful in more practical problems. Another interesting direction is to consider the episodic MDPs, where episodes can have varying lengths horizons or even infinite horizons.

7. REPRODUCIBILITY STATEMENT

In this paper, we dedicate a substantial effort to improving the reproducibility and comprehensibility of both our theoretical results and empirical experiments. We formally state and discuss the necessity and implications of our assumptions (please see Section 3.3 and the paragraph below Assumption 1) before presenting our theoretical results. We also provide a 3-step proof sketch of our main theoretical result. For each step, we present the key ideas and high-level directions and refer the reader to more detailed and complete proofs in the Appendices. For the experiments, we provide details of different experimental settings in Section 5, and include our code in the supplementary material.

A CONNECTIONS TO DISTRIBUTIONAL RL

We first give a brief survey on the distributional RL literature and discuss its connection to risk-sensitive RL. For a policy π in distributional RL, we define the total reward from state x (or state-action pair (x, a)) and time step h as the sum of rewards of an agent starting in time h, at state x (or state-action pair (x, a)), and the following policy π as follows G π h (x) := H h ′ =h r h ′ (x h ′ , a h ′ )|x h = x, a h ′ ∼ π h ′ (•|x h ′ ), x h ′ +1 ∼ P h ′ (•|x h ′ , a h ′ ), J π h (x, a) := H h ′ =h r h ′ (x h ′ , a h ′ )|x h = x, a h = a, a h ′ ∼ π h ′ (•|x h ′ ), x h ′ +1 ∼ P h ′ (x h ′ , a h ′ ). Both G π h (x) and J π h (x, a) are random variables where the randomness comes from the transition probability P and the (possibly stochastic) policy π. In standard MDP, the expected value of these random variables, E G π (x) and E J π (x, a) , are known as the value function and action-value function, respectively. Distributional RL (Bellemare et al., 2022) is built upon the overarching idea of estimating the value distribution, i.e., the distribution of the random variables G π h (x) and J π h (x, a), rather than just the expected values as in standard RL. The most important component of distributional RL is the distributional Bellman equation, which is given as follows: J π h (x, a) d = r h (x, a) + J π h x ′ , arg max a ′ ∈A E[J π h (x ′ , a ′ )] , where ∀h ∈ [H] and d = denotes equality in distribution, and the next state x ′ ∼ P h (•|x, a). Although distributional RL models the entire value distribution, notice that the optimal action is greedy with respect to the expectation of the value of the next state, which resembles that of standard RL.

B COHERENT ONE-STEP CONDITIONAL RISK MEASURE

Recall the formulation of the risk-aware MDP objective function given in Eq. ( 2): max π J π (x 1 ), where J π (x 1 ) := r 1 (x 1 , a 1 ) + ρ(r 2 (x 2 , a 2 ) + ρ(r 3 (x 3 , a 3 ) + . . . )), where ρ is a coherent one-step conditional risk measure, and x 1 , a 1 , x 2 , a 2 , . . . is a trajectory of states and actions from the MDP under policy π. Notice that J π is defined as a nested and multi-stage composition of conditional risk measure ρ, which is also referred to as the dynamic risk optimization problem (Rigter et al., 2021 , use CVaR as a risk measure). There are two advantages to this formulation. Firstly, one can show that the optimal policy exists, and it is Markovian (Theorem 4 in Ruszczyński (2010) ). Therefore, we can use the Bellman update to learn the risk-aware RL policy. When the objective function uses a single-stage risk measure on the sum of rewards (i.e., ρ H h=1 r h (x h , a h ) ), the problem is referred to as the static risk optimization. Under this setting, (Bäuerle & Ott, 2011 , also use CVaR as a risk measure) shows that the optimal policy exists but is non-Markovian. Therefore, these history-dependent policies must optimally solve the static risk optimization problem, which is harder than learning a Markovian policy. Secondly, the above risk-aware objective satisfies the time consistency property. Intuitively, the time consistency property indicates that the sequence that is better today should continue to be better tomorrow, i.e., our risk preference stays the same over time. To formally define this concept, let us consider the problem of measuring the risk of sequences. Let (Ω, F, P) denote a probability space with filtration F 1 ⊂ F 2 ⊂ • • • ⊂ F T ⊂ F. Let Z 1 , . . . , Z T denote the adapted sequence of random variables (one may view this as a sequence of random rewards for the purpose of this paper). Finally, we define Z t = L p (Ω, F t , P), p ∈ [1, ∞), and Z t,T = Z t × • • • × Z T . We define the following notion of conditional risk measure and dynamic risk measure as follows: Definition 1. A conditional risk measure is a mapping ρ t,T : Z t,T → Z t that satisfies the following monotonicity condition: ρ t,T (Z) ≤ ρ t,T (W ) for all Z, W ∈ Z t,T such that Z ≤ W. A sequence of conditional risk measures: {ρ t,T } t=1,...,T is called a dynamic risk measure. One can view the conditional risk measure as a non-linear extension of the conditional expectation. The randomness of the entire sequence t, . . . , T is reduced to the randomness at time t. This is analogous to conditional expectation in which E[Z t + • • • + Z T |Z t ] is a random variable where the randomness comes from Z t . Intuitively, ρ t,T (Z t , . . . , Z T ) represents the amount of reward a player is willing to take in exchange for a sequence of future random rewards Z t + • • • + Z T . For a risk-neutral player, that amount will equal E[Z t + • • • + Z T |Z t ]. We are now ready to define the notion of time consistency. Definition 2. A dynamic risk measure {ρ t,T } t=1,...,T is time-consistent if for all τ, θ ∈ [T ] and τ < θ, we have the following: If for all k = τ, . . . , θ -1, Z k = W k and ρ θ,T (Z θ , . . . , Z T ) ≤ ρ θ,T (W θ , . . . , W T ), then we have ρ τ,T (Z τ , . . . , Z T ) ≤ ρ τ,T (W τ , . . . , W T ). The time-consistency property implies that a dynamic risk measure ensures that given the same rewards, the sequence that is better today should also be better tomorrow. Why is time consistency a desirable property? This property ensures that we do not contract ourselves in our risk evaluation. If we observe the same realization, the sequence that is better today should continue to be better tomorrow. Our risk preference stays the same over time. Note that this property is trivially satisfied in standard RL, where the risk measure is replaced with expectation. In contrast, a single risk measure applied on the cumulative reward ρ H h=1 r h (x h , a h ) does not enjoy this time consistency property.

C COHERENT RISK MEASURES C.1 CONDITIONAL VALUE-AT-RISK

Let Z be a finite mean random variable, i.e., E[|Z|] < ∞, with the cumulative distribution function F Z (z) = P(Z ≤ z) (an example to keep in mind is that Z represents the random total reward of a learning agent). The value-at-risk at confidence level α ∈ (0, 1) is defined as VaR α (Z) = min{z : F Z (z) ≥ α}. ( ) The minimum is attained because the cumulative distribution function F Z is a non-decreasing and right-continuous function in z. When F Z is strictly increasing (and thus bijective), VaR α (Z) = F -1 Z (α). The conditional value-at-risk (also known as the average value-at-risk) at confidence level α ∈ (0, 1) is defined as 2002) . From this expression, CVaR α (Z) can be viewed as the average of the worst-case α-fraction of Z. It is easy to see that CVaR 1 (Z) = E[Z], and as α → 0, CVaR α approaches the worst-case (or robust) realization. An important result of CVaR (Rockafellar & Uryasev, 2002, Theorem 10) (also known as the fundamental minimization theorem) is that it can be represented as the solution of a convex optimization problem. Lemma 1. (Keramati et al., 2020; Baudry et al., 2021) Let Z be a finite mean random variable, and let α ∈ (0, 1). Then, it holds that 7 CVaR α (Z) := 1 α α 0 VaR t (Z)dt. ( ) If Z is a continuous random variable, then CVaR α (Z) = E[Z|Z ≤ VaR α (Z)] (Acerbi & Tasche, CVaR α (Z) = max s∈R s - 1 α E (s -Z) + = max s∈R s + 1 α E (Z -s) -, where (x) + = max{x, 0} represents the positive part of x, similarly (x) -= min{0, x} represents the negative part of x, and the maximum point is given by s * = VaR α (Z). Conditional Value-at-risk is a prominent risk measure with extensive applications in stochastic optimization (see Rockafellar et al. (2000) for example). By carefully choosing α, CVaR can be tuned to be sensitive to rare events with exceptionally low rewards, making it attractive as a risk measure. The CVaR is also known for having favorable mathematical properties such as coherence. Empirical estimation of the risk. Let Z 1 , . . . , Z m be m i.i.d. samples drawn from the distribution F Z , then the empirical estimation of CVaR α (Z) is given by CVaR α ({Z i } m i=1 ) = max s∈R s + 1 αm m i=1 (Z i -s) -. ( ) Lemma 2. (Lemma 3 in Yu et al. ( 2018)) Let Z 1 , . . . , Z m ∼ F Z be m i.i.d. bounded random variables, i.e., P[0 ≤ Z i ≤ B] = 1, ∀i, then we have P CVaR α (Z) -CVaR α ({Z i } m i=1 ) ≥ ε ≤ 2 1 + 4 -α) exp -mε 2 (1 -α) 2 2(2 -α) 2 B 2 .

C.2 COMPARISONS BETWEEN ENTROPIC RISK MEASURE AND CVAR

The closest work to ours is Fei et al. (2021) , which considers the risk-aware RL problem in the function approximation setting. However, they use the entropic risk measure. This section discusses some key differences between the entropic risk measure and CVaR. Entropic risk measure. For a finite-mean random variable Z and a parameter β ̸ = 0, the entropic risk measure of Z is defined as ER β (Z) := 1 β log E[e βZ ]. The entropic risk measure ER β is the normalized cumulant generating function of Z and is concave and additive for independent random variables (Föllmer & Knispel, 2011) . Using Taylor expansion, the entropic risk can be expressed as follows: ER β (Z) = E[Z] + β 2 Var[Z] + O(β 2 ). From the above expression, we observe that β > 0 induces a risk-tolerant objective and β < 0 induces a risk-averse one. As β → 0, ER β (Z) tends to the risk-neutral expectation E[Z]. However, the additivity property of the entropic risk measure may not be desirable in many practical scenarios. For example, if ER β (X 1 + X 2 + • • • + X n ) is the total reward of n i.i.d. random variables, then the reward per random variable is ER β (X 1 ), no matter how large n is. Thus, the aggregation of independent risks does not affect 'diversification reduces risks.' In contrast, coherent risk measures like CVaR are super-additive and thus enjoy this property.

C.3 ENTROPIC VALUE-AT-RISK

Let (Ω, F, P ) be a probability space. Let Z be a finite mean random variable, i.e., E[|Z|] < ∞, whose moment-generating function M Z (z) exists for all z ≥ 0. The Entropic Value-at-Risk (EVaR) (Ahmadi-Javid, 2011; 2012) at confidence level 1 -α is defined as EVaR 1-α (Z) := inf z>0 z -1 ln M Z (z) α . The EVaR admits a dual representation as follows: EVaR 1-α (Z) = sup Q∈Q E Q (Z) , where Q = {Q ≪ P : d KL (Q||P ) ≤ -ln α} where d KL denotes the Kullback-Leibler (KL) divergence of Q with respect to P . This dual representation of the EVaR also reveals the reason behind its name.

C.4 G-ENTROPIC RISK MEASURE

Inspired by the dual representation of EVaR, Ahmadi-Javid (2012) proposes a large class of information-theoretic coherent risk measures called g-entropic risk measures. This new class contains both the CVaR and EVaR. Let g be a convex proper function with g(1) = 0 and β ≥ 0. The generalized relative entropy of Q with respect to P , denoted by H g (P, Q), is an information-type pseudo-distance (also called divergence measure) from Q to P : H g (P, Q) := g dQ dP dP. This quantity is an important divergence measure, initially mentioned in Ali & Silvey (1966) ; Csiszár (1967) , and discussed in more detail in Liese & Vajda (2006) ; Ullah (1996) . For g(z) = z ln z, we obtain the Kullback-Leibler divergence from Q to P (Kullback & Leibler, 1951) . Let Z be a finite-mean random variable. Then, the g-entropic risk measure with divergence level β is defined as ER g,β := sup Q∈Q E Q (Z), where Q = {Q ≪ P : H g (P, Q) ≤ β}. We can show that the CVaR and EVaR are special cases of the g-entropic risk measure, with proper choices of g and β. For a more comprehensive discussion of the properties of the g-entropic risk measure, please refer to Section 5 of Ahmadi-Javid (2012).

C.5 OTHER COHERENT RISK MEASURES

Risk measures like Tail value-at-risk, Proportional Hazard (PH) risk measure, Wang risk measure, and Superhedging price also belong to the family of coherent risk measures. These risk measures have many important applications. For example, Proportional Hazard (PH) risk measure is widely used in healthcare domains such as clinical trials (Rulli et al., 2018) or epidemiology (Moolgavkar et al., 2018) . Wang risk measure and Superhedging price are commonly used in financial applications such as asset pricing (Wang, 2000) or portfolio optimization (Löhne & Rudloff, 2014) . In the RL context, CVaR is the most well-known and commonly used risk measure among all coherent risk measures and is relatively well-studied in the literature (Bäuerle & Ott, 2011; Yu et al., 2018) . Some very recent works (Ni & Lai, 2022a; b) have started investigating the use of EVaR in RL. Unlike our work, the techniques used by Ni & Lai (2022a; b) exploit properties that are unique to EVaR and thus cannot be generalized to others in the family of coherent risk measures.

D PROOFS

Before presenting the proofs of the supporting lemmas, we review the properties of coherent risk measures and derive the necessary results needed in our proofs. A coherent risk measure (Föllmer & Schied, 2010) is defined as follows Definition 3. Let X, Y be two random variables. A mapping ρ is called a coherent risk measure if ρ satisfies the following conditions for all X, Y : • Monotonicity: If X ≤ Y a.s., then ρ(X) ≤ ρ(Y ). • Translation invariance: If m ∈ R, then ρ(X + m) = ρ(X) + m. • Positive homogeneity: If α > 0, then ρ(αZ) = αρ(Z). • Super-additivity: ρ(X + Y ) ≥ ρ(X) + ρ(Y ). We want to highlight in our paper that we consider maximizing risks of the rewards. It is in direct contrast to other papers that consider minimizing risks of the costs. Therefore, our properties are upended compared to the properties presented in Föllmer & Schied (2010) . The following result presents a simple inequality we will use throughout this section. Lemma 3. For any two state-action value functions f 1 , f 2 : S × A → R, we have (D ρ h f 1 )(x, a) -(D ρ h f 2 )(x, a) ≤ -D ρ h -(f 1 -f 2 ) (x, a). Proof. The super-additivity property of ρ implies that ρ(X) + ρ(Y -X) ≤ ρ(Y ), equivalently, ρ(X) -ρ(Y ) ≤ -ρ(Y -X). In the inequality given in Eq. ( 15), we want to highlight the super-additivity properties of ρ. We emphasize that the statement ρ(X) -ρ(Y ) ≤ ρ(X -Y ) is incorrect since ρ is only positively homogeneous. This argument concludes our proof of inequality. Let x ′ ∼ P h (•|x, a) be the random variable represents the next state by following the transition kernel P h , by definition of D ρ h in Eq. ( 3), we have (D ρ h f 1 )(x, a) -(D ρ h f 2 )(x, a) = ρ f 1 (x ′ ) -ρ f 1 (x ′ ) ≤ -ρ -(f 1 (x ′ ) -f 2 (x ′ )) = -D ρ h -(f 1 -f 2 ) (x, a), where the first and last equality follows from the definition of D ρ h from Eq. ( 3), and the inequality is due to Eq. ( 15).

D.1 PROOF OF THEOREM 1

We first define a few notations to simplify the presentation of the proof. First, we define the temporal-difference (TD) error as δ t h (x, a) = (r h + D ρ h V t h+1 )(x, a) -Q t h (x, a), ∀(x, a) ∈ S × A. For a trajectory {(x t h , a t h )} h∈[H] , we further define the two following quantities ζ 1 t,h = V t h (x t h ) -V πt h (x t h ) -Q t h (x t h , a t h ) -Q πt h (x t h , a t h ) , ζ 2 t,h = (D ρ h V t h+1 )(x t h , a t h ) -(D ρ h V πt h+1 )(x t h , a t h ) -V t h+1 (x t h+1 ) -V πt h+1 (x t h+1 ) . The random variables ζ 1 t,h and ζ 2 t,h capture the deviations of the value function due to two sources of randomness in the MDP -the randomness of choosing the action a t h ∼ π t h (•|x t h ) and drawing next state x t h+1 ∼ P h (•|x t h , a t h ). We establish the upper bound in the following steps. Step 1: Decomposition of the regret. Lemma 4. We can upper bound the regret as R(T ) ≤ - T t=1 H h=1 h-1 i=1 J π ⋆ i D ρ i J π ⋆ h (-δ t h )(x t 1 ) - T t=1 H h=1 δ t h (x t h , a t h ) Term I + T t=1 H h=1 (ξ 1 t,h + ξ 2 t,h ) Term II , where δ t h , ζ 1 t,h , and ζ 2 t,h are defined above. Proof sketch. We decompose the instantaneous regret at the t-th episode into V ⋆ 1 (x t 1 ) -V π t 1 (x t 1 ) = V ⋆ 1 (x t 1 ) -V t 1 (x t 1 )) + V t 1 (x t 1 ) -V π t 1 (x t 1 ) . To upper bound the first term, we establish an inequality of the form V ⋆ h -V t h ≤ f (V ⋆ h+1 -V t h+1 ) for some function f and apply it recursively. This inequality is established using the Bellman equation and the super-additivity property of CVaR. Similar techniques can be applied to the upper bound of the second term. The detailed proof is given in Appendix D.2 Step 2. Upper bounding Term I. Lemma 5. Let λ = 1 + 1/T and β = B T in Algorithm RA-UCB. Then under Assumption 1, with probability at least 1 -(2T 2 H 2 ) -1 , we have that for all t ∈ [T ], h ∈ [H], x ∈ S, and a ∈ A: -2βb t h (x, a) ≤ δ t h (x, a) ≤ 0. The proof of Lemma 5 is in Appendix D.3. By Lemma Lemma 5, δ t h is a negative function, and thus we could upper bound the first term in (I) by 0. We obtain that, with a probability of at least 1 -(2T 2 H 2 ) -1 , Term I ≤ - T t=1 H h=1 δ t h (x t h , a t h ) ≤ 2β T t=1 H h=1 b t h (x t h , a t h ), which is an upper bound of the sum of the bonus terms. Recall that we can rewrite the bonus term as b t h (x t h , a t h ) = ϕ(x t h , a t h ) ⊤ (Λ t h ) -1 ϕ(x t h , a t h ) 1/2 , where Λ t h = t-1 τ =1 ϕ(x τ h , a τ h )ϕ(x τ h , a τ h ) ⊤ + λ • I H and I H is the identity operator on H. Then, Term I ≤ 2β • √ T H h=1 T t=1 ϕ(x t h , a t h ) ⊤ (Λ t h ) -1 ϕ(x t h , a t h ) 1/2 ≤ 2β • √ T H h=1 [2 log det(I + K T h /λ)] 1/2 = 4βH • T • Γ k (T, λ), where Γ k (T, λ) is the maximal information gain defined in Eq. ( 10). Step  δ = (2T 2 H 2 ) -1 gives us Term II ≤ 16T H 3 log(4T 2 H 2 ) + 2T H • Ξ(m, (8T 3 H 3 ) -1 ) Therefore, combining these above results, with probability at least 1 -(T 2 H 2 ) -1 , the regret is bounded by R(T ) ≤ 4βH T Γ k (T, λ) + 16T H 3 log(4T 2 H 2 ) + 2T H • Ξ(m, (8T 3 H 3 ) -1 ) ≤ 5βH T Γ k (T, λ) + 2T H • Ξ(m, (8T 3 H 3 ) -1 ). Substituting β = B T completes the proof of Theorem 1.

D.2 PROOF OF LEMMA 4

We decompose the instantaneous regret at the t-th episode into V ⋆ 1 (x t 1 ) -V π t 1 (x t 1 ) = V ⋆ 1 (x t 1 ) -V t 1 (x t 1 ) (A) + V t 1 (x t 1 ) -V π t 1 (x t 1 ) (B) . We proceed to upper bound the two terms separately. For ease of presentation, we first define the operator J π acting on functions f : S × A → R that map a state-action value function to the state value function by following policy π as follows: (J π f )(x) = ⟨f (x, •), π(•|x)⟩ A . Since the domain of f (x, •) and π(•|x) is a finite set A, the inner product above can be interpreted as an inner product between two Euclidean vectors. Term (A). By the definitions of the optimal value function, we have V ⋆ h (x) = ⟨Q ⋆ h (x, •), π ⋆ h (•|x)⟩ A for all x ∈ S. Similarly, by the definition of V t h , we get V t h (x) = ⟨Q t h (x, •), π t h (•|x)⟩ A for all x ∈ S. Thus, for any t ∈ [T ], h ∈ [H], x ∈ S, we have V ⋆ h (x) -V t h (x) = ⟨Q ⋆ h (x, •), π ⋆ h (•|x)⟩ A -⟨Q t h (x, •), π t h (•|x)⟩ A = ⟨Q ⋆ h (x, •), π ⋆ h (•|x)⟩ A -⟨Q t h (x, •), π ⋆ h (•|x)⟩ A + ⟨Q t h (x, •), π ⋆ h (•|x)⟩ A -⟨Q t h (x, •), π t h (•|x)⟩ A = ⟨Q ⋆ h (x, •) -Q t h (x, •), π ⋆ h (•|x)⟩ A + ⟨Q t h (x, •), π ⋆ h (•|x) -π t h (•|x)⟩ A . Since π t h is the greedy policy with respect to Q t h , it gives ⟨Q t h (x h , •), π ⋆ h (•|x h ) -π t h (•|x h )⟩ A = ⟨Q t h (x h , •), π ⋆ h (•|x h )⟩ A -max a∈A Q t h (x h , a) ≤ 0, for all x h ∈ S. As a result, we can upper bound the second term by 0 and have V ⋆ h (x) -V t h (x) ≤ ⟨Q ⋆ h (x, •) -Q t h (x, •), π ⋆ h (•|x)⟩ A = J π ⋆ h (Q ⋆ h -Q t h )(x). From the Bellman optimality equation and the definition of the temporal-difference, we get Q ⋆ h -Q t h = (r h + D ρ h V ⋆ h+1 ) -(r h + D ρ h V t h+1 -δ t h ) = D ρ h V ⋆ h+1 -D ρ h V t h+1 + δ t h ≤ -D ρ h -(V ⋆ h+1 -V t h+1 ) + δ t h , where the last inequality follows from Lemma 3. Substituting this in the previous derivation gives us V ⋆ h (x) -V t h (x) ≤ J π ⋆ h -D ρ h -(V ⋆ h+1 -V t h+1 ) + δ t h (x) = -J π ⋆ h D ρ h -(V ⋆ h+1 -V t h+1 ) (x) -J π ⋆ h (-δ t h )(x). Eq. ( 18) represents a recursive relation between V ⋆ h -V t h and V ⋆ h+1 -V t h+1 . Then, recursively applying Eq. ( 18) for all h ∈ [H] gives V ⋆ 1 -V t 1 ≤ -J π ⋆ 1 D ρ 1 -(V ⋆ 2 -V t 2 ) -J π ⋆ 1 (-δ t 1 ) ≤ -J π ⋆ 1 D ρ 1 --J π ⋆ 2 D ρ 2 -(V ⋆ 3 -V t 3 ) -J π ⋆ 2 (-δ t 2 ) -J π ⋆ 1 (-δ t 1 ) = -J π ⋆ 1 D ρ 1 J π ⋆ 2 D ρ 2 -(V ⋆ 3 -V t 3 ) + J π ⋆ 2 (-δ t 2 ) -J π ⋆ 1 (-δ t 1 ) = - 2 h=1 J π ⋆ h D ρ h -(V ⋆ 3 -V t 3 ) - 2 h=1 h-1 i=1 J π ⋆ i D ρ i J π ⋆ h (-δ t h ) . . . ≤ - H h=1 J π ⋆ h D ρ h -(V ⋆ H+1 -V t H+1 ) - H h=1 h-1 i=1 J π ⋆ i D ρ i J π ⋆ h (-δ t h ) = - H h=1 h-1 i=1 J π ⋆ i D ρ i J π ⋆ h (-δ t h ), where the last equality follows due to the fact that V ⋆ H+1 (x) = V t H+1 (x) = 0 for all x ∈ S. Term (B). By definitions of δ t h , ζ 1 t,h and ζ 2 t,h defined in Eq. ( 16) and Eq. ( 17), we have V t h (x t h ) -V π t h (x t h ) = V t h (x t h ) -V π t h (x t h ) + δ t h (x t h , a t h ) -δ t h (x t h , a t h ) = V t h (x t h ) -V π t h (x t h ) + D ρ h V t h+1 (x t h , a t h ) -D ρ h V π t h (x t h , a t h ) + (Q π t h -Q t h )(x t h , a t h ) -δ t h (x t h , a t h ) = (V t h -V π t h )(x t h ) -(Q t h -Q π t h )(x t h , a t h ) + D ρ h V t h+1 (x t h , a t h ) - D ρ h V π t h (x t h , a t h ) -(V t h+1 -V π t h+1 )(x t h+1 ) + (V t h+1 -V π t h+1 )(x t h+1 ) -δ t h (x t h , a t h ) = (V t h+1 -V π t h+1 )(x t h+1 ) + ξ 1 t,h + ξ 2 t,h -δ t h (x t h , a t h ). Recursively applying the above gives: V t 1 (x t 1 ) -V π t 1 (x t 1 ) = (V t 2 -V π t 2 )(x t 2 ) + ξ 1 t,1 + ξ 2 t,1 -δ t 1 (x t 1 , a t 1 ) = (V t 3 -V π t 3 )(x t 3 ) + ξ 1 t,2 + ξ 2 t,2 -δ t 2 (x t 2 , a t 2 ) + ξ 1 t,1 + ξ 2 t,1 -δ t 1 (x t 1 , a t 1 ) = (V t 3 -V π t 3 )(x t 3 ) + 2 h=1 (ξ 1 t,h + ξ 2 t,h ) - 2 h=1 δ t h (x t h , a t h ) . . . = (V t H+1 -V π t H+1 )(x t H+1 ) + H h=1 (ξ 1 t,h + ξ 2 t,h ) - H h=1 δ t h (x t h , a t h ) = H h=1 (ξ 1 t,h + ξ 2 t,h ) - H h=1 δ t h (x t h , a t h ), where the last equality follows due to the fact that V t H+1 (x H+1 ) = V π t H+1 (x H+1 ) = 0. Combining Eq. ( 19) and Eq. ( 20) gives R T = T t=1 [V ⋆ 1 (x t 1 ) -V π t 1 (x t 1 )] ≤ - H h=1 h-1 i=1 J π ⋆ i D ρ i J π ⋆ h (-δ t h ) + H h=1 (ξ 1 t,h + ξ 2 t,h ) - H h=1 δ t h (x t h , a t h ), which concludes the proof of this lemma.

D.3 PROOF OF LEMMA 5

Let ϕ : Z → H denote the feature representation induced by the kernel k, i.e., k(z, z ′ ) = ⟨ϕ(z), ϕ(z ′ )⟩ H . For ease of representation, we view ϕ(z) as a vector and write ϕ(z) ⊤ ϕ(z ′ ) = ⟨ϕ(z), ϕ(z ′ )⟩ H to denote the inner product. The kernel regression problem in Eq. ( 9) becomes θ ← min θ∈H L(θ) = t-1 τ =1 r h (x τ h , a τ h ) + ρ(V t h+1 ({x ′ (i) } m i=1 )) -θ ⊤ ϕ(x τ h , a τ h ) 2 + λ • ∥θ∥ 2 H . ( ) We define the feature matrix Φ t h : H → R t-1 and the covariance matrix Λ t h : H → H as Φ t h = ϕ(z 1 h ) ⊤ , . . . , ϕ(z t-1 h ) ⊤ ⊤ and Λ t h = t-1 τ =1 ϕ(z τ h )ϕ(z τ h ) ⊤ + λI H = (Φ t h ) ⊤ Φ t h + λI H , where I H is the identity mapping on H. The Gram matrix K t h in Eq. ( 6) can be expressed as K t h = Φ t h (Φ t h ) ⊤ , and k t h (z) = Φϕ(z). With these definitions, we can rewrite Eq. ( 21) as min θ∈H L(θ) = ∥y t h -Φ t h θ∥ 2 2 + λθ ⊤ θ. The solution to the optimization problem above is given by θ t h = (Λ t h ) -1 (Φ t h ) ⊤ y t h . As a result, Q t h in Eq. ( 9) can be expressed as Q t h (z) = ϕ(z) ⊤ θ t h . In the rest of this section, to further simplify the notation, we denote Φ t h as simply Φ when the context is clear. Since (ΦΦ ⊤ + λI) and (Φ ⊤ Φ + λI H ) are positive definite, and thus invertible, and Φ ⊤ (ΦΦ ⊤ + λI) = (Φ ⊤ Φ + λI H )Φ ⊤ , we have (Φ t h ) -1 Φ ⊤ = (Φ ⊤ Φ + λI H ) -1 Φ ⊤ = Φ ⊤ (ΦΦ ⊤ + λI) -1 = Φ ⊤ (K t h + λI) -1 . Consequently, we can write θ t h as θ t h = (Λ t h ) -1 Φ ⊤ y t h = Φ ⊤ (K t h + λI) -1 y t h . In the sequel, we will bound the temporal-difference error δ t h defined in Eq. ( 16). Since V  t h (x) = max a Q t h (x, a), we have δ t h = r h + D ρ h V t h+1 -Q t h = T ⋆ h Q t h+1 -Q t h , z ∈ Z, T ⋆ h Q t h+1 (z) = ϕ(z) ⊤ θ t h . We can write ϕ(z) as ϕ(z) = (Λ t h ) -1 Λ t h ϕ(z) = (Λ t h ) -1 (Φ ⊤ Φ + λI H )ϕ(z) = (Λ t h ) -1 (Φ ⊤ Φ)ϕ(z) + λ(Λ t h ) -1 ϕ(z) = Φ ⊤ (K t h + λI) -1 k t h (z) + λ(Λ t h ) -1 ϕ(z). Using the above, we can write ϕ(z ) ⊤ θ t h as ϕ(z) ⊤ θ t h = k t h (z) ⊤ (K t h + λI) -1 Φθ t h + λϕ(z) ⊤ (Λ t h ) -1 θ t h . We have: ϕ(z) ⊤ θ t h -ϕ(z) ⊤ θ t h (22) = ϕ(z) ⊤ Φ ⊤ (K t h + λI) -1 y t h -k t h (z) ⊤ (K t h + λI) -1 Φθ t h -λϕ(z) ⊤ (Λ t h ) -1 θ t h = k t h (z) ⊤ (K t h + λI) -1 (y t h -Φθ t h ) (i) -λϕ(z) ⊤ (Λ t h ) -1 θ t h (ii) . We proceed by bounding Term (i) and Term (ii) separately. For Term (ii), by Cauchy-Schwarz inequality: |Term (ii)| = |λϕ(z) ⊤ (Λ t h ) -1 θ t h | ≤ ∥λ(Λ t h ) -1 ϕ(z)∥ H • ∥θ t h ∥ H ≤ RH∥λ(Λ t h ) -1 ϕ(z)∥ H = RH λϕ(z) ⊤ (Λ t h ) -1 • λI H • (Λ t h ) -1 ϕ(z) ≤ RH λϕ(z) ⊤ (Λ t h ) -1 • (Λ t h ) • (Λ t h ) -1 ϕ(z) = √ λRH • b t h (z), where the second last inequality follows from the fact that (Λ t h -λI H ) is a positive-semidefinite operator, which implies for any f ∈ H, we have f ⊤ (Λ t h -λI H )f ≥ 0. We continue to bound Term (i) in the rest of this section. For τ ∈ [0, t -1], the τ -entry of the vector (y t h -Φθ t h ) can be expressed as [y t h ] τ -[Φθ t h ] τ = r h (x τ h , a τ h ) + ρ({V t h+1 (x ′ (i) )} m i=1 ) -ϕ(x τ h , a τ h ) ⊤ θ t h = r h (x τ h , a τ h ) + ρ({V t h+1 (x ′ (i) )} m i=1 ) -(T ⋆ h Q t h+1 )(x τ h , a τ h ) = ρ({V t h+1 (x ′ (i) )} m i=1 ) -(D ρ h V t h+1 )(x τ h , a τ h ). Combining these above results, we have |Term (i)| = |k t h (z) ⊤ (K t h + λI) -1 (y t h -Φθ t h )| = |ϕ(z) ⊤ Φ ⊤ (K t h + λI) -1 (y t h -Φθ t h )| = |ϕ(z) ⊤ (Λ t h ) -1 Φ ⊤ (y t h -Φθ t h )| = ϕ(z) ⊤ (Λ t h ) -1 t-1 τ =1 ϕ(x τ h , a τ h ) • ρ({V t h+1 (x ′ (i) )} m i=1 ) -(D ρ h V t h+1 )(x τ h , a τ h ) ≤ ∥ϕ(z)∥ (Λ t h ) -1 • t-1 τ =1 ϕ(x τ h , a τ h ) • ρ({V t h+1 (x ′ (i) )} m i=1 )- (D ρ h V t h+1 )(x τ h , a τ h ) (Λ t h ) -1 , where the last inequality follows from the Cauchy-Schwarz inequality. To bound the RKHS norm in the second term, we apply techniques similar to (Yang et al., 2020) by combining concentration of self-normalized processes and uniform convergence over the function classes that contain V t h+1 . To achieve this, let us first define the action-value function classes Q ucb (h, R, B) as Q ucb (h, R, B) = {Q : Q(z) = min{Q 0 (z) + β • λ 1/2 [k(z, z) -k D (z) ⊤ (K D + λI) -1 k D (z)] 1/2 , H -h + 1} + , ∥Q 0 ∥ H ≤ R, β ∈ [0, B], |D| ≤ T }. Let us further define the state-value function classes V ucb (h, R, B) as V ucb (h, R, B) = {V : V (x) = max a∈A Q(x, a) for Q ∈ Q ucb (h, R, B)}. By (Yang et al., 2020, Lemma C.1) , if we set R T = 2HΓ k (T, λ), then we have for all t ∈ [T ], h ∈ [H], V t h as defined in Eq. ( 9) satisfies that V t h ∈ V ucb (h, R T , B T ) where B T is defined in Theorem 1. We now bound Term (i) by a covering number argument over the function classes V ucb (h, R T , B T ) for h ∈ [H]. For any two state-value functions V, V ′ : S → R, we consider the maximum metric (also known as the Chebyshev metric ) d(V, V ′ ) = sup x∈S |V (x) -V ′ (x)|. For ϵ, B > 0, let N d (ϵ, h, B) be the ϵ-covering number of V ucb (h, R T , B) with respect to the metric d, and N ∞ (ϵ, h, B) as the ϵ-covering number of Q ucb (h, R T , B) with respect to the maximum metric. Applying (Yang et al., 2020 , Lemma E.2) with δ = (2T 2 H 3 ) -1 and taking a union bound over h ∈ [H] gives t-1 τ =1 ϕ(x τ h , a τ h ) • ρ({V t h+1 (x ′ (i) )} m i=1 ) -(D ρ h V t h+1 )(x τ h , a τ h ) 2 (Λ t h ) -1 ≤ sup V ∈Vucb(h+1,R T ,B T ) t-1 τ =1 ϕ(x τ h , a τ h ) • ρ({V (x ′ (i) )} m i=1 ) -(D ρ h V )(x τ h , a τ h ) 2 (Λ t h ) -1 ≤ 2H 2 log det(I + K t h /λ) + 2H 2 t(λ -1) + 4H 2 log N ∞ (ϵ, h + 1, B T ) + log (2T 2 H 3 ) + 8t 2 ϵ 2 /λ, uniformly over all t ∈ [T ], h ∈ [H] with probability 1 -(2T 2 H 2 ) -1 . The first inequality is due to the fact that V t h+1 ∈ V ucb (h + 1, R T , B T ). According to the algorithm, we set λ = 1 + 1/T . We further set ϵ * = H/T , the above simplifies to t-1 τ =1 ϕ(x τ h , a τ h ) • ρ({V t h+1 (x ′ (i) )} m i=1 ) -(D ρ h V t h+1 )(x τ h , a τ h ) 2 (Λ t h ) -1 ≤ 4H 2 Γ k (T, λ) + 10H 2 + 4H 2 log N ∞ (ϵ * , h + 1, B T ) + 12H 2 log (T H). Combining the above result with Eq. ( 22), Eq. ( 23), and Eq. ( 24 Finally, by the definition of the temporal difference error δ t h and Eq. ( 25), we have: -δ t h (z) = Q t h (z) -T ⋆ h Q t h+1 ≤ ϕ(z) ⊤ ( θ t h -θ t h ) + βb t h (z) ≤ 2βb t h (z) which proves the left inequality of Lemma 5. For the right inequality, note that since Q t h+1 (z) ≤ H -h for all z ∈ Z, we have (T ⋆ h Q t h+1 ) ≤ H -h + 1. Therefore, we have δ t h (z) = T ⋆ h Q t h+1 -Q t h (z) = ϕ(z) ⊤ θ t h -min{ϕ(z) ⊤ θ t h + β • b t h (z), H -h + 1} + ≤ max{ϕ(z) ⊤ θ t h -ϕ(z) ⊤ θ t h -β • b t h (z), ϕ(z) ⊤ θ t h -(H -h + 1)} ≤ 0, where the first term is negative due to Eq. ( 25), and the second term is negative due to the fact that (T ⋆ h Q t h+1 ) ≤ H -h + 1. This completes the proof of Lemma 5.

D.4 PROOF OF LEMMA 6

This section follows Yang et al. (2020) (Azuma, 1967) , we obtain that for all t > 0: P T t=1 H h=1 ζ 1 t,h > t ≤ 2 exp -t 2 16T H 3 . ( ) We let the right hand side equal to δ/2, yielding t = 16T H 3 • log (4/δ). Next, we bound ζ 2 t,h using the risk estimator concentration inequality. Recall that the empirical risk estimate ρ achieves the rate of Ξ(m, δ), i.e., can significantly reduce the smallest rewards below VaR, but the Value-at-Risk will not change). In other words, VaR disregards some parts of the distribution. This property can be both good and bad, depending upon the applications. For example, VaR estimates are statistically more stable than CVaR estimates. However, in our case, we use a sufficiently high number of samples to estimate the risk values; thus, risk measures that consider the complete distribution, like CVaR or EVaR, are slightly more effective. P Furthermore, it can be shown that for the same risk parameter α, EVaR is more risk-averse than both VaR and CVaR (Ahmadi-Javid, 2012). Our results show that the reward distribution using EVaR (plots the rightmost columns) tends to have lower average rewards but better worst-case rewards. Figure 4 : Histograms show the distribution of the cumulative reward when following the learned policy. The solid lines represent the estimated densities using Gaussian kernels, and the vertical lines represent the mean of the distribution. For α = 0.1 (top row), the policy is more risk-averse, favoring safer paths with higher worst-case rewards but having smaller average rewards. As we increase α, α = 0.5 (middle row), and α = 0.9 (bottom row), the learned policy becomes more risk-tolerant, which causes the average rewards to be higher but occasionally small. Furthermore, for the same risk parameters, e.g., α = 0.1, EVaR usually has the smallest average but better worst-case rewards. This result also confirms that EVaR is more risk-averse than VaR and CVaR for the same risk parameter (Ahmadi-Javid, 2012).



Apart from CVaR and EVaR, the risk measures like g-entropic risk measures, Tail value-at-risk, Proportional Hazard (PH) risk measure, Wang risk measure, and Superhedging price also belong to the coherent risk family. More details about various coherent risk measures are given in Appendix C. Super-additivity in the reward maximization setting becomes sub-additivity in the cost minimization setting. Note that the weak simulator can only sample possible next states and returns no information regarding the rewards. In this sense, our simulator is weaker than the archetypal simulators often assumed in the RL literature. In our risk-aware RL setting, the random variable Z represents the random total reward of the agent. Since A is a finite set, the inner product over A is the canonical inner product on Euclidean vector space. The package gym-anytrading is available at https://github.com/AminHP/gym-anytrading. Note that the definition of CVaR presented above is different from that in the literature, e.g., in(Rockafellar et al., 2000). This is because we are treating Z as rewards and thus maximizing Z, whereas existing works consider Z as costs and consequently minimizing Z.



);Yang et al. (2020);Zanette et al. (2020). Please refer to Du et al. (2019) for a discussion on the necessity of this assumption.

Figure 1: Illustration of the continuous version of the cliff walking problem. The robot starts at (0, 0) and must navigate to the goal area (in green). The robot gets negative rewards for being close to the obstacles and receives a reward of 10 upon reaching the goal.

Figure 3: Estimated distribution of the normalized terminal wealth following the learned policy for different risk parameters. The vertical lines represent the average rewards.When α = 0.9 (the blue bar), the policy is more risk-tolerant, which causes the average reward to be higher at the expense of occasional low reward. The policy is more risk-averse as we decrease the value of α, favoring safe paths with lower average-case rewards and higher worst-case rewards.

The episode terminates when the agent reaches state x H+1 at time step H + 1. In the last time step, the agent takes no action and receives no reward.A policy π of an agent is a sequence of H functions, i.e., π = {π h } h∈[H]  , in which each π h (•|x) is a probability distribution over A. Here, π h (a|x) indicates the probability that the agent takes action a at state x in time step h. Any policy π and an initial state x 1 determine a probability measure P π

3. Upper bounding Term II. Lemma 6. For ζ 1 t,h and ζ 2 t,h defined in Eq. (17). We have that, with probability at least 1 -δ,

(25)   holds uniformly for all t ∈[T ], h ∈ [H] with probability 1 -(2T 2 H 2 ) -1 . The second inequality follows from the fact that √ a + √ b ≤ 2(a 2 + b 2 ), and the last inequality follows from the assumption on B T .

to show {ζ 1 t,h , ζ 2 t,h } (t,h)∈[T ]×[H]is a bounded martingale difference sequence. We construct the filtration as follows. For any t ∈ [T ], h ∈ [H], we define the following σ-algebrasF t,h,1 = σ {(x τ i , a τ i )} (τ,i)∈[t-1]×[H] ∪ {(x t i , a t i )} i∈[h] , F t,h,2 = σ {(x τ i , a τ i )} (τ,i)∈[t-1]×[H] ∪ {(x t i , a t i )} i∈[h] ∪ {x t h+1 }, where σ(•) denotes the σ-algebra generated by a finite set. Since V t h and Q t h are computed based on trajectories of the first t -1 episodes, they are measurable with respect to F t,1,1 . Since the action a t h are sampled from π t (•|x t h ), we have E ζ 1 t,h |F t,h-1,2 = 0. Thus, applying Azuma-Hoeffding inequality

|ρ(Z) -ρ({Z i } m i=1 )| ≤ Ξ(m, δ) ≥ 1 -δ. By definition, D ρ h V t h+1 (x t h , a t h ) = ρ(x ′ ) where x ′ ∼ P h (•|x t h , a t h ). Note that x t h+1 is also sampled from P h (•|x t h , a t h ). Applying the above inequality, for all t ∈ [T ], h ∈ [H], we haveFinally, performing a union bound over 26, 28, and 29 gives us with probability at least 1 -δ, we have

ACKNOWLEDGMENTS

This research is part of the programme DesCartes and is supported by the National Research Foundation, Prime Minister's Office, Singapore under its Campus for Research Excellence and Technological Enterprise (CREATE) programme. C. T. Lam is supported by the Singapore-MIT Alliance for Research and Technology (SMART) PhD Fellowship.

D.5 PROOF OF COROLLARY 1

We first restate CVaR empirical risk estimator and its concentration result. Lemma 7. (Lemma 3 in Yu et al. (2018) ) Let Z 1 , . . . , Z m ∼ F Z be m i.i.d. bounded random variables, i.e., P[0 ≤ Z i ≤ B] = 1, ∀i, then we havewhereSetting the RHS to δ, we haveor equivalently,For the regret to be order-optimal, we needThat gives us, which concludes the proof of Corollary 1.

E ADDITIONAL EXPERIMENTS

This section provides additional empirical results to demonstrate the effectiveness of RA-UCB. We study the proposed algorithm under various risk measures, namely VaR, CVaR, and EVaR. Refer to Appendix C for a formal introduction and discussion of these risk measures.The robot navigation environment is similar to the setting in Section 5.1. The robot receives a positive reward of 10 for reaching the destination and a negative reward for being close to obstacles.The negative reward increases exponentially as the robot comes close to the obstacles. We set the horizon of each episode to H = 30 and use m = 100 samples from the weak simulator to estimate the risk in Eq. ( 7). We approximate the state-action value function using the RBF kernel and the KernelRidge regressor from Scikit-learn.We run RA-UCB for 50 episodes and report the performance of the learned policy with three different risk measures, namely Value-at-Risk (VaR), Conditional Value-at-Risk (CVaR), and Entropic Value-at-Risk (EVaR). Each risk measure is evaluated against different values of the risk parameter α ∈ [0.1, 0.5, 0.9]. We note that VaR is not a coherent risk measure; therefore, our regret upper bound does not apply to VaR. Regardless, VaR is still an important risk measure with many real-world applications and has important connections with CVaR and EVaR.Fig. 4 shows the robot's cumulative rewards following the learned policies after 50 episodes. Each column represents a different risk measure, and each row represents a different risk parameter. We observe that the reward distribution changes when we vary the risk parameters. For α = 0.1 (top row), the policies are more risk-averse, favoring safer paths with higher worst-case rewards but having smaller average rewards. As we increase the value of α, α = 0.5 (middle row), and α = 0.9 (bottom row), the learned policies become more risk-tolerant, which causes the average rewards to be higher but occasionally small.We also observe that, for the same risk parameter, CVaR outperforms VaR marginally in terms of average rewards. This gain is because VaR does not control rewards below its value (for instance, you

