RISK-AWARE REINFORCEMENT LEARNING WITH COHERENT RISK MEASURES AND NON-LINEAR FUNCTION APPROXIMATION

Abstract

We study the risk-aware reinforcement learning (RL) problem in the episodic finite-horizon Markov decision process with unknown transition and reward functions. In contrast to the risk-neutral RL problem, we consider minimizing the risk of having low rewards, which arise due to the intrinsic randomness of the MDPs and imperfect knowledge of the model. Our work provides a unified framework to analyze the regret of risk-aware RL policy with coherent risk measures in conjunction with non-linear function approximation, which gives the first sub-linear regret bounds in the setting. Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018 ) is a control-theoretic problem in which an agent interacts with an unknown environment and aims to maximize its expected total reward. Due to the intrinsic randomness of the environment, even a policy with high expected total rewards may occasionally produce very low rewards. This uncertainty is problematic in many real-life applications like competitive games (Mnih et al., 2013) and healthcare (Liu et al., 2020) , where the agent (or decision-maker) needs to be risk-averse. For example, the drug responses to patients are stochastic due to the patients' varying physiology or genetic profiles (McMahon & Insel, 2012) ; therefore, it is desirable to select a set of treatments that yield high effectiveness and minimize the possibility of adverse effects (Beutler et al., 2016; Fatemi et al., 2021) . The existing RL policies that maximize the risk-neutral total reward can not lead to an optimal risk-aware RL policy for problems where the total reward has uncertainty (Yu et al., 2018) . Therefore, our goal is to design an RL algorithm that learns a risk-aware RL policy to minimize the risk of having a small expected total reward. Then, how should we learn a risk-aware RL policy? A natural approach is to directly learn a risk-aware RL policy that minimizes the risk of having a small expected total reward (Howard & Matheson, 1972) . For quantifying such a risk, one can use risk measures like entropic risk (Föllmer & Knispel, 2011) , value-at-risk (VaR) (Dempster, 2002) , conditional value-at-risk (CVaR) (Rockafellar et al., 2000) , or entropic value-at-risk (EVaR) (Ahmadi-Javid, 2012). These risk measures capture the total reward volatility and quantify the possibility of rare but catastrophic events. The entropic risk measure can be viewed as a mean-variance criterion, where the risk is expressed as the variance of total reward (Fei et al., 2021) . Alternatively, VaR, CVaR, and EVaR use quantile criteria, which are often preferable for better risk management over the mean-variance criterion (Chapter 3 of Kisiala ( 2015)). Among these risk measures, coherent risk measuresfoot_0 such as CVaR and EVaR are preferred as they enjoy compelling theoretical properties such as coherence (Rockafellar et al., 2000) . The risk-aware RL algorithms with CVaR as a risk measure (Bäuerle & Ott, 2011; Yu et al., 2018; Rigter et al., 2021) exist in the literature. However, apart from being customized only for CVaR, these algorithms suffer two significant shortcomings. First, most of them focus on the tabular MDP setting and need multiple complete traversals of the state space (Bäuerle & Ott, 2011; Rigter et al., 2021) . These traversals are prohibitively expensive for problems with large state space and impossible for problems with continuous state space, thus limiting these algorithms' applicability in practice. Second, the existing algorithms considering continuous or infinite state space assume that MDP is known, i.e., the probability transitions and reward of each state are known a priori to the algorithm. In such settings, the agent does not need to explore or generalize to unseen scenarios. Therefore, the problem considered in Yu et al. ( 2018) is a planning problem rather than a learning problem. This paper alleviates both shortcomings by proposing a new risk-aware RL algorithm where MDPs are unknown and uses non-linear function approximation for addressing continuous state space. Recent works (Jin et al., 2020; Yang et al., 2020) have proposed RL algorithms with function approximation and finite-sample regret guarantees, but they only focus on the risk-neutral RL setting. Extending their results to a risk-aware RL setting is non-trivial due to two major challenges. First, the existing analyses heavily rely on the linearity of the expectation in the risk-neutral Bellman equation. This linearity property does not hold in the risk-aware RL setting when a coherent risk measure replaces the expectation in the Bellman equation. Then, how can we address this challenge? We overcome this challenge by the non-trivial application of the super-additivity propertyfoot_1 of coherent risk measures (see Lemma 3 and its application in Appendix 4). The risk-neutral RL algorithms only need one sample of the next state to construct an unbiased estimate of the Bellman update (Yang et al., 2020) as one can unbiasedly estimate the expectation in the risk-neutral Bellman equation with a single sample. However, this does not hold in the risk-aware RL setting. Furthermore, whether one can construct an unbiased estimate of an arbitrary risk measure using only one sample is unknown. This problem leads to the second major challenge: how can we construct an unbiased estimate of the risk-aware Bellman update? To resolve this challenge, we assume access to a weak simulatorfoot_2 that can sample different next states given the current state and action and use these samples to construct an unbiased estimator. Such an assumption is mild and holds in many real-world applications, e.g., a player can anticipate the opponent's next moves and hence the possible next states of the game. After resolving both challenges, we propose an algorithm that uses a risk-aware value iteration procedure based on the upper confidence bound (UCB) and has a finite-sample sub-linear regret upper bound. Specifically, our contributions are as follows: • We first formalize the risk-aware RL setting with coherent risk measures, namely the risk-aware objective function and the risk-aware Bellman equation in Section 3. We then introduce the notion of regret for a risk-aware RL policy. • We propose a general risk-aware RL algorithm named Risk-Aware Upper Confidence Bound (RA-UCB) for an entire class of coherent risk measures in Section 4. RA-UCB uses UCB-based value functions with non-linear function approximation and also enjoys a finite-sample sub-linear regret upper bound guarantee. • We provide a unified framework to analyze regret for any coherent risk measure in Section 4.1. The novelty in our analysis is in the decomposition of risk-aware RL policy's regret by the super-additivity property of coherent risk measures (shown in the proof of Lemma 4 in Appendix D.2). • Our empirical experiments on synthetic and real datasets validate the different performance aspects of our proposed algorithm in Section 5.

1.1. RELATED WORK

Risk-aware MDPs first introduced in the seminal work of Howard & Matheson (1972) with the use of an exponential utility function known as the entropic risk measure. Since then, the risk-aware MDPs have been studied with different risk criteria: optimizing moments of the total reward (Jaquette, 1973) , exponential utility or entropic risk (Borkar, 2001; 2002; Bäuerle & Rieder, 2014; Fei et al., 2020; 2021; Moharrami et al., 2022 ), mean-variance criterion (Sobel, 1982; Li & Ng, 2000; La & Ghavamzadeh, 2013; Tamar et al., 2016), and conditional value-at-risk (Boda & Filar, 2006; Artzner et al., 2007; Bäuerle & Mundt, 2009; Bäuerle & Ott, 2011; Tamar et al., 2015; Yu et al., 2018; Rigter et al., 2021 ). Vadori et al. (2020) focuses on the variability or uncertainty of the rewards.



Apart from CVaR and EVaR, the risk measures like g-entropic risk measures, Tail value-at-risk, Proportional Hazard (PH) risk measure, Wang risk measure, and Superhedging price also belong to the coherent risk family. More details about various coherent risk measures are given in Appendix C. Super-additivity in the reward maximization setting becomes sub-additivity in the cost minimization setting. Note that the weak simulator can only sample possible next states and returns no information regarding the rewards. In this sense, our simulator is weaker than the archetypal simulators often assumed in the RL literature.

