RISK-AWARE REINFORCEMENT LEARNING WITH COHERENT RISK MEASURES AND NON-LINEAR FUNCTION APPROXIMATION

Abstract

We study the risk-aware reinforcement learning (RL) problem in the episodic finite-horizon Markov decision process with unknown transition and reward functions. In contrast to the risk-neutral RL problem, we consider minimizing the risk of having low rewards, which arise due to the intrinsic randomness of the MDPs and imperfect knowledge of the model. Our work provides a unified framework to analyze the regret of risk-aware RL policy with coherent risk measures in conjunction with non-linear function approximation, which gives the first sub-linear regret bounds in the setting. Finally, we validate our theoretical results via empirical experiments on synthetic and real-world data.

1. INTRODUCTION

Reinforcement learning (RL) (Sutton & Barto, 2018 ) is a control-theoretic problem in which an agent interacts with an unknown environment and aims to maximize its expected total reward. Due to the intrinsic randomness of the environment, even a policy with high expected total rewards may occasionally produce very low rewards. This uncertainty is problematic in many real-life applications like competitive games (Mnih et al., 2013) and healthcare (Liu et al., 2020) , where the agent (or decision-maker) needs to be risk-averse. For example, the drug responses to patients are stochastic due to the patients' varying physiology or genetic profiles (McMahon & Insel, 2012) ; therefore, it is desirable to select a set of treatments that yield high effectiveness and minimize the possibility of adverse effects (Beutler et al., 2016; Fatemi et al., 2021) . The existing RL policies that maximize the risk-neutral total reward can not lead to an optimal risk-aware RL policy for problems where the total reward has uncertainty (Yu et al., 2018) . Therefore, our goal is to design an RL algorithm that learns a risk-aware RL policy to minimize the risk of having a small expected total reward. Then, how should we learn a risk-aware RL policy? A natural approach is to directly learn a risk-aware RL policy that minimizes the risk of having a small expected total reward (Howard & Matheson, 1972) . For quantifying such a risk, one can use risk measures like entropic risk (Föllmer & Knispel, 2011) , value-at-risk (VaR) (Dempster, 2002) , conditional value-at-risk (CVaR) (Rockafellar et al., 2000) , or entropic value-at-risk (EVaR) (Ahmadi-Javid, 2012). These risk measures capture the total reward volatility and quantify the possibility of rare but catastrophic events. The entropic risk measure can be viewed as a mean-variance criterion, where the risk is expressed as the variance of total reward (Fei et al., 2021) . Alternatively, VaR, CVaR, and EVaR use quantile criteria, which are often preferable for better risk management over the mean-variance criterion (Chapter 3 of Kisiala ( 2015)). Among these risk measures, coherent risk measuresfoot_0 such as CVaR and EVaR are preferred as they enjoy compelling theoretical properties such as coherence (Rockafellar et al., 2000) . The risk-aware RL algorithms with CVaR as a risk measure (Bäuerle & Ott, 2011; Yu et al., 2018; Rigter et al., 2021) exist in the literature. However, apart from being customized only for CVaR, these algorithms suffer two significant shortcomings. First, most of them focus on the tabular MDP setting and need multiple complete traversals of the state space (Bäuerle & Ott, 2011; Rigter et al., 



Apart from CVaR and EVaR, the risk measures like g-entropic risk measures, Tail value-at-risk, Proportional Hazard (PH) risk measure, Wang risk measure, and Superhedging price also belong to the coherent risk family. More details about various coherent risk measures are given in Appendix C.

