PROVABLY EFFICIENT RISK-SENSITIVE REINFORCE-MENT LEARNING: ITERATED CVAR AND WORST PATH

Abstract

In this paper, we study a novel episodic risk-sensitive Reinforcement Learning (RL) problem, named Iterated CVaR RL, which aims to maximize the tail of the reward-to-go at each step, and focuses on tightly controlling the risk of getting into catastrophic situations at each stage. This formulation is applicable to realworld tasks that demand strong risk avoidance throughout the decision process, such as autonomous driving, clinical treatment planning and robotics. We investigate two performance metrics under Iterated CVaR RL, i.e., Regret Minimization and Best Policy Identification. For both metrics, we design efficient algorithms ICVaR-RM and ICVaR-BPI, respectively, and provide nearly matching upper and lower bounds with respect to the number of episodes K. We also investigate an interesting limiting case of Iterated CVaR RL, called Worst Path RL, where the objective becomes to maximize the minimum possible cumulative reward. For Worst Path RL, we propose an efficient algorithm with constant upper and lower bounds. Finally, our techniques for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution are novel, and can find applications in other risk-sensitive RL problems.

1. INTRODUCTION

Reinforcement Learning (RL) (Kaelbling et al., 1996; Szepesvári, 2010; Sutton & Barto, 2018 ) is a classic online decision-making formulation, where an agent interacts with an unknown environment with the goal of maximizing the obtained reward. Despite the empirical success and theoretical progress of recent RL algorithms, e.g., (Szepesvári, 2010; Agrawal & Jia, 2017; Azar et al., 2017; Zanette & Brunskill, 2019) , they focus mainly on the risk-neutral criterion, i.e., maximizing the expected cumulative reward, and can fail to avoid rare but disastrous situations. As a result, existing algorithms cannot be applied to tackle real-world risk-sensitive tasks, such as autonomous driving (Wen et al., 2020) and clinical treatment planning (Coronato et al., 2020) , where policies that ensure low risk of getting into catastrophic situations at all decision stages are strongly preferred. Motivated by the above facts, we investigate Iterated CVaR RL, a novel episodic RL formulation equipped with an important risk-sensitive criterion, i.e., Iterated Conditional Value-at-Risk (CVaR) (Hardy & Wirch, 2004) . Here, CVaR (Artzner et al., 1999) is a popular static (single-stage) risk measure which stands for the expected tail reward. Iterated CVaR is a dynamic (multi-stage) risk measure defined upon CVaR by backward iteration, and focuses on the worst portion of the reward-to-go at each stage. In the Iterated CVaR RL problem, an agent interacts with an unknown episodic Markov Decision Process (MDP) in order to maximize the worst α-portion of the rewardto-go at each step, where α ∈ (0, 1] is a given risk level. Under this model, we investigate two important performance metrics, i.e., Regret Minimization (RM), where the goal is to minimize the cumulative regret over all episodes, and Best Policy Identification (BPI), where the performance is measured by the number of episodes required for identifying an optimal policy. Compared to existing CVaR MDP model, e.g., (Boda & Filar, 2006; Ott, 2010; Bäuerle & Ott, 2011; Chow et al., 2015) , which aims to maximize the CVaR (i.e., the worst α-portion) of the total reward, our Iterated CVaR RL concerns the worst α-portion of the reward-to-go at each step, and prevents the agent from getting into catastrophic states more carefully. Intuitively, CVaR MDP takes more cumulative reward into account and prefers actions which have better performance in general, but can have larger probabilities of getting into catastrophic states. Thus, CVaR MDP is suitable for scenarios where bad situations lead to a higher cost instead of fatal damage, e.g., finance. In contrast, our Iterated CVaR RL prefers actions which have smaller probabilities of getting into catastrophic states. Hence, Iterated CVaR RL is suitable for safety-critical applications, where catastrophic states are unacceptable and need to be carefully avoided, e.g., clinical treatment planning (Wang et al., 2019) and unmanned helicopter control (Johnson & Kannan, 2002) . For example, consider the case where we fly an unmanned helicopter to complete some task. There is a small probability that, at each time during execution, the helicopter encounters a sensing or control failure and does not take the scheduled action. To guarantee the safety of surrounding workers and the helicopter, we need to make sure that even if the failure occurs, the taken policy ensures that the helicopter does not crash and cause fatal damage (see Appendix C.2, C.3 for more detailed comparisons with existing risk-sensitive MDP models). Iterated CVaR RL faces several unique challenges as follows. (i) The importance (contribution to regret) of a state in Iterated CVaR RL is not proportional to its visitation probability. Specifically, there can be states which are critical (risky) but have a small visitation probability. As a result, the regret for Iterated CVaR RL cannot be decomposed into the estimation error at each step with respect to the visitation distribution, as in standard RL analysis (Jaksch et al., 2010; Azar et al., 2017; Zanette & Brunskill, 2019) . (ii) In Iterated CVaR RL, the calculation of estimation error involves bounding the change of CVaR when the true value function shifts to optimistic value function, which is very different from typically bounding the change of expected rewards as in existing RL analysis (Jaksch et al., 2010; Azar et al., 2017; Jin et al., 2018) . Therefore, Iterated CVaR RL demands brandnew algorithm design and analytical techniques. To tackle the above challenges, we design two efficient algorithms ICVaR-RM and ICVaR-BPI for the RM and BPI metrics, respectively, equipped with delicate CVaR-adapted value iteration and exploration bonuses to allocate more attention on rare but potentially dangerous states. We also develop novel analytical techniques, for bounding the change of CVaR due to the value function shift and decomposing the regret via a distorted visitation distribution. Lower bounds for both metrics are established to demonstrate the optimality of our algorithms with respect to the number of episodes K. Moreover, we present experiments to validate our theoretical results and show the performance superiority of our algorithm (see Appendix A). We further study an interesting limiting case of Iterated CVaR RL when α approaches 0, called Worst Path RL, where the goal becomes to maximize the minimum possible cumulative reward (optimize the worst path). This setting corresponds to the scenario where the decision maker is extremely riskadverse and concerns the worst situation (e.g., in clinical treatment planning (Coronato et al., 2020) , the worst case can be disastrous). We emphasize that Worst Path RL cannot be directly solved by taking α → 0 in Iterated CVaR RL's results, as the results there have a dependency on 1 α in both upper and lower bounds. To handle this limiting case, we design a simple yet efficient algorithm MaxWP, and obtain constant upper and lower regret bounds which are independent of K. The contributions of this paper are summarized as follows. • We propose a novel Iterated CVaR RL formulation, where an agent interacts with an unknown environment, with the objective of maximizing the worst α-percent tail of the reward-to-go at each step. This formulation enables one to tightly control risk throughout the decision process, and is most suitable for applications where such safety-at-all-time is critical. • We investigate two important metrics of Iterated CVaR RL, i.e., Regret Minimization (RM) and Best Policy Identification (BPI), and propose efficient algorithms ICVaR-RM

