DISTRIBUTIONAL REINFORCEMENT LEARNING FOR RISK-SENSITIVE POLICIES

Abstract

We address the problem of learning a risk-sensitive policy based on the CVaR risk measure using distributional reinforcement learning. In particular, we show that applying the distributional Bellman optimality operator with respect to a riskbased action-selection strategy overestimates the dynamic, Markovian CVaR. The resulting policies can however still be overly conservative and one often prefers to learn an optimal policy based on the static, non-Markovian CVaR. To this end, we propose a modification to the existing algorithm and show that it can indeed learn a proper CVaR-optimized policy. Our proposed approach is a simple extension of standard distributional RL algorithms and can therefore take advantage of many of the recent advances in deep RL. On both synthetic and real data, we empirically show that our proposed algorithm is able to produce a family of risk-averse policies that achieves a better tradeoff between risk and the expected return.

1. INTRODUCTION

In standard reinforcement learning (RL) (Sutton & Barto, 2018) , one seeks to learn a policy that maximizes an objective, usually the expected total discounted rewards or the long-term average rewards. In stochastic domains, especially when the level of uncertainty involved is high, maximizing the expectation may not be the most desirable since the solution may have high variance and occasionally performs badly. In such scenarios one may choose to learn a policy that is more risk-averse and avoids bad outcomes, even though the long-term average performance is slightly lower than the optimal. In this work we consider optimizing the conditional value-at-risk (CVaR) (Rockafellar & Uryasev, 2000) , a popular risk measure, widely used in financial applications, and is increasingly being used in RL. The CVaR objective focuses on the lower tail of the return and is therefore more sensitive to rare but catastrophic outcomes. Various settings and RL approaches have been proposed to solve this problem (Petrik & Subramanian, 2012; Chow & Ghavamzadeh, 2014; Chow & Pavone, 2014; Tamar et al., 2015; Tamar et al., 2017; Huang & Haskell, 2020) . Most of the proposed approaches, however, involve more complicated algorithms than standard RL algorithms such as Q-learning (Watkins & Dayan, 1992) and its deep variants, e.g. DQN (Mnih et al., 2015) . Recently, the distributional approach to RL (Bellemare et al., 2017; Morimura et al., 2010) has received increased attention due to its ability to learn better policies than the standard approaches in many challenging tasks (Dabney et al., 2018a; b; Yang et al., 2019) . Instead of learning a value function that provides the expected return of each state-action pair, the distributional approach learns the entire return distribution of each state-action pair. While this is computationally more costly, the approach itself is a simple extension to standard RL and is therefore easy to implement and able to leverage many of the advances in deep RL. Since the entire distribution is available, one naturally considers exploiting this information to optimize for an objective other than the expectation. Dabney et al. (2018a) presented a simple way to do so for a family of risk measures including the CVaR. The theoretical properties of such approach, however, are not clear. In particular, it is not clear whether the algorithm converges to any particular variant of CVaR-optimal policy. We address this issue in this work. Our main contribution is to first show that the proposed algorithm in (Dabney et al., 2018a) overestimates the dynamic, Markovian CVaR but empirically can be as conservative. It has been demon-

