DISTRIBUTIONAL REINFORCEMENT LEARNING FOR RISK-SENSITIVE POLICIES

Abstract

We address the problem of learning a risk-sensitive policy based on the CVaR risk measure using distributional reinforcement learning. In particular, we show that applying the distributional Bellman optimality operator with respect to a riskbased action-selection strategy overestimates the dynamic, Markovian CVaR. The resulting policies can however still be overly conservative and one often prefers to learn an optimal policy based on the static, non-Markovian CVaR. To this end, we propose a modification to the existing algorithm and show that it can indeed learn a proper CVaR-optimized policy. Our proposed approach is a simple extension of standard distributional RL algorithms and can therefore take advantage of many of the recent advances in deep RL. On both synthetic and real data, we empirically show that our proposed algorithm is able to produce a family of risk-averse policies that achieves a better tradeoff between risk and the expected return.

1. INTRODUCTION

In standard reinforcement learning (RL) (Sutton & Barto, 2018) , one seeks to learn a policy that maximizes an objective, usually the expected total discounted rewards or the long-term average rewards. In stochastic domains, especially when the level of uncertainty involved is high, maximizing the expectation may not be the most desirable since the solution may have high variance and occasionally performs badly. In such scenarios one may choose to learn a policy that is more risk-averse and avoids bad outcomes, even though the long-term average performance is slightly lower than the optimal. In this work we consider optimizing the conditional value-at-risk (CVaR) (Rockafellar & Uryasev, 2000) , a popular risk measure, widely used in financial applications, and is increasingly being used in RL. The CVaR objective focuses on the lower tail of the return and is therefore more sensitive to rare but catastrophic outcomes. Various settings and RL approaches have been proposed to solve this problem (Petrik & Subramanian, 2012; Chow & Ghavamzadeh, 2014; Chow & Pavone, 2014; Tamar et al., 2015; Tamar et al., 2017; Huang & Haskell, 2020) . Most of the proposed approaches, however, involve more complicated algorithms than standard RL algorithms such as Q-learning (Watkins & Dayan, 1992) and its deep variants, e.g. DQN (Mnih et al., 2015) . Recently, the distributional approach to RL (Bellemare et al., 2017; Morimura et al., 2010) has received increased attention due to its ability to learn better policies than the standard approaches in many challenging tasks (Dabney et al., 2018a; b; Yang et al., 2019) . Instead of learning a value function that provides the expected return of each state-action pair, the distributional approach learns the entire return distribution of each state-action pair. While this is computationally more costly, the approach itself is a simple extension to standard RL and is therefore easy to implement and able to leverage many of the advances in deep RL. Since the entire distribution is available, one naturally considers exploiting this information to optimize for an objective other than the expectation. Dabney et al. (2018a) presented a simple way to do so for a family of risk measures including the CVaR. The theoretical properties of such approach, however, are not clear. In particular, it is not clear whether the algorithm converges to any particular variant of CVaR-optimal policy. We address this issue in this work. Our main contribution is to first show that the proposed algorithm in (Dabney et al., 2018a) overestimates the dynamic, Markovian CVaR but empirically can be as conservative. It has been demon-strated that this variant of CVaR can be overly conservative in many scenarios (Tamar et al., 2017; Yu et al., 2017) , and one may prefer the static CVaR instead as the objective. Our second contribution is to propose a modified algorithm that can help achieve this. Empirically, we show that the proposed approach learns policies that perform better in terms of the overall CVaR objective on both synthetic and real-world problems. We close the introduction section with some references to related works. We formally present our problem setup as well as our main analytical results in Section 2. Section 3 describes our proposed algorithm and finally, Section 4 presents our empirical results.

1.1. RELATED WORKS

The literature on distributional RL has been greatly expanded recently (Morimura et al., 2010; Bellemare et al., 2017; Barth-Maron et al., 2018; Dabney et al., 2018a; b; Yang et al., 2019) . Most of these works focus on the modeling aspects, such as the choice of representations for the value distributions. The approach has been used to enhance exploration in RL (Mavrin et al., 2019; Zhang & Yao, 2019) and in risk-sensitive applications (Wang et al., 2019; Bernhard et al., 2019) . Solving Markov decision processes (MDP) with risk-sensitive objectives have been addressed in many works (Howard & Matheson, 1972; Ruszczynski, 2010; Bäuerle & Ott, 2011 ), including RL approaches (Borkar, 2001; Tamar et al., 2012; L.A. & Ghavamzadeh, 2013) . In particular, Chow & Ghavamzadeh ( 2014 2017) proposed a policy-gradient approach that deals with both the static and the dynamic CVaR objectives. Closest to ours is the work by Stanko & Macek (2019) . Their proposed approach also makes use of distributional RL but it is not clear whether their action selection strategy properly optimizes either the static or the dynamic CVaR.

2. PROBLEM SETUP AND MAIN RESULTS

We consider a discrete-time Markov decision process with state space X and action space A. For simplicity we assume that X and A are finite, although our results and algorithm can be readily extended to more general state-action spaces. We assume that the rewards are bounded and drawn from a countable set R ⊂ R. Given states x t , x t+1 ∈ X for any t ∈ {0, 1, . . .}, the probability of receiving reward r t ∈ R and transitioning to x t+1 after executing a t ∈ A in x t is given by p(r t , x t+1 |x t , a t ). Without loss of generality we assume a fixed initial state x 0 , unless stated otherwise. Given a policy π : H → P(A), where H is the set of all histories so far h t := (x 0 , a 0 , r 0 , x 1 , a 1 , r 1 , . . . , x t ) ∈ H, and P(A) the space of distributions over A, its expected total discounted reward over time is given by V π := E π p ∞ t=0 γ t r t where γ ∈ (0, 1) is a discount factor. The superscript π in the expectation indicates that the actions a t are drawn from π(h t ). The subscript p indicates that the rewards and state transitions are induced by p. In standard RL, we aim to find a policy that maximizes V π . It is well-known that there exists a deterministic stationary policy π : X → A whose decisions depend only on the current state, that gives optimal V π , and therefore one typically works in the space of stationary deterministic policies. Key to a dynamic-programming solution to the above problem is the use of a value function (2)



); Tamar et al. (2015) deal with the static CVaR objectives while Petrik & Subramanian (2012); Chow & Pavone (2014) deal with the dynamic CVaR objectives. Tamar et al. (

Q π (x, a) := E π p [ ∞ t=0 γ t r t | x 0 = x, a 0 = a], which satisfies the Bellman equation ∀x, a, Q π (x, a) = r,x p(r, x |x, a) [r + γQ π (x , π(x ))] .(1)The optimal value Q * (x, a) := Q π * (x, a) for any optimal policy π * satisfies the Bellman optimality equation ∀x, a, Q * (x, a) = r,x p(r, x |x, a) r + γ max a Q * (x , a ) .

