DISTRIBUTIONAL REINFORCEMENT LEARNING FOR RISK-SENSITIVE POLICIES

Abstract

We address the problem of learning a risk-sensitive policy based on the CVaR risk measure using distributional reinforcement learning. In particular, we show that applying the distributional Bellman optimality operator with respect to a riskbased action-selection strategy overestimates the dynamic, Markovian CVaR. The resulting policies can however still be overly conservative and one often prefers to learn an optimal policy based on the static, non-Markovian CVaR. To this end, we propose a modification to the existing algorithm and show that it can indeed learn a proper CVaR-optimized policy. Our proposed approach is a simple extension of standard distributional RL algorithms and can therefore take advantage of many of the recent advances in deep RL. On both synthetic and real data, we empirically show that our proposed algorithm is able to produce a family of risk-averse policies that achieves a better tradeoff between risk and the expected return.

1. INTRODUCTION

In standard reinforcement learning (RL) (Sutton & Barto, 2018) , one seeks to learn a policy that maximizes an objective, usually the expected total discounted rewards or the long-term average rewards. In stochastic domains, especially when the level of uncertainty involved is high, maximizing the expectation may not be the most desirable since the solution may have high variance and occasionally performs badly. In such scenarios one may choose to learn a policy that is more risk-averse and avoids bad outcomes, even though the long-term average performance is slightly lower than the optimal. In this work we consider optimizing the conditional value-at-risk (CVaR) (Rockafellar & Uryasev, 2000) , a popular risk measure, widely used in financial applications, and is increasingly being used in RL. The CVaR objective focuses on the lower tail of the return and is therefore more sensitive to rare but catastrophic outcomes. Various settings and RL approaches have been proposed to solve this problem (Petrik & Subramanian, 2012; Chow & Ghavamzadeh, 2014; Chow & Pavone, 2014; Tamar et al., 2015; Tamar et al., 2017; Huang & Haskell, 2020) . Most of the proposed approaches, however, involve more complicated algorithms than standard RL algorithms such as Q-learning (Watkins & Dayan, 1992) and its deep variants, e.g. DQN (Mnih et al., 2015) . Recently, the distributional approach to RL (Bellemare et al., 2017; Morimura et al., 2010) has received increased attention due to its ability to learn better policies than the standard approaches in many challenging tasks (Dabney et al., 2018a; b; Yang et al., 2019) . Instead of learning a value function that provides the expected return of each state-action pair, the distributional approach learns the entire return distribution of each state-action pair. While this is computationally more costly, the approach itself is a simple extension to standard RL and is therefore easy to implement and able to leverage many of the advances in deep RL. Since the entire distribution is available, one naturally considers exploiting this information to optimize for an objective other than the expectation. Dabney et al. (2018a) presented a simple way to do so for a family of risk measures including the CVaR. The theoretical properties of such approach, however, are not clear. In particular, it is not clear whether the algorithm converges to any particular variant of CVaR-optimal policy. We address this issue in this work. Our main contribution is to first show that the proposed algorithm in (Dabney et al., 2018a) overestimates the dynamic, Markovian CVaR but empirically can be as conservative. It has been demon-strated that this variant of CVaR can be overly conservative in many scenarios (Tamar et al., 2017; Yu et al., 2017) , and one may prefer the static CVaR instead as the objective. Our second contribution is to propose a modified algorithm that can help achieve this. Empirically, we show that the proposed approach learns policies that perform better in terms of the overall CVaR objective on both synthetic and real-world problems. We close the introduction section with some references to related works. We formally present our problem setup as well as our main analytical results in Section 2. Section 3 describes our proposed algorithm and finally, Section 4 presents our empirical results.

1.1. RELATED WORKS

The literature on distributional RL has been greatly expanded recently (Morimura et al., 2010; Bellemare et al., 2017; Barth-Maron et al., 2018; Dabney et al., 2018a; b; Yang et al., 2019) . Most of these works focus on the modeling aspects, such as the choice of representations for the value distributions. The approach has been used to enhance exploration in RL (Mavrin et al., 2019; Zhang & Yao, 2019) and in risk-sensitive applications (Wang et al., 2019; Bernhard et al., 2019) . Solving Markov decision processes (MDP) with risk-sensitive objectives have been addressed in many works (Howard & Matheson, 1972; Ruszczynski, 2010; Bäuerle & Ott, 2011) , including RL approaches (Borkar, 2001; Tamar et al., 2012; L.A. & Ghavamzadeh, 2013) . In particular, Chow & Ghavamzadeh (2014) ; Tamar et al. (2015) deal with the static CVaR objectives while Petrik & Subramanian (2012) ; Chow & Pavone (2014) deal with the dynamic CVaR objectives. Tamar et al. (2017) proposed a policy-gradient approach that deals with both the static and the dynamic CVaR objectives. Closest to ours is the work by Stanko & Macek (2019) . Their proposed approach also makes use of distributional RL but it is not clear whether their action selection strategy properly optimizes either the static or the dynamic CVaR.

2. PROBLEM SETUP AND MAIN RESULTS

We consider a discrete-time Markov decision process with state space X and action space A. For simplicity we assume that X and A are finite, although our results and algorithm can be readily extended to more general state-action spaces. We assume that the rewards are bounded and drawn from a countable set R ⊂ R. Given states x t , x t+1 ∈ X for any t ∈ {0, 1, . . .}, the probability of receiving reward r t ∈ R and transitioning to x t+1 after executing a t ∈ A in x t is given by p(r t , x t+1 |x t , a t ). Without loss of generality we assume a fixed initial state x 0 , unless stated otherwise. Given a policy π : H → P(A), where H is the set of all histories so far h t := (x 0 , a 0 , r 0 , x 1 , a 1 , r 1 , . . . , x t ) ∈ H, and P(A) the space of distributions over A, its expected total discounted reward over time is given by V π := E π p ∞ t=0 γ t r t where γ ∈ (0, 1) is a discount factor. The superscript π in the expectation indicates that the actions a t are drawn from π(h t ). The subscript p indicates that the rewards and state transitions are induced by p. In standard RL, we aim to find a policy that maximizes V π . It is well-known that there exists a deterministic stationary policy π : X → A whose decisions depend only on the current state, that gives optimal V π , and therefore one typically works in the space of stationary deterministic policies. Key to a dynamic-programming solution to the above problem is the use of a value function Q π (x, a) := E π p [ ∞ t=0 γ t r t | x 0 = x, a 0 = a], which satisfies the Bellman equation ∀x, a, Q π (x, a) = r,x p(r, x |x, a) [r + γQ π (x , π(x ))] . The optimal value Q * (x, a) := Q π * (x, a) for any optimal policy π * satisfies the Bellman optimality equation ∀x, a, Q * (x, a) = r,x p(r, x |x, a) r + γ max a Q * (x , a ) . (2) Furthermore, for any Q-function Q ∈ Q := {q : X × A → R | q(x, a) < ∞, ∀x, a}, one can show that the operator T π defined by T π Q(x, a) := r,x p(r, x |x, a)[r + γQ(x , π(x ))] is a γ-contraction in the sup-norm Q ∞ := max x,a |Q(x, a) | with fixed-point satisfying (1). One can therefore start with an arbitrary Q-function and repeatedly apply T π , or its stochastic approximation, to learn Q π . An analogous operator T can also be shown to be a γ-contraction with fixed-point satisfying (2).

2.1. STATIC AND DYNAMIC CVAR

The expected return V π is risk-neutral in the sense that it does not take into account the inherent variability of the return. In many application scenarios, one may prefer a policy that is more riskaverse, with better sensitivity to bad outcomes. In this work, we focus on the conditional value-atrisk (CVaR), which is a popular risk measure that satisfies the properties of being coherent (Artzner et al., 1999) . The α-level CVaR for a random real-valued variable Z, for α ∈ (0, 1], is given by (Rockafellar & Uryasev, 2000 ) C α (Z) := max s∈R s - 1 α E[(s -Z) + ] where (x) + = max{x, 0}. Note that we are concerned with Z that represents returns (the higher, the better), so this particular version of CVaR focuses on the lower tail of the distribution. In particular, the function s → s -1 α E[(s -Z) + ] is concave in s and the maximum is always attained at the α-level quantile, defined as q α (Z) := inf{s : Pr(Z ≤ s) ≥ α}. For α = 1, C α reduces to the standard expectation. In the case Z is absolutely continuous, we have the intuitive C α (Z) = E[Z|Z < q α ]. Our target random variable is the total discounted return Z π := ∞ t=0 γ t r t of a policy π, and our objective is to find a policy that maximizes C α (Z π ), where the optimal CVaR is given by max π max s s - 1 α E π p [(s -Z π ) + ]. In the context where Z is accumulated over multiple time steps, the objective (3) corresponds to maximizing the so-called static CVaR. This objective is time-inconsistent in the sense that the optimal policy may be history-dependent and therefore non-Markov. This is, however, perfectly expected since the optimal behavior in the later time steps may depend on how much rewards have been accumulated thus far -more risky actions can be taken if one has already collected sufficiently large total rewards, and vice versa. From the point of view of dynamic programming, an alternative, time-consistent or Markovian version of CVaR may be more convenient. A class of such risk measures was proposed by Ruszczynski (2010) , and we shall refer to this version of CVaR as the dynamic CVaR, defined recursively as 1 ∀π, x, a, D π α,0 (x, a) := C α [r t |x t = x, a t = a], ∀π, x, a, T > 0, D π α,T (x, a) := C α [r t + γD π α,T -1 (x t+1 , π(x t+1 ))|x t = x, a t = a], and ∀π, x, a, D π α (x, a) := lim T →∞ D π α,T (x, a). It can be shown (Ruszczynski, 2010) that there exists a stationary deterministic optimal policy π * , maximizing D π α (x, a) for all x, a, whose dynamic CVaR is given by D * α := D π * α . In particular, the operator T D,α defined by T D,α D(x, a) := C α [r t + γ max a D(x t+1 , a )|x t = x, a t = a] (4) for D ∈ Q is a γ-contraction in sup-norm with fixed-point satisfying ∀x, a, D * α (x, a) = C α [r t + γ max a D * α (x t+1 , a )|x t = x, a t = a]. The dynamic CVaR, however, can be overly conservative in many cases. We illustrate this with some empirical results in Section 4. In such cases it may be favorable to use the static CVaR. Bäuerle & Ott (2011) suggest an iterative process that can be used to solve for the optimal static CVaR policy. The approach is based on (3): 1. For a fixed s, one can solve for the optimal policy with respect to max π E[-(s -Z π ) + ]. 2. For a fixed π, the optimal s is given by the α-level quantile of Z π . 3. Repeat until convergence. Step one above can be done by solving an augmented MDP with states x = (x, s) ∈ X × R, where s is a moving threshold keeping track of the accumulated rewards so far. In particular, this MDP has no rewards and state transition is given by p(0, (x , s-r γ )|(x, s), a) := p(r, x |x, a). Solving this augmented MDP directly using RL, however, can result in poor sample efficiency since each example (x, a, r, x ) may need to be experienced many times under different threshold s. In this work, we propose an alternative solution using the approach of distributional RL.

2.2. DISTRIBUTIONAL RL

In standard RL, one typically learns the Q π (x, a) value for each (x, a) through some form of temporal-difference learning (Sutton & Barto, 2018) . In distributional RL (Bellemare et al., 2017) , one instead tries to learn the entire distribution of possible future return Z π (x, a) for each (x, a). The Q-value can then be extracted by simply taking the expectation Q π (x, a) = E[Z π (x, a)]. The objects of learning are distribution functions U ∈ Z := {Z : X×A → P(R) | E[|Z(x, a)| q ] < ∞, ∀x, a, q ≥ 1}. For any state-action pair (x, a), we use U (x, a) to denote a random variable with the respective distribution. Let T π be the distributional Bellman operator on Z such that T π U (x, a) : D = R + γU (X , π(X )) where D denotes equality in distribution, generated by the random variables R, X induced by p(r, x |x, a). We use the notation T instead of T when referring to a distributional operator, where T π U (x, a) is a random variable. (Bellemare et al., 2017) show that T π is a γ-contraction in Z in the following distance metric d(U, V ) := sup x,a W (U (x, a), V (x, a)) where W is the 1-Wasserstein distance between the distributions of U (x, a) and V (x, a). Furthermore, the operator T defined by T U (x, a) : D = R + γU (X , A ), A = arg max a E[U (X , a )] can be shown to be a γ-contraction in Q in sup-norm under element-wise expectation, i.e., E T U -E T V ∞ ≤ γ EU -EV ∞ , where E T U ∈ Q such that E T U (x, a) := E[ T U (x, a)], and EU , EV , E T V all similarly defined. In general, T is not expected to be a contraction in the space of distributions for the obvious reason that multiple optimal policies can have very different distributions of the total return even though they all have the same expected total return. Since one keeps the full distribution instead of just the expectation, a natural way to exploit this is to extract more than just the expectation from each distribution. In particular, in (6), one can select the action a based on measures other than the expectation E[U (x , a )]. This is done by Dabney et al. (2018a) where a distortion measure on the expectation is used to select actions using various risk measures on U (x , a ), including the CVaR. If we replace E[U (x , a )] with C α [U (x , a )], one may guess that it converges to the optimal dynamic CVaR policy satisfying (5). This is however, not true in general. We now show that choosing actions using C α [U (x , a )] results in overestimating the dynamic CVaR value D π α and D * α . Proposition , a) ]. Let T D,α as defined in (4). The distributional Bellman operator T D,α given by T D,α U (x, a) : 1. Let U ∈ Z. Let C α [U ] ∈ Q such that C α [U ](x, a) := C α [U (x D = R + γU (X , A ), A = A(X ) := arg max a C α [U (X , a )] satisfies ∀x, a, C α [ T D,α U (x, a)] ≥ (T D,α C α [U ])(x, a). Similarly, for a fixed π, we have that ∀x, a, C α [ T π D,α U (x, a)] ≥ (T π D,α C α [U ])(x, a). Proof. We will use the properties of CVaR as a coherent risk measure (Artzner et al., 1999) . In particular, C α (Z) is concave in Z (recall that we use the lower-tail version) where ∀λ ∈ [0, 1], C α (λZ 1 + (1 -λ)Z 2 ) ≥ λC α (Z 1 ) + (1 -λ)C α (Z 2 ) , and satisfies both translation invariance and positive homogeneity, where ∀c, λ ∈ R, λ ≥ 0, C α (λZ + c) = λC α (Z) + c. C α [ T D,α U (x, a)] = C α   r,x p(r, x |x, a) [r + γU (x , A(x ))]   (a) ≥ r,x p(r, x |x, a)C α [r + γU (x , A(x ))] (b) = r,x p(r, x |x, a) r + γC α [U (x , A(x ))] = E [R + γC α [U (X , A(X ))]] (c) ≥ C α [R + γC α [U (X , A(X ))]] = (T D,α C α [U ])(x, a) where we use the coherent properties of C α in (a) and (b), and in (c) we use the fact that the expectation E is C α for α = 1 and upperbounds all other C α . The proof for the fixed π case is analogous. It is easy to construct an example where the inequalities in Proposition 1 are strict and that T D,α converges to a policy that is different from the optimal D * α policy. Unfortunately, through empirical observations, T D,α still results in policies that are closer to those optimizing the dynamic CVaR rather than the static CVaR. It is now natural to ask whether we can optimize for the static CVaR instead while still staying within the framework of distributional RL. Recall that it is possible to optimize for the static CVaR by solving an augmented MDP as part of an iterative process. Instead of explicitly augmenting the state space, we rely on the distributions U ∈ Z to implicitly "store" the information needed. This approach will make the most of each transition example from (x, a), since it updates an entire distribution, and indirectly the entire set of states (x, s) for all s in the augmented MDP. For this, the action selection strategy in (6) plays a critical role. Given U ∈ Z and s ∈ R, define ζ(U, s) ∈ Q such that ζ(U, s)(x, a) := E[-(s -U (x, a)) + ]. We define a distributional Bellman operator for the threshold s as follows, T s U (x, a) : D = R + γU (X , A ), A = A U s (R, X ) := arg max a ζ U, s -R γ (X , a ). The following result shows that at least for a fixed target threshold, improvement is guaranteed after each application of T s . Proposition 2. For any U, V ∈ Z, and any s ∈ R, ζ( T s U, s) -ζ( T s V, s) ∞ ≤ γ sup s ζ(U, s ) -ζ(V, s ) ∞ . Proof. For each (x, a), ζ( T s U, s)(x, a) -ζ( T s V, s)(x, a) = r,x p(r, x |x, a) -(s -(r + γu)) + dU (x , A U s (r, x )) + (s -(r + γv)) + dV (x , A V s (r, x )) =γ r,x p(r, x |x, a) - s -r γ -u + dU (x , A U s (r, x )) + s -r γ -v + dV (x , A V s (r, x )) =γ r,x p(r, x |x, a) max a ζ U, s -r γ (x , a ) -max a ζ V, s -r γ (x , a ) (a) ≤ γ r,x p(r, x |x, a) max a ζ U, s -r γ (x , a ) -ζ V, s -r γ (x , a ) ≤γ sup x ,a ,s |ζ(U, s )(x , a ) -ζ(V, s )(x , a )| where in (a) we use triangle inequality and the fact that | max a f (a)-max a g(a )| ≤ max a |f (a)- g(a)|. Since we only keep one distribution for each (x, a), we can only apply T s for a single chosen s during each update. Applying T s can potentially change ζ( T s U, s )(x, a) for any other s and there is no guarantee that similar improvement happens for these s . Recall that we seek to optimize the long-term CVaR, where the "optimal" s is actually the α-quantile of the long-term return. We therefore propose the following operator, ∀x, a, T α U (x, a) : a ). This can be easily implemented through distributional RL, which we describe in the next section. D = T qα(U (x,a)) U (x,

3. ALGORITHM

Our proposed algorithm is based on distributional Q-learning using quantile regression (Dabney et al., 2018b) . It can be easily adapted to any other variants of distributional RL. Algorithm 1 shows the main algorithm for computing the loss over a mini-batch containing m transition samples. Here, each distribution θ(x, a) is represented by N quantiles θ = (θ 1 . . . θ N ), each corresponds to a quantile level τi = i-0.5 N . The quantile function q α (θ) can therefore be easily extracted from θ. The loss function is based on quantile regression, where ρ τ (u) = u(τ -δ u<0 ) where δ u<0 = 1 if u < 0 and 0 otherwise. The key difference from the ordinary quantile-regression distributional Q-learning is our target action selection strategy for choosing a k (Step 1(a) and (b)). For other implementation details, we refer the reader to (Dabney et al., 2018b) . Algorithm 1 Quantile Regression Distributional Q-Learning for static CVaR Input: γ, α, θ, θ , mini-batch (x k , a k , r k , x k ) for k = 1 . . . m 1. For each k = 1 . . . m, (a) s k ← q α (θ(x k , a k )) (b) a k ← arg max a ζ(θ , s k -r k γ )(x k , a ) (c) T θ j (x k , a k ) ← r k + γθ j (x k , a k ) 2. L ← 1 m m k=1 1 N 2 i,j ρ τi ( T θ j (x k , a k ) -θ i (x k , a k )) 3. Output ∇ θ L. The execution of a policy defined by θ requires an additional state information s, which summarizes the rewards collected so far. This is not part of the MDP state x and can easily be updated after observing each new reward. At the start of a new episode, s is reset. The complete algorithm for executing the policy in one episode is given in Algorithm 2. 

4. EMPIRICAL RESULTS

We implement Algorithm 1 and 2 and represent our policies using a neural network with two hidden layers, with ReLU activation. All our experiments use Adam as the stochastic gradient optimizer with learning rate 0.0001. For each action, the output consists of N = 100 quantile values. The complete code to reproduce our results is included in the supplementary material.

4.1. SYNTHETIC DATA

We first evaluate our proposed algorithm in a simple task where we know the optimal stationary policy for any CVaR level. The MDP has 4 states x 0 , x 1 , x 2 , x 3 where state x 0 is the initial state and x 3 is a terminal state. Each state has two actions a 0 and a 1 . Action a 0 generates an immediate reward following a Gaussian N (1, 1) and action a 1 has immediate reward N (0.8, 0.4 2 ). Clearly, a 0 gives a better expected reward but with higher variance. Each action always moves the state from x i to x i+1 . We use γ = 0.9 for this task. For α > 0.63, the optimal stationary policy is to choose action a 0 in all states, while for α < 0.62, the optimal stationary policy is to choose action a 1 in all states. We compare our proposed algorithm for static CVaR with the optimal stationary policy at various levels of CVaR. Figure 1 We label the results for our proposed algorithm "Static" while the action-selection strategy based on (Dabney et al., 2018a ) "Dynamic". We clearly see that the "Static" version outperforms "Dynamic" at all tested CVaR levels. In fact, our algorithm performs even better than the optimal stationary policy. Recall that the optimal CVaR policy may be non-stationary, where the actions in later states depend on the rewards collected so far. This example shows that learning using Algorithm 1 and execution using Algorithm 2 can indeed result in non-stationary policies by storing the "extra" information within the value distribution. 0 1 2 CVaR [0 0 0] [0 0 1] [0 1 0] [0 1 1] [1 0 0] [1 0 1] [1 1 0] [ Further insights are revealed in the middle and right plots of Figure 1 . These are the ground truth CVaR values for all the stationary policies, where [1 0 0] means always choosing action a 1 in x 0 and a 0 in the next two states. Notice the switching point around α = 0.625 in the middle plot and around α = 0.83 in the right plot. The "Dynamic" CVaR action-selection strategy will choose action a 1 in x 2 for α < 0.83 since this is the better action if one ignores the rewards collected since the beginning. However, this results in a rather conservative strategy since the optimal strategy should still favor a 0 in x 2 for α > 0.625.

4.2. OPTION TRADING

We evaluate our proposed algorithm on the non-trivial real-world task of option trading, commonly used as a test domain for risk-sensitive RL (Li et al., 2009; Chow & Ghavamzadeh, 2014; Tamar et al., 2017) . In particular, we tackle the task of learning an exercise policy for American options. This can be formulated as a discounted finite-horizon MDP with continuous states and two actions. The state x t includes the price of a stock at time t, as well as the number of steps to the maturity date, which we set to T = 100. The first action, "hold", will always move the state one time step forward with zero reward, while the second action, "execute", will generate an immediate reward max{0, K -x t } and enter a terminal state. K is the strike price. In our experiments, we use K = 1 and always normalize the prices such that x 0 = 1. At t = T -1, all actions will be interpreted as "execute". We set γ = 0.999, which corresponds to a non-zero daily risk-free interest rate. We use actual daily closing prices for the top 10 Dow components from 2005 to 2019. Prices from 2005-2015 are used for training and prices from 2016-2019 for testing. To allow training on unlimited data, we follow (Li et al., 2009) and create a stock price simulator using the geometric Brownian motion (GBM) model. The GBM model assumes that the log-ratio of prices follows a Gaussian distribution log xt+1 xt ∼ N (µ -σ 2 /2, σ 2 ) with parameters µ and σ, which we estimate from the real training data. For each algorithm, each stock and each CVaR level, we trained 3 policies using different random seeds. The policies are then tested on the synthetic data (generated using the same training model) for 1000 episodes. The policies are further tested on the real data, using 100 episodes, each with 100 consecutive days of closing prices. The episode's start and end dates are evenly spread out over the 4 years of test period. All results are averaged over the 3 policies and over the 10 stocks. Figures 2 and 3 show the test results on synthetic and real data, respectively. Again, we label the algorithms "Static" and "Dynamic" as in the previous section. Clearly, when tested on both synthetic and test data, the "Static" algorithm performs better across various CVaR level. The gap is especially significant at lower α levels. Also included are the results from training using α = 1, and tested on all α values. This corresponds to the standard action-selection strategy based on the expected return. The learned strategies perform badly at low α levels, suggesting that they are taking too much risk. 



We use a slightly different definition from that in(Ruszczynski, 2010), but conceptually they are essentially the same.



Policy execution for static CVaR for one episode Input: γ, α, θ 1. x ← Initial state 2. a ← arg max a ζ(θ, q α (θ(x, a )))(x, a ) 3. s ← q α (θ(x, a)) 4. While x not terminal state, (a) Execute a in x, observe reward r and next state x (b) s ← s-r γ (c) x ← x (d) a ← arg max a ζ(θ, s)(x, a )

Figure 1: Left: Comparison with optimal stationary policies. Middle: Ground truth CVaR at x 0 . Right: Ground truth CVaR at x 2 .

Figure 2: Test results on synthetic data

(left plot)   shows the results.

