POLICY GRADIENT WITH EXPECTED QUADRATIC UTILITY MAXIMIZATION: A NEW MEAN-VARIANCE APPROACH IN REINFORCEMENT LEARNING

Abstract

In real-world decision-making problems, risk management is critical. Among various risk management approaches, the mean-variance criterion is one of the most widely used in practice. In this paper, we suggest expected quadratic utility maximization (EQUM) as a new framework for policy gradient style reinforcement learning (RL) algorithms with mean-variance control. The quadratic utility function is a common objective of risk management in finance and economics. The proposed EQUM has several interpretations, such as reward-constrained variance minimization and regularization, as well as agent utility maximization. In addition, the computation of the EQUM is easier than that of existing mean-variance RL methods, which require double sampling. In experiments, we demonstrate the effectiveness of the EQUM in benchmark setting of RL and financial data.

1. INTRODUCTION

Reinforcement learning (RL) with Markov decision processes (MDPs) is one type of dynamic decision-making problem (Puterman, 1994; Sutton & Barto, 1998) . While the typical objective is the expected cumulative reward maximization, risk-aware decision-making has attracted great attention in real-world applications, such as finance and robotics (Geibel & Wysotzki, 2005; García & Fernández, 2015) . The notion of risk is related to the fact that even an optimal policy may perform poorly owing to the stochastic nature of the problem. To capture the risk, various criteria have been proposed, such as Value at Risk (Luenberger, 1998; Chow & Ghavamzadeh, 2014; Chow et al., 2017) and variance (Markowitz, 1952; Markowitz et al., 2000; Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013) . Among them, we focus on the mean-variance trade-off. Typical mean-variance RL (MVRL) methods attempt to maximize the expected cumulative reward while maintaining the variance threshold (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016; Xie et al., 2018; Bisi et al., 2020; Zhang et al., 2020) . However, most existing MVRL methods suffer from high computational costs owing to the double sampling issue when approximating the gradient of the variance term (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016) . To avoid the double sampling issue, Xie et al. (2018) proposed a method based on the Legendre-Fenchel duality (Boyd & Vandenberghe, 2004) . Although the method does not suffer from the double sampling issue, we cannot apply a standard policy gradient method and must use a coordinate descent algorithm. In addition, the method cannot control risk at a certain desirable level. From an economics perspective, the difference in the RL objectives arises from the forms of utility functions. When the objective of an agent is expected cumulative reward maximization, the utility function is risk-neutral; when an agent attempts to control the risk based on an expected reward, the utility function is risk-averse (Mas-Colell et al., 1995) . In economics, there have been several risk-averse utility functions proposed. The quadratic utility function is one such functions and is frequently used in financial economics (Luenberger, 1998) . Under the quadratic utility function, the mean-variance portfolio maximizes the utility of the investor. In addition, other various financial theories are also based on the quadratic utility maximization (Markowitz, 1952; Sharpe, 1964; Lintner, 1965; Mossin, 1966) . For more details, see Appendix A. In this study, as one of the MVRL approaches, we consider the expected quadratic utility maximization (EQUM) based on the policy gradient method (Williams, 1988; 1992; Sutton et al., 1999; Baxter & Bartlett, 2001) . The EQUM has the following advantages: (i) low computational cost; (ii) numerous interpretations, and (iii) direct connections to real-world applications. In this study, as interpretations of EQUM, we propose the minimization of the variance under the constraint of the expected cumulative reward, reward-targeting optimization, and regularization. Thus, this study contributes to the context of riskaverse RL and MVRL. In the following sections, we first formulate the problem setting in Section 2 and review the existing methods in Section 3. Then, we propose the main algorithms in Section 4. Finally, we investigate the empirical effectiveness of the EQUM in Section 5.

2. PROBLEM SETTING

We consider the standard RL framework, where a learning agent interacts with an unfamiliar, dynamic, and stochastic environment modeled by a Markov decision process (MDP) in discrete time. We define the MDP as the tuple (S, A, r, P, P 0 ), where S is a set of states, A is a set of actions, r : S × A → R is a reward function, P : S × S × A → [0, 1] is the transition kernel, and P 0 : S → [0, 1] is an initial state distribution. The initial state S 1 is sampled from P 0 . Let π θ : A × S → [0, 1] be a parameterized stochastic policy mapping states to actions, where θ is the tunable parameter. At time step t, an agent chooses an action A t according to a policy π θ (• | S t ). We assume that the policy π θ is a differentiable function with respect to θ; that is, ∂π θ (a,s) ∂θ exists. There are several performance measures for a policy π θ . One popular measure is the expected cumulative reward from time step t to u defined as E π θ [R t:u ], where R t:u = u i=0 γ i r(S t+i , A t+i ), γ ∈ (0, 1) is a discount factor and E π θ denotes the expectation operator over a policy π θ , and S 1 is generated from P 0 . When γ = 1, to ensure the cumulative reward well defined, it is usually assumed that all policies are proper (Bertsekas & Tsitsiklis, 1996) ; that is, for any policy π θ , the agent goes to a recurrent state S * with probability 1. After the agent passes the recurrent state S * at a time τ , the rewards are always 0. Such a finite horizon case is called episodic MDPs (Puterman, 1994) . For brevity, we denote R t:u as R when the meaning is obvious. Under these criteria, the agent may attempt to obtain a higher cumulative reward while taking higher risks. In real-world applications, such as portfolio management in finance (Markowitz, 1952; Markowitz et al., 2000) , such risky decision-making is not always desired, and we, therefore, consider the trade-off between the expected cumulative reward and the variance. Thus, while the goal of the risk-neutral MDP problem is to find the parameter θ that maximizes the total reward, we consider the mean-variance trade-off between the cumulative expected reward and the variance of the cumulative reward R in the MVRL problem. Note that even when the reward r is deterministic, the cumulative reward is a random variable owing to the stochastic policy, and there exists the mean-variance trade-off. In addition, even if the optimal policy is deterministic, the proposed method empirically has the potential to improves the stability of the training from the observations in the experiments of Section 5.3.

3. EXISTING MVRL METHODS

In this section, we introduce existing studies of MVRL. 2018) report the double sampling issue in MVRL, which requires sampling from two different trajectories for estimating the policy gradient. For instance, in an episodic MDP with the discount factor γ = 1 and the stop-



CONSTRAINED TRAJECTORY-VARIANCE PROBLEM Tamar et al. (2012), Prashanth & Ghavamzadeh (2013), and Xie et al. (2018) formulated MVRL by a constrained optimization problem defined as max θ∈Θ E π θ [R] s.t. Var π θ (R) ≤ η. In their formulation, the goal is to maximize the expected cumulative reward with controlling the trajectory-variance at a certain level. To solve this problem, Tamar et al. (2012), Prashanth & Ghavamzadeh (2013), and Xie et al. (2018) consider a penalized method defined as max θ∈Θ E π θ [R] -δg Var π θ (R) -η , where δ > 0 is a constant and g : R → R is a penalty function, such as g(x) = x or g(x) = x 2 . 3.1.1 DOUBLE SAMPLING ISSUE Tamar et al. (2012), Prashanth & Ghavamzadeh (2013), and Xie et al. (

