POLICY GRADIENT WITH EXPECTED QUADRATIC UTILITY MAXIMIZATION: A NEW MEAN-VARIANCE APPROACH IN REINFORCEMENT LEARNING

Abstract

In real-world decision-making problems, risk management is critical. Among various risk management approaches, the mean-variance criterion is one of the most widely used in practice. In this paper, we suggest expected quadratic utility maximization (EQUM) as a new framework for policy gradient style reinforcement learning (RL) algorithms with mean-variance control. The quadratic utility function is a common objective of risk management in finance and economics. The proposed EQUM has several interpretations, such as reward-constrained variance minimization and regularization, as well as agent utility maximization. In addition, the computation of the EQUM is easier than that of existing mean-variance RL methods, which require double sampling. In experiments, we demonstrate the effectiveness of the EQUM in benchmark setting of RL and financial data.

1. INTRODUCTION

Reinforcement learning (RL) with Markov decision processes (MDPs) is one type of dynamic decision-making problem (Puterman, 1994; Sutton & Barto, 1998) . While the typical objective is the expected cumulative reward maximization, risk-aware decision-making has attracted great attention in real-world applications, such as finance and robotics (Geibel & Wysotzki, 2005; García & Fernández, 2015) . The notion of risk is related to the fact that even an optimal policy may perform poorly owing to the stochastic nature of the problem. To capture the risk, various criteria have been proposed, such as Value at Risk (Luenberger, 1998; Chow & Ghavamzadeh, 2014; Chow et al., 2017) and variance (Markowitz, 1952; Markowitz et al., 2000; Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013) . Among them, we focus on the mean-variance trade-off. Typical mean-variance RL (MVRL) methods attempt to maximize the expected cumulative reward while maintaining the variance threshold (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016; Xie et al., 2018; Bisi et al., 2020; Zhang et al., 2020) . However, most existing MVRL methods suffer from high computational costs owing to the double sampling issue when approximating the gradient of the variance term (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016) . To avoid the double sampling issue, Xie et al. ( 2018) proposed a method based on the Legendre-Fenchel duality (Boyd & Vandenberghe, 2004) . Although the method does not suffer from the double sampling issue, we cannot apply a standard policy gradient method and must use a coordinate descent algorithm. In addition, the method cannot control risk at a certain desirable level. From an economics perspective, the difference in the RL objectives arises from the forms of utility functions. When the objective of an agent is expected cumulative reward maximization, the utility function is risk-neutral; when an agent attempts to control the risk based on an expected reward, the utility function is risk-averse (Mas-Colell et al., 1995) . In economics, there have been several risk-averse utility functions proposed. The quadratic utility function is one such functions and is frequently used in financial economics (Luenberger, 1998) . Under the quadratic utility function, the mean-variance portfolio maximizes the utility of the investor. In addition, other various financial theories are also based on the quadratic utility maximization (Markowitz, 1952; Sharpe, 1964; Lintner, 1965; Mossin, 1966) . For more details, see Appendix A. In this study, as one of the MVRL approaches, we consider the expected quadratic utility maximization (EQUM) based on the policy gradient method (Williams, 1988; 1992; Sutton et al., 1999; Baxter & Bartlett, 2001) . 1

