POLICY GRADIENT WITH EXPECTED QUADRATIC UTILITY MAXIMIZATION: A NEW MEAN-VARIANCE APPROACH IN REINFORCEMENT LEARNING

Abstract

In real-world decision-making problems, risk management is critical. Among various risk management approaches, the mean-variance criterion is one of the most widely used in practice. In this paper, we suggest expected quadratic utility maximization (EQUM) as a new framework for policy gradient style reinforcement learning (RL) algorithms with mean-variance control. The quadratic utility function is a common objective of risk management in finance and economics. The proposed EQUM has several interpretations, such as reward-constrained variance minimization and regularization, as well as agent utility maximization. In addition, the computation of the EQUM is easier than that of existing mean-variance RL methods, which require double sampling. In experiments, we demonstrate the effectiveness of the EQUM in benchmark setting of RL and financial data.

1. INTRODUCTION

Reinforcement learning (RL) with Markov decision processes (MDPs) is one type of dynamic decision-making problem (Puterman, 1994; Sutton & Barto, 1998) . While the typical objective is the expected cumulative reward maximization, risk-aware decision-making has attracted great attention in real-world applications, such as finance and robotics (Geibel & Wysotzki, 2005; García & Fernández, 2015) . The notion of risk is related to the fact that even an optimal policy may perform poorly owing to the stochastic nature of the problem. To capture the risk, various criteria have been proposed, such as Value at Risk (Luenberger, 1998; Chow & Ghavamzadeh, 2014; Chow et al., 2017) and variance (Markowitz, 1952; Markowitz et al., 2000; Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013) . Among them, we focus on the mean-variance trade-off. Typical mean-variance RL (MVRL) methods attempt to maximize the expected cumulative reward while maintaining the variance threshold (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016; Xie et al., 2018; Bisi et al., 2020; Zhang et al., 2020) . However, most existing MVRL methods suffer from high computational costs owing to the double sampling issue when approximating the gradient of the variance term (Tamar et al., 2012; Prashanth & Ghavamzadeh, 2013; 2016) . To avoid the double sampling issue, Xie et al. (2018) proposed a method based on the Legendre-Fenchel duality (Boyd & Vandenberghe, 2004) . Although the method does not suffer from the double sampling issue, we cannot apply a standard policy gradient method and must use a coordinate descent algorithm. In addition, the method cannot control risk at a certain desirable level. From an economics perspective, the difference in the RL objectives arises from the forms of utility functions. When the objective of an agent is expected cumulative reward maximization, the utility function is risk-neutral; when an agent attempts to control the risk based on an expected reward, the utility function is risk-averse (Mas-Colell et al., 1995) . In economics, there have been several risk-averse utility functions proposed. The quadratic utility function is one such functions and is frequently used in financial economics (Luenberger, 1998) . Under the quadratic utility function, the mean-variance portfolio maximizes the utility of the investor. In addition, other various financial theories are also based on the quadratic utility maximization (Markowitz, 1952; Sharpe, 1964; Lintner, 1965; Mossin, 1966) . For more details, see Appendix A. In this study, as one of the MVRL approaches, we consider the expected quadratic utility maximization (EQUM) based on the policy gradient method (Williams, 1988; 1992; Sutton et al., 1999; Baxter & Bartlett, 2001) . The EQUM has the following advantages: (i) low computational cost; (ii) numerous interpretations, and (iii) direct connections to real-world applications. In this study, as interpretations of EQUM, we propose the minimization of the variance under the constraint of the expected cumulative reward, reward-targeting optimization, and regularization. Thus, this study contributes to the context of riskaverse RL and MVRL. In the following sections, we first formulate the problem setting in Section 2 and review the existing methods in Section 3. Then, we propose the main algorithms in Section 4. Finally, we investigate the empirical effectiveness of the EQUM in Section 5.

2. PROBLEM SETTING

We consider the standard RL framework, where a learning agent interacts with an unfamiliar, dynamic, and stochastic environment modeled by a Markov decision process (MDP) in discrete time. We define the MDP as the tuple (S, A, r, P, P 0 ), where S is a set of states, A is a set of actions, r : S × A → R is a reward function, P : S × S × A → [0, 1] is the transition kernel, and P 0 : S → [0, 1] is an initial state distribution. The initial state S 1 is sampled from P 0 . Let π θ : A × S → [0, 1] be a parameterized stochastic policy mapping states to actions, where θ is the tunable parameter. At time step t, an agent chooses an action A t according to a policy π θ (• | S t ). We assume that the policy π θ is a differentiable function with respect to θ; that is, ∂π θ (a,s) ∂θ exists. There are several performance measures for a policy π θ . One popular measure is the expected cumulative reward from time step t to u defined as E π θ [R t:u ], where R t:u = u i=0 γ i r(S t+i , A t+i ), γ ∈ (0, 1 ) is a discount factor and E π θ denotes the expectation operator over a policy π θ , and S 1 is generated from P 0 . When γ = 1, to ensure the cumulative reward well defined, it is usually assumed that all policies are proper (Bertsekas & Tsitsiklis, 1996) ; that is, for any policy π θ , the agent goes to a recurrent state S * with probability 1. After the agent passes the recurrent state S * at a time τ , the rewards are always 0. Such a finite horizon case is called episodic MDPs (Puterman, 1994) . For brevity, we denote R t:u as R when the meaning is obvious. Under these criteria, the agent may attempt to obtain a higher cumulative reward while taking higher risks. In real-world applications, such as portfolio management in finance (Markowitz, 1952; Markowitz et al., 2000) , such risky decision-making is not always desired, and we, therefore, consider the trade-off between the expected cumulative reward and the variance. Thus, while the goal of the risk-neutral MDP problem is to find the parameter θ that maximizes the total reward, we consider the mean-variance trade-off between the cumulative expected reward and the variance of the cumulative reward R in the MVRL problem. Note that even when the reward r is deterministic, the cumulative reward is a random variable owing to the stochastic policy, and there exists the mean-variance trade-off. In addition, even if the optimal policy is deterministic, the proposed method empirically has the potential to improves the stability of the training from the observations in the experiments of Section 5.3.

3. EXISTING MVRL METHODS

In this section, we introduce existing studies of MVRL. 

3.1. CONSTRAINED TRAJECTORY-VARIANCE PROBLEM

E π θ [R] s.t. Var π θ (R) ≤ η. In their formulation, the goal is to maximize the expected cumulative reward with controlling the trajectory-variance at a certain level. To solve this problem, Tamar et al. (2012) , Prashanth & Ghavamzadeh (2013), and Xie et al. (2018) consider a penalized method defined as Tamar et al. (2012) , Prashanth & Ghavamzadeh (2013), and Xie et al. (2018) report the double sampling issue in MVRL, which requires sampling from two different trajectories for estimating the policy gradient. For instance, in an episodic MDP with the discount factor γ = 1 and the stop- (Tamar et al., 2012) . Besides, the gradient of the variance is given as max θ∈Θ E π θ [R] -δg Var π θ (R) -η , where δ > 0 is a constant and g : R → R is a penalty function, such as g(x) = x or g(x) = x 2 . 3.1.1 DOUBLE SAMPLING ISSUE ping time τ = min{t | S t = S * }, the gradients of E π θ [R], E π θ R 2 , and E π θ [R] 2 are given as ∇ θ E π θ [R] = E π θ [R τ t=1 ∇ θ log π θ (S t , A t )], ∇ θ E π θ R 2 = E R 2 τ t=1 ∇ θ log π θ (S t , A t ) , and ∇ θ E π θ [R] 2 = 2E π θ [R]∇ θ E π θ [R] E π θ R 2 τ t=1 ∇ θ log π θ (S t , A t ) -2E π θ [R]∇ θ E π θ [R]. Because optimizing the policy π θ using the gradients directly is computationally intractable, we replace them with their unbiased estimators. Suppose that there is a simulator generating a trajectory k with {(S k t , A k t , r(S k t , A k t ))} τ k t=1 , where τ k is the stopping time of the trajectory. Then, we can naively construct unbiased estimators of E π θ [R] and E π θ R 2 as ∇ θ E π θ [R] = R k τ k t=1 ∇ θ log π θ (S k t , A k t ) and ∇ θ E π θ R 2 = R k 2 τ k t=1 ∇ θ log π θ (S k t , A k t ) , where R k is a sample approximation of E π θ [R] at the episode k. However, obtaining an unbiased estimator of  ∇ θ E π θ [R] 2 = 2E π θ [R]∇ θ E π θ [R] is difficult

Weakness of existing approaches:

The multi-time-scale approaches by Tamar et al. (2012) and Prashanth & Ghavamzadeh (2013) are known to be sensitive to the choice of the step-size schedules, which are not easy to be controlled (Xie et al., 2018) . The method by Xie et al. (2018) does not reflect the constraint condition η as shown above; that is, in the objective function of Xie et al. (2018) , there exits penalty coefficient δ, but does not exist the constraint condition η. Note that the problem of Xie et al. (2018) is owing to their objective function based on the penalty function g (x) = x: E π θ [R] -δ (Var π θ (R) -η) , in which the first derivative does not include η. In addition, when using quadratic function as g(x) = x 2 to consider η, we cannot remove E[R 2 ] even with Legendre-Fenchel dual; that is, the method also suffers the double sampling issue. 3.2 CONSTRAINED PER-STEP VARIANCE PROBLEM Bisi et al. (2020) and Zhang et al. (2020) proposed solving a constrained per-step variance problem for MVRL. Bisi et al. (2020) showed that the per-step variance Var π θ (R) ≤ Varπ θ (r(St,At)) (1-γ) 2 , which implies that the minimization of the per-step variance Var(r(S t , A t )) also minimizes trajectory-variance Var π θ (R). Therefore, they train a policy π θ by maximizing E π θ [r(S t , A t )] -κVar π θ (r(S t , A t )), where κ > 0 is a parameter of the penalty function. The methods of Bisi et al. (2020) and Zhang et al. (2020) are based on the trust region policy optimization (Schulman et al., 2015) and coordinate descent with Legendre-Fenchel duality (Xie et al., 2018) , respectively.

3.3. CONSTRAINED CUMULATIVE EXPECTED REWARD PROBLEM

While existing MVRL studies mainly focus on a constrained trajectory-variance problem, a constrained cumulative expected reward optimization is also frequently used in practical situations, such as finance (Markowitz, 1952; Markowitz et al., 2000) . In the constrained cumulative expected reward problem, we solve the following problem: min θ∈Θ Var π θ (R) s.t. E π θ [R] = ξ. (1) Our proposed EQUM framework is based on this motivation; that is, variance minimization. As shown in the following section, from a computational perspective, there is a critical difference between the constrained cumulative expected reward and trajectory-variance problems.

4. EQUM FRAMEWORK

In this paper, as a novel approach for MVRL, we propose a Expected Quadratic Utility maximization RL (EQUM). In economic model, by using two parameters α > 0 and β > 0, the quadratic utility function for the cumulative return R is defined as Luenberger, 1998) . The quadratic utility function captures the preference of a risk-averse agent over the cumulative return R. Let us consider the expected quadratic utility function defined as U (R; α, β) = αR -1 2 βR 2 for α > 0, β ≥ 0 ( E π θ [U (R; α, β)] = αE π θ [R] - 1 2 βE π θ R 2 = αE π θ [R] - 1 2 β E π θ [R] 2 - 1 2 βE π θ E π θ [R] -R 2 . (2) In the EQUM framework, we train a policy by maximizing the expected quadratic utility function.

4.1. INTERPRETATIONS

Here, we introduce four interpretations of EQUM. We can interpret the EQUM as an approach for (i) an expected utility maximization, (ii) a targeting optimization problem to achieve an expected cumulative reward ζ, (iii) a constrained trajectory-reward problem with a quadratic penalized function, and (iv) an expected cumulative reward maximization with regularization. First, we discuss the connection with respect to training an agent to achieve a predefined return (Berger, 1985) . Let ζ be a target return that the algorithm aims to achieve. Then, we consider the mean squared error (MSE) minimization between the expected deviation of the return and ζ: arg min θ∈Θ J(θ; ζ) = arg min θ∈Θ E π θ ζ -R 2 (3) We can decompose the MSE into the bias and variance as follows: E π θ ζ -R 2 = ζ -E π θ [R] 2 Bias + 2E π θ ζ -E π θ [R] E π θ [R] -R 0 + E π θ E π θ [R] -R 2 Variance = ζ 2 -2ζE π θ [R] + E π θ [R] 2 + E π θ E π θ [R] -R 2 . Thus, the minimization problem (3) trains the policy π θ to consider the trade-off between the bias Luenberger (1998) ). Moreover, we find that the EQUM is equivalent to the reward-targeting optimization when ζ = α β ; that is, ζ -E π θ [R] 2 and variance E π θ E π θ [R] -R 2 (Section 9.5 of arg min θ∈Θ J θ; α β = arg min θ∈Θ α β 2 -2 α β E π θ [R] + E π θ [R] 2 + E π θ E π θ [R] -R 2 = arg max θ∈Θ E π θ [U (R; α, β)] . Second, the bias-variance trade-off heuristically provides a solution to a constraint optimization problem (1) with the constraint ξ = ζ = α β by solving the following penalized problem: min θ∈Θ Var π θ (R) + E π θ [R] -ξ 2 . ( ) Third, we can regard the quadratic utility function as an expected cumulative reward maximization with a regularization term defined as E R 2 ; that is, minimization of the risk R(π θ ): R(θ) = -E π θ [R] Risk of expected cumulative reward maximization + ψE π θ R 2 Regularization term (5) where ψ > 0 is a regulation parameter and ψ = β 2α = 1 2ζ . As ζ → ∞, R(π θ ) → -E π θ [R].

4.2. MERITS OF THE EQUM FRAMEWORK

In this section, we present two advantages of the EQUM framework. The first advantage is in computation. The EQUM framework is an MVRL method. However, compared with existing MVRL methods, which involve the double sampling issue, the computation of the EQUM framework is much simpler because we transform the MVRL problem (4) into the optimization problem without the term 2) and ( 5)). The second advantage is that it provides a variety of interpretations. Because the EQUM framework can interpret economic theory, it is applicable for modeling economic dynamics. In addition, as one of the MVRL methods, it is suitable for various real-world applications, such as finance (Deng et al., 2016) and playing games. By regarding the proposed EQUM as a regularization framework, we can combine it with existing RL methods. (E π θ [R]) 2 (see (

4.3. IMPLEMENTATIONS OF THE EQUM FRAMEWORK

Here, we discuss the implementations of the EQUM framework. Simplest policy gradient with EQUM With the EQUM framework, we introduce a main algorithm with simplest policy gradient (SPG) algorithm (Brockman et al., 2016) , which is also called the REINFORCE algorithm (Williams, 1992) . For an episode k with length n, the method replaces the expectations E π θ R and E π θ R 2 with the sample approximations n t=1 γ t-1 r(S t , A t ) and n t=1 γ t-1 r(S t , A t ) 2 , respectively (Brockman et al., 2016) . Then, the unbiased gradients are ∇ θ E π θ [R] = R k n t=1 ∇ θ log π θ (S k t , A k t ) and ∇ θ E π θ R 2 = R k 2 n t=1 ∇ θ log π θ (S k t , A k t ). Therefore, for a sample approximation R k of E π θ R 2 at the episode k, we optimize the policy with ascending the following gradient: α R k -1 2 β R k 2 n t=1 ∇ θ log π θ (S k t , A k t ). Actor-critic with EQUM: For another combination with the EQUM framework, we apply an actor-critic (AC) based algorithms, which is also refereed to as the advantage actor-critic (A2C) algorithm (Williams & Peng, 1991; Mnih et al., 2016) . Extending the AC algorithm, for an episode k with the length n, we train the policy by a gradient defined as ∇ θ log π θ (S k t , A k t ) α R k t:t+n-1 - 1 2 β R k,2 t:t+n-1 -αM (1) ω(1) k (S k t ) - 1 2 βM (2) ω(2) k (S k t ) , where R k t:t+n-1 = R k t:t+n-1 +γ n M (1) ω(1) k (S k t+n ), and R k,2 t:t+n-1 = R k t:t+n-1 + γ n M (1) ω(1) k (S k t+n ) 2 , and M (1) ω( 1) k (S k t ) and M (2) ω(2) k (S k t ) are value functions approximating E[R t+1:∞ ] and E[R 2 t+1:∞ ] with parameters ω(1) k and ω(2) k , respectively. For other RL algorithms, we can heuristically extend our proposed framework EQUM to accept the other RL algorithms by adding E[R 2 ] as regularization term. Determining α and β: Next, we discuss the parameter tuning of α and β, which are equivalent to ζ, ξ, and ψ). As explained, the parameter ψ has several meanings based on the interpretations of the EQUM, such as the quadratic utility function, targeting optimization, and constrained optimization. In addition, from the regularization perspective, we can adjust the parameter to maximize the expected cumulative reward in the validation data. Thus, we propose the following four directions for the parameter tuning. First, in economic applications, such as finance, we choose β 2α = ψ based on the theoretical economic assumptions and economic empirical studies (Ziemba et al., 1974; Kallberg & Ziemba, 1983 ) (Appendix A). For instance, Capital Asset Pricing Model (CAPM), which is one of the most popular economic models, is also base on the quadratic utility function (Sharpe, 1964; Lintner, 1965; Mossin, 1966) . Second, we set ζ = 1 2ψ as the targeted reward to achieve. Third, we regard the parameter ψ as the constrained problem (1). Fourth, through cross-validation, we optimize the regularization parameter ψ.

5. EXPERIMENTS

This section investigates the empirical performance of the proposed EQUM with synthetic and realworld financial datasets. The goal of these experiments is to construct a portfolio with a wellcontrolled mean and variance through algorithms. In addition, we also show experimental results using CartPole and Atari games. Note that the per-step rewards of portfolio selection are stochastic variables and those of CartPole and Atari games are deterministic variables. However, even when the reward is deterministic, the cumulative reward has randomness owing to the stochastic policy. While we can sell the liquid asset at every time step t = 1, 2, . . . , T , we can sell the non-liquid asset only after a maturity period of N steps; that is, when holding 1 liquid asset, we obtain 1.001 per period; when holding 1 non-liquid asset at the t-th period, we obtain 1.1 or 2 at the t + N -th periods. In addition, the non-liquid asset has some risk of not being paid with a probability p risk ; that is, if the non-liquid asset defaulted during the N periods, we cannot obtain any profits by having the asset. In this setup, a typical investment strategy is to construct a portfolio using both liquid and non-liquid assets to control the mean and variance. In our model, the investor may change his portfolio by investing a fixed fraction α = 0.2 to the non-liquid asset at each time step. As a performance metric of the portfolio, we focus on the mean and variance of the cumulative reward when having the cash 1 at the first period and investing the cash for 50 periods following an algorithm. In particular, we aim to investigate the sensitivity of the EQUM against the parameter ψ = 2/ζ and compare the EQUM with the standard SPG and AC algorithms. We apply the SPG algorithm with the EQUM (Section 4.3) to the synthetic datasets. For ζ, we use ζ = 1, 2, 3, 4, 5, 6. Note that from the regularization perspective with a parameter ψ, EQUM with ψ = 0.1 is equal to minimize the MSE between the cumulative reward and ζ = 5 = 1/(2 × 0.1). We train the model with 500 episodes. In Figure 1 , we calculate the average reward and standard deviation at each episode of the training process over 1, 000 trials. In Table 1 , using the trained model for the test environment, we show the average reward (AR) and standard deviation (SD) by conducting 1, 000 trials. For each trial, we compute the AR and SD over 100 trial and compute the average of each AR and SD over 1, 000 trials. From Figure 1 (Tamar) with various variance constraints (var), we plot the AR and variance (Var) computed on the train environment over 1, 000 trials. We show the Var instead of the SD because the Tamar controls the variance. Compared with Tamar, the EQUM returns Pareto efficient portfolios; that is, higher AR and lower Var. We consider unlike the direct optimization of the EQUM, the Tamar suffers from the complicated optimization mechanism. 

5.2. EXPERIMENTS WITH REAL-WORLD DATASET

We use well-known benchmarks called Fama & French (1992) (FF) datasets to ensure the reproducibility of the experiment. We use the FF25, FF48 and FF100 datasets. The FF25 and FF 100 dataset includes 25 and 100 portfolios formed based on size and book-to-market ratio; the FF48 dataset contains 48 portfolios representing different industrial sectors. We formulate the problem by an episodic MDP. We use all datasets covering monthly data from July 1980 to June 2020. The state is past 12 months returns of each asset and the action is defined as portfolio weight; that is, the number of actions is equal to that of assets. The reward is obtained as the portfolio return. Here, the portfolio return at time 1 ≤ t ≤ T is defined as y t = m j=1 y j,t w j,t-1 , where y j,t is the return of j asset at time t, w j,t-1 is the weight of j asset in the portfolio at time t -1, and m is the number of assets. The length of the episode is 1 years (12 months). For the stochastic policy, we adopt a three-layer feed-forward neural network with the ReLU activation function where the number of units in each respective layer is equal to the number of assets, 100, 50, and the number of actions. We use the softmax function for the output layer. Portfolio models: We use the following portfolio models. An equally-weighted portfolio (EW) weights the financial assets equally (DeMiguel et al., 2007) . A mean-variance portfolio (MV) computes the optimal variance under a mean constraint (Markowitz, 1952) . For computing the mean vector and covariance matrix, we use the latest 10 years (120 months) data. An Kelly growth optimal portfolio with ensemble learning (EGO) is proposed by Shen et al. (2019) . We set the number of resamples as m 1 = 50, the size of each resample m 2 = 5τ , the number of periods of return data τ = 60, the number of resampled subsets m 3 = 50, and the size of each subset m 4 = n 0.7 , where m is number of assets; that is, m = 25 in FF25, m = 48 in FF48 and m = 100 in FF100. A portfolio blending via Thompson sampling (BLD) is proposed by Shen & Wang (2016) . We use the latest 10 years (120 months) data to compute for the sample covariance matrix and blending parameters. A policy gradients with variance related risk criteria (Tamar) is proposed by Tamar et al. (2012) . We set the target variance terms η as 250,500,1000. A block coordinate ascent algorithm proposed by Xie et al. (2018) , which is referred to as Mean-Variance Portfolio (MVP). We set the regularization parameters δ as 10,100,1000. Then, let us denote the SPG with the EQUM framework as EQUM. The parameter ψ is chosen from 1/3, 2/3, and 1. For optimizing Tamar, MVP and EQUM, we set the Adam optimizer with learning rate 0.01 and weight decay parameter 0.1. We train the neural networks for 10 episodes. Each portfolio is updated by sliding one-month-ahead.

Performance metrics:

The following measures widely used in finance to evaluate portfolio strategies (Brandt, 2010) are chosen. The cumulative return (CR), annualized risk as the standard de- T ×CR/RISK. R/R is the most important measure for a portfolio strategy. We also evaluate the maximum draw-down (MaxDD), which is another widely used risk measure Magdon-Ismail & Atiya (2004) for the portfolio strategy. In particular, MaxDD is the largest drop from a peak defined as MaxDD = min t∈[1,T ] 0, Wt max τ ∈[1,t] Wτ -1 , where W k is the cumulative return of the portfolio until time k; that is, W t = t t ′ =1 (1 + y t ′ ). Table 3 reports the performances of the portfolios. In almost all cases, the EQUM portfolio achieves the highest R/R and the lowest MaxDD. Therefore, we can confirm that the EQUM portfolio has a high R/R, and avoids a large drawdown. The real objective (minimizing variance with a penalty on return targeting) for Tamar, MVP, and EQUM is shown in Appendix B.3. Except for FF48's MVP, the objective itself is smaller than EQUM's. Since the values of the objective is the same as the RR, we can empirically confirm that the better optimization, the better performance.

5.3. EXPERIMENTS WITH CARTPOLE AND ATARI GAMES

We also conduct experiments using CartPole and Atari games, where the reward is given as deterministic, and the randomness of the cumulative reward depends only on the stochastic policy. We investigate the sensitivity of ψ = 1/(2ζ) with CartPole and the compared the performance of the EQUM framework with that of Tamar et al. (2012) and Xie et al. (2018) . The results are shown in Appendix B.4. In many experimental results, contrary to our expectations, we observed that the EQUM also improves the expected cumulative reward, not only the variance. We hypothesize that this is because there is often a limit on the cumulative rewards achieved by the standard expected cumulative reward maximizing algorithms. For instance, when the reward at each period is 1, and the discount factor is 0.99, the infinite sum is 100. In such a case, instead of naive reward maximization, MSE minimization against the target reward 100 may result in a more stable performance empirically. We also hypothesize that even if an optimal policy is deterministic, the EQUM can improve the stability of the training process by reducing the variance. This observation is an open problem related to exploration and exploitation trade-off. Therefore, unless we could increase the cumulative reward infinitely, the proposed EQUM framework can stabilize the performance. From this aspect, we can confirm that the EQUM provides a regularization effect.

6. CONCLUSION

In this paper, we proposed an EQUM framework as a variant of MVRL. Compared with existing MVRL methods, the EQUM framework is computationally friendly. The proposed EQUM framework also includes various interpretations, such as targeting optimization and regularization and is suitable for many real-world applications, such as finance and playing games. We investigated the effectiveness of the EQUM framework compared with the standard RL and existing MVRL methods through the experiments. In the results, the proposed method successfully controls the mean-variance trade-off. As an open problem, we also observed that even when an optimal policy is deterministic, the proposed method improves performance. We hypothesize that the proposed method affects the stabilization of the training process.

A ECONOMIC THEORY AND QUADRATIC UTILITY FUNCTION

Considering the mean-variance trade-off in a portfolio and economic activity is an essential task in economics as Tamar et al. (2012) and Xie et al. (2018) pointed out. The mean-variance trade-off is justified by assuming either quadratic utility function to the economic agent or multivariate normal distribution to the financial assets. By assuming the quadratic utility function or the normal distribution, we can assume that the expected utility function of the agent is maximized by maximizing the expected reward and minimizing the variance. Based on this observation, Markowitz (1952) proposed the following steps for providing a portfolio to an economic agent (Also see Markowitz (1959) , page 288, and Luenberger (1998)): • Constructing portfolios minimizing the variance under several reward constraint; • Among the portfolios constructed in the first step, the economic agent chooses a portfolio maximizing the utility function. Therefore, the goal of Markowitz's portfolio is not only to construct the portfolio itself but also to maximize the expected utility function of the agent. In traditional economics, this two-step is adopted because directly predicting the reward and variance to maximize the expected utility function is difficult; therefore, first gathering information based on analyses of an economist, then we construct the portfolio using the information and provide the set of the portfolios to an economic agent. However, owing to the recent development of machine learning, we can directly represent the complicated economic dynamics using flexible models, such as deep neural networks. In addition, as Tamar et al. (2012) and Xie et al. (2018) reported, when constructing the mean-variance portfolio in RL, we suffer the double sampling issue. Therefore, in this paper, we aim to achieve the original goal of the mean-variance approach; that is, the expected utility maximization. Note that this idea is not restricted to financial applications but can be applied to applications where the agent utility can be represented only by the mean and variance. In the following subsection, we review the existing studies of finance and quadratic utility function.

A.1 MARKOWITZ'S PORTFOLIO AND CAPITAL ASSET PRICING MODEL

Markowitz's portfolio is known as the mean-variance portfolio (Markowitz, 1952; Markowitz et al., 2000) . Constructing the mean-variance portfolio is motivated by the agent's expected utility maximization. When the utility function is given as the quadratic utility function, or the financial asset follows the multivariate normal distribution, a portfolio maximizing the agent's expected utility function is given as a portfolio with minimum variance under a certain standard expected reward. The Capital Asset Pricing Model (CAPM) theory is a concept which is closely related to Markowitz's portfolio (Sharpe, 1964; Mossin, 1966; Lintner, 1965) . This theory theoretically explains the expected return of investors when the investor invests in a financial asset; that is, it derives the optimal price of the financial asset. To derive this theory, as well as Markowitz's portfolio, we assume the quadratic utility function to the investors or the multivariate normal distribution to the financial assets. Merton (1969) extended the static portfolio selection problem to a dynamic case. Fishburn & Porter (1976) studied the sensitivity of the portfolio proportion when the safe and risky asset distributions change under the quadratic utility function. Thus, there are various studies investigating relationship between the utility function and risk-averse optimization (Tobin, 1958; Kroll et al., 1984; Bulmuş & Özekici, 2014; Bodnar et al., 2015) .

A.2 EMPIRICAL STUDIES ON THE UTILITY FUNCTIONS

The standard financial theory is built on the assumption that the economic agent has the quadratic utility function. For supporting this theory, there are several empirical studies to estimate the parameters of the quadratic utility function. Ziemba et al. (1974) investigated the change of the portfolio proportion when the parameter of the quadratic utility function changes using the Canadian financial dataset. Recently, Bodnar et al. (2018) investigate the risk parameter (α and β in our formulation of the quadratic utility function) using the markets indexes in the world. They found that the utility function parameter depends on the market data model. A.3 CRITICISM For the simple form of the quadratic utility function, the financial models based on the utility are widely accepted in practice. However, there is also criticism that the simple form cannot capture the real-world complicated utility function. For instance, Kallberg & Ziemba (1983) criticized the use of the quadratic utility function and proposed using a utility function, including higher moments. This study also provided empirical studies using U.S. financial dataset for investigating the properties of the alternative utility functions. However, to the best of our knowledge, financial practitioners still use financial models based on the quadratic utility function. We consider this is because the simple form gains the interpretability of the financial models.

B DETAILS OF EXPERIMENTS

In this section, we show the additional experiments and describe the details of the experiments. To implement SPG and AC algorithms, we mainly follow the Pytorch examplefoot_0 .

B.1 PARETO EFFICIENCY ON THE TEST ENVIRONMENT OF THE SYNTHETIC PORTFOLIO SELECTION DATASET

For investigating the Pareto efficiency, we show the AR and Var of the train environment in Figure 2 of Section 5.1. Here, we also show the AR and Var of the test environment in Figure 3 .

B.2 EXPERIMENTS USING THE SYNTHETIC AMERICAN-STYLE OPTION DATASET

Among various options in finance, an American-style option refers to a contract that we can execute an option right at any time before the maturity time τ ; that is, a buyer who bought a call option has a right to buy the asset with the call option strike price W call at any time; a buyer who bought a put option has a right to sell the with the call option strike price W put at any time. In the setting of Tamar et al. (2014) and Xie et al. (2018) , the buyer simultaneously buy call and put options, which have the strike price W call = and W put =, respectively. The maturity time is set as τ =. If the buyer executes the option at time t, the buyer obtains a reward r t = max(0, W putx t )+max(0, x t -W call ), where x t is an asset price. We set x 0 = 1 and define the stochastic process as follows: x t = x t-1 f u with probability 1 and x t = x t-1 f d with probability 1, where f u and f d . These parameters follows Xie et al. (2018) . Under this setting, we investigate the performance of our EQUM. For the EQUM, we use ζ = 0.1, 0.3, 0.5, 0.7, 1, 1.3. For the other settings, we follow the previous experiment for portfolio management. In Figure 4 , we calculate the average reward and standard deviation at each episode of the training process by conducting 1, 000 trials. In Table 4 , using the trained model for the test environment, we show the average reward (AR) and standard deviation (SD) by conducting 1, 000 trials. We show the MSE between the realized reward and the target reward ζ in Table 5 , where the lowest MSE method is highlighted in bold. As well as the experimental results with the synthetic portfolio dataset, we can confirm that the EQUM can control the risk well. In addition, as well as the experiments with CartPole and Atari games, we can observe that the EQUM also increases the expected reward contrary to our expectation.

B.3 DETAILS OF EXPERIMENTS OF PORTFOLIO OPTIMIZATION

The real objective (minimizing variance with a penalty on return targeting) for Tamar, MVP and EQUM is shown in Table 6 . Except for FF48's MVP, the objective itself is smaller than EQUM's. Since the values of objective is the same as the RR, we can empirically confirm that the better optimization, the better performance. We also divide the performance period into two for robustness checks. Table 7 shows the first-half results from July 2000 to June 2010 and the second-half results from July 2010 to June 2020. In almost all cases, the EQUM portfolio achieves the highest R/R. 650 Table 7 : The performance of each portfolio during first half out-of-sample period (from July 2000 to June 2010) and second half out-of-sample period (from July 2010 to June 2020) for FF25 dataset (upper panel) , FF48 (middle panel) and FF100 (lower panel). Among the comparisons of the various portfolios, the best performance within each dataset is highlighted in bold. In this section, we report the experimental performances of the proposed EQUM framework with well-known benchmarks. We investigate how the behaviors of existing RL methods are changed by adding the additional E[R 2 ] term. We use a simple two-layer perceptron for modeling the policy following the Pytorch example (Paszke et al., 2019) . For the SPG-based algorithms, we define the cumulative reward as finite sum with γ = 1 following (Tamar et al., 2012) . For the AC-based algorithms, we define the cumulative reward as infinite sum with γ = 0.99 following (Prashanth & Ghavamzadeh, 2013) . For the SPG-based algorithms, we define the cumulative reward as finite sum with γ = 1 following (Tamar et al., 2012) . For the AC-based algorithms, we define the cumulative reward as infinite sum with γ = 0.99 following (Prashanth & Ghavamzadeh, 2013) .

B.4.1 SENSITIVITY ANALYSIS ON ψ

First, we investigate the sensitivity of the EQUM framework to the parameter ψ using the Cart-Pole. We apply the SPG and AC methods (Section 4.3) with the EQUM framework to the Cartpole problem, which is a classic control problem. We use ψ = 0.001, 0.002, 0.003, 0.005, 0.01, 0.1 and compare the results with the standard SPG and AC methods. For instance, from the targeting optimization perspective, EQUM with ψ = 0.001 is equal to minimize the MSE between the cumulative reward and ζ = 500 = 1/(2 * 0.001). We train the model by 300 episodes. We calculate the average reward and standard deviation at each period by conducting 300 trials. The results are shown in Figure 5 . In the SPG results, we can confirm the mean-variance trade-off. In contrast, in the SPG with ψ = 0.001 and all AC results, the EQUM framework improves the expected cumulative reward. These results are contrary to our expectations because MVRL methods typically decrease the expected cumulative return to decrease the variance. We discuss this topic in more detail in Section 5.3.



https://github.com/pytorch/examples/tree/master/reinforcement_learning



Figure 1: The ARs and SDs in the training process of the experiment using the synthetic dataset.

5.1 EXPERIMENTS WITH SYNTHETIC PORTFOLIO SELECTION DATASETSFollowingTamar et al. (2012) andXie et al. (2018), we artificially generate a portfolio dataset. Let us consider a portfolio composed of two types of assets: a liquid asset, which has a fixed interest rate r l = 1.001, and a non-liquid asset, which has a time-dependent interest rate r nl (t) ∈ {1.1, 2}.

Figure 2: Higher AR and lower Var methods are Pareto efficient.

Figure 3 in Appendix B.1 corresponds to the test environment version of Figure 2.In Appendix B.2, we also show experimental results using another synthetic dataset of Americanstyle option followingTamar et al. (2014) andXie et al. (2018).

Figure 3: The AR and Var on the test environment. Higher AR and lower Var methods are Pareto efficient.

Figure 4: The ARs and SDs in the training process of the experiment using the synthetic dataset.

Figure5: Sensitivity analysis regarding ψ. The upper two graphs are the results using the SPG-based method and the lower two graphs are the results using the AC-based algorithms. The average reward of the SPG result with ψ = 0.1 is lower than 80, and we do not show the result here.

Tamar et al. (2012),Prashanth & Ghavamzadeh (2013), andXie et al. (2018) formulated MVRL by a constrained optimization problem defined as max θ∈Θ

because it requires sampling from two different trajectories for approximatingE π θ [R] and ∇ θ E π θ [R]. This issue makes the optimization problem difficult.

and Table1, we can clearly confirm that the EQUM framework The experimental results of the synthetic dataset with the ARs and SDs over 1, 000 trials.

The experimental results of the synthetic dataset with the MSEs for ζ over 1000 trials. of the mean-variance trade-off. As shown in the results, we can reduce the variance by increasing ζ. The MSE from target ζ is shown in Table2. In Figure2, for the EQUM with various ζ and the method ofTamar et al. (2012)

The performance of each portfolio models during out-of-sample period (from July 2000 to June 2020) for FF25 dataset (upper table) , FF48 (middle table) and FF100 (lower table). For each dataset, the best performance is highlighted in bold.

The experimental results of the synthetic dataset with the MSEs for ζ over 1000 trials.

The real objective (minimizing variance with a penalty on return targeting) for Tamar, MVP and EQUM for FF25 dataset (upper panel) , FF48 (middle panel) and FF100 (lower panel). Among the comparisons of the various portfolios, the best performance within each dataset is highlighted in bold.

Half Period (from July 2000 to June 2010) CR↑ 122.71 -14.30 159.25 83.19 1.32 123.11 72.27 135.12 148.24 15.86 107.99 118.99 73.20 RISK↓ 15.45 33.97 17.06 9.96 16.46 19.10 19.75 27.53 25.70 21.88 17.97 19.88 13.Half Period (from July 2000 to June 2010) CR↑ 73.25 -44.84 87.78 48.85 143.26 146.59 166.74 218.62 55.87 170.36 215.59 142.66 199.92 RISK↓ 19.60 30.86 20.53 12.21 16.99 11.21 18.45 20.52 31.72 22.13 13.03 12.69 13.

B.4.2 EXPERIMENTS OF ATARI GAMES

When playing games, we often consider risk-control while maintaining a certain level of reward. Under this motivation, we benchmark the proposed EQUM framework on four Atari game tasks from the OpenAI gym (Brockman et al., 2016) . Among the games, we choose BeamRider, Seaquest, Qbert, and SpaceInvaders in which SPG and AC methods work well. We use simplified environments in which the observations are the RAM of the Atari machine, consisting of only 128 bytes. We compare our methods with the standard SPG, and MVLR methods of Tamar et al. (2012) and Xie et al. (2018) . We calculate the average reward (AR) at the last episode over 5 trials and the standard deviation (SD). In Table 8 , we show the results of the SPG algorithm with the standard setting and EQUM framework, where ψ is chosen from 0.001, 0.003, 0.005, 0.01, 0.03, and 0.3. Note that when using ψ = 0.005, it is equal to minimize the MSE between the expected cumulative reward and the target 100. In almost all cases, the EQUM framework shows the better AR than the standard methods. As well as the previous sensitivity experiment, this result is contrary to our expectation. Among the games, we choose simplified BeamRider, Seaquest, Qbert, and SpaceInvaders in which the SPG algorithm work well, and the observations are the RAM of the Atari machine, consisting of only 128 bytes. We compare our methods with the standard SPG algorithms, and MVRL methods of Tamar et al. (2012) and Xie et al. (2018) . We denote the SPG-based methods proposed by Tamar et al. (2012) and Xie et al. (2018) as Tamar and MVP, respectively. We calculate the average reward (AR) at the last episode over 5 trials and the standard deviation (SD). We choose the parameter of Tamar from 200 and 2000, which is denoted as b in Tamar et al. (2012) . We choose the parameter of MVP from 1 and 10, which is denoted as λ in Tamar et al. (2012) . The parameters are denoted as param in Table 8 .

annex

Under review as a conference paper at ICLR 2021 

