FACTOR LEARNING PORTFOLIO OPTIMIZATION INFORMED BY CONTINUOUS-TIME FINANCE MODELS Anonymous

Abstract

We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors.

1. INTRODUCTION

Portfolio optimization studies how to allocate investments across multiple risky financial assets such as stocks and safe assets such as US government bonds. The investment target is often formulated as maximizing the expected utility of the investment portfolio's value at a fixed time horizon, which conceptually maximizes profit while constraining risk (von Neumann & Morgenstern, 1947) . With continuous-time stochastic models of stock prices, great advances in the expected utility maximization framework were made in Merton (1969) using stochastic optimal control (dynamic programming) methods. More realistic models incorporate factors like economic indices and proprietary trading signals (Merton et al., 1973; Fama & French, 2015; 1992) , which (i) affect the dynamics of stock prices; (ii) stochastically evolve over time; (iii) are not affected by individual investment decisions. With greater data availability, it is natural to design and apply data-driven machine learning methods (Bengio, 1997; Dixon et al., 2020; De Prado, 2018) to handle factors for portfolio optimization. This work proposes a novel method-Factor Learning Portfolio Optimization (FaLPO)-which combines tools from both machine learning and continuous-time finance. Portfolio optimization with stochastic factors is challenging for three reasons. First, financial data is notoriously noisy and idiosyncratic (Goyal & Santa-Clara, 2003) , causing complex purely data-driven methods to be unstable and prone to overfitting. Second, the relationship between the factors and their impact on stock prices can be extremely complicated and difficult to model ex ante. Third, many successful finance models are in continuous time and require interacting with the environment infinitely frequently. As a result, such models cannot be easily combined with machine learning methods, many of which are in discrete time. Current approaches to portfolio optimization broadly fall into two categories: reinforcement learning (RL) and continuous-time finance methods. Many RL solutions to portfolio optimization are built on deep deterministic policy gradient (Silver et al. 2014; Hambly et al. 2021, Section 4.3) . Such methods parameterize the policy function as a neural network with strong representation power and learn the neural network by optimizing the corresponding portfolio performance. However, these approaches (as well as other model-free methods like Haarnoja et al. 2018) have high sample complexity and tend to overfit due to the high noise in the data. Other RL methods explicitly learn representation (Watter et al., 2015; Lee et al., 2020; Laskin et al., 2021) and leverage discrete-time models (Deisenroth & Rasmussen, 2011; Gu et al., 2016; Mhammedi et al., 2020; Janner et al., 2019; Nagabandi et al., 2018) . Nonetheless, these methods are not informed by continuous-time finance models and, as our experiments suggest in Section 5, cannot benefit from structures inherent in the financial market. Stochastic factor models can be used to mathematically derive optimal (or approximately optimal) investment policies (Kim & Omberg, 1996; Chacko & Viceira, 2005; Fouque et al., 2017; Avanesyan, 2021) . To this end, one needs domain knowledge to pick and model the factors. Then, model calibration (a.k.a. model fitting, parameter estimation) is conducted by maximizing a calibration objective. With the calibrated model, the optimal investment policy can be derived analytically or numerically (Merton, 1992; Fleming & Soner, 2006) . This procedure of calibration and optimization effectively constrains the 'learning' in the optimization step, and thus helps reduce overfitting to noisy data. However, this approach cannot capture the complicated factor effects in the data, because the factors may be complex and unlikely to be identified manually. Therefore, these methods may end up with oversimplified models and suffer from model bias with suboptimal performance. To tackle these limitations, we propose factor learning portfolio optimization (FaLPO), a new method that interpolates between the two aforementioned solutions (Figure 1 ). FaLPO includes (i) a neural stochastic factor model to handle huge noise and complicated factor effects and (ii) a model-regularized policy learning method to combine continuous-time models with discrete-time policy learning methods. First, to reduce the sample complexity and avoid overfitting, FaLPO assumes factors and asset prices follow a parametric continuous-time finance model. To capture the complicated factor effects, FaLPO models the factors by a representation function ϕ parameterized by a neural network with minimal parametric constraints. Second, for policy learning, FaLPO incorporates two regularizations derived from continuous-time stochastic factor models: a policy functional form and model calibration. Specifically, we derive policy functional forms from the neural stochastic factor model using stochastic optimal control tools, and apply it to parameterize the candidate policy in FaLPO. The use of this form in the learning algorithm effectively acts as a regularizer. Then, model calibration and policy learning are conducted jointly, such that the learned policy is informed by continuous-time models. Theoretically, we prove that the added continuous-time regularization leads to the optimal portfolio performance as the trading frequency increases. Empirically, we demonstrate the improved performance of the proposed method by both synthetic and real-world experiments. We review the related literature in Appendix A. We also discuss how FaLPO is extendable beyond portfolio optimization, and can be applied to other decision-making problems with stochastic factors in Appendix H.

2. BACKGROUND

In this section, we first formulate the portfolio optimization problem. We then review two major solutions to this problem: deep deterministic policy gradient in reinforcement learning (RL) and stochastic factor models in continuous-time finance.

2.1. PORTFOLIO OPTIMIZATION

Problem Formulation Portfolio optimization seeks to derive a policy of asset allocation that yields high return while maintaining low risk for the investment. Formally, consider d S risky assets with prices S t := [S 1 t , S 2 t , • • • S d S t ] ⊤ and a risk-free money market account with, for simplicity, zero interest rate of return (like cash). We observe d Y features (e.g. economic indices, market benchmarks) denoted as Y t . From Y t , we can derive d X factors denoted as X t which (i) affect the dynamics of asset prices; (ii) evolve over time stochastically; (iii) are not affected by investment decisions. Given an initial investment capital (or wealth) z 0 and the initial values for Y t and S t as y 0 and s 0 , we use a



Figure 1: Demonstration of FaLPO

