FACTOR LEARNING PORTFOLIO OPTIMIZATION INFORMED BY CONTINUOUS-TIME FINANCE MODELS Anonymous

Abstract

We study financial portfolio optimization in the presence of unknown and uncontrolled system variables referred to as stochastic factors. Existing work falls into two distinct categories: (i) reinforcement learning employs end-to-end policy learning with flexible factor representation, but does not precisely model the dynamics of asset prices or factors; (ii) continuous-time finance methods, in contrast, take advantage of explicitly modeled dynamics but pre-specify, rather than learn, factor representation. We propose FaLPO (factor learning portfolio optimization), a framework that interpolates between these two approaches. Specifically, FaLPO hinges on deep policy gradient to learn a performant investment policy that takes advantage of flexible representation for stochastic factors. Meanwhile, FaLPO also incorporates continuous-time finance models when modeling the dynamics. It uses the optimal policy functional form derived from such models and optimizes an objective that combines policy learning and model calibration. We prove the convergence of FaLPO and provide performance guarantees via a finite-sample bound. On both synthetic and real-world portfolio optimization tasks, we observe that FaLPO outperforms five leading methods. Finally, we show that FaLPO can be extended to other decision-making problems with stochastic factors. In this section, we first formulate the portfolio optimization problem. We then review two major solutions to this problem: deep deterministic policy gradient in reinforcement learning (RL) and stochastic factor models in continuous-time finance. Problem Formulation Portfolio optimization seeks to derive a policy of asset allocation that yields high return while maintaining low risk for the investment. Formally, consider d S risky assets with prices ] ⊤ and a risk-free money market account with, for simplicity, zero interest rate of return (like cash). We observe d Y features (e.g. economic indices, market benchmarks) denoted as Y t . From Y t , we can derive d X factors denoted as X t which (i) affect the dynamics of asset prices; (ii) evolve over time stochastically; (iii) are not affected by investment decisions. Given an initial investment capital (or wealth) z 0 and the initial values for Y t and S t as y 0 and s 0 , we use a

1. INTRODUCTION

Portfolio optimization studies how to allocate investments across multiple risky financial assets such as stocks and safe assets such as US government bonds. The investment target is often formulated as maximizing the expected utility of the investment portfolio's value at a fixed time horizon, which conceptually maximizes profit while constraining risk (von Neumann & Morgenstern, 1947) . With continuous-time stochastic models of stock prices, great advances in the expected utility maximization framework were made in Merton (1969) using stochastic optimal control (dynamic programming) methods. More realistic models incorporate factors like economic indices and proprietary trading signals (Merton et al., 1973; Fama & French, 2015; 1992) , which (i) affect the dynamics of stock prices; (ii) stochastically evolve over time; (iii) are not affected by individual investment decisions. With greater data availability, it is natural to design and apply data-driven machine learning methods (Bengio, 1997; Dixon et al., 2020; De Prado, 2018) to handle factors for portfolio optimization. This work proposes a novel method-Factor Learning Portfolio Optimization (FaLPO)-which combines tools from both machine learning and continuous-time finance. Portfolio optimization with stochastic factors is challenging for three reasons. First, financial data is notoriously noisy and idiosyncratic (Goyal & Santa-Clara, 2003) , causing complex purely data-driven methods to be unstable and prone to overfitting. Second, the relationship between the factors and their impact on stock prices can be extremely complicated and difficult to model ex ante. Third, many successful finance models are in continuous time and require interacting with the environment infinitely frequently. As a result, such models cannot be easily combined with machine learning methods, many of which are in discrete time. Current approaches to portfolio optimization broadly fall into two categories: reinforcement learning (RL) and continuous-time finance methods. Many RL solutions to portfolio optimization are built on deep deterministic policy gradient (Silver et al. 2014; Hambly et al. 2021, Section 4.3) . Such methods parameterize the policy function as a neural network with strong representation power and learn the neural network by optimizing the corresponding portfolio performance. However, these approaches (as well as other model-free methods like Haarnoja et al. 2018 ) have high sample complexity and tend to overfit due to the high noise in the data. Other RL methods explicitly learn representation (Watter et al., 2015; Lee et al., 2020; Laskin et al., 2021) and leverage discrete-time models (Deisenroth & Rasmussen, 2011; Gu et al., 2016; Mhammedi et al., 2020; Janner et al., 2019; Nagabandi et al., 2018) . Nonetheless, these methods are not informed by continuous-time finance models and, as our experiments suggest in Section 5, cannot benefit from structures inherent in the financial market. Stochastic factor models can be used to mathematically derive optimal (or approximately optimal) investment policies (Kim & Omberg, 1996; Chacko & Viceira, 2005; Fouque et al., 2017; Avanesyan, 2021) . To this end, one needs domain knowledge to pick and model the factors. Then, model calibration (a.k.a. model fitting, parameter estimation) is conducted by maximizing a calibration objective. With the calibrated model, the optimal investment policy can be derived analytically or numerically (Merton, 1992; Fleming & Soner, 2006) . This procedure of calibration and optimization effectively constrains the 'learning' in the optimization step, and thus helps reduce overfitting to noisy data. However, this approach cannot capture the complicated factor effects in the data, because the factors may be complex and unlikely to be identified manually. Therefore, these methods may end up with oversimplified models and suffer from model bias with suboptimal performance. To tackle these limitations, we propose factor learning portfolio optimization (FaLPO), a new method that interpolates between the two aforementioned solutions (Figure 1 ). FaLPO includes (i) a neural stochastic factor model to handle huge noise and complicated factor effects and (ii) a model-regularized policy learning method to combine continuous-time models with discrete-time policy learning methods. First, to reduce the sample complexity and avoid overfitting, FaLPO assumes factors and asset prices follow a parametric continuous-time finance model. To capture the complicated factor effects, FaLPO models the factors by a representation function ϕ parameterized by a neural network with minimal parametric constraints. Second, for policy learning, FaLPO incorporates two regularizations derived from continuous-time stochastic factor models: a policy functional form and model calibration. Specifically, we derive policy functional forms from the neural stochastic factor model using stochastic optimal control tools, and apply it to parameterize the candidate policy in FaLPO. The use of this form in the learning algorithm effectively acts as a regularizer. Then, model calibration and policy learning are conducted jointly, such that the learned policy is informed by continuous-time models. Theoretically, we prove that the added continuous-time regularization leads to the optimal portfolio performance as the trading frequency increases. Empirically, we demonstrate the improved performance of the proposed method by both synthetic and real-world experiments. We review the related literature in Appendix A. We also discuss how FaLPO is extendable beyond portfolio optimization, and can be applied to other decision-making problems with stochastic factors in Appendix H. d S × 1 vector π t to denote the fractions of wealth invested in the d S assets at time point t. Note that negative values are allowed in π t indicating short positions. At the terminal time T > 0, the target is to maximize the expectation of a given utility function E[U (Z π T )], where U : Z → R with Z ⊆ R is the utility function and Z π T denotes the terminal wealth under π. Intuitively, a utility function reflects the risk preference of an investor. It is an increasing function of wealth that is also concave: it changes significantly when the wealth is low but less so when the wealth is high (Figure 2 ). This work focuses on the power utility U (z; γ) :=foot_0 1-γ z 1-γ with Z = R + , γ > 0, and γ ̸ = 1; and the exponential utility U (z; γ) := -exp (-γz) γ with Z = R and γ > 0. Here, γ is the investor's risk aversion coefficient and is hand-picked (instead of tuned) by the user. A larger γ corresponds to more risk aversion, while a smaller γ corresponds to more risk tolerance. Beyond these two utilities, our method is also applicable to other utility functions and other objective functions for portfolio optimization (see Appendix B). Discrete-and Continuous-Time Policies Discrete-and continuous-time policies are two major types of investment policies, differing on how frequently the portfolio is rebalanced. A discrete-time policy rebalances the portfolio fintely frequently, leading to a discrete-time dynamics for the wealth. Such policies are often considered in RL methods (Section 2.2). Continuous-time policies rebalance the investment infinitely frequently, leading to a continuous-time dynamics for the wealth. These policies are often found explicitly in continuous-time finance models (Section 2.3) 1 .

2.2. DEEP DETERMINISTIC POLICY GRADIENT

We review deep deterministic policy gradient (DDPG,  max θ D V (θ D ) with V (θ D ) := E[U (Z π(•;θ D ) T )], where the expectation is over the terminal wealth Z π(•;θ D ) T following the policy π(•; θ D ). A key step of DDPG is to compute the gradient of V (θ D ) to update θ D . Following the procedure in Appendix C, this can be achieved by sampling the trajectories of S t and Y t to approximate the expectation and thus the gradient of V (θ D ). Typically, DDPG learns a discrete-time policy that rebalances the portfolio finitely frequently. To see how the policy rebalances the portfolio, we study its corresponding wealth process Z π(•;θ D ) t that characterizes the changes in wealth over time. Let ∆t > 0 be the time interval (e.g. daily, weekly) to rebalance the portfolio and, for integer M > 0, let T := M ∆t be the fixed investment horizon (e.g. one or two months). At time m∆t with m ∈ {0, 1, 2, • • • , M -1}, define π i m∆t := π i (m∆t, S m∆t , Z m∆t , Y m∆t ; θ D ) as the fraction of current wealth invested in the i th risky asset. Then, the wealth change at time m∆t is: Z π(•;θ D ) (m+1)∆t -Z π(•;θ D ) m∆t = Z π(•;θ D ) m∆t d S i=1 π i m∆t S i (m+1)∆t -S i m∆t S i m∆t , where Z π(•;θ D ) m∆t π i m∆t S i (m+1)∆t -S i m∆t S i m∆t is the wealth change due to the investment in the i th risky asset. Note that the number of shares invested in an asset (Z π(•;θ D ) m∆t π i m∆t S i m∆t ) does not change during (m∆t, (m + 1)∆t): the portfolio rebalances every ∆t time. RL methods like DDPG provide flexible representation for factors: the hidden layers of the neural network are considered as the representation learned for Y t , providing strong representation power. Nonetheless, there is not an explicit parametric model for the learned representation and asset prices. Consequently, such methods require lots of data and tend to overfit (Aboussalah, 2020).

2.3. STOCHASTIC FACTOR MODELS

We review stochastic factor models in continuous-time finance. These models can explicitly formulate the dynamics and can also be used to mathematically derive the functional form of the optimal continuous-time investment policy. Stochastic factor models are described by stochastic differential equations (SDEs) (see Oksendal 2013 and Appendix D) to formulate the dynamics of asset prices S t . Specifically, with a d X × 1 factor variable X t , let W t := [W 1 t , W 2 t , • • • W d W t ] ⊤ be a d W × 1 Brownian motion that characterizes random fluctuations. Then, S t and X t are assumed to follow dS i t S i t = f i S (X t ; θ * S )dt + d W j=1 g ij S (X t ; θ * S )dW j t , i ∈ {1, 2, • • • , d S } , dX t = f X (X t ; θ * S )dt + g X (X t ; θ * S ) ⊤ dW t . (2) In (2), f S : R d X → R d S , f X : R d X → R d X , g S : R d X → R d S ×d W , and g X : R d X → R d X ×d W are parametric functions pre-specified by domain knowledge. Further, f S and f X are often referred to as the drift, and g S and g X as the volatility of S t and X t respectively. Intuitively, SDEs formulate the change of a variable in an infinitesimal time step as the sum of a deterministic part (dt) and a stochastic part (dW t ), and we use θ * S to parameterize the SDE. The factor X t appears in both the drift and volatility of the asset prices, thus affecting the price transition. With the parametric functional forms in (2), we can use tools in stochastic optimal control to derive the functional form of the optimal continuous-time investment policy. Continuous-time policies change the investment in each asset at every time point. For a continuoustime investment policy πt , the dynamics of wealth Z π t is defined as d Z π t Z π t := d S i=1 πi t dS i t S i t , with Z0 = z 0 , S 0 = s 0 and X 0 = x 0 . Crucially, this is different from the discrete-time wealth process Z π t in Section 2.2, as the number of shares Z π t πi t S i t in asset i now changes continuously over time, as opposed to being rebalanced at finite intervals. This discrepancy creates obstacles to directly apply the results derived from stochastic factor models to discrete-time policy learning. Stochastic factor models can reduce sample complexity for portfolio optimization, since the assumed functional forms in (2) significantly constrain the solution space. However, a crucial step to apply stochastic factor models is to pick or even construct X t from the observed Y t that perfectly follows a pre-specified model. This step often relies on domain knowledge and thus may end up with oversimplified models suffering from model bias and eventually leading to suboptimal performance.

3. FACTOR LEARNING PORTFOLIO OPTIMIZATION

We propose factor learning portfolio optimization (FaLPO), a new decision-making framework that interpolates between DDPG and stochastic factor models. FaLPO has two components: (i) a neural stochastic factor model to handle huge noise and complicated factor effects and (ii) model-regularized policy learning to combine continuous-time models with discrete-time policy learning methods.

3.1. NEURAL STOCHASTIC FACTOR MODELS

We describe neural stochastic factor models (NSFM) and discuss their benefits. On the one hand, a neural stochastic factor model assumes the existence of a representation function ϕ such that the factors of the problem can be directly learned from its features: X t = ϕ(Y t ; θ * ϕ ). Here, ϕ is formulated as a neural network with parameter θ * ϕ (Figure 3 ). As a result, FaLPO avoids hand-picking factors from features as is the case in stochastic factor models (Section 2.3). The neural network representation has only a few parametric constraints and thus is able to capture complicated factor effects in the data. Furthermore, factors X t and asset prices S t are assumed to follow a stochastic factor model (e.g. ( 2) and ( 6)), which reduces the sample complexity and avoids overfitting.

3.2. MODEL-REGULARIZED POLICY LEARNING

Under the proposed neural stochastic factor model, we aim to learn the discrete-time optimal policy function π * t and the representation function ϕ(•; θ * ϕ ). However, while the policy learning is for discrete-time policies, our proposed model is in continuous time. To bridge this gap, we incorporate two types of continuous-time model regularization into discrete-time policy learning: (i) the policy functional form (3) and (ii) the model calibration objective (4).

Stochastic Factor Models

Not Specified Parameterize the policy function according to (3).

5:

Estimate the policy gradient for H in (5) (Appendix C).

6:

Update θ ϕ , θ π , and θ S . 7: end for 8: Return π(•; θ ϕ , θ π ) Policy Functional Form From our model, we apply the functional form of a continuous-time optimal policy into our discrete-time policy learning. Using tools in stochastic optimal control, we can derive the functional form of an optimal continuous-time policy: π * t = Π(t, S t , Z t , X t ; θ * π ) , where the functional form of Π can be obtained in many existing stochastic factor models (Kim & Omberg, 1996; Chacko & Viceira, 2005; Avanesyan, 2021; Zariphopoulou, 2001; Wachter, 2002; Kraft, 2005) , and θ * π is an optimal parameter for Π. FaLPO uses the functional form of Π in policy learning and parameterize the candidate policy as π(t, S t , Z t , Y t ; θ ϕ , θ π ) := Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ), (3) where ϕ is the representation function for the factors in Section 3.1. As a result, Π constrains the policy space and acts as regularization. Importantly, although Π is derived for continuous-time policies, it can still provide guidance for discrete-time policy learning when ∆t is small. We rigorously prove the soundness of using (3) in Section 4.

Model Calibration

FaLPO also hinges on model calibration to regularize policy learning. Given the specific functional forms in (2), FaLPO conducts model calibration to estimate the parameters of the SDE. The calibration procedure can be summarized as maximizing a model calibration objective: max θ S L(θ ϕ , θ S ). In practice, with discrete data, one may use likelihood (Phillips, 1972; Beskos & Roberts, 2005) or other likelihood-based objective functions (Bishwal, 2007; Ait-Sahalia & Kimmel, 2010) for L (see Appendix E for concrete examples). To harness the information provided by model calibration in policy learning, FaLPO combines the model calibration objective L in (4) with the performance objective V in (1) and facilitates a joint optimization over the two. Note that naively combining L(θ ϕ , θ S ) and V (θ D ) is not effective since the two in general do not share common parameters: the parameter of the policy network θ D has no overlap with the SDE parameter θ S or factor representation parameter θ ϕ . However, by constraining the policy space to (3) in FaLPO, we can show that θ ϕ is also part of the policy parameterization. Thus, V can be derived as V (θ ϕ , θ π ) := E[U (Z π(•;θ ϕ ,θπ) T )]. In other words, θ ϕ is shared in both V and L, and hence FaLPO can carry out a joint optimization over the two: max (θ ϕ ,θπ,θ S )∈A H(θ ϕ , θ π , θ S ), with H(θ ϕ , θ π , θ S ) := (1 -λ)V (θ ϕ , θ π ) + λL(θ ϕ , θ S ), where the candidate policy follows the functional form of the optimal continuous-time policy (3) (see Algorithm 1), and A denotes the considered parameter set. The model calibration objective also acts as a model regularization, where λ ∈ (0, 1) is a hyperparameter determining its effect. In practice, we can optimize (5) by gradient-based methods, facilitating a easy and end-to-end learning procedure (see Appendix C for gradient estimation details).

3.3. EXAMPLE OF FALPO

In portfolio optimization, one can use different types of stochastic factor models. FaLPO can be applied to many such types (Appendix F). In this section, we use the Kim-Omberg model (Kim & Omberg, 1996) as an example to illustrate FaLPO's modeling and policy learning. Kim-Omberg is a standard model for portfolio optimization with stocahstic factors, which has been extensively studied empirically (Welch & Goyal, 2008; Muhle-Karbe et al., 2017) . For modeling, FaLPO with Kim-Omberg model formulates the dynamics of asset prices and factors as dS i t S i t = X i t dt + d W j=1 σ ij dW j t , dX t = µ(ω -X t ) dt + v dW t , and X t = ϕ(Y t ; θ * ϕ ), where SDE parameters ω, σ, v, and µ are constant matrices or vectors. For policy learning, we detail the policy functional form and model calibration. Under the Kim-Omberg model, we can derive the optimal policy functional form Π in (3). Specifically, for power utility Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ) = k 1 (t; θ π )ϕ(Y t ; θ ϕ ) + k 2 (t; θ π ), for exponential utility Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ) = k 1 (t; θ π )ϕ(Y t ; θ ϕ )/Z t + k 2 (t; θ π )/Z t , where k 1 (•; θ π ) : [0, T ] → R d S ×d X and k 2 (•; θ π ) : [0, T ] → R d S ×d X are two time dependent functions (Appendix F.2). We can derive the functional forms of k 1 and k 2 since the two are solutions to systems of ODEs related to algebraic Riccati equations (Appendix G). We can also directly use function approximators like neural networks or kernel methods for the two. For model calibration, we use a negative mean square loss with the derivation deferred to Appendix E: L(θ ϕ , θ S ) := -E d S i=1 log(S i t+∆t ) -log(S i t ) -ϕ i (Y t ; θ ϕ )∆t -θ i S 2 , where in this case θ S is a d S × 1 vector. Therefore, to implement FaLPO, we can parameterize the candidate policy function using Π and optimize (5). Note that the methodology of FaLPO is also generally applicable to other decision-making problems besides portfolio optimization. In Appendix H, we use linear quadratic control with stochastic factors as an example to demonstrate the generality of FaLPO.

4. THEORY

We theoretically analyze both the asymptotic and non-asymptotic characteristics of FaLPO.

4.1. ASYMPTOTIC ANALYSIS

FaLPO applies the policy functional form and model calibration derived from continuous-time models to discrete-time policy learning. We show that FaLPO can achieve the optimal performance asymptotically (i.e. with infinite data and perfect optimization). In the following, we describe the assumptions and results and provide the formal theorem in the end. We provide an intuitive description on the assumptions, with the formal statements provided in Appendix I.1. First, we assume that the portfolio optimization problem satisfies some standard regularity conditions (Higham et al., 2002) : the drift and volatility are locally Lipschitz continuous; meanwhile, the asset prices, the stochastic factors, and the wealth process under the optimal policy have bounded moments. Second, we assume that the utility function U (z) has linear growth on z ∈ Z. Note that some widely used cases like power utility with γ < 1 and exponential utility with lower-bounded wealth satisfy this assumption. Third, we consider only admissible policies with parameters in A and we assume that A covers the optimal continuous-time policy parameter. A policy is admissible if it is predictable and if the wealth process Z π t ∈ Z for any t ∈ [0, T ] almost surely. It is a common practice to only consider such admissible policies in portfolio optimization. The last two assumptions are artifacts of the current theoretical analysis; in practice FaLPO can achieve reasonable performance without enforcing them (see Section 5). With the foregoing assumptions, we show that the performance of FaLPO can asymptotically converge to that of the best policy in discrete time. In detail, we define V * ∆t := V (π * ) where π * is an optimal discrete-time admissible policy with time interval ∆t, i.e., V * ∆t is the optimal performance obtained without constraining to the functional form (3) or leveraging model calibration like (5). Next, define θ * ∆t := (θ * ϕ,∆t , θ * π,∆t , θ * S,∆t ) ∈ arg max (θ ϕ ,θπ,θ S )∈A H(θ ϕ , θ π , θ S ) with the policy functional form (3), such that V (θ * ∆t ) is the performance that FaLPO can achieve with infinite data and perfect optimization. Then, Theorem 4.1 shows that the gap between V * ∆t and V (θ * ∆t ) converges to zero as ∆t goes to zero. THEOREM 4.1 With assumptions in Appendix I.1, lim ∆t→0 V * ∆t -V (θ * ∆t ) = 0. Theorem 4.1 justifies the methodology of FaLPO under a small time interval. The proof is provided in Appendix I.2 and Appendix I.3.

4.2. NON-ASYMPTOTIC ANALYSIS

We study the finite-sample performance of FaLPO. We describe the problem setup, major assumptions, and then provide the theorem. In each iteration, we collect B independent trajectories to estimate the gradient of H. Let θ n be the estimate after the n th iteration, and N the total number of iterations. we analyze the average estimate θ := N n=1 θn N instead of θ N , which is a common technique for stochastic optimization analysis. Specifically, we aim to bound the expected difference between V * ∆t and V ( θ). Note that it is extremely challenging to theoretically analyze a non-convex stochastic optimization (5) without further specifications in problem setup and assumptions (Polyak, 1963; Bhandari & Russo, 2019; Jin et al., 2021; Ma, 2020; Wang et al., 2019) . Therefore, we consider a projection-based variant of FaLPO, under which the optimization process is conducted in a bounded parameter set B ⊆ A. Furthermore, we assume that the objective function H is strongly concave in B with a local maximal point θ † ∆t := (θ † ϕ,∆t , θ † P,∆t , θ † S,∆t ). Similar local curvature assumptions are commonly used to analyze non-convex problems (Bach et al., 2017; Loh, 2017) . With the above setup and assumptions, the expected gap between V * ∆t and V ( θ) satisfies the following finite-sample bound in Theorem 4.2. THEOREM 4.2 With the aforementioned projection-based FaLPO algorithm and assumptions, both detailed in Appendix J.2, there exist positive constants C 1 , C 2 , C 3 , and C 4 such that E[V * ∆t -V ( θ)] ≤ e ∆t 1 -λ + H(θ * ∆t ) -H(θ † ) 1 -λ + C 1 log(N ) N (1 -λ) + C 1 log(N ) BN (1 -λ) (1 -λ) 2 C 2 + λ 2 C 3 + 2λ(1 -λ)C 4 , where λ ∈ [0, 1]. Also, e ∆t is an error term not related to N or B but dependent on ∆t with lim ∆t→0 e ∆t = 0. Theorem 4.2 provides a non-asymptotic upperbound on the gap between the optimal performance V * ∆t and the one achieved by FaLPO V ( θ). We briefly comment on each term in the upperbound. e∆t 1-λ bounds the asymptotic performance gap caused by leveraging the continuous-time policy functional form constraint and model calibration, and we explain its connection to Theorem 4.1 in Appendix J.4. H(θ * ∆t )-H(θ † ∆t ) 1-λ controls the performance gap between the local optimal point θ † ∆t and the global optimal point θ * ∆t . The remaining terms characterize the performance gap between θ and θ † ∆t . Theorem 4.2 has two implications. First, the bound in ( 7) is a rational function of λ. Accordingly, there exist situations where a λ ∈ (0, 1) provides a smaller upper bound than λ = 0, indicating the possibility that tuning λ can provide better performance (see experiments in Appendix J). Second, when λ = 1, the bound diverges to infinity. This makes sense since when λ = 1, H(θ ϕ , θ π , θ S ) = L(θ ϕ , θ S ) does not contain θ π : the algorithm does not learn the policy. We prove the theorem in Appendices J.1, J.3 and J.4.

5. EXPERIMENTS

By incorporating continuous-time finance models into policy learning, FaLPO can deal with high data noise and complex factor effects. In this section, we demonstrate the improved performance of FaLPO against existing portfolio optimization methods, over synthetic and real-world experiments. (Silver et al., 2014) , which many empirical portfolio optimization methods build on. (iii) Stochastic latent actor critic (SLAC): a state-of-the-art representation learning RL method that explicitly learns representation of latent variables (Lee et al., 2020) . (iv) Model-based RL with rich observations (RichID): a state-of-the-art model-based RL method with representation learning (Mhammedi et al., 2020) . (v) Continuous-time model-based RL (CT-MB-RL): policy gradient optimizing the performance objective using the policy functional form derived from continuous-time models, but directly treating the features as the factors.

Continuous-Time Model

Discrete-Time Model MMMC ✖ ✔ ✖ DDPG ✖ ✖ ✖ SLAC ✔ ✖ ✔ RichID ✔ ✔ ✖ CT-MB-RL ✖ ✔ ✖ FaLPO ✔ ✔ ✖ For policy gradient methods, we pick a deterministic policy approach like DDPG as, when compared to non-deterministic policy gradient alternatives, they are more suitable to portfolio optimization due to the continuous action space, high exploration cost, and high noise in financial data (Appendix A.2). For portfolio optimization, different variance reduction methods for policy gradient (Schulman et al., 2015; 2017; Xu et al., 2020) only provide minor performance improvements (Aboussalah, 2020). We hence do not report such results. For areas of RL with representation learning and model-based RL, we focus on those (SLAC and RichID) explicitly learning a representation of latent variables, since such methods are more closely related to FaLPO. There are other techniques like data augmentation, feature construction, adversarial training, and regularization that can improve the empirical performance of portfolio optimization (see survey in Hambly et al., 2021, Section 4.3) . In this work, we focus on the central methodological task of policy learning, and most such techniques can be directly combined with our proposed method. Protocol We simulate environments with the Kim-Omberg model and implement the considered methods to compare their performance. Note that a key data generating parameter in portfolio optimization is the signal-to-noise ratio, which can be roughly characterized by the ratio between the scale of the drift and the scale of the volatility (see Appendix K.1 for detailed explanation). We test our method under different signal-to-noise ratios. To this end, we randomly generate stock drifts to around 10% (to mimic the real-world average return of stocks in SP500), vary the scale of volatility in {10%, 20%, 30%}, and simulate data following the procedure in Appendix K.2. Then, we apply the considered methods to maximize the terminal power and exponential utility with different γ's. For each method, we tune the learning rate and other method-specific hyperparameters with early stopping (Appendix K.3). With each method-hyperparameter-environment combination, we repeat training, validation, and testing five times. Results For exponential utility maximization, Table 2 summarizes the average test utility after hyperparameter selection with 10 stocks, 10 features, and γ = 5. FaLPO outperforms all the competing methods in terms of average terminal utility. This performance gain may be explained by the factor representation learning informed by the continuous-time finance model, as other methods are incapable of doing so. Meanwhile, MMMC and CT-MB-RL underperform, which suggests the disadvantage of using oversimplified models. Compared with the more sophisticated RL methods like SLAC and RichID, the simple DDPG is fairly competitive. This is consistent Methods Energy Material Industrials Mix FaLPO -2.4 ± 1.9 -3.2 ± 1.0 -6.3 ± 2.3 -3.5 ± 1.5 DDPG -6.6 ± 1.2 -7.3 ± 1.5 -7.3 ± 2.1 -2.5 × 10 4 ± 3.3 × 10 8 SLAC -6.8 ± 0.2 -7.0 ± 1.5 -342.4 ± 886.8 -3.0 × 10 8 ± 4.3 × 10 12 RichID -6.5 ± 0.1 -6.9 ± 1.4 -6.9 ± 0.4 -8.1 ± 3.9 CT-MB-RL -4.2 ± 6.2 -5.4 ± 4.3 -11655 ± 32947.5 -5.7 ± 3.1 MMMC -8.5 ± 7.6 -6.5 ± 1.7 -11.0 ± 5.4 -7.5 ± 4.4 Table 3 : Average terminal utility for real-world data. Mix denotes a mix of stocks in the previous three sectors. with the existing observation that more complicated RL solutions may not always be suitable for portfolio optimization due to the large noise and idiosyncrasy in the data. In the appendix, we provide additional experimental results with different problem dimensions (Appendix K.4.1), other γ values (Appendix K.4.2), and different utility functions (Appendix K.4.3). The results are consistent. Appendix K.4.4 studies a simplified case, where the optimal performance can be mathematically derived. FaLPO achieves a similar performance as the theoretically optimal one. • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • • 0.0 0.

5.2. REAL-WORLD STOCK TRADING

In this section, we present an application of FaLPO for real-world stock trading problems. Following the synthetic portfolio optimization setup, we study the six considered methods for 21-day (one month) stock trading in four different stock sectors using the daily stock price data from Yahoo finance between January 4, 2006 and April 1, 2022. For factors, we follow existing works (Aboussalah, 2020; De Prado, 2018; Dixon et al., 2020) and consider economic indexes, technical analysis indexes, and sector-specific features such as oil prices, gold prices, and related ETF prices, leading to around 30 factors for each sector. In each sector we select 10 stocks according to the availability and trading volume in the considered time range (Appendix L.1). The training, validation, and testing data are constructed using rolling windows (Appendix L.3). Table 3 reports the achieved average utility of each method under the selected hyperparamters. FaLPO achieves the highest average utility in all four sectors. Next, we conduct the training-tuning-testing procedure above with γ ∈ {5, 7, 9}, and report the returns of FaLPO in each quarter in Figure 4 . Recall that a smaller γ corresponds to taking more risk. This is consistent with the observation in Figure 4 that the smaller the γ the bigger the return but the larger the fluctuations. Also, the return fluctuates and drops around late 2018 and early 2020. The former corresponds to the abrupt bear market at the end of 2018, and the latter is consistent with the time period that COVID-19 bursts. Under these two time periods, the financial market was especially noisy and unpredictable. We also implement sensitivity analysis on λ in Appendix L.4 and observe that a non-zero small λ works well in practice.

6. EPILOGUE

Conclusion This work proposes FaLPO, a new decision-making framework for portfolio optimization with stochastic factors. By using continuous-time finance models to regularize policy learning, FaLPO is able to handle high noise and complex effects in financial data. We demonstrate FaLPO's benefits both theoretically and empirically. We focus on policy learning and defer more advanced feature engineering methods to future work. Limitations FaLPO has two potential limitations. First, while we show the extension of FaLPO to problems beyond portfolio optimization, FaLPO is not applicable when there is no suitable parametric model to derive the optimal policy functional form. In such cases, the model-regularized policy learning of FaLPO cannot be implemented. Second, the performance of FaLPO still relies on good features (Y t ) to generate factors (X t ). In the presence of unpredictable market events (like COVID-19), or when the features do not contain any useful signals (like the Merton case in Appendix K.4.4), additional caution needs to be taken when applying FaLPO. Reproducibility The assumptions and proof details are provided in Appendices I and J. The experiment implementation details are reported in Appendices K and L. 

APPENDIX A RELATED LITERATURE

In this section, we discuss related literature.

A.1 CONTINUOUS-TIME FINANCE MODELS

Existing strategies for solving portfolio optimization using continuous-time finance models can be loosely summarized as performing three steps: 1. Choosing the model for the dynamics, i.e. the type of stochastic differential equation (SDE). 2. Estimating the parameter of the selected model (which is also referred to as model fitting, model identification, or calibration). 3. Solving for the optimal policy under the estimated model. The third step leverages stochastic optimal control tools (Fleming & Rishel, 1975; Fleming & Mitter, 1982; Fleming & Soner, 2006; Yong & Zhou, 1999) . Finding and estimating an appropriate model for stochastic optimal control requires significant domain knowledge. For instance, in finance, the modeler must specify both which features are relevant and how they affect stock prices (Fama & French, 1992) . If not every relevant factor is correctly specified, optimal control can hardly lead to good performances. As a result, in stock trading, control methods would hand pick three to five economic indices as the factors and assume they follow a simple (often linear) SDE. But indeed trading can benefit from much richer datasets including related option prices, technical indicators, and interest rates (Aboussalah, 2020; De Prado, 2018; Dixon et al., 2020; Mehtab & Sen, 2019) . Further, even with a correctly specified model and factors, likelihood-based estimation for SDE control models can be very challenging (Phillips & Yu, 2009) . As a result, methods like Aït-Sahalia (2008); Ait-Sahalia & Kimmel (2010) seek to replace the exact likelihood with other likelihood-based objective functions, while maintaining theoretical guarantees. However, the proposed objective function needs to be derived for each specific problem, and the derivation can be challenging. Other methods like Fasen (2013); Holỳ & Tomanová (2018) rely on more specific parametric or lowdimensional setups. To alleviate these issues, our framework extends the existing continuous-time finance models by allowing for a flexible and generalized definition of stochastic factor dynamics. Further, we simultaneously conduct policy learning and model calibration in an RL manner, with a square-loss objective that avoids the calculation of an exact likelihood. A.2 REINFORCEMENT LEARNING RL aims to conduct the aforementioned three steps by (i) relying more on data (ii) in an end-to-end fashion. Methods like model-free RL assume no parametric forms on the dynamics, and directly learn the optimal policy while explicitly learning the model (step 1) and estimating the parameters (step 2). Discrete-Time Model-Free RL There exist many discrete-time model-free RL methods (Sutton & Barto, 2018) . In this category, deep deterministic policy gradient (DDPG) is the most relevant one, and is empirically most widely used for portfolio optimization (Hambly et al., 2021) . The reason is twofold. First of all, DDPG is a policy-gradient based method, and thus can naturally handle continuous states and actions in portfolio optimization with simple procedures. Second, DDPG learns a deterministic policy instead of a stochastic one like Haarnoja et al. (2018) . This characteristic is especially important in portfolio optimization where the policy learning goal is a deterministic policy since the cost of a stochastic policy is extremely expensive. Continuous-Time Model-Free RL Continuous-time model-free RL (Wang et al., 2018; Doya, 2000; Munos, 2006) aims to solve for a continuous-time policy. However, such methods do not use or assume any SDE structure, and thus struggle with the common open questions in model-free RL like poor stability and sample complexity. As one example, path integral methods stem from the theoretical result that the value function of a type of continuous-time decision-making problems can be expressed in closed form as a Feyman-Kac path integral (Fleming & Rishel, 1975; Kappen, 2005) . A series of control/RL methods follow the rationale of optimizing the policy to maximize such an integral. Specifically, Theodorou et al. (2010) propose an open-loop control strategy; Kappen & Ruiz (2016) builds RL with importance sampling; Chebotar et al. (2017a; b) ; Stulp & Sigaud (2012) combine path integral with other model-based or model-free RL methods. However, the core derivation only holds for decision-making satisfying Kappen & Ruiz (2016, Equation ( 1)), which is equivalent to assuming that the action does not affect the randomness in decision-making. Such an assumption is limiting, and does not hold for portfolio optimization, where how to allocate the wealth in order to minimize the risk is key to a successful policy. Model-based RL and RL with Representation Learning Model-based RL and RL with representation learning are two active research areas but without a clear general state-of-the-art (Bharadhwaj et al., 2022; Eysenbach et al., 2021; Trabucco et al., 2022; Janner et al., 2019; Deisenroth & Rasmussen, 2011; Nagabandi et al., 2018; Laskin et al., 2021; Lee et al., 2020; Watter et al., 2015; Chebotar et al., 2017a; Hafner et al., 2019; Kim et al., 2019) . The closest to FaLPO are those that learn an explicit representation of a latent variable like Lee et al. (2020) ; Mhammedi et al. (2020) . But such methods are unable to leverage continuous-time finance models for portfolio optimization. Bayesian RL Our proposed framework is also related to Bayesian models (Ghavamzadeh et al., 2015; Rawlik et al., 2012) , if we treat the learned representation of factors as the hidden variable. Strictly formulating an NSFM as a Bayesian model requires assumptions specifying the conditional distributions, and thus requires more domain knowledge. The optimization of Bayesian methods is also more challenging.

RL for Stock Trading

Various efforts have been made on applying RL to stock trading (Corazza & Bertoluzzo, 2014; Hambly et al., 2021; Nan et al., 2022; Xiong et al., 2018; Guan & Liu, 2021; Liu et al., 2021; Hu et al., 2018; Yu et al., 2019) . However, these methods focus more on feature selection or empirical performance-improving techniques. Methodologically, they do not take advantage of continuous-time finance models.

A.3 EMPIRICAL RISK MINIMIZATION

Another related area is Empirical Risk Minimization (ERM) (Vapnik, 1992) . ERM studies the minimization of an objective function using the averages over training data to construct an empirical loss function. Recent work connected ERM with simulation-based and data-based offline decisionmaking methods (Reppen & Soner, 2020) . More specifically, when the random input is observable and unaffected by actions, and a training set is available, the decision-making problem can be formulated as an ERM problem. As a result, the portfolio optimization may be reformulated as an ERM extension.

B OTHER OBJECTIVE FUNCTIONS

Note that the goal of portfolio optimization is to maximize the return while minimize or constrain the risk. In practice, one can use different objective functions for such a goal, like mean-variance objective (Hambly et al., 2021) , Sharpe ratio, and so on. In this work, we consider utility maximization with power utility and exponential utility. The proposed method also works with other objective functions, as long as we can derive (part of) the optimal policy structure. Note that the selection among these objective functions is more a user-preference question.

C GRADIENT ESTIMATES

In this section, we discuss the gradient estimation for both V and L. Assume that we collect B independent trajectories for S t and Y t , denoted as D := {(s 0,k , y 0,k ), (s ∆t,k , y ∆t,k ), (s 2∆t,k , y 2∆t,k ), • • • (s M ∆t,k , y M ∆t,k )} B k=1 . Then, the gradient estimate for V (θ ϕ , θ π ) is defined as ∇V (θ ϕ , θ π ) := 1 B B k=1 ∇V k (θ ϕ , θ π ) with ∇V k (θ ϕ , θ π ) := ∇ θ ϕ ,θπ U (z π(•;θ ϕ ,θπ),∆t T,k ). The terminal wealth in the trajectory k under the policy where π(•; θ π ) is denoted as z π(•;θπ),∆t T,k with z π(•;θπ),∆t T,k := z 0 + M m=1 z m-1 d S i=1 π i (m∆t, y m∆t ; θ ϕ , θ π ) s i (m+1)∆t -s i m∆t s i m∆t . Next, we consider the gradient of L: ∇L(θ ϕ , θ S ) := 1 B B k=1 ∇L k (θ ϕ , θ S ). Specifically for likelihood and negative mean square loss, we have ∇L Likelihood,k (θ ϕ , θ S ) := 1 M M -1 m=0 ∇ θ ϕ ,θ S log(P(s (m+1)∆t,k , ϕ(y (m+1)∆t,k ; θ ϕ ) | s m∆t,k , ϕ(y m∆t,k ; θ ϕ ); θ S )), ∇L N M SL,k (θ ϕ , θ S ) := - 1 M M -1 m=0 d S i=1 ∇ θ ϕ ,θ S log(s i (m+1)∆t,k ) -log(s i m∆t,k ) -E (m+1)∆t m∆t f j S (X s ; θ S ) - 1 2 d S i=1 (g ij S (X s ; θ S )) 2 ds s m∆t,k , ϕ(y m∆t,k ; θ ϕ ) 2 . As a result, in each iteration, we collect B trajectories to estimate the gradient of H(θ ϕ , θ π ).

D A PRIMER ON STOCHASTIC DIFFERENTIAL EQUATIONS (SDES)

We provide a general formulation of SDEs with two examples.

D.1 FORMULATION OF SDES

SDEs are a generalization of ordinary differential equations to dynamic systems influenced by random fluctuations. The structure of the randomness can in principle be quite general, such as with jump processes where the state evolution is no longer continuous (Tankov, 2003) . Although our method can be generalized to all SDEs, we restrict ourselves to practical settings where the source of randomness is a Brownian motion. Let W t be a multi-dimensional independent standard Brownian motion. For a random process S t , an SDE is typically expressed using a differential form as dS t = f (S t ) dt + g(S t ) dW t , or S t = S 0 + t 0 f (S t ) dt + t 0 g(S t ) dW t , where f (•) and g(•) are functions of S t . The stochastic integral t 0 g(S t )dW t is the accumulation of influence to the state due to the noise. We refer the reader to Karatzas & Shreve (1987) for details on the construction of stochastic integrals and SDE theory. Important here is that Equation (8) defines the transition of S t in an infinitesimal time step. The drift coefficient f (S t ) characterizes the deterministic part of the change of S t , and the diffusion coefficient g(S t ) models the randomness in the transition of S t .

D.2 EXAMPLES

As concrete examples, we discuss two families of SDEs widely used in finance, economics, and biology: Geometric Brownian motion (GBM) and Ornstein-Uhlenbeck (OU) processes (Merton et al., 1973; Vasicek, 1977; Bartoszek et al., 2017; Blomberg et al., 2020; Rohlfs et al., 2014) . The OU structure appears in both applications below, and the financial application uses GBM as a base, but extends it with OU drift coefficients. The two types of SDEs are given by GBM: dS t S t = µ dt + σ dW t , OU: dS t = µS t dt + σ dW t , where dSt St := dSit Sit denotes the component-wise division of S t , and the matrices µ and σ define the drift and diffusion coefficients. We refer the interested reader to Fleming & Soner (2006) for more information on these topics. We now briefly formulate two classic stochastic optimal control models for decision-making with stochastic factors. The stochastic factors appear as the drift coefficients of other state variables and are themselves modeled as SDEs.

D.3 IT Ô'S FORMULA

Itô's Formula is a fundamental analytical tool for SDEs, and crucial for their analysis. We only provide a simple version here, which is sufficient for our analysis. A more general and rigorous statement with assumptions and proof of Itô's formula and integral can be found in Karatzas & Shreve (1987, Theorem 3.3 ). LEMMA D.1 (Itô's Formula) Consider a twice differentiable function G, and S t following dS t = f (S t ) dt + g(S t ) dW t . Then, we have dG(t, S t ) = ∂G ∂t + ∂G ∂S t ⊤ f (S t ) + 1 2 Tr g(S t ) ⊤ ∂ 2 G ∂S 2 t g(S t ) dt + ∂G ∂S t ⊤ g(S t )dW t . LEMMA D.2 For a suitable bounded process S t , the Itô integral t 0 S t dW t satisfies: E t 0 S t dW t = 0, E t 0 S t dW t 2 = E t 0 S 2 t dt . The latter is also referred to as Itô's isometry

E MODEL CALIBRATION

We discuss two model calibration loss functions, log-likelihood and negative mean square loss.

E.1 LOG-LIKELIHOOD

We can use log-likelihood as L for model calibration. The log-likelihood of SDEs is derived in a sequential manner. Specifically, for (2) the log-likelihood is derived as L Log-Likelihood (θ ϕ , θ S ) := E[log(P θ S (S t+∆t , ϕ(Y t+∆t ; θ ϕ ) | S t , ϕ(Y t ; θ ϕ )))], where P θ S denotes the conditional likelihood according to (2) but with parameter θ S instead of θ * S . Then, in specific models, one can derive P θ S (S t+∆t , ϕ(Y t+∆t ) | S t , ϕ(Y t ); θ) (Phillips, 1972; Holỳ & Tomanová, 2018; Beskos & Roberts, 2005) or the approximation of it (Ait-Sahalia & Kimmel, 2010 ).

E.2 NEGATIVE MEAN SQUARE LOSS

For the SDE system (2), one can also use a negative mean square loss (NMSL) as the calibration objective. To derive this loss function, we first drive the dynamics of log price by applying Itô's Formula Lemma D.1 to (2): d log(S i t ) = f i S (X t ; θ S )dt - 1 2 d W j=1 (g ij S (X s ; θ S )]) 2 dt + d W j=1 g ij S (X s ; θ S )]dW j t . Then, combined with Lemma D.2, under proper assumptions of Lemma D.2, we pose expectation over both sides of the above equation, and derive E[d log(S i t )] = f i S (X t ; θ S )dt - 1 2 d W j=1 (g ij S (X s ; θ S )]) 2 dt. In words, the expectation of the log price change in an infinitesimal time is f i S (X t ; θ S )dt - 1 2 d W j=1 (g ij S (X s ; θ S )]) 2 dt. Therefore, one can estimate the parameter θ S by minimizing the mean square loss between the log price change and the expected log price change: L N M SL (θ S ) := -E d S i=1 log(S i t+∆t ) -log(S i t ) -E t+∆t t f i S (X s ; θ S ) - 1 2 d W j=1 (g ij S (X s ; θ S )) 2 ds S t , X t 2 . ( ) It can be easily proved that the true data generating SDE parameter satisfies θ * S ∈ arg max L N M SL (θ S ). Further, if we take X t = ϕ(Y t ; θ ϕ ) and parameterize the objective as L N M SL (θ ϕ , θ S ) := -E d S i=1 log(S i t+∆t ) -log(S i t ) -ϕ i (Y t ; θ ϕ )∆t -θ i S 2 , we can prove θ * ϕ , θ * S ∈ arg max L N M SL (θ ϕ , θ S ). Note that in practice it can be very hard to calculate the expectation E t+∆t t f i S (X s ; θ S ) - 1 2 d W j=1 (g ij S (X s ; θ S )) 2 ds S t , X t . Therefore, when ∆t is small, we replace the conditional ex- pectation via E f i S (X S ; θ S )∆t -1 2 d W j=1 (g ij S (X S ; θ S )) 2 ∆t S t , X t . Accordingly the calibration objective is defined as L N M SL (θ S ) ≈ -E d S i=1 log(S i t+∆t ) -log(S i t ) -f i S (X s ; θ S )∆t + 1 2 d W j=1 (g ij S (X s ; θ S )) 2 ∆t 2 .

E.3 OTHER MODEL CALIBRATION OBJECTIVE

Another potential model calibration objective following the same rationale as ( 11) is L N M SL-X (θ ϕ , θ S ) := -E d X i=1 ϕ(Y t+∆t ; θ ϕ ) i -ϕ(Y t ; θ ϕ ) i -E t+∆t t f i X (ϕ(Y s ; θ ϕ ) i ; θ S )ds ϕ(Y t ; θ ϕ ) i 2 , which is derived using the conditional expectation of X t+∆ given X t . However, the true datagenerating parameter (θ * ϕ , θ * S ) is not a maximal point of L N M SL-X . To more clearly see this, this loss function encourages the representation function ϕ to take a constant output so that X t is constant over time with f i X (X s ; θ S ) = 0, and L N M SL-X (θ ϕ ) = 0. We also try this loss in experiments, and it leads to poor validation and test performances.

F APPLICATIONS OF FALPO TO DIFFERENT STOCHASTIC FACTOR MODELS IN CONTINUOUS-TIME FINANCE

FaLPO can be used with many stochastic factor models in continuous-time finance models. In this section, we discuss the Merton model (Appendix F.1), Kim-Omberg (Appendix F.2) and EVE (Appendix F.3).

F.1 MERTON MODEL

Merton model (Merton, 1969 ) is a classic setup for portfolio optimization. It studies the allocation of capital across a set of financial assets in order to maximize profits and minimize risks.

F.1.1 MODELING

Consider p risky assets with prices S t = {S i t } p i=1 and an additional risk-free money market account with, for simplicity, zero interest rate of return (like cash). The Merton model does not include factors. The dynamics for asset prices is formulated as dS i t S i t = µ dt + σdW t . ( ) The parameters µ, σ are p × p matrices, with σ denoting the volatility of assets. Further, we use Z π t to denote the wealth at time point t under the continuous-time policy π. Under the famous and widely used self-financing assumption (Björk, 2009) , we have d Z π t Z π t = πt µdt + πt σdW t . An investor's goal is to maximize the expected utility of capital U (Z T ) at some future time point T : max πt E πt U ( Z π T )|Z 0 = z, S 0 = s . ( ) Negative values in the policy output are allowed, meaning the agent can short any asset.

F.1.2 POLICY FUNCTIONAL FORM

For the Merton model, Π in (3) can be explicitly derived. LEMMA F.1 (Policy Functional Form for Merton Model) For a Merton model defined in (13), under common assumptions, the optimal policy for portfolio optimization ( 14), with power utility follows π * = µ(σσ ⊤ ) -1 . The optimal policy for portfolio optimization ( 14) with exponential utility follows π * = µ(σσ ⊤ ) -1 Z π * t . Proof. Lemma F.1 is a classic result in continuous-time finance, proposed in Merton (1969) . □ In words, in a Merton model, the optimal policy is independent of time, stock prices, features, and factors. The optimal investment strategy is to keep a constant fraction or amount of wealth in each asset all along, depending on the choice of utility function. Let θ π be a d S × 1 parameter vector. According to Lemma F.1, in FaLPO, we parameterize the candidate policy function as π(t, S t , Z t , Y t ; θ ϕ , θ π ) = Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ) = θ π for power utility, and π(t, S t , Z t , Y t ; θ ϕ , θ π ) = Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ) = θ π Z t for exponential utility.

F.1.3 MODEL CALIBRATION

According to the Merton model formulation, there exist no factors affecting the evolution of stock prices. Therefore, we do not add the model calibration objective in FaLPO for Merton problem. F.2 KIM-OMBERG Kim-Omberg model (Kim & Omberg, 1996 ) is a standard model for portfolio optimization with predictable asset returns, which has been discussed extensively in the empirical literature (Welch & Goyal, 2008; Muhle-Karbe et al., 2017) .

F.2.1 MODELING

In Kim-Omberg model, the stock dynamics are formulated as dS i t S i t = X i t dt + d W j=1 σ ij dW j t dX t = µ(ω -X t ) dt + v dW t . The portfolio optimization goal is formulated as max πt Ṽ (π t ) with Ṽ (π t ) := E πt [U ( Z π t )|X 0 = x, Z 0 = z, S 0 = s].

F.2.2 POLICY FUNCTIONAL FORM

An optimal policy function is derived in Lemma F.2 LEMMA F.2 Under common assumptions in Kim & Omberg (1996) ; Herzog et al. (2004) , an optimal policy functional form for ( 15) with power utility is derived as π * = 1 1 -γ (σσ ⊤ ) -1 X t + σv ⊤ (k 3 (t)X t + k 2 (t)) , where k 2 (t) and k 3 (t) satisfy dk 1 (t) dt + 1 2 Tr v ⊤ (k 2 (t)k 2 (t) ⊤ + k 3 (t))v + (µω) ⊤ k 2 (t) - γ 2(γ -1) (k 2 (t) ⊤ vv ⊤ k 2 (t)) = 0, dk 2 (t) dt + k 3 (t)vv ⊤ k 2 (t) -µ ⊤ k 2 (t) + k 3 (t)µω - γ γ -1 ((σσ ⊤ ) -1 σvk 2 (t) + k 3 (t)vv ⊤ k 2 (t)) = 0, dk 3 (t) dt + k 3 (t)vv ⊤ k 3 (t) -k 3 (t)µ -µ ⊤ k 3 (t) - γ γ -1 ((σσ ⊤ ) -1 + (σσ ⊤ ) -1 σv ⊤ k 3 (t) + k 3 (t)vσ ⊤ (σσ ⊤ ) -1 + k 3 (t)vv ⊤ k 3 (t)) = 0, with k 1 (T ) = 0, k 2 (T ) = 0, and k 3 (T ) = 0. Note that k 1 (t) is a scalar, k 2 (t) is a d S × 1 vector, and k 3 (t) is a d X × d X . And the ODE of k 3 (t) is the famous matrix Ricatti equation. Similarly, an optimal policy functional form for (15) with exponential utility is derived as π * t = (σσ ⊤ ) -1 1 -γ 2 Z t X(t) + σv ⊤ (k 3 (t)X t + k 2 (t)) , where k 2 (t) and k 3 (t) satisfy dk 1 (t) dt + 1 2 Tr v ⊤ (k 2 (t)k 2 (t) ⊤ + k 3 (t))v + (µω) ⊤ k 2 (t) - 1 2 (k 2 (t) ⊤ vv ⊤ k 2 (t)) = 0, dk 2 (t) dt + k 3 (t)vv ⊤ k 2 (t) -µk 2 (t) + k 3 (t)µω -((σσ ⊤ ) -1 σvk 2 (t) + k 3 (t)vv ⊤ k 2 (t)) = 0, dk 3 (t) dt + k 3 (t)vv ⊤ k 3 (t) -k 3 (t)µ -µ ⊤ k 3 (t) -((σσ ⊤ ) -1 + (σσ ⊤ ) -1 σv ⊤ k 3 (t) + k 3 (t)vσ ⊤ (σσ ⊤ ) -1 + k 3 (t)vv ⊤ k 3 (t)), with k 1 (T ) = 0, k 2 (T ) = 0, and k 3 (T ) = 0.

F.2.3 MODEL CALIBRATION

Following the derivation in Appendix E.2, the negative mean square loss for Kim-Omberg model can be derived as L(θ ϕ , θ S ) := -E d S i=1 log(S i t+∆t ) -log(S i t ) -ϕ i (Y t ; θ ϕ )∆t -θ i S 2 , where in this case θ S is a d S × 1 vector.

F.3 EVE MODEL WITH STOCHASTIC MARKOVIAN FACTORS

We take EVE model (Avanesyan et al., 2018) with stochastic Markovian factors as another example.

F.3.1 MODELING

We first detail the modeling of EVE, which formulates the dynamics of asset prices by dS i t S i t = µ i (X t ; θ S )dt + d W j=1 σ ji (X t ; θ S )dW j t , i = 1, 2, • • • , d S , dX t = M ⊤ X t + ω dt + κ(X t ; θ S ) ⊤ dB t , B t = ρ ⊤ W t + A ⊤ W ⊥ t . We use W t and W ⊥ t to denote two sets of independent Brownian motions. ρ denotes a correlation matrix with components ρ ij ∈ [-1, 1]. Therefore, B is indeed another Brownian motion. We let µ, σ, and κ be parametric functions, with θ S denoting all the parameters. The SDE parameters include all the parameter matrices and β's. Again, with Z t as the wealth, we aim to maximize the power utility at the terminal time T > 0 as the performance objective: Ṽ (π t ) = E π ( Z π T ) 1-γ 1 -γ . EVE model poses further assumptions on ( 16). ASSUMPTION F.3 M has non-negative off-diagonal entries and ω ∈ [0, ∞) k . Further, we assume that there exist λ(x), Λ and L, and N such that µ(•), σ(•), κ(•) and ρ satisfy λ(x) ⊤ λ(x) = µ(x) ⊤ σ(x) -1 σ(x) -1 ⊤ µ(x) = Λ ⊤ x κ(x) ⊤ κ(x) = diag(L 1 x 1 , L 2 x 2 , • • • , L k x k ), with L 1 , L 2 , • • • , L k ≥ 0 Γκ(x) ⊤ ρ ⊤ λ(x) = N ⊤ x. The conditions in Assumption F.3 are necessary for the process of X t to be [0, ∞) k -valued and affine. Under these conditions, the SDE in ( 16  ρ ⊤ ρ = pI, where I is the identity matrix. Note that when p = 1, ρ is a vector and thus we define p := ρ ⊤ ρ.

F.3.2 CONCRETE EXAMPLE

We consider a more concrete example of EVE satisfying the formulation and assumptions in Appendix F.3.1. Specifically, we use D µ , D σ , and D λ to denote diagonal matrices. Further, let D(x) denote the diagonal matrix whose diagonal is x. Also, we use x •k for any k ∈ R to denote the component-wise power operation (Hadamard power). Then, we define µ(x) := D µ x • 3 2 σ(x) := D σ D(x) κ(x) := ρ -1 D κ D(x • 1 2 ). Then, we have λ(x) = D -1 σ D µ x • 1 2 . Further, we pose ρ ⊤ ρ = ρρ ⊤ = I. Then, λ(x) = σ(x) -1 ⊤ µ(x) = D -1 σ x • 1 2 and λ(x) ⊤ λ(x) = D -2 σ x, κ(x) ⊤ κ(x) = D κ D(x • 1 2 ) ρ -1 ⊤ ρ -1 D κ D(x • 1 2 ) = D 2 κ D(x), Γκ(x) ⊤ ρ ⊤ λ(x) = ΓD(x • 1 2 )D κ ρ -1 ⊤ ρ ⊤ D -1 σ x • 1 2 = ΓD(x • 1 2 )D κ D -1 σ x • 1 2 = ΓD κ D -1 σ x. Such a setup is shown satisfies (16).

F.3.3 POLICY FUNCTIONAL FORM

The policy functional form of the EVE model with Markovian stochastic factors ca be derived as: LEMMA F.6 Under the assumptions in Appendix F.3.1 and Appendix F.3.2, the optimal policy function follows: π * t = 1 γ σ(X t ) -1 λ(X t ) + qρκ(X t )k 2 (t) ⊤ , with dk i 2 (t) dt + 1 2 L i (k 2 (t) i ) 2 + d X j=1 (M + N ) ij k 2 (t) j + Γ 2q Λ i = 0, i = 1, 2, • • • , d X dk 1 (t) dt + ω ⊤ k 2 (t) = 0. For the specific example in F.3.2, in FaLPO, we parameterize the candidate policy as π(t, Y t , Z t ; θ ϕ , θ π ) = Π(t, Y t , Z t ; θ ϕ , θ π ) = ϕ(Y t ) 1 2 k(t; θ π ).

F.4 MODEL CALIBRATION

Following the derivations in Section E.2, we can derive the calibration objective as: L(θ S ) := -min θ S =(C1,C2) E ∆ log(S t ) -C 1 ϕ(Y t ; θ ϕ ) • 3 2 ∆t -C 2 ϕ(Y t ; θ ϕ ) •2 ∆t 2 2 .

G SOLUTIONS OF RICCATI DIFFERENTIAL EQUATIONS

According to the analysis in Section F, the optimal policy function is closely related to the solutions of Riccati differential equations, which also have closed-form solutions. Specifically, with abuse of notation, let A, B and D be p × p matrices, and X(t) : [0, T ] → R p×p as a function of t solving the following Riccati differential equation: ∂dX(t) dt = A ⊤ X(t) + X(t)A -X(t)BB ⊤ X(t) + D ⊤ D, X(0) = X 0 . Following the analysis and assumptions in (Behr et al., 2019) , the unique symmetric positive stabilizing solution of X(t) follows: X(t) = X ∞ -e t Â⊤ X ∞ -X 0 I -(X L -e t ÂX L e t Â⊤ )(X ∞ -X 0 ) -1 e t Â, where ÂX L + X L Â⊤ + BB ⊤ = 0, with Â := A -BB ⊤ X ∞ , and 0 = A⊤X ∞ + X ∞ A + D ⊤ D. Note that with (18) we can further derive the policy functional forms without using neural networks to parameterize time-dependent fucntions. H EXTENSION TO LINEAR QUADRATIC CONTROL (LQC) The methodology of FaLPO can also be applied to decision-making problems other than portfolio optimization. To implement FaLPO, one needs to first construct a neural stochastic factor model combining factor representation with a continuous-time model. Then, the policy learning is conducted while leveraging policy functional form and model calibration. As an example, we implement FaLPO to linear quadratic control (LQC), and detail modeling (Section H.1), policy functional form (Section H.2) and model calibration (Section H.3).

H.1 MODELING

We consider the problem of LQC (Sun & Yong, 2020) but with stochastic factor X t following an OU process. With slight abuse of notation, we use S t to denote the sate variable in this section: dS t = (BS t + U A t + X t ) dt + d W j=1 D j A t dW j t , X t = µX t dt + v dW t , where B, U , µ, v, and D j are redefined as matrices with appropriate dimensions. With A t = π(•) following the policy π, we aim to solve max π V (π) with V (π) := E π T 0 (QS t ) ⊤ S t + (RA t ) ⊤ A t dt + (GS T ) ⊤ S T , with Q, R, and G as known matrices with appropriate dimensions, and T is terminal time. Further, we apply the modeling strategy in Section 3.1 and aim to learn the representation of stochastic factors from the available features Y t : X t = ϕ(Y t ; θ * ϕ ).

H.2 POLICY FUNCTIONAL FORM

By taking Ξ t (the combination of S t and X t ) as the state variables, we can reformulate the problem as a classic LQC problem: dΞ t = B Ξ Ξ t + U Ξ A t + d W j=1 (D Ξ j A t + β Ξ t )dW j t , with all the coefficients redefined. Then, under common assumptions in Sun & Yong (2020) ; Yong & Zhou (1999) , it can be derived that the optimal policy satisfies: π * (t, ξ) = Λ Ξ (K Ξ (t)) -1 (U Ξ ) ⊤ K Ξ (t)ξ, where Λ Ξ (K Ξ (t)) = R + d W j=1 (D ⊤ j K Ξ (t)D j ). Also, K Ξ (t) solves the differential Riccati equation: K Ξ (t) = -e (B Ξ ) ⊤ (T -t) G Ξ e B Ξ (T -t) - T t e (B Ξ ) ⊤ (T -τ ) K Ξ (τ ) ⊤ U Ξ Λ Ξ (K Ξ (τ )) ⊤ -1 (U Ξ ) ⊤ K Ξ (τ )e (B Ξ ) ⊤ (T -τ ) dτ, with K Ξ (T ) = 0. Therefore, we can formulate the candidate policy as π(t, S t , Y t ; θ ϕ , θ π ) = Π(t, S t , Y t ; θ ϕ , θ π ) = k 1 (t; θ π )S t + k 2 (t; θ π )ϕ(Y t ; θ ϕ ).

H.3 MODEL CALIBRATION

According to ( 19) and following the derivation strategy in Appendix E.2, we can derive the negative mean square loss for LQC as L(θ ϕ , θ S ) := E ∥S t+∆t S t -ϕ(Y t ; θ ϕ )∆t -C 1 S t -C 2 A t ∥ 2 2 , with θ S = {C 1 , C 2 }. As a summary, to apply FaLPO to LQC with stochastic factors, we parameterize candidate policies following (21) and maximize (1 -λV (θ ϕ , θ π )) + λL(θ ϕ , θ S ), with V in (20) and L in ( 22). I EXTENDED RESULTS FOR THEOREM 4.1

I.1 ASSUMPTIONS AND DEFINITIONS

To start with, we consider stochastic factor models such that the optimal feedback admissible policy admits a functional form as π * t = Π(t, X t ; θ * π ). Also, we consider the L function such that the true data generating parameters θ * S , θ * ϕ satisfy θ * S , θ * ϕ ∈ arg max θ ϕ ,θ S L(θ ϕ , θ S ), while other options for L can also be empirically used in our method. DEFINITION I.1 For a continuous-time policy πt := Π(t, S t , Z t , ϕ(Y t ; θ ϕ ); θ π ), we define its value function as Ṽ (θ ϕ , θ π ) := E[U (Z π T )].

Accordingly, we define the continuous-time version objective function as

H(θ ϕ , θ π , θ S ) := (1 -λ) Ṽ (θ ϕ , θ π ) + λL(θ ϕ , θ S ). DEFINITION I.2 For t ∈ [m∆t, (m + 1)∆t), define ⌊t⌋ := m∆t and ⌊π * t ⌋ := π * ⌊t⌋ . DEFINITION I.3 For the continuous-time optimal policy π * , we use ⌊π * ⌋ to denote the piece-wise constant version π * . We use Z⌊π * ⌋ t to denote the wealth process when implementing the optimal continuous-time policy π(•; θ * ϕ , θ * π ) in the piece-wise constant manner. Specifically, d Z⌊π * ⌋ t := Z⌊π * ⌋ ⌊t⌋ d S i=1 (π * ⌊t⌋ ) i dS i t S i ⌊t⌋ = d S i=1 Π i (⌊t⌋, X ⌊t⌋ ; θ * π )f i S (X t ; θ * S ) S i t S i ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ dt + Π i (⌊t⌋, X ⌊t⌋ ; θ * π ) d W j=1 g ij S (X t ; θ * S ) S i t S i ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ dW j t . Further, remember Z π * t is used to denote the continuous-time wealth process under the policy π * . By the dynamics of continuous-time wealth process Z π t in Section 2.3 the dynamics of Z π * t is derived as d Z π * t Z π * t = Π(t, X t ; θ * π ) ⊤ f S (X t ; θ * S )dt + Π(t, X t ; θ * π ) ⊤ g S (X t ; θ * S ) ⊤ dW t . ASSUMPTION I.4 For each R > 0, if ∥x∥ ≤ R and t ≤ T , we assume that there exists a C R > 0 such that ∥Π(t, x; θ * π )∥ ∨ ∥f S (x; θ * S )∥ ∨ ∥g S (x; θ * S )∥ ∨ ∥f X (x; θ * S )∥ ∨ ∥g X (x; θ * S )∥ ≤ C R , and Π(t, x; θ * π ) is locally Lipschitz with Lipschitz constant C R . For some p > 2, there exists a constant A such that E sup 0≤t≤T | Z π * t | p ∨ E sup 0≤t≤T | Z⌊π * ⌋ t | p ∨ E sup 0≤t≤T ∥X t ∥ p ∨ E sup 0≤t≤T ∥log(S t )∥ p ≤ A. Note that Assumption I.4 requires the stochastic processes to have bounded high-order moments. For a specific model like Kim-Omberg, this is not guaranteed to hold for every initial value and SDE coefficient, but one can derive model-specific sufficient conditions for Assumption I.4. In practice when implementing the method, we calculate the empirical moments of wealth, factors and asset prices to approximately check whether Assumption I.4 holds. With Assumption I.4, we define a stopping time: DEFINITION I.5 For R > 0, define a stopping time τ R := inf 0≤t≤T | | Z π * t | ≥ R or ∥X t ∥ ≥ R or ∥log(S t )∥ ≥ R or | Z⌊π * ⌋ t | ≥ R . ASSUMPTION I.6 The utility function U (z) has a linear bound on Z: |U (z)| ≤ C U (|z| + 1). Note that Assumption I.6 holds as long as the utility function is bounded at the smallest value in Z. Specifically for power utility, Assumption I.6 is equivalent to setting γ ∈ (0, 1). ASSUMPTION I.7 There exists ∆t ′ > 0 such that for any ∆t < ∆t ′ , ⌊π * ⌋ is also an admissible policy. I.2 LEMMAS LEMMA I.8 Consider n non-negative constants c 1 , c 2 , • • • , c n . The following inequality is true: ( n i=1 c i ) 2 ≤ n n i=1 c 2 i . Proof. The proof follows the Cauchy-Schwartz inequality. □ LEMMA I.9 (θ * ϕ , θ * π , θ * S ) maximizes both Ṽ (θ ϕ , θ π ) and L(θ ϕ , θ S ). Proof. First of all, since θ π * is defined to be the optimal parameter for the continuous-time policy and θ * ϕ is defined to be the true data generating parameter, (θ * ϕ , θ * π ) maximizes Ṽ . Then, by ( 23), (θ * ϕ , θ * S ) maximize L. □ LEMMA I.10 For any δ > 0, E[ sup 0≤t≤T ( Z⌊π * ⌋ t -Z π * t ) 2 ] ≤ E sup 0≤t≤T ( Z⌊π * ⌋ t∧τ R -Z π * t∧τ R ) 2 + 2 p+1 δA p + (p -2)8A pδ 2/(p-2) R p . Proof. Proof by applying Young's inequality. See derivation in Higham et al. (2002, Equation (2.8)) . □ LEMMA I.11 For any t ≤ τ R , the difference between the coefficients of the dynamics of Z π * t and Z⌊π * ⌋ t are bounded by: d S i=1 Π(t, X t ; θ * π )f i S (X t ; θ * S ) Z π * t - d S i=1 Π(⌊t⌋, X ⌊t⌋ ; θ * π )f i S (X t ; θ * S ) Z⌊π * ⌋ ⌊t⌋ S i t S i ⌊t⌋ 2 ≤ 5d 2 S exp(2R)C 4 R exp(2R)R 2 |t -⌊t⌋| 2 + exp(2R)R 2 X t -X ⌊t⌋ 2 + R 2 S ⌊t⌋ -S t 2 + exp(2R) Z π * t - Z⌊π * ⌋ t 2 + exp(2R) Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ 2 , and d S i=1 Π(⌊t⌋, X ⌊t⌋ ; θ * π ) d W j=1 g ij S (X t ; θ * S ) S i t S i ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ - d S i=1 Π(t, X t ; θ * π ) d W j=1 g ij S (X t ; θ * S ) Z π * t 2 ≤5d 2 S exp(2R)C 4 R exp(2R)R 2 |t -⌊t⌋| 2 + exp(2R)R 2 X t -X ⌊t⌋ 2 + R 2 S ⌊t⌋ -S t 2 + exp(2R) Z π * t - Z⌊π * ⌋ t 2 + exp(2R) Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ 2 . Proof.

By triangle inequality

d i=1 πi t f i S (X t ; θ * S ) Z π * t - d S i=1 πi ⌊t⌋ f i S (X t ; θ * S ) Z⌊π * ⌋ ⌊t⌋ S i t S i ⌊t⌋ ≤ d S i=1 1 S i ⌊t⌋ f i S (X t ; θ * S ) πi t Z π * t S i ⌊t⌋ -πi ⌊t⌋ Z π * t S i ⌊t⌋ + πi ⌊t⌋ Z π * t S i ⌊t⌋ -πi ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ S i ⌊t⌋ + πi ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ S i ⌊t⌋ -πi ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ S i t . For any t ≤ τ R , we can further bound the right hand side by Assumption I.4 d S i=1 πi t f i S (X t ; θ * S ) Z π * t - d S i=1 πi ⌊t⌋ f i S (X t ; θ * S ) Z⌊π * ⌋ ⌊t⌋ S i t S i ⌊t⌋ ≤d S exp(R)C 2 R exp(R)R |t -⌊t⌋| + X t -X ⌊t⌋ + exp(R) Z π * t - Z⌊π * ⌋ t + Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ + R S ⌊t⌋ -S t . Then, by Lemma I.8 d S i=1 πi t f i S (X t ; θ * S ) Z π * t - d S i=1 πi ⌊t⌋ f i S (X t ; θ * S ) Z⌊π * ⌋ ⌊t⌋ S i t S i ⌊t⌋ 2 ≤5d 2 S exp(2R)C 4 R exp(2R)R 2 |t -⌊t⌋| 2 + X t -X ⌊t⌋ 2 + exp(2R) Z π * t - Z⌊π * ⌋ t 2 + Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ 2 + R 2 S ⌊t⌋ -S t 2 =5d 2 S exp(2R)C 4 R exp(2R)R 2 |t -⌊t⌋| 2 + exp(2R)R 2 X t -X ⌊t⌋ 2 + R 2 S ⌊t⌋ -S t 2 + exp(2R) Z π * t - Z⌊π * ⌋ t 2 + exp(2R) Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ 2 . Similarly d S i=1 πi ⌊t⌋ d W j=1 g ij S (X t ; θ * S ) S i t S i ⌊t⌋ Z⌊π * ⌋ ⌊t⌋ - d S i=1 πi t d W j=1 g ij S (X t ; θ * S ) Z π * t 2 ≤ 5d 2 S exp(2R)C 4 R exp(2R)R 2 |t -⌊t⌋| 2 + exp(2R)R 2 X t -X ⌊t⌋ 2 + R 2 S ⌊t⌋ -S t 2 + exp(2R) Z π * t - Z⌊π * ⌋ t 2 + exp(2R) Z⌊π * ⌋ t - Z⌊π * ⌋ ⌊t⌋ 2 . □ LEMMA I.12 With τ R defined in Definition I.5, E X t∧τ R -X ⌊t∧τ R ⌋ 2 ≤ 2C 2 R (∆t 2 + ∆t), E S t∧τ R -S ⌊t∧τ R ⌋ 2 ≤ 2 exp(2R)C 2 R (∆t 2 + ∆t), E Z⌊π * ⌋ ⌊t∧τ R ⌋ - Z⌊π * ⌋ t∧τ R 2 ≤ 2R 2 exp(4R)C 4 R (∆t 2 + ∆t). Proof. By the dynamics of X t and Lemma I.8, we can derive E X t∧τ R -X ⌊t∧τ R ⌋ 2 ≤ 2∆tE t∧τ R mt∧τ R ∆t ∥f X (X s ; θ * S )∥ 2 ds+2E t∧τ R mt∧τ R ∆t g X (X s ; θ * S ) ⊤ dW s 2 , where m t∧τ R satisfies m t∧τ R ∆t ≤ (t ∧ τ R ) < (m t∧τ R + 1)∆t. Further, we apply Itô's isometry with stopping time (Lemma D.2), and derive E X t∧τ R -X ⌊t∧τ R ⌋ 2 ≤ 2∆tE t∧τ R mt∧τ R ∆t ∥f X (X s ; θ * S )∥ 2 ds + 2E t∧τ R mt∧τ R ∆t ∥g X (X s ; θ * S )∥ 2 ds. By Assumption I.4, we derive E X t∧τ R -X ⌊t∧τ R ⌋ 2 ≤ 2C 2 R (∆t 2 + ∆t). Similarly E S t∧τ R -S ⌊t∧τ R ⌋ 2 ≤ 2 exp(2R)C 2 R (∆t 2 + ∆t), E Z⌊π * ⌋ ⌊t∧τ R ⌋ - Z⌊π * ⌋ t∧τ R 2 ≤ 2R 2 exp(4R)C 4 R (∆t 2 + ∆t). □ LEMMA I.13 E sup 0≤t≤τ ( Z π * t∧t k - Z⌊π * ⌋ t∧t k ) 2 ≤10(T + 4)T d 2 S C 4 R exp(4R)R 2 ∆t 2 + 2C 2 R (∆t 2 + ∆t) + 2C 2 R (∆t 2 + ∆t) + 2 exp(4R)C 4 R (∆t 2 + ∆t) + 10(T + 4)d 2 S C 4 R exp(4R) τ 0 E sup 0≤r≤s Z π * r∧τ R - Z⌊π * ⌋ r∧τ R 2 ds. Proof. By Cauchy-Schwarz inequality and the dynamics of Z π * t and Z⌊π * ⌋ t , for any τ ≤ T E sup 0≤t≤τ ( Z π * t∧t k - Z⌊π * ⌋ t∧t k ) 2 ≤2T E sup 0≤t≤τ t∧τ R 0 d S i=1 πi s f i S (X s ; θ * S ) Z π * s - d S i=1 πi ⌊s⌋ f i S (X s ; θ * S ) Z⌊π * ⌋ ⌊s⌋ S i s S i ⌊t⌋ 2 ds + 2E sup 0≤t≤τ d W j=1 t∧τ R 0 d S i=1 πi s g ij (X s ; θ * S ) Z π * s -πi ⌊s⌋ g ij (X ⌊s⌋ ; θ * S )S i s /S i ⌊s⌋ dW j s 2 . Then, by Doob's martingale inequality E sup 0≤t≤τ ( Z π * t∧t k - Z⌊π * ⌋ t∧t k ) 2 ≤2T E sup 0≤t≤τ t∧τ R 0 d S i=1 πi s f i S (X s ; θ * S ) Z π * s - d S i=1 πi ⌊s⌋ f i S (X s ; θ * S ) Z⌊π * ⌋ ⌊s⌋ S i s S i ⌊t⌋ 2 ds + 8T E d W j=1 t∧τ R 0 d S i=1 πi s g ij (X s ; θ * S ) Z π * s -πi ⌊s⌋ g ij (X ⌊s⌋ ; θ * S )S i s /S i ⌊s⌋ dW j s 2 . Next, we apply Lemma I.11, E sup 0≤t≤τ ( Z π * t∧t k - Z⌊π * ⌋ t∧t k ) 2 ≤10(T + 4)d 2 S C 4 R exp(2R)E τ ∧τ R 0 exp(2R)R 2 |s -⌊s⌋| 2 + exp(2R)R 2 X s -X ⌊s⌋ 2 + R 2 S ⌊s⌋ -S s 2 + exp(2R) Z π * s -Z⌊π * ⌋ s 2 + exp(2R) Z⌊π * ⌋ s - Z⌊π * ⌋ ⌊s⌋ 2 ds . Therefore, combined with I.12, E sup 0≤t≤τ ( Z π * t∧t k - Z⌊π * ⌋ t∧t k ) 2 ≤10(T + 4)T d 2 S C 4 R exp(4R)R 2 ∆t 2 + 2C 2 R (∆t 2 + ∆t) + 2C 2 R (∆t 2 + ∆t) + 2 exp(4R)C 4 R (∆t 2 + ∆t) + 10(T + 4)d 2 S C 4 R exp(4R) τ 0 E sup 0≤r≤s Z π * r∧τ R - Z⌊π * ⌋ r∧τ R 2 ds. □ LEMMA I.14 With the definitions and assumptions in Section I.1, lim ∆t→0 E[( Z⌊π * ⌋ T -Z π * T ) 2 ] = 0. Proof. By Lemma I.13, we apply the Gronwall inequality and obtain E sup 0≤t≤T ( Z π * t∧t k -Z π * t∧t k ) 2 ≤ 10(T + 4)T d 2 S C 4 R exp(4R)R 2 ∆t 2 + 2C 2 R (∆t 2 + ∆t) + 2C 2 R (∆t 2 + ∆t) + 2 exp(4R)C 4 R (∆t 2 + ∆t) exp(10(T + 4)d 2 S C 4 R exp(4R)). Then, combined with Lemma I.10, for any δ > 0, E[ sup 0≤t≤T ( Z π * t - Z⌊π * ⌋ t ) 2 ] ≤ 10(T + 4)T d 2 S C 4 R exp(4R)R 2 ∆t 2 + 2C 2 R (∆t 2 + ∆t) + 2C 2 R (∆t 2 + ∆t) + 2 exp(4R)C 4 R (∆t 2 + ∆t) exp(10(T + 4)d 2 S C 4 R exp(4R)) + 2 p+1 δA p + (p -2)8A pδ 2/(p-2) R p . Therefore, E[sup 0≤t≤T ( Z π * t -Z⌊π * ⌋ t ) 2 ] converges to 0 as ∆t goes to 0. □

I.3 PROOF

For ease of presentation, we define θ := (θ ϕ , θ π , θ S ), θ * := (θ * ϕ , θ * π , θ * S ) and θ * ∆t := (θ * ϕ,∆t , θ * π,∆t , θ * S,∆t ). Note that every discrete-time admissible policy is a continuous-time admissible policy. Thus, the continuous-time admissible policy set includes the discrete-time admissible policy set. Therefore, Ṽ (θ * ) ≥ V * ∆t . In other words, it is enough to bound Ṽ (θ * ) -V (θ * ∆t ) for the proof. By Lemma I.9, θ * maximizes Ṽ and L simultaneously, leading to Ṽ (θ * ) -V (θ * ∆t ) ≤ H(θ * ) -H(θ * ∆t ) 1 -λ . By Assumption I.7, for ∆t ≤ ∆t ′ , θ * is also an admissible parameter θ * ∈ A, leading to H(θ * ) -H(θ * ∆t ) ≤ 0. Further, for any δ > 0, by adding and subtracting equal terms, Ṽ (θ * ) -V (θ * ∆t ) ≤ 1 1 -λ [ H(θ * ) -H(θ * ) + H(θ * ) -H(θ * ∆t )] ≤ 1 1 -λ H(θ * ) -H(θ * ) . Next, we focus on H(θ * ; δ) -H(θ * ; δ) , which by definition has H(θ * ; δ) -H(θ * ; δ) = (1 -λ) E[U ( Z π * T ; δ) -U ( Z⌊π * ⌋ T ; δ)] , where λL(θ  ; δ)] - → E[U ( Z π * T ; δ)], which finishes the proof.

J EXTENDED RESULTS OF THEOREM 4.2

In this section, we study the non-asymptotic guarantees on the performance of FaLPO. J.1 ANOTHER VERSION OF THEOREM 4.2 DEFINITION J.1 For two random vectors v and w, we define the trace of the covariance matrix as Var(v) :=E[∥v∥ 2 2 -∥E[v]∥ 2 2 ], Cov(v, w) :=E[v ⊤ w] -E[v] ⊤ E[w]. Further, we use Var θ (v) and Cov θ (v, w) to denote the conditional version of the two given θ: Var θ (v) :=E[∥v∥ 2 2 -∥E[v|θ]∥ 2 2 |θ], Cov θ (v, w) :=E[v ⊤ w|θ] -E[v|θ] ⊤ E[w|θ]. Note that it is challenging to theoretically analyze a non-convex stochastic optimization (5), while there are various ad-hoc procedures providing good empirical performances. To provide theoretical analysis, in this section, we study a projection-based version of FaLPO (Algorithm 2). Specifically, the learning/optimization process is conducted in a bounded parameter space B, under which we assume that the objective function is strongly concave regarding the parameters. Estimate the gradients of H following the procedure in Appendix C with the parameter λ.

8:

Update θ S and θ R with learning rate η by gradients. 9: Project the achieved update to B. 10: end for 11: Return θ ϕ , θ π , and θ S . For ease of presentation, we define θ * := (θ * ϕ , θ * π , θ * S ), θ * ∆t := (θ * ϕ,∆t , θ * π,∆t , θ * S,∆t ) and θ † := (θ † ϕ,∆t , θ † π,∆t , θ † S,∆t ). Let θ n be the estimation after the n th iteration, and θ := C L as the learning rate, E[V * ∆t -V ( θ)] ≤ H(θ * ) -H(θ * ∆t ) 1 -λ + H(θ * ∆t ) -H(θ † ) 1 -λ + C B log(N ) 2N (1 -λ) + 1 2N B(1 -λ) E N -1 n=0 1 n + 1 [(1 -λ) 2 Var θn ∇V k (θ n ) + λ 2 Var θn ∇L k (θ n ) + 2λ(1 -λ)Cov θn ( ∇V k (θ n ), ∇L k (θ n ))] . Also, there exits situations where a λ ∈ (0, 1) provides smaller value for (1 -λ) 2 Var ∇H k (θ n ) + λ 2 Var ∇L k (θ n ) + 2λ(1 -λ)Cov( ∇H k (θ n ), ∇R k (θ n )) than λ = 0. In other words, there exist cases where tuning λ may provide better performances. J.2 ASSUMPTIONS ASSUMPTION J.4 There exits a constant C B > 0 such that the parameter region B is a convex set and satisfies the following conditions 1. In B ⊆ A, H(θ ϕ,∆t , θ π,∆t , θ S,∆t ) is locally m-strongly concave with a local maximal point (θ † ϕ,∆t , θ † P,∆t , θ † S,∆t ) ∈ B.

2.. For any

θ ∈ B, ∥θ∥ ≤ C B . For the last component of (31), since ∇H k (θ n ) is an unbiased gradient estimator: E ∇H(θ n ) - B k=1 ∇H k (θ n ) B ⊤ (θ † -θ n ) = E E ∇H(θ n ) - B k=1 ∇H k (θ n ) B ⊤ (θ † -θ n ) θ n = 0. Then, for the first component in ( 31) E   1 n + 1 B k=1 ∇H k (θ n ) B 2   =E E 1 n + 1 B k=1 ∇H k (θ n ) B 2 |θ n = 1 n + 1 E Var θn ( B k=1 ∇H k (θ n ) B ) + E B k=1 ∇H k (θ n ) B |θ n 2 ≤ 1 n + 1 E Var θn ( B k=1 ∇H k (θ n ) B ) + C 2 B , where the last inequality is due to condition 3 of Assumption J.4. Then, (31) can be further derived as E[H(θ † ) -H( θ)] ≤ 1 2N N -1 n=0 E 1 n + 1 Var θn ( B k=1 ∇H k (θ n ) B ) + 1 n + 1 C B . As a result, E[H(θ * ∆t ) -H( θ)] =E[H(θ * ∆t ) -H(θ † ) + H(θ † ) -H( θ)] ≤H(θ * ∆t ) -H(θ † ) + C B log(N ) 2N + E N -1 n=0 η[(1 -λ) 2 Var θn ∇V k (θ n ) + λ 2 Var θn ∇L k (θ n ) ] BN (n + 1) + E N -1 n=0 η[2λ(1 -λ)Cov θn ( ∇V k (θ n ), ∇L k (θ n ))] BN (n + 1) . Taking the results back to (29) we finish the proof.

K EXTENDED RESULTS FOR SYNTHETIC EXPERIMENTS

For synthetic portfolio optimization, we provide details for drift and volatility (Appendix K.1), data generation (Appendix K.2), hyperparameter tuning (Appendix K.3), and extended experimental results (Appendix K.4). We consider 21-day trading, and generate 1000 trajectories with 21 observations for training, 1000 for validation, and 1000 for testing. To compare different methods, we calculate the average terminal utility as the metric.

K.1 DRIFT AND VOLATILITY

Drift and volatility are two important concepts characterising the strength of signal and noise in financial markets. To demonstrate this, for an asset price S i t and time interval ∆t, define return as: return i t = S i t+∆t -S i t S i t .

Hyperparameters Values

Batch Size {100, 50} λ {0, 05, 0.1, 0.9} Learning Rate {0.0005, 0.001, 0.01, 0.1} -59.2338 ± 77.4511 -70.8667 ± 49.2500 -76.8619 ± 37.448 Table 9 : Average terminal utility after tuning with standard deviation for synthetic data with γ = 3.

K.4.3 SYNTHETIC EXPERIMENT RESULTS WITH POWER UTILITY

We also conduct synthetic experiments maximizing the expected power utility for portfolio optimization. The results are summarized in Figure 5 . As a sanity check, we study a Merton problem where the optimal performance can be mathematically derived, in order to compare the performance of FaLPO to the optimal one. We simulate data following a Merton model in Appendix F.1, with d S = 10, d W = 10. With an exponential utility function with γ = 5, according to Lemma F.1, the optimal policy can be derived as π * = µ(σσ ⊤ ) -1 Z t . ( ) Further, by taking (32) back into (14), we can derive the optimal expected terminal utility as max πt E π [U (Z π T )|z 0 ] = - e -γz0 γ e -1 2 µ ⊤ (σσ ⊤ ) -1 µT , which is the theoretically optimal performance. Then, to implement FaLPO, we generate fake features which are independent from the asset prices: the optimal policy is not dependent on these features. Ideally, FaLPO should be able to automatically ignore the fake features, and deliver performance similar to the theoretically optimal derivation. The results of FaLPO, MMMC, and the theoretically optimal derivation are reported in Figure 6 . Note that FaLPO achieves slightly worse performance compared to the other two. The reason for the slight suboptimality of FaLPO in the Merton case is twofold: i. the expected terminal utility is derived for a continuous-time policy while FaLPO learns a discrete-time policy with time interval ∆t; ii. FaLPO uses an over-complicated model with stochastic factors, while the true data generating process follows a Merton model without stochastic factors. We consider 21-day stock trading in four different stock sectors using the daily stock price data from Yahoo finance between January 4, 2006 and April 1, 2022. More specifically, we use the adjusted close price as the daily trading price. For factors, we consider economic indexes, technical analysis indexes (generated by python package TA), and sector-specific features such as oil prices, gold prices, and related ETF prices, leading to around 30 factors for each sector. In each sector we select 10 stocks according to the availability and trading volume in the considered time range. The considered sectors, stocks, and the features are provided in Table 10 . We consider the same competing methods in Section 5.1 and compare the performance using the average achieved terminal utility over different trajectories as the metric. The larger the utility the better. 

Sectors

-λ 1 min C,b m i=1 ϕ(X ti+1 ) -Cϕ(X ti ) -b 2 2 , where C is a matrix and b a vector of proper dimensions. As discussed in Appendix E.3, this penalty encourages a simple representation function ϕ. The second penalty is the negative sample variance of the terminal wealth, with the parameter λ 2 determining its strength. The intuition of this penalty is to further penalize the instability of the algorithm performance. The second penalty is implemented for all the competing methods except MMMC for a fair comparison.

L.3 TRAIN-VALIDATION-TEST SPLIT BY SLIDING WINDOW

We detail the sliding window method for train-validation-test split for real-world portfolio optimization experiments (Figure 7 ). In financial markets, the dynamics under asset prices and factors vary over time, leading us to construct a sliding window on the dataset for training, validation and testing. Specifically, given a dataset of asset prices and observed features, we construct several windows of observations of equal length. We divide each window into three contiguous periods, the first used for training, the second for validation, and the third for testing. We refer to the length (in days) of the training period as the training size (same for validation and test periods). After constructing one window, we move the start time point by a fixed number of days (the window gap), and construct the second window. A given method is trained on the training set of each window separately, and then validated and tested on the corresponding validation and test sets. The final validation and test performances are calculated by averaging over each window. The experimental setup is summarized in Table 11 . We report the considered hyperparameter values in Table 12 . 



Note that it is impossible to rebalance a portfolio infinitely frequently in practice. Thus, continuous-time policies are more useful as analytical tools.



Figure 1: Demonstration of FaLPO

Figure 2: Power & exponential utilities.

Figure 3: Demonstration for neural stochastic factor models.

Figure 4: FaLPO average return over portfolio terminal dates.

) has a unique weak solution which is affine and takes values in [0, ∞) k (Filipović & Mayerhofer, 2009). Further, the EVE model requires the following two assumptions: ASSUMPTION F.4 (Assumption 2.2 in Avanesyan et al. (2018)) The functions µ : R d X → R d S , σ : R d X → R d W ×d S are continuous. More over, the columns of ρ belong to the range of leftmultiplication by σ(x) for all x ∈ R d X . ASSUMPTION F.5 (EVE Condition in Avanesyan et al. (2018)) For some p ∈ [0, 1],

Projected FaLPO 1: Input: Hyperparameter λ, learning rate η, number of iterations N , the strongly concave region B, and batch size B. 2: Output: θ ϕ , θ π , and θ S 3: Initialize neural networks with initial parameters (θ ϕ , θ π , θ S ) ∈ B. 4: Parameterize the policy function by (3). 5: for n ∈ [N ] do

. It is a common technique to consider the average estimation θ instead of the final estimations θ N for such analysis. Then, we provide a new version of Theorem 4.2. DEFINITION J.2 With the gradient estimations discussed in Appendix C, we define ∇H k (θ) := (1 -λ) ∇V k (θ) + λ ∇L k (θ). THEOREM J.3 With assumptions in Section J.2, λ ∈ [0, 1), and η < 1

Figure 5: Average Terminal Power Utility

Figure 6: Negative average terminal utility of FaLPO and MMMC, and negative expected optimal terminal utility. The smaller the better.

Competing methods and their characteristics.MetricsWe compare different methods using the average terminal utility since it is the ultimate goal in our portfolio optimization problem formulation and is commonly used in continuous-time finance models. There exist other statistics measuring the performance of portfolios (see Section B). These statistics are not equivalent or consistent with the utility, and thus we do not emphasize them.

Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, and Sergey Levine. Path integral guided policy search. In 2017 IEEE international conference on robotics and automation (ICRA), pp. 3381-3388. IEEE, 2017b. Ivo Welch and Amit Goyal. A comprehensive look at the empirical performance of equity premium prediction. The Review of Financial Studies, 21(4):1455-1508, 2008. Zhuoran Xiong, Xiao-Yang Liu, Shan Zhong, Hongyang Yang, and Anwar Walid. Practical deep reinforcement learning approach for stock trading. arXiv preprint arXiv:1811.07522, 2018. Pan Xu, Felicia Gao, and Quanquan Gu. An improved convergence analysis of stochastic variancereduced policy gradient. In Uncertainty in Artificial Intelligence, pp. 541-551. PMLR, 2020. Jiongmin Yong and Xun Yu Zhou. Stochastic controls: Hamiltonian systems and HJB equations, volume 43. Springer Science & Business Media, 1999. Pengqian Yu, Joon Sern Lee, Ilya Kulyatin, Zekun Shi, and Sakyasingha Dasgupta. Model-based deep reinforcement learning for dynamic portfolio optimization. arXiv preprint arXiv:1901.08740, 2019.

Hyperparameters for synthetic experiments K.4 SYNTHETIC EXPERIMENT RESULTSTo gain a more holistic understanding of the performance of FaLPO in a variety of settings, we conduct experiments under different number of stocks to be traded (Appendix K.4.1), different risk preferences (Appendix K.4.2), and alternative utility functions (Appendix K.4.3). Finally, we also compare the performance of various methods under the Merton model as a sanity check (Appendix K.4.4).K.4.1 SYNTHETIC EXPERIMENT RESULTS WITH DIFFERENT DIMENSIONSTables6 and 7report the synthetic experiment results with the number of simulated stocks (d S ) varying in {10, 15}. The performance is not strictly negatively correlated with the number of dimensions of the problem or the annual volatility in simulation. The reason is that the noise in the problem is indeed determined by the whole volatility matrix σ, which is randomly generated in the synthetic experiment (Appendix K.2). In other words, the dimension and average scale cannot fully characterize the extent of the noise in a synthetic task.

Average terminal utility after tuning with standard deviation for synthetic data with d S = 10 and d W = 10.

Average terminal utility after tuning with standard deviation for synthetic data with d S = 15 and d W = 15.K.4.2 SYNTHETIC EXPERIMENT RESULTS WITH DIFFERENT VALUES OF γTables 8 and 9 report experimental results with d S = 10, d W = 10, and γ ∈ {3, 10} for an exponential utility. FaLPO outperforms the competing methods in most scenarios. ± 0.0021 -0.0055 ± 0.0008 -0.0132 ± 0.0028 DDPG -0.003 ± 0.001 -0.0105 ± 0.006 -0.0205 ± 0.0034 SLAC -0.003 ± 0.0007 -0.0153 ± 0.0013 -0.0192 ± 0.0011 RichID -0.012 ± 0.0005 -0.0188 ± 0.0002

Average terminal utility after tuning with standard deviation for synthetic data with γ = 10.

Selected stocks and features L.2 EXTRA PENALTIES For real-world experiments, we consider two extra penalty terms for better stability. The first penalty is the model calibration loss discussed in Appendix E.3. Given a trajectory with time interval ∆t, τ := {t i , s ti , x ti | i ∈ [m]}, it is defined as

Setup for real-world experiments.

annex

3. For any θ ∈ B, the expectation of the gradient estimation is bounded byThese assumptions are widely used in existing analysis (Papini et al., 2018; Karimi et al., 2019; Agarwal et al., 2021; Bhandari & Russo, 2019; Wang et al., 2019; Xu et al., 2020) .ASSUMPTION J.5 At the n th iteration, we use the learning rate as η n+1 = 1 nm .Note that in practice we will tune the learning rate η as a hyperparameter, since we may not know m. However, it is a common practice to set the learning rate as in Assumption J.5 (Hazan & Kale, 2011; Nemirovski et al., 2009; Shalev-Shwartz et al., 2011) J.3 TECHNICAL LEMMAS FOR THEOREM 4.2 LEMMA J.6 With Assumption J.4 and J.5, we haveProof. By the strong concavity of H in Assumption J.4,Further, since θ n+1 is the projection ofto B, the projection satisfiesWe reorder (27) and deriveTaking the result back to (26):By averaging over n with Assumption J.5 we getwhere the first inequality is due to condition 1 of Assumption J.4. □ J.4 PROOF OF THEOREM 4.2 Note that every discrete-time admissible policy is a continuous-time admissible policy. Thus, the continuous-time admissible policy set includes the discrete-time admissible policy set. Therefore, Ṽ (θ * ) ≥ V * ∆t . Therefore, it is enough to bound Ṽ (θ * ) -V ( θ) for the proof. By Lemma I.9, θ * maximizes Ṽ and L simultaneously. Therefore,Then, we use the convergence result for Algorithm 2 detailed by Lemma J.6 :Then, we take expectation on both sidesThe return i t can be daily, monthly or yearly, depending on the length of ∆t. For a specific asset, drift (f i (X t ; θ * S ) in ( 2)) is approximately the expectation of the return, while volatility (g i (X t ; θ * S )) is approximately the return's standard deviation. Given multiple assets, drift (f (X t ; θ * S )) is a vector and volatility (g(X t ; θ * S )) is a matrix. When generating synthetic data (Appendix K.2), we fix the scale of drift and vary the scale of volatility, which is defined as the average value of each component.

K.2 DATA GENERATION

We simulate data for S t and X t following SDE (6). To this end, drift and volatility are randomly picked while mimicking the historical stock price data, with an average annual return around as 0.1 and average annual volatility in {0.1, 0.2, 0.3}, leading to a daily return around 0.1 252 and a daily volatility around {0.1/252, 0.2/252, 0.3/252}. The true representation function is selected as a component-wise exponential operation. Then, we discritize the SDE following the explicit Euler method, and generate data accordingly (Beskos & Roberts, 2005) .The specific configurations for data generation is:• Define two scalars: C d , and C v determining the scale of drift and volatility:• σ is selected as a random matrix, whose components follow a uniform distribution in• v is selected as a random matrix, whose components follow a uniform distribution in• µ is selected as a diagonal matrix whose diagonal components follow a uniform distribution in [0.9, 1].• The initial values of X are randomly generated from a uniform distribution on• The initial prices of assets are randomly generated from a uniform distribution in [20, 30] .Note that the design makes sure that the simulated price has approximately a yearly return of 0.1 and yearly volatility in {0.1, 0.2, 0.3}. For better performance, we conduct early stopping for all methods using the average validation utility with the patience as 5 steps. The considered hyperparameters include the learning rate, λ, and batch size. For each configuration, we conduct training for 5 times, and average the results. Then, we pick the configuration providing the best average validation utility, and test it on the test data and calculate the average test utility per trajectory. The tuning process is conducted using the software wandb (Biewald, 2020) . According to (5), the value of λ determines the weight of the FaLPO model calibration. In this section, we conduct sensitivity analysis of FaLPO on λ. Under the protocol as the experiments in Section 5.2, we also report FaLPO with different values of λ when applied to different sectors.The results are reported in Figure 8 . Compared to the case without model calibration (λ = 0), a small non-zero λ provides higher terminal utilities and lower variance. This observation justifies our method of incorporating model calibration into policy learning. Then, when λ gets bigger and close to one, the performance of FaLPO decays while the variance also gets smaller. 

M MORE INFORMATION FOR COMPETING METHODS

Here we provide more information for the implemented competing methods. First of all, we focus on policy learning methods, without studying other performance improving techniques like data augmentation or feature engineering. (See Appendix A.2 for a review of such methods.) Such techniques can be easily applied to FaLPO. Further, for a thorough comparison, we summarize the existing policy-learning methods for portfolio optimization with the following four representatives. Note that, all the following methods take asset price data and features as the input for a fair comparison.• DDPG is implemented with the gradient estimation detailed in Appendix C and also discussed in Nan et al. (2022) ; Xiong et al. ( 2018); Jiang et al. (2017) . This design makes sure that DDPG can leverage offline data without exploration.• SLAC (Lee et al., 2020) learns a representation of factors jointly with policy learning. But in this process, no parametric models are used.• RichID (Mhammedi et al., 2020) falls into the category of model-based policy learning like Yu et al. (2019) . It first learns the representation of factors and then conduct policy learning.In this process, both steps take advantage of a parametric model. For better performance in portfolio optimization, we pick Kim-Omberg model as the used model, instead of the LQR model original proposed with this method.• CT-MB-RL is a policy gradient method optimizing the performance objective using the policy functional form derived from continuous-time models, but without factor representation learning.We also implement MMMC as a representative of continuous-time finance methods. More complicated and advanced continuous-time finance methods are hard to implement for two reasons. First, to implement such methods, we need to estimate all the parameters of a multivariate SDE (like σ, v, µ and ω in Section 3.3). It is challenging since the derivation of likelihood requires solving multivariate stochastic integrals (Ait-Sahalia & Kimmel, 2010) , Second, deriving explicit optimal policy functions is also difficult, which involves solving high-dimensional PDEs (like k 2 (t) and k 3 (t) in Lemma F.2). Further, such pure continuous-time models are expected to underperform, since they assume that the data exactly follow a parametric SDE and tend to underfit. This can also be seen by our comparison with CT-MB-RL (Section 5), which is a model-based RL method by relying on a Kim-Omberg model. As a result, we focus our empirical comparison to more competitive RL methods.Note that FaLPO circumvents the two aforementioned challenges. First, our model calibration does not aim to fit all the parameters in an SDE, but only those related to learning θ ϕ and θ π . That is why our model calibration loss in Section 3.3 has such an easy-to-calculate form with the parameter θ S as a simple vector. Second, FaLPO does not need a fully derived closed-form solution for the optimal policy. Like in Section 3.3, we use neural networks to parameterize K(t) and ϕ(), instead of fully deriving them like continuous-time finance methods. Being able to bridge this gap between continuous-time finance models and high multidimensional stock trading problems is one of our contributions.

N EXPERIMENTS WITH TRANSACTION COSTS

In this section, we consider the case with transaction costs. Usually, the cost of borrowing a stock to short can vary but typically ranges from 0.3% to 3% per year. Therefore, we take 1% annual transaction cost for short selling an asset. (The fees are applied daily.) Under this setting, we replicate our real-world experiments for the oil sector, using the same protocol. After the tuning procedure in Appendix L.3, the achieved results are reported in We vary the initial wealth in {3000, 5000, 8000, 1000} for portfolio optimization using stocks in the oil sector following the same protocol as the experiments in Section 5.2. The results are summarized in Table 14 . Specifically, FaLPO achieves superior performance to the competing methods with different initial wealth. Also, it should be noticed that all the methods achieve higher terminal utility given more initial wealth. 

