MANAGING TEMPORAL RESOLUTION IN CONTINUOUS VALUE ESTIMATION: A FUNDAMENTAL TRADE-OFF Anonymous authors Paper under double-blind review

Abstract

A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.

1. INTRODUCTION

In many real-world applications of control and reinforcement learning, the underlying system evolves continuously in time. For instance, a physical system such as a robot is naturally modeled as a stochastic dynamical system. In practice, however, sensor measurements are usually captured at discrete time intervals, and the practitioner must make a decision about how to discretize the time dimension, i.e. choosing a sampling frequency or a measurement step-size. A common belief is that a finer time discretization always leads to better estimation of the system properties and the control cost or the reward in reinforcement learning. As we show, this is only true with an unlimited data budget. In practice there are always limitations on how much data can be collected, stored and processed. Consider for example the task of episodic policy evaluation with a finite data budget. A higher temporal resolution means that more data is collected within fewer episodes. This inevitably leads to the question on how to optimally choose the time discretization for the task at hand. The practitioner therefore faces a fundamental trade-off: using a finer temporal resolution leads to better approximation of the continuous-time system from discrete measurements, but the consequence of collecting denser data along fewer trajectories leads to larger estimation variance with respect to stochasticity in the system. This is indeed true for any system with stochastic dynamics, even if the learner has access to exact (noiseless) measurements of the system's state. In this paper, we show that data efficiency can be significantly improved by leveraging a precise understanding of the trade-off between approximation error and statistical estimation error in long term value estimation -two factors that react differently to the level of temporal discretization. The main contributions of this work are twofold. First, we consider the simplest and canonical case of Monte-Carlo value estimation in a Langevin dynamical system (linear dynamics perturbed by a Wiener process) with quadratic instantaneous costs. Although the setup is specialized, it is simple enough such that we can obtain analytical expressions of the least-squares error that exactly characterize the approximation-estimation trade-off with respect to the step-size parameter. Second, we present a numerical study that illustrates and confirms the trade-off in both linear and non-linear systems, including several MuJoCo control environments. Our findings imply that practitioners should pay attention to carefully choosing the step-size parameter of the estimation to obtain the most accurate results possible.

1.1. RELATED WORK

There is a sizable literature on reinforcement learning in continuous-time systems (e.g. Doya, 2000; Lee & Sutton, 2021; Lewis et al., 2012; Bahl et al., 2020; Kim et al., 2021; Yildiz et al., 2021) . These previous works have largely focused on deterministic dynamics, and do not investigate tradeoffs in temporal discretization. A smaller body of work has considered learning continuous-time control under stochastic (Baird, 1994; Bradtke & Duff, 1994; Munos & Bourgine, 1997; Munos, 2006) , or bounded (Lutter et al., 2021) perturbations, but with a focus on making standard learning methods more robust to small time scales (Tallec et al., 2019) , again without explicitly managing the temporal discretization level. There have also been works that characterize the effects of temporal truncation in infinite horizon problems (Jiang et al., 2016; Droge & Egerstedt, 2011) . Despite these prevailing topics in the literature, we find that managing temporal discretization offers substantial improvements not captured by these previous studies. The LQR setting is a standard framework in control theory and it gives rise to a fundamental optimal control problem (Lindquist, 1990) , which has proven itself to be a challenging scenario for Reinforcement Learning algorithms (Tu & Recht, 2019; Krauth et al., 2019) . The stochastic LQR considers linear systems driven by additive Gaussian noise with a quadratic form for the cost, which is sought to be minimised by means of a feedback controller. Although it is a well-understood scenario and a closed form of the optimal controller is known thanks to the separation principle (Georgiou & Lindquist, 2013) , only recently the statistical properties of the long-term cost have been investigated (Bijl et al., 2016) . The work in this paper also closely related to the now sizable literature on reinforcement learning in LQR systems (Bradtke, 1992; Krauth et al., 2019; Tu & Recht, 2018; Dean et al., 2020; Tu & Recht, 2019; Dean et al., 2018; Fazel et al., 2018; Gu et al., 2016) . These existing works uniformly focused on the discrete time setting, although the benefits of managing spatial rather than temporal discretization has been considered (Sinclair et al., 2019; Cao & Krishnamurthy, 2020) . Wang et al. (2020) studied the continuous-time LQR setting but it focused on the exploration problem rather than the temporal discretization. There is compelling empirical evidence that managing temporal resolution, typically via action persistence (Lakshminarayanan et al., 2017; Sharma et al., 2017; Huang et al., 2019; Huang & Zhu, 2020; Dabney et al., 2021; Park et al., 2021) , can greatly improve learning performance. Even grid worlds (Sutton & Barto, 2018 ) can be seen as leveraging a form of action persistence, where a coarse spatial discretization is imposed on an otherwise continuous two dimensional navigation problem to improve learning efficiency. These empirical findings have recently been supported by an initial theoretical analysis (Metelli et al., 2020 ) that shows temporal discretization plays a role in determining the effectiveness of fitted Q-iteration. The analysis by Metelli et al. (2020) does not consider fully continuous systems, but rather remains anchored in a base level discretization and only provides worst-case upper bounds that do not necessarily capture the detailed trade-offs one faces in practice. Choosing the temporal resolution can also be understood as a non-linear experimental design problem (Chaloner & Verdinelli, 1995; Ford et al., 1989) . By choosing the time discretization, the experimenter determines how to allocate measurements for a given data budget. What is peculiar to our objective is that any fixed design has a constant approximation error (bias) that persists even when the number of data points becomes infinite. At the same time, the bias can also be managed by scarifying estimation error (variance). Optimal designs that consider the bias-variance trade-off jointly have been studied previously (e.g. Bardow, 2008; Mutny et al., 2020; Mutnỳ & Krause, 2022) .

2. POLICY EVALUATION IN CONTINUOUS LINEAR QUADRATIC SYSTEMS

In the classical continuous-time linear quadratic regulator (LQR), a state variable X(t) ∈ R n evolves over time t ≥ 0 according to the following equation: dX (t) = AX(t) dt + BU (t) dt + σdW (t). (1) The dynamical model is fully specified by the matrices A ∈ R n×n , B ∈ R n×p and the diffusion coefficient σ. The control input is U (•) ∈ R p is given by a fixed policy, and W (t) is a Wiener process. The state variable X(t) is fully observed. For simplicity, we assume that the dynamics start at Abbasi-Yadkori & Szepesvári, 2011; Dean et al., 2020) . X (0) = - → 0 ∈ R n (c.f. The expected quadratic cost J is defined for positive semi-definite, symmetric matrices Q ∈ R n×n and R ∈ R p×p , a system horizon 0 < τ ≤ ∞ and a discount factor γ ∈ (0, 1]: J τ = τ 0 γ t X (t) ⊤ QX (t) + U (t) ⊤ RU (t) dt (2) In the following we consider the class of controllers given by static feedback of the state, i.e.: U (t) = KX (t) where K ∈ R p×n is the static control matrix yielding the control input. It is well known that in infinite horizon systems with discounting, the optimal control is of the form Eq. ( 3). The specific choice of policy plays no particular role in what follows, therefore we reduce the LQR in Eq. ( 1) further to a linear stochastic dynamical system described by a Langevin equation. Using the definitions A := A + BK and Q := Q + K ⊤ RK, we express both the state dynamics and the cost in a more compact form: dX (t) = AX (t) + σξ (t) , J τ = τ 0 γ t X (t) ⊤ QX (t) dt The expected cost V τ is the expectation of the cost w.r.t. the Wiener process, V τ = E [J τ ]. Equation ( 4) is what we analyze in the following. From now on, we explicitly distinguish the finitehorizon setting where τ < ∞, γ ≤ 1 and the cost is V τ , and the infinite-horizon setting where τ = ∞, γ < 1 and the cost is V ∞ .

Monte-Carlo Policy Evaluation

Our main objective of policy evaluation is to estimate the expected cost from discrete-time observations. To this end, we choose a uniform discretization of the interval [0, T ] with increment h resulting in N = T /h time points t k := kh for k ∈ {0, 1, . . . , N }. Here, the estimation horizon T , such that T < ∞ and T ≤ τ , is chosen by the practitioner (for simplicity assume that T /h is an integer). With the N points sampled from one trajectory, a standard way to approximate the integral in Eq. ( 4) is with the Riemann sum estimator Ĵ (h) = N -1 k=0 γ t k hX (t k ) ⊤ QX (t k ) . To estimate V τ , we average M independent trajectories with cost estimates Ĵ1 , . . . ĴM to obtain the Monte-Carlo estimator: VM (h) = 1 M M i=1 Ĵi (h) = 1 M M i=1 N -1 k=0 γ t k hX (t k ) ⊤ QX (t k ) Our main objective is to understand the mean-squared error of the Monte-Carlo estimator for a fixed system (specified by A, σ and Q), with the goal to inform an optimal choice of the step-size parameter h for a fixed data budget B = M • N . Note that one degree of freedom remains in choosing M and N . For simplicity, we require that in the finite-horizon setting, the estimation grid is chosen to cover the full episode [0, τ ] which leads to the constraint T = τ = N • h. We write the least-squares error-surface as a function of h and B: MSE T (h, B) = E ( VM (h) -V T ) 2 In the infinite horizon setting, i.e. τ = ∞, the estimation horizon T is a free variable chosen by the experimenter that determines the number of trajectories through M = B N = Bh T . The mean-squared error for the infinite horizon setting is given as a function of h, B, and T : MSE ∞ (h, B, T ) = E ( VM (h) -V ∞ ) 2 . (8)

3. CHARACTERIZING THE MEAN-SQUARED ERROR

In the following our goal is to characterize the least-squares error of the Monte-Carlo estimator as a function of the step size h and the total data budget B (and the estimation horizon T in the infinite horizon setting). Our results uncover a fundamental trade-off for choosing an optimal step size that leads to a minimal least-squares error. One-Dimensional Langevin Process To simplify the exposition while preserving the main ideas, we will first present the results for the 1-dimensional case. The analysis for the vector case exhibits the same quantitative behavior but is significantly more involved. To distinguish the 1-dimensional from the d-dimensional setting described in Eq. ( 4), we use lower-case symbols. Let x(t) ∈ R be the scalar state variable that evolves according to following the Langevin equation: dx(t) = ax(t)dt + σdw(t). Here, a ∈ R is the drift coefficient and w(t) is a Wiener process with scale parameter σ > 0. We assume that a ≤ 0, i.e. the system is stable (or marginally stable). The realized sample path in episode i = 1, . . . , M is x i (t) (with starting state x(0) = 0) and t ∈ [0, T ]. The expected cost is V τ = E τ 0 γ t r 2 i (t)dt = τ 0 γ t qE x 2 i (t) dt, where r i (t) = qx i (t) 2 is the quadratic cost function for a fixed q > 0. The Riemann sum that approximates the cost realized in episode i ∈ [M ] becomes Ĵi (h) = N -1 k=0 hqx 2 i (kh). Given data from M episodes, the Monte-Carlo estimator is VM (h) = 1 M M i=1 Ĵi (h). Since the square of the cost parameter q 2 factors out of the mean-squared error in Eq. ( 10), we set q = 1 in what follows.

3.1. FINITE-HORIZON

Recall that in the finite-horizon setting we set the system horizon τ and estimation horizon T to be the same. This implies that the estimation grid covers the full episode, i.e. hN = T = τ . Perhaps surprisingly, the mean-squared error of the Riemann estimator for the Langevin system (9) can be computed in closed form. The result takes its simplest form in the finite-horizon, undiscounted setting where γ = 1 and τ < ∞. This result is summarized in the next theorem. Theorem 1 (Finite-horizon, undiscounted MSE). In the finite-horizon, undiscounted setting, the mean-squared error of the Monte-Carlo estimator is MSE T (h, B) = E 1 (h, T, a) + E 2 (h, T, a) B , where E 1 (h, T, a) = σ 4 -2ah + e 2ah -1 2 e 2aT -1 2 16a 4 (e 2ah -1) 2 , E 2 (h, T, a) = σ 4 T h e 2aT -1 4e 2ah + e 2aT + 1 -e 2ah -1 e 2ah + 4e 2aT + 1 T 2a 2 (e 2ah -1) 2 . The proof involves computing the closed-form expressions for the second and forth moments of the random trajectories x i (t) and is provided in Appendices B and C.1. While perhaps daunting at first sight, the beauty of the result is that it exactly characterizes the error surface as a function of the step size h and the budget B. In principle, for any fixed B, we can optimize h to minimize the mean-squared error by searching over possible step-sizes h m = T /m for m = 1, . . . , B, provided knowledge of the system parameters a, σ and fixed horizon T . On the other hand, the practical scope of this procedure is somewhat limited. On the upside, as we show next, the underlying trade-offs can be characterized and understood closely in several different regimes. In Section 4, we show through numerical experiments how these insights translate into simulations of linear and non-linear systems. In the case of marginal stability (a = 0), a simpler form of the MSE emerges that is easier to interpret. Taking the limit a → 0 of the previous expression gives the following result: Corollary 1 (MSE for marginally stable system). Assume a marginally stable system, a = 0. Then the mean-squared error of the Monte-Carlo estimator is MSE T (h, B) = σ 4 T 2 4 • h 2 + σ 4 T 5 3 • 1 hB + σ 4 T 2 (-2T 2 + 2hT -h 2 ) 3B . The first part of the expression can be understood as a Riemann sum approximation error that is controlled by the h 2 term. The second part corresponds to the variance term that decreases with the number of episodes as 1 M = T Bh . The remaining terms are of lower order terms for small h and large B. For a fixed data budget B, the step size h can be chosen to balance these two terms (up to lower order terms in 1/B): h * (B) := arg min h>0 MSE T (h, B) ≈ T 2 3B 1/3 . ( ) From this, we can compute the optimal number of episodes M * ≈ Bh T = 2 3 1/3 B 2/3 . We remark that under the assumption B ≫ 1, we also obtain that M * ≫ 1. This is in agreement with the implicit requirement that h is big enough to consider at least one whole trajectory, i.e. h > T /B. Consequently, the mean-squared error for the optimal choice of h (up to lower order terms in 1/B): MSE T (h * , B) ≈ 3 (3/2) 1/3 σ 4 T 4 B -2/3 . In other words, the optimal error rate as a function of the data budget is O(B -2/3 ). We can further obtain a similar form for h * for the general case where a ≤ 0. Corollary 2 (Optimal step size). For B ≫ 1, the optimal step-size (up to lower order terms in 1/B) h * (B) ≈ - T 4aT -e 4aT + e 2aT (8aT -4) + 5 a 2 (e 2aT -1) 2 1/3 B -1/3 . Moreover, MSE T (h * , B) ≤ O(B -2/3 ). The proof is provided in Appendix C.2 where we also include a more precise expression of h * . Discounted Cost Adding discounting (γ < 1) in the finite-horizon setting does not fundamentally change the results but makes it more involved (details shown in Appendix C.3). Vector Case Also for the vector case (n > 1) it is possible to exactly characterise the meansquared error of the Monte-Carlo estimator for the Langevin system in Eq. ( 9). The closed-form computations will however require to assume that the matrix A governing the behaviour of the system is diagonalisable, and stable. The latter is a rather mild assumption, as it is sufficient for the system in Eq. ( 1) to be controllable to ensure satisfiability of this condition. Controllability in fact translates into the possibility of freely adjusting the eigenvalues of the closed-loop matrix A through the choice of the controller K. This means that it is always possible to choose eigenvalues to be distinct from each other, so that A is diagonalisable. The explicit form of the mean-squared error, although actually computable, is given by a long formula which is not easy to interpret, and is therefore deferred to Appendix D. The following theorem summarizes the result for the vector case in the form of a Taylor expansion for small h and large B. Theorem 2 (Mean-squared error -vector case). Assume A is diagonalisable, with eigenvalues λ 1 , . . . , λ n . The mean-squared error of the Monte-Carlo estimator in the finite-horizon, undiscounted setting, is MSE T (h, B) = E 1 (h, T, λ 1 , . . . , λ n ) + E 2 (h, T, λ 1 , . . . , λ n ) B ( ) where E 1 (h, T, λ 1 , . . . , λ n ) = C 1 + C 1 (λ 1 , . . . , λ n ) O (T ) σ 4 T 2 h 2 + O(h 3 ) (13) E 2 (h, T, λ 1 , . . . , λ n ) B = C 2 + C 2 (λ 1 , . . . , λ n ) O (T ) σ 4 T 5 hB + O (1/B) The proof with the exact derivation of the constants C 1 , C 1 (λ 1 , . . . , λ n ), C 2 , C 2 (λ 1 , . . . , λ n ) can be found in Appendix D.1. Note that the terms composing the MSE are very similar to the ones obtained in the scalar analysis. Indeed by comparing them with the expressions in Eq. ( 28) and Eq. ( 29) (in Appendix C.2), the expression has the same order for h, B and T . The only difference is that in the vector case, cumbersome eigenvalue-dependent constants are involved, whereas in the scalar case the result can more easily be expressed in terms of the system parameter a. Since the optimal choice for h is given by balancing the trade-off between the two terms above, E 1 for the approximation error and E 2 for the variance term, its expression is analogous to the scalar case, as shown by the following corollary. Corollary 3 (Optimal step size -vector case). Under the assumption that B ≫ 1, the optimal step-size for the vector case is given by h * (B) = C 1 + C 1 (λ 1 , . . . , λ n ) O (T ) C 2 (λ 1 , . . . , λ n ) + C 2 (λ 1 , . . . , λ n ) O (T ) 1/3 T B -1/3 + o B -1/3 (15) The constants in Corollary 3 are clearly the same as in Theorem 2. General bounds that hold for the case of a vector Langevin process with a stable matrix A are provided in Appendix D.3. These results show that the mean-squared error lies in between two expressions with the same order in h and B, whose difference depends only on T , and the eigenvalues of the matrix. Both the lower and upper bounds are convex functions of h, narrowing down the behaviour of the step size in this general case. In particular, the lower bound can always be expressed in terms of the mean-squared error for the scalar case, emphasizing the importance of examining this special case. Although the convex behaviour is only proven for the case of a Langevin system, our experimental results (Section 4) exhibit a similar trade-off for general nonlinear stochastic systems. From the present analysis, it is possible to derive guidelines on how to set the step-size even for the case of nonlinear and unknown dynamics. Although the sharp order in B for the optimal step-size holds for the case of linear dynamics only, we empirically show in Section 4 that a similar trade-off carries on to nonlinear dynamics, and h = cT B -1/3 is a solid choice for the more general setting. While the constant c depends on the controlled dynamics (therefore on both the free dynamics and the policy), c could be estimated with a small budget, in order to properly scale the value of h for a large-scale experiment. This approach does not require the knowledge of the dynamics beforehand, nonetheless it provides a systematic way of setting the step size h for any given scenario.

3.2. INFINITE-HORIZON SETTING

The main characteristic of the finite-horizon setting is the trade-off between approximation and estimation error. Recall that in the infinite-horizon setting (τ = ∞), the estimation horizon T < ∞ becomes a free variable that is chosen by the experimenter to define the measurement range [0, T ]. Consequently the mean-squared error of the Monte-Carlo estimator suffers an additional truncation error from using a finite Riemann sum with N = T /h terms as an approximation to the infinite integral that defines the cost V ∞ . More precisely, we decompose the expected cost V ∞ = V T + V T,∞ , where V T = T 0 γ t E[x 2 (t) ]dt as before, and V T,∞ = ∞ T γ t E x 2 (t) dt = σ 2 γ T 2a 1 log (γ) - e 2aT log (γ) + 2a . ( ) The integral is a direct calculation based on Lemma 1 in Appendix B. Thus the mean-squared error becomes MSE ∞ (h, B, T ) = E ( VM (h) -V ) 2 = MSE T (h, B) -2E VM (h) -V T V T,∞ + V 2 T,∞ , where MSE T (h, B) = E ( VM (h) -V T ) 2 is the mean-squared error of discounted finite-horizon setting. Note that the term V 2 T,∞ is neither controlled by a small step-size h nor by a large data budget B, hence results in the truncation error from finite estimation. Fortunately, the geometric discounting ensures that V 2 T,∞ = O(γ 2T ), which is not unexpected given that the term constitutes the tail of the geometric integral. In particular, setting T = c • log(B)/log(1/γ) for large enough c > 1 suffices to ensure that the truncation error is below the estimation variance. We summarize the result in the next theorem. Theorem 3 (Infinite-horizon, discounted MSE). In the infinite-horizon, discounted setting, the mean-squared error of the Monte-Carlo estimator is MSE ∞ (h, B, T ) = σ 4 T C(a, γ) • 1 hB + σ 4 144 • h 4 + O(h 5 ) + O(B -1 ) (18) where we let C(a, γ) = 1 log(γ)(a+log(γ))(2a+log(γ)) 2 and assume that γ T = o(h 4 ). It follows that the optimal choice for the step-size is h * (B, T ) ≈ (36 T C(a, γ)/B) 1/5 . The minimal least-squares error is MSE ∞ (h * , T, B) ≤ O (T C(a, γ)/B) 4/5 + γ 2T . Lastly, we remark that if γ T is treated as a constant, the cross term E VM (h) -V T V T,∞ in Eq. ( 17) introduces a dependence of order O(hγ 2T ) to the mean-squared error. In this case, the overall tradeoff becomes MSE ∞ (h, B, T ) ≈ O 1/(hB)+γ 2T (1+h) , and the optimal step-size is h * ≈ B -1/2 . Vector Case As before, the mean-squared error for the vector case can be explicitly computed in closed-form assuming that A is diagonalizable. The result reflects the same behaviour as in the scalar case. Conveniently, the MSE in Theorem 3 has been expressed with sharp terms in h and B, while confining the dependence on the system parameter a within the constant C, and the impact of higher-order terms in T within V T,∞ . This allows us to state the same result for the vector case, in which the constant will now depend on the eigenvalues of the matrix A, as well as the discount factor γ. These are provided in full detail in Appendix D.2. Corollary 4. For A diagonalisable, with eigenvalues given by λ 1 , . . . , λ n , the mean-squared error of the Monte-Carlo estimator in the infinite-horizon, discounted setting is MSE ∞ (h, B, T ) = C 3 (λ 1 , . . . , λ n , γ) σ 4 T hB + σ 4 144 h 4 + O h 5 + O B -1 , under the assumption that γ T = o h 4 . The different terms in Corollary 4 correspond to the estimation error, the approximation error and the truncation error as in the scalar case. The optimal step size choice exhibits the same dependence on T and B as in the scalar case, but with a different constant depending on the eigenvalues. Lastly, the general case for Langevin processes with a stable matrix A is discussed in Appendix D.3.

4. FROM LINEAR TO NON-LINEAR SYSTEMS: A NUMERICAL STUDY

The trade-off identified in our analysis suggests that there exists an optimal choice of the temporal resolution in policy evaluation. Our next goal is to verify the trade-off in several simulated dynamical systems. While our analysis assumes a linear transition and quadratic cost, we empirically demonstrate that such a trade-off also exists in nonlinear systems. For our experimental setup, we choose simple linear quadratic systems mirroring the setup of Section 2, as well as several standard benchmarks from OpenAI Gym (Brockman et al., 2016) and MuJoCo (Todorov et al., 2012) . Our findings confirm the theoretical results and highlight the importance of choosing an appropriate step-size for policy evaluation.

4.1. LINEAR QUADRATIC SYSTEMS

We first run numerical experiments on the Langevin dynamical systems to demonstrate the tradeoff analyzed in the previous section, the results of this experiment are shown in Fig. 1 . For all the systems, we fix the parameters σ 2 = 1 and Q = I. The lines in the plot represent the sample mean ( VM (h) -V ) 2 and the shading represent standard error of our sample means. Each data point was averaged over 50 independent runs in the scalar case and 40 in the vector case. We observe a clear trade-off in all the Fig. 1 plots. Fig. 1 (a) shows the MSE in an one-dimensional system with T = 8 and a = -1. The ground truth V is calculated analytically by using Eq. 27. The figure illustrates how the error changes as we vary the data budget, B = {2 12 , 2 13 , 2 14 , 2 15 , 2 16 }, and also illustrates the improvement that can be obtained by increasing the budget. As we increase B, both the error and and the optimal step size, h * , decrease. This result aligns with the analysis shown in Theorem 1 and Corollary 2. The objective is minimized when h is chosen appropriately. Fig. 1(b ) and 1(c) show the experimental results for both undiscounted finite horizon and discounted infinite horizon multi-dimensional systems. Calculating the ground truth V for our multi-dimensional systems is more involved than in our scalar system. To calculate V for the undiscounted finite horizon system, we numerically solve the Riccati Differential Equation using backwards induction as is standard practice. To calculate V for the discounted infinite horizon system we solve the Lyapunov equation using a standard solver. Note in our experiments the dimension n = 3. We fix all parameters and run our experiments on the system A = cI 3 where c ∈ {-0.2, -0.5, -1, -2, -4}, producing stable systems with different eigenvalues. Results in both plots suggest that the impact of these eigenvalues of A on h * is mild and that the eigenvalue-dependent constant terms in Corollary 3 in our vector analysis do not significantly affect the optimal step-size h * . The eigenvalues do influence the values of the MSE achieved in each system. MSE decreases as the magnitude of the eigenvalue increases. In the infinite horizon system, the horizon needs to be large enough to manage truncation error while simultaneously being small such that we can run multiple rollouts. We choose γ large enough such that we can learn a good estimate of V . We set T = 1/(1 -γ), which is commonly referred to as the effective horizon in the RL literature. (VM(h) V) 2 T=20, B=16384, = 0.95 (VM(h) V) 2 T=2, B=4096 A = -0.2 I3 A = -0.5 I3 A = -1 I3 A = -2 I3 A = -4 I3 (b) n = 3, finite horizon A = -0.2 I3 A = -0.5 I3 A = -1 I3 A = -2 I3 A = -4 I3 (c) n = 3, infinite horizon Figure 1 : Mean-squared error trade-off in linear quadratic systems of different dimension n. The left-most plot shows the dependence of the optimal step-size on the data budget; as expected the optimal step-size decreases with more data. Middle and right plot show the MSE for different drift matrices A. Note that the optimal step-size h exhibits only a mild dependence on the scale of A.

4.2. NONLINEAR SYSTEMS

We empirically show that the trade-off identified in linear quadratic systems carries over to nonlinear systems, with more complex cost functions. We demonstrate it in several simulated nonlinear systems from OpenAI Gym (Brockman et al., 2016) , including Pendulum, BipedalWalker and six MuJoCo (Todorov et al., 2012) environments: InvertedDoublePendulum, Pusher, Swimmer, Hopper, HalfCheetah and Ant. We note that the original environments all have a fixed temporal discretization δt, pre-chosen by the designer. To measure the effect of h, we first modify all environments to run at a small discretization δt = 0.001 as the proxy to the underlying continuous-time systems. We train a nonlinear policy parameterized by a neural network for each system, by the algorithm DAU (Tallec et al., 2019) . This policy is used to gather episode data from the continuous-time system proxy at intervals of δt = 0.001 which are then down-sampled for different h based on the ratio of h and δt. This allowed to handle uniformly all environments, thus yielding the behaviour of the MSE with respect to the sampling time h in the same interval. The policy is stable in the sense that it produces reasonable behavior (e.g., pendulum stays mostly upright; Ant walking forward etc) and not cause early termination of episodes (e.g., BipedalWalker does not fall), in the continuous-time system proxy. The results of the MSE of Monte-Carlo value estimation are shown in Fig. 2 . Similar to the linear systems case, we vary the data budget B and see how the MSE changes with the discretization h. We slightly abuse notations by using V, V to refer to the true and estimated sum of rewards instead of the cost. The true value of V is approximated by averaging the sum of rewards observed at δt = 0.001 from 150k episodes. These environments fall under the finite horizon undiscounted setting. The system (and estimation) horizon T of our experiments is chosen to be the physical time of 1k steps under the default δt in the original environments (200 steps for Pendulum and 500 steps for BipedalWalker). Please refer to Appendix F for more details on the setup including B, T , δt, h. The line and shaded region denote the sample mean and its standard error of ( VM (h) -V ) 2 , from 30 random runs. T is the horizon in physical time (seconds). B 0 denotes the environment-dependent base sample budget, chosen such that it gives a full episode for the smallest h (see Appendix F). In almost all environments the optimal step-size depends on the data budget (with 'InvertedDoublePendulum-v2' being the only exception). In particular, the MSE as a function of h shows a clear minimum for choosing the optimal step-size, which generally decreases as the data budget increases. These system are stochastic in the starting state, while having deterministic dynamics. Despite the different settings from our analysis, a clear trade-off is evident in all systems. 

Optimal

Step-Size in Nonlinear Systems Fig. 3 plots the empirical h * over B for all nonlinear environments, and fitted lines based on the relation h * = cT B -1/3 for finite horizon undiscounted linear systems described in Corollary 3. The plot shows that the analytical trade-off for linear systems is observed approximately even in the non-linear experiments. The constant c depends on the system parameters (and the policy) and, as expected, varies with the environment. In our experiments, c ranges from 0.02 to 0.2, which can serve as a starting point for optimizing the stepsize in other experiments.

5. CONCLUSION

We provide a precise characterization of the approximation, estimation and truncation errors incurred by Monte-Carlo value estimation in a Langevin dynamical system with quadratic cost. Our analysis reveals a fundamental bias-variance trade-off, modulated by the level of temporal discretization h. In a second step, we confirm in numerical simulations that the analysis accurately captures the trade-off in a precise, quantitative manner. In particular, we show that the trade-off carries over to non-linear environments such as the popular MuJoCo physics simulation. Our findings show that managing the temporal discretization level h can greatly improve the quality of value estimation under a fixed data budget B. This has implication for practitioners, as in most environments that we encountered the step-size is typically pre-set and rarely changed. There are several interesting directions for future work, including considering policy optimization and other value estimation techniques such as temporal differencing and system identification. Another direction is to extend the analysis to non-linear systems via local linearization.

A WARM-UP: THE RIEMANN SUM APPROXIMATION

The Riemann sum approximation is a standard argument that we reproduce here for completeness. Let g : [0, T ] → R be a continuously differentiable function. Assume that we wish to approximate the integral T 0 g(t) dt using the Riemann sum over N = T /h elements, N -1 i=0 h g(ih) . The difference is readily computed up to first order as follows: D = T 0 g(t) dt - N -1 i=0 h g(ih) = N -1 i=0 (i+1)h ih g(t) -g(ih) dt ≤ N -1 i=0 h 0 g ′ (ih) t + O(t 2 ) dt = 1 2 N -1 i=0 g ′ (ih)h 2 + O(h 3 ) A naive bound is obtained as D ≤ 1 2 N h 2 ∥g ′ ∥ ∞ + O(N h 3 ). Translated to a squared error, this explains the dependency O(N 2 h 4 ) = O(T 2 h 2 ). In the case of discounting, let g(t) = γ t f (t) and g ′ (t) = γ t (f (t) + f ′ (t)). Hence, the previous display leads to the bound D ≤ 1 2 N -1 i=0 γ ih h 2 ∥f (t) + f ′ (t)∥ ∞ + O(h 3 ) = h 2 (1 -γ N h ) ∥f (t) + f ′ (t)∥ ∞ 2(1 -γ h ) + O(h 3 ) . Overall, the squared error is now O(h 4 (1 -γ T ) 2 /(1 -γ h ) 2 ) = O(h 2 / log(1/γ). Note that this alone does not explain the improvement of the order from h 2 to h 4 , which requires that also f (t) is decaying fast enough.

B MOMENT CALCULATIONS

Recall that the solution of the SDE in Eq. ( 9), with x (0) = 0, takes the following form: x (t) = σ t 0 e a(t-s) dw (s) . ( ) An integral part of finding the mean-squared error of the Monte-Carlo estimator is the computation of the moments E x(t) 2 , E x(t) 4 and E x(s) 2 x(t) 2 when s ≤ t. Lemma 1. Let x(t) be the solution of Eq. ( 9). The second moment of the state variable is E x 2 (t) = σ 2 2a e 2at -1 . For the forth moment, we get: E x (t) 4 = 3σ 4 4a 2 e 2at -1 2 (22) Assuming that s ≤ t, we further get: E x 2 (s)x 2 (t) = σ 4 4a 2 (e 2as -1)e 2at (e -2as -e -2at ) + 3(1 -e -2as ) . Proof. (1) We start with the second moment E x(t) 2 . E x(t) 2 = σ 2 e 2at E t 0 e -as dw(s) 2 = σ 2 e 2at t 0 e -2as ds = σ 2 2a (e 2at -1) The calculation makes use of the Itô isometry, which can be stated as: E t 0 z(s) dw (s) 2 = E t 0 z(s) 2 ds , for any stochastic process z(•) adapted to the filtration induced by the Wiener process w (•). (2) Next we compute E x(t) 4 through Itô's integral. Define y (t) := t 0 e -au dw (u), so that dy (t) = e -at dw (t). Thus, df (y (t)) = f ′ (y (t)) dy (t) + 1 2 f ′′ (y (t)) ( dy (t)) 2 = f ′ (y (t)) e -at dw (t) + 1 2 f ′′ (y (t)) e -2at dt, for any f (•). By choosing f (y) = y 4 : f ′ (y) = 4y 3 and f ′′ (y) = 12y 2 . Therefore, by integration and taking the expectation: 1 -e -2au du = 3 4a 2 1 -e -2at 2 From Eq. ( 20) it holds x (t) = σe at y (t) so that the second part of the lemma follows. E [f (y (t))] = E t 0 f ′ (y (u)) e -au dw (u) + 1 2 E t 0 f ′′ (y (u)) e - (3) Lastly, we compute E x(s) 2 x(t) 2 for s ≤ t. Note that we computed (i) before. For (ii) it holds: E s 0 e -au dw (u) 2 = s 0 e -2au dw (u) = 1 2a (1 -e -2as ) and E t s e -au dw (u) 2 = t s e -2au dw (u) = 1 2a (e -2as -e -2at ) Therefore, assuming s ≤ t, it holds that (it is necessary to use Itô integration in this case): E x 2 (s) x 2 (t) = σ 4 e 2a(s+t) 1 4a 2 1 -e -2as e -2as -e -2at + 3 4a 2 1 -e -2as 2 = σ 4 4a 2 (e 2at -1)e 2as (e -2at -e -2as ) + 3(1 -e -2at ) .

C CALCULATIONS OF THE MEAN-SQUARED ERROR

C.1 UNDISCOUNTED, FINITE-HORIZON: PROOF OF THEOREM 1 Proof. We first note that E[ VM (h)] = h M M i=1 N -1 k=0 E[x 2 i (kh)] = h N -1 k=0 E[x 2 (kh)] where we denote x(t) = x 1 (t) for simplicity. Next we expand the mean-squared error E[( VM (h) -V T ) 2 ] = E[ V 2 M (h)] -2V T E[ VM (h)] + V 2 T = h 2 M 2 E   M i=1 N -1 k=0 x 2 i (kh) 2   -2V T E[ VM (h)] + V 2 T = h 2 M 2 M i,j=1 N -1 k,l=0 E[x 2 i (kh)x 2 j (lh)] -2V T E[ VM (h)] + V 2 T = h 2 M N -1 k,l=0 E[x 2 (kh)x 2 (lh)] + M 2 -M M 2 E[ VM (h)] 2 -2V T E[ VM (h)] + V 2 T For the last equality, note that E[ VM (h)] 2 = h 2 N -1 k,l=0 E[x 2 (kh)]E[x 2 (lh)]. It remains to compute the expressions. By Lemma 1 we have for the second moment of the state variable: E[x 2 (t)] = σ 2 2a e 2at -1 . ( ) Assuming that s ≤ t, from the same lemma we get the following for the forth moments: E[x 2 (s)x 2 (t)] = σ 4 4a 2 (e 2as -1)e 2at (e -2as -e -2at ) + 3(1 -e -2as ) . (26) Note that by symmetry, a similar expression follows for s ≥ t. Using these expressions, for the expected cost we get V T = T 0 E[x 2 (t)]dt = σ 2 2a T 0 e 2at -1 dt = σ 2 2a e 2aT -1 2a -T We remark that a similar expression was previously obtained in (Bijl et al., 2016, Theorem 3) . Next, the expected estimated cost is E[ VM (h)] = h N -1 k=0 E[x 2 (kh)] = σ 2 h 2a N -1 k=0 e 2akh -1 = σ 2 h 2a 1 -e 2aT 1 -e 2ah -N Lastly, it remains to compute the sum h 2 M N -1 k,l=0 E[x 2 (kh)x 2 (lh)] = 2h 2 M N -1 k<l E[x 2 (kh)x 2 (lh)] + h 2 M N -1 k=0 E[x 4 (kh)] = σ 4 T h 2 e 2aT -1 8e 2ah + 3e 2aT + 1 + T 2 e 2ah -1 2 -2hT e 2ah -1 e 2ah + 5e 2aT 4a 2 Bh (e 2ah -1) The last equality is a cumbersome calculation that involves nested geometric sums. We verified the result using symbolic computation. For reference we provide the notebooks containing all calculations in the supplementary material. It remains to collect all terms to get the final result.

C.2 UNDISCOUNTED, FINITE-HORIZON: STEP SIZE

Although the exact optimal step size h * can be obtained from Theorem 1 in practice, such exact h * doesn't have an explicit solution through Theorem 1. A trivial way to see the order of h * in terms of B, a, T is finding the dominated term by using Taylor's expansion for exponential parts (which is true for any h) in Theorem 1. A proof of Corollary 2 is given as follows. Proof. From Theorem 1, we compute the leading terms in h of the least-squares error: E 1 (h, T, a) = σ 4 (e 2aT -1) 2 16a 2 h 2 + O(h 3 ), E 2 (h, T, a) B = - σ 4 T 4aT -e 4aT + e 2aT (8aT -4) + 5 8a 4 • 1 hB + σ 4 T (1 -e 4aT + 4aT e 2aT ) 4a 3 B - σ 4 T h 1 + 4aT + e 2aT (8aT + 4) -5e 4aT 24a 2 B - σ 4 T h 2 (e 4aT -1) 12aB + O h 3 /B . (29) It is trivial to see, when h ≥ 1, both Eq. ( 28) and Eq. ( 29) will blow up, and increase exponentially in h. Thus, a small h < 1 is considered to minimize E 1 (h, T, a) + E2(h,T,a)

B

. Keeping the first term in both Eq. ( 28) and Eq. ( 29) and solving for the optimal h * yields the result. A more precise approximation of h * than Corollary 2 is a minimizer of E 1 (h, T, a) + E2(h,T,a) B truncated at O(h 3 ): h * (a, T, B) = D 1 3D 3 + D 3 1 3 3 D 3 3 - 3D 2 2a 2 D 3 - 9D 2 2 4a 4 D 2 3 - D 3 1 D 2 9a 2 D 4 3 1 3 + D 3 1 3 3 D 3 3 - 3D 2 2a 2 D 3 + 9D 2 2 4a 4 D 2 3 - D 3 1 D 2 9a 2 D 4 3 1 3 , ( ) where D 1 = T 1 + 4aT + e 2aT (8aT + 4) -5e 4aT , D 2 = T 4aT -e 4aT + e 2aT (8aT -4) + 5 , D 3 = 3B(e 2aT -1) 2 -4aT (e 4aT -1) . We can further express Eq. ( 30) in terms of B, as h * (B) = - T 4aT -e 4aT + e 2aT (8aT -4) + 5 a 2 (e 2aT -1) 2 1/3 B -1/3 + T 1 + 4aT + e 2aT (8aT + 4) -5e 4aT 9(e 2aT -1) 2 B + 4aT (e 2aT + 1)D 1/3 2 9a 2/3 (e 2aT -1) 5/3 B -4/3 + 4aT 2 (e 2aT + 1)D 1 27(e 2aT -1) 3 B -2 + O(B -7/3 ) . where the first term is exactly the result in Corollary 2.

C.3 FINITE-HORIZON, DISCOUNTED

As stated in Section 3.1, adding discounting in the finite-horizon setting makes the mean-squared error more involved. In the regime where h is small and B is large, a Taylor expansion characterizes the error surface as follows: MSE T (h, B, γ) ≈ σ 4 T log(γ)(a + log(γ))(2a + log(γ)) 2 • 1 hB + σ 4 γ 2T (e 2aT -1) 2 16a 2 • h 2 + γ T e 2aT -1 γ T e 2aT (2a + log (γ)) -log (γ) -2a 48a 2 • h 3 + σ 4 144 • h 4 (31) The approximation shows only the lowest order terms for 1/(hB), γ T and h. The derivation is given in Lemma 2 below. The results shows that main trade-off between h and B persists also for the discounted objective, as long as γ T is treated as a constant relative to h 2 and 1/hB. In the limit where γ T becomes small (e.g. γ T = o(h 4 )) the nature of the trade-off changes in that the approximation error improves to O(h 4 ). This can be understood from the fact that under geometric discounting combined with a decaying process, the sum of N = T /h estimation errors do not suffer a factor N , thereby removing a factor of 1/h from the (non-squared) approximation error (see Appendix A for a more detailed explanation). Lemma 2 (Finite-horizon, discounted). In the finite-horizon with a discount factor γ ∈ (0, 1] setting, the mean-squared error of the Monte-Carlo estimator is MSE T (h, B, γ) = E 1 (h, T, a, γ) + E 2 (h, T, a, γ) B , where E 1 (h, T, a, γ) = C 1 (T, γ, a)σ 4 h 2 + C 2 (T, γ, a)σ 4 h 3 + 1 144 + C 3 (T, γ, a) σ 4 h 4 + O(h 5 ) , E 2 (h, T, a, γ) = σ 4 T + γ T C 4 (T, γ, a) log(γ)(a + log(γ))(2a + log(γ)) 2 h + γ T O(1) , C 1 (T, γ, a) = γ 2T e 2aT -1 2 16a 2 , C 2 (T, γ, a) = γ T e 2aT -1 γ T e 2aT (2a + log(γ)) -log(γ) -2a 48a 2 , C 3 (T, γ, a) = σ 4 γ T γ T e 2aT (2a + log(γ)) -log(γ) 2 -4a e 2aT (2a + log(γ)) -log(γ) 576a 2 , C 4 (T, γ, a) is some finite constant of (T, γ, a) that includes γ T as a factor . Proof. The proof follows the similar computations as those in the previous proof with a new expected cost as follows. In particular, using Lemma 1, we get V T = T 0 γ t E[x 2 (t)]dt = σ 2 2a γ T e 2aT -1 log(γ) + 2a - γ T -1 log(γ) Furthermore, the expected estimated cost is E[ VM (h)] = σ 2 h 2a N -1 k=0 γ kh e 2akh -1 = σ 2 h 2a 1 -γ T e 2aT 1 -γ h e 2ah - 1 -γ T 1 -γ h . Finally, the sum containing the forth order cross-moments is h 2 M N -1 k,l=0 γ kh+lh E[x 2 (kh)x 2 (lh)] = 2h 2 M N -1 k<l γ kh+lh E[x 2 (kh)x 2 (lh)] + h 2 M N -1 k=0 γ 2kh E[x 4 (kh)] . While not impossible to calculate on paper, a written derivation is beyond the scope of this work. Instead, we rely on symbolic computation to obtain the expression and corresponding Taylor approximations. The notebooks containing all derivations are provided in the supplementary material.

C.4 INFINITE HORIZON: PROOF OF THEOREM 3

Proof. The proof relies on the decomposition provided in Eq. ( 17). It only remains to compute the following cross term. E VM (h) -V T V T,∞ = σ 4 2a γ T log(γ) - γ T e 2aT log(γ) + 2a h 2a 1 -γ T e 2aT 1 -γ h e 2ah - 1 -γ T 1 -γ h - 1 2a γ T e 2aT -1 log(γ) + 2a - γ T -1 log(γ) = σ 4 γ 2T e 2aT -1 log(γ) e 2aT -1 -2a 8a 2 log(γ) (2a + log(γ)) h+ σ 4 γ T 2a + log(γ) -e 2aT log(γ) 2a γ T e 2aT -1 + γ T log(γ) e 2aT -1 48a 2 log(γ) (2a + log(γ)) h 2 + O(h 3 ) . Thus, the mean-squared error MSE ∞ (h, B, T, γ) = E ( VM (h) -V ∞ ) 2 is obtained by combining the above computation with Eq. ( 16) and Lemma 2.

D VECTOR CASE ANALYSIS

D.1 FINITE-HORIZON, UNDISCOUNTED: PROOF OF THEOREM 2 Proof. Consider the n-dimensional system that the solution of the trajectory of X(t) is X(t) = σ t 0 e A(t-s) dW (t) . Since A is a diagonalizable matrix, we can decompose A as A = P -1 DP , where P is a invertible matrix (not necessarily to be orthogonal) and D is a diagonal matrix whose diagonal entries (λ 1 , • • • , λ n ) are corresponding to the eigenvalues of the matrix A. Followed by which, we can decompose the matrix exponential of A as: e At = P -1 e Dt P . Define the "diagonalized" process X (•) as: P X (t) = P σ t 0 e A(t-s) dW (s) = σP P -1 t 0 e D(t-s) P dW (s) = σ t 0 e D(t-s) d W (s) =: X (t) where W (s) is a Wiener process (with dependent components when P is not orthogonal). This implies that X (•) = P -1 X (•). To see Xi (t) clearly, we denote P = [p ij ] n i,j=1 , and Xi (t) = (ϕ (i) 1 (t), • • • , ϕ (i) n (t)) ⊤ , then ϕ (i) l (t) = n j=1 p lj σ t 0 e λ l (t-s) dw (i) j (s) for each l ∈ {1, • • • , n}. Particularly, in such an ex- pression, w (i) j (s) are independent Wiener processes for different i or j. Correspondingly, X(t) = (ϕ 1 (t), • • • , ϕ n (t)) ⊤ , and ϕ l (t) = n j=1 p lj σ t 0 e λ l (t-s) dw j (s) for each l ∈ {1, • • • , n} , where w j (s) are independent Wiener processes for different j. By trace operation, we can rewrite VM (h) as follows: VM (h) = 1 M M i=1 N -1 k=0 hX (t k ) ⊤ QX (t k ) = tr 1 M M i=1 N -1 k=0 h X (t k ) ⊤ P -⊤ QP -1 X (t k ) = tr P -⊤ QP -1 VM (h) , where VM (h) = 1 M M i=1 N -1 k=0 h X (t k ) X (t k ) ⊤ ∈ R n×n . Similarly, V T = tr P -⊤ QP -1 V T , where V T = T 0 E[ X(t) X(t) ⊤ ] dt. Therefore, the MSE T (h, B) can be written as MSE T (h, B) = E VM (h) -V T 2 = E tr P -⊤ QP -1 VM (h) -V T 2 . ( ) For notional simplicity, we denote matrix P -⊤ QP -1 =: B = [b lj ] n l,j=1 and VM (h) -V T =: C = [c lj ] n l,j=1 . Noting the fact that MSE T (h, B) = E      l,j b jl c lj   2    = l1,j1,l2,j2 b j1l1 b j2l2 E [c l1j1 c l2j2 ] , it is sufficient to find MSE T by only computing E [c l1j1 c i2j2 ]. We first introduce the following expectations that are used in the computations. For any s ≤ t E t 0 e λ1(t-u) dw(u) s 0 e λ2(s-u) dw(u) = e λ1t+λ2s λ 1 + λ 2 1 -e -(λ1+λ2)s , E s 0 e λ1(s-u) dw(u) s 0 e λ2(s-u) dw(u) t 0 e λ3(t-u) dw(u) t 0 e λ4(t-u) dw(u) = e (λ1+λ2)s+(λ3+λ4)t 1 (λ 1 + λ 2 )(λ 3 + λ 4 ) 1 -e -(λ1+λ2)s 1 -e -(λ3+λ4)s + 1 (λ 1 + λ 3 )(λ 2 + λ 4 ) 1 -e -(λ1+λ3)s 1 -e -(λ2+λ4)s + 1 (λ 1 + λ 4 )(λ 2 + λ 3 ) 1 -e -(λ1+λ4)s 1 -e -(λ2+λ3)s + 1 (λ 1 + λ 2 )(λ 3 + λ 4 ) 1 -e -(λ1+λ2)s e -(λ3+λ4)s -e -(λ3+λ4)t (36) T 0 E t 0 e λ1(t-u) dw(u) s 0 e λ2(-u) dw(u) dt = e (λ1+λ2)T -1 -(λ 1 + λ 2 )T (λ 1 + λ 2 ) 2 . ( ) By using the definitions of VM (h) and V T , it is trivial to see for any l, j ∈ {1, • • • , n} where the last equation is due to the fact that for α ̸ = β E t 0 e λ l (t-s) dw α (s) c lj = 1 M M i=1 N -1 k=0 hϕ (i) l (kh)ϕ (i) j (kh) - T 0 E[ϕ l (t)ϕ j (t)] dt = hσ 2 M M i=1 N -1 t 0 e λj (t-s) dw β (s) = 0 . Thus, for any l 1 , l 2 , j 1 , j 2 ∈ {1, • • • , n}, E [c l1j1 c l2,j2 ] = n α=1 p l1α p j1α p l2α p j2α σ 4 I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) + n α̸ =β p l1α p j1α p l2β p j2β σ 4 I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) + n α̸ =β p l1α p j1β p l2α p j2β σ 4 I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) , where I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) = E h M M i=1 N -1 k=0 kh 0 e λ l 1 (kh-s) dw (i) α (s) kh 0 e λj 1 (kh-s) dw (i) α (s) - T 0 E t 0 e λ l 1 (t-s) dw α (s) t 0 e λj 1 (t-s) dw α (s) dt × h M M i=1 N -1 k=0 kh 0 e λ l 2 (kh-s) dw (i) α (s) kh 0 e λj 2 (kh-s) dw (i) α (s) - T 0 E t 0 e λ l 2 (t-s) dw α (s) t 0 e λj 2 (t-s) dw α (s) dt , and I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) = E h M M i=1 N -1 k=0 kh 0 e λ l 1 (kh-s) dw (i) α (s) kh 0 e λj 1 (kh-s) dw (i) α (s) - T 0 E t 0 e λ l 1 (t-s) dw α (s) t 0 e λj 1 (t-s) dw α (s) dt × h M M i=1 N -1 k=0 kh 0 e λ l 2 (kh-s) dw (i) β (s) kh 0 e λj 2 (kh-s) dw (i) β (s) - T 0 E t 0 e λ l 2 (t-s) dw β (s) t 0 e λj 2 (t-s) dw β (s) dt and I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) = E h M M i=1 N -1 k=0 kh 0 e λ l 1 (kh-s) dw (i) α (s) kh 0 e λj 1 (kh-s) dw (i) β (s) × h M M i=1 N -1 k=0 kh 0 e λ l 2 (kh-s) dw (i) α (s) kh 0 e λj 2 (kh-s) dw (i) β (s) . Note that w (i) α and w (i) β are independent for α ̸ = β. By using the expectations Eqs. ( 35) and (37), we can further obtain I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) as I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) = h (λ l1 + λ j1 ) 1 -e (λl 1 +λj 1 )T 1 -e (λl 1 +λj 1 )h - T h - 1 (λ l1 + λ j1 ) 2 e (λl 1 +λj 1 )T -1 -(λ l1 + λ j1 )T × h (λ l2 + λ j2 ) 1 -e (λl 2 +λj 2 )T 1 -e (λl 2 +λj 2 )h - T h - 1 (λ l2 + λ j2 ) 2 e (λl 2 +λj 2 )T -1 -(λ l2 + λ j2 )T . In the following computations, we will use C and C(λ l1 , λ j1 , λ l2 , λ j2 ) to represent some constants that are not depending on h, T, B. The expectation I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) is computed exactly the same way as in the proof of Theorem 1 by using the expectation results Eq. ( 35) and Eq. ( 36). Notice that the expectation result Eq. ( 35) (when s = t) has the same order in t as the expectation Eq. ( 25). Moreover, the two expectations Eq. ( 36) and Eq. ( 26) have the same orders in s and t. Thus, I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) has the same orders in h, T, B as the scalar MSE, i.e. I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) = C1 + C 1 (λ l1 , λ j1 , λ l2 , λ j2 ) O(T ) T 2 h 2 + O(h 3 ) + C2 + C 2 (λ l1 , λ j1 , λ l2 , λ j2 ) O(T ) T 5 hB + O 1 B The expectation I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) can be computed directly and has the result: I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) = e (λl 1 +λj 1 )T -1 e (λl 2 +λj 2 )T -1 h 2 4 (λ l1 + λ j1 ) (λ l2 + λ j2 ) + O(h 3 ) = 1 4 T 2 + C 3 (λ l1 , λ j1 , λ l2 , λ j2 )O(T 3 )) h 2 + O(h 3 ) . The expectation I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) can be computed as follows: I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α, β) = h 2 M n k=0 e (λl 1 +λ l 2 )kh -1 e (λj 1 +λj 2 )kh -1 h 2 (λ l1 + λ l2 ) (λ j1 + λ j2 ) + h 2 M k<q e λ l 1 kh+λ l 2 qh+λj 1 kh+λj 2 qh (λ l1 + λ l2 ) (λ j1 + λ j2 ) 1 -e -(λ l 1 +λ l 2 )kh 1 -e -(λj 1 +λj 2 )kh h 2 M k<q e λ l 1 qh+λ l 2 kh+λj 1 qh+λj 2 kh (λ l1 + λ l2 ) (λ j1 + λ j2 ) 1 -e -(λ l 1 +λ l 2 )kh 1 -e -(λj 1 +λj 2 )kh = C4 + C 4 (λ l1 , λ j1 , λ l2 , λ j2 ) O(T ) T 5 hB + O 1 B . Thus, the final result is obtained by the expression of MSE in Eq. ( 34), Eq. ( 38) and the above computations. Again, we rely on symbolic computation to obtain the expression and corresponding Taylor approximations and include the notebooks of all derivations in the supplementary material. The extension from Theorem 2 to the discounted finite-horizon results can be done in the same way as in the above proof (add the discount factor γ in VM ) by using the expectation cost for any λ 1 and λ 2 : T 0 γ t E t 0 e λ1(t-u) dw(u) s 0 e λ2(-u) dw(u) dt = 1 (λ 1 + λ 2 ) γ T e (λ1+λ2)T -1 log (γ) + (λ 1 + λ 2 ) - γ T -1 log (γ) .

D.2 PROOF OF COROLLARY 4

Proof. We shall follow the similar proof as in the proof of Theorem 2 and the proof of Theorem 3. Continuing from Eq. ( 38), in infinite-horizon discounted setting, we have I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , γ, α) = E h M M i=1 N -1 k=0 γ kh kh 0 e λ l 1 (kh-s) dw (i) α (s) kh 0 e λj 1 (kh-s) dw (i) α (s) - ∞ 0 γ t E t 0 e λ l 1 (t-s) dw α (s) t 0 e λj 1 (t-s) dw α (s) dt × h M M i=1 N -1 k=0 γ kh kh 0 e λ l 2 (kh-s) dw (i) α (s) kh 0 e λj 2 (kh-s) dw (i) α (s) - ∞ 0 γ t E t 0 e λ l 2 (t-s) dw α (s) t 0 e λj 2 (t-s) dw α (s) dt , I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , γ, α, β) = h (λ l1 + λ j1 ) 1 -γ T e (λl 1 +λj 1 )T 1 -γ h e (λl 1 +λj 1 )h - 1 -γ T 1 -γ h - 1 (λ l1 + λ j1 ) 1 log (γ) - 1 log (γ) + λ l1 + λ j1 × h (λ l2 + λ j2 ) 1 -γ T e (λl 2 +λj 2 )T 1 -γ h e (λl 2 +λj 2 )h - 1 -γ T 1 -γ h - 1 (λ l2 + λ j2 ) 1 log (γ) - 1 log (γ) + λ l2 + λ j2 , I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , r, α, β) = E h M M i=1 N -1 k=0 γ kh kh 0 e λ l 1 (kh-s) dw (i) α (s) kh 0 e λj 1 (kh-s) dw (i) β (s) × h M M i=1 N -1 k=0 γ kh kh 0 e λ l 2 (kh-s) dw (i) α (s) kh 0 e λj 2 (kh-s) dw (i) β (s) . Similar arguments as in proof of Theorem 2, we can conclude I 1 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , α) has the same orders in h, B, T as the MSE result in Theorem 3. Moreover, let C i (λ l1 , λ j1 , λ l2 , λ j2 , γ, T )'s are some constants that depend on λ l1 , λ j1 , λ l2 , λ j2 , γ, T , then I 2 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , γ, α, β) = σ 4 γ 2T (C 1 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T ) + C 2 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T )h) + σ 4 γ T C 3 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T )h 2 + C 4 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T )h 3 + σ 4 1 144 + γ T C 5 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T ) h 4 + O(h 5 ) , I 3 (M, h, T, λ l1 , λ j1 , λ l2 , λ j2 , γ, α, β) = h 2 M N -1 k=0 e (λl 1 +λ l 2 )kh -1 e (λj 1 +λj 2 )kh -1 h 2 γ 2kh (λ l1 + λ l2 ) (λ j1 + λ j2 ) + h 2 M k<q e λ l 1 kh+λ l 2 qh+λj 1 kh+λj 2 qh (λ l1 + λ l2 ) (λ j1 + λ j2 ) 1 -e -(λ l 1 +λ l 2 )kh 1 -e -(λj 1 +λj 2 )kh γ (k+q)h h 2 M k<q e λ l 1 qh+λ l 2 kh+λj 1 qh+λj 2 kh (λ l1 + λ l2 ) (λ j1 + λ j2 ) 1 -e -(λ l 1 +λ l 2 )kh 1 -e -(λj 1 +λj 2 )kh γ (k+q)h = C 6 (λ l1 , λ j1 , λ l2 , λ j2 , γ, T ) T 5 hB + O 1 B . The result in this Corollary is obtained by combining the above results. And we include the notebooks of all derivations in the supplementary material.

D.3 THE CASE WHEN A IS A GENERAL STABLE MATRIX

Lemma 3 (MSE when A is a general stable matrix ). Let A be a stable n × n matrix with distinct eigenvalues λ 1 , • • • , λ m and corresponding multiplicities q 1 , • • • , q m . There exist some constants { Ci } m i=1 , C0 and C j (λ 1 , • • • , λ m , γ, T )'s, such that the mean-squared error of the Monte-Carlo estimator in different setting satisfies (1) Finite-Horizon undiscounted setting: MSE T ∈ [ m i=1 q i Ci MSE T (h, B, λ i ), C 1 (λ 1 , • • • , λ m , T )σ 4 T 2 h 2 + C2 + C 3 (λ 1 , • • • , λ m , T )O(T ) σ 4 T 2n+3 Bh + O(h 3 ) + O( 1 B )] , where MSE T (h, B, λ i ) is the mean-squared error of the Monte-Carlo estimator in Theorem 1 by replacing the drift a by λ i . (2) Finite-Horizon discounted setting: MSE T ∈ [ m i=1 q i Ci MSE T (h, B, γ, λ i ), C 4 (λ 1 , • • • , λ m , γ, T )σ 4 γ 2T T 2 h 2 + C 5 (λ 1 , • • • , λ m , γ, T )σ 4 γ T h 3 + C 6 (λ 1 , • • • , λ m , T )σ 4 h 4 + (C 7 (λ 1 , • • • , λ m , γ, T )) σ 4 T 2n-1 Bh + O(h 5 ) + O( 1 B )] , where MSE T (h, B, γ, λ i ) is the mean-squared error of the Monte-Carlo estimator in Lemma 2 by replacing the drift a by λ i . (3) Infinite-Horizon discounted setting: MSE ∞ ∈ [ m i=1 q i Ci MSE ∞ (h, B, γ, λ i ), (C 8 (λ 1 , • • • , λ m , γ, T ) + C 9 (λ 1 , • • • , λ m , γ, T )h)σ 4 γ 2T + C 10 (λ 1 , • • • , λ m , γ, T )h 2 + C 11 (λ 1 , • • • , λ m , γ, T )h 3 σ 4 γ T + C 12 (λ 1 , • • • , λ m , T )σ 4 h 4 + (C 13 (λ 1 , • • • , λ m , γ, T )) σ 4 T 2n-1 Bh + O(h 5 ) + O( 1 B )] , where MSE ∞ (h, B, γ, λ i ) is the mean-squared error of the Monte-Carlo estimator in Theorem 3 by replacing the drift a by λ i . Proof. As we can see the proof of Lemma 2 is based on the proof of Theorem 1 with adding a discount factor γ, and the proof of Theorem 3 is based on the proof of Lemma 2 with the decomposition Eq. ( 17). By using the same flow direction, it is sufficient to show the result in case (1) and the results in case ( 2) and (3) follows. Consider the decomposition of MSE T in finite-horizon undiscounted setting: MSE T = E ( VM -V T ) 2 = E VM -E VM + E VM -V T 2 = E V 2 M -E VM 2 Part1 + E VM -V T 2 Part2 Before the analysis of part 1 and part 2, we will introduce the following mean-squared error notations for the finite-horizon undiscounted scalar case with drift λ i : MSE T (h, B, λ i ) = Var(h, λ i ) + Approximation(h, B, λ i ) , where Var(h, λ i ) = E V 2 M -E VM 2 and Approximation(h, B, λ i ) = E VM -V T 2 . For part 1: E V 2 M = h 2 M i,j,k,l E X i (kh) ⊤ QX i (kh)X j (lh) ⊤ QX j (lh) = h 2 M 2 i,j,k,l E X i (kh) ⊤ QX i (kh) E X j (lh) ⊤ QX j (lh) + 2tr QE X i (kh)X j (lh) ⊤ 2 = h 2 k,l E X(kh) ⊤ QX(kh) E X j (lh) ⊤ QX(lh) + 2h 2 M k tr QE X(kh)X(kh) ⊤ 2 + 4h 2 M k<l tr QE X(kh)X(lh) ⊤ 2 , ( ) where the second equality is based on Isserlis' theorem and the trace operation. Notice that E VM 2 = h 2 k,l E X(kh) ⊤ QX(kh) E X(lh) ⊤ QX(lh) , thus E V 2 M -E VM 2 = 2h 2 M k tr QE X(kh)X(kh) ⊤ 2 + 4h 2 M k<l tr QE X(kh)X(lh) ⊤ 2 To analyze the above form, we decompose the matrix A by it Jordan form, i.e. A = P -1 JP for some inevitable matrix P and J = diag(J i , • • • , J m ), where J i is the Jordan block corresponding to the eigenvalue λ i . Notice that e J(kh-s) = diag(e J1(kh-s) , • • • , e Jm(kh-s) ), where e Ji(kh-s) = e λi(kh-s)       1 kh -s (kh-s) 2 2! • • • (kh-s) q i -1 (qi-1)! 1 kh -s • • • (kh-s) q i -2 (qi-2)! . . . . . . 1       . Combining with the fact that for any k, l, E X(kh)X(lh) ⊤ = kh∧lh 0 e A(kh-s) e A ⊤ (lh-s) ds = kh∧lh 0 P -1 e J(kh-s) P P ⊤ e J ⊤ (lh-s) P -⊤ ds , we can conclude that for any k ≤ l, tr QE X(kh)X(lh) ⊤ is a linear combination of L 1,i,j and L 2,i,j for all i, j, where L 1,i,j := C 1,i,j kh 0 e (λi(kh-s)+λj (lh-s)) ds = C 1,i,j e λikh + e λj λ i + λ j 1 -e -(λi+λj )kh L 2,i,j := C 2,i,j kh 0 e (λi(kh-s)+λj (lh-s)) (kh -s) qi (lh -s) qj ds , where C 1,i,j , C i,j are some constants and qi ∈ {0, • • • , q i -1}, qj ∈ {0, • • • , q j -1}. For the integral in L 2,i,j , as qi + qj ≤ n -1, we can have the inequality: kh 0 e (λi(kh-s)+λj (lh-s)) (kh -s) qi (lh -s) qj ds ≤ T n-1 kh 0 e (λi(kh-s)+λj (lh-s)) ds . Since tr QE X(kh)X(lh) ⊤ 2 = i1,j1,i2,j2 k,l l1,l2∈{1,2} L l1,i1,j1 L l2,i2,j2 , and all the terms are nonnegative. We drop all terms that include L 2,i,j factor and only include the L 2 1,i,i with k = l terms in the lower bound of part 1. That is to say, the lower bound of part 1 is m i=1 q i Ci Var(h, λ i ). The upper bound of part 1 can be obtained by replacing all L 1,i,j factors by L 2,i,j and use the bound given in Eq. ( 44). This leads to the upper bound for part 1 is  (E VM -V T ) 2 ≈ C 1 (λ 1 , • • • , λ m , T )σ 4 T 2 h 2 + O(h 3 ) , which has the same order in h as the scalar case in finite-horizon undiscounted setting. Thus the result in Eq. ( 39) is obtained by combining part 1 bounds and part 2 approximation. As we explained in the beginning of this proof, in the finite-horizon discounted setting, we will follow the similar arguments as in the proof of Eq. ( 39) to obtain result Eq. ( 40). For the infinite-horizon discounted setting, the corresponding part 1 in the MSE ∞ is the same as the part 1 in MSE T of Eq. ( 40). The part 2 is approximated by using the decomposition Eq. ( 17) and the fact that V t,∞ = ∞ T γ t E X(t) ⊤ QX(t) dt = γ T C(r, T, λ 1 , • • • , λ m ) . To verify V T,∞ is O(γ T ), one can find the bounds of V T,∞ by using the similar arguments in the above proof of Eq. ( 39) and the following inequality: t 0 e (λi+λj )(t-s) (t -s) qi+qj ds ≤ t n-1 t 0 e (λi+λj )(t-s) ds = t n-1 (λ i + λ j ) e (λi+λj )t -1 . Then the components in (λi+λj ) e (λi+λj )t -1 dt. By the celebrating approximation of incomplete gamma function when T is large, we have ∞ T r t t n-1 (λ i + λ j ) e (λi+λj )t -1 dt ≈ r T T n-1 (λ i + λ j ) e (λi+λj )T -1 . Followed by V T,∞ = tr Q ∞ T γ t E X(t)X(t) ⊤ dt , we can obtain that V T,∞ = γ T C(r, T, λ 1 , • • • , λ m ). This result leads to the fact that part 2 is 



Figure2: MSE of Monte-Carlo value estimation in nonlinear systems. The line and shaded region denote the sample mean and its standard error of ( VM (h) -V ) 2 , from 30 random runs. T is the horizon in physical time (seconds). B 0 denotes the environment-dependent base sample budget, chosen such that it gives a full episode for the smallest h (see Appendix F). In almost all environments the optimal step-size depends on the data budget (with 'InvertedDoublePendulum-v2' being the only exception). In particular, the MSE as a function of h shows a clear minimum for choosing the optimal step-size, which generally decreases as the data budget increases.

Figure 3: Empirical h * in nonlinear experiments (solid lines) aligns well with the analysis in Corollary 3 (dashed lines): h * = cT B -1/3 . c is environment dependent and estimated from the data by least squares.

(s) x 2 (t) = σ 4 e 2a(s+t)

λ l (t-s) dw α (s) t 0 e λj (t-s) dw α (s) dt + α̸ =β p lα p jβ hσ 2 λ l (kh-s) dw (i) α (s) kh 0 e λj (kh-s) dw (i) β (s) ,

Part 2, let g(t) = E X(t) ⊤ QX(t) on [0, T ]. Then E VM is the left Riemann sum approximation of g(t), by the property of Riemann approximation,|E VM -V T | ≈ 2hT g(T ) + O(h 2 ) , where g(T ) = tr QE X(T )X(T ) ⊤ = σ 2 tr Q T 0 e A(t-s) e A ⊤ (t-s) ds ,which is a constant depends on λ 1 , • • • , λ m , T . Thus

(λ 1 , • • • , λ m , γ, T ) + C 9 (λ 1 , • • • , λ m , γ, T )h) σ 4 γ 2T + C 10 (λ 1 , • • • , λ m , γ, T )h 2 +C 11 (λ 1 , • • • , λ m , γ, T )h 3 σ 4 γ T + C 12 (λ 1 , • • • , λ m , T )σ 4 h 4 + O(h 5 ) ,which coincides with the Var(hλ i ) in the infinite-horizon discounted scalar case. The results in (

Figure4: Mean-squared error trade-off in LQR with random symmetric diagonalizable drift matrices A. The matrices A{1,2,3,4,5} in each plot were generated by randomly sampling an eigendecomposition and then using it to compute the drift matrix A. The eigenvalues are uniformly sampled on bounded disjoint intervals and the eigenvectors are sampled randomly from the classical compact groups detailed inMezzadri (2006). Note the matrices in (a) and (b) are not equal.Recall that in Fig.1(b) and Fig.1(c), the drift matrices A were scaled identity matrices. In this section, we show empirically that the trade-off persists when A is a dense, stable matrix which aligns with our theoretical results Theorem 2 and Corollary 4. Fig.4clearly shows a trade-off for 10 randomly sampled 3 × 3 dense, stable matrices. The procedure for randomly sampling a systems starts with uniformly sampling two eigenvalues from disjoint, bounded intervals, λ 1 ∈ [-1.5, -1.0) and λ 3 ∈ (-1.0, -0.75]. The final eigenvalue is set to be λ 2 = -1.0. Note that since all the

annex

eigenvalues sampled are negative, any matrix whose eigenvalues are λ 1 , λ 2 , λ 3 is said to be stable. Next, we randomly sample an orthogonal matrix Q using a built-in SCIPY (Virtanen et al., 2020) routine, ORTHO GROUP.RVS. Now let Λ = diag(λ 1 , λ 2 , λ 3 ), the random, dense, stable matrix A is defined as A = Q ⊤ ΛQ.

F EXPERIMENTAL DETAILS

We summarize the environment-specific parameters in Table 1 for the nonlinear-system experiments. 

