MANAGING TEMPORAL RESOLUTION IN CONTINUOUS VALUE ESTIMATION: A FUNDAMENTAL TRADE-OFF Anonymous authors Paper under double-blind review

Abstract

A default assumption in reinforcement learning and optimal control is that experience arrives at discrete time points on a fixed clock cycle. Many applications, however, involve continuous systems where the time discretization is not fixed but instead can be managed by a learning algorithm. By analyzing Monte-Carlo value estimation for LQR systems in both finite-horizon and infinite-horizon settings, we uncover a fundamental trade-off between approximation and statistical error in value estimation. Importantly, these two errors behave differently with respect to time discretization, which implies that there is an optimal choice for the temporal resolution that depends on the data budget. These findings show how adapting the temporal resolution can provably improve value estimation quality in LQR systems from finite data. Empirically, we demonstrate the trade-off in numerical simulations of LQR instances and several non-linear environments.

1. INTRODUCTION

In many real-world applications of control and reinforcement learning, the underlying system evolves continuously in time. For instance, a physical system such as a robot is naturally modeled as a stochastic dynamical system. In practice, however, sensor measurements are usually captured at discrete time intervals, and the practitioner must make a decision about how to discretize the time dimension, i.e. choosing a sampling frequency or a measurement step-size. A common belief is that a finer time discretization always leads to better estimation of the system properties and the control cost or the reward in reinforcement learning. As we show, this is only true with an unlimited data budget. In practice there are always limitations on how much data can be collected, stored and processed. Consider for example the task of episodic policy evaluation with a finite data budget. A higher temporal resolution means that more data is collected within fewer episodes. This inevitably leads to the question on how to optimally choose the time discretization for the task at hand. The practitioner therefore faces a fundamental trade-off: using a finer temporal resolution leads to better approximation of the continuous-time system from discrete measurements, but the consequence of collecting denser data along fewer trajectories leads to larger estimation variance with respect to stochasticity in the system. This is indeed true for any system with stochastic dynamics, even if the learner has access to exact (noiseless) measurements of the system's state. In this paper, we show that data efficiency can be significantly improved by leveraging a precise understanding of the trade-off between approximation error and statistical estimation error in long term value estimation -two factors that react differently to the level of temporal discretization. The main contributions of this work are twofold. First, we consider the simplest and canonical case of Monte-Carlo value estimation in a Langevin dynamical system (linear dynamics perturbed by a Wiener process) with quadratic instantaneous costs. Although the setup is specialized, it is simple enough such that we can obtain analytical expressions of the least-squares error that exactly characterize the approximation-estimation trade-off with respect to the step-size parameter. Second, we present a numerical study that illustrates and confirms the trade-off in both linear and non-linear systems, including several MuJoCo control environments. Our findings imply that practitioners should pay attention to carefully choosing the step-size parameter of the estimation to obtain the most accurate results possible.

1.1. RELATED WORK

There is a sizable literature on reinforcement learning in continuous-time systems (e.g. Doya, 2000; Lee & Sutton, 2021; Lewis et al., 2012; Bahl et al., 2020; Kim et al., 2021; Yildiz et al., 2021) . These previous works have largely focused on deterministic dynamics, and do not investigate tradeoffs in temporal discretization. A smaller body of work has considered learning continuous-time control under stochastic (Baird, 1994; Bradtke & Duff, 1994; Munos & Bourgine, 1997; Munos, 2006) , or bounded (Lutter et al., 2021) perturbations, but with a focus on making standard learning methods more robust to small time scales (Tallec et al., 2019) , again without explicitly managing the temporal discretization level. There have also been works that characterize the effects of temporal truncation in infinite horizon problems (Jiang et al., 2016; Droge & Egerstedt, 2011) . Despite these prevailing topics in the literature, we find that managing temporal discretization offers substantial improvements not captured by these previous studies. The LQR setting is a standard framework in control theory and it gives rise to a fundamental optimal control problem (Lindquist, 1990) , which has proven itself to be a challenging scenario for Reinforcement Learning algorithms (Tu & Recht, 2019; Krauth et al., 2019) . The stochastic LQR considers linear systems driven by additive Gaussian noise with a quadratic form for the cost, which is sought to be minimised by means of a feedback controller. Although it is a well-understood scenario and a closed form of the optimal controller is known thanks to the separation principle (Georgiou & Lindquist, 2013) , only recently the statistical properties of the long-term cost have been investigated (Bijl et al., 2016) . The work in this paper also closely related to the now sizable literature on reinforcement learning in LQR systems (Bradtke, 1992; Krauth et al., 2019; Tu & Recht, 2018; Dean et al., 2020; Tu & Recht, 2019; Dean et al., 2018; Fazel et al., 2018; Gu et al., 2016) . These existing works uniformly focused on the discrete time setting, although the benefits of managing spatial rather than temporal discretization has been considered (Sinclair et al., 2019; Cao & Krishnamurthy, 2020) . Wang et al. (2020) studied the continuous-time LQR setting but it focused on the exploration problem rather than the temporal discretization. There is compelling empirical evidence that managing temporal resolution, typically via action persistence (Lakshminarayanan et al., 2017; Sharma et al., 2017; Huang et al., 2019; Huang & Zhu, 2020; Dabney et al., 2021; Park et al., 2021) , can greatly improve learning performance. Even grid worlds (Sutton & Barto, 2018) can be seen as leveraging a form of action persistence, where a coarse spatial discretization is imposed on an otherwise continuous two dimensional navigation problem to improve learning efficiency. These empirical findings have recently been supported by an initial theoretical analysis (Metelli et al., 2020 ) that shows temporal discretization plays a role in determining the effectiveness of fitted Q-iteration. The analysis by Metelli et al. (2020) does not consider fully continuous systems, but rather remains anchored in a base level discretization and only provides worst-case upper bounds that do not necessarily capture the detailed trade-offs one faces in practice. Choosing the temporal resolution can also be understood as a non-linear experimental design problem (Chaloner & Verdinelli, 1995; Ford et al., 1989) . By choosing the time discretization, the experimenter determines how to allocate measurements for a given data budget. What is peculiar to our objective is that any fixed design has a constant approximation error (bias) that persists even when the number of data points becomes infinite. At the same time, the bias can also be managed by scarifying estimation error (variance). Optimal designs that consider the bias-variance trade-off jointly have been studied previously (e.g. Bardow, 2008; Mutny et al., 2020; Mutnỳ & Krause, 2022) .

2. POLICY EVALUATION IN CONTINUOUS LINEAR QUADRATIC SYSTEMS

In the classical continuous-time linear quadratic regulator (LQR), a state variable X(t) ∈ R n evolves over time t ≥ 0 according to the following equation: dX (t) = AX(t) dt + BU (t) dt + σdW (t). (1) The dynamical model is fully specified by the matrices A ∈ R n×n , B ∈ R n×p and the diffusion coefficient σ. The control input is U (•) ∈ R p is given by a fixed policy, and W (t) is a Wiener process. The state variable X(t) is fully observed. For simplicity, we assume that the dynamics start at X (0) = -→ 0 ∈ R n (c.f. Abbasi-Yadkori & Szepesvári, 2011; Dean et al., 2020) .

