INITIAL VALUE PROBLEM ENHANCED SAMPLING FOR CLOSED-LOOP OPTIMAL CONTROL DESIGN WITH DEEP NEURAL NETWORKS Anonymous authors Paper under double-blind review

Abstract

Closed-loop optimal control design for high-dimensional nonlinear systems has been a long-standing challenge. Traditional methods, such as solving the associated Hamilton-Jacobi-Bellman equation, suffer from the curse of dimensionality. Recent literature proposed a new promising approach based on supervised learning, by leveraging powerful open-loop optimal control solvers to generate training data and neural networks as efficient high-dimensional function approximators to fit the closed-loop optimal control. This approach successfully handles certain high-dimensional optimal control problems but still performs poorly on more challenging problems. One of the crucial reasons for the failure is the so-called distribution mismatch phenomenon brought by the controlled dynamics. In this paper, we investigate this phenomenon and propose the initial value problem enhanced sampling method to mitigate this problem. We theoretically prove that this sampling strategy improves over the vanilla strategy on the classical linear-quadratic regulator by a factor proportional to the total time duration. We further numerically demonstrate that the proposed sampling strategy significantly improves the performance on tested control problems, including the optimal landing problem of a quadrotor and the optimal reaching problem of a 7 DoF manipulator.

1. INTRODUCTION

Optimal control aims to find a control for a dynamical system over a period of time such that a specified loss function is minimized. Generally speaking, there are two types of optimal controls: open-loop optimal control and closed-loop (feedback) optimal control. Open-loop optimal control deals with the problem with a given initial state, and its solution is a function of time for the specific initial data, independent of the other states of the system. In contrast, closed-loop optimal control aims to find the optimal control policy as a function of the state that gives us optimal control for general initial states. By the nature of the problem, solving the open-loop control problem is relatively easy and various open-loop control solvers can handle nonlinear problems even when the state lives in high dimensions (Betts, 1998; Rao, 2009) . Closed-loop control is much more powerful than open-loop control since it can cope with different initial states, and it is more robust to the disturbance of dynamics. The classical approach to obtaining a closed-loop optimal control function is by solving the associated Hamilton-Jacobi-Bellman (HJB) equation. However, traditional numerical algorithms for HJB equations such as the finite difference method or finite element method face the curse of dimensionality (Bellman, 1957) and hence can not deal with high-dimensional problems. Since the work Han & E (2016) for stochastic optimal control problems, there have been growing interest on making use of the capacity of neural networks (NNs) in approximating high-dimensional functions to solve the closed-loop optimal control problems (Nakamura-Zimmerer et al., 2021a; b; 2020; Böttcher et al., 2022; E et al., 2022) . Generally speaking, there are two categories of methods in this promising direction. One is policy search approach (Han & E, 2016; Ainsworth et al., 2021; Böttcher et al., 2022; Zhao et al., 2022) , which directly parameterizes the policy function by neural networks, computes the total cost with various initial points, and minimizes the average total cost. When solving problems with a long time span and high nonlinearity, the corresponding optimization problems can be extremely hard and may get stuck in local minima (Levine & Koltun, 2014) . The other category of methods is based on supervised learning (Nakamura-Zimmerer et al., 2021a; b; 2020; 2022) . Combining various techniques for open-loop control, one can solve complex highdimensional open-loop optimal control problems; see Betts (1998) ; Rao (2009) ; Kang et al. (2021) for detailed surveys. Consequently, we can collect optimal trajectories for different initial points as training data, parameterize the control function (or value function) using NNs, and train the NN models to fit the closed-loop optimal controls (or optimal values). This work focuses on the second approach and aims to improve its performance through adaptive sampling. As demonstrated in Nakamura-Zimmerer et al. (2021b) ; Zang et al. (2022) , NN controllers trained by the vanilla supervised-learning-based approach can perform poorly even when both the training error and test error on collected datasets are fairly small. Some existing works attribute this phenomenon to the fact that the learned controller may deteriorate badly at some difficult initial states even though the error is small in the average sense. Several adaptive sampling methods regarding the initial points are hence proposed (see Section 4 for a detailed discussion). However, these methods all focus on choosing optimal paths according to different initial points and ignore the effect of dynamics. This is an issue since the paths controlled by the NN will deviate from the optimal paths further and further over time due to the accumulation of errors. As shown in Section 6, applying adaptive sampling only on initial points is insufficient to solve challenging problems. This work is concerned with the so-called distribution mismatch phenomenon brought by the dynamics in the supervised-learning-based approach. This phenomenon refers to the fact that the discrepancy between the state distribution of the training data and the state distribution generated by the NN controller typically increases over time and the training data fails to represent the states encountered when the trained NN controller is used. Such phenomenon has also been identified in reinforcement learning (Kakade & Langford, 2002; Long & Han, 2022) and imitation learning (Ross & Bagnell, 2010) . To mitigate this phenomenon, we propose the initial value problem (IVP) enhanced sampling method to make the states in the training dataset more closely match the states that the controller reaches. In the IVP enhanced sampling method, we iteratively re-evaluate the states that the NN controller reaches by solving IVPs and recalculate new training data by solving the open-loop control problems starting at these states. Our sampling method is very versatile to be combined with other techniques like a faster open-loop control solver or better neural network structures. The resulting supervised-learning-based approach empowered by the IVP enhanced sampling can be interpreted as an instance of the exploration-labeling-training (ELT) algorithms (Zhang et al., 2018; E et al., 2021) for closed-loop optimal control problems (see Appendix A for more discussions). At a high level, the ELT algorithm proceeds iteratively with the following three steps: (1) exploring the state space and examining which states need to be labeled; (2) solving the control problem to label these states and adding them to the training data; (3) training the machine learning model. The main contributions of the paper can be summarized as follows. (1) We investigate the distribution mismatch phenomenon brought by the controlled dynamics in the supervised-learning-based approach, which explains the failure of this approach for challenging problems. We propose the IVP enhanced sampling method to update the training data, which significantly alleviates the distribution mismatch problem. (2) We show that the IVP enhanced sampling method can significantly improve the performance of the learned closed-loop controller on a uni-dimensional linear quadratic control problem (theoretically and numerically) and two high-dimensional problems (numerically), the quadrotor landing problem and the reaching problem of a 7-DoF manipulator. (3) We compare the IVP enhanced sampling method with other adaptive sampling methods and show that the IVP enhanced method gives the best performance.

2.1. OPEN-LOOP AND CLOSED-LOOP OPTIMAL CONTROL

We consider the following deterministic controlled dynamical system: ẋ(t) = f (t, x(t), u(t)), t ∈ [t 0 , T ] x(t 0 ) = x 0 (1) where x(t) ∈ R n denotes the state, u(t) ∈ U ⊂ R m denotes the control with U being the set of admissible controls, f : [0, T ] × R n × U → R n is a smooth function describing the dynamics, t 0 ∈ [0, T ] denotes the initial time, and x 0 ∈ R n denotes the initial state. Given a fixed t 0 ∈ [0, T ] and x 0 ∈ R n , solving the open-loop optimal control problem means to find a control path u * : [t 0 , T ] → U to minimize J(u; t 0 , x 0 ) = T t0 L(t, x(t), u(t))dt + M (x(T )) s.t. (x, u) satisfy the system (1), where L : [0, T ] × R n × U → R and M : R n → R are the running cost and terminal cost, respectively. We use x * (t; t 0 , x 0 ) and u * (t; t 0 , x 0 ) to denote the optimal state and control with the specified initial time t 0 and initial state x 0 , which emphasizes the dependence of the open-loop optimal solutions on the initial time and state. We assume the open-loop optimal control problem is well-posed, i.e., the solution always exists and is unique. In contrast to the open-loop control being a function of time only, closed-loop control is a function of the time-state pair (t, x). Given a closed-loop control u : [0, T ] × R n → U, we can induce a family of the open-loop controls with all possible initial time-state pairs (t 0 , x 0 ): u(t; t 0 , x 0 ) = u(t, x u (t; t 0 , x 0 )), where x u (t; t 0 , x 0 ) is defined by the following initial value problem (IVP): IVP(x 0 , t 0 , T, u) : ẋu (t; t 0 , x 0 ) = f (t, x u (t; t 0 , x 0 ), u(t, x u (t; t 0 , x 0 )), t ∈ [t 0 , T ] x u (t 0 ; t 0 , x 0 ) = x 0 . To ease the notation, we always use the same character to denote the closed-loop control function and the induced family of the open-loop controls. The context of closed-loop or open-loop control can be inferred from the arguments and will not be confusing. It is well known in the classical optimal control theory (see, e.g. Liberzon (2011) ) that there exists a closed-loop optimal control function u * : [0, T ] × R n → U such that for any t 0 ∈ [0, T ] and x 0 ∈ R n , u * (t; t 0 , x 0 ) = u * (t, x * (t; t 0 , x 0 )), which means the family of the open-loop optimal controls with all possible initial time-state pairs can be induced from the closed-loop optimal control function. Since IVPs can be easily solved, one can handle the open-loop control problems with all possible initial time-state pairs if a good closedloop control solution is available. Moreover, the closed-loop control is more robust to dynamic disturbance and model misspecification, and hence it is much more powerful in applications. In this paper, our goal is to find a near-optimal closed-loop control û such that for x 0 ∈ X ⊂ R n with X being the set of initial states of interest, the associated total cost is near-optimal, i.e., |J( û( • ; 0, x 0 ); 0, x 0 ) -J(u * ( • ; 0, x 0 ); 0, x 0 )| is small.

2.2. SUPERVISED-LEARNING-BASED APPROACH FOR CLOSED-LOOP OPTIMAL CONTROL PROBLEM

Here we briefly explain the idea of the supervised-learning-based approach for the closed-loop optimal control problem. The first step is to generate training data by solving the open-loop optimal control problems with zero initial time and initial states randomly sampled in X. Then, the training data is collected by evenly choosing points in every optimal path: D = {(t i,j , x i,j ), u i,j } 1≤i≤M,1≤j≤N , where M and N are the number of sampled training trajectories and the number of points chosen in each path, respectively. Finally, a function approximator (mostly neural network, as considered in this work) with parameters θ is trained by solving the following regression problem: min θ 1 M N M i=1 N j=1 ∥u i,j -u NN (t i,j , x i,j ; θ)∥ 2 , and gives the NN controller u NN .

3. IVP ENHANCED SAMPLING METHOD

Although the vanilla supervised-learning-based approach can achieve a good performance in certain problems (Nakamura-Zimmerer et al., 2021a) , it is observed that its performance on complex problems is not satisfactory (see Nakamura-Zimmerer et al. (2021b); Zang et al. (2022) and examples below). One of the crucial reasons that the vanilla method fails is the distribution mismatch phenomenon. To better illustrate this phenomenon, let µ 0 be the distribution of the initial state of interest and u : [0, T ] × R n → U be a closed-loop control function. We use µ u (t) to denote the distribution of x(t) generated by u: ẋ(t) = f (t, x(t), u(t, x(t))), x 0 ∼ µ 0 . Note that in the training process (3), the distribution of the state at time t is µ u * (t), the state distribution generated by the closed-loop optimal control. On the other hand, when we apply the learned NN controller in the dynamics, the distribution of the input state of u NN at time t is µ u NN (t). The error between state x driven by u * and u NN accumulates and makes the discrepancy between µ u * (t) and µ u NN (t) increases over time. Hence, the training data fails to represent the states encountered in the controlled process, and the error between u * and u NN dramatically increases when t is large. See Figures 1 (left) and 2 below for an illustration of this phenomenon. To overcome this problem, we propose the following IVP enhanced sampling method. The key idea is to improve the quality of the NN controller iteratively by enlarging the training dataset with the states seen by the NN controller at previous times. Given predesigned (not necessarily even-spaced) temporal grid points 0 = t 0 < t 1 < • • • < t K = T , we first generate a training dataset S 0 by solving open-loop optimal control problems on the time interval [0, T ] starting from points in X 0 , a set of initial points sampled from µ 0 , and train the initial model û0 . Under the control of û0 , the generated trajectory deviates more and more from the optimal trajectory. So we stop at time t 1 , i.e., compute the IVPs using û0 as the closed-loop control and points in X 0 as the initial points on the time interval [0, t 1 ], and then on the interval [t 1 , T ] solve new optimal paths that start at the endpoints of the previous IVPs. The new training dataset S 1 is then composed of new data (between t 1 and T) and the data before time t 1 in the dataset S 0 , and we train a new model û1 using S 1 . We repeat this process to the predesigned temporal grid points t 2 , t 3 , • • • until end up with T . In other words, in each iteration, the adaptively sampled data replaces the corresponding data (defined on the same time interval) in the training dataset (the size of training data remains the same). The whole process can be formulated as Algorithm 1, and we refer to Figure 1 for an illustration of the algorithm's mechanism. We call this method IVP enhanced sampling method because the initial points of the open-loop optimal control problems are sampled by solving the IVP with the up-todate NN controller. It is worth noting that the later iterations require less effort in labeling data as the trajectories are shorter and thus easier to solve. It is worthwhile mentioning that the IVP enhanced sampling method is versatile enough to combine other improvements for closed-loop optimal control problems, such as efficient open-loop control problem solvers (Kang et al., 2021; Zang et al., 2022) or specialized neural network structures (Nakamura-Zimmerer et al., 2020; 2021b; 2022) . One design choice regarding the network structure Algorithm 1 IVP enhanced sampling method for closed-loop optimal control design 1: Input: Initial distribution µ 0 , number of time points K, temporal grid points 0 = t 0 < t 1 < • • • < t K = T , time step δ, number of initial points N . 2: Initialize: S -1 = ∅, û-1 (t, x) = 0. 3: Independently sample N initial points from µ 0 to get an initial point set X 0 . 4: for i = 0, 1, • • • , K -1 do 5: For any x 0 ∈ X 0 , compute IVP (x 0 , 0, t i , ûi-1 ) according to (2). ▷Exploration 6: Set X i = {x ûi-1 (t i ; 0, x 0 ) : x 0 ∈ X 0 }.

7:

For any x i ∈ X i , call the open-loop optimal control solver to obtain x * (t; t i , x i ) and u * (t; t i , x i ) for t ∈ [t i , T ]. ▷Labeling 8: Set Ŝi = {(t, x * (t; t i , x i ), u * (t; t i , x i )) : x i ∈ X i , t ∈ [t i , T ], (t -t 0 )/δ ∈ N}. 9: Set S i = Ŝi {(t, x, u) : t < t i , (t, x, u) ∈ S i-1 }. 10: Train ûi with dataset S i . ▷Training 11: end for 12: Output: ûK-1 . in the IVP enhanced sampling method is whether to share the same network among different time intervals. We choose to use the same network for all the time intervals in the following numerical examples, but the opposite choice is also feasible.

4. COMPARISON WITH OTHER ADAPTIVE SAMPLING METHODS

In this section, we review existing literature on adaptive sampling methods for the closed-loop optimal control problem. We start with the methods in imitation learning (Hussein et al., 2017) , which aim to learn the expert's control function. Our task can be viewed as an imitation problem if we take the optimal control function as the expert's control. With the same argument, we know the distribution mismatch phenomenon also exists therein. However, there is a key difference regarding the mechanism of data generation between the two settings: in imitation learning, it is often assumed that one can easily access the expert's behavior at every time-state pair while in the optimal control problem, it is much more computationally expensive to access since one must solve an open-loop optimal control problem. This difference affects algorithm design fundamentally. Take the forward training algorithm (Ross & Bagnell, 2010) , a popular method in imitation learning for mitigating distribution mismatch, as an example. To apply it to the closed-loop optimal control problem, we first need to consider a discrete-time version of the problem with a sufficiently fine time grid: 0 = t 0 < t 1 < • • • < t K ′ = T . At each time step t i , we learn a policy function ūi : R n → U where the state x in the training data are generated by sequentially applying ū0 , . . . , ūi-1 and the labels are generated by solving the open-loop optimal solutions with (t i , x) as the initial time-state pair. Hence, the open-loop control solver is called with numbers proportionally to the discretized time steps K ′ , and only the first value on each optimal control path is used for learning. In contrast, in Algorithm 1, we can use much more values over the optimal control paths in learning, which allows its temporal grid for sampling to be much coarser than the grid in the forward training algorithm, and the total cost of solving open-loop optimal control problems is much lower. Another popular method in imitation learning is DAGGER (Dataset Aggregation) (Ross et al., 2011) , which can also be applied to help sampling in the closed-loop optimal control problem. In DAG-GER, in order to improve the current closed-loop controller û, one solves IVPs using û over [0, T ] starting from various initial states and collect the states on a time grid 0 < t 1 < • • • < t K-1 < T . The open-loop control problems are then solved with all the collected time-state pairs as the initial time-state pair, and all the corresponding optimal solutions are used to construct a dataset for learning a new controller. The process can be repeated until a good controller is obtained. The timestate selection in DAGGER is also related to the distribution mismatch phenomenon, but somehow different from the IVP enhanced sampling. Take the data collection using the controller û1 in the first iterative step for example. The IVP enhanced sampling focuses on the states at the time grid t 1 while DAGGER collects states at all the time grids. If û1 is still far from optimal, the data collected at later time grids may be irrelevant to or even mislead training due to error accumulation in states. In Appendix G, we reports more theoretical and numerical comparison between DAGGER and IVP enhanced sampling, which indicates DAGGER performs less satisfactorily. Except for the forward training algorithm and DAGGER, there are other adaptive sampling methods for the closed-loop optimal control problems. Nakamura-Zimmerer et al. (2021a) propose an adaptive sampling method that prefers to choose the initial points with large gradients of the value function as the value function tends to be steep and hard to learn around these points. Landry et al. (2021) propose to sample the initial points on which the errors between predicted values from the NN and optimal values are large. These two adaptive sampling methods both focus on finding points that are not learned well but ignore the influence of the accumulation of the distribution mismatch over time brought by controlled dynamics. We will show in Section 6 that the IVP enhanced sampling method can outperform such sampling methods.

5. THEORETICAL ANALYSIS ON AN LQR EXAMPLE

In this section, we analyze the superiority of the IVP sampling method by considering the following uni-dimensional linear quadratic regulator (LQR) problem: min x(t),u(t) 1 T T t0 |u(t)| 2 dt + |x(T )| 2 s.t. ẋ(t) = u(t), t ∈ [t 0 , T ], x(t 0 ) = x 0 , where T is a positive integer, t 0 ∈ [0, T ] and x 0 ∈ R. Classical theory on linear quadratic control (see, e.g. Sontag ( 2013)) gives the following explicit linear form of the optimal controls:    u * (t; t 0 , x 0 ) = - T T (T -t0)+1 x 0 , (open-loop optimal control) u * (t, x) = - T T (T -t)+1 x. (closed-loop optimal control) We consider the following two models to approximate the closed-loop optimal control function with parameter θ: Model 1: u θ (t, x) = - T T (T -t)+1 x + b(t), where θ = {θ t } 0≤t≤T = {b(t)} 0≤t≤T . Model 2: u θ (t, x) = a(t)x + b(t), where θ = {θ t } 0≤t≤T = {(a(t), b(t))} 0≤t≤T . (5) Since there will be no error in learning a linear model when the data is exact, to mimic the errors encountered when learning neural networks, throughout this section, we assume the data has certain noise. To be precise, for any t 0 ∈ [0, T ] and x 0 ∈ R, the open-loop optimal control solver gives the following approximated optimal path:    û(t; t 0 , x 0 ) = - T T (T -t0)+1 x 0 + ϵZ, x(t; t 0 , x 0 ) = x 0 + t t0 û(t; t 0 , x 0 )dt = T (T -t)+1 T (T -t0)+1 x 0 + (t -t 0 )ϵZ, where ϵ > 0 is a small positive number to indicate the scale of the error and Z is a normal random variable whose mean is m and variance is σ 2 . In other words, the obtained open-loop control is still constant in each path, just like the optimal open-loop control, but perturbed by a random constant. The random variables in different approximated optimal paths starting from different t 0 or x 0 are assumed to be independent. We compare the vanilla supervised-learning-based method and IVP enhanced sampling method theoretically for the first model and numerically for the second model (in Appendix B). In the vanilla method, we randomly sample N T initial points from a standard normal distribution and use corresponding optimal paths to learn the controller. In the IVP enhanced sampling method, we randomly sample N initial points from a standard normal distribution, set the temporal grid points for sampling as 0 < 1 < • • • < T -1 < T , and perform Algorithm 1. In both methods, the open-loop optimal control solver is called N T times in total. Theorem 1 compares the performance of the vanilla method and the IVP enhanced sampling method under Model 1 (4) . The more detailed statement and proof can be found in Appendix B. This theorem shows that both the distribution difference and performance difference with respect to the optimal solution for the vanilla method will increase when T increases, while they are always constantly bounded for the IVP enhanced sampling method. Therefore, compared to the vanilla method, the IVP enhanced sampling method mitigates the distribution mismatch phenomenon and significantly improves the performance when T is large. Theorem 1. Under Model 1 (4), let u o , u v and u a be the optimal controller, the controller learned by the vanilla method, and the controller learned by the IVP enhanced sampling method, respectively. Define IVPs: ẋs (t) = u s (t) = u s (t, x s (t)), x s (0) = x init , 0 ≤ t ≤ T, s ∈ {o, v, a}. 1. If x init is a random variable following a standard normal distribution, which is independent of the initial points and noises in the training process. Let {x j v (t)} N T j=1 and {x j a (t)} N j=1 be the state variables in the training data of the vanilla method and the last iteration of the IVP enhanced sampling method. Then, xj v (t), xj a (t), x v (t) and x a (t) are normal random variables, Ex j v (t) = Ex v (t), Ex j a (t) = Ex a (t) and |E|x j v (t)| 2 -E|x v (t)| 2 | = (1 - 1 N T )ϵ 2 t 2 , |E|x j a (t)| 2 -E|x a (t)| 2 | ≤ ϵ 2 . 2. If x init is a fixed initial point, define the total cost J s = 1 T T 0 |u s (t)| 2 dt + |x s (T )| 2 , s ∈ {o, v, a}. Then, EJ v -J o = (T 2 + 1)(m 2 + σ 2 N T )ϵ 2 , EJ a -J o ≤ 3(m 2 + σ 2 N )ϵ 2 .

6. THE OPTIMAL LANDING PROBLEM OF QUADROTOR

In this section, we test the IVP enhanced sampling method on the optimal landing problem of a quadrotor. We consider the full quadrotor dynamic model with 12-dimensional state variable and 4dimensional control variable (Bouabdallah et al., 2004; Madani & Benallegue, 2006; Mahony et al., 2012) . We aim to find optimal landing paths from some initial states x 0 to a target state x T = 0 with minimum control efforts during a fixed time duration T = 16. The open-loop optimal solutions are obtained by solving the corresponding two-point boundary value problems with the space-marching technique (Zang et al., 2022) . See Appendix C, D and E for more details. We sample N = 500 initial points for generating training data and and use a fully-connected neural network to approximate the optimal control. The temporal grid points on which we do IVP enhanced sampling is 0 < 10 < 14 < 16. After learning, we use learned models to run the initial value problem at 500 training initial points and show the similarity between paths controlled by the NN controller and their corresponding training data. In Figure 2 , the left sub-figure shows the average pointwise distance between data reached by the NN controller and corresponding training data at different times. The right sub-figure shows the maximum mean discrepancy (Borgwardt et al., 2006) between these two datasets using Gaussian kernel k(x, y) = exp(-∥x-y∥ 2

2

). In both figures, there are jumps at t = 10 and 14 since the NN-controlled path is continuous across time while training data is discontinuous at locations where we do IVP enhanced sampling. It can be seen that without adaptive sampling (after iteration 0), the discrepancy between the states reached by the NN controller and training data is large. With our method, they get closer to each other as the iteration goes. Then we compare our method with four other methods. As data generation is the most timeconsuming part, for the sake of fairness, we keep the number of solving open-loop problems the same (1500) among all the methods (except the last method). The first method is training a model on directly sampled 1500 optimal paths (called vanilla sampling). The second method is the adaptive sampling (AS) method proposed by Nakamura-Zimmerer et al. ( 2021a) that chooses initial points with large gradient norms. This is equivalent to choosing initial points whose optimal control has large norms, and we refer to this method as AS w. large u. With a little modification, the third method, AS w. large v, is to choose initial points whose total costs are large under the latest NN controller. The last adaptive sampling method, AS w. bad v, is a variant of the SEAGuL algorithm (Sample Efficient Adversarially Guided Learning) (Landry et al., 2021) . The original SEAGuL algorithm proposes to use a few gradient ascent updates to find initial points with large gaps between the learned values and optimal values. Here we give this method more computational budget to solve more open-loop optimal control problems to find such initial points. The cumulative distribution functions of cost ratios of the above methods are shown in Figure 3 (right), which clearly demonstrate the superiority of the IVP enhanced sampling method. More details about the implementation and results of these methods are provided in Appendix E. We also test the NN controllers obtained from different sampling methods in the presence of observation noises, considering that the sensors have errors in reality. The detailed results are provided in Appendix E. It is observed that when measurement errors exist, closed-loop controllers are more reliable than the open-loop controller and the one trained by the IVP enhanced sampling method performs best among all the considered methods. Finally, we test 4 different choices of temporal grid points in Algorithm 1, and train networks on the same 500 initial points. The results listed in Appendix E show that our algorithm is robust to the choice of temporal grid points.

7. THE OPTIMAL REACHING PROBLEM OF A 7-DOF MANIPULATOR

In this section, we consider the optimal reaching problem on a 7-DoF torque-controlled manipulator, the KUKA LWR iiwa R820 14 (Kuka; Bischoff et al., 2010) . See the figure in appendix F for an illustration of this task. Let x = (q, v) = (q, q) be the state of the system where q, v ∈ R 7 are joint angles and velocities of the manipulator, respectively. Our goal is to find the optimal torque u ∈ U ⊂ R 7 that drives the manipulator from x 0 = (q 0 , 0) to x 1 = (q 1 , 0) in T = 0.8 seconds and minimizes a quadratic type cost. See Appendix F for details of the problem and the experiment configurations. To obtain training data, we use differential dynamic programming (Jacobson & Mayne, 1970) implemented in the Crocoddy library (Mastalli et al., 2020) . We use the QRNet (Nakamura-Zimmerer et al., 2020; 2021b) as the backbone network in this example (see Appendix F for details) and evaluate networks trained in six different ways: four of them are trained using Algorithm 1 with different choices of temporal grid points for adaptive sampling and two of them are trained by vanilla sampling method with 300 (Vanilla300) and 900 (Vanilla900) trajectories separately. All four networks (AS1-AS4) trained by the IVP enhanced sampling have initial training data of 100 trajectories and three iterations (K = 3), i.e., each of them requires solving the open-loop solution 300 times in total for generating training data. The difference of the four networks lies in that they use different temporal grids (t 1 and t 2 ) for enhanced sampling. Each experiment has been independently run five times and we report their average results. We plot the cumulative distribution functions of cost ratios (clipped at 2.0) between the NN-controlled cost and optimal cost in Figure 4 (left). More details and results are provided in Appendix F. We find that adding more data in the vanilla sampling method has very limited effects on improvement while the IVP enhanced sampling greatly improves the performance. Furthermore, such improvement is again robust to different choices of temporal grid points (AS1-AS4). In addition, we also test the performance of the network trained by our adaptive sampling method in the presence of measurement errors. At each simulation timestep, we sample the disturbances uniformly in [-σ, σ] 14 for σ = 1e-5, 1e-4, 1e-3 and add them to the input states of the network. See Figure 4 (right) for the result on the best model trained from AS1-AS4. We find that the NN controller performs well at σ = 1e-4, and there are more than 60% of cases on which our controller achieves a ratio less than 2.0 at σ = 1e-3. 

8. CONCLUSION AND FUTURE WORK

In this work, we propose the IVP enhanced sampling method to overcome the distribution mismatch problem in the supervised-learning-based approaches for the closed-loop optimal control problem. Both theoretical and numerical results show that the IVP enhanced sampling method significantly improves the performance of the learned NN controller and outperforms other adaptive sampling methods. There are a few directions worth exploring in future work. In the IVP enhanced sampling method, one choice we need to make is the temporal grid points for adaptive sampling. We recommend that at each iteration, one can compute the distance between the training data and data reached by the NN controller at different times (see Figure 2 for an example) and choose the time at which the distance starts to increase quickly as the temporal grid for adaptive sampling. We observe that the IVP enhanced sampling method performs well using this strategy. It will be ideal to make this process more systematic. Another direction is to design more effective approaches utilizing the training data. In Algorithm 1 (lines 8-9), at each iteration, we replace parts of the training data with the newly collected data, and hence some optimal labels are thrown away, which are costly to obtain. An alternative choice is to augment data directly, i.e., setting S i = Ŝi S i-1 in line 9. Numerically, we observe that this choice gives similar performance to the version used in Algorithm 1, which suggests that so far the dropped data provides little value for training; see Appendix E and F for details. But it is still possible to find smarter ways to utilize them to improve performance. We also need to evaluate the IVP enhanced sampling method for problems with more features like state/control constraints. Furthermore, the IVP enhanced sampling method can be straightforwardly applied to learning general dynamics from multiple trajectories as the controlled system under the optimal closed-loop policy can be viewed as a special dynamical system. It is an interesting direction to investigate its performance in such general settings. Finally, theoretical analysis beyond the LQR setting is also an interesting and important problem. 

A SUPERVISED-LEARNING-BASED APPROACH THROUGH THE LENS OF ELT ALGORITHM

As pointed out in the introduction, the supervised-learning-based approach empowered by the adaptive sampling can be interpreted as an instance of the exploration-labeling-training (ELT) algorithms (Zhang et al., 2018; E et al., 2021) for closed-loop optimal control problems. Through the lens of the ELT algorithm, there are at least three aspects to improve the efficiency of the supervised-learningbased approach for the closed-loop optimal control problem: • Use the adaptive sampling method. Adaptive sampling methods aim to sequentially choose the time-state pairs based on previous results to improve the performance of the NN controller. This corresponds to the first step in the ELT algorithm and is the main focus of this work. • Improve the efficiency of data generation, i.e., solving the open-loop optimal control problems. Although the open-loop optimal control problem is much easier than the closed-loop optimal control problem, its time cost cannot be neglected and the efficiency varies significantly with different methods. This corresponds to the second step in the ELT algorithm and we refer to Kang et al. (2021) for a detailed survey. • Improve the learning process of the neural networks. This corresponds to the third step in the ELT algorithm. The recent works Nakamura-Zimmerer et al. (2020; 2021b; 2022) focus on the structure of the neural networks and design a special ansatz such that the NN controller is close to the linear quadratic controller around the equilibrium point to improve the stability of the NN controller.

B DETAILED ANALYSIS OF THE LQR EXAMPLE

In this section, we give the detailed settings for the comparison in Section 5 and the detailed statement and proof of Theorem 1. Through this section, all symbols having a hat are open-loop optimal paths sampled for training, e.g. ûj , xj , ûi j , xi j . Let x denote a single state instead of a state trajectory. The clean symbol x without hat or tilde is the IVP solution generated by specific controllers which are specified in the subscript; e.g. x o , x v , x a are trajectories generated by u o , u v , u a which are optimal, vanilla, and IVP enhanced controllers, respectively. The positive integer j in the subscript always denotes the index of the optimal path. Symbols with superscript i are related to the i-th iteration of the IVP enhanced sampling method. For the vanilla method, we first randomly sample N T initial states {x j } N T j=1 from a standard normal distribution where N is a positive integer (recalling T is a positive integer). Then N T approximated optimal paths are collected starting at t 0 = 0: ûj (t) = - T T 2 + 1 xj + ϵZ j , xj (t) = T (T -t) + 1 T 2 + 1 xj + ϵtZ j , where {Z j } N T j=1 are i.i.d. normal random variables with mean m and variance σ 2 , and independent of initial states. Finally, the parameters θ are learned by solving the following least square problems: min θ T 0 N T j=1 |û j (t) -u θ (t, xj (t))| 2 dt. Optimizing θ t independently for each t, we have θ t = arg min b N T j=1 |û j (t) + T T (T -t) + 1 xj (t) -b| 2 or θ t = arg min (a,b) N T j=1 |û j (t) -ax j (t) -b| 2 for the first and second models, respectively. We will use u v to denote the closed-loop controller determined in this way. For the IVP enhanced sampling method, we choose K = T and the temporal grid points t i = i for 0 ≤ i ≤ K. We first sample N initial points {x 0 j } N j=1 from the normal standard distribution, denote the parameters optimized at i-th iteration as θ i and initialize θ -1 = 0. At the i-th iteration (0 ≤ i ≤ T -1), we use u θ i-1 to solve the IVPs on the time horizon [0, i] ẋi j (t) = u θ i-1 (t, x i j (t)), x i j (0) = x0 j , 1 ≤ j ≤ N, and collect {x i j } N j=1 as xi j := x i j (i). Here we omit the controller subscript a for simplicity, i.e. x i j = x i a,j . We then compute N approximated optimal paths starting from {x i j } N j=1 at t i = i: ûi j (t) = - T T (T -i) + 1 xi j + ϵZ i j , xi j (t) = T (T -t) + 1 T (T -i) + 1 xi j + (t -i)ϵZ i j , t ∈ [i, T ] where {Z i j } 0≤i≤T -1,1≤j≤N are i.i.d. normal random variables with mean m and variance σ 2 , and independent of {x 0 j } N j=1 . Note that ûi j and xi j are only defined in t ∈ [i, T ] (for i ≥ 1), we then fill their values in interval [0, i) with values from previous iteration, ûi j (t) = ûi-1 j (t), xi j (t) = xi-1 j (t), t ∈ [0, i). Finally, we solve the least square problems to determine θ i : min θ T 0 N j=1 |û i j (t) -u θ (t, xi j (t))| 2 dt. We will use u a to denote the closed-loop controller u θ T -1 , the closed-loop controller generated in the (T -1)-th iteration by the IVP enhanced sampling method. The theorem below gives the performance of u v and u a when using Model 1 (4). Theorem 1'. Under Model 1 (4), define the IVPs generated by u o = u * , u v and u a as follows:      ẋo (t) = u o (t) = u o (t, x o (t)), x o (0) = xinit , 0 ≤ t ≤ T, ẋv (t) = u v (t) = u v (t, x v (t)), x v (0) = xinit , 0 ≤ t ≤ T, ẋa (t) = u a (t) = u a (t, x a (t)), x a (0) = xinit , 0 ≤ t ≤ T. 1. If xinit is a random variable following a standard normal distribution, which is independent of the initial points {x j } N T j=1 /{x 0 j } N j=1 and noises {Z j } N T j=1 / {Z i j } 0≤i≤T -1,1≤j≤N in the training process, the state variables xj (t) in the training data and x v (t) in the IVP from the vanilla method follow normal distributions and satisfy: Ex j (t) = Ex v (t), |E|x j (t)| 2 -E|x v (t)| 2 | = σ 2 (1 - 1 N T )ϵ 2 t 2 . ( ) On the other hand, the state variables xT -1 j in the training data and x a (t) in the IVP from the IVP enhanced sampling method also follow and satisfy: Ex T -1 j (t) = Ex a (t), |E|x T -1 j (t)| 2 -E|x a (t)| 2 | = σ 2 ϵ 2 (t -i) 2 (1 - 1 N ) ≤ σ 2 ϵ 2 . (13) 2. If xinit is a fixed initial point, define the total cost J o = 1 T T 0 |u o (t)| 2 dt + |x o (T )| 2 , J v = 1 T T 0 |u v (t)| 2 dt + |x v (T )| 2 , J a = 1 T T 0 |u a (t)| 2 dt + |x a (T )| 2 . Then, EJ v -J o = (T 2 + 1)(m 2 + σ 2 N T )ϵ 2 , ( ) EJ a -J o ≤ 3(m 2 + σ 2 N )ϵ 2 . ( ) Proof. We first give the closed-form expressions of u v and u a using Model 1 (4). Recalling ûj (t) and xj (t) given in equation ( 6), we have ûj (t) = - T T (T -t) + 1 xj (t) + T 2 + 1 T (T -t) + 1 ϵZ j , 1 ≤ j ≤ N T. Therefore, recalling u v is learned through the least square problem (7), we have u v (t, x) = - T T (T -t) + 1 x + T 2 + 1 T (T -t) + 1 ϵ Zv , where Zv = 1 N T N T j=1 Z j . To compute u a , recalling equations ( 9) and (10), when 0 ≤ i ≤ T -1, 1 ≤ j ≤ N and t ∈ [i, i + 1), we have ûT -1 j (t) = ûi j (t) = - T T (T -i) + 1 xi j + ϵZ i j , xT -1 j (t) = xi j (t) = T (T -t) + 1 T (T -i) + 1 xi j + (t -i)ϵZ i j . Therefore, ûT -1 j (t) = - T T (T -t) + 1 xT -1 j (t) + T (T -i) + 1 T (T -t) + 1 ϵZ i j . Hence, recalling u a is learned through the least square problem (11), we have, when t ∈ [i, i + 1) u a (t, x) = - T T (T -t) + 1 x + T (T -i) + 1 T (T -t) + 1 ϵ Zi a , where Zi a = 1 N N j=1 Z i j , 0 ≤ i ≤ T -1. Equation ( 18) also holds when i = T -1 and t = T . We then compute the starting points {x i j } 0≤i≤T -1,1≤j≤N in the IVP enhanced sampling method. By equation ( 10), we know that when 1 ≤ i ≤ i ′ ≤ T -1 and 0 ≤ t < t i , θ i t = θ i ′ t . Together with equation ( 8), we know that when 0 ≤ i ≤ T -2, u θ i (t, x) = u θ T -1 (t, x) = u a (t, x) for t ∈ [i, i + 1), and x i+1 j (t) ≡ x i j (t) for t ∈ [0, i], which implies x i+1 j (i) = x i j (i) = xi j . Therefore, for 1 ≤ j ≤ N , when 0 ≤ i ≤ T -2, we have          ẋi+1 j (t) = u θ i (t, x i+1 j (t)) = u a (t, x i+1 j (t)) = - T T (T -t) + 1 x i+1 j (t) + T (T -i) + 1 T (T -t) + 1 ϵ Zi a , t ∈ [i, i + 1], x i+1 j (i) = xi j . Solving the above ODE, we get the solution x i+1 j (t) = T (T -t) + 1 T (T -i) + 1 xi j + (t -i)ϵ Zi a , t ∈ [i, i + 1]. Hence, by definition, for 0 ≤ i ≤ T -2, xi+1 j = x i+1 j (i + 1) = T (T -i -1) + 1 T (T -i) + 1 xi j + ϵ Zi a . Utilizing the above recursive relationship, we obtain, for 0 ≤ i ≤ T -1 1 , xi j = T (T -i) + 1 T 2 + 1 x0 j + i-1 k=0 T (T -i) + 1 T (T -k -1) + 1 ϵ Zk a . Now we are ready to prove the main results of the Theorem. First, for equation ( 12), using the control ( 16), we have ẋv (t) = - T T (T -t) + 1 x v (t) + T 2 + 1 T (T -t) + 1 ϵ Zv , x v (0) = xinit . Solving this ODE gives x v (t) = T (T -t) + 1 T 2 + 1 xinit + ϵt Zv . ( ) Combining the last equation with the fact that xj (t) = T (T -t) + 1 T 2 + 1 xj + ϵtZ j , xj , xinit and {Z j } N T j=1 are independent normal random variables and xj , xinit ∼ N (0, 1), Z j ∼ N (m, σ 2 ), we know that xj (t) and x v (t) are normal random variables with Ex j (t) = Ex v (t), |E|x j (t)| 2 -E|x v (t)| 2 | = σ 2 (1 - 1 N T )ϵ 2 t 2 . Next, we prove the equation (13). for 0 ≤ i ≤ T -1 and t ∈ [i, i + 1), using the control (18), we have ẋa (t) = - T T (T -t) + 1 x a (t) + T (T -i) + 1 T (T -t) + 1 ϵ Zi a . Solving the above ODE with the initial condition x a (0) = xinit , we can get the solution x a (t) = T (T -t) + 1 T 2 + 1 xinit + i-1 k=0 T (T -t) + 1 T (T -k -1) + 1 ϵ Zk a + (t -i)ϵ Zi a , when 0 ≤ i ≤ T -1 and t ∈ [i, i + 1). The above equation also holds when i = T -1 and t = T . On the other hand, combining equations ( 17) and ( 19), we know that when 0 ≤ i ≤ T -1 and t ∈ [i, i + 1) or i = T -1 and t = T , xT -1 j (t) = T (T -t) + 1 T (T -i) + 1 xi j + (t -i)ϵZ i j = T (T -t) + 1 T 2 + 1 x0 j + i-1 k=0 T (T -t) + 1 T (T -k -1) + 1 ϵ Zk a + (t -i)ϵZ i j . The above equation also holds when i = T -1 and t = T . Combining the last equation with equation ( 21) and the fact that x0 j ,x init and {Z i j } 0≤i≤T -1,1≤j≤N are independent normal random variables and x0 j , xinit ∼ N (0, 1), Z i j ∼ N (m, σ 2 ), we know that xT -1 j (t) and x a (t) are normal random variables with Ex T -1 j (t) = Ex a (t), |E|x T -1 j (t)| 2 -E|x a (t)| 2 | = σ 2 ϵ 2 (t -i) 2 (1 - 1 N ) ≤ σ 2 ϵ 2 . We then prove equations ( 14) and ( 15). First, with the optimal solution u o (t) = - T T 2 + 1 xinit , x o (T ) = 1 T 2 + 1 xinit , we have J o = 1 T T 0 T T 2 + 1 xinit 2 dt + 1 T 2 + 1 xinit 2 = 1 T 2 + 1 |x init | 2 . Recalling equation ( 20) and plugging (20) into ( 16), we know that x v (T ) = 1 T 2 + 1 xinit + ϵT Zv , u v (t) = - T T 2 + 1 xinit + ϵ Zv . Hence, J v = ϵ 2 | Zv | 2 - 2T T 2 + 1 xinit ϵ Zv + ϵ 2 T 2 | Zv | 2 + 2T T 2 + 1 xinit ϵ Zv + 1 T 2 + 1 |x init | 2 , which gives EJ v -J o = (T 2 + 1)(m 2 + σ 2 N T )ϵ 2 . On the other hand, recalling equation ( 21) and plugging ( 21) into (18), we know that x a (T ) = 1 T 2 + 1 xinit + T -1 k=0 1 T (T -k -1) + 1 ϵ Zk a , u a (t) = - T T 2 + 1 xinit - i-1 k=0 T T (T -k -1) + 1 ϵ Zk a + ϵ Zi a , 0 ≤ i ≤ T -1, t ∈ [i, i + 1). To compute the difference between J a and J o , we first notice that EJ a -J o = 1 T T 0 Var(u a (t))dt + Var(x a (T )) + 1 T T 0 |Eu a (t)| 2 dt + |Ex a (T )| 2 -J o := I 1 + I 2 . By the independence of { Zi a } T -1 i=0 , we know that I 1 = σ 2 ϵ 2 N T T -1 i=0 i-1 k=0 T 2 [T (T -k -1) + 1] 2 + σ 2 ϵ 2 N + T -1 k=0 1 [T (T -k -1) + 1] 2 σ 2 ϵ 2 N = σ 2 ϵ 2 N   1 + T -1 i=0 i-1 k=0 T [T (T -k -1) + 1] 2 + T -1 k=0 1 [T (T -k -1) + 1] 2   = σ 2 ϵ 2 N   1 + T -2 k=0 T -1 i=k+1 T [T (T -k -1) + 1] 2 + T -1 k=0 1 [T (T -k -1) + 1] 2   = σ 2 ϵ 2 N   1 + T -1 k=0 T (T -k -1) + 1 [T (T -k -1) + 1] 2   = σ 2 ϵ 2 N   1 + T -1 k=0 1 T k + 1   ≤ 3σ 2 ϵ 2 N . Meanwhile, noticing that Ex a (T ) = 1 T 2 + 1 xinit + T -1 k=0 1 T (T -k -1) + 1 ϵm, Eu a (t) = - T T 2 + 1 xinit - i-1 k=0 T T (T -k -1) + 1 ϵm + ϵm, it is straightforward to compute that I 2 = 2ϵmx init T 2 + 1 I 3 + ϵ 2 m 2 (I 4 + 1), where I 3 = T -1 k=0 1 T (T -k -1) + 1 + T -1 i=0 i-1 k=0 T T (T -k -1) + 1 -T = T -1 k=0 1 T (T -k -1) + 1 + T -1 k=0 T -1 i=k+1 T T (T -k -1) + 1 -T = T -1 k=0 T (T -k -1) + 1 T (T -k -1) + 1 -T = 0, and I 4 = ( T -1 k=0 1 T (T -k -1) + 1 ) 2 + T T -1 i=0 ( i-1 k=0 1 T (T -k -1) + 1 ) 2 - T -1 i=0 i-1 k=0 2 T (T -k -1) + 1 = T -1 k=0 1 + T (T -k -1) [1 + T (T -k -1)] 2 + 2 T -1 i=0 i-1 k=0 T (T -i -1) + 1 [T (T -k -1) + 1][T (T -i -1) + 1] -2 T -1 i=0 i-1 k=0 1 T (T -k -1) + 1 = T -1 k=0 1 1 + T (T -k -1) = T -1 k=0 1 T k + 1 ≤ 2. Therefore, EJ a -J o = I 1 + 2ϵmx init T 2 + 1 I 3 + ϵ 2 m 2 (I 4 + 1) ≤ 3(m 2 + σ 2 N )ϵ 2 . Next we present the numerical results when we use Model 2 (5) to fit the closed-loop optimal control. In the following experiments, we set ϵ = 0.1, m = 0.1 and σ 2 = 1. Figure 5 (left) compares the optimal path x o (t) with x v (t) and x a (t), the IVPs generated by the controllers learned by the vanilla method and IVP enhanced sampling method (all three paths start at x init = 1). In this experiment, we set T = 30 and N = 100. Figure 5 : Numerical results on learning Model 2 (5). Left: the optimal path and the paths generated by the vanilla sampling method and the IVP enhanced sampling method. Middle: differences of the second order moments (in the logarithm scale) between the distributions of the training data and the data reached by the controllers at different times. Right: performance differences (in the logarithm scale) of the vanilla sampling method and the IVP enhanced sampling method for different total times (in the logarithm scale).

C FULL DYNAMICS OF QUADROTOR

In this section, we introduce the full dynamics of quadrotor (Bouabdallah et al., 2004; Madani & Benallegue, 2006; Mahony et al., 2012) that are considered in Section 6. The state variable of a quadrotor is x = (p T , v T b , η T , w T b ) T ∈ R 12 where p = (x, y, z) ∈ R 3 is the position of quadrotor in Earth-fixed coordinates, v b ∈ R 3 is the velocity in body-fixed coordinates, η = (ϕ, θ, ψ) ∈ R 3 (roll, pitch, yaw) is the attitude in terms of Euler angles in Earth-fixed coordinates, and w b ∈ R 3 is the angular velocity in body-fixed coordinates. Control u = (s, τ x , τ y , τ z ) T ∈ R 4 is composed of total thrust s and body torques (τ x , τ y , τ z ) from the four rotors. Then we can model the quadrotor's dynamics as          ṗ = R T (η)v b vb = -w b × v b -R(η)g + 1 m Au η = K(η)w b ẇb = -J -1 w b × J w b + J -1 Bu, with matrix A and B defined as A =   0 0 0 0 0 0 0 0 1 0 0 0   , B =   0 1 0 0 0 0 1 0 0 0 0 1   . The constant mass m and inertia matrix J = diag(J x , J y , J z ) are the parameters of the quadrotor, where J x , J y , and J z are the moments of inertia of the quadrotor in the x-axis, y-axis, and z-axis, respectively. We set m = 2kg and J x = J y = 1 2 J z = 1.2416kg • m 2 which are the same system parameters as in (Madani & Benallegue, 2006) . The constants g = (0, 0, g) T denote the gravity vector where g = 9.81m/s 2 is the acceleration of gravity on Earth. The direction cosine matrix R(η) ∈ SO(3) represents the transformation from the Earth-fixed coordinates to the body-fixed coordinates: R(η) =   cos θ cos ψ cos θ sin ψ -sin θ sin θ cos ψ sin ϕ -sin ψ cos ϕ sin θ sin ψ sin ϕ + cos ψ cos ϕ cos θ sin ϕ sin θ cos ψ cos ϕ + sin ψ sin ϕ sin θ sin ψ cos ϕ -cos ψ sin ϕ cos θ cos ϕ   , and the attitude kinematic matrix K(η) relates the time derivative of the attitude representation with the associated angular rate: K(η) =   1 sin ϕ tan θ cos ϕ tan θ 0 cos ϕ -sin ϕ 0 sin ϕ sec θ cos ϕ sec θ   , Note that in practice the quadrotor is directly controlled by the individual rotor thrusts F = (F 1 , F 2 , F 3 , F 4 ) T , and we have the relation u = EF with E =     1 1 1 1 0 l 0 -l -l 0 l 0 c -c c -c     , where l is the distance from the rotor to the UAV's center of gravity and c is a constant that relates the rotor angular momentum to the rotor thrust (normal force). So once we obtain the optimal control u * , we are able to get the optimal F * immediately by the relation F * = E -1 u * .

D PMP AND SPACE MARCHING METHOD

In this section we introduce the open-loop optimal problem solver used for solving the optimal landing problem of a quadrotor. The solver is based on Pontryagin's minimum principle (PMP) (Pontryagin, 1987) and space-marching method (Zang et al., 2022) . The optimal landing problem is defined as min x,u T 0 L(x(τ ), u(τ ))dτ + M (x(T )), s.t. ẋ(t) = f (x(t), u(t)), t ∈ [0, T ], x(0) = x 0 , where x(t) : [0, T ] → R 12 and u(t) : [0, T ] → R 4 denote the state trajectory and control trajectory, respectively, and f is the full dynamics of quadrotor introduced in Appendix C. By PMP, problem ( 22) can be solved through solving a two-point boundary value problem (TPBVP). Introduce costate variable λ ∈ R 12 and Hamiltonian H(x, λ, u) = L(x, u) + λ • f (x, u). The TPBVP is defined as ẋ(t) = ∂ T λ H x(t), λ(t), u * (t) , λ(t) = -∂ T x H x(t), λ(t), u * (t) . ( ) We have the boundary conditions: x(0) = x 0 , λ (T ) = ∇M x (T ) , and the optimal control u * (t) should minimize Hamiltonian at each t : u * (t) = arg min u H(x(t), λ(t), u). ( ) We use solve bvp function of scipy (Kierzenka & Shampine, 2001) to solve TPBVP ( 23)-( 24) and set tolerance to 10 -5 , max nodes to 5000. We note that when the initial state x 0 is far from the target state x T , solving the TPBVP directly often fails. Thus we use the space-marching method proposed in Zang et al. (2022) . We uniformly select K points in the line segment from x T to x 0 , and denote them as {x 1 0 , x 2 0 , • • • , x K 0 } according to their increasing distances to x T (x K 0 = x 0 ). These K TPBVPs will be solved in order and at every step we use the previous solution as the initial guess to the current problem.

E EXPERIMENT DETAILS AND MORE RESULTS OF THE OPTIMAL LANDING PROBLEM

In this section, we give more details about implementation and numerical results in Section 6. We aim to find the optimal controls to steer the quadrotor from some initial states x 0 to a target state x T = 0. The distribution of the initial state of interest is the uniform distribution on the set X = {x, y ∈ [-40, 40], z ∈ [20, 40], v x , v y , v z ∈ [-1, 1], θ, ϕ ∈ [-π/4, π/4], ψ ∈ [-π, π]; w = 0}. We consider a quadratic running cost: L(x, u) = (u -u d ) T Q u (u -u d ), where u d = (mg, 0, 0, 0) represents the reference control that balances with gravity and Q u = diag(1, 1, 1, 1) represents the weight matrix characterizing the cost of deviating from the reference control. The terminal cost is M (x) = p T Q pf p + v T Q vf v + η T Q ηf η + w T Q wf w = x T Q f x where Q pf = 5I 3 , Q vf = 10I 3 , Q ηf = 25I 3 , Q wf = 50I 3 . We set the entries in the terminal cost larger than the running cost as we want to give the endpoint more penalty for deviating from the landing target. We sample N = 500 initial points for training. On every optimal path, we select time-state-action tuples with time step δ = 0.2. Thus the number of training data is always 81 × 500 at every iteration. Note that when solving BVPs and IVPs, we use denser time grids to ensure the solution is accurate enough. The neural network models in all quadrotor experiments have the same structure with 13-dimensional input (12 for states and 1 for time) and 4-dimensional output. The networks are fully-connected with 2 hidden layers; each layer has 128 hidden neurons and we use tanh as the activation function. The inputs are scaled to (-1, 1) where the upper bound and lower bound are the maximum and minimum of the training dataset. Since the activation fucntion is tanh, we adapt Xavier initialization (Glorot & Bengio, 2010) before training. We train the neural network by the Adam (Kingma & Ba, 2015) optimizer with learning rate 0.001, batch size 1000, and 1000 epochs. At every iteration of IVP enhanced sampling method, we train a new neural network from scratch. Our model and training programs are implemented by PyTorch (Paszke et al., 2019) . In the first experiment, we use our IVP enhanced sampling method and choose temporal grid points 0 < 10 < 14 < 16. More statistics of the ratios between the NN-controlled costs and the optimal costs during three iterations are shown in Table 1 . We also test an alternative way to construct dataset during the IVP enhanced sampling as discussed in Section 8, i.e., setting S i = Ŝi S i-1 in Algorithm 1 line 9. We denote the final policy obtained in this approach as ũ and report the statistics of corresponding cost ratios in Table 1 as well. We have to point out that the ratio of ũ has been clipped at 10.0 as there is a test path with a ratio over 1000. The performance of ũ is similar to that of û2 , suggesting that so far the dropped data provides little value for training. Additionally, as we always train networks 1000 epochs and this alternative approach has more training data, the training time of ũ is 1.5 times that of others. We also illustrate the trajectories of states controlled by learned policy and optimal policy in Figure 6 . The path controlled by û0 matches the optimal path at the beginning but deviates around t = 10. Then the path controlled by û1 fits the optimal path more and deviates around t = 14. Finally, the path controlled by û3 matches the optimal path till the terminal time. Note that the cost of three controlled paths is 3296.18, 119.91, 6.69, respectively, and the optimal cost is 6.32. 3 (left) and 3 (middle), respectively. û0 , û1 , û2 denote the policy after the first, second, and third round of training, respectively. ũ denotes the final policy obtained from an alternative approach for constructing datasets during the iteration of the IVP enhanced sampling; see discussion in Section 8. Figure 6 : The optimal path and path controlled by learned controllers. We show the 3-dimensional position p = (x, y, z) and 3-dimensional attitude η = (ϕ, θ, ψ) in terms of Euler angles in Earthfixed coordinates. In the experiments of comparing different methods, the three adaptive sampling methods are processed as follows. The initial network is the same policy as in the IVP enhanced sampling method, i.e., the policy û0 after iteration 0 in the IVP enhanced sampling method, which is trained on 500 optimal paths. Then we will sequentially add 400, 300, 300 paths to training data so we use 1500 optimal paths to train the final network (solving 1500 open-loop optimal control problems namely). When sampling new training paths, we will randomly sample 2 initial points, calculate some indices using the latest policy and preserve the preferable one. For AS w. large u, we will calculate the norms of the control variables of these two points at time 0 and choose the one with larger norms. For AS w. large v, we will solve IVPs to obtain the NN-controlled costs and choose the larger one. For AS w. bad v, we will solve 2 TPBVPs corresponding to these two points and choose the one with a large difference between the NN-controlled value and optimal value. Note that in all other methods, the time span of the solved open-loop problem is always T . Nevertheless, in our method, the time span of the optimal paths to be solved is getting shorter and shorter along iterations, which means it takes less time to solve. Specifically, in our experiment, it takes about 11.1s to solve an optimal path whose total time T -t 0 = 16 at iteration 0 and takes about 3.1s to solve an optimal path whose total time T -t 2 = 2 at iteration 2. More statistics are shown in Table 2 . We further test the performance of NN controllers obtained from different sampling methods in the presence of observation noises, considering that the sensors have errors in reality. During simulation, we add a disturbance ϵ to the input of the network, where ϵ ∈ R 13 (including the disturbance of time) is uniformly sampled from [-σ, σ] 13 . We test σ = 0.01, 0.05, 0.1 and the numerical results are shown in Figure 7 . We also test the performance of the open-loop optimal controller under perturbation where a disturbance ε ∈ R is added to the input time. Figure 7 shows that when disturbance exists, closed-loop controllers are more reliable than the open-loop controller and the one trained by the IVP enhanced sampling method performs best among all the methods.

Methods

Figure 7 : Cumulative distribution function of the cost ratio between NN controlled value and the optimal value under disturbance. Finally we consider the impact of different choices of temporal grid points in Algorithm 1. We test 4 experiments which are all trained using the same 500 initial points. The results listed in Table 3 show that our algorithm is robust to the choice of temporal grid points. The average ratio of policy cost / optimal cost is 67.90 after iteration 0 and the average ratio after iteration 1 is 1.14. The second line shows the same experiment in Figure 3 (middle).

F EXPERIMENT DETAILS AND MORE RESULTS OF THE MANIPULATOR

We can write down the dynamics of the manipulator as, ẋ = f (x, u) = (v, a(x, u)), Figure 8 : An illustration of the reaching problem of the manipulator. The solid manipulator demonstrates its initial position. We label the end effectors of the five instances of robots by "1,2,3,4,5" to indicator the position of the robot at different times t 1 = 0.0, t 2 = 0.2, t 3 = 0.4, t 4 = 0.6, t 5 = 0.8. where u ∈ R 7 is the control torque, x = (q, v) ∈ R 14 , q ∈ R 7 is the joint angles, v = q ∈ R 7 is the joint velocities, q = a(x, u) ∈ R 7 is the acceleration of joint angles. The acceleration a is given by the (non-linear) forward dynamics M (q)a + C(q, q) q + g(q) = u. Here M (q) is the generalized inertia matrix, C(q, q) q represents the centrifugal forces and Coriolis forces, and g(q) is the generalized gravity. The reaching task is to move the manipulator from the initial states near x 0 to the terminal states x 1 . In the experiments, we take x 0 = (q 0 , 0), x 1 = (q 1 , 0) with q 0 = [1.6800, 1.2501, 2.4428, -1.2669, -0.9778, 1.1236, -1.3575] T , q 1 = [2.7736, 0.5842, 1.5413, -1.7028, -2.1665, 0.0847, -2.5764] T . See Figure 8 for an illustration of the task. Besides, we take the running cost and terminal cost to be L(x, u) = a(x, u) T Q a a(x, u) + (u -u 1 ) T Q u (u -u 1 ), M (x) = (x -x 1 ) T Q f (x -x 1 ). where u 1 is the torque to balance gravity at state x 1 , i.e. a(x 1 , u 1 ) = 0. Under this setting, (x 1 , u 1 ) is an equilibrium of the system, i.e. a 1 = a(x 1 , u 1 ) = 0 and f (x 1 , u 1 ) = (v 1 , a 1 ) = 0. In the experiment, we take Q a = 0.005I 7 , Q u = 0.025I 7 , Q f = 25000I 14 where we use large weights Q f to ensure the reaching goal is approximately achieved. The backbone network for this example is the QRNet (Nakamura-Zimmerer et al., 2020; 2021b) . QRNet exploits the solution corresponding to the LQR problem at equilibrium and thus improves the network performance around the equilibrium. The usage of other network structures also demonstrates the genericness/versatility of the IVP enhanced sampling method. Suppose we have the linear quadratic regulator (LQR) u LQR for the problem with linearized dynamics and quadratized costs at (x 1 , u 1 ), the QRNet can be formulated as u QR (t, x) = σ(u LQR (t, x) + u NN (t, x; θ) -u NN (T, x 1 )), where u NN (t, x; θ) is any neural network with trainable parameters θ, and σ is a saturating function that satisfies σ(u 1 ) = u 1 , σ u (u 1 ) = I 7 . The σ used in this example is defined coordinate-wisely as σ(u) = u min + u max -u min 1 + c 1 exp[-c 2 (u -u 1 )] , where c 1 = (u max -u 1 )/(u 1 -u min ), c 2 = (u max -u min )/[(u max -u 1 )(u 1 -u min )] with u min , u max being minimum and maximum values for u. Here u, u min = -150 and u max = 150 are the corresponding values at each coordinate of u, u min , u max , respectively. To get the LQR, we need to expand the dynamics linearly as f (x, u) ≈ f x (x 1 , u 1 )(x -x 1 ) + f u (x 1 , u 1 )(u -u 1 ), and the term related to acceleration in the running cost quadratically as a(x, u) T Q a a(x, u) ≈ L a (x, u) T Q a L a (x, u) = (x -x 1 ) T a T x Q a a x (x -x 1 ) + (u -u 1 ) T a T u Q a a u (u -u 1 ) + 2(x -x 1 ) T a T x Q a a u (u -u 1 ) , where L a = a x (x 1 , u 1 )(x -x 1 ) + a u (x 1 , u 1 )(u -u 1 ), and we exploit a(x 1 , u 1 ) = 0 and f (x 1 , u 1 ) = 0. The derivatives boil down to a x and a u which can be analytically computed in the Pinocchio library (Carpentier et al., 2015 (Carpentier et al., -2021;; 2019; Carpentier & Mansard, 2018) . In the experiment, we solve the LQR by the implementation in the Drake library (Tedrake & the Drake Development Team, 2019). In the simulation and open-loop solver, we take time step ∆t = 0.001 and use the semi-implicit Euler discretization. The initial positions q are sampled uniformly and independently in a 7dimensional cube centered at q 0 with side length 0.02. Initial velocities v are set to zero. Other than directly applying the open-loop solver to collected initial states, we first sample another minibatch of initial states and call the open-loop solver on it. We then pick one solution of the lowest cost and use it as an initial guess later. This can not only speed up the data generation process but also avoid sampling trajectories that fall into bad local minima. We use the differential dynamic programming solver implemented in (Mastalli et al., 2020) , which is a second-order algorithm that favors a good initial guess. The dataset is then created from the (discrete) optimal trajectories warm-started by the initial guess. Each trajectory has T /∆t = 800 data points that are pairs of 15-dimensional input states including time and 7-dimensional output controls. The validation dataset and test dataset contain 605 and 1200 optimal trajectories, respectively. Finally, all the QRNets u QR are trained by minimizing a mean square error (3) over the training dataset with the Adam optimizer (Kingma & Ba, 2015) with learning rate 0.001, batch size 256 and epochs 2000. u NN is a fully-connected network with 6 hidden layers; each layer has 128 neurons. The first three layers use the tanh function as activation while the last three layers use ELU (Clevert et al., 2016) . Network and training are implemented in PyTorch (Paszke et al., 2019) . During iterations, all networks are trained from scratch, i.e. a new network with random weights instead of inheriting weights from the previous iteration. After each epoch of training, we compute the loss on the validation dataset. The network with the least validation loss is then used for data generation in the next iteration or as the final policy (at the last iteration). We compare the ratio of policy cost over the optimal cost, see Table 4 for the mean ratio over all 1200 trajectories of the test dataset. The ratio has been clipped at 2.0 for each trajectory. The results demonstrate that the IVP enhanced sampling method has great improvement over the vanilla supervised-learning-based method. It also shows the IVP enhanced sampling is not sensible to the choices of the temporal grid points. Besides, we also try augmenting the dataset with newly collected data instead of replacing them, as detailed in Section 8. Through the comparison between AS3 and AS3* in Table 4 , we find that the alternative approach does not bring further improvement.

G COMPARISON WITH DAGGER

In this section, we give a comprehensive comparison between DAGGER (Dataset Aggregation) (Ross et al., 2011) and the IVP enhanced sampling, in terms of concepts, theoretical results, and numerical results. When referring to DAGGER, we mostly mean the single iteration version of DAGGER (i.e., augmenting dataset once) unless otherwise stated explicitly. Concept. Both DAGGER and IVP enhanced sampling methods solve IVPs using the policy from the previous iteration to generate time-state pairs as new initial time-state pairs and call the openloop solver to label the trajectories starting from these pairs. Their main differences are as follows. The mean ratio of policy costs / optimal costs of the optimal reaching problem of the manipulator. The ratio has been clipped at 2.0 for each test trajectory. The vanilla300/900 correspond to networks trained on 300/900 optimal trajectories, respectively. The choices of temporal grid points for adaptive sampling in the remaining rows can be inferred by the location of columns. For example, AS1 has temporal grid points 0 < 0.16 < 0.48 < 0.8. AS3* has the same temporal grid points as AS3 except that it augments the dataset directly instead of replacing them, as discussed in Section 8. In the i-th iteration, the IVP enhanced sampling method only solves the IVPs till the i-th time grid and uses the time and the visited states at the i-th time grid as the initial time-state pairs to collect new open-loop optimal data. In contrast, in each iteration, the DAGGER method needs to solve until the penultimate temporal grid t K-1 , collect the states on all the time grids 0 < t 1 < • • • < t K-1 < T , and use all the collected time-state pairs as the initial time-state pair to generate new open-loop optimal data. In the IVP enhanced sampling method, the later times are only visited by networks trained from later iterations. Hopefully, networks from later iterations perform better and generate relevant time-state pairs at later time grids. In contrast, each network from the DAGGER method acts on the dynamical system until the final grid. For stiff dynamics that accumulate errors fast, the time-state pairs at later time grids generated by a network from earlier iterations can deviate much from what the optimal policy will visit. These data might then deteriorate the performance instead, as supported by the numerical results below. Computation cost. With the same time grids (say 0 = t 0 < t 1 • • • < t K = T ), we argue that the efforts of training the IVP enhanced sampling method and single iteration DAGGER are approximately the same since the cost for data labeling and training are approximately the same, respectively. First, in terms of the computation cost of data labeling, the IVP enhanced sampling method requires labeling K dataset of M trajectories. The single iteration DAGGER labels the same amounts of data. Though DAGGER requires fewer efforts in solving the IVPs as the IVP enhanced sampling method always solves IVPs from t = 0. However, for many problems, the time spent in solving IVPs is negligible compared to that in solving the open-loop optimal control problems. Specifically, in our first numerical example of optimal landing, it takes about 2.5 hours to do the BVP computation in total while it only takes 7 minutes to do IVP integration in total. In the second numerical example of the reaching problem, to solve 100 trajectories, it takes 308 seconds to solve the open-loop solution through DDP while it only takes 3 seconds to do IVP integration. Therefore, the cost for data labeling is approximately the same. Second, in terms of training time, let us assume that the initial dataset contains M trajectories and each trajectory contributes N time-state-action tuples. The IVP enhanced sampling method needs to train K networks; each network is trained on an enhanced dataset with M trajectories. Then, there are in total M KN time-state-action tuples visited in the IVP enhanced sampling method (same data is counted repeatedly when training different networks, the same below). DAGGER trains two networks, one with M trajectories and the other with M K trajectories. The former dataset contains M N time-state-action tuples while the latter one contains approximately M ( T -t 0 T + T -t 1 T + • • • + T -t K-1 T )N = M (K - 1 T (t 0 + t 1 + • • • + t K-1 ))N time-state-action tuples. Then, for time grids t 0 = 0 < 10 < 14 < 16 = T (e.g. the landing problem of a quadrotor below), DAGGER visits approximately 16.67% fewer data than the IVP enhanced sampling method; for time grids t 0 = 0 < 0.16 < 0.64 < 0.8 = T and t 0 = 0 < 0.16 < 0.48 < 0.8 = T (e.g. the reaching problem of the manipulator below), DAGGER visits the same amounts of data as and approximately 6.67% more than the IVP enhanced sampling method, respectively; Therefore, with the same number of epochs in training each network, the training efforts do not differ much in the two methods. In the following, we compare the IVP enhanced sampling method with DAGGER. We will see that, for the LQR example in Section 5 and Appendix B, the IVP enhanced sampling method surpasses DAGGER both theoretically and numerically, especially for large T . For the landing and reaching problem studied previously (see Appendix F and E), both methods perform similarly well. However, in more difficult settings of both problems, the IVP enhanced sampling method outperforms DAGGER. In DAGGER, we again choose K = T and the temporal grid points t i = i for 0 ≤ i ≤ K. We first sample N initial points {x 0 j } N j=1 from the normal standard distribution and then generated N approximated optimal paths starting at t 0 = 0:

Results

û0 j (t) = - T T 2 + 1 x0 j + ϵZ 0 j , x0 j (t) = T (T -t) + 1 T 2 + 1 xj + ϵtZ 0 j , where {Z 0 j } N j=1 are i.i.d. normal random variables and independent with initial states whose mean is m and variance is σ 2 . We then train the closed-loop controller u 0 by solving the following least square problems: min θ T 0 N j=1 |û 0 j (t) -u 0 (t, x0 j (t))| 2 dt. Then, we use u 0 to solve the IVPs on the whole time horizon [0, T ] with initial states {x 0 j }: ẋ0 j (t) = u 0 (t, x 0 j (t)), x 0 j (0) = x0 j , 1 ≤ j ≤ N, and collect {x i j } N j=1 as xi j := x 0 j (i) for i = 1, 2, . . . , T -1. At each time step t i = i, we then compute N approximated optimal paths starting from {x i j } N j=1 : ûi j (t) = - T T (T -i) + 1 xi j + ϵZ i j , xi j (t) = T (T -t) + 1 T (T -i) + 1 xi j + (t -i)ϵZ i j , t ∈ [i, T ] where {Z i j } 1≤i≤T -1 are i.i.d. normal random variables and independent with {x 0 j } N j=1 and {Z j } N j=1 whose mean is m and variance is σ 2 . Finally, we collect the optimal paths {(û i j , xi j )} 0≤i≤T -1,1≤j≤N to train the closed-loop controller u d by solving the following least square problems: min θ i+1 i i k=0 N j=1 |û k j (t) -u θ (t, xk j (t))| 2 dt ( ) for k = 0, 1, . . . , T -1. Theorem 2. Under Model 1 (4), define IVP generated by u d : ẋd (t) = u d (t) = u d (t, x d (t)).x d (t) = xinit , 0 ≤ t ≤ T (28) and the total cost: J d = 1 T T 0 |u d (t)| 2 dt + |x d (T )| 2 . If xinit is a fixed point, then EJ d -J o ≥ ( T 2 m 2 4 + T σ 2 3N )ϵ 2 . Proof. With the same approach of computing u v in ( 16), we have u 0 (t, x) = - T T (T -t) + 1 x + T 2 + 1 T (T -t) + 1 ϵ Z0 d , where Z0 d = 1 N N j=1 Z 0 j . Recalling the definition of x 0 j in equation ( 25), we have x 0 j (t) = T (T -t) + 1 T 2 + 1 x0 j + ϵt Z0 d . Hence, for 0 ≤ i ≤ T -1, we have xi j = x 0 j (i) = T (T -i) + 1 T 2 + 1 x0 j + ϵi Z0 d . Plugging the last equation into equation ( 26), we have that for t ∈ [i, T ] ûi j (t) = - T T 2 + 1 x0 j -ϵ T i T (T -i) + 1 Z0 d + ϵZ i j , xi j (t) = T (T -t) + 1 T 2 + 1 x0 j + ϵ iT (T -t) + i T (T -i) + 1 Z0 d + (t -i)ϵZ i j . Therefore, ûi j (t) = - T T (T -t) + 1 xi j (t) + T (T -i) + 1 T (T -t) + 1 ϵZ i j . We can then compute the least square problem ( 27) to obtain that for 0 ≤ i ≤ T -1 and t ∈ [i, i+1), we have u d (t, x) = - T T (T -t) + 1 x + 1 i + 1 i k=0 T (T -k) + 1 T (T -t) + 1 ϵ Zk d , where Zi d = 1 N N j=1 Z i j , for 0 ≤ i ≤ T -1. We can then solve the ODE (28), we have that when 0 ≤ i ≤ T -1 and t ∈ [i, i + 1) x d (t) = T (T -t) + 1 T 2 + 1 xinit + i-1 k=0 F (t) F (k + 1)F (k) 1 k + 1 k l=0 ϵF (l) Zl d + t -i (i + 1)F (i) i k=0 F (k)ϵ Zk d , where F (t) = T (T -t) + 1. Therefore, when 0 ≤ i ≤ T -1 and t ∈ [i, i + 1), u d (t) = - T T 2 + 1 xinit - i-1 k=0 T (k + 1)F (k)F (k + 1) k l=0 ϵF (l) Zl d + 1 (i + 1)F (i) i k=0 F (k)ϵ Zk d . Define (T 2 + 1)(i + 1) -T i(i + 1)/2 (i + 1)F (i)F (i + 1) e i = - i-1 k=0 T (k + 1)F (k)F (k + 1) k l=0 ϵF (l) Zl d + 1 (i + 1)F (i) i k=0 F (k)ϵ Zk d , 0 ≤ i ≤ T -1 then J d = 1 T T -1 i=0 | - T T 2 + 1 xinit + e i | 2 + |x init - T 2 T 2 + 1 xinit + T -1 i=0 e i | 2 = T 2 |x init | 2 (T 2 + 1) 2 - T -1 i=0 2x init e i T 2 + 1 + 1 T T -1 i=0 |e i | 2 + |x init | 2 (T 2 + 1) 2 + T -1 i=0 2e i xinit T 2 + 1 + | T -1 i=0 e i | 2 = |x init | 2 T 2 + 1 + 1 T T -1 i=0 |e i | 2 + | T -1 i=0 e i | 2 . = ϵm T -1 i=0 T 2 + 1 -T i/2 [T (T -i) + 1][T (T -i -1) + 1] ≥ ϵm T 2 + T + 2 2T T -1 i=0 [ 1 T (T -i -1) + 1 - 1 T (T -i) + 1 ] = ϵm T 2 + T + 2 2T (1 - 1 T 2 + 1 ) ≥ ϵmT 2 . On the other hand, noticing that we have Var( T -1 i=0 e i ) = ϵ 2 σ 2 N T -1 k=0 F 2 (k)( T -1 i=k 1 (i + 1)F (i)F (i + 1) ) 2 (31) ≥ ϵ 2 σ 2 N T 4 T -1 k=0 [T (T -k) + 1] 2 ( T -1 i=k 1 T (T -i -1) + 1 - 1 T (T -i) + 1 ) 2 = ϵ 2 σ 2 N T 4 T -1 k=0 [T (T -k) + 1] 2 [1 - 1 T (T -k) + 1 ] 2 = ϵ 2 σ 2 N T 2 T -1 k=0 (T -K) 2 = ϵ 2 σ 2 N T 2 T (T + 1)(2T + 1) 6 ≥ ϵ 2 σ 2 T 3N . Combining equations ( 29), ( 30) and ( 31), we can conclude our result. We then numerically compare the performance of DAGGER with the vanilla method and the IVP enhanced sampling method on Model 2 (5). In these experiments, we again set ϵ = 0.1, m = 0.1 and σ 2 = 1. We present the paths generated by the optimal controller and the controllers learned by the vanilla method, the IVP enhanced sampling method, and DAGGER in Figure 9 (left). Figure 9 (middle) compares the performance of these three methods on different time horizons T . We also test DAGGER with multiple iterations in Figure 9 (right). Here we set total time T = 30. The experiment shows that the performance of the learned controller does not improve with more iterations. The detailed settings are identical to the numerical experiments in Appendix B. Results on the landing problem. Following the same settings as in Appendix E, we train a policy by DAGGER with additional sampling at time t = 10 and 14. The results are summarized in Table 5 . It shows that DAGGER performs closely to the IVP enhanced sampling method. However, if we decrease the number of trajectories in the initial dataset from 500 to 300 to train controllers using these two methods, we observe more decreases in performance in the DAGGER method, which implies that DAGGER is more sensitive to the data amount. The main reason is that DAGGER demands enough data at the beginning to have a good initial controller to explore the state over the whole time interval. However, in complicated control problems, we do not have such privilege and indeed require adaptive sampling to improve the controller. # of iterations result AS 3 1.092 DAGGER 2 1.069 AS-300 3 1.287 DAGGER-300 2 1.405 Table 5 : Average cost ratio on 200 test points of controllers trained by the IVP enhanced sampling method and DAGGER. All models are with time grids 0 < 10 < 14 < 16. Models with the suffix 300 are those trained on the initial dataset with 300 trajectories, whose cost ratios are clipped at 10.0 for each test trajectory. Results on the reaching problem. For the problem detailed in Appendix F, by additionally sampling at time 0.16 and 0.64 in one iteration, the DAGGER algorithm achieves a policy cost / optimal cost ratio of 1.049 on the test dataset, which is close to that achieved by the IVP enhanced sampling method. We then increase the difficulty of the control problem by increasing the moving distance. Following the configurations in Appendix F, we change the center of initial position and the terminal position to q 0 = [1.60, 1.30, 2.70, -0.85, -1.90, 0.95, -1.60] T , q 1 = [2.75, 0.60, 2.00, -1.55, -2.15, 0.00, -2.60] T . Besides, as the DAGGER will generate states in a wider range, we modify u min , u max in QRNet to u min = -2000, u max = 2000 in order to avoid saturation. We also increase the size of the initial dataset from 100 trajectories to 200 trajectories. Each network is trained with 1500 epochs. The other settings are the same as that in Appendix F. Each method has been run 5 times independently, and we report their average and best performance. The results are summarized in Table 6 and Figure 10 . As we can see, the IVP enhanced sampling method is capable of finding a closed-loop controller with an average ratio between policy cost and optimal cost achieving 1.0155. However, the DAGGER algorithm cannot yield such a satisfactory result. In DAGGER1, we apply the DAGGER method with time grids 0 < 0.16 < 0.48 < 0.8, an earlier final grid compared to DAGGER2 which has time grids 0 < 0.16 < 0.64 < 0.8. We see an average improvement from 1.8528 to 1.6327. DAGGER1 performs similarly to the network trained at the second iteration of AS, which implies that the extra data sampled at t = 0.48 does not help much. Furthermore, we conduct an additional iteration of the DAGGER method, which requires solving the open-loop problem to get 400 more trajectories and extra training of the network with 1000 trajectories in total. It performs worse, which also confirms the arguments we made at the beginning that the uncarefully collected data may deteriorate the performance. Table 6 : The mean ratio between policy costs and optimal costs of the reaching problem with larger moving distance. In each cell, the first number is averaged over 5 independent experiments and the number in parenthesis is the average ratio achieved by the best controller among 5 independent experiments. The ratio has been clipped at 2.0 for each test trajectory. Both AS and DAGGER2 have temporal grid points 0 < 0.16 < 0.64 < 0.8. DAGGER3 repeats DAGGER2 for one more DAGGER iteration.



In this section, we take by convention that summation n k=m c k = 0 if m > n.



Figure 2: Left: the average pointwise distance between the training data and the data reached by controllers at different times. Right: the maximum mean discrepancy (in the logarithm scale) between the training data and the data reached by controllers at every time using the Gaussian kernel.

Figure 3: Cumulative distribution function of the cost ratio (with the ideal curve being a straight horizontal segment passing ratio = 1, percentage=100%). Left: results of the IVP enhanced method on 500 training initial points. Middle: results of the IVP enhanced method on 200 test initial points. Right: comparison between different sampling methods.

Figure 4: Cumulative distribution functions of cost ratios (with the ideal curve being a straight horizontal segment passing ratio = 1, percentage=100%) under different training schemes (left) and different intensities of measurement noises (right).

Figure 5 (middle)  shows how the time t influences the differences of the second order moments between the state distribution of the training data and the state distribution of the IVP generated by learned controllers in the vanilla method and the IVP enhanced sampling method. We set the total time T = 100 and N = 100. Figure5(right) compares the performance of the vanilla method and IVP enhanced sampling method on different total times T . The performance difference is an empirical estimation of E[J v -J o ] and E[J a -J o ] when x init follows a standard normal distribution. In this experiment, for each method, we set N = 100 and learn 10 different controllers with different realizations of the training data and calculate the average of the performance difference on 1000 randomly sampled initial points (from a standard normal distribution) and 10 learned controllers.

Figure 9: Numerical results on learning Model 2 (5) with DAGGER.

Figure10: Cumulative distribution functions of average cost ratios over 5 independent experiments (left) and cost ratios of the best controller among 5 independent experiments under the proposed method and DAGGER for the optimal reaching problem with a larger moving distance.

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. In Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp. 627-635. JMLR Workshop and Conference Proceedings, 2011. Jihao Long, Xuanxi Zhang, Wei Hu, Weinan E, and Jiequn Han. A machine learning enhanced algorithm for the optimal landing problem. In 3rd Annual Conference on Mathematical and Scientific Machine Learning, pp. 1-20. PMLR, 2022. Linfeng Zhang, Han Wang, and Weinan E. Reinforced dynamics for enhanced sampling in large atomic and molecular systems. The Journal of chemical physics, 148(12):124113, 2018. Zhigen Zhao, Simiao Zuo, Tuo Zhao, and Ye Zhao. Adversarially regularized policy learning guided by trajectory optimization. In Learning for Dynamics and Control Conference, pp. 844-857. PMLR, 2022.

The numerical results of the IVP enhanced sampling method on training and test points. The results in Table (a) and (b) correspond to Figure

Comparison on different sampling methods.

Average cost ratio on 200 test points of every model. The first line means that we take 2 iterations and the corresponding temporal grid points for adaptive sampling are 0 < 14 < 16.



on the LQR example. We first investigate the performance of DAGGER on the LQR example. With a slight abuse of notation, we will use xi j (t) and ûi j (t) to denote the open-loop optimal paths sampled for training, xi j to denote the initial states used to generate the training trajectories, u d (t, x) to denote the closed-loop controller learned by DAGGER, and x d (t), u d (t) to denote the IVP solution generated by the closed-loop controller u d .

