LEARNING INSTANCE-SOLUTION OPERATOR FOR OP-TIMAL CONTROL

Abstract

Optimal control problems (OCPs) involves finding a control function for a dynamical system such that a cost functional is optimized, which are central to physical system research in both academia and industry. In this paper, we propose a novel instance-solution operator learning perspective, which solves OCPs in a one-shot manner with no dependence on the explicit expression of dynamics or iterative optimization processes. The design is in principle endowed with substantial speedup in running time, and the model reusability is guaranteed by high-quality in-and out-of-distribution generalization. We theoretically validate the perspective by presenting the approximation bounds for the instance-solution operator learning. Experiments on 7 synthetic environments and a real-world dataset verify the effectiveness and efficiency of our approach. The source code will be made publicly available.

1. INTRODUCTION

The explosion of data for embedding the physical world is reshaping the ways we understand, model, and control dynamical systems. Though control theory has been classically rooted in a model-based design and solving paradigm, the demands of model reusability, and the opacity of complex dynamical systems call for a rapprochement of modern control theory, machine learning, and optimization. Recent years have witnessed emerging trends of control theories with successful applications to engineering and scientific research, such as robotics (Krimsky & Collins, 2020) , aerospace technology (He et al., 2019) , and economics and management (Lapin et al., 2019) etc. We consider the well-established formulation of optimal control (Kirk, 2004) in finite time horizon T = [t 0 , t f ]. Denote X and U as two vector-valued function sets, representing state functions and control functions respectively. Functions in X (resp. U ) are defined over T and have their outputs in R dx (resp. R du ). State functions x ∈ X and control functions u ∈ U are governed by a differential equation. The optimal control problem (OCP) is targeted at finding a control function that minimizes the cost functional f (Lions, 1992; Kirk, 2004; Vinter & Vinter, 2010; Lewis et al., 2012) : min u∈U f (x, u) = t f t0 p(x(t), u(t)) dt + h(x(t f )) (1a) s.t. ẋ(t) = d(x(t), u(t)), (1b) x(t 0 ) = x 0 , ( ) where d is the dynamics of differential equations; p evaluates the cost alongside the dynamics and h evaluates the cost at the termination state x(t f ); and x 0 is the initial state. We restrict our discussion to differential equation-governed optimal control problems, leaving the control problems in stochastic networks (Dai & Gluzman, 2022) , inventory management (Abdolazimi et al., 2021) , etc. out of the scope of this paper. The analytic solution of Eq. 1 is usually unavailable, especially for complex dynamical systems. Thus, there has been a wealth of research towards accurate, efficient, and scalable numerical OCP solvers (Rao, 2009) and neural network based solvers (Kiumarsi et al., 2017) in recent years. However, both classic and modern numerical OCP solvers are facing challenges, especially emerging in the big data era, which we briefly discuss as follows. 1) Opacity of Dynamical Systems. Existing works (Böhme & Frank, 2017a; Effati & Pakdaman, 2013; Jin et al., 2020) assume the dynamical systems a priori and exploit their explicit forms to ease Table 1: Comparison of modern optimal control approaches. The proposed OptCtrlOP naturally covers all the merits in the sense of performing a single-phase direct-mapping paradigm that does not rely on known system dynamics, and supports arbitrary input-domain queries. 5) Running Phase. A two-phase model (Chen et al., 2018; Wang et al., 2021a; Hwang et al., 2022) can (partially) overcome the above issues at the cost of introducing an auxiliary dynamics inference phase. This thread of works first approximates the state dynamics by a differentiable surrogate model and then, in its second phase, solves an optimization problem for control variables (more explanation in Appx. B). However, the two-phase paradigm increases computational cost and manifests inconsistency between the two phases. A motivating example in Fig. 1 shows the two-phase paradigm leads to crucial failures. When the domain of phase-2 optimization goes outside the training distribution in phase-1, this method might collapse. Table 1 compares the methods regarding the above aspects. We propose an instance-solution operator perspective for learning to solve OCPs, thereby tackling the issues above. The contributions are: 1. We propose the operator perspective and solve OCPs by learning direct mappings from OCPs to their solutions. The design holds the following merits. The system dynamics is implicitly learned during the training, which relies on neither any explicit form of the system nor the optimization process at test time. As such the operator can be reused and generalized to similarly-formed OCPs without retraining, and such generalization ability is even missing for learning-free solvers. The single-phase direct mapping paradigm avoids iterative processes with substantial speedup. 2. We theoretically validate the instance-solution mapping perspective by leveraging Pontryagin's Maximum Principle, thereby converting Eq. 1 to a boundary value problem. We instantiate a neural solver: OptCtrlOP (Optimal Control OPerator), and derive bounds on its approximation error. 3. Experiments on both synthetic and real systems show that OptCtrlOP is versatile for various forms of OCPs. It achieves about 100x speedup against MLP baseline, and 10Kx speedup (on synthetic environments) over classical direct method solvers. It also generalizes well on both inand out-of-distribution OCP instances. Related Works. Most OCPs can not be solved analytically, and numerical methods are developed, which can be mainly categorized into three groups: direct methods, indirect methods, and dynamic programming. Our work bears a resemblance to indirect methods. Due to page limitation, we leave the detailed discussion on related works to Appendix B.

2. METHODOLOGY

In this section, we present the instance-solution operator perspective for solving OCPs. Inspired by Pontryagin's Maximum Principle (PMP) (Pontryagin, 1987) , OCPs (Eq. 1) can be converted to a boundary value problem (BVP), the solution of which is a typical function operator. The operator can further be approximated accurately by neural networks with theoretical guarantees. Hence, we propose OptCtrlOP, an end-to-end OCP neural solver that learns the underlying infinite-dimensional operator G governed by BVP. The input of operator G is the cost functional f , and the output is the optimal control function u * , i.e. G : F → U . OptCtrlOP learns such an operator from paired data by minimizing the error between predicted and optimal control, without explicit knowledge of the dynamics. Our theoretical analysis guarantees that for an arbitrary small approximation error tolerance, there exists an instance of OptCtrlOP that satisfies the tolerance with bounded size and depth.

2.1. INSTANCE-SOLUTION OPERATOR PERSPECTIVE OF OCP

This section presents the pre-condition of our model, a novel perspective that converts OCPs into an operator that maps problem instances (defined by cost functionals and dynamics) into their solutions. The conversion has two steps: 1) convert OCP into BVP by PMP; 2) define the operator based on BVP, as explained below. Here we assume the dynamics are deterministic DE, and the derivation of stochastic DE (SDE) (Carius et al., 2022; Kishida & Ogura, 2022) is similar but based on Hamilton-Jacobi-Bellman (HJB) equation (Yong & Zhou, 1999) , which is left for future work. We elaborate how PMP converts OCP into BVP, following the derivation of the first-order necessary optimality conditions in the calculus of variations. To begin with, define the Hamiltonian H of Eq. 1: H(x, u, λ) := p(x, u) + λ ⊤ d(x, u). Then suppose u * ∈ U is the optimal control function, and x * ∈ X is the optimal state trajectory. The PMP asserts that there exists a co-state (adjoint) function λ * : T → R dx such that the following BVP is satisfied (Pontryagin, 1987) : dynamics: ẋ * = ∂ ∂λ H(x * , u * , λ * ), co-state: λ * = - ∂ ∂x H(x * , u * , λ * ), optimal control: 0 = ∂ ∂u H(x * , u * , λ * ), boundary conditions: λ * (t f ) = ∂ ∂x h(x * (t f )), x * (t 0 ) = x init . ( ) A BVP is composed of a set of DEs along with boundary conditions. The BVP defines an implicit mapping from cost functional and dynamics, both of which can be represented by functions, to the optimal control function, with state and co-state being intermediate variables. Based on BVP, one can define an operator according to concrete problem settings or protocols. In the theoretic analysis of this paper, we assume that: the initial state x init and the dynamics d : X × U → R are unknown but constant (Hwang et al., 2022) . Consequently, the state x is uniquely determined by control u, thus the cost functional f : X × U → R can be rewritten interchangeably as f d : U → R. Under the protocol, the operator is defined as G : F → U , a mapping from cost functional to optimal control. In this way, we have converted the OCPs into an operator, which directly maps OCP instances to their solutions. Theoretically, the operator is able to solve all OCPs of the same type (e.g. governed by the same dynamics d), thus highly reusable and efficient. Since in practice such a non-linear operator can hardly have a closed-form expression, we propose the following neural model.

2.2. ARCHITECTURE OF THE PROPOSED NEURAL MODEL

From the operator perspective discussed above, constructing an OCP solver is equivalent to an operator learning problem. Inspired by DeepONet (Lu et al., 2021) , we propose Operator Learning based Control Network (OptCtrlOP) to solve the OCP by learning the non-linear operator G directly, and theoretically estimate the error bounds. Specifically, we decompose OptCtrlOP into three components, as suggested in (Lanthaler et al., 2022) : 1. Encoder. We define the following mapping as the encoder: E : F → R m , which converts the infinite-dimensional cost functional f into a finite-dimensional vector e that can be processed by neural networks. For demonstration, we now let the encoder be an ad-hoc function that returns the target state x goal . Such a simplified encoder is still effective, since the form of the cost functional is assumed fixed, and the only parameter is the target state (Eq. 12, following Jin et al. (2020) ). We term such encoder as physical priors based. Otherwise, the encoder can be implemented by functional approximations, e.g. point-wise evaluation, basis expansion, or neural network. For example, in the Heating system (Appendix E.6) the encoder output is the coefficients of a trigonometric polynomial. 2. Approximator. The encoded vector e is mapped to another finite dimensional vector a by the approximator mapping defined as: A : R m → R p . The approximator is implemented by a neural network termed as branch net β, which is a ReLU activated multi-layer perceptron (MLP) (Goodfellow et al., 2016) . The output vector a = β(e) will serve as the coefficient vector of basis functions produced by the following reconstructor.

3.

Reconstructor. It reconstructs the control function u by the mapping: R : R p → U. The reconstructor firstly constructs p + 1 basis functions via another MLP named trunk net, τ : T → R p+1 , a parametric mapping from time to basis function values. Note that the original DeepONet trunk net output is p-dimensional, we add an additional bias function dimension as suggested by (Lanthaler et al., 2022) . Then, the control function u = R(a) is reconstructed by an affine combination of basis functions, where the coefficient a is produced by the approximator: R(a) := τ 0 + a ⊤ τ 1:p . (4) Combining three components together, OptCtrlOP maps the cost functional to a control function, N : F → U , and can be viewed as a composition of the above three mappings: N := R • A • E. The architecture of OptCtrlOP is summarized in Fig. 2 and the pseudo code of forward step is described Alg. 1 in Appendix A.

2.3. APPROXIMATION ERROR ESTIMATION

We give the estimation for the approximation error of the proposed model. The theoretic result guarantees that there exists a neural network instance of OptCtrlOP architecture (Eq. 5) approximating the operator G to arbitrary error tolerance. Furthermore, the size and depth of such a network are upper-bounded. The technical line of our analysis is partly inspired by (Lanthaler et al., 2022) which provides error estimation for DeepONets (Lu et al., 2021) . The error of interest in OCP is the distance between the cost of the solution and the optimal cost. Let µ be the probability measure of cost functional, with µ ∈ P 2 (F ), where P 2 (F ) is the set of probability measures with finite second moments. We assume the control dimension d u = 1, and f d > 0, w.l.o.g. Then the approximation error of OptCtrlOP can be defined as: E := F f d • N (f ) -f d • G(f ) f d • G(f ) dµ(f ). Figure 2 : The architecture of OptCtrlOP. The network takes 2 inputs: cost functional f and time index t. The input is processed by encoder, approximator and reconstructor, as introduced in Section 2.2. And finally it outputs û(t), the estimation of optimal control at time t. Following the decomposition described in Sec. 2.2, the approximation error of OptCtrlOP can also be decomposed into three parts: 1) encoder error, 2) approximator error, and 3) reconstructor error. The error of our simplified encoder is zero, since there exists a one-to-one mapping between the target state (encoder output) and the cost functional (encoder input). In other words, for a given OptCtrlOP encoder E, there exists an inverse mapping E -1 , such that E -1 • E = Id. If the encoder is implemented by functional approximations, e.g. point-wise evaluation (Lu et al., 2021) , then the encoder error should be considered (Appx D). For a reconstructor R, its error E R is estimated by the mismatch between R and its approximate inverse mapping, projector P, weighted by push-forward measure G # µ(u) := µ(G -1 (u)). E R := U ∥R • P(u) -u∥ U d (G # µ) (u) 1 2 P := argmin P E R , s.t. P • R = Id. Intuitively, such reconstructor error quantifies the information loss induced by R. An ideal R without any information loss should be invertible, i.e. its optimal inverse P is exactly R -1 , thus we have E R = 0. Given the encoder and reconstructor, and denote the encoder output is e ∈ R m , the error E A induced by approximator A is defined as the distance between the approximator output and the optimal coefficient vector, weighted on push-forward measure E # µ(e) := µ(E -1 (e)): E A := R m ∥A(e) -P • G • E -1 (e)∥ 2 ℓ 2 (R p ) d (E # µ) (e) 1 2 . ( ) With the definitions above, we can estimate the approximation error of our OptCtrlOP by the error of each of its components, as stated in the following theorem (see detailed proof in Appendix C.1): Theorem 1 (Decomposition of OptCtrlOP Approximation Error). Suppose the cost functional f d is Lipschitz continuous, with Lipschitz constant Lip(f d ). Define constant C = sup f ∈F Lip(f d ) f d •G(f ) . The approximation error E (Eq. 6) of a OptCtrlOP N = R • A • E is upper-bounded by E ≤ C Lip(R) E A + E R . The estimation of reconstructor error E R is analyzed in the previous work (Lanthaler et al., 2022) , which gives a detailed discussion of the error estimation of DeepONet, and the reconstructor error component is the same as that of OptCtrlOP. We cite the result to establish the following theorem: Theorem 2 (Reconstructor Error (Lanthaler et al., 2022) ). If G defines a Lipschitz mapping G : F → H s (T ) ,for some s > 0, M > 0, with F ∥G(f )∥ 2 H s dµ(f ) ≤ M < ∞, then there exists a constant C = C(s, M ) > 0, such that for any p ∈ N, there exists a trunk net τ : T → R p+1 (with bias term τ 0 ≡ 0 ) and the associated reconstruction R : R p → U satisfies: size(τ ) ≤ Cp 1 + log(p) 2 , depth(τ ) ≤ C 1 + log(p) 2 , E R ≤ Cp -s . ( ) Furthermore, the reconstruction R and the optimal projection P satisfy Lip(R), Lip(P) ≤ 2. Note that size(•) is defined as the number of trainable parameters of a neural network, and depth(•) denotes the number of hidden layers. H s is a Sobolev space with s degrees of regularity and L 2 norm. The proof (omitted here) is based an observation that reconstructor is minimized when trunk net outputs {τ 1 , τ 2 , ..., τ p } are linearly independent. The observation is consistent with intuition, since trunk net outputs are regarded as basis functions (Eq. 4). Next, the error bound of the approximator will be presented. Recall that the approximator learns a mapping between vector space by MLP, whose error estimation is well studied in deep learning theory. One of the existing works (Gühring et al., 2020) derives the estimation based on the Sobolev regularity of the mapping. We extend their result to form the following theorem (proof given in Appendix C.2): Theorem 3 (Approximator Error). Given operator G : F → U , encoder F → R m , and reconstructor R : R p → U , let P denote the corresponding projector. If for some s ∈ N ≥2 , M > 0, the following bound is satisfied: P • G • E -1 H s (E # µ) ≤ M < ∞, then there exists a constant C = C(m, s, M ) > 0, such that for any ε ∈ (0, 1 2 ), there exists an approximator A : R m → R p : size(A) ≤ Cp 2 ε -m/s log ε -1 , depth(A) ≤ C log ε -1 , E A ≤ √ pε. In summary, we have proved that the approximation error is bounded by the sum of the reconstructor and approximator errors. Those errors can hold under arbitrary small tolerance, with bounded size and depth. The theorems hold as long as some integratable and continuous conditions are satisfied, which is trivial in real-world OCPs.

3. EXPERIMENTS

3.1 SYNTHETIC CONTROL SYSTEMS

3.1.1. CONTROL SYSTEMS AND DATA GENERATION

We evaluate OptCtrlOP on six representative optimal control systems by following the same protocol of (Jin et al., 2020; Hwang et al., 2022) , as summarized in Table 4 . We postpone the rest systems to Appendix E, and only describe the details of the Quadrotor control here: ṗ = v, m v = 0 0 mg + R ⊤ (q)   0 0 1 ⊤ u   , q = 1 2 Ω(ω)q, J ω = T u -ω × J ω. This system describes the dynamics of a helicopter with four rotors. The state x = [p ⊤ , v ⊤ , ω ⊤ ] ⊤ ∈ R 9 consists of parts: position p, velocity v, and angular velocity ω. The control u ∈ R 4 is the thrusts of the four rotating propellers of the quadrotor. q ∈ R 4 is the unit quaternion (Jia, 2019) representing the attitude(spacial rotation) of quadrotor w.r.t. the inertial frame. J is the moment of inertia in quadrotor's frame, and T u is the torque applied to the quadrotor. Our setting is similar to (Jin et al., 2020, Appx. E.1 ), but we exclude the quaternion from the state. We set the initial state x init = [-8, -6, 9, 0] ⊤ , the initial quaternion q init = 0. The matrices Ω(ω), R(q), T are coefficient matrices, see definition in Appx. E.4. The cost functional is defined as following, with coefficients With the settings above given, the solution of OCP only depends on the target state x goal . Therefore, we generate datasets (for model training/validation) and benchmarks (for model testing) by sampling target states from a pre-defined distribution. To fully evaluate the generalization ability, we define both in-distribution (ID) and out-of-distribution (OOD) (Shen et al., 2021) . Specifically, we design two random variables, x in goal := x base goal + ϵ in , and x out goal := x base goal + ϵ out , where x base goal is a baseline goal state, and ϵ in,out are different noise applied to ID and OOD. In Quadrotor problems for example, we set x base goal = 0, and uniform noise ϵ in ∼ U(0.1, 1.1), and ϵ out ∼ U(-0.1, 0.1). The training data are sampled from ID, while validation data and benchmark sets are both sampled from ID and OOD separately. The data generation process is shown in Alg. 2 in Appendix. For a given distribution, we sample a target state x goal , and construct the corresponding cost functional f and OCP. Then define 100 time indices uniformly spaced in time horizon T = [0, tf], tf ∼ U(1, 1.01). The length of T is slightly perturbed so that time indices fall in the whole horizon, instead of fixed points. Then we solve the resulting OCP by the DM solver, and get the optimal control u * at those time indices. After that we sample 10 time indices {t i } 1≤i≤10 , creating 10 triplets {(f, t i , u * (t i ))} 1≤i≤10 and adding them to the dataset. Repeat the process, until the size meets the requirement. The benchmark set is generated in the same way, but we store (f, J opt ) pair (J opt is the optimal cost) for each problem instead. c x = 1, c u = 0.1. min u tf 0 c ⊤ x (x(t) -x goal ) 2 + c u ∥u(t)∥ 2 dt

3.1.2. IMPLEMENTATION AND BASELINES

For all systems and all neural models, the learning rate starts from 0.01, decaying every 1,000 epochs at a rate of 0.9. The batch size is 10,000, and the optimizer is Adam (Kingma & Ba, 2014) . The loss is the mean squared error defined below, with a dataset D of N samples: L = 1 N i,j∈D N (x goal,j )(t i ) -u * j (t i ) 2 . For comparison, we choose the following baselines. Other details of implementation and baselines are recorded in Appendix F. 2) Pontryagin Differentiable Programming (PDP) (Jin et al., 2020) : an adjoint-based indirect method, differentiating through PMP, and optimized by gradient descent. 3) Multi-layer Perceptron (MLP): a fully connected counterpart (Alg. 4) of OptCtrlOP. 4) Fourier Neural Operator (FNO) (Li et al., 2020) : A neural operator consists of consecutive Fourier transform layers. 5) Graph Element Network (GEN) (Alet et al., 2019) : A graph neural operator with graph convolution backbone.

3.1.3. RESULTS AND DISCUSSION

We present the numerical results on the six systems to evaluate the efficiency and accuracy of OptC-trlOP. The metrics of interest are 1) the running time of solving problems; 2) the quality of solution, measured by mean absolute percentage error (MAPE) between the true optimal cost and the predicted cost, which is defined as the mean of |(J opt -J sol )/J opt |, where J opt is the optimal cost generated by DM (regarded as ground truth), and J sol is the cost of the solution produced by the 1.74 × 10 -5 2.13 × 10 -4 1.96 × 10 -4 MLP (Alg. 4) 3.04 × 10 -5 4.12 × 10 -4 6.09 × 10 -4 1.97 × 10 -4 GEN (Alet et al., 2019) 6.15 × 10 -5 1.40 × 10 -4 7.31 × 10 -4 5.91 × 10 -4 FNO (Li et al., 2020) 6.38 × 10 -4 9.87 × 10 -6 1.28 × 10 -3 1.27 × 10 -3 PDP (Jin et al., 2020) 7.25 × 10 1 1.24 × 10 -4 1.16 × 10 -4 8.74 × 10 -6 0 100 200 300 400 500 Epochs model. The MAPE is calculated on ID/OOD benchmarks respectively, and the running time is averaged for 2,000 random problems. The results of ODE-constrained OCP are visualized in Fig. 3 . First, the comparison of running time is shown in Fig. 3a , which shows that the neural operator solver is much faster than the classic solver. For example, OptCtrlOP achieves over 10 5 times speedup against the DM solver. The acceleration can be reasoned in two aspects: 1) the neural operator solvers produce the output by a single forward propagation, while the classic methods need to iterate between forward and backward pass multiple times; 2) the neural solver calculation is highly paralleled. Moreover, OptCtrlOP is 100 times faster than MLP, although both of them are neural operator models. Next, the accuracy on in-and out-of-distribution benchmark sets is compared in Fig. 3b-3c . Compared with other neural models, OptCtrlOP achieves better or comparable accuracy in general. In addition, OptCtrlOP outperforms classical PDP on more than half of the benchmarks. As a concrete example, we investigate the performance of Quadrotor environment. From Tab. 2 (and Fig. 4 for learning trajectory of neural models), one can observe that MAPE of OptCtrlOP on ID is the second lowest among all models, with a slight disadvantage compared to FNO. On OOD, however, OptCtrlOP outperforms FNO by a clear margin, and the performance is close to that of the classical method PDP. Among all neural models, OptCtrlOP achieves the lowest OOD MAPE as well as the smallest ID-OOD gap (defined as the absolute distance between ID and OOD MAPE). We conjecture that the OOD generalization ability of OptCtrlOP results from its architecture, where branch net and trunk net (coefficients and basis functions) are explicitly disentangled. Such a structure may inherit the inductive bias from numerical basis expansion methods (e.g. Kafash et al. (2014) ), thus being more robust to distribution shifts. Due to space limitations, the numerical results of other systems are given in Tab. 6-10 in Appendix. This section presents how to learn optimal control of a robot arm for pushing objects of varying shapes on various planar surfaces. We use the Pushing dataset Yu et al. ( 2016), a challenging dataset consisting of noisy, real-world data produced by an ABB IRB 120 industrial robotic arm (Fig. 5 , right part). The robot arm is controlled to push objects starting from various contact positions and 9 angles (initial state), along different trajectories (target state functions), with 11 object shapes and 4 surface materials (dynamics). The control function is represented by the force exerted on to object, measured by the force sensor. Left of Fig. 5 gives a compact overview of input variables, and more details are given in Appx. E.8.

3.2. REAL-WORLD CONTROL DATASET OF PLANAR PUSHING

In our experiment, we apply OptCtrlOP to learn a mapping from a pushing OCP instance (represented by variables above) to the optimal control function. The input now is no longer cost functional f only, but f d with abuse of notation, where subscript d denotes dynamics and initial conditions. And the encoder is realized by different techniques for different inputs, such as Savitzky-Golay smoothing (Savitzky & Golay, 1964) and down-sampling for trajectories, mean and standard value extraction for friction map, and Convolution Network(CNN) (LeCun et al., 1989) for shape images. The encoder error now is not negligible, but the analysis framework of approximation error can be extended to include this error (Appx. D). The estimation of encoder error itself depends on the specific choice of the encoder, which is beyond the scope of this paper and is left to future work. We extract training data from ID, validation and test data from both ID/OOD. The ID/OOD is distinguished by different initial contact positions. The accuracy metric MAPE is now defined as ∥û -u * ∥/∥u * ∥. We compare OptCtrlOP performance only with neural baselines, since the explicit expression of pushing OCP is unavailable. All neural models share the same encoder structure, while the parameters of CNN are trained end-to-end individually for each model. The results are displayed in Tab. 3, from which one can observe OptCtrlOP outperforms all baselines in both running time and ID/OOD accuracy. Notice that the performance of FNO and GEN degrades compared with that of synthetic data. The reason might be that their architecture is not suitable for complex OCP tasks like pushing. For example, the essence of FNO is to learn parametric transformation in Fourier space. For the pushing dataset, however, the input OCP instance is a composition of several functions and environment parameters, of which the Fourier transform does not have a clear physical meaning.

4. CONCLUSION

Future Works. This paper studies an effective way to solve general OCPs with data-driven approaches. We do not specify the forms of problem instances or investigate sophisticated models for specific problems, which calls for careful designs and exploitation of the problem structures. We also leave rigorous analysis on OOD performance and encoder error as future works.

Conclusion.

We have proposed a novel instance-solution operator perspective of OCPs, where the operator directly maps cost functionals to optimal control functions. Based on this perspective, we present a neural operator OptCtrlOP, with a theoretic guarantee on its approximation capability. Extensive experiments on various OCP system benchmarks show the outstanding generalization ability and efficiency of OptCtrlOP, on both ID and OOD settings. We envision the proposed model will be beneficial in solving numerous high-dimensional problems in the learning and control fields. Direct methods (Böhme & Frank, 2017a) reformulate OCP as finite-dimensional nonlinear programming (NLP) (Bazaraa et al., 2013) , and solve the problem by NLP algorithms, e.g. sequential quadratic programming (Boggs & Tolle, 1995) and interior-point method (Mehrotra, 1992) . The reformulation essentially constructs surrogate models, where the state and control function (infinite dimension) is replaced by polynomial or piece-wise constant functions. The dynamics constraint is discretized into equality constraints. The direct methods optimize the surrogate models, thus the solution is not guaranteed to be optimal for the origin problem. Likewise, typical direct neural solvers (Chen et al., 2018; Wang et al., 2021a; Hwang et al., 2022) , termed as Two-Phase models, consist of two phases: 1) approximating the dynamics by a neural network (surrogate model); 2) solving the NLP via gradient descent, by differentiating through the network. The advantage of Two-Phase Algorithm 4: MLP inference on benchmark set Input: Benchmark set of cost functionals F = {f j } j≤N , time indices t = {t i } i≤M Output: estimated controls Û = { ûj (t i )} i≤M,j≤N 1 for i ← 1 to M do 2 for j ← 1 to N do 3 e j ← E(f j ) ; // Encoder 4 ûj (t i ) ← MLP(e j , t i ); // repeat O(M N ) times 5 return Û models against traditional direct methods is computational efficiency, especially in high-dimensional cases. However, the two-phase method is sensitive to distribution shift (see Fig. 1 ). Indirect methods (Böhme & Frank, 2017b) are based on Pontryagin's Maximum Principle (PMP) (Pontryagin, 1987) . By PMP, indirect methods convert OCP (Eq. 1) into a boundary-value problem (BVP) (Lasota, 1968) , which is then solved by numerical methods such as shooting method (Bock & Plitt, 1984) , collocation method (Xiu & Hesthaven, 2005) , adjoint-based gradient descend (Effati & Pakdaman, 2013; Jin et al., 2020) . These numerical methods are sensitive to the initial guesses of the solution. Some indirect methods based neural solvers approximate the finitedimensional mapping from state x * (t) ∈ R dx to control u * (t) ∈ R du (Cheng et al., 2020) , or to co-state λ * (t) ∈ R dx (Xie et al., 2018) . The full trajectory of the control function is obtained by repeatedly applying the mapping and getting feedback from the system, and such sequential nature is the efficiency bottleneck. Another work (D'ambrosio et al., 2021) proposes to solve the BVP via a PINN, thus its trained network works only for one specific OCP instance. Distinctive from all these models, OptCtrlOP solves BVP by learning an infinite-dimensional operator that maps cost functional f ∈ F to control function u * ∈ U . The trained model is available for parallel queries on different OCP instances, with high efficiency and accuracy. Dynamic programming (DP) is an alternative, based on Bellman's principle of optimality (Bellman & Kalaba, 1960) . It offers a rule to divide a high-dimensional optimization problem with a long time horizon into smaller, easier-to-solve auxiliary optimization problems. Typical methods are: Hamilton-Jacobi-Bellman (HJB) equation (Al-Tamimi et al., 2008) , differential dynamical programming (DDP) (Tassa et al., 2014) , which assumes quadratic dynamics and value function, and iterative linear quadratic regulator (iLQR) (Li & Todorov, 2004) , which assumes linear dynamics and quadratic value function. Similar to dynamic programming, model predictive control (MPC) synthesizes the approximate control function via the repeated solution of finite overlapping horizons (Hewing et al., 2020) . The main drawback of DP is the curse of dimensionality on the number and complexity of the auxiliary problem. MPC alleviates this problem at the expense of optimality. Yet fast implementation of MPC is still under exploration and remains open (Nubert et al., 2020) .

B.2 DIFFERENTIAL EQUATION NEURAL SOLVERS

A variety of networks have been developed to solve DE, including Physics-informed neural networks (PINNs) (Raissi et al., 2019) , neural operators (Lu et al., 2021) , hybrid models (Mathiesen et al., 2022; Lienen & Günnemann, 2021) , and frequency domain models (Li et al., 2020; Poli et al., 2022) . We will briefly introduce the first two models for their close relevance to our work. PINNs parameterize the DE solution as a neural network, and learn the parameters by minimizing the residual loss and boundary condition loss (Yu et al., 2018; Raissi et al., 2019; Sirignano & Spiliopoulos, 2018) . PINNs are similar to those numerical methods e.g. the finite element method in that they replace the linear span of a finite set of local basis functions with neural networks. PINNs usually have simple architectures (e.g. MLP), although they have produced remarkable results across a wide range of areas in computational science and engineering (Raissi et al., 2020; Zhu et al., 2019) . However, these models are limited to learning the solution of one specific DE instance, but not the operator. In other words, if the coefficients of the DE instance slightly change, then a new neural network needs to be trained accordingly, which is time-consuming. Another major drawback of PINNs, as pointed out by (Wang et al., 2021b) , is that the magnitude of two loss terms (i.e.residual loss and boundary condition loss) is inherently imbalanced, leading to heavily biased predictions even for simple linear equations. Neural operators regards DE as an operator that maps the input to the solution. Learning operators using neural networks was introduced in the seminal work (Chen & Chen, 1995) . It proposes the universal approximation theorem for operator learning, i.e. a network with a single hidden layer can approximate any nonlinear continuous operator. Lu et al. (2021) follows this theorem by designing a deep architecture named DeepONet, which consists of two networks: a branch net for input functions and a trunk net for the querying locations in the output space. We choose this architecture as our OCP operator learner, and our analysis of optimal control error bound is partly inspired by Lanthaler et al. (2022) , providing error estimation of DeepONet. Note that DeepONet is designed to solve DE operators, and is not ready to handle optimization tasks e.g. OC. In addition, there exist many neural operators with other architectures. For example, another type of neural operator is to parameterize the operator as a convolutional neural network (CNN) between the finite-dimensional data meshes (Guo et al., 2016; Zhu & Zabaras, 2018; Khoo et al., 2021) . The major weakness of these models is that it is impossible to query solutions at off-grid points. Moreover, graph neural networks (GNNs) (Kipf & Welling, 2016) are also applied in operator learning (Alet et al., 2019; Anandkumar et al., 2020) . The key idea is to construct the spacial mesh as a graph, where the nodes are discretized spatial locations, and the edges link neighboring locations. Compared with CNN-based models, the graph operator model is less sensitive to resolution, and is capable of inferencing at off-grid points by adding new nodes to the graph. However, its computational cost is still high, growing quadratically with the number of nodes. Another category of neural operators is Fourier transform based (Li et al., 2020; Kovachki et al., 2021) . The models learn parametric linear functions in the frequency domain, along with nonlinear functions in the time domain. The conversion between those two domains is realized by discrete Fourier transformation.

C PROOFS

C.1 PROOF OF THEOREM 1 Proof. Firstly, extract the constant, and decompose the error by triangle inequality (subscript of norm omitted): E = f d • G -f d • N f d • G ≤ sup f ∈F Lip(f d ) f d • G ∥G -N ∥ = C∥G -N ∥ ∥G -N ∥ = ∥G -R • A • E∥ = ∥G -R • P • G + R • P • G -R • A • E∥ ≤ ∥G -R • P • G∥ + ∥R • P • G -R • A • E∥ The first term is exactly the reconstructor error E R , by definition of push-forward: ∥G -R • P • G∥ L 2 (µ) = ∥Id -R • P∥ L 2 (G # µ) = E R And the second term is related to the approximator error E A : ∥R • P • G -R • A • E∥ L 2 (µ) = R • P • G • E -1 • E -R • A • E L 2 (µ) ≤ Lip(R) P • G • E -1 • E -A • E L 2 (µ) = Lip(R) P • G • E -1 • E -A • E L 2 (E # µ) = Lip(R) E A C.2 PROOF OF THEOREM 3 Proof. Our estimation of approximator error is based on the approximation rates of deep ReLU neural networks derived in Gühring et al. (2020) (notation modified for consistency): Theorem 4. Let m ∈ N, s ∈ N ≥2 , 1 ≤ q ≤ ∞, M > 0, and 0 ≤ n ≤ 1, then there exists a constant C = C(m, s, q, M, n), with the following properties: For any function f with d-dimensional input and one-dimensional output in subsets of the Sobolev space W s,q : ∥f ∥ W s,q ≤ M, and for any ϵ ∈ (0, 1/2),there exists a ReLU MLP N such that: ∥N -f ∥ W n,q ≤ ϵ, and: size(N ) ≤ Cϵ -m/(s-n) log ϵ -s/(s-n) , depth(N ) ≤ C log ϵ -s/(s-n) . Such error bounds can not be directly applied to the approximator A in our framework, since A is implemented by a single branch net β with p-dimensional output. It is different with stacking p independent one-dimensional output networks {N j } 1≤j≤p and concatenating the outputs. The key difference lies in the parameter sharing of hidden layers. To fill the gap, we design a special structure of the branch net β without parameter sharing, as explained below. Given p independent one-dimensional output networks {N j : R m → R 1 } 1≤j≤p , denote the weight matrix of the i-th layer of the N j as W i,j . The weight matrix of i-th layer of the branch net, W β i , can be constructed as: W β 1 =     W 1,1 W 1,2 . . . W 1,p     , W β i≥2 =     W i,1 0 • • • 0 0 W i,2 • • • 0 . . . . . . . . . . . . 0 0 • • • W i,p     . The weight of first layer W β 1 is a vertical concatenation of {W 1,j } 1≤j≤p . And the weight W β i of any remaining layer i ≥ 2 is a block diagonal matrix, with the main-diagonal blocks being {W i,j } 1≤j≤p . It is easy to verify that such approximator is computationally equivalent to stacking of p independent one-dimensional output networks stack({N j } 1≤j≤p ). Let q = 2, n = 0, and f = P • G • E -1 , then by Theorem 4 the approximator error is bounded by: E A = β -P • G • E -1 =   j N j -P • G • E -1 j 2   1/2 ≤ (pε 2 ) 1/2 = √ pε. And the depth and size of β can be calculated by comparing with any N j :  depth(A) = depth(N j ) ≤ C log ε -1 , size(A) = size(W A 1 ) + p i=2 size(W A i ) = p size(W 1,j ) + p i=2 p 2 size(W i,j ) ≤ p i=1 p 2 size(W i,j ) = p 2 size(N j ) ≤ Cp 2 ε -m/s log ε -1 .

D EXTENSION TO NON-ZERO ENCODER ERROR

The approximation error estimation can be naturally extended to model non-zero encoder error, as derived in Lanthaler et al. (2022) . For an encoder E, its error E E is estimated by the distance to its optimal approximate inverse mapping, decoder D, weighted by measure µ. E E := F ∥D • E(f ) -f ∥ F dµ(f ) 1 2 D := argmin D E E , s.t. E • D = Id. Similar to reconstructor error, this error quantifies the information loss during encoding. An ideal encoder should be invertible, i.e. the decoder D = E -1 , thus we have E E = 0. When the encoder error is non-zero, the definition of the approximator error E A (Eq. 8) should be modified accordingly, by replacing E -1 to D: E A := R m ∥A(e) -P • G • D(e)∥ 2 ℓ 2 (R p ) d (E # µ) (e) 1 2 . ( ) Also, the approximation error bound (Eq. 10) is changed to (proof similar to C.1, omitted here): E ≤ C Lip(G) Lip(R • P) E E + Lip(R) E A + E R . ( ) E EXPERIMENT ENVIRONMENTS E.1 PENDULUM min u tf 0 c ⊤ x (x(t) -x goal ) 2 + c u u 2 (t) dt (16) s.t. ẋ(t) = x 2 (t) [u(t) -m • g • l sin x 1 (t) -b • x 2 (t)]/I , x(0) = x init where state x = [x 1 , x 2 ] ⊤ , denoting the angle and angular velocity of the pendulum respectively, and control u is the external torque. The initial condition x init = [0, 0] ⊤ , i.e. the pendulum starts from the lowest position with zero velocity. The cost functional consists of two parts: state mismatching penalty and control function regularization, and c x = [10, 1] ⊤ , c u = 0.1 are balancing coefficients. Other constants are: m = 1, g = 10, l = 1, I = 1/3. E.7 STOCHASTIC PENDULUM E.2 ROBOT ARM M (x) = m 1 r 2 1 + I 1 + m 2 (l 2 1 + r 2 2 + 2l 1 r 2 cos(x 2 )) m 2 (r 2 2 + l 1 r 2 cos(x 2 )) + I 2 m 2 (r 2 2 + l 1 r 2 cos(x 2 )) + I 2 m 2 r 2 2 + I 2 , dx 1 (t) = x 2 (t) dt , dx 2 (t) = 1 I [u(o(t)) -mgl sin x 1 (t)] dt + σdB(t), o(t) = sin x 1 (t) cos x 1 (t) x 2 (t) . This system is similar to the simple Pendulum system introduced in the main text, but with Brownian motion B(t) (Uhlenbeck & Ornstein, 1930) involved in the dynamics, resulting in a stochastic OCP (Fleming & Rishel, 2012) . The state function is now a random process, thus the cost functional is defined as an expectation: min u E tf 0 c ⊤ x (x(t) -x goal ) 2 + c u u 2 (o(t)) dt . The state is defined by angle x 1 and angular velocity x 2 , and control u is the torque applied to the pendulum. We follow the convention of Gym environment Brockman et al. (2016) We solve the problem in a closed-loop optimal control scheme, where the model takes the current state from the environment as input at each time index, then outputs the control to the environment. This setting is the same as RL, if we define the reward of RL as the negative cost of OCP.

E.8 PUSHING

In this dataset, the robot executes an open-loop straight push along a straight line of 5 cm, with different shapes of objects, materials of surface, velocity and accelerations, and contact positions and angles, see Fig. 6 for details. The control and state trajectories are recorded at 250 Hz. The length of the recorded time horizon varies among samples, due to the difference in velocity and acceleration. We select acceleration a = 0.5ms -2 , with initial velocity v = 0. Then define time horizon T = 0.44s, and extract 110 time indices per instance. The input to the encoder is 4167-dim, including a 768-dim gray-scale image of the shape, 3280dim friction map matrix, 110-dim trajectory, and 9-dim other parameters (e.g. mass and moment of inertia). The input to the neural network (including CNN) is 801-dim, and the encoded vector e is 44-dim.

F DETAILS ON IMPLEMENTATIONS

For all the systems, we have 2,000 samples in each ID/OOD validation set, and 100 problems in each ID/OOD benchmark set. The size of the training set varies among systems, and is roughly proportional to the number of trainable parameters, as displayed in Table 5 . OptCtrlOP and other neural models are implemented in PyTorch (Paszke et al., 2019) . For MLP, hyper-parameters are the same or as close as to that of OptCtrlOP. We adjust the width of MLP layer to reach almost the same number of parameters. For FNO 2 , we set the number of Fourier layers to 4 as suggested in the open-source codes, and tune the network width such that the number of parameters is in the same order as that of OptCtrlOP. Notice that the original FNO outputs function values at fixed time indices, which is inconsistent with our experiment setting. Thus we slightly modify it by adding time indices to its input. For GEN 3 , we set 9 graph nodes uniformly spaced in time (or space) horizon, and perform 3 graph convolution steps on them. The input function initializes the node features at time index t = 0 (multiplied by weights). For any time index t, the GEN output is defined as the weighted average of all node features. Both input/output weights are softmax of negative distances between t and node positions. For stochastic OCP (i.e. Stochastic Pendulum), we choose a reinforcement learning algorithm named Proximal Policy Optimization(PPO) (Schulman et al., 2017) as the ground truth closed-loop OCP solver, by defining the reward as a negative cost. The PPO implementation is credited to open-source package Stable Baselines (Raffin et al., 2021) , with hyper-parameter settings cited from Raffin (2020). For fairness, all training/testing cases are executed on an Intel i9-10920X CPU, without GPU. 2.10 × 10 -4 (±1.87 × 10 -6 ) 9.13 × 10 -2 (±4.59 × 10 -2 ) 4.69 × 10 -2 (±1.07 × 10 -2 ) OptCtrlOP 3.33 × 10 -4 (±4.08 × 10 -6 ) 9.19 × 10 -2 (±4.59 × 10 -2 ) 3.04 × 10 -2 (±1.09 × 10 -2 ) 0 2500 5000 7500 10000 Epochs 



Figure 1: Phase-2 cost curves of two failed instances of two-phase control on Pendulum system. The control function gradually moves outside the training data distribution of phase-1. As a result, the control function converges w.r.t. the cost predicted by the surrogate model (blue), but diverges w.r.t. true cost (red).

Figure 3: The running time and mean absolute percentage error (MAPE) on in-distribution (ID) and out-of-distribution (OOD) benchmark problem sets. Compared with baselines, OptCtrlOP (red bars) achieves higher or comparable accuracy, with over 100x speedup.

Direct Method (DM): a classical direct OCP solver, with Interior Point OPTimizer (IPOPT) (Biegler & Zavala, 2009) backend NLP solver.

Figure 4: (a) Quadrotor Loss curve. (b) Quadrotor cost MAPE (mean absolute percentage error) curve. For ID (train and ID test), OptCtrlOP (red curves) performs competitively, or better than others. It also outperforms all neural baselines on OOD.

Figure 5: Pushing environment, composed of images from Yu et al. (2016).

OptCtrlOP inference on benchmark setInput: Benchmark set of cost functionals F = {f j } j≤N , time indices t = {t i } i≤M Output: estimated controls Û = { ûj (t i )} i≤M,j≤N 1 E ← E(F ) ; // Encoder 2 A ← A(E) ; // Approximator, O(N ) 3 R ← τ (t); // trunk net (Reconstructor step 1), O(M ) 4 for j ← 1 to N do 5 ûj (t) ← R 0 + A ⊤j R 1:p ; // affine combination (Reconstructor step 2) are well developed over the decades, which are learning-free and often involve tedious optimization iterations to find an optimal solution.

Figure 6: List of variables explored in Pushing dataset, credited to Yu et al. (2016).

by adding o, the observation of state x. And further constrain the scale of state and control as |u| ≤ 2, |x 2 | ≤ 8. The initial state is x init = 0, and the target state baseline is x goal = [π, 0] ⊤ . Cost functional coefficients c x = 1, c u = 0.001. Other constants are: mass m = 1, length l = 1, gravitational acceleration g = 10, scale of noise σ = 0.01.

Figure 7: The loss curves of 4 systems on the training set, in-and out-of-distribution testing sets. All curves are visualized after exponential moving average with weight=0.5

Figure 8: The cost MAPE (mean absolute percentage error) curves of 4 systems on the in-and out-of-distribution testing sets. All curves are visualized after exponential moving average with weight=0.5

Figure 9: The state trajectories of five systems on four randomly sampled OCPs. The left two columns are in distribution problems, and the right two columns are out of distribution problems. The three dimensions consist of time t and the first two dimensions of state x 1 , x 2 .

Figure 10: The MAPE (mean absolute percentage error) w.r.t. the number of training samples curves of 5 systems on the in-and out-of-distribution testing sets.

Performance of Quadrotor environment.

Results of Pushing environment.

Bing Yu et al. The deep ritz method: a deep learning-based numerical algorithm for solving variational problems. Communications in Mathematics and Statistics, 6(1):1-12, 2018. KT Yu, M Bauza, N Fazeli, and A Rodriguez. More than a million ways to be pushed. A High-Fidelity Experimental Data Set of Planar Pushing In: IEEE/RSJ IROS, 2016.

Typical control systems used in literature and our experiments and their dimensions.

Hyper-parameter settings of the proposed OptCtrlOP for different systems. PDP, DM, and synthetic control systems are implemented in CasADi(Andersson et al., 2019), which are adapted from the code repository 1 . For classical methods, the dynamics are discretized by Euler method. To limit the running time, we set the maximum number of iterations of PDP to 2,500.

Results of Pendulum environment.

Results of RobotArm environment.

Results of CartPole environment.

Results of Rocket environment.

Results of Heating environment.

Results of closed-loop control on StochasticPendulum environment. The ground truth result is generated by PPO.

C(x) =

-m 2 l 1 r 2 sin(x 2 )x 4 -m 2 l 1 r 2 sin(x 2 )(x 3 + x 4 )) m 2 l 1 r 2 sin(x 2 )x 3 0 , g(x) = m 1 r 1 g cos(x 1 ) + m 2 g(r 2 cos(x 1 + x 2 ) + l 1 cos(x 1 )) m 2 gr 2 cos(x 1 + x 2 ) ,.The RobotArm (also named Acrobot) is a planar two-link robotic arm in the vertical plane, with an actuator at the elbow. The state is x = [x 1 , x 2 , x 3 x 4 ] ⊤ , where x 1 is the shoulder joint angle, and x 2 is the elbow (relative) joint angle, x 3 , x 4 denotes their angular velocity respectively. The control u is the torque at the elbow. Note that the last equation is the manipulator equation, where M is the inertia matrix, C captures Coriolis forces, and g is the gravity vector. The details of the derivation can be found in (Spong et al., 2020, Sec. 6.4 ).The initial condition x init = [π/4, π/2, 0, 0] ⊤ , and the target state baseline x goal = [π/2, 0, 0, 0] ⊤ . The cost functional coefficients are c x = [0.1, 0.1, 0.1, 0.1] ⊤ , c u = 0.1. Other constants are: mass of two links m 1,2 = 1, gravitational acceleration g = 0, links length l 1,2 = 1, distance from joint to the center of mass r 1,2 = 0.5, moment of inertia I 1,2 = 1/3.In the CartPole system, an unactuated joint connects a pole(pendulum) to a cart that moves along a frictionless track. The pendulum is initially positioned upright on the cart, and the goal is to balance the pendulum by applying horizontal forces to the cart. The state is x = [x 1 , x 2 , x 3 x 4 ] ⊤ , where x 1 is the horizontal position of the cart, x 2 is the counter-clockwise angle of the pendulum, x 3 velocity and angular velocity of cart and pendulum respectively. We refer the reader to of (Tedrake, 2022, Sec. 3.2) for derivation of above equations.The initial condition x init = [0, 0, 0, 0] ⊤ , and the target state baseline x goal = [0, π, 0, 0] ⊤ . The cost functional coefficients are c x = [0.1, 0.6, 0.1, 0.1] ⊤ , c u = 0.3. Other constants are: mass of cart and pole m c,p = 0.1, gravitational acceleration g = 10, pole length l = 1.This system describes the dynamics of a helicopter with four rotors. The state x = [p ⊤ , v ⊤ , ω ⊤ ] ⊤ ∈ R 9 consists of three parts: position p, velocity v, and angular velocity ω. The control u ∈ R 4 is the thrusts of the four rotating propellers of the quadrotor. q ∈ R 4 is the unit quaternion (Jia, 2019) representing the attitude(spacial rotation) of quadrotor w.r.t. the inertial frame. J is the moment of inertia in quadrotor's frame, and T u is the torque applied to the quadrotor. Our setting is similar to (Jin et al., 2020 , Appx. E.1), but we exclude the quaternion from the state.The derivation is straightforward. The first two equations are Newton's laws of motion, the third equation is time-derivative of quaternion (Jia, 2019, Appx. B) , and the last equation is Euler's rotation equation (Truesdell, 1992, Sec. I.10) . And the coefficient matrices and operators used in the equations are defined as follows:2 (q 2 q 3 -q 4 q 1 ) 2 (q 2 q 4 + q 3 q 1 ) 2 (q 2 q 3 + q 4 q 1 ) 1 -2 q 2 2 + q 2 4 2 (q 3 q 4 -q 2 q 1 ) 2 (q 2 q 4 -q 3 q 1 ) 2 (q 3 q 4 + q 2 q 1 ) 1We set the initial state x init = [[-8, -6, 9] ⊤ , 0, 0] ⊤ , the initial quaternion q init = 0, and the target state baseline x goal = 0. Cost functional coefficients c x = 1, c u = 0.1. Other constants are configured as: mass m = 1, wing length l = 0.4, moment of inertia J = 1, z-axis torque constant c = 0.01.J ω = T uω × J ω. The rocket system models a 6-DoF rocket in 3D space. The formulation is very close to Quadrotor mentioned above. The state x = [p ⊤ , v ⊤ , ω ⊤ ] ⊤ ∈ R 9 is same as that of Quadrotor, but the control u ∈ R 3 is slightly different. Here u denotes the total thrust in 3 dimensions. Accordingly, the torque T u is changed to:We set the initial state x init = [10, -8, 5, -1, 0] ⊤ , the initial quatenion q init = [cos(0.75), 0, 0, sin(0.75)] ⊤ , and the target state baseline x goal = 0. The cost functional coefficients c x = 1, c u = 0.4. Other constants are configured as: mass m = 1, rocket length l = 1, the moment of inertia J = diag([0.5, 1, 1])E.6 HEATING -∆x = u in Ω x = 0 on ∂Ω The heating system mimics a 2D plane Ω, whose temperature x is controlled by a heating source u, such as a plane heated by electromagnetic induction or microwaves (Tröltzsch, 2010, Chap. 1.2.1) . The dynamics is a Poisson equation with zero boundary, where ∆ = ∇ • ∇ = ∇ 2 is the Laplace operator. The objective is thus a double integral over Ω:The target state here is a function, not a vector. To simplify the problem, we let x goal (s) = a sin(πs 1 ) sin(πs 2 ) + b sin(πs 1 ) + c sin(πs 2 ), with [a, b, c] ⊤ being the parameters. Let the baseline parameters to be [a, b, c] ⊤ = 1, and the in and out of distribution problem are generated by adding noise ϵ in and ϵ out to the baseline. The cost functional coefficients are c x = 1, c u = 10 -4 . During testing, the area Ω = [0, 1] 2 is evenly divided into 100 areas, with 121 grid points.

