PROVABLE SIM-TO-REAL TRANSFER IN CONTINUOUS DOMAIN WITH PARTIAL OBSERVATIONS

Abstract

Sim-to-real transfer, which trains RL agents in the simulated environments and then deploys them in the real world, has been widely used to overcome the limitations of gathering samples in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.

1. INTRODUCTION

Deep reinforcement learning has achieved great empirical successes in various real-world decisionmaking problems, such as Atari games (Mnih et al., 2015) , Go (Silver et al., 2016; 2017) , and robotics control (Kober et al., 2013) . In addition to the power of large-scale deep neural networks, these successes also critically rely on the availability of a tremendous amount of data for training. For these applications, we have access to efficient simulators, which are capable of generating millions to billions of samples in a short time. However, in many other applications such as auto-driving (Pan et al., 2017) and healthcare (Wang et al., 2018) , interacting with the environment repeatedly and collecting a large amount of data is costly and risky or even impossible. A promising approach to solving the problem of data scarcity is sim-to-real transfer (Kober et al., 2013; Sadeghi & Levine, 2016; Tan et al., 2018; Zhao et al., 2020) , which uses simulated environments to generate simulated data. These simulated data is used to train the RL agents, which will be then deployed in the real world. These trained RL agents, however, may perform poorly in realworld environments owing to the mismatch between simulation and real-world environments. This mismatch is commonly referred to as the sim-to-real gap. To close such a gap, researchers propose various methods including (1) system identification (Kristinsson & Dumont, 1992) , which builds a precise mathematical model for the real-world envi-ronment; (2) domain randomization (Tobin et al., 2017) , which randomizes the simulation and trains an agent that performs well on those randomized simulated environments; and (3) robust adversarial training (Pinto et al., 2017b) , which finds a policy that performs well in a bad or even adversarial environment. Despite their empirical successes, these methods have very limited theoretical guarantees. A recent work Chen et al. (2021) studies the domain randomization algorithms for sim-to-real transfer, but this work has two limitations: First, their results (Chen et al., 2021, Theorem 4 ) heavily rely on domain randomization being able to sample simulated models that are very close to the real-world model with at least constant probability, which is hardly the case for applications with continuous domain. Second, their theoretical framework do not capture the problem with partial observations. However, sim-to-real transfer in continuous domain with partial observations is very common. Take dexterous in-hand manipulation (OpenAI et al., 2018) as an example, the domain consists of angels of different joints which is continuous and the training on the simulator has partial information due to the observation noises (for example, the image generated by the agent's camera can be affected by the surrounding environment). Therefore, we ask the following question: Can we provide a rigorous theoretical analysis for the sim-to-real gap in continuous domain with partial observations? This paper answers the above question affirmatively. We study the sim-to-real gap of the robust adversarial training algorithm and address the aforementioned limitations. To summarize, our contributions are three-fold: • We use finite horizon LQGs to model the simulated and real-world environments, and also formalize the problem of sim-to-real transfer in continuous domain with partial observations. Under this framework, the learner is assumed to have access to a simulator class E , each of which represents a simulator with certain control parameters. We analyze the sim-to-real gap (Eqn. ( 6)) of the robust adversarial training algorithm trained in simulator class E . Our results show that the sim-to-real gap of the robust adversarial training algorithm is Õ( √ δ E H). Here H is the horizon length of the real task, and δ E denotes some intrinsic complexity of the simulator class E . This result shows that the sim-to-real gap is small for simple simulator classes and short tasks, while it gets larger for complicated classes and long tasks. By establishing a nearly matching lower bound, we further show this sim-to-real gap enjoys a near-optimal rate in terms of H. • To bound the sim-to-real gap of the robust adversarial training algorithm, we develop a new reduction scheme that reduces the problem of bounding the sim-to-real gap to designing a sample-efficient algorithm in infinite-horizon average-cost LQGs. Our reduction scheme for LQGs is different from the one in Chen et al. (2021) for MDPs because the value function (or more precisely, the optimal bias function) is not naturally bounded in a LQG instance, which requires more sophisticated analysis. • To prove our results, we propose a new algorithm, namely LQG-VTR, for infinite-horizon average-cost LQGs with convex cost functions. Theoretically, we establish a regret bound Õ( √ δ E T ) for LQG-VTR, where T is the number of steps. To the best of our knowledge, this is the first "instance-dependent" result that depends on the intrinsic complexity of the LQG model class with convex cost functions, whereas previous works only provide worst-case regret bounds depending on the ambient dimensions (i.e., dimensions of states, controls, and observations). At the core of LQG-VTR is a history clipping scheme, which uses a clipped history instead of the full history to estimate the model and make predictions. This history clipping scheme helps us reduce the intrinsic complexity δ E exponentially from O(T ) to O(poly(log T )) (cf. Appendix G). Our theoretical results also have two implications: First, the robust training algorithm is provably efficient (cf. Theorem 1) in continuous domain with partial observations. One can turn to the robust adversarial training algorithm if the domain randomization has poor performance (cf. Appendix A.4). Second, for stable LQG systems, only a short clipped history needs to be remembered to make accurate predictions in the infinite-horizon average-cost setting.

1.1. RELATED WORK

Sim-to-real Transfer Sim-to-real transfer, using simulated environments to train a policy that can be transferred to the real world, is widely used in many realistic scenarios such as robotics (e.g., Rusu et al., 2017; Tan et al., 2018; Peng et al., 2018; OpenAI et al., 2018; Zhao et al., 2020) . To close the sim-to-real gap, various empirical algorithms are proposed, including robust adversarial training (Pinto et al., 2017b) , domain adaptation (Tzeng et al., 2015) , inverse dynamics methods (Christiano et al., 2016) , progressive networks (Rusu et al., 2017) , and domain randomization (Tobin et al., 2017) . In this work, we focus on the robust adversarial training algorithm. Jiang (2018) ; Feng et al. (2019) ; Zhong et al. (2019) studies the sim-to-real transfer theoretically, but they require real-world samples to improve the policy during the training phase, while our work does not use any real-world samples. Our work is mostly related to Chen et al. (2021) , which studies the benefits of domain randomization for sim-to-real transfer. As mentioned before, however, Chen et al. (2021) cannot tackle the sim-to-real transfer in continuous domain with partial observations, which is the focus of our work.

Robust Adversarial Training Algorithm

There are many empirical works studying robust adversarial training algorithms. Pinto et al. (2017b) proposes the robust adversarial training algorithm, which trains a policy in the adversarial environment. Then Pinto et al. (2017a) ; Mandlekar et al. (2017) ; Pattanaik et al. (2017) ; Dennis et al. (2020) show that the robust policy obtained by the robust adversarial training method can achieve good performance in the real world. But these works lack theoretical guarantees. Broadly speaking, the robust adversarial training method is also related to the min-max optimal control (Ma et al., 1999; Ma & Braatz, 2001) and robust RL (Morimoto & Doya, 2005; Iyengar, 2005; Xu & Mannor, 2010; Ho et al., 2018; Tessler et al., 2019; Mankowitz et al., 2019; Goyal & Grand-Clement, 2022) . LQR and LQG There is a line of works (Mania et al., 2019; Cohen et al., 2019; Simchowitz & Foster, 2020; Lale et al., 2020a; Chen & Hazan, 2021; Lale et al., 2022) studying the infinite-horizon linear quadratic regulator (LQR), where the learner can observe the state. For the more challenging infinite-horizon LQG control, where the learner can only observe the noisy observations generated from the hidden state, Mania et al. (2019) ; Simchowitz et al. (2020) ; Lale et al. (2020c; 2021) propose various algorithms and establish regret bound for them. In specific, Mania et al. (2019) ; Lale et al. (2020b) study the strongly convex cost setting, where it is possible to achieve O(poly(log T )) regret bound. Simchowitz et al. (2020) ; Lale et al. (2020c; 2021) and our work focus on the convex cost, where Simchowitz et al. (2020) ; Lale et al. (2020c) establish a T 2/3 worst-case regret. Lale et al. (2021) derives a T 1/2 worst-case regret depending on the ambient dimensions, which is based on strong assumptions and complicated regret analysis. In contrast, with weaker assumptions and cleaner analysis, our results only depend on the intrinsic complexity of the model class, which might be potentially small (cf. Appendix G).

2.1. FINITE-HORIZON LINEAR QUADRATIC GAUSSIAN

We consider the following finite-horizon linear quadratic Gaussian (LQG) model: x h+1 = Ax h + Bu h + w h , y h = Cx h + z h , where x h ∈ R n is the hidden state at step h; u h ∈ R m is the action at step h; w h ∼ N (0, I n ) is the process noise at step h; y h ∈ R p is the observation at step h; and z h ∼ N (0, I p ) is the measurement noise at step h. Here the noises are i.i.d. random vectors. The initial state x 0 is assumed to follow a Gaussian distribution. Moreover, we denote by Θ := (A ∈ R n×n , B ∈ R n×m , C ∈ R p×n ) the parameters of this LQG problem. The learner interacts with the system as follows. At each step h, the learner observes an observation y h , chooses an action u h , and suffers a loss c h = c(y h , u h ), which is defined by c(y h , u h ) = y ⊤ h Qy h + u ⊤ h Ru h , where Q ∈ R p×p and R ∈ R m×m are known positive definite matrices. For the finite-horizon setting, the interaction ends after receiving the cost c H , where H is a positive integer. Let H h = {y 0 , u 0 , • • • , y h-1 , u h-1 , y h } be the history at step h. Given a policy π = {π h : H h → u h } H h=0 , its expected total cost is defined by V π (Θ) = E π H h=0 y ⊤ h Qy h + u ⊤ h Ru h , where the expectation is taken with respect to the randomness induced by the underlying dynamics and policy π. The learner aims to find the optimal policy π ⋆ with minimal expected total cost, which is defined by π ⋆ = arg min π V π (Θ). For simplicity, we use the notation V ⋆ (Θ) = V π ⋆ (Θ).

2.2. INFINITE-HORIZON LINEAR QUADRATIC GAUSSIAN

For the infinite-horizon average-cost LQG, we use the t and J to denote the time step and expected total cost function respectively to distinguish them from the finite-horizon setting. Similar to the finite-horizon setting, the learner aims to find a policy π = {π t : H t → u t } ∞ t=0 that minimizes the expected total cost J π (Θ), which is defined by J π (Θ) = lim T →∞ 1 T E π T t=0 y ⊤ t Qy t + u ⊤ t Ru t . The optimal policy π ⋆ in is defined by π ⋆ in def = arg min π J π (Θ). We also use the notation J ⋆ (Θ) = J π ⋆ in (Θ). We measure the T -step optimality of the learner's policy π by its regret: Regret(π; T ) = T t=0 E π [c t -J ⋆ (Θ)]. Throughout this paper, when π is clear from the context, we may omit π from Regret(π; T ). It is known that the optimal policy for this problem is a linear feedback control policy, i.e., u t = -K(Θ)x t|t,Θ , where K(Θ) is the optimal control gain matrix and xt|t,Θ is the belief state at step t (i.e. the estimated mean of x t ). One can use the Kalman filter (Kalman, 1960) to obtain xt|t,Θ and dynamic programming to obtain K(Θ) when the system Θ is known. In particular, denote P (Θ) as the unique solution to the discrete-time algebraic Riccati equation (DARE): P (Θ) = A ⊤ P (Θ)A + C ⊤ QC -A ⊤ P (Θ)B(R + B ⊤ P (Θ)B) -1 B ⊤ P (Θ)A, then K(Θ) can be obtained using P (Θ). We also use Σ(Θ) to denote the steady-state covariance matrix of x t . More details are deferred to Appendix A.3. The LQG instance (1) can be depicted in the predictor form (Kalman, 1960; Lale et al., 2020b; 2021 ) x t+1 = (A -F (Θ)C)x t + Bu t + F (Θ)y t , y t = Cx t + e t , where F (Θ) = AL(Θ) and e t denotes a zero-mean innovation process. Bellman Equation We define the optimal bias function of Θ = (A, B, C) for (x t|t,Θ , y t ) as h ⋆ Θ xt|t,Θ , y t def = x⊤ t|t,Θ (P (Θ) -C ⊤ QC)x t|t,Θ + y ⊤ t Qy t . With this notation, the Bellman optimality equation (Lale et al., 2020c, Lemma 4.3 ) is given by J ⋆ (Θ) + h ⋆ Θ xt|t,Θ , y t = min u c(y t , u) + E Θ,u h ⋆ Θ xt+1|t+1,Θ , y t+1 , where the equality is achieved by the optimal control of Θ. Notation We use O(•) notation to highlight the dependency on H, n, m, p, yet omit the polynomial dependency on some complicated instance-dependent constants ( Õ(•) further omits polylogarithmic factors). For function f : X → R, its ℓ ∞ -norm ∥f ∥ ∞ is defined by sup x∈X f (x). We also define ∥F∥ ∞ def = sup f ∈F ∥f ∥ ∞ . Let ρ(•) denote the spectral radius, i.e., the maximum absolute value of eigenvalues.

2.3. SIM-TO-REAL TRANSFER

The principal framework of sim-to-real transfer works as follows: the learner trains a policy in the simulators of the environment, and then applies the obtained policy to the real world. We follow Chen et al. (2021) to present a formulation of the sim-to-real transfer. The simulators are modeled as a set of finite-horizon LQGs with control parameters (e.g., physical parameters, control delays, etc.), where different parameters correspond to different dynamics. We denote this simulator class by E . To make the problem tractable, we also impose the realizability assumption: the real-world environment Θ ⋆ ∈ E is contained in the simulator set. Now we describe the sim-to-real transfer paradigm formally. In the simulation phase, the learner is given the set E , each of which is a parameterized simulator (LQG system). During the simulation phase, the learner can interact with each simulator for arbitrary times. However, the learner does NOT know which one represents the real-world environment, which may cause the learned policy to perform poorly in the real world. This challenge is commonly referred to as the sim-to-real gap. Mathematically, assuming the learned policy in the simulation phase is π(E ), its sim-to-real gap is defined by Gap(π(E )) = V π(E ) (Θ ⋆ ) -V ⋆ (Θ ⋆ ), which is the difference between the cost of simulation policy π(E ) on the real-world model and the optimal cost in the real world.

2.4. ROBUST ADVERSARIAL TRAINING ALGORITHM

With our sim-to-real transfer framework defined above, we now formally define the robust adversarial training algorithm used in the simulator training procedure. Definition 1 (Robust Adversarial Training Oracle). The robust adversarial training oracle returns a (history-dependent) policy π RT such that π RT = arg min π max Θ∈E [V π (Θ) -V ⋆ (Θ)], where V ⋆ is the optimal cost, and V π is the cost of π, both on the LQG model Θ. Note that the real world model Θ ⋆ is unknown to the robust adversarial training oracle. This robust adversarial training oracle aims to find a policy that minimizes the worst case value gap. This oracle can be achieved by many algorithms in min-max optimal control (Ma et al., 1999; Ma & Braatz, 2001) and robust RL (Morimoto & Doya, 2005; Iyengar, 2005; Xu & Mannor, 2010; Pinto et al., 2017a; b; Ho et al., 2018; Tessler et al., 2019; Mankowitz et al., 2019; Kuang et al., 2022; Goyal & Grand-Clement, 2022) .

3. MAIN RESULTS

Before presenting our results, we introduce several standard notations and assumptions for LQGs. Assumption 1. The real world LQG system Θ ⋆ = (A ⋆ , B ⋆ , C ⋆ ) is open-loop stable, i.e., ρ(A ⋆ ) < 1. Assume Φ(A ⋆ ) def = sup τ ≥0 ∥(A ⋆ ) τ ∥ 2 ρ(A ⋆ ) τ < +∞. We also assume that (A ⋆ , B ⋆ ) and (A ⋆ , F (Θ ⋆ )) are controllable, (A ⋆ , C ⋆ ) is observable. For completeness, we provide the definitions of controllable, observable, and stable systems in Appendix A.2. Assumption 1 is common in previous works studying the regret minimization or system identification of LQG (Lale et al., 2021; 2020b; Oymak & Ozay, 2019) . The bounded Φ(A ⋆ ) condition is a mild condition as noted in Lale et al. (2020c) ; Oymak & Ozay (2019) ; Mania et al. (2019) , which is satisfied when A ⋆ is diagonalizable. We emphasize that the (A ⋆ , F (Θ ⋆ ))-controllable assumption here only helps us simplify the analysis (can be removed by more sophisticated analysis). It is not a necessary condition for our results. In the framework of sim-to-real transfer, we have access to a parameterized simulator class E . It is natural to assume E has a bounded norm. The stability of the systems relies on the contractible property of the close-loop control matrix A -BK (Lale et al., 2020c; 2021) . Thus, we make the following assumption, which is also used in Lale et al. (2020c; 2021) . Assumption 2. The simulators Θ ′ = (A ′ , B ′ , C ′ ) ∈ E have contractible close-loop control matri- ces: ∥K(Θ ′ )∥ 2 ≤ N K and ∥A ′ -B ′ K(Θ ′ )∥ 2 ≤ γ 1 for fixed constant N K > 0, 0 < γ 1 < 1. We assume there exists constant N S such that for any Θ ′ ∈ E , ∥A ′ ∥ 2 , ∥B ′ ∥ 2 , ∥C ′ ∥ 2 ≤ N S . As we study the partially observable setting with infinite-horizon average cost, we also require the belief states of any LQG model in E to be stable and convergent (Mania et al., 2019; Lale et al., 2021) . Since the dynamics for belief states follow the predictor form in (3), we assume matrix A -F (Θ)C is stable as follows. Assumption 3. Assume there exists constant (κ 2 , γ 2 ) such that for any Θ ′ = (A ′ , B ′ , C ′ ) ∈ E , the matrix A ′ -F (Θ ′ )C ′ is (κ 2 , γ 2 )-strongly stable. Here A ′ -F (Θ ′ )C ′ is (κ 2 , γ 2 )-strongly stable means ∥F (Θ ′ )∥ 2 ≤ κ 2 and there exists matrices L and G such that A ′ -F (Θ ′ )C ′ = GLG -1 with ∥L∥ 2 ≤ 1 -γ 2 , ∥G∥ 2 ∥G -1 ∥ 2 ≤ κ 2 . We assume x 0 ∼ N (0, I) for simplicity. Note that the Kalman filter converges exponentially fast to its steady-state (Caines & Mayne, 1970; Chan et al., 1984) , we omit the error brought by x 0 starting at covariance I instead of the Σ(Θ ⋆ ) (see e.g., Appendix G of Lale et al. (2021) ) without loss of generality. Now we are ready to state our main theorem, which establishes a sharp upper bound for the sim-toreal gap of π RT . Theorem 1. Under Assumptions 1, 2, and 3, the sim-to-real gap of π RT satisfies Gap(π RT ) ≤ Õ δ E H , where δ E (defined in Theorem 4) characterizes the complexity of model class E . Here we use Õ to highlight the dependency with respect to the dimensions of the problem (i.e., m, n, p, H) and hide the uniform constants, log factors, and complicated instance-dependent constants. We present the high-level ideas of our proof in Section 4. The full proof is deferred to Appendix E. The following proposition shows that δ E has at most logarithmic dependency on H. Proposition 1. Under Assumptions 1, 2, and 3, the intrinsic complexity δ E is always upper bounded by δ E = Õ (poly(m, n, p)) . The proof is deferred to Appendix G.1. Note that the sim-to-real gap of any trivial stable control policy is as large as O(H) since stable policies incur at most O(1) loss (compared with the optimal policy) at each step. Therefore, Theorem 1 along with proposition 1 ensures that π RT is a highly non-trivial policy which only suffers O(H -1/2 ) loss per step. Moreover, we can show that δ E will be smaller if E has more properties such as the low-rank structure. See Appendicies G.2 and G.3 for details. To demonstrate the optimality of the upper bound in Theorem 1, we establish the following lower bound, which shows that the √ H sim-to-real gap is unavoidable. The proof is given in Appendix H. Theorem 2 (Lower Bound). Under Assumptions 1, 2, and 3, for any history-dependent policy π, there exists a model class E and a choice of Θ ⋆ ∈ E such that : Gap(π) ≥ Ω( √ H).

4. ANALYSIS

We provide a sketched analysis of the sim-to-real gap of π RT in this section. In a nutshell, we first perform a reduction from bounding the sim-to-real gap to an infinite-horizon regret minimization problem. We further show that there exists a history-dependent policy achieving low regret bound in the infinite-horizon LQG problem. This immediately implies that the sim-to-real gap of π RT will be small by reduction. Before coming to the analysis, we note first that any simulator Θ ∈ E not satisfying Assumption 1 cannot be the real world system. Therefore, we can prune the simulator set E to remove the models that do not satisfy Assumption 1.

4.1. THE REDUCTION

Now we introduce the reduction technique, which connects the sim-to-real gap defined in Eqn. ( 6) and the regret defined in Eqn. (2). Lemma 3 (Reduction). Under Assumptions 1, 2, and 3, the sim-to-real gap of π RT can be bounded by the H-step regret bound of any (history-dependent) policy π defined in Eqn. (2): Gap(π RT ) ≤ Regret(π; H) + D h , where D h does not depend on H. Although the idea of reduction is also used in Chen et al. (2021) , our reduction here is very different from theirs. Their proof relies on the communication assumption, which ensures that the optimal bias function in infinite-horizon MDPs is uniformly bounded. In contrast, the optimal bias function defined in ( 4) is not naturally bounded, which requires much more complicated analysis. See Appendix D for a detailed proof. By Lemma 3, it suffices to construct a history-dependent policy π that has low regret. Since we can treat any history-dependent policy π as an algorithm, it suffices to design an efficient regret minimization algorithm for the infinite-horizon average-cost LQG systems as π.

4.2. THE REGRET MINIMIZATION ALGORITHM

Motivated by the reduction in the previous subsection, we construct π as a sample-efficient algorithm LQG-VTR (Algorithm 1). The key steps of LQG-VTR are summarized below. Model Selection It has been observed by many previous works (e.g., Simchowitz & Foster (2020) ; Lale et al. (2021; 2020c) ; Tsiamis et al. (2020) ) that an inaccurate estimation of the system leads to the actions that cause the belief state to explode (i.e., ∥x t|t, Θ∥ 2 becomes linear or even super linear as t grows). To alleviate this problem, we utilize a model selection procedure before the optimistic planning algorithm to stabilize the system. The model selection procedure rules out some unlikely systems from the simulator set, which ensures that the inaccuracies do not blow up during the execution of LQG-VTR. To this end, we collect a few samples with random actions, and use a modified system identification algorithm to estimate the system parameters (Lale et al., 2021; Oymak & Ozay, 2019) . After the model selection procedure, we show that with high probability the belief states and observations stay bounded throughout the remaining steps (cf. Lemma 5 in Appendix C), so LQG-VTR does not terminate at Line 9. Estimate the Model with Clipped History As the simulator class E is known to the agent, we use the value-target model regression procedure (Ayoub et al., 2020) to estimate the real-world model Θ ⋆ at the end of each episode k. To be more concrete, suppose the agent has access to the regression dataset Z = {E t } |Z| t=1 at the end of episode k, where E t is t-th sample containing the belief state xt|t , action u t , observation y t , the estimated bias function h ⋆ Θ, and the regression target (x t+1|t+1 , y t+1 ). Here Θ is the optimistic model used in the t-th sample. Then, inspired by the Bellman equation in (5) we can estimate the model by minimizing the following least-squares loss Θ′ k+1 = arg min Θ∈U1 Et∈Z E Θ,ut h ⋆ Θ x′ t+1|t+1 , y ′ t+1 | xt|t -h ⋆ Θ xt+1|t+1 , y t+1 2 , where x′ t+1|t+1 , y ′ t+1 denotes the random belief state and observation at step t + 1. However, it requires the full history at step t to compute the expectation in Eqn. ( 8 set maximum state allowed M x = X1 ( X1 is defined in Lemma 5). 3: set maximum number of episode K by Eqn. (77).

4:

set ψ by Lemma 9, β by Eqn. ( 119), l = O(log(Hn + Hp)). 5: set Z = Z new = ∅, initial state x 0 ∼ N (0, I), episode k = 1. 6: Compute U 1 = Model Selection(T w , E ) (Algorithm 2), Θ1 = arg min Θ∈U1 J ⋆ (Θ). 7: for step t = T w + 1, • • • , H do 8: if ∥x t|t, Θk ∥ 2 > M x then 9: Take u t = 0 for the remaining steps and halt the algorithm.

10:

Compute the optimal action under system Θk : u t = -K( Θk )x t|t, Θk . 11: Take action u t and observe y t+1 . 12: Let l clip def = min(l, t), define τ t def = (y t , u t-1 , y t-1 , u t-2 , y t-2 , ..., y t-l clip +1 , u t-l clip ). 13: Add sample E t def = (τ t , xt|t, Θk , Θk , u t , xt+1|t+1, Θk , y t+1 ) to the set Z new . 14: if importance score sup f1,f2∈F ∥f1-f2∥ 2 Znew ∥f1-f2∥ 2 Z +ψ ≥ 1 and k < K then 15: Add the history data Z new to the set Z, and reset Z new = ∅.

16:

Calculate Θk+1 using Eqn. ( 9). 17: Update the confidence set U k+1 = C( Θk+1 , Z) by Eqn. (10). 18: Compute Θk+1 = arg min Θ∈U k+1 J ⋆ (Θ); episode counter k = k + 1. E Θ,u [h ⋆ Θ(x ′ t+1|t+1 , y ′ t+1 ) | xt|t ] be the approximation of the expectation (see Section 4.4 for the formal definition), we use f to estimate Θ ⋆ with dataset Z: Θk+1 = arg min Θ∈U1 Et∈Z f Θ xt|t , u t , Θ -h ⋆ Θ xt+1|t+1 , y t+1 2 . With this estimator, we can construct the following confidence set, C Θk+1 , Z = Θ ∈ U 1 : Et∈Z f Θ (x t|t , u t , Θ) -f Θk+1 (x t|t , u t , Θ) 2 ≤ β , Update with Low-switching Cost Since the regret bound of LQG-VTR has a linear dependency on the number of times the algorithm switches the control policies (cf. Appendix E.5), we have to ensure the low-switching property (Auer et al., 2008; Bai et al., 2019) of our algorithm. We follow the idea of Kong et al. (2021) ; Chen et al. (2021) that maintains two datasets Z and Z new , representing current data used in model regression and new incoming data. The importance score (Line 14 of Algorithm 1) measures the importance of the data in Z new with respect to the data in Z. We only synchronize the current dataset with the new dataset and update the current policy when this value is greater than 1.

4.3. REGRET BOUND OF LQG-VTR

Theorem 4. Under Assumptions 1, 2, and 3, the regret (Eq.( 2)) of LQG-VTR is bounded by Õ δ E H , where the intrinsic model complexity δ E is defined as δ E def = dim E (F, 1/H) log(N (F, 1/H))∥F∥ 2 ∞ . ( ) Here we use function class F to capture the complexity of E , which is defined in Section 4.4. dim E (F, 1/H) denotes the 1/H-Eluder dimension of F, N (F, 1/H) is the 1/H-covering number of F. The definitions of Eluder dimension and covering number are deferred to Appendix A.2. Under Assumptions 1, 2, and 3, we obtain a √ H-regret as desired because δ E is at most Õ(poly(m, n, p)) (cf. Appendix G). Notably, if we do not adopt the history clipping technique, δ E will be linear in H, resulting in the regret bound becoming vacuous. Compared with existing works (Simchowitz et al., 2020; Lale et al., 2020c; 2021) , we achieve a √ H-regret bound for general model class E (depends on the intrinsic complexity of E ) with weaker assumptions and cleaner analysis. Theorem 1 is directly implied by Theorem 4 and the reduction in (75).

4.4. CONSTRUCTION OF THE FUNCTION CLASS F

In this subsection, we present the intuition and construction of F. To begin with, recall that the optimal bias function h ⋆ Θ of the optimistic system Θ is h ⋆ Θ xt+1|t+1, Θ, y t+1 = x⊤ t+1|t+1, Θ P -C⊤ Q C xt+1|t+1, Θ + y ⊤ t+1 Qy t+1 , where ( P , C) are the parameters with respect to Θ. Given any underlying system Θ, we know the next step observation y t+1,Θ given belief state xt|t,Θ and action u is defined as y t+1,Θ = Y xt|t,Θ , u, Θ def = CAx t|t,Θ + CBu + Cw t + CA(x t -xt|t,Θ ) + z t+1 . Here we use Y to denote the stochastic process that generates y t+1,Θ under system Θ. Thus, we can define function f ′ to be the one-step Bellman backup of the optimal bias function h ⋆ Θ under the transition of the underlying system Θ: f ′ Θ τ t , xt|t, Θ, u, Θ def = E h ⋆ Θ xt+1|t+1, Θ, y t+1,Θ , y t+1,Θ = Y xt|t,Θ , u, Θ . The next step belief state xt+1|t+1, Θ is determined by the Kalman filter as xt+1|t+1, Θ = I -L C Ãx t|t, Θ + Bu + Ly t+1,Θ . However, we need the full history τ t to compute xt|t,Θ as shown below, which is unacceptable. To mitigate this problem, we compute the approximate belief state xc t|t,Θ with clipped length-l history τ l = {u t-l , y t-l+1 , • • • , u t-1 , y t }: xt|t,Θ = (I -LC) Ax t-1|t-1,Θ + (I -LC) Bu t-1 + Ly t (require full history τ t to compute xt|t,Θ ) = (A -LCA) l xt-l|t-l,Θ + l s=1 (A -LCA) s-1 ((I -LC)Bu t-s + Ly t-s+1 ) xc t|t,Θ . Thanks to Assumption 3, we can show that (A -LCA) l xt-l|t-l,Θ is a small term whose ℓ 2 norm is bounded by O(κ 2 (1 -γ 2 ) l ) since ∥(I -LC)A∥ 2 = ∥A(I -LC)∥ 2 . Therefore, xc t|t,Θ will be a good approximation of xt|t,Θ , helping us control the error of clipping the history. With the above observation, we formally define F, which is an approximation by replacing the full history with the clipped history. Let the domain of F be X def = B m Ū × B p Ȳ l × B n X1 × B m Ū × E , where B d v = {x ∈ R d : ∥x∥ 2 ≤ v} for any (d, v) ∈ N × R. Here Ū , Ȳ and X1 will be specified in Lemma 5. We define F formally as follows: F def = {f Θ : X → R | Θ ∈ E }, f Θ τ l , xt|t, Θ, u, Θ def = E h ⋆ Θ xt+1|t+1, Θ, y c t+1,Θ , y c t+1,Θ = Y xc t|t,Θ , u, Θ . Here xt+1|t+1, Θ is determined similarly as in Eqn. ( 14), where the only difference is the next observation y t+1,Θ replaced by y c t+1,Θ .

5. CONCLUSION

In this paper, we make the first attempt to study the sim-to-real gap in continuous domain with partial observations. We show that the output policy of the robust adversarial algorithm enjoys the nearoptimal sim-to-real gap, which depends on the intrinsic complexity of the simulator class. This work opens up several directions: Can we extend our results to the non-linear dynamics system? Can we design better sim-to-real algorithms beyond robust training? We leave them to future investigations.

A MISSING PARTS

A.1 NOTATION TABLE For the convenience of the reader, we summarize some notations that will be used. Notation Explanation E , δ E model class and its complexity Θ = (A, B, C) dynamics parameters defined in (1) V ⋆ (Θ), J ⋆ (Θ) optimal cost of finite-horizon and infinite-horizon setting for Θ K, P, L, Σ see ( 18), ( 17), ( 19), and (20) F AL (cf. (3)) γ 1 , κ 2 , γ 2 strongly stable coefficients X1 , Ȳ , Ū , X2 upper bounds of xt|t, Θk , y t , u t , xt|t,Θ in ( 25), ( 26), and ( 27) Definition 3 (Stability (Cohen et al., 2018; Lale et al., 2022) ). N S max{∥A∥, ∥B∥, ∥C∥} ≤ N S for all Θ ∈ E N U max{∥Q∥ 2 , ∥R∥ 2 } N K , N P , N Σ , A LQG system Θ = (A, B, C) is stable if ρ(A -BK(Θ)) < 1. It is strongly stable in terms of parameter (κ, γ) if ∥K(Θ)∥ 2 ≤ κ and there exists matrices L and H such that A-BK = HLH -1 with ∥L∥ 2 ≤ 1-γ, ∥H∥ 2 ∥H -1 ∥ 2 ≤ κ. Note that any stable LQG system is also strongly stable in terms of some parameters (Cohen et al., 2018) , and a strongly stable system is also stable. Definition 4 (Covering Number). We use N (F, ϵ) to denote the ϵ-covering number of a set F with respect to the ℓ ∞ norm, which is the minimum integer N such that their exists F ′ ∈ F with |F ′ | = N , and for any f ∈ F their exists f ′ ∈ F ′ satisfying ∥f -f ′ ∥ ∞ ≤ ϵ. Definition 5 (ϵ-Independent). For the function class F defined in Z, we say z is ϵ-independent of {z 1 , • • • , z n } ∈ Z if there exist f, f ′ ∈ F satisfying n i=1 (f (z i ) -f ′ (z i )) 2 ≤ ϵ and f (z) - f ′ (z) ≥ ϵ. Definition 6 (Eluder Dimension). For the function class F defined in Z, the ϵ-Eluder dimension is the longest sequence {z 1 , • • • , z n } ∈ Z such that there exists ϵ ′ ≥ ϵ where z i is ϵ ′ -independent of {z 1 , • • • , z i-1 } for any i ∈ [n].

A.3 PRELIMINARIES ABOUT THE OPTIMAL CONTROL AND THE KALMAN FILTER

It is known that the optimal policy for this problem is a linear feedback control policy, i.e., u t = -K(Θ)x t|t,Θ , where K(Θ) is the optimal control gain matrix and xt|t,Θ is the belief state at step t (i.e. the estimation of the hidden state). Let P (Θ) be the unique solution to the discrete-time algebraic Riccati equation (DARE): P (Θ) = A ⊤ P (Θ)A + C ⊤ QC -A ⊤ P (Θ)B(R + B ⊤ P (Θ)B) -1 B ⊤ P (Θ)A. ( ) Then K(Θ) can be calculated by K(Θ) = (R + B ⊤ P (Θ)B) -1 B ⊤ P (Θ)A. ( ) The belief state xt|t,Θ is defined as the mean of x t under system Θ, which is decided by the system parameter Θ and history H t . Moreover, assuming x0|-1,Θ = 0, the belief state can be calculated by the Kalman filter: xt|t,Θ = (I -L(Θ)C)x t|t-1,Θ + L(Θ)y t , xt|t-1,Θ = Ax t-1|t-1,Θ + Bu t-1 , L(Θ) = Σ(Θ)C ⊤ (CΣ(Θ)C ⊤ + I) -1 , where Σ(Θ) is the unique positive semidefinite solution to the following DARE: Σ(Θ) = AΣ(Θ)A ⊤ -AΣ(Θ)C ⊤ (CΣ(Θ)C ⊤ + I) -1 CΣ(Θ)A ⊤ + I. ( ) We sometimes use L, P, K, Σ as the shorthand of L(Θ), P (Θ), K(Θ), Σ(Θ) when the system Θ is clear from the context. The predictor form is another formulation of LQG instance (1) (Kalman, 1960; Lale et al., 2020b; 2021 ) x t+1 = (A -F (Θ)C)x t + Bu t + F (Θ)y t , y t = Cx t + e t , where F (Θ) = AL(Θ) and e t denotes a zero-mean innovation process. In the steady state, we have e t ∼ N (0, CΣ(Θ)C ⊤ + I). The dynamics of xt|t-1,Θ follows exactly the predictor form.

A.4 DOMAIN RANDOMIZATION

Definition 7 (Domain Randomization Oracle (Chen et al., 2021) ). The domain randomization oracle returns a (history-dependent) policy π DR such that π DR = arg min π E Θ∼d(E ) [V π (Θ) -V ⋆ (Θ)] , where V ⋆ is the optimal cost, V π is the cost of π, and d(E ) is a distribution over E . As shown in Theorem 4 of Chen et al. (2021) , even with an additional smooth assumption (Chen et al., 2021, Assumption 3) , the performance of π DR depends crucially on d(E ). In high dimensional continuous domain, the probability of sampling an accurate model close to the real world model by uniform randomization is exponentially small. Thus, the learner needs to carefully choose the domain randomization distribution d(E ) with strong prior knowledge of the real world model. In contrast, the robust adversarial training oracle does not worry about this, and it just needs to solve (7) by some min-max optimal control or robust RL algorithms.

B UNIFORM BOUNDED SIMULATOR SET

Under Assumptions 1, 2, and 3, it is possible to provide a uniform upper bound on the spectral norm of P (Θ), Σ(Θ), K(Θ) and L(Θ) after the model selection procedure. Let N U def = max(∥Q∥ 2 , ∥R∥ 2 ). For any Θ = (A, B, C) ∈ E , we have ∥K∥ 2 ≤ N K by Assumption 2. By definition P is the unique solution to the equation (A -BK) ⊤ P (A -BK) -P = C ⊤ QC + K ⊤ RK, where ρ(A -BK) < 1. Therefore, Lemma B.4 of Simchowitz et al. (2020) shows P = ∞ k=0 (A -BK) ⊤ k (C ⊤ QC + K ⊤ RK)(A -BK) k . Since ∥(A -BK) k ∥ 2 ≤ (1 -γ 1 ) k , ∥P ∥ 2 ≤ N 2 S N U + N 2 K N U ∞ k=0 (1 -γ 1 ) 2k ≤ N P def = N U (N 2 S + N 2 K ) 2γ 1 -γ 2 1 . Similarly, Σ is the unique solution to the equation (A -F C)Σ(A -F C) ⊤ -Σ = I + F F ⊤ , where F = AL and A -F C is (κ 2 , γ 2 )-strongly stable under Assumption 3. Thus we know ∥Σ∥ 2 ≤ N Σ def = κ 2 2 (1 + κ 2 2 ) 2γ 2 -γ 2 2 . Finally, we have ∥L∥ 2 = ΣC ⊤ (CΣC ⊤ + I) -1 2 ≤ N L def = N Σ N S since ∥(CΣC ⊤ + I) -1 ∥ 2 ≤ 1. In this paper, we use N P , N Σ , N K , N L as the uniform upper bound on the spectral norm of P (Θ), Σ(Θ), K(Θ) and L(Θ) for any Θ ∈ E .

C DETAILS FOR THE MODEL SELECTION PROCEDURE

In this section we provide a complete description of the model selection procedure. The main purpose of running a model selection procedure at the beginning of LQG-VTR is to obtain a more accurate estimate Â, B, Ĉ, L of the real-world model Θ ⋆ . This accurate estimate is very useful to show that the belief states encountered in LQG-VTR will be bounded (see Lemma 5). We follow the warm up procedure used in Lale et al. (2021) to estimate Markov matrix M, and perform the model selection afterwards. The matrix M is M = CF C ĀF . . . C Ā H-1 F CB C ĀB . . . C Ā H-1 B , where The regression model for M can be established by Ā = A -F C = A -ALC. y t = Mϕ t + e t + C Ā H x t-H , where ϕ t = y ⊤ t-1 . . . y ⊤ t-H u ⊤ t-1 . . . u ⊤ t-H ⊤ . Here e t is a Gaussian noise, and C ĀH x t-H is an exponentially small bias term. Thanks to Assumption 3, Ā is a (κ 2 , γ 2 )-strongly stable matrix. Hence ∥ Ā H ∥ 2 = O(κ 2 (1-γ 2 ) H ) = O(1/H 2 ) , and the bias term will be negligible. Therefore, by estimating M with random action set D init , we come to a similar confidence set as shown in the Theorem 3.4 of Lale et al. (2021) : There exists a unitary matrix T ∈ R n×n such that with probability at least 1 -δ/4, the real world model Θ ⋆ is contained in the set C, where C def = Θ = ( Ā, B, C) : Â -T ⊤ ĀT 2 ≤ β A , B -T ⊤ B 2 ≤ β B , Ĉ -CT 2 ≤ β C , L -T ⊤ L( Θ) 2 ≤ β L . (23) Here the unitary T is used because any LQG system transformed by a unitary matrix has exactly the same distribution of observation sequences as the original one (i.e., LQGs are equivalent under unitary transformations). Without loss of generality, we can assume T = I. β A , β B , β C , β L measure the width of the confidence set, and we have β A , β B , β C , β L ≤ C w T w for some instance-dependent parameter C w according to Theorem 3.4 of Lale et al. (2021) . Now we discuss how to decide the value of T w so that the following bounded state property holds after the model selection procedure, which is crucial to bound the regret of LQG-VTR. We define K as the number of episodes of LQG-VTR.  Accordingly for any step t ∈ [H], ∥y t ∥ 2 ≤ Ȳ , ∥u t ∥ 2 ≤ Ū . ( ) For any step t ∈ [H] and any system Θ = (A, B, C) ∈ E , the belief state measured under Θ with history (y 0 , u 0 , y 1 , u 1 , ..., y t-1 , u t-1 , y t ) is bounded xt|t,Θ 2 ≤ X2 , where X1 , Ȳ , Ū , X2 = O ( √ n + √ p) log H . ( ) As a consequence, the algorithm does not terminate at Line 9 with probability 1 -δ. Proof. Let S(k) and T (k) be the starting step and ending step of episode k. For simplicity, we use Θt = ( Ãt , Bt , Ct ) to denote Θk for S(k) ≤ t ≤ T (k) for time step t (The first episode starts after the model selection procedure, S(1) = T w + 1). The core problem here is to show that after the model selection procedure (Algorithm 2), the belief state xt|t, Θt under Θt stays bounded, then u t also stays bounded since u t = -K( Θt )x t|t, Θt . Since Θ ⋆ ∈ C with probability at least 1 -δ/4, we assume this event happens. Before going through the proof, we define constant w, z and ū such that with probability 1 -δ/4, Lale et al. (2020c) . We assume this high probability event happens for now. ∥w We use an inductive argument to show that for any episode k and step t therein, it holds that ∥x t|t, Θt ∥ 2 ≤ C 0 (1 + 1/ K) k for some C 0 . To guarantee action u t does not blow up, we set the upper bound of ∥x t|t, Θt ∥ 2 as X1 def = eC 0 ≥ C 0 (1 + 1/ K) k for any k ≤ K < K. Thus, LQG-VTR terminates as long as ∥x t|t, Θt ∥ 2 exceeds that upper bound (Line 9 of Algorithm 1). For convenient, we call the model selection procedure as episode 0 (so T (0) = T w ). As the inductive hypothesis, suppose that ∥x t|t, Θt ∥ 2 ≤ D 0 ≤ C 0 (1 + 1/ K) k holds for some D 0 and step t ≤ T (k) + 1 when k ≥ 1. We define D 0 = C 0 for C 0 = O(( √ n + √ p) log H) specified later when k = 0. Then we know that the algorithm does not terminate in the first k episodes. Moreover, we have for any t ≤ T (k) it holds that ∥u t ∥ 2 ≤ N K D 0 . Thus for t ≤ T (k) + 1, by definition of x t and y t we have ∥x t ∥ 2 ≤ Φ(A ⋆ ) 1 -ρ(A ⋆ ) • (N S N K D 0 + w), ∥y t ∥ 2 ≤ N S ∥ xt ∥ 2 + z. ( ) Suppose D 0 ≥ wΦ(A ⋆ )/(1 -ρ(A ⋆ )), so that ∥x t ∥ 2 ≤ 1 + N S N K Φ(A ⋆ ) 1 -ρ(A ⋆ ) D 0 , ∥y t ∥ 2 ≤ 1 + N S + N 2 S N K Φ(A ⋆ ) 1 -ρ(A ⋆ ) D 0 . Hence, we can define three constants D u , D x , D y such that for any t ≤ T (k) it holds that ∥u t ∥ 2 ≤ D u D 0 ; for any t ≤ T (k) + 1, we have ∥x t ∥ 2 ≤ D x D 0 , ∥y t ∥ 2 ≤ D y D 0 . Note that D u , D x , D y are all instance-dependent constants. To be more concrete, it suffices to set D u = N K D x = 1 + N S N K Φ(A ⋆ ) 1 -ρ(A ⋆ ) D y = 1 + N S + N 2 S N K Φ(A ⋆ ) 1 -ρ(A ⋆ ) . INDUCTION STAGE 1: FIX t = T (k) + 1 AND SHOW ∥x t|t, Θk ∥ 2 IS VERY CLOSE TO ∥x t|t, Θk+1 ∥ 2 In stage 1, we need to derive four terms T A , T B , T C , T L as the lower bound of T w to achieve the desired concentration property. This means as long as T w ≥ T A , T B , T C , T L , we can prove the induction stage 1 thanks to the tight confidence set C defined in Eqn. ( 23). Now fix t = T (k) + 1 = S(k + 1), for any Θ ∈ U 1 we have xt|t,Θ = t s=1 A s-1 (Bu t-s + Ly t-s+1 ) + A t Ly 0 , where A def = (I -LC)A, B def = (I -LC)B. Note that ∥A s ∥ 2 ≤ κ 2 (1 -γ 2 ) s according to Assumption 3. Therefore, for Θ = Θk and Θ = Θk+1 (both Θk and Θk+1 is chosen from the confidence set C by definition), xt|t, Θk -xt|t, Θk+1 = t s=1 Ãs-1 k Bk u t-s + Lk y t-s+1 + Ãt k Lk y 0 - t s=1 Ãs-1 k+1 Bk+1 u t-s + Lk+1 y t-s+1 + Ãt k+1 Lk+1 y 0 . To move on, xt|t, Θk -xt|t, Θk+1 = t s=1 Ãs-1 k Bk u t-s + Lk y t-s+1 - t s=1 Ãs-1 k Bk+1 u t-s + Lk+1 y t-s+1 + t s=1 Ãs-1 k Bk+1 u t-s + Lk+1 y t-s+1 - t s=1 Ãs-1 k+1 Bk+1 u t-s + Lk+1 y t-s+1

+ Ãt

k Lk y 0 -Ãt k+1 Lk+1 y 0 . We bound these three terms separately in the following. For the first term, observe that when T w ≥ T L , we have t s=1 Ãs-1 k ( Lk -Lk+1 )y t-s+1 2 ≤ t s=1 κ 2 (1 -γ 2 ) s-1 • 2 √ C w √ T L • D y D 0 ≤ 2κ 2 D y √ C w γ 2 √ T L • D 0 . ( ) where the first inequality is due to the confidence set (Eqn. ( 23)), the induction hypothesis, and Assumption 3. Thus, as long as T L ≥ (12κ 2 D y √ C w K/γ 2 ) 2 = O( K2 ), we know t s=1 Ãs-1 k ( Lk -Lk+1 )y t-s+1 2 ≤ D 0 6 K . Similarly, we have Bk -Bk+1 2 ≤ C w 1 + N L N S √ T B + N 2 S √ T L + N L N S √ T C by checking the definition of Bk and Bk+1 . Therefore, as long as T B ≥ 36κ 2 D u √ C w (1 + N L N S ) K γ 2 2 = O( K2 ) T L ≥ 36κ 2 D u √ C w N 2 S K γ 2 2 = O( K2 ) T C ≥ 36κ 2 D u √ C w N L N S K γ 2 2 = O( K2 ), we have Bk -Bk+1 2 ≤ γ 2 12κ 2 D u K . Therefore, it holds that t s=1 Ãs-1 k ( Bk -Bk+1 )u t-s 2 ≤ D 0 6 K . ( ) following the same reason as Eqn. (32). To sum up, we require T B ≥ O( K2 ), T L ≥ O( K2 ), T C ≥ O( K2 ) to ensure that as long as T w ≥ max(T B , T L , T C ), the first term is bounded by t s=1 Ãs-1 k ( Bk -Bk+1 )u t-s + ( Lk -Lk+1 )y t-s+1 2 ≤ D 0 3 K from Eqn. (33) and Eqn. (34). For the term, t s=1 Ãs-1 k -Ãs-1 k+1 Bk+1 u t-s + Lk+1 y t-s+1 ≤ t s=1 ξ k • (s -1)κ 2 2 (1 -γ 2 ) s-2 (N S + N 2 S N L )D u + N L D y D 0 (see Eqn. (148)) (37) ≤ ξ k • (κ 2 /γ 2 ) 2 t(1 -γ 2 ) t-1 (N S + N 2 S N L )D u + N L D y D 0 , where 32) and Eqn. (33) for an example on how T A , T C , T L are derived) to ensure that as long as ξ k = Ãk -Ãk+1 2 . Note that C 1 def = max t≥0 t(1 -γ 2 ) t-1 is a constant. Similarly, we require T A , T C , T L ≥ O( K2 ) (see Eqn. ( T w ≥ max(T A , T C , T L ), it holds that ξ k • (κ 2 /γ 2 ) 2 C 1 (N S + N 2 S N L )D u + N L D y D 0 ≤ D 0 3 K . ( ) Using the same argument, by setting an appropriate T L we have as long as T w ≥ T L , it holds that Ãt k Lk y 0 -Ãt k+1 Lk+1 y 0 ≤ D 0 3 K . ( ) As a consequence, combining Eqn. ( 35), (39), and (40) gives xt|t, Θk -xt|t, Θk+1 2 ≤ D 0 K (41) xt|t, Θk+1 2 ≤ D 0 1 + 1 K . ( ) For now we have proved the induction at the beginning step t = S(k +1) of episode k +1, it suffices to show that ∥x t|t, Θk+1 ∥ 2 ≤ (1 + 1/ K)D 0 for the rest of episode k + 1. INDUCTION STAGE 2: SHOW THAT ∥x Θk+1 ∥ 2 ≤ (1 + 1/ K)D 0 HOLDS FOR THE WHOLE EPISODE k + 1. Suppose LQG-VTR does not terminate during episode k +1, then the algorithms follows the optimal control of Θk+1 . Denote Θk+1 = ( Ã, B, C), we follow the Eqn. ( 46) of Lale et al. (2020c) to decompose xt|t, Θk+1 for S(k + 1) + 1 ≤ t ≤ T (k + 1) + 1: xt|t, Θk+1 = M xt-1|t-1, Θk+1 + LC ⋆ xt|t-1,Θ ⋆ -xt|t-1, Θ + Lz t + LC ⋆ x t -xt|t-1,Θ ⋆ , where M def = ( Ã -B K -L( C Ã -C B K -C ⋆ Ã + C ⋆ B K)). The main idea to prove induction stage 2 is to show that xt|t, Θk+1 will also be bounded by (1 + 1/ K)D 0 as long as xt-1|t-1, Θk+1 is bounded by (1 + 1/ K)D 0 , given the conclusion xS(k+1)|S(k+1), Θk+1 ≤ (1 + 1/ K)D 0 of the induction stage 1. To achieve this end, we divide the induction stage 2 into 2 phases. We first show that xt|t-1,Θ ⋆ -xt|t-1, Θ is small in the first phase, and prove the above main idea formally in the second phase. Before coming to the main proof for induction stage 2, we first observe that ēt def = z t + C ⋆ x t -xt|t-1,Θ ⋆ is a (N S N Σ + 1)-subGaussian noise, so with probability at least 1 -δ/4 we know ∥ē t ∥ 2 is bounded by C 2 (N S N Σ + 1) n log(H) for any 1 ≤ t ≤ H and some constant C 2 . Assume this high probability event happens for now. Phase 1: bound xt|t-1,Θ ⋆ -xt|t-1, Θ We mainly follow the decomposition and induction techniques in Lale et al. (2020c) to finish phase 1. However, we note that our analysis here is much more complicated than the analysis in Lale et al. (2020c) , because they have only one commit episode after the pure exploration stage (the agent chooses random actions in their pure exploration stage, just the same as the model selection procedure in LQG-VTR) but we have multiple episodes here. Define ∆ t def = xt+S(k+1)|t+S(k+1)-1,Θ ⋆ -xt+S(k+1)|t+S(k+1)-1, Θk+1 for 1 ≤ t ≤ T (k +1)-S(k + 1) + 1. For t = 1, we have ∆ 1 = xS(k+1)+1|S(k+1),Θ ⋆ -xS(k+1)+1|S(k+1), Θk+1 = A ⋆ xS(k+1)|S(k+1),Θ ⋆ + B ⋆ u S(k+1) -Ãx S(k+1)|S(k+1), Θk+1 -Bu S(k+1) . ( ) Since we have assumed Θ ⋆ ∈ C, we know ∥x S(k+1)|S(k+1),Θ ⋆ -xS(k+1)|S(k+1), Θk+1 ∥ 2 will be small according to induction stage 1. Specifically, we can set T ′ A , T ′ B ≥ O((N L N S κ 3 ) 2 /(1 -γ 1 ) 2 ) so that as long as T w ≥ T ′ A , T ′ B , we have ∥∆ 1 ∥ 2 ≤ D 0 (1 -γ 1 )/(8N L N S κ 3 ) for a constant κ 3 defined later. Now that we have come up with an upper bound on ∥∆ 1 ∥ 2 , we hope to bound each ∥∆ t ∥ 2 and prove Eqn. (47). For 1 < t ≤ T (k + 1) -S(k + 1) + 1, we follow Eqn. ( 49) and Eqn. (50) of Lale et al. (2020c) to perform the decomposition ∆ t = G t-1 0 ∆ 1 + G 1 t-1 j=1 G t-1-j 0 G j-1 2 xS(k+1)+1|S(k+1),Θ ⋆ + j-1 s=1 G j-s-1 2 G 3 ∆ s + ẽt , ( ) where ẽt is a noise term, whose l 2 -norm is bounded by O( p log(H)) with probability 1 -δ/4 for any episode k and step t. We again assume this high probability event happens for now. We define Λ Θ def = Ã -A ⋆ -B K + B ⋆ K and thus G 0 = (A ⋆ + Λ Θ) (I -L C), G 1 = (A ⋆ + Λ Θ) L( C -C ⋆ ) -Λ Θ, G 2 = A ⋆ -B ⋆ K + B ⋆ K L( C -C ⋆ ), G 3 = (A ⋆ -B ⋆ K)L ⋆ + B ⋆ K(L ⋆ -L). a similar argument as in Lale et al. (2020c) , we define T g0 , T g2 so that G 0 , G 2 are (κ 3 , γ 3 )strongly stable for some γ 3 ≤ min(γ 2 /2, (1 -γ 1 )/2) and κ 3 ≥ 1 as long as T w ≥ max(T g0 , T g2 ). This is possible because as long as Θ is close enough to Θ, G 0 will be strongly stable according to Assumption 3 while G 2 will be contractible in that ∥G 2 ∥ 2 ≤ (1 + γ 1 )/2 according to Assumption 2. To be more concrete, we show how to construct T g2 in the following, and the construction of T g0 are analogous to that of T g2 . Observe that G 2 -Ã -B K = A ⋆ -Ã + ( B -B ⋆ ) K + B ⋆ K L( C -C ⋆ ). By a similar argument used in Eqn. (32) and Eqn. ( 33), we know A ⋆ -Ã + ( B -B ⋆ ) K + B ⋆ K L( C -C ⋆ ) 2 ≤ 1 -γ 1 2 , as long as T w ≥ T g2 def = 36C w N 2 S N 2 K N 2 L /(1 -γ 1 ) 2 . This further implies ∥G 2 ∥ 2 ≤ ∥ Ã -B K∥ 2 + 1 -γ 1 2 ≤ 1 + γ 1 2 , where the second inequality is by Assumption 3.

Now we prove that

∥∆ t ∥ 2 ≤ 1 -γ 1 4N L N S D 0 ( ) holds for all 1 ≤ t ≤ T (k + 1) -S(k + 1) + 1. First of all, we know ∥∆ 1 ∥ 2 ≤ D 0 (1-γ 1 )/(8N L N S κ 3 ) ≤ (1-γ 1 )D 0 /(4N L N S ) according to Eqn. (44) and κ 3 ≥ 1. For any fixed t, suppose ∥∆ s ∥ 2 ≤ (1 -γ 1 )D 0 /(4N L N S ) for all 1 ≤ s ≤ t -1, then the decomposition equation ( 45) implies ∥∆ t ∥ 2 = G t-1 0 ∆ 1 + G 1 t-1 j=1 G t-1-j 0 G j-1 2 xS(k+1)+1|S(k+1),Θ ⋆ + j-1 s=1 G j-s-1 2 G 3 ∆ s + ẽt 2 ≤ G 1 t-1 j=1 G t-1-j 0 G j-1 2 xS(k+1)+1|S(k+1),Θ ⋆ + j-1 s=1 G j-s-1 2 G 3 ∆ s 2 + κ 3 (1 -γ 3 ) t-1 ∥∆ 1 ∥ 2 + ∥ẽ t ∥ 2 ≤ ∥G 1 ∥ 2 • (t -1)κ 2 3 (1 -γ 3 ) t-1 • 2N S (1 + N K )D 0 + G 1 t-1 j=1 G t-1-j 0 j-1 s=1 G j-s-1 2 G 3 ∆ s 2 + κ 3 (1 -γ 3 ) t-1 ∥∆ 1 ∥ 2 + ∥ẽ t ∥ 2 ≤ ∥G 1 ∥ 2 • (t -1)κ 2 3 (1 -γ 3 ) t-1 • 2N S (1 + N K )D 0 + κ 3 γ 3 2 ∥G 1 G 3 ∥ 2 • 1 -γ 1 4N L N S D 0 + κ 3 (1 -γ 3 ) t-1 ∥∆ 1 ∥ 2 + ∥ẽ t ∥ 2 . The first inequality follows from the strong stability of G 0 and the concentration of noises ẽt . The second inequality is due to the strong stability of G 0 , G 2 , and ∥x S(k+1)+1|S(k+1),Θ ⋆ ∥ 2 ≤ (1 + N K )(N S + N S / K)D 0 ≤ 2N S (1 + N K )D 0 according to induction stage 1. The last inequality is because the assumption that ∥∆ s ∥ 2 ≤ (1 -γ 1 )D 0 /(4N L N S ) for all 1 ≤ s ≤ t -1. When T w ≥ T g1 for some T g1 ≥ 1 we know that ∥G 1 ∥ 2 ≤ N S + √ C w (1 + N K ) T g1 N L √ C w T g1 + √ C w (1 + N K ) T g1 ≤ √ C w N L N S + ( √ C w N L + 1)(1 + N K ) T g1 . Since max t≥0 (t -1)(1 -γ 3 ) t-1 and ∥G 3 ∥ 2 are both (instance-dependent) constants, there exists an instance-dependent constant T g1 that for T w ≥ T g1 , it holds that ∥G 1 ∥ 2 • (t -1)κ 2 3 (1 -γ 3 ) t-1 • 2N S (1 + N K )D 0 ≤ 1 -γ 1 24N L N S D 0 . ( ) ∥G 1 ∥ 2 • κ 3 γ 3 2 ∥G 3 ∥ 2 • 1 -γ 1 4N L N S D 0 ≤ 1 -γ 1 24N L N S D 0 . ( ) Moreover, the bound on ∥∆ 1 ∥ 2 from Eqn. ( 44) implies that κ 3 (1 -γ 3 ) t-1 ∥∆ 1 ∥ 2 ≤ 1 -γ 1 8N L N S D 0 . ( ) Note that we can set C 0 so that D 0 is large enough to guarantee (1 -γ 1 )D 0 24N L N S ≥ ∥ẽ t ∥ 2 = O( p log(H)) for any episode k and step t in it. The proof of Eqn. ( 47) can be obtained by combining Eqn. ( 48), ( 49), (50), and (51). As a final remark, T g0 , T g1 , T g2 has no dependency on H. Phase 2: finish induction stage 2 Now we come back to the main decomposition formula (43). Define T M = O(2 2 /(1 -γ 1 ) 2 ) so that ∥C -C∥ 2 is small enough to guarantee ∥M ∥ 2 ≤ (1 + γ 1 )/2 (see the definition of M at the beginning of induction stage 2) as long as T w ≥ T M . Suppose we know ∥x t-1|t-1, Θk+1 ∥ 2 ≤ (1 + 1/ K)D 0 for time step t -1, then as long as D 0 ≥ 4C 2 N L (N S N Σ + 1) n log(H)/(1 -γ 1 ), we have xt|t, Θk+1 2 ≤ ∥M ∥ 2 1 + 1 K D 0 + LC ⋆ ∆ t-S(k+1) 2 + Lē t (52) ≤ 1 + γ 1 2 1 + 1 K D 0 + 1 -γ 1 4 D 0 + 1 -γ 1 4 D 0 (53) ≤ 1 + 1 K D 0 . ( ) Fortunately, we know that ∥x S(k+1)|S(k+1), Θk+1 ∥ 2 ≤ (1 + 1/ K)D 0 according to induction stage 1. Using the recursion argument above, we can show that ∥x t|t, Θk+1 ∥ 2 ≤ (1 + 1/ K)D 0 holds for S(k + 1) ≤ t ≤ T (k + 1) + 1. We have proved the induction from episode k to k + 1 so far under the condition that LQG-VTR does not terminate at Line 9 during episode k. On the other hand, if for some step S(k + 1) + 1 ≤ t 0 ≤ T (k + 1) the algorithm entered Line 9 (note that it cannot be t 0 = S(k + 1) because induction stage 1 does not depend on whether the algorithm terminates), then the decomposition for xt|t, Θk+1 (43) and decomposition for ∆ t (45) still works for steps t < t 0 . Therefore, there must be some high probability event fails at step t 0 -1 (i.e., ēt0-1 and/or ẽt0-1 explode). This is impossible since we have assumed that all the high probability events happen. Thus under the condition that all the high probability events happen, it suffices to choose the constant C 0 = O(( √ n + √ p) log H) so that Eqn. ( 30), ( 51), (52) all hold, and LQG-VTR does not terminate at Line 9 due to the definition of X1 = eC 0 . Since these high probability events happen with probability 1 -δ and recall that the total number of episodes is bounded by K, we know that with probability at least 1 -δ, ∥x t|t, Θt ∥ 2 ≤ C 0 1 + 1 K K ≤ eC 0 = X1 = O(( √ n + √ p) log H). As a result, we know ∥u t ∥ 2 ≤ Ū def = N K X1 because u t = -K( Θt )x t|t, Θt . Moreover, y t = C ⋆ x t + z t = C ⋆ xt|t-1, Θt-1 + C ⋆ x t -xt|t-1, Θt-1 + z t = C ⋆ Ãt-1 xt-1|t-1, Θt-1 + Bt-1 u t-1 + C ⋆ x t -xt|t-1, Θt-1 + z t = C ⋆ Ãt-1 xt-1|t-1, Θt-1 + Bt-1 u t-1 + C ⋆ x t -xt|t-1,Θ ⋆ + xt|t-1,Θ ⋆ -xt|t-1, Θt-1 + z t . Observe that ∥u t-1 ∥ 2 ≤ Ū , the l 2 -norm of noise term x t -xt|t-1,Θ ⋆ is bounded by C 2 (N S N Σ + 1) n log(H) , and the last term xt|t-1,Θ ⋆ -xt|t-1, Θt-1 can be bounded through Eqn. ( 47), we have ∥y t ∥ 2 ≤ Ȳ def = N 2 S ( X1 + Ū ) + C 2 (N S N Σ + 1) n log(H) + (1 -γ 1 ) X1 4N L . Note that x0|0,Θ = L(Θ)y 0 , by Eqn. ( 31) xt|t,Θ 2 ≤ κ 2 (1 -γ 2 ) t N L Ȳ + t s=1 κ 2 (1 -γ 2 ) s-1 (1 + N L N S )N S Ū + N L Ȳ ≤ X2 def = κ 2 (1 + N L N S )N S Ū + κ 2 (1 + γ 2 )N L Ȳ γ 2 . At the end of this section, we set the value of T w so that the confidence set C is accurate enough to stabilize the LQG system: T w def = max(T A , T B , T C , T L , T ′ A , T ′ B , T g0 , T g1 , T g2 , T M ) = Õ( K2 ). D PROOF OF LEMMA 3 Proof of Lemma 3. The key observation to prove the lemma is that for any Θ ∈ E , the difference of H-step optimal cost between the infinite-horizon setting of Θ and finite-horizon setting of Θ is bounded: |V ⋆ (Θ) -(H + 1)J ⋆ (Θ)| ≤ D h , where D h is independent of H. We now prove this statement. We first recall the optimal control of the finite horizon LQG problem. The cost for any policy π in a H-step finite horizon LQG problem is defined as E π H h=0 y ⊤ h Qy h + u ⊤ h Ru h , where the initial state x 0 ∼ N (0, I). The optimal control depends on the belief state E[x h |H h ], where H h is the history observations and actions up to time step h. It is well known that x h is a Gaussian variable, whose mean and covariance can be estimated by the Kalman filter. where ϕ h def = (A -AL 0 C) ⊤ (A -AL 1 C) ⊤ • • • (A -AL h-1 C) ⊤ . In the proof of Theorem 4.2 of Chan et al. (1984) we have Σ h|h-1 = ϕ ⊤ h Σ 0|-1 ϕ h + non-negative constant matrices. Theorem 4.1 of Caines & Mayne (1970) shows that for any Θ ∈ E and any 0 ≤ h ≤ H it holds that ∥Σ h|h-1 ∥ 2 ≤ κ 2 2 (1 + κ 2 2 ) 2γ 2 -γ 2 2 = N Σ , which helps us determine b 0 = (1 + N Σ ) √ N Σ . The existence of a 0 also relies on the state transition matrix (Zhang et al., 2021)  ϕ ′ h = (A - BK H )(A -BK H-1 ) • • • (A -BK h+1 ). As a minimum requirement to tackle the finite horizon LQR control problem, we shall assume ∥ϕ ′ h ∥ 2 is uniformly upper bounded (i.e., the system Θ is stable) for any Θ (in fact, people often assume a stronger condition that it is exponentially stable). According to Assumption 2, we can now identify the existence of a 0 . As a result, we have |V ⋆ (Θ ⋆ ) -(H + 1)J ⋆ (Θ ⋆ )| = H h=0 trace P h L h CΣ h|h-1 -trace (P LCΣ) + H h=0 trace C ⊤ QC(I -L h C)Σ h|h-1 -trace C ⊤ QC(I -LC)Σ ≤ H h=0 trace (P h -P )L h CΣ h|h-1 + H h=0 trace P L h CΣ h|h-1 -LCΣ + H h=0 trace C ⊤ QC Σ h|h-1 -Σ + LCΣ -L h CΣ h|h-1 (72) ≤ H h=0 O na 0 γ H-h 1 + nκ 2 (b 0 + c 0 )(1 -γ 2 ) h (73) ≤ D h def = O na 0 1 -γ 1 + nκ 2 (b 0 + c 0 ) γ 2 , where the second inequality uses trace (X) ≤ n∥X∥ 2 for any X ∈ R n×n . Now for any (history-dependent) policy π and any Θ ∈ E , V π (Θ) -V ⋆ (Θ) ≤ V π (Θ) -(H + 1)J ⋆ (Θ) Regret(π;H) +D h . Combined with the definition of π RT in Eqn. ( 7), we have Gap(π RT ) = V πRT (Θ ⋆ ) -V ⋆ (Θ ⋆ ) ≤ max Θ∈E (V π (Θ) -V ⋆ (Θ)) ≤ Regret(π; H) + D h , where the first inequality is by the minimax property of π RT and Θ ⋆ ∈ E .

E PROOF OF THEOREM 4

For all the conclusions in this section, the bounded state property of LQG-VTR (Lemma 5) is a crucial condition. Let V be the event that for all 1 ≤ t ≤ H, ∥x t|t, Θt ∥ 2 ≤ X1 (so that LQG-VTR does not terminate at Line 9), then by Lemma 5 we know P(V) ≥ 1 -δ. We provide the full proof of Theorem 4 in this section, but defer the proof of technical lemmas to Appendix F.

E.1 BOUNDED ℓ ∞ NORM OF F

To begin with, we show that ∥F∥ ∞ is well bounded under event V. Lemma 6 (Bounded norm of F). Suppose event V happens, then each sample E t = (τ t , xt|t, Θk , Θk , u t , xt+1|t+1, Θk , y t+1 ) (Line 13 of Algorithm 1) satisfies for any 1 ≤ t ≤ H, (τ t , xt|t, Θk , u t , Θk ) ∈ X (recall that X is the domain of f defined in Section 4.4). There exists a constant D such that for any Θ ∈ E , ∥f Θ ∥ ∞ def = max E∈X ∥f Θ (E)∥ ≤ D = O((n + p) log 2 H), ∥h ⋆ Θ ∥ ∞ ≤ D.

E.2 THE OPTIMISTIC CONFIDENCE SET

Now that ∥F∥ ∞ is bounded, we can show the real-world model Θ ⋆ is contained in the confidence set with high probability. Note that this lemma requires a more complicated analysis since we are using a biased value target regression (Ayoub et al., 2020) . Lemma 7 (Optimism). It holds that with probability at least 1 -2δ, Θ ⋆ ∈ U k for any episode k ∈ [K].

E.3 THE CLIPPING ERROR

We also analyze the clipping error introduced by replacing f ′ with f (see Section 4.4) in the LQG-VTR. Lemma 8. Suppose event V happens, there exists constant ∆ such that for any system Θ = (A, B, C) ∈ E and for any input (τ l , x, u, Θ) ∈ X at time step t, ∆ f τ t , x, u, Θ def = f ′ τ t , x, u, Θ -f τ l , x, u, Θ ≤ ∆ = O κ 2 (1 -γ 2 ) l (n + p) log 2 H , where τ t = {y 0 , u 0 , y 1 , ..., y t } is the full history, and τ l = {u t-l , y t-l+1 , • • • , u t-1 , y t } is the clipped history.

E.4 LOW SWITCHING PROPERTY

At last, we observe that the episode switching protocol based on the importance score (Line 14 of Algorithm 1) ensures that the total number of episodes will be only O(log H). Lemma 9 (Low Switching Cost). Suppose event V happens and setting ψ = 4D 2 + 1, the total number of episodes K of LQG-VTR(Algorithm 1) is bounded by K < K, where K is defined as K def = C K dim E (F, 1/H) log 2 (DH) for some constant C K . Proof. This lemma is implied by Lemma 6 and the proof of Chen et al. (2021, Lemma 9) . This lemma also implies that the number of episode will NOT reach K since K < K but not K ≤ K. This property is important as it guarantees that LQG-VTR switches to a new episode immediately as long as the importance score gets greater than 1.

E.5 PROOF OF THEOREM 4

We have prepared all the crucial lemmas for the proof of the main theorem till now. Putting everything together, we provide the full proof for the Theorem 4 below. Proof of Theorem 4. Define Regret(H) def = H t=Tw+1 (y ⊤ t Qy t + u ⊤ t Ru t -J(Θ ⋆ )) , then Regret(H) = E[ Regret(H)] + O(T w ). We assume event V happens and the optimistic property of U k (Lemma 7) holds for now, leaving the failure of these two events at the end of the proof. As a consequence of V, the algorithm will not terminate at Line 9. As we know the regret incurred in the model selection stage is O(T w ) = O( K2 ) is only logarithmic in H, it cannot be the dominating term of the regret. Denote the full history at time step t as τ t = {y 0 , u 0 , y 1 , ..., u t-1 , y t }, we decompose the regret as  Regret(H) = H t=Tw+1 y ⊤ t Qy t + u ⊤ t Ru t -J ⋆ (Θ ⋆ ) (78) ≤ H t=Tw+1 y ⊤ t Qy t + u ⊤ t Ru t -J ⋆ ( Θk ) (by Lemma 7) (79) = K k=1 T (k) t=S(k) E wt,zt+1 xut⊤ t+1|t+1, Θk P -C⊤ Q C xut t+1|t+1, Θk + y ut⊤ t+1, Θk Qy ut t+1, Θk -f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk + f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk -x⊤ t|t, Θk P -C⊤ Q C xt|t, Θk -y ⊤ t Qy t . Here we denote Θk = ( Ã, B, C), and P = P ( Θk ). For any k ∈ [K], the last term (84) of the regret decomposition can be bounded as T (k) t=S(k) f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk -x⊤ t|t, Θk P -C⊤ Q C xt|t, Θk -y ⊤ t Qy t (85) = T (k)-1 t=S(k) f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk -x⊤ t+1|t+1, Θk P -C⊤ Q C xt+1|t+1, Θk -y ⊤ t+1 Qy t+1 (86) + f ′ Θ ⋆ τ T (k) , xT (k)|T (k), Θk , u T (k) , Θk -x⊤ S(k)|S(k), Θk P -C⊤ Q C xS(k)|S(k), Θk -y ⊤ S(k) Qy S(k) . Observe that Eqn. ( 86) is a martingale difference sequence, and Eqn. ( 87) is bounded by 2D by Lemma 6. Summing over k we have with probability 1 -δ,  K k=1 T (k) t=S(k) f ′ Θ ⋆ τ t , = f ′ Θk τ t , xt|t, Θk , u t , Θk . Therefore, Eqn. ( 82) and ( 83) becomes K k=1 T (k) t=S(k) f ′ Θk τ t , xt|t, Θk , u t , Θk -f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk . Consider any episode 1 ≤ k ≤ K, define Z k to be the dataset used for the regression at the end of episode k (Z 0 = ∅). The construction of confidence set U k (Eqn. ( 10)) shows that f Θk -f Θ ⋆ 2 Z k-1 ≤ 2β. ( ) By the definition of importance score (Line 14 of Algorithm 1), we know for any S(k ) ≤ t ≤ T (k), t s=S(k) f Θk τ l s , xs|s, Θk , u s , Θk -f Θ ⋆ τ l s , xs|s, Θk , u s , Θk 2 ≤ 2β + ψ + 4D 2 , ( ) where τ l s = {u s-l , y s-l+1 , ..., u s-1 , y s } if l ≤ s, or τ l s denotes the full history at step s if l > s. Summing up Eqn. ( 93) and ( 94) implies for any k, t k o=1 min(t,T (o)) s=S(o) f Θk τ l s , xs|s, Θo , u s , Θo -f Θ ⋆ τ l s , xs|s, Θo , u s , Θo 2 ≤ 4β + ψ + 4D 2 . ( ) Invoking Jin et al. (2021, Lemma 26) with G = F -F, g t = f Θk -f Θ ⋆ , ω = 1/H, and µ s (•) = 1[• = (τ l s , xs|s, Θo , u s , Θo )], we have K k=1 T (k) t=S(k) f Θk τ l t , xt|t, Θk , u t , Θk -f Θ ⋆ τ l t , xt|t, Θk , u t , Θk ≤ O dim E (F, 1/H) βH + D min(H, dim E (F, 1/H)) . The clipping error between f ′ and f can be bounded by Lemma 8: K k=1 T (k) t=S(k) f ′ Θk τ t , xt|t, Θk , u t , Θk -f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk ≤ K k=1 T (k) t=S(k) f Θk τ l t , xt|t, Θk , u t , Θk -f Θ ⋆ τ l t , xt|t, Θk , u t , Θk + K k=1 T (k) t=S(k) ∆ f Θk τ t , xt|t, Θk , u t , Θk + ∆ f Θ ⋆ τ t , xt|t, Θk , u t , Θk ≤ O dim E (F, 1/H) βH + κ 2 (1 -γ 2 ) l (n + p)H log 2 H . Therefore, we choose l = O(log(Hn + Hp)) to obtain K k=1 T (k) t=S(k) f ′ Θk τ t , xt|t, Θk , u t , Θk -f ′ Θ ⋆ τ t , xt|t, Θk , u t , Θk ≤ O dim E (F, 1/H) βH , where β = O(D 2 log(N (F, 1/H)/δ)) is defined in Eqn. (119) . Plugging it into Eqn. ( 92) combined with Eqn. (89), we can finally bound the regret as Regret(H) ≤ O dim E (F, 1/H) βH + D H log(1/δ) + 2DK as long as V happens, the optimistic property Θ ⋆ ∈ U k for any episode k, and the martingale difference (86) converges. The probability that any of these three events fail is 4δ, in which case the Regret(H) can be very large. The tail bound of the Gaussian variables (Abbasi-Yadkori & Szepesvári, 2011) indicates that for any q > 0 and t > 0, P (∥w t ∥ 2 ≥ q) ≤ 2n exp - q 2 2n , P (∥v t ∥ 2 ≥ q) ≤ 2p exp - q 2 2p . Therefore, a union bound implies P (∃t, max(∥w t ∥ 2 , ∥v t ∥ 2 ) ≥ q) ≤ 2Hn exp - q 2 2n + 2Hp exp - q 2 2p . Note that as long as ∀0 ≤ t ≤ H, ∥w t ∥ 2 , ∥v t ∥ 2 ≤ q, we know that Regret(H) ≤ C R (q 2 + Ū 2 ), since ∥u t ∥ 2 ≤ Ū always hold for some instance dependent constant C R thanks to the terminating condition (Line 9 of LQG-VTR). Thus we know P Regret(H) > C R (q 2 + Ū 2 ) ≤ 2Hn exp - q 2 2n + 2Hp exp - q 2 2p . We set q = Hnp and δ = C -1 R (q 2 + Ū 2 ) -1 so that Regret(H) = E Regret(H) + O(T w ) (104) ≤ O dim E (F, 1/H) βH + 4C R δ(q 2 + Ū 2 ) + +∞ C R (q 2 + Ū 2 ) P Regret(H) > x dx (105) ≤ O dim E (F, 1/H) βH . ( ) The integral above is a constant because P( Regret(H) > x) is a exponentially small term. The theorem is finally proved since β = Õ(D 2 log N (F, 1/H)) and ∥F∥ ∞ ≤ D = Õ(n + p).

F PROOF OF TECHNICAL LEMMAS IN APPENDIX E

F.1 PROOF OF LEMMA 6 Proof of Lemma 6. The first part of the lemma is implied by the definition of V. By definition of X , we know that for any (τ l , x, u, Θ) ∈ X and Θ ∈ E ,  f Θ (τ l , x, u, Θ) = E et x ⊤ ( P -C⊤ Q C)x + + CAx c t|t,Θ + CBu ⊤ Q CAx c t|t,Θ + CBu (113) + trace L⊤ ( P -C⊤ Q C) L CΣC ⊤ + I + trace Q CΣC ⊤ + I , where P = P ( Θ), L = L( Θ), and e t = Cw t + CA(x t -xt|t,Θ ) + z t+1 is the innovation noise such that e t ∼ N (0, CΣ(Θ)C ⊤ + I) (Zheng et al., 2021; Lale et al., 2021) . .

Note that xc

t|t,Θ = xt|t,Θ -(A -LCA) l xt-l|t-l,Θ , so ∥x c t|t,Θ ∥ 2 ≤ (1+κ 2 (1-γ 2 ) l ) X2 . Therefore, each term in Eqn. ( 111) -( 114) is bounded in at most O(( √ n + √ p) log H) besides ∥x c t|t,Θ ∥ 2 = O(( √ n + √ p) log H). We can finally find a constant D 1 such that ∥f Θ ∥ ∞ ≤ D 1 = O((n + p) log 2 H) for any Θ ∈ E . By the definition of optimal bias function in (4), we have h ⋆ Θ xt|t,Θ , y t = x⊤ t|t,Θ (P (Θ) -C ⊤ QC)x t|t,Θ + y ⊤ t Qy t (115) ≤ D 2 def = (N P + N 2 S N U ) X2 2 + N U Ȳ 2 . ( ) It suffices to choose D def = max(D 1 , D 2 ) = O((n + p) log 2 H). F.2 PROOF OF LEMMA 7 Proof of Lemma 7. As a starting point, Θ ⋆ ∈ U 1 holds with probability 1 -δ/4 due to the warm up procedure (see Appendix C). We assume event V happens for now, and discuss the failure of V at the end of the proof. For episode 1 ≤ k ≤ K and step t in the episode, define X t def = (τ t , xt|t, Θk , u t , Θk ), d t def = f ′ Θ ⋆ (X t )- f Θ ⋆ (X t ), Y t def = x⊤ t+1|t+1, Θk ( P -C⊤ Q C)x t+1|t+1, Θk + y ⊤ t+1 Qy t+1 for P = P ( Θk ), C = C( Θk ) (note that f Θ ⋆ only depends on the l-step nearest history τ l instead of the full history τ t ). Let F t-1 be the filtration generated by (X 0 , Y 0 , X 1 , Y 1 , ..., X t-1 , Y t-1 , X t ), then we know that f ′ Θ ⋆ (X t ) = E[Y t | F t-1 ] by definition. Define Z t def = Y t -f ′ Θ ⋆ (X t ) , then Z t is a D/2-subGaussian random variable conditioned on F t-1 by Lemma 6, and it is F t measurable (hence F-adapted). It holds that for the dataset Z at the end of episode k ∥Y -f ∥ 2 Z -∥Y -f * ∥ 2 Z = ∥f * -f ∥ 2 Z + 2 ⟨Z + d, f * -f ⟩ Z , where ⟨x, y⟩ Z def = e∈Z x(e)y(e), ∥x -y∥ 2 Z def = ⟨x -y, x -y⟩ Z , f * def = f Θ ⋆ . Rearranging the terms gives 1 2 ∥f * -f ∥ 2 Z = ∥Y -f ∥ 2 Z -∥Y -f * ∥ Published as a conference paper at ICLR 2023 for E(f ) def = - 1 2 ∥f * -f ∥ 2 Z + 2 ⟨Z + d, f -f * ⟩ Z . Recall that f def = f Θk+1 = arg min Θ∈U1 ∥f Θ -Y ∥ 2 Z and f * ∈ U 1 , we have ∥ f -Y ∥ 2 Z ≤ ∥f * -Y ∥ 2 Z . Thus 1 2 f * -f 2 Z ≤ E( f ). In order to show that f is close to f * , it suffices to bound E( f ). For some fixed α > 0, let G(α) be an α-cover of F in terms of ∥ • ∥ ∞ . Let f def = min f ∈F ∥ f -f ∥ Z , then E f = E f -E f + E f ≤ E f -E f + max f ∈G(α) E(f ). We now start to bound these three terms. For any fixed f ∈ F, we know that 2 ⟨Z, f -f * ⟩ Z is D∥f -f * ∥ Z -subGaussian. Then it holds with probability 1 -3δ/4 for any episode k and λ > 0 that E(f ) ≤ - 1 2 ∥f * -f ∥ 2 Z + 1 λ log 4 3δ + λ D 2 ∥f -f * ∥ 2 Z 2 + 2 ⟨d, f -f * ⟩ Z . Choosing λ = 1/D 2 , we get E(f ) ≤ D 2 log 4 3δ + 2 ⟨d, f -f * ⟩ Z ≤ D 2 log 4 3δ + 2∆DH, ( ) since ⟨d, f -f * ⟩ Z ≤ ∥d∥ Z ∥f -f * ∥ Z ≤ ∆ √ H • D √ H = ∆DH. By a union bound argument we know with probability 1 -3δ/4, the term max f ∈G(α) E(f ) is bounded by D 2 log(4|G(α)|/3δ) + 2∆DH. For E( f ) -E( f ), we have E f -E f = 1 2 f -f * 2 Z - 1 2 f -f * 2 Z + 2 Z + d, ft -f Z ≤ 1 2 f -f , f + f + 2f * Z + 2 (∥Z∥ Z + ∥d∥ Z ) f -f Z . Note that by definition of f it holds that f -f Z ≤ α √ H. Together with ∥ f ∥ Z , ∥ f ∥ Z , ∥f * ∥ Z ≤ D √ H, we bound E( f ) -E( f ) as E f -E f ≤ 2DαH + 2∆ √ H + D 2H log (3H(H + 1)/δ) • α √ H ≤ 2 (D + ∆) αH + αH • D 2 log (3H(H + 1)/δ). Here we take advantage of the D/2-subGaussian property of z t in that with probability 1 -3δ/4 for any episode k, ∥Z∥ Z ≤ D 2 2|Z| log (2|Z|(|Z| + 1)/δ) ≤ D 2 2H log (3H(H + 1)/δ). Merging Eqn. ( 117) and ( 118) with another union bound, we get that with probability 1 -δ for any episode k, f * -f 2 Z ≤ 2D 2 log (2N α /δ) + 4∆DH + 2αH 2(D + ∆) + D 2 log (6H(H + 1)/δ) , where N α is the (α, ∥ • ∥ ∞ )covering number of F. Finally, we set α = 1/H and β = 2D 2 log (2N (F, 1/H)) + 4∆DH + 4(D + ∆) + 4D 2 log(4H(H + 1)/δ). (119) It holds that Θ ⋆ ∈ U k for any 1 ≤ k ≤ K with probability 1 -δ/4 -3δ/4 = 1 -δ as long as V happens. Since V happens with probability 1 -δ, we know the optimistic property holds with probability 1 -2δ.  c ∥ 2 ≤ κ 2 (1 -γ 2 ) l X2 under event V and Assumption 3. By the decomposition rule for f and f ′ (Eqn. ( 111) -( 114)), we know f ′ Θ (τ t , x, u, Θ) -f Θ (τ l , x, u, Θ) (120) = 2e c⊤ A ⊤ C ⊤ L⊤ ( P -C⊤ Q C) LCAx c t|t,Θ + LCBu + I -L C Ãx + Bu (121) + e c⊤ A ⊤ C ⊤ L⊤ ( P -C⊤ Q C) LCAe c (122) + 2e c⊤ A ⊤ C ⊤ Q CAx c t|t,Θ + CBu + e c⊤ A ⊤ C ⊤ QCAe c . ( ) Note that X2 = O(( √ n + √ p) log H), thereby there exists a constant ∆ such that f ′ Θ (τ t , x, u, Θ) -f Θ (τ l , x, u, Θ) ≤ ∆ = O κ 2 (1 -γ 2 ) l (n + p) log 2 H . G THE INTRINSIC MODEL COMPLEXITY δ E In this section, we show how large the intrinsic model complexity δ E will be for different simulator classes E . An important message is that δ E does NOT have any polynomial dependency on H.

G.1 GENERAL SIMULATOR CLASS

Without further structures, we can show that δ E is bounded by δ E = Õ np 2 (n + m + p)(n + p) 2 (m + p) 2 due to Proposition 2, 3 and the fact that ∥F∥ ∞ ≤ D = Õ(n + p). Proposition 2. Under Assumption 1, 2, 3, the 1/H-Eluder dimension of F is bounded by dim E (F, 1/H) = Õ p 4 + p 3 m + p 2 m 2 . ( ) Proof. The bound on Eluder dimension mainly comes from the fact that f Θ (τ l , x, u, Θ) can be regarded a linear function between features of Θ and features of Θ. We use the "feature" of a system Θ to represent a quantity that only depends on Θ (independent from Θ) and vice versa. It is well-known that the linear function class has bounded Eluder dimension. We formalize this idea below. In the proof of Lemma 8, we know that for any (τ l , x, u, Θ) ∈ X and Θ ∈ E , f Θ (τ l , x, u, Θ) = LCAx c t|t,Θ + LCBu + I -L C Ãx + Bu ⊤ ( P -C⊤ Q C)× (127) LCAx c t|t,Θ + LCBu + I -L C Ãx + Bu (128) + CAx c t|t,Θ + CBu ⊤ Q CAx c t|t,Θ + CBu (129) + trace L⊤ ( P -C⊤ Q C) L CΣC ⊤ + I + trace Q CΣC ⊤ + I , Denote the input as E def = (τ l , x, u, Θ) ∈ X , and define three feature mappings ζ : X → R n×n , ϕ : X → R n and Φ : X → R p×p such that ζ(E) = P -C⊤ Q C, ϕ(E) = I -L C Ãx + Bu , Φ(E) = L⊤ ζ(E) L + Q. ( ) Proposition 3. Under Assumptions 1, 2, and 3, the logarithmic 1/H-covering number of F is bounded by log N (F, 1/H) = Õ n 2 + nm + np . Proof. Under Assumption 2, the spectral norm of the system dynamics of Θ = (A, B, C) ∈ E is bounded by ∥A∥ 2 , ∥B∥ 2 , ∥C∥ 2 ≤ N S . Therefore, we can construct an ϵ 0 -net G ϵ0 (E ) such that for any Θ = (A, B, C), there exists Θ = ( Ā, B, C) ∈ E satisfying max(∥A -Ā∥ 2 , ∥B -B∥ 2 , ∥C - C∥ 2 ) ≤ ϵ 0 . By classic theory, we know |G ϵ0 (E )| ≤ O((1 + 2 √ nN S /ϵ 0 ) n 2 +nm+np ). To see this, observe that ∥ • ∥ 2 ≤ ∥ • ∥ F ≤ min(n, m)∥ • ∥ 2 for any n × m matrix, so we can reduce the ϵ-cover with respect to the l 2 -norm to the ϵ-cover with respect to the Frobenius norm. To show the covering number of F, we check the gap ∥f Θ -f Θ∥ ∞ . By Lemma 3.1 of Lale et al. (2020c) , we know that for small enough ϵ 0 there exists instance dependent constants C Σ C L such that Σ -Σ 2 ≤ C Σ ϵ 0 , L -L 2 ≤ C L ϵ 0 , where Σ = Σ( Θ), L = L( Θ). The constant C Σ is slightly different from Lemma 3.1 of their paper, because we do not assume M def = A -ALC is contractible (i.e., ∥M ∥ 2 < 1). Rather we know M is (κ 2 , γ 2 )-strongly stable. Therefore, the linear mapping T def = X → X -M XM ⊤ is invertible with ∥T -1 ∥ 2 ≤ κ 2 /(2γ 2 -γ 2 ) according to Lemma B.4 of Simchowitz & Foster (2020) . As a result, we can set C Σ by replacing the 1/(1 -ν 2 ) term in the original constant of Lale et al. (2020c, Lemma 3 .1) by κ 2 /(2γ 2 -γ 2 2 ). For any E = (τ l , x, u, Θ) ∈ X , it is desired to compute the difference |f Θ (E) -f Θ(E)|. As a first step, we show the difference between xc t|t,Θ and xc t|t, Θ. Note that for any 1 ≤ s ≤ l, Similarly we can bound ∥(B -LCB) -( B -L C B)∥ 2 . Follow the decomposition rule in Eqn. (A -LCA) s-1 -( Ā -L C Ā) s-2 2 ≤ (s -1)κ 2 2 (1 -γ 2 ) s-1 (1 + N 2 S + 2N L N S )ϵ 0 . ( (136) and observe that ∥u t-s ∥ 2 ≤ Ū , ∥y t-s+1 ∥ 2 ≤ Ȳ for 1 ≤ s ≤ l, we have xc t|t,Θ -xc t|t, Θ 2 ≤ C x ϵ 0 , where C x is a instance dependent constant (also depends on C Σ , C L ). By Eqn. ( 134) and ( 135), it holds that f Θ (E) -f Θ(E) = trace Φ(E) CΣC ⊤ -C Σ C⊤ + Diff 1 + Diff 2 , where Note that ∥CΣC ⊤ -C Σ C⊤ ∥ 2 ≤ (2N S N Σ + N 2 S )ϵ 0 and ∥Φ(E)∥ 2 ≤ N 2 L (N P + N 2 S N U ) + N U , we have |f Θ (E) -f Θ(E)| = trace Φ(E) CΣC ⊤ -C Σ C⊤ + Diff 1 + Diff 2 ≤ C f ϵ 0 , where C f def = p∥Φ(E)∥ 2 (2N S N Σ + N 2 S )ϵ 0 + C d1 + C d2 . This means ∥f Θ -f Θ∥ ∞ ≤ C f ϵ 0 . Therefore, the subset {f Θ | Θ ∈ G ϵ0 (E )} forms a C f ϵ 0 -covering set of F. Finally, if suffices to set ϵ 0 = C -1 f H -1 to induce a 1/H-covering set of F: N (F, 1/H) ≤ G C -1 f H -1 (E ) = O 1 + 2 √ nN S C f H n 2 +nm+np , which finishes the proof.

G.2 LOW-RANK SIMULATOR CLASS

In many sim-to-real tasks, there are only a few control parameters that affect the dynamics in the simulator class E (see e.g., Table 1 of OpenAI et al. (2018) ). The number of control parameters is often a constant that is independent of the dimension of the task. This means our simulator class E is a low-rank class in these scenarios: It is a common situation that Θ(t 1 , t 2 , ..., t k ) is a continuous function of (t 1 , ..., t k ). For example, it is a continuous function when the control parameters are physical parameters like friction coefficients, damping coefficients. A simple example is when Θ is a linear combination of k base simulators: E = {Θ( Θ = (A, B, C) = k i=1 t i Θ i , where Θ i = (A i , B i , C i ) is a fixed base simulator. This can be achieved if we approximate the effect of control parameters with some linear mappings on the states x t . We can set different values of control parameters t 1 , ..., t k to generate new simulators. In such a continuous low rank class, the log covering number N (F, 1/H) will reduce to only Õ(k) as long as Θ is continuous with respect to the control parameters. It is straightforward to see this by checking the proof of Proposition 3. Unfortunately, it is unclear whether the Eluder dimension of such a continuous low-rank class will be smaller due to the existence of the Kalman gain matrix L. Overall, the intrinsic model complexity δ E will decrease from Õ(np 2 (n + m + p)(n + p) 2 (m + p) 2 ) to at most Õ(kp 2 (m + p) 2 (n + p) 2 ) for a small constant k.

G.3 LOW-RANK SIMULATOR CLASS WITH REAL DATA

Many works studying sim-to-real transfer also fine-tune the simulated policy with a few real data on the real-world model (Tobin et al., 2017; Rusu et al., 2017) . Moreover, the observations are usually the perturbed input states with observation noises introduced by vision inputs (e.g., from the cameras) (Tobin et al., 2017; Peng et al., 2018; OpenAI et al., 2018) . This means the observation y t equals x t + z t for random noise z t , without any linear transformation C. With these real data and noisy observations of the state, we can estimate the steady-state covariance matrix Σ(Θ ⋆ ) with random actions since it is independent of the policy! The intrinsic complexity of the simulator set will be further reduced for this low-rank simulator set (e.g., the simulator set defined by Eqn. ( 157)) combined with real data. In such low rank simulator classes, we know that Σ(Θ ⋆ ) is fixed as an estimator Σ computed with noisy observations y t = x t + z t . By definition we know L(Θ ⋆ ) is also fixed as L = Σ( Σ + I) -1 . Therefore, the Kalman dynamics matrix A -LCA = A -LA belongs to a k-dimensional space as A lies in a k dimensional space. As a result, the dimension of feature mappings φ s u (t 1 , t 2 , ..., t k ), φ s y (t 1 , t 2 , ..., t k ) (Eqn. ( 137)) reduces to O(k 2 l k ). The Eluder dimension of this low rank simulator class reduces to Õ(k 4 l 2k ) = Õ(k 4 log 2k (H)) for a small constant k. Therefore, the intrinsic model complexity δ E for a low-rank simulator class fine-tuned with real data is Õ((n + p) 2 k 5 log 2k (H)), which implies that the robust adversarial training algorithms are very powerful with a small sim-toreal gap in such simulator classes.

H PROOF OF THEOREM 2

Proof of Theorem 2. Throughout this proof, we assume H is sufficiently large and omit some (instance-dependent) constants because we focus on the dependency of H. Note that LQR is a special case of LQG in the sense that the learner can observe the hidden state in LQR, it suffices to consider the case where E is a set of LQRs. Since C = I in LQRs, Assumption 3 holds naturally. Let (A ⋆ , B ⋆ ) be the parameters satisfying Assumptions 1, and 2, and E = {(A, B) : ∥A -A ⋆ ∥ ∞ ≤ ϵ, ∥B -B ⋆ ∥ ∞ ≤ ϵ}. Choosing ϵ ∝ H -1/2 where H is sufficiently large, by the same perturbation analysis in Simchowitz & Foster (2020, Appendix B) , we have that all Θ ∈ E satisfy Assumptions 1 and 2. Fix policy π, we have  Notably, Simchowitz & Foster (2020) only imposes the strongly stable assumption, which is weaker than our assumption, but their proof (Lemma B.7 in Simchowitz & Foster (2020) ) still holds under our assumptions. Combining ( 158) and ( 159), we conclude the proof of Theorem 2.



) as we show in Section 4.4, which leads to an O(H) intrinsic complexity (i.e., δ E = O(H)). Then the sim-to-real gap of π RT in Theorem 1 becomes O(H), which is vacuous. Fortunately, we can use a clipped history (Line 12 of Algorithm 1) to compute an approximation of the expectation. Let f Θ (x t|t , u, Θ) def ≈ Algorithm 1 LQG-VTR 1: Initialize: set model selection period length T w by Eqn. (56).2:

Model Selection 1: Input: model selection period length T w by Eqn. (56), simulator set E . 2: Set σ u = 1. 3: Execute action u t ∼ N (0, σ 2 u I) for T w steps, and gather dataset D init = {y t , u t } Tw t=1 . 4: Set the truncation length H = O(log(κ 2 H)/ log(1/γ 2 )). 5: Estimate M with D init following Eqn. (11) of Lale et al. (2021). 6: Run SYSID( H, Ĝ, n) (Algorithm 2 of Lale et al. (2021)) to obtain an estimate Â, B, Ĉ, L of the real world system Θ ⋆ . 7: Return E ∩ C, where C is computed through Eqn. (23).

-C⊤ Q C xt|t, Θk -y ⊤ t Qy t (by the Bellman equation of system Θk )

PROOF OF LEMMA 8 Proof of Lemma 8. Define e c def = (A -LCA) l xt-l|t-l,Θ , then xt|t,Θ = xc t|t,Θ + e c . Further, we know ∥e

LCA, Ā def = Ā -L C Ā, where A -Ā 2 ≤ (1 + N 2 S + 2N L N S )ϵ 0 , ∥A∥ 2 , ∥ Ā∥ 2 ≤ κ 2 (1 -γ 2 ).

C Āx c t|t, Θ + C Bu ⊤ Q C Āx c t|t, Θ + C Bu (152) Diff 2 = LCAx c t|t,Θ + LCBu ⊤ ζ(E) LCAx c t|t,Θ + LCBu (153) -L C C xc t|t, Θ + L C Bu ⊤ ζ(E) L C C xc t|t, Θ + L C Bu . (154)Checking each term in Diff 1 and Diff 2 , we conclude that there exists instance dependent constantC d1 , C d2 such that |Diff 1 | ≤ C d1 ϵ 0 , |Diff 2 | ≤ C d2 ϵ 0 .

treat a history-dependent policy as an algorithm, the lower bound in Simchowitz & Foster (2020, Corollary 1) implies thatmin π max Θ∈E [V π (Θ) -V * (Θ)] ≥ Ω( √ H).

t ∥ 2 ≤ w, ∥z t ∥ 2 ≤ z for any 0 ≤ t ≤ H, and ∥u t ∥ 2 ≤ ū for 0 ≤ t ≤ T w . Since w t , z t , u t

xt|t, Θk , u t , Θk -x⊤

t 1 , t 2 , ..., t k ) | (t 1 , t 2 , ..., t k ) ∈ T } ,where t 1 , t 2 , ..., t k represent k control parameters, T is the domain of control parameters. The dynamics of simulator Θ only depends on these control parameters t 1 , t 2 , ..., t k .

annex

Define the initial covariance estimate Σ 0|-1 def = I, then Kalman filter givesandThe optimal control is linear in the belief state E[x h |H h ], with an optimal control matrix K h . This control matrix in turn depends on a Riccati difference equation as follows. Define P H def = C ⊤ QC, the Riccati difference equation for P h takes the form:The optimal control matrix isThen the optimal cost of the finite horizon LQG problem V ⋆ (Θ) equalsFor the infinite-horizon view of the same LQG instance Θ, we denote the stable solution to two DAREs in ( 57) and ( 58) by Σ and P (i.e., we use Σ, P and L to represent Σ(Θ), P (Θ) and L(Θ)), then the optimal cost J ⋆ (Θ) equalsNow let us check the difference between (H + 1)J ⋆ (Θ) and V ⋆ (Θ). The difference can be bounded asPositive definite matrix Q can be decomposed as Q = Γ ⊤ Γ, where Γ ∈ R p×p is also positive definite. Therefore, for matrix C ⊤ QC = (ΓC) ⊤ ΓC, we can show that (A, ΓC) is also observable owing to (A, C) observable and Γ positive definite. Since (A, ΓC) is observable, and (A, B) is controllable, we know that the DARE for P has a unique positive semidefinite solution, and the Riccati difference sequence converges exponentially fast to that unique solution (see e.g., Chan et al. (1984, Theorem 4 .2), Caines & Mayne (1970, Theorem 2.2), and the references therein). That is, there exists a uniform constant a 0 such thatSimilarly, Σ h|h-1 also converges to Σ exponentially fast in that there exists a (instance-dependent) constant b 0 (Chan et al., 1984; Lale et al., 2021) such thatwhich further implies that L h also converges to L exponentially fast for some constant c 0 :since Σ 0|-1 = I is positive definite (the techniques of Lemma 3.1 of Lale et al. (2020c) can be used here).To show the existence of b 0 , observe thatThen we can move on to write f asBy definition,Define two series of feature mappings φ s u , φ s y for 1 ≤ s ≤ l on E to be φ s u : E → R p×m and φ s y : E → R p×p such thatConsider the terms in Eqn. ( 134) and ( 135, each term can be written as the inner product of a feature of Θ and a feature of E. For example, trace Φ(E)(CΣC ⊤ + I) is the inner product of Φ(E) (feature of E) and CΣC ⊤ + I (feature of Θ, since it only depends on Θ). As a more complicated example, we checkNote that φ s u (Θ) and φ s y (Θ) only depends on Θ (so they can be regarded as a feature of Θ), where u t-s , y t-s+1 , L⊤ ζ(E) L only depends on E. Consider any 1 ≤ s 1 , s 2 ≤ l in the summation, we pick the term below as an example:Here ⊗ denotes the Kronecker product between two matrices, vec(X) for X ∈ R a×b denotes the vectorization of matrix X (i.e., vec(X) = [X 11 X 21 ... X a1 X 12 ... X ab ] ⊤ ∈ R ab ), and ⟨•, •⟩ is the inner product. In this way, we decompose the term (140) as the inner product of a). We then observe that such decomposition is valid for any 1 ≤ s 1 , s 2 ≤ l. Therefore, we can decompose each term in Eqn. ( 134) and ( 135) as the inner product of features of Θ and E. Gathering all these features as the aggregated feature mapping fea 1 for Θ and fea 2 for E, we have f Θ (E) = ⟨fea 1 (Θ), fea 2 (E)⟩ .(142)It is not hard to observe that the dimension of aggregated feature mappings is bounded by dim(fea 1 (Θ)) = dim(fea 2 (E)) = Õ p 4 + p 3 m + p 2 m 2 (143) since l = O(log(Hn + Hp)).Note the ∥fea 1 (Θ)∥ 2 , ∥fea 2 (E)∥ 2 are both upper bounded by the domain of E ∈ X and Θ ∈ E , then Proposition 2 of Osband & Van Roy (2014) shows that dim E (F, 1/H) = Õ p 4 + p 3 m + p 2 m 2 . (144)

