PROVABLE SIM-TO-REAL TRANSFER IN CONTINUOUS DOMAIN WITH PARTIAL OBSERVATIONS

Abstract

Sim-to-real transfer, which trains RL agents in the simulated environments and then deploys them in the real world, has been widely used to overcome the limitations of gathering samples in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.

1. INTRODUCTION

Deep reinforcement learning has achieved great empirical successes in various real-world decisionmaking problems, such as Atari games (Mnih et al., 2015) , Go (Silver et al., 2016; 2017) , and robotics control (Kober et al., 2013) . In addition to the power of large-scale deep neural networks, these successes also critically rely on the availability of a tremendous amount of data for training. For these applications, we have access to efficient simulators, which are capable of generating millions to billions of samples in a short time. However, in many other applications such as auto-driving (Pan et al., 2017) and healthcare (Wang et al., 2018) , interacting with the environment repeatedly and collecting a large amount of data is costly and risky or even impossible. A promising approach to solving the problem of data scarcity is sim-to-real transfer (Kober et al., 2013; Sadeghi & Levine, 2016; Tan et al., 2018; Zhao et al., 2020) , which uses simulated environments to generate simulated data. These simulated data is used to train the RL agents, which will be then deployed in the real world. These trained RL agents, however, may perform poorly in realworld environments owing to the mismatch between simulation and real-world environments. This mismatch is commonly referred to as the sim-to-real gap. To close such a gap, researchers propose various methods including (1) system identification (Kristinsson & Dumont, 1992) , which builds a precise mathematical model for the real-world envi-ronment; (2) domain randomization (Tobin et al., 2017) , which randomizes the simulation and trains an agent that performs well on those randomized simulated environments; and (3) robust adversarial training (Pinto et al., 2017b) , which finds a policy that performs well in a bad or even adversarial environment. Despite their empirical successes, these methods have very limited theoretical guarantees. A recent work Chen et al. ( 2021) studies the domain randomization algorithms for sim-to-real transfer, but this work has two limitations: First, their results (Chen et al., 2021, Theorem 4) heavily rely on domain randomization being able to sample simulated models that are very close to the real-world model with at least constant probability, which is hardly the case for applications with continuous domain. Second, their theoretical framework do not capture the problem with partial observations. However, sim-to-real transfer in continuous domain with partial observations is very common. Take dexterous in-hand manipulation (OpenAI et al., 2018) as an example, the domain consists of angels of different joints which is continuous and the training on the simulator has partial information due to the observation noises (for example, the image generated by the agent's camera can be affected by the surrounding environment). Therefore, we ask the following question: Can we provide a rigorous theoretical analysis for the sim-to-real gap in continuous domain with partial observations? This paper answers the above question affirmatively. We study the sim-to-real gap of the robust adversarial training algorithm and address the aforementioned limitations. To summarize, our contributions are three-fold: • We use finite horizon LQGs to model the simulated and real-world environments, and also formalize the problem of sim-to-real transfer in continuous domain with partial observations. Under this framework, the learner is assumed to have access to a simulator class E , each of which represents a simulator with certain control parameters. We analyze the sim-to-real gap (Eqn. ( 6)) of the robust adversarial training algorithm trained in simulator class E . Our results show that the sim-to-real gap of the robust adversarial training algorithm is Õ( √ δ E H). Here H is the horizon length of the real task, and δ E denotes some intrinsic complexity of the simulator class E . This result shows that the sim-to-real gap is small for simple simulator classes and short tasks, while it gets larger for complicated classes and long tasks. By establishing a nearly matching lower bound, we further show this sim-to-real gap enjoys a near-optimal rate in terms of H. • To bound the sim-to-real gap of the robust adversarial training algorithm, we develop a new reduction scheme that reduces the problem of bounding the sim-to-real gap to designing a sample-efficient algorithm in infinite-horizon average-cost LQGs. Our reduction scheme for LQGs is different from the one in Chen et al. ( 2021) for MDPs because the value function (or more precisely, the optimal bias function) is not naturally bounded in a LQG instance, which requires more sophisticated analysis. • To prove our results, we propose a new algorithm, namely LQG-VTR, for infinite-horizon average-cost LQGs with convex cost functions. Theoretically, we establish a regret bound Õ( √ δ E T ) for LQG-VTR, where T is the number of steps. To the best of our knowledge, this is the first "instance-dependent" result that depends on the intrinsic complexity of the LQG model class with convex cost functions, whereas previous works only provide worst-case regret bounds depending on the ambient dimensions (i.e., dimensions of states, controls, and observations). At the core of LQG-VTR is a history clipping scheme, which uses a clipped history instead of the full history to estimate the model and make predictions. This history clipping scheme helps us reduce the intrinsic complexity δ E exponentially from O(T ) to O(poly(log T )) (cf. Appendix G). Our theoretical results also have two implications: First, the robust training algorithm is provably efficient (cf. Theorem 1) in continuous domain with partial observations. One can turn to the robust adversarial training algorithm if the domain randomization has poor performance (cf. Appendix A.4). Second, for stable LQG systems, only a short clipped history needs to be remembered to make accurate predictions in the infinite-horizon average-cost setting.



* Equal Contribution.

