PROVABLE SIM-TO-REAL TRANSFER IN CONTINUOUS DOMAIN WITH PARTIAL OBSERVATIONS

Abstract

Sim-to-real transfer, which trains RL agents in the simulated environments and then deploys them in the real world, has been widely used to overcome the limitations of gathering samples in the real world. Despite the empirical success of the sim-to-real transfer, its theoretical foundation is much less understood. In this paper, we study the sim-to-real transfer in continuous domain with partial observations, where the simulated environments and real-world environments are modeled by linear quadratic Gaussian (LQG) systems. We show that a popular robust adversarial training algorithm is capable of learning a policy from the simulated environment that is competitive to the optimal policy in the real-world environment. To achieve our results, we design a new algorithm for infinite-horizon average-cost LQGs and establish a regret bound that depends on the intrinsic complexity of the model class. Our algorithm crucially relies on a novel history clipping scheme, which might be of independent interest.

1. INTRODUCTION

Deep reinforcement learning has achieved great empirical successes in various real-world decisionmaking problems, such as Atari games (Mnih et al., 2015) , Go (Silver et al., 2016; 2017) , and robotics control (Kober et al., 2013) . In addition to the power of large-scale deep neural networks, these successes also critically rely on the availability of a tremendous amount of data for training. For these applications, we have access to efficient simulators, which are capable of generating millions to billions of samples in a short time. However, in many other applications such as auto-driving (Pan et al., 2017) and healthcare (Wang et al., 2018) , interacting with the environment repeatedly and collecting a large amount of data is costly and risky or even impossible. A promising approach to solving the problem of data scarcity is sim-to-real transfer (Kober et al., 2013; Sadeghi & Levine, 2016; Tan et al., 2018; Zhao et al., 2020) , which uses simulated environments to generate simulated data. These simulated data is used to train the RL agents, which will be then deployed in the real world. These trained RL agents, however, may perform poorly in realworld environments owing to the mismatch between simulation and real-world environments. This mismatch is commonly referred to as the sim-to-real gap. To close such a gap, researchers propose various methods including (1) system identification (Kristinsson & Dumont, 1992) , which builds a precise mathematical model for the real-world envi- * Equal Contribution. 1

