ON THE POWER OF PRE-TRAINING FOR GENERALIZA-TION IN RL: PROVABLE BENEFITS AND HARDNESS

Abstract

Generalization in Reinforcement Learning (RL) aims to train an agent during training that generalizes to the target environment. In this work, we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good test performance. Therefore, we seek to answer the following question: how much can we expect pre-training over training environments to be helpful for efficient and effective fine-tuning? On one hand, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, we show that pre-training can be indeed helpful in the non-asymptotic regime by designing a policy collection-elimination (PCE) algorithm and proving a distribution-dependent regret bound that is independent of the state-action space. We hope our theoretical results can provide insight towards understanding pre-training and generalization in RL.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with sequential decision making problems in which the agent interacts with the environment aiming to maximize its cumulative reward. This framework has achieved tremendous successes in various fields such as game playing (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019) , resource management (Mao et al., 2016) , recommendation systems (Shani et al., 2005; Zheng et al., 2018) and online advertising (Cai et al., 2017) . However, many empirical applications of RL algorithms are typically restricted to the single environment setting. That is, the RL policy is learned and evaluated in the exactly same environment. This learning paradigm can lead to the issue of overfitting in RL (Sutton, 1995; Farebrother et al., 2018) , and may have degenerate performance when the agent is deployed to an unseen (but similar) environment. The ability to generalize to test environments is important to the success of reinforcement learning algorithms, especially in the real applications such as autonomous driving (Shalev-Shwartz et al., 2016; Sallab et al., 2017 ), robotics (Kober et al., 2013; Kormushev et al., 2013) and health care (Yu et al., 2021) . In these real-world tasks, the environment can be dynamic, open-ended and always changing. We hope the agent can learn meaningful skills in the training stage and be robust to the variation during the test stage. Furthermore, in applications such as robotics where we have a simulator to efficiently and safely generate unlimited data, we can firstly train the agent in the randomized simulator models and then generalize it to the real environment (Rusu et al., 2017; Peng et al., 2018; Andrychowicz et al., 2020) . An RL algorithm with good generalization ability can greatly reduce the demand of real-world data and improve test-time performance. Generalization in supervised learning has been widely studied for decades (Mitchell et al., 1986; Bousquet & Elisseeff, 2002; Kawaguchi et al., 2017) . For a typical supervised learning task such as classification, given a hypothesis space H and a loss function ℓ, the agent aims to find an optimal solution in the average manner. That is, we hope the solution is near-optimal compared with the optimal hypothesis h * in expectation over the data distribution, which is formally defined as h * = arg min h∈H E ℓ(h(X), Y ) . From this perspective, generalization in RL is fundamentally different. Once the agent is deployed in the test environment M sampled from distribution D, it is expected to achieve comparable performance with the optimal policy in M. In other words, we hope the learned policy can perform near-optimal compared with the optimal value V * M in instance for the sampled test environment M.

