ON THE POWER OF PRE-TRAINING FOR GENERALIZA-TION IN RL: PROVABLE BENEFITS AND HARDNESS

Abstract

Generalization in Reinforcement Learning (RL) aims to train an agent during training that generalizes to the target environment. In this work, we first point out that RL generalization is fundamentally different from the generalization in supervised learning, and fine-tuning on the target environment is necessary for good test performance. Therefore, we seek to answer the following question: how much can we expect pre-training over training environments to be helpful for efficient and effective fine-tuning? On one hand, we give a surprising result showing that asymptotically, the improvement from pre-training is at most a constant factor. On the other hand, we show that pre-training can be indeed helpful in the non-asymptotic regime by designing a policy collection-elimination (PCE) algorithm and proving a distribution-dependent regret bound that is independent of the state-action space. We hope our theoretical results can provide insight towards understanding pre-training and generalization in RL.

1. INTRODUCTION

Reinforcement learning (RL) is concerned with sequential decision making problems in which the agent interacts with the environment aiming to maximize its cumulative reward. This framework has achieved tremendous successes in various fields such as game playing (Mnih et al., 2013; Silver et al., 2017; Vinyals et al., 2019) , resource management (Mao et al., 2016) , recommendation systems (Shani et al., 2005; Zheng et al., 2018) and online advertising (Cai et al., 2017) . However, many empirical applications of RL algorithms are typically restricted to the single environment setting. That is, the RL policy is learned and evaluated in the exactly same environment. This learning paradigm can lead to the issue of overfitting in RL (Sutton, 1995; Farebrother et al., 2018) , and may have degenerate performance when the agent is deployed to an unseen (but similar) environment. The ability to generalize to test environments is important to the success of reinforcement learning algorithms, especially in the real applications such as autonomous driving (Shalev-Shwartz et al., 2016; Sallab et al., 2017 ), robotics (Kober et al., 2013; Kormushev et al., 2013) and health care (Yu et al., 2021) . In these real-world tasks, the environment can be dynamic, open-ended and always changing. We hope the agent can learn meaningful skills in the training stage and be robust to the variation during the test stage. Furthermore, in applications such as robotics where we have a simulator to efficiently and safely generate unlimited data, we can firstly train the agent in the randomized simulator models and then generalize it to the real environment (Rusu et al., 2017; Peng et al., 2018; Andrychowicz et al., 2020) . An RL algorithm with good generalization ability can greatly reduce the demand of real-world data and improve test-time performance. Generalization in supervised learning has been widely studied for decades (Mitchell et al., 1986; Bousquet & Elisseeff, 2002; Kawaguchi et al., 2017) . For a typical supervised learning task such as classification, given a hypothesis space H and a loss function ℓ, the agent aims to find an optimal solution in the average manner. That is, we hope the solution is near-optimal compared with the optimal hypothesis h * in expectation over the data distribution, which is formally defined as h * = arg min h∈H E ℓ(h(X), Y ) . From this perspective, generalization in RL is fundamentally different. Once the agent is deployed in the test environment M sampled from distribution D, it is expected to achieve comparable performance with the optimal policy in M. In other words, we hope the learned policy can perform near-optimal compared with the optimal value V * M in instance for the sampled test environment M. Unfortunately, as discussed in many previous works (Malik et al., 2021; Ghosh et al., 2021) , the instance-optimal solution in the target environment can be statistically intractable without additional assumptions. We formulate this intractability into a lower bound (Proposition 1) to show that it is impractical to directly obtain a near-optimal policy for the test environment M * with high probability. This motivates us to ask: in what settings can the generalization problem in RL be tractable? Targeting on RL generalization, the agent is often allowed to further interact with the test environment to improve its policy. For example, many previous results in robotics have demonstrated that fine-tuning in the test environment can greatly improve the test performance for sim-to-real transfer (Rusu et al., 2017; James et al., 2019; Rajeswaran et al., 2016) . Therefore, one possible way to formulate generalization is to allow further interaction with the target environment during the test stage. Specifically, suppose the agent interacts with MDP M ∼ D in the test stage, and we measure the performance of the fine-tuning algorithm A using the expected regret in K episodes, i.e. Reg K (D, A) = E M∼D K k=1 V π * (M) M -V π k M . In this setting, can the information obtained from pre-trainingfoot_0 help reduce the regret suffered during the test stage? In addition, when the test-time fine-tuning is not allowed, what can we expect the pre-training to be helpful? As discussed above, we can no longer demand instance-optimality in this setting, but can only step back and pursue a near-optimal policy in expectation. Specifically, our goal is to perform near-optimal in terms of the optimal policy with maximum value in expectation, i.e. π * (D) = arg max π∈Π E M∼D V π M . Here V π M is the value function of the policy π in MDP M. We seek to answer: is it possible to design a sample-efficient training algorithm that returns a ϵ-optimal policy π in expectation, i.e. E M∼D V π * (D) M -V π M ≤ ϵ? Main contributions. In this paper, we theoretically study RL generalization in the above two settings. Our contributions can be summarized as follows: • When fine-tuning is allowed, we study the benefit of pre-training for the test-time performance. Since all information we can gain from training is no more than the distribution D itself, we start with a somewhat surprising theorem showing the limitation of this benefit: there exists hard cases where, even if the agent has exactly learned the environment distribution D in the training stage, it cannot improve the test-time regret up to a universal factor in the asymptotic setting (K → ∞). In other words, knowing the distribution D cannot provide more information in consideration of the regret asymptotically. Our theorem is proved by using Radon transform and Lebesgue integral analysis to give a global level information limit, which we believe are novel techniques for RL communities. • Inspired by this lower bound, we focus on the non-asymptotic setting, and study whether and how much we can reduce the regret in this case. We propose an efficient pre-training and test-time finetuning algorithm called PCE (Policy Collection-Elimination). By maintaining a minimum policy set that generalizes well, it achieves a regret upper bound Õ C(D)K in the test stage, where C(D) is a complexity measure of the distribution D. This bound removes the polynomial dependence on the cardinality of state-action space by leveraging the information obtained from pre-training. We give a fine-grained analysis on the value of C(D) and show that our bound can be significantly smaller than state-action space dependent bound in many settings. • When the agent cannot interact with the test environment, we propose an efficient algorithm called OMERM (Optimistic Model-based Empirical Risk Minimization) to find a near-optimal policy in expectation. This algorithm is guaranteed to return a ϵ-optimal policy with O log N Π ϵ/(12H) /ϵ 2 sampled MDP tasks in the training stage where N Π ϵ/(12H) is the complexity of the policy class. This rate matches the traditional generalization rate in many supervised learning results (Mohri et al., 2018; Kawaguchi et al., 2017) .



We call the training stage "pre-training" when interactions with the test environment are allowed.

