PROVABLE BENEFITS OF REPRESENTATIONAL TRANSFER IN REINFORCEMENT LEARNING

Abstract

We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy in a target task. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given a generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy, with only online access to the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration.

1. INTRODUCTION

Leveraging historical experiences acquired in learning past skills to accelerate the learning of a new skill is a hallmark of intelligent behavior. In this paper, we study this question in the context of reinforcement learning (RL). Specifically, we consider a setting where the learner is exposed to multiple tasks and ask the following question: Can we accelerate RL by sharing representations across multiple related tasks? There is rich empirical literature which studies multiple approaches to this question and various paradigms for instantiating it. For instance, in a multi-task learning scenario, the learner has simultaneous access to different tasks and tries to improve the sample complexity by sharing data across them (Caruana, 1997) . Other works study a transfer learning setting, where the learner has access to multiple source tasks during a pre-training phase, followed by a target task (Pan and Yang, 2009) . The goal is to learn features and/or a policy which can be quickly adapted to succeed in the target task. More generally, the paradigms of meta-learning (Finn et al., 2017 ), lifelong learning (Parisi et al., 2019) and curriculum learning (Bengio et al., 2009) also consider related questions. On the theoretical side, questions of representation learning have received an increased recent emphasis owing to their practical significance, both in supervised learning and RL settings. In RL, a limited form of transfer learning across multiple downstream reward functions is enabled by several recent reward-free representation learning approaches (Jin et al., 2020a; Zhang et al., 2020; Wang et al., 2020; Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020; Modi et al., 2021) . Inspired by recent treatments of representation transfer in supervised (Maurer et al., 2016; Du et al., 2020) and imitation learning (Arora et al., 2020) , some works also study more general task collections in bandits (Hu et al., 2021; Yang et al., 2020 , 2022 ) and RL (Hu et al., 2021; Lu et al., 2021) . Almost all these works study settings where the representation is frozen after pre-training in the source tasks, and a linear policy or optimal value function approximation is trained in the target task using these learned features. This setting, which we call representational transfer, is the main focus of our paper. A crucial question in formalizing representational transfer settings is the notion of similarity between source and target tasks. Prior works in supervised learning make the stringent assumption that the covariates x follow the same underlying distribution in all the tasks, and only the conditional P (y|x) can vary across tasks (Du et al., 2020) . This assumption does not nicely generalize to RL settings, where state distributions are typically policy dependent, and prior attempts to extend this assumption to RL (Lu et al., 2021) result in strong assumptions on the learning setup. Other works (Hu et al., 2021; Yang et al., 2020 Yang et al., , 2022) ) focus on linear representations only, which limits the expressivity of the feature maps, and does not adequately represent the empirical literature in the field. Our contributions. In this context, our work makes the following contributions: • We propose a new linear span assumption of task relatedness for representational transfer, where the target task dynamics can be expressed as a (state-dependent) linear span of source task dynamics, in addition to the dynamics being low-rank under a shared representation. We give examples captured by this assumption, and it generalizes all prior settings for representational transfer in RL. We do not make any linearity assumptions on our feature maps. • When we have generative access to source tasks, we provide a novel algorithm REPTRANSFER that successfully pretrains a representation for downstream online learning in any target task (i.e., no generative access in target task) satisfying the linear span assumption, when the source tasks satisfy a common latent reachability assumption. The regret bound of learning in the target task is close to that of learning in a linear MDP equipped with the ground truth features, the strongest possible yardstick in our setup. The additional terms in our regret largely arise out of the distributional mismatch between source and target tasks, which is expected. We complement the theory with an empirical validation of REPTRANSFER on the challenging rich observation combination lock benchmarks (Misra et al., 2020), confirming our theoretical findings. • Without generative access to source tasks, we show the statistical hardness of representational transfer under the linear span assumption, and confirm this hardness in our empirical evaluation. We show that an additional assumption that every observed state is reachable in every source task is sufficient for allowing fully online learning in source tasks. The new task relatedness assumption, reward-free learning result for low-rank MDPs and our analysis of LSVI-UCB under average case misspecification may be of independent interest.

2. RELATED WORK

In this section, we focus on survey related works that obtained concrete PAC or regret guarantees, and we defer a discussion of the empirical literature to the appendix. Multi-task and Transfer Learning in Supervised Learning. The theoretical benefit of representation learning are well studied under conditions such as the i.i.d. task assumption (Maurer et al., 2016) and the diversity assumption (Du et al., 2020; Tripuraneni et al., 2020) . Many works below successfully adopt the frameworks and assumptions to sequential decision making problems. Multi-task and Transfer Learning in Bandit and small-size MDPs. Several recent works study multi-task linear bandits with linear representations (ϕ(s) = A s with unknown A) (Hu et al., 2021; Yang et al., 2020 Yang et al., , 2022)) . The techniques developed in these works crucially rely on the linear structure and can not be applied to nonlinear function classes. 2021) study transfer learning in low-rank MDPs with general nonlinear representations, but make a generative model assumption on both the source tasks and the target task, along with other distributional and structural assumptions. We do not require generative access to the target task and make much weaker structural assumptions on the source-target relatedness. Recently and independently, Cheng et al. ( 2022) also studied transfer learning in low-rank MDPs in the online learning setting, identical to the setting we study in Section 5. However, their analysis relies on an additional assumption that bounds the point-wise TV error with the population TV error, which we show is in fact not necessary (details in Appendix C). Efficient Representation Learning in RL. Even in the single task setting, efficient representation learning is an active area witnessing recent advances with exploration (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021; Zhang et al., 2022) or without (Ren et al., 2021) . Other papers study feature selection (e.g. Farahmand and Szepesvári, 2011; Jiang et al., 2015; Pacchiano et al., 2020; Cutkosky et al., 2021; Lee et al., 2021; Zhang et al., 2021) or sparse models (Hao et al., 2021a,b) .



Lazaric et al. (2013) study spectral techniques for online sequential transfer learning. Brunskill and Li (2013) study multi-task RL under a fixed distribution over finitely many MDPs, while Brunskill and Li (2014) consider transfer in semi-MDPs by learning options. Lecarpentier et al. (2021) consider lifelong learning in Lipschitz MDP. All these works consider small size tabular models while we focus on large-scale MDPs. Multi-task and Transfer Learning in RL via representation learning. Beyond tabular MDPs, Arora et al. (2020) and D'Eramo et al. (2019) show benefits of representation learning in imitation learning and planning, but do not address exploration. Lu et al. (

