PROVABLE BENEFITS OF REPRESENTATIONAL TRANSFER IN REINFORCEMENT LEARNING

Abstract

We study the problem of representational transfer in RL, where an agent first pretrains in a number of source tasks to discover a shared representation, which is subsequently used to learn a good policy in a target task. We propose a new notion of task relatedness between source and target tasks, and develop a novel approach for representational transfer under this assumption. Concretely, we show that given a generative access to source tasks, we can discover a representation, using which subsequent linear RL techniques quickly converge to a near-optimal policy, with only online access to the target task. The sample complexity is close to knowing the ground truth features in the target task, and comparable to prior representation learning results in the source tasks. We complement our positive results with lower bounds without generative access, and validate our findings with empirical evaluation on rich observation MDPs that require deep exploration. Notations: We denote total variation distance of P 1 and P 2 by ∥P 1 -P 2 ∥ T V . Given a vector a, we define ∥a∥ B = √ a ⊤ Ba. c 0 , c 1 , • • • are universal constants. We use a ≲ b to mean a ≤ Cb for some universal constant C. Also, [K] = {1, . . . , K} and λ min (A) is the smallest eigenvalue of matrix A. Please see Table 1 for a full list of notations and Table 2 for a list of algorithms.

1. INTRODUCTION

Leveraging historical experiences acquired in learning past skills to accelerate the learning of a new skill is a hallmark of intelligent behavior. In this paper, we study this question in the context of reinforcement learning (RL). Specifically, we consider a setting where the learner is exposed to multiple tasks and ask the following question: Can we accelerate RL by sharing representations across multiple related tasks? There is rich empirical literature which studies multiple approaches to this question and various paradigms for instantiating it. For instance, in a multi-task learning scenario, the learner has simultaneous access to different tasks and tries to improve the sample complexity by sharing data across them (Caruana, 1997) . Other works study a transfer learning setting, where the learner has access to multiple source tasks during a pre-training phase, followed by a target task (Pan and Yang, 2009) . The goal is to learn features and/or a policy which can be quickly adapted to succeed in the target task. More generally, the paradigms of meta-learning (Finn et al., 2017) , lifelong learning (Parisi et al., 2019) and curriculum learning (Bengio et al., 2009) also consider related questions. On the theoretical side, questions of representation learning have received an increased recent emphasis owing to their practical significance, both in supervised learning and RL settings. In RL, a limited form of transfer learning across multiple downstream reward functions is enabled by several recent reward-free representation learning approaches (Jin et al., 2020a; Zhang et al., 2020; Wang et al., 2020; Du et al., 2019; Misra et al., 2020; Agarwal et al., 2020; Modi et al., 2021) . Inspired by recent treatments of representation transfer in supervised (Maurer et al., 2016; Du et al., 2020) and imitation learning (Arora et al., 2020) , some works also study more general task collections in bandits (Hu et al., 2021; Yang et al., 2020 Yang et al., , 2022) ) and RL (Hu et al., 2021; Lu et al., 2021) . Almost all these works study settings where the representation is frozen after pre-training in the source tasks, and a linear policy or optimal value function approximation is trained in the target task using these learned features. This setting, which we call representational transfer, is the main focus of our paper. A crucial question in formalizing representational transfer settings is the notion of similarity between source and target tasks. Prior works in supervised learning make the stringent assumption that the covariates x follow the same underlying distribution in all the tasks, and only the conditional P (y|x) can vary across tasks (Du et al., 2020) . This assumption does not nicely generalize to RL settings, where state distributions are typically policy dependent, and prior attempts to extend this assumption to RL (Lu et al., 2021) result in strong assumptions on the learning setup. Other works (Hu et al., 2021; Yang et al., 2020 Yang et al., , 2022) ) focus on linear representations only, which limits the expressivity of the feature maps, and does not adequately represent the empirical literature in the field. Our contributions. In this context, our work makes the following contributions: • We propose a new linear span assumption of task relatedness for representational transfer, where the target task dynamics can be expressed as a (state-dependent) linear span of source task dynamics, in addition to the dynamics being low-rank under a shared representation. We give examples captured by this assumption, and it generalizes all prior settings for representational transfer in RL. We do not make any linearity assumptions on our feature maps. • When we have generative access to source tasks, we provide a novel algorithm REPTRANSFER that successfully pretrains a representation for downstream online learning in any target task (i.e., no generative access in target task) satisfying the linear span assumption, when the source tasks satisfy a common latent reachability assumption. The regret bound of learning in the target task is close to that of learning in a linear MDP equipped with the ground truth features, the strongest possible yardstick in our setup. The additional terms in our regret largely arise out of the distributional mismatch between source and target tasks, which is expected. We complement the theory with an empirical validation of REPTRANSFER on the challenging rich observation combination lock benchmarks (Misra et al., 2020) , confirming our theoretical findings. • Without generative access to source tasks, we show the statistical hardness of representational transfer under the linear span assumption, and confirm this hardness in our empirical evaluation. We show that an additional assumption that every observed state is reachable in every source task is sufficient for allowing fully online learning in source tasks. The new task relatedness assumption, reward-free learning result for low-rank MDPs and our analysis of LSVI-UCB under average case misspecification may be of independent interest.

2. RELATED WORK

In this section, we focus on survey related works that obtained concrete PAC or regret guarantees, and we defer a discussion of the empirical literature to the appendix. Multi-task and Transfer Learning in Supervised Learning. The theoretical benefit of representation learning are well studied under conditions such as the i.i.d. task assumption (Maurer et al., 2016) and the diversity assumption (Du et al., 2020; Tripuraneni et al., 2020) . Many works below successfully adopt the frameworks and assumptions to sequential decision making problems. Multi-task and Transfer Learning in Bandit and small-size MDPs. Several recent works study multi-task linear bandits with linear representations (ϕ(s) = A s with unknown A) (Hu et al., 2021; Yang et al., 2020 Yang et al., , 2022)) . The techniques developed in these works crucially rely on the linear structure and can not be applied to nonlinear function classes. Lazaric et al. (2013) study spectral techniques for online sequential transfer learning. Brunskill and Li (2013) study multi-task RL under a fixed distribution over finitely many MDPs, while Brunskill and Li (2014) consider transfer in semi-MDPs by learning options. Lecarpentier et al. (2021) consider lifelong learning in Lipschitz MDP. All these works consider small size tabular models while we focus on large-scale MDPs. Multi-task and Transfer Learning in RL via representation learning. Beyond tabular MDPs, Arora et al. (2020) and D 'Eramo et al. (2019) show benefits of representation learning in imitation learning and planning, but do not address exploration. Lu et al. (2021) study transfer learning in low-rank MDPs with general nonlinear representations, but make a generative model assumption on both the source tasks and the target task, along with other distributional and structural assumptions. We do not require generative access to the target task and make much weaker structural assumptions on the source-target relatedness. Recently and independently, Cheng et al. (2022) also studied transfer learning in low-rank MDPs in the online learning setting, identical to the setting we study in Section 5. However, their analysis relies on an additional assumption that bounds the point-wise TV error with the population TV error, which we show is in fact not necessary (details in Appendix C). Efficient Representation Learning in RL. Even in the single task setting, efficient representation learning is an active area witnessing recent advances with exploration (Agarwal et al., 2020; Modi et al., 2021; Uehara et al., 2021; Zhang et al., 2022) or without (Ren et al., 2021) . Other papers study feature selection (e.g. Farahmand and Szepesvári, 2011; Jiang et al., 2015; Pacchiano et al., 2020; Cutkosky et al., 2021; Lee et al., 2021; Zhang et al., 2021) or sparse models (Hao et al., 2021a,b) . √ d. An MDP is a low rank MDP if P ⋆ h admits such a low rank decomposition for all h = 0, 1, . . . , H -1. Low-rank MDPs capture the latent variable model (Agarwal et al., 2020) where ϕ ⋆ (s, a) is a distribution over a discrete latent state space Z, and the block-MDP model (Du et al., 2019) where ϕ ⋆ (s, a) is a one-hot encoding vector. Note that the linear MDP model (Yang and Wang, 2020; Jin et al., 2020b) assumes a known ϕ ⋆ , which significantly simplifies the algorithm design. Transfer Learning: In contrast to the classic single-task learning setting, in this paper, we explore the setting of transfer learning, where learning consists of two phases: (1) the pre-training phase where the agent interacts with K -1 source tasks with dynamics P ⋆ k , and (2) the deployment phase where the agent is deployed into the K-th target task and no longer has access to the source tasks. The performance is measured mainly by the regret incurred in the target task upon deployment, while we also desire small sample complexity in the source tasks. We denote d π k;h := d π P ⋆ k ;h . In order for the pre-training phase to help with learning in the target task, we must make assumptions on the connections between tasks. In this work, we make the following fundamental structural assumption on all the tasks at hand, namely that they share the same underlying representation ϕ ⋆ . Assumption 3.1 (Common representation) We assume that all tasks P ⋆ k are low-rank MDPs with a shared representation ϕ ⋆ h (s, a) but distinct µ ⋆ k;h (s ′ ), that is P ⋆ k;h (s ′ |s, a) = ϕ ⋆ h (s, a) ⊤ µ ⋆ k;h (s ′ ). Assumption 3.2 (Realizability) For any source task k ∈ [K -1] and any h ∈ [H], we assume that the agent has access to realizable function classes Φ h and Υ k;h , such that ϕ ⋆ h ∈ Φ h and µ ⋆ k;h ∈ Υ k;h . For normalization, we assume that for all k, h, all ϕ ∈ Φ h satisfy ∥ϕ(s, a)∥ 2 ≤ 1, and for all µ ∈ Υ k;h and any function g : S → [0, 1], g(s)dµ ⋆ h (s) 2 ≤ Assumptions for representational transfer. In addition to standard assumptions in the source tasks, we make the following structural and relatedness assumptions on the source and target tasks. Assumption 3.3 (Feature reachability in the source tasks) We assume that ψ : = min k∈[K-1],h∈[H] max π λ min E π,P ⋆ k [ϕ ⋆ h (s h , a h )ϕ ⋆ h (s h , a h ) ⊤ ] is strictly positive. Assumption 3.3 intuitively requires that no subspace in R d is unreachable in the source tasks, as otherwise it's impossible to guarantee the quality of the learned representation in that subspace which may contain high rewards in the target task. Note that no reachability is required in the target task. The next assumption quantifies the relatedness of the target task and the source tasks. Assumption 3.4 (Relatedness: Point-wise Linear span) For any h ∈ [H] and s ′ ∈ S h+1 , there is a vector α h (s ′ ) ∈ R K-1 such that µ ⋆ K;h (s ′ ) = K-1 k=1 α k;h (s ′ )µ ⋆ k;h (s ′ ), with α max = max h;k,s ′ ∈S |α k;h (s ′ )| and ᾱ = max h K-1 k=1 max s ′ ∈S |α k;h (s ′ )|. Assumption 3.4 ensures that if s ′ is reachable from an (s, a) pair in the target task, then it must be reachable from the same (s, a) pair in at least one of the source tasks. This intuitively is also necessary for transfer learning, as s ′ could be a high rewarding state in the target. A special case is the convex combination, i.e., for any s ′ ∈ S, h, α k;h (s ′ ) = p k with p k ≥ 0, k p k = 1, which implies α max ≤ 1 and ᾱ = 1. On the other hand, if the target task largely focuses on observations quite rare under the source tasks, then ᾱ can grow large, and this is unavoidable in a transfer learning setting. We remark that unlike prior work (Du et al., 2020; Tripuraneni et al., 2020; Lu et al., 2021) , we do not make any assumption on the data generating distribution in either the source or the target tasks. Instead, our approach is end-to-end, i.e., we collect our own data from scratch for representation learning by doing strategic exploration in the source tasks (see Section C for a detailed discussion). In fact, in Theorem D.1 we show that the above assumptions do not permit successful transfer in the supervised learning setting, thus establishing an interesting separation between supervised and reinforcement learning. We conclude this section with a couple of examples where our assumptions are satisfied. Example 3.1 (Mixture of source tasks). Perhaps the simplest example of our assumptions is where P ⋆ K (s ′ |s, a) = K-1 k=1 α k P ⋆ k (s ′ |s, a) with the coefficients independent of s ′ , α k ≥ 0 and K-1 k=1 α k = 1. Such mixtures of base models have been considered in several prior works (Modi et al., 2020; Ayoub et al., 2020) . While prior works study the case of arbitrary, but known base models, we instead allow structured, unknown base models with a shared representation. Here ᾱ = 1. Example 3.2 (Block MDPs with shared latent dynamics). In this example, each MDP P ⋆ k is a Block MDP (Du et al., 2019) with a shared latent space Z and a shared decoder ψ ⋆ : S → Z. In a block MDP, given state action pair (s, a), the decoder ψ ⋆ maps s to a latent state z, the next latent state is sampled from the latent transition z ′ ∼ P (•|z, a), and the next state is generated from an emission distribution s ′ ∼ o(•|z ′ ). Recall that o(s ′ |z ′ ) > 0 at only one z ′ ∈ Z for any s ′ ∈ S for a block MDP. We assume that the latent transition model P (z ′ |z, a) is shared across all the tasks, but the emission process differs across the MDPs. For instance, in a typical navigation example used to motivate Block MDPs, the latent dynamics might correspond to navigating in a shared 2-D map, while emission distributions capture different wall colors or lighting conditions across multiple rooms. Then Assumption 3.3 requires that the agent can visit the entire 2-D map, while Assumption 3.4 requires that the color/lighting conditions of the target task resemble that of at least one source task. The coefficients α for any s ′ are non-zero on the source tasks which can generate that observation.

4. TRANSFER LEARNING WITH GENERATIVE ACCESS TO SOURCE TASKS

We first study a setting where we assume generative model access to the source tasks, while having only online access to the target task. Assumption 4.1 (Generative access to the source tasks) We assume that we have access to generative models for the K -1 source tasks. Specifically, for any P ⋆ k with k ∈ [K -1], we can query any (s h , a h ) pair, and the generative model will return a next state sample s h+1 ∼ P ⋆ k;h (s h , a h ). Having generative model access is not unrealistic, especially in applications where a high-quality simulation environment is available. Generative access also does not trivialize the challenge of efficient exploration in the source tasks, since there are a potentially infinite number of states and the ground truth representation ϕ ⋆ is unknown. Prior works using generative access typically Algorithm 1 Transfer learning with generative access (REPTRANSFER) PRE-TRAINING PHASE Input: exploratory policies {π k } K-1 k=1 , size of cross-sampled datasets n, failure probability δ. 1: for task pairs i, j, i.e. for all i, j ∈ [K -1] s.t. i ̸ = j do ▷ cross sampling procedure 2: For each h ∈ [H -1], sample dataset D ij;h containing n i.i.d. (s, a, s ′ ) tuples sampled as: (s, ã) ∼ d πi i;h-1 , s ∼ P ⋆ j;h-1 (•|s, ã), a ∼ U(A), s ′ ∼ P ⋆ i;h (•|s, a). (1) 3: ∀h ∈ [H -1], learn ϕ h =Multi-task REPLEARN({∪ j∈[K-1] D kj;h } k∈[K-1] ). (Algorithm 3) DEPLOYMENT PHASE Additional Input: number of deployment episodes T . 1: Set β = H √ d + ᾱdH log(dHT /δ). 2: Run LSVI-UCB { ϕ h } H-1 h=0 , r = r K , T, β in the target task P ⋆ K (Algorithm 6). require either a known representation ϕ ⋆ (so that one can perform D-optimal design to construct an exploratory state-action distribution (Agarwal et al., 2019) ), or directly assume access to a diverse state-action distribution which provides coverage and from which one can sample. Neither ϕ ⋆ nor such a diverse sampling distribution is given in our case. We also note that in Section 5 we will show that without any additional assumptions, generative access in source tasks is necessary. Algorithm overview. Given the above setup, now we present our algorithm REPTRANSFER, detailed in Algorithm 1. REPTRANSFER takes a representation learning approach to the transfer learning problem and operates in two phases. During the pre-training phase, REPTRANSFER performs rewardfree exploration in each of the source tasks to learn task-specific policies π k , which satisfies for all h = 0, 1, ..., H -1, that E π k ,P ⋆ k ϕ ⋆ h (s h , a h )ϕ ⋆⊤ h (s h , a h ) ⊤ ⪰ λ min I for some λ min > 0. We call such policies λ min -exploratory and will use this definition in Lemma 4.1 and Theorem 4.1. In Section 4 we present one particular algorithm that finds such π k 's with a specific λ min , but our main analysis below is modular to the choice of the reward-free exploration algorithm. Given the exploratory policies {π k } k∈[K-1] , REPTRANSFER collects a joint dataset across the source tasks using these policies and cross-sampling across pairs of tasks (1), and then learns a single representation ϕ. In the deployment phase, REPTRANSFER runs optimistic least squares value iteration (Algorithm 6) using the learned representation ϕ in the target task 1 . The cross sampling procedure and learning a shared representation. We now describe the cross sampling procedure in detail. Given an exploratory policy π k for each source task k, the next step is to sample fresh data under a cross-sampling procedure using the generative model. Consider a particular (k, j) pair of source environments. For each h ∈ [H], we first sample s h-1 , a h-1 ∼ d π k k;h-1 . Then, in the simulator of task j, we reset to (s h-1 , a h-1 ) and perform a transition step to s h , i.e., s h ∼ P j;h-1 (s h-1 , a h-1 ). Then we reset the simulator for task k to state s h , sample a uniformly random action a h and then perform another transition to s h+1 , i.e., s h+1 ∼ P ⋆ k (s h , a h ). Such a procedure is only possible under the generative model setting. Intuitively, cross-sampling ensures that our training data contains all possible states that can be encountered in the target task, and failure modes without this can be found in the discussion following Theorem 5.1. Given datasets, each of size n, collected using the cross-sampling procedure, we perform a Maximum Likelihood Estimation (MLE)-based representation learning procedure (Algorithm 3) jointly on all source tasks, with the goal of finding a representation function φh that can predict the transition probability well across all source tasks. An existing MLE generalization analysis (Agarwal et al., 2020) guarantee that Multi-task REPLEARN (Algorithm 3) achieves the following total variation guarantee with probability at least 1 -δ: K-1 k=1 E ν k;h ϕ h (s, a) ⊤ µ k;h (•) -ϕ ⋆ h (s, a) ⊤ µ ⋆ k;h (•) 2 T V ≤ ζ N := O ( (log( |Φ| /δ)+K log(|Υ|)) /N) . (2) 1 Given any dataset, {s, a, r, s ′ } feature ϕ, and reward r, LSVI learns a Q function backward, i.e., at step h via ŵh = arg minw s,a,s ′ (w ⊤ ϕ(s, a) -Vh+1 (s ′ )) 2 + λ∥w∥ 2 and set Vh (s) = maxa(r(s, a) + ŵ⊤ h ϕ(s, a)), ∀s. UCB, short for Upper Confidence Bound, refers to an exploration bonus added to basic LSVI.  P h = ( ϕ h , µ h )} H-1 h=0 by running REWARDFREE REP-UCB (Algorithm 4) in P ⋆ for N REWARDFREE episodes. 3: Set β = dH log(dHN LSVI-UCB /δ). 4: Return ρ = LSVI-UCB { ϕ h } H-1 h=0 , r = 0, N LSVI-UCB , β, UNIFORMACTIONS = TRUE by simulating in the learned model P (Algorithm 6). Note this step requires no samples from P ⋆ . where ν k;h denotes the generating distribution of inputted datasets D k;h , each with size N . For example, in REPTRANSFER we have D k;h = ∪ K-1 j=1 D kj;h and N = n(K -1). Representational transfer. Next we transfer this guarantee to the target task via Assumption 3.4, a key insight of our work. Specifically, we show that under cross sampling and Assumption 3.4, ϕ also linearly approximates the true transition in the target task well, under the occupancy measure of any policy. Remarkably, this holds before the agent has ever interacted with the target task. Lemma 4.1 Suppose Assumption 3.4 and that for all source tasks k ∈ [K -1] we have a λ minexploratory policy π k . Then, for any δ ∈ (0, 1), learning features ϕ using the cross-sampling procedure of Algorithm 1 satisfies (2) w.p. 1 -δ. Furthermore, for any h = 0, 1, ..., H -1, there exists µ h : S → R d such that for any function g : S → [0, 1], ∥ g(s)d µ h (s)∥ 2 ≤ ᾱ√ d and sup π E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ K;h (•) T V ≤ ε T V := |A|α 3 max Kζ n /λ min . Lemma 4.1 implies that ϕ is a feature such that P ⋆ K is an approximatly linear MDP in ϕ. Learning in approximately linear MDPs has been studied in Jin et al. (2020b) , but under a much stronger ℓ ∞ error bound rather than the average misspecification here. Our result for downstream task learning under such an average model misspecification case presents the strongest result for learning in an approximately linear MDP, and the result might be of independent interest. Regret bound for REPTRANSFER. Putting things together, we obtain the following regret bound for online learning in the target task using our learned representation ϕ. Theorem 4.1 (Regret under generative source access) Suppose Assumptions 3.1-3.4 and 4.1, and suppose the input policies π k are λ min -exploratory. Then, for any δ ∈ (0, 1), w.p. 1 -δ, REPTRANSFER when deployed in the target task has regret at most O ᾱH 2 d 1.5 T log(1/δ) , with at most Kn generative accesses per source task, with n = O λ -1 min Aα 3 max KT log |Φ| δ + K log |Υ| . Remarkably, Theorem 4.1 shows that with the pre-trained features, we achieve the same regret bound on the target task to the setting of linear MDP with known ϕ ⋆ (Jin et al., 2020b) , up to the additional ᾱ factor. The scaling factor ᾱ only depends on α itself and captures the hardness of transfer learning. For special cases such as convex combination, i.e., α is state-independent and α h ∈ ∆(K), then ᾱ = 1. In the worst-case, some dependence on the scale of α seems unavoidable as we can have a state s ′ such that µ K (s ′ ) = 1 and µ 1 (s ′ ) ≪ 1 with α 1 (s ′ ) ≫ 1. This corresponds to a rarely observed state for the source task encountered often in the target, and our estimates of transitions involving this state can be highly unreliable if it is not seen in any other source, roughly scaling the error between target and source tasks as |α 1 (s ′ )|. Obtaining formal lower bounds that capture a matching dependence on structural properties of α is an interesting question for future research. Reward-free exploration in the source environments. So far we have assumed that we have a reward-free exploration black box algorithm that can provide an exploratory policy π k for each source task k. In this section, we give a detailed algorithm that achieves this goal. Recall that our transfer learning algorithm, for source task P ⋆ k , relies on an exploratory policy π k such that the empirical covariance with respect to ϕ ⋆ h is lower bounded. This ensures good exploration in the underlying ground truth feature space in P ⋆ k . Algorithm 2 achieves this goal by first invoking the REWARDFREE REP-UCB algorithm (Algorithm 4) to learn an estimated linear MDP model P k for P ⋆ k . The REWARDFREE REP-UCB algorithm is a reward-free generalization of the recent REP-UCB algorithm (Uehara et al., 2021) that perform reward-free exploration in low-rank MDPs and can be of independent interest. Subsequently, we perform reward-free exploration (e.g., using LSVI-UCB with zero reward) within each P k which involves no further environment interactions. Lemma 4.2 (Reward-free Exploration) Fix any source task k ∈ [K -1]. Suppose Assumptions 3.2 and 3.3. Then, for any δ ∈ (0, 1), w.p. 1 -δ, REWARDFREE (Algorithm 2) with N LSVI-UCB = Θ A 3 d 6 H 8 ψ -2 and N REWARDFREE = O A 3 d 4 H 6 log |Φ||Υ| δ N 2 LSVI-UCB returns a λ min -exploratory policy π k where λ min = Ω A -3 d -5 H -7 ψ 2 . The sample complexity here is N REWARDFREE episodes in the source task. To the best of our knowledge, Lemma 4.2 is the first result that finds a full-rank policy cover in the low-rank MDP setting, and might be of independent interest. Wagenmaker et al. (2022) 

5. TRANSFER LEARNING WITH ONLINE ACCESS TO SOURCE TASKS

In the previous section, we show that efficient transfer learning is possible under very weak structural assumptions, with generative access to the source tasks. One natural question is whether transfer learning is possible with only online access to the source tasks. Somewhat surprisingly, we show that this is impossible without significantly stronger assumption. Theorem 5.1 (Impossibility Result) Let M K be a K-task multi-set that satisfies (1) all tasks are Block MDPs; (2) all tasks satisfy Assumption 3.3 and Assumption 3.4; (3) the latent dynamics are exactly the same for all source and target tasks. For any pre-training algorithm A which outputs a feature φ by interacting with the source tasks k ∈ [K -1], there exists {P ⋆ k } k∈[K] ∈ M K , such that with probability at least 1/2, A will output a feature φ, such that for any policy taking the functional form of π(s) = f { φ(s, a)} a∈A , {r(s, a)} a∈A , we have V ⋆ K -V π K ≥ 1/2. The above theorem implies that a representation learned only from online access to source tasks does not enable learning in downstream tasks if the downstream task algorithm is restricted to use the representation as the only information of the state-action pairs (e.g., running LSVI-UCB with φ). We briefly explain the intuition behind the above lower bound. For a Block MDP, for any (s, a), we can model the ground-truth ϕ ⋆ as a one-hot encoding e (z,a) corresponding to the latent state-action pair (z, a) with z = ψ ⋆ (s) being the encoded latent state. The key observation here is that any permutation of ϕ ⋆ will also be a perfect feature in terms of characterizing the Block MDPs, since it corresponds to simply permuting the indices of the latent states. Therefore, without cross referencing, the agent could potentially learn different permutations in different source tasks, which would collapse in the target task. A precise constructive proof of Theorem 5.1 can be found in Section D. Part of the reason that the above example fails is that each source task has its own observed subset of raw states, which permits such permutation to happen. In what follows, we show that, indeed, under an additional assumption on the reachability of raw states, a slight variant of the same algorithm (Algorithm 5 in Section G) can achieve the same regret with only online access to the source tasks. Assumption 5.1 (Reachability in the raw states) For all source tasks k ∈ [K -1], any policy π and h = 0, 1, ..., H -1, we have inf s∈S,a∈A d π k;h (s, a) ≥ ψ raw λ min E π,P ⋆ k ϕ ⋆ h (s h , a h )ϕ ⋆ h (s h , a h ) ⊤ . Assumption 5.1 implies that for each source task, any policy that achieves a full-rank covariance matrix also achieves global coverage across the raw state-action space. In addition, in order to apply importance sampling (IS) to transfer the TV error from source task to target task, we need to assume that the target task distribution has bounded density. This is true, for example, when S is discrete. Figure 1 : (a): A visualization of the rich observation comblock environment. Latent states (white and black) emit continuous high-dimensional observations. The reward is sparse (only in white states at H). Each white state has 10 actions where one of them is a good action that leads to the next two white states while the other 9 lead to black states (good action for each white state is different). Once the agent is in black states, it stays stuck in black until the end of the episode. Thus, a random exploration strategy has an exponentially small probability of hitting the goal (i.e., roughly (1/10) H ). (b) Top: Number of episodes required to solve the target environment under the setting from Sec. 6. (b) Bottom: Number of episodes required to solve the target environment under the setting from Sec. 6. An algorithm solves the target task if it can achieve the optimal return (i.e., 1) for 5 consecutive iterations with 50 evaluation runs each. We include the mean and standard deviation (in the brackets) for 5 random seeds. ∞ denotes that an algorithm can not solve the target task within a fixed sample budget. Assumption 5.2 (Bounded density) For all (π, h, s, a), we have d π K;h (s, a) ≤ 1. Theorem 5.2 (Regret with online access) Suppose Assumptions 3.1-3.4,5.1,5.2 hold. W.p. 1 -δ, Algorithm 5 with appropriate parameters achieves a regret in the target O ᾱd 1.5 H 2 T log(1/δ) , with at most poly A, α max , d, H, K, T, ψ -1 , ψ -1 raw , log(|Φ||Υ|/δ) online queries in the source tasks. Assumption 5.1 is satisfied in a Block MDP, when, for example, the emission function o(s|z) satisfies that ∀s, ∃z, s.t., o(s|z) ≥ c. That is, for any source task, any state in the state space can be generated by at least one latent state.

6. EXPERIMENTS

We investigate empirically the benefit of transfer learning under the Block MDP setting, on the challenging Rich Observation Combination Lock (comblock) benchmark. In this environment, one must recover the correct feature from rich observations and perform strategic exploration or otherwise pay for exponential sample complexity. We include a visual overview in Fig. 1(a) . This environment is uniquely challenging since it requires strategic exploration and latent state discovery at the same time, which results in failures of common deep RL methods (Misra et al., 2020) , and theoretical RL approaches based on linear function approximation (and kernels) (Zhang et al., 2022) . In particular, in this section we study the following questions: i) what are the benefits of representational transfer using multiple source tasks, and ii) whether generative access to source tasks is needed. We design two sets of experiments with various source and target environment configurations. We defer the details of the experiments (the design of vanilla comblock, hyperparameters, etc.) in Appendix J and include the essential design information in the following two sections. Baselines. We denote Source as the smallest sample complexity of LSVI-UCB using learned features from any of the source tasks; O-REPTRANSFER as REPTRANSFER with only online access to the source tasks; G-REPTRANSFER as REPTRANSFER with generative access to the source tasks; Oracle as learning in the target task with ground truth features; and Target as running BRIEE (Zhang et al., 2022) -the SOTA Block MDP algorithm, in the target task with no pretraining.

Comblock without partitioned observations

In this section we start off with an easier setting: we use 5 source tasks, each with horizon H = 25, 3 latent states per level and 10 actions. The latent transition dynamics are different for each source task, but notice that for comblock, the reachability assumption (c.r. Assumption 3.3) is always satisfied. The emission distribution of all the source and target tasks is identical, so Assumption 5.1 holds here. For the construction of the target task, for each timestep h, we choose the latent transition dynamics from one of the sources uniformly at random (thus Assumption 3.4 is satisfied). We record the number of episodes in the target environment that each method takes to solve the target environment in Table . 1(b). We first observe that REPTRANSFER with either online or generative access can solve the target task (since. Assumption 5.1 holds). Second, we observe that directly applying the learned feature from any single source task does not suffice to solve the target environment. This is because the representation learned from a single source task may collapse two latent states into a single one during encoding (e.g., if two latent states at the same time step have exactly identical latent transitions). See the visualization of comparisons between the learned decoders from a single source task and REPTRANSFER in Fig. 2 (a). Third, the result shows that REPTRANSFER saves order of magnitude of target samples compared with training in the target environment from scratch using the SOTA Block MDPs algorithm BRIEE. This set of results verifies the empirical benefits of representation learning from multiple tasks, i.e., resolves ambiguity and speeds up downstream task learning. Comblock with partitioned observation space (Comblock-PO) In this section, following the intuition of our lower bound (Theorem 5.1), we construct a setting where the supports of the emission distributions from each task are completely disjoint, while the emission distribution in the target task is a mixture of all source emissions and the latent dynamics are identical across tasks. Hence Assumption 3.4 holds while Assumption 5.1 fails. So we expect that an algorithm without generative access to source tasks will fail based on Theorem 5.1. Specifically, under disjoint emission supports, a representation can decode the latent state correctly for each source, but permute latent state labels across sources, causing decoding errors on the target. We record the number of target episodes for each method to solve the target task in Table . 1(b). We observe that indeed the online version fails while the generative version still succeeds. We show the visualizations of O-REPTRANSFER and G-REPTRANSFER in Fig. 2 (b), and we note that the empirical results exactly match our theoretical results. This ablation verifies that source generative model access is needed.

7. CONCLUSION

We study representational transfer among low rank MDPs which share the same unknown representation. Under a reasonably flexible linear span task relatedness assumption, we propose an algorithm that provably transfers the representation learned from source tasks to the target task. The regret in target task matches the bound obtained with oracle access to the true representation, using only polynomial number of samples from source tasks. Our approach relies on the generative model access in source tasks, which we prove is not avoidable in the worst case under the linear span assumption. To complement the lower bound, we propose a stronger assumption on the conditions of the reachability in raw states, under which online access to source tasks suffices for provably efficient representation transfer. Finding modalities other than generative access which avoid the lower bound, and a more extensive empirical evaluation beyond the proof-of-concept experiments here are important directions for future research Reproducibility Statement: For theory, we include detailed and complete proofs for all of our claims in the appendices. For empirics, we include anonymized code in the supplement. Please refer to Appendix. J for a comprehensive experimental setup and list of hyperparameters. All experiments were run on CPUs and no external datasets were used.

Appendices

A NOTATIONS 

B ADDITIONAL RELATED WORKS

The idea of learning transferable representation has been extensively explored in the empirical literature. Here we don't intend to provide a comprehensive survey of all existing works on this topic. Instead, we discuss a few representative approach that may be of interest. Towards transfer learning across different environments, progressive neural network (Rusu et al., 2016) is among the first neural-based attempt to learning a transferable representation for a sequence of downstream tasks that tries to overcome the challenge of catastrophic forgetting. It maintains the learned neural models for all previous tasks and introduce additional connections between the network of the current tasks to those of prior tasks to allow information reuse. However, a drawback common to such an approach is that the network size grows linearly with the number of tasks. Other approaches include directly learning a multi-task policy that can perform well on a set of source tasks, with the hope that it will generalize to future tasks (Parisotto et al., 2015) . Such an approach requires the tasks to be similar in their optimal policy, which is a much stronger assumption than ours. Slightly off-topic are the works about "transfer learning" inside the same environment but across different reward functions, which is more restricted than the setting considered in this paper. Several prior works design representation learning algorithms that aim to learn a representation that generalize across multiple reward function/goals (Dayan, 1993; Barreto et al., 2017; Touati and Ollivier, 2021; Blier et al., 2021) . These are related to the REWARDFREE REP-UCB we developped in Section E. The key difference is that we concern representation learning along with efficient exploration to derive an end-to-end polynomial sample complexity bound. These prior works do not consider exploration and do not come with provable sample complexity bounds. We refer interested readers to a recent survey (Zhu et al., 2020) for a comprehensive discussion of other empirical approaches.

C COMPARISONS TO CLOSELY RELATED WORKS

Recall that in addition to the commonly made shared representation assumption (Assumption 3.1), we made two additional structural assumptions: reachability (Assumption 3.3) and linear span on µ ⋆ (Assumption 3.4). The reachability assumption is commonly made in prior works even in single agent RL, e.g. (Modi et al., 2021; Misra et al., 2020) . It ensures that there is no redundant dimensions in the ground-truth representation, which is a reasonable requirement. The linear span assumption is closely related to the diversity assumptions made in prior works of transfer learning in both supervised learning (Tripuraneni et al., 2020) and reinforcement learning (Lu et al., 2021) .

Lu et al. (2021):

In the prior work of Lu et al. (2021) , which also studies transfer learning in lowrank MDPs with nonlinear function approximations, they need to make the following assumptions: 1. shared representation (identical to our Assumption 3.1). 2. task diversity (similar to our Assumption 3.4). 3. generative model access to both the source and the target tasks. In contrast, we only require generative model access to the source tasks and allow online learning in the target task. 4. a somewhat strong coverage assumption saying that the data covariance matrix (under the generative data distribution) between arbitrary pairs of features ϕ, ϕ ′ ∈ Φ must be full rank. In contrast, our analysis only requires coverage in the true feature ϕ ⋆ in the source tasks. 5. the existence of an ideal distribution q on which the learned representation can extrapolate. We do not require an assumption of a similar nature. Instead, we show that the data collected from our strategic reward-free exploration phase suffices for successful transfer. 6. the uniqueness for each ϕ in the sense of linear-transform equivalence. Two representation functions ϕ and ϕ ′ can yield similar estimation result if and only if they differ by just an invertible linear transformation. In contrast, we do not make any additional structural assumptions on the function class Φ beyond realizability. In summary, our work present a theoretical framework that permits successful representation transfer based on significantly weaker assumptions. We believe that this is a solid step towards understanding transfer learning in RL.

Cheng et al. (2022):

A concurrent work of Cheng et al. (2022) also studies the exact same problem as ours. Both works study the setting where the agent performs reward-free exploration in the source tasks for representation learning and used the learned representation for the target task. Both works achieves similar sample complexity in source and target tasks, albeit using very different algorithms. However, in addition to the assumptions that we madefoot_0 , Cheng et al. (2022) has to make an additional assumption. Assumption C.1 For any two different models in the model class Φ × Ψ, say P 1 (s ′ |s, a) = ⟨ϕ 1 (s, a), µ 1 (s ′ )⟩ and P 2 (s ′ |s, a) = ⟨ϕ 2 (s, a), µ 2 (s ′ )⟩, there exists a constant C R such that for all (s, a) ∈ S × A and h ∈ [H], ∥P 1 (•|s, a) -P 2 (•|s, a)∥ T V ≤ C R E (s,a)∼U (S,A) ∥P 1 (•|s, a) -P 2 (•|s, a)∥ T V This assumption ensures that the point-wise TV error is bounded, as long as the population-level TV error is bounded. This assumption is used to transfer the MLE error from the source tasks to the target task. This type of assumption is strong in the sense that we typically expect C R to scale with |S|. In contrast, our analysis (Lemma G.1) shows that this assumption is in fact not necessary assuming online access only to source tasks. The generative access to source task studied here, which enables transfer under weaker reachability assumptions is not studied in their work. It is worth noting that Cheng et al. ( 2022) also study offline RL in the target task which we do not cover, while we mainly focus on the setting of generative models in the source tasks and demonstrating a more complete picture by proving generative model access in source tasks is needed without additional assumptions. Comparing to (Cheng et al., 2022) , we also further implement and perform experimental evaluations of our algorithm.

D IMPOSSIBILITY RESULTS

Here, we present an interesting result showing that the above assumptions we make are so weak that they do not even permit efficient transfer in supervised learning: Theorem D.1 (Counter-example in supervised learning) Assume that we want to perform conditional density estimation, where P ⋆ k (y|x) = ϕ ⋆ (x) ⊤ µ ⋆ k (y) . Under Assumption 3.1 (shared representation) and Assumption 3.4 (linear span), and assume that in each source task, one have access to a data generating distribution ρ k (x) such that λ min E ρ k [ϕ ⋆ (x)ϕ ⋆ (x) ⊤ ] ≥ ψ (reachability). No algorithm can consistently achieve E ρ K [∥ P ⋆ K (y|x) -P ⋆ K (y|x)∥ T V ] ≤ 1/2 on the target task using the feature learned from the source tasks with probability more than 1/2. Proof of Theorem D.1. Consider the following example. X = R 2 and we have the following 3 sets. S 1 = B 1/2 ((-1, -1)) S 2 = B 1/2 ((-2, -2)) S 3 = B 1/2 ((0, 1)) where B a ((x, y)) stands for the ball with radius a centered at (x, y). These will be the support of 3 tasks: task 1 and 2 are two source tasks, task 3 is the target task. Let's assume that P ⋆ k (x) are uniform distribution on S k . Suppose that the feature class Φ only contains two functions:  and  (1, 0) otherwise That is, the feature maps from R 2 to the set of binary encoding of dimension 2, i.e. {(1, 0), (0, 1)}. We further assume that µ ⋆ k = (p 1 (y), p 2 (y)) for some distributions p 1 , p 2 , which is identical for all task k, where ∥p 1 , p 2 ∥ T V = 1. We also assume that µ ⋆ k is known to the learner a prior, i.e. ϕ 1 : {x 1 ≤ 0 & x 2 ≥ x 1 } → (1, 0), and (0, 1) otherwise ϕ 2 : {x 2 ≤ 0 & x 1 ≥ x 2 } → (0, 1), Υ k = {µ ⋆ k } for all k ∈ [K] , so all the learner needs to do is to pick the correct ϕ out of two candidates. Given the above setup, it's easy to verify that both Assumption 3.1 and Assumption 3.4 are satisfied, because the decision boundary of both ϕ 1 and ϕ 2 passes through the support of the source tasks, and all µ ⋆ k 's are identical. However, ϕ 1 and ϕ 2 are equivalent in S 1 and S 2 in terms of their representation power, therefore no algorithm can always pick the correct feature function with probability more than 1/2, regardless of the number of samples. Suppose ϕ 1 is the true feature and the algorithm incorrectly chooses ϕ 2 . Then, for x ∈ S 3 {x 1 ≥ 0} which has probability mass 1/2, P3 (y|x) = p 2 whereas P ⋆ 3 (y|x) = p 1 . Thus, the expected total variation distance between PK and P ⋆ K is 1/2. The above construction shows that our assumption are not sufficient to permit reliable representation transfer, even in the supervised learning setting. Yet, surprisingly, these assumptions are sufficient in the RL setting, implying somehow that transfer learning in RL in easier than transfer learning in SL. To understand this phenomenon, observe that in RL, the marginal distribution on (s, a) is not independent from the conditional density P (s ′ |s, a) we desire to estimate. In particular, if one collects data in the source tasks in an online fashion via running a policy, ρ(s, a) is structurally restricted to be an occupancy distribution generated by the ground-truth transition P ⋆ (s ′ |s, a). Such a connection can only exist in Markov chains, and our analysis elegantly utilizes this additional structure to establish the soundness of the learned representation. Also note, crucially, that we never learn a representation to capture d 0 , which would suffer from similar issues as the supervised learning setting, but is not necessary for sample-efficient RL. Next, we prove the impossibility result in Theorem 5.1, restated below as Theorem D.3. This result shows that one can not achieve online learning in the source tasks without significantly stronger assumptions such as Assumption 5.1. Before that, we provide a preliminary version, showing that the learned φ is not sufficient to fit the transition model in the target task, which motivates the construction in Theorem D.3. Theorem D.2 (Impossibility Result: Model Learning) Let M K be the set of K-task multi-set that satisfies 1. all tasks are Block MDPs; 2. all tasks satisfy Assumption 3.3 and Assumption 3.4; 3. the latent dynamics are exactly the same for all source and target tasks. For any pre-training algorithm A, there exists {P ⋆ k } k=1:K ∈ M K and an occupancy distribution ρ K on the target task, such that with probability at least 1/2, A will output a feature φ and for any µ E ρ K ∥ φ(s, a) ⊤ µ(•) -P ⋆ K (•|s, a)∥ T V ≥ 1/2. Proof of Theorem D.2. Consider a tabular MDP with 2 latent states z 1 , z 2 and an observation state space S = R 1 R 2 B 1 B 2 , where in task 1 one can only observe R 1 R 2 and in task 2 one can only observe B 1 B 2 . Correspondingly, o 1 (s|z) is only supported on R 1 R 2 (i.e., o 1 (R i |z i ) = 1) and similar for task 2. Let the latent state transition be such that P (z 1 |z 1 , a) = 1 and P (z 2 |z 2 , a) = 1, i.e. only self-transition regardless of the actions. Now, consider a 2-element feature class Ψ = {ψ 1 , ψ 2 } such that ψ 1 = {R 1 → 1, R 2 → 2, B 1 → 1, B 2 → 2} ψ 2 = {R 1 → 1, R 2 → 2, B 1 → 2, B 2 → 1} Denote ϕ i (s, a) = e (ψi(s),a) for i ∈ [1, 2]. Consider for each task k, a 2-element Υ k class in the form of Υ k = {(o k (s|z 1 ), o k (s|z 2 )), (o k (s|z 2 ), o k (s|z 1 ))}. Notice that ϕ 1 and ϕ 2 are merely permutations of one another and so given any single task data, the two hypothesis will not be distinguishable by any means. Therefore, for any algorithm, there is at least probability 1/2 that it will choose the wrong hypothesis if the ground truth ϕ ⋆ is sampled between ϕ 1 and ϕ 2 uniformly at random. Suppose ϕ 1 is the correct hypothesis and ϕ 2 is the one that the algorithm picks (i.e., φ = ϕ 2 ). Let task 3 be such that any state emits to R 1 R 2 and B 1 B 2 each with probability 1/2 (i.e., o 3 (R i |z i ) = o 3 (B i |z i ) = 0.5). This construction satisfies Assumption 3.3 and Assumption 3.4. Then, within task 3, one would encounter observations from both R 1 and B 2 which should be mapped to latent state z 1 and z 2 respectively by the true decoder ϕ 1 , but are instead both mapped to latent state z 1 by the learned decoder ϕ 2 , and thus z 1 and z 2 become indistinguishable. Suppose ρ K (z 1 ) = ρ K (z 2 ) = 1/2, then E ρ K [∥ φ(s, a) ⊤ µ(•) -P ⋆ (•|s, a)∥ T V ] = 1 4 ∥ φ(R 1 ) ⊤ µ(•) -ϕ(R 1 ) ⋆⊤ µ ⋆ (•)∥ T V + 1 4 ∥ φ(B 1 ) ⊤ µ(•) -ϕ(B 1 ) ⋆⊤ µ ⋆ (•)∥ T V + 1 4 ∥ φ(R 2 ) ⊤ µ(•) -ϕ(R 2 ) ⋆⊤ µ ⋆ (•)∥ T V + 1 4 ∥ φ(B 2 ) ⊤ µ(•) -ϕ(B 2 ) ⋆⊤ µ ⋆ (•)∥ T V = ∥o 1 -o ⋆ 1 ∥ T V /4 + ∥o 2 -o ⋆ 1 ∥ T V /4 + ∥o 2 -o ⋆ 2 ∥ T V /4 + ∥o 1 -o ⋆ 2 ∥ T V /4 ≥ 1 4 ∥o ⋆ 1 -o ⋆ 2 ∥ T V + 1 4 ∥o ⋆ 1 -o ⋆ 2 ∥ T V = 1 2 , where the last second inequality uses triangle inequality, and the last equality comes from the fact that o 3 (•|z 1 ) and o 3 (•|z 2 ) have disjoint support which implies that ∥p ⋆ 1 -p ⋆ 2 ∥ T V = 1. Now, we are ready to restate and prove Theorem 5.1. Theorem D.3 (Impossibility Result: Optimal Policy Identification) Let M K be the set of K-task multi-set that satisfies 1. all tasks are Block MDPs; 2. all tasks satisfy Assumption 3.3 and Assumption 3.4; 3. the latent dynamics are exactly the same for all source and target tasks. For any pre-training algorithm A, there exists {P ⋆ k } k=1:K ∈ M K , such that with probability at least 1/2, A will output a feature φ, such that for any policy taking the functional form of π(s) = f { φ(s, a)} a∈A , {r(s, a)} a∈A , we have V ⋆ -V π ≥ 1/2. Proof of Theorem D.3. Consider a tabular MDP with H = 2, two latent states z 1 , z 2 for h = 1 and two latent states z 3 , z 4 for h = 2. • For h = 1, let there be two actions a 1 , a 2 . Let the observation state space be S = R 1 R 2 B 1 B 2 , where in task 1 one can only observe R 1 R 2 and in task 2 one can only observe B 1 B 2 . Correspondingly, o 1 (s|z) is only supported on R 1 R 2 (i.e., o 1 (R i |z i ) = 1 ) and similar for task 2. Let the latent state transition be such that P (z 3 |z 1 , a 1 ) = P (z 3 |z 2 , a 2 ) = 1, and P (z 4 |z 1 , a 2 ) = P (z 4 |z 2 , a 1 ) = 1. All rewards are 0 for h = 1. • For h = 2, in state z 3 , all actions have reward 1, and in state z 4 all actions have reward 0. • The initial state distribution is d 0 (z 1 ) = d 0 (z 2 ) = 1/2. Algorithm 3 Multi-task REPLEARN 1: Input: Datasets {D k } k=1:K-1 , model class Φ, Υ k , k = 1, ..., K -1. 2: Compute MLE ( φ, μ1:k-1 ) := argmax ϕ∈Φ,µi∈Υi K-1 k=1 E D k log ϕ(s, a) ⊤ µ k (s ′ ) . 3: Return φ Now, consider a 2-element feature class Ψ = {ψ 1 , ψ 2 } for h = 1, such that ψ 1 = {R 1 → 1, R 2 → 2, B 1 → 1, B 2 → 2} ψ 2 = {R 1 → 1, R 2 → 2, B 1 → 2, B 2 → 1} Denote ϕ i (s, a) = e (ψi(s),a) for i ∈ [1, 2]. In addition, define Υ = {µ 1 , µ 2 } where µ 1 = {z 3 → (1, 0), z 4 → (0, 1)} µ 2 = {z 4 → (1, 0), z 3 → (0, 1 )} Notice that ϕ 1 and ϕ 2 are merely permutations of one another and so given any single task data, the two hypothesis will not be distinguishable by any means. Therefore, for any algorithm, there is at least probability 1/2 that it will choose the wrong hypothesis. Suppose ϕ 1 is the correct hypothesis and ϕ 2 is the one that the algorithm picks (i.e., φ = ϕ 2 ). Let task 3 be such that any state emits to R 1 R 2 and B 1 B 2 each with probability 1/2 (i.e., o 3 (R i |z i ) = o 3 (B i |z i ) = 0.5). This construction satisfies Assumption 3.3 and Assumption 3.4. Then, for any policy that only make decision based on φ(s, a) and r(s, a), π would output the same action for observations in R 1 and B 2 , or for B 1 and R 2 . However, notice that the optimal policy, which would try to go to z 3 from either z 1 or z 2 , will pick a 1 at R 1 and B 1 while picking a 2 at R 2 and B 2 , which means that the optimal policy will not agree on R 1 and B 2 , and it also will not agree on R 2 and B 1 . Thus clearly, no such policy as defined above is capable of capturing the optimal policy. From the reward perspective, notice that d π (z 1 ) = d π (z 2 ) = 1/2 and d π (R 1 ) = d π (R 2 ) = d π (B 1 ) = d π (B 1 ) = 1/4. Since π(R 1 ) = π(B 2 ), the agent will only be able to collect reward at one of the R 1 and B 2 (but not at both). Similarly, since π(R 2 ) = π(B 1 ), the agent will only be able to collect reward at one of the R 2 and B 1 by reaching z 3 (but not at both). This means that π will have average reward 1/2. Since the optimal policy will be able to collect reward at all R 1 , R 2 , B 1 , B 2 , it will have average reward 1. This concludes the proof. Theorem D.2 and Theorem D.3 show that it's impossible to allow online learning in the source tasks without much stronger assumptions. In our paper, we show that our Assumption 5.1, which ensures reachability in the raw states, is sufficient to establish an end-to-ending online transfer learning result. However, it is unclear if Assumption 5.1 is necessary for online learning. We leave this as an important direction of future work.

E REWARD-FREE REP-UCB

In this section, we adapt the Rep-UCB algorithm (Uehara et al., 2021) for reward-free exploration in a single task. We drop all task subscripts as this section is for a single task only, i.e. think about the task as being each source task. The original Rep-UCB algorithm was for infinite-horizon discounted MDPs, so we modify it to work for our undiscounted and finite-horizon setting. Our goal is to prove that Rep-UCB can learn a model that satisfies strong TV guarantees, i.e. Theorem E.1 and (5). Note that FLAMBE (Agarwal et al., 2020, Theorem 2 ) can be used for this directly, but at a worse (polynomial) sample complexity. Thus, we do a bit more work to derive a new model-learning algorithm for low-rank MDPs, based on Rep-UCB, that is more sample efficient in the source tasks. A finite-horizon analysis of Rep-UCB was done in BRIEE (Zhang et al., 2022) , so here we just need to replace BRIEE's RepLearn ζ n with that of the MLE, which is how we learn ϕ and µ, as in Rep-UCB. Recall the notation of (Zhang et al., 2022) , Data collection from π n-1 : for h = 1, 2, ..., H -1, ρ h,n (s, a) = 1 n n-1 i=0 d πi h s ∼ d πn-1 h , a ∼ Unif(A), s ′ ∼ P ⋆ h (s, a); s ∼ d πn-1 h-1 , a ∼ Unif(A), s ′ ∼ P ⋆ h-1 ( s, a), a ′ ∼ Unif(A), s ′′ ∼ P ⋆ h ( s ′ , a ′ ); D h,n = D h,n-1 ∪ {(s, a, s ′ )} , D ′ h,n = D ′ h,n-1 ∪ {( s ′ , a ′ , s ′′ )} . For h = 0, only collect D 0,n . 5: Learn model via MLE: for all h = 0, 1, ..., H -1, P h,n = ( ϕ h,n , µ h,n ) = argmax ϕ h ,µ h ∈M h E D h,n ∪D ′ h,n log ϕ h (s, a) T µ h (s ′ ) . 6: Update exploration bonus: for all h = 0, 1, ..., H -1, b h,n (s, a) = α n ϕ h,n (s, a) Σ -1 h,n Σ h,n = (s,a,_)∈D h,n ϕ h,n (s, a) ϕ h,n (s, a) T + λ n I.

7:

Learn policy π n = argmax π V π Pn, bn and let V n be its value. 8: Let n = argmin n≥N/2 V n . 9: Output: n, P n . β h,n (s, a) = 1 n n-1 i=0 E s∼d π i h-1 , a∼Unif(A) [P ⋆ h (s | s, a) Unif(a)] γ h,n (s, a) = 1 n n-1 i=0 d πi h (s, a) Σ ρ,ϕ,n = nE ρ ϕ(s, a)ϕ(s, a) T + λ n I. By using MLE (Uehara et al., 2021, Lemma 18) to learn models, with probability at least 1 -δ, for any n = 1, 2, ..., N and h = 0, 1, ..., H -1, we have max E ρ h,n P h,n (s, a) -P ⋆ h (s, a) 2 T V , E β h,n P h,n (s, a) -P ⋆ h (s, a) 2 T V ≤ ζ n , where ζ n = O log (|M|nH/δ) n , and |M| = max h∈[H] |Φ h ||Υ h |. We also adopt the same choice of α n , λ n parameters as BRIEE, which we assume from now on. λ n = Θ (d log(|M|nH/δ)) α n = Θ n|A| 2 ζ n + λ n d . As in Rep-UCB, we posit standard assumptions about realizability and normalization, on the (source) task of interest. Assumption E.1 For any h = 0, 1, ..., H -1, we have ϕ ⋆ h ∈ Φ h and µ ⋆ h ∈ Υ h . For any ϕ ∈ Φ h , ∥ϕ(s, a)∥ 2 ≤ 1. For all µ ∈ Υ h and any function g : S → R, we have ∥ s g(s)dµ(s)∥ 2 ≤ ∥g∥ ∞ √ d. Lemma E.1 Let r be any reward function. Suppose we ran Algorithm 4 with line 7 having reward r + b n instead of just b n . Then, for any δ ∈ (0, 1), w.p. at least 1 -δ, we have N -1 n=0 V πn Pn,r+ bn -V πn P ⋆ ,r ≤ O H 2 d 2 |A| 1.5 N log(|M|N H/δ) Proof. Start from the third equation of Zhang et al. (2022, Theorem A.4 ). Following their proof until the last page of their proof, we arrive at the following: for any n = 1, 2, ..., N ,

V πn

Pn,r+ bn -V πn P ⋆ ,r ≲ H-2 h=0 E s, a∼d πn P ⋆ ,h ∥ϕ ⋆ h ( s, a)∥ Σ -1 γ h,n ,ϕ ⋆ h |A|α 2 n d + λ n d + |A|α 2 1 d/n + (2H + 1) H-2 h=0 E s, a∼d πn P ⋆ ,h ∥ϕ ⋆ h ( s, a)∥ Σ -1 γ h,n ,ϕ ⋆ h n|A|ζ n + λ n d + (2H + 1) |A|ζ n . By elliptical potential arguments, we have N -1 n=0 E s, a∼d πn P ⋆ ,h ∥ϕ ⋆ h ( s, a)∥ Σ -1 γ h,n ,ϕ ⋆ h ≤ dN log 1 + N dλ 1 . Thus, summing over n, noting that nζ n , α n , λ n are increasing in n, we can combine the above to get, N -1 n=0 V πn Pn,r+ bn -V πn P ⋆ ,r ≲ dN log 1 + N dλ 1 H |A|α 2 N d + λ N d + H 2 N |A|ζ N + λ N d ≲ dN log 1 + N dλ 1 H N |A| 3 ζ N d + λ N d 2 + H 2 N |A|ζ N + λ N d ≲ dN log 1 + N dλ 1 H 2 d|A| 3 log(|M|N H/δ) + d 3 log(|M|N H/δ) ∈ O H 2 d 2 |A| 1.5 N log(|M|N H/δ) . This gives the following useful corollary for reward free exploration. Lemma E.2 For any δ ∈ (0, 1) w.p. at least 1 -δ we have V n ≤ O H 2 d 2 |A| 1.5 log(|M|N H/δ) N . Proof. By definition of V n , we have N 2 V n ≤ N -1 n=N/2 V π n Pn, bn ≤ N -1 n=0 V π n Pn, bn , which is bounded by the previous lemma and the fact that V πn P ⋆ ,r=0 = 0, since in Algorithm 4, the reward function is zero. Conditioning on this, we now show that the environment P n has low TV error for any policy-induced distribution. Theorem E.1 For any policy π, we have H-1 h=0 E d π P ⋆ ,h P ⋆ h (s, a) -P h, n (s, a) T V ≤ O H 3 d 2 |A| 1.5 log(|M|N H/δ) N := ε T V . Proof. In this proof, let P = P n , which is the returned environment from the algorithm. Let r(s, a) = P ⋆ h (s, a) -P h (s, a) T V ∈ [0, 2]. Then, H-1 h=0 E d π P ⋆ ,h -E d π P ,h [r(s, a)] = V π P ⋆ ,r -V π P ,r = H-1 h=0 E d π P ,h E P ⋆ h (s,a) -E P h (s,a) V π P ⋆ ,r,h+1 (s ′ ) (Simulation lemma) ≤ 2H H-1 h=0 E d π P ,h P ⋆ h (s, a) -P h (s, a) T V . Thus, H-1 h=0 E d π P ⋆ ,h P ⋆ h (s, a) -P h (s, a) T V ≤ (2H + 1) H-1 h=0 E d π P ,h P ⋆ h (s, a) -P h (s, a) T V (by (Zhang et al., 2022, Lemma A.1) ) ≲ H H-2 h=0 E d π P ,h b h, n (s, a) + |A|ζ N/2 ≤ H V π P , b n + 2|A| log(|M|N H/δ) N ≤ H V π n P , b n + 2|A| log(|M|N H/δ) N ≲ H H 2 d 2 |A| 1.5 log(|M|N H/δ) N + |A| log(|M|N H/δ) N (by Lemma E.2) ∈ O H 3 d 2 |A| 1.5 log(|M|N H/δ) N . This also gives us a guarantee on the TV distance between the visitation distributions induced by P ⋆ vs. by P . Lemma E.3 Suppose P satisfies the following for all h = 0, 1, ..., H -1, ∀π : E d π P ⋆ ,h P h (s, a) -P ⋆ h (s, a) T V ≤ ε h . Then, for any h = 0, 1, ..., H -1, we have ∀π : d π P ,h -d π P ⋆ ,h T V ≤ h-1 t=0 ε t . Note, for h = 0, the sum is empty so the right hand side is 0. Proof. We proceed by induction for h = 0, 1, ..., H -1. For the base case of h = 0, no transition has been taken, so that d π P ,0 = d π P ⋆ ,0 . Now let h ∈ {0, 1, ..., H -2} be arbitrary, and suppose that the claim is true for h (IH). We want to show the claim holds for h + 1. One key fact we'll use is that, for any measure µ, we have ∥µ∥ T V = sup ∥f ∥∞≤1 |E µ [f ]|. Below we use the notation that f (s, π) = E a∼π(s) f (s, a). ∥d π P ,h+1 -d π h+1 ∥ T V = sup ∥f ∥∞≤1 E d π P ,h+1 [f (s, a)] -E d π h+1 [f (s, a)] = sup ∥f ∥∞≤1 E ( s, a)∼d π P ,h ,(s,a)∼ P h ( s, a) [f (s, π h+1 )] -E ( s, a)∼d π h ,(s,a)∼P ⋆ h ( s, a) [f (s, π h+1 )] ≤ sup ∥f ∥∞≤1 E ( s, a)∼d π P ,h -E ( s, a)∼d π h E P h ( s, a) f (s, π h+1 ) + sup ∥f ∥∞≤1 E ( s, a)∼d π h E P h ( s, a) f (s, a) -E P ⋆ h ( s, a) f (s, π h+1 ) ≤ h-1 t=0 ε t + E ( s, a)∼d π h sup ∥f ∥∞≤1 E P h ( s, a) -E P ⋆ h ( s, a) f (s, π h+1 ) (by IH and Jensen) ≤ h-1 t=0 ε t + ε h , (by (4) and ∥f (s, π h+1 )∥ ≤ 1) as desired. Thus, when combined with Theorem E.1, we have for h = 0, 1, ..., H -1 and any policy π, ∥d π P ,h -d π P ⋆ ,h ∥ T V ≤ O H 3 d 2 |A| 1.5 log(|M|N H/δ) N = ε T V . In other words, the sample complexity needed for a model-error of ε T V is O H 6 d 4 |A| 3 log(|M|N H/δ) ε 2 T V . Note this is much better than FLAMBE's guarantee (Agarwal et al., 2020 .

F REWARD-FREE EXPLORATION

In this section, we show that the mixture policy returned by Algorithm 2 has good coverage. Recall that Algorithm 2 contains two main steps: Step 1 Learn a model P . This was the focus of the previous section, where our modified REP-UCB method obtained a strong TV guarantee (( 5)) by requiring number of episodes at most, N REWARDFREE = O H 6 d 4 |A| 3 log(|M|N H/δ) ε 2 T V . Step 2 Run LSVI-UCB (Algorithm 6) in the learned model P with reward at the e-th episode being b h,e and UNIFORMACTIONS = TRUE. The optimistic bonus pushes the algorithm to explore directions that are not well-covered yet by the mixture policy up to this point. With elliptical potential, we can establish that this process will terminate in polynomial number of steps. We now focus on Step 2. Let π +1 h denote rolling-in π for h steps and taking uniform actions on the h + 1 step, thus inducing a distribution over s h+1 , a h+1 . We abuse the notation a little and use π +1 -1 for a policy that just takes one uniform action from the initial distribution d 0 . Lemma F.1 Let δ ∈ (0, 1) and run REWARDFREE (Algorithm 2). Let Λ h,N be the empirical covariance at the N -th iteration of LSVI-UCB (Algorithm 6). Then, w.p. at least 1 -δ we have, sup π H-1 h=0 E π +1 h-1 , P ϕ h (s h , a h ) Λ -1 h,N ≲ Ad 1.5 H 3 log(dN H/δ)/N . Proof. In this proof, we'll treat the empirical MDP as P ⋆ , as that is the environment we're running in. Thus, we abuse notation and P h,e is the model-based perpsective of the linear MDP, i.e. ϕ h µ h,e where µ h,e is Λ -1 h,e e-1 k=1 ϕ h (s k h , a k h )δ(s k h+1 ). Also, in Algorithm 2, we set reward to be zero, but for the purpose of this analysis, suppose the reward function is precisely the (unscaled) bonus in LSVI-UCB, i.e. r h,e (s h , a h ) = b h,e (s h , a h ). This does not change the algorithm at all since the β-scaling of the bonus dominates this reward in the definition of Q h,e , but thinking about the reward in this way will make our analysis simpler. Recall the high-level proof structure of reward free guarantee of linear MDP (with known features ϕ) (Wang et al., 2020, Lemma 3.2) . Step 1 Show that V h,e ∈ V h and w.p. 1 -δ, for all h, e, ∀s h , a h : sup f ∈V h P h,e (s h , a h ) -P ⋆ h (s h , a h ) f ≤ βb h,e (s h , a h ). This step only uses self-normalized martingale bounds. So, line 9 can use any martingale sequence of states and actions, and this claim still holds, with bonus b h,e using the appropriate covariance under the data. Step 2 Show optimism conditioned on Step 1. Specifically, for all e = 1, 2, ..., N , we have E d0 V ⋆ 0 (s 0 , r e ) -V 0,e (s 0 ) ≤ 0. To show this, we need that V h,e (s h ) = Q h,e (s h , π e h (s h )) ≥ Q h,e (s h , π ⋆ h (s h )) (this is for the unclipped case of V -optimism), which we have satisfied in the algorithm, i.e. π e h is greedy w.r.t. Q h,e . Step 3 Bound the sum e V h,e , where we decompose it as a sum of expected bonuses with the expectation is under π e . Step 3 is the only place where we use the fact that s k h , a k h are data sampled from rolling out π e . For Step 1 and 2, please refer to existing proofs in (Agarwal et al., 2019; Jin et al., 2020b; Wang et al., 2020) .

Now we show

Step 3 for our modified algorithm with uniform actions. First, let us show a simulation lemma. For any episode e = 1, 2, ..., N , for any s, recalling definition of reward being b h,e , we have V 0,e (s 0 ) ≤ (1 + β)b 0,e (s 0 , π e 0 (s 0 )) + P 0,e (s 0 , π e (s 0 )) V 1,e ≤ (1 + 2β)b 0,e (s 0 , π e 0 (s 0 )) + P ⋆ 0,e (s 0 , π e (s 0 )) V 1,e , where the first inequality is due to the thresholding on V h,e 's and the second inequality is due to Step 1. Continuing in this fashion, we have E d0 V 0,e (s 0 ) ≤ (1 + 2β) H-1 h=0 E π e [b h,e (s h , a h )] . Summing over e = 1, 2, ..., N , we have N e=1 E d0 V 0,e (s 0 ) ≲ β H-1 h=0 N e=1 E π e [b h,e (s h , a h )] ≤ Aβ H-1 h=0 N e=1 E (π e h-1 ) +1 [b h,e (s h , a h )] For each h = 0, 1, ..., H -1, apply Azuma's inequality to the martingale difference sequence Thus, we finally have ∆ e = E (π e h-1 ) +1 [b h,e (s h , a h )] -b h, N e=1 E d0 V 0,e (s 0 ) ≲ AβH dN log(N H/δ). Consider any episode e = 1, 2, ..., N . By definition, Λ h,N ⪰ Λ h,e , so for all s, a we have pointwise that b h,N (s, a) ≤ b h,e (s, a). Hence, for all s, we have V ⋆ 0 (s; r N ) ≤ V ⋆ 0 (s; r e ), and further using optimism, we have N E d0 [V ⋆ 0 (s 0 ; r N )] ≤ N e=1 E d0 [V ⋆ 0 (s 0 ; r e )] ≤ N e=1 E d0 V 0,e (s 0 ) ≲ AβH dN log(N H/δ). Now consider any h and policy π, and consider rolling it out for h -1 steps and taking a random action. Then we have E π +1 h-1 , P ϕ h (s h , a h ) Λ -1 h,N ≤ E d0 V π +1 h-1 0 (s 0 ; r N ) ≤ AβH d log(N H/δ)/N . Summing over h incurs an extra H factor on the right. This concludes the proof. Lemma F.2 (One-step back for Linear MDP) Suppose P h = (ϕ h , µ h ) is a linear MDP. Suppose ρ is any mixture of n policies, and let Σ h := nE ρ ϕ h (s h , a h )ϕ h (s h , a h ) ⊤ +λI denote the unnormalized covariance. For any g : S × A → R, policy π, and h = 0, 1, ..., H -2, we have E π [g(s h+1 , a h+1 )] ≤ E π ∥ϕ h (s h , a h )∥ Σ -1 h nAE ρ +1 h [g(s h+1 , a h+1 ) 2 ] + λd∥g∥ 2 ∞ Proof. E π [g(s h+1 , a h+1 )] = E π [ϕ h (s h , a h )] , s h+1 g(s h+1 , π h+1 )dµ h (s h+1 ) ≤ E π ∥ϕ h (s h , a h )∥ Σ -1 h s h+1 g(s h+1 , π h+1 )dµ h (s h+1 ) Σ h , where s h+1 g(s h+1 , π h+1 )dµ h (s h+1 ) Σ h = nE ρ E s h+1 ∼P h (s h ,a h ) [g(s h+1 , π h+1 )] 2 + λ s h+1 g(s h+1 , π h+1 )dµ h (s h+1 ) 2 ≤ n|A|E ρ +1 h g(s h+1 , a h+1 ) 2 + λd∥g∥ 2 ∞ . Under reachability, we can show that small (squared) bonuses and spectral coverage, in the sense of having lower bounded eigenvalues, are somewhat equivalent. Lemma F.3 Let Σ be a symmetric positive definite matrix and define the bonus b h (s, a) = ∥ϕ ⋆ h (s, a)∥ Σ -1 . Then we have 1. For any policy π, E d π h b 2 h (s, a) ≤ 1 λmin(Σ) . That is, coverage implies small squared bonus. 2. Suppose reachability under ϕ ⋆ (Assumption 3.3), then we have the converse: there exists π, for any policy π, E d π h b 2 h (s, a) ≥ ψ λmin(Σ) . That is, small squared bonus implies coverage. Proof. The first claim follows directly from Cauchy-Schwartz. Indeed, for any policy π, we have E d π h b 2 h (s, a) ≤ E d π h ∥ϕ ⋆ h (s, a)∥ 2 2 ∥Σ -1 ∥ 2 ≤ 1 λ min (Σ) . For the second claim, Assumption 3.3 implies that there exist a policy π such that for all vectors v ∈ R d with ∥v∥ 2 = 1, we have E d π h (ϕ ⋆ h (s, a) ⊤ v) 2 ≥ ψ. Now decompose Σ = d i=1 λ i v i v ⊤ i , where λ i , v i are eigenvalue/vector pairs with ∥v i ∥ 2 = 1 and λ 1 ≥ λ 2 ≥ ... ≥ λ d . Then substituting this into the definition of the bonus, we have E d π h b 2 h (s, a) = d i=1 1 λ i E d π h (ϕ ⋆ h (s, a) T v i ) 2 ≥ 1 λ d E d π h (ϕ ⋆ h (s, a) T v d ) 2 ≥ ψ λ min (Σ) . We now prove our main lemma for reward-free exploration, Lemma 4.2. Lemma 4.2 (Reward-free Exploration) Fix any source task k ∈ [K -1]. Suppose Assumptions 3.2 and 3.3. Then, for any δ ∈ (0, 1), w.p. 1 -δ, REWARDFREE (Algorithm 2) with N LSVI-UCB = Θ A 3 d 6 H 8 ψ -2 and N REWARDFREE = O A 3 d 4 H 6 log |Φ||Υ| δ N 2 LSVI-UCB returns a λ min -exploratory policy π k where λ min = Ω A -3 d -5 H -7 ψ 2 . The sample complexity here is N REWARDFREE episodes in the source task. Proof of Lemma 4.2. In this proof, let Λ ⋆ h = N LSVI-UCB E ρ +1 h-1 ϕ ⋆ h (s h , a h )ϕ ⋆ h (s h , a h ) ⊤ + λI, Λ h = N LSVI-UCB E ρ +1 h-1 ϕ h (s h , a h ) ϕ h (s h , a h ) ⊤ + λI, where λ = dH log(N LSVI-UCB /δ) ≥ 1. This setting of λ satisfies the precondition for the Concentration of Inverse Covariances Zanette et al. (2021, Lemma 39) , which implies w.p. at least 1 -δ that Λ -1 h ⪯ 2 NLSVI-UCB e=1 ϕ h (s e h , a e h ) ϕ h (s e h , a e h ) ⊤ + λI -1 ⪯ 2Λ -1 h,NLSVI-UCB , where we've also used the fact that λ ≥ 1, so (A + λI) -1 ⪯ (A + I) -1 . Under this event, for any π, we have, H h=1 E π, P ϕ h (s h , a h ) Λ -1 h ≲ H h=1 E π, P ϕ h (s h , a h ) Λ -1 h,N LSVI-UCB . ( ) Now let h = 0, 1, ..., H -2 be arbitrary. By Assumption 3.3 (there exists some policy π with coverage) such that, ψ λ min Λ ⋆ h+1 ≤ E π ϕ ⋆ h+1 (s h+1 , a h+1 ) 2 (Λ ⋆ h+1 ) -1 (by Lemma F.3) ≤ E π ϕ ⋆ h+1 (s h+1 , a h+1 ) (Λ ⋆ h+1 ) -1 (by λ ≥ 1) ≤ E π, P ϕ h (s h , a h ) Λ -1 h A(2d + ε T V N LSVI-UCB ) + ε T V (by Corollary F.1) ≤ E π, P ϕ h (s, a) Λ -1 h A(2d + 1) + 1/N LSVI-UCB (by ε T V = 1/N LSVI-UCB ) ≲ A 1.5 d 2 H 3 log(dHN LSVI-UCB /δ)/N LSVI-UCB + 1/N LSVI-UCB (by ( 6) and Lemma F.1) ≲ A 1.5 d 2 H 3 log(dHN LSVI-UCB /δ)/N LSVI-UCB . Recall that λ = dH log(N LSVI-UCB /δ), we have, λ min E ρ +1 h ϕ ⋆ h+1 (s, a)ϕ ⋆ h+1 (s, a) T = λ min Λ ⋆ h+1 -λ N LSVI-UCB ≥ 1 N LSVI-UCB Cψ A 1.5 d 2 H 3 log(dHN LSVI-UCB /δ)/N LSVI-UCB -dH log(N LSVI-UCB /δ) ≳ Cψ A 1.5 d 2 H 3 √ N LSVI-UCB - dH N LSVI-UCB , where we've omitted the log terms for simplicity in the ≳. Now we optimize N LSVI-UCB to maximize this bound. For a, b > 0, to maximize a function of the form f (x) = a √ x -b x , it's best to set x ⋆ such that √ x ⋆ = 2b a , resulting in value f (x ⋆ ) = a 2 4b . Setting, x = N LSVI-UCB , a = Cψ A 1.5 d 2 H 3 , b = dH. Hence, we need to set N LSVI-UCB = Θ b 2 /a 2 = Θ A 3 d 6 H 8 ψ 2 , which results in a λ min lower bound of λ min E ρ +1 h ϕ ⋆ h+1 (s h+1 , a h+1 )ϕ ⋆ h+1 (s h+1 , a h+1 ) T = Ω a 2 /b = Ω ψ 2 A 3 d 5 H 7 . Finally, we used the fact that ε T V = 1/N LSVI-UCB , which is set by the choice of N REWARDFREE in the lemma statement to satisfy (5).

The above proves coverage of ρ +1

h for h = 0, 1, ..., H -2. Finally to argue for ρ +1 -1 , which is simply taking a random action at time h, we can simply invoke Assumption 3.3 for h = 0 to get a policy π that E ρ +1 -1 ϕ ⋆ 0 (s 0 , a 0 )ϕ ⋆ 0 (s 0 , a 0 ) ⊤ ⪰ 1 A E π ϕ ⋆ 0 (s 0 , a 0 )ϕ ⋆ 0 (s 0 , a 0 ) ⊤ ⪰ ψ A . Corollary F.1 Let λ, Λ ⋆ h , Λ h be defined as in the proof of Lemma 4.2. For any h = 0, 1, ..., H -2 and any policy π, we have E π ϕ ⋆ h+1 (s h+1 , a h+1 ) (Λ ⋆ h+1 ) -1 ≤ E π, P ϕ h (s h , a h ) Λ -1 h |A|(2d + ε T V N LSVI-UCB ) + ε T V . Intuitively, this means that coverage in the learned features implies coverage in the true features. Proof. For shorthand, let N = N LSVI-UCB . Apply Lemma F.2 (one-step back) to the learned model P and the function (s, a) → ∥ϕ ⋆ h+1 (s, a)∥ (Λ ⋆ h+1 ) -1 , which is bounded by λ -1/2 ≤ 1. We have, E π, P ϕ ⋆ h+1 (s h+1 , a h+1 ) (Λ ⋆ h+1 ) -1 ≤ E π, P ϕ h (s h , a h ) Λ -1 h N |A|E ρ +1 h , P ∥ϕ ⋆ h+1 (s h+1 , a h+1 )∥ 2 (Λ ⋆ h+1 ) -1 + d ≤ E π, P ϕ h (s h , a h ) Λ -1 h N |A|E ρ +1 h ∥ϕ ⋆ h+1 (s h+1 , a h+1 )∥ 2 (Λ ⋆ h+1 ) -1 + N |A|ε T V + d ≤ E π, P ϕ h (s h , a h ) Λ -1 h d|A| + N |A|ε T V + d, where we used the fact that E ρ +1 h ∥ϕ ⋆ h+1 (s h+1 , a h+1 )∥ 2 (Λ ⋆ h+1 ) -1 = Tr E ρ +1 h ϕ ⋆ h+1 (s h+1 , a h+1 )ϕ ⋆ h+1 (s h+1 , a h+1 ) ⊤ N E ρ +1 h ϕ ⋆ h (s h , a h )ϕ ⋆ h (s h , a h ) ⊤ + λI -1 = 1 N Tr(I -M ) ≤ d N , where M is a positive definite matrix. Thus, doing an initial change from d π h+1 to d π P ,h+1 concludes the proof.

G REPRESENTATION TRANSFER

First, we prove Lemma 4.1, restated below. Lemma 4.1 Suppose Assumption 3.4 and that for all source tasks k ∈ [K -1] we have a λ minexploratory policy π k . Then, for any δ ∈ (0, 1), learning features ϕ using the cross-sampling procedure of Algorithm 1 satisfies (2) w.p. 1 -δ. Furthermore, for any h = 0, 1, ..., H -1, there exists µ h : S → R d such that for any function g : S → [0, 1], ∥ g(s)d µ h (s)∥ 2 ≤ ᾱ√ d and sup π E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ K;h (•) T V ≤ ε T V := |A|α 3 max Kζ n /λ min . Proof of Lemma 4.1. Fix an arbitrary π. Denote µ h (s ′ ) = K-1 k=0 α k;h (s ′ ) µ k;h (s ′ ). First, note that max g:S→[0,1] µ h (s)g(s)d(s) 2 ≤ max g:S→[0,1] K-1 k=0 µ k;h (s)α k;h (s)g(s)d(s) 2 ≤ K-1 k=0 max s α k;h (s) √ d (Since µ k;h (s)g(s)d(s) ≤ √ d by 3.2) = ᾱ√ d For any h = 0, 1, ..., H -1, we have E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ K;h (•) T V = E π,P ⋆ K   s h+1 K-1 k=1 α k;h (s h+1 ) ϕ h (s h , a h ) ⊤ µ k;h (s h+1 ) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (s h+1 )   ≤ E π,P ⋆ K   s h+1 K-1 k=1 |α k;h (s h+1 )| ϕ h (s h , a h ) ⊤ µ k;h (s h+1 ) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (s h+1 )   ≤ α max K-1 k=1 E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ k;h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (•) T V . First consider the case when h = 0. At h = 0, the distribution under P ⋆ K is the same as ν k,h , and so, we directly get that the above quantity is at most α max ζ 1/2 n ≤ ε, which proves the h = 0 case. Now consider any h = 1, 2, ..., H -1. To simplify notation, let us denote err k;h (s h , a h ) = ϕ h (s h , a h ) ⊤ µ k;h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (•) T V , w k;h = s h dµ ⋆ K;h-1 (s h )E a h ∼π h (s h ) err k;h (s h , a h ), Σ k,h = E π k ,P ⋆ k ϕ ⋆ h (s h , a h )ϕ ⋆ h (s h , a h ) ⊤ . Note that λ min (Σ k,h ) ≥ λ min by assumption. Now continuing from where we left off, we take a one-step back as follows, α max K-1 k=1 E π,P ⋆ K err k;h (s h , a h ) = α max K-1 k=1 E π,P ⋆ K ϕ ⋆ h-1 (s h-1 , a h-1 ), w k;h ≤ α max K-1 k=1 E π,P ⋆ K ϕ ⋆ h-1 (s h-1 , a h-1 ) Σ -1 k;h-1 ∥w k;h ∥ Σ k;h-1 By λ min guarantee of Σ k,h , and Jensen's inequality to push the square inside, ≤ α max √ λ min K-1 k=1 E s h-1 ,a h-1 ∼π k ,P ⋆ k E s h ∼P ⋆ K;h-1 (s h-1 ,a h-1 ),a h ∼π h (s h ) err k;h (s h , a h ) 2 ≤ A 1/2 α max √ λ min K-1 k=1 E s h-1 ,a h-1 ∼π k ,P ⋆ k E s h ∼P ⋆ K;h-1 (s h-1 ,a h-1 ),a h ∼U (A) err k;h (s h , a h ) 2 By Assumption 3.4, the expectation over P ⋆ K;h-1 is a linear combination of expectations over P ⋆ j;h-1 , ≤ A 1/2 α 3/2 max √ λ min K-1 k=1 K-1 j=1 E s h-1 ,a h-1 ∼π k ,P ⋆ k E s h ∼P ⋆ j;h-1 (s h-1 ,a h-1 ),a h ∼U (A) err k;h (s h , a h ) 2 ≤ A 1/2 α 3/2 max K 1/2 √ λ min K-1 k=1 K-1 j=1 E s h-1 ,a h-1 ∼π k ,P ⋆ k E s h ∼P ⋆ j;h-1 (s h-1 ,a h-1 ),a h ∼U (A) err k;h (s h , a h ) 2 ≤ A 1/2 α 3/2 max K 1/2 ζ 1/2 n √ λ min , where we used the MLE guarantee (2) in the last step. Find policy cover π k =REWARDFREE(P ⋆ k , N LSVI-UCB , N REWARDFREE , δ). (Algorithm 2) 3: for source task k = 1, ..., K -1 do 4: For each h = 0, 1, ..., H -1, sample D k as n i.i.d. (s h , a h , s h+1 ) tuples from π k . 5: For each h = 0, 1, ..., H -1, learn ϕ h =Multi-task REPLEARN({D k;h } k∈[K-1] ). (Algorithm 3) DEPLOYMENT PHASE Additional Input: number of deployment episodes T . 1: Set β = H √ d + ᾱdH log(dHT /δ). 2: Run LSVI-UCB { ϕ h } H-1 h=0 , r = r K , T, β in the target task P ⋆ K (Algorithm 6). Next we state an analogous lemma for when we don't need generative access to the source task, but instead assume Assumption 5.1, and Assumption 5.2. Lemma G.1 Suppose Assumption 5.1, and Assumption 5.2. Now take the setup of Lemma 4.1 with the only difference being that ϕ is learned as in Algorithm 5. Then, the same guarantee of Lemma 4.1 holds with a slightly different right hand side for the bound on the TV-error, sup π E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ K;h (•) T V ≤ α max K 1/2 ζ 1/2 n (ψ raw λ min ) 1/2 . Proof of Lemma G.1. Fix an arbitrary π. Denote µ h (s ′ ) = K-1 k=0 α k;h (s ′ )μ k;h (s ′ ). Then, some algebra with importance sampling gives us the bound, E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ K;h (•) T V ≤ E π,P ⋆ K   s h+1 K-1 k=1 α k;h (s h+1 ) ϕ h (s h , a h ) ⊤ µ k;h (s h+1 ) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (s h+1 )   ≤ α max K-1 k=1 E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ k;h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (•) T V ≤ α max K 1/2 K-1 k=1 E π,P ⋆ K ϕ h (s h , a h ) ⊤ µ k;h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (•) 2 T V By Assumption 5.1, Assumption 5.2, for any s, a, we have d π K;h (s,a) d π k k;h (s,a) ≤ 1 ψrawλmin E π k ,P ⋆ k [ϕ ⋆ h (s h ,a h )ϕ ⋆ h (s h ,a h ) ⊤ ] ≤ 1 ψrawλmin , where we used the coverage-under-π k assumption in the last inequality. In other words, for each k = 1, 2, ..., K -1, we have dd π K;h dd π k k;h ∞ ≤ 1 ψrawλmin , hence we can importance sample, ≤ α max K 1/2 (ψ raw λ min ) 1/2 K-1 k=1 E π k ,P ⋆ k ϕ h (s h , a h ) ⊤ µ k;h (•) -ϕ ⋆ h (s h , a h ) ⊤ µ ⋆ k;h (•) 2 T V ≤ α max K 1/2 ζ 1/2 n (ψ raw λ min ) 1/2 . H PROOFS FOR DEPLOYMENT PHASE H.1 ONLINE RL LEMMAS Lemma H.1 (Self-normalized Martingale) Consider filtrations {F i } i=1,2,... , so that E[ε i | F i-1 ] = 0 and {ε i | F i-1 } i=1,2,. .. are sub-Gaussian with parameter σ 2 . Let {X i } i=1,2,... be random variables in a hilbert space H. Suppose a linear operator Σ 0 : H → H is positive definite. For any t, define Σ t = Σ 0 + t i=1 X i X T i . Then w.p. at least 1 -δ, we have, ∀t ≥ 1 : t i=1 X i ε i 2 Σ -1 t ≤ σ 2 log det(Σ t ) det(Σ 0 ) -1 δ 2 . Proof. Lemma A.8 of (Agarwal et al., 2019) . Lemma H.2 Let Λ t = λI + t i=1 x i x T i for x i ∈ R d and λ > 0. Then t i=1 x T i (Λ t ) -1 x i ≤ d. Proof. Lemma D.1 of (Jin et al., 2020b) .

H.2 PROOF OF MAIN RESULTS

Let (x) ≤y refer to the clamping operator, i.e. (x) ≤y = min{x, y}. Let M V be the maximum possible value in the MDP with the given reward function. Algorithm 6 LSVI-UCB 1: Input: Features { ϕ h } h=0,1,...,H-1 , reward {r h } h=0,1,...,H-1 , number of episodes N , bonus scaling parameter β, UNIFORMACTIONS = FALSE. 2: for episode e = 1, 2, ..., N do 3: Initialize V H,e (s) = 0, ∀s 4: for step h = H -1, H -2..., 0 do 5: Learn best predictor for V e h+1 , Λ h,e = e-1 k=1 ϕ h (s k h , a k h ) ϕ h (s k h , a k h ) ⊤ + I, w h,e = Λ -1 h,e e-1 k=1 ϕ h (s k h , a k h ) V h+1,e (s k h+1 ). 6: Set bonus and value functions, b h,e (s, a) = ϕ h (s, a) Λ -1 h,e , Q h,e (s, a) = w ⊤ h,e ϕ h (s, a) + r h (s, a) + βb h,e (s, a), V h,e (s) = max a Q h,e (s, a) ≤M V . 7: Set π e h (s) = argmax a Q h,e (s, a).

8:

Execute π e to collect a trajectory (s e h , a e h ) H-1 h=0 . 9: If UNIFORMACTIONS = TRUE, discard a e h and draw freshly sampled uniform actions independently for all h, i.e. a e h ∼ Unif(A). 10: Return: uniform mixture ρ = Uniform({π e } N e=1 ). Previously, Jin et al. (2020b) analyzed LSVI-UCB under point-wise model-misspecification. Here, we show that similar guarantees hold under a more general policy-distribution model-misspecification ε ms , captured by Assumption H.1. Assumption H.1 Suppose for every h = 0, 1, ..., H -1, there exist µ h such that for any policy π, E π µ h (•) T ϕ h (s h , a h ) -P ⋆ h (• | s h , a h ) T V ≤ ε ms . We further assume that sup s,a,h µ h (•) T ϕ(s, a) T V ≤ M µ and ∥f T µ h ∥ 2 ≤ M µ √ d∥f ∥ ∞ ∀f : S → R, for some positive constant M µ . In other words, we only need the model to be accurate on average under the occupancy distributions realizable by policies. We also make a slight generalization on the regularization constant M µ , which is set to 1 in the original linear MDP definition (Jin et al., 2020b) . Later, we will later instantiate the above assumption with our transferred µ h (s ′ ) = K-1 k=1 α k;h (s ′ ) µ k;h (s ′ ), then for any s, a, we have ∥ µ h ϕ h (s, a)∥ T V = s ′ K-1 k=1 α k;h (s ′ ) µ k;h (s ′ ) T ϕ h (s, a) ≤ s ′ K-1 k=1 |α k;h (s ′ )|| µ k;h (s ′ ) T ϕ h (s, a)| ≤ K-1 k=1 max s ′ |α k;h (s ′ )| (by µ k;h ϕ h (s, a) T V ≤ 1) ≤ ᾱ. Also, ∥f T µ h ∥ 2 = s ′ K-1 k=1 α k;h (s ′ ) µ k;h (s ′ )f (s ′ ) 2 = K-1 k=1 max s ′ |α k;h (s ′ )| s ′ µ k;h (s ′ )f (s ′ ) 2 ≤ ᾱ√ d∥f ∥ ∞ . (by ∥f T µ k;h ∥ 2 ≤ √ d∥f ∥ ∞ ) So we will set M µ = ᾱ. Note that we only need the existence of µ h here, and µ h (•) T ϕ h (s, a) need not be a valid probability kernel. In fact, it may even be negative valued. In this section, we make a model-based analysis of LSVI. Similar approaches have been used in prior works, e.g. Lykouris et al. (2021) ; Agarwal et al. (2019) ; Zhang et al. (2022) . For simplicity, we suppose that S is finite, but may be exponentially large, as we suffer no dependence on |S|. The proof can be easily extended to infinite state spaces by replacing inner products with P by integrals. Consider the following quantity, µ h,e = e-1 k=1 δ(s k h+1 ) ϕ h (s k h , a k h ) T (Λ h,e ) -1 ∈ argmin µ∈R S×d e-1 k=1 ∥µ ϕ h (s i h , a i h ) -δ(s i h+1 )∥ 2 + ∥µ∥ 2 F , where δ(s) is a one-hot encoding of the state s. In words, this is the best choice for linearly (in ϕ h (s, a)) predicting E s ′ ∼P ⋆ h (s,a) [δ(s ′ )] = P ⋆ h (s ′ | s, a). We highlight that this is just a quantity for analysis and not computed in the algorithm. Finally, denote P h,e = µ h,e ϕ h , P h = µ h ϕ h . We will also sometimes use the shorthand P f (s, a) for E s ′ ∼P (•|s,a) [f (s ′ )]. For each h = 0, 1, ..., H -1, let V h denote the class of functions s → max a w T ϕ h (s, a) + r h (s, a) + β∥ ϕ h (s, a)∥ Λ -1 ≤M V ∥w∥ 2 ≤ N M V , β ∈ [0, B], Λ ⪰ I symmetric The motivation behind this construction is that V h satisfies the key property that all of the learned value functions V h,e during Algorithm 6 are captured in this class. Lemma H.3 For any h = 0, 1, ..., H -1, 1. sup s V h,e (s) ≤ M V . 2. For any e = 1, 2, ..., N , we have V h,e ∈ V h . 3. ∀f ∈ V h , we have sup s |f (s)| ≤ M V . Proof. Recall that V h,e (s) = max a w T h,e ϕ h (s, a) + r h (s, a) + βb h,e (s, a) ≤M V where w h,e = Λ -1 h,e e-1 k=1 ϕ h (s k h , a k h ) V h+1,e (s k h+1 ). From the thresholding, we have V h,e (s) ≤ M V . We can bound the norm of w h,e as follows, ∥ w h,e ∥ ≤ Λ -1 h,e 2 e-1 k=1 V h+1,e (s k h+1 ) ≤ N sup s V h+1,e (s) ≤ N M V . We also required β ≤ B, and we regularized the covariance with I, so λ min is at least 1. Hence V h,e satisfies all the conditions to be in V h .

Now we control the metric entropy of

V h in ℓ ∞ , i.e. d(f 1 , f 2 ) = sup s |f 1 (s) -f 2 (s)| for f i ∈ V h . Lemma H.4 Let ε > 0 be arbitrary and let N ε be the smallest ε-net with ℓ ∞ of V h . Then, log |N ε | ≤ d log(1 + 6L/ε) + log(1 + 6B/ε) + d 2 log(1 + 18B 2 √ d/ε 2 ). Proof. Let f 1 , f 2 ∈ V h . Then, |f 1 (s) -f 2 (s)| ≤ max a (w 1 -w 2 ) T ϕ h (s, a) + β 1 ϕ h (s, a) Λ -1 1 -β 2 ϕ h (s, a) Λ -1 2 ≤ ∥w 1 -w 2 ∥ 2 + max a (β 1 -β 2 ) ϕ h (s, a) Λ -1 1 + β 2 max a ϕ h (s, a) Λ -1 1 -ϕ h (s, a) Λ -1 2 ≤ ∥w 1 -w 2 ∥ 2 + |β 1 -β 2 | + B max a ϕ h (s, a) Λ -1 1 -ϕ h (s, a) Λ -1 2 (λ min (Λ 1 ) ≥ 1) ≤ ∥w 1 -w 2 ∥ 2 + |β 1 -β 2 | + B Λ -1 1 -Λ -1 2 2 , where we used for any a, b ≥ 0, we have √ a - √ b = √ |a-b| √ a+ √ b |a -b| ≤ |a -b|. Now proceeding like the Lemma 8.6 in the RL Theory Monograph (Agarwal et al., 2019) , we have the result. In this section, we'll use the following bonus scaling parameter, β := O √ N dε ms M V + M V M µ d log(dN M V /δ) . The following high probability event (E model ) is a key step in our proof. Essentially, Theorem H.1 guarantees that, for all functions in V h , the model we learn is an accurate predictor of the expectation, up to a bonus and some vanishing terms. For all the following lemmas and theorems, suppose Assumption H.1 and the bonus scaling β is set as in (7). Throughout the section, ζ h (τ h ) refers to indicator functions of the trajectory τ h , where τ h = (s 0 , s 1 , ..., s h ). As before, the expectations E π [g(τ h )] are with respect to the distribution of trajectories when π is executed in the environment P ⋆ . Theorem H.1 Let δ ∈ (0, 1). Then, w.p. 1 -δ, for any time h, episode e, indicator functions ζ 1 , . . . , ζ H , and policy π, we have sup f ∈V E π P h,e (s h , a h ) -P ⋆ h (s h , a h ) f ζ h (τ h ) ≤ βE π [b e h (s h , a h )ζ h (τ h )] + ∥V h ∥ ∞ ε ms . (E model ) Proof. Condition on the outcome of Lemma H.5, which implies that w.p. 1 -δ, for any h, e, π, ζ h , we have sup f ∈V h E π P h,e (s h , a h ) -P h (s h , a h ) f ζ h (τ h ) ≤ βE π [b e h (s h , a h )ζ h (τ h )] . Also, for any h, e, π, ζ h , by Assumption H.1, we have (w.p. 1) that sup f ∈V h E π P h (s h , a h ) -P ⋆ h (s h , a h ) f ζ h (τ h ) ≤ E π sup f ∈V h P h (s h , a h ) -P ⋆ h (s h , a h ) f ζ h (τ h ) ≤ E π sup f ∈V h P h (s h , a h ) -P ⋆ h (s h , a h ) f ≤ ∥V h ∥ ∞ ε ms . Combining these two yields the result, as sup f ∈V h E π P h,e (s h , a h ) -P ⋆ h (s h , a h ) f ζ h (τ h ) ≤ sup f ∈V h E π P h (s h , a h ) -P ⋆ h (s h , a h ) f ζ h (τ h ) + sup f ∈V h E π P h,e (s h , a h ) -P h (s h , a h ) f ζ h (τ h ) . Lemma H.5 Suppose Assumption H.1 and the bonus scaling β is set as in (7). For any δ ∈ (0, 1), w.p. at least 1 -δ, we have for any time h, episode e, and policy π, ∀s h , a h : sup f ∈V h P h,e (s h , a h ) -P h (s h , a h ) f ≤ βb h,e (s h , a h ). Proof. Consider any h, e, π. Define ε k h := -δ(s k h+1 ) + P ⋆ h (s k h+1 |s k h , a k h ), so that E[ε k h | H k-1 ] = 0 , where H k-1 contains the states and actions before episode k. In what follows, we slightly abuse notation, as P (s, a) ϕ T (s, a) will denote the outer product, and hence a R S×d quantity. µ h,e Λ h,e = e-1 k=1 δ(s k h+1 ) ϕ h (s k h , a k h ) T = e-1 k=1 P ⋆ h (s k h , a k h ) -P h (s k h , a k h ) ϕ h (s k h , a k h ) T + k=0 P h (s k h , a k h ) -ε k h ϕ h (s k h , a k h ) T = e-1 k=1 P ⋆ h (s k h , a k h ) -P h (s k h , a k h ) ϕ h (s k h , a k h ) T + µ h (Λ h,e -I) - e-1 k=0 ε k h ϕ h (s k h , a k h ) T . Rearranging, we have µ h,e -µ h = e-1 k=0 P ⋆ h (s k h , a k h ) -P h (s k h , a k h ) ϕ h (s k h , a k h ) T (Λ h,e ) -1 -µ h (Λ h,e ) -1 - e-1 k=0 ε k h ϕ h (s k h , a k h ) T (Λ h,e ) -1 . Now let f ∈ V h be arbitrary. For any s h , a h , multiply the above with ϕ h (s h , a h ) and multiply with f , we have P h,e (s h , a h ) -P h (s h , a h ) f = f T ( µ h,e -µ h ) ϕ h (s h , a h ) ≤ f T e-1 k=1 P ⋆ h (s k h , a k h ) -P h (s k h , a k h ) ϕ h (s k h , a k h ) T Λ -1 h,e ϕ h (s h , a h ) Term(a) + f T µ h Λ -1 h,e ϕ h (s h , a h ) Term(b) + f T e-1 k=1 ε k h ϕ h (s k h , a k h ) T Λ -1 h,e ϕ h (s h , a h ) Term(c) . We can deterministically bound Term (b) as follows, sup f ∈V h f T µ h Λ -1 h,e ϕ h (s h , a h ) = sup f ∈V h (Λ -1/2 h,e f T µ h ) T Λ -1/2 h,e ϕ h (s h , a h ) ≤ sup f ∈V h Λ -1/2 h,e 2 ∥f T µ h ∥ 2 b h,e (s h , a h ) ≤ ∥V h ∥ ∞ M µ √ db h,e (s h , a h ). (by Assumption H.1) This term will be lower order compared to the other two. We now derive the bound for Term (c) for any fixed f ∈ V h . Observe that f T e-1 k=1 ε k h ϕ h (s k h , a k h ) T Λ -1 h,e ϕ h (s h , a h ) = Λ -1/2 h,e e-1 k=1 ϕ h (s k h , a k h )(f T ε k h ) T Λ -1/2 h,e ϕ h (s h , a h ) ≤ e-1 k=1 ϕ h (s k h , a k h T ε k h ) Λ -1 h,e b h,e (s h , a h ). Now we argue w.p. 1 -δ, for any e, h we have e-1 k=1 ϕ h (s k h , a k h )(f T ε k h ) Λ -1 h,e ≤ 2∥V h ∥ ∞ 2 log(1/δ) + d log(N + 1) , which implies the claim about all s h , a h . Indeed, we can apply Lemma H.1. Checking the preconditions, E P ⋆ h (s h ,a h ) f T ε k h | H k-1 = 0, σ ≤ |f T ε k h | ≤ ∥f ∥ ∞ ∥ε k h ∥ 1 ≤ 2∥V h ∥ ∞ , det(Σ 0 ) = det I = 1, and det(Σ t ) = det(Λ h,e ) ≤ (e + 1) d since the largest eigenvalue is e + 1. So, w.p. at least 1 -δ, for all e, we have the above inequality. Thus, for any fixed f ∈ V h , w.p. 1 -δ, for all e, h we have, P h,e (s h , a h ) -P h (s h , a h ) f ≤ Term(a) + Term(b) + Term(c) ≤ 4∥V h ∥ ∞ (1 + M µ ) log(1/δ) + d log(N ) + √ dN ∥V h ∥ ∞ ε ms b h,e (s h , a h ) + ∥V h ∥ ∞ M µ √ d b h,e (s h , a h ) + 4∥V h ∥ ∞ log(1/δ) + d log(N ) b h,e (s h , a h ) ≲ √ dN ∥V h ∥ ∞ ε ms + ∥V h ∥ ∞ M µ log(1/δ) + d log(N ) b h,e (s h , a h ). Now we apply a covering argument. Namely, union bound the above argument to every element in an ε net -net of V h . For any f ∈ V h , let f be its neighbor in the net s.t. ∥ f -f ∥ ∞ ≤ ε net , so we have P h,e (s h , a h ) -P h (s h , a h ) f ≤ P h,e (s h , a h ) -P h (s h , a h ) f + P h,e (s h , a h ) -P h (s h , a h ) ( f -f ) and P h,e (s h , a h ) -P h (s h , a h ) ( f -f ) ≲ ∥ f -f ∥ ∞ (N + 1) ≲ ε net N. Setting ε net = N , the metric entropy is of the order d log(N (M V + B)) + log(BN ) + d 2 log(BdN ). The error incurred with this epsilon net is a constant, which is lower order. Thus, we have ∀s h , a h : sup f ∈V h P h,e (s h , a h ) -P h (s h , a h ) f ≲ √ dN ∥V h ∥ ∞ ε ms + ∥V h ∥ ∞ M µ log(1/δ) + d log(M V ) + d 2 log(BdN ) b h,e (s h , a h ) ≲ √ dN M V ε ms + M V M µ log(1/δ) + d log(M V ) + d 2 log(BdN ) b h,e (s h , a h ) Note that β scales as √ log B, so one can find a valid B by solving β ≤ B for B. Lemma H.6 Let f ∈ V h . For any δ ∈ (0, 1), w.p. at least 1 -δ, for any time h, episode e, we have ∀s h , a h : f T e-1 k=1 P ⋆ h (s h , a k h ) -P h (s k h , a k h ) ϕ h (s k h , a k h ) T Λ -1 h,e ϕ h (s h , a h ) ≤ 4∥V h ∥ ∞ (1 + M µ ) log(1/δ) + d log(N ) + √ dN ∥V h ∥ ∞ ε ms b h,e (s h , a h ). Proof. First observe that f T e-1 k=1 (P ⋆ h (s k h , a k h ) -µ h ϕ h (s k h , a k h )) ϕ h (s k h , a k h ) T Λ -1 h,e ϕ h (s h , a h ) = Λ -1/2 h,e e-1 k=1 ϕ h (s k h , a k h )f T (P ⋆ h (s k h , a k h ) -µ h ϕ h (s k h , a k h )) T Λ -1/2 h,e ϕ h (s h , a h ) ≤ e-1 k=1 ϕ h (s k h , a k h ) ε k Λ -1 h,e b h,e (s h , a h ), where ε k = P ⋆ h (s k h , a k h ) -P h (s k h , a k h ) f . Now we will argue that w.p. 1 -δ, for all e, h, e-1 k=1 ϕ h (s k h , a k h ) ε k Λ -1 h,e ≤ 4∥V h ∥ ∞ (1 + M µ ) log(1/δ) + d log(N ) + √ dN ∥V h ∥ ∞ ε ms , which will imply the claim for all s h , a h . Apply self-normalized martingale concentration (Lemma H.1) to X i = ϕ h (s i h , a i h ) and ε i = ε i - E [ ε i | H i-1 ], where the expectation is over (s i h , a i h ) in the definition of ε i . To see sub-Gaussianity, bound the envelope, | ε k | ≤ ∥f ∥ ∞ ∥P ⋆ h (s k h , a k h )-µ h ϕ h (s k h , a k h )∥ T V ≤ ∥V h ∥ ∞ (1+M µ ), and thus σ ≤ |ε k | ≤ 2∥V h ∥ ∞ (1 + M µ ) . Now compute the determinants: det(Λ h,0 ) = 1 and since λ max (Λ h,e ) ≤ e + 1, we have that log det(Λ h,e ) ≤ d log(e + 1). Hence, w.p. at least 1 -δ, we have ∀e : e-1 k=1 ϕ h (s k h , a k h ) ( ε k -E [ ε k | H k-1 ]) Λ -1 h,e ≤ 2∥V h ∥ ∞ (1 + M µ ) 2 log(1/δ) + d log(N + 1). By Assumption H.1 applied to π k (the data-generating policy for episode k), we have |E [ ε k | H k-1 ]| ≤ ∥V h ∥ ∞ ε ms . Recall for any scalars c i and vectors x i , we have ∥ i c i x i ∥ ≤ i |c i |∥x i ∥ ≤ i c 2 i i ∥x i ∥ 2 . Thus, e-1 k=1 ϕ h (s k h , a k h )E [ ε k | H k-1 ] Λ -1 h,e ≤ e-1 k=1 ∥ ϕ h (s k h , a k h )∥ 2 Λ -1 h,e e-1 k=1 E [ ε k | H k-1 ] 2 ≤ √ d (e -1)∥V h ∥ ∞ ε ms . (by Lemma H.2) Combining these two bounds concludes the proof. Lemma H.7 (Optimism) Suppose (E model ) holds. Let ι = ∥V h ∥ ∞ ε ms . Then, for any episode e = 1, 2, ..., N , we have ∀h = 0, 1, ..., H -1 : E π ⋆ Q ⋆ h (s h , a h ) -Q h,e (s h , a h ) ζ h (τ h ) ≤ (H -h)ι, ∀h = 0, 1, ..., H -1 : E π ⋆ V ⋆ h (s h ) -V h,e (s h ) ζ h-1 (τ h-1 ) ≤ (H -h)ι, where ζ h (s h ) := I h,e (s h , π e h (s h )) ≤ M V ζ h (τ h ) = h h ′ =0 ζ h ′ (s h ′ ). Abusing notation, ζ -1 (•) is the constant function 1. In particular, we have that E d0 V ⋆ 0 (s 0 ) -V 0,e (s 0 ) ≤ Hι. Proof. Fix any episode e. We prove both claims via induction on h = H, H -1, H -2..., 1, 0. The base case holds trivially since V H,e and V ⋆ H are zero at every state by definition. Indeed, we have that for any π, including π ⋆ , that E π P ⋆ H-1 (s H-1 , a H-1 )(V ⋆ H -V H,e ) ζ H-1 (τ H-1 ) = E π [(0 -0) ζ H-1 (τ H-1 )] = 0. Now let's show the inductive step. Let h ∈ {H -1, H -2, ..., 1, 0} be arbitrary and suppose the inductive hypothesis. So suppose that V -optimism holds at h + 1 (we don't even need Q-optimism in the future), i.e. E π ⋆ P ⋆ h (s h , a h )(V ⋆ h+1 -V h+1,e ) ζ h (τ h ) = E π ⋆ V ⋆ h+1 (s h+1 ) -V h+1,e (s h+1 ) ζ h (τ h ) ≤ (H -h -1)ι (IH) Recalling that Q h,e (s h , a h ) = r h (s h , a h ) + P h,e (s h , a h ) V h+1,e + βb h,e (s h , a h ), we have E π ⋆ Q ⋆ h (s h , a h ) -Q h,e (s h , a h ) ζ h (τ h ) = E π ⋆ P ⋆ h (s h , a h )V ⋆ h+1 -P h,e (s h , a h ) V h+1,e -βb h,e (s h , a h ) ζ h (τ h ) ≤ E π ⋆ P ⋆ h (s h , a h ) -P h,e (s h , a h ) V h+1,e -βb h,e (s h , a h ) ζ h (τ h ) + (H -h -1)ι (by (IH)) ≤ E π ⋆ P h,e (s h , a h ) -P ⋆ h (s h , a h ) V h+1,e ζ h (τ h ) -E π ⋆ [βb h,e (s h , a h )ζ h (τ h )] + (H -h -1)ι ≤ ι + (H -h -1)ι = (H -h)ι, (by (E model ) and V h+1,e ∈ V h (Lemma H.3)) which proves the Q-optimism claim. Now let's prove V -optimism. E π ⋆ V ⋆ h (s h ) -V h,e (s h ) ζ h-1 (τ h-1 ) = E π ⋆ Q ⋆ h (s h , a h ) -Q h,e (s h , π e h (s h )) ≤M V ζ h-1 (τ h-1 ) = E π ⋆ [(Q ⋆ h (s h , a h ) -M V )ζ h-1 (τ h-1 )(1 -ζ h (s h ))] + E π ⋆ Q ⋆ h (s h , a h ) -Q h,e (s h , π e h (s h )) ζ h-1 (τ h-1 )ζ h (s h ) ≤ E π ⋆ Q ⋆ h (s h , a h ) -Q h,e (s h , π e h (s h )) ζ h (τ h ) ≤ E π ⋆ Q ⋆ h (s h , a ) -Q h,e (s h , π ⋆ h (s h )) ζ h (τ h ) ≤ (H -h)ι, by Q-optimism. Remark H.1 We did not require P h,e to be a valid transition! It is in general unbounded and can even have negative entries! Lemma H.8 (Simulation) For any episode e = 1, 2, ..., N , we have E d0 V 0,e (s 0 ) -V π e 0 (s 0 ) ≤ H-1 h=0 E π e b h,e (s h , a h ) + ( P h (s h , a h ) -P ⋆ h (s h , a h )) V h+1,e ] Proof. We progressively unravel the left hand side. For any s 0 , V 0,e (s 0 ) -V π e 0 (s 0 ) ≤ Q 0,e (s 0 , π e 0 (s 0 )) -Q π e 0 (s 0 , π e 0 (s 0 )) = b 0,e (s 0 , π e 0 (s)) + P 0,e (s 0 , π e 0 (s)) -P ⋆ 0 (s 0 , π e 0 (s 0 )) V 1,e + P ⋆ 0 (s 0 , π e 0 (s 0 )) V 1,e -V π e 1 , where the inequality is due to the thresholding on the value function. Now, perform this recursively on the P ⋆ 0 (s 0 , π e 0 (s 0 )) V 1,e -V π e 1 term. Doing this unravelling h times gives the result. Theorem H.2 Suppose Assumption H.1 and let β and bonus be defined as in (7). Let δ ∈ (0, 1). Then w.p. at least 1 -δ, we have that the regret of LSVI is sublinear, N V ⋆ - N -1 e=0 V π e ≤ O dHN M V log(HN/δ)ε ms + d 1.5 H √ N M V M µ log(dHN/δ) where O hides log dependence. Proof. We first condition on the high-probability event (E model ), which occurs w.p. at least 1 -δ. Fix any arbitrary episode e. By optimism Lemma H.7 and the simulation lemma Lemma H.8,  E d0 V ⋆ 0 (s 0 ) -V π e 0 (s 0 ) ≤ E d0 V e 0 (s 0 ) -V π e 0 (s 0 ) + Hι ≤ H-1 h=0 E π e βb e h (s h , a h ) + P h,e (s h , a h ) -P ⋆ h (s h , a h ) V h+1, E d0 V ⋆ 0 (s 0 ) -V π e 0 (s 0 ) ≤ 2HN ι + 2β E d0 V ⋆ 0 (s 0 ) -V π e 0 (s 0 ) ≲ HN ι + βH dN log(N ) + β N log(HN/δ) ≲ HN ι + √ dN M V ε ms + M V M µ d log(dHN/δ) • H dN log(HN/δ) = HN ι + dHN M V log(HN/δ)ε ms + d 1.5 H √ N M V M µ log(dHN/δ). Note that Hι = H∥V∥ ∞ ε ms = HM V ε ms is of lower order (with respect to N ), we can simply drop it. This concludes the proof. Corollary H.1 By setting δ = 1/N , we have that expected regret also has the same rate as above. Proof. The expected regret by law of total probability, since regret is at most N H, E [Reg N ] ≤ E [Reg N | (E model )] + N H(1 -P((E model ))) ≤ E [Reg N | (E model )] + H. Since H is lower-order, we have the same rate. First, let's calculate the reward-free model learning sample complexity, i.e. this is the number of samples required for k . Recall that we need this to be sufficiently large such that ε T V = 1/N LSVI-UCB . As required by Lemma 4.2, we need,

I PROOF OF MAIN THEOREMS

N REWARDFREE = O A 3 d 4 H 6 log (|Φ||Υ|/δ) N 2 LSVI-UCB = O A 3 d 4 H 6 log (|Φ||Υ|/δ) A 3 d 6 H 8 ψ -2 2 = O A 9 d 16 H 22 ψ -4 log (|Φ||Υ|/δ) . Second, we calculate the cross-sampling sample complexity. Recall that n is the number of samples in each pairwise dataset. In order to reduce ϵ ms to 1/ √ T , by Lemma 4.1, we need 1/ √ T ≤ ε ms ≤ Aα 3 max K/λ min 1/2 ζ n ≤ Aα 3 max K/λ min 1/2 1 n log |Φ| δ + K log |Υ| (by (2)) which implies that we need n ≤ λ -1 min Aα 3 max KT log |Φ| δ + K log |Υ| . Incorporating the coverage result from Lemma 4.2 gives, n ≤ A 4 α 3 max d 5 H 7 KT ψ -2 log |Φ| δ + K log |Υ| . Since each task is in at most K -2 pairwise datasets, each of size n, the total pre-training sample complexity per task is at most, N REWARDFREE + (K -2) • n = O A 9 d 16 H 22 ψ -4 log (|Φ||Υ|/δ) + A 4 α 3 max d 5 H 7 K 2 T ψ -2 log |Φ| δ + K log |Υ| . Now we prove Theorem 5.2, restated below. Theorem 5.2 (Regret with online access) Suppose Assumptions 3.1-3.4,5.1,5.2 hold. W.p. 1 -δ, Algorithm 5 with appropriate parameters achieves a regret in the target O ᾱd 1.5 H 2 T log(1/δ) , with at most poly A, α max , d, H, K, T, ψ -1 , ψ -1 raw , log(|Φ||Υ|/δ) online queries in the source tasks. Proof of Theorem 5.2. We follows the same format as the proof of Theorem 4.1. The regret bound is identical. Now let's compute the pre-training sample complexity. The regret bound requires us to set ε ms ≤ 1/ √ T . Here, our ε ms comes from Lemma G.1, so 1/ √ T ≤ α max K 1/2 (ψ raw λ min ) 1/2 1 n log |Φ| δ + K log |Υ| , which implies we need n ≤ α 2 max K ψ raw λ min log |Φ| δ + K log |Υ| . Plugging in the coverage of Lemma 4.2, n ≤ α 2 max KT ψ raw log |Φ| δ + K log |Υ| A -3 d -5 H -7 ψ 2 -1 ≤ A 3 α 2 max d 5 H 7 KT ψ raw ψ 2 log |Φ| δ + K log |Υ| . the observation space. In this setting, the size of the observation depends on the number of source environments K. Let the size original observation space be O = |O|, the size of the observation for comblock-PO is KO. For the k-th source environment, where k ∈ [K], the environment first generates the O-dimensional observation vector as in the original comblock, and then embed it to the (k -1)O-th to kO-th entries of the KO-dimensional observation vector, where it is 0 everywhere else. Thus we can see that the observation space for each source environment is disjoint (and thus the name partitioned observations). For the target enviornment, since the latent dynamcis are the same, we only need to design the emission distribution: for each latent state s i;h , we assign the emission distribution uniformly at random from one of the sources.

J.4 IMPLEMENTATION DETAILS

Our implementation builds on BRIEE (Zhang et al., 2022) . In the Multi-task REPLEARN stage, we requires our learned feature to predict the Bellman backup of all the sources simultaneously. Therefore, in each iteration we have k discriminators and k sets of linear weights (instead of 1 in BRIEE), where k is the number of source environments. For the deployment stage we implement LSVI following Algorithm 6. To create the training dataset for Multi-task REPLEARN, for each (i, j) environment pairs where i ̸ = j, we collect 500 samples for each timestep h. For each (i, i) environment pairs, we collect 500 × (k -1) × k samples for each timestep h, where k denotes the number of sources. Thus we ensure that the number of samples from cross transition of different environments is the same as the number of samples from cross transition of the same environment. For the online setting, we simply sample 1000 × (k -1) × k samples for each (i, i) cross transition to ensure that the total number of samples is the same for G-REPTRANSFER and O-REPTRANSFER. To sample the initial state action pair (i.e., (s, ã) pair as in (1)), for 90% of the samples, we follow the final policy from each source environment trained using BRIEE. For the remaining 10%, we follow the same policy to state s, and then take a uniform random policy to get ã. With this sampling scheme we ensure that Assumption 3.3 is satisfied. In the setting of Section. 6, we follow a more simple procedure to ensure that the samples are more balanced among the three states: we skip the first sampling step from environment i (i.e., sample s given (s, ã)), and simply reset environment i to s, where s is one of the three states with equal probability, and generate the observation accordingly. Note that such visitation distribution is also possible in the online setting with a more nuanced sampling procedure, and in the experiment we use the same sampling procedure for both G-REPTRANSFER and O-REPTRANSFER for a fair comparison.

J.5 HYPERPARAMETERS

In this section, we record the hyperparameters we try and the final hyperparameter we use for each baselines. The hyperparameters for REPTRANSFER in Section. 6 is in Table . 3. The hyperparameters for REPTRANSFER in Section. 6 is in Table . 4. The hyperparameters for BRIEE is in Table . 5. We use the same set of hyperparameters for G-REPTRANSFER and O-REPTRANSFER. 



with a the minor difference that they directly assume the state space is compact with bounded measure n=1 ϕ(s n ).



taking the only correct action in all steps leads to the final

Figure 2: (a): Visualization of the decoder source (top) and G-REPTRANSFER (bottom). (b): Visualization of the decoder O-REPTRANSFER (top) and G-REPTRANSFER (bottom). For each baseline, The h-th column in the i-th image denotes the averaged decoded states from the 30 observations generated by latent state z i,h , for i ∈ {0, 1, 2} and h ∈ [25], from the corresponding target environment. The optimal decoder should recover the latent states up to a permutation. In Fig a (top), note that the learned features in source task fail to solve the target because of the collapse at timestep 5: both observations from state 0 and 1 are mapped to state 0. Note in the source task where this feature is trained, such collapse can happen when state 0 and 1 have identical latent transition (for detailed discussion we refer to Misra et al. (2020)). In Fig b (top), REPTRANSFER with only online access learns an incorrect decoder when the source tasks' observation spaces are disjoint. This is because the learned feature can decode each source task with a different permutation.

s) Unif(a) Algorithm 4 REWARDFREE REP-UCB 1: Input: Regularizer λ n , bonus scaling α n , model class M = Φ × Υ, number of episodes N . 2: Initialize π 0 as random and D h,0 , D ′ h,0 = ∅. 3: for episode n = 1, 2, ..., N do 4:

Transfer learning with online access PRE-TRAINING PHASE Input: num. LSVI-UCB episodes N LSVI-UCB , num. model-learning episodes N REWARDFREE , size of cross-sampled datasets n, failure probability δ. 1: for source task k = 1, ..., K -1 do 2:

we prove Theorem 4.1 and Theorem 4.2. Theorem 4.1 (Regret under generative source access) Suppose Assumptions 3.1-3.4 and 4.1, and suppose the input policies π k are λ min -exploratory. Then, for any δ ∈ (0, 1), w.p. 1 -δ, REPTRANSFER when deployed in the target task has regret at most O ᾱH 2 d 1.5 T log(1/δ) , with at most Kn generative accesses per source task, with n = O λ -1 min Aα 3 max KT log |Φ| δ + K log |Υ| . Theorem 4.2 (Regret for REPTRANSFER with REWARDFREE subroutine) Suppose Assumptions 3.1-3.4 and 4.1 hold, and let δ ∈ (0, 1). Let {π k } K-1 k=1 be exploratory policies learned from running REWARDFREE on each source task with N LSVI-UCB and N REWARDFREE set as in Lemma 4.2. Then, w.p. 1 -δ, running REPTRANSFER with {π k } K-1 k=1 has regret in the target task of O ᾱH 2 d 1.5 T log(1/δ) , with at most O A 4 α 3 max d 5 H 7 K 2 T ψ -2 (log(|Φ|/δ) + K log |Υ|) generative accesses per source task. Proof of Theorem 4.1 and Theorem 4.2. For the regret bound, set M V = H and M µ = ᾱ and apply Theorem H.2. This choice of M µ is valid by the argument following Assumption H.1. This gives us a regret bound ofO dH 2 T ε ms + ᾱd 1.5 H 2 √ T log(1/δ) ,where ε ms can be made smaller than 1/ √ T , in which the second term dominates. Now, we calculate the pre-training phase sample complexity in a source task.

Figure 4: Visualization of decoders from source 2. Note the collapse happens at timestep 1 and 10. J.6.2 VISUALIZATIONS FROM SECTION. 6

Input: MDP P ⋆ with online access, num. LSVI-UCB episodes N LSVI-UCB , num. model-learning episodes N REWARDFREE , failure probability δ.

REWARDFREE on each source task with N LSVI-UCB and N REWARDFREE set as in Lemma 4.2. Then, w.p. 1 -δ, running REPTRANSFER with {π k } K-1 k=1 has regret in the target task of O ᾱH 2 d 1.5 T log(1/δ) , with at most O A 4 α 3 max d 5 H 7 K 2 T ψ -2 (log(|Φ|/δ) + K log |Υ|) generative accesses per source task.

List of NotationsS, A, AState and action spaces, andA = |A|. ∆(S)The set of distributions supported by S. λ min (A) Smallest eigenvalue of matrix A. e jOne-hot encoding of j, i.e. 0 at each index except the one corresponding to j.Length of vector implied from context. (x) ≤y min{x, y}. H Episode length of MDPs, a.k.a. time horizon. We index steps as h = 0, 1, ..., H -1.Ground truth transition at time h for task k. r Constants defined in point-wise linear span assumption (Assumption 3.4).

Now apply a self-normalized elliptical potential bound to the first term, giving that

By Azuma's inequality applied to the martingale difference ∆ E π e [b h,e (s h , a h )] -b h,e (s e h , a e h ), which has envelope bounded by 2, implies w.p. 1 -δ,

annex

Here, we only collect one dataset, so the total pre-training sample complexity isJ EXPERIMENT DETAILS

J.1 CONSTRUCTION OF COMBLOCK

In this section we first introduce the vanilla Combination lock (Comblock) environment that is widely used as the benchmark for algorithms for Block MDPs. We provide a visualization of the comblock environment in Fig. 1 (a). Concretely, the environment has a horizon H, and 3 latent states z i;h , i ∈ {0, 1, 2} for each timestep h and 10 actions. Among the three latent states, we denote z 0 , z 1 as the good states which leads to the final reward and z 2 as the bad states. At the beginning of the task, the environment will uniformly and independent sample 1 out of the 10 actions for each good state z 0;h and z 1;h for each timestep h, and we denote these actions a 0;h , a 1;h as the optimal actions (corresponding to each latent state). These optimal actions, along with the task itself, determines the dynamics of the environment. At each good latent state s 0;h or s 1;h , if the agent takes the correct action, the environment transits to the either good state at the next timestep (i.e., s 0;h+1 , s 1;h+1 ) with equal probability. Otherwise, if the agent takes any 9 of the bad actions, the environment will transition to the bad state s 2;h+1 deterministicly, and the bad states transit to only bad states at the next timestep deterministicly. There are two situations where the agent receives a reward: one is uponing arriving the good states at the last timestep, the agent receives a reward of 1. The other is upon the first ever transition into the bad state, the agent receives an "anti-shaped" reward of 0.1 with probability 0.5. Such design makes greedy algorithms without strategic exploration such as policy optimization methods easily fail. For the initial state distribution, the environment starts in s 0;0 or s 1;0 with equal probability. The dimension of the observation is 2 ⌈log(H+|S|+1)⌉ . For the emission distribution, given a latent state s i;h , the observation is generated by first concatenate the one hot vectors of the state and the horizon, adding i.i.d. N (0,0.1) noise at each entry, appending 0 at the end if necessary. Then finally we apply a linear transfermation on the observation with a Hadamard matrix. Note that without a good feature or strategic exploration, it takes 10 H actions to reach the final goal with random actions.J.2 CONSTRUCTION OF TRANSFER SETUP IN SECTION. 6 In this section we introduce the detailed construction of the experiment in Section. 6. For the source environment, we simply generate 5 random vanilla comblock environment described in Section.J.1. Note that in this way we ensure that the emission distribution shares across the sources, but the latent dynamics are different because the optimal actions are independently randomly selected. For the target environment, for each timestep h, we randomly acquire the optimal actions at h from one of the sources and set it to be the optimal action of the target environment at timestep h, if the selected optimal actions are different for the two good states. Otherwise we keep sampling until they are different. Note that under such construction, since we fix the emission distribution, Assumption 3.4 is satisfied if we set α = 1 for the source environment where we select the optimal action and α = 0 for the other sources, at each timestep. To see how Assumption 5.1 is satisfied, recall that comblock environment naturally satisfies Assumption 3.3, and identical emission implies that the conditional ratio of all observations between source and target is 1.

J.3 CONSTRUCTION OF TRANSFER SETUP IN SECTION. 6

Now we introduce the construction of the Comblock with Partitioned Observation (Comblock-PO) environment. Comparing with the vanilla comblock environment, the major difference is in In this section we provide a comprehensive visualization of the decoders for all baselines in the target environment. We observe that the behaviors of all baselines are similar across the 5 random seeds. Thus to avoid redundancy, we only show the visualization from 1 random seed. We provide an example in Fig. 3 on how to interpret the visualization: let the emission function of the target environment be o, and let the decoder that we are evaluating be ϕ, and to generate the blue block in Fig. 3 , we sample 30 observations {s n } 30 n=1 from the target environment at z 1,13 , the latent state 1 (the title of the subplot) from timestep 13 (the x-axis). Concretely,). The blue block denotes the three-dimensional decoded latent states ẑ from these 30 observations: ẑ = 1 30 We record the visualization of the 5 sources from Fig. 3 to Fig. 7 ; O-REPTRANSFER in Fig. 8 ; G-REPTRANSFER in Fig. 9 ; running BRIEE on target in Fig. 10 . 

