OFFLINE REINFORCEMENT LEARNING FROM RAN-DOMLY PERTURBED DATA SOURCES

Abstract

Most of the existing offline reinforcement learning (RL) studies assume the available dataset is sampled directly from the target task. However, in some practical applications, the available data are often coming from several related but heterogeneous environments. A theoretical understanding of efficient learning from heterogeneous offline datasets remains lacking. In this work, we study the problem of offline RL based on multiple data sources that are randomly perturbed versions of the target Markov decision process (MDP). A novel HetPEVI algorithm is first proposed, which simultaneously considers two types of uncertainties: sample uncertainties from a finite number of data samples per data source, and source uncertainties due to a finite number of data sources. In particular, the sample uncertainties from all data sources are jointly aggregated, while an additional penalty term is specially constructed to compensate for the source uncertainties. Theoretical analysis demonstrates the near-optimal performance of HetPEVI. More importantly, the costs and benefits of learning with randomly perturbed data sources are explicitly characterized: on one hand, an unavoidable performance loss occurs due to the indirect access to the target MDP; on the other hand, efficient learning is achievable as long as the sources collectively (instead of individually) provides a good data coverage. Finally, we extend the study to linear function approximation and propose the HetPEVI-Lin algorithm that provides additional efficiency guarantees beyond the tabular cases.

1. INTRODUCTION

Offline reinforcement learning (RL) (Levine et al., 2020) , a.k.a. batch RL (Lange et al., 2012) , has received growing interest in the recent years. It aims at training RL agents using accessible datasets collected a priori and thus avoids expensive online interactions. Along with its tremendous empirical successes (Kidambi et al., 2020) , recent studies have also established theoretical understandings of offline RL (Rashidinejad et al., 2021; Jin et al., 2021b; Duan et al., 2021; Uehara & Sun, 2021) . Despite these advances, the majority of offline RL research focuses on learning via data collected exactly from the target task environment (Kumar et al., 2020) . However, in practice, it is difficult to ensure that all such data are perfectly from one source environment. Instead, in many cases, it is more reasonable to assume that data are collected from different sources that are perturbed versions of the target task. For example, when training a chatbot (Jaques et al., 2020) , the offline dialogue datasets typically consist of short conversations between different people, who naturally have varying language habits. The training objective is the common underlying language structure, e.g., basic grammar, which however cannot be reflected in any individual dialogue but must be holistically learned from the aggregation of them. More examples can be found in healthcare (Tang & Wiens, 2021) , autonomous driving (Sallab et al., 2017) and others. While a few empirical investigations under the offline meta-RL framework have been reported (Dorfman et al., 2021; Lin et al., 2022; Mitchell et al., 2021) , theoretical understandings of effectively and efficiently learning the underlying task using datasets from multiple heterogeneous sources are largely lacking. Motivated by both practical and theoretical limitations, this work makes progress in the underexplored RL problem of learning the target task using data from heterogeneously perturbed data sources. In particular, we study the problem of learning a target Markov decision process (MDP) from offline datasets sampled from multiple heterogeneously realized source MDPs. Several provably efficient designs are proposed, targeting both tabular and linear MDPs. To the best of our knowledge, this is the first work that proposes provably efficient offline RL algorithms to handle perturbed data sources, which can benefit relevant applications and further shed light on the theoretical understanding of offline meta-RL. The contributions are summarized as follows: • We study a new offline RL problem where the datasets are collected from multiple heterogeneous source MDPs, with possibly different reward and transition dynamics, instead of directly from the target MDP. Motivated by practical applications, the data source MDPs are modeled as random perturbations of the target MDP. Compared with studies of offline RL using data directly from the target MDP (Rashidinejad et al., 2021; Jin et al., 2021b) , we face new challenges of jointly considering uncertainties caused by the finite number of data samples per source (referred to as the sample uncertainties) and by the finite number of data sources (referred to as the source uncertainties). • A novel HetPEVI algorithm is proposed, which generalizes the idea of pessimistic value iteration (Jin et al., 2021b) and uses carefully crafted penalty terms to address the sample and source uncertainties simultaneously. Specifically, in the HetPEVI, the specially designed penalty term contains two parts: one aggregating the sample uncertainties associated with each dataset, and the other term compensating the source uncertainties. The combination of these two parts jointly characterizes the uncertainty associated with the collected datasets. • Theoretical analysis proves the effectiveness of HetPEVI with a corresponding lower bound, which first demonstrates that even with finite randomly perturbed MDPs and finite data samples from each of them, it is feasible to efficiently learn the target MDP. More importantly, the analysis reveals that learning with multiple perturbed data sources brings both costs and benefits. On one hand, due to indirect access to the target MDP, an unavoidable learning cost occurs. This cost scales only with the number of data sources and cannot be reduced by increasing the size of datasets, which highlights the importance of the diversity of data sources. On the other hand, effective learning only requires that the datasets collectively (instead of individually) provide a good coverage of the optimal policy, which may provide additional insights for practical data collections. • Moreover, we extend the study to linear function approximation where offline data are collected from linear MDPs with a shared feature mapping but heterogeneously realized system dynamics. The HetPEVI-Lin algorithm is developed to jointly consider the sample and source uncertainties while incorporating the linear structure. Theoretical analysis demonstrates the effectiveness of HetPEVI-Lin and verifies the sufficiency of a good collective coverage. Related Works. With the empirical success of offline RL (Levine et al., 2020) , its theoretical understandings have been gradually established in recent years. In particular, the principle of "pessimism" is incorporated and proved efficient for offline RL (Jin et al., 2021b; Rashidinejad et al., 2021) . Following this line, Xie et al. (2021b) ; Li et al. (2022) ; Shi et al. (2022) further fine-tune the designs for the tabular setting and Zanette et al. (2021) ; Min et al. (2021) ; Yin et al. (2022) ; Xiong et al. (2022) for linear MDPs (Jin et al., 2020) . These theoretical advances are mainly focused on learning with data directly from the target task. However, in the practical studies of RL, growing interests have been made to utilize data from heterogeneous sources, e.g., offline meta-RL (Mitchell et al., 2021; Dorfman et al., 2021; Lin et al., 2022; Li et al., 2020b) . This work is thus motivated to provide a theoretical understanding of how to extract information about the target task from multiple sources. A more detailed literature review regarding both online and offline RL with single or heterogeneous environments is provided in Appendix A.1.

2. PROBLEM FORMULATION

Preliminaries of RL. We consider an RL problem characterized by an episodic MDP M := (H, S, A, P, r). In this tuple, H is the length of each episode, S is the state space, A is the action space, P is the transition matrix so that P h (s ′ |s, a) gives the probability of transiting to state s ′ if action a is taken for state s at step h, and r h (s, a) is the deterministic reward in the interval of [0, 1] of taking action a for state s at step h. 1 Specifically, in each episode, starting from an initial state s 1 , at each step h ∈ [H], the agent observes state s h ∈ S, picks action a h ∈ A, receives reward r h (s h , a h ), and then transits to a next state s h+1 ∼ P h (•|s h , a h ). The episode ends after H steps. In the tabular RL setting, the state and action spaces are finite with S := |S| and A := |A|. A policy π := {π h (•|s) : (s, h) ∈ S × [H]} consists of distributions π h (•|s) over the action space A, whose value functions can be defined as V π h (s) := E π,M [ H h ′ =h r h ′ (s h ′ , a h ′ )|s h = s] and Q π h (s, a) := E π,M [ H h ′ =h r h ′ (s h ′ , a h ′ )|s h = s, a h = a] for each (s, a, h) ∈ S × A × [H], where the expectation E π,M [•] is with respect to (w.r.t.) the random trajectory induced by policy π on MDP M. The optimal policy π * maximizes the value function, i.e., π * := arg max π V π 1 (s). For convenience, we denote V * h (s) := V π * h (s) and Q * h (s, a) := Q π * h (s, a) for all (s, a, h) ∈ S×A×[H], and use π h (s) to refer to the chosen action for state s at step h for a deterministic policy π.

2.1. THE LEARNING TARGET AND OFFLINE DATASETS

The Target Task. This work considers a target task modeled by an MDP M = (H, S, A, P, r), which can be any task of interest (e.g., chatbot training in Sec. 1). The goal is to find a good output policy π with a small sub-optimality gap on the target MDP M, which is measured as: Gap(π; M) := V * 1 (s 1 ) -V π 1 (s 1 ). Note that an output policy π is called ε-optimal if Gap(π; M) ≤ ε. Multiple Data Sources. The agent has access to datasets from L sources, i.e., D := {D l , ∀l ∈ [L]}. Each dataset D l := {(s k h,l , a k h,l , r k h,l , s k h+1,l ) : k ∈ [K], h ∈ [H]} consists of K tuples for each step h ∈ [H] sampled by a (possibly different) unknown behavior policy ρ l on a unknown data source MDP denoted as M l = (H, S, A, P l , r l ). In particular, for each step h ∈ [H], the behavior policy ρ l performs K times sampling over the state-action space S × A following a distribution denoted as d ρ l h,l (•), i.e., for each k ∈ [K], (s k h,l , a k h,l ) ∼ d ρ l h,l (•). Then, for a sampled pair (s k h,l , a k h,l ), the reward r k h,l = r h,l (s k h,l , a k h,l ) and s k h+1,l ∼ P h,l (•|s k h,l , a k h,l ) realized from the date source MDP M l is collected and aggregated as a tuple (s k h,l , a k h,l , r k h,l , s k h+1,l ) for the dataset D l . To ease the presentation, all the sampled tuples are assumed to be independent of each other. Such independence can be ensured by applying the sub-sampling technique in Li et al. (2022) to trajectories induced by behavior policies while maintaining the order of the dataset size.

2.2. THE TASK-SOURCE RELATIONSHIP

On one hand, motivated by Sec. 1, each data source MDP M l may not exactly match the target task M. Concretely, while having the same episodic length, state space, and action space, their transition and reward dynamics are not necessarily aligned, i.e., P h,l (•|s, a) ̸ = P h (•|s, a), r h,l (s, a) ̸ = r h (s, a). On the other hand, in practical applications, while being heterogeneous, the data source MDPs are often still related to the target task (e.g., dialogue datasets in Sec. 1). In particular, data sources in offline meta-RL are often assumed to be sampled from a certain distribution (Mitchell et al., 2021) . Thus, the following relationship between the target MDP and the data source MDPs is assumed. Assumption 1 (Task-source Relationship). Data source MDPs {M l : l ∈ [L]} are independently sampled from an unknown distribution g such that for each (l, s, a, h) ∈ [L] × S × A × [H], the reward r h,l (s, a) is a random variable with mean r h (s, a), and the transition vector P h,l (•|s, a) is a random vector with mean P h (•|s, a). In particular, random variables and random vectors {r h,l (s, a), P h,l (•|s, a) : (s, a, h, l) ∈ S × A × [H] × [L]} are independent with each other. The requirement that rewards are random samples with the expectation as the target model is commonly adopted in bandits literature (Shi & Shen, 2021; Zhu & Kveton, 2022) , and the same requirement on the transition vectors is a natural extension, where one representative example is to have them follow a Dirichlet distribution (Marchal & Arbel, 2017) . Miscellaneous. Notations without subscripts l generally refer to the target MDP M, while subscripts l are always added when discussing each individual source M l . For a clear presentation, the notation c is used throughout the paper with varing values to represent a poly-logarithmic term of order O(poly(log(LHSA/δ))) for the tabular setting and O(poly(log(LH/δ))) for the linear setting, where δ ∈ (0, 1) is a confidence parameter. Additionally, for x ∈ R, x + denotes max{x, 0}. For any function f : S → R, the transition operator and Bellman operator of the target MDP M at each step h ∈ [H] are defined, respectively, as (P h f )(s, a) := E[f (s ′ )|s, a] and (B h f )(s, a) := r h (s, a) + (P h f )(s, a) , where the expectation is w.r.t. the transition s ′ ∼ P h (•|s, a) . Expectation E π,M [•] is over the trajectory induced by π on the MDP M starting from s 1 , and d π h (s, a) denotes the probability of visiting state-action pair (s, a) at step h with policy π on M starting from s 1 .

Algorithm 1 HetPEVI

Input: Dataset D = {D l : l ∈ [L]}; VH+1 (s) = 0, ∀s ∈ S 1: For each (l, s, a, h) ∈ [L] × S × A × [H], estimate rh,l (s, a) = r h,l (s, a)1{N h,l (s, a) ̸ = 0} and Ph,l (s ′ |s, a) = N h,l (s,a,s ′ ) N h,l (s,a)∨1 2: For each (s, a, h) ∈ S × A × [H], aggregate rh (s, a) = 1 L l∈[L] rh,l (s, a) and Ph (s ′ |s, a) = 1 L l∈[L] Ph,l (s ′ |s, a) 3: for h = H, H -1, • • • , 1 do 4: Perform updates for all (s, a) ∈ S × A as Eqn. ( 1) with ( Bh Vh+1 )(s, a) := rh (s, a) + ( Ph Vh+1 )(s, a) and Γ h (s, a) = c l∈[L] H 2 (L 2 N h,l (s,a))∨L + c H 2 L 5: end for Output: policy π = {π h (s) : (s, h) ∈ S × [H]}

3. THE HETPEVI ALGORITHM

In this section, the HetPEVI algorithm is introduced for the tabular MDP setting (presented in Algorithm 1). HetPEVI follows the principle of pessimistic value iterations (PEVI) (Rashidinejad et al., 2021; Jin et al., 2021b) , which is first briefly introduced, while the key challenges and our novel design in learning the target MDP from randomly perturbed data sources are then illustrated. The HetPEVI algorithm begins by counting the number of visitations for each tuple (s, a, h, s ′ ) ∈ S × A × [H] × S in each dataset l ∈ [L]. Especially, we denote N h,l (s, a) and N h,l (s, a, s ′ ) as the amount of visitations on (s, a, h) and (s, a, h, s ′ ) in dataset D l , respectively. Empirical estimations of rewards and transitions are then given for all (l, s, a, h) ∈ [L] × S × A × [H] as rh,l (s, a) = r h,l (s, a)1{N h,l (s, a) ̸ = 0}; Ph,l (s ′ |s, a) = N h,l (s, a, s ′ ) N h,l (s, a) ∨ 1 . These individual estimates are then aggregated into overall estimates as: rh (s, a) = 1 L l∈[L] rh,l (s, a); Ph (s ′ |s, a) = 1 L l∈[L] Ph,l (s ′ |s, a). With these estimations, HetPEVI iterates backward from the last step to the first step as The essence of the above procedure is that instead of directly setting Qh (s, a) as ( Bh Vh+1 )(s, a) (as in the standard value iteration), a penalty term Γ h (s, a) is subtracted, which serves the important role of keeping the estimations Vh (s) and Qh (s, a) pessimistic. Especially, the penalty term Γ h (s, a) should be carefully designed such that with a high probability, it holds that Qh (s, a) = min Bh Vh+1 (s, a) -Γ h (s, a), H -h + 1 + , Vh (s) = max a∈A Qh (s, a), πh (s) = arg max a∈A Qh (s, a), Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≤ Γ h (s, a), ∀(s, a, h) ∈ S × A × [H]. Technical Challenges. Previous offline RL studies (Jin et al., 2021b; Rashidinejad et al., 2021; Yin et al., 2022; Xie et al., 2021b) only deal with one single target data source and only one type of uncertainties (finite data samples) to ensure Eqn. (2). In this work, the agent needs to process multiple heterogeneous datasets, while none of them individually characterizes the learning target. As a result, the agent faces two coupled challenges. First, the uncertainties due to the finite sample sizes still needs to be considered. This is referred to as the sample uncertainties, but the key difference is that now the agent needs to jointly consider and aggregate uncertainties associated with all data sources. Second, even with perfect knowledge of each data source, the target MDP may not be fully revealed. Thus, the agent also needs to consider the uncertainties from the limited number of data sources, which are referred to as the source uncertainties. To address the two uncertainties, the penalty term is designed to have two parts as follows: Γ h (s, a) = Γ α h (s, a) + Γ β h (s, a), where Γ α h (s, a) aggregates the sample uncertainties from each data source while Γ β h (s, a) accounts for the source uncertainties. Both of them are further elaborated on in the following. Penalties to Aggregate Sample Uncertainties. For the first part of the penalty, i.e., Γ α h (s, a), since the agent faces data from multiple heterogeneous sources, the penalty term needs to jointly consider the individual uncertainties from all sources. In HetPEVI, the following penalty term is proposed, which originates from the Hoeffding inequality: Γ α h (s, a) = c l∈[L] H 2 (L 2 N h,l (s, a)) ∨ L + c H 2 L . Note that instead of treating each source individually and directly summing up their sample uncertainties (as c l∈[L] 1 L H 2 N h,l (s,a)∨1 ), the adopted penalty term is a joint measure of sample uncertainties from all sources. This would lead to a O( √ L) speed-up in the performance guarantees. Penalties to Account for Source Uncertainties. The second part of the penalty Γ β h (s, a) serves the important role of measuring the uncertainties due to the limited amount of data sources. With the observation that (B h,l Vh+1 )(s, a) is a bounded random variable with mean of (B h Vh+1 )(s, a), the following penalty term is proposed: Γ β h (s, a) = c H 2 L . Intuitively, it shrinks with the number of datasets, i.e., L, as more sources provide more information about the target task. Remark 1. With the two-part penalty term, with high probability, for all (s, a, h) a) , which jointly ensure Eqn. (2). Remark 2. The adopted source uncertainty penalty Γ β h (s, a) is intended to accommodate any unknown variance between sources. However, if there is prior knowledge of the variance, it is feasible to incorporate such information. In particular, if the rewards and transition vectors are generated via σ-sub-Gaussian distributions, the penalty can be designed as ∈ S × A × [H], it simultaneously holds that |( Bh Vh+1 )(s, a) -( Bh Vh+1 )(s, a)| ≤ Γ α h (s, a) and |( Bh Vh+1 )(s, a) - (B h Vh+1 )(s, a)| ≤ Γ β h (s, Γ β h (s, a) = c σ 2 H 2 /L.

4.1. INDIVIDUALLY GOOD COVERAGE

Previous offline RL studies typically require that the behavior policy should provide information for all the state-action pairs that may be visited by the optimal policy on the target MDP, e.g., Rashidinejad et al. (2021) ; Xie et al. (2021b) . Particularizing this requirement to each individual data source considered in this work leads to the following assumption, which requires each individual dataset covers the optimal policy on the target MDP. Assumption 2 (Individual Coverage). There exists a constant C * < ∞ such that for all (s, a, h, l) ∈ S × A × [H] × [L], it holds that d π * h (s, a) ≤ C * d ρ l h,l (s, a). Under this assumption, a minimax lower bound is first established as follows. Theorem 1 (Lower Bound; Individual Coverage). For any C * ≥ 2, S ≥ 2, A ≥ 2, it holds that inf π sup M∈M,g∈G,{ρ l :l∈[L]}∈B(C * ) E {M l ,D l :l∈[L]} [Gap(π; M)] = Ω C * H 3 S LK ∨ H 2 L , where M := {M(H, S, A, P, r)} is the family of all possible target MDPs, The first term in this lower bound is the same as the lower bound of standard offline RL with LK data samples directly from the target MDP (Rashidinejad et al., 2021) , which represents the performance loss originating from finite data samples. The second term is unique for the setting studied in this work, as it represents the loss caused by learning with finite randomly perturbed data sources. Especially, it can be observed that the second term only scales with the number of data sources, i.e., L, and cannot be mitigated via sampling more data from each data source, i.e., increasing K. G := {g: {M l ∼ g : l ∈ [L]} satisfies On one hand, this observation verifies the intuition that using data sampled from multiple randomly perturbed data sources poses additional learning difficulties. On the other hand, it also highlights the importance of the diversity of data sources, i.e., it is more important to involve more sources instead of more data from each source. This is reasonable as involving more data sources provides additional population coverage while also adding more data. With the information-theoretic lower bound established, the performance guarantee of HetPEVI is also provided in the following theorem, which highlights its effectiveness and efficiency. Theorem 2 (HetPEVI; Individual Coverage). Under Assumptions 1, 2, w.p. at least 1-δ, the output policy π of HetPEVI satisfies Gap(π; M) = Õ C * H 4 S LK + H 4 L . This result first illustrates that even with finite randomly perturbed MDPs and finite data samples from each of them, it is still feasible to efficiently learn the target. Compared with the lower bound in Theorem 1, it can be observed that HetPEVI is optimal (up to logarithmic factors) on the dependency of L, K, C * and S. The additional √ H factor in the first term (from the sample uncertainties) can be removed by invoking a carefully designed Bernstein-type penalty term to incorporate the variance information, which is deferred to Appendix E due to the space limitation. However, it is currently unclear how to alleviate the additional H factor in the second term (from the source uncertainties), which is left open to be further investigated. Additionally, Thm. 2 indicates that to obtain an ε-optimal policy, HetPEVI requires the overall number of samples T = LK to be of order Õ(C * H 4 S/ε 2 ) and the available number of data sources L to be of order Õ(H 4 /ε 2 ). Note that the first requirement can be viewed as on the sample complexity while the second one is on the source diversity.

4.2. COLLECTIVELY GOOD COVERAGE

It can be noticed that Assumption 2 requires each data source to provide good coverage. With access to only one data source in previous offline RL studies, this requirement is intuitive as information about the optimal policy needs to be obtained. However, the multiple available datasets in this work provide opportunities to avoid this strong assumption via their aggregated information. In this section, the scenario where the datasets collectively provide a good coverage is discussed. In particular, to characterize the collective coverage, the following assumption is proposed. Assumption 3 (Collective Coverage). There exist constants L † > 0 and C † < ∞ such that for all (s, a, h) ∈ S × A × [H], it holds that |{l ∈ [L] : d π * h (s, a) ≤ C † d ρ l h,l (s, a)| ≥ L † . It can be observed that Assumption 3 is weaker than Assumption 2 as the latter implies the former with L † = L and C † = C * . Moreover, instead of requiring each data source individually covers the optimal policy, Assumption 3 leverages their collective coverage. In particular, for each different state-action pair possibly visited by the optimal policy (i.e., d π * h (s, a) > 0), it is sufficient to have potentially varying datasets cover it. In other words, different parts of the optimal policy can be covered by different datasets, which is a highly practical consideration. Under Assumption 3, the following performance guarantee of HetPEVI can be further obtained. Theorem 3 (HetPEVI; Collective Coverage). Under Assumptions 1, 3, w.p. at least 1-δ, the output policy π of HetPEVI satisfies Gap(π; M) = Õ C † H 4 S LK + (L + 1 -L † )H 4 L . Compared with Theorem 2, Theorem 3 is more general and also more practically useful as it implies the former and indicates that good collective coverage is sufficient for efficient learning. Moreover, from another perspective, the performance dependence on L † also highlights the importance of collecting high-quality datasets. As a summary of the obtained observations, learning via randomly perturbed datasets brings both costs and benefits. Especially, an unavoidable performance loss occurs due to the indirect access to the target MDP. This loss can only be reduced by increasing the diversity of data sources, i.e., a larger L, but cannot be mitigated by collecting a larger dataset from each data source, i.e., a larger K. Despite bringing the additional loss, the access to multiple heterogeneous datasets provides an opportunity to leverage their aggregated information, which is concretely reflected that efficient learning only requires a good collective (instead of individual) coverage.

5. EXTENSION TO OFFLINE LINEAR MDP

Lastly, we extend the study from tabular RL to incorporating function approximations. Especially, the following linear MDP model is considered. Definition 1 (Linear MDP (Jin et al., 2020; 2021b) ). An MDP M = (H, S, A, P, r) is a linear MDP with a feature map ϕ : S × A → R d if there exist d unknown measures µ h = (µ (1) h , • • • , µ h ) over S and an unknown vector θ h ∈ R d s.t. P h (s ′ |s, a) = ⟨ϕ(s, a), µ h (s ′ )⟩ and r h (s, a) = ⟨ϕ(s, a), θ h ⟩ for all (s, a, s ′ ) ∈ S × A × S at each step h ∈ [H]. Without loss of generality, it is assumed that ∥ϕ(s, a)∥ 2 ≤ 1 for all (s, a) ∈ S × A and max{∥µ h (S)∥ 2 , ∥θ h ∥ 2 } ≤ √ d for all h ∈ [H], where ∥µ h (S)∥ 2 := S ∥µ h (s)∥ 2 ds. For simplicity, we denote M = (H, S, A, µ, ϕ, θ). With this definition, we consider the problem that the target MDP M is a linear MDP, i.e., M = (H, S, A, ϕ, µ, θ). Correspondingly, the data source MDPs {M l : l ∈ [L]} are also assumed to be linear MDPs, which are denoted as {M l = (H, S, A, ϕ, µ l , θ l ) : l ∈ [L]}. In particular, the data source MDPs are assumed to share the same feature dimension d and feature mapping ϕ as the target MDP; however, their system dynamics may be different. We note that this shared feature mapping is commonly adopted in federated linear bandits (Huang et al., 2021; Li & Wang, 2022) , which naturally extends to linear MDP. Then, similarly as Assumption 1 in the tabular MDP, the following task-source relationship is assumed for the linear MDP to model that the data sources are randomly perturbed versions of the target MDP. Assumption 4 (Task-source Relationship; Linear MDP). Data source MDPs {M l : l ∈ [L]} are independently sampled from unknown distributions g such that for each (l, i, h) ∈ [L] × [d] × [H], the vector θ h,l is a random vector with mean θ h , and the measure µ (i) h,l is a random measure with mean µ (i) h . In particular, random vectors and measures {θ h,l , µ (i) h,l : (h, i, l) ∈ [H] × [d] × [L]} are independent with each other. Note that when treating the tabular setting as d = SA and ϕ(s, a) = e (s,a) , where e (s,a) denotes a canonical basis in R d , Assumption 4 returns to Assumption 1.

5.1. HETPEVI-LIN FOR LINEAR MDPS

From the linear MDP definition, it can be shown that the Bellman operators of each data source MDP and the target MDP are all linear (but with different weights). Algorithm 2 HetPEVI-Lin Input: Dataset D = {D l : l ∈ [L]}; VH+1 (s) = 0, ∀s ∈ S 1: Estimate ŵh,l ← Λ -1 h,l k∈[K] ϕ(s k h,l , a k h,l ) r k h,l + Vh+1 (s k h+1,l ) , ∀(h, l) ∈ [H] × [L] 2: Set ŵh ← l∈[L] ŵh,l /L, ∀h ∈ [H] 3: for h = H, H -1, • • • , 1 do 4: Perform updates for all (s, a) ∈ S × A as Eqn. ( 1) with ( Bh Vh+1 )(s, a) = ŵ⊤ h ϕ(s, a) and Γ h (s, a) = c dH 2 L 2 l∈[L] ∥ϕ(s, a)∥ 2 Λ -1 h,l + c dH 2 L 5: end for Output: policy π = {π h (s) : (s, h) ∈ S × [H]} Lemma 1. With any function f : S → R, there exists {w f h,l ∈ R d : h ∈ [H], l ∈ [L]} and {w f h ∈ R d : h ∈ [H]} such that for any (s, a, h) ∈ S ×A×[H], it holds that (B h,l f )(s, a) = ⟨ϕ(s, a), w f h,l ⟩ for each l ∈ [L] and (B h f )(s, a) = ⟨ϕ(s, a), w f h ⟩. Furthermore, the following lemma can be established for the sample average MDP M = (H, S, A, P, r), where Ph (•|s, a) = 1 L l∈[L] P h,l (•|s, a) and rh (s, a) = 1 L l∈[L] r h,l (s, a). It states that M is also a linear MDP and the weights associated with its Bellman operator are the average of weights from each individual data source MDP. Lemma 2. The average MDP M is a linear MDP of dimension d with feature mapping ϕ. Also, for any function f : S → R and any (s, a, h) ∈ S × A × [H], it holds that ( Bh f )(s, a) = ⟨ϕ(s, a), wf h ⟩ with weights { wf h = l∈[L] w f h,l /L : h ∈ [H]}. With the above observations, the HetPEVI-Lin algorithm is proposed (presented in Algorithm 2). First, for each step h ∈ [H] and each data source l ∈ [L], the weight ŵh,l is estimated via a ridge regression: ŵh,l = arg min w∈R d k∈[K] r k h,l + Vh+1 (s k h+1,l ) -ϕ(s k h,l , a k h,l ) ⊤ w 2 + 1 L ∥w∥ 2 2 = Λ -1 h,l k∈[K] ϕ(s k h,l , a k h,l ) r k h,l + Vh+1 (s k h+1,l ) , with Λ h,l := k∈[K] ϕ(s k h,l , a k h,l )ϕ(s k h,l , a k h,l ) ⊤ + I/L. The agent aggregates ŵh = l∈[L] ŵh,l /L, which provides an estimation of the weight wh associated with the average MDP M. The PEVI in Eqn. ( 1) is then performed with the empirical Bellman operator defined as ( Bh Vh+1 )(s, a) = ⟨ϕ(s, a), ŵh ⟩ and a two-part penalty term Γ h (s, a) = Γ α h (s, a)+Γ β h (s, a) illustrated in the following. Penalties to Aggregate Sample Uncertainties. The first part of the penalty term aggregates the sample uncertainties from each data source, which is designed as Γ α h (s, a) = c dH 2 L 2 l∈[L] ∥ϕ(s, a)∥ 2 Λ -1 h,l . While this penalty term shares a similar format as those in linear bandits (Li et al., 2010; Abbasi-Yadkori et al., 2011) and linear MDPs (Jin et al., 2020; 2021b) , its novelty comes from jointly considering the uncertainties of different sources. Especially, this design avoids applying the selfnormalized concentration (Abbasi-Yadkori et al., 2011) to each data source (which leads to an inferior dependency on L) by leveraging the statistical independence among the split datasets. Penalties to Account for Source Uncertainties. Besides the sample uncertainties, the following penalty is designed to measure the source uncertainties: Γ β h (s, a) = c dH 2 L . Note that compared with the tabular design, an additional d factor appears, which is from the covering argument over the d-dimensional feature space.

5.2. THEORETICAL ANALYSIS

Similar to the tabular setting, previous offline RL studies with linear MDPs often requires the behavior policy to cover the optimal policy (Jin et al., 2021b; Xiong et al., 2022) (or even stronger, to cover the entire feature space (Yin et al., 2022; Min et al., 2021) ). Hence, following Sec. 4, we first discuss the scenario with good individual coverage characterized by the following assumption. Assumption 5 (Individual Coverage; Linear MDP). There exists a constant D * < ∞ such that for all (h, l) ∈ [H] × [L], it holds that E π * ,M [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] ⪯ D * E ρ l ,M l [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ]. Theorem 4 (HetPEVI-Lin; Individual Coverage). Under Assumptions 4, 5 and assuming the matrices { l∈[L] Λ -1 h,l : h ∈ [H] } are invertible, w.p. at least 1 -δ, the output policy π of HetPEVI-Lin satisfies Gap(π; M) = Õ D * d 2 H 4 KL + dH 4 L . Similar to Thm. 2, the two terms in Thm. 4 originate from the finite data samples and the limited data sources, respectively. However, note that the gap guarantee in Thm. 4 does not have a dependency on the number of states S (which appears in the tabular analysis), instead it depends on the feature dimension d, thanks to the careful design that incorporates linear function approximation. As a result, to output an ε-optimal policy, the sample complexity requirement is T = KL = Õ(D * d 2 H 4 /ε 2 ) while the source diversity requirement is L = Õ(dH 4 /ε 2 ). One important observation from the tabular setting is that efficient learning only requires a good collective (instead of individual) coverage. To further verify this claim, the following collective coverage assumption is considered for linear MDPs, which shares a similar format as Assumption 3. Assumption 6 (Collective Coverage; Linear MDP). There exist constants L † > 0 and D † < ∞ such that for all h ∈ [H], it holds that |l ∈ [L] : E π * ,M [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ] ⪯ D * E ρ l ,M l [ϕ(s h , a h )ϕ(s h , a h ) ⊤ ]| ≥ L † . It can be observed that Assumption 5 implies Assumption 6 with L † = L and D † = D * . With this relatively weaker collective coverage assumption, the following theorem can be established. Theorem 5 (HetPEVI-Lin; Collective Coverage). Under Assumptions 4, 6 and assuming the matrices { l∈[L] Λ -1 h,l : h ∈ [H]} are invertible, w.p. at least 1 -δ, the output policy π of HetPEVI-Lin satisfies Gap(π; M) = Õ D † d 2 H 4 KL + d(L + 1 -L † )H 4 L . It can be observed that Thm. 3 shares a similar form as Thm. 3, both of which illustrate that as long as the datasets collectively cover the optimal policy, efficient learning is achievable.

6. CONCLUSIONS

This work studied a novel problem of efficient offline RL with randomly perturbed data sources. In particular, motivated by practical applications, the available offline datasets are assumed to be collected from multiple randomly perturbed versions of the target MDP. The HetPEVI algorithm is proposed, where novel penalty terms were designed to jointly consider the uncertainties from the finite data samples and the limited amount of data sources. Theoretical analyses proved its nearoptimal performance. More importantly, the costs and benefits of offline RL via randomly perturbed data sources are explicitly characterized. On one hand, an additional unavoidable performance loss occurs due to the finite data sources, which cannot be reduced by involving more data samples from each source. On the other hand, as long as the datasets collectively (instead of individually) provide a good data coverage, efficient learning is achievable. Lastly, linear function approximation was considered, and the HetPEVI-Lin algorithm was developed with penalties carefully designed to consider both uncertainties and incorporate the linear structure. Its analysis again verifies the importance of source diversity and the sufficiency of collective coverage.

A ADDITIONAL DISCUSSIONS

A.1 RELATED WORKS RL has seen much progress over the past few years, particularly in its theoretical understanding (see the recent monograph (Agarwal et al., 2019) for an overview). We will discuss the most related papers in the following, with a particular focus on the theoretical aspect as well as the offline setting. Online RL with One Environment. The majority of RL studies focus on the online setting (Sutton & Barto, 2018) , where the agent consistently improves her algorithm via online interactions with the environment. In recent years, many advances have deepened the theoretical understandings of online RL in interacting with a fixed and stable environment, with provably efficient designs (Azar et al., 2017; Jin et al., 2018; 2020; 2021a; Zhang et al., 2020; Jiang et al., 2017; Dann et al., 2021) . Online RL with Heterogeneous Environments. The importance of handling heterogeneous environments is well recognized. Especially, federated RL (Zhuo et al., 2019) and the general multiagent RL (Zhang et al., 2021 ) study the setting with multiple agents in the system, while each of them may view an agent-depend environment (Jin et al., 2022) . In contrast, meta-RL (Saemundsson et al., 2018; Gupta et al., 2018; Fallah et al., 2021; Chua et al., 2021) focuses on extracting knowledge from previous tasks and adapting them to new (potentially different) tasks. Also, multi-task RL (Teh et al., 2017) attempts to handle multiple tasks together via information shared among tasks (Brunskill & Li, 2013; Zhang & Wang, 2021; Lu et al., 2021; Yang et al., 2020; 2022) or distinguishing the latent structure associated with each task via past observations (Hallak et al., 2015; Kwon et al., 2021a; b) . Offline RL with Single-source Datasets. Offline RL (Levine et al., 2020) attempts to avoid potentially expensive interactions with the environment as in online RL but instead use available offline datasets collected previously. Inspired by empirical advances (Yu et al., 2020; Kumar et al., 2020) , the principle of "pessimism" is incorporated and proved efficient for offline RL (Jin et al., 2021b; Rashidinejad et al., 2021) . Especially, it is illustrated that a nearly optimal policy can be learned via a dataset collected by a behavior policy that covers the trajectories of the optimal policy. 2021). However, these results are mostly information-theoretical as computational intractable optimizations are required in general. Besides the above discussions, one particular line of research that relates to this work is under the category of offline robust RL (Zhou et al., 2021; Yang et al., 2021; Panaganti et al., 2022; Panaganti & Kalathil, 2022; Si et al., 2020; Shi & Chi, 2022) . Especially, offline robust RL learns via data sampled from one MDP and tries to output a policy that is robustly good for a family of MDPs; however, our work attempts at learning the hidden task from data sampled from a family of MDPs. Nevertheless, we note that it would be an interesting question to investigate whether having access to multiple data source MDPs in the target family would make offline robust RL easier. Another conceptually relative work is Shrestha et al. (2020) . In particular, it looks for similar stateaction pairs with small distances in the dataset, which can be thought of as available data sources in this work. Then, the Lipschitz continuity assumption is posed, which serves a similar role as Assumption 1 to establish the connection between desired task information with the available datasets. From this perspective, the first term in Theorem 3.1 (Shrestha et al., 2020) can be interpreted as coming from the source uncertainty while the second term is from the sample uncertainty. However, we also note that the Lipschitz continuity assumption is a worst-case consideration that would not characterize the concentration of involving more data sources, which however is the key of this work. Offline RL with Multi-source Datasets. The aforementioned advances on offline RL are mainly focused on learning from a single data source, which limits their applicability. In the offline domain, growing interests have been made to utilize data from heterogeneous sources. The most related literature falls under the framework of "offline meta-RL" or "offline multi-task RL" (Mitchell et al., 2021; Dorfman et al., 2021; Lin et al., 2022; Li et al., 2020b; a; Yu et al., 2021) . Especially, the target MDP can be viewed as a learning target for the "meta-training" process of offline meta-RL (Mitchell et al., 2021) , which aims to extract information from the available data (of multiple sources). In addition to "meta-training", the empirically studied offline meta-RL systems often feature another step of "meta-testing", which further utilizes the learned information and applies them to a specific task. Thus, we believe this work may contribute the (currently lacking) theoretical understanding of offline meta-RL systems, especially the meta-training process, which may also serve as the foundation for studies on the meta-testing process.

A.2 FUTURE WORKS

Coverage Assumptions. While the collective coverage assumption of Assumptions 3 and 6 is relatively weak, it is still of major interest to further explore how to perform offline RL (especially with heterogeneous data sources) under weaker assumptions. This direction is particularly interesting with multiple data sources since the heterogeneous sources naturally enrich the data diversity. Unknown Source Identities. This work considers the scenario where each data sample is known to belong to a particular source. One interesting direction is to investigate the scenarios without such information, i.e., unknown source identities. A potential solution is to first cluster the data samples and then adopt the algorithms proposed in this work. However, it is challenging to design provable clustering algorithms. One candidate clustering technique is developed in Kwon et al. (2021b) for the study of latent MDP, which however relies on strong assumptions of prior knowledge about the source MDPs. Personalization. As mentioned in the discussions of related work, this work can be viewed as targeting at the "meta-training" process of offline meta-RL (Mitchell et al., 2021) , which extracts common knowledge from available data of multiple sources. While the extracted common knowledge have individual values, in many applications, an additional step of personalization is performed to further use such knowledge to benefit a specific task, which is called the "meta-testing" process of offline meta-RL (Mitchell et al., 2021) . Based on this work, it would be valuable to further study how to perform such a personalization step with theoretical guarantees.

B LOWER BOUND ANALYSIS

Lemma 3. For any C * ≥ 2, it holds that inf π sup M∈M,g∈G,{ρ l :l∈[L]}∈B(C * ) E {M l ,D l :l∈[L]} [Gap(π; M)] = Ω C * SH 3 LK . Proof. We consider the case where g always generates M l = M, and ρ 1 = • • • = ρ L = ρ. Then, the problem degenerates to offline RL with one dataset directly from the target. Thus, results from the case of C * ≥ 2 in Theorem 7 of Rashidinejad et al. (2021) can be applied to obtain the final result. Lemma 4. For any C * ≥ 2, it holds that inf π sup M∈M,g∈G,{ρ l :l∈[L]}∈B(C * ) E {M l ,D l :l∈[L]} [Gap(π; M)] = Ω H 2 L . Proof. This lemma is established with the following construction. Target MDP M. We design the following family of target MDPs with two states denoted as S = {s g , s b } and two actions denotes as A = {a g , a b } for all steps (s 1 can be fixed to be s g ), which can be easily generalized to accommodate any number of states and actions: M = P 1 (s g |s g , a g ) = p, P h (s b |s g , a g ) = 1 -p P 1 (s g |s g , a b ) = P 1 (s b |s g , a b ) = 0.5, ∀h ∈ [H]; P h (s|s, a) = 1, ∀(s, a, h) ∈ S × A × [2, H]; r h (s g , a) = 1, r h (s b , a) = 0, ∀(a, h) ∈ A × [H] . A target MDP M with parameter p in M is referred to as M(p). Data source generation distribution g. For target M(p), the data source generation, denoted as g(p), is designed as follows: with probability p, the generated MDP has P 1 (s g |s g , a g ) = 1, P 1 (s b |s g , a g ) = 0; otherwise it has P 1 (s g |s g , a g ) = 0, P 1 (s b |s g , a g ) = 1. The other parameters of the generated MDP are the same as M. Behavior policy ρ l . For all l ∈ [L], the behavior policy ρ l is specified as follows: ρ h,l (a|s) = d π * h (a|s), ∀(s, a, h) ∈ S × A × [H], which ensures Assumption 2 with any C * ≥ 1. Then, with p 1 := 1 2 + δ and p 2 = 1 2 -δ, it can be obtained that E {M l ∼g(p1),D l ∼M l :l∈[L]} [Gap(π; M(p 1 ))] = (H -1)δE {M l ∼g(p1),D l ∼M l :l∈[L]} [π 1 (a b )] ; E {M l ∼g(p2),D l ∼M l :l∈[L]} [Gap(π; M(p 2 ))] = (H -1)δE {M l ∼g(p2),D l ∼M l :l∈[L]} [π h (a g )] , which leads to E {M l ∼g(p1),D l ∼M l :l∈[L]} [Gap(π; M(p 1 ))] + E {M l ∼g(p2),D l ∼M l :l∈[L]} [Gap(π; M(p 2 ))] = (H -1)δ E {M l ∼g(p1),D l ∼M l :l∈[L]} [π 1 (a b )] + E {M l ∼g(p2),D l ∼M l :l∈[L]} [1 -π 1 (a g )] Furthermore, it holds that E {M l ∼g(p1),D l ∼M l :l∈[L]} [π 1 (a b )] + E {M l ∼g(p2),D l ∼M l :l∈[L]} [1 -π 1 (a g )] ≥ 1 -TV(P {M l ∼g(p1),D l ∼M l :l∈[L]} , P {M l ∼g(p2);D l ∼M l :l∈[L]} ) ≥ 1 -KL P {M l ∼g(p1),D l ∼M l :l∈[L]} ||P {M l ∼g(p2),D l ∼M l :l∈[L]} /2. We can explicitly write down the ratio between the two desired probabilities as P {M l ∼g(p1),D l ∼M l :l∈[L]} ({M l , D l : l ∈ [L]}) P {M l ∼g(p2),D l ∼M l :l∈[L]} ({M l , D l : l ∈ [L]}) = (p 1 ) κ (1 -p 1 ) L-κ (p 2 ) κ (1 -p 2 ) L-κ where κ = l∈[L] 1{P 1,l (s g |s, a g ) = 1}. As a result, with p β h ∈ [ 1 4 , 3 4 ], ∀h ∈ [H] it holds that KL P {M l ∼g(p1),D l ∼M l :l∈[L]} ||P {M l ∼g(p2),D l ∼M l :l∈[L]} = E {M l ∼g(p1),D l ∼M l :l∈[L]} κ log p 1 p 2 + (L -κ) log 1 -p 1 1 -p 2 = Lp 1 log p 1 p 2 + (L -Lp 1 ) log 1 -p 1 1 -p 2 ≤ L(p 1 -p 2 ) 2 p 2 (1 -p 2 ) ≤ 16L(p 1 -p 2 ) 2 Finally, with δ = 1 16 2 L , it holds that E {M l ∼g(p α ),D l ∼M l :l∈[L]} [π h (a b )] + E {M l ∼g(p β ),D l ∼M l :l∈[L]} [1 -π h (a g )] ≥ 1 -KL P {M l ∼g(p α ),D l ∼M l :l∈[L]} ||P {M l ∼g(p β ),D l ∼M l :l∈[L]} /2 ≥ 1 2 which concludes the proof as E {M l ∼g(p1),D l ∼M l :l∈[L]} [Gap(π; M(p 1 ))] + E {M l ∼g(p2),D l ∼M l :l∈[L]} [Gap(π; M(p 2 ))] ≥ (H -1) 1 2 • 1 16 2 L = Ω H 2 L .

C UPPER BOUND ANALYSIS: OVERVIEW

In this section, we provide an overview with the steps shared in the proofs of each proposed algorithm. In other words, the following results are stated for all three proposed algorithms in general. The more challenging and specific proofs for our designs are illustrated in the following sections. The basic logistics here follow the seminal work of Jin et al. (2021b) in provably efficient offline RL. The first step is to establish the validness of the pessimism. For different settings, various styles of pessimism are incorporated. Specifically, in this work, three penalty constructions are adopted in HetPEVI, HetPEVI-Adv and HetPEVI-Lin, respectively. As an abstraction, the validness of pessimism induced by penalties {Γ h (s, a) : (s, a, h) ∈ S × A × [H]} is defined in the following. Definition 2 (Validness of Pessimism). For pessimistic value iterations using an estimated Bellman operator Bh and penalties {Γ h (s, a) : (s, a, h) ∈ S × A × [H]}, a valid pessimism is induced if the following event happens E := Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≤ Γ h (s, a), ∀(s, a, h) ∈ S × A × [H] . With a valid pessimism, we can obtain the following lemma. Lemma 5. Suppose that event E in Definition 2 happens, it holds that ζ h (s, a) := B h Vh+1 (s, a) -Qh (s, a) ∈ [0, 2Γ h (s, a)] , ∀(s, a, h) ∈ S × A × [H]. Proof. We can observe that for any (s, a, h) ∈ S × A × [H], with the adopted pessimistic value iteration, it holds that ζ h (s, a) = B h Vh+1 (s, a) -Qh (s, a) = B h Vh+1 (s, a) -min Bh Vh+1 (s, a) -Γ h (s, a), H -h + 1 + ≥ min B h Vh+1 (s, a) -Bh Vh+1 (s, a) + Γ h (s, a), B h Vh+1 (s, a) ≥ 0, where the last inequality is due to event E. Similar, for the other direction, it holds that ζ h (s, a) = B h Vh+1 (s, a) -Qh (s, a) = B h Vh+1 (s, a) -min Bh Vh+1 (s, a) -Γ h (s, a), H -h + 1 + ≤ max B h Vh+1 (s, a) -Bh Vh+1 (s, a) + Γ h (s, a), B h Vh+1 (s, a) -(H -h + 1) ≤ 2Γ h (s, a), where the last inequality is due to event E and the fact that (B h Vh+1 )(s, a) ≤ H -h + 1. Furthermore, the suboptimality gap on the target MDP M between the output policy π and the optimal policy π * can be generally bounded via the following lemma. Lemma 6. Suppose that event E in Definition 2 happens, it holds that Gap(π; M) ≤ 2 h∈[H] E π * ,M [Γ h (s h , a h )] . Proof. It holds that Gap(π; M) (a) = - h∈[H] E π,M [ζ h (s h , a h )] + h∈[H] E π * ,M [ζ h (s h , a h )] + h∈[H] E π * ,M Qh (s h , π * (s h )) -Qh (s h , πh (s h )) (b) ≤ 2 h∈[H] E π * ,M [Γ h (s h , a h )] , where equation (a) is from Lemma 3.1 in (Jin et al., 2021b) (provided as Lemma 23), and inequality (b) is due to Lemma 5 together with the fact that Qh (s h , •) takes maximum at πh (s h ) in the proposed algorithms. In other words, Gap(π; M) is bounded via the sum of penalties Γ h (s h , a h ) along the expected trajectory of optimal policy π * on the target MDP M.

D THE HETPEVI ALGORITHM

D.1 GOOD EVENT Lemma 7. The following event holds with probability at least 1 -δ for HetPEVI G := (i) The penalties {Γ h (s, a)} induce a valid pessimism; (ii) N h,l (s, a) ≥ cKd ρ l h,l (s, a), ∀(l, s, a, h) ∈ [L] × S × A × [H] . Proof. Part (I) is from Lemma 8 and part (II) obtained via Lemma 9. Lemma 8. The penalty Γ h (s, a) = c l∈[L] H 2 (L 2 N h,l (s, a)) ∨ L + c H 2 L in HetPEVI induces a valid pessimism with probability at least 1 -δ. Proof. For a fixed (s, a, h), it holds that Bh Vh+1 (s, a) -B h Vh+1 (s, a) = rh (s, a) + Ph Vh+1 (s, a) -r h (s, a) -P h Vh+1 (s, a) = l∈[L] 1 L r h,l (s, a) + Ph,l Vh+1 (s, a) - l∈[L] 1 L r h,l (s, a) + P h,l Vh+1 (s, a) + l∈[L] 1 L r h,l (s, a) + P h,l Vh+1 (s, a) -r h (s, a) + P h Vh+1 (s, a) = l∈[L]:N h,l (s,a)>0 1 L   r h,l (s, a) + k∈[K]:(s k h,l ,a k h,l )=(s,a) Vh+1 (s k h+1,l ) N h,l (s, a)   - l∈[L] 1 L r h,l (s, a) + P h,l Vh+1 (s, a) + l∈[L] 1 L r h,l (s, a) + P h,l Vh+1 (s, a) -r h (s, a) + P h Vh+1 (s, a) = l∈[L]:N h,l (s,a)>0 1 L   k∈[K]:(s k h,l ,a k h,l )=(s,a) Vh+1 (s k h+1,l ) N h,l (s, a) -P h,l Vh+1 (s, a)   + l∈[L] 1 L r h,l (s, a) + P h,l Vh+1 (s, a) -r h (s, a) + P h Vh+1 (s, a) - l∈[L]:N h,l (s,a)=0 1 L r h,l (s, a) + P h,l Vh+1 (s, a) Then, via Hoeffding inequality, it can be recognized that with probability at least 1 -δ, Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≤ c l∈[L]:N h,l (s,a)>0 H 2 L 2 N h,l (s, a) + c H 2 L + l∈[L]:N h,l (s,a)=0 H L = c l∈[L]:N h,l (s,a)>0 H 2 L 2 N h,l (s, a) +   l∈[L]:N h,l (s,a)=0 H L   2 + c H 2 L = c l∈[L]:N h,l (s,a)>0 H 2 L 2 N h,l (s, a) + l∈[L]:N h,l (s,a)=0 H 2 L + c H 2 L = c l∈[L] H 2 (L 2 N h,l (s, a)) ∨ L + c H 2 L Lemma 9. With probability at least 1 -δ, it holds that, N h,l (s, a) ≥ cKd ρ l h,l (s, a), ∀(l, s, a, h) ∈ [L] × S × A × [H]. Proof. The proof can be done similarly as in Lemma B.3 of (Xie et al., 2021b) .

D.2 SUBOPTIMAITY GAP

Proof of Theorems 2 and 3. By Lemma 6 and 7, with probability at least 1 -δ, it holds that Gap(π; M) ≤ 2 h∈[H] E π * ,M [Γ h (s h , a h )] . With the individual coverage assumption (i.e., Assumption 2), it holds that h∈[H] (s,a)∈S×A d π * h (s, a)    l∈[L] H 2 (L 2 N h,l (s, a)) ∨ L + H 2 L    (a) ≤ h∈[H] (s,a)∈S×A d π * h (s, a)    l∈[L] H 2 L 2 Kd ρ l h,l (s, a) ∨ L + H 2 L    (b) = Õ C * H 4 S LK + H 4 L , where inequality (a) is from (ii) of event G in Lemma 7, and equality (b) uses Assumption 2 and Cauchy-Schwarz inequality. With the overall coverage assumption (i.e., Assumption 3), it can be similarly obtained that h∈[H] (s,a)∈S×A d π * h (s, a)    l∈[L] H 2 (L 2 N h,l (s, a)) ∨ L + H 2 L    ≤ h∈[H] (s,a)∈S×A d π * h (s, a)    l∈[L]:d ρ l h,l (s,a)>0 H 2 L 2 Kd ρ l h,l (s, a) + l∈[L]:d ρ l h,l (s,a)=0 H 2 L + H 2 L    (c) ≤ C † H 2 LK h∈[H] (s,a)∈S×A 1{a = π * h (s)}d π * h (s, a) + (L -L † )H 4 L + H 4 L = Õ C † H 4 S LK + (L + 1 -L † )H 4 L , where inequality (c) is from Assumption 3.

E THE HETPEVI-ADV ALGORITHM

E.1 ALGORITHM DESIGN HetPEVI-Adv is designed as an enhanced version of HetPEVI with a Bernstein-type penalty term for the sample uncertainties. Especially, it shares the same procedure as HetPEVI except that the adopted penalty is Γ α h a) = c l∈[L] Vh,l Vh+1 (s, a) L 2 N h,l (s, a) + c l∈[L] H 2 L 2 (N h,l (s, a)) 2 , where Vh,l Vh+1 (s, a) is the empirical variance of Vh+1 (s ′ ) with s ′ ∼ Ph,l (•|s, a). Note that compared with HetPEVI, the variance information is incorporated in HetPEVI-Adv.

E.2 THEORETICAL ANALYSIS

Theorem 6. Under Assumptions 1, 2, w.p. at least 1 -δ, the output policy π of HetPEVI-Adv satisfies Gap(π; M) = Õ C * H 3 S LK + C * H 3 S √ LK + C * H 7 S L 2 K + H 4 L . It is noted that the first three terms come from finite samples and the last term from finite sources. Furthermore, when LK is sufficiently larger, the first term dominates the other second and third terms; thus, in this regime, it can be observed that HetPEVI-Adv has a tight performance loss to finite samples. However, the performance loss due to finite data sources still has an additional H factor, which is left open for further investigations.

E.3 GOOD EVENT

Lemma 10. Under Assumptions 1, 2, with probability at least 1 -δ, the following good event G happens, where G := (i) The penalties {Γ h (s, a) : ∀(s, a, h) ∈ S × A × [H]} induce a valid pessimism; (ii) V * h (s) -Vh (s) = Õ C * H 4 S KL + H 4 L , ∀(s, h) ∈ S × [H]; (iii) N h,l (s, a) ≥ cKd ρ l h,l (s, a), ∀(l, s, a, h) ∈ [L] × S × A × [H]; (iv) Vh,l f (s, a) ≤ (V h,l f ) (s, a) + c H 4 N h,l (s, a) , ∀(l, s, a, h) ∈ [L] × S × A × [H]; (vii) Vh f (s, a) ≤ (V h f ) (s, a) + H 4 L , ∀(s, a, h) ∈ S × A × [H] , where f is any fixed function: S × A → [-H, H]. Proof. Part (i) can be obtained similarly as Lemma 8 using the enhanced Bernstein inequality proved in Lemma 26, and the rest parts are established in the following lemmas. Lemma 11. Under Assumptions 1, 2, with a probability of at least 1 -δ, the following crude bound of the output value of HetPEVI-Adv holds V * h (s) -Vh (s) = Õ C * H 4 S KL + H 4 L , ∀(s, a) ∈ S × [H] Proof. It can be observed that Γ α h (s, = c l∈[L] Vh,l Vh+1 (s, a) L 2 N h,l (s, a) + c l∈[L] H 2 L 2 (N h,l (s, a)) 2 ≤ c l∈[L] H 2 L 2 N h,l (s, a) Using the above upper bound of Γ h (s, a), the lemma can be obtained via similar steps of Theorem 2. Lemma 12. With probability at least 1 -δ, the following events happen, N h,l (s, a) ≥ cKd ρ l h,l (s, a), ∀(l, s, a, h) ∈ [L] × S × A × [H] Proof. The proof can be done similarly as in Lemma B.3 of (Xie et al., 2021b) . Lemma 13. With probability at least 1 -δ, for a fix functions f : S × A → [-H, H], the following events happen Vh,l f (s, a) ≤ (V h,l f ) (s, a) + c H 4 N h,l (s, a) , ∀(l, s, a, h) ∈ [L] × S × A × [H]. Proof. With a union bound over (s, a, h) ∈ S × A × [H], according to Hoeffding inequality, with probability at least 1 -δ, it holds that Vh,l f (s, a) -(V h,l f ) (s, a) = Ph,l -P h,l f 2 (s, a) + P h,l -Ph,l f (s, a) • P h,l + Ph,l f (s, a) ≤ c H 4 N h,l (s, a) , which concludes the proof. Lemma 14. With probability at least 1 -δ, for a fix functions f : S × A → [-H, H], the following events happen Vh f (s, a) ≤ (V h f ) (s, a) + c H 4 L , ∀(s, a, h) ∈ S × A × [H]. Proof. With a union bound over (s, a, h) ∈ S × A × [H], according to Hoeffding inequality, with probability at least 1 -δ, it holds that Vh,l f (s, a) -(V h,l f ) (s, a) = Ph -P h f 2 (s, a) + Ph -P h f (s, a) • Ph + P h f (s, a) ≤ c H 4 L , which concludes the proof.

E.4 SUBOPTIMAITY GAP

Lemma 15. It holds that h∈[H] (s,a)∈S×A d π * h (s, a)[V h V * h+1 ](s, a) ≤ H 2 . Proof. The proof can be found in Lemma C.4 in Xie et al. (2021b) . Lemma 16. It holds that l∈[L] V h,l V * h+1 (s, a) ≤ L Vh V * h+1 (s, a). Proof. This lemma is a direct consequence of Lemma 28. Lemma 17. With event G happening, it holds that h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) V h,l Vh+1 (s, a) ≤ LH 2 + c C * H 8 SL K + c √ H 8 L. Proof. First, the left-hand side can be decomposed as h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) V h,l Vh+1 (s, a) = h∈[H] (s,a)∈S×A d π * h (s, a)L V h V * h+1 (s, a) :=term (I) + h∈[H] (s,a)∈S×A d π * h (s, a)L Vh V * h+1 (s, a) -V h V * h+1 (s, a) :=term (II) + h∈[H] (s,a)∈S×A d π * h (s, a)   l∈[L] V h,l V * h+1 (s, a) -L Vh V * h+1 (s, a)   :=term (III) + h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) V h,l Vh+1 (s, a) -V h,l V * h+1 (s, a) :=term (IV) Then, for term (I), with Lemma 15, it holds that term (I) ≤ LH 2 . For term (II), with event (vii) in Lemma 10 it holds that term (II) = h∈[H] (s,a)∈S×A d π * h (s, a)L Vh V * h+1 (s, a) -V h V * h+1 (s, a) ≤ c h∈[H] (s,a)∈S×A d π * h (s, a)L H 4 L = c √ H 6 L For term (III), with Lemma 16, it holds that term (III) ≤ 0. For term (IV), we can obtain that term (IV) = h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) V h,l Vh+1 (s, a) -V h,l V * h+1 (s, a) (a) ≤ 4H h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) Vh+1 (•) -V * h+1 (•) ∞ (b) ≤ cH h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a) • C * H 4 S LK + H 4 L = c C * H 8 SL K + c √ H 8 L, where inequality (a) is from simple algebraic based on the definition of variance and inequality (b) is from part (ii) of even G in Lemma 10. Proof of Theorem 6. By Lemma 6 and 10, with probability at least 1 -δ, it holds that Gap(π; M) ≤ 2 h∈[H] E π * ,M [Γ h (s h , a h )] . Furthermore, it can be obtained that Γ α h (s h , a h ) = c l∈[L] Vh,l Vh+1 (s, a) L 2 N h,l (s, a) + c l∈[L] H 2 L 2 (N h,l (s, a)) 2 (a) ≤ c l∈[L] V h,l Vh+1 (s, a) L 2 N h,l (s, a) + c l∈[L] H 2 L 2 (N h,l (s, a)) 3/2 + l∈[L] H 2 L 2 (N h,l (s, a)) 2 ≤ c l∈[L] V h,l V ref h+1 (s, a) L 2 N h,l (s, a) + l∈[L] 1 L 2 N h,l (s, a) + l∈[L] H 4 L 2 (N h,l (s, a)) 2 , where inequality (a) is from event (iv) of Lemma 10. Then, for each separate term, it holds that term (I) := c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] V h,l V (s, a) L 2 N h,l (s, a) (a) ≤ c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] V h,l Vh+1 (s, a) L 2 Kd ρ l h,l (s, a) ≤ c C * HS L 2 K h∈[H] (s,a)∈S×A l∈[L] d π * h (s, a)1{a = π * h (s)} V h,l Vh+1 (s, a) (b) ≤ c C * HS L 2 K LH 2 + C * H 8 SL K + √ H 8 L ≤ c C * HS L 2 K LH 2 + C * H 2 SL K + H 6 ≤ c C * H 3 S LK + c C * H 2 S √ LK + c C * H 7 S L 2 K ; term (II) := c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] 1 L 2 N h,l (s, a) (c) ≤ c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] 1 KL 2 d ρ l h,l (s, a) ≤ c C * KL h∈[H] (s,a)∈S×A d π * h (s, a)1{a = π * h (s)} ≤ c C * H 2 S LK ; term (I.c) := c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] H 4 L 2 (N h,l (s, a)) 2 (d) ≤ c h∈[H] (s,a)∈S×A d π * h (s, a) l∈[L] H 4 L 2 (Kd ρ l h,l (s, a)) 2 ≤ c h∈[H] (s,a)∈S×A 1{π * h (s) = a} l∈[L] (C * ) 2 H 4 L 2 K 2 ≤ c C * SH 3 √ LK , where inequalities (a), (c) and (d) are from event (iv) of Lemma 10 and inequality (b) is from Lemma 17. By aggregating these three terms together and adding the sum of source uncertainties, it can be observed that Gap(π; M) := Õ C * H 3 S LK + C * H 3 S √ LK + C * H 7 S L 2 K + H 4 L .

F THE HETPEVI-LIN ALGORITHM F.1 PROPERTIES OF LINEAR MDPS

Proof of Lemma 1. This proof is standard for studies in linear MDPs (Jin et al., 2020; 2021b) . We include it here for completeness and to facilitate the analysis of the next lemma. Based on the Bellman equation, it holds that (B h,l f )(s, a) = r h,l (s, a) + (P h,l f )(s, a) = ⟨ϕ(s, a), θ h,l ⟩ + S f (s ′ ) • ⟨ϕ(s, a), dµ h (s ′ )⟩, where means that (B h,l f )(s, a) = ⟨ϕ(s, a), w f h,l ⟩ with w f h,l = θ h,l + S f (s ′ ) dµ h,l (s ′ ). Similar arguments hold for (B h f )(s, a) = ⟨ϕ(s, a), w f h ⟩ with w f h = θ h + S f (s ′ ) dµ h (s ′ ). Proof of Lemma 2. First, it can be observed that the transition and reward of the sample average MDP M is linear since  Ph (s ′ |s, a) = l∈[L] P h,l (s ′ |s, a)/L = l∈[L] ⟨ϕ(s, a), µ h,l (s ′ )⟩/L = ⟨ϕ(s, a), μh (s ′ )⟩; rh (s, a) = l∈[L] max{∥μ h (S)∥ 2 , ∥ θh ∥ 2 } = max    l∈[L] µ h,l (S)/L 2 , l∈[L] θ h,l /L 2    ≤ max    l∈[L] ∥µ h,l (S)∥ 2 /L, l∈[L] ∥θ h,l ∥ 2 /L    ≤ √ d, ∀h ∈ [H]. Moreover, it holds that ( Bh f )(s, a) = rh (s, a) + ( Ph f )(s, a) = ⟨ϕ(s, a), θh ⟩ + S f (s ′ ) • ⟨ϕ(s, a), dμ h (s ′ )⟩ = ϕ(s, a), l∈[L] θ h,l /L + S f (s ′ ) • ϕ(s, a), l∈[L] dµ h,l (s ′ )/L = ϕ(s, a), l∈[L] θ h,l + S f (s ′ ) dµ h,l (s ′ ) /L = ϕ(s, a), l∈[L] w f h,l /L = ϕ(s, a), wf h , which proves the claim.

F.2 GOOD EVENT

Lemma 18. The following event holds with probability at least 1 -δ for HetPEVI-Lin G := (i) The penalties {Γ h (s, a)} induce a valid pessimism; (ii) Ξ h,l -Ξh,l 2 ≤ c 1 K , ∀(l, h) ∈ [L] × [H] , where Ξ h,l := E ρ l ,M l ϕ(s h , a h )ϕ(s h , a h ) ⊤ Ξh,l := 1 K k∈[K] ϕ(s k h,l , a k h,l )ϕ(s k h,l , a k h,l ) ⊤ . Proof. The part (i) holds according to Lemma 22 and the part (ii) can be observed via the Lemma 30. In the following, for all h ∈ [H], based on the assumption that the matrix l∈[L] Λ -1 h,l is invertible, we denote that  Υ h := l∈[L] Λ -1 h,l -1 ; Υ -1 h := l∈[L] Λ -1 h,l . Υ -1 h ∥ wh -ŵh ∥ Υ h , and thus we can instead control ∥ wh -ŵh ∥ Υ h . For this purpose, we can observe that ∥ wh -ŵh ∥ Υ h = ⟨ wh -ŵh , (Υ h ) 1/2 X⟩, where X := (Υ h ) 1/2 ( wh -ŵh ) ∥ wh -ŵh ∥ Υ h ∈ S d-1 . With Lemma 29, we can find a ε-covering C ε over S d-1 with |C ε | ≤ (3/ε) d . Using Lemma 20 and a union bound, we can have that with probability 1 -δ, ∀(y, h) ∈ C ε × [H], it holds that ⟨(Υ h ) 1/2 y, wh -ŵh ⟩ ≤ cH log(H|C ε |/δ) L 2 + dλ L (Υ h ) 1/2 y (Υ h ) -1 ≤ cH d log(H/(εδ)) L 2 + dλ L . With this event happening, for any x ∈ S d-1 , there exists y ∈ C ε such that ∥x -y∥ 2 ≤ ε and further it holds that ∥ wh -ŵh ∥ Υ h = max x∈S d-1 ⟨ wh -ŵh , (Υ h ) 1/2 x⟩ = max x∈S d-1 min y∈Cε ⟨ wh -ŵh , (Υ h ) 1/2 y⟩ + ⟨ wh -ŵh , (Υ h ) 1/2 (x -y)⟩ ≤ max x∈S d-1 min y∈Cε cH d log(H/(εδ)) L 2 + dλ L + ∥ wh -ŵh ∥ Υ h ∥x -y∥ 2 ≤ cH d log(H/(εδ)) L 2 + dλ L + ε ∥ wh -ŵh ∥ Υ h . Thus, with ε = 1/2, with probability at least 1 -δ, ∀h ∈ [H], it holds that ∥w h -ŵh ∥ Υ h ≤ cH d L 2 + dλ L . Thus, for any (s, a) ∈ S × A, it holds that B h Vh+1 (s, a) -Bh Vh+1 (s, a) ≤ ∥ϕ(s, a)∥ Υ -1 h ∥w h -ŵh ∥ Υ h ≤ cH d L 2 + dλ L ∥ϕ(s, a)∥ Υ -1 h . With a union bound over h ∈ [H], the lemma is proved. Lemma 20. For a fixed vector x ∈ R d and a fixed h ∈ [H], with probability at least 1 -δ, it holds that |⟨x, wh -ŵh ⟩| ≤ 2 log(2/δ) L 2 + dH 2 λ L ∥x∥ Υ -1 h . Proof. It holds that x ⊤ ( wh -ŵh ) = x ⊤   l∈[L] w h,l /L - l∈[L] ŵh,l /L   (a) = x ⊤ l∈[L] 1 L Λ -1 h,l   k∈[K] ϕ(s k h,l , a k h,l )ϕ(s k h,l , a k h,l ) ⊤ + λI   w h,l -x ⊤ l∈[L] 1 L Λ -1 h,l   k∈[K] ϕ(s k h,l , a k h,l ) r k h,l + Vh+1 (s k h+1,l )   = λx ⊤ l∈[L] 1 L Λ -1 h,l w h,l term (I) + x ⊤ l∈[L] 1 L Λ -1 h,l   k∈[K] ϕ(s k h,l , a k h,l ) ϕ(s k h,l , a k h,l ) ⊤ w h,l -r k h,l + Vh+1 (s k h+1,l )   term (II) , where the definition of Λ h,l and ŵh,l is used in equation (a). For term (I), we have λ x ⊤ l∈[L] 1 L Λ -1 h,l w h,l = λ l∈[L] x ⊤ Λ -1 h,l w h,l /L ≤ λ L ∥x∥ Υ -1 h l∈[L] w ⊤ h,l Λ -1 h,l w h,l ≤ λ L ∥x∥ Υ -1 h 4H 2 Ld/λ = c dH 2 λ L ∥x∥ Υ -1 h , where inequality (a) is based on the fact that ∥w h,l ∥ 2 ≤ 2H √ d, ∀(l, h) ∈ [L] × [H] (which can be proved similarly to Lemma B.1 in Jin et al. (2020) ) and the following observation: w ⊤ h,l Λ -1 h,l w h,l ≤ ∥w h ∥ 2 2 ∥Λ -1 h,l ∥ 2 ≤ 4dH 2 /λ. For term (II), with t := H L 2x ⊤ Υ -1 h x log(2/δ), conditioned on the state-action pairs at step h, we have Proof. For a fixed ϕ(s, a) ∈ R d and a fixed Vh+1 , since (B h,l Vh+1 )(s, a) is bounded between [0, H] and has an expectation of (B h Vh+1 )(s, a), it is then a H-sub-Gaussian random variable (Vershynin, 2018) . Thus, it holds that We denote P   x ⊤ l∈[L] 1 L Λ -1 h,l   k∈[K] ϕ(s k h,l , a k h,l ) ϕ(s k h,l , a k h,l ) ⊤ w h,l -r k h,l + Vh+1 (s k h+1,l )   ≥ t   ≤ 2 exp   - 2t 2 4H 2 l∈[L] k∈[K] 1 L 2 x ⊤ Λ -1 h,l ϕ(s k h,l , a k h,l ) 2    ≤ 2 exp - x ⊤ Υ -1 h x log(2/δ) l∈[L] k∈[K] x ⊤ Λ -1 h,l ϕ(s k h,l , a k h,l )ϕ(s k h,l , a k h,l ) ⊤ Λ -1 h,l x ≤ 2 exp   - x ⊤ Υ -1 h x log(2/δ) l∈[L] k∈[K] x ⊤ Λ -1 h,l ϕ(s k h,l , a k h,l )ϕ(s k h,l , a k h,l ) ⊤ + λI Λ -1 h,l x   ≤ 2 exp - x ⊤ Υ -1 h x log(2/δ) x ⊤ Υ -1 h x = δ, Σ h = E π * ,M ϕ(s h , a h )ϕ(s h , a h ) ⊤ , which is positive semi-definite as Ξ h,l and Ξh,l for all h ∈ [H]. Correspondingly, we denote Rank h = Rank(Σ h ) for all h ∈ [H]. Also, for a positive semi-definite matrix Σ ∈ R d×d , we denote its ordered eigenvalues as γ 1 (Σ) ≥ γ 1 (Σ) • • • ≥ γ d (Σ) ≥ 0. With Assumption 5, when K ≥ c max h,l (D * ) 2 (γRank h (Σ h )) 2 and λ = 1/L, it holds that where inequality (a) is from the Cauchy-Schwarz inequality, inequality (b) is from the Von Neumann's trace inequality (Mirsky, 1975) and removes the zero eigenvalues of Σ h from the sum. Furthermore, for i ≤ Rank h (implying γ i (Σ h ) > 0), inequality (d) is from the following observation γ i ( Ξh,l ) ≥ γ i (Ξ h,l ) -∥Ξ h,l -Ξh,l ∥ 2 (f ) ≥ γ i (Ξ h,l ) - c √ K (g) ≥ γ i (Σ h ) D * - c √ K (h) ≥ c γ i (Σ h ) D * , where inequality (e) is from the Weyl's inequality (see e.g., Chapter 8 in (Wainwright, 2019) ), inequality (f) is from (ii) of event G in Lemma 18, inequality (g) is from the Assumption 5 (which implies γ i (Σ h ) ≤ D * γ i (Ξ h,l ) from the Weyl's inequality), and inequality (h) is due to γ i (Σ h ) > 0, ∀i ∈ [Rank h ] and the sufficiently large K. With Assumption 6, we can similarly obtain 

G SUPPORTING LEMMAS H SUPPORTING LEMMAS H.1 SUBOPTIMALITY DECOMPOSITION

The suboptimality gap between an output policy π from an offline RL algorithm and the optimal policy π * can be decomposed as follows.  α l E[X 2 l ] -   l∈[L] α l E[X l ]   2    RHS := l∈[L] α 2 l V(X l ) = l∈[L] α 2 l E[X 2 l ] - l∈[L] α 2 l [EX l ] 2 . With the above expressions, we can further obtain that LHS -RHS = l∈[L] α l (C -α l )E[X 2 l ] -C   l∈[L] α l E[X l ]   2 + l∈[L] α 2 l [EX l ] 2 = l∈[L] α l (C -α l )E[X 2 l ] + l∈[L] α 2 l (1 -C)[EX l ] 2 -C l̸ =n α l α n E[X l ]E[X n ] ≥ l∈[L] α l (C -α l )E[X 2 l ] + l∈[L] α 2 l (1 -C)[EX l ] 2 -C l̸ =n α l α n [EX l ] 2 + [EX n ] 2 2 = l∈[L] α l (C -α l )E[X 2 l ] + l∈[L] α 2 l (1 -C)[EX l ] 2 -C l∈[L] α l (1 -α l )[EX l ] 2 = l∈[L] α l (C -α l )E[X 2 l ] + l∈[L] α l (α l -C)[EX l ] 2 l∈[L] α l (C -α l )V(X l ), which leads to the lemma. 



The assumption of deterministic rewards is standard in theoretical analysis of RL(Jin et al., 2018; 2020) as the uncertainties in estimating rewards are dominated by those in estimating transitions.



with VH+1 (s) = 0, ∀s ∈ S, and the empirical Bellman operator Bh is defined as Bh Vh+1 (s, a) := rh (s, a) + Ph Vh+1 (s, a), where ( Ph Vh+1 )(s, a) is the empirical version of (P h Vh+1 )(s, a) using the estimated Ph (•|s, a).

Assumption 1} is a family of data source generation distributions, and B(C) := {{ρ l : l ∈ [L]} : Assumption 2 holds with C * = C} is a family of behavior policies.

this line, Xie et al. (2021b); Li et al. (2022); Shi et al. (2022) further fine-tune the designs for the tabular setting, and Jin et al. (2021b); Zanette et al. (2021); Min et al. (2021); Yin et al. (2022); Xiong et al. (2022) for the linear MDP. For the general function approximation, additional attempts are reported in Xie et al. (2021a); Uehara & Sun (

r h,l (s, a)/L = l∈[L] ⟨ϕ(s, a), θ h,l ⟩/L = ⟨ϕ(s, a), θh ⟩, where μh (s ′ ) := l∈[L] µ h,l (s ′ )/L; θh := l∈[L] θ h,l /L. Then, we can verify the constraints in Definition 1 as ∥ϕ(s, a)∥ 2 ≤ 1, ∀(s, a) ∈ S × A, and

With the penalties Γ α h (s, a) least 1 -δ, it holds that for all (s, a, h)∈ S × A × [H] Bh Vh+1 (s, a) -Bh Vh+1 (s, a) ≤ Γ α h (s, a).Proof. For a fixed h and a fixed function Vh+1 (•) : S → R, with Lemma 2, there exists w h ∈ R d such that Bh Vh+1 (s, a) = ⟨ϕ(s, a), wh ⟩, ∀(s, a) ∈ S × A. With ( Bh Vh+1 )(s, a) := ⟨ϕ(s, a), ŵh ⟩, it holds that Bh Vh+1 (s, a) -Bh Vh+1 (s, a) = |⟨ϕ(s, a), wh -ŵh ⟩| ≤ ∥ϕ(s, a)∥

which leads to the claim.Lemma 21. For the penalties Γ β h (s, a) = c dH 2 L in HetPEVI, with probability at least 1 -δ, it holds that for all (s, a, h)∈ S × A × [H], Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≤ Γ β h (s, a).

Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≥ c H 2 log(2H/δ) L ≤ δ, ∀h ∈ [H].Similarly as in Lemma 19, using a covering argument, it holds thatP Bh Vh+1 (s, a) -B h Vh+1 (s, a) ≥ c dH 2 L ≤ δ, ∀(s, a, h) ∈ S × A × [H],which concludes the proof. Lemma 22. The penalties {Γ h (s, a) = Γ α h (s, a) + Γ β h (s, a) : (s, a, h) : (s, a, h) ∈ S × A × [H]} in HetPEVI induce a valid pessimism with probability at least 1 -δ with respect to the estimated Bellman operator Bh as ( Bh Vh+1 )(s, a) := ⟨ϕ(s, a), ŵh ⟩. Proof. This result can be obtained by combining Lemmas 19 and 21. F.3 SUBOPTIMALITY GAP Proof of Theorems 4. By Lemma 6 and 18, with probability at least 1 -δ, it holds that Gap(π; M) ≤ 2 h∈[H] E π * ,M [Γ h (s h , a h )] .

Lemma 3.1 inJin et al. (2021b)). Let π = {π h } h∈[H] be the policy such that Vh (s) = ⟨ Qh (s, •), π(•|s)⟩. For any π and s ∈ S it holds thatGap(π; M) = -h∈[H] E π,M [ζ h (s h , a h )|s 1 = s] + h∈[H] E π * ,M [ζ h (s h , a h )|s 1 = s] + h∈[H] E π * ,M Qh (s h , •), π * (•|s h ) -πh (•|s h ) |s 1 = s , Proof.For each side of the desired inequality, we have LHS := CV(X)

Lemma 29 (Covering Number of a Euclidean Ball; Lemma D.5 ofJin et al. (2020)). There exists a setC ε ∈ R d with |C ε | ≤ (1 + 2R/ε) d such that for all x ∈ S d-1 := {x ∈ R d : ∥x∥ 2 ≤ R} there exists a y ∈ C ε with ∥x -y∥ 2 ≤ ε. Lemma 30 (Lemma H.4 ofMin et al. (2021)). Let ψ : S × A → R d satisfying ∥ψ(s, a)∥ 2 ≤ C for all (s, a) ∈ S × A. For any T > 0 and κ > 0, define ḠT = τ ∈[T ] ψ(s τ , a τ )ψ(s τ , a τ ) ⊤ + κI d where (s τ , a τ )'s are i.i.d. samples from some distribution ν over S × A. Then, for any δ ∈ (0, 1), with probability at least 1 -δ, it holds that

E π * ,M dH 2 L 2 ∥ϕ(s h , a h )∥ Υ -1

annex

where the expectations E π,M and E π * ,M are with respect to the trajectories induced by π and π * , andis the model evaluation error at (s, a, h) ∈ S × A × [H].

H.2 ENHANCED EMPIRICAL BERNSTEIN INEQUALITY

In following, an enhanced version of empirical Bernstein inequality is derived. First, the famous Bernstein inequality is presented, which serves as a starting point of the later generalization. Lemma 24 (Bernstein Inequality).where V(Z i ) denotes the variance of random variable Z i .It is noted that the true variance is required in the original form of Bernstein inequality, which is hard to be obtained in practice. Thus, Maurer & Pontil (2009) established an empirical version of Bernstein inequality, where the estimated variance is used instead of the true variance.Lemma 25 (Empirical Bernstein Inequality; Theorem 4 in Maurer & Pontil (2009) where V(Z) denotes the empirical variance of Z with samples Z 1 , • where V(Z l ) denotes the empirical variance of Z l with samplesProof. With n min = min l∈[L] {n l } andfrom Lemma 24, it holds thatFurthermore, with Lemma 27 and a union bound, with probability at least 1 -Lδ L+1 , it holds thatIf the above event happens, it holds thatCombining this the bound of t with the concentration inequality proved above, the lemma is proved.Lemma 27 (Theorem 10 in (Maurer & Pontil, 2009) ). Let Z, Z 1 , • • • , Z n be i.i.d. random variables with values in [0, 1], and let δ > 0, with probability at least 1 -δ, it holds that

H.3 VARIANCE OF MIXTURE MODELS

Lemma 28. With random variable X l following distribution P l for each l ∈ [L], the random variable X is assumed to follow the distribution mixture, especially, P = l∈[L] α l P l , where α = [α 1 , • • • , α L ] ∈ ∆ L , the followings holds.

