THE PROVABLE BENEFITS OF UNSUPERVISED DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Self-supervised methods have become crucial for advancing deep learning by leveraging data itself to reduce the need for expensive annotations. However, the question of how to conduct self-supervised offline reinforcement learning (RL) in a principled way remains unclear. In this paper, we address this issue by investigating the theoretical benefits of utilizing reward-free data in linear Markov Decision Processes (MDPs) within a semi-supervised setting. Further, we propose a novel, Provable Data Sharing algorithm (PDS) to utilize such reward-free data for offline RL. PDS uses additional penalties on the reward function learned from labeled data to prevent overestimation, ensuring a conservative algorithm. Our results on various offline RL tasks demonstrate that PDS significantly improves the performance of offline RL algorithms with reward-free data. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees. We believe our findings will contribute to developing more robust self-supervised RL methods.

1. INTRODUCTION

Offline reinforcement learning (RL) is a promising framework for learning sequential policies with pre-collected datasets. It is highly preferred in many real-world problems where active data collection and exploration is expensive, unsafe, or infeasible (Swaminathan & Joachims, 2015; Shalev-Shwartz et al., 2016; Singh et al., 2020) . However, labeling large datasets with rewards can be costly and require significant human effort (Singh et al., 2019; Wirth et al., 2017) . In contrast, unlabeled data can be cheap and abundant, making self-supervised learning with unlabeled data an attractive alternative. While self-supervised methods have achieved great success in computer vision and natural language processing tasks (Brown et al., 2020; Devlin et al., 2018; Chen et al., 2020) , their potential benefits in offline RL are less explored. Several prior works (Ho & Ermon, 2016; Reddy et al., 2019; Kostrikov et al., 2019) have explored demonstration-based approaches to eliminate the need for reward annotations, but these approaches require samples to be near-optimal. Another line of work focuses on data sharing from different datasets (Kalashnikov et al., 2021; Yu et al., 2021a) . Still, it assumes that the dataset can be relabeled with oracle reward functions for the target task. These settings can be unrealistic in real-world problems where expert trajectories and reward labeling are expensive. Incorporating reward-free datasets into offline RL is important but challenging due to the sequential and dynamic nature of RL problems. Prior work (Yu et al., 2022) has shown that learning to predict rewards can be difficult, and simply setting the reward to zero can achieve good results. However, it's unclear how reward-prediction methods affect performance and whether reward-free data can provably benefit offline RL. This naturally leads to the following question: How can we leverage reward-free data to improve the performance of offline RL algorithms in a principled way? To answer this question, we conduct a theoretical analysis of the benefits of utilizing unlabeled data in linear MDPs. Our analysis reveals that although unlabeled data can provide information about the dynamics of the MDP, it cannot reduce the uncertainty over reward functions. Based on this insight, we propose a model-free method named Provable Data Sharing (PDS), which adds uncertainty penalties to the learned reward functions to maintain a conservative algorithm. By doing so, PDS can effectively leverage the benefits of unlabeled data for offline RL while ensuring theoretical guarantees. We demonstrate that PDS can cooperate with model-free offline methods while being simple and efficient. We conduct extensive experiments on various environments, including single-task domains like MuJoCo (Todorov et al., 2012) and Kitchen (Gupta et al., 2019) , as well as multi-task domains like AntMaze and Meta-World (Yu et al., 2020a) . The results show that PDS improves significantly over previous methods like UDS (Yu et al., 2022) and naive reward prediction methods. Our main contribution is the Provable Data Sharing (PDS) algorithm, a novel method for utilizing unsupervised data in offline RL that provides theoretical guarantees. PDS adds uncertainty penalties to the learned reward functions and can be easily integrated with existing offline RL algorithms. Our experimental results demonstrate that PDS can achieve superior performance on various locomotion, navigation, and manipulation tasks. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees and contributing to the development of more robust self-supervised RL methods.

2. RELATED WORK

Offline Reinforcement Learning Current offline RL methods (Levine et al., 2020) can be roughly divided into policy constraint-based, uncertainty estimation-based, and model-based approaches. Policy constraint methods aim to keep the policy close to the behavior under a probabilistic distance (Siegel et al., 2020; Yang et al., 2021; Kumar et al., 2020; Fujimoto & Gu, 2021; Yang et al., 2022; Fujimoto et al., 2019; Hu et al., 2022; Kostrikov et al., 2021; Ma et al., 2021; Wang et al., 2021) . Uncertainty estimation-based methods attempt to consider the Q-value prediction's confidence using dropout or ensemble techniques (An et al., 2021; Wu et al., 2021) . Last, model-based methods incorporates the uncertainty in the model space for conservative offline learning (Yu et al., 2020b; 2021b; Kidambi et al., 2020) . Offline Data Sharing Prior works have demonstrated that data sharing across tasks can be beneficial by designing sophisticated data-sharing protocols (Yu et al., 2021a) . For example, previous studies have explored developing data sharing strategies by human effort (Kalashnikov et al., 2021) , inverse RL (Reddy et al., 2019; Li et al., 2020) , and estimated Q-values (Yu et al., 2021a) . However, these data sharing works must assume the dataset can be relabeled with oracle rewards for the target task, which is a strong assumption since the high cost of reward labeling. Therefore, effectively incorporating the unlabeled data into the offline RL algorithms is essential. To solve this issue, some recent work (Yu et al., 2022) proposes simply applying zero rewards to unlabeled data. In this work, we propose a principle way to leverage unlabeled data without the strong assumption of reward relabeling. Reward Prediction It is widely observed that reward shaping and intrinsic rewards can accelerate learning in online RL (Mataric, 1994; Ng et al., 1999; Wu & Tian, 2017; Song et al., 2019; Guo et al., 2016; Abel et al., 2021) . There are also extensive works that studies automatically designing reward functions using inverse RL (Ng et al., 2000; Fu et al., 2017) . However, there is less attention on the offline setting where online interaction is not allowed and the trajectories may not be optimal.

3.1. LINEAR MDPS AND PERFORMANCE METRIC

We consider infinite-horizon discounted Markov Decision Processes (MDPs), defined by the tuple (S, A, P, r, γ), with state space S, action space A, discount factor γ ∈ [0, 1), transition function P : S × A → ∆(S), and reward function r : S × A → [0, r max ]. To make things more concrete, we consider the linear MDP (Yang & Wang, 2019; Jin et al., 2020) as follows, where the transition kernel and expected reward function are linear with respect to a feature map. Definition 3.1 (Linear MDP). We say an episodic MDP (S, A, P, r, γ) is a linear MDP with known feature map φ : S × A → R d if there exist unknown measures µ = (µ 1 , . . . , µ d ) over S and an unknown vector θ ∈ R d such that P(s | s, a) = φ(s, a), µ(s ) , r(s, a) = φ(s, a), θ for all (s, a, s ) ∈ S × A × S. And we assume φ(s, a) 2 ≤ 1 for all (s, a, s ) ∈ S × A × S and max{ µ(S) 2 , θ 2 } ≤ √ d, where µ(S) ≡ S µ(s) ds. A policy π : S → ∆(A) specifies a decision-making strategy in which the agent chooses actions adaptively based on the current state, i.e., a t ∼ π(• | s t ). The value function V π : S → R and the action-value function (Q-function) Q π : S × A → R are defined as V π (s) = E π ∞ t=0 γ t r(s t , a t ) s 0 = s , Q π (s, a) = E π ∞ t=0 γ t r(s t , a t ) s 0 = s, a 0 = a . (2) where the expectation is with respect to the trajectory τ induced by policy π. We define the Bellman operator as (Bf )(s, a) = E s ∼p(•|s,a) r(s, a) + γf (s ) . We use π * , Q * , and V * to denote the optimal policy, optimal Q-function, and optimal value function, respectively. We have the Bellman optimality equation V * (s) = max a∈A Q * (s, a), Q * (s, a) = (BV * )(s, a). Meanwhile, the optimal policy π * satisfies π * (• | s) = argmax π Q * (s, •), π(• | s) A , V * (s) = Q * (s, •), π * (• | s) A , where the maximum is taken over all functions mapping from S to distributions over A. We aim to learn a policy that maximizes the expected cumulative reward. Correspondingly, we define the performance metric as SubOpt(π, s) = V π * (s) -V π (s). (5)

3.2. PROVABLE OFFLINE ALGORITHMS

In this section, we consider pessimistic value iteration (PEVI; Jin et al., 2021) as the backbone algorithm, described in Algorithm 2. It is a representative model-free offline algorithm with theoretical guarantees. PEVI uses negative bonus Γ(•, •) over standard Q-value estimation Q(•, •) = ( B V )(•) to reduce potential bias due to finite data, where B is some empirical estimation of B from dataset D. Please refer to Appendix A.1 for more details of the PEVI algorithm. We use the following notion of ξ-uncertainty quantifier as follows to formalize the idea of pessimism. Definition 3.2 (ξ-Uncertainty Quantifier). We say Γ : S × A → R is a ξ-uncertainty quantifier for B and V if with probability 1ξ, for all (s, a) ∈ S × A, ( B V )(s, a) -(B V )(s, a) ≤ Γ(s, a).

3.3. UNSUPERVISED DATA SHARING

We consider unsupervised data sharing in offline reinforcement learning. We first characterize the quality of the dataset with the notion of coverage coefficient (Uehara & Sun, 2021) , defined as below. Definition 3.3. The coverage coefficient C † of a dataset D = {(s τ , a τ , r τ )} N τ =1 is defined as C † = sup C 1 N • N τ =1 φ(s τ , a τ )φ(s τ , a τ ) C • E π * φ(s t , a t )φ(s t , a t ) s 0 = s , ∀s ∈ S , The coverage coefficient C † is common in offline RL literature (Uehara & Sun, 2021; Jin et al., 2021; Rashidinejad et al., 2021) , which represents the maximum ratio between the density of empirical state-action distribution and the density induced from the optimal policy. Intuitively, it represents the quality of the dataset. For example, the expert dataset has a high coverage ratio while the random dataset may have a low ratio. We denote D 0 as the origin labeled dataset, with coverage coefficient C † 0 and size N 0 . And we denote the unlabeled dataset as D 1 , with coverage coefficient C † 1 and size N 1 . Note that it is possible that the unlabeled data comes from multiple sources, such as multi-task settings, and we still use D 1 to represent the combined dataset ∪ M i=1 D i from M tasks for simplicity.

4. PROVABLE UNSUPERVISED DATA SHARING

How can we leverage reward-free data for offline RL? A naive approach is to learn the reward function from labeled data via the following regression θ = argmin θ N0 τ =1 (f θ (s τ , a τ ) -r τ ) 2 + ν 2 θ 2 2 , where f θ (s τ , a τ ) = φ(s τ , a τ ) θ in linear MDPs. Then we can use this learned reward function to label unsupervised data. However, this approach can lead to suboptimal performance due to overestimation of the predicted reward r θ (Yu et al., 2022) , which undermines the pessimism in offline algorithms. To address this issue, we propose a data-sharing algorithm called Provable Data Sharing (PDS). We start by analyzing the uncertainty in learned reward functions and add penalties for such uncertainty to leverage unlabeled data. In Section 4.2, we show that PDS has a provable performance bound consisting of two parts: the offline error, which is tightened compared to no data sharing due to additional data, and the error from reward bias. The performance bound of PDS is provably better than no data sharing as long as the unlabeled dataset has mediocre size or quality. We also extend our algorithm in linear MDPs to general settings in Section 4.3 and propose using ensembles for reward uncertainty estimation. We demonstrate the effectiveness of PDS by integrating it with IQL (Kostrikov et al., 2021) , and present Algorithm 3, which is simple and can be easily integrated with other model-free offline algorithms.

4.1. PROVABLE DATA SHARING

To address the issue of potential overestimation of predicted rewards, we first analyze the uncertainty in learned reward functions. In the context of linear MDPs, the reward function can be learned via linear regression, and the uncertainty of the parameters is characterized by the elliptical confidence region, as shown in Lemma 4.1. This confidence region is important as it allows us to give a more accurate estimation of the reward function while keeping the overall algorithm pessimistic. Lemma 4.1 ( Abbasi-Yadkori et al. (2011)). Let α = √ ν + r max • 2 log 1 δ + d log (1 + N0 νd ), Λ = νI + N0 τ =1 φ(s τ , a τ )φ(s τ , a τ ) , C(δ) = θ ∈ R d | θ -θ Λ ≤ α , ( ) where θ is the minimizer in Equation (7), then we have P(θ ∈ C(δ)) ≥ 1δ, where θ is the true parameter for the reward function. Proof. Please refer to Theorem 2 in Abbasi-Yadkori et al. ( 2011) for detailed proof. Lemma 4.1 provides a useful insight: the uncertainty of the learned reward function in linear MDPs only depends on the quality and size of labeled data. Based on this insight, we propose a two-phase algorithm that guarantees a provable performance bound, as shown in Theorem 4.3. The simple reward prediction method can compromise the pessimistic estimation of the algorithm, while UDS may result in a reward bias that is too large. Our algorithm consists of two phases: in the first phase, we construct a pessimistic reward estimator that finds the reward function in the confidence set that leads to the lowest optimal value. In the second phase, we conduct standard offline RL with the given pessimistic reward function. To solve the challenge of finding the best parameter in the confidence set, which is a bi-level optimization problem, we propose using a simpler method that maintains the pessimistic property of the offline algorithms. This method involves using a pessimistic estimation, which allows us to keep the algorithm pessimistic while avoiding the computational challenges of the bi-level optimization problem. Formally, r(s, a) = max φ(s, a) θ -α φ(s, a) Λ -1 φ(s, a), 0 , where Λ = νI + N0 τ =1 φ(s τ , a τ )φ(s τ , a τ ) . We adopt the pessimistic estimation in Equation ( 9) because it provides a lower bound for reward functions in the confidence set C(δ), as guaranteed by the following lemma derived from Cauchy-Schwartz inequalities. Lemma 4.2. For any θ ∈ C(δ), φ(s, a) θ -φ(s, a) θ ≤ α φ(s, a) Λ -1 φ(s, a). ( ) When labeled data is scarce or there is a significant distributional shift between the labeled and unlabeled data, Equation ( 9) degenerates to 0, which is equivalent to the UDS algorithm (Yu et al., 2022) . Algorithm 1 Provable Data Sharing, Linear MDP 1: Require: Labeled dataset D 0 = {(s τ , a τ , r τ )} N0 τ =1 , unlabeled dataset D 1 = {(s τ , a τ )} N1 τ =1 . 2: Require: Confidence parameter α, β, δ. 3: Learn the reward function θ from D 0 using Equation (7) 4: Construct the confidence set C(δ) using Equation (8). 5: Construct the pessimistic reward over the confidence set θ ← argmin θ∈C(δ) V θ , where V θ is the estimated value function from Algorithm 2 and dataset D 0 ∪ D 1 with reward relabeled with parameter θ. 6: Annotate the reward in D 0 ∪ D 1 with parameter θ. 7: Learn the policy from the annotated dataset D 0 ∪ D 1 using Algorithm 2 V , π ← PEVI(D 0 ∪ D 1 ). ( ) 8: Return π

4.2. THEORETICAL ANALYSIS

The following subsection analyzes how the provable data-sharing (PDS) algorithm can enhance the performance bound by leveraging unlabeled data. To be specific, we present the following theorem. Theorem 4.3 (Performance Bound for PDS). Suppose the dataset D 0 , D 1 have positive coverage coefficients C † 0 , C † 1 , and the underlying MDP is a linear MDP. In Algorithm 1, we set λ = 1, ν = 1, α = 2 dζ 2 •r max , β = cd √ ζ 1 1 -γ •r max , ζ 1 = log 4d(N 0 + N 1 ) (1 -γ)δ , ζ 2 = log 2dN 0 δ , where c > 0 is an absolute constant and δ ∈ (0, 1) is the confidence parameter. Then with probability 1 -2δ, the policy π generated by PDS satisfies for all s ∈ S, SubOpt π; s ≤ 2cr max (1 -γ) 2 d 3 ζ 1 N 0 C † 0 + N 1 C † 1 + 4r max 1 -γ d 2 ζ 2 N 0 C † 0 . Proof. Please refer to Appendix B for detailed proof. The performance bound of PDS is composed of two terms. The first is the offline error, which is inherited from offline algorithms. This bound is improved when additional unlabeled data with size N 1 and coverage C † 1 is available. The second term is the reward bias, which arises due to uncertainties in the rewards. Notably, this term is equivalent to the performance bound of a linear bandit with rewards in the range [0, r max /(1γ)]. As the number of unlabeled data approaches infinity, the uncertainty of the dynamics decreases to zero, and the RL problem becomes a linear bandit problem. The theorem demonstrates that PDS outperforms UDS, which suffers from a constant reward bias, and naive reward prediction methods, which lack pessimism and therefore do not offer such guarantees. Moreover, we demonstrate the tightness of the bound by constructing an "adversarial" dataset that matches the bound's suboptimality (see Appendix F). To better understand the benefits of unlabeled data, we define the suboptimality bound ratio (SBR) of an offline algorithm A as the ratio of the suboptimality bound obtained by the policy learned with additional unlabeled data to the suboptimality bound of the policy learned using labeled data alone. Mathematically, the SBR of A is given by: SBR(A) = SubOpt πA(D 0 , D 1 ) SubOpt πA(D 0 , ∅) , ( ) where SubOpt is the tight upper bound on suboptimality. The SBR provides a measure of the benefit of unlabeled data to the offline algorithm, with a smaller SBR indicating a greater benefit from the unlabeled data. Applying this definition to PDS, we obtain the following corollary. Corollary 4.4 (Informal). The SBR of PDS satisfies SBR ≈ N 0 C † 0 N 0 C † 0 + N 1 C † 1 finite sample term + 2(1 -γ) c √ d asymptotic term , ( ) where c is the constant in Theorem 4.3 and we ignore the logarithmic factors. When does unlabeled data improve the performance of offline algorithms? Corollary 4.4 allows us to analyze the relative performance of PDS under different conditions. The first term of the bound depends on the qualities and amounts of both labeled and unlabeled datasets. If the unlabeled dataset has a mediocre number of samples or data quality, the first term will be sufficiently small. The second term affects the asymptotic performance when the unlabeled data approaches infinity, and it depends on the discount factor and the dimension of the problem. PDS improves over no data-sharing algorithms asymptotically in larger problems or longer horizons. For a more detailed discussion, please refer to Appendix E.

4.3. PRACTICAL IMPLEMENTATION

This subsection outlines the practical implementation of PDS in general MDPs. We employ L ensembles θ 1 , . . . , θ L to estimate uncertainty, which are learned using Equation (7). To estimate pessimistic rewards, we use the following pessimistic estimation: r(s, a) = max {µ(s, a) -kσ(s, a), 0} , where µ(s, a) = 1 L L i=1 f θi (s, a), σ(s, a) = 1 L L i=1 (f θi (s, a) -µ(s, a) ) 2 are the mean and standard deviation, respectively. Here, k is a hyperparameter used to control the amount of pessimism. We can also use the minimum over L ensembles for the pessimistic estimation, which is linked to Equation ( 15) following An et al. (2021); Royston (1982) as shown in Equation ( 16): E min j=1,...,L f θj (s, a) ≈ µ(s, a) -Φ -1 L -π 8 L -π 4 + 1 σ(s, a), ( ) where Φ is the CDF of the standard Gaussian distribution. The appropriate value of k for each domain can be difficult to determine. To address this issue, we observe that the amount of pessimism required for different domains is proportional to the difference in mean rewards between labeled and unlabeled data. Leveraging this insight, we propose a simple and efficient automatic mechanism for adjusting the value of the k parameter. Specifically, we suggest a method that adjusts k based on the difference in mean rewards, as given by Equation ( 17): r(s, a) = max min j=1,...,L f θj (s, a) -kσ(s, a), 0 , where k = a • max(µ -µ, 0) |µ| + . ( ) where µ = 1 N0 N0 i=1 µ(s i , a i ), µ = 1 N1 N1 i=1 µ(s i , a i ) are the mean reward of labeled and (predicted) unlabeled data, respectively. We use a = 25 and L = 10 in all experiments. Then we can plug in any model-free offline algorithms. Here we use IQL (Kostrikov et al., 2021) as the backbone offline algorithm, but we emphasize that it can be easily integrated with other modelfree algorithms. The details of our algorithm is summarized in Algorithm 3 in Appendix A.2.

5. EXPERIMENTS

In this section, we aim to evaluate the effectiveness of pessimistic reward estimation and answer the following questions: (1) How does PDS perform compared to the naive reward prediction and unlabeled data sharing (UDS) methods in single locomotion and manipulation tasks? (2) How does PDS behave in multi-task offline RL settings compared to baselines? (3) What makes PDS effective? Single-task domains and datasets. To address Question (1), we empirically evaluate the PDS algorithm on the hopper, walker2d, and kitchen tasks from the D4RL benchmark suite (Fu et al., 2020) . We use 50 labeled trajectories with varying amounts of unlabeled data of different sizes and qualities. This experimental setup is motivated by real-world problems where labeled data is often scarce, and additional unlabeled data may be readily available. Comparisons. To ensure a fair comparison, we combine UDS with IQL (Kostrikov et al., 2021) , the same underlying offline RL method as PDS. In addition to UDS, we train a naive reward prediction method and the sharing-all-true-rewards method (Oracle), and adapt them with IQL. In all experiments, we set the hyperparameters a = 25 and L = 10 for our method.

5.1. EXPERIMENTAL RESULTS

Results of Question (1). We evaluated each method on the hopper, walker2d, and kitchen domains and found that PDS outperformed the other methods on most tasks and achieved competitive or better performance than the oracle method (Table 3 ). Notably, PDS performed well when the labeled and unlabeled datasets had different data qualities, which we attribute to its ability to capture the uncertainties induced by this distribution shift and maintain a pessimistic algorithm. The prediction method performed well when the unlabeled dataset had high quality, and UDS performed well when the unlabeled data had low quality. PDS combined the strengths of both methods and achieved superior performance. Results of Question (2). Multi-task settings exhibit greater distributional shifts between labeled and unlabeled data due to the differing task goals. We evaluated PDS and the other methods on the AntMaze and Meta-World domains (Tables 2 and 1 ) and found that PDS's performance was comparable to the oracle method and outperformed the other methods. UDS performed relatively poorly on the Meta-World dataset, possibly due to the high dataset quality, which made labeling with zeros induce a large reward bias. On the multi-task AntMaze domain, PDS outperformed both UDS and the naive prediction method, especially on the diverse dataset. These results aligned with our observation on single-task domains that PDS performs better when the distribution shift between datasets is larger.

Results of Question (3).

We conducted experiments on the hopper and walker2d tasks with various penalty weights k to investigate the effect of uncertainty weights in PDS (Figure 1 ). The results shows that PDS can interpolate between UDS and the reward prediction method and offered a better trade-off to balance the conservation and generalization of reward estimators, resulting in better performance. Also, PDS reduces the variance from reward prediction and is close to the oracle method, indicating its ability to reduce the uncertainties from the variance of reward predictors while keeping the reward bias small, as shown in Figure 2 . Discussion of PDS, UDS, and Reward Prediction: PDS can be seen as a generalization of both UDS and the reward prediction method. UDS sets the penalty weight k to infinity, while the reward prediction method sets it to zero. However, UDS introduces a high reward bias, and the reward prediction method ruins the pessimism of offline algorithms. In contrast, PDS offers a trade-off between bias and pessimism by adaptively adjusting k. To verify that overestimation is the main factor for the suboptimal performance of the reward prediction method, we conduct additional ablation studies, as shown in Appendix G.

6. CONCLUSION

In this paper, we show that incorporating reward-free data into offline reinforcement learning can yield significant performance improvements. Our theoretical analysis reveals that unlabeled data provides additional information about the MDP's dynamics, reducing the problem to linear bandits in the limit and improving performance bounds therefore. Building upon these insights, we propose a new algorithm, PDS, that leverages this information by incorporating uncertainty penalties on learned rewards to ensure a conservative approach. Our method has provable guarantees in theory and achieves superior performance in practice. In future work, it may be interesting to explore how PDS can be further improved with representation learning methods, and to extend our analysis to more general settings, such as generalized linear MDPs (Wang et al., 2019) and low-rank MDPs (Ayoub et al., 2020; Jiang et al., 2017) .

A ALGORITHM DETAILS

A.1 PESSIMISTIC VALUE ITERATION (PEVI,(JIN ET AL., 2021)) In this section, we describe the details of PEVI algorithm. In linear MDPs, we can construct B V and Γ based on D as follows, where B V is the empirical estimation for B V . For a given dataset D = {(s τ , a τ , r τ )} N τ =1 , we define the empirical mean squared Bellman error (MSBE) as M (w) = N τ =1 r τ + γ V (s τ +1 ) -φ(s τ , a τ ) w 2 + λ||w|| 2 2 Here λ > 0 is the regularization parameter. Note that w has the closed form w = Λ -1 N τ =1 φ(s τ , a τ ) • r τ + γ V (s τ +1 ) , where Λ = λI + N τ =1 φ(s τ , a τ )φ(s τ , a τ ) . Then we simply let B V = φ, w . Meanwhile, we construct Γ based on D as Γ(s, a) = β • φ(s, a) Λ -1 φ(s, a) 1/2 . ( ) Here β > 0 is scaling parameter. The overall PEVI algorithm is summarized in Algorithm 2. Algorithm 2 Pessimistic Value Iteration, PEVI 1: Require: Dataset D = {(s τ , a τ , r τ , s τ +1 )} T τ =1 . 2: Initialization: Set V (•) ← 0 and construct Γ(•, •). 3: while not converged do 4: Construct ( B V )(•, •) 5: Set Q(•, •) ← ( B V )(•, •) -Γ(•, •). 6: Set π(• | •) ← argmax π E π Q(•, •) . 7: Set V (•) ← E π Q(•, •) . 8: end while 9: Return V , π A.2 IQL WITH PROVABLE DATA SHARING In this section, we give a detailed description of our IQL+PDS algorithm.

Algorithm 3 IQL+PDS algorithm, General MDPs

Input: Labeled dataset D 0 , unlabeled dataset D 1 . Input: Parameter α, β, k, τ . Output: policy π φ . 1: Learn L reward functions as in Equation ( 7). 2: Construct pessimistic reward estimation as in Equation ( 17). 3: Relabel unsupervised dataset D 1 and combine with the labeled dataset D 0 . 4: Initialize ψ, θ, θ, φ. 5: for each gradient step do 6: φ ← ψ -λ V ∇ ψ L V (ψ), L V (ψ) = E s,a L τ 2 (Q θ (s, a) -V ψ (s)) 7: θ ← θ -λ Q ∇ θ L Q (θ), L Q (θ) = E s,a,r (r + γQ θ (s, a) -Q θ (s, a)) 2 8: θ ← αθ + (1α)θ 9: end for 10: for each gradient step do 11: φ ← φ -λ π ∇ φ L π (φ), L π (φ) = E s,a exp β(Q θ (s, a) -V ψ (s)) log(π φ (a|s)) 12: end for B PROOF OF THEOREM 4.3 Proof. From Equation ( 11) in Algorithm 1, we have V θ ≤ V θ , ∀θ ∈ C(δ), ( ) where θ is the pessimistic estimation of θ. Let E 1 be the event θ ∈ C(δ), then we have P(E 1 ) ≥ 1δ from Lemma 4.1. Let E 2 be the event where the following inequality holds, |(B V )(s, a) -( B V )(s, a)| ≤ Γ = β φ(s, a) Λ -1 φ(s, a), ∀(s, a) ∈ S × A. ( ) then we have P(E 2 ) ≥ 1δ from Lemma C.3. Condition on E 1 ∩ E 2 , we have. V π * θ -V π θ = V π * θ -V θ + V θ -V π θ ≤ V π * θ -V θ = V π * θ -V π * θ + V π * θ -V θ + V θ -V θ ≤ V π * θ -V π * θ + V π * θ -V θ = V π * θ -V π * θ + V π * θ -V π * θ + V π * θ -V θ ≤ 4r max 1 -γ d 2 ζ 2 N 0 C † 0 + 2cr max (1 -γ) 2 d 3 ζ 1 N 0 C † 0 + N 1 C † 1 , where the first inequality follows from Lemma C.2. The second inequality follows from Equation (21), and the last inequality follows from Lemma C.5 and C.1. From the union bound, we have that the above inequality holds with a probability of 1 -2δ.

C ADDTIONAL LEMMAS AND MISSING PROOFS

Lemma C.1. Under the event in Lemma C.3, we have V π * θ (s) -V θ (s) ≤ 2cr max (1 -γ) 2 d 3 ζ C † N , with probability 1 -δ, for all θ 2 2 ≤ d. Proof. Let δ(s, a) = r(s, a) + γE s ∼P(•|s,a) V (s ) -Q(s, a), From the definition of Q(s, a) and V (s), we have δ(s, a) = B γ V (s) -Q(s, a) = B γ V (s) -B γ V + Γ(s, a). Under the condition of Lemma C.3, it holds that 0 ≤ δ(s, a) ≤ 2Γ(s, a), for all s, a. Then we have V π * θ (s) -V θ (s) =E a∼π * ,s ∼P(•|s,a) r(s, a) + γV π * (s ) -E a∼ π Q(s, a) =E a∼π * ,s ∼P(•|s,a) r(s, a) + γV π * (s ) -Q(s, a) + E a∼π * Q(s, a) -E a∼ π Q(s, a) =E a∼π * ,s ∼P(•|s,a) r(s, a) + γ V (s ) -Q(s, a) + γE a∼π * ,s ∼P(•|s,a) V π * (s ) -V (s ) + Q(s, •), π * (• | s) -π(• | s) A =E a∼π * ,s ∼P(•|s,a) [δ(s, a)] + Q(s, •), π * (• | s) -π(• | s) A + • • • =E π * ∞ t=0 γ t δ(s t , a t ) | s 0 = s + E π * ∞ t=0 γ t Q(s t , •), π * (• | s t ) -π(• | s t ) A | s 0 = s ≤E π * ∞ t=0 γ t δ(s t , a t ) s 0 = s ≤2E π * ∞ t=0 γ t Γ(s t , a t ) s 0 = s =2βE π * ∞ t=0 γ t φ(s t , a t ) Λ -1 φ(s t , a t ) 1/2 s 0 = s . Here the first inequality follows from the fact that π( •|s) = argmax π Q(•, •), π(•|s ) and the second inequality follows from Equation (25). By the Cauchy-Schwarz inequality, we have E π * ∞ t=0 γ t φ(s t , a t ) Λ -1 φ(s t , a t ) 1/2 s 0 = s = 1 1 -γ E d π * Tr φ(s, a) Λ -1 φ(s, a) s 0 = s = 1 1 -γ E d π * Tr φ(s, a)φ(s, a) Λ -1 s 0 = s ≤ 1 1 -γ Tr E d π * φ(s, a)φ(s, a) s 0 = s Λ -1 = 1 1 -γ Tr Σ π * ,s Λ -1 , Lemma C.3 (ξ-Quantifiers). Let λ = 1, β = c • dV max ζ, ζ = log (2dN/(1 -γ)ξ). Then Γ = β • φ(s, a) Λ -1 φ(s, a) 1/2 are ξ-quantifiers with probability at least 1ξ. That is, let E 2 be the event that the following inequality holds, |(B V )(s, a) -( B V )(s, a)| ≤ Γ = β φ(s, a) Λ -1 φ(s, a), ∀(s, a) ∈ S × A. ( ) Then we have P(E 2 ) ≥ 1ε. Proof. we have B V -B V = φ(s, a) (w -w) = φ(s, a) w -φ(s, a)Λ -1 N τ =1 φ τ (r τ + γ V (s τ +1 ) = φ(s, a) w -φ(s, a)Λ -1 N τ =1 φ τ φ τ w (i) + φ(s, a)Λ -1 ( N τ =1 φ τ φ τ w - N τ =1 φ τ (r τ + γ V (s τ +1 )) (ii) , Then we bound (i) and (ii), respectively. For (i), we have (i) = φ(s, a) w -φ(s, a)Λ -1 (Λ -λI)w = λφ(s, a)Λ -1 w ≤ λ||φ(s, a)|| λ -1 ||w|| λ -1 ≤ V max √ dλ φ(s, a) Λ -1 φ(s, a), where the first inequality follows from Cauchy-Schwartz inequality. The second inequality follows from the fact that ||Λ -1 || op ≤ λ -1 and Lemma C.4. For notation simplicity, let τ = r τ + γ V (s τ +1 )φ τ w, then we have |(ii)| = φ(s, a)Λ -1 N τ =1 φ τ τ ≤ || N τ =1 φ τ τ || Λ -1 • ||φ(s, a)|| Λ -1 = || N τ =1 φ τ τ || Λ -1 (iii) • φ(s, a) Λ -1 φ(s, a). The term (iii) is depend on the randomness of the data collection process of D. To bound this term, we resort to uniform concentration inequalities to upper bound sup V ∈V(R,B,λ) N τ =1 φ(s τ , a τ ) • τ (V ) , where V(R, B, λ) = {V (s; w, β, Σ) : S → [0, V max ] with||w|| ≤ R, β ∈ [0, B], Σ λ • I}, where V (s; w, β, Σ) = max a {φ(s, a) wβ • φ(s, a) Σ -1 φ(s, a)}. For all > 0, let N ( ; R, B, λ) be the minimal cover if V(R, B, λ). That is, for any function V ∈ V(R, B, λ), there exists a function V † ∈ N ( ; R, B, λ), such that sup s∈S |V (s) -V † (s)| ≤ . ( ) Let R 0 = V max N d/λ, B 0 = 2β, it is easy to show that at each iteration, V u ∈ V(R 0 , B 0 , λ). From the definition of B, we have |B V -BV † | = γ ( V (s ) -V † (s )) φ(s, a), µ(s ) ds ≤ γ . ( ) Then we have |(r + γV -BV ) -(r + γV † -BV † )| ≤ 2γ . (41) Let † τ = r(s τ , a τ ) + γV † (s τ +1 ) -BV † (s, a), we have (iii) 2 = || N τ =1 φ τ τ || 2 Λ -1 ≤ 2|| N τ =1 φ τ † τ || 2 Λ -1 + 2|| N τ =1 φ τ ( † τ -τ )|| 2 Λ -1 ≤ 2|| N τ =1 φ τ † τ || 2 Λ -1 + 8γ 2 2 N τ =1 |φ τ Λ -1 φ τ | ≤ 2|| N τ =1 φ τ † τ || 2 Λ -1 + 8γ 2 2 N 2 /λ It remains to bound || N τ =1 φ τ † τ || 2 Λ -1 . From the assumption for data collection process, it is easy to show that E D [ τ | F τ -1 ] = 0, where F τ -1 = σ({(s i , a i ) τ i=1 ∪ (r i , s i+1 ) τ i=1 }) is the σ-algebra generated by the variables from the first τ step. Moreover, since τ ≤ 2V max , we have τ are 2V max -sub-Gaussian conditioning on F τ -1 . Then we invoke Lemma C.7 with M 0 = λ • I and M k = λ • I + k τ =1 φ(s τ , a τ ) φ(s τ , a τ ) . For the fixed function V : S → [0, V max ], we have P D N τ =1 φ(s τ , a τ ) • τ (V ) 2 Λ -1 > 8V 2 max • log det(Λ) 1/2 δ • det(λ • I) 1/2 ≤ δ for all δ ∈ (0, 1). Note that φ(s, a) ≤ 1 for all (s, a) ∈ S × A by Definition 3.1. We have Λ op = λ • I + N τ =1 φ(s τ , a τ )φ(s τ , a τ ) op ≤ λ + N τ =1 φ(s τ , a τ )φ(s τ , a τ ) op ≤ λ + N, where • op denotes the matrix operator norm. Hence, it holds that det(Λ) ≤ (λ + N ) d and det(λ • I) = λ d , which implies P D N τ =1 φ(s τ , a τ ) • τ (V ) 2 Λ-1 > 4V 2 max • 2 • log(1/δ) + d • log(1 + N/λ) ≤ P D N τ =1 φ(s τ , a τ ) • τ (V ) 2 Λ-1 > 8V 2 max • log det(Λ) 1/2 δ • det(λ • I) 1/2 ≤ δ. Therefore, we conclude the proof of Lemma C.3. Applying Lemma C.3 and the union bound, we have P D sup V ∈N (ε) N τ =1 φ(s τ , a τ ) • τ (V ) 2 Λ -1 > 4V 2 max • 2 • log(1/δ) + d • log(1 + N/λ) ≤ δ • |N (ε)|. ( ) Recall that V ∈ V(R 0 , B 0 , λ), where R 0 = V max N d/λ, B 0 = 2β, λ = 1, β = c • dV max ζ. (44) Here c > 0 is an absolute constant, ξ ∈ (0, 1) is the confidence parameter, and ζ = log(2dV max /ξ) is specified in Algorithm 2. Applying Lemma C.6 with ε = dV max /N , we have log |N (ε)| ≤ d • log(1 + 4d -1/2 N 3/2 ) + d 2 • log(1 + 32c 2 • d 1/2 N 2 ζ) ≤ d • log(1 + 4d 1/2 N 2 ) + d 2 • log(1 + 32c 2 • d 1/2 N 2 ζ). By setting δ = ξ/|N (ε)|, we have that with probability at least 1ξ, N τ =1 φ(s τ , a τ ) • τ ( V ) 2 Λ -1 ≤ 8V 2 max • 2 • log(V max /ξ) + 4d 2 • log(64c 2 • d 1/2 N 2 ζ) + d • log(1 + N ) + 4d 2 ≤ 8V 2 max d 2 ζ(4 + log (64c 2 )). Here the last inequality follows from simple algebraic inequalities. We set c ≥ 1 to be sufficiently large, which ensures that 36 + 8 • log(64c 2 ) ≤ c 2 /4 on the right-hand side of Equation ( 46). By Equations ( 37) and ( 46), it holds that |(ii)| ≤ c/2 • dV max ζ • φ(x, a) Λ -1 φ(s, a) = β/2 • φ(x, a) Λ -1 φ(s, a) By Equations ( 20), ( 35), (36), and ( 47), for all (s, a) ∈ S × A, it holds that (B V )(s, a) -( B V )(s, a) ≤ (V max √ d + β/2) • φ(s, a) Λ -1 φ(s, a) ≤ Γ(s, a) with probability at least 1ξ. Therefore, we conclude the proof of Lemma C.3. Lemma C.4 (Bounded weight of value function). Let V max = r max /(1γ). For any function V : S → [0, V max ], we have ||w|| ≤ V max √ d, || w|| ≤ V max N d λ . Proof. Since w φ(s, a) = M, φ(s, a) + γ V (s )ψ(s ) M φ(s, a)ds , we have w = M + γ V (s )ψ(s ) M ds = r max √ d + γV max √ d = V max √ d. For w, we have || w|| = ||Λ -1 N τ =1 φ τ (r τ + γV (s τ +1 ))|| ≤ N τ =1 ||Λ -1 φ τ (r τ + γV (s τ +1 ))|| ≤ V max N τ =1 ||Λ -1 φ τ || ≤ V max N τ =1 φ τ Λ -1/2 Λ -1 Λ -1/2 φ τ ≤ V max √ λ N τ =1 φ τ Λ -1 φ τ ≤ V max N λ Tr(Λ -1 T τ =1 φ τ φ τ ) ≤ V max N d λ . for all t ≥ 1. Moreover, suppose that conditioning on F t-1 , t is a zero-mean and σ-sub-Gaussian random variable for all t ≥ 1, that is, E[ t | F t-1 ] = 0, E exp(λ t ) F t-1 ≤ exp(λ 2 σ 2 /2), ∀λ ∈ R. Meanwhile, let {φ t } ∞ t=1 be an R d -valued stochastic process such that φ t is F t-1 -measurable for all t ≥ 1. Also, let M 0 ∈ R d×d be a deterministic positive-definite matrix and M t = M 0 + t s=1 φ s φ s for all t ≥ 1. For all δ > 0, it holds that t s=1 φ s s 2 M -1 t ≤ 2σ 2 • log det(M t ) 1/2 • det(M 0 ) -1/2 δ for all t ≥ 1 with probability at least 1δ. Proof. See Theorem 1 of Abbasi-Yadkori et al. ( 2011) for a detailed proof.

D EXPERIMENTAL SETTINGS D.1 MULTI-TASK ANTMAZE

We first divide the source dataset (e.g., antmaze-medium-diverse-v2) into the directed dataset or the undirected dataset. The trajectories in the undirected dataset are randomly and uniformly divided into different tasks. In contrast, the directed dataset is associated with the goal closest to the final state of the trajectory. Then, for each subtask (e.g., goal=(0.0, 16.0)), we relabel the corresponding undirected or undirected dataset according to the goal. Finally, we keep the rewards in the target task dataset (labeled dataset) while setting the rewards in the other task dataset to 0 (unlabeled dataset). We visualize the directed or undirected dataset in Figure 3 . We can find that the distribution shift issue in the directed dataset is more severe than the undirected dataset, which further challenges the unlabeled data sharing algorithms. However, it is worth noting that the asymptotic term is dependent on the backbone offline algorithm we choose. For model-based algorithms like Uehara & Sun (2021) , the performance bound scale as O(d) and thus the problem's dimension does not affect the asymptotic term. The dependence over the discount factor is also dependent on the choice of the backbone algorithm. However, it is known that the lower bound of offline RL algorithms in linear MDPs scales as O((1γ) 1.5 ) so this asymptotic term must scales at least as O((1γ) 0.5 ) and thus negatively depends on the discount factor. That is, the larger the discount factor, the better the relative performance of PDS. F DISCUSSION ON THE PERFORMANCE BOUND

F.1 CONSTRUCTION OF ADVERSARIAL EXAMPLES

In this section, we show that there is an MDP and an "unfortunate" dataset such that the suboptimality of Algorithm 3 matches the performance bound in Theorem 4.3. We first focus on the case without data sharing. The same techniques can be used to match the reward-learning bound. We only need to show that all inequalities in Theorem 4.3 can become equality. Suppose we have an MDP with one state and N > d actions. d of them are optimal actions, with feature map φ(s, a i ) = ( 0, . . . i-1 zeros , √ d, 0, . . .) and let the optimal policy be uniform over d actions. Such construction makes Σ π * ,s = I so that inequalities in Equation ( 25)∼( 27) become equalities. Then we let samples from other actions match the confidence upper bound while samples from the optimal action match the confidence lower bound, and we also need all the samples to be symmetric over different dimensions (this is required by the Cauchy-Schwitz inequality used in the proof), such that the confidence bound inequalities in Lemma C.3 also become equalities. Then, in this case, the suboptimality bound is matched and the bound in Theorem 4.3 is tight.

F.2 REWARD BIAS OF UDS

In this section, we show that UDS suffers from a constant reward bias. That is, the reward bias does not vanish as long as the ratio of labeled data size and unlabeled data size keeps constant. Proof. 

G HOW BAD IS THE REWARD PREDICTION BASELINE ACTUALLY?

We conduct ablation studies for the reward prediction method from three aspects, including model size, ensemble number, and early stopping, which are shown in Table 5 , Table 6 and Table 7 . For the model size, 256*3 denotes the 256 hidden neurons with three hidden layers. As for the early stopping, Epoch Number = 3 denotes traversing the entire training dataset 3 times. (We find the Epoch Number=3 is enough to reduce the prediction error to a small range, and increasing epochs leads to overfitting.) We conduct experiments in a setting where the quality of the reward labeled and reward-free data differs significantly. For example, walker2d-expert(50K)-random(0.1M) denotes 50K reward-labeled data from expert datasets and 0.1M reward-free data from random datasets. We set the PDS as the default parameter in all comparisons, including the model size 256*2, the training epoch 3, and the ensemble number 10. All experimental results adopt the normalized score metric averaged over five seeds with standard deviation. The experimental results show that fine-tuning reward prediction can improve its performance in some cases, Nevertheless, a well-tuned reward prediction method still performs poorly compared to PDS in general. We hypothesize that this is because the "test" (reward-free) dataset may have a large distributional shift from the "training" (reward-labeled) dataset, violating the i. 



Figure1: Impact of penalty weight k on the performance. We evaluate PDS on Hopper/Walker2d-Labeled (Expert/Random)-Unlabeled (Expert/Random) tasks with various k.

Figure 3: Visualization of multi-task antmaze-medium-diverse-v2 datasets. The purple dots denote the transition of reward 0. The yellow dots denote the transition near the goal of the sub-task and the reward is +1.

(a) Door Open (b) Drawer Close

Figure 4: Visualization of subtasks in Meta-World.

By labeling all rewards in the unlabeled dataset to zeros, we haveE D0+D1 [|r(s, a)r U DS (s, a)|] = N 1 N 0 + N 1 • E D1 [|r(s, a)|].

D0+D1 [|r(s, a)r U DS (s, a)|] = N 0 N 0 + N 1 • E D0 [|r(s, a)r(s, a)|] + N 1 N 0 + N 1 • E D1 [|r(s, a) -0|] = N 1 N 0 + N 1 • E D1 [|r(s, a)|].

Experiment results for multi-task robotic manipulation (Meta-World) experiments. Numbers are averaged across five seeds and we bold the best-performing method that does not have access to the true rewards.

Experiment results for AntMaze tasks with normalized score metric averaged with five random seeds.

Experimental results with normalized score metric averaged with five random seeds.Multi-task domains and datasets. We investigate Question (2) by evaluating PDS on several multi-task domains. The first set of domains we consider is Meta-World(Yu et al., 2020a), where we adopt the same setup as in CDS(Yu et al., 2021a)  and evaluate PDS on four tasks: door open, door close, drawer open, and drawer close. The second domain is the AntMaze task in D4RL, which consists of mazes of two sizes (medium and large) and includes 3 and 7 tasks, respectively, corresponding to different goal positions. For a detailed description of the experimental setting, please refer to Appendix D.

E MORE DISCUSSION ON THE RELATIVE PERFORMANCE OF UDSBased on the discussion in the main text, we can summarize the factors that affect the relative performance of PDS algorithms as follows. Scenarios where PDS has better relative performance.

Ablation for the model size. We adopt various model sizes for the prediction network in the Reward Prediction baseline, while PDS adopts the default parameter model size 256*3.

Ablation for the early stopping. We adopt various Training Epochs for the prediction network in the Reward Prediction baseline, while PDS adopts the default parameter epoch number 3.

Ablation for the ensemble. PDS adopts the default parameter ensemble number 10 and we adopt the same ensemble number for the prediction network in the Reward Prediction baseline.

7. ACKNOWLEDGEMENTS

This work is supported in part by Science and Technology Innovation 2030 -"New Generation Artificial Intelligence" Major Project (No. 2018AAA0100904) and the National Natural Science Foundation of China (62176135).

annex

for all s ∈ S. Then we haveHere {λ j (s)} d j=1 are the eigenvalues of Σ π * ,s for all s ∈ S, the second inequality follows from Equation ( 26). Meanwhile, by Definition 3.1, we have φ(s, a) ≤ 1 for all (s, a) ∈ S × A. By Jensen's inequality, we havefor all s ∈ S. As Σ π * ,s is positive semidefinite, we have λ j (s) ∈ [0, 1] for all s ∈ S and all j ∈ [d].Hence we havefor all x ∈ S, where the second inequality follows from the fact that λ j (s) ∈ [0, 1] for all s ∈ S and all j ∈ [d], while the third inequality follows from the choice of the scaling parameter β > 0.Then we have the conclusion in Lemma C.1.Lemma C.2. Under the event in Lemma C.3, we havewith probability 1δ, for all θ 2 2 ≤ d.Proof. Similar to the proof of Lemma C.1, letwe haveThen under the condition of Lemma C.3, it holds that 0 ≤ δ(s, a) ≤ 2Γ(s, a), for all s, a,Then we have the result immediately.Lemma C.5. For policy π * and any reward function parameter θ ∈ C(δ), we haveProof. From the definition, we haveWhere the first inequality follows from the Cauchy-Schwartz inequality and the second inequality uses the fact that θ ∈ C(δ). Then we havewhere the second to last inequality follows similarly as Equation (26) (27) and the last inequality follows from the fact that α = 2r max √ dζ 2 .Note that such a choice of α is sufficient for Lemma 4.1 to hold sincewhere the inequalities holds for sufficiently small δ > 0 and d ≥ 2.

C.1 TECHNICAL LEMMAS

Lemma C.6 (ε-Covering Number (Jin et al., 2020) ). For all h ∈ [H] and all ε > 0, we haveProof of Lemma C.6. See Lemma D.6 in Jin et al. (2020) for detailed proof.Lemma C.7 (Concentration of Self-Normalized Processes (Abbasi-Yadkori et al., 2011) ). Let {F t } ∞ t=0 be a filtration and { t } ∞ t=1 be an R-valued stochastic process such that t is F t -measurable

