THE PROVABLE BENEFITS OF UNSUPERVISED DATA SHARING FOR OFFLINE REINFORCEMENT LEARNING

Abstract

Self-supervised methods have become crucial for advancing deep learning by leveraging data itself to reduce the need for expensive annotations. However, the question of how to conduct self-supervised offline reinforcement learning (RL) in a principled way remains unclear. In this paper, we address this issue by investigating the theoretical benefits of utilizing reward-free data in linear Markov Decision Processes (MDPs) within a semi-supervised setting. Further, we propose a novel, Provable Data Sharing algorithm (PDS) to utilize such reward-free data for offline RL. PDS uses additional penalties on the reward function learned from labeled data to prevent overestimation, ensuring a conservative algorithm. Our results on various offline RL tasks demonstrate that PDS significantly improves the performance of offline RL algorithms with reward-free data. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees. We believe our findings will contribute to developing more robust self-supervised RL methods.

1. INTRODUCTION

Offline reinforcement learning (RL) is a promising framework for learning sequential policies with pre-collected datasets. It is highly preferred in many real-world problems where active data collection and exploration is expensive, unsafe, or infeasible (Swaminathan & Joachims, 2015; Shalev-Shwartz et al., 2016; Singh et al., 2020) . However, labeling large datasets with rewards can be costly and require significant human effort (Singh et al., 2019; Wirth et al., 2017) . In contrast, unlabeled data can be cheap and abundant, making self-supervised learning with unlabeled data an attractive alternative. While self-supervised methods have achieved great success in computer vision and natural language processing tasks (Brown et al., 2020; Devlin et al., 2018; Chen et al., 2020) , their potential benefits in offline RL are less explored. Several prior works (Ho & Ermon, 2016; Reddy et al., 2019; Kostrikov et al., 2019) have explored demonstration-based approaches to eliminate the need for reward annotations, but these approaches require samples to be near-optimal. Another line of work focuses on data sharing from different datasets (Kalashnikov et al., 2021; Yu et al., 2021a) . Still, it assumes that the dataset can be relabeled with oracle reward functions for the target task. These settings can be unrealistic in real-world problems where expert trajectories and reward labeling are expensive. Incorporating reward-free datasets into offline RL is important but challenging due to the sequential and dynamic nature of RL problems. Prior work (Yu et al., 2022) has shown that learning to predict rewards can be difficult, and simply setting the reward to zero can achieve good results. However, it's unclear how reward-prediction methods affect performance and whether reward-free data can provably benefit offline RL. This naturally leads to the following question: How can we leverage reward-free data to improve the performance of offline RL algorithms in a principled way? To answer this question, we conduct a theoretical analysis of the benefits of utilizing unlabeled data in linear MDPs. Our analysis reveals that although unlabeled data can provide information about the dynamics of the MDP, it cannot reduce the uncertainty over reward functions. Based on this insight, we propose a model-free method named Provable Data Sharing (PDS), which adds uncertainty penalties to the learned reward functions to maintain a conservative algorithm. By doing so, PDS can effectively leverage the benefits of unlabeled data for offline RL while ensuring theoretical guarantees. We demonstrate that PDS can cooperate with model-free offline methods while being simple and efficient. We conduct extensive experiments on various environments, including single-task domains like MuJoCo (Todorov et al., 2012) and Kitchen (Gupta et al., 2019) , as well as multi-task domains like AntMaze and Meta-World (Yu et al., 2020a) . The results show that PDS improves significantly over previous methods like UDS (Yu et al., 2022) and naive reward prediction methods. Our main contribution is the Provable Data Sharing (PDS) algorithm, a novel method for utilizing unsupervised data in offline RL that provides theoretical guarantees. PDS adds uncertainty penalties to the learned reward functions and can be easily integrated with existing offline RL algorithms. Our experimental results demonstrate that PDS can achieve superior performance on various locomotion, navigation, and manipulation tasks. Overall, our work provides a promising approach to leveraging the benefits of unlabeled data in offline RL while maintaining theoretical guarantees and contributing to the development of more robust self-supervised RL methods.

2. RELATED WORK

Offline Reinforcement Learning Current offline RL methods (Levine et al., 2020) can be roughly divided into policy constraint-based, uncertainty estimation-based, and model-based approaches. Policy constraint methods aim to keep the policy close to the behavior under a probabilistic distance (Siegel et al., 2020; Yang et al., 2021; Kumar et al., 2020; Fujimoto & Gu, 2021; Yang et al., 2022; Fujimoto et al., 2019; Hu et al., 2022; Kostrikov et al., 2021; Ma et al., 2021; Wang et al., 2021) . Uncertainty estimation-based methods attempt to consider the Q-value prediction's confidence using dropout or ensemble techniques (An et al., 2021; Wu et al., 2021) . Last, model-based methods incorporates the uncertainty in the model space for conservative offline learning (Yu et al., 2020b; 2021b; Kidambi et al., 2020) . Offline Data Sharing Prior works have demonstrated that data sharing across tasks can be beneficial by designing sophisticated data-sharing protocols (Yu et al., 2021a) . For example, previous studies have explored developing data sharing strategies by human effort (Kalashnikov et al., 2021 ), inverse RL (Reddy et al., 2019; Li et al., 2020), and estimated Q-values (Yu et al., 2021a) . However, these data sharing works must assume the dataset can be relabeled with oracle rewards for the target task, which is a strong assumption since the high cost of reward labeling. Therefore, effectively incorporating the unlabeled data into the offline RL algorithms is essential. To solve this issue, some recent work (Yu et al., 2022) proposes simply applying zero rewards to unlabeled data. In this work, we propose a principle way to leverage unlabeled data without the strong assumption of reward relabeling. Reward Prediction It is widely observed that reward shaping and intrinsic rewards can accelerate learning in online RL (Mataric, 1994; Ng et al., 1999; Wu & Tian, 2017; Song et al., 2019; Guo et al., 2016; Abel et al., 2021) . There are also extensive works that studies automatically designing reward functions using inverse RL (Ng et al., 2000; Fu et al., 2017) . However, there is less attention on the offline setting where online interaction is not allowed and the trajectories may not be optimal.

3.1. LINEAR MDPS AND PERFORMANCE METRIC

We consider infinite-horizon discounted Markov Decision Processes (MDPs), defined by the tuple (S, A, P, r, γ), with state space S, action space A, discount factor γ ∈ [0, 1), transition function P : S × A → ∆(S), and reward function r : S × A → [0, r max ]. To make things more concrete,

