SEMI-SUPERVISED OFFLINE REINFORCEMENT LEARN-ING WITH ACTION-FREE TRAJECTORIES

Abstract

Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop a simple meta-algorithmic pipeline that learns an inversedynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successfulon several D4RL benchmarks (Fu et al., 2020), certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. Finally, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.

1. INTRODUCTION

One of the key challenges with deploying reinforcement learning (RL) agents is its prohibitive sample complexity for real-world applications. Offline reinforcement learning (RL) can significantly reduce the sample complexity by exploiting logged demonstrations from auxiliary data sources (Levine et al., 2020) . However, contrary to curated benchmarks in use today, the nature of offline demonstrations in the real world can be highly varied. For example, the demonstrations could be misaligned due to frequency mismatch (Burns et al., 2022) , use of different sensors, actuators, or dynamics (Reed et al., 2022; Lee et al., 2022) , or lacking partial state (Ghosh et al., 2022; Rafailov et al., 2021; Mazoure et al., 2021) , or reward information (Yu et al., 2022) . Successful offline RL in the real world requires embracing these heterogeneous aspects for maximal data efficiency, similar to learning in humans. Unlike traditional semi-supervised learning, our setup has a few key differences. First, we do not assume that the distribution of the labelled and unlabelled trajectories are necessarily identical. In realistic scenarios, we expect these to be different with unlabelled data having higher returns than labelled data e.g., videos of a human professional are easier to obtain than installing actuators for continuous control tasks. We replicate such varied data quality setups in some of our experiments; Figure 1 .1 shows an illustration of the difference in returns between the labelled and unlabelled dataset splits for the hopper-medium-expert D4RL dataset. Second, our end goal goes beyond labeling the actions in the unlabelled trajectories, but rather we intend to use the unlabelled data for learning a downstream policy that is better than the behavioral policies used for generating the offline datasets. Hence, there are two kinds of generalization challenges: generalizing from the labelled to the unlabelled data distribution and then going beyond the offline data distributions to get closer to the expert distribution. Regular offline RL is concerned only with the latter. Finally, we are mainly interested in the case where a significant majority of the trajectories in the offline dataset are unlabelled. One motivating example for this setup is learning from videos or third-person demos. There are tremendous amounts of internet videos that can be potentially used to train RL agents, yet they are without action labels and are of varying quality. Our paper seeks to answer the following questions: 1. How can we utilize the unlabelled data for improving the performance of offline RL algorithms? 2. How does our performance vary as a function data-centric properties, such as the size and return distributions of labelled and unlabelled datasets? 3. How do offline RL algorithms compare in this setup? To answer these questions, we propose a meta-algorithmic pipeline to train policies in the semisupervised setup described above. We call our pipeline Semi-Supervised Offline Reinforcement Learning (SS-ORL). SS-ORL contains three simple and scalable steps: (1) train a multi-transition inverse dynamics model on labelled data, which predicts actions based on transition sequences, (2) fill in proxy-actions for unlabelled data, and finally (3) train an offline RL agent on the combined dataset. Empirically, we instantiate SS-ORL with CQL (Kumar et al., 2020) , DT (Chen et al., 2021), and TD3BC (Fujimoto & Gu, 2021) as the underlying offline RL algorithms respectively, and conduct experiments on the D4RL datasets (Fu et al., 2020) . We highlight a few predominant trends from our experimental findings below: 1. Given low-quality labelled data, SS-ORL agents can exploit unlabelled data that contains highquality trajectories and thus improve performance. The absolute performance of SS-ORL is close to or even matches that of the oracle agents, which have access to complete action information. 2. When the labelled data quality is high, utilizing unlabelled data does not bring significant benefits. 3. The choice of value vs. behavior cloning based methods can significantly affect performance in the semi-supervised setup. In our experiments, CQL and TD3BC are less sensitive to the missing actions compared to DT. They enjoy better absolute performance when the labelled data is of low quality, and their performance gap relative to the oracle agent is also smaller. See Appendix H for more details.

2. RELATED WORK

Offline RL The goal of offline RL is to learn effective policies from fixed datasets which are generated by unknown behavior policies. There are two main categories of model-free offline RL methods: value-based methods and behavior cloning (BC) based methods. Value-based methods attempt to learn the value functions based on temporal difference (TD) updates. There is a line of work that aims to port existing off-policy value-based online RL methods to the offline setting, with various types of additional regularization components that encourage the learned policy to stay close to the behavior policy. Several representive techniques include specifically tailored policy parameterizations (Fujimoto et al., 2019; Ghasemipour et al., 2021) , divergence-based regularization on the learned policy (Wu et al., 2019; Jaques et al., 2019; Kumar et al., 2019) , and regularized value function estimation (Nachum et al., 2019; Kumar et al., 2020; Kostrikov et al., 2021a; Fujimoto & Gu, 2021; Kostrikov et al., 2021b) . Recently, a growing body of work has tried to formulate offline RL as a supervised learning problem (Chen et al., 2021; Janner et al., 2021; Emmons et al., 2021) . Compared with the value-based methods, these methods enjoy several appealing properties including algorithmic simplicity and training stability. Generally speaking, these approaches can be viewed as conditional behavior cloning methods (Bain & Sammut, 1995) , where the conditioning parameters are related information such



Figure 1.1: An example of the labelled and unlabelled data distributions.

