SEMI-SUPERVISED OFFLINE REINFORCEMENT LEARN-ING WITH ACTION-FREE TRAJECTORIES

Abstract

Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop a simple meta-algorithmic pipeline that learns an inversedynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successfulon several D4RL benchmarks (Fu et al., 2020), certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. Finally, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.

1. INTRODUCTION

One of the key challenges with deploying reinforcement learning (RL) agents is its prohibitive sample complexity for real-world applications. Offline reinforcement learning (RL) can significantly reduce the sample complexity by exploiting logged demonstrations from auxiliary data sources (Levine et al., 2020) . However, contrary to curated benchmarks in use today, the nature of offline demonstrations in the real world can be highly varied. For example, the demonstrations could be misaligned due to frequency mismatch (Burns et al., 2022) , use of different sensors, actuators, or dynamics (Reed et al., 2022; Lee et al., 2022) , or lacking partial state (Ghosh et al., 2022; Rafailov et al., 2021; Mazoure et al., 2021) , or reward information (Yu et al., 2022) . Successful offline RL in the real world requires embracing these heterogeneous aspects for maximal data efficiency, similar to learning in humans. 2021)), cannot directly operate on such unlabelled trajectories. At the same time, naively throwing out the unlabelled trajectories can be wasteful, especially when they have high returns. Our goal in this work is to enable compute and data efficient learning with additional action-unlabelled trajectory logs. Unlike traditional semi-supervised learning, our setup has a few key differences. First, we do not assume that the distribution of the labelled and unlabelled trajectories are necessarily identical. In realistic scenarios, we expect these to be different with unlabelled data 1



Figure 1.1: An example of the labelled and unlabelled data distributions.In this work, we propose a new semi-supervised setup for offline RL. Standard offline RL assumes trajectories to be sequences of observations, actions, and rewards. However, many data sources, such as videos or third-person demonstrations lack direct access to actions. Hence, we propose a semi-supervised setup, where an agent's offline dataset also consists of action-unlabelled trajectories in addition to the aforementioned (action-labelled) trajectories. Standard offline RL algorithms, such as Conservative Q Learning (CQL; Kumar et al. (2020)) or Decision Transformer (DT; Chen et al. (2021)), cannot directly operate on such unlabelled trajectories. At the same time, naively throwing out the unlabelled trajectories can be wasteful, especially when they have high returns. Our goal in this work is to enable compute and data efficient learning with additional action-unlabelled trajectory logs.

