RETURN AUGMENTATION GIVES SUPERVISED RL TEMPORAL COMPOSITIONALITY

Abstract

Offline Reinforcement Learning (RL) methods that use supervised learning or sequence modeling (e.g., Chen et al. (2021a)) work by training a return-conditioned policy. A fundamental limitation of these approaches, as compared to value-based methods, is that they have trouble generalizing to behaviors that have a higher return than what was seen at training (Emmons et al., 2021). Value-based offline-RL algorithms like CQL use bootstrapping to combine training data from multiple trajectories to learn strong behaviors from sub-optimal data. We set out to endow RL via Supervised Learning (RvS) methods with this form of temporal compositionality. To do this, we introduce SUPERB, a dynamic programming algorithm for data augmentation that augments the returns in the offline dataset by combining rewards from intersecting trajectories. We show theoretically that SUPERB can improve sample complexity and enable RvS to find optimal policies in cases where it previously fell behind the performance of value-based methods. Empirically, we find that SUPERB improves the performance of RvS in several offline RL environments, surpassing the prior state-of-the-art RvS agents in AntMaze by orders of magnitude and offering performance competitive with value-based algorithms on the D4RL-gym tasks (Fu et al., 2020).

1. INTRODUCTION

The use of prior experiences to inform decision making is critical to our human ability to quickly adapt to new tasks. To build intelligent agents that match these capabilities, it is natural to seek algorithms that learn to act from preexisting datasets of experience. Research on this problem, formally known as offline reinforcement learning (RL), has focused on two main approaches. The first takes existing off-policy RL algorithms, such as those based on Q-learning, and alters them to reduce issues caused by distributional shift. The resulting algorithms use value pessimism and policy constraints to keep actions within the support of the data distribution while simultaneously optimizing for high returns (Fujimoto et al., 2019; Kumar et al., 2020a) . The second, RL via Supervised Learning (RvS), draws inspiration from generative modeling and supervised learning to learn outcome-conditioned policy models and uses them to predict which actions should be taken in order to get a high return (Schmidhuber, 2019; Chen et al., 2021b; Emmons et al., 2021) . RvS algorithms are appealing due to their simple training objective, robustness to hyperparameters, and strong performance, especially when trained in a multi-task setting. Recently, however, attention has been brought to their suboptimality in certain settings, such as stochastic environments (Paster et al., 2022; Villaflor et al., 2022; Eysenbach et al., 2022) and, as we remark in this work, offline settings where temporal compositionalityfoot_0 is required for good performance (see, e.g., Figure 1 ). Because return-conditioned RvS agents are trained using returns calculated from an offline dataset, they may fail to extrapolate to higher returns not present in any single trajectory but made possible by combining behaviors from multiple trajectories. While value-based methods use dynamic programming to compose behaviors across trajectories, RvS approaches fail to fully exploit the temporal structure inherent to sequential decision making (Sutton, 1988) . In our work, we aim to endow RvS methods with this kind of temporal compositionality. Traditionally, RvS agents are trained using return labels computed by summing over the future rewards of While the optimal trajectory (s 0 , s 1 , s 3 , s 4 ) has a value of 20, no empirical trajectory has a value of 20. Right: Our method, SUPERB, combines rewards from multiple trajectories when calculating return-to-go labels and applies a label of 20 to each state and action along the optimal trajectory. each trajectory individually. Our main insight is that we can apply the n-step temporal difference (TD) relation (Sutton & Barto, 2018) to offline data to augment the return labels used in training with returns that are made possible by composing different trajectories. While Q-learning algorithms aim to learn the value of the optimal policy, our method uses a distributional value function to assist in generating a distribution of feasible return labels on which we can train a return-conditioned RvS agent. We propose to iteratively apply our method to augment returns, which we show can generate exponentially more labels for RvS agents to train on. Our main contributions are: 1. We show that there are some environments where, both analytically and empirically, a return-conditioned RvS agent falls behind value-based agents that use bootstrapping, and propose a data augmentation method that bridges this gap by using the n-step TD relation to combine rewards from different trajectories. 2. We show that temporal compositionality can exponentially improve the number of training trajectories and is necessary to learn optimal policies in some offline RL datasets. 3. We evaluate our method on the D4RL offline RL suite. Our method dramatically improves the performance of RvS on AntMaze environments, where optimal policies must stitch together behaviors from partial demonstrations, from near zero to state-of-the-art. We also obtain performance competitive with value-based methods on the D4RL-gym tasks.

2.1. PRELIMINARIES

We consider the offline Reinforcement Learning (RL) setting where the environment is modeled as a Markov Decision Process (MDP), M = ⟨S, A, T, R, γ⟩, consisting of a state space, action space, transition function, reward function, and discount factor, respectively (Lange et al., 2012; Sutton & Barto, 2018) . The agent is given fixed dataset D of state-action-reward trajectories {τ = (s 1 , a 1 , r 1 , s 2 , a 2 , r 2 , . . . )} produced by a (non-Markovian) "empirical policy" π e acting in M, and tasked with learning a policy π θ : S → A that obtains high return t γ t r t when executed in M.



We define temporal compositionality as the composition of behaviors from different timesteps within a trajectory or from different trajectories.



Figure1: Left: In this illustrative offline RL task, the dataset consists of trajectories gathered by two suboptimal policies: the blue/solid policy, which goes to the left and either gets a return of 12 or 15, and the red/dashed policy, which goes to the right and gets a return of 12. Middle: Typical RvS algorithms compute the return-to-go by summing up the rewards along each empirical trajectory. While the optimal trajectory (s 0 , s 1 , s 3 , s 4 ) has a value of 20, no empirical trajectory has a value of 20. Right: Our method, SUPERB, combines rewards from multiple trajectories when calculating return-to-go labels and applies a label of 20 to each state and action along the optimal trajectory.

