OFFLINE Q-LEARNING ON DIVERSE MULTI-TASK DATA BOTH SCALES AND GENERALIZES

Abstract

The potential of offline reinforcement learning (RL) is that high-capacity models trained on large, heterogeneous datasets can lead to agents that generalize broadly, analogously to similar advances in vision and NLP. However, recent works argue that offline RL methods encounter unique challenges to scaling up model capacity. Drawing on the learnings from these works, we re-examine previous design choices and find that with appropriate choices: ResNets, cross-entropy based distributional backups, and feature normalization, offline Q-learning algorithms exhibit strong performance that scales with model capacity. Using multi-task Atari as a testbed for scaling and generalization, we train a single policy on 40 games with near-human performance using up-to 80 million parameter networks, finding that model performance scales favorably with capacity. In contrast to prior work, we extrapolate beyond dataset performance even when trained entirely on a large (400M transitions) but highly suboptimal dataset (51% human-level performance). Compared to return-conditioned supervised approaches, offline Q-learning scales similarly with model capacity and has better performance, especially when the dataset is suboptimal. Finally, we show that offline Q-learning with a diverse dataset is sufficient to learn powerful representations that facilitate rapid transfer to novel games and fast online learning on new variations of a training game, improving over existing state-of-the-art representation learning approaches.

1. INTRODUCTION

High-capacity neural networks trained on large, diverse datasets have led to remarkable models that can solve numerous tasks, rapidly adapt to new tasks, and produce general-purpose representations in NLP and vision (Brown et al., 2020; He et al., 2021) . The promise of offline RL is to leverage these advances to produce polices with broad generalization, emergent capabilities, and performance that exceeds the capabilities demonstrated in the training dataset. Thus far, the only offline RL approaches that demonstrate broadly generalizing policies and transferable representations are heavily-based on supervised learning (Reed et al., 2022; Lee et al., 2022) . However, these approaches are likely to perform poorly when the dataset does not contain expert trajectories (Kumar et al., 2021b) . Offline Q-learning performs well across dataset compositions in a variety of simulated (Gulcehre et al., 2020; Fu et al., 2020) and real-world domains (Chebotar et al., 2021; Soares et al., 2021) , however, these are largely centered around small-scale, single-task problems where broad generalization and learning general-purpose representations is not expected. Scaling these methods up to high-capcity models on large, diverse datasets is the critical challenge. Prior works hint at the difficulties: on small-scale, single-task deep RL benchmarks, scaling model capacity can lead to instabilities or degrade performance (Van Hasselt et al., 2018; Sinha et al., 2020; Ota et al., 2021) explaining why decade-old tiny 3-layer CNN architectures (Mnih et al., 2013) are still prevalent. Moreover, works that have scaled architectures to millions of parameters (Espeholt et al., 2018; Teh et al., 2017; Vinyals et al., 2019; Schrittwieser et al., 2021) typically focus on online learning and employ many sophisticated techniques to stabilize learning, such as supervised auxiliary losses, distillation, and pre-training. Thus, it is unclear whether offline Q-learning can be scaled to high-capacity models. sub-optimal data. We adapt CQL to the multi-task setup via a multi-headed architecture. The pre-trained visual encoder is reused in fine-tuning (the weights are either frozen or fine-tuned), whereas the downstream fully-connected layers are reinitialized and trained. In this paper, we demonstrate that with careful design decisions, offline Q-learning can scale to highcapacity models trained on large, diverse datasets from many tasks, leading to policies that not only generalize broadly, but also learn representations that effectively transfer to new downstream tasks and exceed the performance in the training dataset. Crucially, we make three modifications motivated by prior work in deep learning and offline RL. First, we find that a modified ResNet architecture (He et al., 2016) substantially outperforms typical deep RL architectures and follows a power-law relationship between model capacity and performance, unlike common alternatives. Second, a discretized representation of the return distribution with a distributional cross-entropy loss (Bellemare et al., 2017) substantially improves performance compared to standard Q-learning, that utilizes mean squared error. Finally, feature normalization on the intermediate feature representations stabilizes training and prevents feature co-adaptation (Kumar et al., 2021a) . To systematically evaluate the impact of these changes on scaling and generalization, we train a single policy to play 40 Atari games (Bellemare et al., 2013; Agarwal et al., 2020) , similarly to Lee et al. (2022) , and evaluate performance when the training dataset contains expert trajectories and when the data is sub-optimal. This problem is especially challenging because of the diversity of games with their own unique dynamics, reward, visuals, and agent embodiments. Furthermore, the sub-optimal data setting requires the learning algorithm to "stitch together" useful segments of sub-optimal trajectories to perform well. To investigate generalization of learned representations, we evaluate offline fine-tuning to never-before-seen games and fast online adaptation on new variants of training games (Section 5.2). With our modifications, • Offline Q-learning learns policies that attain more than 100% human-level performance on most of these games, about 2x better than prior supervised learning (SL) approaches for learning from sub-optimal offline data (51% human-level performance). • Akin to scaling laws in SL (Kaplan et al., 2020) , offline Q-learning performance scales favorably with model capacity (Figure 6 ). • Representations learned by offline Q-learning give rise to more than 80% better performance when fine-tuning on new games compared to representations from state-of-the-art returnconditioned supervised (Lee et al., 2022) and self-supervised methods (He et al., 2021; Oord et al., 2018) . By scaling Q-learning, we realize the promise of offline RL: learning policies that broadly generalize and exceed the capabilities demonstrated in the training dataset. We hope that this work encourages large-scale offline RL applications, especially in domains with large sub-optimal datasets.

2. RELATED WORK

Prior works have sought to train a single generalist policy to play multiple Atari games simultaneously from environment interactions, either using off-policy RL with online data collection (Espeholt et al., 2018; Hessel et al., 2019a; Song et al., 2019) , or policy distillation (Teh et al., 2017; Rusu et al., 2015) from single-task policies. While our work also focuses on learning such a generalist multi-task



Figure 1: An overview of the training and evaluation setup. Models are trained offline with potentially

