CAN WIKIPEDIA HELP OFFLINE REINFORCEMENT LEARNING?

Abstract

Fine-tuning reinforcement learning (RL) models has been challenging because of a lack of large scale off-the-shelf datasets as well as high variance in transferability among different environments. Recent work has looked at tackling offline RL from the perspective of sequence modeling with improved results as result of the introduction of the Transformer architecture. However, when the model is trained from scratch, it suffers from slow convergence speeds. In this paper, we look to take advantage of this formulation of reinforcement learning as sequence modeling and investigate the transferability of pre-trained sequence models on other domains (vision, language) when finetuned on offline RL tasks (control, games). To this end, we also propose techniques to improve transfer between these domains. Results show consistent performance gains in terms of both convergence speed and reward on a variety of environments, accelerating training by 3-6x and achieving stateof-the-art performance in a variety of tasks using Wikipedia-pretrained and GPT2 language models. We hope that this work not only brings light to the potentials of leveraging generic sequence modeling techniques and pre-trained models for RL, but also inspires future work on sharing knowledge between generative modeling tasks of completely different domains.

1. INTRODUCTION

Large pre-trained language models have shown impressive performance in natural language (Devlin et al., 2019; Radford et al., 2018) and vision (Dosovitskiy et al., 2021) tasks. Furthermore, Transformer-based autoregressive language models (Vaswani et al., 2017; Baevski & Auli, 2019; Radford et al., 2019) have shown to be powerful sources of zero-shot and few-shot performance (Brown et al., 2020) , with notable rapid adaptation in low resource settings, demonstrating their easy adaptability and transferability to a number of tasks in their respective domains. Adapting autoregressive language models has also been extended to the multimodal setting (Tsimpoukelli et al., 2021) for tasks such as visual question answering. Concurrently, offline reinforcement learning (RL) has been seen as analogous to sequence modeling (Chen et al., 2021; Janner et al., 2021; Furuta et al., 2021) , framed as simply supervised learning to fit return-augmented trajectories in an offline dataset. This relaxation, doing away with many of the complexities commonly associated with reinforcement learning (Watkins & Dayan, 1992; Kakade, 2001) , allows us to take advantage of techniques popularized in sequence modeling tasks for RL. Pre-training, particularly, is an essential technique for alleviating higher compute costs from using more expressive models such as Transformers. However, such concept is still relatively fresh in RL (Singh et al., 2020; Tirumala et al., 2020) , due to the difficulty in parameterizing different scenes and tasks through a single network (Wang et al., 2018b; Jiang et al., 2019; Zeng et al., 2020) as well as the lack of large off-the-shelf datasets for pre-training (Cobbe et al., 2020; Zhu et al., 2020; Yu et al., 2020) . Adopting pre-training as a default option for recent Transformer-based methods (Chen et al., 2021; Janner et al., 2021; Furuta et al., 2021) appears far away -if we only look within RL. Unified under the umbrella of sequence modeling, we look at whether Transformer-based pre-trained language models are able to be adapted to standard offline reinforcement learning tasks that have no relations to language. Given the setting of having a single model pre-trained on natural language to finetune on each offline RL task individually, we demonstrate drastic improvements in convergence speeds and final policy performances. We also consider further techniques (e.g. extension of positional Figure 1 : Adapting pre-trained language models (e.g. from Wikipedia) to offline RL (e.g. in continuous control and games). embeddings, embedding similarity encouragement) in order to better take advantage of the features learned by the pre-trained language model and demonstrate greater improvements. We demonstrate that pre-training on autoregressively modeling natural language provides consistent performance gains when compared to the Decision Transformer (Chen et al., 2021) on both the popular OpenAI Gym (Brockman et al., 2016) and Atari (Bellemare et al., 2013) offline RL benchmarks. We also note a significantly faster convergence speed, with a 3-6x improvement over a vanilla Decision Transformer turning hours of training to tens of minutes, indicating long-term computational efficiency benefits on language pre-training. Our findings allude to the potential impact of large scale pre-training for reinforcement learning, given its surprising efficacy when transferring from a distant sequence modeling domain such as natural language. Notably, unlike other work on multi-task offline RL, our model provides consistent results in terms of both reward and convergence regardless of environment and setting, indicating a forseeable future where everyone should use a pre-trained language model for offline RL.

2. BACKGROUND

Offline Reinforcement Learning We consider a standard Markov Decision Process (MDP) with state space s ∈ S and action space a ∈ A, specified by a initial state distribution p(s 1 ), a dynamics distribution p(s t+1 |s t , a t ), and a scalar reward function r(s, a). The goal of reinforcement learning (RL) is to find the optimal policy π * (a|s) which maximizes the γ-discounted expected return as the agent interacts in the environment, max π E s1:∞,a1:∞∼p,π ∞ t=1 γ t r(s t , a t ) In offline RL, the objective remains the same, but has to be optimized with no interactive data collection on a fixed set of trajectories τ i , each of the form below with horizon N , τ = (r 1 , s 1 , a 1 , r 2 , s 2 , a 2 , . . . , r N , s N , a N ). (2) Common approaches include value-based or model-based objectives with regularization (Fujimoto et al., 2019; Levine et al., 2020) , and more recently, direct generative modeling of these trajectories conditioned on hindsight returns (Chen et al., 2021; Janner et al., 2021; Furuta et al., 2021) . Transformer model In this subsection, we briefly review the Transformer architecture (Vaswani et al., 2017) used to model sequences. The Transformer is comprised of stacks of identical Transformer layers. Each of these layers takes in a set of n-dimensional vectors that are fed through the two main building blocks: a multi-head self-attention sublayer and a feedfoward MLP as shown below: Attention(x) = softmax Q(x)K(x) ⊤ √ n V (x) Feedforward(x) = L 2 (g(L 1 (x))) (4)

