CONTEXTUAL TRANSFORMER FOR OFFLINE REIN-FORCEMENT LEARNING

Abstract

Recently, the pretrain-tuning paradigm in large-scale sequence models has made significant progress in Natural Language Processing and Computer Vision. However, such a paradigm is still hindered by intractable challenges in Reinforcement Learning (RL), including the lack of self-supervised large-scale pretraining methods based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can help sequencemodeling based offline Reinforcement Learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional generation. As such, we can pretrain a model on the offline dataset with supervised loss and learn a prompt to guide the policy to play the desired actions. Secondly, we extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT), which leverages the context among different tasks as the prompt to improve the performance on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. The results validate the strong performance, and generality of our methods.

1. INTRODUCTION

Reinforcement learning algorithms based on sequence modeling (Chen et al., 2021; Janner et al., 2021; Reed et al., 2022) shine in sequential decision-making tasks and form a new promising paradigm. Compared with classic RL methods, such as policy-based methods and value-based methods (Sutton & Barto, 2018) , optimization of the policies from the sequence prospective has advantages in long-term credit assignment, partial observation, etc. Meanwhile, significant generalization of large pretrained sequence model in natural language processing (Kenton & Toutanova, 2019; Brown et al., 2020) and computer vision (Liu et al., 2021b; Zhai et al., 2021 ) not only conserves vast computation in downstream tasks but also alleviates the large data quantity requirements. Inspired by them, we want to ask the question: whether pretrain technique has a similar power in RL? Since limited and expensive interactions impede the deployment of RL in various valuable applications (Levine et al., 2020) , pretraining a large model to improve the robustness of real-world gap by a zero-shot generalization and improve data efficiency by few-shot learning offers great significance. (Team et al., 2021; Meng et al., 2021 ) pretrains a single model with diverse and abundant training tasks in the decision-making domain to generalize in downstream tasks, which proves the feasibility that pretraining enables RL agents to harness knowledge for generalization. Earlier works on sequence modeling RL provide a new perspective on offline RL. However, extending these methods to the pretrain domain is still confronted with several challenges. One major challenge for generalization (Li et al., 2020b) is how to encode task-relevant information, thereby enhancing transferring knowledge among tasks. Since discovering the relationship among diverse tasks from data and making decisions conditioned on distinct tasks plays a significant role in generalization, it is not a trivial modification of existing methods. Another problem is efficient self-supervised learning in offline RL. Specifically, the decision transformer (Chen et al., 2021) leverages the data to learn a return conditioned policy, which ignores the knowledge about world dynamics. In addition, trajectory transformer (Janner et al., 2021) conducts planning based on a world model, but the high computational intensity and decision latency might be a bottleneck for a large-scale model and hard to fine-tune to other tasks. Therefore, introducing key components to transfer the knowledge in a pretrained model and incorporating the advantages in a conditioned policy and a world model is necessary. In this work, we propose a novel offline RL algorithm, named Contextual Meta Transformer (CMT), aiming to conquer multiple tasks and generalization at one shot in an offline setting from the perspective of sequence modeling. CMT provides a pretrain and prompt-tuning paradigm to solve offline RL problems in the offline setting. Firstly, a model is pretrained on the offline dataset through a selfsupervised learning method, which converts the offline trajectories into some policy prompts and utilizes these policy prompts to reconstruct the offline trajectories in the autoregressive style. Then a better policy prompt is learned based on planning in the learned world model to attain the advanced policy to generate trajectories with high rewards. In contrast to previous work, CMT learns a prompt to construct policy to guide desired actions, rather than being designed by humans or explicitly planned by the world model. In the offline meta-learning setting, CMT extends the framework by simply concatenating a task prompt with the input sequence. With a simple modification, CMT is capable of executing a proper policy for a specific task and sharing knowledge among tasks. Our contributions are three-fold: First, we propose a novel offline RL algorithm based on prompt tuning, in which the offline trajectory is encoded as a prompt, and the appropriate prompt is found to lead a policy for execution to achieve high reward in the online environment. Second, CMT is the first algorithm to solve offline meta-RL from a sequence modeling perspective. The context trajectory, which represents the structure of the task, is used by CMT as a prompt to guide the policy in a specific unknown task. Furthermore, empirical results on D4RL datasets and meta Mujuco tasks show that CMT has outstanding performance and strong generalization.

2. RELATED WORK

Offline Reinforcement Learning. Offline RL is gaining popularity as a data-driven RL method that can effectively utilize large offline datasets. However, the data distribution shift and hyperparameter tuning in offline settings seriously affect the performance of the agent (Levine, 2021). So far, several schemes have been proposed to address them. Through action-space constraint, BCQ (Fujimoto et al., 2019) , AWR (Peng et al., 2019) , BRAC (Wu et al., 2019), and ICQ (Yang et al., 2021) reduce extrapolation error caused by policy iteration. Noticing the problem of overestimation of values, CQL (Kumar et al., 2020) keeps reasonable estimates by looking for pessimistic expectations. UWAC (Wu et al., 2021) handles out-of-distribution (OOD) data by weighting the Q value during training by estimating the uncertainty of (s, a). MOPO (Yu et al., 2020) and MOReL (Kidambi et al., 2020) solve the offline RL problem from the model-based perspective while ensuring rational control by adding penalty items to uncertain areas. Decision Transformer (DT) (Chen et al., 2021) and Trajectory Transformer (TT) (Janner et al., 2021) reconstruct the RL problem into a sequential decision problem, extending the Large-Language-Model-like (LLM-like) structure to the RL area, which inspires many follow-up works on them. However, the relevant works on offline RL are still insufficient due to the lack of self-supervised large-scale pretraining methods and efficient prompt-tuning over unseen tasks, and CMT proposes a pretrain-and-tune paradigm to deal with them. Pretrain and Sequence Modeling. Recently, much attention has been attracted to pretraining big models on large-scale unsupervised datasets and applying them to downstream tasks through finetuning. In language process tasks, transformer-based models such as BERT (Kenton & Toutanova, 2019), GPT-3 (Brown et al., 2020) overcome the limitation that RNN cannot be trained in parallel and improve the ability to use long sequence information, achieving SOTA results on NLP tasks such as translation, question answering systems. Even the CV field has been inspired to reconstruct their issues as sequence modeling problems, and high-performance models like the swin transformer (Liu et al., 2021b) and scaling ViT (Zhai et al., 2021) have been proposed. Since the trajectories in offline RL datasets have Markov properties, they can be modeled through transformer-like structures. Decision transformer (Chen et al., 2021) and trajectory transformer (Janner et al., 2021) propose condition policy on return to go (RTG) and behavior cloning policy improved by beam search to solve RL problems in offline settings respectively. Inspired by these works, we bring prompt tuning from NLP into the RL domain, then propose a potential path to pretrain a large-scale RL model and efficiently transfer the knowledge to downstream tasks.

