CONTEXTUAL TRANSFORMER FOR OFFLINE REIN-FORCEMENT LEARNING

Abstract

Recently, the pretrain-tuning paradigm in large-scale sequence models has made significant progress in Natural Language Processing and Computer Vision. However, such a paradigm is still hindered by intractable challenges in Reinforcement Learning (RL), including the lack of self-supervised large-scale pretraining methods based on offline data and efficient fine-tuning/prompt-tuning over unseen downstream tasks. In this work, we explore how prompts can help sequencemodeling based offline Reinforcement Learning (offline-RL) algorithms. Firstly, we propose prompt tuning for offline RL, where a context vector sequence is concatenated with the input to guide the conditional generation. As such, we can pretrain a model on the offline dataset with supervised loss and learn a prompt to guide the policy to play the desired actions. Secondly, we extend the framework to the Meta-RL setting and propose Contextual Meta Transformer (CMT), which leverages the context among different tasks as the prompt to improve the performance on unseen tasks. We conduct extensive experiments across three different offline-RL settings: offline single-agent RL on the D4RL dataset, offline Meta-RL on the MuJoCo benchmark, and offline MARL on the SMAC benchmark. The results validate the strong performance, and generality of our methods.

1. INTRODUCTION

Reinforcement learning algorithms based on sequence modeling (Chen et al., 2021; Janner et al., 2021; Reed et al., 2022) shine in sequential decision-making tasks and form a new promising paradigm. Compared with classic RL methods, such as policy-based methods and value-based methods (Sutton & Barto, 2018) , optimization of the policies from the sequence prospective has advantages in long-term credit assignment, partial observation, etc. Meanwhile, significant generalization of large pretrained sequence model in natural language processing (Kenton & Toutanova, 2019; Brown et al., 2020) and computer vision (Liu et al., 2021b; Zhai et al., 2021 ) not only conserves vast computation in downstream tasks but also alleviates the large data quantity requirements. Inspired by them, we want to ask the question: whether pretrain technique has a similar power in RL? Since limited and expensive interactions impede the deployment of RL in various valuable applications (Levine et al., 2020) , pretraining a large model to improve the robustness of real-world gap by a zero-shot generalization and improve data efficiency by few-shot learning offers great significance. (Team et al., 2021; Meng et al., 2021 ) pretrains a single model with diverse and abundant training tasks in the decision-making domain to generalize in downstream tasks, which proves the feasibility that pretraining enables RL agents to harness knowledge for generalization. Earlier works on sequence modeling RL provide a new perspective on offline RL. However, extending these methods to the pretrain domain is still confronted with several challenges. One major challenge for generalization (Li et al., 2020b) is how to encode task-relevant information, thereby enhancing transferring knowledge among tasks. Since discovering the relationship among diverse tasks from data and making decisions conditioned on distinct tasks plays a significant role in generalization, it is not a trivial modification of existing methods. Another problem is efficient self-supervised learning in offline RL. Specifically, the decision transformer (Chen et al., 2021) leverages the data to learn a return conditioned policy, which ignores the knowledge about world dynamics. In addition, trajectory transformer (Janner et al., 2021) conducts planning based on a world model, but the high computational intensity and decision latency might be a bottleneck for a large-scale model and hard to fine-tune to other tasks. Therefore, introducing key components to transfer the knowledge in a

