HYPER-DECISION TRANSFORMER FOR EFFICIENT ON-LINE POLICY ADAPTATION

Abstract

Decision Transformers (DT) have demonstrated strong performances in offline reinforcement learning settings, but quickly adapting to unseen novel tasks remains challenging. To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data-and parameter-efficient manner. To achieve such a goal, we propose to augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly. This initialization enables HDT to efficiently adapt to novel tasks by only fine-tuning the adaptation module. We validate HDT's generalization capability on object manipulation tasks. We find that with a single expert demonstration and fine-tuning only 0.5% of DT parameters, HDT adapts faster to unseen tasks than fine-tuning the whole DT model. Finally, we explore a more challenging setting where expert actions are not available, and we show that HDT outperforms state-of-the-art baselines in terms of task success rates by a large margin. Demos are available on our project page. 1

1. INTRODUCTION

Building an autonomous agent capable of generalizing to novel tasks has been a longstanding goal of artificial intelligence. Recently, large transformer models have shown strong generalization capability on language understanding when fine-tuned with limited data (Brown et al., 2020; Wei et al., 2021) . Such success motivates researchers to apply transformer models to the regime of offline reinforcement learning (RL) (Chen et al., 2021; Janner et al., 2021) . By scaling up the model size and leveraging large offline datasets from diverse training tasks, transformer models have shown to be generalist agents successfully solving multiple games with a single set of parameters (Reed et al., 2022; Lee et al., 2022) . Despite the superior performance in the training set of tasks, directly deploying these pre-trained agents to novel unseen tasks would still lead to suboptimal behaviors. One solution is to leverage the handful of expert demonstrations from the unseen tasks to help policy adaptation, and this has been studied in the context of meta imitation learning (meta-IL) (Duan et al., 2017; Reed et al., 2022; Lee et al., 2022) . In order to deal with the discrepancies between the training and testing tasks, these works focus on fine-tuning the whole policy model with either expert demonstrations or online rollouts from the test environments. However, with the advent of large pre-trained transformers, it is computationally expensive to fine-tune the whole models, and it is unclear how to perform policy adaptation efficiently (Figure 1 (a) ). We aim to fill this gap in this work by proposing a more parameter-efficient solution. Moreover, previous work falls short in a more challenging yet realistic setting where the target tasks only provide demonstrations without expert actions. This is similar to the state-only imitation learning or Learning-from-Observation (LfO) settings (Torabi et al., 2019; Radosavovic et al., 2021) , where expert actions are unavailable, and therefore we term this setting as meta Learning-from-Observation (meta-LfO). As a result, we aim to develop a more general method that can address both meta-IL and meta-LfO settings. Figure 1 : Efficient online policy adaptation of pre-trained transformer models with few-shot demonstrations. To facilitate data efficiency, we introduce a demonstration-conditioned adaptation module that helps leverage prior knowledge in the demonstration and guide exploration. When adapting to novel tasks, we only fine-tune the adaptation module to maintain parameter efficiency. The closest work to ours is Prompt-DT (Xu et al., 2022) , which proposes to condition the model behavior in new environments on a few demonstrations as prompts. While the method is originally evaluated for meta-IL, the flexibility of the prompt design also allows this method to be useful for meta-LfO. However, we find that Prompt-DT hardly generalizes to novel environments (as is shown empirically in Section 4), since the performance of the in-context learning paradigm, e.g., prompting, is generally inferior to fine-tuning methods (Brown et al., 2020) . The lack of an efficient adaptation mechanism also makes Prompt-DT vulnerable to unexpected failures in unseen tasks. In order to achieve the strong performance of fine-tuning-based methods as well as maintain the efficiency of in-context learning methods, we propose Hyper-Decision Transformer (HDT) for large pre-trained Decision Transformers (DT) (Chen et al., 2021) . HDT composes of three key components: (1) a multi-task pre-trained DT model, (2) adapter layers that can be updated when solving novel unseen tasks, and (3) a hyper-network that outputs parameter initialization of the adapter layer based on demonstrations. The pre-trained DT model encodes shared information across diverse training tasks and serves as a base policy. To mimic the performance of fine-tuning methods, we introduce an adapter layer with a bottleneck structure to each transformer block (Houlsby et al., 2019) . The parameters in the adapter layers can be updated to adapt to a new task. Moreover, it only adds a small fraction of parameters to the base DT model. The adapter parameters are initialized by a single hyper-network conditioning on the demonstrations with or without actions. In such a way, the hyper-network extracts the task-specific information from demonstrations, which is encoded into the adapter's initialized parameters. We evaluate HDT in both meta-IL (with actions) and meta-LfO (without actions) settings. In meta-IL, the adapter module could directly fine-tune in a supervised manner as few-shot imitation learning. In meta-LfO, the agent interacts with the environment and performs RL, while conditioning on the expert states. We conduct extensive experiments in the Meta-World benchmark Yu et al. (2020) , which contains diverse manipulation tasks requiring fine-grind gripper control. We train HDT with 45 tasks and test its generalization capability in 5 testing tasks with unseen objects, or seen objects with different reward functions. Our experiment results show that HDT demonstrates strong data and parameter efficiency when adapting to novel tasks.



Project Page: https://sites.google.com/view/hdtforiclr2023/home. In the meta-LfO setting with only 20-80 online rollouts, HDT can sample successful episodes and therefore outperforms baselines by a large margin in terms of success rates, demonstrating strong data efficiency.2 RELATED WORKTransformers in PolicyLearning. Transformer (Vaswani et al., 2017) has shown success in natural language tasks thanks to its strong sequence-modeling capacity. As generating a policy in RL is

acknowledgement

We list our contributions as follows:1. We propose Hyper-Decision Transformer (HDT), a transformer-based model which generalizes to novel unseen tasks maintaining strong data and parameter efficiency.2. In the meta-IL setting with only one expert demonstration, HDT only fine-tunes a small fraction (0.5%) of the model parameters and adapts faster than baselines that fine-tune the whole model, demonstrating strong parameter efficiency.

