HYPER-DECISION TRANSFORMER FOR EFFICIENT ON-LINE POLICY ADAPTATION

Abstract

Decision Transformers (DT) have demonstrated strong performances in offline reinforcement learning settings, but quickly adapting to unseen novel tasks remains challenging. To address this challenge, we propose a new framework, called Hyper-Decision Transformer (HDT), that can generalize to novel tasks from a handful of demonstrations in a data-and parameter-efficient manner. To achieve such a goal, we propose to augment the base DT with an adaptation module, whose parameters are initialized by a hyper-network. When encountering unseen tasks, the hyper-network takes a handful of demonstrations as inputs and initializes the adaptation module accordingly. This initialization enables HDT to efficiently adapt to novel tasks by only fine-tuning the adaptation module. We validate HDT's generalization capability on object manipulation tasks. We find that with a single expert demonstration and fine-tuning only 0.5% of DT parameters, HDT adapts faster to unseen tasks than fine-tuning the whole DT model. Finally, we explore a more challenging setting where expert actions are not available, and we show that HDT outperforms state-of-the-art baselines in terms of task success rates by a large margin. Demos are available on our project page. 1

1. INTRODUCTION

Building an autonomous agent capable of generalizing to novel tasks has been a longstanding goal of artificial intelligence. Recently, large transformer models have shown strong generalization capability on language understanding when fine-tuned with limited data (Brown et al., 2020; Wei et al., 2021) . Such success motivates researchers to apply transformer models to the regime of offline reinforcement learning (RL) (Chen et al., 2021; Janner et al., 2021) . By scaling up the model size and leveraging large offline datasets from diverse training tasks, transformer models have shown to be generalist agents successfully solving multiple games with a single set of parameters (Reed et al., 2022; Lee et al., 2022) . Despite the superior performance in the training set of tasks, directly deploying these pre-trained agents to novel unseen tasks would still lead to suboptimal behaviors. One solution is to leverage the handful of expert demonstrations from the unseen tasks to help policy adaptation, and this has been studied in the context of meta imitation learning (meta-IL) (Duan et al., 2017; Reed et al., 2022; Lee et al., 2022) . In order to deal with the discrepancies between the training and testing tasks, these works focus on fine-tuning the whole policy model with either expert demonstrations or online rollouts from the test environments. However, with the advent of large pre-trained transformers, it is computationally expensive to fine-tune the whole models, and it is unclear how to perform policy adaptation efficiently (Figure 1 (a) ). We aim to fill this gap in this work by proposing a more parameter-efficient solution. Moreover, previous work falls short in a more challenging yet realistic setting where the target tasks only provide demonstrations without expert actions. This is similar to the state-only imitation learning or Learning-from-Observation (LfO) settings (Torabi et al., 2019; Radosavovic et al., 2021) , where expert actions are unavailable, and therefore we term this setting as meta Learning-from-Observation (meta-LfO). As a result, we aim to develop a more general method that can address both meta-IL and meta-LfO settings.



Project Page: https://sites.google.com/view/hdtforiclr2023/home. 1

