TOWARDS CONTROLLABLE POLICY THROUGH GOAL-MASKED TRANSFORMERS

Abstract

Offline goal-conditioned supervised learning (GCSL) can learn to achieve various goals from purely offline datasets without reward information, enhancing control over the policy. However, we argue that learning a composite policy switchable among different goals seamlessly should be an essential task for obtaining a controllable policy. This feature should be learnable if the dataset contains enough data about such switches. Unfortunately, most existing datasets either partially or entirely lack such switching demonstrations. Current GCSL approaches that use hindsight information concentrate primarily on reachability at the state or return level. They might not work as expected when the goal is changed within an episode. To this end, we present Goal-Masked Transformers (GMT), an efficient GCSL algorithm based on transformers with goal masking. GMT makes use of trajectory-level hindsight information, which is automatically gathered and can be adjusted for various statistics of interest. Due to the autoregressive nature of GMT, we can change the goal and control the policy at any time. We empirically evaluate GMT on MuJoCo continuous control benchmarks and Atari discrete control games with image states to compare GMT against baselines. We illustrate that GMT can infer the missing switching processes from the given dataset and thus switch smoothly among different goals. As a result, GMT demonstrates its ability to control policy and succeeds on all the tasks with low variance, while existing GCSL works can hardly succeed in goal-switching 1 .

1. INTRODUCTION

Runners can control and adjust their pace in a marathon by switching comfortably between various poses for different goals. Similarly, agents can also acquire such switching ability through reinforcement learning (RL) or imitation learning (IL). This process generally requires environments that can start with arbitrary pose states, carefully tuned rewards, or massive offline demonstrations. However, these critical things are notoriously challenging to obtain. In comparison, by knowing the pace of each running stance, a human can easily switch between various poses to control speed without learning such switching processes intentionally. From another perspective, we try to formulate this problem as goal-conditioned supervised learning (GCSL) (Ghosh et al., 2019) problem given a fixed amount of offline dataset: Considering pose or pace as a goal, can agents learn a composite policy that can switch among these goals interchangeably over the dataset? We refer to this problem as the goal-switching problem. Since the distribution of initial states shifts between the training and evaluation, this problem might face the covariate shift issue. Instead of a fixed set of states in the training set, any state might be the start of the switched goal during evaluation, resulting in agents not knowing how to achieve the goal. The goal-switching has widespread adoption in practical applications. The control of robots to transfer to a different skill while performing another skill is essential in robotics. In the game field, it can induce immersive experiences by managing the performance and strategies of AI bots according to the game progress. Recent works (Ghosh et al., 2019; Ding et al., 2019; Furuta et al., 2021; Eysenbach et al., 2022; Reed et al., 2022) on GCSL mainly focus on learning how to achieve arbitrary goals. They tend to use either state (Emmons et al., 2021) or return-to-go (Chen et al., 2021) as goals. Models that use states



The code will be available as soon as possible.1

