TOWARDS CONTROLLABLE POLICY THROUGH GOAL-MASKED TRANSFORMERS

Abstract

Offline goal-conditioned supervised learning (GCSL) can learn to achieve various goals from purely offline datasets without reward information, enhancing control over the policy. However, we argue that learning a composite policy switchable among different goals seamlessly should be an essential task for obtaining a controllable policy. This feature should be learnable if the dataset contains enough data about such switches. Unfortunately, most existing datasets either partially or entirely lack such switching demonstrations. Current GCSL approaches that use hindsight information concentrate primarily on reachability at the state or return level. They might not work as expected when the goal is changed within an episode. To this end, we present Goal-Masked Transformers (GMT), an efficient GCSL algorithm based on transformers with goal masking. GMT makes use of trajectory-level hindsight information, which is automatically gathered and can be adjusted for various statistics of interest. Due to the autoregressive nature of GMT, we can change the goal and control the policy at any time. We empirically evaluate GMT on MuJoCo continuous control benchmarks and Atari discrete control games with image states to compare GMT against baselines. We illustrate that GMT can infer the missing switching processes from the given dataset and thus switch smoothly among different goals. As a result, GMT demonstrates its ability to control policy and succeeds on all the tasks with low variance, while existing GCSL works can hardly succeed in goal-switching 1 .

1. INTRODUCTION

Runners can control and adjust their pace in a marathon by switching comfortably between various poses for different goals. Similarly, agents can also acquire such switching ability through reinforcement learning (RL) or imitation learning (IL). This process generally requires environments that can start with arbitrary pose states, carefully tuned rewards, or massive offline demonstrations. However, these critical things are notoriously challenging to obtain. In comparison, by knowing the pace of each running stance, a human can easily switch between various poses to control speed without learning such switching processes intentionally. From another perspective, we try to formulate this problem as goal-conditioned supervised learning (GCSL) (Ghosh et al., 2019) problem given a fixed amount of offline dataset: Considering pose or pace as a goal, can agents learn a composite policy that can switch among these goals interchangeably over the dataset? We refer to this problem as the goal-switching problem. Since the distribution of initial states shifts between the training and evaluation, this problem might face the covariate shift issue. Instead of a fixed set of states in the training set, any state might be the start of the switched goal during evaluation, resulting in agents not knowing how to achieve the goal. The goal-switching has widespread adoption in practical applications. The control of robots to transfer to a different skill while performing another skill is essential in robotics. In the game field, it can induce immersive experiences by managing the performance and strategies of AI bots according to the game progress. Recent works (Ghosh et al., 2019; Ding et al., 2019; Furuta et al., 2021; Eysenbach et al., 2022; Reed et al., 2022) on GCSL mainly focus on learning how to achieve arbitrary goals. They tend to use either state (Emmons et al., 2021) or return-to-go (Chen et al., 2021) as goals. Models that use states

Pos Emb

Token Emb as goals can be called state-conditioned models. These models generate actions based on targeting future states or slices of successful demonstration. There are also models conditioning on return-togo -the total rewards an agent can receive from the current step until the end of an episode. These methods typically use the "relabeling" strategy to excavate a variety of goal signals by bootstrapping any of the aforementioned goals from a fixed dataset. However, each method from above has some fatal flaws that make them result in poor performance on the goal-switching problem. By setting return-to-go as goals, switching can only happen by tweaking a normalized return-to-go, which is neither intuitive nor efficient. State-conditioned methods can reach an arbitrary state in theory. However, they tend to model the problem as a pure MDP, which leads to poor performance or failure when demonstrations of transiting between two states are missing from the dataset. We argue that the essence of the goal-switching problem is to enable the model to learn unseen transitions. Modelbased approaches may be one remedy, which allows for unseen transitions by planning across a latent space (Jiang et al., 2022) or an explicitly learned world model (Micheli et al., 2022) . These methods require additional environmental model learning with higher learning complexity. 𝑔 !"# 𝑠 !"# 𝑎 !"# 𝑔 ! [𝑚𝑎𝑠𝑘] 𝑠 ! 𝑎 ! 𝑔 !$# 𝑠 !$# 𝑎 !$# Goal Loss Action Loss This paper presents Goal-Masked Transformers (GMT), an efficient GCSL algorithm that neither necessitates explicit world model learning nor successful demonstrations to solve the goal-switching problem. In particular, we employ a causal transformer (Radford et al., 2019) to autoregressively describe trajectories with transitions consisting of goals, states, and actions. With such a setting, the goal can be changed at any moment, releasing the full potential of the policy control. However, since the goals are the same in each transition during training, the model tends to neglect the change of goals, leading to a goal-switching failure during evaluation. In order to compel the model to put more effort into goals, we introduce a masking mechanism with a probability of replacing the goal information with a [mask] token. As a result, we observe the promising outcome that agents can smoothly switch from one goal to another. Figure 1 presents an overview of GMT. Similar to previous GCSL works, we apply the "relabeling" strategy to increase the diversity and coverage of goals. Additionally, there are various expressions of goals and numerous approaches to achieving them. Thus, we propose a simple yet effective approach that automatically aggregates and clusters the offline data into several goals in accordance with the statistics of interests. In summary, our main contributions are as follows: • We draw attention to the goal-switching problem and argue that doing so is an essential step toward controllable policy. The goal-switching problem requires that agents should adapt to goal changes within one episode. It is an exceedingly challenging generalization problem, especially when training on limited datasets. • We propose Goal-Masked Transformers (GMT), a family of goal-conditioned algorithms based on causal transformers with a goal-masking mechanism and hindsight information to achieve controllable policy. Through experiments, we demonstrate that GMT possesses goal-switching capabilities that are not present in the current GCSL algorithms. • We introduce an unsupervised approach to cluster trajectories into multiple goals from the datasets without any goal information. Empirically, we find that this approach improves the stability and efficiency of the switching process.



The code will be available as soon as possible.



Figure 1: Overview of GMT architecture.

