SKILL DECISION TRANSFORMER

Abstract

Recent work has shown that Large Language Models (LLMs) can be incredibly effective for offline reinforcement learning (RL) by representing the traditional RL problem as a sequence modelling problem (Chen et al., 2021; Janner et al., 2021) . However many of these methods only optimize for high returns, and may not extract much information from a diverse dataset of trajectories. Generalized Decision Transformers (GDTs) (Furuta et al., 2021) have shown that by utilizing future trajectory information, in the form of information statistics, can help extract more information from offline trajectory data. Building upon this, we propose Skill Decision Transformer (Skill DT). Skill DT draws inspiration from hindsight relabelling (Andrychowicz et al., 2017) and skill discovery methods to discover a diverse set of primitive behaviors, or skills. We show that Skill DT can not only perform offline state-marginal matching (SMM), but can discovery descriptive behaviors that can be easily sampled. Furthermore, we show that through purely reward-free optimization, Skill DT is still competitive with supervised offline RL approaches on the D4RL benchmark.

1. INTRODUCTION

Reinforcement Learning (RL) has been incredibly effective in a variety of online scenarios such as games and continuous control environments (Li, 2017) . However, they generally suffer from sample inefficiency, where millions of interactions with an environment are required. In addition, efficient exploration is needed to avoid local minimas (Pathak et al., 2017; Campos et al., 2020) . Because of these limitations, there is interest in methods that can learn diverse and useful primitives without supervision, enabling better exploration and re-usability of learned skills (Eysenbach et al., 2018; Strouse et al., 2021; Campos et al., 2020) . However, these online skill discovery methods still require interactions with an environment, where access may be limited. This requirement has sparked interest in Offline RL, where a dataset of trajectories is provided. Some of these datasets (Fu et al., 2020) are composed of large and diverse trajectories of varying performance, making it non trivial to actually make proper use of these datasets; simply applying behavioral cloning (BC) leads to sub-optimal performance. Recently, approaches such as the Decision Transformer (DT) (Chen et al., 2021) and the Trajectory Transformer (TT) (Janner et al., 2021) , utilize Transformer architectures (Vaswani et al., 2017) to achieve high performance on Offline RL benchmarks. Furuta et al. (2021) showed that these methods are effectively doing hindsight information matching (HIM), where the policies are trained to estimate a trajectory that matches given target statistics of future information. The work also generalizes DT as an information-statistic conditioned policy, Generalized Decision Transformer (GDT). This results in policies with different capabilities, such as supervised learning and State Marginal Matching (SMM) (Lee et al., 2019) , just by simply varying different information statistics. In the work presented here, we take inspiration from the previously mentioned skill discovery methods and introduce Skill Decision Transformers (Skill DT), a special case of GDT, where we wish to condition action predictions on skill embeddings and also future skill distributions. We show that Skill DT is not only able to discovery a number of discrete behaviors, but it is also able to effectively match target trajectory distributions. Furthermore, we empirically show that through pure unsupervised skill discovery, Skill DT is actually able to discover high performing behaviors that match or achieve higher performance on D4RL benchmarks (Fu et al., 2020) compared to other state-of-the-art offline RL approaches. Our method predicts actions, conditioned by previous states, skills, and distributions of future skills. Empirically, we show that Skill DT can not only perform SMM on target trajectories, but can also match or achieve higher performance on D4RL benchmarks (Fu et al., 2020) compared to other state-of-the-art offline RL approaches. Skill DT also has the added benefit of using discrete skills, which are useful for easily sampling diverse behaviors.

2.1. SKILL DISCOVERY

Many skill methods attempt to learn a latent skill conditioned policy π(a|s, z), where state s ∼ p(s) and skill z ∼ Z, that maximizes mutual information between S and Z (Gregor et al., 2016; Sharma et al., 2019; Eysenbach et al., 2018) . Another way of learning meaningful skills is through variational inference, where z is learned via a reconstruction loss (Campos et al., 2020) . Explore, Discover and Learn (EDL) (Campos et al., 2020) is an approach, which discovers a discrete set of skills by encoding states via a VQ-VAE: p(z|s), and reconstructing them: p(s|z). We use a similar approach, but instead of reconstructing states, we utilize offline trajectories and optimize action reconstruction directly (p(a|s, z)). Since our policy is autoregressive, our skill encoding actually takes into account temporal information, leading to more descriptive skill embeddings. Offline Primitive Discovery for Accelerating Offline Reinforcement Learning (OPAL) (Ajay et al., 2020) , also discovers offline skills temporally, but instead uses a continuous distribution of skills. These continuous skills are then sample by a hierarchical policy that is optimized by task rewards. Because our approach is completely supervised, we wish to easily sample skills. To simplify this, we opt to use a discrete distribution of skills. This makes it trivial to query the highest performing behaviors, accomplished by just iterating through the discrete skills.

2.2. STATE MARGINAL MATCHING

State marginal matching (SMM) (Lee et al., 2019) involves finding policies that minimize the distance between the marginal state distribution that the policy represents p π (s), and a target distribution p * (s). These objectives have an advantage over traditional RL objectives in that they do not require any rewards and are guided towards exploration (Campos et al., 2020) . CDT has shown impressive SMM capabilities by utilizing binned target state distributions to condition actions in order to match the given target state distributions. However, using CDT in a real environment is difficult because target distributions must be provided, while Skill DT learns discrete skills that can be sampled easily. Also, CDT requires a low dimensional state space, while Skill DT in theory can work on any type of input as long as it can be encoded effectively into a vector.

3. PRELIMINARIES

In this work, we consider learning in environments modelled as Markov decision processes (MDPs), which can be described using varibles (S, A, P, R), where S represents the state space, A represents the action space, and P (s t+1 |s t , a t ) represents state transition dynamics of the environment.

3.1. GENERALIZED DECISION TRANSFORMER

The Decision Transformer (DT) (Chen et al., 2021) represents RL as a sequence modelling problem and uses a GPT architecture Alec Radford & Sutskever (2018) to predict actions autoregressively. Specifically, DT takes in a sequence of RTGs, states, and actions, where R t = T t r t , and trajectory τ = (R 0 , s 0 , a 0 , ..., R |τ | , s |τ | , a |τ | ). DT uses K previous tokens to predict a t with a deterministic policy which is optimized by a mean squared error loss between target and predicted actions. For evaluation, a target return Rtarget is provided and DT attempts to achieve the targeted return in the actual environment. Furuta et al. (2021) introduced a generalized version of DT, Generalized Decision Transformer (GDT) . GDT provides a simple interface for representing a variety of different objectives, configurable by different information statistics (for consistency, we represent variations of GDT with π gdt ): τ t = s t , a t , r t , ..., s T , a T , r T , I ϕ = information statistics function Generalized Decision Transformer (GDT): π gdt (a t |I ϕ (τ 0 ), s 0 , a 0 ..., I ϕ (τ t ), s t-1 , a t-1 ) Decision Transformer (DT): π gdt dt (a t |I ϕ dt (τ 0 ), s 0 , a 0 , ..., I ϕ dt (τ t ), s t-1 , a t-1 ), where I ϕ dt (τ t ) = T t γ * r t , γ = discount factor Categorical Decision Transformer (CDT): π gdt cdt (a t |I ϕ cdt (τ 0 ), s 0 , a 0 , ..., I ϕ cdt (τ t ), s t , a t ), where I ϕ cdt (τ t ) = histogram(s t , ..., s T ) CDT is the most similar to Skill DT -CDT captures future trajectory information using future state distributions, represented as histograms for each state dimension, essentially binning and counting the bin ids for each state dimension. Skill DT instead utilizes learned skill embeddings to generate future skill distributions, represented as histograms of full embeddings. In addition, Skill DT also makes use of the representation learnt by the skill embedding by also using it in tandem with the skill distributions.

4.1. FORMULATION

Our Skill DT architecture is very similar to the original Decision Transformer presented in Chen et al. (2021) . While the classic DT uses summed future returns to condition trajectories, we instead make use of learned skill embeddings and future skill distributions, represented as a histogram of skill embedding indices, similar to the way Categorical Decision Transformer (CDT) (Furuta et al., 2021) utilizes future state counts. One notable difference Skill DT has to the original Decision Transformer (Chen et al., 2021) and the GDT (Furuta et al., 2021) variant is that we omit actions in predictions. This is because we are interested in SMM through skills, where we want to extract as much information from states. Formally, Skill DT represents a policy: π(a t |Z t-K , z t-K , s t-K , ...Z t-1 , z t-1 , s t-1 ), where K is the context length, and θ are the learnable parameters of the model. States are encoded as skill embeddings ẑt , which are then quantized using a learned codebook of embeddings z = argmin n ||ẑ -z n || 2 2 . The future skill distributions are represented as the normalized histogram of summed future one hot encoded skill indices: Z t = T t one hot(z t ). Connecting this to GDT, our policy can be viewed as: π gdt skill (a t |I ϕ skill (τ 0 ), s 0 , ..., I ϕ skill (τ t ), s t ) , where I ϕ skill (τ t ) = (histogram(z t , ..., z T ), z t ).

4.1.1. HINDSIGHT SKILL RE-LABELLING

Hindsight experience replay (HER) is a method that has been effective in improving sampleefficiency of goal-oriented agents (Andrychowicz et al., 2017; Rauber et al., 2017) . The core concept revolves around goal relabelling, where trajectory goals are replaced by achieved goals vs. inteded goals. This concept of re-labelling information has been utilized in a number of works (Ghosh et al., 2019; Zheng et al., 2022; Faccio et al., 2022) , to iteratively learn an condition predictions on target statistics. Bi-Directional Decision Transformer (BDT) (Furuta et al., 2021) (Razavi et al., 2019; Esser et al., 2020) , planning (Ozair et al., 2021) , and online skill discovery (Campos et al., 2020) . Because of this, we use a VQ-VAE to quantize encoded states into a set of continuous skill embeddings. We encode states into vectors z, and quantize to nearest skill embeddings ẑ. To ensure stability, we minimize the loss: V QLOSS(z, ẑ) = M SE(z, ẑ) (1) Where ẑ is the output of the MLP encoder and z is the nearest embedding in the VQ-VAE codebook. Optimizing this loss minimizes the distance of our skill encodings with their corresponding nearest VQ-VAE embeddings. This is analagous to clustering, where we are trying to minimize the distance between datapoints and their actual cluster centers. In practice, we optimize this loss using an exponential moving average, as detailed in Lai et al. (2022) . Causal Transformer. The Causal Transformer portion of Skill DT shares a similar architecture to that of the original DT (Chen et al., 2021) , utilizing a GPT (Alec Radford & Sutskever, 2018) model. It takes in input the last K states s t-K:t , skill encodings z t-K:t , and future skill embedding distributions Z t-K:t . As mentioned above, the future skill embedding distributions are calculated by generating a histogram of skill indices from timestep t : T , and normalizing them so that they add up to 1. For states and skill embedding distributions, we use learned linear layers to create token embeddings. To capture temporal information, we also learn a timestep embedding that is added to each token. Note that we don't tokenize our skill embeddings because we want to ensure that we don't lose important skill embedding information. It's important to note that even though we don't add timestep embeddings to the skill embeddings, they still capture temporal behavior because the attention mechanism (Vaswani et al., 2017) of the causal transformer attends the embeddings to temporally conditioned states and skill embedding distributions. The VQ-VAE and Causal Transformer components are shown visually in Fig. 1 .

4.3. TRAINING PROCEDURE

Training Skill DT is very similar to how other variants of GDT are trained (CDT, BDT, DT, etc.). First, before every training iteration we re-label skill distributions for every trajectory using our VQ-VAE encoder. Afterwards, we sample minibatches of sequence length K, where timesteps are sampled uniformly. Specifically, at every training iteration, we sample τ = (s t , ...s t+K , a t , ...a t+K ), 

5.1. TASKS AND DATASETS

For evaluating the performance of Skill DT, we use tasks and datasets from the D4RL benchmark (Fu et al., 2020) . D4RL has been used as a standard for evaluating many offline RL methods (Kumar et al., 2020; Chen et al., 2021; Kostrikov et al., 2021; Zheng et al., 2022) . We evaluate our methods on mujoco gym continuous control tasks, as well as two antmaze tasks. Images of some of these environments can be seen in A.4

5.2. EVALUATING SUPERVISED RETURN

Can Skill DT achieve near or competitive performance, using only trajectory information, compared to supervised offline RL approaches? 1 : Average normalized returns on Gym and AntMaze tasks. We obtain some results as reported on other works (Chen et al., 2021; Kumar et al., 2020; Kostrikov et al., 2021; Zheng et al., 2022) , and calculate Skill DT's returns as an average over 4 seeds (for gym) and 15 (for antmaze).

Mujoco

Skill DT outperforms the baselines on most tasks, but fails to beat them on replay tasks and antmazemedium. However, Skill DT can consistently solve the antmaze-umaze tasks. Other offline skill discovery algorithms optimize hierarchical policies via supervised RL, utilizing the learned primitives to maximize rewards of downstream tasks (Ajay et al., 2020) . However, because we are interested in evaluating Skill DT without rewards, we have to rely on learning enough skills such that high performing trajectories are represented. To evaluate this in practice, we run rollouts for each unique skill and take the maximum reward achieved. Detailed python sudocode for this is provided in A.5. For a close skill-based comparison to Skill DT, we use a K-Means augmented Decision Transformer (K-Means DT). K-Means DT differs from Skill DT in that instead of learning skill embeddings, instead we cluster states via K-Means and utilize the cluster centers as the skill embeddings. Surprisingly, through just pure unsupervised skill discovery, we are able to achieve competitive results on Mujoco continuous control environments compared to state-of-the-art offline reinforcement learning algorithms (Kumar et al., 2020; Kostrikov et al., 2021; Chen et al., 2021) . As we can see in our results in Table 1 , Skill DT outperforms other baselines on most of the tasks and DT on all of the tasks. However, it performs worse than the other baselines on the antmaze-medium / -replay tasks. We hypothesize that Skill DT performs worse in these tasks because they contain multimodal and diverse behaviors. We think that with additional return context or online play, Skill DT may be able to perform better in these environments, and we hope to explore this as future work. Skill DT, like the original Decision Transformer (Chen et al., 2021) , also struggles on harder exploration problems like the antmaze-medium environments. Methods that perform well on these tasks usually utilize dynamic programming like Trajectory Transformer (Janner et al., 2021) or hierarchical reinforcement learning like OPAL (Ajay et al., 2020) . Even though Skill DT performs marginally better than DT, there is still a lot of room for improvement in future work. Because Skill DT is a completely unsupervised algorithm, evaluating supervised return requires evaluating every learnt skill and taking the one that achieves the maximum reward. This means we are relying entirely on Skill DT's ability to capture behaviors from high performing trajectories from the offline dataset. We found that increasing the number of skills has less of an effect on performance, in environments that have a large number of successful trajectories (-medium environments). We hypothesize that these datasets have unimodal behaviors, and Skill DT does not need many skills to capture descriptive information from the dataset. However, for multimodal datasets (such as the -replay environments), Skill DT's performance improves with an increasing number of skills. In general, using a larger number of skills can help performance, but the tradeoff is increased computation time because each skill needs to be evaluated in the environment. These results are reported in Table 2 . Images of skills learnt can be seen in A.4

6.2. SMM WITH LEARNED SKILLS

How well can Skill DT reconstruct target trajectories and perform SMM in a zero shot manner? Ideally, if an algorithm is effective at SMM, it should be able to reconstruct a target trajectory in an actual environment. That is, given a target trajectory, the algorithm should be able to rollout a similar trajectory. The original DT can actually perform SMM well, on offline trajectories. However, when actually attempting this in an actual environment, it is unable to reconstruct a target trajectory because it is unable to be conditioned on accurate future state trajectory information. Skill DT, similar to CDT, is able to perform SMM in an actual environment because it encodes future state information into skill embedding histograms. The practical process for this is fairly simple and detailed in Algorithm 2. In addition to state trajectories, learned skill distributions of the reconstructed trajectory and the target trajectory should also be close. We investigate this by looking into target trajectories from antmaze-umaze-v2, antmaze-umaze-diverse-v2. For a more challenging example, we handpicked a trajectory from antmaze-umaze-diverse that is unique in that it has a loop. Even though the trajectory is unique, Skill DT is still able to roughly recreate it in a zero shot manner (Fig. 3 ), with rollouts also including a loop in the trajectory. Additional results can be found in A.3. In order to evaluate Skill DT as a skill discovery method, we must show that behaviors are not only diverse but are descriptive, or more intuitively, distinguishable. We are able to visualize the diversity of learned behaviors by plotting each trajectory generated by a skill on both antmaze-umaze and ant environments, shown below. To visualize Skill DT's ability to describe states, we show the the projected skill embeddings and quantized skill embedding clusters (Fig. 5 ). For a diversity metric, we utilize a Wasserstein Distance metric between skill distributions (normalized between [0, 1]), similar to the method proposed in (Furuta et al., 2021) . We report this metric in Table 3 3 : Wasserstein distance metric (computed between each skill and all others). In tasks with unimodal behaviors (-medium), Skill DT discovers skills that result in trajectories that are more similar to eachother than more complex tasks (-replay and antmaze) Our approach is powerful because it is unsupervised, but it is also limited because of it. Because we do not have access to rewards, we rely on pure offline diversity to ensure that high performing trajectories are learned and encoded into skills that can be sampled. However, this is not very effective for tasks that require dynamic programming or longer sequence prediction. Skill DT could benefit from borrowing concepts from hierarchical skill discovery (Ajay et al., 2020) to re-use learned skills on downstream tasks by using an additional return-conditioned model. In addition, it would be interesting to explore an online component to the training procedure, similar to the work in Zheng et al. (2022) .

7. CONCLUSION

We proposed Skill DT, a variant of Generalized DT, to explore the capabilities of offline skill discovery with sequence modelling. We showed that a combination of LLMs, hindsight-relabelling can be incredibly useful for extracting information from diverse offline trajectories.. On standard offline RL environments, we showed that Skill DT is capable of learning a rich set of behaviors and can perform zero-shot SMM through state-encoded skill embeddings. Skill DT can further be improved by adding an online component, a hierarchical component that utilizes returns, and improved exploration. A APPENDIX Skill DT, like other skill discovery methods, can use its state-to-skill encoder to guide its actions towards a particular goal. In this case, we are interested in recreating target trajectories as close as possible. The detailed algorithm for few shot skill reconstruction: 5 



Figure 1: Skill Decision Transformer. States are encoded and clustered via VQ-VAE codebook embeddings. A Causal Transformer, similar to the original DT architecture, takes in a sequence of states, a latent skill distribution, represented as the normalized summed future counts of VQVAE encoding indices (details can be found in the "generate histogram" function in A.5), and the corresponding skill encoding of the state at timestep t.

Figure 2: Training procedure for Skill Decision Transformer. sub-trajectories of states of length k are sampled from the dataset and encoded into in latents and discretized. All three variables are passed into the causal transformer to output actions. The VQVAE parameters and Causal Transformer parameters are backpropagated directly using an MSE loss and VQ-VAE regularization loss, shown in 1

(a) Target antmaze-umaze trajectory (b) Target antmaze-umaze-diverse skill dist. (c) reconstructed trajectories (d) reconstructed skill dist.

Figure 3: From the antmaze-umaze-diverse Environment: The target trajecty is complex, with a loop and with noisy movement. Reconstructed rollouts also contain a loop

Figure 5: t-SNE projections of Ant-v2 states. Left: States are encoded into unquantized skill embeddings and projected via TSNE. Right: States are encoded into quantized skill embeddings and projected via TSNE

Figure 7: From the Ant-v2 Environment: Skill distributions of a target trajectory and the reconstructed trajectory from rolling out in the environment. Because Ant-v2 is a simpler environment, we can see that the reconstructed skill distributions are very close to the target.

Figure6: From the Antmaze-Umaze Envidiverseronment: One of the longer and highest performing trajectories in the dataset is reconstructed by Skill DT. The trajectory is not quite identical to the target, but it follows a similar path, where it hugs the edges of the maze just like the target.

, utilizes an anticausal transformer to encode trajectory information, and passes it into a causal transformer action predictor. At every training iteration, BDT re-labels trajectory information with the anti-causal transformer. Similarly, Skill DT re-labels future skill distributions at every training iteration. Because the skill encoder is being updated consistently and skill representations change during training, the re-labelling of skill distributions is required to ensure stability in action predictions.

Sample batch of trajectory states: τ = (s t , ...s t+K , a t , ...a t+K ) for j = 1...J do ẑτ t:t+K = (e ϕ (s t ), ...e ϕ (s t+K )) Encode skills z τ t:t+K = quantize(ẑ τ t:t+K ) Quantize skills with VQVAE âτ t:t+K = f θ (Z τt , z τt , s t , ..., Z τ t+K , z τ t+K , s t+K )

Best reward obtained from skills for a varying number of skills

.

ACKNOWLEDGEMENTS

Removed for blind review.

REPRODUCIBILITY STATEMENT

The code to to reproduce every experiment in this paper is available at [removed].

A.4 TRAJECTORY SKILL VISUALIZATIONS

Because Skill DT is a purely unsupervised algorithm, to evaluate the performance in an actual environment, we perform rollouts for each skill and evaluate each to determine which is the best. To do this, we first populate a buffer of skills (here we denote with z) and skill histograms Z. When we rollout in the actual environment, the causal transformer utilizes this buffer to actually make predictions. However, it updates the skill encodings that it actually sees in the environment at each timestep. This is because even though the policy is completely conditioned to follow a single skill, it may end up reaching states that are classified under another. Python sudocode shown below: # reverse order for i in range(trajectory_length-1, -1, -1): 

