Q-LEARNING DECISION TRANSFORMER: LEVERAG-ING DYNAMIC PROGRAMMING FOR CONDITIONAL SE-QUENCE MODELLING IN OFFLINE RL

Abstract

Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Qlearning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.

1. INTRODUCTION

The transformer architecture employs a self-attention mechanism to extract relevant information from high-dimensional data. It achieves state-of-the-art performance in a variety of applications, including natural language processing (NLP) (Vaswani et al., 2017; Radford et al., 2018; Devlin et al., 2018) or computer vision (Ramesh et al., 2021) . Its translation to the RL domain, the Decision transformer (DT) (Chen et al., 2021) , successfully applies the transformer architecture to offline reinforcement learning tasks with good performance when shifting their focus on the sequential modelling. It employs a goal conditioned policy which converts offline RL into a supervised learning task, and it avoids the stability issues related to bootstrapping for the long term credit assignment (Srivastava et al., 2019; Kumar et al., 2019b; Ghosh et al., 2019) . More specifically, DT considers a sum of the future rewards -return-to-go (RTG), as the goal and learns a policy conditioned on the RTG and the state. It is categorised as a reward conditioning approach. Although DT shows very competitive performance in the offline reinforcement learning (RL) tasks, it fails to achieve one of the desired properties of offline RL agents, stitching. This property is an ability to combine parts of sub-optimal trajectories and produce an optimal one (Fu et al., 2020) . Below, we show a simple example of how DT (reward conditioning approaches) would fail to find the optimal path. To demonstrate the limitation of the reward conditioning approaches (DT), consider a task to find the shortest path from the left-most state to the rightmost state without going down to the fail state in Fig. 1 . We set the reward as -1 at every time step and -10 for the action going down to the fail state. The training data covers the optimal path, but none of the training data trajectories has the entire optimal path. The agent needs to combine these two trajectories and come up with the optimal path. The reward conditioning approach essentially finds a trajectory from the training data that gives the ideal reward and takes the same action as the trajectory. In this simple example, trajectory 2 has a meagre reward. Hence, it always follows the path of trajectory 1 despite trajectory 2 giving the optimal path for the first action. In contrast to the reward conditioning approaches (DT), Q-learningfoot_0 does not suffer from the issue and finds the optimal path quickly in this simple example. Q-learning takes each time step separately and propagates the best future rewards backwards. Hence it can learn from the first optimal action from trajectory 2. However, Q-learning has some issues on a long time horizon and sparse reward scenarios. It attempts propagating the value function backwards to its initial state, often struggling to learn across long time horizons and sparse reward tasks. This is especially true when Q-learning uses function approximation in an off-policy setting as discussed in Section 11.3 in (Sutton & Barto, 1998) . Here, we devise a method to address the issues above by leveraging Q-learning to improve DT. Our approach differs from other offline RL algorithms that often propose a new single architecture of the agent and achieves better performance. We propose a framework that improves the quality of the offline dataset and obtains better performance from the existing offline RL algorithms. Our approach exploits the Q-learning estimates to relabel the RTG in the training data for the DT agent. The motivation for this comes from the fact that Q-learning learns RTG value for the optimal policy. This suggests that relabelling the RTG in the training data with the learned RTG should resolve the DT stitching issue. However, Q-learning also struggles in situations where the states require a large time step backward propagation. In these cases, we argue that DT will help as it estimates the sequence of states and actions without backward propagation. Our proposal (QDT) exploits the strengths of each of the two different approaches to compensate for other's weaknesses and achieve a more robust performance. Our main evaluation results are summarised in Fig. 2 . The left two



In this paper, we will use the Q-learning and Dynamic Programming interchangeably to indicate any RL algorithm relying on the Bellman-backup operation.



Figure 1: A simple example demonstrates the decision transformer approach's issue (lack of stitching ability) -fails to find the shortest path to the goal. In contrast, Q-learning finds the shortest path. The numbers on the arrows are rewards on the path and the numbers on the states are RTGs.

Figure 2: Evaluation results for conservative Q-learning (CQL), Decision Transformer (DT) and Qlearning Decision Transformer (QDT). The left two plots (simple and maze2d environments) show that the DT does not perform as it fails to stitch trajectories, and the right plot shows that CQL fails to learn from a sparse reward scenario (delayed reward). In contrast, QDT achieves consistently good results across all the environments.

