DECISION S4: EFFICIENT SEQUENCE-BASED RL VIA STATE SPACE LAYERS

Abstract

Recently, sequence learning methods have been applied to the problem of off-policy Reinforcement Learning, including the seminal work on Decision Transformers, which employs transformers for this task. Since transformers are parameter-heavy, cannot benefit from history longer than a fixed window size, and are not computed using recurrence, we set out to investigate the suitability of the S4 family of models, which are based on state-space layers and have been shown to outperform transformers, especially in modeling long-range dependencies. In this work we present two main algorithms: (i) an off-policy training procedure that works with trajectories, while still maintaining the training efficiency of the S4 model. (ii) An on-policy training procedure that is trained in a recurrent manner, benefits from long-range dependencies, and is based on a novel stable actor-critic mechanism. Our results indicate that our method outperforms multiple variants of decision transformers, as well as the other baseline methods on most tasks, while reducing the latency, number of parameters, and training time by several orders of magnitude, making our approach more suitable for real-world RL.

1. INTRODUCTION

Robots are naturally described as being in an observable state, having a multi-dimensional action space and striving to achieve a measurable goal. The complexity of these three elements, and the often non-differentiable links between them, such as the shift between the states given the action and the shift between the states and the reward (with the latter computed based on additional entities), make the use of Reinforcement Learning (RL) natural, see also (Kober et al., 2013; Ibarz et al., 2021) . Off-policy RL has preferable sample complexity and is widely used in robotics research, e.g., (Haarnoja et al., 2018; Gu et al., 2017) . However, with the advent of accessible physical simulations for generating data, learning complex tasks without a successful sample model is readily approached by on-policy methods Siekmann et al. ( 2021) and the same holds for the task of adversarial imitation learning Peng et al. (2021; 2022) . The decision transformer of Chen et al. ( 2021) is a sequence-based off-policy RL method that considers sequences of tuples of the form (reward, state, action). Using the auto-regressive capability of transformers, it generates the next action given the desired reward and the current state. The major disadvantages of the decision transformer are the size of the architecture, which is a known limitation in these models, the inference runtime, which stems from the inability to compute the transformer recursively, and the fixed window size, which eliminates long-range dependencies. In this work, we propose a novel, sequence-based RL method that is far more efficient than the decision transformer and more suitable for capturing long-range effects. The method is based on the S4 sequence model, which was designed by Gu et al. (2021a) . While the original S4 method is not amendable for on-policy applications due to the fact that it is designed to train on sequences rather than individual elements, we suggest a new learning method that combines off-policy training with on-policy fine-tuning. This scheme allows us to run on-policy algorithms, while exploiting the advantages of S4. In the beginning, we trained the model in an off-policy manner on sequences, via the convolutional view. This process exploits the ability of S4 to operate extremely fast on sequences, thanks to the fact that computations can be performed with FFT instead of several recurrent operations. Later, at the fine-tuning stage, we used an on-policy algorithm. While pure on-policy training is a difficult task due to the instability and randomness that arise at the beginning of training, our method starts the on-policy training at a more stable point. From the technical perspective, our method applies recurrence during the training of S4 model. As far as we can ascertain, such a capability has not been demonstrated for S4, although it was part of the advantages of the earlier HIPPO Gu et al. ( 2020) model, which has fixed (unlearned) recurrent matrices and different parameterization and is outperformed by S4. Furthermore, in Appendix E we show that the recurrent view of the diagonal state space layer is unstable from both a theoretical and empirical perspective, and we propose a method to mitigate this problem in on-policy RL. This observation provides a further theoretical explanation for why state-space layers empirically outperform RNNs. Moreover, we present a novel transfer learning technique that involves training both the recurrent and convolutional views of S4 and show its applicability for RL. We conduct experiments on multiple Mujoco (Todorov et al., 2012) benchmarks and show the advantage of our method over existing off-policy methods, including the decision transformer, and over similar on-policy methods.

2. RELATED WORK

Classic RL methods, such as dynamic programming (Veinott, 1966; Blackwell, 1962) and Q-learning variants Schwartz (1993); Hasselt (2010); Rummery & Niranjan (1994) are often outperformed by deep RL methods, starting with the seminal deep Q-learning method (Mnih et al., 2015) and followed by thousands of follow-up contributions. Some of the most prominent methods are AlphaGo (Silver et al., 2016 ), AlphaZero (Silver et al., 2018 ), and Pluribus (Brown & Sandholm, 2019) , which outperform humans in chess, go and shogi, and poker, respectively. Sequence Models in RL There are many RL methods that employ recurrent neural networks (RNNs), such as vanilla RNNs (Schäfer, 2008; Li et al., 2015) or LSTMs Bakker (2001; 2007) . Recurrent models are suitable for RL tasks for two reasons. First, these models are fast in inference, which is necessary for a system that operates and responds to the environment in real-time. Second, since the agent should make decisions recursively based on the decisions made in the past, RL tasks are recursive in nature. These models often suffer from lack of stability and struggle to capture long-term dependencies. The latter problem stems from two main reasons: (i) propagating gradients over long trajectories is an extensive computation, and (ii) this process encourages gradients to explode or vanish, which impairs the quality of the learning process. In this work, we tackle these two problems via the recent S4 layer Gu et al. et al., 2022; Reed et al., 2022; Wen et al., 2022) . The most relevant DT variant to our work is a recent contribution that applies DT in an on-policy manner, by fine-tuning a pre-learned DT that was trained in an off-policy manner (Zheng et al., 2022) . Actor-Critic methods Learning off-policy algorithms over high-dimensional complex data is a central goal in RL. One of the main challenges of this problem is the instability of convergence, as



(2021a) and a stable implementation of the actor-critic mechanism. Decision transformers (Chen et al., 2021) (DT) consider RL as a sequence modeling problem. Using transformers as the underlying models, state-of-the-art results are obtained on multiple tasks. DTs have drawn considerable attention, and several improvements have been proposed: Furuta et al. (2021) propose data-efficient algorithms that generalize the DT method and try to maximize the information gained from each trajectory to improve learning efficiency. Meng et al. (2021) explored the zero-shot and few-shot performance of a model that trained in an offline manner on online tasks. Janner et al. (2021) employs beam search as a planning algorithm to produce the most likely sequence of actions. Reid et al. (2022) investigate the performance of pre-trained transformers on RL tasks and propose several techniques for applying transfer learning in this domain. Other contributions design a generalist agent, via a scaled transformer or by applying modern training procedures (Lee

