How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

Abstract

Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task.

1. Introduction

The Structured State Space model (S4) is a recent deep learning model based on continuoustime dynamical systems that has shown promise on a wide variety of sequence modeling tasks (Gu et al., 2022a) . It is defined as a particular linear time-invariant (LTI) state space model (SSM), which give it multiple properties (Gu et al., 2021) : as an SSM, S4 can be simulated as a discrete-time recurrence for efficiency in online or autoregressive settings, and as a LTI model, S4 can be converted into a convolution for parallelizability and computational efficiency at training time. These properties give S4 remarkable computational efficiency and performance, especially when modeling continuous signal data and long sequences. Despite its potential, several aspects of the S4 model remain poorly understood. Most notably, Gu et al. (2022a) claim that the long range abilities of S4 arise from instantiating it with a particular "HiPPO matrix" (Gu et al., 2020) . However, this matrix was actually derived in prior work for a different (time-varying) setting, and the use of this matrix in S4 (a time-invariant SSM) did not have a mathematical interpretation. Consequently, the mechanism by which S4 truly models long-range dependencies is actually not known. Beyond this initialization, several other aspects of parameterizing and training S4 remain poorly understood. For example, S4 involves an important timescale parameter ∆, and suggests a method for parameterizing and initializing this parameter, but does not discuss its meaning or provide a justification. This work aims to provide a comprehensive theoretical exposition of several aspects of S4. The major contribution of this work is a cleaner, more intuitive, and much more general formulation of the HiPPO framework. This result directly generalizes all previous known results in this line of work (Voelker et al., 2019; Gu et al., 2020; 2021; 2022a) . As immediate consequences of this framework:

