How to Train Your HiPPO: State Space Models with Generalized Orthogonal Basis Projections

Abstract

Linear time-invariant state space models (SSM) are a classical model from engineering and statistics, that have recently been shown to be very promising in machine learning through the Structured State Space sequence model (S4). A core component of S4 involves initializing the SSM state matrix to a particular matrix called a HiPPO matrix, which was empirically important for S4's ability to handle long sequences. However, the specific matrix that S4 uses was actually derived in previous work for a particular time-varying dynamical system, and the use of this matrix as a time-invariant SSM had no known mathematical interpretation. Consequently, the theoretical mechanism by which S4 models long-range dependencies actually remains unexplained. We derive a more general and intuitive formulation of the HiPPO framework, which provides a simple mathematical interpretation of S4 as a decomposition onto exponentially-warped Legendre polynomials, explaining its ability to capture long dependencies. Our generalization introduces a theoretically rich class of SSMs that also lets us derive more intuitive S4 variants for other bases such as the Fourier basis, and explains other aspects of training S4, such as how to initialize the important timescale parameter. These insights improve S4's performance to 86% on the Long Range Arena benchmark, with 96% on the most difficult Path-X task. (Δ min, Δ max) (Δ min, Δ max) (1e-3, 2e-3) (2e-3, 2e-3) (1e-3, 1e-1) (2e-3, 1e-1) S4-LegS -6.

1. Introduction

The Structured State Space model (S4) is a recent deep learning model based on continuoustime dynamical systems that has shown promise on a wide variety of sequence modeling tasks (Gu et al., 2022a) . It is defined as a particular linear time-invariant (LTI) state space model (SSM), which give it multiple properties (Gu et al., 2021) : as an SSM, S4 can be simulated as a discrete-time recurrence for efficiency in online or autoregressive settings, and as a LTI model, S4 can be converted into a convolution for parallelizability and computational efficiency at training time. These properties give S4 remarkable computational efficiency and performance, especially when modeling continuous signal data and long sequences. Despite its potential, several aspects of the S4 model remain poorly understood. Most notably, Gu et al. (2022a) claim that the long range abilities of S4 arise from instantiating it with a particular "HiPPO matrix" (Gu et al., 2020) . However, this matrix was actually derived in prior work for a different (time-varying) setting, and the use of this matrix in S4 (a time-invariant SSM) did not have a mathematical interpretation. Consequently, the mechanism by which S4 truly models long-range dependencies is actually not known. Beyond this initialization, several other aspects of parameterizing and training S4 remain poorly understood. For example, S4 involves an important timescale parameter ∆, and suggests a method for parameterizing and initializing this parameter, but does not discuss its meaning or provide a justification. This work aims to provide a comprehensive theoretical exposition of several aspects of S4. The major contribution of this work is a cleaner, more intuitive, and much more general formulation of the HiPPO framework. This result directly generalizes all previous known results in this line of work (Voelker et al., 2019; Gu et al., 2020; 2021; 2022a) . As immediate consequences of this framework: • We prove a theoretical interpretation of S4's state matrix A, explaining S4's ability to capture long-range dependencies via decomposing the input with respect to an infinitely long, exponentially-decaying measure. • We derive new HiPPO matrices and corresponding S4 variants that generalize other nice basis functions. For example, our new method S4-FouT produces truncated Fourier basis functions. This method thus automatically captures sliding Fourier transforms (e.g. the STFT and spectrograms), which are ubiquitous as a hand-crafted signal processing tool, and can also represent any local convolution, thus generalizing conventional CNNs. • We provide an intuitive explanation of the timescale ∆, which has a precise interpretation as controlling the length of dependencies that the model captures. Our framework makes it transparent how to initialize ∆ for a given task, as well as how to initialize the other parameters (in particular, the last SSM parameter C) to make a deep SSM variance-preserving and stable. Empirically, we validate our theory on synthetic function reconstruction and memorization tasks, showing that empirical performance of state space models in several settings is predicted by the theory. For example, our new S4-FouT method, which can provably encode a spike function as its convolution kernel, performs best on a continuous memorization task compared to other SSMs and other models, when ∆ is initialized correctly. Finally, we show that the original S4 method is still best on very long range dependencies, achieving a new state of the art of 86% average on Long Range Arena, with 96% on the most difficult Path-X task that even the other SSM variants struggle with.

2. Framework

We present our improved framework for state space models and online reconstruction of signals. Section 2.1 discusses background on SSMs, including their connection to convolutions for timeinvariant systems. Section 2.2 defines new subclasses of SSMs with special properties that can be used for online function reconstruction, simplifying and generalizing the original HiPPO framework. An extended background and related work section can be found in Appendix A.

2.1. State Space Models: A Continuous-time Latent State Model

The state space model (SSM) is defined by the differential equation ( 1) and (2). Given an input sequence u of length N , it maps a 1-D input signal u(t) to an N -D latent state x(t) before projecting to a 1-D output signal y(t). x ′ (t) = A(t)x(t)+B(t)u(t) (1) y(t) = C(t)x(t)+D(t)u(t) (2) K(t) = Ce tA B y(t) = (K * u)(t) We will generally assume D = 0 ∈ R and omit it for simplicity, unless explicitly mentioned. SSMs can in general have dynamics that change over time, i.e. the matrix A ∈ R N ×N , and vectors B ∈ R N ×1 ,C ∈ R 1×N are a function of t in (1) and (2). However, when they are constant the system is linear time invariant (LTI), and is equivalent to a convolutional system (3). The function K(t) is called the impulse response which can also be defined as the output of the system when the input u(t) = δ(t) is the impulse or Dirac delta function. We will call these time-invariant state space models (TSSM). These are particularly important because the equivalence to a convolution makes TSSMs parallelizable and very fast to compute, which is critical for S4's efficiency. Our treatment of SSMs will consider the (A,B) parameters separately from C. We will refer to an SSM as either the tuple (A,B,C) (referring to (3)) or (A,B) (referring to Definition 1) when the context is unambiguous. We also drop the T in TSSM when the context is clearly time-invariant. Definition 1. Given a TSSM (A,B), e tA B is a vector of N functions which we call the SSM basis. The individual basis functions are denoted K n (t) = e ⊤ n e tA B, which satisfy x n (t) = (u * K n )(t) = t -∞ K n (t-s)u(s)ds. Here e n is the one-hot basis vector. This definition is motivated by noting that the SSM convolutional kernel is a linear combination of the SSM basis controlled by the vector of coefficients C, K(t) = N -1 n=0 C n K n (t). We note that Definition 1 has not appeared in prior works on deep SSMs, but is a new perspective taken by this work for understanding and visualizing these models. Discrete SSM with Timescales. To be applied on a discrete input sequence (u 0 ,u 1 ,...) instead of continuous function u(t), (1) must be discretized by a step size ∆ that represents the resolution of the input. A poorly understood question from prior work is how to interpret and choose this ∆ parameter, especially when the input u k does not actually arise from uniformly sampling an underlying continuous signal. S4 specifies to log-uniformly initialize ∆ in the range (∆ min , ∆ max ) = (0.001, 0.1), but does not provide a concrete justification. In Section 3.3 we show a simpler interpretation of ∆ directly in terms of the length of dependencies in a discrete input sequence. 2.2 HiPPO: High-order Polynomial Projection Operators S4 is defined as a TSSM where (A,B) is initialized with a particular formula (4). This was called the HiPPO matrix in (Gu et al., 2022a) , but is actually just one of several such special matrices derived in (Gu et al., 2020) . To disambiguate other variants of S4, we refer to the full S4 method using this HiPPO SSM as S4-LegS. Other cases considered in this work include LegT from prior work (5) and FouT that we introduce in this work ( 6). (HiPPO-LegS) A nk = -(2n+1) 1 2 (2k+1) 1 2 •    1 n > k n+1 2n+1 n = k 0 n < k Bn = (2n+1) 1 2 (HiPPO-LegT) A nk = -(2n+1) 1 2 (2k+1) 1 2 • 1 k ≤ n (-1) n-k k ≥ n Bn = (2n+1) 1 2 (5) (HiPPO-FouT) A nk =                      -2 n = k = 0 -2 √ 2 n = 0,k odd -2 √ 2 k = 0,n odd -4 n,k odd 2πk n-k = 1,k odd -2πn k-n = 1,n odd 0 otherwise Bn =    2 n = 0 2 √ 2 n odd 0 otherwise (6) These matrices were originally motivated by the question of "online memorization" of an input signal. In the following, we present an improved version of the HiPPO framework that addresses this problem. The key idea is that for a suitably chosen SSM basis A,B, at any time t, the current state x(t) can be used to approximately reconstruct the entire input u up to time t. In Appendix A.1, we describe the full HiPPO framework as described in (Gu et al., 2020) . In particular, suppose that the basis functions satisfy Definition 2. Definition 2. We call an SSM (A(t),B(t)) an orthogonal SSM (OSSM) for the basis p n (t,s) and measure ω(t,s) ≥ 0 if the functions K n (t,s) = p n (t,s)ω(t,s) satisfy, at all times t, x n (t) = t -∞ K n (t,s)u(s)ds t -∞ p n (t,s)p m (t,s)ω(t,s)ds = δ n,m . In the case of a time-invariant OSSM (TOSSM), K n (t,s) =: K n (t-s) (depends only on t-s), giving us Definition 1 with measure ω(t-s) := ω(t,s) and basis p n (t-s) := p n (t,s). To be more specific about terminology, p n and ω n are called the basis and measure for orthogonal SSMs (Definition 2), while K n are called the SSM basis kernels which applies more generally to all SSMs (Definition 1). The distinction will be made clear from context, notation, and the word "kernel" referring to K n . Note that for OSSMs, (p n ,ω n ) and K n are uniquely determined by each other (Proposition 6 in Appendix C.2), so we can refer to an OSSM by either.

Defining p (t)

n (s) = p n (t,s) and similarly ω (t) (s) = ω(t,s) for every fixed t, the bases p (t) n are orthonormal in the Hilbert space with inner product ⟨p,q⟩ = p(s)q(s)ω (t) (s)ds. By equation (7), we have x n (t) = t -∞ u(s)K n (t,s)ds = ⟨u,p (t) n ⟩ ω (t) . Thus at all times t, the state vector x(t) is simply the projections of u | ≤t onto a orthonormal basis, so that u can be reconstructed from x(t). In the HiPPO framework, this reconstruction is called the online function approximation problem (Gu et al., 2020) . Proposition 1. Consider an OSSM that satisfies (7) and suppose that in the limit N → ∞, for a fixed time t, the p (t) n are complete on the support of ω. Then u(s) = ∞ n=0 x n (t)p n (t,s) for all s ≤ t. HiPPO can thus be viewed as a framework for deriving specific SSMs that do satisfy (7). The original HiPPO methods and its generalizations (Gu et al., 2020; 2021) primarily focused on the case when the p n are orthogonal polynomials, and specifically looked for solutions to (7), which turn out to be SSMs. We have rephrased the HiPPO definition in Definition 2 to start directly from SSMs and hence is more general. (See Appendix A.1 for an overview of the original HiPPO setup.) We discuss the two most important cases previously introduced. HiPPO-LegT. (5) is a TOSSM that approximates the Legendre polynomials. Definition 3. Let I(t) be the indicator function for the unit interval [0,1]. Let L n (t) be the Legendre polynomials rescaled to be orthonormal on [0,1], i.e., L n (t)L m (t)I(t)dt = δ n,m . Proposition 2. As N → ∞, the SSM (5) is a TOSSM with ω(t) = I(t), p n (t) = L n (t). This particular system was the precursor to HiPPO and has also been variously called the Legendre Delay Network (LDN) or Legendre Memory Unit (LMU) (Voelker, 2019; Voelker et al., 2019) . The original motivation of this system was not through the online function approximation formulation of HiPPO, but through finding an optimal SSM approximation to the delay network that has impulse response K(t) = δ(t-1) representing a time-lagged output by 1 time unit. This is visualized in Appendix C.4.4 Fig. 7 . We state and provide an alternate proof of this result in Appendix C.4.4, Theorem 10. HiPPO-LegS. Unlike the HiPPO-LegT case, which is an LTI system (1) (i.e. TOSSM), the HiPPO-LegS matrix (4) was meant to be used in a time-varying system x ′ (t) = 1 t Ax(t)+ 1 t Bu(t) (Gu et al., 2020) . In contrast to HiPPO-LegT, which reconstructs onto the truncated Legendre polynomials in sliding windows [t -1,t], HiPPO-LegS reconstructs onto Legendre polynomials on "scaled" windows [0,t]; since the window changes across time, the system is not time-invariant. Specifically, we have: Theorem 3. The SSM ( 1 t A, 1 t B) for (A,B) in (4) is an OSSM with ω(t,s) = 1 t •I(s/t) p n (t,s) = L n (s/t). However, the S4 model applies the exact same formula (4) inside the time-invariant SSM (1), i.e. dropped the 1 t term, which had no mathematical interpretation (see Appendix A.1 for more details). In other words, while ( 1 t A, 1 t B) is an OSSM, it was not known whether the TSSM (A,B) is a TOSSM. Given that the performance of SSM models is very sensitive to these matrices A (Gu et al., 2022a; Gupta, 2022) , it remained a mystery why this works. In Section 3 we will prove that (4) actually does correspond to a TOSSM. Naming convention. We use HiPPO-[SSM] to refer to a fixed OSSM (A,B) suitable for online function approximation, where [SSM] is a suffix (e.g. LegS, LegT) that abbreviates the corresponding basis functions (e.g. scaled Legendre, truncated Legendre). S4-[SSM] refers to the corresponding trainable layer (A,B,C) with randomly initialized C, trained with S4's representation and computational algorithm (Gu et al., 2022a) . Other SSMs. Several variants of S4 have been introduced, including several simpler diagonal SSMs (DSS (Gupta, 2022) , S4D (Gu et al., 2022b) , S5 (Smith et al., 2022) ). Notably, these methods are all based on approximations of HiPPO-LegS, and our new theory explains why they perform well (Gu et al., 2022b) . However, they are not OSSMs, and in Section 4 we show several settings where the full S4 variants based on OSSMs outperform these variants.

3. Generalized HiPPO: General Orthogonal Basis Projections

In Section 3.1, we prove that the LTI HiPPO-LegS is actually a TOSSM and show closed formulas for its basis functions. In Section 3.2, we include more specific results on finitewindow SSMs, including introducing a new method HiPPO-FouT based on truncated Fourier functions, and proving previously established conjectures. Section 3.3 shows more general properties of TOSSMs, which establish guidelines for interpreting and initializing SSM parameters such as the timescale ∆. Our main, fully general, result is Theorem 8 in Appendix C.2, which describes a very general way to derive OSSMs for various SSM basis functions K n (t,s). This result can be instantiated in many ways to generalize all previous results in this line of work.

3.1. Explanation of S4-LegS

We showcase the generality of Theorem 8 by stating the following special case containing a sub-class of time varying OSSMs (which are themselves rich enough to explain both S4-LegS and HiPPO-LegS): Corollary 3.1. Define σ(t,s) = exp(a(s)-a(t)) for any differentiable function a. The SSM (a ′ (t)A,a ′ (t)B) is an OSSM with ω(t,s) = I(σ(t,s))a ′ (s)σ(t,s) p n (t,s) = L n (σ(t,s)). We show the matrices (A,B) in (4) are deeply related to the Legendre polynomials L n defined in Definition 3. In particular, as more specific corollaries of Corollary 3.1, we recover both the original time-varying interpretation of the matrix in (4), as well as the instantiation of LegS as a time-invariant system. If we set a ′ (t) = 1 t , then we recover the scale-invariant HiPPO-LegS OSSM in Theorem 3: Corollary 3.2 (Scale-Invariant HiPPO-LegS, Theorem 3). The SSM ( 1 t A, 1 t B) is a TOSSM for basis functions K n (t) = s t L n ( s t ) and measure ω = 1 t I[0,1] where A and B are defined as in (4). And if we set a ′ (t) = 1, this shows a new result for the time-invariant HiPPO-LegS TOSSM: Corollary 3.3 (Time-Invariant HiPPO-LegS). The SSM (A,B) is a TOSSM with ω(t) = e -t p n (t) = L n (e -t ). This explains why removing the 1 t factor from HiPPO-LegS still works: it is orthogonalizing onto the Legendre polynomials with an exponential "warping" or change of basis on the time axis (Fig. 1 ,(Left)).

3.2. Finite Window Time-Invariant Orthogonal SSMs

For the remainder of this section, we restrict to the time-invariant SSM setting (3). A second important instantiation of Theorem 8 covers cases with a discontinuity in the SSM basis functions K n (t), which requires infinite-dimensional SSMs to represent. The most important type of discontinuity occurs when K n (t) is supported on a finite window, so that these TSSMs represent sliding window transforms.

3.2.1. S4-FouT

Using the more general framework (Theorem 8) that does not necessarily require polynomials as basis functions, we derive a TOSSM that projects onto truncated Fourier functions. Theorem 4. As N → ∞, the SSM for (6) is a TOSSM with ω(t) = I(t), and {p n } n≥1 are the truncated Fourier basis functions orthonormal on [0,1], ordered in the form {p n } n≥0 = (1,c 0 (t),s 0 (t),...), where s m (t) = √ 2sin(2πmt) and c m (t) = √ 2cos(2πmt) for m = 0,...,N/2. This SSM corresponds to Fourier series decompositions, a ubiquitous tool in signal processing, but represented as a state space model. The basis is visualized in Fig. 1 (middle) for state size N = 1024. A benefit of using these well-behaved basis functions is that we can leverage classic results from Fourier analysis. For example, it is clear that taking linear combinations of the truncated Fourier basis can represent any function on [0,1], and thus S4-FouT can represent any local convolution (i.e. the layers of modern CNNs) (cf. Theorem 9 in Appendix C.4).

3.2.2. Approximating Delay Networks

An interesting property of these finite window TSSMs is that they can approximate delay functions. This is defined as a system with impulse response K(t) = δ(t-1). Any HiPPO method involving finite windows should have this capability, in particular, the finite window methods LegT and FouT: Theorem 5. For the FouT system A and B (6), let C be (twice) the vector of evaluations of the basis functions C n = 2•p n (1) and let D = 1. For the LegT system A and B (5), let C be the vector of evaluations of the basis functions C n = p n (1) = (1+2n) 1 2 (-1) n and let D = 0. Then the SSM kernel K(t) = Ce tA B +Dδ(t) limits to K(t) → δ(t-1) as N → ∞. Theorem 5 is visualized in Fig. 1 for FouT, and Fig. 7 in Appendix C.4. Further, the result for LegT can be characterized even more tightly for finite N (cf. Theorem 10 in Appendix C.4). The above result provides theoretical justification for why S4-FouT excels at dense memorization tasks (see Section 4).

3.3. Properties of Time-invariant Orthogonal SSMs

We describe several general properties of TOSSMs, which let us answer the following questions: What does ∆ intuitively represent, and how should it be set in an SSM model? So far, this had been done in an ad-hoc way. It turns out that for TOSSMs, these two questions are closely related and have intuitive interpretations. See Appendix C.5 for more details on other properties of TOSSMs related to their closure and normalization. Timescales. As discussed in Section 2, converting from continuous to discrete time involves a parameter ∆ that represents the step size of the discretization. This is an unintuitive quantity when working directly with discrete data, especially if it is not sampled from an underlying continuous process. We observe the following fact: for all standard discretization methods (e.g. Euler, backward Euler, generalized bilinear transform, zero-order hold (Gu et al., 2021) ), the discretized system depends on (A,B), and ∆ only through their products (∆A,∆B). This implies that the SSM (A,B) discretized at step size ∆ is computationally equivalent to the SSM (∆A,∆B) discretized at step size 1. Therefore, ∆ can be viewed just as a scalar scaling of the base SSM instead of changing the rate of the input. In the context of TOSSMs, this just stretches the underlying basis and measure (Scalar Scaling). The most intuitive example of this is for a finite window TOSSM such as LegT or FouT. Discretizing this system with step size ∆ is equivalent to considering the system (∆A,∆B) with step size 1, which produces basis functions supported exactly on [0, 1 ∆ ]. The interpretation of the timescale ∆ lends to simple discrete-time corollaries of the previous continuous-time results. For example, LegT and FouT represent sliding windows of 1/∆ elements in discrete time. Corollary 3.4. By Theorem 5, as N → ∞, the discrete convolutional kernel K → e ⌈∆ -1 ⌉ , i.e. the discrete delay network with lag 1 ∆ . Corollary 3.5. For HiPPO-FouT matrices (A,B), by Theorem 4, as N → ∞, the discrete convolutional kernel K (over the choice of C) can represent any local convolution of length ⌊∆ -1 ⌋. This discussion motivates the following definition. Properly normalized TOSSMs (A,B) will model dependencies of expected length 1, and ∆ modulates it to model dependencies of length 1 ∆ , allowing fine-grained control of the context size of a TOSSM. Definition 4 (Timescale of TOSSM). Define E[ω] = ∞ 0 tω(t)dt ∞ 0 ω(t)dt to be the timescale of a TOSSM having measure ω(t). A TOSSM is timescale normalized if it has timescale 1. By this definition, HiPPO-LegS is timescale normalized. This motivates S4's initialization of ∆ log-uniformly in (0.001,0.1), covering a geometric range of sensible timescales (expected length 10 to 1000). In Section 4 we show that the timescale can be chosen more precisely when lengths of dependencies are known.

4. Experiments

We study the empirical tradeoffs of our proposed S4 variants. We compare several S4 variants based on the TOSSMs introduced in this work, as well as to simpler diagonal SSMs called S4D that are not orthogonal SSMs (Gu et al., 2022b) . Corresponding to our main contributions, we hypothesize that • S4-LegS excels at sparse memorization tasks because it represents very smooth convolution kernels that memorize the input against an infinitely-long measure (Corollary 3.3, Fig. 1 ). Conversely, it is less appropriate at short-range tasks with dense information because it smooths out the signal. • S4-FouT excels at dense memorization tasks because it can represent spike functions that pick out past elements at particular ranges (Section 3.2.2). However, it is less appropriate at very long range tasks because it represents a finite (local) window. • ∆ can be initialized precisely based on known time dependencies in a given task.

4.1. Long Range Arena

The Long Range Arena (LRA) benchmark is a suite of sequence classification tasks designed to stress test sequence models on modeling long sequences. We improve S4's previous state of the art by another 6 points (Table 1 ). Validating our hypothesis, S4-LegS is extremely strong at the hardest long-range task (Path-X) involving sparse dependencies of length 16384, which FouT cannot solve because it is a finite window method. Compared to the original S4 model, the S4-LegS method in Table 1 is the same model but differs by improving some sensible hyperparameters; the main differences are (i) using a bidirectional instead of autoregressive model, since the tasks do not require causality (ii) adopting a more standard cosine learning rate scheduler rather than decaying on validation plateau, and (iii) increasing weight decay regularization. On top of these general changes, the primary source of improvements on Path-X performance arises from applying the theory of timescales in Section 3.3. Fig. 2 illustrates the importance of setting ∆ correctly. Instead of the standard initialization of (∆ min ,∆ max ) = (0.001,0.1), these results were obtained by lowering the initialization of ∆ by a factor of 10 in accordance with known length of dependencies in the Path-X task. 4.2 Theory: Function Reconstruction, Timescales, Normalization Fig. 3 confirms the HiPPO theory of online function reconstruction (Proposition 1) for the proposed TOSSMs LegS and FouT. A followup question is whether the theory is necessary to develop methods that have this functionality, or whether other SSMs can still learn this ∆ min = 10 4 , which is on the order of (but not exceeding) the length of the task L = 16384. Empirically, performance is best when spreading out the range of ∆ with a larger ∆max that covers a wider range of timescales and can potentially learn features at different resolutions, which are combined by a multi-layer deep neural network. We also show a diagonal variant of S4-LegS called S4D-Inv introduced in (Gu et al., 2022b) which approximates S4-LegS, but is still worse. Figure 3 : (New HiPPO methods) Function reconstruction predicted by our general theory. An input signal of length 10000 is processed sequentially, maintaining a state vector of size only x(t) ∈ R 64 , which is then used to approximately reconstruct the entire history of the input. (Left) HiPPO-LegS (as an LTI system) orthogonalizes on the Legendre polynomials warped by an exponential change of basis, smoothening them out. This basis is orthogonal with respect to an exponentially decaying measure. Matching the intuition, the reconstruction is very accurate for the recent past and degrades further out, but still maintains information about the full history of the input, endowing it with long-range modeling capacity. This is the same as S4. (Right) HiPPO-FouT orthogonalizes on the truncated Fourier basis, similar to the original HiPPO-LegT or LMU. task when trained. Appendix B.1, Fig. 6 analyzes a synthetic Reconstruction Task against a uniform measure, which validates that S4-LegT and S4-FouT are far better than other SSM variants, particularly when ∆ is initialized properly based on the length of the task.

4.3. Memorization: the Delay (continuous copying) Task

Next, we study how the synthetic reconstruction ability transfers to other tasks. The Delay Task requires models to learn a sequence-to-sequence map whose output is the input lagged by a fixed time period (Fig. 4a ). For recurrent models, this task can be interpreted as requiring models to maintain a memory buffer that continually remembers the latest elements it sees. This capability was the original motivation for the Legendre Memory Unit, the predecessor to HiPPO-LegT, which was explicitly designed to solve this task because it can encode a spike kernel (Fig. 7 ). In Fig. 4b , we see that our new S4-FouT actually outperforms S4-LegT, which both outperform all other methods when the timescale ∆ is set correctly. We note that this task with a lag of just 1000 time steps is too hard for baselines such as an LSTM and Transformer, which empirically did not learn better than random guessing (RMSE 0.43).  (b) (RMSE) (Left) Setting ∆ appropriately makes a large difference. For FouT (A,B), which encode finite window basis functions (Fig. 1 ), the model can see a history of length up to 2 ∆ . For example, setting ∆ too large means the model cannot see 1000 steps in the past, and performs at chance. Performance is best at the theoretically optimal value of ∆ = 2•10 -3 which can encode a spike kernel at distance exactly 1000 steps (Corollary 3.4). (Right) When ∆ is set optimally, the proposed S4-FouT method is the best SSM as the theory predicts. When ∆ is not set optimally, other methods perform better, including the simple diagonal methods proposed in (Gu et al., 2022b) . Figure 4 : (Delay Task.) A synthetic memorization task: definition (Fig. 4a ) and results (Fig. 4b ).

5. Discussion

This work improves the HiPPO framework, generalizing it to any set of orthonormal basis functions as projection operators. This led to a better understanding of existing models (clarification of the mechanisms underlying the original S4 model) as well as new variants (SSMs producing Fourier basis functions). In addition, we use our new framework to give principled explanations of other components such as the timescale and initialization, leading to improved empirical results. The theoretical insights provided by this work have been used to improve and extend SSMs in several directions. We showed that S4 produces exponentially-decaying kernels according to precise formulas (Corollary C.6), and Li et al. (2022) designed alternative exponentiallydecaying CNN kernels inspired by this property. Another line of work on diagonal approximations to S4 all use insights from our theory to simplify and improve S4. DSS (Gupta, 2022 ) introduced a particular diagonal approximation which was empirically effective, and S4D (Gu et al., 2022b) proved that it produced the same kernels asymptotically as S4 (Corollary C.6). S5 (Smith et al., 2022) extended this to multi-input multi-output (MIMO) SSMs and showed that our recommendations for initialization of C and ∆ are important even in the MIMO setting. We believe that the insights in this work will be useful both to understand the original S4 model, and produce better and simpler state space models. 1). Specific cases of HiPPO matrices A,B are derived so that at every time t, the history of u up to time t can be reconstructed linearly from x(t) (red), according to a measure (green). (Left) The HiPPO-LegT method orthogonalizes onto the Legendre polynomials against a time-invariant uniform measure, i.e. sliding windows. (Right) The original HiPPO-LegS method is not time-invariant system. When used as a time-varying ODE x ′ =foot_0 t Ax + 1 t Bu, x(t) represents the projection of the entire history of u onto the Legendre polynomials. It was previously unknown how to interpret the time-invariant version of this ODE using the same (A,B) matrices.

A Related Work

We discuss in more detail the differences between this work and the previous results in this line of work.

A.1 HiPPO OSSMs with Orthogonal Polynomial Kernels

HiPPO is an online function reconstruction framework theoretically motivated and described in (Gu et al., 2020) and expanded on in (Gu et al., 2021) . By projecting sequence data onto polynomial bases, a function's history can be represented in a latent space. Every measure µ (with some mild restrictions) in the finite 1 interval 2 [-1, 1] induces a unique sequences of orthogonal polynomials (OPs) p 0 (x),p 1 ,... satisfying deg(p i ) = i and ⟨p i ,p j ⟩ µ = 1 -1 p i (x)p j (x)dµ(x) = δ ij for all i,j, where δ ij = 1 if i = j and 0 otherwise. This sequence forms an OP family. For a function u, HiPPO gives a compressed representation the history of u in the interval [t-θ(t),t] in the N coefficients given by (0 ≤ n < N ): x n (t) = t t-θ(t) u(t)p n (t,s)µ(t,s)ds. ( ) where p n (t,s) and µ(t,s) are transformations of the OP family onto the interval [0,t]. Specifically, one can choose p n (t,s) = p n 2(s-t) θ(t) +1 and µ(t,s) = µ 2(s-t) θ(t) +1 . Note that Eq. ( 9) and Eq. ( 8) correspond exactly to Eq. ( 7). When θ(t) = θ for a fixed θ, this corresponds to the 'truncated' window from (Gu et al., 2020) while the case of θ(t) = t considers the entire [0,t] is the 'scaled' window case from (Gu et al., 2020) . This is illustrated in Fig. 5 . (Gu et al., 2020) only considered the above for the case of µ(x) = 1 for which the corresponding OP family is the Legendre polynomials. For θ(t) = θ and θ(t) = t we get the LegT and LegS respectively in (Gu et al., 2020) . From the viewpoint of Definition 2, it is easy to see that choosing an OP family p i (x) and its measure µ defines an OSSM (A(t),B(t)). However, both (Gu et al., 2020) and (Gu et al., 2021) start from Eq. ( 9) and show that by differentiating Eq. ( 9) wrt t one can derive the corresponding SSM: x ′ (t) = A(t)x(t)+B(t)u(t). Specifically, for the case of µ(x) = 1 (i.e. Legendre) the above simplifies to: 1. For θ(t) = θ (i.e. LegT) one gets x ′ (t) = 1 θ •Ax(t)+ 1 θ •Bu(t), where A and B are as in Eq. ( 5). 2. For θ(t) = t (i.e. LegS) one gets x ′ (t) = 1 t •Ax(t)+ 1 t •Bu(t), where A and B are as in Eq. ( 4).

A.2 The mystery of S4

The S4 paper (Gu et al., 2022b) , used a weird 'mixture' of LegS and LegT. Specifically, in its experiments, it used the ODE in Eq. ( 10) but instead of using A and B are as in Eq. ( 5) it used A and B as in Eq. ( 4). However, LegS A and B from Eq. ( 4) had been derived for θ(t) = t instead of θ(t) = θ. In other words, there was NO mathematical justification for the ODE used in S4. One of our main results is to provide a solid mathematical justification for the ODE used in S4 (see Section 3.1).

A.3 Why OP kernels?

A crucial insight of the HiPPO framework is that the coefficients x(t) is sufficient to recover u. This enables online predictions for end-to-end models. The intuition for this is basically that OPs form a complete basis for functions over [-1,1] . Specifically, given a C 1 -smooth function u : [-1,1] → R which is seen online, we wish to maintain a compressed representation of its history u(s) ≤t = u(s) s≤t at every time t. For any infinite dimensional polynomial basis p 0 ,p 1 ,..., we get the polynomial expansion of u: u(t) = ∞ n=0 x n (t)p n (t). The truncated approximation of u(t) at time t = N is: û(t) = N -1 n=0 x n (t)p n (t). ( ) If p i is an OP family, then our approximation û(t) is guaranteed to be optimal. That is, as N → ∞, û(t) becomes a perfect reconstruction of u (i.e. the error with respect to the measure µ goes to 0 as N → ∞). Further, given the measure µ(x) it is known that the OP family corresponding to µ gives the best possible approximation among all degree N -1 polynomial approximations. The main insight of HiPPO (Gu et al., 2020) was to extend the framework above from the interval [-1,1] to [0,t] such that the approximation of Eq. ( 12) can be updated efficiently as t increases. In Appendix C.2, we expand the HiPPO framework to any set of differentiable orthogonal functions with respect to a given measure, and generalize the concepts behind LegS using the time-warping function σ. We use this to derive a general form for time-varying OSSMs, and give a mathematical interpretation of the LegS's state matrix.

A.4 HiPPO vs LSSL

As discussed above, HiPPO can be thought of as a framework for deriving state space models corresponding to specific polynomial bases. The original paper (Gu et al., 2020) did not explicitly draw the connection to state space models, and also developed systems only for a few particular cases which were called LegS (a time-varying system involving Legendre polynomials), LegT (a time-invariant system with the truncated Legendre polynomials), and LagT (involving Laguerre polynomials). A follow-up paper on Linear State Space Layers (LSSL) (Gu et al., 2021) generalized these results to all orthogonal polynomial families, and also generalized the flexibility of the timevarying component. They produced SSMs x ′ (t) = A(t)x(t)+B(t)u(t) where at all times t, x(t) can be viewed as the projection of the history of u(s) | s≤t onto orthogonal polynomials p n rescaled onto the interval [t -θ(t),t], where θ(t) is an arbitrary factor. (Indeed this is the form we outlined above.) This generalized all 3 cases of the original HiPPO paper. A.5 Legendre Memory Unit (Legendre Delay Network) The HiPPO-LegT matrix (5) was first introduced as the LMU (Voelker, 2019; Voelker et al., 2019) . The original motivation was to produce a state space model that approximates the Delay Network, which can be defined as the LTI system that transforms u(t) into u(t-1), i.e. lags the input by 1 time unit. This can also be defined as the system with impulse response K(t) = δ(t-1), i.e. it convolves by the convolutional kernel with a δ spike at time 1. The connection between the Delay Network and Legendre polynomials was made in two steps. First, the transfer function of the ideal system is L[δ(t -1)](s) = e -s and must be approximated by a proper rational function to be represented as an SSM. Taking Padé approximants of this function yields "optimal" approximations by rational functions, which can then be distilled into a SSM (A,B,C) whose transfer function C(sI -A) -1 B matches it. Second, the SSM basis e tA B for this system can be computed and found to match Legendre polynomials. However, despite making this connection and writing out formulas for this SSM, (Voelker, 2019) did not provide a complete proof of either of these two connections. The preceding two steps that motivated the LDN can be informally written as the chain of transformations (i) transfer function e -s → (ii) SSM (A,B,C) → (iii) Legendre polynomials e tA B. The HiPPO framework in a sense proceeded in the opposite direction. (Gu et al., 2020) started by defining the system that convolves with truncated Legendre polynomials, and with a particular differentiation technique showed that it could be written as a particular SSM which they called HiPPO-LegT. This SSM turned out to be the same (up to a minor change in scaling) as the original (A,B) defined by the LMU, thus proving the second of the two steps relating this particular SSM to the Legendre polynomials. In this work, we show the final piece in this reverse chain of equivalences. In particular, we start from the LegT SSM (A,B,C) and directly prove that its transfer function produces Padé approximants of the exponential. Our proof introduces new techniques in an inductive argument that can be applied to HiPPO SSMs beyond the LegT case, and relates them to continued fraction expansions of the exponential. We comment on a minor difference between the parameterization of HiPPO-LegT and the LMU. The LMU is originally defined as x ′ (t) = 1 θ Ax(t)+ 1 θ Bu(t) where θ is a hyperparameter that controls the length of the window. However, we point out that such constant scaling of the SSM is also controlled by the step size ∆ as discussed in Section 3.3. Therefore θ is redundant with ∆, so the LegT matrices defined in (Gu et al., 2020) and in this work do not have a concept of θ. Additionally, in this work we redefine the LegT matrices (A,B) to be scaled by a factor of 2 to make them properly timescale normalized, using the theory developed in Section 3.3.

A.6 Our framework

Compared to these works, our framework (Definition 2) simplifies and generalizes the concepts directly in terms of (time-varying) state space models. We define a more natural concept of orthogonal SSM, derive very general instantiations of it (Section 3.1), and flesh out its properties (Section 3.3). Our general result subsumes all prior cases including all cases of the LSSL as a direct corollary. Some concrete advantages include: • It allows more flexible transformations of polynomial bases, such as including a change-of-basis inside the polynomials. The previously expained case of LegS is an instance of this, which has basis functions L(e -t ) with an exponential change of basis, instead of vanilla polynomials. • It can be applied to non-polynomial bases, such as the truncated Fourier basis FouT. • It does not require considering multiple cases depending on where the basis functions are supported. Instead, we handle this by considering discontinuities in the basis functions.

A.7 Application in deep learning systems

While the preceding discussion covers theoretical interpretations of SSMs, S4 (Gu et al., 2022a) (and its predecessor LSSL (Gu et al., 2021) ) are the application of these SSMs to deep learning. In comparison to prior works such as the LMU and HiPPO which require a pre-determined system (A,B) and incorporate them naively into an RNN, LSSL and S4 use a full state space model (A,B,C) as a completely trainable deep learning layer. Doing this required resolving computational problems with the SSM, which was the main focus of S4. Specifically, the results in HiPPO (Gu et al., 2020) and LSSL (Gu et al., 2021) only guaranteed theoretical efficiency: in particular, they showed how the various computations can be done with near-linear number of arithmetic operations and does not specifically guaranteee any sort of numerical stability-the main theoretical contribution of S4 (Gu et al., 2022a) was to give a numerically stable algorithm. In this work, we make a distinction between HiPPO, which is the theoretical derivation and interpretation of particular SSMs (A,B), and S4, which is the incorporation of those SSMs as a trainable deep learning layer with a particular algorithm.

B Experiment Details and Additional Experiments B.1 Synthetic Reconstruction Task

We construct a synthetic Reconstruction Task against a uniform measure. The input is a white noise sequence u ∈ R 4000 . We use a single layer linear S4 model with state size N = 256 and H = 256 hidden units. Models are required to use their output at the last time step, a vector y 4000 ∈ R 256 , to reconstruct the last 1000 elements of the input with a linear probe. Concretely, the loss function is to minimize ∥u 3000:4000 -W y 4000 ∥ 2 2 , where W ∈ R 1000×256 is a learned matrix. Models are trained with the Adam optimizer with learning rate 0.001 for 20 epochs. Fig. 6 shows that S4-LegT and S4-FouT, the methods that theoretically reconstruct against a uniform measure, are far better than other methods. We include the new diagonal variants (S4D) proposed in (Gu et al., 2022b) , which are simpler SSM methods that generally perform well but do not learn the right function on this task. We also include a method S4-(LegS/FouT) which combines both LegS and FouT measures by simply initializing half of the SSM kernels to each. Despite having fewer S4-FouT kernels, this still performs as well as the pure S4-FouT initializing.

B.2 Delay (Continuous Copying) Task

The Delay Task consists of input-output pairs where the input is a white noise signal of length 4000 bandlimited to 1000 Hz. The output is the same signal shifted by 1000 steps (Fig. 4a ). We use single layer linear SSMs with H = 4 hidden units and state size N = 1024. Models are trained with the Adam optimizer with learning rate 0.001 for 20 epochs.

B.3 Long Range Arena

The settings for LRA use the same hyperparameters in (Gu et al., 2022b) . A more detailed protocol can be found in (Gu et al., 2022b) . To be self-contained, we recreate the same table of parameters in Table 2 .

C Proof Details

We furnish the missing proofs from Section 2 in Appendix C.1. We will describe our general framework and results in Appendix C.2, and prove the results in Sections 3.1 to 3.3 in Appendices C.3 to C.5 respectively. 

C.1 Proofs from Background

This corresponds to results from Section 2. Recall that for OSSMs, (p,ω) and K are uniquely determined by each other, so we can refer to an OSSM by either. One direction is obvious: (p,ω) determine K via K n (t,s) = p n (t,s)ω(t,s). Proposition 6. If a set of kernel functions satisfies K n (t,s) = p n (t,s)ω(t,s) where the functions p n are complete and orthogonal w.r.t. ω (equation (7) right), p and ω are unique. Proof of Proposition 6. Suppose for the sake of contradiction that there is a second basis and measure q n ,µ such that q n is complete and orthogonal w.r.t. µ, and K n = q n µ. By completeness, there are coefficients c ℓ,k such that p ℓ = k c ℓ,k q k . Then p ℓ q j µ = k c ℓ,k q k q j µ = k c ℓ,k δ kj = c ℓ,j . But q j µ = K j = p j ω, so p ℓ q j µ = p ℓ p j ω = δ ℓj . So c ℓ,j = δ ℓ,j which implies that p ℓ = q ℓ for all ℓ, as desired. The main barrier to using Proposition 1 for function reconstruction is that SSMs are in general not OSSMs. For example, even though we will show that ( 4) is an TOSSM, and that unitary conjugation of a TOSSM is a TOSSM (Section 3.3), its diagonal matrix of eigenvalues is not a TOSSM. This both shows the existence of an SSM that is not an OSSM, and also implies that general conjugation does not preserve TOSSMs. Proposition 7. There is no TOSSM with the diagonal state matrix A = diag{-1,-2,...}. Proof of Proposition 7. The SSM kernels are K n (t) = e -t(n+1) B n . Assume B n ̸ = 0 so that the kernels are not degenerate. Suppose for the sake of contradiction that this was a TOSSM with measure ω(t). Then we must have K n (s)K m (s)ω(t) -1 ds = δ n,m Plugging in n = 1,m = 1 and n = 0,m = 2 gives 1 = e -2t B 1 e -2t B 1 ω(t) -1 ds = B 1 B 1 e -4t ω(t) -1 ds 0 = e -1t B 0 e -3t B 2 ω(t) -1 ds = B 0 B 2 e -4t ω(t) -1 ds This is clearly a contradiction.

C.2 General theory

Consider a measure supported on [0,1] with density ω(t)I(t), where I(t) is the indicator function for membership in the interval [0,1]. Let the measure be equipped with a set of orthonormal basis functions p 0 ,...,p N -1 , i.e. p j (s)p k (s)ω(s)I(s)ds = δ jk , ( ) where the integrals in this paper are over the range [-∞,∞], unless stated otherwise. This is sufficient to derive an OSSM based on the HiPPO technique. The generalized HiPPO framework demonstrates how to build (T)OSSMs utilizing time warping to shape the time interval and tilting to construct new sets of orthogonal basis functions. Given an general interval [ℓ,r], we will use the notation I[ℓ,r] to denote the indicator function for the interval [ℓ,r]-we will drop the interval if ℓ = 0,r = 1. We will need the notion of a "time warping" function σ as follows: Definition 5. A time warping function is defined as σ(t,s) : (-∞,t] → [0,1] such that σ(t,t) = 1. We will be using a special case of time-warping function, which we say has a discontinuity at t 0 for some t 0 ∈ (-∞,t]: σ(t,s) = I[t 0 ,t]σ(t,s), ( ) such that ∂ ∂t ∂ ∂s σ s (t,s) = c(t) ∂ ∂s σ(t,s). ( ) We allow for t 0 = -∞, in which case we think of the interval [t 0 ,t] as (-∞,t]. Before proceeding, let us clarify our notation. We will use σ t and σ s to denote the partial derivatives ∂ ∂t σ(t,s) and ∂ ∂s σ(t,s) respectively. We will drop the parameters (t,s) and use f instead of f (t,s) when it is clear from context to reduce notational clutter. Further, we will extend this notation to function composition, i.e. write g•f (t,s)) as g(f ) and function product, i.e. use f gh instead of f (t,s)g(t,s)g(t,s). Finally, we'll shorten f gh•ϕ(t,s) as f gh(ϕ). We also define the tilting χ and show that regardless of warping, we can construct a new orthogonal basis (note that the result holds for warping functions as in ( 14) and not just those as in ( 15)). Lemma C.1. For the set of orthonormal functions {p n } N -1 n=0 orthogonal over measure ωI, the set of basis functions q t k (σ(t,s)) = χ(t,s)p k (σ(t,s)) are orthogonal over the measure µ(t,s) = ω(σ(t,s))I[t 0 ,t](s)σ s (t,s)χ(t,s) -2 for time-warping function σ satisfying (14) and any χ(t,s) that is non-zero in its support. Proof. Consider the following sequence of equalities: In the above, the second equality follows from the substitution y ← σ(t,s) and hence dy = σ s ds and the final equality follows from (13). Then since χ(t,s) is always non-zero, we have p j (σ)p k (σ)ω(σ)I[t 0 ,t]σ s ds = t t0 p j (σ)p k (σ)ω(σ)σ s ds = σ(t,t) σ(t, (χp j (σ))(χp k (σ))ω(σ)I[t 0 ,t]σ s χ -2 ds = δ jk , as desired. Without loss of generality, we can split χ into a product χ(t,s) = 1 ψ(σ(t,s))ϕ(t,s) of one part that depends on σ and another arbitrary component. Time Warped HiPPO. Since we have an orthonormal basis and measure, we can try to derive the (T)OSSM. For a given input signal u(t), the HiPPO coefficients are defined as the projections. x n (t) = ⟨u,χp n ⟩ µ = u(s)•χ•(p n ω)(σ)I[t 0 ,t]σ s χ -2 ds defined as inner product of u(t) with the tilted basis functions χp n with respect to the measure µ as defined in Lemma C.1. For additional convenience, we use the decomposition χ = ψ -1 ϕ -1 from ( 16) to get: x n (t) = u(s)•(p n ωψ)(σ)I[t 0 ,t]σ s ϕds. ( ) The HiPPO technique is to differentiate through this integral in a way such that it can be related back to x n (t) and other x k (t). We require for every n, we require that there are a set of coefficients {γ nk } N -1 k=0 such that σ t (p n ωψ) ′ (σ) = N -1 k=0 γ nk (p n ωψ)(σ) and for tilting component ϕ d dt ϕ(t,s) = d(t)ϕ(t,s). ( ) Theorem 8. Consider a set of basis functions p n orthogonal over ω, time warping σ(t,s) as in ( 14), ( 15), and tilting χ as in ( 16) and ( 19) with the functions σ,p n ,ω,ψ obeying (18). If dt0 dt ̸ = 0, further assume that for some vector A ′ , we have as N → ∞, 20) u(t 0 ) = c N -1 k=0 A ′ k •x k (t)+du(t). ( Then (A 0 + (c(t) + d(t))I -cD(A ′ ) ⊤ ,B -dD ) is an OSSM for basis functions χp n (σ) with measure ωI[t 0 ,t]σ s χ -2 where A 0 nk = γ nk with γ nk as in (18), D n = (p n ωψ)(σ(t,t 0 ))(σ s ϕ)(t,t 0 )• dt 0 dt , B n = (p n ωψ)(1)(σ s ϕ)(t,t). Proof. Applying the Leibniz rule to (17), we get x ′ n (t) = x (0) n (t)+x (1) n (t)+x (2) n (t)+x (3) n (t) , where x (0) n (t) = u(s)•σ t (p n ωψ) ′ (σ)I[t 0 ,t]σ s ϕds x (1) n (t) = u(s)•(p n ωψ)(σ)I[t 0 ,t] ∂ ∂t (σ s ϕ) ds and the x (2) n (t)+x n (t) terms capture the term we get when differentiating I[t 0 ,t]. Let us consider each term separately. The first term x (0) n (t) = u(s)•σ t (p n ωψ) ′ (σ)I[t 0 ,t]σ s ϕds corresponds to the differentiation of the basis functions and measure. In order to relate this to {x k (t)}, it suffices that σ t (p n ωψ) ′ (σ) satisfies ( 18) which implies that when we vectorize this, we get x (0) (t) = A 0 •x(t). For additional warping and tilting terms, we consider x (1) n (t) = u(s)•(p n ωψ)(σ)I[t 0 ,t] ∂ ∂t (σ s ϕ) ds. To reduce this term to x n (t), recall from (15) that ∂ t (σ s ) = c(t)σ s . Then the above and ( 19) imply ∂ t (σ s ϕ) = c(t)(σ s ϕ)+d(t)(σ s ϕ) where c(t),d(t) are defined as in ( 15) and ( 19). We will end up with x (1) n (t) = (c(t) + d(t)) x n (t). This leads to the the vectorized form x (1) (t) = (c(t)+d(t))Ix(t). We now need to handle x (2) n (t)+x (3) n (t) = u(s)•(p n ωψ)(σ) ∂ ∂t I[t 0 ,t] (σ s ϕ)ds. ( ) For the above note that I[t 0 ,t](s) = H(s-t 0 )-H(s-t), where H(x) is the "heaviside step function." It is know that H ′ (x) = δ(x), which implies ∂ ∂t I[t 0 ,t] = δ(s-t)- dt 0 dt δ(s-t 0 ). Using the above in RHS of ( 22), we separate out x (2) n (t) and x (3) n (t) as follows. First, define x (2) n (t) = u(s)•(p n ωψ)(σ)δ(s-t)σ s ϕds = u(t)•(p n ωψ)(σ(t,t))(σ s ϕ)(t,t) = u(t)•(p n ωψ)(1)(σ s ϕ)(t,t ). In the last equality, we have used the fact that σ(t,t) = σ(t,1) = 1 by definition. It follows that in vectorized form we have x (2) (t) = Bu(t). Finally, define x (3) n (t) = -u(s)•(p n ωψ)(σ)δ(s-t 0 ) dt 0 dt σ s ϕds = -u(t 0 )•(p n ωψ)(σ(t,t 0 ))(σ s ϕ)(t,t 0 )• dt 0 dt If dt0 dt = 0, then we have D = 0 and hence we have x (3) (t) = 0 = -cD(A ′ ) ⊤ x(t)-dDu(t) If dt0 dt ̸ = 0, then as N → ∞, from (20), the above comes out to x (3) n (t) = -c N -1 k=0 A ′ k •x k (t)+du(t) •(p n ωψ)(σ(t,t 0 ))(σ s ϕ)(t,t 0 )• dt 0 dt It follows that in vectorized form we have x (3) (t) = -cD(A ′ ) ⊤ x(t) -dDu(t). The result follows after combining the terms. We see that the behavior of is the model is dictated by t 0 . In particular, in this paper, we will consider two special cases. Corollary C.2 (t 0 independent of t). The SSM ((A+c(t)+d(t)I),B) satisfying conditions of Theorem 8 with t 0 independent of t, is an OSSM for basis functions χp n (σ) with measure ωI[t 0 ,t]σ s χ -2 where A = γ nk as in (18) and B n = (p n ωψ)(1)(σ s ϕ)(t,t). Proof. Follows from Theorem 8. Since t 0 is independent of t, then dt0 dt = 0, and D = 0. Corollary C.3 (t 0 = t -θ). The SSM (A 0 + (c(t) + d(t))I -cDA ′ ,B -dD) satisfying conditions of Theorem 8 with t 0 = t-θ for a fixed θ, is an OSSM with basis functions χp n (σ) with measure ωI[t 0 ,t]σ s χ -2 where A 0 nk = γ nk as in (18), D n = (p n ωψ)(σ(t,t-θ))(σ s ϕ)(t,t-θ), and B n = (p n ωψ)(1)(σ s ϕ(t,t). Proof. This follows directly from Theorem 8 by setting t 0 = t-θ.

C.3.1 Explanation of S4-LegS

Consider the case when σ = ω -1 , i.e. the measure is completely "tilted" away, and let ∂ ∂t σ(t,s) = a(t)σ(t,s)+b(t). Let's consider the special case of ( 23) where b(t) = 0. This is most generally satisfied by σ(t,s) = exp(a(t)+z(s)). Note that the condition σ(t,t) = 1 forces z = -a. Hence, we have σ(t,s) = exp(a(s)-a(t)). ( 24) We now consider the following special case of Corollary C.2: Corollary C.4. Let η ≥ 0. The SSM (-a ′ (t)(A+(η+1)I),a ′ (t)B), where t 0 is independent of t, is an OSSM for basis functions and measure ω(σ) σ η p n (σ) ω(t,s) = I(σ) a ′ (s)σ 2η+1 ω(σ) where σ satisfies (24), ϕ(t,s) = exp(ηa(s)-ηa(t)) = σ η , A = α nk such that yp ′ n (y) = n-1 k=0 α nk p k (y) and B n = p n (1). Proof. Given a orthonormal basis p 0 ,p 1 ,...,p N -1 with respect to a measure ω. Note that time-warping function σ satisfying (24) implies that σ s = a ′ (s)σ. We fix tilting χ(t,s) = ω(σ) σ η , which in turn follows by setting ψ = ω -1 . We show shortly that we satisfy the pre-conditions of Corollary C.2, which implies (with our choice of χ and σ) that we have an OSSM with basis functions p n (t,s) = ω(σ) σ η p n (σ) and measure ω(t,s) = ω(σ(t,s))I[t 0 ,t]σ s (t,s)χ(t,s) -2 = I[t 0 ,t] a ′ (s)σ 2η+1 ω(σ) To complete the proof, we show that out choice of parameters above satisfies the conditions of Corollary C.2 (by showing they satisfy the conditions of Theorem 8). We verify that σ and ϕ satisfy ( 15) and ( 19), noting that ∂ t (σ s ) = -a ′ (t)σ s , and ∂ t (ϕ) = -ηa ′ (t)ϕ. This implies that setting c(t) = -a ′ (t) and d(t) = -ηa ′ (t) is enough to satisfy (15) and ( 19). Further, note that (24) and the fact that ψ = ω -1 imply that 18) is satisfied as long as σ t (p n ωψ) ′ (σ) = -a ′ (t)σp ′ n (σ). It follows that ( σp ′ n (σ) = n-1 k=0 α nk p k (σ) for some set of coefficients {α nk } N -1 k=0 , which is exactly (26). This implies the γ nk in Corollary C.2 satisfy. γ nk = -a ′ (t)α nk . Let A be the matrix such that A nk = -α nk and then note that -a ′ (t)(A + (η + 1)I) is exactly the first parameter of the SSM in Corollary C.2. Similarly, recall in Corollary C.2 B n = p n (1)(σ s ϕ)(t,t) = p n (1)a ′ (t), where the final equality follows since in our case, σ s (t,t) = a ′ (t)exp(a(t)-a(t)) = a ′ (t). Overloading notation and letting B n = p n (1), all conditions of Corollary C.2 hold, from which the claimed result follows. We are particularly interested in the following two special cases of Corollary C.4. Corollary C.5. The SSM (-1 t (A+I), 1 t B) is a OSSM for basis functions p n ( s t )ω( s t ) with measure 1 t I[t 0 ,t] s t •ω( s t ) where A = α nk as in (26) and B n = p n (1). Proof. Letting a ′ (t) = 1 t implies that a(t) = lnt. Then we can observe that is a case of Corollary C.4 with time warping σ(t,s) = exp(-lnt+lns) = exp(ln(s/t)) = s t . We set η = 0 in Corollary C.4, which in turn sets ϕ = σ 0 = 1. This gives the tilting χ = ϕ -1 ψ -1 = ω. Then by Corollary C.4, it follows that that we can use σ and χ to build an OSSM with basis functions ω(σ) σ η p n (σ) = ω( s t )•p n ( s t ) with measure I(σ) a ′ (s)σ 2η+1 ω(σ) = 1 t I(σ) σ ω(σ) . Then the result follows. Corollary C.6. The SSM (-(A+I),B) is a OSSM for basis functions p n (e s-t )ω(e s-t ) with measure ω = I[t 0 ,t](e s-t ) e s-t ω(e s-t ) where A = α nk as in (26) and B n = p n (1). Proof. This is a case of Corollary C.4 where a ′ (t) = 1, σ = exp(s -t), and we pick η = 0, implying that ϕ = σ 0 = 1. It follows that χ = ϕ -1 ψ -1 = ω. Utilizing Corollary C.4, we can use σ and χ to build an OSSM with basis functions ω(σ) σ η p n (σ) = ω(exp(s-t))•p n (exp(s-t)) with measure I(σ) a ′ (s)σ 2η+1 ω(σ) = I(σ) exp(s-t) ω(exp(s-t)) . This gives us our final result. Next we instantiate Corollary C.4 to prove Corollary 3.1. (Even though strictly not needed, we instantiate Corollary C.6 and Corollary C.5 to prove Theorem 3 and Corollary 3.3.) To that end, we will need the following result: Lemma C.7. Let the Legendre polynomials orthonormal over the interval [0,1] be denoted as L n . Then yL ′ n (y) = nL n (y)+ √ 2n+1 n-1 k=0 √ 2k+1L k (y) , L ′ n (y) = 2 √ 2n+1   0≤k≤n-1,n-k is odd √ 2k+1L k (y)   , and L n (0) = (2n+1) 1 2 (-1) n and L n (1) = (2n+1) 1 2 . ( ) Proof. The Legendre polynomials satisfy the following orthogonality condition over [-1,1]: 1 -1 P m (z)P n (z)dz = 2 2n+1 δ mn . Let us denote the normalized Legendre polynomials orthogonal over [-1, 1] as λ n P n (z) where λ n = 2n+1 2 . To orthogonalize them over [0,1], let y = 1+z 2 . It follows that z = 2y-1, dz = 2dy. Note that we then have 1 -1 P m (z)P n (z)dz = 1 0 2P m (2y-1)P n (2y-1)dy. This implies that 1 -1 2n+1 2 •2P m (2y-1)P n (2y-1)dy = δ mn . Then if we let L n (y) = √ 2λ n P n (2y-1) = √ 2n+1P n (2y-1), then we have an a set of functions over [0,1] such that 1 0 L m (y)L n (y)dy = δ mn . From (Chihara, 2011, (2.8 ), (2.9)), note that P n (-1) = (-1) n and P n (1) = 1. This implies that L n (0) = √ 2n+1P n (-1), L n (1) = √ 2n+1P n (1). Finally note that (30) implies: (Gu et al., 2020, 7) , we get L ′ n (y) = 2 √ 2n+1P ′ n (2y-1) = 2 √ 2n+1P ′ n (z). From P ′ n (z) = 0≤k≤n-1,n-k is odd (2k+1)P k (z). Using (30) on the above, we get (28). We now consider (Gu et al., 2020, 8) , we get yL ′ n (y) = 2y √ 2n+1P ′ n (z) = (1+z) √ 2n+1P ′ n (z). From (z+1)P ′ n (z) = nP n (z)+ n-1 k=0 (2k+1)P k (z). Then the above becomes yL ′ n (y) = √ 2n+1 nP n (z)+ n-1 k=0 (2k+1)P k (z) . (30) implies that P n (z) = Ln(y) √ 2n+1 , thus yL ′ n (y) = nL n (z)+ √ 2n+1 n-1 k=0 √ 2k+1L k (z) . We now re-state and prove Corollary 3.1: Corollary C.8 (Corollary 3.1, restated). Let L n be the Legendre polynomials orthonormal over the interval [0,1]. Define σ(t,s) = exp(a(s)-a(t)). The SSM (a ′ (t)A,a ′ (t)B) is an OSSM with ω(t,s) = I(σ(t,s))a ′ (s)σ(t,s) p n (t,s) = L n (σ(t,s)), where A and B are defined as in (4). Proof. We consider our basis functions, the Legendre polynomials, which are orthogonal with respect to unit measure. This allows us to invoke Corollary C.4 with ω = 1. Further, here we have t 0 = -∞ and η = 0. Now we have an SSM: -a ′ (t)(A 0 +I),a ′ (t)B where A 0 nk = α nk as in ( 26) and B n = L n (1). From (29) observe that B n = (2n+1) 1 2 . From ( 27), we have α nk =    (2n+1) 1 2 (2k+1) 1 2 k < n n k = n 0 otherwise . We write that A = -(A 0 +I). Indeed, -(A 0 +I) nk = -    (2n+1) 1 2 (2k+1) 1 2 if k < n n+1 if k = n 0 if k > n . Thus the A and B match those in (4), which completes our claim. We now re-state and prove Theorem 3: Corollary C.9 (Theorem 3, restated). Let L n be the Legendre polynomials orthonormal over the interval [0,1]. Then the SSM ( 1 t A, 1 t B) is a OSSM for basis functions L n ( s t ) and measure t] where A and B are defined as in (4). 1 t I[t 0 , Proof. We consider our basis functions, the Legendre polynomials, which are orthogonal with respect to unit measure. This allows us to invoke Corollary C.5 with ω = 1. Now we have x ′ (t) = 1 t -(A 0 +I)x(t)+Bu(t) where A 0 nk = α nk as in (26) and B n = L n (1). From (29) observe that B n = (2n+1) 1 2 . From (27), we have α nk =    (2n+1) 1 2 (2k+1) 1 2 k < n n k = n 0 otherwise . We write that A = -(A 0 +I). Indeed, -(A 0 +I) nk = -    (2n+1) 1 2 (2k+1) 1 2 if k < n n+1 if k = n 0 if k > n , which completes our claim. We now restate and prove Corollary 3.3. Corollary C.10 (Corollary 3.3, restated). Let L n be the Legendre polynomials orthonormal over the interval [0,1]. Then the SSM (A,B) is a TOSSM for basis functions L n (e -t ) with measure ω = I[t 0 ,t]e -t where A,B are defined as in (4). Proof. We consider our basis functions, the Legendre polynomials, which are orthogonal with respect to unit measure, warping function σ = exp(s-t), and with tilting χ = ω. We note that σ = exp(s-t) satisfies ( 24) with, a ′ (t) = 1. This allows us to invoke Corollary C.5. Then x ′ (t) = (A + I)x(t) + Bu(t) orthogonalizes against the basis functions L n (e s-t ) with measure I[-∞,t]e s-t where A = α nk as in 26. Note that the SSM basis functions K n (t,s) = K n (s-t), hence we get the claimed SSM form utilizing the same argument for A,B as in the proof of Corollary C.9 This explains why removing the 1 t factor from HiPPO-LegS still works: it is orthogonalizing onto the Legendre polynomials with an exponential "warping".

C.4.1 LegT Derivation

Corollary C.11. Let L n be the Legendre polynomials orthonormal over the interval [0,1] and let σ = 1-t-s θ for a constant θ. Then the SSM 1 θ A, 1 θ B is a OSSM for basis functions L n (σ) with measure 1 θ I[t 0 ,t](σ) where A,B are defined as in (5). Proof. Out plan is to apply Corollary C.3, for which we must show that the basis functions L n (t,s), time warping σ(t,s), and tilting χ(t,s) = ψ -1 ϕ -1 (t,s) satisfy ( 18), (15), and (19), respectively. We first set some parameters-note that because ω = 1 and set ψ = ϕ = 1. The above implies that we have σ t (L n ωψ) ′ (σ) = - 1 θ L ′ n (σ). The above along with (28), we see that the Legendre polynomials satisfy (18) with γ nk = 1 θ • -2•(2n+1) 1 2 (2k+1) 1 2 k < n and n-k is odd 0 otherwise . ( ) We also note that σ s = 1 θ . It follows that d dt σ s = 0, satisfying (15) trivially by setting c(t) = 0. Similarly, since ϕ = 1 ( 19) is also satisfied trivially by setting d(t) = 0. Finally we note that the L n forms a complete basis over [0,1], hence as N → ∞, we have u(t-θ) = N -1 k=0 x k (t)L n (σ(t,t-θ)) = N -1 k=0 x k (t)L n (0). The above defines A ′ by setting A ′ n = L n (0) (as well as c = 1 and d = 0.) Now by Corollary C.3, we have an SSM A 0 -D(A ′ ) ⊤ ,B ′ , where D n = 1 θ L n (0), and by (27) A 0 nk = γ nk (as in ( 31)) and B ′ n = 1 θ L n (1). From (29), we have D n = 1 θ (2n+1) 1 2 (-1) n and B n = 1 θ (2n+1) 1 2 . Thus, we have A 0 -D(A ′ ) ⊤ nk = 1 θ • -(2n+1) 1 2 (2k+1) 1 2 2+(-1) n-k k < n and n-k is odd -(2n+1) 1 2 (2k+1) 1 2 (-1) n-k otherwise . The proof is complete by noting that A 0 -D(A ′ ) ⊤ = 1 θ A and B ′ = 1 θ B. We  1 0 L k (1-t)L j (1-t)dt = 1 0 L k (t)L j (t)dt.) We first give a proof of Theorem 4. Then, we prove Theorem 9 as a function approximation result pertaining to S4-FouT.

C.4.2 Explanation of S4-FouT

Proof of Theorem 4. We seek to derive A and B ′ from (6) using Corollary C.3: We use the time-warping function σ(t,s) = 1-(t-s), which implies that we have σ s (t,s) = 1, (32) ∂ ∂t σ s (t,s) = 0 (33) Thus, we can take c(t) = 0 in ∂ ∂t σ s (t,s) = c(t)σ s (t,s). ( ) We then have χ(t,s) = 1 as we set ψ(t,s) = ϕ(t,s) = 1, (35) d dt ϕ(t,s) = 0. So, we can take d(t) = 0 in d dt ϕ(t,s) = d(t)ϕ(t,s). ( ) We also have ω(σ) = 1, and we order our bases in the form p n = (1,c 1 (t),s 1 (t),c 2 (t),s 2 (t),...)foot_2 , where the basis functions have derivatives: (1) ′ (σ) = 0; (c n ) ′ (σ) = -2πns n (σ); (s n ) ′ (σ) = 2πnc n (σ). Consequently, we can define γ nk as follows: γ nk =    2πn n-k = 1, k odd -2πk k-n = 1, n odd 0 otherwise . ( ) Further, the discontinuity is at t 0 = t-θ, θ = 1 which implies that dt0 dt = 1. We now seek to use the stored approximation to u at time t to compute u(t-1). First, denote the latent state x(t) with coefficients x = (x 1 (t),x c 1 (t),x s 1 (t),x c 2 (t),x s 2 (t),...) and define the functions v(s) and w(s) such that we have  v(s) = u(2t- ) Towards that end, we examine the sine and cosine coefficients of u and v as follows: ⟨v,c n ⟩ = v(s)c n (σ(t,s))I[t-1,t]ds = u(2t-s-1)c n (σ(t,s))I[t-1,t]ds = u(s ′ )c n (1-σ(t,s ′ ))I[t-1,t]ds ′ (43) = u(s ′ )c n (σ(t,s ′ ))I[t-1,t]ds ′ = ⟨u,c n ⟩. ⟨v,s n ⟩ = v(s)s n (σ(t,s))I[t-1,t]ds = u(2t-s-1)s n (σ(t,s))I[t-1,t]ds = u(s ′ )s n (1-σ(t,s ′ ))I[t-1,t]ds ′ (44) = -u(s ′ )s n (σ(t,s ′ ))I[t-1,t]ds ′ = -⟨u,s n ⟩. Here, for ( 43) and ( 44), we use the change of variables s ′ ← 2t-s-1, which gives us σ(t,s) = 1-(t-s) = 1-(1+t-s-1) = 1-[1-(t-(2t-s-1))] = 1-(1-(t-s ′ )) = 1-σ(t,s ′ ). Then, we use the fact that c n (1 -σ(t,s ′ )) = c n (σ(t,s ′ )) but s n (1 -σ(t,s ′ )) = -s n (σ(t,s ′ )). That is, both u and v have the same cosine coefficients but negated sine coefficients of each other. But, we know that both s n (σ(t,t -1)) = s n (1 -(t -(t -1))) = s n (0) = 0 and s n (σ(t,t)) = s n (1-(t-t)) = s n (1) = 0, and hence, the reconstruction of û at the endpoints σ(t,t-1) = 0 and σ(t,t) = 1 depends only on the cosine coefficients, whence we assert that the reconstruction û agrees with v at both endpoints. Therefore, we have û(t,t) = v(t,t) implying that ŵ(t,t) = û(t,t). Note that w is continuous and periodic, for which the basis {1,c n ,s n } n is complete, and hence, we know that as N → ∞, ŵ → w. Thus, at s = t, we have û (t,t) = ŵ(t,t) = w(t) = u(t)+v(t) 2 = u(t)+u(t-1)

2

, which completes the proof of the claim in (42). Recall from (39) that we can express the stored approximation of u(t), given by û(t,s), as follows: û(t,s) = ⟨u(s),1⟩+ For the value at t, the approximation û(t,t) is then given by û (t,t) = x 1 (t)+ k x c k (t)c k (1)+ k x s k (t)s k (1) = x 1 (t)+ k √ 2x c k (t). Due to (42), we know u(t-1) = 2û(t,t)-u(t), which combined with the above yields: u(t-1) = 2x 1 (t)+2 √ 2 k x c k (t)-u(t). Finally, with regards to Corollary C.3, for Theorem 8, (34) satisfies ( 15) and ( 37) satisfies ( 19) with ( 38) satisfying (18) for A 0 . Moreover, from (45), we can take c = 1,d = -1, and A ′ k :=    2 k = 0 2 √ 2 k odd 0 otherwise to satisfy (20). Invoking Corollary C.3 now yields the following OSSM:foot_3  (A 0 +(c(t)+d(t))I -cD(A ′ ) ⊤ , B -dD), where A 0 nk = γ nk with D n and B n specified as follows: D n =    1 n = 0 √ 2 n odd 0 otherwise (46) B n =    1 n = 0 √ 2, n odd 0 otherwise (47) Here, the values are derived from the expressions of Corollary C.3: D n = (p n ωψ)(σ(t,t-1))(σ s ϕ)(t,t-1) and B n = (p n ωψ)(1)(σ s ϕ)(t,t). Recall that we have p n ∈ {1,c n ,s n },ω(t,s) = 1, and from ( 32) and (35), σ s (t,s) = 1 with ψ(t,s) = ϕ(t,s) = 1. Thus, ( 46) is due to 1( 0 )•1 = 1,s n (0)•1 = 0 but c n (0)•1 = √ 2. Similarly, (47) is because 1(0)•1 = 1,s n (1)•1 = 0 but again c n (1)•1 = √ 2. Now, we have [D(A ′ ) ⊤ ] nk =        2 n = k = 0 2 √ 2 n = 0, k odd or k = 0, n odd 4 n,k odd 0 otherwise [dD] n =    -1 n = 0 - √ 2 n odd 0 n otherwise . As c(t) = d(t) = 0, we define A ← A 0 -cD(A ′ ) ⊤ and B ← B -dD, given by A nk =                  -2 n = k = 0 -2 √ 2 n = 0, k odd or k = 0, n odd -4 n,k odd 2πn n-k = 1,k odd -2πk k-n = 1,n odd 0 otherwise B n =    2 n = 0 2 √ 2 n odd 0 otherwise

C.4.3 Function Approximation Error

Theorem 9. Let K(t) be a differentiable kernel on [0, 1], and let K(t) be its representation by the FouT system (Theorem 4) with state size N. If K is L-Lipschitz, then for ϵ > 0,N ≥ L πϵ 2 + 2, we have ∥K(t)-K(t)∥ ≤ ϵ. If K has k-derivatives bounded by L, then we can take N ≥ L π k ϵ 2 2k-1 +2. Proof of Theorem 9. First, the state size being N dictates that there are ⌊N/2⌋ s n and c n basis functions each. We fix time t and denote x c n and x s n to be the respective coefficients for s n and c n basis corresponding to S4-Fou. Since {s n ,c n } n≥0 forms an orthonormal basis, by Parseval's identity, we have ∥K -K∥ 2 2 = ∞ n=⌊N/2⌋ x c n 2 (t)+x s n 2 (t). Thus, in order to bound the error, it suffices to bound the high-order coefficients by integration by parts as follows: x c n (t) = ⟨K,c n ⟩ = 1 0 K(t)c n (t)dt = K(t) 1 2πn s n (t) 1 0 - 1 2πn 1 0 K ′ (t)s n (t)dt = - 1 2πn 1 0 K ′ (t)s n (t)dt. The quantity in the bracket vanishes as s n is periodic. Therefore |x c n | ≤ 1 2πn 1 0 K ′ (t)s n (t)dt ≤ 1 2πn 1 0 |K ′ (t)||s n (t)|dt ≤ L 2πn , where we use the fact that K is L-Lipshitz. For x s n , a similar argument holds and we get: |x s n | ≤ 1 2π 1 0 K ′ (t)c n (t)dt ≤ 1 2π 1 0 |K ′ (t)||c n (t)|dt ≤ L 2πn. Due to (48), this then implies that ∥K -K∥ 2 2 = ∞ n=⌊N/2⌋ x c n 2 (t)+x s n 2 (t) = ∞ n=⌊N/2⌋ |x c n | 2 (t)+|x s n | 2 (t) ≤ ∞ n=⌊N/2⌋ 2L 2 (2πn) 2 = 2L 2 (2π) 2 ∞ n=⌊N/2⌋ 1 n 2 = 2L 2 (2π) 2 1 ⌊N/2⌋ ≤ L 2 π 2 (N -2) . We use (49) to get the following estimate on ∥K -K∥ : ∥K -K∥ 2 ≤ L π (N -2) . Thus, it suffices for N to satisfy the following inequality: L π (N -2) ≤ ϵ =⇒ √ N -2 ≥ L πϵ =⇒ N ≥ L πϵ 2 +2. We now use the same argument as above to the fact that K has order-k bounded derivative. By iteration, we get: |x s n | = |x c n | ≤ 1 (2πn) k 1 0 K (k) (t)s n (t)dt ≤ 1 (2πn) k 1 0 |K (k) ||s n (t)|dt ≤ L (2πn) k . Again, due to (48), this then gives us the following estimate on the square error: ∥K -K∥ 2 2 = ∞ n=⌊N/2⌋ x c n 2 (t)+x s n 2 (t) = ∞ n=⌊N/2⌋ |x c n | 2 (t)+|x s n | 2 (t) ≤ ∞ n=⌊N/2⌋ 2L 2 (2πn) 2k = 2L 2 (2π) 2k ∞ n=⌊N/2⌋ 1 n 2k = 2L 2 (2π) 2k 1 (⌊N/2⌋) 2k-1 ≤ L 2 π 2k (N -2) 2k-1 . ( ) If K has order k-bounded derivatives, then we use (50) to get the following estimate on ∥K -K∥ : ∥K -K∥ 2 ≤ L π k (N -2) -k+1/2 . Again, it suffices for N to satisfy the following inequality:  L π k (N -2) -k+1/2 ≤ ϵ =⇒ (N -2) k-1/2 ≥ L π k ϵ =⇒ N ≥ L π k ϵ 2 2k-1 +2.

C.4.4 Approximating Delay Networks

The original motivation for the LDN/LMU (Voelker, 2019; Voelker et al., 2019) worked backward from the transfer function of the desired delay function impulse response K(t) = δ(t-1), and noticed that the SSM for Padé approximations to this were linked to Legendre polynomials. This was not fully proven, and we state it here and provide a full proof. Theorem 10. For A,B,C,D in the LegT system described in Theorem 5, the transfer function L{K(t)}(s) is the [N -1/N ] Padé approximant to e -s = L{δ(t-1)}(s). We remark that although LegT (LMU) is designed to be an "optimal" approximation to the delay function via Padé approximants, it actually produces a weaker spike function than FouT (Fig. 7 vs. Fig. 1 ) and empirically performs slightly worse on synthetic tasks testing this ability (Section 4.3). This may be because Padé approximation in the Laplace domain does not necessarily translate to localization in the time domain. Finally, we prove Theorem 10. Note that this is a stronger version of the LegT portion of Theorem 5, while the FouT portion is a corollary of the proof of Theorem 4. We start by working out some calculations concretely to provide an example. The SSM corresponding to HiPPO-LegT is A = P 1 2    -1 1 -1 1 -1 -1 1 -1 -1 -1 -1 1 -1 -1 -1 -1   P 1 2 B = P 1 2 1 C = Z ⊤ P 1 2 P = diag{1+2n} Z ⊤ = [1 -1 1 -1] The transfer function is C(sI -A) -1 B = Z(sP -1 -A) -1 1 (In the RHS and for the rest of this part, we will redefine A to be the ±1 matrix found above for convenience.) Case N=1. We have A = -1,B = C = 1, and the transfer function is C(sI -A) -1 B = 1 1+s . Case N=2. The transfer function is C(sI -A) -1 B = [1 -1] sP -1 - -1 1 -1 -1 -1 1 1 = [1 -1] s+1 -1 1 s 3 +1 -1 1 1 = 1 s 2 3 + 4s 3 +2 [1 -1] 1+ s 3 1 -1 1+s 1 1 = 2-2s 3 s 2 3 + 4s 3 +2 = 1-s 3 1+ 2s 3 + s 2 6 It can be verified that this is indeed [1/2] exp (-s). A General Recursion. We will now sketch out a method to relate these transfer functions recursively. We will redefine Z to be the vector that ENDS in +1. The main idea is to write A n = A n-1 Z n-1 -1 ⊤ n-1 -1 sP -1 n -A n -1 = sP -1 n-1 -A n-1 -Z n-1 1 ⊤ n-1 1+ s 2n+1 -1 . Now we can use the block matrix inversion formula.foot_4 Ideally, this will produce a recurrence where the desired transfer function Z n (sP -1 n -A n ) -1 1 n will depend on Z n-1 (sP -1 n -A n-1 ) -1 1 n-1 . However, looking at the block matrix inversion formula, it becomes clear that there are also dependencies on terms like 1 ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 1 n-1 and Z n-1 (sP -1 n-1 -A n-1 ) -1 Z ⊤ n-1 . The solution is to track all of these terms simultaneously. We will compute the 4 transfer functions H n (s) := H 1z n (s) H 11 n (s) H zz n (s) H z1 n (s) := 1 ⊤ n (sP -1 n -A n ) -1 Z n 1 ⊤ n (sP -1 n -A n ) -1 1 n Z ⊤ n (sP -1 n -A n ) -1 Z n Z ⊤ n (sP -1 n -A n ) -1 1 n = 1 ⊤ n Z ⊤ n (sP -1 n -A n ) -1 [Z n 1 n ] Lemma C.12. Instead of using the explicit block matrix inversion formula, it will be easier to work with the following factorization used to derive it (block LDU decompositionfoot_5 ). n -A n ) -1 = I n-1 (sP -1 n-1 -A n-1 ) -1 Z n-1 1 (sP -1 n-1 -A n-1 ) -1 1+ s 2n+1 +(-1) n-1 H 1z n-1 (s) -1 I n-1 -1 ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 1 Now we compute 1 ⊤ n Z ⊤ n I n-1 (sP -1 n-1 -A n-1 ) -1 Z n-1 1 = 1 ⊤ n-1 1 -Z ⊤ n-1 1 I n-1 (sP -1 n-1 -A n-1 ) -1 Z n-1 1 = 1 ⊤ n-1 1+H 1z n-1 (s) -Z ⊤ n-1 1-H zz n-1 (s) and I n-1 -1 ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 1 [Z n 1 n ] = I n-1 -1 ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 1 -Z n-1 1 n-1 1 1 = -Z n-1 1 n-1 1+H 1z n-1 (s) 1-H 11 n-1 (s) Now we can derive the full recurrence for all these functions. These satisfy the following recurrences: H n (s) = H 1z n (s) H 11 n (s) H zz n (s) H z1 n (s) = 1 ⊤ n Z ⊤ n (sP -1 n -A n ) -1 [Z n 1 n ] = 1 ⊤ n Z ⊤ n I n-1 (sP -1 n-1 -A n-1 ) -1 Z n-1 1 (sP -1 n-1 -A n-1 ) -1 1+ s 2n+1 +(-1) n-1 H 1z n-1 (s) -1 I n-1 -1 ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 1 [Z n 1 n ] = 1 ⊤ n-1 1+H 1z n-1 (s) -Z ⊤ n-1 1-H zz n-1 (s) • (sP -1 n-1 -A n-1 ) -1 1+ s 2n+1 +H 1z n-1 (s) -1 • -Z n-1 1 n-1 1+H 1z n-1 (s) 1-H 11 n-1 (s) = 1 ⊤ n-1 -Z ⊤ n-1 (sP -1 n-1 -A n-1 ) -1 [-Z n-1 1 n-1 ] + 1+H 1z n-1 (s) 1-H zz n-1 (s) 1+ s 2n+1 +H 1z n-1 (s) -1 1+H 1z n-1 (s) 1-H 11 n-1 (s) = -H 1z n- G 1z n (s) = 1-G 1z n-1 (s)+ G 1z n-1 (s)G 1z n-1 (s) G 1z n-1 (s)+ s 2(2n+1) G 11 n (s) = G 11 n-1 (s)- G 11 n-1 (s)G 1z n-1 (s) G 1z n-1 (s)+ s 2(2n+1) G zz n (s) = G zz n-1 (s)- G zz n-1 (s)G 1z n-1 (s) G 1z n-1 (s)+ s 2(2n+1) G z1 n (s) = G z1 n-1 (s)-(-1) n-1 G 11 n-1 (s)G zz n-1 (s) G 1z n-1 (s)+ s

2(2n+1)

We can analyze each term separately. Case G 1z n (s). This will be the most important term, as it determines the denominator of the expressions. Simplifying the recurrence slightly gives G 1z n (s) = 1-G 1z n-1 (s)+ G 1z n-1 (s)G 1z n-1 (s) G 1z n-1 (s)+ . This results in the recurrence Q 1z n (s) = P 1z n-1 (s)+ s 2(2n+1) •Q 1z n-1 (s) P 1z n (s) = Q 1z n-1 (s)- s 2(2n+1) •P 1z n-1 (s). But this is exactly the fundamental recurrence formula for continuants of the continued fraction e s = 1+ s 1-1 2 s 1+ n-1 (s)+ s 2(2n+1) = P 1z n-1 (s)+ s 2(2n+1) •Q 1z n-1 (s) Q 1z n-1 (s) = Q 1z n (s) Q 1z n-1 (s) Going forward we will also suppress the superscript of Q, Q n-1 (s) := Q 1z n-1 (s), as it will be evident that all terms have the same denominator Q n (s) Case G 11 n (s). First note that G 11 n (s) = G zz n (s) is straightforward from the fact that their recurrences are identical. The recurrence is  G 11 n (s) = G 11 n-1 (s)- G 11 n-1 (s)G 1z n-1 (s) G 1z n-1 (s)+ s 2(2n+1) = s 2(2n+1) •G 11 n-1 (s) G 1z n-1 (s)+ s 2(2n+1) = s 2(2n+1) •G 11 n-1 (s)Q n-1 (s) P 1z n-1 (s)+ s 2(2n+1) •Q n-1 (s) = s 2(2n+1) •G 11 n-1 (s)Q n-1 (s) Q n (s) Therefore G 11 n (s)Q n (s) = s 2(2n+1) •G 11 n-1 (s)Q n-1 (s) =



The results on orthogonal polynomials also work for the infinite interval [0,∞) via the Laguerre polynomials but we ignore this case for simplicity but point out that(Gu et al., 2020) handles this case.2 In this work we presented everything for the finite interval [0,1] but since [-1,1] is more standard in the orthogonal polynomials and what was used in(Gu et al., 2020) we stick with [-1,1] in this section. It is easy to move from one to another by an appropriate linear scaling of the argument. Note that this is 0-indexed. Recall that, like the coefficients, the matrices are 0-indexed. https://en.wikipedia.org/wiki/Block_matrix#/Block_matrix_inversion https://en.wikipedia.org/wiki/Schur_complement#/Background



Figure 1: (Left: LegS) We prove that the particular A matrix chosen in S4 produces Legendre polynomials under an exponential re-scaling, resulting in smooth basis functions with a closed form formula. (Middle, Right: FouT) We derive a new SSM that produces approximations to the truncated Fourier basis, perhaps the most intuitive and ubiquitous set of basis functions. This method generalizes sliding Fourier Transforms and local convolutions (i.e. CNNs), and can also encode spike functions to solve classic memorization tasks.

Figure 2: (Validation curves on Path-X.) (Left) Setting ∆min too small can solve the task, but is inconsistent. (Right) A good setting of ∆min can consistently solve the task. Note that the timescale of each feature is up to 1∆ min = 10 4 , which is on the order of (but not exceeding) the length of the task L = 16384. Empirically, performance is best when spreading out the range of ∆ with a larger ∆max that covers a wider range of timescales and can potentially learn features at different resolutions, which are combined by a multi-layer deep neural network. We also show a diagonal variant of S4-LegS called S4D-Inv introduced in(Gu et al., 2022b)  which approximates S4-LegS, but is still worse.

(a) (Delay Task) Models perform a mapping from R 4000 → R 4000 where the target output is lagged by 1000 steps, with error measured by RMSE. The input is a white noise signal bandlimited to 1000Hz. We use single layer SSMs with state size N = 1024.

Figure 5: (Prior HiPPO methods) Given an input function u(t) (black), HiPPO compresses it online into a state vector x(t) ∈ R N via equation (1). Specific cases of HiPPO matrices A,B are derived so

Figure6: Log-MSE after training on the Reconstruction Task. (Left) When the timescales ∆ are set appropriately for this task, the methods that theoretically reconstruct against a uniform measure (LegT and FouT) are much better than alternatives, achieving MSE more orders of magnitude lower than other SSM initializations. (Right) Interestingly, when the timescales ∆ are not set correctly, these methods (LegT and FouT) actually perform worst and the diagonal methods introduced in(Gu  et al., 2022b)  perform best.

y)p k (y)ω(y)dy = p j (y)p k (y)ω(y)I(y)dy = δ jk .

Figure 7: (HiPPO-LegT.) (Left) First 4 basis functions Kn(t) for state size N = 1024 (Proposition 2). (Right) Choosing a particular C produces a spike kernel or "delay network" (Theorem 10).

CA -1 B) -1 I 0 -CA -1 IUsing Lemma C.12, we can factor the inverse as (sP -1

s) are the denominators of the Pade approximants. Note that by definition of P,Q, G 1z

Q n (s) Case G z1 n (s).

(Long Range Arena) Accuracy (std.) on full suite of LRA tasks. Hyperparameters in Appendix B. ✗ denotes failure to learn better than random guessing, following convention fromTay et al. (2021);Gu et al. (2022a).

The values of the best hyperparameters found for LRA. LR is learning rate and WD is weight decay. BN and LN refer to Batch Normalization and Layer Normalization.

note that Corollary C.11 implies Proposition 2. More specifically, Proposition 2 follows by setting θ = 1 in Corollary C.11 and noticing that the OSSM there is actually a TOSSM. (Technically we get basis function L n (1 -t) for measure I(1-t) but this is OK since

G z1

n (s) =This term satisfies the formulaBy definition of P z1 ,But note that this is exactly satisfied by the Padé approximants, by the determinantal formula of continued fractions. This shows that G 1z n-1 (s) are the Padé approximants of e -s , as desired.

C.5 Normalization and Timescales

Proposition 11 (Closure properties of TOSSMs). Consider a TOSSM (A,B) for basis functions p n (t) and measure ω(t). Then, the following are also TOSSMs with the corresponding basis functions and measure:1. Constant scaling changes the timescale: (cA,cB) is a TOSSM with basis p(ct) and measure ω(ct)c.

2.. Identity shift tilts by exponential:

(A + cI,B) is a TOSSM with basis p(t)e -ct and measure ω(t)e 2ct .3. Unitary change of basis preserves measure: (V AV * ,V B) is a TOSSM with basis V p(t) and measure ω(t).Proof. We define p(t) to be the vector of basis functions for the OSSM (A,B),Recall that the SSM kernels are1. The SSM kernels areIt remains to show that the p n (ct) are orthonormal with respect to measure cω(ct):which follows immediately from the change of variables formula.2. Using the commutativity of A and I, the SSM kernels are e t(A+cI) B = e tA e ctI B = e ct p(t)ω(t).It remains to show that p n (t)e -ct are orthonormal with respect to measure ω(t)e 2ct : p j (t)e -ct p k (t)e -ct ω(t)e 2ct = p j (t)p k (t)ω(t) = δ jk .

3.. The SSM basis is

It remains to show that the basis functions V p(t) are orthonormal with respect to ω(t). Note that orthonormality of a set of basis functions can be expressed as p(t)ω(t)p(t) ⊤ = I, so thatNormalization. A standard aspect of training deep learning models, in general, concerns the scale or variance of activations. This has been the subject of much research on training deep learning models, touching on deep learning theory for the dynamics of training such as the exploding/vanishing gradient problem (Hochreiter, 1991) , and a large number of normalization methods to ensure properly normalized methods, from the simple Xavier/He initializations (Glorot and Bengio, 2010; He et al., 2015) to BatchNorm and LayerNorm (Ioffe and Szegedy, 2015; Ba et al., 2016) to many modern variants and analyses of these (Davis et al., 2021) .The following proposition follows because for a TOSSM, x(t) can be interpreted as projecting onto orthonormal functions in a Hilbert space (Proposition 1).Proposition 12 (Normalization of TOSSM). Consider an (infinite-dimensional) TOSSM.For any input u(t), ∥x(t)∥ 2 2 = ∥u∥ 2 ω = t -∞ u(s) 2 ω(t-s)dt. Corollary C.13. For a TOSSM with a probability measure (i.e. ω(t) = 1) and any constant input u(t) = c, the state has norm ∥x(t)∥ 2 = c 2 and the output y(t) has mean 0, variance c 2 if the entries of C are mean 0 and variance 1.Note that the probability measure requirement can be satisfied by simply rescaling B. Corollary C.13 says the TOSSM preserves the variance of inputs, the critical condition for a properly normalized deep learning layer. Note that the initialization of C is different than a standard Linear layer in deep neural networks, which usually rescale by factor depending on its dimensionality such as N -1 2 (Glorot and Bengio, 2010) .

