LIQUID STRUCTURAL STATE-SPACE MODELS

Abstract

A proper parametrization of state transition matrices of linear state-space models (SSMs) followed by standard nonlinearities enables them to efficiently learn representations from sequential data, establishing the state-of-the-art on an extensive series of long-range sequence modeling benchmarks. In this paper, we show that we can improve further when the structured SSM, such as S4, is given by a linear liquid time-constant (LTC) state-space model. LTC neural networks are causal continuous-time neural networks with an input-dependent state transition module, which makes them learn to adapt to incoming inputs at inference. We show that by using a diagonal plus low-rank decomposition of the state transition matrix introduced in S4, and a few simplifications, the LTC-based structured statespace model, dubbed Liquid-S4, improves generalization across sequence modeling tasks with long-term dependencies such as image, text, audio, and medical time-series, with an average performance of 87.32% on the Long-Range Arena benchmark. On the full raw Speech Command recognition dataset, Liquid-S4 achieves 96.78% accuracy with a 30% reduction in parameter counts compared to S4. The additional gain in performance is the direct result of the Liquid-S4's kernel structure that takes into account the similarities of the input sequence samples during training and inference.

1. INTRODUCTION

Learning representations from sequences of data requires expressive temporal and structural credit assignment. In this space, the continuous-time neural network class of liquid time-constant networks (LTC) (Hasani et al., 2021b) has shown theoretical and empirical evidence for their expressivity and their ability to capture the cause and effect of a given task from high-dimensional sequential demonstrations (Lechner et al., 2020a; Vorbach et al., 2021; Wang et al., 2022; Hasani et al., 2022; Yin et al., 2022) . Liquid networks are nonlinear state-space models (SSMs) with an input-dependent state transition module that enables them to learn to adapt the dynamics of the model to incoming inputs, at inference, as they are dynamic causal models (Friston et al., 2003) . Their complexity, however, is bottlenecked by their differential equation numerical solver that limits their scalability to longer-term sequences. How can we take advantage of LTC's generalization and causality capabilities and scale them to competitively learn long-range sequences without gradient issues, compared to advanced recurrent neural networks (RNNs) (Rusch & Mishra, 2021a; Erichson et al., 2021; Gu et al., 2020a) , convolutional networks (CNNs) (Lea et al., 2016; Romero et al., 2021b; Cheng et al., 2022) , and attention-based models (Vaswani et al., 2017) ? In this work, we set out to leverage the elegant formulation of structured state-space models (S4) (Gu et al., 2022a) to obtain linear liquid network instances that possess the approximation capabilities of both S4 and LTCs. This is because structured SSMs are shown to largely dominate advanced RNNs, CNNs, and Transformers across many data modalities such as text, sequence of pixels, audio, and time series (Gu et al., 2021; 2022a; b; Gupta, 2022) . structured SSMs achieve such impressive performance by using three main mechanisms: 1) High-order polynomial projection operators (HiPPO) However, when these RNNs are trained by gradient descent (Rumelhart et al., 1986; Allen-Zhu & Li, 2019; Sherstinsky, 2020) , they suffer from the vanishing/exploding gradients problem, which makes difficult the learning of long-term dependencies in sequences (Hochreiter, 1991; Bengio et al., 1994) . This issue happens in both discrete RNNs such as GRU-D with its continuous delay mechanism (Che et al., 2018) and Phased-LSTMs (Neil et al., 2016) , and continuous RNNs such as ODE-RNNs (Rubanova et al., 2019) , GRU-ODE (De Brouwer et al., 2019) , Log-ODE methods (Morrill et al., 2020) which compresses the input time-series by time-continuous path signatures (Friz & Victoir, 2010) , and neural controlled differential equations (Kidger et al., 2020) , and liquid time-constant networks (LTCs) (Hasani et al., 2021b) . Numerous solutions have been proposed to resolve these gradient issues to enable long-range dependency learning. Examples include discrete gating mechanisms in LSTMs (Hochreiter & Schmidhuber, 1997; Greff et al., 2016; Hasani et al., 2019) , GRUs (Chung et al., 2014) , continuous gating mechanisms such as CfCs (Hasani et al., 2021a) , hawks LSTMs (Mei & Eisner, 2017) , IndRNNs (Li et al., 2018) , state regularization (Wang & Niepert, 2019) , unitary RNNs (Jing et al., 2019) , dilated RNNs (Chang et al., 2017) , long memory stochastic processes (Greaves-Tunnell & Harchaoui, 2019) , recurrent kernel networks (Chen et al., 2019) , Lipschitz RNNs (Erichson et al., 2021) , symmetric skew decomposition (Wisdom et al., 2016) , infinitely many updates in iRNNs (Kag et al., 2019) , coupled oscillatory RNNs (coRNNs) (Rusch & Mishra, 2021a) , mixed-memory RNNs (Lechner & Hasani, 2021) , and Legendre Memory Units (Voelker et al., 2019) . Learning Long-range Dependencies with CNNs and Transformers. RNNs are not the only solution to learning long-range dependencies. Continuous convolutional kernels such as CKConv (Romero et al., 2021b) and (Romero et al., 2021a) , and circular dilated CNNs (Cheng et al., 2022) have shown to be efficient in modeling long sequences faster than RNNs. There has also been a large series of works showing the effectiveness of attention-based methods for modeling spatiotemporal data. A large list of these models is listed in Table 6 . These baselines have recently been largely outperformed by the structured state-space models (Gu et al., 2022a) . State-Space Models. SSMs are well-established frameworks to study deterministic and stochastic dynamical systems (Kalman, 1960) . Their state and input transition matrices can be directly learned by gradient descent to model sequences of observations (Lechner et al., 2020b; Hasani et al., 2021b; Gu et al., 2021) . In a seminal work, Gu et al. (2022a) showed that with a couple of fundamental algorithmic methods on memorization and computation of input sequences, SSMs can turn into the most powerful sequence modeling framework to-date, outperforming advanced RNNs, temporal and continuous CNNs (Cheng et al., 2022; Romero et al., 2021b; a) and a wide variety of Transformers (Vaswani et al., 2017) , available in Table 6 by a significant margin. The key to their numerical performance is their derivation of higher-order polynomial projection (HiPPO) matrix (Gu et al., 2020a) obtained by a scaled Legendre measure (LegS) inspired by the Legendre Memory Units (Voelker et al., 2019) to memorize input sequences. Their efficient runtime and memory are derived from their normal plus-low rank representation.It was also shown recently that diagonal SSMs (S4D) (Gupta, 2022 ) could be as performant as S4 in learning long sequences when parametrized and initialized properly (Gu et al., 2022b; c) . Concurrent with our work, there is also a new variant of S4 introduced as simplified-S4 (S5) (Smith et al., 2022) that tensorizes the 1-D operations of S4 to gain a more straightforward realization of SSMs. Here, we introduce Liquid-S4, which is obtained by a more expressive SSM, namely liquid time-constant (LTC) representation (Hasani et al., 2021b) which achieves SOTA performance across many benchmarks.

3. SETUP AND METHODOLOGY

In this section, we first revisit the necessary background to formulate our liquid structured statespace models. We then set up and sketch our technical contributions.

3.1. BACKGROUND: STRUCTURED STATE-SPACE MODELS (S4)

We aim to design an end-to-end sequence modeling framework built by SSMs. A continuous-time SSM representation of a linear dynamical system is given by the following set of equations: ẋ(t) = A x(t) + B u(t), y(t) = C x(t) + D u(t). (1) Here, x(t) is an N -dimensional latent state, receiving a 1-dimensional input signal u(t), and computing a 1-dimensional output signal y(t). 1×N ) and D (1×1) are system's parameters. For the sake of brevity, throughout our analysis, we set D = 0 as it can be added eventually after construction of our main results in the form of a skip connection (Gu et al., 2022a) . A (N ×N ) , B (N ×1) , C Discretization of SSMs. In order to create a sequence-to-sequence model similar to a recurrent neural network (RNN), we discretize the continuous-time representation of SSMs by the trapezoidal rule (bilinear transform) as follows (sampling step = δt) (Gu et al., 2022a) : x k = A x k-1 + B u k , y k = C x k (2) This is obtained via the following modifications to the transition matrices: A = (I - δt 2 A) -1 (I + δt 2 A), B = (I - δt 2 A) -1 δt B, C = C (3) With this transformation, we constructed a discretized seq-2-seq model that can map the input u k to output y k , via the hidden state x k ∈ R N . A is the hidden transition matrix, B and C are input and output transition matrices, respectively. Creating a Convolutional Representation of SSMs. The system described by Eq. 2 and Eq. 3, can be trained via gradient descent to learn to model sequences, in a sequential manner which is not scalable. To improve this, we can write the discretized SSM in Eq. 2 as a discrete convolutional kernel. To construct the convolutional kernel, let us unroll the system of Eq. 2 in time as follows, assuming a zero initial hidden states x -1 = 0: x 0 = Bu 0 , x 1 = ABu 0 + Bu 1 , x 2 = A 2 Bu 0 + ABu 1 + Bu 2 , . . . y 0 = CBu 0 , y 1 = CABu 0 + CBu 1 , y 2 = CA 2 Bu 0 + CABu 1 + CBu 2 , . . . The mapping u 0,k → y k can now be formulated into a convolutional kernel explicitly: y k = CA k Bu 0 + CA k-1 Bu 1 + . . . CABu k-1 + CBu k , y = K * u (5) K ∈ R L := K L (C, A, B) := CA i B i∈[L] = CB, CAB, . . . , CA L-1 B (6) Eq. 5 is a non-circular convolutional kernel. Gu et al. (2022a) showed that under the condition that K is known, it can be solved very efficiently by a black-box Cauchy kernel computation pipeline. Computing S4 Kernel Efficiently: Gu et al. (2022a) showed that the S4 convolution kernel could be computed efficiently using the following elegant parameterization tricks: • To obtain better representations in sequence modeling schemes by SSMs, instead of randomly initializing the transition matrix A, we can use the normal plus low-Rank (NPLR) matrix below, called the Hippo Matrix (Gu et al., 2020a) which is obtained by the scaled Legendre measure (LegS) (Gu et al., 2021; 2022a) : (HiPPO Matrix) A nk = -      (2n + 1) 1/2 (2k + 1) 1/2 if n > k n + 1 if n = k 0 if n < k • The NPLR representation of this matrix is the following (Gu et al., 2022a) : A = V ΛV * -P Q ⊤ = V (Λ -(V * P ) (V * Q) * ) V * (8) Here, V ∈ C N ×N is a unitary matrix, Λ is diagonal, and P , Q ∈ R N ×r are the lowrank factorization. Eq. 7 is normal plus low rank with r = 1 (Gu et al., 2022a) . With the decomposition in Eq. 8, we can obtain A over complex numbers in the form of diagonal plus low-rank (DPLR) (Gu et al., 2022a) . • Vectors B n and P n are initialized by B n = (2n + 1) 1 2 and P n = (n + 1/2) 1 2 (Gu et al., 2022b) . Both vectors are trainable. • Furthermore, it was shown in Gu et al. (2022b) that with Eq. 8, the eigenvalues of A might be on the right half of the complex plane, resulting in numerical instability. To resolve this, Gu et al. (2022b) recently proposed to use the parametrization Λ -P P * instead of Λ -P Q * . • Computing the powers of A in direct calculation of the S4 kernel K is computationally expensive. S4 computes the spectrum of K instead of direct computations, which reduces the problem of matrix powers to matrix inverse computation (Gu et al., 2022a) . S4 then computes this convolution kernel via a black-box Cauchy Kernel efficiently, and recovers K by an inverse Fourier Transform (iFFT) (Gu et al., 2022a) .

3.2. LIQUID STRUCTURAL STATE-SPACE MODELS

In this work, we construct a convolutional kernel corresponding to a linearized version of LTCs (Hasani et al., 2021b) ; an expressive class of continuous-time neural networks that demonstrate attractive generalizability out-of-distribution and are dynamic causal models (Vorbach et al., 2021; Friston et al., 2003; Hasani et al., 2020) . In their general form, the state of a liquid time-constant network at each time-step is given by the set of ODEs described below (Hasani et al., 2021b) : dx(t) dt = -A + B ⊙ f (x(t), u(t), t, θ) Liquid time-constant ⊙x(t) + B ⊙ f (x(t), u(t), t, θ). In this expression, x (N ×1) (t) is the vector of hidden state of size N , u (m×1) (t) is an input signal with m features, A (N ×1) is a time-constant state-transition mechanism, B (N ×1) is a bias vector, and ⊙ represents the Hadamard product. f (.) is a bounded nonlinearity parametrized by θ. Our objective is to show how the liquid time-constant (i.e., an input-dependent state transition mechanism in state-space models can enhance its generalization capabilities by accounting for the covariance of the input samples. To do this, we linearize the LTC formulation of Eq. 9 in the following to better connect the model to SSMs. Let's dive in: Linear Liquid Time-Constant State-Space Model. A Linear LTC SSM can be presented by the following coupled bilinear (first order bilinear Taylor approximation (Penny et al., 2005) ) equation: ẋ(t) = A + I N B u(t)] x(t) + B u(t), y(t) = C x(t) Similar to Eq. 1, x(t) is an N -dimensional latent state, receiving a 1-dimensional input signal u(t), and computing a 1-dimensional output signal y(t). A (N ×N ) , B (N ×1) , and C (1×N ) . Note that D is set to zero for simplicity. In Eq. 10, J N is an N × N unit matrix that adds B u(t) element-wise to A. This dynamical system allows the coefficient (state transition compartment) of state vector x(t) to be input dependent which, as a result, allows us to realize more complex dynamics. Discretization of Liquid-SSMs. We can use a forward Euler transformation to discretize Eq. 10 into the following discretization: x k = A + B u k x k-1 + B u k , y k = C x k The discretized parameters would then correspond to: A = I + δt 2 A, B = δt B, and C = C, which are function of the continuous-time coefficients A, B, and C, and the discretization step δt. Given the properties of the transition matrices A and B, and ranges of δt, we could use the more stable bilinear discretization of matrices A and B, of Eq. 3 as well, as the Forward Euler discretization and the bilinear transformation of A and B presented in Eq. 3 stay close to each other (Appendix D). Creating a Convolutional Representation of Liquid-SSMs. Similar to Eq. 4, we first unroll the Liquid-SSM in time to construct a convolutional kernel of it. By assuming x -1 = 0, we have: x0 = Bu0, y0 = CBu0 x1 = ABu0 + Bu1+ B 2 u0u1, y1 = CABu0 + CBu1+CB 2 u0u1 (12) x2 = A 2 Bu0 + ABu1 + Bu2+ AB 2 u0u1 + AB 2 u0u2 + B 2 u1u2 + B 3 u0u1u2 y2 = CA 2 Bu0 + CABu1 + CBu2+ CAB 2 u0u1 + CAB 2 u0u2 + CB 2 u1u2 + CB 3 u0u1u2, . . . The resulting expressions of the Liquid-SSM at each time step consist of two types of weight configurations: 1. Weights corresponding to the mapping of individual time instances of inputs independently, shown in black in Eq. 12, and 2. Weights associated with all orders of auto-correlation of the input signal, shown in violet in Eq. 12. The first set of weights corresponds to the convolutional kernel of the simple SSM, shown by Eq. 5 and Eq. 6, whereas the second set leads to the design of an additional input correlation kernel, which we call the liquid kernel. These kernels generate the following input-output mapping: y k = CA k Bu0 + CA k-1 Bu1 + . . . CABu k-1 + CBu k + P p=2 u i u i+1 ... up∈Π(k+1,p) CA (k+1-p-i) B p uiui+1 . . . up (13) for i ∈ Z and i ≥ 0, → y = K * u + Kliquid * ucorrelations. Here, Π(k + 1, p) represents k+1 p permuted indices. For instance, let us assume we have a 1dimensional input signal u(t) of length L = 100 on which we run the liquid-SSM kernel. We set the hyperparameters P = 4. This value represents the maximum order of the correlation terms we Algorithm 1 LIQUID-S4 KERNEL -The S4 convolution kernel (highlighted in black) is used from Gu et al. (2022a) and Gu et al. (2022b) . Liquid kernel computation is highlighted in purple. Input: S4 parameters Λ, P , B, C ∈ C N , step size ∆, liquid kernel order P, inputs seq length L, liquid kernel sequence length L Output: SSM convolution kernel K = K L (A, B, C) and SSM liquid kernel K liquid = K L(A, B, C) for A = Λ -P P * (Eq. 6) 1: C ← I -A L * C ▷ Truncate SSM generating function (SSMGF) to length L 2: k 00 (ω) k 01 (ω) k 10 (ω) k 11 (ω) ← C P * 2 ∆ 1-ω 1+ω -Λ -1 [B P ] ▷ Black-box Cauchy kernel 3: K(ω) ← 2 1+ω k 00 (ω) -k 01 (ω)(1 + k 11 (ω)) -1 k 10 (ω) ▷ Woodbury Identity 4: K = { K(ω) : ω = exp(2πi k L )} ▷ Evaluate SSMGF at all roots of unity ω ∈ Ω L 5: K ← iFFT( K) ▷ Inverse Fourier Transform 6: if Mode == KB then ▷ Liquid-S4 Kernel as shown in Eq. 14 7: for p in {2, . . . , P} do 8: K liquid=p = K (L-L,L) ⊙ B p-1 (L-L,L) * J L ▷ J L is a backward identity matrix 9: K liquid .append(K liquid=p ) 10: end for 11: else if Mode == PB then ▷ Liquid-S4 Kernel of Eq. 14 with A reduced to Identity. 12: for p in {2, . . . , P} do 13: K liquid=p = C ⊙ B p-1 (L-L,L) 14: K liquid .append(K liquid=p ) 15: end for 16: end if would want to take into account to output a decision. This means that the signal u correlations in Eq. 13 will contain all combinations of 2 order correlation signals L+1 2 , u i u j , 3 order L+1 3 , u i u j u k and 4 order signals L+1 4 , u i u j u k u l . The kernel weights corresponding to this auto-correlation signal are given in Appendix A. This additional kernel takes the temporal similarities of incoming input samples into consideration. This way, Liquid-SSM gives rise to a more general sequence modeling framework. The liquid convolutional kernel, K liquid is as follows: Kliquid ∈ R L := KL(C, A, B) := CA ( L-i-p) B p i∈[ L], p∈[P] = CA L-2 B 2 , . . . , CB p How to compute Liquid-S4 kernel efficiently? K liquid possess similar structure to the S4 kernel. In particular, we have: Proposition 1. The Liquid-S4 kernel for each order p ∈ P, K liquid , can be computed by the anti-diagonal transformation (flip operation) of the product of the S4 convolution kernel, K = CB, CAB, . . . , CA L-1 B , and a vector B p-1 ∈ R N . The proof is given in Appendix. Proposition 1 indicates that the Liquid-s4 kernel can be obtained from the precomputed S4 kernel and a Hadamard product of that kernel with the transition vector B powered by the chosen liquid order. This is illustrated in Algorithm 1, lines 6 to 10, corresponding to a mode we call KB, which stands for Kernel × B. Additionally, we introduce a simplified Liquid-S4 kernel that is easier to compute while is as expressive as or even better performing than the KB kernel. To obtain this, we set the transition matrix A in Liquid-S4 of Eq. 14, with an identity matrix, only for the input correlation terms. This way, the Liquid-s4 Kernel for a given liquid order p ∈ P reduces to the following expression: (Liquid-S4 -PB) K liquid=p ∈ R L := K L (C, B) := CB p i∈[ L], p∈[P] We call this kernel Liquid-S4 -PB, as it is obtained by powers of the vector B. The computational steps to get this kernel is outlined in Algorithm 1 lines 11 to 15. Computational Complexity of the Liquid-S4 Kernel. The computational complexity of the S4-Legs Convolutional kernel solved via the Cauchy Kernel is Õ(N + L), where N is the state-size, and L is the sequence length [Gu et al. (2022a) , Theorem 3]. Liquid-S4 both in KB and PB modes can be computed in Õ(N + L + p max L). The added time complexity in practice is tractable. This is because we usually select the liquid orders, p, to be less than 10 (typically p max = 3, and L which is the number of terms we use to compute the input correlation vector, u correlation , is typically two orders of magnitude smaller than the sequence length.

4. EXPERIMENTS WITH LIQUID-S4

In this section, we present an extensive evaluation of Liquid-S4 on sequence modeling tasks with very long-term dependencies and compare its performance to a large series of baselines ranging from advanced Transformers and Convolutional networks to many variants of state-space models. In the following, we first outline the baseline models we compare against. We then list the datasets we evaluated these models on and finally present results and discussions. Baselines. We consider a broad range of advanced models to compare Liquid-S4 with. These baselines include transformer variants such as vanilla and Sparse Transformers, a Transformer model with local attention, Longformer, Linformer, Reformer, Sinkhorn, BigBird, Linear Transformer, and Performer. We also include architectures such as FNets, Nystro mformer, Luna-256, H-Transformer-1D, and Circular Diluted Convolutional neural networks (CDIL). We then include a full series of state-space models and their variants such as diagonal SSMs (DSS), S4, S4-legS, S4-FouT, S4-LegS/FouT, S4D-LegS, S4D-Inv, S4D-Lin and the Simplified structured state-space models (S5). Datasets. We first evaluate Liquid-S4's performance on the well-studied Long Range Arena (LRA) benchmark (Tay et al., 2020b) , where Liquid-S4 outperforms other S4 and S4D variants in every task with an average accuracy of 87.32%. LRA dataset includes six tasks with sequence lengths ranging from 1k to 16k. Concurrent work We then report Liquid-S4's performance compared to other S4, and S4D variants, as well as other models, on the BIDMC Vital Signals dataset (Pimentel et al., 2016; Goldberger et al., 2000) . BIDMC uses bio-marker signals of length 4000 to predict Heart rate (HR), respiratory rate (RR), and blood oxygen saturation (SpO2). We also experiment with the sCIFAR dataset that consists of the classification of flattened images in the form of 1024-long sequences into ten classes. Finally, we perform Raw Speech Command (SC) recognition with full 35 labels as conducted very recently in the updated S4 article (Gu et al., 2022a) . It is essential to denote that there is a modified speech command dataset that restricted the dataset to only ten output classes and is used in a couple of works (see for example 

4.1. Results on Long Range Arena

Table 6 depicts a comprehensive list of baselines benchmarked against each other on six long-range sequence modeling tasks in LRA. We observe that Liquid-S4 instances (all use the PB kernel with a scaled Legendre (LegS) configuration) with a small liquid order, p, ranging from 2 to 6, consistently outperform all baselines in all six tasks, establishing the new SOTA on LRA with an average performance of 87.32%. In particular, on ListOps, Liquid-S4 improves S4-LegS performance by more than 3%, on character-level IMDB by 2.2%, and on 1-D pixel-level classification (CIFAR) by 0.65%, while establishing the-state-of the-art on the hardest LRA task by gaining 96.66% accuracy. Liquid-S4 performs on par with improved S4 and S4D instances on both AAN and Pathfinder tasks. The performance of SSM models is generally well-beyond what advanced Transformers, RNNs, and Convolutional networks achieve on LRA tasks, with the Liquid-S4 variants standing on top. The impact of increasing Liquid Order p. Figure 1 illustrates how increasing the liquid order, p, can improve performance on ListOps and IMDB tasks from LRA (More results in Appendix).

4.2. Results on BIDMC Vital Signs

Table 2 demonstrates the performance of a variety of classical and advanced baseline models on the BIDMC dataset for all three heart rate (HR), respiratory rate (RR), and blood oxygen saturation (SpO2) level prediction tasks. We observe that Liquid-s4 with a PB kernel of order p = 3, p = 2, and p = 4, perform better than all S4 and S4D variants. It is worth denoting that Liquid-S4 is built by the same parametrization as S4-LegS (which is the official S4 model reported in the updated S4 report (Gu et al., 2022a) ). In RR, Liquid-S4 outperforms S4-LegS by a significant margin of 36%. On SpO2, Liquid-S4 performs 26.67% better than S4-Legs. On HR, Liquid-S4 outperforms S4-Legs by 8.7% improvement in performance. Table 3 : Performance on sCIFAR. Numbers indicate Accuracy (standard deviation). The baseline models are from Table 9 of Gu et al. (2022b) .

Model Accuracy

Transformer (Trinh et al., 2018) 62.2 FlexConv (Romero et al., 2021a) 80.82 TrellisNet (Bai et al., 2018) 73.42 LSTM 63.01 r-LSTM (Trinh et al., 2018) 72.2 UR-GRU (Gu et al., 2020b) 74.4 HiPPO-RNN (Gu et al., 2020a) 61.1 LipschitzRNN (Erichson et al., 2021) 64.2 S4-LegS (Gu et al., 2022b) 91.80 (0.43) S4-FouT (Gu et al., 2022b) 91.22 (0.25) S4-(LegS/FouT) (Gu et al., 2022b) 91.58 (0.17) S4D-LegS (Gu et al., 2022b) 89.92 (1.69) S4D-Inv (Gu et al., 2022b) 90.69 (0.06) S4D-Lin (Gu et al., 2022b) 90.42 (0.03) S5 Smith et al. (2022) 89.66 Liquid-S4-KB (ours) 91.86 (0.08) Liquid-S4-PB (ours) 92.02 (0.14) p=3

4.3. Results on Image Classification

Similar to the previous tasks, a Liquid-S4 network with PB kernel of order p = 3 outperforms all variants of S4 and S4D while being significantly better than Transformer and RNN baselines as summarized in Table 3 .

4.4. Results on Speech Commands

Table 4 demonstrates that Liquid-S4 with p = 2 achieves the best performance amongst all benchmarks on the 16KHz testbed. Liquid-S4 also performs competitively on the halffrequency zero-shot experiment, while it does not realize the best performance. Although the task is solved to a great degree, the reason could be that liquid kernel accounts for covariance terms. This might influence the learned representations in a way that hurts performance by a small margin in this zero-shot experiment. The hyperparameters are given in Appendix. It is essential to denote that there is a modified speech command dataset that restricts the dataset to only ten output classes, namely SC10, and is used in a couple of works (see for example (Kidger et al., 2020; Gu et al., 2021; Romero et al., 2021b; a) ). Aligned with the updated results reported in (Gu et al., 2022a) and (Gu et al., 2022b) , we choose not to break down this dataset and report the full-sized benchmark in the main paper. Nevertheless, we conducted an experiment with SC10 and showed that even on the reduced dataset, with the same hyperparameters, we solved the task with a SOTA accuracy of 98.51%. The results are presented in Table 7 .

5. CONCLUSIONS

We showed that the performance of structured state-space models could be considerably improved if they are formulated by a linear liquid time-constant kernel, namely Liquid-S4. Liquid-S4 kernels are obtainable with minimal effort, with their kernel computing the similarities between time-lags of the input signals in addition to the main S4 diagonal plus low-rank parametrization. Liquid-S4 kernels with smaller parameter counts achieve SOTA performance on all six tasks of the Long-range arena dataset, on BIDMC heart rate, respiratory rate, and blood oxygen saturation, on sequential 1-D pixellevel image classification, and on Speech command recognition. As a final note, our experimental evaluations suggest that for challenging multivariate time series and modeling complex signals with long-range dependencies, SSM variants such as Liquid-S4 dominate other baselines, while for image and text data, a combination of SSMs and attention might enhance model quality. (Gu et al., 2022b ) 60.47 (0.34) 86.18 (0.43) 89.46 (0.14) 88.19 (0.26) 93.06 (1.24) 91.95 84.89 S4D-Inv (Gu et al., 2022b) 60.18 (0.35) 87.34 (0.20) The kernel computation pipeline uses PyKeops package (Charlier et al., 2021) for large tensor computations without memory overflow. All reported results are validation accuracy (similar to Gu et al. (2022a) ) performed with 2 to 3 different random seeds, except for the BIDMC dataset, which reports accuracy on the test set. 5 of (Gu et al., 2022a) . x stands for infeasible computation on a single GPU or not applicable as stated in Table 10 of (Gu et al., 2022a) . The hyperparameters for Liquid-S4 are the same as the ones reported for Speech Commands Full Dataset reported in Table 5 . (Gu et al., 2021) x x S4-LegS (Gu et al., 2022a) How do we perform the discretization of A + Bu(t). The dynamical system presented in Eq. 10, is a continuous-time (CT) bilinear state-space model (SSM). Ideally, we want that the discretization of a CT bilinear SSM to 1) satisfy the first-order form of the model, and 2) preserve the bilinear model structure. This is challenging and only possible via a limited number of methods: 1) The most straightforward approach is to use Forward Euler with first-order error: ẋ = x k+1 -x k δt + O(δt). Now by plugging this in Eq. 10, we get A = I + δtA, B = δtB. Such discretization satisfies the conditions above. For this discretization to stay stable, for s = σ ± iω an eigenvalue of the continuous transition matrix A, and λ = 1 + sδt, an eigenvalue of the discrete model, Re(s) ≤ 0 or |λ| = |1 + sδt| ≤ 1, thus (1 + σδt) 2 + ω 2 δt 2 ≤ 1. This condition implies that selecting a small enough δt ensures the system's stability, but for cases where δt is large, the system might go unstable. One can show that based on the properties of the transition matrices A and B, and the range of selected δt a bilinear transformation of discrete matrices A and B, would be very close to that of our Forward Euler discretization. This means that: |A Forward Euler -A Approx bilinear | < γ 0 < γ < δt 2 2) Adams-Bashforth Method: The second-order Adams-Bashforth will apply the transformation x k+1 = x k + 3δt 2 f (k) -δt 2 f (k -1) + O(δt 2 ) , where f(k) is the right-hand-side of Eq. 8 at time t = kδt. This method also satisfies the two conditions we required (Phan et al., 2012) . One must denote that computing a bilinear transform (https://en.wikipedia.org/wiki/ Bilinear_transform) of a continuous-time bilinear SSM while preserving the first-order structure of the model is an open problem. Ideally, we can apply this transformation on A+Bu(t). However, it is challenging to preserve the first-order form of the equation while keeping the bilinear (liquid) structure of the model described in Eq. 10. In our case, we use the bilinear transform form of A, B, and C presented in (Eq. 3) for the discrete weights of the system of (Eq. 11), as this approximation is close to that of the Forward Euler. This implies that the continuous system in (Eq. 10) could be transformed directly to Eq. 11 by a forward Euler transformation. Furthermore, due to the range of δt and properties of A and B, the bilinear transformed matrices presented in Eq. 3, would be close to the direct forward Euler system.

E ON THE AUTOREGRESSIVE MODE OF PB KERNEL

In autoregressive (AR) mode, with PB (or any other conditioned kernel), we obtain A, B, and C no matter what the conditions are. More specifically, in the autoregressive mode of PB we can use Eq. 13. In Eq. 13, for computing the black parts we can reuse the AR mode of the plain S4 model and only have computed the new (violet) parts. As the violet parts consist of p multiplications of input terms (and the corresponding matrices) computing the AR mode is feasible. This adds a complexity of O(p) to the inference in the AR mode of PB, but because p is much smaller than L (past sequence length), it can significantly speed up the inference time compared to the convolution counterpart. Moreover, one of the properties of LTCs that was never studied and introduced before this work is their ability to account for the pairwise correlation of inputs which became apparent once we unrolled the system's dynamics in this work. We believe that the pairwise correlation of inputs is a property that the PB kernel also possesses. Whether the kernel loses the expressivity and robustness attributes of LTCs, we have to investigate in future work. F WHY DOES PB KERNEL OUTPERFORM KB KERNEL? One possible reason why PB outperforms KB could be the fact that we limit the correlation terms with the truncation with order p. This limitation arises from how S4 blocks are constructed as a stack of many 1D blocks which does not computationally allow us to exploit the benefit of higher-order correlation terms due to the high computational complexity. This limitation might also reduce the expressivity of the KB kernel, but not PB, as they do not have a dependency on A for the correlation terms. A potential solution would be a Liquid-S5 instance, where we could directly use a parallel scan introduced in the concurrent work S5 (Smith et al., 2023) , over the linear LTC system (which is a time-varying SSM). This is possible because we could precompute the state transitions at each time step. This way we would not need to truncate the kernel and obtain all correlation terms for free. This is an exciting extension to Liquid-S4 which we are exploring in future work



Figure 1: Performance vs Liquid Order in Liquid-S4 for A) ListOps, and B) IMDB datasets. More in Appendix. (n=3)

Figure 2: Performance vs Liquid Order in Liquid-S4. (n=3)

Performance on Long Range Arena Tasks. Numbers indicate validation accuracy (standard deviation). The accuracy of models denoted by * is reported from(Tay et al., 2020b). Methods denoted by ** are reported from(Gu et al., 2022a). The rest of the models' performance results are reported from the cited paper. See Appendix for accuracy on test set.

Performance on BIDMC Vital Signs dataset. Numbers indicate RMSE on the test set. Models denoted by * is reported from(Gu et al., 2022b). The rest of the models' performance results are reported from the cited paper.

Performance on Raw Speech Command dataset with Full 35 Labels.Numbers indicate Accuracy on test set. The baseline models are reported from Table11of(Gu et al., 2022b).

Performance on Long Range Arena Tasks. Numbers for Liquid-S4 kernels indicate test accuracy (standard deviation). The rest of the models' performance results are reported from the cited paper. Liquid-S4 is used with its PB kernel.

Hyperparameters for obtaining best performing models. BN= Batch normalization, LN = Layer normalization, WD= Weight decay.

Performance on Raw Speech Command dataset with the reduced ten classes (SC10) dataset.Numbers indicate validation accuracy. The accuracy of baseline models is reported from Table

ACKNOWLEDGMENTS

This research was supported in part by the AI2050 program at Schmidt Futures (Grant G-22-63172) and the United States Air Force Artificial Intelligence Accelerator under Cooperative Agreement Number FA8750-19-2-1000. The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the United States Air Force or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes, notwithstanding any copyright notation herein. We are very grateful.

annex

A EXAMPLE LIQUID-S4 KERNEL K liquid * u correlations = CA (k-1) B 2 , . . . , CB 2 , . . . , CA (k-2) B 3 , . . . , CB 3 , . . . , CA (k-3) B 4 , . . . , CB 4 * (16)Here, u correlations is a vector of length k+1 2 + k+1 3 + k+1 4 , and the kernel) .

B PROOF OF PROPOSITION 1

Proposition. The Liquid-S4 kernel for each order p ∈ P, K liquid , can be computed by the anti-diagonal transformation (flip operation) of the product of the S4 convolution kernel, K = CB, CAB, . . . , CA L-1 B , and a vector BProof. This can be shown by unrolling the S4 convolution kernel and multiplying its components with B p-1 , performing an anti-diagonal transformation to obtain the corresponding liquid S4 kernel:For p = 2 (correlations of order 2), S4 kernel should be multiplied by B. The resulting kernel would be:We obtain the liquid kernel by flipping the above kernel to be convolved with the 2-term correlation terms (p=2):Similarly, we can obtain liquid kernels for higher liquid orders and obtain the statement of the proposition.

C HYPERPARAMETERS

Learning Rate. Liquid-S4 generally requires a smaller learning rate compared to S4 and S4D blocks.Setting ∆t max and ∆t min We set ∆t max for all experiments to 0.2, while the ∆t min was set based on the recommendations provided in (Gu et al., 2022c) to be proportional to ∝ 1 seq length . Causal Modeling vs. Bidirectional Modeling Liquid-S4 works better when it is used as a causal model, i.e., with no bidirectional configuration. d s tate We observed that Liquid-S4 PB kernel performs best with smaller individual state sizes d s tate. For instance, we achieve SOTA results in ListOps, IMDB, and Speech Commands by a state size set to 7, significantly reducing the number of required parameters to solve these tasks.Choice of Liquid-S4 Kernel In all experiments, we choose our simplified PB kernel over the KB kernel due to the computational costs and performance. We recommend the use of PB kernel.Choice of parameter p in liquid kernel. In all experiments, start off by setting p or the liquidity order to 2. This means that the liquid kernel is going to be computed only for correlation terms of order 2. In principle, we observe that higher p values consistently enhance the representation learning capacity of Liquid-S4 modules, as we showed in all experiments. We recommend p = 3 as a norm to perform experiments with Liquid-S4.

