EFFECTIVELY MODELING TIME SERIES WITH SIMPLE DISCRETE STATE SPACES

Abstract

Time series modeling is a well-established problem, which often requires that methods (1) expressively represent complicated dependencies, (2) forecast long horizons, and (3) efficiently train over long sequences. State-space models (SSMs) are classical models for time series, and prior works combine SSMs with deep learning layers for efficient sequence modeling. However, we find fundamental limitations with these prior approaches, proving their SSM representations cannot express autoregressive time series processes. We thus introduce SPACETIME, a new state-space time series architecture that improves all three criteria. For expressivity, we propose a new SSM parameterization based on the companion matrix-a canonical representation for discrete-time processes-which enables SPACETIME's SSM layers to learn desirable autoregressive processes. For long horizon forecasting, we introduce a "closed-loop" variation of the companion SSM, which enables SPACETIME to predict many future time-steps by generating its own layer-wise inputs. For efficient training and inference, we introduce an algorithm that reduces the memory and compute of a forward pass with the companion matrix. With sequence length ℓ and state-space size d, we go from Õ(dℓ) naïvely to Õ(d + ℓ). In experiments, our contributions lead to state-of-the-art results on extensive and diverse benchmarks, with best or second-best AUROC on 6 / 7 ECG and speech time series classification, and best MSE on 14 / 16 Informer forecasting tasks. Furthermore, we find SPACETIME (1) fits AR(p) processes that prior deep SSMs fail on, (2) forecasts notably more accurately on longer horizons than prior state-of-the-art, and (3) speeds up training on real-world ETTh1 data by 73% and 80% relative wall-clock time over Transformers and LSTMs.

1. INTRODUCTION

Time series modeling is a well-established problem, with tasks such as forecasting and classification motivated by many domains such as healthcare, finance, and engineering (Shumway et al., 2000) . However, effective time series modeling presents several challenges: • First, methods should be expressive enough to capture complex, long-range, and autoregressive dependencies. Time series data often reflects higher order dependencies, seasonality, and trends, which govern how past samples determine future samples (Chatfield, 2000) . This motivates many classical approaches that model these properties (Box et al., 1970; Winters, 1960) , alongside expressive deep learning mechanisms such as attention (Vaswani et al., 2017) and fully connected layers that model interactions between every sample in an input sequence (Zeng et al., 2022) . • Second, methods should be able to forecast a wide range of long horizons over various data domains. Reflecting real world demands, popular forecasting benchmarks evaluate methods on 34 different tasks (Godahewa et al., 2021) and 24-960 time-step horizons Zhou et al. (2021) . Furthermore, as testament to accurately learning time series processes, forecasting methods should ideally also be able to predict future time-steps on horizons they were not explicitly trained on. • Finally, methods should be efficient with training and inference. Many time series applications require processing very long sequences, e.g., classifying audio data with sampling rates up to 16,000 Hz (Warden, 2018) . To handle such settings-where we still need large enough models that Figure 1 : We learn time series processes as state-space models (SSMs) (top left). We represent SSMs with the companion matrix, which is highly expressive for discrete time series (top middle), and compute such SSMs efficiently as convolutions or recurrences via a shift + low-rank decomposition (top right). We use these SSMs to build SPACETIME, a new time series architecture broadly effective across tasks and domains (bottom). can expressively model this data-training and inference should ideally scale subquadratically with sequence length and model size in time and space complexity. Unfortunately, existing time series methods struggle to achieve all three criteria. Classical methods (c.f., ARIMA (Box et al., 1970) , exponential smoothing (ETS) (Winters, 1960) ) often require manual data preprocessing and model selection to identify expressive-enough models. Deep learning methods commonly train to predict specific horizon lengths, i.e., as direct multi-step forecasting (Chevillon, 2007) , and we find this hurts their ability to forecast longer horizons (Sec. 4.2.2). They also face limitations achieving high expressivity and efficiency. Fully connected networks (FCNs) in Zeng et al. (2022) scale quadratically in O(ℓh) space complexity (with input length ℓ and forecast length h). Recent Transformer-based models reduce this complexity to O(ℓ + h), but do not always outperform the above FCNs on forecasting benchmarks (Liu et al., 2022; Zhou et al., 2021) . We thus propose SPACETIME, a deep state-space architecture for effective time series modeling. To achieve this, we focus on improving each criteria via three core contributions: 1. For expressivity, our key idea and building block is a linear layer that models time series processes as state-space models (SSMs) via the companion matrix (Fig. 1 ). We start with SSMs due to their connections to both classical time series analysis (Kalman, 1960; Hamilton, 1994) and recent deep learning advances (Gu et al., 2021a) . Classically, many time series models such as ARIMA and exponential smoothing (ETS) can be expressed as SSMs (Box et al., 1970; Winters, 1960) . Meanwhile, recent state-of-the-art deep sequence models (Gu et al., 2021a) have used SSMs to outperform Transformers and LSTMs on challenging long-range benchmarks (Tay et al., 2020) . Their primary innovations show how to formulate SSMs as neural network parameters that are practical to train. However, we find limitations with these deep SSMs for time series data. While we build on their advances, we prove that these prior SSM representations (Gu et al., 2021b; a; Gupta, 2022) cannot capture autoregressive processes fundamental for time series. We thus specifically propose the companion matrix representation for its expressive and memory-efficient properties. We prove that the companion matrix SSM recovers fundamental autoregressive (AR) and smoothing processes modeled in classical techniques such as ARIMA and ETS, while only requiring O(d) memory to represent an O(d 2 ) matrix. Thus, SPACETIME inherits the benefits of prior SSM-based sequence models, but introduces improved expressivity to recover fundamental time series processes simply through its layer weights. 2. For forecasting long horizons, we introduce a new "closed-loop" view of SSMs. Prior deep SSM architectures either apply the SSM as an "open-loop" (Gu et al., 2021a) , where fixed-length inputs necessarily generate same-length outputs, or use closed-loop autoregression where final layer outputs are fed through the entire network as next-time-step inputs (Goel et al., 2022) . We describe issues with both approaches in Sec. 3.2, and instead achieve autogressive forecasting in a deep network with only a single SSM layer. We do so by explicitly training the SSM layer to predict its next time-step inputs, alongside its usual outputs. This allows the SSM to recurrently generate its own future inputs that lead to desired outputs-i.e., those that match an observed time series-so we can forecast over many future time-steps without explicit data inputs. 3. For efficiency, we introduce an algorithm for efficient training and inference with the companion matrix SSM. We exploit the companion matrix's structure as a "shift plus low-rank" matrix, which allows us to reduce the time and space complexity for computing SSM hidden states and outputs from Õ(dℓ) to Õ(d + ℓ) in SSM state size d and input sequence length ℓ. In experiments, we find SPACETIME consistently obtains state-of-the-art or near-state-of-the-art results, achieving best or second-best AUROC on 6 out of 7 ECG and audio speech time series classification tasks, and best mean-squared error (MSE) on 14 out of 16 Informer benchmark forecasting tasks (Zhou et al., 2021) . SPACETIME also sets a new best average ranking across 34 tasks on the Monash benchmark (Godahewa et al., 2021) . We connect these gains with improvements on our three effective time series modeling criteria. For expressivity, on synthetic ARIMA processes SPACETIME learns AR processes that prior deep SSMs cannot. For long horizon forecasting, SPACETIME consistently outperforms prior state-of-the-art on the longest horizons by large margins. SPACETIME also generalizes better to new horizons not used for training. For efficiency, on speed benchmarks SPACETIME obtains 73% and 80% relative wall-clock speedups over parametermatched Transformers and LSTMs respectively, when training on real-world ETTh1 data.

2. PRELIMINARIES

Problem setting. We evaluate effective time series modeling with classification and forecasting tasks. For both tasks, we are given input sequences of ℓ "look-back" or "lag" time series samples u t-ℓ:t-1 = (u t-ℓ , . . . , u t-1 ) ∈ R ℓ×m for sample feature size m. For classification, we aim to classify the sequence as the true class y out of possible classes Y. For forecasting, we aim to correctly predict H future time-steps over a "horizon" y t,t+H-1 = (u t , . . . , u t+H-1 ) ∈ R H×m . State-space models for time series. We build on the discrete-time state-space model (SSM), which maps observed inputs u k to hidden states x k , before projecting back to observed outputs y k via x k+1 = Ax k + Bu k (1) y k = Cx k + Du k (2) where A ∈ R d×d , B ∈ R d×m , C ∈ R m ′ ×d , and D ∈ R m ′ ×m . For now, we stick to single-input single-output conventions where m, m ′ = 1, and let D = 0. To model time series in the single SSM setting, we treat u and y as copies of the same process, such that y k+1 = u k+1 = C(Ax k + Bu k ) We can thus learn a time series SSM by treating A, B, C as black-box parameters in a neural net layer, i.e., by updating A, B, C via gradient descent s.t. with input u k and state x k at timestep k, following (3) predicts ŷk+1 that matches the next time-step sample y k+1 = u k+1 . This SSM framework and modeling setup is similar to prior works (Gu et al., 2021b; a) , which adopt a similar interpretation of inputs and outputs being derived from the "same" process, e.g., for language modeling. Here we study and improve this framework for time series modeling. As extensions, in Sec. 3.1.1 we show how (1) and (2) express univariate time series with the right A representation. In Sec. 3.1.2 we discuss the multi-layer setting, where layer-specific u and y now differ, and we only model first layer inputs and last layer outputs as copies of the same time series process.

3. METHOD: SPACETIME

We now present SPACETIME, a deep architecture that uses structured state-spaces for more effective time-series modeling. SPACETIME is a standard multi-layer encoder-decoder sequence model, built as a stack of repeated layers that each parametrize multiple SSMs. We designate the last layer as the "decoder", and prior layers as "encoder" layers. Each encoder layer processes an input time series sample as a sequence-to-sequence map. The decoder layer then takes the encoded sequence representation as input and outputs a prediction (for classification) or output sequence (for forecasting). Below we expand on our contributions that allow SPACETIME to improve expressivity, long-horizon forecasting, and efficiency of time series modeling. In Sec. 3.1, we present our key building block, a layer that parametrizes the companion matrix SSM (companion SSM) for expressive and autoregressive modeling. In Sec. 3.2, we introduce a specific instantiation of the companion SSM to flexibly forecast over long horizons. In Sec. 3.3, we provide an efficient inference algorithm that allows SPACETIME to train and predict over long sequences in sub-quadratic time and space complexity.

3.1. THE MULTI-SSM SPACETIME LAYER

We discuss our first core contribution and key building block of our model, the SPACETIME layer, which captures the companion SSM's expressive properties, and prove that the SSM represents multiple fundamental processes. To scale up this expressiveness in a neural architecture, we then go over how we represent and compute multiple SSMs in each SPACETIME layer. We finally show how the companion SSM's expressiveness allows us to build in various time series data preprocessing operations in a SPACETIME layer via different weight initializations of the same layer architecture.

3.1.1. EXPRESSIVE STATE-SPACE MODELS WITH THE COMPANION MATRIX

For expressive time series modeling, our SSM parametrization represents the state matrix A as a companion matrix. Our key motivation is that A should allow us to capture autoregressive relationships between a sample u k and various past samples u k-1 , u k-2 , . . . , u k-n . Such dependencies are a basic yet essential premise for time series modeling; they underlie many fundamental time series processes, e.g., those captured by standard ARIMA models. For example, consider the simplest version of this, where u k is a linear combination of p prior samples (with coefficients ϕ 1 , . . . , ϕ p ) u k = ϕ 1 u k-1 + ϕ 2 u k-2 + . . . ϕ p u k-p 4) i.e., a noiseless, unbiased AR(p) process in standard ARIMA time series analysis (Box et al., 1970) . To allow (3) to express (4), we need the hidden state x k to carry information about past samples. However, while setting the state-space matrices as trainable neural net weights may suggest we can learn arbitrary task-desirable A and B via supervised learning, prior work showed this could not be done without restricting A to specific classes of matrices (Gu et al., 2021b; Gupta, 2022) . Fortunately, we find that a class of relatively simple A matrices suffices. We propose to set A ∈ R d×d as the d × d companion matrix, a square matrix of the form: (Companion Matrix) A =       0 0 . . . 0 a 0 1 0 . . . 0 a 1 0 1 . . . 0 a 2 . . . . . . . . . . . . 0 0 . . . 1 a d-1       i.e., A i,j =    1 for i -1 = j a i for j = d -1 0 otherwise (5) Then simply letting state dimension d = p, assuming initial hidden state x 0 = 0, and setting a := [a 0 a 1 . . . a d-1 ] T = 0, B = [1 0 . . . 0] T , C = [ϕ 1 . . . ϕ p ] allows the discrete SSM in (1, 2) to recover the AR(p) process in (4). We next extend this result in Proposition 1, proving in App. B that setting A as the companion matrix allows the SSM to recover a wide range of fundamental time series and dynamical system processes beyond the AR(p) process. Proposition 1. A companion state matrix SSM can represent ARIMA (Box et al., 1970) , exponential smoothing (Winters, 1960; Holt, 2004) , controllable linear time-invariant systems (Chen, 1984) . As a result, by training neural network layers that parameterize the companion SSM, we provably enable these layers to learn the ground-truth parameters for multiple time series processes. In addi- Computation. To compute the companion SSM, we could use the recurrence in (1). However, this sequential operation is slow on modern GPUs, which parallelize matrix multiplications. Luckily, as described in Gu et al. (2021a) we can also compute the SSM as a 1-D convolution. This enables parallelizable inference and training. To see how, note that given a sequence with at least k inputs and hidden state x 0 = 0, the hidden state and output at time-step k by induction are: x k = k-1 j=0 A k-1-j Bu j and y k = k-1 j=0 CA k-1-j Bu j We can thus compute hidden state x k and output y k as 1-D convolutions with "filters" as F x = (B, AB, A 2 B, . . . , A ℓ-1 B) (Hidden State Filter) (7) F y = (CB, CAB, CA 2 B, . . . , CA ℓ-1 B) (Output Filter) (8) x k = (F x * u)[k] and y k = (F y * u)[k] So when we have inputs available for each output (i.e., equal-sized input and output sequences) we can obtain outputs by first computing output filters F y (8), and then computing outputs efficiently with the Fast Fourier Transform (FFT). We thus compute each encoder SSM as a convolution. For now we note two caveats. Having inputs for each output is not always true, e.g., with long horizon forecasting. Efficient inference also importantly requires that F y can be computed efficiently, but this is not necessarily trivial for time series: we may have long input sequences with large k. Fortunately we later provide solutions for both. In Sec. 3.2, we show how to predict output samples many time-steps ahead of our last input sample via a "closed-loop" forecasting SSM. In Sec. 3.3 we show how to compute both hidden state and output filters efficiently over long sequences via an efficient inference algorithm that handles the repeated powering of A k .

3.1.3. BUILT-IN DATA PREPROCESSING WITH COMPANION SSMS

We now show how beyond autoregressive modeling, the companion SSM also enables SPACETIME layers to do standard data preprocessing techniques used to handle nonstationarities. Consider differencing and smoothing, two classical techniques to handle nonstationarity and noise: u ′ k = u k -u k-1 (1st-order differencing) u ′ k = 1 n n-1 i=0 u k-i (n-order moving average smoothing) We explicitly build these preprocessing operations into a SPACETIME layer by simply initializing companion SSM weights. Furthermore, by specifying weights for multiple SSMs, we simultaneously perform preprocessing with various orders in one forward pass. We do so by setting a = 0 and B = [1, 0, . . . , 0] Challenges and limitations. For forecasting, a model must process an input lag sequence of length ℓ and output a forecast sequence of length h, where h ̸ = ℓ necessarily. Many state-of-the-art neural nets thus train by specifically predicting h-long targets given ℓ-long inputs. However, in Sec. 4.2.2 we find this hurts transfer to new horizons in other models, as they only train to predict specific horizons. Alternatively, we could output horizons autoregressively through the network similar to stacked RNNs as in SASHIMI (Goel et al., 2022) or DeepAR (Salinas et al., 2020) . However, we find this can still be relatively inefficient, as it requires passing states to each layer of a deep network. Closed-loop SSM solution. Our approach is similar to autoregression, but only applied at a single SPACETIME layer. We treat the inputs and outputs as distinct processes in a multi-layer network, and add another matrix K to each decoder SSM to model future input time-steps explicitly. ū = (ū 0 , . . . , ūℓ-1 ) be the input sequence to a decoder SSM and u = (u 0 , . . . , u ℓ-1 ) be the original input sequence, we jointly train A, B, C, K such that x k+1 = Ax k + B ūk , and ŷk+1 = Cx k+1 (where ŷk+1 = y k+1 = u k+1 ) (11) ûk+1 = Kx k+1 (where ûk+1 = ūk+1 ) (12) We thus train the decoder SPACETIME layer to explicitly model its own next time-step inputs with A, B, K, and model its next time-step outputs (i.e., future time series samples) with A, B, C. For forecasting, we first process the lag terms via ( 11) and ( 12) as convolutions x k = k-1 j=0 A k-1-j Bu j and ûk = K k-1 j=0 A k-1-j B ūj (13) for k ∈ [0, ℓ -1]. To forecast h future time-steps, with last hidden state x ℓ we first predict future input ûℓ via (12). Plugging this back into the SSM and iterating for h -1 future time-steps leads to x ℓ+i = (A + BK) i x ℓ for i = 1, . . . , h -1 (14) ⇒ (y ℓ , . . . , y ℓ+h-1 ) = C(A + BK) i x ℓ i∈[h-1] We can thus use Eq. 15 to get future outputs without sequential recurrence, using the same FFT operation as for Eq. 8, 9. This flexibly recovers O(ℓ + h) time complexity for forecasting h future time-steps, assuming that powers (A + BK) h are taken care of. Next, we derive an efficient matrix powering algorithm to take care of this powering and enable fast training and inference in practice.

3.3. EFFICIENT INFERENCE WITH THE COMPANION SSM

We finally discuss our third contribution, where we derive an algorithm for efficient training and inference with the companion SSM. To motivate this section, we note that prior efficient algorithms Algorithm 1 Efficient Output Filter F y Computation Require: A is a companion matrix parameterized by the last column a ∈ R d , B ∈ R d , C = C(I -A ℓ ) ∈ R d , sequence length ℓ. 1: Define quad(u, v) ∈ R ℓ for vectors u, v ∈ R d : compute q = u * v (linear convolution), zero-pad to length ℓ⌈d/ℓ⌉, split into ⌈d/ℓ⌉ chunks of size ℓ of the form [q (1) , . . . , q (⌈d/ℓ⌉) ] and return the length-ℓ Fourier transform of the sum F ℓ (q (1) + • • • + q (⌈d/ℓ⌉) ). 2: Compute the roots of unity z = [ω 0 , . . . , ωℓ-1 ] where ω = exp(-2πi/ℓ). 

3: Compute

F y = F -1 ℓ ( F y ). to compute powers of the state matrix A were only proposed to handle specific classes of A, and do not apply to the companion matrix (Gu et al., 2021a; Goel et al., 2022; Gu et al., 2022) . Recall from Sec. 3.1.2 that for a sequence of length ℓ, we want to construct the output filter F y = (CB, . . . , CA ℓ-1 B), where A is a d × d companion matrix and B, C are d × 1 and 1 × d matrices. Naïvely, we could use sparse matrix multiplications to compute powers CA j B for j = 0, . . . , ℓ -1 sequentially. As A has O(d) nonzeros, this would take O(ℓd) time. We instead derive an algorithm that constructs this filter in O(ℓ log ℓ + d log d) time. The main idea is that rather than computing the filter directly, we can compute its spectrum (its discrete Fourier transform) more easily, i.e., F y [m] := F(F y ) = ℓ-1 j=0 CA j ω mj B = C(I -A ℓ )(I -Aω m ) -1 B, m = 0, 1, . . . , ℓ -1. where ω = exp(-2πi/ℓ) is the ℓ-th root of unity. This reduces to computing the quadratic form of the resolvent (I -Aω m ) -1 on the roots of unity (the powers of ω). Since A is a companion matrix, we can write A as a shift matrix plus a rank-1 matrix, A = S + ae T d . Thus Woodbury's formula reduces this computation to the resolvent of a shift matrix (I -Sω m ) -1 , with a rank-1 correction. This resolvent can be shown analytically to be a lower-triangular matrix consisting of roots of unity, and its quadratic form can be computed by the Fourier transform of a linear convolution of size d. Thus one can construct F y k by linear convolution and the FFT, resulting in O(ℓ log ℓ + d log d) time. We validate in Sec. 4.2.3 that Algorithm 1 leads to a wall-clock time speedup of 2× compared to computing the output filter naïvely by powering A. In App. B.2, we prove the time complexity O(ℓ log ℓ + d log d) and correctness of Algorithm 1. We also provide an extension to the closed-loop SSM, which can also be computed in subquadratic time as A + BK is a shift plus rank-2 matrix.

4. EXPERIMENTS

We test SPACETIME on a broad range of time series forecasting and classification tasks. In Sec. 4.1, we evaluate whether SPACETIME's contributions lead to state-of-the-art results on standard benchmarks. To help explain SPACETIME's performance and validate our contributions, in Sec. 4.2 we then evaluate whether these gains coincide with empirical improvements in expressiveness (Sec. 4.2.1), forecasting flexibility (Sec. 4.2.2), and training efficiency (Sec. 4.2.3).

4.1. MAIN RESULTS: TIME SERIES FORECASTING AND CLASSIFICATION

For forecasting, we evaluate SPACETIME on 40 forecasting tasks from the popular Informer (Zhou et al., 2021) and Monash (Godahewa et al., 2021) benchmarks, testing on horizons 8 to 960 timesteps long. For classification, we evaluate SPACETIME on seven medical ECG or speech audio classification tasks, which test on sequences up to 16,000 time-steps long. For all results, we report mean evaluation metrics over three seeds. ✗ denotes the method was computationally infeasible on allocated GPUs, e.g., due to memory constraints (same resources for all methods; see App. C for details). App. C also contains additional dataset, implementation, and hyperparameter details. Informer (forecasting). We report univariate time series forecasting results in Table 1 , comparing against recent state-of-the-art methods (Zeng et al., 2022; Zhou et al., 2022a) , related state-space models (Gu et al., 2021a) , and other competitive deep architectures. We include extended results on additional horizons and multivariate forecasting in App. D.2. We find SPACETIME obtains lowest MSE and MAE on 14 and 11 forecasting settings respectively, 3× more than prior state-of-the-art. SPACETIME also outperforms S4 on 15 / 16 settings, supporting the companion SSM representation.

Monash (forecasting).

We also evaluate on 32 datasets in the Monash forecasting benchmark (Godahewa et al., 2021) , spanning domains including finance, weather, and traffic. For space, we report results in Table 20 (App. D.3). We compare against 13 classical and deep learning baselines. SPACE-TIME achieves best RMSE on 7 tasks and sets new state-of-the-art average performance across all 32 datasets. SPACETIME's relative improvements also notably grow on long horizon tasks (Fig. 6 ).

ECG (multi-label classification).

Beyond forecasting, we show that SPACETIME can also perform state-of-the-art time series classification. To classify sequences, we use the same sequence model architecture in Sec. 3.1. Like prior work (Gu et al., 2021a) , we simply use the last-layer FFN to project from number of SSMs to number of classes, and mean pooling over length before a softmax to output class logits. In Table 2 , we find that SPACETIME obtains best or second-best AUROC on five out of six tasks, outperforming both general sequence models and specialized architectures. Speech Audio (single-label classification). We further test SPACETIME on long-range audio classification on the Speech Commands dataset (Warden, 2018) . The task is classifying raw audio sequences of length 16,000 into 10 word classes. We use the same pooling operation for classification as in ECG. SPACETIME outperforms domain-specific architectures, e.g., WaveGan-D (Donahue et al., 2018) and efficient Transformers, e.g., Performer (Choromanski et al., 2020 ) (Table 3 ).

4.2. IMPROVEMENT ON CRITERIA FOR EFFECTIVE TIME SERIES MODELING

For further insight into SPACETIME's performance, we now validate that our contributions improve expressivity (4.2.1), forecasting ability (4.2.2), and efficiency (4.2.3) over existing approaches.

4.2.1. EXPRESSIVITY

To first study SPACETIME's expressivity, we test how well SPACETIME can fit controlled autoregressive processes. To validate our theory on SPACETIME's expressivity gains in Sec. 3.1, we compare against recent related SSM architectures such as S4 (Gu et al., 2021a) and S4D (Gu et al., 2022) . For evaluation, we generate noiseless synthetic AR(p) sequences. We test if models learn the true process by inspecting whether the trained model weights recover transfer functions specified by the AR coefficients (Oppenheim, 1999) . We use simple 1-layer 1-SSM models, with state-space size equal to AR p, and predict one time-step given p lagged inputs (the smallest sufficient setting). In Fig. 3 we compare the trained forecasts and transfer functions (as frequency response plots) of SPACETIME, S4, and S4D models on a relatively smooth AR(4) process and sharp AR(6) process. Our results support the relative expressivity of SPACETIME's companion matrix SSM. While all models accurately forecast the AR(4) time series, only SPACETIME recovers the ground-truth transfer functions for both, and notably forecasts the AR(6) process more accurately (Fig. 3c, d ).

4.2.2. LONG-HORIZON FORECASTING

To next study SPACETIME's improved long horizon forecasting capabilities, we consider two additional long horizon tasks. First, we test on much longer horizons than prior settings (c.f., Table 1 ). Second, we test a new forecasting ability: how well methods trained to forecast one horizon transfer to longer horizons at test-time. For both, we use the popular Informer ETTh datasets. We compare SPACETIME with NLinear-the prior state-of-the-art on longer-horizon ETTh datasets-an FCN that learns a dense linear mapping between every lag input and horizon output (Zeng et al., 2022) . We find SPACETIME outperforms NLinear on both long horizon tasks. On training to predict long horizons, SPACETIME consistently obtains lower MSE than NLinear on all settings (Table 4 ). On transferring to new horizons, SPACETIME models trained to forecast 192 time-step horizons transfer more accurately and consistently to forecasting longer horizons up to 576 time-steps (Fig. 4 ). This suggests SPACETIME's autoregressive forecasting more convincingly learns the time series process; rather than only fitting to the specified horizon, the same model can generalize to new horizons. faster training time versus similarly recurrent LSTMs (Fig. 5 ).

5. CONCLUSION

We introduce SPACETIME, a state-space time series model. We achieve high expressivity by modeling SSMs with the companion matrix, long-horizon forecasting with a closed-loop SSM variant, and efficiency with a new algorithm to compute the companion SSM. We validate SPACETIME's proposed components on extensive time series forecasting and classification tasks.

6. ETHICS STATEMENT

A main objective of our work is to improve the ability to classify and forecast time series, which has real-world applications in many fields. These applications may have high stakes, such as classifying abnormalities in medical time series. In these situations, incorrect predictions may lead to harmful patient outcomes. It is thus critical to understand that while we aim to improve time series modeling towards these applications, we do not solve these problems. Further analysis and development into where models fail in time series modeling is necessary, including potentials intersections with research directions such as robustness and model biases when aiming to deploy machine learning models in real world applications.

7. REPRODUCIBILITY

We include code to reproduce our main results in (Box and Jenkins, 1968) , exponential smoothing (Hyndman et al., 2008; Winters, 1960) , autoregressive integrated moving average (ARIMA) (Box et al., 1970) , and state-space models (Hamilton, 1994) . In such approaches, the model is usually manually selected based analyzing time series features (e.g., seasonality and order of non-stationarity), where the selected model is then fitted for each individual time series. While classical approaches may be more interpretable than recent deep learning techniques, the domain expertise and manual labor needed to succesfully apply them renders them infeasible to the common setting of modeling thousands, or millions, of time series.

A.2 DEEP LEARNING APPROACHES

Recurrent models. Common deep learning architectures for modeling sequence data are the family of recurrent neural networks, which include GRUs (Chung et al., 2014) , LSTMs (Hochreiter and Schmidhuber, 1997) , and DeepAR (Salinas et al., 2020) . However, due to the recurrent nature of RNNs, they are slow to train and may suffer from vanishing/exploding gradients, making them difficult to train (Pascanu et al., 2013) . Deep State Space models. Recent work has investigated combining the expressive strengths of SSMs with the scalable strengths of deep neural networks (Rangapuram et al., 2018; Gu et al., 2021a) . Rangapuram et al. (2018) propose to train a global RNN that transforms input covariates to sequence-spcific SSM parameters; however, one downside of this approach is that they inherit the drawbacks of RNNs. More recent approaches, such as LSSL (Gu et al., 2021b) , S4 (Gu et al., 2021a) , S4D (Gu et al., 2022) , and S5 (Smith et al., 2022) , directly parameterize the layers of a neural network with multiple linear SSMs, and overcome common recurrent training drawbacks by leveraging the convolutional view of SSMs. While deep SSM models have been shown great promise in time series modeling, we show in our work -which builds off deep SSMs -that current deep SSM approaches are not able to capture autoregressive processes due to their continuous nature. Neural differential equations as nonlinear state spaces. (Chen et al., 2018) 2021) numerical techniques to enable learning of stiff dynamical systems with Neural ODEs are investigated. The idea of parameterizing the vector field of a differential equation with a neural network, popularized by NDEs, can be traced back to earlier works (Funahashi and Nakamura, 1993; Zhang et al., 2014; Weinan, 2017) . Transformers. While RNNs and its variants have shown some success at time series modeling, a major limitation is their applicability to long input sequences. Since RNNs are recurrent by nature, they require long traversal paths to access past inputs, which leads to vanishing/exploding gradients and as a result struggle with capturing long-range dependencies. To counteract the long-range dependency problem with RNNs, a recent line of work considers Transformers for time series modeling. The motivation is that due to the attention mechanism, a Transformer can directly model dependencies between any two points in the input sequence, independently of how far apart the points are. However, the high expressivity of the attention mechanism comes at the cost of the time and space complexity being quadratic in sequence length, making Transformers infeasible for very long sequences. As a result, many works consider specialized Transformer architectures with sparse attention mechanisms to bring down the quadratic complexity. For example, Beltagy et al. (2020) propose LogSparse self-attention, where a cell attends to a subset of past cells (as opposed to all cells), where closer cells are attended to more frequently, proportional to the log of their distance, which brings down complexity from O(ℓ 2 ) to O(ℓ(log ℓ) 2 ). Zhou et al. (2021) propose ProbSparse self-attention, which achieves O(ℓ log ℓ) time and memory complexity, where they propose a generative style decoder to speed inference. Liu et al. (2022) propose a pyramidal attention mechanism which shows linear time and space complexity with sequence length. Autoformer (Wu et al., 2021) suggests more specialization is needed in time series with a decomposition forecasting architecture, which extracts long-term stationary trend from the seasonal series and utilizes an auto-correlation mechanism, which discovers the period-based dependencies. Zhou et al. (2022b) believes previous attempts of Transformer-based architectures do not capture global statistical properties, and to do so requires an attention mechanism in the frequency domain. Conformer (Gulati et al., 2020) stacks convolutional and self-attention modules into a shared layer to combine the strengths of local interactions from convolutional modules and global interactions from self-attention modules. Perceiver AR (Hawthorne et al., 2022) builds on the Perceiver architecture, which reduces the computational complexity of transformers by performing self-attention in a latent space, and extends Perceiver's applicability to causal autoregressive generation. While these works have shown exciting progress on time series forecasting, their proposed architectures are specialized to handle specific time series settings (e.g., long input sequences, or seasonal sequences), and are commonly trained to output a fixed target horizon length (Zhou et al., 2021) , i.e., as direct multi-step forecasting (DMS) Chevillon (2007) . Thus, while effective at specific fore-casting tasks, their setups are not obviously applicable to a broad range of time series settings (such as forecasting arbitrary horizon lengths, or generalizing to classification or regression tasks). Moreover, Zeng et al. (2022) showed that simpler alternatives to Transformers, such as data normalization plus a single linear layer (NLinear), can outperform these specialized Transformer architectures when similarly trained to predict the entire fixed forecasting horizons. Their results suggest that neither the attention mechanism nor the proposed modifications of these time series Transformers may be best suited for time series modeling. Instead, the success of these prior works may just be from learning to forecast the entire horizon with fully connected dependencies between prior time-step inputs and future time-step outputs, where a fully connected linear layer is sufficient. Other deep learning methods. Other works also investigate pure deep learning architectures with no explicit temporal components, and show these models can also perform well on time series forecasting. Oreshkin et al. (2019) propose N-BEATS, a deep architecture based on backward and forward residual links. Even simpler, Zeng et al. (2022) investigate single linear layer models for time series forecasting. Both works show that simple architectures are capable of achieving high performance for time series forecasting. In particular, with just data normalization, the NLinear model in Zeng et al. (2022) obtained state-of-the-art performance on the popular Informer benchmark Zhou et al. (2021) . Given an input sequence of past lag terms and a target output sequence of future horizon terms, for every horizon output their model simply learns the fully connected dependencies between that output and every input lag sample. However, FCNs such as NLinear also carry inefficient downsides. Unlike Transformers and SSM-based models, the number of parameters for FCNs scales directly with input and output sequence length, i.e., O(ℓh) for ℓ inputs and h outputs. Meanwhile, SPACETIME shows that the SSM can improve the modeling quality of deep architectures, while maintaining constant parameter count regardless of input or output length. Especially when forecasting long horizons, we achieve higher forecasting accuracy with smaller models.

B.1 EXPRESSIVITY RESULTS

Proposition 1. An SSM with a companion state matrix can represent i. ARIMA (Box et al., 1970) ii.

Exponential smoothing

iii. Controllable LTI systems (Chen, 1984) Proof of Proposition 1. We show each case separately. We either provide a set of algebraic manipulations to obtain the desired model from a companion SSM, or alternatively invoke standard results from signal processing and system theory. i. We start with a standard ARMA(p, q) model y k = u k + q i=1 θ i u k-i + p i=1 ϕ i y k-i p i We consider two cases: Case (1): Outputs y are a shifted (lag-1) version of the inputs u y k+1 = y k + q i=1 θ i y k-i + p i=1 ϕ i y k-i+1 p i = (1 + ϕ 1 y k ) + q i=1 (θ i + ϕ i+1 )y k-i + p i=q+1 θ i y k-i (16) where, without loss of generality, we have assumed that p > q for notational convenience. The autoregressive system ( 16) is equivalent to A B C D =         0 0 . . . 0 0 1 1 0 . . . 0 0 0 . . . . . . . . . . . . . . . . . . 0 0 . . . 0 0 0 0 0 . . . 1 0 0 (1 + ϕ 1 ) (θ 1 + ϕ 2 ) . . . θ d-1 θ d 0         . in state-space form, with x ∈ R d and d = max(p, q). Note that the state-space formulation is not unique. Case (2): Outputs y are "shaped noise". The ARMA(p,q) formulation (classically) defines inputs u as white noise samplesfoot_0 , ∀k : p(u k ) is a normal distribution with mean zero and some variance. In this case, we can decompose the output as follows: y ar k = p i=1 ϕ i y k-i p i y ma k = u k + q i=1 θ i u k-i such that y k = y ar k + y ma k . The resulting state-space models are: A ar B ar C ar D ar =         0 0 . . . 0 0 1 1 0 . . . 0 0 0 . . . . . . . . . . . . . . . . . . 0 0 . . . 0 0 0 0 0 . . . 1 0 0 ϕ 1 ϕ 2 . . . ϕ p-1 ϕ p 0         . and A ma B ma C ma D ma =         0 0 . . . 0 0 1 1 0 . . . 0 0 0 . . . . . . . . . . . . . . . . . . 0 0 . . . 0 0 0 0 0 . . . 1 0 0 θ 1 θ 2 . . . θ q-1 θ q 1         . Note that A ar ∈ R p×p , A ma ∈ R q×q . More generally, our method can represent any ARMA process as the sum of two SPACETIME heads: one taking as input the time series itself, and one the driving signal u. ARIMA ARIMA processes are ARMA(p, q) applied to differenced time series. For example, first-order differencing y k = u k -u k-1 . Differencing corresponds to high-pass filtering of the signal y, and can be thus be realized via a convolution (Strang and Nguyen, 1996) . Any digital filter that can be expressed as a difference equation admits a state-space representation in companion form (Oppenheim, 1999) , and hence can be learned by SPACETIME. ii. Simple exponential smoothing (SES) (Brown, 1959 ) y k = αy k-1 + α(1 -α)y k-2 + • • • + α(1 -α) p-1 y k-p (17) is an AR process with a parametrization involving a single scalar 0 < α < 1 and can thus be represented in companion form as shown above. iii. Let (A, B, C) be any controllable linear system. Controllability corresponds to invertibility of the Krylov matrix (Chen, 1984, Thm 6.1, p145 ) K(A, B) = [B, AB, . . . , A d-1 B], K(A, B) ∈ R d×d . From rank(K) = d, it follows that there exists a a ∈ R d a 0 B + a 1 AB + • • • + a d-1 A d-1 B + A d B = 0. Thus AK = [AB, A 2 B, . . . , A d B] = [AB, A 2 B, . . . , A d-1 B column left shift of K , -(a 0 B + a 1 Ab + • • • + a d-1 A d-1 B) linear combination, columns of K ] = K(S f -ae ⊤ d-1 ) where G = (S f -ae ⊤ d-1 ) is a companion matrix. AK = KG ⇐⇒ G = K -1 AK. Therefore G is similar to A. We can then construct a companion form state space (G, B, C, D) from A using the relation above. Proposition 2. No class of continuous-time LSSL SSMs can represent the noiseless AR(p) process. Proof of Proposition 2. Recall from Sec. 3.1.1 that a noiseless AR(p) process is defined by y t = p i=1 ϕ i y t-i = ϕ 1 y t-1 + . . . + ϕ p y t-p with coefficients ϕ 1 , . . . , ϕ p . This is represented by the SSM  x t+1 = Sx t + Bu t (19) y t = Cx t + Du t       , B = [1 0 . . . 0] T , C = [ϕ 1 . . . ϕ p ] We prove by contradiction that a continuous-time LSSL SSM cannot represent such a process. Consider the following solutions to a continuous-time system and a system (18), both in autonomous form x cont t+1 = e A x t x disc t+1 = Sx t . It follows x cont t+1 = x disc t+1 ⇐⇒ e A = S ⇐⇒ A = log (S). we have reached a contradiction by (Culver, 1966 , Theorem 1), as S is singular by definition and thus its matrix logarithm does not exist.

B.2 EFFICIENCY RESULTS

We first prove that Algorithm 1 yields the correct output filter F y . We then analyze its time complexity, showing that it takes time O(ℓ log ℓ + d log d) for sequence length ℓ and state dimension d. Theorem 1. Algorithm 1 returns the filter F y = (CB, . . . , CA ℓ-1 B). Proof. We follow the outline of the proof in Section 3.3. Instead of computing F y directly, we compute its spectrum (its discrete Fourier transform): F y [m] := F(F y ) = ℓ-1 j=0 CA j ω mj B = C(I-A ℓ )(I-Aω m ) -1 B = C(I-Aω m ) -1 B, m = 0, 1, . . . , ℓ-1. where ω = exp(-2πi/ℓ) is the ℓ-th root of unity. This reduces to computing the quadratic form of the resolvent (I -Aω m ) -1 on the roots of unity (the powers of ω). Since A is a companion matrix, we can write A as a shift matrix plus a rank-1 matrix, A = S + ae T d , where e d is the d-th basis vector [0, . . . , 0, 1] and the shift matrix S is: S =       0 0 . . . 0 0 1 0 . . . 0 0 0 1 . . . 0 0 . . . . . . . . . . . . . . . 0 0 . . . 1 0       . Thus Woodbury's matrix identity (i.e., Sherman-Morrison formula) yields: (I -Aω m ) -1 = (I -ω m S -ω m ae ⊤ d ) -1 = (I -ω m S) -1 + (I -ω m S) -1 ω m ae ⊤ d (I -ω m S) -1 1 -ω m e ⊤ d (I -ω m S) -1 a . This is the resolvent of the shift matrix (I -ω m S) -1 , with a rank-1 correction. Hence F y = C(I -ω m S) -1 B + C(I -ω m S) -1 ae ⊤ d (I -ω m S) -1 B ω -m -e ⊤ d (I -ω m S) -1 a . ( ) We now need to derive how to compute the quadratic form of a resolvent of the shift matrix efficiently. Fortunately the resolvent of the shift matrix has a very special structure that closely relates to the Fourier transform. We show analytically that: (I -ω m S) -1 =       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       . It is easy to verify by multiplying this matrix with I -ω m S to see if we obtain the identity matrix. Recall that multiplying with S on the left just shifts all the columns down by one index. Therefore:       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       (I -ω m S) =       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       -ω m S       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       =       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       -ω m       0 0 . . . 0 0 1 0 . . . 0 0 ω m 1 . . . 0 0 . . . . . . . . . . . . . . . ω (d-2)m ω (d-3)m . . . 1 0       =       1 0 . . . 0 0 ω m 1 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω m 1       -       0 0 . . . 0 0 ω m 0 . . . 0 0 ω 2m ω m . . . 0 0 . . . . . . . . . . . . . . . ω (d-1)m ω (d-2)m . . . ω 0       =I. Thus the resolvent of the shift matrix indeed has the form of a lower-triangular matrix containing the roots of unity. Now that we have the analytic formula of the resolvent, we can derive its quadratic form, given some vectors u, v ∈ R d . Substituting in, we have u T (I -ω m S) -1 v = u 1 v 1 + u 2 v 1 ω m + u 2 v 2 + u 3 v 1 ω 2m + u 3 v 2 ω m + u 3 v 1 + . . . . Grouping terms by powers of ω, we see that we want to compute u 1 v 1 + u 2 v 2 + • • • + u d v d , then u 2 v 1 + u 3 v 2 + • • • + u d v d-1 , and so on. The term corresponding to ω km is exactly the k-th element of the linear convolution u * v. Define q = u * v, then u T (I -ω m S) -1 v is just the Fourier transform of u * v. To deal with the case where d > ℓ, we note that the powers of roots of unity will repeat, so we just need to extend the output of u * v to be multiples of ℓ, then split them into chunk of size ℓ, then sum them up and take the length-ℓ Fourier transform. This is exactly the procedure quad(u, v) defined in Algorithm 1. Once we have derived the quadratic form of the resolvent (I -ω m S) -1 , simply plugging it into the Woodbury's matrix identity (Equation ( 22)) yields Algorithm 1. We analyze the algorithm's complexity. Theorem 2. Algorithm 1 has time complexity O(ℓ log ℓ + d log d) for sequence length ℓ and state dimension d. Proof. We see that computing the quadratic form of the resolvent (I -ω m S) -1 involves a linear convolution of size d and a Fourier transform of size ℓ. The linear convolution can be done by performing an FFT of size 2d on both inputs, multiply them pointwise, then take the inverse FFT of size 2d. This has time complexity O(d log d). The Fourier transform of size ℓ has time complexity O(ℓ log ℓ). Remark. We see that the algorithm easily extends to the case where the matrix A is a companion matrix plus low-rank matrix (of some rank k). We can write A as the sum of the shift matrix and a rank-(k + 1) matrix (since A itself is the sum of a shift matrix and a rank-1 matrix). Using the same strategy, we can use the Woodbury's matrix identity for the rank-(k + 1) case. The running time will then scale as O(k(ℓ log ℓ + d log d)).

B.3 COMPANION MATRIX STABILITY

Normalizing companion parameters for bounded gradients Proposition 3 (Bounded SPACETIME Gradients). Given s, the norm of the gradient of a SPACE-TIME layer is bounded for all k < s if d-1 i=0 |a i | = 1 Proof. Without loss of generality, we assume x 0 = 0. Since the solution at time s is y s = C s-1 i-1 A s-i-1 Bu i we compute the gradient w.r.t u k as dy s du k = CA s-k-1 B. The largest eigenvalue of A max{eig (Hirst and Macey, 1997 , Theorem 1) (A)} = ≤ max 1, d-1 i=0 |a i | Corollary of Gershgorin = 1 using i |a i | = 1 is 1, which implies convergence of the operator CA s-k-1 B. Thus, the gradients are bounded. We use the proposition above to ensure gradient boundedness in SPACETIME layers by normalizing a every forward pass.

C EXPERIMENT DETAILS C.1 INFORMER FORECASTING

Dataset details. In Table 1 , we evaluate all methods with datasets and horizon tasks from the Informer benchmark (Zhou et al., 2021) . We use the datasets and horizons evaluated on in recent works (Wu et al., 2021; Zhou et al., 2022b; a; Zeng et al., 2022) , which evaluate on electricity transformer temperature time series (ETTh1, ETTh2, ETTm1, ETTm2) with forecasting horizons {96, 192, 336, 720}. We extend this comparison in Appendix D.2 to all datasets and forecasting horizons in Zhou et al. (2021) , which also consider weather and electricity (ECL) time series data. Training details. We train SPACETIME on all datasets for 50 epochs using AdamW optimizer (Loshchilov and Hutter, 2017) , cosine scheduling, and early stopping based on best validation standardized MSE. We performed a grid search over number of SSMs {64, 128} and weight decay {0, 0.0001}. Like prior forecasting works, we treat the input lag sequence as a hyperparameter, and train to predict each forecasting horizon with either 336 or 720 time-step-long input sequences for all datasets and horizons. For all datasets, we use a 3-layer SPACETIME network with 128 SSMs per layer. We train with learning rate 0.01, weight decay 0.0001, batch size 32, and dropout 0.25. Hardware details. All experiments were run on a single NVIDIA Tesla P100 GPU.

C.2 MONASH FORECASTING

The Monash Time Series Forecasting Repository (Godahewa et al., 2021) provides an extensive benchmark suite for time series forecasting models, with over 30 datasets (including various configurations) spanning finance, traffic, weather and medical domains. We compare SPACETIME against 13 baselines provided by the Monash benchmark: simple exponential smoothing (SES) (Gardner Jr, 1985) , Theta (Assimakopoulos and Nikolopoulos, 2000) , TBATS (De Livera et al., 2011) , ETS (Winters, 1960) , DHR-ARIMA (Hyndman and Athanasopoulos, 2018) , Pooled Regression (PR) (Trapero et al., 2015) , CatBoost (Dorogush et al., 2018) , FFNN, DeepAR (Salinas et al., 2020) , N-BEATS Oreshkin et al. (2019) , WaveNet (Oord et al., 2016 ), vanilla Transformer (Vaswani et al., 2017) . A complete list of the datasets considered and baselines, including test results (average RMSE across 3 seeded runs) is available in Table 20 . Training details. We optimize SPACETIME on all datasets using Adam optimizer for 40 epochs with a linear learning rate warmup phase of 20 epochs and cosine decay. We initialize learning rate at 0.001, reach 0.004 after warmup, and decay to 0.0001. We do not use weight decay or dropout. We perform a grid search over number of layers {3, 4, 5, 6}, number of SSMs per layer {8, 16, 32, 64, 128} and number of channels (width of the model) {1, 4, 8, 16}. Hyperparameter tuning is performed for each dataset. We pick the model based on best validation RMSE performance. Hardware details. All experiments were run on a single NVIDIA GeForce RTX 3090 GPU.

C.3 TIME SERIES CLASSIFICATION

ECG classification (motivation and dataset description). Electrocardiograms (ECG) are commonly used as one of the first examination tools for assessing and diagnosing cardiovascular diseases, which are a major cause of mortality around the world (Amini et al., 2021) . However, ECG interpretation remains a challenging task for cardiologists and general practitioners (Jablonover et al., 2014; Cook et al., 2020) . Incorrect interpretation of ECG can result in misdiagnosis and delayed treatment, which can be potentially life-threatening in critical situations such as emergency rooms, where an accurate interpretation is needed quickly. To mitigate these challenges, deep learning approaches are increasingly being applied to interpret ECGs. These approaches have been used for predicting the ECG rhythm class (Hannun et al., 2019) , detecting atrial fibrillation (Attia et al., 2019b) , rare cardiac diseases like cardiac amyloidosis (Goto et al., 2021) , and a variety of other abnormalities (Attia et al., 2019a; Siontis et al., 2021) . Deep learning approaches have shown preliminary promise in matching the performance of cardiologists and emergency residents in triaging ECGs, which would permit accurate interpretations in settings where specialists may not be present (Ribeiro et al., 2020; Hannun et al., 2019) . We use the publicly available PTB-XL dataset (Wagner et al., 2020a; b; Goldberger et al., 2000) , which contains 21,837 12-lead ECG recordings of 10 seconds each obtained from 18,885 patients. Each ECG recording is annotated by up to two cardiologists with one or more of the 71 ECG statements (labels). These ECG statements conform to the SCP-ECG standard (Secretary, 2009) . Each statement belongs to one or more of the following three categories -diagnostic, form, and rhythm statements. The diagnostic statements are further organised in a hierarchy containing 5 superclasses and 24 subclasses. This provides six sets of annotations for the ECG statements based on the different categories and granularities: all (all ECG statements), diagnostic (only diagnostic statements including both subclass and superclass statements), diagnostic subclass (only diagnostic subclass statements), diagnostic superclass (only diagnostic superclass statements), form (only form statements), and rhythm (only rhythm statements). These six sets of annotations form different prediction tasks which are referred to as all, diag, sub-diag, super-diag, form, and rhythm respectively. The diagnostic superclass task is multi-class classification, and the other tasks are multi-label classification. ECG classification training details. To tune SPACETIME and S4, we performed a grid search over the learning rate {0.01, 0.001}, model dropout {0.1, 0.2}, number of SSMs per layer {128, 256}, and number of layers {4, 6}, and chose the parameters that resulted in highest validation AUROC. The SSM state dimension was fixed to 64, with gated linear units as the non-linearity between stacked layers. We additionally apply layer normalization. We use a cosine learning rate scheduler, with a warmup period of 5 epochs. We train all models for 100 epochs. (40, 50) in the test set. The outputs are obtained by filtering x, i.e., y = F(x) where F is in the family of digital filters. We introduce common various sequence-to-sequence layers or models as baselines: the original S4 diagonal plus low-rank (Gu et al., 2021a) , a single-layer LSTM, a single 1d convolution (Conv1d), a dense linear layer (NLinear), a single self-attention layer. All models are trained for 800 epochs with batch size 256, learning rate 10 -3 and Adam. We repeat this experiment for digital filters of different orders (Oppenheim, 1999) . The results are shown in Figure 8 . SPACETIME learns to match the frequency response of the target filter, producing the correct output for inputs at test frequencies. 7 , we also compare SPACETIME to alternative time series methods on the complete datasets and horizons used in the original Informer paper (Zhou et al., 2021) . We compare against recent architectures which similarly evaluate on these settings, including ETSFormer (Woo et al., 2022) , SCINet (Liu et al., 2021) , and Yformer (Madhusudhanan et al., 2021) , and other comparison methods found in the Informer paper, such as Reformer (Kitaev et al., 2020) and ARIMA. SPACETIME obtains best results on 20 out of 25 settings, the most of any method.

D.6 SPACETIME ABLATIONS

To better understand how the proposed SPACETIME SSMs lead to the improved empirical performance, we include ablations on the individual closed-loop forecasting SSM (Section 3.2) and preprocessing SSMs (Section 3.1.3).

D.6.1 CLOSED-LOOP FORECASTING SSM

To study how the closed-loop SSM improves long horizon forecasting accuracy, we remove the closed-loop SSM component in our default SPACETIME forecasting architecture (c.f., Appendix D.7, and compare the default SPACETIME with one without any closed-loop SSMs on Informer forecasting tasks. For models without closed-loop SSMs, we replace the last layer with the standard "openloop" SSM framework in Section 3.1.2), and keep all other layers the same. Finally, for baseline comparison against another SSM without the closed-loop component, we compare against S4. In Table 10 , we report standardized MSE on Informer ETT datasets. Adding the closed-loop SSM consistently improves forecasting accuracy, on average lowering relative MSE by 33.2%. Meanwhile, even without the closed-loop SSM, SPACETIME outperforms S4, again suggesting that the companion matrix parameterization is beneficial for autoregressive time series forecasting. To study how the preprocessing SSM improves long horizon forecasting accuracy, we next compare how SPACETIME performs with and without the weight-initializing preprocessing SSMs introduced in Section 3.1.3. We compare the default SPACETIME architecture (Table 12 with (1) replacing the preprocessing SSMs with randomly initialized default companion SSMs, and (2) removing the preprocessing SSMs altogether. For the former, we preserve the number of layers, but now train the first-layer SSM weights. For the latter, there is one-less layer, but the same number of trainable parameters (as we fix and freeze the weights for each preprocessing SSM). In In addition, as discussed in Sections 3.1.1 and 3.1.3, we can also fix a, B, or C to specific values to recover useful operations when computing the SSM outputs. We describe specific instantiations of the companion SSM used in our models below (with dimensionality referring to one SSM). Shift SSM. We fix the a vector in the companion state matrix A ∈ R d×d to the 0 vector ∈ R d , such that A is the shift matrix (see Eq. 21 for an example). This is a generalization of a 1-D "sliding window" convolution with fixed kernel size equal to SSM state dimension d. To see how, note that if B is also fixed to the first basis vector e 1 ∈ R d×1 , then this exactly recovers a 1-D convolution with kernel determined by C. Differencing SSM. As a specific version of the preprocessing SSM discussed in Section 3.1.3, we fix a = 0, B = e 1 , and set C to recover various order differencing when computing the SSM, i.e., C = [1 0 0 0 0 . . . 0] (0-order differencing, i.e., an identity function) ( 24) C = [1 -1 0 0 0 . . . 0] (1st-order differencing) (25) C = [1 -2 1 0 0 . . . 0] (2nd-order differencing) (26) C = [1 -3 3 -1 0 . . . 0] (3rd-order differencing) In this work, we only use the above 0, 1st, 2nd, or 3rd-order differencing instantiations. With multiple differencing SSMs in a multi-SSM SPACETIME layer, we initialize differencing SSMs by running through the orders repeatedly in sequence. For example, given five differencing SSMs, the first four SSMs perform 0, 1st, 2nd, and 3rd-order differencing respectively, while the fifth performs 0-order differencing again. Moving Average Residual (MA residual) SSM. As another version of the preprocessing SSM, we can fix a = 0, B = e 1 , and set C such that the SSM outputs sample residuals from a moving average applied over the input sequence. For an n-order moving average, we compute outputs with C specified as C = [1 -1/n, -1/n, . . . -1/n, 0 . . . 0] (n-order moving average residual) (28) For each MA residual SSM, we randomly initialize the order by uniform-randomly sampling an integer in the range [4, d] , where d is again the state-space dimension size (recall C ∈ R 1×d ). We pick 4 as a heuristic which was not finetuned; we leave additional optimization here for further work.

D.7.2 TASK-SPECIFIC SPACETIME ARCHITECTURES

Here, we provide layer-level details on the SPACETIME networks used in this work. For each task, we describe number of layers, number of SSMs per layer, state-space dimension (fixed for all SSMs in a network), and which SSMs are used in each layer. Expanding on this last detail, as previously discussed in Section 3.1.2, in each SPACETIME layer we can specify multiple SSMs in each layer, computing their outputs in parallel to produce a multidimensional output that is fed as the input to the next SPACETIME layer. The "types" of SSMs do not all have to be the same per layer, and we list the type (companion, shift, differencing, MA residual) and closed-loop designation (standard, closed-loop) of the SSMs in each layer below. For an additional visual overview of a SPACETIME network, please refer back to Figure 2 . Forecasting: Informer and Monash. We describe the architecture in Table 12 . We treat the first SPACETIME layer as "preprocessing" layer, which performs differencing and moving average residual operations on the input sequence. We treat the last SPACETIME layer as a "forecasting" layer, which autoregressively outputs future horizon predictions given the second-to-last layer's outputs as an input sequence. Classification: ECG. We describe the architectures for each ECG classification task in Tables 13 14 15 16 17 18 . For all models, we use state-space dimension d = 64. As described in the experiments, for classification we compute logits with a mean pooling over the output sequence, where pooling is computed over the sequence length. Classification: Speech Audio. We describe the architecture for the Speech Audio task in Table 19 . We use state-space dimension d = 1024. As described in the experiments, for classification we compute logits with a mean pooling over the output sequence, where pooling is computed over the sequence length. 



Other formulations with forecast residuals are also common. The whole algorithm needs to compute four such quadratic form, hence it takes time O(ℓ log ℓ + d log d). A task can belong to multiple splits, resulting in overlapping splits. For example, a task can involve both long context as well as long forecasting horizon.



Figure 2: SPACETIME architecture and components. (Left): Each SPACETIME layer carries weights that model multiple companion SSMs, followed optionally by a nonlinear FFN. The SSMs are learned in parallel (1) and computed as a single matrix multiplication (2). (Right): We stack these layers into a SPACETIME network, where earlier layers compute SSMs as convolutions for fast sequence-to-sequence modeling and data preprocessing, while a decoder layer computes SSMs as recurrences for dynamic forecasting.

F y = quad( C, B) + quad( C, a) * quad(e d , B)/(z -quad(e d , a)) ∈ R ℓ , where e d = [0, . . . , 0, 1] is the d-th basis vector. 4: Return the inverse Fourier transform

Figure 3: AR(p) expressiveness benchmarks. SPACETIME captures AR processes more precisely than similar deep SSM models, forecasting future samples and learning ground-truth transfer functions more accurately.

22 C.1 Informer Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 C.2 Monash Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 C.3 Time Series Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 D Extended experimental results 24 D.1 Expressivity on digital filters . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.2 Informer Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24 D.3 Monash Forecasting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 D.4 ECG Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.5 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 D.6 SPACETIME Ablations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 D.7 SPACETIME Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 A RELATED WORK A.1 CLASSICAL APPROACHES Classical approaches in time series modeling include the Box-Jenkins method

20)when S ∈ R p×p is the shift matrix, B ∈ R p×1 is the first basis vector e 1 , C ∈ R 1×p is a vector of coefficients ϕ 1 , . . . , ϕ p , and D = 0, i.e.,

tion, as we only update a ∈ R d (5), we can efficiently scale the hidden-state size to capture more expressive processes with only O(d) parameters. Finally, by learning multiple such SSMs in a single layer, and stacking multiple such layers, we can further scale up expressivity in a deep architecture. To capture and scale up the companion SSM's expressive and autoregressive modeling capabilities, we model multiple companion SSMs in each SPACETIME layer's weights. SPACE-TIME layers are similar to prior work such as LSSLs, with A, B, C as trainable weights, and D added back as a skip connection. To model multiple SSMs, we add a dimension to each matrix. For s SSMs per SPACETIME layer, we specify weights A ∈ R s×d×d , B ∈ R d×s , and C ∈ R s×d . Each slice in the s dimension represents an individual SSM. We thus compute s outputs and hidden states

Univariate forecasting results on Informer ETT datasets. Best results in bold. SPACETIME results reported as means over three seeds. Additional datasets, horizons, and method comparisons in App. D.2

ECG statement classification on PTB-XL (100 Hz version). Baseline AUROC from Strodthoff et al. (2021) (error bars in App. D.4).

Speech audio

Longer horizon forecasting on Informer ETTh datasets. Mean standardized MSE reported. SPACETIME obtains lower MSE when trained to forecast longer horizons.

Train wall-clock time.

Table 1 in the supplementary material. We provide training hyperparameters and dataset details for each benchmark in Appendix C, discussing the Informer forecasting benchmark in Appendix C.1, the Monash forecasting benchmark in Appendix C.2, and the ECG and speech audio classification benchmarks in Appendix C.3. We provide proofs for all propositions and algorithm complexities in Appendix B. Classical Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 A.2 Deep Learning Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B Proofs and Theoretical Discussion 17 B.1 Expressivity Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 B.2 Efficiency Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 B.3 Companion Matrix Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

parametrizes the vector field of continuous-time autonomous systems. These models, termed Neural Differential Equations (NDEs) have seen extensive application to time series and sequences, first by Rubanova et al. (2019) and then by Kidger et al. (2020); Morrill et al. (2021); Massaroli et al. (2021) with the notable extension to Neural Controlled Differential Equations (Neural CDEs). Neural CDEs can be considered the continuous-time, nonlinear version of state space models and RNNs(Kidger, 2022). Rather than introducing nonlinearity between linear state space layers, Neural CDEs model nonlinear systems driven by a control input.The NDE framework has been further applied byPoli et al. (2019) to model graph time series via Neural Graph Differential Equations. InQueiruga et al. (2020), a continuous-depth ResNet generalization based on ODEs is proposed, and inKim et al. (

Speech Commands training details. To train SPACETIME, we use the same hyperparameters used by S4: a learning rate of 0.01 with a plateau scheduler with patience 20, dropout of 0.1, 128 SSMs per layer, 6 layers, batch normalization, trained for 200 epochs.

Comparing sequence models on the task of approximating the input-output map defined by digital filters of different orders. Test RMSE on held-out inputs at unseen frequencies.

Closed-loop SSM Ablation We ablate the closed-loop SSM component in SPACETIME, comparing against the prior S4 SSM on four Informer time series forecasting tasks. Removing the closed-loop SSM consistently hurts forecasting accuracy for SPACETIME.

Table11, we report standardized MSE on Informer ETT datasets. We find fixing the first layer SSMs of a SPACETIME network to preprocessing SSMs consistently improves forecasting performance, achieving 4.55% lower MSE on average than the ablation with just trainable companion matrices. Including the preprocessing layer also improves MSE by 9.26% on average compared to removing the layer altogether. These results suggest that preprocessing SSMs are beneficial for time series forecasting, e.g., by performing classic time series modeling techniques on input data. Unlike other approaches, SPACETIME is able to flexibly and naturally incorporate these operations into its network layers via simple weight initializations of the same general companion SSM structure.

Preprocessing SSM Ablation We ablate the preprocessing SSM layer inSPACETIME, comparing    We provide the specific SPACETIME architecture configurations used for forecasting and classification tasks. Each configuration follows the general architecture presented in Section 3.1 and Figure 2, and consists of repeated Multi-SSM SPACETIME layers. We first provide additional details on specific instantiations of the companion SSMs we use in our models, e.g., how we instantiate preprocessing SSMs to recover specific techniques (Section 3.1.3). We then include the layer-specific details of the number and type of SSM used in each network.D.7.1 SPECIFIC SSM PARAMETERIZATIONSIn Section 3.1.1, we described the general form of the companion SSM used in this work. By default, for any individual SSM we learn the a column in A and the vectors B, C as trainable parameters in a neural net module. We refer to these SSMs specifically as companion SSMs.

SPACETIME forecasting architecture. For all SSMs, we keep state-space dimension d = 128. Repeated Identity denotes repeating the input to match the number of SSMs in the next layer, i.e., 128 SSMs in this case. For each forecasting task, d ′ denotes time series samples' number of features, ℓ denotes the lag size (number of past samples given as input), and h denotes the horizon size (number of future samples to be predicted).

SPACETIME architecture for ECG SuperDiagnostic classification. For all SSMs, we keep state-space dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 timesteps long. The number of classes c = 5.

SPACETIME architecture for ECG SubDiagnostic classification. For all SSMs, we keep state-space dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 timesteps long. The number of classes c = 23.

SPACETIME architecture for ECG Diagnostic classification. For all SSMs, we keep statespace dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 time-steps long. The number of classes c = 44.

SPACETIME architecture for ECG Form classification. For all SSMs, we keep state-space dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 time-steps long. The number of classes c = 19.

SPACETIME architecture for ECG Rhythm classification. For all SSMs, we keep statespace dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 time-steps long. The number of classes c = 12.

SPACETIME architecture for ECG All classification. For all SSMs, we keep state-space dimension d = 64. Input samples have d ′ = 12 features and are length ℓ = 1000 time-steps long. The number of classes c = 71.

SPACETIME architecture for Speech Audio classification. For all SSMs, we keep statespace dimension d = 1024. Input samples have d ′ = 1 features and are length ℓ = 16000 time-steps long. The number of classes c = 10.

8. ACKNOWLEDGEMENTS

We thank Albert Gu, Yining Chen, Ke Alexander Wang, and Rose Wang for helpful discussions and feedback. We also gratefully acknowledge the support of NIH under No. U54EB020405 (Mobilize), NSF under Nos. CCF1763315 (Beyond Sparsity), CCF1563078 (Volume to Velocity), and 1937301 (RTML); US DEVCOM ARL under No. W911NF-21-2-0251 (Interactive Human-AI Teaming); ONR under No. N000141712266 (Unifying Weak Supervision); ONR N00014-20-1-2480: Understanding and Applying Non-Euclidean Geometry in Machine Learning; N000142012275 (NEP-TUNE); NXP, Xilinx, LETI-CEA, Intel, IBM, Microsoft, NEC, Toshiba, TSMC, ARM, Hitachi, BASF, Accenture, Ericsson, Qualcomm, Analog Devices, Google Cloud, Salesforce, Total, the HAI-GCP Cloud Credits for Research program, the Stanford Data Science Initiative (SDSI), and members of the Stanford DAWN project: Facebook, Google, and VMWare. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views, policies, or endorsements, either expressed or implied, of NIH, ONR, or the U.S. Government.

annex

Multivariate signals. We additionally compare the performance of SPACETIME to state-of-the-art comparison methods on ETT multivariate settings. We focus on horizon length 720, the longest evaluated in prior works. In Table 8 , we find SPACETIME is competitive with NLinear, which achieves best performance among compparison methods. SPACETIME also notably outperforming S4 by large margins, supporting the companion matrix representation once more.

D.3 MONASH FORECASTING

We report the results across all datasets in Table 20 . We also investigate the performance of models by aggregating datasets based on common characteristics. Concretely, we generate sets of tasks 2 based on the following properties:• Large dataset: the dataset contains more than 2000 effective training samples.• Long context: the models are provided a context of length greater than 20 as input.• Long horizon: and the models are asked to forecast longer than 20 steps in the future.Figure 6 shows the average x/13 model ranking in terms of test RMSE across splits. We contextualize SPACETIME results with best classical and deep learning methods (TBATS and DeepAR). SPACETIME relative performance is noticeably higher when context and forecasting horizons are longer, and when a larger number of samples is provided during training. 

Monash Model Rankings

Figure 6 : Relative test RMSE rankings ( * /13 models) across different slices of the 33 datasets in the Monash repository (Godahewa et al., 2021) . SPACETIME sets best overall ranking across all tasks and is significantly more accurate on tasks involving long forecast horizon and larger number of training samples.

D.4 ECG CLASSIFICATION

In addition to our results table in the main paper, we also provide the mean and standard deviations of the two models we ran in house (SPACETIME and S4) in Table 9 .Table 9 : ECG statement classification on PTB-XL (100 Hz version). We report the mean and standard deviation over three random seeds for the three methods we ran in house. 

D.5 EFFICIENCY RESULTS

We additionally empirically validate that SPACETIME trains in near-linear time with horizon sequence length. We also use synthetic data, scaling horizon from 1 -1000. 

