KOOPMAN NEURAL FORECASTER FOR TIME SERIES WITH TEMPORAL DISTRIBUTION SHIFTS

Abstract

Temporal distributional shifts, with underlying dynamics changing over time, frequently occur in real-world time series, and pose a fundamental challenge for deep neural networks (DNNs). In this paper, we propose a novel deep sequence model based on the Koopman theory for time series forecasting: Koopman Neural Forecaster (KNF) that leverages DNNs to learn the linear Koopman space and the coefficients of chosen measurement functions. KNF imposes appropriate inductive biases for improved robustness against distributional shifts, employing both a global operator to learn shared characteristics, and a local operator to capture changing dynamics, as well as a specially-designed feedback loop to continuously update the learnt operators over time for rapidly varying behaviors. We demonstrate that KNF achieves the superior performance compared to the alternatives, on multiple time series datasets that are shown to suffer from distribution shifts.

1. INTRODUCTION

Temporal distribution shifts frequently occur in real-world time-series applications, from forecasting stock prices to detecting and monitoring sensory measures, to predicting fashion trend based sales. Such distribution shifts over time may due to the data being generated in a highly-dynamic and non-stationary environment, abrupt changes that are difficult to predict, or constantly evolving trends in the underlying data distribution (Gama et al., 2014) . Temporal distribution shifts pose a fundamental challenge for time-series forecasting (Kuznetsov & Mohri, 2020) . There are two scenarios of distribution shifts. When the distribution shifts only occur between the training and test domains, meta learning and transfer learning approaches (Jin et al., 2021; Oreshkin et al., 2021) have been developed. The other scenario is much more challenging: distribution shifts occurring continuously over time. This scenario is closely related to "concept drift" (Lu et al., 2018) and non-stationary processes (Dahlhaus, 1997) but has received less attention from the deep learning community. In this work, we focus on the second scenario. To tackle temporal distribution shifts, various statistical estimation methods have been studied, including spectral density analysis (Dahlhaus, 1997) , sample reweighting (Bennett & Clarkson, 2022; McCarthy & Jensen, 2016) and Bayesian state-space models (West & Harrison, 2006) . However, these methods are limited to low capacity auto-regressive models and are typically designed for short-horizon forecasting. For large-scale complex time series data, deep learning models (Oreshkin et al., 2021; Woo et al., 2022; Tonekaboni et al., 2022; Zhou et al., 2022) now increasingly outperform traditional statistical methods. Yet, most deep learning approaches are designed for stationary timeseries data (with i.i.d. assumption), such as electricity usage, sales and air quality, that have clear seasonal and trend patterns. For distribution shifts, DNNs have been shown to be problematic in forecasting on data with varying distributions (Kouw & Loog, 2018; Wang et al., 2021) . DNNs are black-box models and often require a large number of samples to learn. For time series with continuous distribution shifts, the number of samples from a given distribution is small, thus DNNs would struggle to adapt to the changing distribution. Furthermore, the non-linear dependencies in a DNN are difficult to interpret or manipulate. Directly modifying the parameters based on the change in dynamics may lead to undesirable effects (Vlachas et al., 2020) . Therefore, if we can reduce non-linearity and simplify dynamics modeling, then we would be able to model time series in a much more interpretable and robust manner. Koopman theory (Koopman, 1931) provides convenient tools to simplify the dynamics modeling. It states that any nonlinear dynamics can be modeled by a linear Koopman operator acting on the space of measurement functions (Brunton et al., 2021) , thus the dynamics can be manipulated by simply modifying the Koopman matrix. In this paper, we propose a novel approach for accurate forecasting for time series with distribution shifts based on Koopman theory: Koopman Neural Forecaster (KNF). Our model has three main features: 1) we combine predefined measurement functions with learnable coefficients to introduce appropriate inductive biases into the model. 2) our model employs both global and local Koopman operators to approximate the forward dynamics: the global operator learns the shared characteristics; the local operator captures the local changing dynamics. 3) we also integrate a feedback loop to cope with distribution shifts and maintain the model's long-term forecasting accuracy. The feedback loop continuously updates the learnt operators over time based on the current prediction error. Leveraging Koopman theory brings multiple benefits to time series forecasting with distribution shifts: 1) using predefined measurement functions (e.g., exponential, polynomial) provide sufficient expressivity for the time series without requiring a large number of samples. 2) since the Koopman operator is linear, it is much easier to analyze and manipulate. For instance, we can perform spectral analysis and examine its eigenfunctions, reaching a better understanding of the frequency of oscillation. 3) Our feedback loop makes the Koopman operator adaptive to non-stationary environment. This is fundamentally different from previous works that learns a single and fixed Koopman operator (Han et al., 2020; Takeishi et al., 2017; Azencot et al., 2020) . In summary, our major contributions include: • Proposing a novel deep forecasting model based on Koopman theory for time-series data with temporal distributional shifts. • The proposed approach allows the Koopman matrix to both capture the global behaviors and evolve over time to adapt to local changing distributions. • Demonstrating state-of-the-art performance on highly non-stationary time series datasets, including M4, cryptocurrency return forecasting and sports player trajectory prediction. • Generating interpretable insights for the model behavior via eigenvalues and eigenfunctions of the Koopman operators.

2. RELATED WORK

DNNs for time-series forecasting. DNNs are shown to increasingly outperform traditional statistical methods (such as exponential smoothing (ETS) (Gardner Jr, 1985) and ARIMA (Ariyo et al., 2014) ) for time series forecasting. For example, Tonekaboni et al. (2022) ; Wang et al. (2019) proposed to use DNNs to learn the local and global representations of time series separately, showing high accuracy on sales and weather data. Woo et al. (2022) leverages inductive biases in different architectures and also specially-designed contrastive loss to learn disentangled seasonal and trend representations. Sen et al. (2019) utilized a global TCN to avoid normalization before training when there are wide variations in scale. But it focuses mainly on better modeling the relationships between time series rather than advances in modeling over time as ours. Transformer-based approaches are particularly effective in time series forecasting, particularly on datasets including electricity and traffic (Zhou et al., 2022; Wu et al., 2021; Zhou et al., 2021) , which are relatively stationary and have clear seasonality and trend dynamics. Robustness against temporal distribution shifts. Non-stationarity poses a great challenge for time series forecasting. To cope with varying distributions, one approach is to stationarize the input data. Kim et al. (2022) (Makridakis et al., 2020; Malinin et al., 2021) . Smyl (2020) combines ETS with a RNN, where the seasonality and (1) 1 ), ..., g n (v (n) t )]. G(V t ) is computed based on the set of predfined measurement functions G and their input values V t learnt by the encoder. The model utilizes a global operator K g and an adaptive local operator K l t learnt by a Transformer encoder to model the evolution of measurements. On the forecasting window, an additional adjustment operator K c t is learnt by a MLP module based on the prediction error on the lookback window. smoothing coefficients, are fitted concurrently with the RNN weights using the same gradient descent optimization. Oreshkin et al. (2019) proposes a deep residual architecture for univariant time series data, in which each block used MLP to learn coefficients of basis functions, such as polynomials and sine functions. Indeed, we show that our model outperforms both Smyl (2020) and Oreshkin et al. (2019) on the M4 financial time series dataset (Makridakis et al., 2020) . Koopman theory. Koopman theory (Koopman, 1931; Strogatz, 2018) shows that a nonlinear dynamical system can be represented as an infinite-dimensional linear Koopman operator acting on a space of measurement functions. The spectral decomposition of the linear Koopman operator can provide insights on the behaviors of nonlinear systems. Traditionally, dynamic mode decomposition (DMD) (Brunton et al., 2016) is commonly used for approximating the Koopman Operator. However, it is highly nontrivial to find appropriate measurement functions as well as the Koopman operator. Many works at the intersection of machine learning and Koopman theory employ DNNs to learn measurement functions from data (Han et al., 2020; Li et al., 2020; Morton et al., 2019; Li & Jiang, 2021; Takeishi et al., 2017; Fan et al., 2022) . Yeung et al. (2019) and Lusch et al. (2018) propose to use fully connected DNNs to learn the transformation between the observations and measurement functions that span a Koopman invariant subspace and the Koopman operator is approximated by a linear layer. Azencot et al. (2020) also designed an autoencoder architecture based on Koopman theory to forecast fluid dynamics. Different from these approaches, we allow the Koopman matrix to evolve over time for each time series to adapt to the changing distribution.

3.1. TIME-SERIES FORECASTING WITH KOOPMAN THEORY

Time series data {x t } T t=1 can be considered as observations of a dynamical system states -consider following discrete form: x t+1 = F (x t ), where x ∈ X ⊆ R d is the system state, and F is the underlying governing equation. We focus on multi-step forecasting task of predicting the future states given a sequence of past observations. Formally, we seek a function map f such that: f : (x t-q+1 , . . . , x t ) -→ (x t+1 , . . . , x t+h ), ( ) where q is the lookback window length and h is the forecasting window length. Koopman theory (Koopman, 1931) shows that any nonlinear dynamic system can be modeled by an infinite-dimensional linear Koopman operator acting on the space of all possible measurement functions. More specifically, there exists a linear infinite-dimensional operator K : G(X ) → G(X ) that acts on a space of real-valued measurement functions G(X ) := {g : X → R}. The Koopman operator maps between function spaces and advances the observations of the state to the next step: Kg(x t ) = g(F (x t )) = g(x t+1 ). We propose Koopman Neural Forecasting (KNF), a deep sequence model based on Koopman theory to forecast highly non-stationary time series, as shown in Fig. 1 . It instantiates an encoder-decoder architecture for each time series segment. The encoder takes in observations from multiple time steps as the underlying dynamics may contain higher-order time derivatives. Our model has three main features: 1) we use predefined measurement functions with learned coefficients to map time series to the functional space. 2) the model employs both a global Koopman operator to learn the shared characteristics and a local Koopman operator to capture the local changing dynamics. 3) we also integrate a feedback loop to update the learnt operators over time based on forecasting error, maintaining model's long-term forecasting performance.

3.2. LEVERAGING PREDEFINED MEASUREMENT FUNCTIONS

We define a set of measurement functions G := [g 1 , • • • , g n ] that spans the Koopman space, where each g i : R → R. For example, g 1 (x) = sin(x). These functions are canonical nonlinear functions and are often used to model complex dynamical systems, such as Duffing oscillator and fluid dynamics (Brunton et al., 2021; Kutz et al., 2016) . They also provide a sample-efficient approach to represent highly nonlinear behavior that may be difficult to learn for DNNs. We use an encoder to generate the coefficients of the measurement functions Ψ(X t ), such as the frequencies of sine functions. Let n be the number of measurement functions for each feature, d be the number of features in a time series and k be the number of steps encoded by the encoder Ψ : R d×k → R n×d×k every time. The lookback window length q is a multiple of k and we denote x tk:(t+1)k as X t ∈ R d×k for simplicity. As shown in the Eq.3 below, we first obtain a latent matrix V t = [v (1) t , v (2) t , • • • , v t ] ∈ R n×d . Every vector v i ∈ R d is a different linear transformation of the observations, where the weights are learnt by the encoder Ψ: V t [i, j] = l Ψ(X t )[i, j, l]X t [j, l]; 1 ≤ i ≤ n, 1 ≤ j ≤ d, 1 ≤ l ≤ k. (3) Our measurement functions are defined in the latent space rather than the observational space. We apply a set of predefined measurement functions G to the latent matrix V t : G(V t ) = [g 1 (v (1) t ), g 2 (v (2) t ), ..., g n (v (n) t )] ∈ R n×d In our implementation, we flatten G(V t ) into a vector and then finite Koopman operator should be a nd × nd matrix. Finally, we use a decoder Φ : R n×d → R k×d to reconstruct the observations from the measurements: Xt = Φ(G(V t )). Here, the encoder Ψ and the decoder Φ can be any DNN architecture, for which we use multi-layer perceptron (MLP). The set of measurement functions G contains polynomials, exponential functions, trigonometric functions as well as interaction functions. These predefined measurement functions are useful in imposing inductive biases into the model and help capture the non-linear behaviors of time series. The encoder model needs to approximate only the parameters of these functions without the need of directly learning non-stationary characteristics. Ablation studies (in Sec. 4.8) demonstrate that using predefined measurement functions significantly outperforms the model with learned measurement functions in the previous works.

3.3. GLOBAL AND LOCAL KOOPMAN OPERATORS

Dynamic mode decomposition (DMD) (Tu et al., 2013) is traditionally used to find the Koopman operator that best propagates the measurements. But for time series with temporal distribution shift, we need to compute spectral decomposition and learn a Koopman matrix for every sample (i.e. slice of a trajectory), which is computationally expensive. So we utilize DNNs to learn Koopman operators. In classic Koopman theory, the measurement vector G(V t ) is infinite-dimensional, which is impossible to learn. We assume that encoder is learning a finite approximation and G(V t ) forms a finite Koopmaninvariant subspace. Thus, the Koopman operator K that we need to find is the finite matrix that best advances the measurements forward in time. While the Koopman matrix should vary across samples and time in our case, it should also capture the global behaviors. Thus, we propose to use both a global operator K g and a local operator K l t to model the propagation of dynamics in the Koopman space, defined as below: KG(V t ) := (K g + K l t )G(V t ) = G(V t+1 ), t ≥ 0. The global operator K g is an nd × nd trainable matrix that is shared across all time series. We use the global operator to learn the shared behavior such as trend. The local Koopman operator K l t , on the other hand, is based on the measurement functions on the lookback window for each sample, shown in Fig. 1 . The local operator should capture the local dynamics specific to each sample. Since we generate the forecasts in an autoregressive way, the local operator depends on time t and varies across autoregressive steps, adapting to the distribution changes along prediction. We use a Transformer architecture with a single-head as the encoder, to capture the relationships between measurements at different steps. We use the attention weight matrix in the last layer as the local Koopman operator.

3.4. FEEDBACK LOOP

Suppose an abrupt distributional shift occurs in the middle of the look-back window, the model would still try to fit two distributions before and after the shift but a single proporgration matrix is never good enough to model multiple distributions. This will results in the inaccurate operator used for the forecasting window. To address it, we add an additional feedback closed-loop, in which we employ an MLP module Γ to learn the adjustment operator K c t based on the prediction errors in the lookback window. It is directly added to other operators when making predictions on the forecasting window, as shown in Fig. 1 . More specifically, we apply global and local operators recursively to the measurements at the first step in the lookback window to obtain predictions: Xt-q/k+i = Φ(( K g + K l t ) i G(V t-q/k )), 0 < i ≤ q/k. Then, the feedback module Γ uses the difference between the predictions on the lookback window and the observed data to generate additional adjustment operator K c t , which is a diagonal matrix: K c t = Γ( Xt-q/k:t -X t-q/k:t ) = Γ(x t-q:t -x t-q:t ) If the predictions deviate significantly from the ground truth within the lookback window, the operator K c t would learn the temporal change in the underlying dynamics and correspondingly adjust the other two operators. Thus, for forecasting, the sum of all three operators is used: Xt+i = Φ((K g + K l t + K c t ) i G(V t )), i > 0. In a word, the feedback module is designed to detect the distributional shifts in the lookback window and adapt the global+local operator to the latest distribution in the lookback window.

3.5. LOSS FUNCTIONS

KNF is trained in an end-to-end fashion, using superposition of three loss terms L = L rec + L back + L forw . Denote l as a distance metric for which we use the L2 loss. The first term is the reconstruction loss, to ensure the decoder Φ can reconstruct the time series from the measurements: L rec = l(X t , Φ(G(Ψ(X t )X t ))), t ≥ 0. ( ) The second term is the prediction loss on the lookback window to ensure the sum of global and local operators is the best-fit propagation matrix on the lookback window. L back = l(X t-q/k+i , Φ(( K g + K l t ) i G(Ψ(X t-q/k )X t-q/k )), 0 < i ≤ q/k. ( ) The third term is for prediction accuracy in the forecasting window to guide the feedback loop to learn the correct adjustment placed on the Koopman operator. L forw = l(X t+i , Φ((K g + K l t + K c t ) i G(V t )), i > 0. ( ) 4 EXPERIMENTS

4.1. DATASETS

We benchmark our method on three time series datasets: M4 (Makridakis et al., 2020 ), Crypto (Arik et al., 2022) and Basketball Player Trajectories (Li et al., 2021) . These time series are particularly chosen as they are difficult to forecast due to high nonstationarity and abrupt temporal changes, as analyzed in Sec. 4.2. M4. It contains 10000 highly nonstationary univariate time series with different frequencies from hourly to yearly and different categories from financials to demographics. The forecasting horizon varies across different frequencies. We directly compare KNF with the M4 competition winner (Smyl, 2020) , the second place (Montero-Manso et al., 2020) and the ensemble N-Beats-I+G (Oreshkin et al., 2019) that has achieved the competitive prediction performance on M4. Crypto.foot_0 This multivariate dataset contains 8 features on historical trades, such as open and close prices, for 14 cryptocurrencies. The original challenge is to predict 3-step ahead 15-minute relative future returns. Since we focus on long-term forecasting, we train all models to make 15-step predictions of 15-minute relative future returns. We use the original training set from the competition and do an 80%-10%-10% training-validation-test split. Player Trajectory.foot_1 This dataset contains basketball player movement trajectories from NBA games in 2016. We randomly sample 300 player trajectories for training and validation and 50 trajectories for testing. All models are trained to yield 30-step ahead predictions.

4.2. DISTRIBUTION SHIFTS AND FORECASTABILITY

As a way of motivating for the improvement scenarios of our method, we show that the three datasets we focus are much more difficult to predict than a commonly-used datasets like Electricity (Harries & Wales, 1999) with the following three metrics: (1) Forecastability (Goerg, 2013) : it is one minus the entropy of Fourier decomposition of the time series; (2) Lyapunov exponents (LEs) (Dingwell, 2006; Schölzel, 2019) : a measure of how sensitive a dynamical system is to initial conditions.(3) Trend: the slope of the linear regression fitted on the time series scaled by its own magnitude; and (4) Seasonality: we use ACF test (Witt et al., 1998) to test if there is clear seasonality. We report the mean forecastability, mean trend and the percentage of the slices that have seasonality in Table 1 , averaged over slices with length 20. We can see that seasonality is dominant for Electricity dataset, which also has significantly higher forecastability compared to the datasets we experiment with. Moreover, we show that KNFachieves the state-of-the-art prediction performance on those datasets that have high Lyapunov exponents, low forecastability, no clear trends and seasonality, including Crypto, Player Trajectories, M4-monthly, M4-weekly and M4-daily. Electricity Crypto Player Traj. 

4.3. BASELINES

For M4, we compare with the winner solution (Smyl, 2020) , the second best (Montero-Manso et al., 2020) and N-beats (Oreshkin et al., 2019) . On the Crypto and Player Trajectory datasets, we compare KNF with four different types of models. • Vector ARIMA (VARIMA) (Stock & Watson, 2001) : It is an extension of the classic ARIMA model for multivariate time series. • Multi-layer Perceptron (MLP) that maps the historic observations to the future and roll out autoregressively to yield long-term predictions. It is a commonly-used DL model for time series forecasting (Arik et al., 2022; Wang et al., 2021; Faloutsos et al., 2018) . • Random Forest (RF) applied autoregressively same as MLP. It is a traditional ML model that have achieved competitive performance on many time series benchmarks (Godahewa et al., 2021; Masini et al., 2021; Kane et al., 2014; Hyndman & Athanasopoulos, 2018) . • Long Expressive Memory (LEM) (Rusch et al., 2022) : An SOTA recurrent model for learning long-term sequential dependencies. • FedFormer (FedFormer) (Zhou et al., 2022) is a state-of-art transformer-based model, which has outperformed other transformer-based models, such as Informer (Zhou et al., 2021) and Autoformer (Wu et al., 2021) , on many datasets, including electricity, traffic, weather, etc. • Variational Beam Search (VBS) (Li et al., 2021) is a Bayesian online learning model proposed to detect and adapt to temporal distributional shifts. Since VBS is not designed for time series forecasting, we modify it by feeding its own prediction of next step back to the input (instead of the ground truth) to yield multi-step predictions.

4.4. TRAINING STRATEGIES

We find that the following two training strategies can improve the prediction accuracy of models by varying degrees. The first one is the reversible instance normalization ReVIN (Kim et al., 2022) , which normalize the input sequence and de-normalize the predictions at every autoregressive step for every instance. We use ReVIN for MLP and KNF since it can improve their prediction accuracy. An ablation study of ReVIN on KNF can be found in Table 4 . The second one is, temporal bundling TB (Brandstetter et al., 2022) that asks autoregressive models to generate multiple-step predictions instead of just one on every call, to reduce number of model calls and therefore error propagation speed. We observe this strategy can improve prediction accuracy of both MLP and RF. For all datasets, we use a sliding window approach to generate training samples. On Crypto and Player Trajectory datasets, we perform a grid search of hyperparameters, including learning rate, input length, hidden dimension, number of predictions made in each autoregressive step, etc. for all models. The hyperparameter tuning ranges can be found in the Appendix A.3 Table 6 . The default set of measurement functions used in KNF includes polynomials up to the order of four, one exponential function and trigonometric functions with the same number of input steps for each feature, as well as pairwise product interaction functions between features if the time series data is multivariate.

4.5. SETUP

We report the mean and standard deviation averaged across five runs. We follow the literature and use RMSE for evaluation on Player Trajectories and the weighted RMSE on Crypto, where the weight corresponds to the importance of each cryptocurrency same as in the competition. Since VBS is deterministic once its inverse temperature parameter is fixed, we do not report its standard deviation. On M4, we evaluate models with the sMAPE metric (Makridakis et al., 2020) foot_2 used in the original competition. Ensembling is used by most models in the M4-competition and N-Beats (Oreshkin et al., 2019) , so we ensemble five KNF with best hyperparameters but trained with random seeds. Since the evaluation in M4 competition is only based on a single submission, we also report the sMAPE of a single ensembled prediction without standard deviation. 2019) does not report the breakdown sMAPE of N-beats on weekly, daily and hourly, so we did not include them in the table. The number behind each frequency in the table is the number of prediction steps required in the competition. We can see that KNF achieve the state-of-the-art accuracy on Weekly, Daily and Monthly data that have longer forecasting horizons, and no clear trends and seasonality. This highlights the value proposition of KNF for accurate long-term forecasting accuracy especially for time-series with nonstationary characteristics and low forecastability. KNF also achieve competitive performance on Yearly, Quarterly and Hourly and the full results on M4 are in Appendix A.4.

4.7. RESULTS ON CRYPTO AND PLAYER TRAJECTORIES

Table 3 shows the Prediction RMSE on Crypto and Player Trajectories datasets. KNF consistently achieves the best performance on both, across different forecasting horizons. Fig. 2 exemplifies the predictions of KNFand best-performing baselines. We observe that KNF can capture both overall trends and small fluctuations in a superior way, yielding much higher accuracy. Fig. 2 right visualizes the predictions on a trajectory with changing direction. KNF correctly predicts the change of moving direction while FedFormer fails. Regarding baseline performances, RevIN and TB greatly improves MLP, and MLP+RevIN+TB is the best performing one on Crypto, and FedFormer performs better on Player Trajectory, that are less chaotic and irregular. Though VBS was shown to perform well on the online change point detection task, it does not appear to be as successful for time series forecasting based on its performance on these two datasets. We also include additional experimental results on the Traffic and Exchange Datasets (Zeng et al., 2023) in Appendix A.5.

4.8. ABLATION STUDY ON M4

We performed an ablation study of KNF on the M4-Weekly data to understand the contribution of each component in our model. Table 4 shows the sMAPE of ensembled predictions from five funs of each variant. We denote KNF-base as the basic backbone of our model that only has an encoder-decoder architecture and a local Koopman operator. KNF-base-G uses purely data-driven measurements, which is similar to the Koopman autoencoders proposed in (Yeung et al., 2019; Azencot et al., 2020) . This can be considered a baseline for our model. KNF-base-I uses predefined measurement functions, which significantly outperforms KNF-base-G. This demonstrates the benefits of leveraging hard-coded functions. Moreover, adding any of ReVIN, the global Koopman operator and the feedback loop also brings great improvement based on the results shown in Table 4 . To further demonstrate the effectiveness of the feedback loop, we visualize the predictions on M4weekly data from KNF and KNF with the feedback module removed after training. We can observe that the predictions from the model with the feedback loop removed start to deviate from the ground truth after a few steps. That means the feedback loop can cope with the temporal distributional shifts and thus improve the long-horizon forecasting accuracy. We perform additional ablation studies on measurement functions on M4-yearly data shown in Appendix A.1, demonstrating the necessity of each type of measurement functions.

4.9. INTERPRETABILITY RESULTS

Figure 4 : Reconstructions from KNF on the lookback window with only one of eigenfunctions of K g + K l t on a M4 time series. KNF can offer unique interpretability capabilities via spectral analysis of the Koopman operators. We perform eigen-decomposition on K g + K l t for KNF trained on M4-weekly data to investigate what individual eigenfunctions have learnt. Fig. 4 shows the predictions in the lookback window with different eigenfunctions. From the top to the bottom, the plot shows the target, the reconstruction from KNF with only the first eigenfunction in the lookback window and the reconstruction only using the second eigenfunction. We observe that the some eigenfunctions captures the trend while some other functions focus on the seasonality. To show that our model can also correctly learn the time series governed by known equations. We run experiments with our model on a simple oscillator system, given in Appendix A.2. We try our models with different sets of polynomial functions and train/test them on different initial conditions. Fig. 5 shows that our model can make accurate predictions on the test set with different dictionaries.

5. CONCLUSION

In this paper, we propose a novel model based on the Koopman theory, Koopman Neural Forecaster (KNF), designed for accurate long-term forecasting for non-stationary time series with temporal distribution shifts. KNF leverages predefined measurement functions to capture the nonlinear and nonstationary characteristics that may pose a great difficulty for neural nets to learn. It employs both a global operator to learn shared characteristics, and a local operator to capture changing dynamics. We also use a judiciously-designed feedback loop to continuously update the learnt operators over time for rapidly varying behaviors. We demonstrate that KNF achieves the state-of-the-art performance on wide range of time series datasets that are particularly known to suffer from distribution shifts. KNF achieves SOTA prediction performance on M4, Cryptos and NBA player trajectory datasets and provides interpretable results. Future work includes forecasting higher-dimensional dynamics and investigating distributional shifts between training and test domains.

A APPENDIX A.1 ABLATION STUDIES ON PREDEFINED MEASUREMENT FUNCTIONS

Table 5 shows the results from the ablation study of measurement function on M4-Yearly data. G represents generic, purely data-driven. P is polynomial function, E is the exponential function, S is the sine function. The integers in the first row are the number of each type of function. Basically, we gradually add more and different types of predefined measurement functions to purely data-drive model and observe improvements brought by the polynomial, exponential function, sine functions. We consider the following simple nonlinear system with discrete spectrum. Any multiplication of [φ 1 , φ 2 , φ 3 ] is still an eigenfunction. We want to show our model with only trainable global operator can correctly identify the Koopman invariant subspace from the synthetic data generated based on this system. We generate 36 time series with µ = -0.1 and λ = -1 and the initial values are uniformly sampled from [-1, 1] 2 . We tried four different basis dictionaries, including D 1 = {x 1 , x 2 , x 2 1 }, D 2 = {x 1 , x 2 , x 2 1 , x 2 2 }, D 3 = {x 1 , x 2 , x 2 1 , x 2 2 , x 3 1 , x 3 2 }, D 4 = {x 1 , x 2 , x 2 1 , x 2 2 , x 3 1 , x 3 2 , x 4 1 , x 4 2 } and we trained and tested the models on different initial conditions. The left figure in Fig. 5 shows eigenvalues learnt by KNF and some true eigenvalues. Since the learned Koopman matrix may not be unique given the flexibility in the coefficients of measurement functions, and the compositions of eigenfunctions are also eigenfunctions, there are many possible eigenvalues. But we can still see that most of the learned eigenvalues match the true eigenvalues. The right three figures in Fig. 5 shows that KNF can make accurate predictions with different dictionaries.

A.3 ADDITIONAL EXPERIMENTAL DETAILS

Table 6 shows the hyperparameter tuning ranges, including the learning rate, the hidden dimension and number of layers of deep neural network modules in both baselines and our model, number of predictions made at each autoregressive step, the length of input observations, the forecasting window size during training, and whether to use ReVIN. For different modules in a model, we tune hyperparameters, such as the number of layers/hidden dimensions separately. For instance, in FedFormer, the encoder and decoder may have different numbers of layers. As for the nonlinearity, we use ReLU for all layers.



https://www.kaggle.com/competitions/g-research-crypto-forecasting/data https://github.com/linouk23/NBA-Player-Movements sMAPE is the mean absolute error scaled by the magnitude of the predictions and target.



Figure 1: Architecture of the proposed KNF model. It encodes every multiple steps of observations into a measurement vector G(V t ) = [g 1 (v

Figure 2: Left: Example forecasts on Crypto dataset. Right: Example forecasts on NBA player trajectory dataset. We observe that KNF can be superior in capturing complex non-stationary patterns.

Figure 3: Predictions on M4-weekly data from original KNF and with its version without the feedback loop. We observe that the feedback loop can help providing robustness against temporal distributional shifts and maintain the long-term prediction accuracy.

ẋ2 = λ(x 2 -x 2 1 )This system has a minimal Koopman invariant subspace spanned by {x 1 , x 2 , x be spanned by three eigenfunctions [φ 1 , φ 2 , φ 3 ] = [x 1

The forecastability, Lyapunov exponents, trend and seasonality of the datasets that we used for our experiments, compared with a commonly-used Electricity benchmark. The boldface datasets are what we use for experiments in this paper.

sMAPE for KNF and baselines on M4 datasets. The numbers in parentheses are the number of prediction steps. KNF achieves the state-of-the-art prediction performance at Weekly, Daily and Monthly frequencies.

Prediction RMSE on Cryptos and Player Trajectories datasets.

Ablation study of KNF on M4-weekly data. It shows the importance of every component in the model architecture.

Ablation study of measurement function on M4-Yearly data. G represents generic, purely data-driven. P is polynomial function, E is the exponential function, S is the sine function. The integers in the first row are the number of each type of function.A.2 SIMPLE NONLINEAR SYSTEM WITH FINITE KOOPMAN SPACE

ACKNOWLEDGEMENT

We'd like to thank Tomas Pfister, Nate Yoder, and Jinsung Yoon for insightful discussions on this work. This work was supported in part by U.S. Department Of Energy, Office of Science, U. S. Army Research Office under Grant W911NF-20-1-0334, and NSF Grants #2134274, #2107256 and #2134178.

availability

//github.com/google-research/ google-research/tree/

