DEEPTIME: DEEP TIME-INDEX META-LEARNING FOR NON-STATIONARY TIME-SERIES FORECASTING

Abstract

Advances in I.T. infrastructure has led to the collection of longer sequences of time-series. Such sequences are typically non-stationary, exhibiting distribution shifts over time -a challenging scenario for the forecasting task, due to the problems of covariate shift, and conditional distribution shift. In this paper, we show that deep time-index models possess strong synergies with a meta-learning formulation of forecasting, displaying significant advantages over existing neural forecasting methods in tackling the problems arising from non-stationarity. These advantages include having a stronger smoothness prior, avoiding the problem of covariate shift, and having better sample efficiency. To this end, we propose Deep-Time, a deep time-index model trained via meta-learning. Extensive experiments on real-world datasets in the long sequence time-series forecasting setting demonstrate that our approach achieves competitive results with state-of-the-art methods, and is highly efficient. Code is attached as supplementary material, and will be publicly released.

1. INTRODUCTION

Time-series forecasting has important applications across business and scientific domains, such as demand forecasting (Carbonneau et al., 2008) , capacity planning and management (Kim, 2003) , and anomaly detection (Laptev et al., 2017) . With the advances of I.T. infrastructure, time-series are collected over longer durations, and at a higher sampling frequency. This has led to time-series spanning tens-of-thousands to millions of time steps, on which we would like to perform forecasting on. Such datasets face the unique challenge of non-stationarity, where long sequences face distribution shifts over time, due to factors like concept drift. This has practical implications on forecasting models, which face a degradation in performance at test time (Kim et al., 2021) due to covariate shift, and conditional distribution shift (see Appendix B for formal definitions). Table 1 : Time-index models are defined to be models whose predictions, ŷt , are purely functions of the current time-index features, τ t , e.g. relative time-index (1, 2, 3, ...), datetime features (minute-of-hour, week-of-day, etc.) . Historical-value models, whose predictions of future time step(s), ŷt+1 , are explicit functions of past observations, (y t , y t-1 , . . .), and optionally covariates, (z t+1 , z t , z t-1 , . . .), which can include exogenous time-series or even datetime features.

Time-index Models

Historical-value Models ŷt = f (τ t ) ŷt+1 = f (y t , y t-1 , . . . , z t+1 , z t , . . .) E.g.: DeepTime, Prophet, Gaussian process E.g.: N-HiTS, Autoformer, DeepAR In this work, we posit that deep time-index models exhibit strong synergies with a meta-learning formulation to tackle the problem of non-stationary forecasting, whereas existing neural forecasting methods, which are historical-value models, are unable to take full advantage of this formulation, and are still susceptible to the problem of covariate shift. In the following, we discuss time-index models and their deep counterparts, highlighting how simple deep time-index models are unable to perform forecasting (i.e. extrapolate from historical training data). Yet, endowing them with a meta-learning formulation solves this problem. Thereafter, we demonstrate the advantages of deep time-index meta-learning for non-stationary forecasting and how they alleviate the issues faced by historical-value models, which are namely: (i) meta-learning is an effective solution for conditional distribution shift, (ii) they avoid the problem of covariate shift, (iii) have stronger sample efficiency, and (iv) that time-index models have a stronger smoothness prior. Figure 2 : (a) A naive deep time-index model. We visualize a reconstruction of the historical training data, as well as the forecasts. As can be seen, it overfits to historical data and is unable to extrapolate. This model corresponds to (+Local) Table 3 of our ablations. (b) DeepTime, our proposed approach, trained via a meta-learning formulation, successfully learns the appropriate function representation and is able to extrapolate. Visualized here is the last variable of the ETTm2 dataset. Deep Time-index Models On the one hand, classical time-index methods (Taylor & Letham, 2018; Corani et al., 2021; Ord et al., 2017) rely on predefined parametric representation functions y t = f (τ t ) + ε t , where ε t represents idiosyncratic changes not accounted for by the model and f could be some polynomial function to represent trend, Fourier series to represent seasonality, or a composition of seasonal-trend components. While these functions are simple and easy to learn, the choice of representation function requires strong domain expertise or computationally heavy cross-validation. Furthermore, predefining the representation function is a strong assumption and may fail under distribution shifts. On the other hand, while deep time-index models (letting f be a deep neural network) present a deceptively clear path to approximate the representation function in a data-driven manner, deep time-index models are too expressive. Trained via straightforward supervised learning on historical values without any inductive bias, they are unable to extrapolate to future time steps (visualized in Figure 2 ), and a meta-learning formulation is required to do so -this formulation has the added benefit of handling non-stationary forecasting. Advantages of Deep Time-index Meta-learning Firstly, distribution shift of input statistics sharply degrade the prediction accuracy of deep learning models (Nado et al., 2020) . Historicalvalue models, which take past observations as input, suffer from this as an effect of covariate shift. Time-index models easily sidestep this problem since they take time-index features as input. Next, meta-learning is an effective solution to tackle the problem of conditional distribution shift -nearby time steps are assumed to follow a locally stationary distribution (Dahlhaus, 2012; Vogt, 2012) (see Figure 1 ), considered to be a task. The base learner adapts to this locally stationary distribution, while the meta learner generalizes across various task distributions. In principle, historical-value models are able to take advantage of the meta-learning formulation, however, they still suffer from the problem of covariate shift and sample efficiency issues. Time-index models are also able to achieve greater sample efficiency in the meta-learning formulation. Like many existing state-ofthe-art forecasting approaches, time-index models are direct multi-step (DMS) approaches 1 . For a lookback window of length L and forecast horizon of length H, a historical-value DMS model requires N + L + H -1 time steps to construct a support set of size N , whereas a time-index model only requires N time steps. Not only does this marked increase in sample efficiency mean that time-index models can achieve an improved task generalization error bound (Appendix Q), they are also better able to adhere to the assumption of a locally stationary distribution since using more time steps leads to the risk of a non-stationary support set. Finally, time-index models have a stronger smoothness prior (Bengio et al., 2013)  , i.e. t ≈ t ′ =⇒ τ t ≈ τ t ′ =⇒ f (τ t ) ≈ f (τ t ′ ) , whereas the complicated parameterization of historical-value models provide no such inductive biases. To this end, we propose DeepTime, a deep time-index model, endowed with a meta-learning formulation. We leverage implicit neural representations (INR) (Sitzmann et al., 2020b) as our choice of deep time-index models, and also introduce a novel concatenated Fourier features layer to efficiently learn high frequency patterns. The meta-learning formulation is instantiated as a closed-form ridge regressor (Bertinetto et al., 2019) to efficiently tackle the meta-learning formulation. DeepTime manages to overcome the limitations of a naive deep time-index model by learning the appropriate inductive biases for extrapolation over the forecast horizon. It is also more effective than existing historical-value methods at non-stationary time-series forecasting by learning a global meta model shared across tasks and performs adaptation on a locally stationary task, and also sidesteps the covariate shift problem. To summarize, the key contributions of our work are as follows: • We introduce a novel forecasting as meta-learning framework for deep time-index models, enabling them learn the appropriate representation function and tackle the problem of nonstationary forecasting. This is distinct from existing work which leverages meta-learning on historical-value models for adapting to new time-series datasets, where tasks are defined to be the entire time-series (Grazzi et al., 2021) . • We propose DeepTime, leveraging an INR with concatenated Fourier features and closedform ridge regressor to achieve a highly efficient forecasting model. • We conduct extensive experiments on the long sequence time-series forecasting (LSTF) datasets, demonstrating DeepTime to be extremely competitive. We perform ablation studies to better understand the contribution of each component of DeepTime, and finally show that it is highly efficient in terms of runtime and memory.

2. DEEPTIME

Problem Formulation In time-series forecasting, we consider a time-series dataset (y 1 , y 2 , . . . , y T ), where y t ∈ R m is the m-dimension observation at time t. Given a lookback window Y t-L:t = [y t-L ; . . . ; y t-1 ] T ∈ R L×m of length L, the goal of forecasting is to construct a point forecast over a horizon of length H, Y t:t+H = [y t ; . . . ; y t+H-1 ] T ∈ R H×m . The goal is to learn a time-index model, f : R → R m , f : τ t → ŷt , where τ t is a time-index feature, to quickly adapt to observations in the lookback window, (τ t-L:t , Y t-L:t ), by minimizing a reconstruction loss L : R m × R m → R. Then, we can query it over the forecast horizon to obtain forecasts, Ŷt:t+H = f (τ t:t+H ). In the following, we first describe our forecasting as meta-learning framework on time-index models. We emphasize that this formulation falls within the standard time-series forecasting problem and requires no extra information. Next, we further elaborate on our proposed model architecture, and how it uses a differentiable closed-form ridge regression module to efficiently tackle forecasting as meta-learning problem. Psuedocode of DeepTime is available in Appendix E.

2.1. FORECASTING AS META-LEARNING

In time-index meta-learning, each lookback window and forecast horizon pair, (Y t-L:t , Y t:t+H ) is a task. This task yields a single support and query set, which are the lookback window and forecast horizon respectively. Each time coordinate and time-series value pair, (τ t+i , y t+i ), is an input-output sample, i.e. D S = {(τ t-L , y t-L ), . . . , (τ t-1 , y t-1 )}, D Q = {(τ t , y t ), . . . , (τ t+H-1 , y t+H-1 )}, where τ t+i = i+L L+H-1 is a [0, 1]-normalized time-index. The time-index model, f , is parameterized by ϕ and θ, the meta and base parameters respectively, and the bi-level optimization problem can be formalized as: A time-series dataset can be split into M tasks as given in the problem formulation. For a given task, the lookback window represents the support set, and the forecast horizon represents the query set. g ϕ represents the meta model associated with the meta parameters. ϕ is shared between the lookback window and forecast horizon. Inputs to g ϕ are not normalized due to notation constraints on this illustration. The ridge regressor performs the inner loop optimization, while outer loop optimization is performed over samples from the horizon. As illustrated in Figure 3b , DeepTime has a simple overall architecture, comprising of a random Fourier features layer, an MLP, and a ridge regressor. ϕ * = arg min ϕ T -H+1 t=L+1 H-1 j=0 L(f (τ t+j ; θ * t , ϕ), y t+j ) (1) s.t. θ * t = arg min θ -1 j=-L L(f (τ t+j ; θ, ϕ), y t+j ) Here, the outer summation in Equation (1) over index t represents each lookback-horizon window, corresponding to each task in meta-learning, and the inner summation over index j represents each sample in the query set, or equivalently, each time step in the forecast horizon. The summation in Equation ( 2) over index j represents each sample in the support set, or each time step in the lookback window. This is illustrated in Figure 3a . To understand how our meta-learning formulation helps to learn an appropriate function representation from data, we examine how the meta-learning process performs a restriction on hypothesis class of the model f . The original hypothesis class of our model, or function representation, H INR = {f (τ ; θ, ϕ) | θ ∈ Θ, ϕ ∈ Φ}, is too large and provides no guarantees that training on the lookback window leads to good extrapolation. The meta-learning formulation allows DeepTime to restrict the hypothesis class of the representation function, from the space of all K-layered INRs, to the space of K-layered INRs conditioned on the optimal meta parameters, H DeepTime = {f (τ ; θ, ϕ * ) | θ ∈ Θ}, where the optimal meta parameters, ϕ * , is the minimizer of a forecasting loss (as specified in Equation ( 1)). Given this hypothesis class, local adaptation is performed over H DeepTime given the lookback window, which is assumed to come from a locally stationary distribution, resolving the issue of non-stationarity.

2.2. MODEL ARCHITECTURE

Implicit Neural Representations The class of deep models which map coordinates to the value at that coordinate using a stack of multi-layer perceptrons (MLPs) is known as INRs (Sitzmann et al., 2020b; Tancik et al., 2020) . We make use a of them as they are a natural fit for time-index models, to map a time-index to the value of the time-series at that time-index. A K-layered, ReLU (Nair & Hinton, 2010) INR is a function f θ : R c → R m which has the following form: z (0) = τ z (k+1) = max(0, W (k) z (k) + b (k) ), k = 0, . . . , K -1 f θ (τ ) = W (K) z (K) + b (K) (3) where τ ∈ R c is the time-index. Note that c = 1 for our proposed approach as specified in Section 2.1, but we use the notation τ ∈ R c to allow for generalization to cases where datetime features are included. Tancik et al. (2020) introduced a random Fourier features layer which allows INRs to fit to high frequency functions, by modifying z (0) = γ(τ ) = [sin(2πBτ ), cos(2πBτ )] T , where each entry in B ∈ R d/2×c is sampled from N (0, σ 2 ) with d is the hidden dimension size of the INR and σ 2 is the scale hyperparameter. [•, •] is a row-wise stacking operation. Concatenated Fourier Features While the random Fourier features layer endows INRs with the ability to learn high frequency patterns, one major drawback is the need to perform a hyperparameter sweep for each task and dataset to avoid over or underfitting. We overcome this limitation with a simple scheme of concatenating multiple Fourier basis functions with diverse scale parameters, i.e. γ(τ ) = [sin(2πB 1 τ ), cos(2πB 1 τ ), . . . , sin(2πB S τ ), cos(2πB S τ )] T , where elements in B s ∈ R d/2×c are sampled from N (0, σ 2 s ), and W (0) ∈ R d×Sd . We perform an analysis in Section 3.3 and show that the performance of our proposed Concatenated Fourier Features (CFF) does not significantly deviate from the setting with the optimal scale parameter obtained from a hyperparameter sweep. Differentiable Closed-form Solvers One key aspect to tackling forecasting as a meta-learning problem is efficiency. Optimization-based meta-learning approaches originally perform an expensive bi-level optimization procedure on the entire neural network model by backpropagating through inner gradient steps (Ravi & Larochelle, 2017; Finn et al., 2017) . Since each forecast is now treated as an inner loop optimization problem, it needs to be sufficiently fast to be competitive with competing methods. We achieve this by restricting the inner loop optimization to only apply to the last layer of the INR. As a result, we can perform the inner loop optimization on this linear layer using the closed-form solution of a ridge regressor for the case of mean squared error loss. We note that our formulation is general, and any differentiable solver can be used instead (Bertinetto et al., 2019) . This means that for a K-layered model, ϕ = {W (0) , b (0) , . . . , W (K-1) , b (K-1) , λ} are the meta parameters and θ = {W (K) } are the base parameters, following notation from Equation (3). Then let g ϕ : R → R d be the meta learner where g ϕ (τ ) = z (K) . For task t with the corresponding lookback-horizon pair, (Y t-L:t , Y t:t+H ), the support set features obtained from the meta learner is denoted Z t-L:t = [g ϕ (τ t-L ); . . . ; g ϕ (τ t-1 )] T ∈ R L×d , where [•; •] is a column-wise concatenation operation. The inner loop thus solves the optimization problem: W (K) * t = arg min W ||Z t-L:t W -Y t-L:t || 2 + λ||W || 2 = (Z T t-L:t Z t-L:t + λI) -1 Z T t-L:t Y t-L:t (4) Now, let Z t:t+H = [g ϕ (τ t ); . . . ; g ϕ (τ t+H-1 )] T ∈ R H×d be the query set features. Then, our predictions are Ŷt:t+H = Z t:t+H W (K) * t . This closed-form solution is differentiable, which enables gradient updates on the parameters of the meta learner, ϕ. A bias term can be included for the closed-form ridge regressor by appending a scalar 1 to the feature vector g ϕ (τ ). The end result of training DeepTime on a dataset is the restricted hypothesis class H DeepTime = g ϕ * (τ ) T W (K) | W (K) ∈ R d×m . This is illustrated in Figure 3b. Some confusion regarding DeepTime's categorization as a time-index model may arise from the above simplified equation for predictions, since forecasts are now a function the lookback window due to the closed-form solution of W (K) * t . However, we highlight that DeepTime is a meta-learning algorithm on top of a deep time-index model -it comprises a learning algorithm, A : H × R L×m → H, specified in Equation ( 2) (the inner loop optimization step), and the deep time-index model itself, f ∈ H DeepTime . Thus, forecasts are of the form, ŷt+h = A(f, Y t-L:t )(τ t+h ), and as can be seen, while the inner loop optimization step is a function of past observations, the adapted time-index model it yields is purely a function of time-index features. Further discussion can be found in Appendix D.

3. EXPERIMENTS

We evaluate DeepTime on both synthetic datasets, and a variety of real-world data. We ask the following questions: (i) Is DeepTime, trained on a family of functions following the same parametric form, able to perform extrapolation on unseen functions? (ii) How does DeepTime compare to other forecasting models on real-world data? (iii) What are the key contributing factors to the good performance of DeepTime?

3.1. EXPERIMENTS ON SYNTHETIC DATA

We first consider DeepTime's ability to extrapolate on the following functions specified by some parametric form: (i) the family of linear functions, y = ax + b, (ii) the family of cubic functions, y = ax 3 +bx 2 +cx+d, and (iii) sums of sinusoids, j A j sin(ω j x+φ j ). Parameters of the functions (Rasmussen, 2003) . For the univariate setting, we include additional univariate forecasting models, N-BEATS (Oreshkin et al., 2020) , DeepAR (Salinas et al., 2020) , Prophet (Taylor & Letham, 2018), and ARIMA. Baseline results are obtained from the respective papers. Table 2 and Table 9 (in Appendix J for space) summarizes the multivariate and univariate forecasting results respectively. DeepTime achieves state-of-the-art performance on 20 out of 24 settings in MSE, and 17 out of 24 settings in MAE on the multivariate benchmark, and also achieves competitive results on the univariate benchmark despite its simple architecture compared to the baselines comprising complex fully connected architectures and computationally intensive Transformer architectures.

3.3. ABLATION STUDIES

We perform an ablation study to understand how various training schemes and input features affect the performance of DeepTime. Table 3 presents these results. First, we observe that our metalearning formulation is a critical component to the success of DeepTime. We note that DeepTime without meta-learning may not be a meaningful baseline since the model outputs are always the same regardless of the input lookback window. Including datetime features helps alleviate this issue, yet we observe that the inclusion of datetime features generally lead to a degradation in performance. In the case of DeepTime, we observed that the inclusion of datetime features lead to a much lower training loss, but degradation in test performance -this is a case of meta-learning memorization (Yin et al., 2020) due to the tasks becoming non-mutually exclusive (Rajendran et al., 2020) . Finally, we observe that the meta-learning formulation is indeed superior to training a model from scratch for each lookback window. In Table 4 we perform an ablation study on various backbone architectures, while retaining the differentiable closed-form ridge regressor. We observe a degradation when the random Fourier features layer is removed, due to the spectral bias problem which neural networks face (Rahaman et al., 2019; Tancik et al., 2020) . DeepTime outperforms the SIREN variant of INRs which is consistent with observations INR literature. Finally DeepTime outperforms the RNN variant which is the model proposed in Grazzi et al. (2021) . This is a direct comparison between IMS historicalvalue models and time-index models, and highlights the benefits of a time-index models. Table 5 : Comparison of CFF against the optimal and pessimal scales as obtained from the hyperparameter sweep. We also calculate the change in performance between CFF and the optimal and pessimal scales, where a positive percentage refers to a CFF underperforming, and negative percentage refers to CFF outperforming, calculated as % change = (MSE CF F -MSE Scale )/MSE Scale .

CFF

Optimal Lastly, we perform a comparison between the optimal and pessimal scale hyperparameter for the vanilla random Fourier features layer, against our proposed CFF. We first report the results on each scale hyperparameter for the vanilla random Fourier features layer in Table 13 , Appendix N. As with the other ablation studies, the results reported in Table 13 is based on performing a hyperparameter sweep across lookback length multiplier, and selecting the optimal settings based on the validation set, and reporting the test set results. Then, the optimal and pessimal scales are simply the best and worst results based on Table 13 . Table 5 shows that CFF achieves extremely low deviation from the optimal scale across all settings, yet retrains the upside of avoiding this expensive hyperparameter tuning phase. We also observe that tuning the scale hyperparameter is extremely important, as CFF obtains up to a 23.22% improvement in MSE over the pessimal scale hyperparameter.

3.4. COMPUTATIONAL EFFICIENCY

Finally, we analyse DeepTime's efficiency in both runtime and memory usage, with respect to both lookback window and forecast horizon lengths. The main bottleneck in computation for DeepTime is the matrix inversion operation in the ridge regressor, canonically of O(n 3 ) complexity. This is a major concern for DeepTime as n is linked to the length of the lookback window. As mentioned in Bertinetto et al. (2019) , the Woodbury formulation, W * = Z T (ZZ T + λI) -1 Y is used to alleviate the problem, leading to an O(d 3 ) complexity, where d is the hidden size hyperparameter, fixed to some value (see Appendix I). Figure 5 demonstrates that DeepTime is highly efficient, even when compared to efficient Transformer models, recently proposed for the long sequence time-series forecasting task, as well as fully connected models.

4. RELATED WORK

Neural Forecasting Neural forecasting Benidis et al. (2020) methods have seen great success in recent times. One related line of research are Transformer-based methods for LSTF (Li et al., 2019; Zhou et al., 2021; Xu et al., 2021; Woo et al., 2022; Zhou et al., 2022) which aim to not only achieve high accuracy, but to overcome the vanilla attention's quadratic complexity. Fully connected methods (Oreshkin et al., 2020; Challu et al., 2022) have also shown success, with Challu et al. (2022) introducing hierarchical interpolation and multi-rate data sampling for the LSTF task. Meta-learning with the use of a differentiable closed-form solver has been explored in time-series forecasting (Grazzi et al., 2021) , but for the meta-forecasting setting which adapts to new time-series datasets rather than to tackle non-stationarity, using an IMS historical-value backbone model. Time-index Models Time-index models take as input time-index features such as datetime features to predict the value of the time-series at that time step. They have been well explored as a special case of regression analysis (Hyndman & Athanasopoulos, 2018; Ord et al., 2017) , and many different predictors have been proposed for the classical setting,including linear, polynomial, and piecewise linear trends, and dummy variables indicating holidays. Of note, Fourier terms have been used to model periodicity, or seasonal patterns, and is also known as harmonic regression (Young et al., 1999) . Prophet (Taylor & Letham, 2018) is a popular classical approach which uses a structural time-series formulation, specialized for business forecasting. Another classical approach of note are Gaussian Processes (Rasmussen, 2003; Corani et al., 2021) which are non-parametric models, often requiring complex kernel engineering. Godfrey & Gashler (2017) introduced an initial attempt at using time-index based neural networks to fit a time-series for forecasting. Yet, their work is more reminiscent of classical methods, manually specifying periodic and non-periodic activation functions, analogous to the representation functions. Implicit Neural Representations INRs have recently gained popularity in the area of neural rendering (Tewari et al., 2021) . They parameterize a signal as a continuous function, mapping a coordinate to the value at that coordinate. A key finding was that positional encodings (Mildenhall et al., 2020; Tancik et al., 2020) are critical for ReLU MLPs to learn high frequency details, while another line of work introduced periodic activations (Sitzmann et al., 2020b) . Meta-learning on via INRs have been explored for various data modalities, typically over images or for neural rendering tasks (Sitzmann et al., 2020a; Tancik et al., 2021; Dupont et al., 2021) , using both hypernetworks and optimization-based approaches. Yüce et al. (2021) show that meta-learning on INRs is analogous to dictionary learning. In time-series, Jeong & Shin (2022) explored using INRs for anomaly detection, opting to make use of periodic activations and temporal positional encodings.

5. DISCUSSION

In this paper, we proposed DeepTime, a deep time-index based model trained via a meta-learning formulation to automatically learn a representation function from time-series data, rather than manually defining the representation function as per classical methods. The meta-learning formulation further enables DeepTime to be utilized for non-stationary time-series by adapting to the locally stationary distribution. Importantly, we use a closed-form ridge regressor to tackle the meta-learning formulation to ensure that predictions are computationally efficient. Our extensive empirical analysis shows that DeepTime, while being a much simpler model architecture compared to prevailing state-of-the-art methods, achieves competitive performance across forecasting benchmarks on real world datasets. We perform substantial ablation studies to identify the key components contributing to the success of DeepTime, and also show that it is highly efficient. Limitations & Future Work Despite having verified DeepTime's effectiveness, we expect some under-performance in cases where the lookback window contains significant anomalies, or an abrupt change point which violates the locally stationary assumption. Next, while out of scope for our current work, a limitation that DeepTime faces is that it does not consider holidays and events. We leave the consideration of such features as a potential future direction, along with the incorporation of exogenous covariates and datetime features, whilst avoiding the incursion of the meta-learning memorization problem. Finally, time-index models are a natural fit for missing value imputation, as well as other time-series intelligence tasks for irregular time-series -this is another interesting future direction to extend deep time-index models towards. Haoyi Zhou, Shanghang Zhang, Jieqi Peng, Shuai Zhang, Jianxin Li, Hui Xiong, and Wancai Zhang. Informer: Beyond efficient transformer for long sequence time-series forecasting. In Proceedings of AAAI, 2021. Tian Zhou, Ziqing Ma, Qingsong Wen, Xue Wang, Liang Sun, and Rong Jin. Fedformer: Frequency enhanced decomposed transformer for long-term series forecasting. arXiv preprint arXiv:2201.12740, 2022.

A EXTENDED RELATED WORK

Non-stationarity and Distribution Shift Switching state-space models (Ghahramani & Hinton, 2000) generalizes and combines hidden Markov models and state-space models, where the dynamics in each regime is typically represented by a linear model (linear dynamic system) and switches between regimes, controlled by hidden transition probabilities of a Markov process. They can be applied to non-stationary time-series, but makes additional assumptions that the time-series has a predefined number of regimes or generating distributions. Sequential Neural Processes (Singh et al., 2019) incorporates a temporal state-transition model of stochastic processes, extending Neural Process framework to dynamic stochastic processes. Importantly, SNPs use a "black-box metalearning" approach, while DeepTime uses an "optimization-based meta-learning" approach. Further differences include that the standard SNP setting requires knowledge of task boundaries, and multiple support/query sets per task. Du et al. (2021) tackled temporal covariate shift by distribution matching, an approach popularly used in domain adaptation. They introduce the Temporal Distribution Characterization module which divides a given time-series into regions with different disribution, and a Temporal Distribution Matching module reduces distribution mismatch in the time-series. Their approach is built on top of an RNN architecture. Kim et al. (2021) introduced a learnable instance normalization method to tackle the covariate shift problem. Their approach is ad-hoc and can be attached to any existing architecture. Most relevant to our work is Non-stationary Transformers (Liu et al., 2022) , which introduced an instance normalization method for the LSTF task. Rather than proposing a generic module, their approach is specialized for Transformer-based architectures and performs normalization on each layer to tackle the non-stationarity of intermediate representations, rather than just the inputs and outputs.

B NON-STATIONARITY AND DISTRIBUTION SHIFT

In our work, we tackle the problem of non-stationarity in time-series data, which has been a well explored problem in the context of classical time-series analysis. We map this problem to the modern setting of deep learning for time-series forecasting. As mentioned in Section 1, long time-series datasets collected due to the advances of I.T. infrastructure has been plagued by the problem of nonstationarity. In particular, we tackle the problems of covariate shift, and conditional distribution shift which arise from it. Definition 1. (Covariate Shift) Given a stochastic process {Y t } T t=1 and let p(y t , y t-1 , . . . , y t-L+1 ) be the unconditional joint distribution of a length L segment. The stochastic process is said to experience covariate shift if any two segments are drawn from different distributions, i.e. p(y t , y t-1 , . . . , y t-L ) ̸ = p(y t ′ , y t ′ -1 , . . . , y t ′ -L ), ∀ t ̸ = t ′ . Definition 2. (Conditional Distribution Shift) Given a stochastic process {Y t } T t=1 and let p(y t+1 |y t , y t-1 , . . . , y t-L+1 ) be the conditional distribution of Y t+1 on a length L segment of previous time steps. The stochastic process is said to experience a conditional distribution shift if any two segments have different conditional distributions, i.e. p(y t+1 |y t , y t-1 , . . . , y Multi-step Forecasts Forecasting over a horizon (multiple time steps) can be achieved via two strategies, direct multi-step, or iterative multi-step (Marcellino et al., 2006; Chevillon, 2007; Taieb et al., 2012) , or even a mixture of both, but this has been less explored: t-L+1 ) ̸ = p(y t ′ +1 |y t ′ , y t ′ -1 , . . . , y t ′ -L+1 ), ∀ t ̸ = t ′ C CATEGORIZATION OF FORECASTING METHODS • Direct Multi-step (DMS): A DMS forecaster directly predicts forecasts for the entire horizon. For example, to achieve a multi-step forecast of H time steps, a DMS forecaster simply outputs H values in a single forward pass. • Iterative Multi-step (IMS): An IMS forecaster iteratively predicts one step ahead, and consumes this forecast to make a subsequent prediction. This is performed iteratively, until the desired length is achieved.

D FURTHER DISCUSSION ON DEEPTIME AS A TIME-INDEX MODEL

We first reiterate our definitions of time-index and historical-value models from Section 1. Timeindex models are models whose predictions are purely functions of current time-index features. To perform forecasting (i.e. make predictions over some forecast horizon), time-index models make the predictions ŷt+h = f (τ t+h ) for h = 0, . . . , H -1. Historical-value models predict the time-series value of future time step(s) as a function of past observations, and optionally, covariates.

Time-index Models

ŷt = f (τ t ) Historical-value Models ŷt+1 = f (y t , y t-1 , . . . , z t+1 , z t , . . .) Next, we further discuss some subtleties of how time-index models interact with past observations. Astute readers may have noticed DeepTime to be a function of the past observations. In particular, that Equations (3) and ( 4) indicate that forecasts from DeepTime are in fact linear in the lookback window. However, we highlight that this is not in contradiction with our definition of time-index and historical-value models. Here, we differentiate between the model, f and the learning algorithm, A. The learning algorithm A : H × R L×m → H takes as input a model from the hypothesis class H and, past observations, returning a model minimizing the loss function L. A time-index model is thus, still only a function of time-index features, while the learning algorithm is a function of past observations, i.e. f, f 0 ∈ H, f : R c → R m , f = A(f 0 , Y t-L:t ). DeepTime as a forecaster, is a deep time-index model endowed with a meta-learning algorithm. In order to perform forecasting, it has to perform an inner loop optimization step defined by the learning algorithm, as highlighted in Equation ( 2). For the special case where we use the closed-form ridge regressor, the inner loop learning algorithm reduces to a form which is linear in the lookback window. Still, the deep timeindex model is only a function of time-index features.

F SYNTHETIC DATA

The training set for each synthetic data experiment consists 1000 functions/tasks, while the test set contains 100 functions/tasks. We ensure that there is no overlap between the train and test sets. Sums of sinusoids Sinusoids come from a fixed set of frequencies, generated by sampling ω ∼ U(0, 12π). We fix the size of this set to be five, i.e. Ω = {ω 1 , . . . , ω 5 }. Each function is then a sum of J sinusoids, where J ∈ {1, 2, 3, 4, 5} is randomly assigned. The function is thus y = J j=1 A j sin(ω rj x + φ j ) for x ∈ [0, 1], where the amplitude and phase shifts are freely chosen via A j ∼ U(0.1, 5), φ j ∼ U(0, π), but the frequency is decided by r j ∈ {1, 2, 3, 4, 5} to randomly select a frequency from the set Ω.

G DATASETS

ETTfoot_1 Zhou et al. ( 2021) -Electricity Transformer Temperature provides measurements from an electricity transformer such as load and oil temperature. We use the ETTm2 subset, consisting measurements at a 15 minutes frequency. ECLfoot_2 -Electricity Consuming Load provides measurements of electricity consumption for 321 households from 2012 to 2014. The data was collected at the 15 mintue level, but is aggregated hourly. Exchangefoot_3 Lai et al. ( 2018) -a collection of daily exchange rates with USD of eight countries (Australia, United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016. Trafficfoot_4 -dataset from the California Department of Transportation providing the hourly road occupancy rates from 862 sensors in San Francisco Bay area freeways. Weatherfoot_5 -provides measurements of 21 meteorological indicators such as air temperature, humidity, etc., every 10 minutes for the year of 2020 from the Weather Station of the Max Planck Biogeochemistry Institute in Jena, Germany. 7 along with some dataset statistics, we report the results of both tests and the number of dimensions which meet the criteria for nonstationarity (rejecting the null hypothesis for Chow test, and not rejecting the null hypothesis for the ADF test) over various significance levels. We observe that the real-world datasets exhibit high levels of non-stationarity across dimensions based on both tests. 

O ABLATION STUDIES DETAILS

In this section, we list more details on the models compared to in the ablation studies section. Unless otherwise stated, we perform the same hyperparameter tuning for all models in the ablation studies, and use the same standard hyperparameters such as number of layers, layer size, etc.

O.1 ABLATION STUDY ON VARIANTS OF DEEPTIME

RR Removing the ridge regressor module refers to replacing it with a simple linear layer, Linear : R d → R m , where Linear(x) = W x + b, x ∈ R d , W ∈ R m×d , b ∈ R m . This corresponds to a straight forward INR, which is trained across all lookback-horizon pairs in the dataset. Local For models marked "Local", we similarly remove the ridge regressor module and replace it with a linear layer. Yet, the model is not trained across all lookback-horizon pairs in the dataset. Instead, for each lookback-horizon pair in the validation/test set, we fit the model to the lookback window via gradient descent, and then perform prediction on the horizon to obtain the forecasts. A new model is trained from scratch for each lookback-horizon window. We perform tuning on an extra hyperparameter, the number of epochs to perform gradient descent, for which we search through {10, 20, 30, 40, 50}. Datetime Features As each dataset comes with a timestamps for each observation, we are able to construct datetime features from these timestamps. We construct the following features: 1. Quarter-of-year For all models in this section, we retain the differentiable closed-form ridge regressor, to identify the effects of the backbone model used. MLP The random Fourier features layer is a mapping from coordinate space to latent space γ : R c → R d . To remove the effects of the random Fourier features layer, we simply replace it with a with a linear map, Linear : R c → R d . SIREN We replace the random Fourier features backbone with the SIREN model which is introduced by Sitzmann et al. (2020b) . In this model, periodical activation functions are used, i.e. sin(x), along with specified weight initialization scheme. RNN We use a 2 layer LSTM with hidden size of 256. Inputs are observations, y t , in an IMS fashion, predicting the next time step, y t+1 .

P COMPUTATIONAL EFFICIENCY EXPERIMENTS DETAILS

Trans/In/Auto/ETS-former We use a model with 2 encoder and 2 decoder layers with a hidden size of 512, as specified in their original papers.

N-BEATS

We use an N-BEATS model with 3 stacks and 3 layers (relatively small compared to 30 stacks and 4 layers used in their orignal paperfoot_7 ), with a hidden size of 512. Note, N-BEATS is a univariate model and values presented here are multiplied by a factor of m to account for the multivariate data. Another dimension of comparison is the number of parameters used in the model. Demonstrated in Table 14 , fully connected models like N-BEATS, their number of parameters scales linearly with lookback window and forecast horizon length, while for Transformer-based and Deep-Time, the number of parameters remains constant.

Q GENERALIZATION BOUND FOR OUR META-LEARNING FRAMEWORK

In this section, we derive a meta-learning generalization bound for DeepTime under the PAC-Bayes framework (Shalev-Shwartz & Ben-David, 2014) . Our formulation follows (Amit & Meir, 2018) and assumes that all tasks share the same hypothesis space H, sample space Z and loss function ℓ : H × Z → [0, 1]. We observes n tasks in the form of sample sets D 1 , . . . , D n . The number of samples in each task is H + L. Each dataset D k is assumed to be generated i.i.d from an unknown sample distribution D H+L k . Each task's sample distribution D k is i.i.d. generated from an unknown meta distribution, E. Particularly, we have D k = (z k-L , . . . , z k , . . . , z k+H-1 ), where z t = (τ t , y t ). Here, τ t is the time coordinate, and y t is the time-series value. For any forecaster h(•) parameterized by θ, we define the loss function ℓ(h θ , z t ). We also define P as the prior distribution over H and Q as the posterior over H for each task. In the meta-learning setting, we assume a hyper-prior P, which is a prior distribution over priors, observes a sequence of training tasks, and then outputs a distribution over priors, called hyper-posterior Q. Theorem Q.1. Consider the Meta-Learning framework, given the hyper-prior P, then for any hyper-posterior Q, any c 1 , c 2 > 0 and any δ ∈ (0, 1] with probability ≥ 1 -δ we have, P er(Q) ≤ c 1 c 2 (1 -e -c1 )(1 -e -c2 ) • 1 n n k=1 êr(Q, D k ) + c 1 1 -e -c1 • KL(P||Q) + log 1 δ k nc 1 + c 1 c 2 (1 -e -c2 )(1 -e -c1 ) • KL(ρ||π) + log 1 δ k n(H + L)c 2 ≥ 1 -δ. (5) Proof. Our proof contains two steps. First, we bound the error within observed tasks due to observing a limited number of samples. Then we bound the error on the task environment level due to observing a finite number of tasks. Both of the two steps utilize Catoni's classical PAC-Bayes bound (Catoni, 2007) to measure the error. We give here the Catoni's classical PAC-Bayes bound. Theorem Q.2. (Catoni's bound (Catoni, 2007) ) Let X be a sample space, P (X) a distribution over X , Θ a hypothesis space. Given a loss function ℓ(θ, X) : Θ × X → [0, 1] and a collection of M i.i.d random variables (X 1 , . . . , X M ) sampled from P (X). Let π be a prior distribution over hypothesis space. Then, for any δ ∈ (0, 1] and any real number c > 0, the following bound holds uniformly for all posterior distributions ρ over hypothesis space, Finally, by employing the union bound, we could bound the probability of the intersection of the events in Equation ( 11) and Equation ( 8) For any δ > 0, set δ 0 = δ 2 and δ i = δ 2n for i = 1, . . . , n, P E Xi∼P (X),θ∼ρ [ℓ(θ, X i )] ≤ c 1 -e -c P er(Q) ≤ c 1 c 2 (1 -e -c1 )(1 -e -c2 ) • 1 n n k=1 êr(Q, D k ) + c 1 1 -e -c1 • KL(P||Q) + log 1 δ k nc 1 + c 1 c 2 (1 -e -c2 )(1 -e -c1 ) • KL(ρ||π) + log 1 δ k n(H + L)c 2 ≥ 1 -δ. ( ) Theorem Q.1 shows that the expected task generalization error is bounded by the empirical multitask error plus two complexity terms. The first term represents the complexity of the environment, or equivalently, the time-series dataset, converging to zero if we observe an infinitely long time-series (n → ∞). The second term represents the complexity of the observed tasks, or equivalently, the lookback-horizon windows. This converges to zero when there are sufficient number of time steps in each window (H + L → ∞).



DMS methods directly predict the entire forecast horizon, and are contrasted with iterative multi-step (IMS) methods. Further discussion on DMS/IMS, and a taxonomy of forecasting methods in Appendix C. https://github.com/zhouhaoyi/ETDataset https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 https://github.com/laiguokun/multivariate-time-series-data https://pems.dot.ca.gov/ https://www.bgc-jena.mpg.de/wetter/ https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html https://github.com/ElementAI/N-BEATS/blob/master/experiments/ electricity/generic.gin



Figure1: Non-stationary time-series degrade model performance due to covariate shifts and conditional distributional shifts. Such behaviors can be modelled as locally stationary processes, by which contiguous segments are assumed to be stationary. Meta-learning takes advantage of this assumption to adapt to these locally stationary distributions. Yet, existing methods which model the conditional distribution, p(y t+1 |y t , . . .), are still susceptible to covariate shifts since the meta model takes time-series values as input.

Figure3: Illustration of DeepTime. A time-series dataset can be split into M tasks as given in the problem formulation. For a given task, the lookback window represents the support set, and the forecast horizon represents the query set. g ϕ represents the meta model associated with the meta parameters. ϕ is shared between the lookback window and forecast horizon. Inputs to g ϕ are not normalized due to notation constraints on this illustration. The ridge regressor performs the inner loop optimization, while outer loop optimization is performed over samples from the horizon. As illustrated in Figure3b, DeepTime has a simple overall architecture, comprising of a random Fourier features layer, an MLP, and a ridge regressor.

Figure 4: Predictions of DeepTime on three unseen functions for each function class. The orange line represents the split between lookback window and forecast horizon.

Figure5: Computational efficiency benchmark on the ETTm2 multivariate dataset, on a batch size of 32. Runtime is measured for one iteration (forward + backward pass). Left: Runtime/Memory usage as lookback length varies, horizon is fixed to 48. Right: Runtime/Memory usage as horizon length varies, lookback length is fixed to 48. Further model details can be found in Appendix P.

Samples are generated from the function y = ax + b for x ∈ [-1, 1]. This means that each function/task consists of 400 evenly spaced points between -1 and 1. The parameters of each function/task (i.e. a, b) are sampled from a normal distribution with mean 0 and standard deviation of 50, i.e. a, b ∼ N (0, 50 2 ). Cubic Samples are generated from the function y = ax 3 + bx 2 + cx + d for x ∈ [-1, 1] for 400 points. Parameters of each task are sampled from a continuous uniform distribution with minimum value of -50 and maximum value of 50, i.e. a, b, c, d ∼ U(-50, 50).

Influenza-like Illness measures the weekly ratio of patients seen with ILI and the total number of patients, obtained by the Centers for Disease Control and Prevention of the United States between 2002 and 2021. G.1 NON-STATIONARITY OF REAL-WORLD DATASETS

We first utilize Theorem Q.2 to bound the generalization error in each of the observed tasks. Let k be the index of task, we have the definition of expected error and empirical error as follows, to Theorem Q.2, for any δ k ∼ (0, 1] and c 2 > 0, we haveP er(Q, D k ) ≤ c 2 1 -e -c2êr(Q, D k ) + bound the error due to observing a limited number of tasks from the environment. Similarly, we have the definition of expected task error as follows er(Q) = E that the following holds for any δ 0 ∼ (0, 1] and c 1 > 0, we have

Multivariate forecasting benchmark on long sequence time-series forecasting. Best results are highlighted in bold, and second best results are underlined. We evaluate the performance of our proposed approach using two metrics, the mean squared error (MSE) and mean absolute error (MAE) metrics. The datasets are split into train, validation, and test sets chronologically, following a 70/10/20 split for all datasets except for ETTm2 which follows a 60/20/20 split, as per convention. The univariate benchmark selects the last index of the multivariate dataset as the target variable, following previous work(Xu et al., 2021). Preprocessing on the data is performed by standardization based on train set statistics. Hyperparameter selection is performed on only one value, the lookback length multiplier, L = µ * H, which decides the length of the lookback window. We search through the values µ =[1, 3, 5, 7, 9], and select the best value based on the validation loss. Further implementation details on DeepTime are reported in Appendix H, and detailed hyperparameters are reported in Appendix I. Reported results for DeepTime are averaged over three runs, and standard deviation is reported in Appendix K.

Ablation study on variants of DeepTime. Starting from the original version, we add (+) or remove (-) some component from DeepTime. RR stands for the differentiable closed-form ridge regressor, removing it refers to replacing this module with a simple linear layer trained via gradient descent across all training samples (i.e. without meta-learning formulation). Local refers to training an INR from scratch via gradient descent for each lookback window (RR is not used here, and there is no training phase). Datetime refers to datetime features. Further model details can be found in Appendix O.1.

Ablation study on backbone models. DeepTime refers to our proposed approach, an INR with random Fourier features sampled from a range of scales. MLP refers to replacing the random Fourier features with a linear map from input dimension to hidden dimension. SIREN refers to an INR with periodic activations as proposed bySitzmann et al. (2020b). RNN refers to an autoregressive recurrent neural network (inputs are the time-series values, y t ). All approaches include differentiable closed-form ridge regressor. Further model details can be found in Appendix O.2.

Categorization of time-series forecasting methods over the dimensions of time-index vs historical-value methods, and DMS vs IMS methods.

Summary of real-world datasets, results of Chow test, and Augmented Dickey-Fuller (ADF) test. The statistical tests are performed on each dimension separately, since they are designed for univariate time-series. We report the number of dimensions which reject/fail to reject the null hypothesis, depending on which indicates non-stationarity. These are reported at significance levels of 0.1, 0.05, and 0.01. Larger values for the Chow test statistic indicate more non-stationarity, and larger (less negative) values for the ADF test statistic indicates more non-stationarity.Real-world datasets used in long sequence time-series forecasting suffers from non-stationarity. We first verify this qualitatively by visualizing histogram values across some dimensions for each dataset in Figure6. This simple visualization already gives us a strong confirmation on the distribution mismatch between the training and testing phases. We further verify this quantitatively via two statistical tests, the Chow test, and Augmented Dickey-Fuller (ADF) test. The Chow test is a test of whether the true coefficients in two linear regressions on different data sets are equal. Rejecting the null hypothesis of equality of regression coefficients in the two periods indicates that the train and test regions are generated from different distributions. The ADF test tests the null hypothesis that a unit root is present in a time series sample. Not rejecting the null hypothesis indicates that a unit root is present, and is thus non-stationary. Presented in Table

Sensitivity analysis on the lookback window length. Results presented on the ETTm2 dataset across various values of the lookback length multiplier, µ. Best results are highlighted in bold.

Additional ablation study on variants of DeepTime. + Finetune refers to training an INR via gradient descent for each lookback window on top of having a training phase. Full MAML refers to performing the full meta-learning formulation on the whole model rather than just the last layer, using gradient-based optimization.

Results from hyperparameter sweep on the scale hyperparameter. Best scores are highlighted in bold, and worst scores are highlighted in bold red.

Each feature is initially an integer value, e.g. month-of-year can take on values in {0, 1, . . . , 11}, which we subsequently normalize to a [0, 1] range. Depending on the data sampling frequency, the appropriate features can be chosen. For the ETTm2 dataset, we used all features except second-ofminute since it is sampled at a 15 minute frequency.

E DEEPTIME PSEUDOCODE

Algorithm 1 PyTorch-Style Pseudocode of Closed-Form Ridge Regressor mm: matrix multiplication, diagonal: returns the diagonal elements of a matrix, add : in-place addition linalg.solve computes the solution of a square system of linear equations with a unique solution.# X: inputs, shape: (n samples, n dim) # Y: targets, shape: (n samples, n out) # lambd: scalar value representing the regularization coefficient n samples, n dim = X.shape # add a bias term by concatenating an all-ones vector ones = torch.ones(n samples, 1) X 

H DEEPTIME IMPLEMENTATION DETAILS

Optimization We train DeepTime with the Adam optimizer (Kingma & Ba, 2014) with a learning rate scheduler following a linear warm up and cosine annealing scheme. Gradient clipping by norm is applied. The ridge regressor regularization coefficient, λ, is trained with a different, higher learning rate than the rest of the meta parameters. We use early stopping based on the validation loss, with a fixed patience hyperparameter (number of epochs for which loss deteriorates before stopping). All experiments are performed on an Nvidia A100 GPU.Model The ridge regression regularization coefficient is a learnable parameter constrained to positive values via a softplus function. We apply Dropout (Srivastava et al., 2014 ), then LayerNorm (Ba et al., 2016) after the ReLU activation function in each INR layer. The size of the random Fourier feature layer is set independently of the layer size, in which we define the total size of the random Fourier feature layer -the number of dimensions for each scale is divided equally. 

