ETSFORMER: EXPONENTIAL SMOOTHING TRANS-FORMERS FOR TIME-SERIES FORECASTING

Abstract

Transformers have recently been actively studied for time-series forecasting. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they are generally not decomposable or interpretable, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSformer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing methods in improving Transformers for time-series forecasting. Specifically, ETSformer leverages a novel level-growth-seasonality decomposed Transformer architecture which leads to more interpretable and disentangled decomposed forecasts. We further propose two novel attention mechanisms -the exponential smoothing attention and frequency attention, which are specially designed to overcome the limitations of the vanilla attention mechanism for time-series data. Extensive experiments on the long sequence time-series forecasting (LSTF) benchmark validates the efficacy and advantages of the proposed method. Code is attached in the supplementary material, and will be made publicly available.

1. INTRODUCTION

Transformer models have achieved great success in the fields of natural language processing (Vaswani et al., 2017; Devlin et al., 2019) , computer vision (Carion et al., 2020; Dosovitskiy et al., 2021) , and even more recently, time-series (Li et al., 2019; Wu et al., 2021; Zhou et al., 2021; Zerveas et al., 2021; Zhou et al., 2022) . While the success of Transformer models have been widely attributed to the self-attention mechanism, alternative forms of attention, infused with the appropriate inductive biases, have been introduced to tackle the unique properties of their underlying task or data (You et al., 2020; Raganato et al., 2020) . In time-series forecasting, decomposition-based architectures such as Autoformer and FEDformer models (Wu et al., 2021; Zhou et al., 2022) have incorporated time-series specific inductive biases, leading to increased accuracy, and more interpretable forecasts (by decomposing forecasts into seasonal and trend components). Their success has been motivated by: (i) disentangling seasonal and trend representations via seasonal-trend decomposition (Cleveland & Tiao, 1976; Cleveland et al., 1990; Woo et al., 2022) , and (ii) replacing the vanilla pointwise dot-product attention which handle time-series patterns such as seasonality and trend inefficiently, with time-series specific attention mechanisms such as the Auto-Correlation mechanism and Frequency-Enhanced Attention. While these existing work introduce the promising direction of interpretable and decomposed time-series forecasting for Transformer-based architectures, they suffer from two drawbacks. Firstly, they suffer from entangled seasonal-trend representations, evidenced in Figure 1 , where the trend forecasts exhibit periodical patterns which should only be present in the seasonal component, and the seasonal component does not accurately track the (multiple) periodicities present in the ground truth seasonal component. This arises due to their decomposition mechanism which detects trend via a simple moving average over the input signal and detrends the signal by removing the detected trend component -an arguably naive approach. This method has many known pitfalls (Hyndman & Athanasopoulos, 2018) , such as the trend-cycle component not being available for the first and last few observations, and over-smoothing rapid rises and falls. Secondly, their proposed replacements for the vanilla attention mechanism are not human interpretable -demonstrated in Section 3.3. Model inspection and diagnosis allows us to better understand the fore- casts generated by our models, attributing predictions to each component to make better downstream decisions. For an attention mechanism focusing on seasonality, we would expect the cross-attention map visualization to produce clear periodic patterns which shift smoothly across decoder time steps. Yet, the Auto-Correlation mechanism from Autoformer does not exhibit this property, yielding similar attention weights across decoder time steps, while the Frequency-Enhanced Attention from FEDformer does not have such model interpretability capabilities due to its complicated frequency domain attention. To address these limitations, we look towards the more principled approach of level-growth-season decomposition from ETS methods (Hyndman et al., 2008) (further introduced in Appendix A). This principle further deconstructs trend into level and growth components. To extract the level and growth components, we also look at the idea of exponential smoothing, where more recent data gets weighted more highly than older data, reflecting the view that the more recent past should be considered more relevant for making new predictions or identifying current trends, to replace the naive moving average. At the same time, we leverage the idea of extracting the most salient periodic components in the frequency domain via the Fourier transform, to extract the global seasonal patterns present in the signal. These principles help yield a stronger decomposition strategy by first extracting global periodic patterns as seasonality, and subsequently extracting growth as the change in level in an exponentially smoothed manner. Motivated by the above, we propose ETSformer, an interpretable and efficient Transformer architecture for time-series forecasting which yields disentangled seasonal-trend forecasts. Instead of reusing the moving average operation for detrending, ETSformer overhauls the existing decomposition architecture by leveraging the level-growth-season principle, embedding it into a novel Transformer framework in a non-trivial manner. Next, we introduce interpretable and efficient attention mechanisms -Exponential Smoothing Attention (ESA) for trend, and Frequency Attention (FA) for seasonality. ESA assigns attention weights in an exponentially decreasing manner, with high values to nearby time steps and low values to far away time steps, thus specialising in extracting growth representations. FA leverages frequency domain representations to extract dominating seasonal patterns by selecting the Fourier bases with the K largest amplitudes. Both mechanisms have efficient implementations with O(L log L) complexity. Furthermore, we demonstrate human interpretable visualizations of both mechanisms in Section 3.3. To summarize, our key contributions are as follows: • We introduce a novel decomposition Transformer architecture, incorporating the timetested level-growth-season principle for more disentangled, human-interpretable time-series forecasts. • We introduce two new attention mechanisms, ESA and FA, which incorporate stronger time-series specific inductive biases. They achieve better efficiency than vanilla attention, and yield interpretable attention weights upon model inspection. • The resulting method is a highly effective, efficient, and interpretable deep forecasting model. We show this via extensive empirical analysis, that ETSformer achieves performance competitive with state-of-the-art methods over 6 real world datasets on both multivariate and univariate settings, and is highly efficient compared to competing methods. 

2. ETSFORMER

Problem Formulation Let x t ∈ R m denote an observation of a multivariate time-series at time step t. Given a lookback window X t-L:t = [x t-L , . . . , x t-1 ], we consider the task of predicting future values over a horizon, X t:t+H = [x t , . . . , x t+H-1 ]. We denote Xt:t+H as the point forecast of X t:t+H . Thus, the goal is to learn a forecasting function Xt:t+H = f (X t-L:t ) by minimizing some loss function L : R H×m × R H×m → R. In the following, we explain how ETSformer infuses level-growth-seasonal decomposition the the classical encoder-decoder Transformer architecture, specializing for interpretable time-series forecasting. Our architecture design methodology relies on three key principles: (1) the architecture leverages the stacking of multiple layers to progressively extract a series of level, growth, and seasonal representations from the intermediate latent residual; (2) performs level-growth-seasonal decomposition of latent representations, by extracting salient seasonal patterns while modeling level and growth components following an exponential smoothing formulation; (3) the final forecast is a composition of level, growth, and seasonal components making it human interpretable.

2.1. OVERALL ARCHITECTURE

Figure 2 illustrates the overall encoder-decoder architecture of ETSformer. At each layer, the encoder is designed to iteratively extract growth and seasonal latent components from the lookback window. The level is then extracted in a similar fashion to classical level smoothing in Equation (3). These extracted components are then fed to the decoder to further generate the final H-step ahead forecast via a composition of level, growth, and seasonal forecasts, which is defined: Xt:t+H = Et:t+H + Linear N n=1 (B (n) t:t+H + S (n) t:t+H ) , where E t:t+H ∈ R H×m , and B (n) t:t+H , S t:t+H ∈ R H×d represent the level forecasts, and the growth and seasonal latent representations of each time step in the forecast horizon, respectively. The superscript represents the stack index, for a total of N encoder stacks. Note that Linear(•) : R d → R m operates element-wise along each time step, projecting the extracted growth and seasonal representations from latent to observation space.

2.1.1. INPUT EMBEDDING

Raw signals from the lookback window are mapped to latent space via the input embedding module, defined by Z (0) t-L:t = E (0) t-L:t = Conv(X t-L:t ), where Conv is a temporal convolutional filter with kernel size 3, input channel m and output channel d. In contrast to prior work (Li et al., 2019; Wu et al., 2020; 2021; Zhou et al., 2021) , the inputs of ETSformer do not rely on any other manually designed dynamic time-dependent covariates (e.g. month-of-year, day-of-week) for both the lookback window and forecast horizon. This is because the proposed Frequency Attention module (details in Section 2.2.2) is able to automatically uncover these seasonal patterns, which renders it more applicable for challenging scenarios without these discriminative covariates and reduces the need for feature engineering.

2.1.2. ENCODER

The encoder focuses on extracting a series of latent growth and seasonality representations in a cascaded manner from the lookback window. To achieve this goal, traditional methods rely on the assumption of additive or multiplicative seasonality which has limited capability to express complex patterns beyond these assumptions. Inspired by (Oreshkin et al., 2019; He et al., 2016) , we leverage residual learning to build an expressive, deep architecture to characterize the complex intrinsic patterns. Each layer can be interpreted as sequentially analyzing the input signals. The extracted growth and seasonal signals are then removed from the residual and undergo a nonlinear transformation before moving to the next layer. Each encoder layer takes as input the residual from the previous encoder layer Z (n-1) t-L:t and emits Z (n) t-L:t , B (n) t-L:t , S t-L:t , the residual, latent growth, and seasonal representations for the lookback window via the Multi-Head Exponential Smoothing Attention (MH-ESA) and Frequency Attention (FA) modules (detailed description in Section 2.2). The following equations formalizes the overall pipeline in each encoder layer, and for ease of exposition, we use the notation := for a variable update. Seasonal: S (n) t-L:t = FA t-L:t (Z (n-1) t-L:t ) Z (n-1) t-L:t := Z (n-1) t-L:t -S (n) t-L:t Growth: B (n) t-L:t = MH-ESA(Z (n-1) t-L:t ) Z (n-1) t-L:t := LN(Z (n-1) t-L:t -B (n) t-L:t ) Z (n) t-L:t = LN(Z (n-1) t-L:t + FF(Z (n-1) t-L:t )) LN is layer normalization (Ba et al., 2016) , FF(x) = Linear(σ(Linear(x))) is a position-wise feedforward network (Vaswani et al., 2017) and σ(•) is the sigmoid function. Level Module Given the latent growth and seasonal representations from each layer, we extract the level at each time step t in the lookback window in a similar way as the level smoothing equation in Equation (3). Formally, the adjusted level is a weighted average of the current (de-seasonalized) level and the level-growth forecast from the previous time step t -1. It can be formulated as: E (n) t = α * E (n-1) t -Linear(S (n) t ) + (1 -α) * E (n) t-1 + Linear(B (n) t-1 ) , where α ∈ R m is a learnable smoothing parameter, * is an element-wise multiplication term, and Linear(•) : R d → R m maps representations to observation space. Finally, the extracted level in the last layer E (N ) t-L:t can be regarded as the corresponding level for the lookback window. We show in Appendix B.3 that this recurrent exponential smoothing equation can also be efficiently evaluated using the efficient A ES algorithm (Algorithm 1) with an auxiliary term.

2.1.3. DECODER

The decoder is tasked with generating the H-step ahead forecasts. As shown in Equation (1), the final forecast is a composition of level forecasts E t:t+H , growth representations B  (n) t , S (n) t-L:t to predict B (n) t:t+H , S (n) t:t+H , respectively. Growth: B (n) t:t+H = GD(B (n) t ) Seasonal: S (n) t:t+H = FA t:t+H (S t-L:t ) To obtain the level in the forecast horizon, the Level Stack repeats the level in the last time step t along the forecast horizon. It can be defined as E t:t+H = Repeat H (E (N ) t ) = [E (N ) t , . . . , E (N ) t ], with Repeat H (•) : R 1×m → R H×m . Growth Damping To obtain the growth representation in the forecast horizon, we follow the idea of trend damping in Equation ( 4) to make robust multi-step forecast. Thus, the trend representations can be formulated as: GD(B (n) t ) j = j i=1 γ i B (n) t , GD(B (n) t-L:t ) = [GD(B (n) t ) t , . . . , GD(B (n) t ) t+H-1 ], where 0 < γ < 1 is the damping parameter which is learnable, and in practice, we apply a multi-head version of trend damping by making use of n h damping parameters. Similar to the implementation for level forecast in the Level Stack, we only use the last trend representation in the lookback window B (n) t to forecast the trend representation in the forecast horizon.

2.2. EXPONENTIAL SMOOTHING ATTENTION AND FREQUENCY ATTENTION MECHANISM

Considering the ineffectiveness of existing attention mechanisms in handling time-series data, we develop the Exponential Smoothing Attention (ESA) and Frequency Attention (FA) mechanisms to extract latent growth and seasonal representations. ESA is a non-adaptive, learnable attention scheme with an inductive bias to attend more strongly to recent observations by following an exponential decay, while FA is a non-learnable attention scheme, that leverages Fourier transformation to select dominating seasonal patterns. A comparison between existing work and our proposed ESA and FA is illustrated in Figure 3 .

2.2.1. EXPONENTIAL SMOOTHING ATTENTION

Vanilla self-attention can be regarded as a weighted combination of an input sequence, where the weights are normalized alignment scores measuring the similarity between input contents (Tsai et al., 2019) . Inspired by the exponential smoothing in Equation (3), we aim to assign a higher weight to recent observations. It can be regarded as a novel form of attention whose weights are computed by the relative time lag, rather than input content. Thus, the ESA mechanism can be defined as A ES : R L×d → R L×d , where A ES (V ) t ∈ R d denotes the t-th row of the output matrix, representing the token corresponding to the t-th time step. Its exponential smoothing formula can be further written as: A ES (V ) t = αV t + (1 -α)A ES (V ) t-1 = t-1 j=0 α(1 -α) j V t-j + (1 -α) t v 0 , where 0 < α < 1 and v 0 are learnable parameters known as the smoothing parameter and initial state respectively. Efficient A ES algorithm The straightforward implementation of the ESA mechanism by constructing the attention matrix, A ES and performing a matrix multiplication with the input sequence (detailed algorithm in Appendix B.4) results in an O(L 2 ) computational complexity. A ES (V ) =    A ES (V ) 1 . . . A ES (V ) L    = A ES • v T 0 V , Yet, we are able to achieve an efficient algorithm by exploiting the unique structure of the exponential smoothing attention matrix, A ES , which is illustrated in Appendix B.1. Each row of the attention matrix can be regarded as iteratively right shifting with padding (ignoring the first column). Thus, a matrix-vector multiplication can be computed with a cross-correlation operation, which in turn has an efficient fast Fourier transform implementation (Mathieu et al., 2014) . The full algorithm is described in Algorithm 1, Appendix B.2, achieving an O(L log L) complexity.

Multi-Head Exponential Smoothing Attention (MH-ESA)

We use A ES as a basic building block, and develop the Multi-Head Exponential Smoothing Attention to extract latent growth representations. Formally, we obtain the growth representations by taking the successive difference of the residuals. Z(n) t-L:t = Linear(Z (n-1) t-L:t ), B (n) t-L:t = MH-A ES ( Z(n) t-L:t -[ Z(n) t-L:t-1 , v (n) 0 ]), B (n) t-L:t := Linear(B (n) t-L:t ), where MH-A ES is a multi-head version of A ES and v (n) 0 is the initial state from the ESA mechanism.

2.2.2. FREQUENCY ATTENTION

The goal of identifying and extracting seasonal patterns from the lookback window is twofold. Firstly, it can be used to perform de-seasonalization on the input signals such that downstream components are able to focus on modeling the level and growth information. Secondly, we are able to extrapolate the seasonal patterns to build representations for the forecast horizon. The main challenge is to automatically identify seasonal patterns. Fortunately, the use of power spectral density estimation for periodicity detection has been well studied (Vlachos et al., 2005) . Inspired by these methods, we leverage the discrete Fourier transform (DFT, details in Appendix C) to develop the FA mechanism to extract dominant seasonal patterns. Specifically, FA first decomposes input signals into their Fourier bases via a DFT along the temporal dimension, F(Z (n-1) t-L:t ) ∈ C F ×d where F = ⌊L/2⌋ + 1, and selects bases with the K largest amplitudes. An inverse DFT is then applied to obtain the seasonality pattern in time domain. Formally, this is given by the following equations: Φ k,i = ϕ F(Z (n-1) t-L:t ) k,i , A k,i = F(Z (n-1) t-L:t ) k,i , κ (1) i , . . . , κ (K) i = arg Top-K k∈{2,...,F } A k,i , S (n) j,i = K k=1 A κ (k) i ,i cos(2πf κ (k) i j + Φ κ (k) i ,i ) + cos(2π fκ (k) i j + Φκ (k) i ,i ) , where Φ k,i , A k,i are the phase/amplitude of the k-th frequency for the i-th dimension, arg Top-K returns the arguments of the top K amplitudes, K is a hyperparameter, f k is the Fourier frequency of the corresponding index, and fk , Φk,i are the Fourier frequency/amplitude of the corresponding conjugates. Finally, the latent seasonal representation of the i-th dimension for the lookback window is formulated as S (n) t-L:t,i = [S (n) t-L,i , . . . , S t-1,i ]. For the forecast horizon, the FA module extrapolates beyond the lookback window via, S (n) t:t+H,i = [S (n) t,i , . . . , S (n) t+H-1,i ]. Since K is a hyperparameter typically chosen for small values, the complexity for the FA mechanism is similarly O(L log L). 

3. EXPERIMENTS

This section presents extensive empirical evaluations on the LSTF task over 6 real world multivariate datasets, ETT, ECL, Exchange, Traffic, Weather, and ILI, coming from a variety of application areas (details in Appendix E). Performance is evaluated via the mean squared error (MSE) and mean absolute error (MAE) metrics. For the main benchmark, datasets are split into train, validation, and test sets chronologically, following a 60/20/20 split for the ETT datasets and 70/10/20 split for other datasets. The multivariate benchmark makes use of all dimensions, while univariate benchmark selects the last dimension of the datasets as the target variable, following previous work (Zhou et al., 2021; Wu et al., 2021) . Data is pre-processed by performing standardization based on train set statistics. Further details on implementation and hyperparameters can be found in Appendix D. This is followed by an ablation study of the various contributing components, and interpretability experiments of our proposed model, and finally an analysis on computational efficiency.

3.1. RESULTS

For the multivariate benchmark, baselines include recently proposed time-series/efficient Transformers -FEDformer, Autoformer, Informer, LogTrans (Li et al., 2019) , and Reformer (Kitaev et al., 2020) , and RNN variants -LSTnet (Lai et al., 2018) , and ES-RNN (Smyl, 2020) . Univariate baselines further include N-BEATS (Oreshkin et al., 2019), DeepAR (Salinas et al., 2020) , ARIMA, Prophet (Taylor & Letham, 2018), and AutoETS (Bhatnagar et al., 2021) . We obtain baseline results from the following papers: (Wu et al., 2021; Zhou et al., 2021) , and further run AutoETS from the Merlion library (Bhatnagar et al., 2021) . Table 1 summarize the results of ETSformer against top performing baselines on a selection of datasets, for the multivariate setting, and Table 6 in Appendix G for space. Results for ETSformer are averaged over three runs (standard deviation in Appendix H). Overall, ETSformer achieves competitive performance, achieving best performance on 14 out of 24 datasets/settings on MSE for the multivariate case, and within top 2 performance across all 24 datasets/settings.

3.2. ABLATION STUDY

We study the contribution of each major component which the final forecast is composed of level, growth, and seasonality. Table 2 first presents the performance of the full model, and subsequently, the performance of the resulting model by removing each component. We observe that the composition of level, growth, and season provides the most accurate forecasts across a variety of application areas, and removing any one component results in a deterioration. In particular, estimation of the level of the time-series is critical. We also analyse the design of the MH-ESA in Section 3.2, replacing it with a vanilla multi-head attention and an FC layer performing token mixing -we observe that our trend attention formulation indeed is more effective. 

3.3. INTERPRETABILITY

ETSformer generates interpretable forecasts which can be decomposed into disentangled level, growth, and seasonal components. We showcased this ability compared to baselines in Figure 1 on synthetic data containing (nonlinear) trend and seasonality patterns (details in Appendix F) , since we are not able to obtain ground truth decomposition from real-world data. Forecast decompositions (without component ground truth) can be found in Appendix J. Furthermore, we report quantitative results over the test set in Table 4 . ETSformer successfully forecasts interpretable level, trend (level + growth), and seasonal components, as observed in the trend and seasonality components closely tracking the ground truth patterns. Despite obtaining a good combined forecast, competing decomposition based approaches, struggles to disambiguate between trend and seasonality. Furthermore, ETSformer produces human interpretable attention weights for both the FA and ESA mechanisms, visualized in Figure 4 . The FA weights visualized exhibit clear periodicity which can be used to identify the dominating seasonal patterns, while ESA weights exhibit exponentially decaying property as per the inductive biases. This is contrasted to Autoformer's Auto-Correlation visualization in Figure 5 which does not follow periodicity properties despite being specialized to handle seasonality. 3.4 COMPUTATIONAL EFFICIENCY Figure 6 charts ETSformer's empirical efficiency with that of competing Transformer-based approaches. ETSformer maintains competitive efficiency with competing quasilinear and linear complexity Transformers. This is especially so when forecast horizon increasese, due to ETSformer's unique decoder architecture which relies on its Trend Damping and Frequency Attention modules rather than relying on a cross attention mechanism. Of note, while FEDformer claims linear complexity, our empirical results show that it incurs significant overhead especially in terms of runtime efficiency. This slowdown arises from their (official) implementation still relying on the straightfor- 

4. RELATED WORK

Deep Forecasting LogTrans (Li et al., 2019) and AST (Wu et al., 2020) first introduced Transformer based methods to reduce computational complexity of attention. The LSTF benchmark was first introduced by Informer (Zhou et al., 2021) , extending the Transformer architecture by proposing the ProbSparse attention and distillation operation to achieve O(L log L) complexity. Similar to our work that incorporates prior knowledge of time-series structure, Autoformer (Wu et al., 2021) introduces the Auto-Correlation attention mechanism which focuses on sub-series based similarity and is able to extract periodic patterns. FEDformer (Zhou et al., 2022) extends this line of work by incorporating Frequency Enhanced structures. N-HiTS (Challu et al., 2022) introduced hierarchical interpolation and multi-rate data sampling by building on top of N-BEATS (Oreshkin et al., 2019) for the LSTF task. ES-RNN (Smyl, 2020) has explored combining ETS methods with neural networks. However, they treat ETS as a pre and post processing step, rather than baking it into the model architecture. Furthermore, their method requires prior knowledge on seasonality patterns, and they were not proposed for LSTF, leading to high computation costs over long horizons.

Attention Mechanisms

The self-attention mechanism in Transformer models has recently received much attention, its necessity has been greatly investigated in attempts to introduce more flexibility and reduce computational cost. Synthesizer (Tay et al., 2021) empirically studies the importance of dot-product interactions, and show that a randomly initialized, learnable attention mechanisms with or without token-token dependencies can achieve competitive performance with vanilla self-attention on various NLP tasks. You et al. (2020) utilizes an unparameterized Gaussian distribution to replace the original attention scores, concluding that the attention distribution should focus on a certain local window and can achieve comparable performance. Raganato et al. (2020) replaces attention with fixed, non-learnable positional patterns, obtaining competitive performance on NMT tasks. Lee-Thorp et al. ( 2021) replaces self-attention with a non-learnable Fourier Transform and verifies it to be an effective mixing mechanism.

5. DISCUSSION

Inspired by the classical exponential smoothing methods and emerging Transformer approaches for time-series forecasting, we propose ETSformer, a novel level-growth-season decomposition Transfomer. ETSformer leverages the novel Exponential Smoothing Attention and Frequency Attention mechanisms which are more effective at modeling time-series than vanilla self-attention, and at the same time achieves O(L log L) complexity, where L is the length of lookback window. We performed extensive empirical evaluation, showing that ETSformer has extremely competitive accuracy and efficiency, while being highly interpretable. Limitations & Future Work ETSformer currently only produces point forecasts. Probabilistic forecasting would be a valuable extension of our current work due to it's importance in practical applications. Other future directions which ETSformer does not currently consider but would be useful are additional covariates such as holiday indicators and other dummy variables to consider holiday effects which cannot be captured by the FA mechanism.

A CLASSICAL EXPONENTIAL SMOOTHING

We instantiate exponential smoothing methods (Hyndman et al., 2008) in the univariate forecasting setting. They assume that time-series can be decomposed into seasonal and trend components, and trend can be further decomposed into level and growth components. Specifically, a commonly used model is the additive Holt-Winters' method (Holt, 2004; Winters, 1960) , which can be formulated as: Level : e t = α(x t -s t-p ) + (1 -α)(e t-1 + b t-1 ) Growth : b t = β(e t -e t-1 ) + (1 -β)b t-1 Seasonal : s t = γ(x t -e t ) + (1 -γ)s t-p Forecasting : xt+h|t = e t + hb t + s t+h-p (3) where p is the period of seasonality, and xt+h|t is the h-steps ahead forecast. The above equations state that the h-steps ahead forecast is composed of the last estimated level e t , incrementing it by h times the last growth factor, b t , and adding the last available seasonal factor s t+h-p . Specifically, the level smoothing equation is formulated as a weighted average of the seasonally adjusted observation (x t -s t-p ) and the non-seasonal forecast, obtained by summing the previous level and growth (e t-1 + b t-1 ). The growth smoothing equation is implemented by a weighted average between the successive difference of the (de-seasonalized) level, (e t -e t-1 ), and the previous growth, b t-1 . Finally, the seasonal smoothing equation is a weighted average between the difference of observation and (de-seasonalized) level, (x t -e t ), and the previous seasonal index s t-p . The weighted average of these three equations are controlled by the smoothing parameters α, β and γ, respectively. A widely used modification of the additive Holt-Winters' method is to allow the damping of trends, which has been proved to produce robust multi-step forecasts (Svetunkov, 2016; McKenzie & Gardner Jr, 2010) . The forecast with damping trend can be rewritten as: xt+h|t = e t + (ϕ + ϕ 2 + • • • + ϕ h )b t + s t+h-p , where the growth is damped by a factor of ϕ. If ϕ = 1, it degenerates to the vanilla forecast. For 0 < ϕ < 1, as h → ∞ this growth component approaches an asymptote given by ϕb t /(1 -ϕ). 

B EXPONENTIAL SMOOTHING ATTENTION B.1 EXPONENTIAL SMOOTHING ATTENTION MATRIX

A ES =       (1 -α) 1 α 0 0 . . . 0 (1 -α) 2 α(1 -α) α 0 . . . 0 (1 -α) 3 α(1 -α) 2 α(1 -α) α . . . 0 . . . . . . . . . . . . . . . . . . (1 -α) L α(1 -α) L-1 . . . α(1 -α) j . . . α       B.2 EFFICIENT EXPONENTIAL SMOOTHING ATTENTION ALGORITHM Algorithm 1 PyTorch-style pseudocode of efficient A ES (n) t = α * (E (n-1) t -S (n) t ) + (1 -α) * (E (n) t-1 + B (n) t-1 ) = α * (E (n-1) t -S (n) t ) + (1 -α) * B (n) t-1 + (1 -α) * [α * (E (n-1) t-1 -S (n) t-1 ) + (1 -α) * (E (n) t-2 + B (n) t-2 )] = α * (E (n-1) t -S (n) t ) + α * (1 -α) * (E (n-1) t-1 -S (n) t-1 ) + (1 -α) * B (n) t-1 + (1 -α) 2 * B (n) t-2 + (1 -α) 2 [α * (E (n-1) t-2 -S (n) t-2 ) + (1 -α) * (E (n) t-3 + B (n) t-3 )] . . . = (1 -α) t (E (n) 0 -S (n) 0 ) + t-1 j=0 α * (1 -α) j * (E (n-1) t-j -S (n) t-j ) + t k=1 (1 -α) k * B (n) t-k = A ES (E (n-1) t-L:t -S (n) t-L:t ) + t k=1 (1 -α) k * B (n) t-k Based on the above expansion of the level equation, we observe that E (t) n can be computed by a sum of two terms, the first of which is given by an A ES term, and we we note that the second term can also be calculated using the conv1d fft algorithm, resulting in a fast implementation of level smoothing. Algorithm 2 describes the naive implementation for ESA by first constructing the exponential smoothing attention matrix, A ES , and performing the full matrix-vector multiplication. Efficient A ES relies on Algorithm 3, to achieve an O(L log L) complexity, by speeding up the matrix-vector multiplication. Due to the structure lower triangular structure of A ES (ignoring the first column), we note that performing a matrix-vector multiplication with it is equivalent to performing a convolution with the last row. Algorithm 3 describes the pseudocode for fast convolutions using fast Fourier transforms.

C DISCRETE FOURIER TRANSFORM

The DFT of a sequence with regular intervals, x = (x 0 , x 1 , . . . , x N -1 ) is a sequence of complex numbers, c k = N -1 n=0 x n • exp(-i2πkn/N ), for k = 0, 1, . . . , N -1, where c k are known as the Fourier coefficients of their respective Fourier frequencies. Due to the conjugate symmetry of DFT for real-valued signals, we simply consider the first ⌊N/2⌋ + 1 Fourier coefficients and thus we denote the DFT as F : R N → C ⌊N/2⌋+1 . The DFT maps a signal to the frequency domain, where each Fourier coefficient can be uniquely represented by the amplitude, |c k |, and the phase, ϕ(c k ), |c k | = R{c k } 2 + I{c k } 2 ϕ(c k ) = tan -1 I{c k } R{c k } where R{c k } and I{c k } are the real and imaginary components of c k respectively. Finally, the inverse DFT maps the frequency domain representation back to the time domain, x n = F -1 (c) n = 1 N N -1 k=0 c k • exp(i2πkn/N ), D IMPLEMENTATION DETAILS D.1 HYPERPARAMETERS For all experiments, we use the same hyperparameters for the encoder layers, decoder stacks, model dimensions, feedforward layer dimensions, number of heads in multi-head exponential smoothing attention, and kernel size for input embedding as listed in Table 5 . We perform hyperparameter tuning via a grid search over the number of frequencies K, lookback window size, and learning rate, selecting the settings which perform the best on the validation set based on MSE (on results averaged over three runs). The search range is reported in Table 5 , where the lookback window size search range was decided to be set as the values for the horizon sizes for the respective datasets. , 2015) with β 1 = 0.9, β 2 = 0.999, and ϵ = 1e -08, and a batch size of 32. We schedule the learning rate with linear warmup over 3 epochs, and cosine annealing thereafter for a total of 15 training epochs for all datasets. The minimum learning rate is set to 1e-30. For smoothing and damping parameters, we set the learning rate to be 100 times larger and do not use learning rate scheduling. Training was done on an Nvidia A100 GPU.

D.3 REGULARIZATION

We apply two forms of regularization during the training phase. Dropout We apply dropout (Srivastava et al., 2014) with a rate of p = 0.2 across the model. Dropout is applied on the outputs of the Input Embedding, Frequency Self-Attention and Multi-Head ES Attention blocks, in the Feedforward block (after activation and before normalization), on the attention weights, as well as damping weights. Noise Injection We utilize a composition of three noise distributions, applied in the following order -scale, shift, and jitter, activating with a probability of 0.5. 1. Scale -The time-series is scaled by a single random scalar value, obtained by sampling ϵ ∼ N (0, 0.2), and each time step is xt = ϵx t . 2. Shift -The time-series is shifted by a single random scalar value, obtained by sampling ϵ ∼ N (0, 0.2) and each time step is xt = x t + ϵ. 3. Jitter -I.I.D. Gaussian noise is added to each time step, from a distribution ϵ t ∼ N (0, 0.2), where each time step is now xt = x t + ϵ t .

E DATASETS

ETT 1 Electricity Transformer Temperature (Zhou et al., 2021 ) is a multivariate time-series dataset, comprising of load and oil temperature data recorded every 15 minutes from electricity transformers. ETT consists of two variants, ETTm and ETTh, whereby ETTh is the hourly-aggregated version of ETTm, the original 15 minute level dataset. ECL 2 Electricity Consuming Load measures the electricity consumption of 321 households clients over two years, the original dataset was collected at the 15 minute level, but is pre-processed into an hourly level dataset. Exchange 3 Exchange (Lai et al., 2018) tracks the daily exchange rates of eight countries (Australia, United Kingdom, Canada, Switzerland, China, Japan, New Zealand, and Singapore) from 1990 to 2016. Traffic 4 Traffic is an hourly dataset from the California Department of Transportation describing road occupancy rates in San Francisco Bay area freeways. Weather 5 Weather measures 21 meteorological indicators like air temperature, humidity, etc., every 10 minutes for the year of 2020. 

F SYNTHETIC DATASET

The synthetic dataset is constructed by a combination of trend and seasonal component. Each instance in the dataset has a lookack window length of 720 and forecast horizon length of 192. The trend pattern follows a nonlinear, saturating pattern, b(t) = 1 1+exp β0(t-β1) , where β 0 = -0.2, β 1 = 720. The seasonal pattern follows a complex periodic pattern formed by a sum of sinusoids. Concretely, s(t) = A 1 cos(2πf 1 t) + A 2 cos(2πf 2 t, where f 1 = 1/10, f 2 = 1/13 are the frequencies, A 1 = A 2 = 0.15 are the amplitudes. During training phase, we use an additional noise component by adding i.i.d. gaussian noise with 0.05 standard deviation. Finally, the i-th instance of the dataset is x i = [x i (1), x i (2), . . . , x i (720 + 192)], where x i (t) = b(t) + s(t + i). 1 https://github.com/zhouhaoyi/ETDataset 2 lhttps://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 3 https://github.com/laiguokun/multivariate-time-series-data 4 https://pems.dot.ca.gov/ 5 https://www.bgc-jena.mpg.de/wetter/ 6 https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html 



Figure 1: Seasonal-trend decomposed forecasts on synthetic data with ground truth seasonal and trend components. Top row: combined forecast. Middle row: trend component forecast. Bottom row: season component forecast. ETSformer is compared to two competing decomposed Transformer baselines, Autoformer, and FEDformer. Seen in the visualization, ETSformer exhibits a more disentangled seasonal-trend decomposition which accurately tracks the ground truth components. Not visualized here is ETSformer's unique ability to further separate trend into level and growth components.

Figure 2: ETSformer model architecture.

in the forecast horizon. It comprises N Growth + Seasonal (G+S) Stacks, and a Level Stack. The G+S Stack consists of the Growth Damping (GD) and FA blocks, which leverage B

Figure 3: Comparison between different attention mechanisms. (a) Full, (b) Sparse, and (c) Log-sparse Attentions are adaptive mechanisms, where the green circles represent the attention weights adaptively calculated by a point-wise dot-product query, and depends on various factors including the time-series value, additional covariates (e.g. positional encodings, time features, etc.). (d) Auto-Correlation mechanism considers sliding dot-product queries to construct attention weights for each rolled input series. We introduce (e) Exponential Smoothing Attention (ESA) and (f) Frequency Attention (FA).ESA directly computes attention weights based on the relative time lag, without considering the input content, while FA attends to patterns which dominate with large magnitudes in the frequency domain.

Figure 4: ETSformer attention weights visualization and learned seasonal dependencies on the ECL dataset. For weights visualizations, each row represents the attention weights a time step in the forecast horizon places on each time step in the lookback window. FA learns a clear periodicity, which is highlighted in the learned dependencies, where the top 6 time steps being attended to by the query time step are highlighted in red. ESA displays exponentially decaying weights representing growth.0 20 40 60 Figure 5: Autoformer Auto-Correlation mechanism weights on ECL dataset.

Figure 6: Computational Efficiency Analysis. Values reported are based on the training phase of ETTm2 multivariate setting. Horizon is fixed to 48 for lookback window plots, and lookback is fixed to 48 for forecast horizon plots. For runtime efficiency, values refer to the time for one iteration. The " " marker indicates an out-of-memory error for those settings. ward FFT operation, incurring O(L log L) complexity, as well as their Frequency Enhanced Modules requiring a large number of trainable parameters.

conv1d fft: efficient convolution operation implemented with fast Fourier transform (Appendix B, Algorithm 3), outer: outer product # V: value matrix, shape: L x d # v0: initial state, shape: d # alpha: smoothing parameter, shape: 1 # obtain exponentially decaying weights # and compute weighted combination powers = arange(L) # L weight = alpha * (1 -alpha) * * flip(powers) # L output = conv1d fft(V, weight, dim=0) # L x d # compute contribution from initial state init weight = (1 -alpha) * * (powers + 1) # L init output = outer(init weight, v0) # L x d return init output + output B.3 LEVEL SMOOTHING VIA EXPONENTIAL SMOOTHING ATTENTION E

FURTHER DETAILS ON ESA IMPLEMENTATION Algorithm 2 PyTorch-style pseudocode of naive A ES mm: matrix multiplication, outer: outer product repeat: einops style tensor operations, gather: gathers values along an axis specified by dim # V: value matrix, shape: L x d # v0: initial state, shape: d # alpha: smoothing parameter, shape: 1 L, d = V.shape # obtain exponentially decaying weights powers = arange(L) # L weight = alpha * (1 -alpha).pow(flip(powers)) # L # perform a strided roll operation # rolls a matrix along the columns in a strided manner # i.e. first row is shifted right by L-1 positions, # second row is shifted L-2, ..., last row is shifted by 0. weight = repeat(weight, 'L -> T L', T=L) # L x L indices = repeat(arange(L), 'L -> T L', T=L) indices = (indices -(arange(L) + 1).unsqueeze(1) ) % L weight = gather(weight, dim=-1, index=indices) # triangle masking to achieve the exponential smoothing attention matrix weight = triangle causal mask(weight) output = mm(weight, V) init weight = (1 -alpha) * * (powers + 1) init output = outer(init weight, v0) return init output + output Algorithm 3 PyTorch-style pseudocode of conv1d fft next_fast_len: find the next fast size of input data to fft, for zero-padding, etc. rfft: compute the one-dimensional discrete Fourier Transform for real input x.conj(): return the complex conjugate, element-wise irfft: computes the inverse of rfft roll: roll array elements along a given axis index select: returns a new tensor which index es the input tensor along dimension dim using the entries in index # V: value matrix, shape: L x d # weight: exponential smoothing attention vector, shape: L # dim: dimension to perform convolution on # obtain lengths of sequence to perform convolution on N = V.size(dim) M = weight.size(dim) # Fourier transform on inputs fast len = next fast len(N + M -1) F V = rfft(V, fast len, dim=dim) F weight = rfft(weight, fast len, dim=dim) # multiplication and inverse F V weight = F V * F weight.conj() out = irfft(F V weight, fast len, dim=dim) out = out.roll(-1, dim=dim) # select the correct indices idx = range(fast len -N, fast len) out = out.index select(dim, idx) return out

Influenza-like Illness records the ratio of patients seen with ILI and the total number of patients on a weekly basis, obtained by the Centers for Disease Control and Prevention of the United States between 2002 and 2021.

Multivariate forecasting results over various forecast horizons. Best results are bolded, and second best results are underlined.

Ablation study on the various components (Level, Growth, Season) of ETSformer, averaged over multiple horizons{24, 96, 192, 336,  720} for ETTm2, ECL, and Traffic, {24, 36, 48,  60} for ILI.

Ablation study on the effectiveness of the MH-ESA design.

MSE of decomposed forecasts over the synthetic dataset's test set (1000 samples).

Hyperparameters used in ETSformer.

I LAYER ANALYSIS

We provide additional analysis on the number of layers, and also ablations on the observation space (meaning that there is no projection into representation space by removing the embedding layer). We observe that learning deep representations lead to a significant increase in performance, and the optimal number of layers is around 2 3, before overfitting occurs. Figure 7 : of decomposed forecasts from ETSformer on real world datasets, ETTh1, ECL, and Weather. Note that season is zero-centered, and trend successfully tracks the level of the time-series. Due to the long sequence forecasting setting and with a damping, the growth component is not visually obvious, but notice for the Weather dataset, the trend pattern is has a strong downward slope initially (near time step 0), and is quickly damped.

