ETSFORMER: EXPONENTIAL SMOOTHING TRANS-FORMERS FOR TIME-SERIES FORECASTING

Abstract

Transformers have recently been actively studied for time-series forecasting. While often showing promising results in various scenarios, traditional Transformers are not designed to fully exploit the characteristics of time-series data and thus suffer some fundamental limitations, e.g., they are generally not decomposable or interpretable, and are neither effective nor efficient for long-term forecasting. In this paper, we propose ETSformer, a novel time-series Transformer architecture, which exploits the principle of exponential smoothing methods in improving Transformers for time-series forecasting. Specifically, ETSformer leverages a novel level-growth-seasonality decomposed Transformer architecture which leads to more interpretable and disentangled decomposed forecasts. We further propose two novel attention mechanisms -the exponential smoothing attention and frequency attention, which are specially designed to overcome the limitations of the vanilla attention mechanism for time-series data. Extensive experiments on the long sequence time-series forecasting (LSTF) benchmark validates the efficacy and advantages of the proposed method. Code is attached in the supplementary material, and will be made publicly available.

1. INTRODUCTION

Transformer models have achieved great success in the fields of natural language processing (Vaswani et al., 2017; Devlin et al., 2019 ), computer vision (Carion et al., 2020; Dosovitskiy et al., 2021) , and even more recently, time-series (Li et al., 2019; Wu et al., 2021; Zhou et al., 2021; Zerveas et al., 2021; Zhou et al., 2022) . While the success of Transformer models have been widely attributed to the self-attention mechanism, alternative forms of attention, infused with the appropriate inductive biases, have been introduced to tackle the unique properties of their underlying task or data (You et al., 2020; Raganato et al., 2020) . In time-series forecasting, decomposition-based architectures such as Autoformer and FEDformer models (Wu et al., 2021; Zhou et al., 2022) have incorporated time-series specific inductive biases, leading to increased accuracy, and more interpretable forecasts (by decomposing forecasts into seasonal and trend components). Their success has been motivated by: (i) disentangling seasonal and trend representations via seasonal-trend decomposition (Cleveland & Tiao, 1976; Cleveland et al., 1990; Woo et al., 2022) , and (ii) replacing the vanilla pointwise dot-product attention which handle time-series patterns such as seasonality and trend inefficiently, with time-series specific attention mechanisms such as the Auto-Correlation mechanism and Frequency-Enhanced Attention. While these existing work introduce the promising direction of interpretable and decomposed time-series forecasting for Transformer-based architectures, they suffer from two drawbacks. Firstly, they suffer from entangled seasonal-trend representations, evidenced in Figure 1 , where the trend forecasts exhibit periodical patterns which should only be present in the seasonal component, and the seasonal component does not accurately track the (multiple) periodicities present in the ground truth seasonal component. This arises due to their decomposition mechanism which detects trend via a simple moving average over the input signal and detrends the signal by removing the detected trend component -an arguably naive approach. This method has many known pitfalls (Hyndman & Athanasopoulos, 2018) , such as the trend-cycle component not being available for the first and last few observations, and over-smoothing rapid rises and falls. Secondly, their proposed replacements for the vanilla attention mechanism are not human interpretable -demonstrated in Section 3.3. Model inspection and diagnosis allows us to better understand the fore- casts generated by our models, attributing predictions to each component to make better downstream decisions. For an attention mechanism focusing on seasonality, we would expect the cross-attention map visualization to produce clear periodic patterns which shift smoothly across decoder time steps. Yet, the Auto-Correlation mechanism from Autoformer does not exhibit this property, yielding similar attention weights across decoder time steps, while the Frequency-Enhanced Attention from FEDformer does not have such model interpretability capabilities due to its complicated frequency domain attention. To address these limitations, we look towards the more principled approach of level-growth-season decomposition from ETS methods (Hyndman et al., 2008) (further introduced in Appendix A). This principle further deconstructs trend into level and growth components. To extract the level and growth components, we also look at the idea of exponential smoothing, where more recent data gets weighted more highly than older data, reflecting the view that the more recent past should be considered more relevant for making new predictions or identifying current trends, to replace the naive moving average. At the same time, we leverage the idea of extracting the most salient periodic components in the frequency domain via the Fourier transform, to extract the global seasonal patterns present in the signal. These principles help yield a stronger decomposition strategy by first extracting global periodic patterns as seasonality, and subsequently extracting growth as the change in level in an exponentially smoothed manner. Motivated by the above, we propose ETSformer, an interpretable and efficient Transformer architecture for time-series forecasting which yields disentangled seasonal-trend forecasts. Instead of reusing the moving average operation for detrending, ETSformer overhauls the existing decomposition architecture by leveraging the level-growth-season principle, embedding it into a novel Transformer framework in a non-trivial manner. Next, we introduce interpretable and efficient attention mechanisms -Exponential Smoothing Attention (ESA) for trend, and Frequency Attention (FA) for seasonality. ESA assigns attention weights in an exponentially decreasing manner, with high values to nearby time steps and low values to far away time steps, thus specialising in extracting growth representations. FA leverages frequency domain representations to extract dominating seasonal patterns by selecting the Fourier bases with the K largest amplitudes. Both mechanisms have efficient implementations with O(L log L) complexity. Furthermore, we demonstrate human interpretable visualizations of both mechanisms in Section 3.3. To summarize, our key contributions are as follows: • We introduce a novel decomposition Transformer architecture, incorporating the timetested level-growth-season principle for more disentangled, human-interpretable time-series forecasts. • We introduce two new attention mechanisms, ESA and FA, which incorporate stronger time-series specific inductive biases. They achieve better efficiency than vanilla attention, and yield interpretable attention weights upon model inspection. • The resulting method is a highly effective, efficient, and interpretable deep forecasting model. We show this via extensive empirical analysis, that ETSformer achieves performance competitive with state-of-the-art methods over 6 real world datasets on both multivariate and univariate settings, and is highly efficient compared to competing methods.



Figure 1: Seasonal-trend decomposed forecasts on synthetic data with ground truth seasonal and trend components. Top row: combined forecast. Middle row: trend component forecast. Bottom row: season component forecast. ETSformer is compared to two competing decomposed Transformer baselines, Autoformer, and FEDformer. Seen in the visualization, ETSformer exhibits a more disentangled seasonal-trend decomposition which accurately tracks the ground truth components. Not visualized here is ETSformer's unique ability to further separate trend into level and growth components.

