SCALEFORMER: ITERATIVE MULTI-SCALE REFINING TRANSFORMERS FOR TIME SERIES FORECASTING

Abstract

The performance of time series forecasting has recently been greatly improved by the introduction of transformers. In this paper, we propose a general multi-scale framework that can be applied to the state-of-the-art transformer-based time series forecasting models (FEDformer, Autoformer, etc.). By iteratively refining a forecasted time series at multiple scales with shared weights, introducing architecture adaptations, and a specially-designed normalization scheme, we are able to achieve significant performance improvements, from 5.5% to 38.5% across datasets and transformer architectures, with minimal additional computational overhead. Via detailed ablation studies, we demonstrate the effectiveness of each of our contributions across the architecture and methodology. Furthermore, our experiments on various public datasets demonstrate that the proposed improvements outperform their corresponding baseline counterparts. Our code is publicly available in https://github.com/BorealisAI/scaleformer.

Fine Forecasting

Time Integrating information at different time scales is essential to accurately model and forecast time series (Mozer, 1991; Ferreira et al., 2006) . From weather patterns that fluctuate both locally and globally, as well as throughout the day and across seasons and years, to radio carrier waves which contain relevant signals at different frequencies, time series forecasting models need to encourage scale awareness in learnt representations. While transformer-based architectures have become the mainstream and stateof-the-art for time series forecasting in recent years, advances have focused mainly on mitigating the standard quadratic complexity in time and space, e.g., attention (Li et al., 2019; Zhou et al., 2021) or structural changes (Xu et al., 2021; Zhou et al., 2022b) , rather than explicit scale-awareness. The essential cross-scale feature relationships are often learnt implicitly, and are not encouraged by architectural priors of any kind beyond the stacked attention blocks that characterize the transformer models. Autoformer (Xu et al., 2021) and Fedformer (Zhou et al., 2022b) introduced some emphasis on scale-awareness by enforcing different computational paths for the trend and seasonal components of the input time series; however, this structural prior only focused on two scales: low-and high-frequency components. Given their importance to forecasting, can we make transformers more scale-aware? We enable this scale-awareness with Scaleformer. In our proposed approach, showcased in Figure 1 , time series forecasts are iteratively refined at successive time-steps, allowing the model to better capture the inter-dependencies and specificities of each scale. However, scale itself is not sufficient. Iterative refinement at different scales can cause significant distribution shifts between intermediate forecasts which can lead to runaway error propagation. To mitigate this issue, we introduce cross-scale normalization at each step. Our approach re-orders model capacity to shift the focus on scale awareness, but does not fundamentally alter the attention-driven paradigm of transformers. As a result, it can be readily adapted to work jointly with multiple recent time series transformer architectures, acting broadly orthogonally to their own contributions. Leveraging this, we chose to operate with various transformer-based backbones (e.g. Fedformer, Autoformer, Informer, Reformer, Performer) to further probe the effect of our multi-scale method on a variety of experimental setups. Our contributions are as follows: (1) we introduce a novel iterative scale-refinement paradigm that can be readily adapted to a variety of transformer-based time series forecasting architectures. (2) To minimize distribution shifts between scales and windows, we introduce cross-scale normalization on outputs of the Transformer. (3) Using Informer and AutoFormer, two state-of-the-art architectures, as backbones, we demonstrate empirically the effectiveness of our approach on a variety of datasets. Depending on the choice of transformer architecture, our mutli-scale framework results in mean squared error reductions ranging from 5.5% to 38.5%. (4) Via a detailed ablation study of our findings, we demonstrate the validity of our architectural and methodological choices.

2. RELATED WORKS

Time-series forecasting: Time-series forecasting plays an important role in many domains, including: weather forecasting (Murphy, 1993), inventory planning (Syntetos et al., 2009) , astronomy (Scargle, 1981) , economic and financial forecasting (Krollner et al., 2010) . One of the specificities of time series data is the need to capture seasonal trends (Brockwell & Davis, 2009) . There exits a vast variety of time-series forecasting models (Box & Jenkins, 1968; Hyndman et al., 2008; Salinas et al., 2020; Rangapuram et al., 2018; Bai et al., 2018; Wu et al., 2020) . Early approaches such as ARIMA (Box & Jenkins, 1968 ) and exponential smoothing models (Hyndman et al., 2008) were followed by the introduction of neural network based approaches involving either Recurrent Neural Netowkrs (RNNs) and their variants (Salinas et al., 2020; Rangapuram et al., 2018; Salinas et al., 2020) Multi-scale neural architectures: Multi-scale and hierarchical processing is useful in many domains, such as computer vision (Fan et al., 2021; Zhang et al., 2021; Liu et al., 2018) , natural language processing (Nawrot et al., 2021; Subramanian et al., 2020; Zhao et al., 2021) and time series forecasting (Chen et al., 2022; Ding et al., 2020 ). Multiscale Vision Transformers (Fan et al., 2021) is proposed for video and image recognition, by connecting the seminal idea of multiscale feature hierarchies with transformer models, however, it focuses on the spatial domain, specially designed for computer vision tasks. Cui et al. (2016) proposed to use different transformations of a time series such as downsampling and smoothing in parallel to the original signal to better capture temporal patterns and reduce the effect of random noise. Many different architectures have been proposed recently (Chung et al., 2016; Che et al., 2018; Shen et al., 2020; Chen et al., 2021) to improve RNNs in tasks such as language processing, computer vision, time-series analysis, and speech recognition. However, these methods are mainly focused on proposing a new RNN-based module which is not applicable to transformers directly. The same direction has been also investigated in Transformers, TCN, and MLP models. Recent work Du et al. (2022) proposed multi-scale segment-wise correlations as a multi-scale version of the self-attention mechanism. Our work is orthogonal to the above methods



Figure 1: Intermediate forecasts by our model at different time scales. Iterative refinement of a time series forecast is a strong structural prior that benefits time series forecasting.

or Temporal Convolutional Networks (TCNs)(Bai et al., 2018).More recently, time-series Transformers(Wu et al., 2020; Zerveas et al., 2021; Tang & Matteson,  2021)  were introduced for the forecasting task by leveraging self-attention mechanisms to learn complex patterns and dynamics from time series data. Informer(Zhou et al., 2021)  reduced quadratic complexity in time and memory to O(L log L) by enforcing sparsity in the attention mechanism with the ProbSparse attention. Yformer(Madhusudhanan et al., 2021)  proposed a Y-shaped encoderdecoder architecture to take advantage of the multi-resolution embeddings. Autoformer(Xu et al.,  2021)  used a cross-correlation-based attention mechanism to operate at the level of subsequences. FEDformer(Zhou et al., 2022b)  employs frequency transform to decompose the sequence into multiple frequency domain modes to extract the feature, further improving the performance of Autoformer.

