MICN: MULTI-SCALE LOCAL AND GLOBAL CONTEXT MODELING FOR LONG-TERM SERIES FORECASTING

Abstract

Recently, Transformer-based methods have achieved surprising performance in the field of long-term series forecasting, but the attention mechanism for computing global correlations entails high complexity. And they do not allow for targeted modeling of local features as CNN structures do. To solve the above problems, we propose to combine local features and global correlations to capture the overall view of time series (e.g., fluctuations, trends). To fully exploit the underlying information in the time series, a multi-scale branch structure is adopted to model different potential patterns separately. Each pattern is extracted with down-sampled convolution and isometric convolution for local features and global correlations, respectively. In addition to being more effective, our proposed method, termed as Multi-scale Isometric Convolution Network (MICN), is more efficient with linear complexity about the sequence length with suitable convolution kernels. Our experiments on six benchmark datasets show that compared with state-of-the-art methods, MICN yields 17.2% and 21.6% relative improvements for multivariate and univariate time series, respectively. Code is available at https://github. com/wanghq21/MICN.

1. INTRODUCTION

Researches related to time series forecasting are widely applied in the real world, such as sensor network monitoring (Papadimitriou & Yu., 2006) , weather forecasting, economics and finance (Zhu & Shasha, 2002) , and disease propagation analysis (Matsubara et al., 2014) and electricity forecasting. In particular, long-term time series forecasting is increasingly in demand in reality. Therefore, this paper focuses on the task of long-term forecasting. The problem to be solved is to predict values for a future period: X t+1 , X t+2 , ..., X t+T -1 , X t+T , based on observations from a historical period: X 1 , X 2 , ..., X t-1 , X t , and T ≫ t. As a classic CNN-based model, TCN (Bai et al., 2018) uses causal convolution to model the temporal causality and dilated convolution to expand the receptive field. It can integrate the local information of the sequence better and achieve competitive results in short and medium-term forecasting (Sen et al., 2019 ) (Borovykh et al., 2017) . However, limited by the receptive field size, TCN often needs many layers to model the global relationship of time series, which greatly increases the complexity of the network and the training difficulty of the model. Transformers (Vaswani et al., 2017) based on the attention mechanism shows great power in sequential data, such as natural language processing (Devlin et al., 2019 ) (Brown et al., 2020 ), audio processing (Huang et al., 2019) and even computer vision (Dosovitskiy et al., 2021 ) (Liu et al., 2021b) . It has also recently been applied in long-term series forecasting tasks (Li et al., 2019b ) (Wen et al., 2022) and can model the long-term dependence of sequences effectively, allowing leaps and bounds in the accuracy and length of time series forecasts (Zhu & Soricut, 2021 ) (Wu et al., 2021b ) (Zhou et al., 2022) . The learned attention matrix represents the correlations between different time points of the sequence and can explain relatively well how the model makes future predictions based on past information. However, it has a quadratic complexity, and many of the computations between token pairs are non-essential, so it is also an interesting research direction to reduce its computational complexity. Some notable models include: LogTrans (Li et al., 2019b ), Informer (Zhou et al., 2021 ), Reformer (Kitaev et al., 2020 ), Autoformer Wu et al. (2021b ), Pyraformer (Liu et al., 2021a ), FEDformer (Zhou et al., 2022) . However, as a special sequence, time series has not led to a unified modeling direction so far. In this paper, we combine the modeling perspective of CNNs with that of Transformers to build models from the realistic features of the sequences themselves, i.e., local features and global correlations. Local features represent the characteristics of a sequence over a small period T , and global correlations are the correlations exhibited between many periods T 1 , T 2 , ...T n-1 , T n . For example, the temperature at a moment is not only influenced by the specific change during the day but may also be correlated with the overall trend of a period (e.g., week, month, etc.). We can identify the value of a time point more accurately by learning the overall characteristics of that period and the correlation among many periods before. Therefore, a good forecasting method should have the following two properties: (1) The ability to extract local features to measure short-term changes. ( 2) The ability to model the global correlations to measure the long-term trend. Based on this, we propose Multi-scale Isometric Convolution Network (MICN). We use multiple branches of different convolution kernels to model different potential pattern information of the sequence separately. For each branch, we extract the local features of the sequence using a local module based on downsampling convolution, and on top of this, we model the global correlation using a global module based on isometric convolution. Finally, Merge operation is adopted to fuse information about different patterns from several branches. This design reduces the time and space complexity to linearity, eliminating many unnecessary and redundant calculations. MICN achieves state-of-the-art accuracy on five real-world benchmarks. The contributions are summarized as follows: • We propose MICN based on convolution structure to efficiently replace the self-attention, and it achieves linear computational complexity and memory cost. • We propose a multiple branches framework to deeply mine the intricate temporal patterns of time series, which validates the need and validity for separate modeling when the input data is complex and variable. • We propose a local-global structure to implement information aggregation and long-term dependency modeling for time series, which outperforms the self-attention family and Auto-correlation mechanism. We adopt downsampling one-dimensional convolution for local features extraction and isometric convolution for global correlations discovery. • Our empirical studies show that the proposed model improves the performance of state-ofthe-art method by 17.2% and 21.6% for multivariate and univariate forecasting, respectively.

2.1. CNNS AND TRANSFORMERS

Convolutional neural networks (CNN) are widely used in computer vision, natural language processing and speech recognition (Sainath et al., 2013 ) (Li et al., 2019a ) (Han et al., 2020) . It is widely believed that this success is due to the use of convolution operations, which can introduce certain inductive biases, such as translation invariance, etc. CNN-based methods are usually modeled from the local perspective, and convolution kernels can be very good at extracting local information from the input. By continuously stacking convolution layers, the field of perception can be extended to the entire input space, enabling the aggregation of the overall information. Transformer (Vaswani et al., 2017) has achieved the best performance in many fields since its emergence, which is mainly due to the attention mechanism. Unlike modeling local information directly from the input, the attention mechanism does not require stacking many layers to extract global information. Although the complexity is higher and learning is more difficult, it is more capable of learning long-term dependencies (Vaswani et al., 2017) . Although CNNs and Transformers are modeled from different perspectives, they both aim to achieve efficient utilization of the overall information of the input. In this paper, from the view of combining

