MICN: MULTI-SCALE LOCAL AND GLOBAL CONTEXT MODELING FOR LONG-TERM SERIES FORECASTING

Abstract

Recently, Transformer-based methods have achieved surprising performance in the field of long-term series forecasting, but the attention mechanism for computing global correlations entails high complexity. And they do not allow for targeted modeling of local features as CNN structures do. To solve the above problems, we propose to combine local features and global correlations to capture the overall view of time series (e.g., fluctuations, trends). To fully exploit the underlying information in the time series, a multi-scale branch structure is adopted to model different potential patterns separately. Each pattern is extracted with down-sampled convolution and isometric convolution for local features and global correlations, respectively. In addition to being more effective, our proposed method, termed as Multi-scale Isometric Convolution Network (MICN), is more efficient with linear complexity about the sequence length with suitable convolution kernels. Our experiments on six benchmark datasets show that compared with state-of-the-art methods, MICN yields 17.2% and 21.6% relative improvements for multivariate and univariate time series, respectively. Code is available at https://github. com/wanghq21/MICN.

1. INTRODUCTION

Researches related to time series forecasting are widely applied in the real world, such as sensor network monitoring (Papadimitriou & Yu., 2006) , weather forecasting, economics and finance (Zhu & Shasha, 2002) , and disease propagation analysis (Matsubara et al., 2014) and electricity forecasting. In particular, long-term time series forecasting is increasingly in demand in reality. Therefore, this paper focuses on the task of long-term forecasting. The problem to be solved is to predict values for a future period: X t+1 , X t+2 , ..., X t+T -1 , X t+T , based on observations from a historical period: X 1 , X 2 , ..., X t-1 , X t , and T ≫ t. As a classic CNN-based model, TCN (Bai et al., 2018) uses causal convolution to model the temporal causality and dilated convolution to expand the receptive field. It can integrate the local information of the sequence better and achieve competitive results in short and medium-term forecasting (Sen et al., 2019) (Borovykh et al., 2017) . However, limited by the receptive field size, TCN often needs many layers to model the global relationship of time series, which greatly increases the complexity of the network and the training difficulty of the model. Transformers (Vaswani et al., 2017) based on the attention mechanism shows great power in sequential data, such as natural language processing (Devlin et al., 2019) (Brown et al., 2020) , audio processing (Huang et al., 2019) and even computer vision (Dosovitskiy et al., 2021) (Liu et al., 2021b) . It has also recently been applied in long-term series forecasting tasks (Li et al., 2019b ) (Wen et al., 2022) and can model the long-term dependence of sequences effectively, allowing leaps and bounds in the accuracy and length of time series forecasts (Zhu & Soricut, 2021) (Wu et al., 2021b) (Zhou et al., 2022) . The learned attention matrix represents the correlations between different time points of the sequence and can explain relatively well how the model makes future predictions based on past information. However, it has a quadratic complexity, and many of the computations between token pairs are non-essential, so it is also an interesting research direction to reduce its computational complexity. Some notable models include: LogTrans (Li et al., 2019b) , Informer (Zhou et al., 2021) , Reformer (Kitaev et al., 2020) , Autoformer Wu et al. (2021b) , Pyraformer (Liu et al., 2021a) , FEDformer (Zhou et al., 2022) . However, as a special sequence, time series has not led to a unified modeling direction so far. In this paper, we combine the modeling perspective of CNNs with that of Transformers to build models from the realistic features of the sequences themselves, i.e., local features and global correlations. Local features represent the characteristics of a sequence over a small period T , and global correlations are the correlations exhibited between many periods T 1 , T 2 , ...T n-1 , T n . For example, the temperature at a moment is not only influenced by the specific change during the day but may also be correlated with the overall trend of a period (e.g., week, month, etc.). We can identify the value of a time point more accurately by learning the overall characteristics of that period and the correlation among many periods before. Therefore, a good forecasting method should have the following two properties: (1) The ability to extract local features to measure short-term changes. ( 2) The ability to model the global correlations to measure the long-term trend. Based on this, we propose Multi-scale Isometric Convolution Network (MICN). We use multiple branches of different convolution kernels to model different potential pattern information of the sequence separately. For each branch, we extract the local features of the sequence using a local module based on downsampling convolution, and on top of this, we model the global correlation using a global module based on isometric convolution. Finally, Merge operation is adopted to fuse information about different patterns from several branches. This design reduces the time and space complexity to linearity, eliminating many unnecessary and redundant calculations. MICN achieves state-of-the-art accuracy on five real-world benchmarks. The contributions are summarized as follows: • We propose MICN based on convolution structure to efficiently replace the self-attention, and it achieves linear computational complexity and memory cost. • We propose a multiple branches framework to deeply mine the intricate temporal patterns of time series, which validates the need and validity for separate modeling when the input data is complex and variable. • We propose a local-global structure to implement information aggregation and long-term dependency modeling for time series, which outperforms the self-attention family and Auto-correlation mechanism. We adopt downsampling one-dimensional convolution for local features extraction and isometric convolution for global correlations discovery. • Our empirical studies show that the proposed model improves the performance of state-ofthe-art method by 17.2% and 21.6% for multivariate and univariate forecasting, respectively.

2.1. CNNS AND TRANSFORMERS

Convolutional neural networks (CNN) are widely used in computer vision, natural language processing and speech recognition (Sainath et al., 2013 ) (Li et al., 2019a ) (Han et al., 2020) . It is widely believed that this success is due to the use of convolution operations, which can introduce certain inductive biases, such as translation invariance, etc. CNN-based methods are usually modeled from the local perspective, and convolution kernels can be very good at extracting local information from the input. By continuously stacking convolution layers, the field of perception can be extended to the entire input space, enabling the aggregation of the overall information. Transformer (Vaswani et al., 2017) has achieved the best performance in many fields since its emergence, which is mainly due to the attention mechanism. Unlike modeling local information directly from the input, the attention mechanism does not require stacking many layers to extract global information. Although the complexity is higher and learning is more difficult, it is more capable of learning long-term dependencies (Vaswani et al., 2017) . Although CNNs and Transformers are modeled from different perspectives, they both aim to achieve efficient utilization of the overall information of the input. In this paper, from the view of combining the modeling principles of CNNs and Transformers, we consider both local and global context, extract local features of data first, and then model global correlation on this basis. Furthermore, our method achieves lower computational effort and complexity.

2.2. MODELING BOTH LOCAL AND GLOBAL CONTEXT

Both local and global relationships play an important role in sequence modeling. Some works have been conducted to study how to combine local and global modeling into a unified model to achieve high efficiency and interpretability. Two well-known architectures are: Conformer (Gulati et al., 2020) and Lite Transformer (Wu et al., 2020) . Conformer is a variant of Transformer and has achieved state-of-the-art performance in many speech applications. It adopts the attention mechanism to learn the global interaction, the convolution module to capture the relative-offset-based local features, and combines these two modules sequentially. However, Conformer does not analyze in detail what local and global features are learned and how they affect the final output. There is also no explanation why the attention module is followed by a convolution module. Another limitation of Conformer is the quadratic complexity with respect to the sequence length due to self-attention. Lite Transformer also adopts convolution to extract local information and self-attention to capture long-term correlation, but it separates them into two branches for parallel processing. A visual analysis of the feature weights extracted from the two branches is also presented in the paper, which can provide a good interpretation of the model results. However, the parallel structure of the two branches determines that there may be some redundancy in its computation, and it still has the limitation of quadratic complexity. Whether the convolution and self-attention are serialized to extract local and global relationships step by step or in parallel, it inevitably results in quadratic time and space complexity. Therefore, in this paper, we propose a new framework for modeling local features and global correlations of time series along with a new module instead of attention mechanism. We also use the convolution operation to extract its local information and then propose isometric convolution to model the global correlation between each segment of the local features. This modeling method not only avoids more redundant computations but also reduces the overall time and space complexity to linearity with respect to the sequence length.

3.1. MICN FRAMEWORK

The overall structure of MICN is shown in Figure 1 . The long time series prediction task is to predict a future series of length O based on a past series of length I, which can be expressed as input -Ipredict -O, where O is much larger than I. Inspired by traditional time series decomposition algorithms (Robert et al., 1990 ) (Wu et al., 2021b) , we design a multi-scale hybrid decomposition (MHDecomp) block to separate complex patterns of input series. Then we use Seasonal Prediction Block to predict seasonal information and Trend-cyclical Prediction Block to predict trend-cyclical information. Then add the prediction results up to get the final prediction Y pred . We donate d as the number of variables in multivariate time series and D as the hidden state of the series. The details will be given in the following sections.

Embedding MIC

Figure 1 : MICN overall architecture.

3.2. MULTI-SCALE HYBRID DECOMPOSITION

Previous series decomposition algorithms (Wu et al., 2021b) adopt the moving average to smooth out periodic fluctuations and highlight the long-term trends. For the input series X ∈ R I×d , the process is: X t = AvgPool(Padding(X)) kernel X s = X -X t , where: X t , X s ∈ R I×d denote the trend-cyclical and seasonal parts, respectively. The use of the Avgpool(•) with the padding operation keeps the series length unchanged. But the parameter kernel of the Avgpool(•) is artificially set and there are often large differences in trend-cyclical series and seasonal series obtained from different kernels. Therefore, we design a multi-scale hybrid decomposition block that uses several different kernels of the Avgpool(•) and can separate several different patterns of trend-cyclical and seasonal parts purposefully. Different from the MOEDecomp block of FEDformer (Zhou et al., 2022) , we use simple mean operation to integrate these different patterns because we cannot determine the weight of each pattern before learning its features. Correspondingly, we put this weighting operation in the Merge part of Seasonal Prediction block after the representation of the features. Concretely, for the input series X ∈ R I×d , the process is: X t = mean(AvgPool(Padding(X)) kernel 1 , ..., AvgPool(Padding(X)) kernel n ) X s = X -X t , where X t , X s ∈ R I×d denote the trend-cyclical and seasonal part, respectively. The different kernels are consistent with multi-scale information in Seasonal Prediction block. The effectiveness is demonstrated experimentally in Appendix B.1.

3.3. TREND-CYCLICAL PREDICTION BLOCK

Currently, Autoformer (Wu et al., 2021b) concatenates the mean of the original series and then accumulates it with the trend-cyclical part obtained from the inner series decomposition block. But there is no explanation of this and no proof of its effectiveness. In this paper, we use a simple linear regression strategy to make a prediction about trend-cyclical, demonstrating that simple modeling of trend-cyclical is also necessary for non-stationary series forecasting tasks (See Section 4.2). Concretely, for the trend-cyclical series X t ∈ R I×d obtained with MHDecomp block, the process is: Y regre t = regression(X t ) where Y regre t ∈ R O×d denotes the prediction of the trend part using the linear regression strategy. And we use MICN-regre to represent MICN model with this trend-cyclical prediction method. For comparison, we use the mean of X t to cope with the series where the trend-cyclical keeps constant: Y mean t = mean(X t ) where Y mean t ∈ R O×d denotes the prediction of the trend part. And we use MICN-mean to represent MICN model with this trend-cyclical prediction method.

3.4. SEASONAL PREDICTION BLOCK

As shown in Figure 2 , the Seasonal Prediction Block focuses on the more complex seasonal part modeling. After embedding the input sequence X s , we adopt multi-scale isometric convolution to capture the local features and global correlations, and branches of different scales model different underlying patterns of the time series. We then merge the results from different branches to complete comprehensive information utilization of the sequence. It can be summarised as follows: X emb s = Embedding(Concat(X s , X zero )) Y 0 s = X emb s Y s,l = MIC(Y s,l-1 ), l ∈ {1, 2, ..., N} Y s = Truncate(Pro jection(Y s,N )), where X zero ∈ R O×d denotes the placeholders filled with zero, and X emb s ∈ R (I+O)×D denotes the embedded representation of X s . Y s,l ∈ R (I+O)×D represents the output of lth multi-scale isometric Embedding The decoder of the latest Transformer-based models such as Informer (Zhou et al., 2021) , Autoformer (Wu et al., 2021b) and FEDformer (Zhou et al., 2022) contain the latter half of the encoder's input with the length I 2 and placeholders with length O filled by scalars, which may lead to redundant calculations. To avoid this problem and adapt to the prediction length O , we replace the traditional encoder-decoder style input with a simpler complementary 0 strategy. Meanwhile, we follow the setting of FEDformer and adopt three parts to embed the input. The process is: X emb s = sum(T FE + PE +V E(Concat(X s , X zero ))) where X emb s ∈ R (I+O)×D . T FE represents time features encoding (e.g., MinuteOfHour, HourOfDay, DayOfWeek, DayOfMonth, and MonthOfYear), PE represents positional encoding and V E represents value embedding. Multi-scale isometric Convolution(MIC) Layer MIC layer contains several branches, with different scale sizes used to model potentially different temporal patterns. In each branch, as shown in Figure 3 , the local-global module extracts the local features and the global correlations of the sequence (See Appendix B.7 for more detailed description). Concretely, after obtaining the corresponding single pattern by avgpool, the local module adopts one-dimensional convolution to implement downsampling. The process is: Y s,l = Y s,l-1 Y local,i s,l = Conv1d(Avgpool(Padding(Y s,l )) kernel=i ) kernel=i , where Y s,l-1 denotes the output of (l -1)th MIC layer and Y s,0 = X emb s . i ∈ I 4 , I 8 , ... denote the different scale sizes corresponding to the different branches in Figure 2 . For Conv1d, we set stride = kernel = i, which serves as compression of local features. Y local,i s,l ∈ R (I+O) i ×D represents the result obtained by compressing local features, which is a short sequence. And furthermore, the global module is designed to model the global correlations of the output of the local module. A commonly used method for modeling global correlations is the self-attention mechanism. But in this paper, we use a variant of casual convolution, isometric convolution, as an alternative. As shown in Figure 4 , isometric convolution pads the sequence of length S with Published as a conference paper at ICLR 2023 placeholders zero of length S -1 , and its kernel is equal to S . It means that we can use a large convolution kernel to measure the global correlation of the whole series. The current generative prediction approach is to add placeholder to the input sequence, which has no actual sequence information in the second half. The Isometric Convolution can enable sequential inference of sequences by fusing local features information. Moreover, the kernel of Isometric convolution is determined by all the training data, which can introduces a global temporal inductive bias (translation equivariance, etc.) and achieve better generalization than self-attention (the correlations are obtained from the product between different elements). Meanwhile, we demonstrate that for a shorter sequence, isometric convolution is superior to self-attention. The detailed experiments of the proof are in Appendix B.3. And to keep the sequence length constant, we upsample the result of the isometric convolution using transposed convolution. The global module can be formalized as follows: Y ′ ,i s,l = Norm(Y local,i s,l + Dropout(Tanh(IsometricConv(Y local,i) s,l ))) Y global,i s,l = Norm(Y s,l-1 + Dropout(Tanh(Conv1dTranspose(Y ′ ,i s,l ) kernel=i ))), where Y local,i s,l ∈ R (I+O) i ×D denote the result after the global correlations modeling. Y s,l-1 is the output of l -1 MIC layer. Y global,i s,l ∈ R (I+O)×D represents the result of this pattern (i.e., this branch). The process is: Y merge s,l = (Conv2d(Y global,i s,l , i ∈ I 4 , I 8 , ... )) Y s,l = Norm(Y merge s,l + FeedForward(Y merge s,l )), where Y s,l ∈ R (I+O)×D represents the result of lth MIC layer. To get the final prediction of the seasonal part, we use the projection and truncate operations: Y s = Truncate(Pro jection(Y s,N )) where Y s,N ∈ R (I+O)×D represents the output of N-th MIC layer, and Y s ∈ R O×d represents the final prediction about the seasonal part.

4. EXPERIMENTS

Dataset To evaluate the proposed MICN, we conduct extensive experiments on six popular realworld datasets, covering many aspects of life: energy, traffic, economics, and weather. We follow standard protocol (Zhou et al., 2021) and split all datasets into training, validation and test set in chronological order by the ratio of 6:2:2 for the ETT dataset and 7:1:2 for the other datasets. More details about the datasets and implementation are described in Appendix A.1 and A.2. Baselines We include four transformer-based models: FEDformer (Zhou et al., 2022) , Autoformer (Wu et al., 2021b) , Informer (Zhou et al., 2021) , LogTrans (Li et al., 2019b) , two RNN-based models: LSTM (Hochreiter & Schmidhuber, 1997) , LSTNet (Lai et al., 2018b) and CNN-based model TCN (Bai et al., 2018) as baselines. For the univariate setting, we mainly compare transformer-based models. For the state-of-the-art model FEDformer, we compare the better one (FEDformer-f). 192 0.307 0.376 0.262 0.326 0.269 0.328 0.281 0.340 0.533 0.563 0.989 0.757 3.154 1.369 2.249 1.112 3.072 1.339  336 0.325 0.388 0.305 0.353 0.325 0.366 0.339 0.372 1.363 0.887 1.334 0.872 3.160 1.369 2.568 1.238 3.105 1.348  720 0.502 0.490 0.389 0.407 0.421 0.415 0.422 0.419 3.379 1.388 3.048 1.328 3.171 1.368 2.720 1.287 3.135 1 192 0.177 0.285 0.200 0.308 0.201 0.315 0.222 0.334 0.296 0.386 0.266 0.368 0.725 0.676 0.442 0.473 0.996 0.821  336 0.193 0.304 0.219 0.328 0.214 0.329 0.231 0.338 0.300 0.394 0.280 0.380 0.828 0.727 0.439 0.473 1.000 0.824  720 0.212 0.321 0.224 0.332 0.246 0.355 0.254 0.361 0.373 0.439 0.283 0.376 0.957 0.811 0.980 0.814 1.438 0 192 0.172 0.316 0.324 0.408 0.271 0.380 0.300 0.369 1.204 0.895 1.040 0.851 1.477 1.028 1.846 1.179 3.048 1 

4.2. ABLATION STUDIES

Trend-cyclical Prediction Block We attempt to verify the necessity of modeling the trend-cyclical part when using a decomposition-based structure. Like Autoformer (Wu et al., 2021b) , previous methods decompose the time series and then take the mean prediction of the trend information, which is then added to the other trend information obtained from the decomposition module in the model. However, the reasons and rationality are not argued in the relevant papers. In this paper, we use simple linear regression to predict the trend-cyclical part and we also record the results of the mean prediction for comparison. Note that with different trend-cyclical prediction blocks, we have different models named MICN-regre and MICN-mean. As shown in 

Impact of input length

In time series forecasting tasks, the size of the input length indicates how much historical information the algorithm can utilize. In general, a model that has a strong ability to model long-term temporal dependency should perform better as the input length increases. Therefore, we conduct experiments with different input lengths and the same prediction length to validate our model. As shown in Figure 5 , when the input length is relatively long, the performance of Transformer-based models becomes worse because of repeated short-term patterns as stated in (Zhou et al., 2021) . Relatively, the overall performance of MICN prediction gradually gets better as Our method is trained with the L2 loss, using the ADAM optimizer with an initial learning rate of 10 -foot_2 . Batch size is set to 32. The training process is early stopped after three epochs if there is no loss degradation on the valid set. The mean square error (MSE) and mean absolute error (MAE) are used as metrics. All the experiments are repeated 3 times with different seeds, implemented in PyTorch and conducted on NVIDIA RTX A5000 24GB GPU. The hyper-parameter i is set to {12, 16} , and the hyper-parameter sensitivity analysis can be seen in Appendix A.4 . For a fairer comparison, we fix the input length to 96 for all datasets (36 for ILI). MICN contains 1 MIC layer. We use MICN-regre and MICN-mean to represent the different strategies of trend-cyclical prediction block in the following.

A.3 FULL BENCHMARK ON THE ETT DATASETS

We build the benchmark on the four ETT datasets in Table 8 and Table 9 because of the lack of ability to capture complex temporal patterns of the time series. Meanwhile, MICN can achieve almost the same better performance when i takes two or three values, indicating that the multi-branch structure is effective. To be more representative, we set i to {12, 16} in this paper.

A.5 SELECTION OF DIFFERENT CONVOLUTION MODES

As shown in Table 11 , we also record the performance in different convolution modes: stride = kernel and stride = kernel 2 . The second mode makes more comprehensive use of local information, making the convolution more coherent. MICN achieves similar performance in different convolution modes. It proves that MICN can make the most of sequence information, and the performance of the model depends on the structure we proposed. 

B ADDITIONAL MODEL ANALYSIS B.1 MULTI-SCALE HYBRID DECOMPOSITION

Autoformer harnesses the decomposition as an inner block of deep models and gets good performance. However, the patterns obtained by its decomposition are simple and cannot effectively deal with the complex and changeable properties of time series. As shown in Table 12 , we replace the decomposition block in Autoformer with our proposed multi-scale hybrid decomposition block. For Exchange, we achieve a similar performance because it has no obvious temporal pattern. The result verifies that multi-scale hybrid decomposition structure is more in line with the complex temporal patterns in real-time series.

B.2 VISUALIZATION OF LEARNED TREND-CYCLICAL PARTS

As shown in Figure 6 and Figure 7 , we plot the results of learned trend-cyclical parts. The separate modeling of the trend-cyclical part makes better performance and grasp of long-term progression. We also observe that the mean prediction is slightly better on the ETTm2 dataset. This is due to the complexity of the trend-cyclical information and the inability of simple linear regression, which may require a more advanced trend prediction method. Moreover, the mean value of its trend change is close to constant, so mean prediction is better in this situation.

B.3 ISOMETRIC CONVOLUTION VS. MASKED SELF-ATTENTION

With the local module in MICN, we get a short sequence characterizing local features. On this basis, we propose the isometric convolution in global module to model the global correlation of the sequence, while previously the first choice is masked self-attention. We replace the isometric convolution in the global module of MICN with masked self-attention for training, and the results are shown in Table 13 and Table 14 . It verifies that for a short sequence, isometric convolution outperforms masked self-attention in general. To compare the isometric convolution and the masked self-attention further, we conduct more experiments on full benchmark with different kernel sizes. The results in Table 15 show that the isometric convolution outperforms the masked self-attention in the most cases. And we also note that in some cases the masked self-attention is slightly more effective. We believe that this is related to the corresponding datasets and that we will analyse the datasets in detail in the future. Moreover, we can conclude that different kernels have a relatively small impact on the final results, which indicates that our advanced structure instead of model parameters plays a major role in the performance. The traditional method of merging branch structures is the concat operation on the hidden state. In this paper, we propose to adopt 2D convolution to merge multiple branches to better measure the importance of each branch (the kernel represents the weights). As shown in Table 16 , the better performance verifies the effectiveness of our proposed method. For downsampling convolution (kernel= stride), the complexity is O(i * D 2 * L i ) = O(LD 2 ). For Isometric convolution, the sequence length and kernel are changed to L i , stride=1, padding= L i -1, so the complexity is O( ( L i ) 2 * D 2 ) = O( L 2 D 2 i 2 ). In this paper, we set i ∈ { L 4 , L 8 ...} is factor of L, so the complexity of Isometric convolution is O(cD 2 ), where c is a constant. In summary, the overall complexity is max(O(LD 2 ), O(cD 2 )) = O(LD 2 ), which is linear about the sequence length. The comparisons of the time complexity and memory usage in training and the inference steps in testing are summarized in Table 17 . (Zhou et al., 2022) O(L) O(L) Autoformer (Wu et al., 2021b) O(L log L) O(L log L) Informer (Zhou et al., 2021) O(L log L) O(L log L) LogTrans (Li et al., 2019b) O(L log L) O L 2 Transformer (Vaswani et al., 2017) O L 2 O L 2 LSTM (Hochreiter & Schmidhuber, 1997 ) O(L) O(L) As the prediction length increases, our model takes a little more time than Auto-Correlation. We speculate that this may be due to the use of the convolution operation or the activation function Tanh. In general, our method is the most portable and valuable in practical applications. 

B.6 MORE ANALYSIS OF THE TREND-CYCLICAL PREDICTION BLOCK

To further evaluate the trend-cyclical prediction, we conduct more experiments on full benchmark. As shown in Table 18 , regre is the simple regression prediction without other modules and mean is the simple mean prediction without other modules. MICN-regre is our proposed method with regression prediction in trend-cyclical block and MICN-mean is the mean prediction in trend-cyclical block. FEDformer-mean is the original FEDformer using the mean prediction and FEDformer-regre uses regression prediction instead of mean prediction. We can conclude that simple regression and mean fail to capture the complex temporal correlations. But for the data Exchange without periodicity, simple regression makes competitive performance. This result is worth thinking about, and we will conduct more in-depth experiments in the future. Meanwhile, the results that MICN-regre outperforms FEDformer-regre and MICN-mean outperforms FEDforemr-mean prove the validity of our proposed model. We also find FEDformer-regre makes worse performance in most cases. This may be due to the more complex structure of FEDformer, and more complex regression prediction is needed correspondingly.

B.7 THE DETAILED DESCRIPTION OF THE LOCAL-GLOBAL MODULE

We show the detailed description of the Local-global module in Figure 9 . For the input series, we adopt down-sampling convolution with different kernels to extract the local features of different temporal patterns and isometric convolution instead of masked self-attention to capture global correlations. Then we use up-sampling convolution to recover the length of the series. Finally, we merge the different branches to complete the modeling of different patterns. 



https://archive.ics.uci.edu/ml/datasets/ElectricityLoadDiagrams20112014 http://pems.dot.ca.gov https://www.bgc-jena.mpg.de/wetter/ https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html



Figure 2: Seasonal Prediction Block.

Figure 3: Local-Global module architecture.

Figure 4: Isometric Convolution architecture vs. Masked self-attention architecture Then we propose to use Conv2d to merge the different patterns with different weights instead of the traditional Concat operation. The validity of this weighting approach is verified in Appendix B.4. The process is:

Figure 6: Visualization of learned trend-cyclical part prediction result Y t and seasonal part prediction result Y s in ETTm1 dataset under MICN-regre. Sample linear regression performs well.

Furthermore, we compare the running memory and time among Local-Global-based, Auto-correlationbased and self-attention-based models during the training phase. As shown in Figure 8, the proposed Local-Global module shows O(LD 2 ) complexity and achieves better long-term sequences efficiency.

Figure 8: Efficiency Analysis. We place the Local-Global module in MICN with Auto-correlation and self-attention. Then we record the memory and running time of an epoch with fixed input length 96 and increasing output length. Missing values of self-attention are due to out-of-memory.

Figure 10: Prediction cases from the univariate Electricity dataset under MICN.

Figure 11: Prediction cases from the univariate Electricity dataset under Autoformer.

Figure 15: Prediction cases from the univariate Traffic dataset under Informer.

Multivariate long-term series forecasting results with input length I = 96 and prediction length O ∈ {96, 192, 336, 720} (for ILI, the input length I = 36). A lower MSE or MAE indicates a better prediction, and the best results are highlighted in bold.

.444 336 0.272 0.407 0.639 0.598 0.460 0.500 0.509 0.524 1.672 1.036 1.659 1.081 1.507 1.031 2.136 1.231 3.113 1.459 720 0.714 0.658 1.218 0.862 1.195 0.841 1.447 0.941 2.478 1.310 1.941 1.127 2.285 1.243 2.984 1.427 3.150 1.458

Univariate long-term series forecasting results with input length I = 96 and prediction length O ∈ {96, 192, 336, 720} (for ILI, the input length I = 36). A lower MSE or MAE indicates a better prediction, and the best results are highlighted in bold. Univariate results We also show the univariate time-series forecasting results in Table 2. Significantly, MICN achieves a 21.6% averaged MSE reduction compared to FEDformer. Especially for the Weather dataset, MICN gives 53% relative MSE reduction under the predict-96 setting, 75% relative MSE reduction under the predict-192 setting, 44% relative MSE reduction under the predict-336 setting, and 56% relative MSE reduction under the predict-720 setting. It again verifies the greater time-series forecasting capacity. More results about other ETT benchmarks are provided in Appendix A.3. See Appendix C.2 for detailed showcases.

MICN-regre performs better than MICN-mean overall. Because all the datasets are non-stationary, a simple modeling for trend-cyclical to give model a holistic view of the trend direction is necessary. See Appendix B.2 for more visualization results and analysis.

Comparison of sample linear regression prediction and mean prediction in multivariate datasets. The better results are highlighted in bold. Local-Global Structure vs. Auto-correlation, self-attention In this work, we propose the localglobal module to model the underlying pattern of time series, including local features and global correlations, while the previous outstanding model Autoformer uses auto-correlation. We replace the auto-correlation module in the original Autoformer with our proposed local-global module (we set i ∈ {12, 16}) for training, and the results are shown in Table 4. Also, We replace the Local-Global module in MICN-regre with the Auto-Correlation module and self-attention module for training, and the results are shown in Table5. They all demonstrate that modeling time series in terms of local features and global correlations is better and more realistic.

Ablation of Local-global structure in other models. We replace the Auto-Correlation in Autoformer with our local-global module and implement it in the multivariate Electricity, Exchange and Traffic. The better results are highlighted in bold.

Ablation of Local-global structure in our model. We replace the Local-Global module in MICN-regre with Auto-correlation and self-attention and implement it in the multivariate Electricity, Exchange and Traffic. The better results are highlighted in bold.

Robustness analysis of multivariate results. Different ε indicates different proportions of noise injection. And MICN-regre is used as the base model. Traffic 2 contains the data from California Department of Transportation hourly, which describes the road occupancy rates measured by different sensors on San Francisco Bay area freeways. (5) Weather 3 contains 21 meteorological indicators, recorded every 10 minutes for 2020 whole year. (6) ILI 4 records influenza-like illness (ILI) patients data weekly from Centers for Disease Control and Prevention of the United States between 2002 and 2021. Table7summarizes feature details (Sequence Length: Len, Dimension: Dim, Frequency: Freq) .

The details of datasets.

Multivariate long-term forecasting results on ETT full benchmark. The best results are highlighted in bold.

Multivariate results with different parameters i in three datasets: Electricity, Exchange and Traffic.

MICN performance under different convolution modes. We implement it on three multivariate datasets: Electricity, Exchange and Traffic.

Ablation of multi-scale decomposition (MHDecomp). Autoformer-MHDecomp adopts multi-scale decomposition block into Autoformer.

Ablation of isometric convolution. We replace the Isometric convolution in MICN-regre with masked self-attention and implement it in the multivariate Electricity, Exchange and Traffic. The better results are highlighted in bold.

Comparison of Isometric convolution and masked self-attention in the univariate Electricity, Exchange and Traffic. We replace the Isometric convolution in MICN-regre with masked selfattention. The better results are highlighted in bold.

Comparison of the isometric convolution and the masked self-attention in MICN with different kernel sizes.

Comparison of different merging operations. The better results are highlighted in bold. the complexity lies in Downsampling convolution and Isometric convolution in Local-Global module. If we set the sequence length to L, the hidden state to D and the multi-scale convolution kernels to i.

Complexity analysis of different forecasting models.

Comparison of different linear complexity models on the univariate datasets. The better results are highlighted in bold.To get more robust experimental results, we repeat each experiment three times with different random seeds. For easier comparison, the results are shown in the main text when the seed is set to 2021. Table21shows the standard deviations.

Quantitative results with fluctuations under different prediction lengths O for multivariate forecasting. A lower MSE or MAE indicates a better performance.

ACKNOWLEDGEMENTS

This work was supported by the Sichuan Science and Technology Program (2023YFG0112) and the National Key R&D Program of China (2020YFB0704502) and the Sichuan Science and Technology Program (2022YFG0034) and Postdoctoral Interdisciplinary Innovation Fund(10822041A2137) and Sichuan University and Yibin Cooperation Program (2020CDYB-30).

annex

Published as a conference paper at ICLR 2023 19 and Table 20 . Concretely, we replace the isometric convolution in MICN-regre with the different masked attention mechanisms in Linformer and Fastformer. We implement the experiments in multivariate and univariate datasets. We can conclude that Fastformer can make relatively competitive performance but our proposed model MICN performs best in general. These results provide further evidence of the effectiveness of MICN. 

