TIMESNET: TEMPORAL 2D-VARIATION MODELING FOR GENERAL TIME SERIES ANALYSIS

Abstract

Time series analysis is of immense importance in extensive applications, such as weather forecasting, anomaly detection, and action recognition. This paper focuses on temporal variation modeling, which is the common key problem of extensive analysis tasks. Previous methods attempt to accomplish this directly from the 1D time series, which is extremely challenging due to the intricate temporal patterns. Based on the observation of multi-periodicity in time series, we ravel out the complex temporal variations into the multiple intraperiod-and interperiod-variations. To tackle the limitations of 1D time series in representation capability, we extend the analysis of temporal variations into the 2D space by transforming the 1D time series into a set of 2D tensors based on multiple periods. This transformation can embed the intraperiod-and interperiod-variations into the columns and rows of the 2D tensors respectively, making the 2D-variations to be easily modeled by 2D kernels. Technically, we propose the TimesNet with TimesBlock as a task-general backbone for time series analysis. TimesBlock can discover the multi-periodicity adaptively and extract the complex temporal variations from transformed 2D tensors by a parameter-efficient inception block. Our proposed TimesNet achieves consistent state-of-the-art in five mainstream time series analysis tasks, including short-and long-term forecasting, imputation, classification, and anomaly detection.

1. INTRODUCTION

Time series analysis is widely used in extensive real-world applications, such as the forecasting of meteorological factors for weather prediction (Wu et al., 2021) , imputation of missing data for data mining (Friedman, 1962) , anomaly detection of monitoring data for industrial maintenance (Xu et al., 2021) and classification of trajectories for action recognition (Franceschi et al., 2019) . Because of its immense practical value, time series analysis has received great interest (Lim & Zohren, 2021) . Different from other types of sequential data, such as language or video, time series is recorded continuously and each time point only saves some scalars. Since one single time point usually cannot provide sufficient semantic information for analysis, many works focus on the temporal variation, which is more informative and can reflect the inherent properties of time series, such as the continuity, periodicity, trend and etc. However, the variations of real-world time series always involve intricate temporal patterns, where multiple variations (e.g. rising, falling, fluctuation and etc.) mix and overlap with each other, making the temporal variation modeling extremely challenging. Especially in the deep learning communities, benefiting from the powerful non-linear modeling capacity of deep models, many works have been proposed to capture the complex temporal variations in real-world time series. One category of methods adopts recurrent neural networks (RNN) to model the successive time points based on the Markov assumption (Hochreiter & Schmidhuber, 1997; Lai et al., 2018; Shen et al., 2020) . However, these methods usually fail in capturing the longterm dependencies and their efficiency suffers from the sequential computation paradigm. Another category of methods utilizes the convolutional neural network along the temporal dimension (TCN) Figure 1 : Multi-periodicity and temporal 2D-variation of time series. Each period involves the intraperiod-variation and interperiod-variation. We transform the original 1D time series into a set of 2D tensors based on multiple periods, which can unify the intraperiod-and interperiod-variations. to extract the variation information (Franceschi et al., 2019; He & Zhao, 2019) . Also, because of the locality property of the one-dimension convolution kernels, they can only model the variations among adjacent time points, thereby still failing in long-term dependencies. Recently, Transformers with attention mechanism have been widely used in sequential modeling (Brown et al., 2020; Dosovitskiy et al., 2021; Liu et al., 2021b) . In time series analysis, many Transformer-based models adopt the attention mechanism or its variants to capture the pair-wise temporal dependencies among time points (Li et al., 2019; Kitaev et al., 2020; Zhou et al., 2021; 2022) . But it is hard for attention mechanism to find out reliable dependencies directly from scattered time points, since the temporal dependencies can be obscured deeply in intricate temporal patterns (Wu et al., 2021) . In this paper, to tackle the intricate temporal variations, we analyze the time series from a new dimension of multi-periodicity. Firstly, we observe that real-world time series usually present multi-periodicity, such as daily and yearly variations for weather observations, weekly and quarterly variations for electricity consumption. These multiple periods overlap and interact with each other, making the variation modeling intractable. Secondly, for each period, we find out that the variation of each time point is not only affected by the temporal pattern of its adjacent area but also highly related to the variations of its adjacent periods. For clearness, we name these two types of temporal variations as intraperiod-variation and interperiod-variation respectively. The former indicates short-term temporal patterns within a period. The latter can reflect long-term trends of consecutive different periods. Note that for the time series without clear periodicity, the variations will be dominated by the intraperiod-variation and is equivalent to the ones with infinite period length. Since different periods will lead to different intraperiod-and interperiod-variations, the multiperiodicity can naturally derive a modular architecture for temporal variation modeling, where we can capture the variations derived by a certain period in one module. Besides, this design makes the intricate temporal patterns disentangled, benefiting the temporal variation modeling. However, it is notable that the 1D time series is hard to explicitly present two different types of variations simultaneously. To tackle this obstacle, we extend the analysis of temporal variations into the 2D space. Concretely, as shown in Figure 1 , we can reshape the 1D time series into a 2D tensor, where each column contains the time points within a period and each row involves the time points at the same phase among different periods. Thus, by transforming 1D time series into a set of 2D tensors, we can break the bottleneck of representation capability in the original 1D space and successfully unify the intraperiod-and interperiod-variations in 2D space, obtaining the temporal 2D-variations. Technically, based on above motivations, we go beyond previous backbones and propose the TimesNet as a new task-general model for time series analysis. Empowering by TimesBlock, TimesNet can discover the multi-periodicity of time series and capture the corresponding temporal variations in a modular architecture. Concretely, TimesBlock can adaptively transform the 1D time series into a set of 2D tensors based on learned periods and further capture intraperiod-and interperiod-variations in the 2D space by a parameter-efficient inception block. Experimentally, TimesNet achieves the consistent state-of-the-art in five mainstream analysis tasks, including short-and long-term forecasting, imputation, classification and anomaly detection. Our contributions are summarized in three folds: • Motivated by multi-periodicity and complex interactions within and between periods, we find out a modular way for temporal variation modeling. By transforming the 1D time series into 2D space, we can present the intraperiod-and interperiod-variations simultaneously. Here, FFT(•) and Amp(•) denote the FFT and the calculation of amplitude values. A ∈ R T represents the calculated amplitude of each frequency, which is averaged from C dimensions by Avg(•). Note that the j-th value A j represents the intensity of the frequency-j periodic basis function, corresponding to the period length ⌈ T j ⌉. Considering the sparsity of frequency domain and to avoid the noises brought by meaningless high frequencies (Chatfield, 1981; Zhou et al., 2022) , we only select the top-k amplitude values and obtain the most significant frequencies {f 1 , • • • , f k } with the unnormalized amplitudes {A f1 , • • • , A f k }, where k is the hyper-parameter. These selected frequencies also correspond to k period lengths {p 1 , • • • , p k }. Due to the conjugacy of frequency domain, we only consider the frequencies within {1, • • • , [ T 2 ]}. We summarize Equation 1 as follows: A, {f 1 , • • • , f k }, {p 1 , • • • , p k } = Period(X 1D ) . (2) Based on the selected frequencies {f 1 , • • • , f k } and corresponding period lengths {p 1 , • • • , p k }, we can reshape the 1D time series X 1D ∈ R T ×C into multiple 2D tensors by the following equations: X i 2D = Reshape pi,fi (Padding(X 1D )) , i ∈ {1, • • • , k}, where Padding(•) is to extend the time series by zeros along temporal dimension to make it compatible for Reshape pi,fi (•), where p i and f i represent the number of rows and columns of the transformed 2D tensors respectively. Note that X i 2D ∈ R pi×fi×C denotes the i-th reshaped time series based on frequency-f i , whose columns and rows represent the intraperiod-variation and interperiod-variation under the corresponding period length p i respectively. Eventually, as shown in Figure 2 , based on the selected frequencies and estimated periods, we obtain a set of 2D tensors {X 1 2D , • • • , X k 2D }, which indicates k different temporal 2D-variations derived by different periods. It is also notable that, this transformation brings two types of localities to the transformed 2D tensors, that is localities among adjacent time points (columns, intraperiod-variation) and adjacent periods (rows, interperiod-variation). Thus, the temporal 2D-variations can be easily processed by 2D kernels.

3.2. TIMESBLOCK

As shown in Figure 3 , we organize the TimesBlock in a residual way (He et al., 2016) . Concretely, for the length-T 1D input time series X 1D ∈ R T ×C , we project the raw inputs into the deep features X 0 1D ∈ R T ×dmodel by the embedding layer X 0 1D = Embed(X 1D ) at the very beginning. For the l-th layer of TimesNet, the input is X l-1 1D ∈ R T ×dmodel and the process can be formalized as: X l 1D = TimesBlock X l-1 1D + X l-1 1D . As shown in Figure 3 , for the l-th TimesBlock, the whole process involves two successive parts: capturing temporal 2D-variations and adaptively aggregating representations from different periods. can conducted representation learning by parameter-efficient inception block conveniently.

Capturing temporal 2D-variations

A l 1 , {f1, • • • , fk}, {p1, • • • , pk} = Period X l 1 1D , X l,i 2D = Reshape fi⇥pi Padding(X l 1 1D ) , i 2 {1, • • • , k} b X l,i 2D = Inception ⇣ X l,i 2D ⌘ , i 2 {1, • • • , k} b X l,i 1D = Trunc ⇣ Reshape 1⇥(fi⇥pi) ⇣ b X l,i 2D ⌘⌘ , i 2 {1, • • • , k} where X l,i 2D 2 R (fi⇥pi)⇥dmodel is the transformed 2D tensor. After the transformation, we process the 2D tensor by a parameter-efficient inception block (Szegedy et al., 2015) , which involves multi-scale 2D kernels and is one of the most well-acknowledged vision backbone. Then we transform the processed 2D feature into b X l,i 2D back to 1D tensor b X l,i 1D 2 R T ⇥dmodel for aggregation by Trunc(•) to truncate the padding series into original length. Note that benefiting from the transformation of 1D time series, the 2D kernels in inception block can aggregate the multi-scale intraperiod-variation (columns) and interperiod-variation (rows) simultaneously. Besides, the parameter sharing deign makes the model size invariant to the selection of the number of periods k. {X l,1 2D , • • • , X l,k 2D } { b X l,1 2D , • • • , b X l,k 2D } { b X l,1 1D , • • • , b X l,k 1D } Adaptive aggregation Finally, we fuse the k different processed features w.r.t their corresponding amplitudes of the estimated periods as follows: b A l 1 f1 , • • • , b A l 1 fk = SoftMax ⇣ A l 1 f1 , • • • , A l 1 fk ⌘ X l 1D = k X i=1 b A l 1 fi ⇥ b X l,i 1D . Since the temporal variations are involved in the multiple highly-structured 2D tensors, the design of TimesBlock can fully capture the multi-scale temporal 2D-variations from multiple views simultaneously, making the representatin learning more effective. Generality in 2D vision backbones Since we transform the 1D time series into 2D space, we can also choose various vision backbones to replace the inception module for representation learning, such as the widely-used ResNet (He et al., 2016) and ResNext (Xie et al., 2017) , advanced ConvNext (Liu et al., 2022b) and attention-based models (Liu et al., 2021b) . Thus, our temporal 2D-variation design also bridges the 1D time series and the booming 2D vision backbones, making the time series analysis take advantage of the development of vision community. For efficiency, we conduct the main experiments based on the parameter-efficient inception block as Equation 5.

5

informative features from the k different reshaped time series 2D kernels (right part) and fuse them based on the normalized can conducted representation learning by parameter-efficient in A l 1 , {f1, • • • , fk}, {p1, • • • , pk} = Period X l 1 1D , X l,i 2D = Reshape fi⇥pi Paddi b X l,i 2D = Inception ⇣ X l,i 2D ⌘ , i 2 b X l,i 1D = Trunc ⇣ Reshape 1⇥(f where X l,i 2D 2 R (fi⇥pi)⇥dmodel is the transformed 2D tensor. Aft 2D tensor by a parameter-efficient inception block (Szegedy et 2D kernels and is one of the most well-acknowledged vision processed 2D feature into b X l,i 2D back to 1D tensor b X l,i 1D 2 R T ⇥ truncate the padding series into original length. Note that bene time series, the 2D kernels in inception block can aggregate (columns) and interperiod-variation (rows) simultaneously. B makes the model size invariant to the selection of the number o {X l,1 2D , • • • , X l,k 2D } { b X l,1 2D , • • • , b X l,k 2D } { b X l,1 1D , • • • , b X l,k 1D } Adaptive aggregation Finally, we fuse the k different proces amplitudes of the estimated periods as follows: b A l 1 f1 , • • • , b A l 1 fk = SoftMax ⇣ A l 1 f1 , X l 1D = k X i=1 b A l 1 fi ⇥ b X l,i 1D . Since the temporal variations are involved in the multiple h sign of TimesBlock can fully capture the multi-scale tempora simultaneously, making the representatin learning more effecti Generality in 2D vision backbones Since we transform the also choose various vision backbones to replace the inception such as the widely-used ResNet (He et al., 2016) and ResNext ( (Liu et al., 2022b) and attention-based models (Liu et al., 2021 design also bridges the 1D time series and the booming 2D visi analysis take advantage of the development of vision community experiments based on the parameter-efficient inception block a 5 X l,i 2D = Reshape fi⇥pi Paddi b X l,i 2D = Inception ⇣ X l,i 2D ⌘ , i 2 b X l,i 1D = Trunc ⇣ Reshape 1⇥(fi where X l,i 2D 2 R (fi⇥pi)⇥dmodel is the transformed 2D tensor. Afte 2D tensor by a parameter-efficient inception block (Szegedy et a 2D kernels and is one of the most well-acknowledged vision processed 2D feature into b X l,i 2D back to 1D tensor b X l,i 1D 2 R T ⇥ truncate the padding series into original length. Note that bene time series, the 2D kernels in inception block can aggregate t (columns) and interperiod-variation (rows) simultaneously. B makes the model size invariant to the selection of the number o {X l,1 2D , • • • , X l,k 2D } { b X l,1 2D , • • • , b X l,k 2D } { b X l,1 1D , • • • , b X l,k 1D } Adaptive aggregation Finally, we fuse the k different proces amplitudes of the estimated periods as follows: b A l 1 f1 , • • • , b A l 1 fk = SoftMax ⇣ A l 1 f1 , X l 1D = k X i=1 b A l 1 fi ⇥ b X l,i 1D . Since the temporal variations are involved in the multiple hi sign of TimesBlock can fully capture the multi-scale tempora simultaneously, making the representatin learning more effectiv Generality in 2D vision backbones Since we transform the 1 also choose various vision backbones to replace the inception such as the widely-used ResNet (He et al., 2016) and ResNext (X (Liu et al., 2022b) and attention-based models (Liu et al., 2021 design also bridges the 1D time series and the booming 2D visio analysis take advantage of the development of vision community experiments based on the parameter-efficient inception block a 5 where X l,i 2D 2 R (fi⇥pi)⇥dmodel is the transformed 2D tensor. Afte 2D tensor by a parameter-efficient inception block (Szegedy et a 2D kernels and is one of the most well-acknowledged vision processed 2D feature into b X l,i 2D back to 1D tensor b X l,i 1D 2 R T ⇥ truncate the padding series into original length. Note that benefi time series, the 2D kernels in inception block can aggregate t (columns) and interperiod-variation (rows) simultaneously. B makes the model size invariant to the selection of the number o {X l,1 2D , • • • , X l,k 2D } { b X l,1 2D , • • • , b X l,k 2D } { b X l,1 1D , • • • , b X l,k 1D } Adaptive aggregation Finally, we fuse the k different process amplitudes of the estimated periods as follows: b A l 1 f1 , • • • , b A l 1 fk = SoftMax ⇣ A l 1 f1 , • X l 1D = k X i=1 b A l 1 fi ⇥ b X l,i 1D . Since the temporal variations are involved in the multiple hi sign of TimesBlock can fully capture the multi-scale temporal simultaneously, making the representatin learning more effectiv Generality in 2D vision backbones Since we transform the 1 also choose various vision backbones to replace the inception such as the widely-used ResNet (He et al., 2016) and ResNext (X (Liu et al., 2022b) and attention-based models (Liu et al., 2021 design also bridges the 1D time series and the booming 2D visio analysis take advantage of the development of vision community experiments based on the parameter-efficient inception block as 5 { b X l,1 2D , • • • , b X l,k 2D } { b X l,1 1D , • • • , b X l,k 1D } Adaptive aggregation Finally, we fuse the k different processed features w.r.t their corresponding amplitudes of the estimated periods as follows: b A l 1 f1 , • • • , b A l 1 fk = SoftMax ⇣ A l 1 f1 , • • • , A l 1 fk ⌘ X l 1D = k X i=1 b A l 1 fi ⇥ b X l,i 1D . (6) Since the temporal variations are involved in the multiple highly-structured 2D tensors, the design of TimesBlock can fully capture the multi-scale temporal 2D-variations from multiple views simultaneously, making the representatin learning more effective. Generality in 2D vision backbones Since we transform the 1D time series into 2D space, we can also choose various vision backbones to replace the inception module for representation learning, such as the widely-used ResNet (He et al., 2016) and ResNext (Xie et al., 2017) , advanced ConvNext (Liu et al., 2022b) and attention-based models (Liu et al., 2021b) . Thus, our temporal 2D-variation design also bridges the 1D time series and the booming 2D vision backbones, making the time series analysis take advantage of the development of vision community. For efficiency, we conduct the main experiments based on the parameter-efficient inception block as Equation 5. into 2D space and obtain a set of 2D tensors, from which we can obtain informative representations by parameter-efficient inception block conveniently. The process is formalized as follows: A l-1 , {f 1 , • • • , f k }, {p 1 , • • • , p k } = Period X l-1 1D , X l,i 2D = Reshape pi,fi Padding(X l-1 1D ) , i ∈ {1, • • • , k} X l,i 2D = Inception X l,i 2D , i ∈ {1, • • • , k} X l,i 1D = Trunc Reshape 1,(pi×fi) X l,i 2D , i ∈ {1, • • • , k}, where X l,i 2D ∈ R pi×fi×dmodel is the i-th transformed 2D tensor. After the transformation, we process the 2D tensor by a parameter-efficient inception block (Szegedy et al., 2015) as Inception(•), which involves multi-scale 2D kernels and is one of the most well-acknowledged vision backbones. Then we transform the learned 2D representations X l,i 2D back to 1D space X l,i 1D ∈ R T ×dmodel for aggregation, where we employ Trunc(•) to truncate the padded series with length (p i × f i ) into original length T . Note that benefiting from the transformation of 1D time series, the 2D kernels in the inception block can aggregate the multi-scale intraperiod-variation (columns) and interperiod-variation (rows) simultaneously, covering both adjacent time points and adjacent periods. Besides, we adopt a shared inception block for different reshaped 2D tensors {X l,1 2D , • • • , X l,k 2D } to improve parameter efficiency, which can make the model size invariant to the selection of hyper-parameter k. Adaptive aggregation Finally, we need to fuse k different 1D-representations { X l,1 1D , • • • , X l,k 1D } for the next layer. Inspired by Auto-Correlation (Wu et al., 2021) , the amplitudes A can reflect the relative importance of selected frequencies and periods, thereby corresponding to the importance of each transformed 2D tensor. Thus, we aggregate the 1D-representations based on the amplitudes: A l-1 f1 , • • • , A l-1 f k = Softmax A l-1 f1 , • • • , A l-1 f k X l 1D = k i=1 A l-1 fi × X l,i 1D . Since the variations within and between periods are already involved in multiple highly-structured 2D tensors, TimesBlock can fully capture multi-scale temporal 2D-variations simultaneously. Thus, TimesNet can achieve a more effective representation learning than directly from 1D time series.

Generality in 2D vision backbones

Benefiting from the transformation of 1D time series into temporal 2D-variations, we can choose various computer vision backbones to replace the inception block for representation learning, such as the widely-used ResNet (He et al., 2016) and ResNeXt (Xie et al., 2017) , advanced ConvNeXt (Liu et al., 2022b) and attention-based models (Liu et al., 2021b) . Thus, our temporal 2D-variation design also bridges the 1D time series to the booming 2D vision backbones, making the time series analysis take advantage of the development of computer vision community. In general, more powerful 2D backbones for representation learning will bring better performance. Considering both performance and efficiency (Figure 4 right), we conduct the main experiments based on the parameter-efficient inception block as shown in Equation 5. Published as a conference paper at ICLR 2023

4. EXPERIMENTS

To verify the generality of TimesNet, we extensively experiment on five mainstream analysis tasks, including short-and long-term forecasting, imputation, classification and anomaly detection. Implementation Table 1 is a summary of benchmarks. More details about the dataset, experiment implementation and model configuration can be found in Appendix A. 

4.1. MAIN RESULTS

As a foundation model, TimesNet achieves consistent state-of-the-art performance on five mainstream analysis tasks compared with other customized models (Figure 4 left). The full efficiency comparison is provided in Table 11 of Appendix. Besides, by replacing the inception block with more powerful vision backbones, we can further promote the performance of TimesNet (Figure 4 right), confirming that our design can make time series analysis take advantage of booming vision backbones.

4.2. SHORT-AND LONG-TERM FORECASTING

Setups Time series forecasting is essential in weather forecasting, traffic and energy consumption planning. To fully evaluate the model performance in forecasting, we adopt two types of benchmarks, including long-term and short-term forecasting. Especially for the long-term setting, we follow the benchmarks used in Autoformer (2021), including ETT (Zhou et al., 2021) , Electricity (UCI), Traffic (PeMS), Weather (Wetterstation), Exchange (Lai et al., 2018) and ILI (CDC), covering five real-world applications. For the short-term dataset, we adopt the M4 (Spyros Makridakis, 2018) , which contains the yearly, quarterly and monthly collected univariate marketing data. Note that each dataset in the long-term setting only contains one continuous time series, where we obtain samples by sliding window, while M4 involves 100,000 different time series collected in different frequencies. Published as a conference paper at ICLR 2023 Results TimesNet shows great performance in both long-term and short-term settings (Table 2 -3). Concretely, TimesNet achieves state-of-the-art in more than 80% of cases in long-term forecasting (Table 13 ). For the M4 dataset, since the time series are collected from different sources, the temporal variations can be quite diverse, making forecasting much more challenging. Our model still performs best in this task, surpassing extensive advanced MLP-based and Transformer-based models.

4.3. IMPUTATION

Setups Real-world systems always work continuously and are monitored by automatic observation equipment. However, due to malfunctions, the collected time series can be partially missing, making the downstream analysis difficult. Thus, imputation is widely-used in practical applications. In this paper, we select the datasets from the electricity and weather scenarios as our benchmarks, including ETT (Zhou et al., 2021) , Electricity (UCI) and Weather (Wetterstation), where the data-missing problem happens commonly. To compare the model capacity under different proportions of missing data, we randomly mask the time points in the ratio of {12.5%, 25%, 37.5%, 50%}. Results Due to the missing time points, the imputation task requires the model to discover underlying temporal patterns from the irregular and partially observed time series. As shown in Table 4 , our proposed TimesNet still achieves the consistent state-of-the-art in this difficult task, verifying the model capacity in capturing temporal variation from extremely complicated time series.

4.4. CLASSIFICATION

Setups Time series classification can be used in recognition and medical diagnosis (Moody et al., 2011) . We adopt the sequence-level classification to verify the model capacity in high-level representation learning. Concretely, we select 10 multivariate datasets from UEA Time Series Classification Archive (Bagnall et al., 2018) , covering the gesture, action and audio recognition, medical diagnosis by heartbeat monitoring and other practical tasks. Then, we pre-process the datasets following the descriptions in (Zerveas et al., 2021) , where different subsets have different sequence lengths. Published as a conference paper at ICLR 2023 ), which performs well in some time series forecasting datasets. This is because DLinear only adopts a one-layer MLP model on the temporal dimension, which might be suitable for some autoregressive tasks with fixed temporal dependencies but will degenerate a lot in learning high-level representations. In contrast, TimesNet unifies the temporal 2D-variation in 2D space, which is convenient to learn informative representation by 2D kernels, thereby benefiting the classification task that requires hierarchical representations.

4.5. ANOMALY DETECTION

Setups Detecting anomalies from monitoring data is vital to industrial maintenance. Since the anomalies are usually hidden in the large-scale data, making the data labeling hard, we focus on unsupervised time series anomaly detection, which is to detect the abnormal time points. We compare models on five widely-used anomaly detection benchmarks: SMD (Su et al., 2019) , MSL (Hundman et al., 2018) , SMAP (Hundman et al., 2018) , SWaT (Mathur & Tippenhauer, 2016) , PSM (Abdulaal et al., 2021) , covering service monitoring, space & earth exploration, and water treatment applications. Following the pre-processing methods in Anomaly Transformer (2021), we split the dataset into consecutive non-overlapping segments by sliding window. In previous works, the reconstruction is a classical task for unsupervised point-wise representation learning, where the reconstruction error is a natural anomaly criterion. For a fair comparison, we only change the base models for reconstruction and use the classical reconstruction error as the shared anomaly criterion for all experiments. Results Table 5 demonstrates that TimesNet still achieves the best performance in anomaly detection, outperforming the advanced Transformer-based models FEDformer (2022) and Autoformer (2021). The canonical Transformer performs worse in this task (averaged F1-score 76.88%). This may come from that anomaly detection requires the model to find out the rare abnormal temporal patterns (Lai et al., 2021) , while the vanilla attention mechanism calculates the similarity between each pair of time points, which can be distracted by the dominant normal time points. Besides, by taking the periodicity into consideration, TimesNet, FEDformer and Autoformer all achieve great performance. Thus, these results also demonstrate the importance of periodicity analysis, which can highlight variations that violate the periodicity implicitly, further benefiting the anomaly detection. Published as a conference paper at ICLR 2023 

Representation analysis

We attempt to explain model performance from the representation learning aspect. From Figure 6 , we can find that the better performance in forecasting and anomaly detection corresponds to the higher CKA similarity (2019), which is opposite to the imputation and classification tasks. Note that the lower CKA similarity means that the representations are distinguishing among different layers, namely hierarchical representations. Thus, these results also indicate the property of representations that each task requires. As shown in Figure 6 , TimesNet can learn appropriate representations for different tasks, such as low-level representations for forecasting and reconstruction in anomaly detection, and hierarchical representations for imputation and classification. In contrast, FEDformer (2022) performs well in forecasting and anomaly detection tasks but fails in learning hierarchical representations, resulting in poor performance in imputation and classification. These results also verify the task-generality of our proposed TimesNet as a foundation model. Temporal 2D-variations We provide a case study of temporal 2D-variations in Figure 7 . We can find that TimesNet can capture the multi-periodicities precisely. Besides, the transformed 2D tensor is highly structured and informative, where the columns and rows can reflect the localities between time points and periods respectively, supporting our motivation in adopting 2D kernels for representation learning. See Appendix D for more visualizations.

5. CONCLUSION AND FUTURE WORK

This paper presents the TimesNet as a task-general foundation model for time series analysis. Motivated by the multi-periodicity, TimesNet can ravel out intricate temporal variations by a modular architecture and capture intraperiod-and interperiod-variations in 2D space by a parameter-efficient inception block. Experimentally, TimesNet shows great generality and performance in five mainstream analysis tasks. In the future, we will further explore large-scale pre-training methods in time series, which utilize TimesNet as the backbone and can generally benefit extensive downstream tasks.

A IMPLEMENTATION DETAILS

We provide the dataset descriptions and experiment configurations in Table 6 and 7 7 for details of d min and d max ). This protocol can make the model powerful enough for multiple variates and also keep the model size compact. All the baselines that we reproduced are implemented based on configurations of the original paper or official code. It is also notable that none of the previous methods are proposed for general time series analysis. For a fair comparison, we keep the input embedding and the final projection layer the same among different base models and only evaluate the capability of base models. Especially for the forecasting task, we use a MLP on temporal dimension to get the initialization of predicted future. Since we focus on the temporal variation modeling, we also adopt the Series Stationarization from Non-stationary Transformer (Liu et al., 2022a) to eliminate the affect the distribution shift. Published as a conference paper at ICLR 2023 For the metrics, we adopt the mean square error (MSE) and mean absolute error (MAE) for long-term forecasting and imputations. For anomaly detection, we adopt the F1-score, which is the harmonic mean of precision and recall. For the short-term forecasting, following the N-BEATS (Oreshkin et al., 2019), we adopt the symmetric mean absolute percentage error (SMAPE), mean absolute scaled error (MASE) and overall weighted average (OWA) as the metrics, where OWA is a special metric used in M4 competition. These metrics can be calculated as follows: SMAPE = 200 H H i=1 |X i -X i | |X i | + | X i | , MAPE = 100 H H i=1 |X i -X i | |X i | , MASE = 1 H H i=1 |X i -X i | 1 H-m H j=m+1 |X j -X j-m | , OWA = 1 2 SMAPE SMAPE Naïve2 + MASE MASE Naïve2 , where m is the periodicity of the data. X, X ∈ R H×C are the ground truth and prediction results of the future with H time pints and C dimensions. X i means the i-th future time point.

B HYPER-PARAMETER SENSITIVITY

We introduce a hyper-parameter in Equation 1 to select the most significant frequencies. We provide the sensitivity analysis for these two hyper-parameters in Figure 8 . We can find that our proposed TimesNet can present a stable performance under different choices of k in all four tasks. Especially, we select the 37.5% mask ratio for hyper-parameter experiments. For classification, we choose the two largest subsets: SpokenArobicDigits and FaceDetection for evaluation. For short-term forecasting, we adopt the weighted average for sensitivity analysis. Besides, from Figure 8 , we can also find out the following observations: • For the low-level modeling tasks, such as forecasting and anomaly detection, the selection of k will affect the final performance more. This may come from that k will directly affect the amount of information in deep representations. Published as a conference paper at ICLR 2023 • For the high-level modeling tasks, such as classification and imputation, the model performance will be more robust to the selection of k, since the key to these tasks is to extract hierarchical representations. Giving consideration to both efficiency and performance, we set k = 3 for imputation, classification and anomaly detection and k = 5 for short-term forecasting.

C ABLATION STUDIES

To elaborate the property of our proposed TimesNet, we conduct detailed ablations on the representation leaning in 2D space, model architecture and adaptive aggregation. 2D space As shown in Table 8 , replacing the inception block with more powerful blocks will bring further performance promotion, such as ResNeXt (Xie et al., 2017) , Swin Transformer (Liu et al., 2021b) and ConvNeXt (Liu et al., 2022b) . It is also notable that using the independent parameters will also bring improvement, while this will cause the model size related to the selection of hyperparameter k. Considering the efficiency and model performance, we choose the parameter-efficient inception block as our final solution. These results also verify that our design bridges the 1D time series analysis with 2D computer vision backbones. Model architecture We also conduct experiments on different architectures. Surprisingly, as shown in Table 9 , we find that combining with the deep decomposition architecture in Autoformer (Wu et al., 2021) cannot bring further promotion. These results may come from that in the case that the input series already present clear periodicity, our design can capture the 2D-variations effectively. As for the case that they are without clear periodicity, the model will learn the most significant frequency as 1, where the trend of time series be covered by the intraperiod-variation modeling. These results also verify that our proposed TimesNet can handle the analysis for time series without clear periodicity. Besides, in this paper, to take advantages of the deep representations, we place the transformation from 1D-variations to 2D-variations in every TimesBlock. Here, we compare our design with the case that only conducts the transformation on the raw input series. From Table 9 , we can find that the performance of Transform raw data degenerates a lot (Avg F1-score: 85.49% → 84.85%), indicating the advantages of our design. + decomposition 87.44 78.49 82.72 83.48 86.47 84.95 91.64 57.34 70.54 89.68 95.60 92.54 98.42 93.12 95.69 85.29 Transform raw data 86.83 79.17 82.82 85.23 86.47 85.84 91.92 57.60 70.82 87.68 95.81 91.57 97.64 89.14 93.20 84.85 Published as a conference paper at ICLR 2023 Adaptive aggregation As shown in Equation 6, following the design in Autoformer (2021), we adopt the amplitudes after the Softmax function as aggregation weights of processed tensors { X l,1 1D , • • • , X l,k 1D }. Here we include two variants for comparison. The first is directly-sum, namely k i=1 X l,i 1D . The second is to remove the Softmax function, that is  k i=1 A l-1 fi × X l,i 1D .

D MORE REPRESENTATION ANALYSIS

To give an intuitive understanding of 2D-variations, we visualize the transformed 2D tensors in Figure 9 . From the visualization, we can obtain the following observations: • The interperiod-variation can present the long-term trends of time series. For example, in the first case of Exchange, the values in each row decrease from left to right, indicating the downtrend of the raw series. And for the ETTh1 dataset, the values in each row are similar to each other, reflecting the global stable variation of the raw series. • For the time series without clear periodicity, the temporal 2D-variations can still present informative 2D structure. If the frequency is one, the intraperiod-variation is just the original variation of raw series. Besides, the interperiod-variation can also present the long-term trend, benefiting the temporal variation modeling. • The transformed 2D-variations demonstrate two types of localities. Firstly, for each column (intraperiod-variation), the adjacent values are close to each other, presenting the locality among adjacent time points. Secondly, for each row (interperiod-variation), the adjacent values are also close, corresponding to the locality among adjacent periods. Note that the non-adjacent periods can be quite different from each other, which can be caused by global trend, such as the case from the Exchange dataset. These observations of localities also motivate us to adopt the 2D kernel for representation learning. Published as a conference paper at ICLR 2023

E MULTI-PERIODICITY OF TIME SERIES

As shown in Figure 10 , we calculate the density of each period length for different datasets. We can find that real-world time series present multi-periodicity to some extent. For example, the Electricity dataset contains the periods with length-12 and length-24.  Density < l a t e x i t s h a 1 _ b a s e 6 4 = " t Z 4 R q W Q O A / P R a l B n t j O F S u X h u F I = " > A A A C 4 3 i c j V H L S s N A F D 3 G V 3 1 X X Q o S L I I u L I m I u h R 1 4 b K C t Y W 2 S p J O 6 9 C 8 y E z E U r p z 5 0 7 c + g N u 9 V / E P 9 C / 8 M 4 Y Q S 2 i E 5 K c O f e e M 3 P v d W O f C 2 l Z L 0 P G 8 M j o 2 H h u Y n J q e m Z 2 L j + / c C q i N P F Y 2 Y v 8 K K m 6 j m A + D 1 l Z c u m z a p w w J 3 B 9 V n E 7 B y p e u W S J 4 F F 4 I r s x a w R O O + Q t 7 j m S q P P 8 c l 2 y K 9 k 7 Z K H g s t t f q 0 s e M G H a 1 l l v Y 7 O / f p 4 v W E V L L 3 M Q 2 B k o I F u l K P + M O p q I 4 C F F A I Y Q k r A P B 4 K e G m x Y i I l r o E d c Q o j r O E M f k 6 R N K Y t R h k N s h 7 5 t 2 t U y N q S 9 8 h R a 7 d E p P r 0 J K U 2 s k i a i v I S w O s 3 U 8 V Q 7 K / Y 3 7 5 7 2 V H f r 0 t / N v A J i J S 6 I / U v 3 m f l f n a p F o o V d X Q O n m m L N q O q 8 z C X V X V E 3 N 7 9 U J c k h J k 7 h J s U T w p 5 W f v b Z 1 B q h a 1 e 9 d X T 8 V W c q V u 2 9 L D f F m 7 o l D d j + O c 5 B c L p Z t L e L 9 v F W Y W 8 / G 3 U O S 1 j B G s 1 z B 3 s 4 Q g l l 8 r 7 G A x 7 x Z D D j x < l a t e x i t s h a _ b a s e = " t Z R q W Q O A / P R a l B n t j O F S u X h u F I = " > A A A C i c j V H L S s N A F D G V X X Q o S L I I u L I m I u h R b K C t Y W S p J O C y E z E U r p z c + g N u V / E P C / M Y Q S i E K c O f e e M P v d W O f C l Z L P G M j o H h u Y n J q e m Z L j + / c C q i N P F Y Y v K K m j m A + D l Z c u m z a p w w J B V n E B y p e u W S J F F I r s x a w R O O + Q t j m S q P P c l y K k Z K H g s t t f q s e M G H a l l v Y O / f p v W E V L L M Q B k o I F u l K P + M O p q I C F F A I Y Q k r A P B K e G m x Y i I l r o E d c Q o j r O E M f k R N K Y t R h k N s h t t U y N q S h R a d E p P r J K U s k i a i v I S w O s U V Q K / Y V H f r t / N v A J i J S I / U v m f l f n a p F o o V d X Q O n m m L N q O q z C X V X V E N U J c k h J k h J s U T w p W f v b Z B q h a e d X T V W c q V u L D f F m o l D d j + O c B c L p Z t L e L v F W Y W / G U O S j B G s z B s Q g l l r G A x x Z D D j x r g j S j a F M s h v y h / B / g Q m w = < / l a t e x i t Z 4 R q W Q O A / P R a l B n t j O F S u X h u F I = " > A A A C 4 3 i c j V H L S s N A F D 3 G V 3 1 X X Q o S L I I u L I m I u h R 1 4 b K C t Y W 2 S p J O 6 9 C 8 y E z E U r p z 5 0 7 c + g N u 9 V / E P 9 C / 8 M 4 Y Q S 2 i E 5 K c O f e e M 3 P v d W O f C 2 l Z L 0 P G 8 M j o 2 H h u Y n J q e m Z 2 L j + / c C q i N P F Y 2 Y v 8 K K m 6 j m A + D 1 l Z c u m z a p w w J 3 B 9 V n E 7 B y p e u W S J 4 F F 4 I r s x a w R O O + Q t 7 j m S q P P 8 c l 2 y K 9 k 7 Z K H g s t t f q 0 s e M G H a 1 l l v Y 7 O / f p 4 v W E V L L 3 M Q 2 B k o I F u l K P + M O p q I 4 C F F A I Y Q k r A P B 4 K e G m x Y i I l r o E d c Q o j r O E M f k 6 R N K Y t R h k N s h 7 5 t 2 t U y N q S 9 8 h R a 7 d E p P r 0 J K U 2 s k i a i v I S w O s 3 U 8 V Q 7 K / Y 3 7 5 7 2 V H f r 0 t / N v A J i J S 6 I / U v 3 m f l f n a p F o o V d X Q O n m m L N q O q 8 z C X V X V E 3 N 7 9 U J c k h J k 7 h J s U T w p 5 W f v b Z 1 B q h a 1 e 9 d X T 8 V W c q V u 2 9 L D f F m 7 o l D d j + O c 5 B c L p Z t L e L 9 v F W Y W 8 / G 3 U O S 1 j B G s 1 z B 3 s 4 Q g l l 8 r 7 G A x 7 x Z D D j x r g 1 7 j 5 S j a F M s 4 h v y 7 h / B / g Q m 1 w = < / l a t e x i t > Density(⇥10 2 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " t Z 4 R q W Q O A / P R a l B n t j O F S u X h u F I = " > A A A C 4 3 i c j V H L S s N A F D 3 G V 3 1 X X Q o S L I I u L I m I u h R 1 4 b K C t Y W 2 S p J O 6 9 C 8 y E z E U r p z 5 0 7 c + g N u 9 V / E P 9 C / 8 M 4 Y Q S 2 i E 5 K c O f e e M 3 P v d W O f C 2 l Z L 0 P G 8 M j o 2 H h u Y n J q e m Z 2 L j + / c C q i N P F Y 2 Y v 8 K K m 6 j m A + D 1 l Z c u m z a p w w J 3 B 9 V n E 7 B y p e u W S J 4 F F 4 I r s x a w R O O + Q t 7 j m S q P P 8 c l 2 y K 9 k 7 Z K H g s t t f q 0 s e M G H a 1 l l v Y 7 O / f p 4 v W E V L L 3 M Q 2 B k o I F u l K P + M O p q I 4 C F F A I Y Q k r A P B 4 K e G m x Y i I l r o E d c Q o j r O E M f k 6 R N K Y t R h k N s h 7 5 t 2 t U y N q S 9 8 h R a 7 d E p P r 0 J K U 2 s k i a i v I S w O s 3 U 8 V Q 7 K / Y 3 7 5 7 2 V H f r 0 t / N v A J i J S 6 I / U v 3 m f l f n a p F o o V d X Q O n m m L N q O q 8 z C X V X V E 3 N 7 9 U J c k h J k 7 h J s U T w p 5 W f v b Z 1 B q h a 1 e 9 d X T 8 V W c q V u 2 9 L D f F m 7 o l D d j + O c 5 B c L p Z t L e L 9 v F W Y W 8 / G 3 U O S 1 j B G s 1 z B 3 s 4 Q g l l 8 r 7 G A x 7 x Z D D j x r g 1 7 j 5 S j a F M s 4 h v y 7 h / B / g Q m 1 w = < / l a t e x i t > Density(⇥10 2 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " t Z 4 R q W Q O A / P R a l B n t j O F S u X h u F I = " > A A A C 4 3 i c j V H L S s N A F D 3 G V 3 1 X X Q o S L I I u L I m I u h R 1 4 b K C t Y W 2 S p J O 6 9 C 8 y E z E U r p z 5 0 7 c + g N u 9 V / E P 9 C / 8 M 4 Y Q S 2 i E 5 K c O f e e M 3 P v d W O f C 2 l Z L 0 P G 8 M j o 2 H h u Y n J q e m Z 2 L j + / c C q i N P F Y 2 Y v 8 K K m 6 j m A + D 1 l Z c u m z a p w w J 3 B 9 V n E 7 B y p e u W S J 4 F F 4 I r s x a w R O O + Q t 7 j m S q P P 8 c l 2 y K 9 k 7 Z K H g s t t f q 0 s e M G H a 1 l l v Y 7 O / f p 4 v W E V L L 3 M Q 2 B k o I F u l K P + M O p q I 4 C F F A I Y Q k r A P B 4 K e G m x Y i I l r o E d c Q o j r O E M f k 6 R N K Y t R h k N s h 7 5 t 2 t U y N q S 9 8 h R a 7 d E p P r 0 J K U 2 s k i a i v I S w O s 3 U 8 V Q 7 K / Y 3 7 5 7 2 V H f r 0 t / N v A J i J S 6 I / U v 3 m f l f n a p F o o V d X Q O n m m L N q O q 8 z C X V X V E 3 N 7 9 U J c k h J k 7 h J s U T w p 5 W f v b Z 1 B q h a 1 e 9 d X T 8 V W c q V u 2 9 L D f F m 7 o l D d j + O c 5 B c L p Z t L e L 9 v F W Y W 8 / G 3 U O S 1 j B G s 1 z B 3 s 4 Q g l l 8 r 7 G A x 7 x Z D D j x r g 1 7 j 5 S j a F M s 4 h v y 7 h / B / g Q m 1 w = < / l a t e x i t > Density(⇥10 2 ) < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 3 A U 5 u g t U r s P e G 7 0 e m 8 + p l i 0 q t Y = " > A A A C 0 X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 4 7 K i f U B b J U m n d W h e T C Z i K Q V x 6 w + 4 1 Z 8 S / 0 D / w j t j C m o R n Z D k z L n 3 n J l 7 r x v 7 P J G W 9 Z o z Z m b n 5 h f y i 4 W l 5 Z X V t e L 6 R j 2 J U u G x m h f 5 k W i 6 T s J 8 H r K a 5 N J n z V g w J 3 B 9 1 n A H J y r e u G E i 4 V F 4 I Y c x 6 w R O P + Q 9 7 j m S q M u 2 Z L d y V G U i 4 t 1 k f F U s W W V L L 3 M a 2 B k o I V v V q P i C N r q I 4 C F F A I Y Q k r A P B w k 9 L d i w E B P X w Y g 4 Q Y j r O M M Y B d K m l M U o w y F 2 Q N 8 + 7 V o Z G 9 J e e S Z a 7 d E p P r 2 C l C Z 2 S B N R n i C s T j N 1 P N X O i v 3 N e 6 Q 9 1 d 2 G 9 H c z r 4 B Y i W t i / 9 J N M v + r U 7 V I 9 H C k a + B U U 6 w Z V Z 2 X u a S 6 K + r m 5 p e q J D n E x C n c p b g g 7 G n l p M + m 1 i S 6 d t V b R 8 f f d K Z i 1 d 7 L c l O 8 q 1 v S g O 2 f 4 5 w G 9 b 2 y f V C 2 z / Z L l e N s 1 H l s Y R u 7 N M 9 D V H C K K m r k L f C I J z w b 5 8 b Q u D P u P 1 O N X K b Z x L d l P H w A e f q V f A = = < / l a t e x i t >

Peroids

< l a t e x i t s h a 1 _ b a s e 6 4 = " 7 3 A U 5 u g t U r s P e G 7 0 e m 8 + p l i 0 q t Y = " > A A A C 0 X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 4 7 K i f U B b J U m n d W h e T C Z i K Q V x 6 w + 4 1 Z 8 S / 0 D / w j t j C m o R n Z D k z L n 3 n J l 7 r x v 7 P J G W 9 Z o z Z m b n 5 h f y i 4 W l 5 Z X V t e L 6 R j 2 J U u G x m h f 5 k W i 6 T s J 8 H r K a 5 N J n z V g w J 3 B 9 1 n A H J y r e u G E i 4 V F 4 I Y c x 6 w R O P + Q 9 7 j m S q M u 2 Z L d y V G U i 4 t 1 k f F U s W W V L L 3 M a 2 B k o I V v V q P i C N r q I 4 C F F A I Y Q k r A P B w k 9 L d i w E B P X w Y g 4 Q Y j r O M M Y B d K m l M U o w y F 2 Q N 8 + 7 V o Z G 9 J e e S Z a 7 d E p P r 2 C l C Z 2 S B N R n i C s T j N 1 P N X O i v 3 N e 6 Q 9 1 d 2 G 9 H c z r 4 B Y i W t i / 9 J N M v + r U 7 V I 9 H C k a + B U U 6 w Z V Z 2 X u a S 6 K + r m 5 p e q J D n E x C n c p b g g 7 G n l p M + m 1 i S 6 d t V b R 8 f f d K Z i 1 d 7 L c l O 8 q 1 v S g O 2 f 4 5 w G 9 b 2 y f V C 2 z / Z L l e N s 1 H l s Y R u 7 N M 9 D V H C K K m r k L f C I J z w b 5 8 b Q u D P u P 1 O N X K b Z x L d l P H w A e f q V f A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 3 A U 5 u g t U r s P e G 7 0 e m 8 + p l i 0 q t Y = " > A A A C 0 X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 4 7 K i f U B b J U m n d W h e T C Z i K Q V x 6 w + 4 1 Z 8 S / 0 D / w j t j C m o R n Z D k z L n 3 n J l 7 r x v 7 P J G W 9 Z o z Z m b n 5 h f y i 4 W l 5 Z X V t e L 6 R j 2 J U u G x m h f 5 k W i 6 T s J 8 H r K a 5 N J n z V g w J 3 B 9 1 n A H J y r e u G E i 4 V F 4 I Y c x 6 w R O P + Q 9 7 j m S q M u 2 Z L d y V G U i 4 t 1 k f F U s W W V L L 3 M a 2 B k o I V v V q P i C N r q I 4 C F F A I Y Q k r A P B w k 9 L d i w E B P X w Y g 4 Q Y j r O M M Y B d K m l M U o w y F 2 Q N 8 + 7 V o Z G 9 J e e S Z a 7 d E p P r 2 C l C Z 2 S B N R n i C s T j N 1 P N X O i v 3 N e 6 Q 9 1 d 2 G 9 H c z r 4 B Y i W t i / 9 J N M v + r U 7 V I 9 H C k a + B U U 6 w Z V Z 2 X u a S 6 K + r m 5 p e q J D n E x C n c p b g g 7 G n l p M + m 1 i S 6 d t V b R 8 f f d K Z i 1 d 7 L c l O 8 q 1 v S g O 2 f 4 5 w G 9 b 2 y f V C 2 z / Z L l e N s 1 H l s Y R u 7 N M 9 D V H C K K m r k L f C I J z w b 5 8 b Q u D P u P 1 O N X K b Z x L d l P H w A e f q V f A = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 3 A U 5 u g t U r s P e G 7 0 e m 8 + p l i 0 q t Y = " > A A A C 0 X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d V l 0 4 7 K i f U B b J U m n d W h e T C Z i K Q V x 6 w + 4 1 Z 8 S / 0 D / w j t j C m o R n Z D k z L n 3 n J l 7 r x v 7 P J G W 9 Z o z Z m b n 5 h f y i 4 W l 5 Z X V t e L 6 R j 2 J U u G x m h f 5 k W i 6 T s J 8 H r K a 5 N J n z V g w J 3 B 9 1 n A H J y r e u G E i 4 V F 4 I Y c x 6 w R O P + Q 9 7 j m S q M u 2 Z L d y V G U i 4 t 1 k f F U s W W V L L 3 M a 2 B k o I V v V q P i C N r q I 4 C F F A I Y Q k r A P B w k 9 L d i w E B P X w Y g 4 Q Y j r O M M Y B d K m l M U o w y F 2 Q N 8 + 7 V o Z G 9 J e e S Z a 7 d E p P r 2 C l C Z 2 S B N R n i C s T j N 1 P N X O i v 3 N e 6 Q 9 1 d 2 G 9 H c z r 4 B Y i W t i / 9 J N M v + r U 7 V I 9 H C k a + B U U 6 w Z V Z 2 X u a S 6 K + r m 5 p e q J D n E x C n c p b g g 7 G n l p M + m 1 i S 6 d t V b R 8 f f d K Z i 1 d 7 L c l O 8 q 1 v S g O 2 f 4 5 w G 9 b 2 y f V C 2 z / Z L l e N s 1 H l s Y R u 7 N M 9 D V H C K K m r k L f C I J z w b 5 8 b Q u D P u P 1 O N X K b Z x L d l P H w A e f q V f A = = < / l a t e x i t > Peroids (a) Electricity (b) ETTh1 (c) Exchange (d) Weather Figure 10 : Statistics of period length in experimental datasets. We conduct FFT to the raw data and select the top-6 significant frequencies for each length-96 segment. Then, we record the corresponding period lengths and plot the normalized density for each period length.

F SHOWCASES

To provide a clear comparison among different models, we provide showcases to the regression tasks, including the imputation (Figure 11 ), long-term forecasting (Figure 12 ) and short-term forecasting (Figure 13 ). Especially in the imputation task, the MLP-based models degenerate a lot. This is because the input series has been randomly masked. However, the MLP-based models adopt the fixed model parameter to model the temporal dependencies among time points, thereby failing in this task. As shown in Table 11 , our proposed TimesNet achieves the best performance in all five tasks. Among the top three models, TimesNet also achieves the greatest efficiency. Compared to MLP-based models, our proposed TimesNet shows a significant advantage in performance. And benefiting from the utilization of 2D kernels and parameter-efficient design, the parameter size is invariant when the input series changes. Compared to Transformer-based models, TimesNet is with great efficiency in GPU memory, which is essential in long sequence modeling.

LightTS DLinear

Autoformer TimesNet FEDformer Published as a conference paper at ICLR 2023 

H MODEL PERFORMANCE IN MIXED DATASET

To verify the model capacity in large-scale pre-training, we evaluate the model performance when it is trained from a mixed dataset. Concretely, we mixed the hourly-collected ETTh1, ETTh2 and the 15-minute collected ETTm1, ETTm2 as the mixed dataset. Note that this mixed dataset contains diverse temporal patterns and periodicities in different data instances, making the unified training challenging. From Table 12 , we can find that TimesNet can handle this mixed dataset well and generally promote the model performance in four independent subsets. Besides, we can also find that except TimesNet, for other baselines, the mixed training may decrease the model performance in some subsets, indicating that other baselines cannot handle the complex periodicities in the mixed dataset. These results also verify the potential of TimesNet in performing as the general-purpose backbone for large-scale pre-training in time series. 

I FULL RESULTS

Due to the space limitation of the main text, we place the full results of all experiments in the following: long-term forecasting in Table 13 , short-term forecasting in Table 14 , imputation in Table 16 , classification in Table 17 and anomaly detection in Table 15 . Yearly SMAPE 13.387 13.418 13.436 18.009 14.247 16.965 13.728 13.717 13.974 15.530 14.727 17.107 16.169 176.040 14.920 et al., 2019) adopts a special ensemble method to promote the performance. For fair comparisons, we remove the ensemble and only compare the pure forecasting models. Table 15 : Full results for the anomaly detection task. The P, R and F1 represent the precision, recall and F1-score (%) respectively. F1-score is the harmonic mean of precision and recall. A higher value of P, R and F1 indicates a better performance. 



Figure2: A univariate example to illustrate 2D structure in time series. By discovering the periodicity, we can transform the original 1D time series into structured 2D tensors, which can be processed by 2D kernels conveniently. By conducting the same reshape operation to all variates of time series, we can extend the above process to multivariate time series.

Figure 3: Overall architecture of TimesNet. The TimesBlock (left part) can extract the various informative features from the k different reshaped time series by a shared MSUnit with multi-scale 2D kernels (right part) and fuse them based on the normalized amplitude values.

Figure 3: Overall architecture of TimesNet. TimesNet is stacked by TimesBlocks in a residual way. TimesBlocks can capture various temporal 2D-variations from k different reshaped tensors by a parameter-efficient inception block in 2D space and fuse them based on normalized amplitude values.

Figure 4: Model performance comparison (left) and generality in different vision backbones (right).

Figure 6: Representation analysis in four tasks. For each model, we calculate the centered kernel alignment (CKA) similarity (2019) between representations from the first and the last layers. A higher CKA similarity indicates more similar representations. TimesNet is marked by red stars.

. All experiments are repeated three times, implemented in PyTorch (Paszke et al., 2019) and conducted on a single NVIDIA TITAN RTX 24GB GPU. To make the model handle various dimensions of input series from different datasets, we select the d model based on the input series dimension C by min{max{2 ⌈log C⌉ , d min }, d max } (see Table

Figure 8: Sensitivity analysis of hyper-parameters k in imputation and anomaly detection task.Especially, we select the 37.5% mask ratio for hyper-parameter experiments. For classification, we choose the two largest subsets: SpokenArobicDigits and FaceDetection for evaluation. For short-term forecasting, we adopt the weighted average for sensitivity analysis.

Figure 9: More showcases for temporal 2D-variations.

r g 1 7 j 5 S j a F M s 4 h v y 7 h / B / g Q m 1 w = < / l a t e x i t >

Figure 11: Visualization of ETTm1 imputation results given by models under the 50% mask ratio setting. The black lines stand for the ground truth and the orange lines stand for predicted values.G MODEL EFFICIENCY ANALYSISTo summarize the model performance and efficiency, we calculate the relative performance rankings for comparing baselines. The rankings are compared from the common models used in all five tasks: LSTM (1997) and LSSL (2022); TCN (2019); LightTS (2022) and DLinear (2023); Reformer (2020), Informer (2021), Pyraformer (2021a), Autoformer (2021), FEDformer (2022), Non-stationary Transformer (2022a), ETSformer (2022) and our proposed TimesNet, namely 13 models in total.

Summary of experiment benchmarks.

Long-term forecasting task. The past sequence length is set as 36 for ILI and 96 for the others. All the results are averaged from 4 different prediction lengths, that is{24, 36, 48, 60} for ILI  and {96, 192, 336, 720}  for the others. See Table13in Appendix for the full results.

Short-term forecasting task on M4. The prediction lengths are in[6, 48]  and results are weighted averaged from several datasets under different sample intervals. See Table14for full results.

Imputation task. We randomly mask {12.5%, 25%, 37.5%, 50%} time points in length-96 time series. The results are averaged from 4 different mask ratios. See Table16for full results.

Anomaly detection task. We calculate the F1-score (as %) for each dataset. *. means the *former. A higher value of F1-score indicates a better performance. See Table15for full results. .83 77.31 76.60 76.88 * We replace the joint criterion in Anomaly Transformer (2021) with reconstruction error for fair comparison.

Dataset descriptions. The dataset size is organized in(Train, Validation, Test).

Experiment configuration of TimesNet. All the experiments use the ADAM (2015) optimizer with the default hyperparameter configuration for (β 1 , β 2 ) as (0.9, 0.999).

Ablations on the representation leaning in 2D space, where we replace the parameter-efficient inception with other well-acknowledged vision backbones. See Figure4for efficiency comparison. ) 87.76 82.63 85.12 82.97 85.42 84.18 91.50 57.80 70.85 88.31 96.24 92.10 98.22 92.21 95.21 85.49 * In this paper, we adopt a parameter-efficient design that uses the same parameters for k different transformed 2D tensors, namely Shared. For comparison, we also compare with the independent design, that uses different parameters for different transformed 2D tensors, namely Ind.

Ablations on model architecture. + decomposition is to combine the deep decomposition architecture proposed by Autoformer(Wu et al., 2021) with TimesNet. Transform raw data refers conducting the transformation on the original time series, instead of deep features.

Table 10 demonstrates that our design in adaptive aggregation performs the best. Ablations on adaptive aggregation. Softmax 87.27 79.31 83.10 83.91 86.47 85.17 91.93 58.57 71.55 87.13 95.81 91.27 98.00 92.48 95.16 85.25

Model efficiency comparison and their rankings in five tasks. The efficiency measurements are recorded on the imputation task of ETTh1 dataset. The rankings are organized in the order of long-and short-term forecasting, imputation, classification and anomaly detection. "/" indicates the out-of-memory situation. A smaller ranking means better performance.

Comparison between unified training and independent training for imputation task. .048 0.060 0.078 0.023 0.027 0.030 0.034 0.066 0.086 0.114 0.133 0.042 0.049 0.055 0.065 MAE 0.122 0.146 0.163 0.185 0.091 0.102 0.109 0.117 0.174 0.200 0.229 0.247 0.135 0.147 0.157 0.171 Independent MSE 0.034 0.046 0.057 0.067 0.023 0.026 0.030 0.035 0.074 0.090 0.109 0.137 0.044 0.050 0.060 0.068 MAE 0.124 0.144 0.161 0.174 0.092 0.101 0.108 0.119 0.182 0.203 0.222 0.248 0.138 0.149 0.163 0.173 FEDformer Unified MSE 0.041 0.057 0.073 0.099 0.060 0.089 0.125 0.172 0.077 0.101 0.130 0.164 0.087 0.125 0.161 0.214 MAE 0.143 0.169 0.192 0.224 0.166 0.205 0.244 0.287 0.196 0.228 0.258 0.289 0.204 0.246 0.283 0.326 Independent MSE 0.035 0.052 0.069 0.089 0.056 0.080 0.110 0.156 0.070 0.106 0.124 0.165 0.095 0.137 0.187 0.232 MAE 0.135 0.166 0.191 0.218 0.159 0.195 0.231 0.276 0.190 0.236 0.258 0.299 0.212 0.258 0.304 0.341 TimesNet Unified MSE 0.019 0.023 0.028 0.037 0.018 0.020 0.022 0.025 0.035 0.046 0.057 0.075 0.032 0.036 0.040 0.047 MAE 0.091 0.099 0.109 0.123 0.075 0.081 0.086 0.095 0.126 0.144 0.159 0.181 0.112 0.119 0.129 0.140

Full results for the short-term forecasting task in the M4 dataset. * . in the Transformers indicates the name of * former. Stationary means the Non-stationary Transformer. Models TimesNet N-HiTS N-BEATS * ETS. LightTS DLinear FED. Stationary Auto. Pyra.

QuarterlySMAPE 10.100 10.202 10.124 13.376 11.364 12.145 10.792 10.958 11.338 15.449 11.360 13.207 13.313 172.808 11.122  65.999 MASE 1.182 1.194 1.169 1.906 1.328 1.520 1.283 1.325 1.365 2.350 1.401 1.827 1.775 19.753 1.360 17.662 OWA 0.890 0.899 0.886 1.302 1.000 1.106 0.958 0.981 1.012 1.558 1.027 1.266 1.252 15.049 1.001 9.436 Monthly SMAPE 12.670 12.791 12.677 14.588 14.014 13.514 14.260 13.917 13.958 17.642 14.062 16.149 20.128 143.237 15.626 64.664 MASE 0.933 0.969 0.937 1.368 1.053 1.037 1.102 1.097 1.103 1.913 1.141 1.660 2.614 16.551 1.274 16.245 OWA 0.878 0.899 0.880 1.149 0.981 0.956 1.012 0.998 1.002 1.511 1.024 1.340 1.927 12.747 1.141 9.879 Others SMAPE 4.891 5.061 4.925 7.267 15.880 6.709 4.954 6.302 5.485 24.786 24.460 23.236 32.491 186.282 7.186 121.844 MASE 3.302 3.216 3.391 5.240 11.434 4.953 3.264 4.064 3.865 18.581 20.960 16.288 33.355 119.294 4.677 91.650 OWA 1.035 1.040 1.053 1.591 3.474 1.487 1.036 1.304 1.187 5.538 5.879 5.013 8.679 38.411 1.494 27.273 Weighted Average SMAPE 11.829 11.927 11.851 14.718 13.525 13.639 12.840 12.780 12.909 16.987 14.086 16.018 18.200 160.031 13.961 67.156 MASE 1.585 1.613 1.599 2.408 2.111 2.095 1.701 1.756 1.771 3.265 2.718 3.010 4.223 25.788 1.945 21.208 OWA 0.851 0.861 0.855 1.172 1.051 1.051 0.918 0.930 0.939 1.480 1.230 1.378 1.775 12.642 1.023 8.021 * The original paper of N-BEATS (Oreshkin

Full results for the imputation task. We randomly mask 12.5%, 25%, 37.5% and 50% time points to compare the model performance under different missing degrees. * . in the Transformers indicates the name of * former. Mask Ratio MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE MSE MAE

ACKNOWLEDGMENTS

This work was supported by the National Key Research and Development Plan (2020AAA0109201), National Natural Science Foundation of China (62022050 and 62021002), Civil Aircraft Research Project (MZJ3-2N21), Beijing Nova Program (Z201100006820041), and CCF-Ant Group Green Computing Fund.

availability

https://github.com/thuml/TimesNet.

annex

* means that there are some mismatches between our input-output setting and their papers. We adopt their official codes and only change the length of input and output sequences for a fair comparison. Avg 0.030 0.054 0.076 0.171 0.055 0.117 0.052 0.110 0.099 0.203 0.032 0.059 0.031 0.057 0.152 0.235 0.045 0.104 0.039 0.076 0.038 0.087 0.365 0.434 0.183 0.291 0.045 0.108 

