TI-MAE: SELF-SUPERVISED MASKED TIME SERIES AUTOENCODERS

Abstract

Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformerbased models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformerbased models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks. The code will be made public after this paper is accepted.

1. INTRODUCTION

Time series modeling has an urgent need in many fields, such as time series classification (Dau et al., 2019) , demand forecasting (Carbonneau et al., 2008) , and anomaly detection (Laptev et al., 2017) . Recently, long sequence time series forecasting (LSTF), which aims to predict the change of values in a long future period, has aroused significant interests of researchers. In the previous work, most of the self-supervised representation learning methods on time series aim to learn transformationinvariant features via contrastive learning to be applied on downstream tasks. Although these methods perform well on classification tasks, there is still a gap between their performance and other supervised models on forecasting tasks. Apart from the inevitable distortion to time series caused by augmentation strategies they have borrowed from vision or language, the inconsistency between upstream contrastive learning approaches and downstream forecasting tasks should be also a major cause of this problem. Besides, as the latest contrastive learning frameworks (Yue et al., 2022; Woo et al., 2022a) reported, Transformer (Vaswani et al., 2017) performs worse than CNN-based backbones, which is also not consistent with our experience. We have to reveal the differences and relationships between existing contrastive learning and supervised methods on time series. As an alternative of contrastive learning, denoising autoencoders (Vincent et al., 2008) are also used to be an auxiliary task to learn intermediate representation from the data. Due to the ability of Transformer to capture long-range dependencies, many of existing methods (Zhou et al., 2021; Wu et al., 2021; Woo et al., 2022b) focused on reducing the time complexity and memory usage caused by vanilla attention mechanism such as sparse attention or correlation to process longer time series. These transformer-based models all follow the same training paradigm as Figure 1a shows, which learns similar patterns from input historical time series segments and predict future time series values However, this continuous masking strategy is usually accompanied by two severe problems. For one thing, continuous masking strategy will limit the learning ability of the model, which captures only the information of the visible sequence and some mapping relationship between the historical and the future segments. Similar problems have been reported in vision tasks (Zhang et al., 2017) . For another, continuous masking strategy will induce severe distribution shift problems, especially when the prediction horizon is longer than input sequence. In reality, most of the time series data collected from real scenarios are non-stationary, whose mean or variance changes over time. Similar problems were also observed in previous studies (Qiu et al., 2018; Oreshkin et al., 2020; Wu et al., 2021; Woo et al., 2022a) . Most of them have tried to disentangle the input time series into a trend part and a seasonality part in order to enhance the capture of periodic features and to make the model robust to outlier noises. Specifically, they utilize moving average implemented by one average pooling layer with a fixed size sliding window to gain trend information of input time series. Then they capture seasonality features from periodic sequences, which are obtained by simply subtracting trend items from the original signal. To further clarify the mechanism of this disentanglement, we intuitively propose an easy but comprehensible description of disentangled time series as y(t) = Trend(t) + Seasonality(t) + Noises. (1) For better illustration, we simply use polynomial series n t n and Fourier cosine series n cos n t to respectively describe trend parts and seasonality parts of the original time series in Eq.(1). Apparently, the seasonality part is stationary when we set a proper observation horizon (not less than the maximum period of seasonality parts), while the moments of the trend part change continuously over time. Figure 2 illustrates that the size of sliding window in average pooling layer plays a vital role in the quality of disentangled trend part. Natural time series data generally have more complex periodic patterns, which means we have to employ longer sliding windows or other hierarchical disposals. In addition, when moving average is used to capture the trend parts, both ends of a sequence need to be padded for alignment, which causes inevitable data distortion at the head and tail. These observed phenomenons suggest there are still some unresolved issues in the current disentanglement. To address these issues, this paper proposes a novel Transformer-based framework named Ti-MAE as shown in Figure 3 . Ti-MAE randomly masks out parts of embedded time series data and learns an autoencoder to reconstruct them at the point-level in the training stage. Figure 1 shows the difference between random masking and fixed continuous masking in end-to-end models, where we adequately leverage all the input sequence with different combination of visible tokens. Random masking takes the overall distribution of inputs into consideration, which can therefore alleviate the distribution shift problem. Moreover, with the power of pre-training or representation learning embodied in the encoder-decoder structure, Ti-MAE provides a universal scheme for both forecasting and classification. The contributions of our work are summarized as follows: • We provide a novel perspective to bridge the connection between existing contrastive learning and generative Transformer-based models on time series and point out the inconsistency and deficiencies of them on downstream tasks. et al., 2021) , which is in contrast to the ability that Transformer-based model can capture long-range dependencies. Some of the latest works like ETSformer (Woo et al., 2022b) and FEDformer (Zhou et al., 2022 ) also rely heavily on disentanglement and extra introduced domain knowledge.

2.2. TIME SERIES REPRESENTATION LEARNING

Self-supervised representation learning has achieved good performance in time series domain, especially using contrastive learning to learn a good intermediate representation. Lei et al. (2019) ; Franceschi et al. (2019) used loss function of metric learning to preserve pairwise similarities in the time domain. CPC (van den Oord et al., 2018) first proposed contrastive predictive coding and In-foNCE, which treats the data from the same sequence as positive pairs while the different noise data from the mini-batch as negative pairs. Different data augmentations on time series data were proposed to capture transformation-invariant features at semantic level (Eldele et al., 2021; Yue et al., 2022) . CoST (Woo et al., 2022a) introduced extra inductive biases in frequency domain through DFT and separately processed disentangled trend and seasonality parts of the original time series data to encourage discriminative seasonal and trend representations. Almost all of these methods rely on heavily data augmentation or other domain knowledge like hierarchy and disentanglement.

2.3. MASKED DATA MODELING

Masked language modeling is a widely adapted method for pre-training in NLP. BERT (Devlin et al., 2019) holds out a portion of the input sequence and predicts the missing content in training stage, which can generate good representations to various downstream tasks. Masked image encoding methods are also used for learning image representations. Pathak et al. (2016) recovered a small portion of missing regions using convolution. Motivated by the huge successes in NLP, recent methods (Bao et al., 2021; Dosovitskiy et al., 2021) are resort to Transformers to predict unknown pixels. MAE (He et al., 2021) 

3. METHODOLOGY

3.1 PROBLEM DEFINITION Let X = (x 1 , x 2 , . . . , x T ) ∈ R T ×m be a multivariate time series instance with length of T , where m is the dimension of each signal. Given a historical multivariate time series segment X h ∈ R h×m with length of h, forecasting tasks aim to predict the next k steps values of X f ∈ R k×n where n ≤ m. For classification tasks, we should match the categorical ground truth from a set of labels C and each time series instance X .

3.2. MODEL ARCHITECTURE

The overall architecture of Ti-MAE is shown in Figure 3 . Similar as all autoencoders, our framework has an encoder that maps the observed time series signal X ∈ R T ×m to a latent representation H ∈ R T ×n , and a decoder that reconstructs the original sequence from the embedding of the encoder on timestamp. Motivated by the great success of other MAE-style approaches (He et al., 2021; Feichtenhofer et al., 2022; Hou et al., 2022) , we also adopt an asymmetric design that the encoder only operates visible tokens after applying masking on input embedding, and a lighter decoder processes encoded tokens padded with masked tokens and reconstructs the original time series at the point-level. More details of each component are introduced as follows. Input embedding. Unlike other time series modeling methods, we have not adopted any multi-scale or complex convolution scheme like dilated convolution. Given a time series segment, we directly use one 1-D convolutional layer to extract local temporal features on timestamp across channels. Fixed sinusoidal positional embeddings are added to maintain the position information. Be different from other temporal data embedding approaches, we do not add any handcrafting task-specific or date-specific embeddings so as to introduce as little inductive bias as possible. Masking. After tokenizing original temporal data into tokens on timestamp, we randomly sample a subset of tokens without replacement which follows the uniform distribution and mask the remaining parts. It is hypothesized and summarized in (He et al., 2021; Feichtenhofer et al., 2022) that the masking ratio is related to the information density and redundancy of the data, which has an immense impact on the performance of the autoencoders. Generally speaking, natural language has higher information density due to its highly discrete word distribution, while images are of heavy spatial redundancy. Specifically, single pixel in one image has lower semantic information so that we can reconstruct a missing region from neighboring pixels by interpolation with little understanding of contents. Thus, data with lower information density should be applied a higher masking ratio to largely eliminate redundancy and prevent the model from focusing only on low-level semantic information. As a benchmark model often used in natural language, BERT (Devlin et al., 2019) uses a masking ratio of 15% while MAE uses a ratio of 75% for images (He et al., 2021) and 90% for videos (Feichtenhofer et al., 2022) . Similar as images, time series data also have local continuity so that we should determine a high masking ratio in training stage. The optimal masking ratio of multivariate time series we observe is also around 75%. Ti-MAE Encoder. Our encoder is a set of vanilla Transformer blocks with input embedding but utilizes pre-norm instead of post-norm in each block, which is shown as Figure 4 . Like other MAEstyle methods, Ti-MAE's encoder is applied only on visible tokens after embedding and random masking. This design significantly reduces time complexity and memory usage compared to full encoding. Ti-MAE Decoder. Our decoder also contains a set of vanilla Transformer blocks applied on the union of the encoded visible tokens and learnable randomly initialized mask tokens. Following (He et al., 2021) , the decoder is designed to be smaller than the encoder. Notably, we add positional embeddings to all tokens after padding to supplement the location information of the missing parts. The last layer of the decoder is a linear projection layer which reconstructs the input by predicting all the values at the point-level. The training loss function is the mean squared error (MSE) between the original time series data and the prediction over masking regions. The encoder and decoder of Ti-MAE are both agnostic to the sequential data with as less domain knowledge as possible. There is no date-specific embedding, hierarchy or disentanglement in contrast to other architectures (Zhou et al., 2021; Wu et al., 2021; Yue et al., 2022; Woo et al., 2022a) . Compared to masked autoencoders used in vision tasks, a lot of parameter settings of Ti-MAE have been adjusted to better fit the time series data. We keep the point-level modeling rather than patch embedding for the consistency between masked modeling and downstream forecasting tasks. Unlike (Shao et al., 2022) , we directly generate future values from the decoder as prediction, maintaining the consistency of training and inference stages.

4.1. EXPERIMENTAL SETUP

Datasets. We conduct extensively experiments on several public real-world datasets, covering time series forecasting and classification applications. ( 1) ETT (Electricity Transformer Temperature) (Zhou et al., 2021) consists of the data collected from electricity transformers, recording six power load features and oil temperature. (2) Weatherfoot_0 contains 21 meteorological indicators like humidity, pressure in 2020 year from nearly 1600 locations in the U.S.. (3) Exchange (Lai et al., 2018) is a collection of exchange rates among eight different countries from 1990 to 2016. (4) ILIfoot_1 records the weekly influenza-like illness (ILI) patients data from Centers for Disease Control and Prevention of the United States between 2002 and 2021, describing the ratio of patients observed with ILI and the total number of patients. (5) The UCR archive (Dau et al., 2019) has 128 different datasets covering multiple domains like object outlines, traffic and body posture. We follow the same protocol and split all forecasting datasets into training, validation and test set by the ratio of 6:2:2 for the ETT dataset and 7:1:2 for other datasets. For classification, each dataset of UCR archive has been already divided into training and test set where the size of test set is greatly larger than training set in order to be accord with the actual scenarios. Baselines. We select two types of baselines, Transformer-based end-to-end and representation learning methods which have public official codes. For time series forecasting tasks, we select four latest state-of-the-art representation learning models: CoST (Woo et al., 2022a) , TS2Vec (Yue et al., 2022) , TNC (Tonekaboni et al., 2021) and MoCo (Chen et al., 2021) applied on time series and four Transformer-based end-to-end models: FEDformer (Zhou et al., 2022) , ETSformer (Woo et al., 2022b) , Autoformer (Wu et al., 2021) and Informer (Zhou et al., 2021) . For time series classification tasks, we include more competitive unsupervised representation learning methods: TS2Vec, T-Loss (Franceschi et al., 2019) , TS-TCC (Eldele et al., 2021) , TST (Zerveas et al., 2021) , TNC (Tonekaboni et al., 2021) and DTW (Chen et al., 2013) . Implementation Details. The encoder and decoder of Ti-MAE both use 2 layers of vanilla Transformer blocks with 4 heads self-attention. The number of hidden states dimension is set to 64, which is significantly lower than other existing methods (e.g., 320, 512). Ti-MAE is trained with MSE loss, using the Adam optimizer (Kingma & Ba, 2015) with an initial learning rate of 1e -3. We apply a batch size of 64 and sampling time of 30 in each iteration. We use mean squared error (MSE) 1 n n i=1 (y -ŷ) 2 and mean absolute error (MAE) 1 n n i=1 |y -ŷ| as evaluation metrics on forecasting tasks, and average accuracy with critical difference (CD) on classification tasks. All the models are implemented in PyTorch (Paszke et al., 2019) and trained/tested on a single Nvidia V100 32GB GPU.

4.2. TIME SERIES FORECASTING

To simulate different forecasting scenarios, we evaluate models under different future horizons, covering short-term and long-term forecasting cases. Tables 1 and 2 summarize the multivariate time series forecasting evaluation results of four datasets. The optimal masking ratio is around 75%. Lower or higher masking ratio will degrade the performance of prediction. In Table 1 , Ti-MAE consistently improves the performance in across all datasets of different prediction horizons. Specifically, Ti-MAE achieves a MAE decrease of 15.7% in ETT, 42.3% in Weather, 45.5% in Exchange and 19.2% in ILI compared to representation learning frameworks. Notably, our Ti-MAE does not require any extra regressor after pre-trained because its decoder can directly generate future time series to be predicted given the input sequence and masking ratio. In Table 2 , Ti-MAE ( † indicates fine-tuned version) also shows more compatible performance compared to other Transformer-based end-to-end supervised methods. It must be stressed that we have pre-trained only one Ti-MAE model while all the end-to-end supervised models should be trained separately for different settings. Then we just utilize its encoder (parameters have been frozen) with an additional linear projection layer for fine-tuning at different prediction horizons. Runtime analysis comapred to other Transformer-based models could be seen at appendix. To further explore the impact of main properties of Ti-MAE, we conduct extensive ablation experiments on Weather under input sequence length of 200 and prediction horizon of 100 setting for evaluation. Table 3 demonstrates all the results of ablation study. Masking ratio. Figure 5 and Table 3a show the influence of the masking ratio. The optimal ratios are around 75%, which is in contrast to BERT (Devlin et al., 2019) and video-MAE (Feichtenhofer et al., 2022) but similar to MAE for images (He et al., 2021) . The high masking ratio induces the model to process fewer tokens and learn high-level semantic information. We can see that lower masking ratios perform worse even if the encoder could see more tokens because the model trained with lower masking ratio may simply recover the values by interpolation or extrapolation, focusing on low level semantic features locally. 3d we compare different length of input time series in training stage. Surprisingly, although lengthening the input length of the pre-training stage can improve the performance within limits, too long input sequence may degrade the results of our model because there is a certain conflict between the complex periodic pattern in the long sequence and the shortterm prediction task in the downstream. Decoder Design. Tables 3e and 3f show the influence of the decoder width and depth. A shallow design of the decoder is sufficient for reconstruction tasks. It is because that time series data are not that complicated and thus need lower decoding dimension to reduce redundancy. Such a lightweight decoder can efficiently reduce computational complexity and memory usage.

4.3. TIME SERIES CLASSIFICATION

In the previous section, we have improved the performance of our framework on forecasting tasks by reducing the consistency between upstream tasks and downstream tasks compared to contrastive learning methods. Thus, we should evaluate learning ability of instance-level representation on classification tasks. The results on 128 UCR archive are summarized in Table 4 . Compared to other representation learning methods, Ti-MAE achieves more compatible performance. More details and full results of each dataset in UCR are listed in the appendix. Following (Yue et al., 2022) , Critical Difference diagram (Demsar, 2006) for Nemenyi tests conducted on all datasets is shown as Figure 6 , where classifiers that are connected by a bold line do not have a significant difference. This proves that Ti-MAE could learn good instance-level representations directly from the raw time series data without any hierarchical tricks or data augmentation. 

5. CONCLUSION

This paper proposes a novel self-supervised framework named Ti-MAE for time series representation learning, which randomly masks out tokenized time series and learns an autoencoder to reconstruct them at the point-level. Ti-MAE bridges the connection between contrastive representation learning and generative Transformer-based methods and greatly improves the performance on forecasting tasks due to reducing the inconsistency of upstream and downstream tasks compared to contrastive learning methods. Compared with the fixed continuous masking strategy used in existing Transformer-based models, Ti-MAE adequately leverages all the input sequence and alleviates the distribution shift problem. The flexible setting of masking ratio makes Ti-MAE more adaptive to various prediction scenarios with different time steps. The experiments on real-world datasets and ablation study demonstrate the effectiveness and scalability of our framework. Future work will extend our work for different reconstruction targets according to their requirements.

A EXPERIMENTAL DETAILS A.1 REPRODUCTION DETAILS FOR TI-MAE

The default settings of Ti-MAE are shown in Table 5 in detail. We use one Conv1d layer with the setting of kernel = 3, stride = 1, padding = 1 to obtain the encoder input embedding, and then we add a fixed positional encoding as PE(pos, 2i) = sin( pos 10000 2i/dmodel ) PE(pos, 2i + 1) = cos( pos 10000 2i/dmodel ), (2) where d model represents the number of hidden states. After encoder input embedding, we randomly mask out 75% tokens, and then remaining visible parts are fed into the encoder. The encoder and decoder of Ti-MAE both contain 2 Transformer blocks as widely adopted in Devlin et al. (2019) ; Dosovitskiy et al. (2021) , each of which consists of one vanilla self-attention layer with 4 heads and a point-wise feed forward layer. As recommended in Dosovitskiy et al. (2021) , we adopt pre-norm instead of post-norm for stability of the model in training stage. Equation 3 demonstrates the whole process in the encoder: Z d i = RandomMask(Conv1d(X l,n ) + PE(X l,n )) Ẑd i = Z d i + MHSA(LayerNorm(Z d i , Z d i , Z d i )) Zd i = Ẑd i + MLP(LayerNorm( Ẑd i )) where we use X l,n to denote the vectors in dimension n with the length of l, and Z d i to denote the intermediate representation in dimension d with the length of i. In the decoder, we first apply a linear layer to reduce the input dimension to d ′ (64 → 32) for training efficiency. Given the position to be reconstruct, zero initialized masked tokens are padded to the encoded tokens with the original positional encoding. A dropout layer (p = 0.1) is added to the bottom of Transformer blocks to prevent the over-fitting problem. The last linear projection layer of the decoder is to reconstruct the missing values at the point-level. Equation 4 demonstrates the whole process of the decoder: Z d ′ l = Padding(Linear( Zd i )) + PE(X l,d ′ )) Ẑd ′ l = Z d ′ l + MHSA(LayerNorm(Z d ′ l , Z d ′ l , Z d ′ l )) Zd ′ l = Ẑd ′ l + MLP(LayerNorm( Ẑd ′ l )) X l,n = Projection( Zd ′ l ) where we use Z d ′ l to denote the intermediate representation in dimension d ′ with the length of l, and X l,n to denote our reconstruction goals. 2021) is a self-supervised contrastive learning framework widely used in computer vision domain, which uses a dynamic queue to save a large number of positive and negative samples with consistency. We directly apply this framework on time series data using the official code from https://github.com/facebookresearch/moco. Hyper-parameters are the same as Woo et al. (2022a) . Autoformer Wu et al. (2021) is a novel end-to-end supervised model with a decomposition architecture for time series forecasting. By directly subtracting trend parts obtained from moving average, they design an auto-correlation mechanism as a replacement for self-attention to capture long-term dependencies from seasonality parts. We use their open source code from https: //github.com/thuml/Autoformer. Hyper-parameters are remain the default values in the code. Informer Zhou et al. (2021) is an efficient end-to-end supervised model for time series forecasting. They propose a novel sparse attention to reduce time complexity and memory usage. We take the officially implemented code from https://github.com/zhouhaoyi/Informer2020. Hyper-parameters are used as suggested in their paper. ETSformer Woo et al. (2022b) proposes an interpretable Transformer architecture which decomposes forecasts into level, growth, and seasonality components. And they employ both exponential smoothing attention and frequency attention to reduce computational complexity. We use their open source code from https://github.com/salesforce/ETSformer. Hyper-parameters are used as suggested in their paper. FEDformer Zhou et al. (2022) proposes to combine Transformer with the seasonal-trend decomposition method, and exploit the fact that most time series tend to have a sparse representation in wellknown basis such as Fourier transform, and develop a frequency enhanced Transformer. We use the official code from https://github.com/MAZiqing/FEDformer. Hyper-parameters are remain the default values in the code.

A.3 DETAILS ON BENCHMARK TASKS

For time series forecasting tasks, the evaluation settings of end-to-end supervised models and other representation learning methods are slightly different. For other representation learning methods, we follow Yue et al. (2022) to evaluate the performance of their models. Specifically, we use a ridge regression trained on the learned representations to predict the future values. The regularization term α is selected by grid search from {0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000}. It is important to stress that Ti-MAE can directly generate future values from its decoder without any extra regressor (e.g. setting 50% masking ratio means giving one half of the entries to predict the other half.). As for end-to-end models, we set the length of input sequence as 96 to predict future time series with different horizons. Notably, for fair comparison with other SOTA Transformerbased methods including FEDformer and ETSformer on forecasting, we have fine-tuned Ti-MAE on forecasting tasks. Specifically, we extract the encoder of Ti-MAE and freeze it after pre-training, and add an extra linear regressor for fine-tuning. For classification tasks, we directly obtain instance-level representations by average or max pooling over all timestamps following Yue et al. (2022) . To evaluate the performance of models on classification, we follow the same protocol Franceschi et al. (2019) , where an SVM classifier with RBF kernel is trained on obtained instance-level representations. The full results of each dataset in UCR are provided in Table 13 and 14 . Notably, due to the flexible design of the Transformer block, we can utilize any layer of the encoder or the decoder of Ti-MAE to get intermediate representations. Extra class token is also a choice if necessary. In our experiments, we simply gather the encoder embedding of Ti-MAE as instancelevel representations for evaluation. To accelerate the training of the model, we perform equidistant sampling for different datasets to reduce input to less than 1024 for training efficiency. TS2Vec also reports an interesting phenomenon that using Transformer instead of Dilated CNN as backbone will largely degrade the performance on classification tasks. We also find similar problems, especially on morphological datasets. We suppose that some morphological datasets have almost no seasonality, while the local morphological characteristics determine the data classification. The positional encoding introduced in the encoder may destroy these morphological features. Simply removing position embedding in the encoder when generating representations will significantly affect the performance of classification. Table 6 shows the classification results on some morphological datasets with or without position embedding. The input time series and the output one do not need to have the same dimensionality. Actually the final linear projection layer in the decoder can easily project the input dimensionality to the desired out dimensionality. Table 11 shows the results of using multivariate time series to predict the last univariate target. To evaluate the transferability of our framework, we generate a set of time series data with different trend and seasonality patterns, which follows

B.1 THE IMPACT OF MASKING RATIO AND SAMPLING STRATEGIES

y(t) = cos(α • t) + cos( α 2 • t) + cos( α 4 • t) + β • t + ϵ (5) where the hyper-parameters α and β respectively control the trend and seasonality patterns, and the noises ϵ ∼ N (0, 0.1). We train our Ti-MAE under the setting of α = 300, β = 3 and evaluate the forecasting performance on other different settings. 



https://www.ncei.noaa.gov/data/local-climatological-data/ https://gis.cdc.gov/grasp/fluview/fluportaldashboard.html



Random masking applied in Ti-MAE.

Figure 1: Different masking strategies in generative Transformer-based models on time series, where blue areas signify the sequence fed into the encoder and green areas means the sequence to be generated. Left: The training paradigm of existing Transformer-based forecasting models, which can be seen as a special continuous masking strategy (only masks future time series and reconstructs them). Right: Random masking strategy applied in Ti-MAE, which can produce different views fed into the encoder in each iteration, fully leveraging the whole input time series.

Figure 2: Example of disentanglement. Top Left: Simulated input cosine series added with a linear trend. Top Right: The true trend and seasonality parts of the input. Bottom Left: Disentangled trend part though average pooling with the sliding window size of 15. Bottom Right: Disentangled trend part though average pooling with the sliding window size of 75.

Figure 4: Ti-MAE encoder overview. Left: Encoder input embedding. Right: Details of one Transformer block used in both Ti-MAE encoder and decoder, where we utilize pre-norm instead of post-norm scheme.

Figure5: The optimal masking ratio is around 75%. Lower or higher masking ratio will degrade the performance of prediction.

Figure 6: Critical Difference (CD) diagram on UCR classification with a 95% confidence level.

Figure 7: Transferability of Ti-MAE on different trend and seasonality patterns.

Multivariate time series forecasting results compared to representation learning methods.

Multivariate time series forecasting results compared to end-to-end methods.Sampling Time Tables3b and 3cstudy the influence of sampling time in each iteration and data augmentation on Ti-MAE training stage. Ti-MAE works well with proper sampling time in each iteration and even no extra data augmentation, which is different from other existing representation learning methods on time series, especially contrastive learning models which rely on heavily data augmentation. Ti-MAE can directly learn adequate information from masked data. Additionally, introducing extra data augmentation will degrade the performance due to inevitable distortions of the original data, which is different from the result of MAE for images or videos. Random masking in each iteration generates a large number of different views without any distortion so that model can make use of visible tokens to capture more useful features.

Ablation experiments on Weather. The entries marked in bold are the same which specify the default settings. Lower MSE and MAE represent better performance. This table format follows(Feichtenhofer et al., 2022).



Default settings of Ti-MAEFor forecasting tasks, the results ofCoST Woo et al. (2022a), TS2VecYue et al. (2022), TNCTonekaboni et al. (2021), MoCo Chen et al. (2021), AutoformerWu et al. (2021), Informer Zhou et al.  (2021), ETSformer Woo et al. (2022b)  and FEDformerZhou et al. (2022) are all based on our reproduction. For classification tasks, the results of TS2Vec are based on our reproduction. Other results of classification are directly taken fromYue et al. (2022).CoSTWoo et al. (2022a)  was recently proposed as a contrastive learning framework of disentangled seasonal-trend representations for time series forecasting. They comprises both time domain and frequency domain contrastive losses to learn discriminative trend and seasonal representations. We use the public official source code from https://github.com/salesforce/CoST. TS2Vec Yue et al. (2022) is a universal framework for learning representations of time series in an arbitrary semantic level through applying contrastive learning in a hierarchical way over augmented context views. TS2Vec can obtain timestamp-level and instance-level representations for forecasting and classification simultaneously. We take the officially implemented code from https://github.com/yuezhihan/ts2vec.TNC Tonekaboni et al. (2021) is a self-supervised contrastive learning framework for time series, where the positive samples come from the neighboring similar signals. We use the official open source code from https://github.com/sanatonek/TNCrepresentationlearning and all the settings of hyper-parameters followsWoo et al. (2022a).

The classification results on morphological datasets with or without positional encoding

The impact of masking ratio on forecasting tasks.

The impact of different masking strategies with 75% ratio on Weather.Table7summarizes the impact of masking ratio on different forecasting tasks under the setting of 200-100. We can see that the best masking ratio is generally around 75% given the continuous nature of time series data. Table8studies the impact of different masking strategies with 75% ratio on Weather dataset of 96-96 setting. Specifically, random masking means tokens are randomly masked; continuous masking means we only mask historical time series and reconstruct future values, which is the same as traditional forecasting methods; split masking means we both mask historical time series to reconstruct future values, and mask future time series to reconstruct historical sequence; periodic masking means tokens are periodically masked. Notably, periodic masked tokens with a length of four are sampled equidistantly to maintain the same masking ratio. We can see that random masking achieves the best result because randomness can adequately exploit the whole time series data with less inductive bias.

shows the running time in seconds for each stage of different Transformer-based methods, where we execute three times for each setting (using 96 historical steps to predict future steps of24, 48, 96, 288 and 672 respectively). All experiments are performed on one single Nvidia V100 GPU. Although many Transformer-based models have O(L log L) complexity, however, there exists a large constant since these methods generally need to do a bulk of pre-treatment (e.g. Fourier Transform, Wavelet Transform), which makes the overall training not that efficient. In comparison, although our proposed Ti-MAE has O(L 2 ) complexity due to the vanilla attention mechanism, we need to pre-train the encoder of Ti-MAE only once and can fine-tune it on different forecasting settings. Thus, the total running time of Ti-MAE is less than other Transformer-based methods.

Running time (seconds) for Transformer-based methods at different stages.

shows the impact of different components of Ti-MAE, which proves the effectiveness of random masking strategy, Transformer-based backbone and other necessary parts.

Ablation study of Ti-MAE's components on the Exchange dataset (200-100 setting).

Forecasting results with different dimension compared to representation learning methods.

Table 8 and Figure 7 demonstrate the strong transferability of Ti-MAE under different trend and seasonality patterns.

The results of forecasting 400 time steps on simulated time series data with different trend and seasonality patterns.

Full classification results on 128 UCR datasets part 1.

