TI-MAE: SELF-SUPERVISED MASKED TIME SERIES AUTOENCODERS

Abstract

Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformerbased models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformerbased models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks. The code will be made public after this paper is accepted.

1. INTRODUCTION

Time series modeling has an urgent need in many fields, such as time series classification (Dau et al., 2019) , demand forecasting (Carbonneau et al., 2008) , and anomaly detection (Laptev et al., 2017) . Recently, long sequence time series forecasting (LSTF), which aims to predict the change of values in a long future period, has aroused significant interests of researchers. In the previous work, most of the self-supervised representation learning methods on time series aim to learn transformationinvariant features via contrastive learning to be applied on downstream tasks. Although these methods perform well on classification tasks, there is still a gap between their performance and other supervised models on forecasting tasks. Apart from the inevitable distortion to time series caused by augmentation strategies they have borrowed from vision or language, the inconsistency between upstream contrastive learning approaches and downstream forecasting tasks should be also a major cause of this problem. Besides, as the latest contrastive learning frameworks (Yue et al., 2022; Woo et al., 2022a ) reported, Transformer (Vaswani et al., 2017) performs worse than CNN-based backbones, which is also not consistent with our experience. We have to reveal the differences and relationships between existing contrastive learning and supervised methods on time series. As an alternative of contrastive learning, denoising autoencoders (Vincent et al., 2008) are also used to be an auxiliary task to learn intermediate representation from the data. Due to the ability of Transformer to capture long-range dependencies, many of existing methods (Zhou et al., 2021; Wu et al., 2021; Woo et al., 2022b) focused on reducing the time complexity and memory usage caused by vanilla attention mechanism such as sparse attention or correlation to process longer time series. These transformer-based models all follow the same training paradigm as Figure 1a shows, which learns similar patterns from input historical time series segments and predict future time series values However, this continuous masking strategy is usually accompanied by two severe problems. For one thing, continuous masking strategy will limit the learning ability of the model, which captures only the information of the visible sequence and some mapping relationship between the historical and the future segments. Similar problems have been reported in vision tasks (Zhang et al., 2017) . For another, continuous masking strategy will induce severe distribution shift problems, especially when the prediction horizon is longer than input sequence. In reality, most of the time series data collected from real scenarios are non-stationary, whose mean or variance changes over time. Similar problems were also observed in previous studies (Qiu et al., 2018; Oreshkin et al., 2020; Wu et al., 2021; Woo et al., 2022a) . Most of them have tried to disentangle the input time series into a trend part and a seasonality part in order to enhance the capture of periodic features and to make the model robust to outlier noises. Specifically, they utilize moving average implemented by one average pooling layer with a fixed size sliding window to gain trend information of input time series. Then they capture seasonality features from periodic sequences, which are obtained by simply subtracting trend items from the original signal. To further clarify the mechanism of this disentanglement, we intuitively propose an easy but comprehensible description of disentangled time series as y(t) = Trend(t) + Seasonality(t) + Noises. (1) For better illustration, we simply use polynomial series n t n and Fourier cosine series n cos n t to respectively describe trend parts and seasonality parts of the original time series in Eq.(1). Apparently, the seasonality part is stationary when we set a proper observation horizon (not less than the maximum period of seasonality parts), while the moments of the trend part change continuously over time. Figure 2 illustrates that the size of sliding window in average pooling layer plays a vital role in the quality of disentangled trend part. Natural time series data generally have more complex periodic patterns, which means we have to employ longer sliding windows or other hierarchical disposals. In addition, when moving average is used to capture the trend parts, both ends of a sequence need to be



End-to-end forecasting.Ti-MAETi-MAE (b) Random masking applied in Ti-MAE.

Figure 1: Different masking strategies in generative Transformer-based models on time series, where blue areas signify the sequence fed into the encoder and green areas means the sequence to be generated. Left: The training paradigm of existing Transformer-based forecasting models, which can be seen as a special continuous masking strategy (only masks future time series and reconstructs them). Right: Random masking strategy applied in Ti-MAE, which can produce different views fed into the encoder in each iteration, fully leveraging the whole input time series.

Figure 2: Example of disentanglement. Top Left: Simulated input cosine series added with a linear trend. Top Right: The true trend and seasonality parts of the input. Bottom Left: Disentangled trend part though average pooling with the sliding window size of 15. Bottom Right: Disentangled trend part though average pooling with the sliding window size of 75.

