TI-MAE: SELF-SUPERVISED MASKED TIME SERIES AUTOENCODERS

Abstract

Multivariate Time Series forecasting has been an increasingly popular topic in various applications and scenarios. Recently, contrastive learning and Transformerbased models have achieved good performance in many long-term series forecasting tasks. However, there are still several issues in existing methods. First, the training paradigm of contrastive learning and downstream prediction tasks are inconsistent, leading to inaccurate prediction results. Second, existing Transformerbased models which resort to similar patterns in historical time series data for predicting future values generally induce severe distribution shift problems, and do not fully leverage the sequence information compared to self-supervised methods. To address these issues, we propose a novel framework named Ti-MAE, in which the input time series are assumed to follow an integrate distribution. In detail, Ti-MAE randomly masks out embedded time series data and learns an autoencoder to reconstruct them at the point-level. Ti-MAE adopts mask modeling (rather than contrastive learning) as the auxiliary task and bridges the connection between existing representation learning and generative Transformer-based methods, reducing the difference between upstream and downstream forecasting tasks while maintaining the utilization of original time series data. Experiments on several public real-world datasets demonstrate that our framework of masked autoencoding could learn strong representations directly from the raw data, yielding better performance in time series forecasting and classification tasks. The code will be made public after this paper is accepted.

1. INTRODUCTION

Time series modeling has an urgent need in many fields, such as time series classification (Dau et al., 2019) , demand forecasting (Carbonneau et al., 2008) , and anomaly detection (Laptev et al., 2017) . Recently, long sequence time series forecasting (LSTF), which aims to predict the change of values in a long future period, has aroused significant interests of researchers. In the previous work, most of the self-supervised representation learning methods on time series aim to learn transformationinvariant features via contrastive learning to be applied on downstream tasks. Although these methods perform well on classification tasks, there is still a gap between their performance and other supervised models on forecasting tasks. Apart from the inevitable distortion to time series caused by augmentation strategies they have borrowed from vision or language, the inconsistency between upstream contrastive learning approaches and downstream forecasting tasks should be also a major cause of this problem. Besides, as the latest contrastive learning frameworks (Yue et al., 2022; Woo et al., 2022a ) reported, Transformer (Vaswani et al., 2017) performs worse than CNN-based backbones, which is also not consistent with our experience. We have to reveal the differences and relationships between existing contrastive learning and supervised methods on time series. As an alternative of contrastive learning, denoising autoencoders (Vincent et al., 2008) are also used to be an auxiliary task to learn intermediate representation from the data. Due to the ability of Transformer to capture long-range dependencies, many of existing methods (Zhou et al., 2021; Wu et al., 2021; Woo et al., 2022b) focused on reducing the time complexity and memory usage caused by vanilla attention mechanism such as sparse attention or correlation to process longer time series. These transformer-based models all follow the same training paradigm as Figure 1a shows, which learns similar patterns from input historical time series segments and predict future time series values 1

