A TRANSFORMER-BASED FRAMEWORK FOR MULTI-VARIATE TIME SERIES REPRESENTATION LEARNING

Abstract

In this work we propose for the first time a transformer-based framework for unsupervised representation learning of multivariate time series. Pre-trained models can be potentially used for downstream tasks such as regression and classification, forecasting and missing value imputation. We evaluate our models on several benchmark datasets for multivariate time series regression and classification and show that they exceed current state-of-the-art performance, even when the number of training samples is very limited, while at the same time offering computational efficiency. We show that unsupervised pre-training of our transformer models offers a substantial performance benefit over fully supervised learning, even without leveraging additional unlabeled data, i.e., by reusing the same data samples through the unsupervised objective.

1. INTRODUCTION

Multivariate time series (MTS) are an important type of data that is ubiquitous in a wide variety of domains, including science, medicine, finance, engineering and industrial applications. Despite the recent abundance of MTS data in the much touted era of "Big Data", the availability of labeled data in particular is far more limited: extensive data labeling is often prohibitively expensive or impractical, as it may require much time and effort, special infrastructure or domain expertise. For this reason, in all aforementioned domains there is great interest in methods which can offer high accuracy by using only a limited amount of labeled data or by leveraging the existing plethora of unlabeled data. There is a large variety of modeling approaches for univariate and multivariate time series, with deep learning models recently challenging or replacing the state of the art in tasks such as forecasting, regression and classification (De Brouwer et al., 2019; Tan et al., 2020a; Fawaz et al., 2019b) . However, unlike in domains such as Computer Vision or Natural Language Processing (NLP), the dominance of deep learning for time series is far from established: in fact, non-deep learning methods such as TS-CHIEF (Shifaz et al., 2020) , HIVE-COTE (Lines et al., 2018), and ROCKET (Dempster et al., 2020 ) currently hold the record on time series regression and classification dataset benchmarks (Tan et al., 2020a; Bagnall et al., 2017) , matching or even outperforming sophisticated deep architectures such as InceptionTime (Fawaz et al., 2019a) and ResNet (Fawaz et al., 2019b) . In this work, we investigate, for the first time, the use of a transformer encoder for unsupervised representation learning of multivariate time series, as well as for the tasks of time series regression and classification. Transformers are an important, recently developed class of deep learning models, which were first proposed for the task of natural language translation (Vaswani et al., 2017) but have since come to monopolize the state-of-the-art performance across virtually all NLP tasks (Raffel et al., 2019) . A key factor for the widespread success of transformers in NLP is their aptitude for learning how to represent natural language through unsupervised pre-training (Brown et al., 2020; Raffel et al., 2019; Devlin et al., 2018) . Besides NLP, transformers have also set the state of the art in several domains of sequence generation, such as polyphonic music composition (Huang et al., 2018) . Transformer models are based on a multi-headed attention mechanism that offers several key advantages and renders them particularly suitable for time series data (see Appendix section A.4 for details). Inspired by the impressive results attained through unsupervised pre-training of transformer models in NLP, as our main contribution, in the present work we develop a generally applicable methodology (framework) that can leverage unlabeled data by first training a transformer model to extract dense vector representations of multivariate time series through an input denoising (autoregressive) objective. The pre-trained model can be subsequently applied to several downstream tasks, such as regression, classification, imputation, and forecasting. Here, we apply our framework for the tasks of multivariate time series regression and classification on several public datasets and demonstrate that transformer models can convincingly outperform all current state-of-the-art modeling approaches, even when only having access to a very limited amount of training data samples (on the order of hundreds of samples), an unprecedented success for deep learning models. Importantly, despite common preconceptions about transformers from the domain of NLP, where top performing models have billions of parameters and require days to weeks of pre-training on many parallel GPUs or TPUs, we also demonstrate that our models, using at most hundreds of thousands of parameters, can be trained even on CPUs, while training them on GPUs allows them to be trained as fast as even the fastest and most accurate non-deep learning based approaches. (Fawaz et al., 2019a) and ResNet (Fawaz et al., 2019b) . ROCKET, which on average is the best ranking method, is a fast method that involves training a linear classifier on top of features extracted by a flat collection of numerous and various random convolutional kernels. HIVE-COTE and TS-CHIEF (itself inspired by Proximity Forest (Lucas et al., 2019) ), are very sophisticated methods which incorporate expert insights on time series data and consist of large, heterogeneous ensembles of classifiers utilizing shapelet transformations, elastic similarity measures, spectral features, random interval and dictionary-based techniques; however, these methods are highly complex, involve significant computational cost, cannot benefit from GPU hardware and scale poorly to datasets with many samples and long time series; moreover, they have been developed for and only been evaluated on univariate time series.

2. RELATED WORK

Unsupervised learning for multivariate time series: Recent work on unsupervised learning for multivariate time series has predominantly employed autoencoders, trained with an input reconstruction objective and implemented either as Multi-Layer Perceptrons (MLP) or RNN (most commonly, LSTM) networks. As interesting variations of the former, Kopf et al. (2019) and Fortuin et al. (2019) additionally incorporated Variational Autoencoding into this approach, but focused on clustering and the visualization of shifting sample topology with time. As an example of the latter, Malhotra et al. (2017) presented a multi-layered RNN sequence-to-sequence autoencoder, while Lyu et al. ( 2018) developed a multi-layered LSTM with an attention mechanism and evaluated both an input reconstruction (autoencoding) as well as a forecasting loss for unsupervised representation learning of Electronic Healthcare Record multivariate time series. As a novel take on autoencoding, and with the goal of dealing with missing data, Bianchi et al. ( 2019) employ a stacked bidirectional RNN encoder and stacked RNN decoder to reconstruct the input, and at the same time use a user-provided kernel matrix as prior information to condition internal representations and encourage learning similarity-preserving representations of the input. They evaluate the method on the tasks of missing value imputation and classification of time series under increasing "missingness" of values. A distinct approach is followed by Zhang et al. (2019) , who use a composite convolutional -LSTM network with attention and a loss which aims at reconstructing correlation matrices between the variables of the multivariate time series input. They use and evaluate their method only for the task of anomaly detection. Finally, Jansen et al. (2018) rely on a triplet loss and the idea of temporal proximity (the loss rewards similarity of representations between proximal segments and penalizes similarity between distal segments of the time series) for unsupervised representation learning of non-speech audio data. This idea is explored further by Franceschi et al. (2019) , who combine the triplet loss with a deep causal dilated CNN, in order to make the method effective for very long time series.



Regression and classification of time series: Currently, non-deep learning methods such as TS-CHIEF(Shifaz et al., 2020),HIVE-COTE (Lines et al., 2018), and ROCKET (Dempster et al.,  2020)  constitute the state of the art for time series regression and classification based on evaluations on public benchmarks(Tan et al., 2020a; Bagnall et al., 2017), followed by CNN-based deep architectures such as InceptionTime

