VISION TRANSFORMER FOR MULTIVARIATE TIME-SERIES CLASSIFICATION (VITMTSC)

Abstract

Multivariate Time-Series Classification (MTSC) is an important issue in many disciplines because of the proliferation of disparate data sources and sensors (economics, retail, health, etc.). Nonetheless, it remains difficult due to the highdimensionality and richness of data that is regularly updated. We present a Vision Transformer for Multivariate Time-Series Classification (VitMTSC) model that learns latent features from raw time-series data for classification tasks and is applicable to large-scale time-series data with millions of data samples of variable lengths. According to our knowledge, this is the first implementation of the Vision Transformer (ViT) for MTSC. We demonstrate that our approach works on datasets ranging from a few thousand to millions of samples and achieves close to the state-of-the-art (SOTA) results on open datasets. Using click-stream data from a major retail website, we demonstrate that our model can scale to millions of samples and vastly outperform previous neural net-based MTSC models in real-world applications. Our source code is publicly accessible at https: //github.com/mtsc-research/vitmtsc to facilitate further research.

1. INTRODUCTION

Deep neural networks (DNNs) have shown significant effectiveness with both text (Lin et al., 2021) and images (Li et al., 2021) . The availability of standard DNN architectures that effectively encode raw data into meaningful representations, leading in good performance on new datasets and associated tasks with little effort, is a crucial factor enabling advancement. In image understanding, for instance, variations of residual convolutional networks, such as ResNet (He et al., 2015) , exhibit relatively outstanding performance on new image datasets or somewhat different visual recognition problems, such as segmentation. Despite being the most prevalent sort of data in the actual world, tabular data has only recently been the focus of deep learning research, such as TabNet (Arik & Pfister, 2019) and TabTransformer (Huang et al., 2020) . Time-Series data is a special case of tabular data. An intrinsic characteristic of time-series is that observations within time-series are statistically dependent on assumed generative processes (Löning et al., 2019) . For example, the likelihood of a user streaming a particular movie X, is dependent on whether they have streamed movie X before as well as latent factors like, the time duration since the last streaming of movie X, number of movies from the same genre/director/actor as X that the user has streamed, cost of streaming, promotions on movie X, etc. Due to this reliance, time-series data does not readily fit inside the traditional machine learning paradigm for tabular data, which implicitly implies that observations are independent and identically distributed (i.i.d.). We consider time-series, input to the model, to be composed of (a) time-points at which they are observed, and (b) observations (or time-series datapoints or TS datapoints) at those time-points. We denote, a time-series object that contains N samples as X = [X t1 , X t2 , ..., X t N ], where elements of the sequence are observation(s) at time-points t 1 , t 2 , ..., t N respectively. The time-series data may be (a) Univariate time-series (UTS), in which a single variable is observed over time, or (b) Multivariate time-series (MTS), in which two or more variables are recorded over time. We denote an individual multivariate time-series datapoint as X t K = [X 1 t K , X 2 t K , ..., X M t K ] T for 1, 2, ..., M distinct variables/features observed at each time-point. Please note in large commercial datasets, an individual X n t K categorical variable can have millions of distinct values (refer to Table 2 ). In this paper, we will focus on Multivariate Time-Series Classification (MTSC) (Ruiz et al., 2021) , which is an area of machine learning interested in assigning labels to MTS data. In non-DNN based MTSC methods, the data is first converted to i.i.d. via feature transformation, and then traditional classifiers are applied (Ruiz et al., 2021) . Only a few available approaches for MTSC consider DNNs for this task (Fawaz et al., 2018) (refer to Section 2). One advantage of using DNN for MTSC is predicted efficiency benefits, particularly for large datasets (Hestness et al., 2017) . In addition, DNNs offer gradient descent (GD) based end-to-end learning for time-series data, which has numerous advantages: (a) quickly encoding multiple data kinds, including images, tabular, and time-series data; (b) eliminating the requirement for feature engineering, which is a fundamental part of non-DNN MTSC approaches; (c) learning from streaming data; and perhaps most crucially (d) end-to-end models enable representation learning, which enables many valuable applications including including data-efficient domain adaptation (Goodfellow et al., 2016) , generative modeling (Radford et al., 2015) , semi-supervised learning (Dai et al., 2017) , and unsupervised learning (Karhunen et al., 2015) . We present Vision Transformer for Multivariate Time-Series Classification (VitMTSC), a ViTbased pure transformer approach to MTSC, to model long-range contextual relationships in MTS data. The VitMTSC model accepts raw time-series data as input and is trained with GD based optimization to facilitate flexible end-to-end learning. We evaluate VitMTSC using both publicly available and proprietary datasets. The majority of algorithms in the MTSC literature have been evaluated on open-source UEA datasets (Ruiz et al., 2021) . We evaluate VitMTSC on five UEA datasets (Table 1 ) and demonstrate that VitMTSC achieves comparable performance to the current SOTA methods (Table 3 ). Using click-stream data (Table 2 ) from a popular e-commerce site, we show that VitMTSC can meet the needs of modern datasets and performs much better than current DNN methods (Table 4 ) on both small and large real-world datasets (refer to Section 4 for experiment details).

2. RELATED WORK

Since AlexNet (Krizhevsky et al., 2012) , deep convolutional neural networks (CNNs) have advanced the state-of-the-art across many standard datasets for vision problems. At the same time, the most prominent architecture of choice in sequence-to-sequence modeling is the Transformer (Vaswani et al., 2017) , which does not use convolutions, but is based on multi-headed self-attention (MSA). The MSA operation is particularly effective at modelling long-range dependencies and allows the model to attend over all elements in the input sequence. This is in stark contrast to convolutions where the corresponding "receptive field" is limited, and grows linearly with the depth of the network. The success of attention-based models in Natural Language Processing (NLP) has inspired approaches in computer vision to integrate transformers into CNNs (Wang et al., 2017; Carion et al., 2020) , as well as some attempts to replace convolutions completely (Parmar et al., 2018; Bello et al., 2019; Ramachandran et al., 2019) . With the Vision Transformer (ViT) (Dosovitskiy et al., 2020) , however, a pure-transformer-based architecture has recently outperformed its convolutional counterparts in image classification. Since then, the ViT model has also been applied to other domains, for example, video classification (Arnab et al., 2021) and many more variants have been proposedfoot_0 . In the last two decades, time-series data has proliferated a plethora of fields including economics (Wan & Si, 2017), health and medicine (Gharehbaghi et al., 2015) , scene classification (Nwe et al., 2017 ), activity recognition (Chen et al., 2020 ), traffic analysis (Yang et al., 2014 ), click-stream analysis (Jo et al., 2018) and more. DNN is used in modeling time-series tasks, such as forecasting (Benidis et al., 2020; Lim & Zohren, 2021; Torres et al., 2021 ), classification (Wang et al., 2016; Fawaz et al., 2018; Zou et al., 2019; Fawaz et al., 2019) , anomaly detection (Blázquez-García et al., 2020; Choi et al., 2021) , and data augmentation (Wen et al., 2021 ). Similarly, Transformers (Wen et al., 2022) have been used in Time-Series forecasting (Li et al., 2019; Zhou et al., 2020; 2022) , anomaly detection (Xu et al., 2021; Tuli et al., 2022 ), classification (Rußwurm et al., 2019; Zerveas et al., 2021; Yang et al., 2021) . Historically, the problem of MTSC was addressed by non-linear transforms of the time-series on which standard classification algorithms are used. Bostrom & Bagnall (2017) used a modification of the shapelet transform to quickly find k shapelets of length n to represent the time-series. They then use the kxn feature vector in a standard classification model. Similarly, BagOfPatterns (Lin et al., 2012) use the frequency of each word (token/feature) in the series to represent the series as a histogram of words. Bag of Symbolic Fourier Approximation Symbols (BOSS) (Large et al., 2018) uses Symbolic Fourier Approximation on frequencies of words in the series. Word ExtrAction for time SEries cLassification (WEASEL) (Schäfer & Leser, 2017) extracts words with multiple sliding windows of different sizes and selects the most discriminating words according to the chi-squared



https://github.com/lucidrains/vit-pytorch

