VISION TRANSFORMER FOR MULTIVARIATE TIME-SERIES CLASSIFICATION (VITMTSC)

Abstract

Multivariate Time-Series Classification (MTSC) is an important issue in many disciplines because of the proliferation of disparate data sources and sensors (economics, retail, health, etc.). Nonetheless, it remains difficult due to the highdimensionality and richness of data that is regularly updated. We present a Vision Transformer for Multivariate Time-Series Classification (VitMTSC) model that learns latent features from raw time-series data for classification tasks and is applicable to large-scale time-series data with millions of data samples of variable lengths. According to our knowledge, this is the first implementation of the Vision Transformer (ViT) for MTSC. We demonstrate that our approach works on datasets ranging from a few thousand to millions of samples and achieves close to the state-of-the-art (SOTA) results on open datasets. Using click-stream data from a major retail website, we demonstrate that our model can scale to millions of samples and vastly outperform previous neural net-based MTSC models in real-world applications. Our source code is publicly accessible at https: //github.com/mtsc-research/vitmtsc to facilitate further research.

1. INTRODUCTION

Deep neural networks (DNNs) have shown significant effectiveness with both text (Lin et al., 2021) and images (Li et al., 2021) . The availability of standard DNN architectures that effectively encode raw data into meaningful representations, leading in good performance on new datasets and associated tasks with little effort, is a crucial factor enabling advancement. In image understanding, for instance, variations of residual convolutional networks, such as ResNet (He et al., 2015) , exhibit relatively outstanding performance on new image datasets or somewhat different visual recognition problems, such as segmentation. Despite being the most prevalent sort of data in the actual world, tabular data has only recently been the focus of deep learning research, such as TabNet (Arik & Pfister, 2019) and TabTransformer (Huang et al., 2020) . Time-Series data is a special case of tabular data. An intrinsic characteristic of time-series is that observations within time-series are statistically dependent on assumed generative processes (Löning et al., 2019) . For example, the likelihood of a user streaming a particular movie X, is dependent on whether they have streamed movie X before as well as latent factors like, the time duration since the last streaming of movie X, number of movies from the same genre/director/actor as X that the user has streamed, cost of streaming, promotions on movie X, etc. Due to this reliance, time-series data does not readily fit inside the traditional machine learning paradigm for tabular data, which implicitly implies that observations are independent and identically distributed (i.i.d.). We consider time-series, input to the model, to be composed of (a) time-points at which they are observed, and (b) observations (or time-series datapoints or TS datapoints) at those time-points. We denote, a time-series object that contains N samples as X = [X t1 , X t2 , ..., X t N ], where elements of the sequence are observation(s) at time-points t 1 , t 2 , ..., t N respectively. The time-series data may be (a) Univariate time-series (UTS), in which a single variable is observed over time, or (b) Multivariate time-series (MTS), in which two or more variables are recorded over time. We denote an individual multivariate time-series datapoint as X t K = [X 1 t K , X 2 t K , ..., X M t K ] T for 1, 2, ..., M distinct variables/features observed at each time-point. Please note in large commercial datasets, an individual X n t K categorical variable can have millions of distinct values (refer to Table 2 ). In this paper, we will focus on Multivariate Time-Series Classification (MTSC) (Ruiz et al., 2021) , which is an area of machine learning interested in assigning labels to MTS data. In non-DNN based MTSC methods, the data is first converted to i.i.d. via feature transformation, and then traditional

