OUT-OF-DISTRIBUTION REPRESENTATION LEARNING FOR TIME SERIES CLASSIFICATION

Abstract

Time series classification is an important problem in the real world. Due to its nonstationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view time series classification from the distribution perspective. We argue that the temporal complexity of a time series dataset could attribute to unknown latent distributions that need characterize. To this end, we propose DIVERSIFY for outof-distribution (OOD) representation learning on dynamic distributions of times series. DIVERSIFY takes an iterative process: it first obtains the 'worst-case' latent distribution scenario via adversarial training, then reduces the gap between these latent distributions. We then show that such an algorithm is theoretically supported. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions. Code is available at https://github.com/microsoft/robustlearn.

1. INTRODUCTION

Time series classification is one of the most challenging problems in the machine learning and statistics community (Fawaz et al., 2019; Du et al., 2021) . One important nature of time series is the non-stationary property, indicating that its statistical features are changing over time. For years, there have been tremendous efforts for time series classification, such as hidden Markov models (Fulcher & Jones, 2014) , RNN-based methods (Hüsken & Stagge, 2003), and Transformer-based approaches (Li et al., 2019; Drouin et al., 2022) . We propose to model time series from the distribution perspective to handle its dynamically changing distributions; more precisely, to learn out-of-distribution (OOD) representations for time series that generalize to unseen distributions. The general OOD/domain generalization problem has been extensively studied (Wang et al., 2022; Lu et al., 2022; Krueger et al., 2021; Rame et al., 2022) , where the key is to bridge the gap between known and unknown distributions. Despite existing efforts, OOD in time series remains less studied and more challenging. Compared to image classification, the dynamic distribution of time series data keeps changing over time, containing diverse distribution information that should be harnessed for better generalization. Figure 1 shows an illustrative example. OOD generalization in image classification often involves several domains whose domain labels are static and known (subfigure (a)), which can be employed to build OOD models. However, Figure 1 (b) shows that in EMG time series data (Lobov et al., 2018) , the distribution is changing dynamically over time and its domain information is unavailable. If no attention is paid to exploring its latent distributions (i.e., sub-domains), predictions may fail in face of diverse sub-domain distributions (subfigure (c)). This will dramatically impede existing OOD algorithms due to their reliance on domain information. In this work, we propose DIVERSIFY, an OOD representation learning algorithm for time series classification by characterizing the latent distributions inside the data. Concretely speaking, DI- VERSIFY consists of a min-max adversarial game: on one hand, it learns to segment the time series data into several latent sub-domains by maximizing the segment-wise distribution gap to preserve diversities, i.e., the 'worst-case' distribution scenario; on the other hand, it learns domain-invariant representations by reducing the distribution divergence between the obtained latent domains. Such latent distributions naturally exist in time series, e.g., the activity data from multiple people follow different distributions. Additionally, our experiments show that even the data of one person still has such diversity: it can also be split into several latent distributions. Figure 1 (d) shows that DIVERSIFY can effectively characterize the latent distributions (more results are in Sec. 3.5). To summarize, our contributions are four-fold: Novel perspective: We propose to view time series classification from the distribution perspective to learn OOD representation, which is more challenging than the traditional image classification due to the existence of unidentified latent distributions. Novel methodology: DIVERSIFY is a novel framework to identify the latent distributions and learn generalized representations. Technically, we propose pseudo domain-class labels and adversarial self-supervised pseudo labeling to obtain the pseudo domain labels. Theoretical insights: We provide the theoretical insights behind DIVERSIFY to analyze its design philosophy and conduct experiments to prove the insights. Superior performance and insightful results: Qualitative and quantitative results using various backbones demonstrate the superiority of DIVERSIFY in several challenging scenarios: difficult tasks, significantly diverse datasets, and limited data. More importantly, DIVERSIFY can successfully characterize the latent distributions within a time series dataset.

2. METHODOLOGY

A time-series training dataset D tr can be often pre-processed using sliding windowfoot_0 to N inputs: D tr = {(x i , y i )} N i=1 , where x i ∈ X ⊂ R p is the p-dimensional instance and y i ∈ Y = {1, . . . , C} is its label. We use P tr (x, y) on X × Y to denote the joint distribution of the training dataset. Our goal is to learn a generalized model from D tr to predict well on an unseen target dataset, D te , which is inaccessible in training. In our problem, the training and test datasets have the same input and output spaces but different distributions, i.e., X tr = X te , Y tr = Y te , but P tr (x, y) = P te (x, y). We aim to train a model h from D tr to achieve minimum error on D te .

2.1. MOTIVATION

What are domain and distribution shift in time series? Time series may consist of several unknown latent distributions (domains), even if the dataset is fully labeled. For instance, data collected by sensors of three persons may belong to two different distributions due to their dissimilarities. This can be termed as spatial distribution shift. Surprisingly, we even find temporal distribution shifts in



Sliding window is a common technique to segment one time series data into fixed-size windows. Each window is a minimum instance. We focus on fixed-size inputs for its popularity in time series(Das et al., 1998).



Image: Domain label known (b) Time series: Domain label unknown Misclassified sub-domains if we treat it as one distribution (d) Latent distributions of one domain learned by our method

Figure 1: Illustration of DIVERSIFY: (a) Domain generalization for image data requires known domain labels. (b) Domain labels are unknown for time series. (c) If we treat the time series data as one single domain, the sub-domains are misclassified. Different colors and shapes correspond to different classes and domains. Axes represent data values. (d) Finally, our DIVERSIFY can effectively learn the latent distributions. X-axis represents data numbers while Y-axis represents values.

