OUT-OF-DISTRIBUTION REPRESENTATION LEARNING FOR TIME SERIES CLASSIFICATION

Abstract

Time series classification is an important problem in the real world. Due to its nonstationary property that the distribution changes over time, it remains challenging to build models for generalization to unseen distributions. In this paper, we propose to view time series classification from the distribution perspective. We argue that the temporal complexity of a time series dataset could attribute to unknown latent distributions that need characterize. To this end, we propose DIVERSIFY for outof-distribution (OOD) representation learning on dynamic distributions of times series. DIVERSIFY takes an iterative process: it first obtains the 'worst-case' latent distribution scenario via adversarial training, then reduces the gap between these latent distributions. We then show that such an algorithm is theoretically supported. Extensive experiments are conducted on seven datasets with different OOD settings across gesture recognition, speech commands recognition, wearable stress and affect detection, and sensor-based human activity recognition. Qualitative and quantitative results demonstrate that DIVERSIFY significantly outperforms other baselines and effectively characterizes the latent distributions. Code is available at https://github.com/microsoft/robustlearn.

1. INTRODUCTION

Time series classification is one of the most challenging problems in the machine learning and statistics community (Fawaz et al., 2019; Du et al., 2021) . One important nature of time series is the non-stationary property, indicating that its statistical features are changing over time. For years, there have been tremendous efforts for time series classification, such as hidden Markov models (Fulcher & Jones, 2014) , RNN-based methods (Hüsken & Stagge, 2003) , and Transformer-based approaches (Li et al., 2019; Drouin et al., 2022) . We propose to model time series from the distribution perspective to handle its dynamically changing distributions; more precisely, to learn out-of-distribution (OOD) representations for time series that generalize to unseen distributions. The general OOD/domain generalization problem has been extensively studied (Wang et al., 2022; Lu et al., 2022; Krueger et al., 2021; Rame et al., 2022) , where the key is to bridge the gap between known and unknown distributions. Despite existing efforts, OOD in time series remains less studied and more challenging. Compared to image classification, the dynamic distribution of time series data keeps changing over time, containing diverse distribution information that should be harnessed for better generalization. Figure 1 shows an illustrative example. OOD generalization in image classification often involves several domains whose domain labels are static and known (subfigure (a)), which can be employed to build OOD models. However, Figure 1 (b) shows that in EMG time series data (Lobov et al., 2018) , the distribution is changing dynamically over time and its domain information is unavailable. If no attention is paid to exploring its latent distributions (i.e., sub-domains), predictions may fail in face of diverse sub-domain distributions (subfigure (c)). This will dramatically impede existing OOD algorithms due to their reliance on domain information. In this work, we propose DIVERSIFY, an OOD representation learning algorithm for time series classification by characterizing the latent distributions inside the data. Concretely speaking, DI- VERSIFY consists of a min-max adversarial game: on one hand, it learns to segment the time series data into several latent sub-domains by maximizing the segment-wise distribution gap to preserve diversities, i.e., the 'worst-case' distribution scenario; on the other hand, it learns domain-invariant representations by reducing the distribution divergence between the obtained latent domains. Such latent distributions naturally exist in time series, e.g., the activity data from multiple people follow different distributions. Additionally, our experiments show that even the data of one person still has such diversity: it can also be split into several latent distributions. Figure 1 (d) shows that DIVERSIFY can effectively characterize the latent distributions (more results are in Sec. 3.5). To summarize, our contributions are four-fold: Novel perspective: We propose to view time series classification from the distribution perspective to learn OOD representation, which is more challenging than the traditional image classification due to the existence of unidentified latent distributions. Novel methodology: DIVERSIFY is a novel framework to identify the latent distributions and learn generalized representations. Technically, we propose pseudo domain-class labels and adversarial self-supervised pseudo labeling to obtain the pseudo domain labels. Theoretical insights: We provide the theoretical insights behind DIVERSIFY to analyze its design philosophy and conduct experiments to prove the insights. Superior performance and insightful results: Qualitative and quantitative results using various backbones demonstrate the superiority of DIVERSIFY in several challenging scenarios: difficult tasks, significantly diverse datasets, and limited data. More importantly, DIVERSIFY can successfully characterize the latent distributions within a time series dataset.

2. METHODOLOGY

A time-series training dataset D tr can be often pre-processed using sliding window 1 to N inputs: D tr = {(x i , y i )} N i=1 , where x i ∈ X ⊂ R p is the p-dimensional instance and y i ∈ Y = {1, . . . , C} is its label. We use P tr (x, y) on X × Y to denote the joint distribution of the training dataset. Our goal is to learn a generalized model from D tr to predict well on an unseen target dataset, D te , which is inaccessible in training. In our problem, the training and test datasets have the same input and output spaces but different distributions, i.e., X tr = X te , Y tr = Y te , but P tr (x, y) = P te (x, y). We aim to train a model h from D tr to achieve minimum error on D te .

2.1. MOTIVATION

What are domain and distribution shift in time series? Time series may consist of several unknown latent distributions (domains), even if the dataset is fully labeled. For instance, data collected by sensors of three persons may belong to two different distributions due to their dissimilarities. This can be termed as spatial distribution shift. Surprisingly, we even find temporal distribution shifts in experiments (Figure 6 ) that distributions of one person can also change at different time. Those shifts widely exist in time series, as suggested by (Zhang et al., 2021; Ragab et al., 2022) . OOD generalization requires latent domain characterization. Due to the non-stationary property, naive approaches that treat time series as one distribution fail to capture domain-invariant (OOD) features since they ignore the diversities inside the dataset. In Figure 1 (c), we assume the training domain contains two sub-domains (circle and plus points). Directly treating it as one distribution via existing OOD approaches may generate the black margin. Red star points are misclassified to the green class when predicting on the OOD domain (star points) with the learned model. Thus, multiple diverse latent distributions in time series should be characterized to learn better OOD features. A brief formulation of latent domain characterization. Following above discussions, a time series may consist of K unknown latent domains 23 rather than a fixed one, i.e., P tr (x, y) = K i=1 π i P i (x, y), where P i (x, y) is the distribution of the i-th latent one with weight π i , K i=1 π i = 1. 4 There could be infinite ways to obtain P i s and our goal is to learn the 'worst-case' distribution scenario where the distribution divergence between each P i and P j is maximized. Why the 'worstcase' scenario? It will maximally preserve the diverse information of each latent distribution, thus benefiting generalization. For an illustration, these obtained latent distributions are shown in Sec. 3.5.

2.2. DIVERSIFY

In this paper, we propose DIVERSIFY to learn OOD representations for time series classification. The core of DIVERSIFY is to characterize the latent distributions and then minimize the distribution divergence between each two. DIVERSIFY utilizes an iterative process: it first obtains the 'worst-case' distribution scenario from a given dataset, then bridges the distribution gaps between each pair of latent distributions. Figure 2 describes its main procedures, where steps 2 ∼ 4 are iterative: 1. Pre-processing: this step adopts the sliding window to split the entire training dataset into fixed-size windows. We argue that the data from one window is the smallest domain unit. 2. Fine-grained feature update: this step updates the feature extractor using the proposed pseudo domain-class labels as the supervision. 3. Latent distribution characterization: it aims to identify the domain label for each instance to obtain the latent distribution information. It maximizes the different distribution gaps to enlarge diversity. 4. Domain-invariant representation learning: this step utilizes pseudo domain labels from the last step to learn domain-invariant representations and train a generalizable model. Fine-grained Feature Update. Before characterizing the latent distributions, we perform finegrained feature updates to obtain fine-grained representation. As shown in Figure 2 (blue), we propose a new concept: pseudo domain-class label to fully utilize the knowledge contained in domains and classes, which serves as the supervision for feature extractor. Features are more fine-grained w.r.t. domains and labels, instead of only attached to domains or labels. At the first iteration, there is no domain label d and we simply initialize d = 0 for all samples. We treat per category per domain as a new class with label s ∈ {1, 2, • • • , S}. We have S = K × C where K is the pre-defined number of latent distributions that can be tuned in experiments. We perform pseudo domain-class label assignment to get discrete values for supervision: s = d × C + y. Let h (2) f , h b , h (2) c be feature extractor, bottleneck, and classifier, respectively (we use superscripts to denote step number). Then, the supervised loss is computed using the cross-entropy loss : L super = E (x,y)∼P tr h (2) c (h (2) b (h (2) f (x))), s . Latent Distribution Characterization. This step characterizes the latent distributions contained in one dataset. As shown in Figure 2 (green), we propose an adapted version of adversarial training to disentangle the domain labels from the class labels. However, there are no actual domain labels provided, which hinders such disentanglement. Inspired by (Caron et al., 2018) , we employ a self-supervised pseudo-labeling strategy to obtain domain labels. First, we attain the centroid for each domain with class-invariant features: μk = xi∈X tr δ k (h (3) c (h (3) b (h (3) f (x i ))))h (3) b (h (3) f (x i )) xi∈X tr δ k (h f (x i )))) , where h (3) f , h c are feature extractor, bottleneck, and classifier, respectively. μk is the initial centroid of the k th latent domain while δ k is the k th element of the logit soft-max output. Then, we obtain the pseudo domain labels via the nearest centroid classifier using a distance function D: d i = arg min k D(h (3) b (h (3) f (x i )), μk ). Then, we compute the centroids and obtain the updated pseudo domain labels: µ k = xi∈X tr I( d i = k)h (3) b (h (3) f (x)) xi∈X tr I( d i = k) , d i = arg min k D(h (3) b (h (3) f (x i )), µ k ), where I(a) = 1 when a is true, otherwise 0. After obtaining d , we can compute the loss of step 2: L self + L cls =E (x,y)∼P tr (h (3) c (h (3) b (h (3) f (x))), d ) + (h (3) adv (R λ1 (h (3) b (h (3) f (x)))), y), where h (3) adv is the discriminator for step 3 that contains several linear layers and one classification layer. R λ1 is the gradient reverse layer with hyperparameter λ 1 (Ganin et al., 2016) . After this step, we can obtain pseudo domain label d for x. Domain-invariant Representation Learning. After obtaining the latent distributions, we learn domain-invariant representations for generalization. In fact, this step (purple in Figure 2 ) is simple: we borrow the idea from DANN (Ganin et al., 2016) and directly use adversarial training to update the classification loss L cls and domain classifier loss L dom using gradient reversal layer (GRL) (a common technique that facilitates adversarial training via reversing gradients) (Ganin et al., 2016) : L cls + L dom =E (x,y)∼P tr (h (4) c (h (4) b (h (4) f (x))), y) + (h (4) adv (R λ2 (h (4) b (h (4) f (x)))), d ), where is the cross-entropy loss and R λ2 is the gradient reverse layer with hyperparameter λ 2 (Ganin et al., 2016) . We will omit the details of GRL and adversarial training here since they are common techniques in deep learning. More details are presented in Appendix B.2. Training, Inference, and Complexity. We repeat these steps until convergence or max epochs. Different from existing methods, the last two steps only optimize the last few independent layers but not the feature extractor. We perform inference with the modules from the last step. Most of the trainable parameters are shared between modules, indicating that DIVERSIFY has the same model size as existing methods and can reach quick convergence in experiments (Figure F.5).

2.3. THEORETICAL INSIGHTS

We present some theoretical insights to show that our approach is well motivated in theory. Proofs can be found in Appendix A. Proposition 2.1. Let X be a space and H be a class of hypotheses corresponding to this space. Let Q and the collection {P i } K i=1 be distributions over X and let {ϕ i } K i=1 be a collection of non-negative coefficient with i ϕ i = 1. Let O be a set of distributions s.t. ∀S ∈ O, the following holds d H∆H ( i ϕ i P i , S) ≤ max i,j d H∆H (P i , P j ). Then, for any h ∈ H, ε Q (h) ≤ λ + i ϕ i ε Pi (h) + 1 2 min S∈O d H∆H (S, Q) + 1 2 max i,j d H∆H (P i , P j ), ( ) where λ is the error of an ideal joint hypothesis. ε P (h) is the error for a hypothesis h on a distribution P. d H∆H (P, Q) is H-divergence which measures differences in distribution (Ben-David et al., 2010) . The first item in Eq. ( 8), λ , is often neglected since it is small in reality. The second item, i ϕ i ε Pi (h) , exists in almost all methods and can be minimized via supervision from class labels with cross-entropy loss in Eq. ( 6). Our main purpose is to minimize the last two items in Eq. ( 8). Here Q corresponds to the unseen out-of-distribution target domain. The last term 1 2 max i,j d H∆H (P i , P j ) is common in OOD theory which measures the maximum differences among source domains. This corresponds to step 4 in our approach. Finally, the third item, 1 2 min S∈O d H∆H (S, Q), explains why we exploit sub-domains in step 3. Since our goal is to learn a model which can perform well on an unseen target domain, we cannot obtain Q. To minimize 1 2 min S∈O d H∆H (S, Q), we can only enlarge the range of O. We have to max i,j d H∆H (P i , P j ) according to Eq. ( 7), corresponding to step 3 in our method which tries to segment the time series data into several latent sub-domains by maximizing the segment-wise distribution gap to preserve diversities, i.e., the 'worst-case' distribution scenario.

3. EXPERIMENTS

We perform evaluations on four diverse time series classification tasks: gesture recognition, speech commands recognition, wearable stress&affect detection, and sensor-based activity recognition. Time series OOD algorithms are currently less studied and there are only two recent strong approaches for comparison: GILE (Qian et al., 2021) and AdaRNN (Du et al., 2021) . 5 We further compare with 7 general OOD methodsfoot_1 from DomainBed (Gulrajani & Lopez-Paz, 2021) : ERM, DANN (Ganin et al., 2016) , CORAL (Sun & Saenko, 2016) , Mixup (Zhang et al., 2018) , GroupDRO (Sagawa et al., 2020) , RSC (Huang et al., 2020) , and ANDMask (Parascandolo et al., 2021) . More details of these methods are in Sec. B.2 and B.3. For fairness, all methods (except GILE and AdaRNN) use a feature net with two blocks and each block has one convolution layer, one pooling layer, and one batch normalization layer, following (Wang et al., 2019) . We also use Transformers (Vaswani et al., 2017) for backbone. Detailed data pre-processing, architecture, and hyperparameters are in Appendix C.5 and D. Ablations with various backbones are in Figure 8 and Appendix F.4. Most OOD methods require the domain labels known in training while ours does not, which is more challenging and practical. We conduct the training-domain-validation strategy and the training data are split by 8 : 2 for training and validation. We tune all methods to report the average best performance of three trials for fairness. Note that the "target" in experiments is unseen and only used for testing. K in DIVERSIFY is treated as a hyperparameter and we tune it to record the best OOD performance. 7 Per-segment accuracy is the evaluation metric. Time complexity and convergence are in Sec. F.5, showing its quick convergence. First, we evaluate DIVERSIFY on EMG for gestures Data Set (Lobov et al., 2018) . It contains data of 36 subjects with 7 classes and we select 6 common classes for our experiments. We randomly divide 36 subjects into four domains (i.e., 0, 1, 2, 3). More details on EMG and domain splits can be found in Sec. C.2 and C.6 respectively. EMG data is affected by many factors since it comes from bioelectric signals.EMG data are scene and device-dependent, which means the same person may generate different data when performing the same activity with the same device at a different time (i.e., distribution shift across time (Wilson et al., 2020; Purushotham et al., 2016) ) or with the different devices at the same time. Thus, the EMG benchmark is challenging. Table 1 shows that with the same backbone, our method achieves the best average performance and is 4.3% better than the second-best method. DIVERSIFY even outperforms AdaRNN which has a stronger backbone. Then, we adopt a regular speech recognition task, the Speech Commands dataset (Warden, 2018) . It consists of one-second audio recordings of both background noise and spoken words such as 'left' and 'right'. It is collected from more than 2,000 persons, thus is more complicated. Following (Kidger et al., 2020) , we use 34,975 time series corresponding to ten spoken words to produce a balanced classification problem. Since this dataset is collected from multiple persons, the training and test distributions are different, which is also an OOD problem with one training domain. There are many subjects and each subject only records a few audios. Thus, we do not split each sample. Figure 3 shows the results on two different backbones.

3.2. SPEECH COMMANDS

Compared with GroupDRO, DIVERSIFY has over 1% improvement with a basic CNN backbone and over 0.6% improvement with a strong backbone MatchBoxNet3-1-64 (Majumdar & Ginsburg, 2020) . It demonstrates the superiority of our method on a regular time-series benchmark containing massive distributions. Here, 0 ∼ 4 in x-axis denotes the unseen test dataset.

3.3. WEARABLE STRESS AND AFFECT DETECTION

We further evaluate DIVERSIFY on a larger dataset, Wearable Stress and Affect Detection (WESAD) (Schmidt et al., 2018) . WESAD is a public dataset that contains physiological and motion data of 15 subjects with 63, 000, 000 instances. We utilize sensor modalities of chest-worn devices including electrocardiogram, electrodermal activity, electromyogram, respiration, body temperature, and three axis acceleration. We split 15 subjects into four domains (details are in Sec. C.6). Results Figure 4 showed that our method achieves the best performance compared to other state-of-the-art methods with an improvement of over 8% on this larger dataset.

3.4. SENSOR-BASED HUMAN ACTIVITY RECOGNITION

Finally, we construct four diverse OOD settings by leveraging four sensor-based human activity recognition datasets: DSADS (Barshan & Yüksek, 2014) , USC-HAD (Zhang & Sawchuk, 2012) , UCI-HAR (Anguita et al., 2012) , and PAMAP (Reiss & Stricker, 2012) . These datasets are collected from different people and positions using accelerometer and gyroscope, with 11, 741, 000 instances in total. (1) Cross-person generalization aims to learn generalized models for different persons. (2) Cross-position generalization aims to learn generalized models for different sensor positions. (3) Cross-dataset generalization aims to learn generalized models for different datasets. ( 4) One-Person-To-Another aims to learn generalized models for different persons from data of a single Table 2 and Table 3 show the results on four settings for HAR, where our method significantly outperforms the second-best baseline by 2.4%, 1.4%, 9.9%, and 5.8% respectively. All results demonstrate the superiority of DIVERSIFY. More results using Transformer can be found in F.4. We observe more insightful conclusions. (1) When the task is difficult: In the Cross-Person setting, USC-HAD may be the most difficult task. Although it has more samples, it contains 14 subjects with only two sensors on one position, which may bring more difficulty in learning. The results prove the above argument that all methods perform terribly on this benchmark while ours has the largest improvement. (2) When datasets are significantly more diverse: Compared to Cross-Person and Cross-Position settings, Cross-Dataset may be more difficult since all datasets are totally different and samples are influenced by subjects, devices, sensor positions, and some other factors. In this setting, our method is substantially better than others. (3) Limited data: Compared with Cross-Person setting, One-Person-To-Another is more difficult since it has fewer data samples. In this case, enhancing diversity can bring a remarkable improvement and our method can boost the performance.

3.5. ANALYSIS

Ablation study We present ablation study to answer the following three questions. (1) Why obtaining pseudo domain labels with class-invariant features in step 3? If we obtain pseudo domain labels with common features, domain labels may have correlations with class labels, which may introduce 8 In One-Person-To-Another setting, we only report average accuracy of four tasks on each dataset. Since only one domain exists in training dataset for this setting, DANN and CORAL cannot be implemented here. Varying backbones Figure 8 shows the results using small, medium, and large backbones, respectively (we implement them with different numbers of layers. 2020) also studied DG without domain labels by clustering with the style features for images, which is not applied to time series and is not end-to-end trainable. Disentanglement (Peng et al., 2019; Zhang et al., 2022b) tries to disentangle the domain and label information, but they also assume access to domain information. Single domain generalization is similar to our problem setting which also involves one training domain (Fan et al., 2021; Li et al., 2021; Wang et al., 2021; Zhu & Li, 2022) . However, they treated the single domain as one distribution and did not explore latent distributions. Multi-domain learning is similar to DG which also trains on multiple domains, but also tests on training distributions. Deecke et al. (2022) proposed sparse latent adapters to learn from unknown domain labels, but their work does not consider the min-max worst-case distribution scenario and optimization. In domain adaptation, Wang et al. (2020) proposed the notion of domain index and further used variational models to learn them (Xu et al., 2023) , but took a different modeling methodology since they did not consider min-max optimization. Mixture models (Rasmussen et al., 1999) are models representing the presence of subpopulations within an overall population, e.g., Gaussian mixture models. Our approach has a similar formulation but does not use generative models. Subpopulation shift is a new setting (Koh et al., 2021) that refers to the case where the training and test domains overlap, but their relative proportions differ. Our problem does not belong to this setting since we assume that these distributions do not overlap. Distributionally robust optimization (Delage & Ye, 2010) shares a similar paradigm with our work, whose paradigm is also to seek a distribution that has the worst performance within a range of the raw distribution. GroupDRO (Sagawa et al., 2020) studied DRO at a group level. However, we study the internal distribution shift instead of seeking a global distribution close to the original one.

5. LIMITATION AND DISCUSSION

DIVERSIFY could be more perfect by pursuing the following avenues. 1) Estimate the number of latent distributions K automatically: we currently treat it as a hyperparameter. 2) Seek the semantics behind latent distributions (e.g., Figure 6 (a)): can adding more human knowledge obtain better latent distributions? 3) Extend DIVERSIFY beyond classification, but for forecasting problems. Moreover, we argue that dynamic distributions not only exist in time series, but also in general machine learning data such as images and text (Deecke et al., 2022; Xu et al., 2023) . Thus, it is of great interest to apply our approach to these domains to further improve their performance. 

A THEORETICAL INSIGHTS

A.1 BACKGROUND For a distribution P with an ideal binary labeling function h * and a hypothesis h, we define the error ε P (h) in accordance with (Ben-David et al., 2010) as: ε P (h) = E x∼P |h(x) -h * (x)|. ( ) We also give the definition of H-divergence according with (Ben-David et al., 2010) . Given two distributions P, Q over a space X and a hypothesis class H, d H (P, Q) = 2 sup h∈H |P r P (I h ) -P r Q (I h )|, where I h = {x ∈ X |h(x) = 1}. We often consider the H∆H-divergence in (Ben-David et al., 2010) where the symmetric difference hypothesis class H∆H is the set of functions characteristic to disagreements between hypotheses. Theorem A.1. (Theorem 2.1 in (Sicilia et al., 2021) , modified from Theorem 2 in (Ben-David et al., 2010) ). Let X be a space and H be a class of hypotheses corresponding to this space. Suppose P and Q are distributions over X . Then for any h ∈ H, the following holds ε Q (h) ≤ λ + ε P (h) + 1 2 d H∆H (Q, P) with λ the error of an ideal joint hypothesis for Q, P. Theorem A.1 provides an upper bound on the target-error. λ is a property of the dataset and hypothesis class and is often ignored. Theorem A.1 demonstrates the necessity to learn domain invariant features. A.2 PROOF OF PROPOSITION 2.1. Proposition 2.1. Let X be a space and H be a class of hypotheses corresponding to this space. Let Q and the collection {P i } K i=1 be distributions over X and let {ϕ i } K i=1 be a collection of non-negative coefficient with i ϕ i = 1. Let the object O be a set of distributions such that for every S ∈ O the following holds d H∆H ( i ϕ i P i , S) ≤ max i,j d H∆H (P i , P j ). Then, for any h ∈ H, ε Q (h) ≤ λ + i ϕ i ε Pi (h) + 1 2 min S∈O d H∆H (S, Q) + 1 2 max i,j d H∆H (P i , P j ) where λ is the error of an ideal joint hypothesis. Proof. On one hand, with Theorem A.1, we have ε Q (h) ≤ λ 1 + ε S (h) + 1 2 d H∆H (S, Q), ∀h ∈ H, ∀S ∈ O. On the other hand, with Theorem A.1, we have ε S (h) ≤ λ 2 + ε i ϕiPi (h) + 1 2 d H∆H ( i ϕ i P i , S), ∀h ∈ H. Since ε i ϕiPi (h) = i ϕ i ε Pi (h), and d H∆H ( i ϕ i P i , S) ≤ max i,j d H∆H (P i , P j ), we have ε Q (h) ≤ λ + i ϕ i ε Pi (h) + 1 2 d H∆H (S, Q) + 1 2 max i,j d H∆H ( i ϕ i P i , S), ∀h ∈ H, ∀S ∈ O, where λ = λ 1 + λ 2 . Equation 16 for all S ∈ O holds. Therefore, we complete the proof.

B METHOD DETAILS B.1 DOMAIN-INVARIANT REPRESENTATION LEARNING

Domain-invariant representation learning utilizes adversarial training which contains a feature network, a domain discriminator, and a classification network. The domain discriminator tries its best to discriminate domain labels of data while the feature network tries its best to generate features to confuse the domain discriminator, which thereby obtains domain-invariant representation. Therefore, it is an adversarial process, and in our setting, it can be expressed as follows, min h (4) b ,h (4) c E (x,y)∼P tr (h (4) c (h (4) b (h (4) f (x))), y) -(h f (x))), d ), min h (4) adv E (x,y)∼P tr (h f (x))), d ). To optimize Eq. ( 17), we need an iterative process to optimize h (4) b , h c and h (4) adv iteratively, which is cumbersome. It is better to optimize h (4) b , h c and h (4) adv at the same time. It is obvious that the key is to solve the problems caused by the negative sign in Eq. ( 17). Therefore, a special gradient reversal layer (GRL), a popular implementation of the adversarial training in training several domains as suggested by (Ganin et al., 2016) , came. GRL acts as an identity transformation during the forward propagation while it takes the gradient from the subsequent level and changes its sign before passing it to the preceding layer during the backpropagation. During the forward propagation, the GRL can be ignored. During the backpropagation, the GRL makes the sign of gradient on h (4) b reverse, which solves the problems caused by the negative sign in Eq. ( 17).

B.2 METHOD FORMULATION AND IMPLEMENTATION

While it is common to use some probability or Bayesian approaches for formulation when mentions distributions, we actually do not formulate the latent distributions: we are not a generative or parametric method. In fact, the concept of latent distribution is just a notion to help understand our method. Our ultimate goal is to infer which distribution a segment belongs to for best OOD performance. Thus, we do not care what a distribution exactly looks like or even parameterize it since it is not our focus. As long as we can obtain diverse latent distributions, things are all done. In real implementation, the latent distributions are just represented as domain labels: latent distribution i is also a domain i that certain time series segments belong to, as we stated in the introduction part. Additionally, we also acknowledge that parameterizing the latent distributions may help to get better performance, which can be left for future research.

B.3 COMPARISONS TO OTHER LATEST METHODS

Here, we offer more details on comparisons to other latest methods utilized in the main paper. DANN (Ganin et al., 2016) is a method that utilizes the adversarial training to force the discriminator unable to classify domains for better domain-invariant features. It requires domain labels and splits data in advance while ours is a universal method. CORAL (Sun & Saenko, 2016 ) is a method that utilizes the covariance alignment in feature layers for better domain-invariant features. It also requires domain labels and splits data in advance. Mixup (Zhang et al., 2018) is a method that utilizes interpolation to generate more data for better generalization. Ours mainly focuses on generalized representation learning. GroupDRO (Sagawa et al., 2020) is a method that seeks a global distribution with the worst performance within a range of the raw distribution for better generalization. Ours study the internal distribution shift instead of seeking a global distribution close to the original one. RSC (Huang et al., 2020 ) is a self-challenging training algorithm that forces the network to activate features as much as possible by manipulating gradients. It belongs to gradient operation-based DG while ours is to learn generalized features. ANDMask (Parascandolo et al., 2021) is another gradient-based optimization method that belongs to special learning strategies. Ours focuses on representation learning. GILE (Qian et al., 2021) is a disentanglement method designed for cross-person human activity recognition. It is based on VAEs and requires domain labels. AdaRNN (Du et al., 2021 ) is a method with a two-stage that is non-differential and it is tailored for RNN. A specific algorithm is designed for splitting. Ours is universal and is differential with better performance.

C DATASET C.1 DATASETS INFORMATION

Table 4 shows the statistical information on each dataset. (Zhang & Sawchuk, 2012) is composed of 14 subjects (7 male, 7 female, aged 21 to 49) executing 12 activities with a sensor tied on the front right hip. UCI-HAR (Anguita et al., 2012) is collected by 30 subjects performing 6 daily living activities with a waist-mounted smartphone. PAMAP (Reiss & Stricker, 2012) We will introduce how we preprocess data and the final dimension of data for experiments here. We mainly utilize the sliding window technique, a common technique in time-series classification, to split data. As its name suggests, this technique involves taking a subset of data from a given array or sequence. Two main parameters of the sliding window technique are the window size, describing a subset length, and the step size, describing moving forward distance each time. For EMG, we set the window size 200 and the step size 100, which means there exist 50% overlaps between two adjacent samples. We normalize each sample with x = x-min X max X-min X . X contains all x. The final dimension is 8 × 1 × 200. For Speech Commands, we follow (Kidger et al., 2020) . For WESAD, we utilize the same preprocessing as EMG. Now we give details on all datasets in Cross-person setting. For DSADS, we directly utilize data split by the providers. The final dimension shape is 45 × 1 × 125. 45 = 5 × 3 × 3 where 5 means five positions, the first 3 means three sensors, and the second 3 means each sensor has three axes. For USC-HAD, the window size is 200 and the step size is 100. The final dimension shape is 6 × 1 × 200. For PAMAP, the window size is 200 and the step size is 100. The final dimension shape is 27 × 1 × 200. For UCI-HAR, we directly utilize data split by the providers. The final dimension shape is 6 × 1 × 128. For Cross-position, we directly utilize samples obtained from DSADS in Cross-person setting. Since each position corresponds to one domain, a sample is split into five samples in the first dimension. And the final dimension shape is 9 × 1 × 125. For Cross-dataset, we directly utilize samples obtained in Cross-person setting. To make all datasets share the same label space and input space, we select six common classes, including WALKING, WALKING UPSTAIRS, WALKING DOWNSTAIRS, SITTING, STANDING, LAYING. In addition, we down-sample data and select two sensors from each dataset that belong to the same position. The final dimension shape is 6 × 1 × 50. For One-Person-To-Another, we randomly select four pairs of persons from DSADS, USC-HAD, and PAMAP respectively. Four tasks are 1 → 0, 3 → 2, 5 → 4, and 7 → 6. Each number corresponds to one subject. And the final dimension shape is 45 × 1 × 125, 6 × 1 × 200, and 27 × 1 × 200 for DSADS, USC-HAD, and PAMAP respectively. As we can see, samples in EMG, WESAD, and HAR all have more than one channel (the first dimension shape), which means they are all multivariate.

C.6 DETAILS ON DOMAIN SPLITS

We introduce how we split data here. Since Speech Commands is a regular task, we just randomly split the entire dataset into a training dataset, a validation dataset, and a testing dataset. We mainly focus on EMG, WESAD, and HAR, and we construct domains for OOD tasks. We denote subjects of a dataset with 0 -s n , where s n is the number of subjects in the dataset. For example, there are 36 subjects in EMG and we utilize 0, 1, 2, • • • , 35 to denote data of them respectively. Table 5 shows the initial domain splits of EMG, WESAD, and all datasets for HAR in Cross-person setting. We just want to make each domain has a similar number of samples in one dataset. As noted in the main paper, we also utilize 0, 1, 2, and 3 to represent different domains but they have different meanings with subjects. When conducting experiments, we take one domain as the testing data and the others as the training data. Our method is not influenced by the splits of the training data since we do not need the domain labels. 

D NETWORK ARCHITECTURE AND HYPERPARAMETERS

For the architecture, the model contains two blocks, and each has one convolution layer, one pooling layer, and one batch normalization layer. A single-fully-connected layer is used as the bottleneck layer while another fully-connected layer serves as the classifier. All methods are implemented with PyTorch (Paszke et al., 2019) . The maximum training epoch is set to 150. The Adam optimizer with weight decay 5 × 10 -4 is used. The learning rate for GILE is 10 -4 . The learning rate for the rest methods is 10 -2 or 10 -3 . (For Speech Commands with MatchBoxNet3-1-64, we also try the learning rate, 10 -4 .) We tune hyperparameters for each method. For the pooling layer, we utilize MaxPool2d in PyTorch. The kernel size is (1, 2) ad the stride is 2. For the convolution layer, we utilize Conv2d in PyTorch. Different tasks have different kernel sizes and Table 6 shows the kernel sizes. 

E EVALUATION METRICS

We utilize average accuracy on the testing dataset as our evaluation metrics for all benchmarks. Average accuracy is the most common metric for DG and it can be computed as the following, There are mainly four hyperparameters in our method: K which is the number of latent sub-domains, λ 1 for the adversarial part in step 3, λ 2 for the adversarial part in step 4, and local epochs and total rounds. For fairness, the product of local epochs and total rounds is the same value. We evaluate the parameter sensitivity of our method in Figure 11 where we change one parameter and fix the other to record the results. From these results, we can see that our method achieves better performance in a wide range, demonstrating that our method is robust. Figure 12 shows H-divergence among domains with initial splits and our splits on EMG, which demonstrates our splits have larger H-divergence and thereby can bring better generalization.

F.4 THE INFLUENCE OF ARCHITECTURES

To ensure that our method can work with different sizes of models, we add some more experiments with more complex or simpler architectures. As shown in Table 7 , where small, medium, and large indicate the different model sizes (our paper uses the medium), we see a clear picture that model sizes influence the results, and our method also achieves the best performance. Small corresponds to the model with one convolutional layer, Medium corresponds to the model with two convolutional layers, and Large corresponds to the model with four convolutional layers. For most methods, more complex models bring better results.



There are recent approaches purely on time series, but not for OOD. There could be recent OOD methods, but according to DomainBed(Gulrajani & Lopez-Paz, 2021), most approaches do not significantly outperform ERM. DANN, CORAL, and Mixup are also strong baselines. There might be no optimal K for a dataset. We perform grid search in [2, 10] to get the best performance. Figure 6(c)-6(d) and the experimental results above prove that paying attention to shifts comprehensively can bring larger divergence and better results. CONCLUSIONWe proposed DIVERSIFY to learn generalized representation for time series classification. DIVERSIFY employs an adversarial game that maximizes the 'worst-case' distribution scenario while minimizing their distribution divergence. We demonstrated its effectiveness in different applications. We are surprised that not only a mixed dataset, but one dataset from a single person can also contain several latent distributions. Characterizing such latent distributions will greatly improve the generalization performance on unseen datasets. We do not use UCI-HAR in cross-person setting since its baseline is good enough.



Misclassified sub-domains if we treat it as one distribution (d) Latent distributions of one domain learned by our method

Figure 1: Illustration of DIVERSIFY: (a) Domain generalization for image data requires known domain labels. (b) Domain labels are unknown for time series. (c) If we treat the time series data as one single domain, the sub-domains are misclassified. Different colors and shapes correspond to different classes and domains. Axes represent data values. (d) Finally, our DIVERSIFY can effectively learn the latent distributions. X-axis represents data numbers while Y-axis represents values.

Figure 2: The framework of DIVERSIFY.

Figure 3: Results on Speech commands with two different backbones.

Figure 4: Results on WESAD. Here, 0 ∼ 4 in x-axis denotes the unseen test dataset.

Figure 5: Ablation study of DIVERSIFY.

Figure 7(d) and 7(c) show that DIVERSIFY can learn better domain-invariant representations compared to the latest method ANDMask. To sum up, DIVERSIFY can find better latent domains to enhance generalization. More results are in Appendix F.1.

contains data of 18 activities, performed by 9 subjects wearing 3 sensors. C.4 DETAILS ON DIFFERENT SETTINGS FOR HUMAN ACTIVITY RECOGNITION We construct four different settings representing different degrees of generalization: (1) Crossperson generalization: This setting utilizes DSADS, USC-HAD, PAMAP 10 datasets to construct three benchmarks. Within each dataset, we randomly split the data into four groups and then use three groups as training data to learn a generalized model for the last group. (2) Cross-position generalization: this setting uses DSADS dataset and data from each position denotes a different domain. Each sample contains three sensors with nine dimensions. We treat one position as the test domain while the others are for training. (3) Cross-dataset generalization: this setting uses all four datasets, and each dataset corresponds to a different domain. Six common classes are selected. Two sensors from each dataset that belong to the same position are selected and data is down-sampled to have the same dimension. (4) One-Person-To-Another. This setting adopts DSADS, USC-HAD, and PAMAP datasets. In each dataset, we randomly select four pairs of persons where one is the training and the other is the test. C.5 DATA PREPROCESSING

y)∈D te I y (y * ) #|Y te | , y * = arg max h(x).

y * ) is an indicator function. If y = y * , it equals 1, otherwise it equals 0. #| • | represents the number of the set. h is the model to learn. Please note that X te has a different distribution from X tr for EMG and HAR. And x has been preprocessed and each sample is a segment.

Figure 9: Visualization of the t-SNE embeddings for classification on EMG. Different colors correspond to different classes while different shapes correspond to different domains.

Figure 12: H-divergence among domains with initial splits and our splits on EMG.

Results on EMG dataset. "Target" 0 ∼ 4 denotes unseen test distribution that is only for testing.

Accuracy on cross-person generalization. "Target" 0 ∼ 4 denotes the unseen test set.

Classification accuracy on cross-position, cross-dataset, and one-to-another generalization.

). Results indicate that larger models tend to achieve better OOD generalization performance. Our method outperforms others in all backbones, showing that DIVERSIFY presents consistently strong OOD performance in different architectures. More results with Transformer are in Appendix F.4 Parameter sensitivity is in Appendix F.2.

Alexander A Sawchuk. Usc-had: a daily activity dataset for ubiquitous activity recognition using wearable sensors. In Proceedings of the 2012 ACM conference on ubiquitous computing, pp. 1036-1043, 2012. Wenyu Zhang, Mohamed Ragab, and Ramon Sagarna. Robust domain-free domain generalization with class-aware alignment. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2870-2874. IEEE, 2021. Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 A.2 Proof of Proposition 2.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 Domain-invariant Representation Learning . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Method Formulation and Implementation . . . . . . . . . . . . . . . . . . . . . . 16 B.3 Comparisons to Other Latest Methods . . . . . . . . . . . . . . . . . . . . . . . . 16

Information on datasets. EMG) is a typical time-series data that is based on bioelectric signals. We use EMG for gestures Data Set(Lobov et al., 2018) that contains raw EMG data recorded by MYO Thalmic bracelet. The bracelet is equipped with eight sensors equally spaced around the forearm that simultaneously acquire myographic signals. Data of 36 subjects are collected while they performed series of static hand gestures and the number of instances is 40, 000 -50, 000 recordings in each column. It contains 7 classes and we select 6 common classes for our experiments. We randomly divide 36 subjects into four domains (i.e., 0, 1, 2, 3) without overlapping and each domain contains data of 9 persons.

Initial domain splits.

The kernel size of each benchmark.

ACKNOWLEDGEMENT

This work is supported by National Key Research & Development Program of China (No. 2020YFC2007104), Natural Science Foundation of China (No. 61972383), the Strategic Priority Research Program of Chinese Academy of Sciences ( No. XDA28040500), Science Research Foundation of the Joint Laboratory Project on Digital Ophthalmology and Vision Science (No.SZYK202201).

annex

Published as a conference paper at ICLR 2023 DIVERSIFY 69.8 77.3 74.4 74.4 74.0 71.7 82.4 76.9 77.3 77.1 72.0 86.6 78.5 78.9 79.0   Table 8 We also try Transformer (Vaswani et al., 2017) as the backbone for comparisons. shown in (Zhang et al., 2022a) , Transformer often has a better generalization ability compared to CNN, which implies improving with Transformer is more difficult. From Table 8, we can see that each method with Transformer has a remarkable improvement on EMG. Compared to ERM, DANN almost has no improvement but ours still has further improvements and achieves the best performance. To further validate the advantage of our method, we perform the experiments on a more difficult task, i.e. the first task of cross-dataset where distribution gaps are larger. As shown in Figure F.4, our method still achieves the best performance in this more difficult situation while DANN even performs worse than ERM, which demonstrates the importance of more accurate sub-domain labels.Overall, for all architectures, our method achieves the best performance. We also provide some analysis on time complexity and convergence. Since we only optimize the feature extractor in Step 2, our method does not cost too much time. And the results in Table 9 prove this argument empirically.The convergence results are shown in Figure F.5. Our method is convergent. Although there are some little fluctuations, these fluctuations exist widely in all domain generalization methods due to different distributions of different samples.

