CONTRASTIVE LEARNING FOR UNSUPERVISED DOMAIN ADAPTATION OF TIME SERIES

Abstract

Unsupervised domain adaptation (UDA) aims at learning a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain. UDA is important in many applications such as medicine, where it is used to adapt risk scores across different patient cohorts. In this paper, we develop a novel framework for UDA of time series data, called CLUDA. Specifically, we propose a contrastive learning framework to learn contextual representations in multivariate time series, so that these preserve label information for the prediction task. In our framework, we further capture the variation in the contextual representations between source and target domain via a custom nearest-neighbor contrastive learning. To the best of our knowledge, ours is the first framework to learn domain-invariant, contextual representation for UDA of time series data. We evaluate our framework using a wide range of time series datasets to demonstrate its effectiveness and show that it achieves state-of-the-art performance for time series UDA.

1. INTRODUCTION

Many real-world applications of machine learning are characterized by differences between the domains at training and deployment (Hendrycks & Dietterich, 2019; Koh et al., 2021) . Therefore, effective methods are needed that learn domain-invariant representations across domains. For example, it is well known that medical settings suffer from substantial domain shifts due to differences in patient cohorts, medical routines, reporting practices, etc. (Futoma et al., 2020; Zech et al., 2018) . Hence, a machine learning model trained for one patient cohort may not generalize to other patient cohorts. This highlights the need for effective domain adaptation of time series. Unsupervised domain adaptation (UDA) aims to learn a machine learning model using a labeled source domain that performs well on a similar yet different, unlabeled target domain (Ganin et al., 2016; Long et al., 2018) . So far, many methods for UDA have been proposed for computer vision (Chen et al., 2020a; Ganin et al., 2016; Huang et al., 2021; Kang et al., 2019; Long et al., 2018; Pei et al., 2018; Shu et al., 2018; Singh, 2021; Sun & Saenko, 2016; Tang et al., 2021; Tzeng et al., 2014; 2017; Xu et al., 2020; Zhu et al., 2020) . These works can -in principle -be applied to time series (with some adjustment of their feature extractor); however, they are not explicitly designed to fully leverage time series properties. In contrast, comparatively few works have focused on UDA of time series. Here, previous works utilize a tailored feature extractor to capture temporal dynamics of multivariate time series, typically through recurrent neural networks (RNNs) (Purushotham et al., 2017) , long short-term memory (LSTM) networks (Cai et al., 2021) , and convolutional neural networks (Liu & Xue, 2021; Wilson et al., 2020; 2021) . Some of these works minimize the domain discrepancy of learned features via adversarial-based methods (Purushotham et al., 2017; Wilson et al., 2020; 2021; Jin et al., 2022) or restrictions through metric-based methods (Cai et al., 2021; Liu & Xue, 2021) . Another research stream has developed time series methods for transfer learning from the source domain to the target domain (Eldele et al., 2021; Franceschi et al., 2019; Kiyasseh et al., 2021; Tonekaboni et al., 2021; Yang & Hong, 2022; Yèche et al., 2021; Yue et al., 2022) . These methods pre-train a neural network model via contrastive learning to capture the contextual representation of time series from unlabeled source domain. However, these methods operate on a labeled target domain, which is different from UDA. To the best of our knowledge, there is no method for UDA of time series that captures and aligns the contextual representation across source and target domains. In this paper, we propose a novel framework for unsupervised domain adaptation of time series data based on contrastive learning, called CLUDA. Different from existing works, our CLUDA framework aims at capturing the contextual representation in multivariate time series as a form of high-level features. To accomplish this, we incorporate the following components: (1) We minimize the domain discrepancy between source and target domains through adversarial training. (2) We capture the contextual representation by generating positive pairs via a set of semantic-preserving augmentations and then learning their embeddings. For this, we make use of contrastive learning (CL). (3) We further align the contextual representation across source and target domains via a custom nearest-neighborhood contrastive learning. We evaluate our method using a wide range of time series datasets. (1) We conduct extensive experiments using established benchmark datasets WISDM (Kwapisz et al., 2011) , HAR (Anguita et al., 2013) , and HHAR (Stisen et al., 2015) . Thereby, we show our CLUDA leads to increasing accuracy on target domains by an important margin. (2) We further conduct experiments on two largescale, real-world medical datasets, namely MIMIC-IV (Johnson et al., 2020) and AmsterdamUMCdb (Thoral et al., 2021) . We demonstrate the effectiveness of our framework for our medical setting and confirm its superior performance over state-of-the-art baselines. In fact, medical settings are known to suffer from substantial domain shifts across health institutions (Futoma et al., 2020; Nestor et al., 2019; Zech et al., 2018) . This highlights the relevance and practical need for adapting machine learning across domains from training and deployment. Contributions:foot_0 1. We propose a novel, contrastive learning framework (CLUDA) for unsupervised domain adaptation of time series. To the best of our knowledge, ours is the first UDA framework that learns a contextual representation of time series to preserve information on labels. 2. We capture domain-invariant, contextual representations in CLUDA through a custom approach combining nearest-neighborhood contrastive learning and adversarial learning to align them across domains. 3. We demonstrate that our CLUDA achieves state-of-the-art performance. Furthermore, we show the practical value of our framework using large-scale, real-world medical data from intensive care units.

2. RELATED WORK

Contrastive learning: Contrastive learning aims to learn representations with self-supervision, so that similar samples are embedded close to each other (positive pair) while pushing dissimilar samples away (negative pairs). Such representations have been shown to capture the semantic information of the samples by maximizing the lower bound of the mutual information between two augmented views (Bachman et al., 2019; Tian et al., 2020a; b) . Several methods for contrastive learning have been developed so far (Oord et al., 2018; Chen et al., 2020b; Dwibedi et al., 2021; He et al., 2020) , and several of which are tailored to unsupervised representation learning of time series (Franceschi et al., 2019; Yèche et al., 2021; Yue et al., 2022; Tonekaboni et al., 2021; Kiyasseh et al., 2021; Eldele et al., 2021; Yang & Hong, 2022; Zhang et al., 2022) . A detailed review is in Appendix A. Unsupervised domain adaptation: Unsupervised domain adaptation leverages labeled source domain to predict the labels of a different, unlabeled target domain (Ganin et al., 2016) . To achieve this, UDA methods typically aim to minimize the domain discrepancy and thereby decrease the lower bound of the target error (Ben-David et al., 2010) . To minimize the domain discrepancy, existing UDA methods can be loosely grouped into three categories: (1) Adversarial-based methods reduce the domain discrepancy via domain discriminator networks, which enforce the feature extractor to learn domain-invariant feature representations. Examples are DANN (Ganin et al., 2016) , CDAN (Long et al., 2018) , ADDA (Tzeng et al., 2017) , MADA (Pei et al., 2018) , DIRT-T (Shu et al., 2018) , and DM-ADA (Xu et al., 2020) . ( 2) Contrastive methods reduce the domain discrepancy through a minimization of a contrastive loss which aims to bring source and target embeddings of the same class together. Here, the labels (i.e., class information) of the target samples are unknown, and, hence, these methods rely on pseudo-labels of the target samples generated from a clustering algorithm, which are noisy estimates of the actual labels of the target samples. Examples are CAN (Kang et al., 2019) , CLDA (Singh, 2021) , GRCL (Tang et al., 2021) , and HCL (Huang et al., 2021) . (3) Metric-based methods reduce the domain discrepancy by enforcing restrictions through a certain distance metric (e.g., via regularization). Examples are DDC (Tzeng et al., 2014) , Deep CORAL (Sun & Saenko, 2016) , DSAN (Zhu et al., 2020) , HoMM (Chen et al., 2020a), and MMDA (Rahman et al., 2020) . However, previous works on UDA typically come from computer vision. There also exist works on UDA for videos (e. g., Sahoo et al., 2021) ; see Appendix A for details. Even though these methods can be applied to time series through tailored feature extractors, they do not fully leverage the time series properties. In contrast, comparatively few works have been proposed for UDA of time series. Unsupervised domain adaptation for time series: A few methods have been tailored to unsupervised domain adaptation for time series data. Variational recurrent adversarial deep domain adaptation (VRADA) (Purushotham et al., 2017) was the first UDA method for multivariate time series that uses adversarial learning for reducing domain discrepancy. In VRADA, the feature extractor is a variational recurrent neural network (Chung et al., 2015) , and VRADA then trains the classifier and the domain discriminator (adversarially) for the last latent variable of its variational recurrent neural network. Convolutional deep domain adaptation for time series (CoDATS) (Wilson et al., 2020) builds upon the same adversarial training as VRADA, but uses convolutional neural network for the feature extractor instead. Time series sparse associative structure alignment (TS-SASA) (Cai et al., 2021 ) is a metric-based method. Here, intra-variables and inter-variables attention mechanisms are aligned between the domains via the minimization of maximum mean discrepancy (MMD). Adversarial spectral kernel matching (AdvSKM) (Liu & Xue, 2021 ) is another metric-based method that aligns the two domains via MMD. Specifically, it introduces a spectral kernel mapping, from which the output is used to minimize MMD between the domains. Across all of the aforementioned methods, the aim is to align the features across source and target domains. Research gap: For UDA of time series, existing works merely align the features across source and target domains. Even though the source and target distributions overlap, this results in mixing the source and target samples of different classes. In contrast to that, we propose to align the contextual representation, which preserves the label information. This facilitates a better alignment across domains for each class, leading to a better generalization over unlabeled target domain. To achieve this, we develop a novel framework called CLUDA based on contrastive learning.

3. PROBLEM DEFINITION

We consider a classification task for which we aim to perform UDA of time series. Specifically, we have two distributions over the time series from the source domain D S and the target domain D t . In our setup, we have labeled i.i.d. samples from the source domain given by S = {(x s i , y s i )} Ns i=1 ∼ D S , where x s i is a sample of the source domain, y s i is the label for the given sample, and N s is the overall number of i.i.d. samples from the source domain. In contrast, we have unlabeled i.i.d. samples from the target domain given by T = {x t i } Nt i=1 ∼ D T , where x t i is a sample of the target domain and N t is the overall number of i.i.d. samples from the target domain. In this paper, we allow for multivariate time series. Hence, each x i (either from the source or target domain) is a sample of multivariate time series denoted by x i = {x it } T t=1 ∈ R M ×T , where T is the number of time steps and x it ∈ R M is M observations for the corresponding time step. Our aim is to build a classifier that generalizes well over target samples T by leveraging the labeled source samples S. Importantly, labels for the target domain are not available during training. Instead, we later use the labeled target samples T test = {(x t i , y t i )} Ntest i=1 ∼ D T only for the evaluation. The above setting is directly relevant for practice (Futoma et al., 2020; Hendrycks & Dietterich, 2019; Koh et al., 2021; Zech et al., 2018) . For example, medical time series from different health institutions differ in terms of patient cohorts, medical routines, reporting practice, etc., and, therefore, are subject to substantial domain shifts. As such, data from training and data from deployment should be considered as different domains. Hence, in order to apply machine learning for risk scoring or other medical use cases, it is often helpful to adapt the machine learning model trained on one domain S for another domain T before deployment. 

4. PROPOSED CLUDA FRAMEWORK

In this section, we describe the components of our framework to learn domain-invariant, contextual representation of time series. We start with an overview of our CLUDA framework, and then describe how we (1) perform domain adversarial training, (2) capture the contextual representation, and (3) align contextual representation across domains.

4.1. ARCHITECTURE

The neural architecture of our CLUDA for unsupervised domain adaptation of time series is shown in Fig. 1 . In brief, our architecture is the following. (1) The feature extractor network F (•) takes the (augmented) time series x s and x t from both domains and creates corresponding embeddings z s and z t , respectively. The classifier network C(•) is trained to predict the label y s of time series from the source domain using the embeddings z s . The discriminator network D(•) is trained to distinguish source embeddings z s from target embeddings z t . For such training, we introduce domain labels d = 0 for source instances and d = 1 for target instances. The details of how classifier and discriminator have been trained is explained in Sec. 4.2. Note that we later explicitly compare our CLUDA against this base architecture based on "standard" domain adversarial learning. We refer to it as "w/o CL and w/o NNCL". (2) Our CLUDA further captures the contextual representation of the time series in the embeddings z s and z t . For this, our framework leverages the momentum updated feature extractor network F (•) and the projector network Q(•) via contrastive learning for each domain. The details are described in Sec. 4.3. (3) Finally, CLUDA aligns the contextual representation across domains in the embedding space z s and z t via nearest-neighbor CL. This is explained in Sec. 4.4. The overall training objective of CLUDA is given in Sec. 4.5.

4.2. ADVERSARIAL TRAINING FOR UNSUPERVISED DOMAIN ADAPTATION

For the adversarial training, we minimize a combination of two losses: (1) Our prediction loss L c trains the feature extractor F (•) and the classifier C(•). We train both jointly in order to correctly predict the labels from the source domain. For this, we define the prediction loss L c = 1 N s Ns i L pred (C(F (x s i )), y s i ), where L pred (•, •) is the cross-entropy loss. (2) Our domain classification loss L disc is used to learn domain-invariant feature representations. Here, we use adversarial learning (Ganin et al., 2016) . To this end, the domain discriminator D(•) is trained to minimize the domain classification loss, whereas the feature extractor F (•) is trained to maximize the same loss simultaneously. This is achieved by the gradient reversal layer R(•) between F (•) and D(•), defined by R(x) = x, dR dx = -I. Hence, we yield the domain classification loss L disc = 1 N s Ns i L pred (D(R(F (x s i ))), d s i ) + 1 N t Nt i L pred (D(R(F (x t i ))), d t i ). (3)

4.3. CAPTURING CONTEXTUAL REPRESENTATIONS

In our CLUDA, we capture a contextual representation of the time series in the embeddings z s and z t , and then align the contextual representations of the two domains for unsupervised domain adaptation. With this approach, we improve upon the earlier works in two ways: (1) We encourage our feature extractor F (•) to learn label-preserving information captured by the context. This observation was made earlier for unsupervised representation learning yet outside of our time series settings (Bachman et al., 2019; Chen et al., 2020b; c; Ge et al., 2021; Grill et al., 2020; Tian et al., 2020a; b) . ( 2) We further hypothesize that discrepancy between the contextual representations of two domains is smaller than discrepancy between their feature space, therefore, the domain alignment task becomes easier. To capture the contextual representations of time series for each domain, we leverage contrastive learning. CL is widely used in unsupervised representation learning for the downstream tasks in machine learning (Chen et al., 2020b; c; He et al., 2020; Mohsenvand et al., 2020; Shen et al., 2022; Yèche et al., 2021; Zhang et al., 2022) . In plain words, CL approach aims to learn similar representations for two augmented views (positive pair) of the same sample in contrast to the views from other samples (negative pairs). This leads to maximizing the mutual information between two views and, therefore, capturing contextual representation (Bachman et al., 2019; Tian et al., 2020a; b) . In our framework (see Fig. 1 ), we leverage contrastive learning in form of momentum contrast (MoCo) (He et al., 2020) in order to capture the contextual representations from each domain. Accordingly, we apply semantic-preserving augmentations (Cheng et al., 2020; Kiyasseh et al., 2021; Yèche et al., 2021) to each sample of multivariate time series twice. Specifically, in our framework, we sequentially apply the following functions with random instantiations: history crop, history cutout, channel dropout, and Gaussian noise (see Appendix C for details). After augmentation, we have two views of each sample, called query x q and key x k . These two views are then processed by the feature extractor to get their embeddings as z q = F (x q ) and z k = F (x k ). Here, F (•) is a momentum-updated feature extractor for MoCo. To train the momentum-updated feature extractor, the gradients are not backpropagated through F (•). Instead, the weights θ F are updated by the momentum via θ F ← -m θ F + (1 -m) θ F , ) where m ∈ [0, 1) is the momentum coefficient. The objective of MoCo-based contrastive learning is to project z q via a projector network Q(•) and bring the projection Q(z q ) closer to its positive sample z k (as opposed to negative samples stored in queue {z kj } J j=1 ), which is a collection of z k 's from the earlier batches. This generates a large set of negative pairs (queue size J ≫ batch size N ), which, therefore, facilitates better contextual representations (Bachman et al., 2019; Tian et al., 2020a; b) . After each training step, the batch of z k 's are stored in queue of size J. As a result, for each domain, we have the following contrastive loss L CL = - 1 N N i=1 log exp(Q(z qi ) • z ki /τ ) exp(Q(z qi ) • z ki /τ ) + J j=1 exp(Q(z qi ) • z kj /τ ) , where τ > 0 is the temperature scaling parameter, and where all embeddings are normalized. Since we have two domains (i.e., source and target), we also have two contrastive loss components given by L s CL and L t CL and two queues given by queue s and queue t , respectively.

4.4. ALIGNING THE CONTEXTUAL REPRESENTATION ACROSS DOMAINS

Our CLUDA framework further aligns the contextual representation across the source and target domains. To do so, we build upon ideas for nearest-neighbor contrastive learning (Dwibedi et al., 2021) from unsupervised representation learning, yet outside of our time series setting. To the best of our knowledge, ours is the first nearest-neighbor contrastive learning approach for unsupervised domain adaptation of time series. In our CLUDA framework, nearest-neighbor contrastive learning (NNCL) should facilitate the classifier C(•) to make accurate predictions for the target domain. We achieve this by creating positive pairs between domains, whereby we explicitly align the representations across domains. For this, we pair z t qi with the nearest-neighbor of z t ki from the source domain, denoted as NN s (z t ki ). We thus introduce our nearest-neighbor contrastive learning loss L NNCL = - 1 N t Nt i=1 log exp(z t qi • NN s (z t ki )/τ ) Ns j=1 exp(z t qi • z s qj /τ ) , where NN s (•) retrieves the nearest-neighbor of an embedding from the source queries {z s qi } Ns i=1 .

4.5. TRAINING

Overall loss: Overall, our CLUDA framework minimizes L = L c + λ disc • L disc + λ CL • (L s CL + L t CL ) + λ NNCL • L NNCL , where hyperparameters λ disc , λ CL , and λ NNCL control the contribution of each component.

Implementation:

Appendix C provides all details of our architecture search for each component.

5. EXPERIMENTAL SETUP

Our evaluation is two-fold: (1) We conduct extensive experiments using established benchmark datasets, namely WISDM (Kwapisz et al., 2011) , HAR (Anguita et al., 2013) , and HHAR (Stisen et al., 2015) . Here, sensor measurements of each participant are treated as separate domains and we randomly sample 10 source-target domain pairs for evaluation. This has been extensively used in the earlier works of UDA on time series (Wilson et al., 2020; 2021; Cai et al., 2021; Liu & Xue, 2021) . Thereby, we show how our CLUDA leads to increasing accuracy on target domains by an important margin. (2) We show the applicability of our framework in a real-world setting with medical datasets: MIMIC-IV (Johnson et al., 2020) and AmsterdamUMCdb (Thoral et al., 2021) . These are the largest intensive care units publicly available and both have a different origin (Boston, United States vs. Amsterdam, Netherlands). Therefore, they reflect patients with different characteristics, medical procedures, etc. Here we treat each age group as a separate domain (Purushotham et al., 2017; Cai et al., 2021) . Further details of datasets and task specifications are in Appendix B. Baselines: (1) We report the performance of a model without UDA (w/o UDA) to show the overall contribution of UDA methods. For this, we only use feature extractor F (•) and the classifier C(•) using the same architecture as in our CLUDA. This model is only trained on the source domain. (2) We implement the following state-of-the-art baselines for UDA of time series data. These are: VRADA (Purushotham et al., 2017) , CoDATS (Wilson et al., 2020) , TS-SASA (Cai et al., 2021) , and AdvSKM (Liu & Xue, 2021) . In our results later, we omitted TS-SASA as it repeatedly was not better than random. (3) We additionally implement CAN (Kang et al., 2019) , CDAN (Long et al., 2018) , DDC (Tzeng et al., 2014) , DeepCORAL (Sun & Saenko, 2016) , DSAN (Zhu et al., 2020) , HoMM (Chen et al., 2020a), and MMDA (Rahman et al., 2020) . These models were originally developed for computer vision, but we tailored their feature extractor to time series (See Appendix C).

6.1. PREDICTION PERFORMANCE ON BENCHMARK DATASETS

Figure 2a shows the average accuracy of each method for 10 source-target domain pairs on the WISDM, HAR, and HHAR datasets. On WISDM dataset, our CLUDA outperforms the best baseline accuracy of CoDATS by 12.7 % (0.754 vs. 0.669). On HAR dataset, our CLUDA outperforms the best baseline accuracy of CDAN by 18.9 % (0.944 vs. 0.794). On HHAR dataset, our CLUDA outperforms the best baseline accuracy of CDAN by 21.8 % (0.759 vs. 0.623). Overall, CLUDA consistently improves the best UDA baseline performance by a large margin. In Appendix D, we provide the full list of UDA results for each source-target pair, and additionally provide Macro-F1 scores, which confirm our findings from above. Insights: We further visualize the embeddings in Fig. 3 to study the domain discrepancy and how our CLUDA aligns the representation of time series. (a) The embeddings of w/o UDA show that there is a significant domain shift between source and target. This can be observed by the two clusters of each class (i. e., one for each domain). (b) CDAN as the best baseline reduces the domain shift by aligning the features of source and target for some classes, yet it mixes the different classes of the different domains (e.g., blue class of source and green class of target overlap). (c) By examining the embeddings from our CLUDA, we confirm its effectiveness: (1) Our CLUDA pulls together the source (target) classes for the source (target) domain (due to the CL). ( 2) Our CLUDA further pulls both source and target domains together for each class (due to the alignment). We have the following observation when we consider the embedding visualizations of all baselines (see Appendix E). Overall, all the baselines show certain improvements over w/o UDA in aligning the embedding distributions of the source and target domains (i. e., overlapping point clouds of source and target domains). Yet, when the class-specific embedding distributions are considered, source and target samples are fairly separated. Our CLUDA remedies this issue by actively pulling source and target samples of the same class together via its novel components. and align the contextual representation of time series. The discriminator also helps achieving consistent performance gains, albeit of smaller magnitude. Finally, our CLUDA works the best in all experiments, thereby justifying our chosen architecture. Appendix F shows our detailed ablation study. We further conduct an ablation study to understand the importance of the selected CL method. For this, we implement two new variants of our CLUDA: (1) CLUDA with SimCLR (Chen et al., 2020b) and ( 2) CLUDA with NCL (Yèche et al., 2021) . Both variants performed inferior, showing the importance of choosing a tailored CL method for time series UDA. Details are in Appendix K.

6.2. PREDICTION PERFORMANCE ON MEDICAL DATASETS

We further demonstrate that our CLUDA achieves state-of-the-art UDA performance using the medical datasets. For this, Table 1 lists 12 UDA scenarios created from the MIMIC dataset. In 9 out of 12 domain UDA scenarios, CLUDA yields the best mortality prediction in the target domain, consistently outperforming the UDA baselines. When averaging over all scenarios, our CLUDA improves over the performance of the best baseline (AdvSKM) by 2.2 % (AUROC 0.773 vs. 0.756). Appendix H reports the ablation study for this experimental setup with similar findings as above. Appendix G repeats the experiments for another medical dataset: AmsterdamUMCdb (Thoral et al., 2021) . Again, our CLUDA achieves state-of-the-art performance. We now provide a case study showing the application of our framework to medical practice. Here, we evaluate the domain adaptation between two health institutions. We intentionally chose this setting as medical applications are known to suffer from a substantial domain shifts (e. g., due to different patient cohorts, medical routines, reporting practices, etc., across health institutions) (Futoma et al., 2020; Nestor et al., 2019; Zech et al., 2018) . We treat MIMIC and AmsterdamUMCdb (AUMC) as separate domains and then predict health outcomes analogous to earlier works (Cai et al., 2021; Che et al., 2018; Ge et al., 2018; Ozyurt et al., 2021; Purushotham et al., 2017) : decompensation, mortality, and length of stay (see Table 2 ). All details regarding the medical datasets and task definitions are given in Appendix B. Fig. 4 shows the performance across all three UDA tasks and for both ways (i.e., MIMIC → AUMC and AUMC → MIMIC). For better comparability in practice, we focus on the "performance gap": we interpret the performance from source → source setting as a loose upper bound. 2 We then report how much of the performance gap between no domain adaptation (w/o UDA) vs. the loose upper bound is filled by each method. Importantly, our CLUDA consistently outperforms the state-of-the-art baselines. For instance, in decompensation prediction for AUMC, our CLUDA (AUROC 0.791) fills 47.6 % of the performance gap between no domain adaptation (AUROC 0.771) and loose upper bound from the source → source setting (AUROC 0.813). In contrast, the best baseline model of this task (HoMM) can only fill 16.7 % (AUROC 0.778). Altogether, this demonstrates the effectiveness of our proposed framework. Appendix I reports the detailed results for different performance metrics. Appendix J provides an ablation study. Both support our above findings.

7. DISCUSSION

Our CLUDA framework shows a superior performance for UDA of time series when compared to existing works. Earlier works introduced several techniques for aligning source and target domains, mainly via adversarial training or metric-based losses. Even though they facilitate matching source and target distributions (i. e., overlapping point clouds of two domains), they do not explicitly facilitate matching class-specific distributions across domains. To address this, our CLUDA builds upon two strategies, namely capturing and aligning contextual representations. (1) CLUDA learns class-specific representations for both domains from the feature extractor. This is achieved by CL, which captures label-preserving information from the context, and therefore, enables adversarial training to align the representations of each class across domains. Yet, the decision boundary of the classifier can still miss some of the target domain samples, since the classifier is prone to overfit to the source domain in high dimensional representation space. To remedy this, (2) CLUDA further aligns the individual samples across domains. This is achieved by our NNCL, which brings each target sample closer to its most similar source domain-counterpart. Therefore, during the evaluation, the classifier generalizes well to target representations which are similar to the source representations from training time.

Conclusion:

In this paper, we propose a novel framework for UDA of time series based on contrastive learning, called CLUDA. To the best of our knowledge, CLUDA is the first approach that learns domain-invariant, contextual representation in multivariate time series for UDA. Further, CLUDA achieves state-of-the-art performance for UDA of time series. Importantly, our two novel components -i.e., our custom CL and our NNCL -yield clear performance improvements. Finally we expect our framework of direct practical value for medical applications where risk scores should be transferred across populations or institutions.

A RELATED WORK

Contrastive learning: Several methods for contrastive learning have been developed so far. For example, contrastive predictive coding (CPC) (Oord et al., 2018) predicts the next latent variable in contrast to negative samples from its proposal distribution. SimCLR stands for simple framework for CL of visual representations (Chen et al., 2020b) . It maximizes the agreement between the embeddings of the two augmented views of the same sample and treats all the other samples in the same batch as negative samples. Nearest-neighbor CL (NNCL) (Dwibedi et al., 2021) creates positive pairs from other samples in the dataset. For this, it takes the embedding of first augmented view and finds its nearest neighbor from the support set. Moment contrast (MoCo) (He et al., 2020) refers to the embeddings of two augmented views as query and key, and then constructs positive pairs for the sample as follows: Key embeddings are generated by a momentum encoder and stored in a queue (whose size is larger than the batch size), while all key embeddings are further used to construct negative pairs for the other sample. Thereby, MoCo generates more negative pairs than the batch size as compared to SimCLR, which is often more efficient in practice. Contrastive learning for time series: Contrastive learning has been used for time series to learn contextual representation of time series in unsupervised settings. As a result, several methods emerged: Scalable representation learning (SRL) (Franceschi et al., 2019) , neighborhood contrastive learning (NCL) (Yèche et al., 2021) , TS2Vec (Yue et al., 2022) , and temporal neighborhood coding (TNC) (Tonekaboni et al., 2021) treat the neighboring windows of the time series as positive pairs and use other windows to construct negative pairs. For this, SRL, NCL, and TS2Vec minimize the triplet loss, contrastive loss, and hierarchical contrastive loss, respectively, while TNC trains a discriminator network to predict neighborhood information. There are also more specialized methods. For example, contrastive learning of cardiac signals (CLOCKS) (Kiyasseh et al., 2021) leverages spatial invariance and constructs positive pairs from measurements of the different sensors of the same subject. Temporal and contextual contrasting (TS-TCC) (Eldele et al., 2021 ) is a variant of CPC and maximizes the agreement between strong and weak augmentations of the same sample in an autoregressive model. Bilinear temporal-spectral fusion (BTSF) (Yang & Hong, 2022) constructs the positive pairs via dropout layer applied to the same sample twice and it minimizes a triplet loss for the temporal and spectral features. Time-Frequency Consistency (TF-C) (Zhang et al., 2022) maximizes the agreement between time and frequency embeddings of the same sample. Unsupervised domain adaptation for videos: Several works have been tailored to unsupervised domain adaptation in video domain. Similar to UDA in other domains, some works leveraged adversarial training for the alignment of source and target domains. Specifically, Temporal Attentive Adversarial Adaptation Network (TA 3 N) (Chen et al., 2019) assigns different weights to the features of source and target during the domain alignment process, where the weights are determined by the entropy of domain classifier. Adversarial bipartite graph (ABG) (Luo et al., 2020) 

B DATASET DETAILS B.1 BENCHMARK DATASETS

We select three sensor datasets that are most commonly used in the earlier works. In each dataset, participants perform various actions while wearing smartphone and/or smartwatches. Based on the sensor measurements, the task is to predict which action the participant is performing. Table 3 provides summary statistics for all datasets. Below, we provide additional information about each dataset. WISDM. The dataset contains 3-axis accelerometer measurements from 30 participants. The measurements are collected at 20 Hz, and we use non-overlapping segments of 128 time steps to predict the type of the activity of a participant. There are 6 types of activities: walking, jogging, sitting, standing, walking upstairs and walking downstairs. This dataset is particularly challenging due to class imbalance across participants, i. e., there are some participants who did not perform all the activities. HAR. The dataset contains the measurements of 3-axis accelerometer, 3-axis gyroscope, and 3-axis body acceleration from 30 participants. The measurements are collected at 50 Hz, and we use non-overlapping segments of 128 time steps to predict the type of the activity of a participant. There are 6 types of activities: walking, walking upstairs, walking downstairs, sitting, standing, and lying down. HHAR. The dataset contains 3-axis accelerometer measurements from 30 participants. The measurements are collected at 50 Hz, and we use non-overlapping segments of 128 time steps to predict the type of the activity of a participant. There are 6 types of activities: biking, sitting, standing, walking, walking upstairs, and walking downstairs. 

B.2 MEDICAL DATASETS

We use MIMIC-IV (Johnson et al., 2020) and AmsterdamUMCdb (Thoral et al., 2021) . Both are de-indentified, publicly-available data from intensive care unit stays, where the goal is to predict mortality. To date, MIMIC-IV is the largest public dataset for intensive care units with 49,351 ICU stays; AmsterdamUMCdb contains 19,840 ICU stays. However, both have a different origin (Boston, United States vs. Amsterdam, Netherlands) and thus reflect patients with different characteristics, medical procedures, etc. For the medical datasets, we follow the literature (Purushotham et al., 2017; Cai et al., 2021) and create 4 domains based on patients' age groups: 20-45, 46-65, 66-85, and 85+ years. We then apply UDA for each cross-domain scenario (i. e., from Group 1 → Group 4 to Group 4 → Group 3) to predict mortality. Table 4 shows the summary statistics of both medical datasets MIMIC and AUMC. Table 6 shows the number of patients and the number of samples for each split and each dataset. As a reminder, since we start making the prediction at four hours after the ICU admission, the same patient yields multiple samples when training/testing the models. Pre-processing: We split the patients of each dataset into 3 parts for training/validation/testing (ratio: 70/15/15). We used a stratified split based on the mortality label. We proceeded analogous to previous works for pre-processing (Cai et al., 2021; Che et al., 2018; Ge et al., 2018; Harutyunyan et al., 2019; Ozyurt et al., 2021; Purushotham et al., 2017; 2018; Yèche et al., 2021) . Each measurement was sampled to hourly resolution, and missing measurements were filled with forward-filling imputation. We applied standard scaling to each measurement based on the statistics from training set. The remaining missing measurements were filled with zero, which corresponds to mean imputation after scaling. We followed best practices in benchmarking data from intensive care units (Harutyunyan et al., 2019; Purushotham et al., 2018) . That is, for each of the tasks, we start making the prediction at four hours after the ICU admission. In all our experiments, we used a maximum history length T = 48 hours. Shorter sequences were pre-padded by zero. Tasks: We compare the performance of our framework across 3 different standard tasks from the literature (Cai et al., 2021; Che et al., 2018; Ge et al., 2018; Harutyunyan et al., 2019; Ozyurt et al., 2021; Purushotham et al., 2017; 2018) . ( 1) Decompensation prediction refers to predicting whether the patient dies within the next 24 hours. (2) Mortality prediction refers to predicting whether the patient dies during his/her ICU stay. (3) Length of stay prediction refers to predicting the remaining hours of ICU stay for the given patient. This serves as a proxy of the overall health outcome. The distribution of remaining length of ICU stay contains a heavy tail (see Appendix A), which makes it challenging to model it as a regression task. Therefore, we follow the previous works (Harutyunyan et al., 2019; Purushotham et al., 2018) and divide the range of values into 10 buckets and perform an ordinal multiclass classification. For each task, we performed unsupervised domain adaptation in both ways: MIMIC (source) → AUMC (target), and AUMC (source) → MIMIC (target). Later, we also report the corresponding performance on the test samples from the source dataset (i.e., MIMIC → MIMIC, and AUMC → AUMC). This way, we aim to provide insights to what extent the different UDA methods provide a trade-off for the performance in the source vs. target domain. It also be loosely interpreted as a upper bound for the prediction performance. Performance metrics: We report the following performance metrics. The tasks for predicting (1) decompensation and (2) mortality are binary classification problems. For these tasks, we compare the area under the receiver operating characteristics curve (AUROC) and area under the precisionrecall curve (AUPRC). Results for AUPRC are in the Appendix C due to space limitation. The task of predicting (3) length of stay is an ordinal multiclass classification problem. For this, we report Cohen's linear weighted kappa, which measures the correlation between the predicted and ground-truth classes.

SUMMARY STATISTICS FOR "LENGTH OF STAY"

Here, we provide additional summary statistics for the distribution of "length of stay". Figure 5 and Figure 6 show the length of stay distribution of all patients in the MIMIC and AUMC datasets, respectively. Further, Figure 7 and Figure 8 show the remaining length of stay distribution for all samples (i. e., all time windows considered for all patients) in MIMIC and AUMC, respectively. Recall that we divide the values of remaining length of stay into 10 buckets; the corresponding fraction of samples belonging to each bucket is reported in Figure 9 . The buckets are the following: one bucket for less than one day, one bucket each for days 1 through 7, one bucket for the interval between 7 and 14 days, and one bucket for more than 14 days.

C TRAINING DETAILS

In this section, we provide details on the hyperparameters tuning. Table 7 lists the tuning range of all hyperparameters. To avoid repetition, we list hyperparameters that appear at all methods in the first rows of Table 7 . For each dataset (benchmark or medical) and each task (i. e., decompensation, mortality, and length of stay prediction), we performed a grid search for hyperparameter tuning separately for each method. We implemented our CLUDA framework and all the baseline methods in PyTorch. For this, we carefully considered the original implementations and the benchmarking suites (Cai et al., 2021; Chen et al., 2020a; Kang et al., 2019; Liu & Xue, 2021; Long et al., 2018; Purushotham et al., 2017; Ragab et al., 2022; Rahman et al., 2020; Sun & Saenko, 2016; Tzeng et al., 2014; Wilson et al., 2020; Zhu et al., 2020) . For training and testing, we used NVIDIA GeForce GTX 1080 Ti with 11GB GPU memory. We minimize the loss of each method via Adam optimizer. We implement the feature extractor F (•) via a temporal convolutional network (TCN) (Bai et al., 2018) . We set its kernel size 3 and dilation factor 2. For benchmark datasets, we use 6 layers with 16 channels, whereas for medical datasets, we use 5 layers with 64 channels. This configuration remains the same across all methods so that the difference in prediction performance is attributed to their novel UDA approach. We now explain how we decide the search range of the hyperparameters (e. g., learning rate, weight decay). The low learning rate is preferred so that the methods converge to a certain loss after seeing all samples from each dataset. Especially, the medical dataset MIMIC has roughly 2.4M samples, and it requires ∼1.2K steps to iterate over all these samples with a batch size of 2048. With higher learning rates, the methods converge to a loss even before one iteration over the dataset. We observed that this leads to suboptimal prediction performance (i.e., lower AUROC, AUPRC and KAPPA scores). For the hyperparameters regarding the contrastive learning framework, we are informed by the configuration of MoCo (He et al., 2020) as a starting point. We explored a certain range to improve the performance. For the feature extractor F (•) and the classifier C(•), we used the best hyperparameter configuration obtained by w/o UDA as a starting point. For benchmark datasets, we trained all methods for max. 5,000 training steps with a batch size of 128. For medical datasets, we trained all methods for max. 30,000 training steps with a batch size of 2048 (except AdvSKM, DDC, DSAN, and MMDA with a batch size of 1024 to fit into GPU). For early stopping and hyperparameter selection, we deliberately avoided the use of data from the labeled target domain. In our work, we aim to present the performance results as close as possible to the real-world scenario of UDA in, e. g., medical practice. We applied early stopping based on the validation loss, which involves labeled source domain and unlabeled target domain (as the overall loss of CLUDA in Sec. 4.5). For hyperparameter selection, we adopted the following two-way approach. For the hyperparameters regarding the model architecture (e. g., num. layers, hidden dimensions, etc.) and the training approach (e. g. learning rate, weight decay etc.) with the fixed weights of the loss components (e. g., λ disc , λ CL of our CLUDA or weight MMD loss of DDC), we considered the validation loss as for the early stopping. However, for the hyperparameters regarding the loss components, validation losses are not comparable because the loss values are in different scale. (Trivially, one could disable some of the loss components, i. e., by setting the weights to 0, and hence get a lower validation loss. However, this would not result in a better performance on target domain). Therefore, to select the model across different loss weights, we choose the one with the highest performance metric (e. g., accuracy, macro F1, or AUROC depending on the setting) on the labeled validation source domain as our proxy. This choice is informed by the theory of learning from different domains (Ben-David et al., 2010) in that the loss on the target domain is upper-bounded by the loss on the source domain and some other additional terms. Hence, we are aiming for a better bound on the target domain by choosing a better performance on the source domain. After the model selection, we report the prediction results on the labeled test set from the target domain (and from source domain in Sec. 6.3), yet which have never been seen during training or model selection. We applied the same procedure to all the baseline methods in our the paper to ensure a fair comparison. To report variability in the test performance of each method, we repeated each experiment with 10 different random seeds (i. e., 10 different random initializations) and then show error bars. Here, we compare the runtimes of each method. For this, we use MIMIC-III (the largest dataset in our experiments). We report average runtimes per 100 training steps since the total runtime (i. 

AUGMENTATIONS

To capture the contextual representation of medical time series, we apply semantic-preserving augmentations (Cheng et al., 2020; Kiyasseh et al., 2021; Yèche et al., 2021) in our CLUDA framework. We list the augmentations and their optimal hyperparameters (search range in parenthesis) below: History crop: We mask out a minimum 20 % (10 % -40 %) of the initial time series with 50 % (20 % -50 %) probability. History cutout: We mask out a random 15 % time-window (5 % -20 %) of time series with 50 % (20 % -70 %) probability. Channel dropout: We mask out each channel (i. e., type of measurement) independently with 10 % (5 % -30 %) probability. Gaussian noise: We apply Gaussian noise to each measurement independently with standard deviation of 0.1 (0.05 -0.2). We apply these augmentations sequentially to each time series twice. As a result, we have two semantic-preserving augmented views of the same time series for our CLUDA framework. Of note, we trained all the baseline methods with and without the augmentations of time series. We always report the their best results. We perform the activity prediction as a UDA task based on the benchmark datasets WISDM, HAR, and HHAR. For each dataset, we present the prediction results for 10 randomly selected source-target pairs. For each source-target pair, we repeat the experiments with 10 random initializations and report the mean values. Table 8 shows the accuracy on the target domains and average accuracy for each dataset. Similarly, Table 9 shows the Macro-F1 on the target domains and average Macro-F1 for each dataset. Overall, our CLUDA outperforms the UDA baselines by a large margin, as discussed in the main paper. Specifically, CLUDA achieves the best accuracy in 28 out of 30 UDA scenarios and the best Macro-F1 in 27 out of 30 UDA scenarios. Thereby, the results confirm the effectiveness of our method. 

E EMBEDDING VISUALIZATION

In this section, we provide the t-SNE visualization (see Fig. 10 ) for the embeddings of each method from HHAR dataset. When there is no domain adaptation (see Fig. 10a of w/o UDA), there is a significant domain shift between source and target. As a result, embeddings of one class in target domain overlap with embeddings of another class in source domain. Thereby, the classifier learned on the source domain cannot generalize well over the target domain. The UDA baselines mitigate the domain shift; however, they still mix several classes. On the other hand, our CLUDA clearly pulls the embeddings of the same class (even though they are in different domains), and facilitates better generalization in the target domain. 

F ABLATION STUDY FOR UDA ON BENCHMARK DATASETS

We further conduct an ablation study on the benchmark datasets WISDM, HAR, and HHAR. We use the same variants of CLUDA from the main paper (see Sec. 6.1: w/o CL and w/o NNCL, w/o CL, w/o NNCL, and w/o Discriminator. Similar to the main experiments, for each dataset, we present the prediction results for 10 randomly selected source-target pairs. For each source-target pair, we repeat the experiments with 10 random initializations and report the mean values. Table 10 shows the accuracy on the target domains and average accuracy for each dataset. Similarly, Table 11 shows the Macro-F1 on the target domains and average Macro-F1 for each dataset. Overall, our complete CLUDA outperforms all its variants by a significant margin, which confirms our chosen architecture. 

G UDA ACROSS VARIOUS AGE GROUPS

Following the earlier works (Purushotham et al., 2017; Cai et al., 2021) , we conducted extensive experiments to compare the UDA performance of our CLUDA framework across various age groups. We consider the following groups: (1) Group 1: working-age adult (20 to 45 years old patients); (2) Group 2: old working-age adult (46 to 65 years old patients); (3) Group 3: elderly (66 to 85 years old patients); and (4) Group 4: seniors (85+ years old patients). Therefore, within each dataset (MIMIC and AUMC), we list the results of all combinations of Source → Target for mortality prediction (i. e., Group 1 → Group 2, Group 1 → Group 3, . . . , Group 4 → Group 3). Results are shown in Table 1 (in the main paper) for MIMIC and Table 12 for AUMC. We further extend the experiments to across datasets. That means, we pick the source domain as one age group from one dataset (e. g., Group 1 of MIMIC) and pick the target domain as one age group from the other dataset (e. g., Group 3 of AUMC). We, again, conducted the experiments for all combinations of age groups across the datasets. Results are shown in Table 13 from MIMIC to AUMC and Table 14 from AUMC to MIMIC. We report the mean over 10 random initialization. For better readability, we omitted the standard deviation. Nevertheless, we highlight performance results in bold when corresponding baselines are outperformed at a significant level. In addition, Tables 17 and 18 list the UDA performance across age groups from MIMIC to AUMC and from AUMC to MIMIC, respectively. In total, our ablation study counts 56 new experiments. We report the mean over 10 random initialization. For better readability, we omitted the standard deviation. Nevertheless, we highlight performance results in bold when corresponding baselines are outperformed at a significant level. We make the following important findings. First, our CLUDA works overall best on the target domain, thereby justifying our chosen architecture. Second, the models w/o CL and w/o NNCL perform significantly worse than our complete framework, which justifies our choice for incorporating both components. Third, we compare w/o Discriminator and our CLUDA. As demonstrated by our results, the discriminator is consistently responsible for better UDA for the target domain consistently. Overall, its performance improvement is significant but the gain is smaller than the other components. Overall, the ablation study with different variants of our CLUDA confirms the importance of each component in our framework. Specifically, our CLUDA improves the prediction performance over all of its variants in all tasks except one (mortality prediction from MIMIC to AUMC). For this task, it is important to note that the best performing variant is w/o Discriminator, which has all the novel components of our CLUDA framework. In our CLUDA framework, we capture contextual representation in time series data by leveraging contrastive learning. Specifically, we adapt momentum contrast (MoCo) (He et al., 2020) for contrastive learning in our framework. This choice is motivated by earlier research from other domains (He et al., 2020; Chen et al., 2020c; Yèche et al., 2021; Dwibedi et al., 2021) , where MoCo was found to yield more stable negative samples (due to the momentum-updated feature extractor) as compared to other approaches throughout each training step, such as SimCLR (Chen et al., 2020b) . In principle, stability yields stronger negative samples for the contrastive learning objectives and, therefore, increases the mutual information between the positive pair (i. e., two augmented views of the same sample). Furthermore, MoCo allows storing the negative samples within a queue, facilitating a larger number of negative samples for the contrastive loss as compared to SimCLR. As shown earlier (Bachman et al., 2019; Tian et al., 2020a; b) , the lower bound of the mutual information between the positive pair increases with a larger number of negative samples in CL. With that motivation, we opted for MoCo (He et al., 2020) in our CLUDA instead of SimCLR (Chen et al., 2020b) . Nevertheless, we evaluate our choice through numerical experiments below. We now further perform an ablation study where we repeat the experiments with SimCLR (instead of MoCo) for our case study from Sec. 6.3. Specifically, we provide results for decompensation prediction (see Table 30 ), mortality prediction (see Table 31 ), and length of stay prediction (see Table 32 ). The results confirm our choice for MoCo instead of SimCLR in capturing the contextual representation in time series. Specifically, our CLUDA improves the result of CLUDA w/ SimCLR in all tasks by a large margin. Despite being inferior to our CLUDA, CLUDA w/ SimCLR achieves better UDA performance compared to other baseline methods in decompensation prediction from MIMIC to AUMC, mortality prediction from AUMC to MIMIC, and length of stay prediction from MIMIC to AUMC. This shows the importance of leveraging the contextual representation into unsupervised domain adaptation. Besides, it highlights that our CLUDA can be further improved in the future with the recent advances in capturing the contextual representation of time series.

K.2 CLUDA WITH NCL

We further compare our framework against neighborhood contrastive learning (NCL) (Yèche et al., 2021) . For this, we replace the CL component (Sec. 4.3) of our CLUDA framework by another CL method called neighborhood contrastive learning (NCL) (Yèche et al., 2021) . NCL also leverages the MoCo as in our CLUDA. It considers different time segments of the same subject as positive pairs (within a certain time window) when constructing the CL objective. NCL is specifically designed for the transfer learning setting, where the model is pre-trained on the unlabeled source domain and later fine-tuned on the smaller amount of labeled target domain. When labels are absent during the pre-training stage, NCL has been shown to be captured relevant signals in the embedding space for the downstream classification task. However, our UDA setting is different from the transfer learning setting in Yèche et al. ( 2021): (a) UDA assumes the existence of source domain labels whereas transfer learning does not. (b) Transfer learning later leverages the labels of target domain whereas UDA does not require those labels. Since NCL's positive pairs may come from different classes (e.g., in healthcare, different time windows of a patient corresponding to different decompensation label or, in sensor datasets, different time windows of a subject corresponding to different activities from walking to running), we conjecture that it adds additional noise to the classifier network, leading to an inferior prediction performance. Below we perform an ablation study where we replaced the CL component of our CLUDA with the CL of NCL. We kept all the other components the same (, i. e., discriminator, classifier networks, and our NNCL component). To select hyperparameters for NCL, we performed a grid-search analogous to the original implementation in Yèche et al. (2021) . We provide the results for decompensation prediction (see Table 33 ), mortality prediction (see Table 34 ), and length of stay prediction (see Table 35 ). The results confirm our conjecture that leveraging NCL in UDA setting leads to an inferior prediction performance. Specifically, our CLUDA performs significantly better than CLUDA w/NCL for all tasks and for both source and target domains. Notably, CLUDA w/NCL performs even worse than w/o UDA. For this, our explanation is that, since we have the labels of the source domain during the training time, the objectives of NCL and the classifier networks counteract each other. Therefore, our ablation study shows the need of tailoring the right contrastive learning objective for different problem settings (such as UDA vs. transfer learning). In sum, this confirms the effectiveness of our proposed framework architecture. L DISCUSSION FOR VARIABLE-LENGTH TIME SERIES Following earlier works of UDA for time series (Cai et al., 2021; Liu & Xue, 2021; Wilson et al., 2020; 2021) , we defined the problem (see Sec. 3) in way that each time series has a fixed length T . In case the length of time series differs too much within a dataset or when the entire history of time series needs to be considered, it may be preferred to account for variable-length time series. Here, we briefly discuss how our CLUDA can be adapted to variable-length time series inputs. We further believe that our discussion below may also be applicable for existing UDA baselines (with minor modifications). One can adapt our CLUDA primarily in two different ways. (a) Straightforward approach: One can configure a temporal convolutional network (TCN) (Bai et al., 2018) as feature extractor to handle the longest time-series and pre-pad the shorter ones with a certain value. Then, the output of the feature extractor is used analogous to our original CLUDA framework. However, in case of too long and too short time series being present in the same dataset, we suspect TCN may not capture meaningful representations for the short time series due to prepadded values (e. g., zeros) being dominant. If the length of time series varies by the order of dilation factor (e. g., a number of 1x, 2x, 4x, 8x time steps with dilation factor 2 of TCN), one can extract the features from the earlier layers of TCN for the shorter time series (thereby basically considering the receptive field). This way, one could avoid the dominance of pre-padded values. (b) Tailored approach: Variable-length time series can be naturally modeled by a generative neural network such as variational recurrent neural network (VRNN) (Chung et al., 2015) or deep Markov model (DMM) (Krishnan et al., 2017) . As such, one can leverage the latent variables of the generative model as input to our contrastive learning component of CLUDA. Here, we hypothesize that using individual latent variables (of each time step) as input to CL would be (i) computationally too expensive and (ii) not so meaningful, since we apply augmentations to the entire time series and not to individual time steps. Therefore, we suggest an attention module which will process the sequence of latent variables and output the aggregated latent representation of entire time series. The output of the attention module can then be used in our CLUDA framework by multiple components, such as CL, NNCL, and the classifier network. To summarize, our suggestion as a short recipe: One can (1) get the latent variables from a generative model, (2) aggregate them via an attention module, and (3) use the output as the output of the feature extractor as in our original CLUDA framework.



Codes are available at https://github.com/oezyurty/CLUDA . In some cases, this loose upper bound can be exceeded by UDA methods. For instance, when there is not enough number of samples in target domain. Then, leveraging a larger and labeled dataset as source domain can yield a superior performance.



Figure 1: The complete CLUDA framework (best viewed in color). Some network components are shown twice (for source and target) to enhance readability. Source and target samples are augmented twice (colored in yellow). These augmented samples are processed by the feature extractor to yield the embeddings (colored in red). The embeddings are processed by four different components: classification network (Sec. 4.2), adversarial training (Sec. 4.2), CL (Sec. 4.3), and nearest-neighbor CL (Sec. 4.4). Dashed lines represent input pairs to each loss function.

Figure 2: UDA performance on benchmark datasets.

Figure 3: t-SNE visualization for the embeddings from HHAR dataset. Each class is represented by a different color. Shape shows source and target domains (circle vs. cross). Ablation study: We conduct an ablation study (see Fig. 2b) to better understand the importance of the different components in our framework. The variants are: (1) w/o UDA baseline for comparison; (2) w/o CL and w/o NNCL, which solely relies on the domain adversarial training and refers to base architecture from Sec. 4.2; (3) w/o CL, which deactivates CL (from Sec. 4.3) for capturing contextual representations; (4) w/o NNCL, which deactivates NNCL (from Sec. 4.4) for aligning contextual representations across domains; and (5) w/o Discriminator, which deactivates the discriminator. Overall, the low prediction performance of the adversarial training from Sec. 4.2 (w/o CL and w/o NNCL) demonstrates the importance of capturing and aligning the contextual representations. Comparing w/o Discriminator shows that the largest part of our performance improvements (compared to w/o UDA) comes from our novel components introduced in Sec. 4.3 and Sec. 4.4 to capture

Figure 4: Case study. Report is how much of the performance gap is filled by each method. Here: performance gap [%] is the difference between no domain adaptation and the source → source setting as a loose upper bound on performance.

Figure 5: Length of stay distribution of MIMIC patients. For reasons of space, the distribution is cropped at a value of 500.

Figure 6: Length of stay distribution of AUMC patients. For reasons of space, the distribution is cropped at a value of 500.

Figure 7: Remaining length of stay distribution of all MIMIC samples. For reasons of space, the distribution is cropped at a value of 500.

Figure 8: Remaining length of stay distribution of all AUMC samples. TFor reasons of space, the distribution is cropped at a value of 500.

Figure 9: Histogram showing the distribution of remaining length of stay (MIMIC vs. AUMC).The buckets are the following: one bucket for less than one day, one bucket each for days 1 through 7, one bucket for the interval between 7 and 14 days, and one bucket for more than 14 days.

Figure 10: t-SNE visualization of the embeddings from each model on HHAR dataset. Each class is represented by a different color. Shape shows source and target domains (circle vs. cross).



3 UDA tasks between MIMIC and AUMC. Shown: Average performance over 10 random initializations.

creates a bi-partite graph from the source and target videos for a given batch and leverages a graph neural network to fool the domain classifier. Temporal Co-attention Network (TCoN)(Pan et al., 2020) generates target-aligned source features via co-attention matrix, which is adversarially trained against domain classifier. Multi-Modal Domain Adaptation for Fine-Grained Action Recognition(Munro & Damen, 2020) further leverages multi-model self supervision during the adversarial training. Shuffle and Attend(Choi et al., 2020) is another adversarial-based method, which additionally predicts the order of the video clips to alleviate the background shift across domains. Different from the previous works, Contrast and Mix (CoMix)(Sahoo et al., 2021) does not rely on the adversarial training. Instead, CoMix generates synthetic videos by mixing the background of source (target) domain and the motion of target (source) domain via convex combination and applies contrastive learning, where the videos sharing the same motion treated are treated as positive pairs. This work aligns source (or target) domain with a synthetic domain, where as our CLUDA framework directly aligns two domains, without requiring an intermediate synthetic domain generation. Further, CoMix is not

Summary of the sensor datasets.

provides additional details for both datasets MIMIC and AUMC. Both comprise of 41 separate time series, which are then used to predict the outcomes of interest -i.e., decompensation, mortality, and length of stay -via unsupervised domain adaptation.

Summary of datasets.

Descriptions of medical time series and their summary statistics for MIMIC and AUMC

Number of patients and samples for each dataset and each split

e., total number of training steps) varies with the step of early stopping applied at each run. For each method, the average runtimes (per 100 training steps) are the following: 44.83 seconds for w/o UDA, 122.81 seconds for VRADA, 81.06 seconds for CoDATS, 151.20 seconds for TS-SASA, 73.67 seconds for AdvSKM with a half batch size, 119.42 seconds for CAN, 83.93 seconds for CDAN, 59.92 seconds for DDC with a half batch size, 85.67 seconds for DeepCORAL, 62.38 seconds for DSAN with a half batch size, 83.81 seconds for HoMM, 68.92 seconds for MMDA with a half batch size, and 96.11 seconds for our CLUDA.

Hyperparameter tuning.

Activity prediction for each dataset between various subjects. Shown: mean Accuracy over 10 random initializations.

Activity prediction for each dataset between various subjects. Shown: mean MacroF1 over 10 random initializations.

Activity prediction for each dataset between various subjects. Shown: mean Accuracy over 10 random initializations.

Activity prediction for each dataset between various subjects. Shown: mean MacroF1 over 10 random initializations.

Mortality prediction between various age groups of AUMC. Shown: mean AUROC over 10 random initializations. Higher is better. Best value in bold. Second best results are underlined if stds overlap.

Mortality prediction between various age groups from MIMIC to AUMC. Shown: mean AUROC over 10 random initializations. Higher is better. Best value in bold. Second best results are underlined if stds overlap.

Mortality prediction between various age groups from AUMC to MIMIC. Shown: mean AUROC over 10 random initializations. Higher is better. Best value in bold. Second best results are underlined if stds overlap.Overall, in this section we present 56 prediction tasks to compare the methods across various age groups in both datasets. Out of 56 tasks, our CLUDA achieves the best performance in 36 of them, where it significantly outperforms the other methods. In comparison, the best baseline methods, AdvSKM and DeepCORAL, achieve the best result in only 5 out of 56 tasks. This highlights the consistent and significant performance improvements achieved by our CLUDA in various domains.H ABLATION STUDY FOR UDA ACROSS VARIOUS AGE GROUPSWe further conduct an ablation study to compare different variants of our CLUDA framework. Here, we build upon the previous experiments of various age groups. We use the same variants of CLUDA from the main paper (see Sec. 6.1): w/o CL and w/o NNCL, w/o CL, w/o NNCL, and w/o Discriminator. We repeat the results of w/o UDA and our CLUDA for better comparability. Table15and Table16list the UDA performance across age groups within MIMIC and AUMC, respectively.

Mortality prediction between various age groups of MIMIC. Shown: mean AUROC over 10 random initializations.

Mortality prediction between various age groups of AUMC. Shown: mean AUROC over 10 random initializations.

Mortality prediction between various age groups from MIMIC to AUMC. Shown: mean AUROC over 10 random initializations.

Mortality prediction between various age groups from AUMC to MIMIC. Shown: mean AUROC over 10 random initializations. ABLATION STUDY FOR MEDICAL PRACTICE Here, we additionally provide our ablation study for the case study presented in Sec. 6.3. Specifically, Table 24 (source: MIMIC) and Table 25 (source: AUMC) evaluate the decompensation prediction. Table 26 (source: MIMIC) and Table 27 (source: AUMC) evaluate the mortality prediction. Table 28 (source: MIMIC) and Table 29 (source: AUMC) evaluate the length of stay prediction.

Ablation study for decompensation prediction. Shown: AUROC (mean ± std) over 10 random initializations.

Ablation study for mortality prediction. Shown: AUROC (mean ± std) over 10 random initializations.

Ablation study for mortality prediction. Shown: AUROC (mean ± std) over 10 random initializations.Higher is better. Best value in bold. Black font: main results for UDA. Gray font: source → source.

Ablation study for length of stay prediction. Shown: KAPPA (mean ± std) over 10 random initializations.

Ablation study for length of stay prediction. Shown: KAPPA (mean ± std) over 10 random initializations.

Decompensation prediction. Shown: AUROC (mean ± std) over 10 random initializations.

Mortality prediction. Shown: AUROC (mean ± std) over 10 random initializations. ± 0.001 0.709 ± 0.002 0.721 ± 0.005 0.774 ± 0.006 CLUDA w/ SimCLR 0.827 ± 0.001 0.724 ± 0.004 0.748 ± 0.002 0.781 ± 0.002 CLUDA (ours) 0.836 ± 0.001 0.739 ± 0.004 0.750 ± 0.001 0.789 ± 0.002Higher is better. Best value in bold. Black font: main results for UDA. Gray font: source → source.

Length of stay prediction. Shown: KAPPA (mean ± std) over 10 random initializations. ± 0.002 0.169 ± 0.003 0.246 ± 0.001 0.122 ± 0.001 CLUDA w/ SimCLR 0.203 ± 0.001 0.178 ± 0.006 0.258 ± 0.005 0.107 ± 0.003 CLUDA (ours) 0.216 ± 0.001 0.202 ± 0.006 0.276 ± 0.002 0.129 ± 0.003

Decompensation prediction. Shown: AUROC (mean ± std) over 10 random initializations.

ACKNOWLEDGMENTS

Funding via the Swiss National Science Foundation (SNSF) via Grant 186932 is acknowledged.

I PREDICTION RESULTS OF MEDICAL PRACTICE

The main paper reported the average UDA performance between MIMIC and AUMC without the standard deviation of the results. Here, we provide the full results with gap filled (%) calculated for each method and additional AUPRC metric for decompensation and mortality predictions. Table 19 and Table 20 show the decompensation prediction results. Table 21 and Table 22 show the mortality prediction results. Table 23 show the length of stay prediction results. The results confirm our findings from the main paper: overall, our CLUDA achieves the best performance in both source and target domains.Published as a conference paper at ICLR 2023 

