ADDRESSING PARAMETER CHOICE ISSUES IN UNSU-PERVISED DOMAIN ADAPTATION BY AGGREGATION

Abstract

We study the problem of choosing algorithm hyper-parameters in unsupervised domain adaptation, i.e., with labeled data in a source domain and unlabeled data in a target domain, drawn from a different input distribution. We follow the strategy to compute several models using different hyper-parameters, and, to subsequently compute a linear aggregation of the models. While several heuristics exist that follow this strategy, methods are still missing that rely on thorough theories for bounding the target error. In this turn, we propose a method that extends weighted least squares to vector-valued functions, e.g., deep neural networks. We show that the target error of the proposed algorithm is asymptotically not worse than twice the error of the unknown optimal aggregation. We also perform a large scale empirical comparative study on several datasets, including text, images, electroencephalogram, body sensor signals and signals from mobile phones. Our method 1 outperforms deep embedded validation (DEV) and importance weighted validation (IWV) on all datasets, setting a new state-of-the-art performance for solving parameter choice issues in unsupervised domain adaptation with theoretical error guarantees. We further study several competitive heuristics, all outperforming IWV and DEV on at least five datasets. However, our method outperforms each heuristic on at least five of seven datasets. L 2 (q X ) can be interpreted as the total target error made by Algorithm 1, sometimes called excess risk. Indeed, in the deterministic setting of labeling functions, f q equals the target labeling function and the excess risk equals the target error of Ben-David et al. (2010) . Eq. ( 4) compares this error for the aggregation f , computed by Algorithm 1, to the error for the optimal aggregation f * . Note that the error of the optimal aggregation f * is unavoidable in the sense that it is determined by

1. INTRODUCTION

The goal of unsupervised domain adaptation is to learn a model on unlabeled data from a target input distribution using labeled data from a different source distribution (Pan & Yang, 2010; Ben-David et al., 2010) . If this goal is achieved, medical diagnostic systems can successfully be trained on unlabeled images using labeled images with a different modality (Varsavsky et al., 2020; Zou et al., 2020) ; segmentation models for natural images can be learned using only labeled data from computer simulations Peng et al. (2018) ; natural language models can be learned from unlabeled biomedical abstracts by means of labeled data from financial journals (Blitzer et al., 2006) ; industrial quality inspection systems can be learned on unlabeled data from new products using data from related products (Jiao et al., 2019; Zellinger et al., 2020) . However, missing target labels combined with distribution shift makes parameter choice a hard problem (Sugiyama et al., 2007; You et al., 2019; Saito et al., 2021; Zellinger et al., 2021; Musgrave et al., 2021) . Often, one ends up with a sequence of models, e.g., originating from different hyperparameter configurations (Ben-David et al., 2007; Saenko et al., 2010; Ganin et al., 2016; Long et al., (Shimodaira, 2000; Sugiyama et al., 2007; You et al., 2019) . Left: Source distribution (solid) and target distribution (dashed). Right: A sequence of different linear models (dashed) is used to find the optimal linear aggregation of the models (solid). Model selection methods (Sugiyama et al., 2007; Kouw et al., 2019; You et al., 2019; Zellinger et al., 2021) cannot outperform the best single model in the sequence, confidence values as used in Zou et al. (2018) are not available, and, approaches based on averages or tendencies of majorities of models (Saito et al., 2017) suffer from a high fraction of large-error-models in the sequence. In contrast, our approach (dotted-dashed) is nearly optimal. In addition, the model computed by our method provably approaches the optimal linear aggregation for increasing sample size. For further details on this example we refer to Section C in the Supplementary Material. 2015; Zellinger et al., 2017; Peng et al., 2019) . In this work, we study the problem of constructing an optimal aggregation using all models in such a sequence. Our main motivation is that the error of such an optimal aggregation is clearly smaller than the error of the best single model in the sequence. Although methods with mathematical error guarantees have been proposed to select the best model in the sequence (Sugiyama et al., 2007; Kouw et al., 2019; You et al., 2019; Zellinger et al., 2021) , methods for learning aggregations of the models are either heuristics or their theory guarantees are limited by severe assumptions (cf. Wilson & Cook (2020) ). Typical aggregation approaches are (a) to learn an aggregation on source data only (Nozza et al., 2016) , (b) to learn an aggregation on a set of (unknown) labeled target examples (Xia et al., 2013; Dai et al., 2007; III & Marcu, 2006; Duan et al., 2012) , (c) to learn an aggregation on target examples (pseudo-)labeled based on confidence measures of the given models (Zhou et al., 2021; Ahmed et al., 2022; Sun, 2012; Zou et al., 2018; Saito et al., 2017), (d) to aggregate the models based on data-structure specific transformations (Yang et al., 2012; Ha & Youn, 2021) , and, (e) to use specific (possibly not available) knowledge about the given models, such as information obtained at different time-steps of its gradient-based optimization process (French et al., 2018; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Athiwaratkun et al., 2019; Al-Stouhi & Reddy, 2011) or the information that the given models are trained on different (source) distributions (Hoffman et al., 2018; Rakshit et al., 2019; Xu et al., 2018; Kang et al., 2020; Zhang et al., 2015) . One problem shared among all methods mentioned above is that they cannot guarantee a small error, even if the sample size grows to infinity. See Figure 1 for a simple illustrative example. In this work, we propose (to the best of our knowledge) the first algorithm for computing aggregations of vector-valued models for unsupervised domain adaptation with target error guarantees. We extend the importance weighted least squares algorithm (Shimodaira, 2000) and corresponding recently proposed error bounds (Gizewski et al., 2022) to linear aggregations of vector-valued models. The importance weights are the values of an estimated ratio between target and source density evaluated at the examples. Every method for density-ratio estimation can be used as a basis for our approach, e.g. Sugiyama et al. (2012) ; Kanamori et al. (2012) and references therein. Our error bound proves that the target error of the computed aggregation is asymptotically at most twice the target error of the optimal aggregation. In addition, we perform extensive empirical evaluations on several datasets with academic data (Transformed Moons), text data (Amazon Reviews (Blitzer et al., 2006) ), images (MiniDomainNet (Peng et al., 2019; Zellinger et al., 2021) ), electroencephalography signals (Sleep-EDF (Eldele et al., 2021; Goldberger et al., 2000) ), body sensor signals (UCI-HAR (Anguita et al., 2013) , WISDM (Kwapisz et al., 2011) ), and, sensor signals from mobile phones and smart watches (HHAR (Stisen et al., 2015) ). We compute aggregations of models obtained from different hyper-parameter settings of 11 domain adaptation methods (e.g., DANN (Ganin et al., 2016) and Deep-Coral Sun & Saenko (2016) ). Our method sets a new state of the art for methods with theoretical error guarantees, namely importance weighted validation (IWV) (Sugiyama et al., 2007) and deep embedded validation (DEV) (Kouw et al., 2019) , on all datasets. We also study (1) classical least squares aggregation on source data only, (2) majority voting on target predictions, (3) averaging over model confidences, and (4) learning based on pseudo-labels. All of these heuristics outperform IWV and DEV on at least five of seven datasets, which is a result of independent interest. In contrast, our method outperforms each heuristic on at least five of seven datasets. Our main contributions are summarized as follows: • We propose the (to the best of our knowledge) first algorithm for ensemble learning of vector-valued models in (single-source) unsupervised domain adaptation that satisfies a non-trivial target error bound. • We prove that the target error of our algorithm is asymptotically (for increasing sample sizes) at most twice the target error of the unknown optimal aggregation. • We outperform IWV and DEV, and therefore set a new state-of-the-art performance for re-solving parameter choice issues under theoretical target error guarantees. • We describe four heuristic baselines which all outperform IWV and DEV on at least five of seven datasets. Our method outperforms each heuristic on at least five of seven datasets. • Our method tends to be more stable than others w.r.t. adding inaccurate models to the given sequence of models.

2. RELATED WORK

It is well known that aggregations of models in an ensemble often outperform individual models (Dong et al., 2020; Goodfellow et al., 2016) . Traditional ensemble methods that have shown the advantage of aggregation are Boosting (Schapire, 1990; Breiman, 1998) , Bootstrap Aggregating (bagging) (Breiman, 1994; 1996a) and Stacking (Wolpert, 1992; Breiman, 1996b) . For example, averages of multiple models pre-trained on data from a distribution different from the target one have recently been shown to achieve state-of-the-art performance on ImageNet (Wortsman et al., 2022) and their good generalization properties can be related to flat minima (Hochreiter & Schmidhuber, 1994; 1997) . However, most such methods don't take into account a present distribution shift. Although some ensemble learning methods exist, which take into account a present distribution shift, in contrast to our work, they are either relying on labeled target data (Nozza et al., 2016; Xia et al., 2013; III & Marcu, 2006; Dai et al., 2007; Mayr et al., 2016) , are restricted by fixing the aggregation weights to be the same (Razar & Samothrakis, 2019) , make assumptions on the models in the sequence or the corresponding process for learning the models (Yang et al., 2012; Ha & Youn, 2021; French et al., 2018; Laine & Aila, 2017; Tarvainen & Valpola, 2017; Athiwaratkun et al., 2019; Al-Stouhi & Reddy, 2011; Hoffman et al., 2018; Rakshit et al., 2019; Xu et al., 2018; Kang et al., 2020; Zhang et al., 2015) , or, learn an aggregation based on the heuristic approach of (pseudo-)labeling some target data based on confidence measures of models in the sequence (Zhou et al., 2021; Ahmed et al., 2022; Sun, 2012; Zou et al., 2018; Saito et al., 2017) . Another crucial difference of all methods above is that none of these methods can guarantee a small target error in the general setting (distribution shift, vector valued models, different classes, single source domain) described above, even if the sample size grows to infinity. Another branch of research are methods which aim at selecting the best model in the sequence. Although, such methods with error bounds have been proposed for the general setting above (Sugiyama et al., 2007; You et al., 2019; Zellinger et al., 2021) , they cannot overcome a limited performance of the best model in the given sequence (cf. Figure 1 and Section 6 in the Supplementary Material of Zellinger et al. (2021) ). In contrast, our method can outperform the best model in the sequence, and our empirical evaluations show that this is indeed the case in practical examples. A recent kernel-based algorithm for univariate regression, that is similar to ours, can be found in Gizewski et al. (2022) . However, in contrast to Gizewski et al. (2022) , our method allows a much more general form of vector-valued models which are not necessarily obtained from regularized kernel least squares, and, can therefore be applied to practical deep learning tasks. Our work employs technical tools developed in Caponnetto & De Vito (2007; 2005) . In fact, we extend Caponnetto & De Vito (2007; 2005) to deal with importance weighted least squares. Finally, it is important to note Huang et al. (2006) , where a core Lemma of our proofs is proposed.

3. AGGREGATION BY IMPORTANCE WEIGHTED LEAST SQUARES

This section gives a summary of the main problem of this paper and our approach. For detailed assumptions and proofs, we refer to Section A of the Supplementary Material. Notation and Setup Let X ⊂ R d1 be a compact input space and Y ⊂ R d2 be a compact label space with inner product ⟨., .⟩ Y such that for the associated norm ∥y∥ Y ≤ y 0 holds for all y ∈ Y and some y 0 > 0. Following Ben-David et al. (2010) , we consider two datasets: A source dataset (x, y) = ((x 1 , y 1 ), . . . , (x n , y n )) ∈ (X × Y) n independently drawn according to some source distribution (probability measure) p on X × Y and an unlabeled target dataset x ′ = (x ′ 1 , . . . , x ′ m ) ∈ X m with elements independently drawn according to the marginal distributionfoot_1 q X of some target distribution q on X × Y. The marginal distribution of p on X is analogously denoted as p X . We further denote by R q (f ) = X ×Y ∥f (x) -y∥ 2 Y dq(x, y) the expected target risk of a vector valued function f : X → Y w.r.t. the least squares loss. Problem Given a set f 1 , . . . , f l : X → Y of models, the labeled source sample (x, y) and the unlabeled target sample x ′ , the problem considered in this work is to find a model f : X → Y with a minimal target error R q (f ).

Main Assumptions

We rely (a) on the covariate shift assumption that the source conditional distribution p(y|x) equals the target conditional distribution q(y|x), and, (b) on the bounded density ratio assumption that there is a function β : X → [0, B] with B > 0 such that dq X (x) = β(x) dp X (x). Approach Our goal is to compute the linear aggregation f = l i=1 c i f i for c 1 , . . . , c l ∈ R with minimal squared target risk R q l i=1 c i f i . Our approach relies on the fact that arg min c1,...,c l ∈R R q l i=1 c i f i = arg min c1,...,c l ∈R X l i=1 c i f i (x) -f q (x) 2 Y dq X (x) for the regression functions given by f q (x) = Y y dq(y|x)foot_2 , see e.g. Cucker & Smale (2002, Proposition 1) . Unfortunately, the right hand side of Eq. ( 1) contains information about labels f q (x) which are not given in our setting of unsupervised domain adaptation. However, borrowing an idea from importance sampling, it is possible to estimate Eq. ( 1). More precisely, from the covariate shift assumption we get f p (x) = Y y dp(y|x) = f q (x) and we can use the bounded density ratio β to obtain arg min c1,...,c l ∈R R q l i=1 c i f i = arg min c1,...,c l ∈R X β(x) l i=1 c i f i (x) -f p (x) 2 Y dp X (x) which extends importance weighted least squares (Shimodaira, 2000; Kanamori et al., 2009) to linear aggregations l i=1 c i f i of vector-valued functions f 1 , . . . , f l . The unique minimizer of Eq. ( 2) can be approximated based on available data analogously to classical least squares estimation as detailed in Algorithm 1. In the following, we call Algorithm 1 Importance Weighted Least Squares Linear Aggregation (IWA).

Relation to Model Selection

The optimal aggregation f * := arg min c1,...,c l ∈R R q l i=1 c i f i defined in Eq. ( 2) is clearly better than any single model selection since R q (f * ) = min c1,...,c l ∈R R q l i=1 c i f i ≤ min c1,...,c l ∈{0,1} R q l i=1 c i f i ≤ min f1,...,f l R q (f i ). (3) However, the optimal aggregation f * cannot be computed based on finite datasets and the next logical questions are about the accuracy of the approximation f in Algorithm 1. Algorithm 1: Importance Weighted Least Squares Linear Aggregation (IWA). Input :Set f 1 , . . . , f l : X → Y of models, labeled source sample (x, y) and unlabeled target sample x ′ . Output :Linear aggregation f = l i=1 c i f i with weights c = ( c 1 , . . . , c l ) ∈ R l . Step 1 Use unlabeled samples x and x ′ to approximate density ratio dq X dp X by some function β : X → [0, B] using a classical algorithm, e.g. Sugiyama et al. (2012) . Step 2 Compute weight vector c = G -1 g with empirical Gram matrix G and vector g defined by G = 1 m m k=1 ⟨f i (x ′ k ), f j (x ′ k )⟩ Y l i,j=1 g = 1 n n k=1 β(x k ) ⟨y k , f i (x k )⟩ Y l i=1 . Return :Linear aggregation f = l i=1 c i f i .

4. TARGET ERROR BOUND FOR ALGORITHM 1

Let us start by introducing some further notation: L 2 (p) refers to the Lebesgue-Bochner space of functions from X to Y, associated to a measure p on X with corresponding inner product ⟨., .⟩ L 2 (p) (this space basically consists of all Y-valued functions whose Y-norms are square integrable with respect to the given measure p). Moreover, let us introduce the (positive semi-definite) Gram matrix G = ⟨f i , f j ⟩ L 2 (q X ) l i,j=1 and the vector g = ⟨βf p , f i ⟩ L 2 (p X ) l i=1 . We can assume that G is invertible (and thus positive definite), since otherwise some models are too similar to others and can be withdrawn from consideration (see Section D). Next, we recall that the minimizer of Eq. ( 2) is c * = (c * 1 , . . . , c * l ) = G -1 g, see Lemma 4. However, neither G nor the vector g is accessible in practice, because there is no access to the target measure q X . Driven by the law of large numbers we try to approximate them by averages over our given data and therefore arrive at the formulas for G and g given in Algorithm 1. This leads to the approximation f . Up to this point, we were only considering an intuitive perspective on the problem setting, therefore, we will now formally discuss statements on the distance between the model f and the optimal linear model f * = l i=1 c * i f i , measured in terms of target risks, and how this distance behaves with increasing sample sizes. This is what we attempt with our main result: Theorem 1. With probability 1 -δ it holds that R q ( f ) -R q (f q ) ≤ 2 (R q (f * ) -R q (f q )) + C log 1 δ (n -1 + m -1 ) (4) for some coefficient C > 0 not depending on m, n and δ, and sufficiently large m and n. Before we give an outline of the proof (see Section A), let us briefly comment on the main message of Algorithm 1. Observe, that (Cucker & Smale, 2002 , Proposition 1) R q (f ) -R q (f q ) = ∥f -f q ∥ the decision of searching for linear aggregations of f 1 , . . . , f l only. However, if the models f 1 , . . . , f l are sufficiently different, then this error can be expected to be small. Theorem 1 tells us that the error of f approaches the one of f * with increasing target and source sample size. The rate of convergence is at least linear. Finally, we emphasize that Theorem 1 does not take into account the error of the density-ratio estimation. We refer to the recent work Gizewski et al. (2022) , who, for the first time, included such error in the analysis of importance weighted least squares. Let us now give a brief outline for the proof of Theorem 1. One key part concerns the existence of a Hilbert space H with associated inner product ⟨., .⟩ H (a reproducing kernel space of functions from X → Y) which contains all given models f 1 , . . . , f l and the regression function f q = f p . The space H can be constructed from any given models that are bounded and continuous functions. Furthermore, Algorithm 1 does not need any knowledge of H, which is a modeling assumption only needed for the proofs, so that we can apply many arguments developed in Caponnetto & De Vito (2007; 2005) . H is also not necessarily generated by a prescribed kernel such as Gaussian or linear kernel, and, no further smoothness assumption is required, see Sections A and B in the Supplementary Material. Moreover, in this setting one can express the excess risk as follows: R q (f ) -R q (f q ) = ∥A(f -f q )∥ 2 H for some bounded linear operator A : H → H. This also allows us to formulate the entries of G and ḡ in terms of the inner product ⟨., .⟩ H instead. Using properties related to the operators that appear in the construction of H, in combination with Hoeffding-like concentration bounds in Hilbert spaces and bounds that measure, e.g., the deviation between empirical averages in source and target domain (as done in Gretton et al. (2006, Lemma 4 )), we can quantify differences between the entries of G and G (and ḡ and g respectively) in terms of n, m and δ. This leads to Eq. ( 4). et al., 2015) over 3 seeds. The individual models (x-axis) are trained with DIRT (Shu et al., 2018) for different hyper-parameter choices. Bottom: Scaled Aggregation weights (y-axis) for individual models (x-axis) computed by IWA, SOR and DEV (average over 3 seeds). Instead of searching for the best model in the sequence, IWA effectively uses all models in the sequence and obtains a performance not reachable by any procedure selecting only one model.

5. EMPIRICAL EVALUATIONS

We now empirically evaluate the performance of our approach compared to classical ensemble learning baselines and stateof-the-art model selection methods. Therefore, we structure our empirical evaluation as follows. First, we outline our experimental setup for unsupervised domain adaptation and introduce all domain adaptation methods for our analysis. Second, we describe the ensemble learning and model selection baselines, and third, we present the datasets used for our experiments. We then conclude with our results and a detailed discussion thereof.

5.1. EXPERIMENTAL SETUP

To assess the performance of our ensemble learning Algorithm 1 IWA, we perform numerous experiments with different domain adaptation algorithms on different datasets. By changing the hyper-parameters of each algorithm, we obtain, as results of applying these algorithms, sequences of models. The goal of our method is to find optimal models based on combinations of candidates from each sequence. As domain adaptation algorithms, we consider the AdaTime benchmark suite, and run our experiments on language, image, text and time-series data. This suite comprises a collection of 11 domain adaptation algorithms. We follow their evaluation setup and apply the following algorithms: Adversarial Spectral Kernel Matching (AdvSKM) (Liu & Xue, 2021) , Deep Domain Confusion (DDC) (Tzeng et al., 2014) , Correlation Alignment via Deep Neural Networks (Deep-Coral) (Sun et al., 2017) , Central Moment Discrepancy (CMD) (Zellinger et al., 2017) , Higher-order Moment Matching (HoMM) (Chen et al., 2020) , Minimum Discrepancy Estimation for Deep Domain Adaptation (MMDA) (Rahman et al., 2020) , Deep Subdomain Adaptation (DSAN) (Zhu et al., 2021) , Domain-Adversarial Neural Networks (DANN) (Ganin et al., 2016) , Conditional Adversarial Domain Adaptation (CDAN) (Long et al., 2018) , A DIRT-T Approach to Unsupervised Domain Adaptation (DIRT) (Shu et al., 2018) and Convolutional deep Domain Adaptation model for Time-Series data (CoDATS) (Wilson et al., 2020) . In addition to the sequence of models, IWA requires an estimate of the density ratio between source and target domain. To compute this quantity we follow (Bickel et al., 2007) and (You et al., 2019, Section 4.3) , and, train a classifier discriminating between source and target data. The output of this classifier is then used to approximate the density ratio denoted as β in Algorithm 1. Overall, to compute the results in our tables we trained 16680 models over approximately a timeframe of 1500 GPU/hours using computation resources of NVIDIA P100 16GB GPUs. For example, consider the top plot of Figure 2 , where we compare the performance of Algorithm 1 to deep embedded validation (DEV) (You et al., 2019) , a heuristics baseline source-only regression (SOR, see Section 5.2) and each individual model in the sequence. The bottom plot shows the the scaled aggregation weights, i.e. how much each individual model contributes to the aggregated prediction of IWA, DEV, and SOR. In this example, the given sequence of models is obtained from applying the algorithm proposed in Shu et al. (2018) with different hyper-parameter choices to the Heterogeneity Human Activity Recognition dataset (Stisen et al., 2015) . See Section D.3 in the Supplementary Material for the exact hyper-parameter values.

5.2. BASELINES

As representatives for the most prominent methods discussed in Section 1, we compare our method, IWA, to ensemble learning methods that use linear regression and majority voting as heuristic for model aggregation, and, model selection methods with theoretical error guarantees.

Heuristic Baselines

The first baseline is majority voting on target data (TMV). It aggregates the predictions of all models by counting the overall class predictions and selects the class with the maximum prediction count as ensemble output. In addition, we implement three heuristic baselines which aggregate the vector-valued output, i.e. probabilities, of all classifiers using weights learned via linear regression. The final ensemble prediction is then made by selecting the class with the highest probability. The three heuristic regression baselines differ in the input used for the performed regression. Source-only regression (SOR) trains a regression model on classifier predictions (of the given models) and labels from the source domain only. Target majority voting regression (TMR) uses the same voting procedure as explained above to generate pseudo-labels on the target domain, which are then further used to train a linear regression model. In contrast, target confidence average regression (TCR) selects the highest average class probability over all classifiers to pseudo-label the target samples, which is then used for training the linear regression model.

Baselines with Theoretical Error Guarantees

We compare IWA to the model selection methods importance weighted validation (IWV) (Sugiyama et al., 2007) and deep embedded validation (DEV) (You et al., 2019) , which select models according to their (importance weighted) target risk. Both methods assume the knowledge of an estimated density ratio between target and source domains. In our experiments we follow Bickel et al. (2007) ; You et al. (2019) and estimate this ratio, by using a classifier that discriminates between source and target domain (see Supplementary Material Section D for more details).

5.3. DATASETS

We evaluate the previously mentioned methods according to a diverse set of datasets, including language, image and time-series data. All datasets have a train, evaluation and test split, with results only presented on the held-out test sets. For additional details we refer to Appendix C and D. TransformedMoons This specific form of twinning moons is based on Zellinger et al. (2021) . The source domain consists of two-dimensional input data points and their transformations to two opposing moon-shaped forms. MiniDomainNet is a reduced version of DomainNet-2019 (Peng et al., 2019) consisting of six different image domains (Quickdraw, Real, Clipart, Sketch, Infograph, and Painting) . In particular, MiniDomainNet (Zellinger et al., 2021) reduces the number of classes of DomainNet-2019 to the top-five largest representatives in the training set of each class across all six domains. AmazonReviews is based on Blitzer et al. (2006) and consists of text reviews from four domains: books, DVDs, electronics, and kitchen appliances. Reviews are encoded in feature vectors of bag-ofwords unigrams and bigrams with binary labels indicating the rankings. From the four categories we obtain twelve domain adaptation tasks where each category serves once as source domain and once as target domain.

UCI-HAR

The Human Activity Recognition (Anguita et al., 2013) dataset from the UC Irvine Repository contains data from three motion sensors (accelerometer, gyroscope and body-worn sensors) gathered using smartphones from 30 different subjects. It classifies their activities in several categories, namely, walking, walking upstairs, downstairs, standing, sitting, and lying down. WISDM (Kwapisz et al., 2011) is a class-imbalanced dataset variant from collected accelerometer sensors, including GPS data, from 29 different subjects which are performing similar activities as in the UCI-HAR dataset. HHAR The Heterogeneity Human Activity Recognition (Stisen et al., 2015) dataset investigate sensor-, device-and workload-specific heterogeneities using 36 smartphones and smartwatches, consisting of 13 different device models from four manufacturers. (Goldberger et al., 2000) , which contains EEG readings from 20 healthy subjects.

Sleep-EDF

We rely on the AdaTime benchmark suite (Ragab et al., 2022) in most evaluations. The four timeseries datasets above are originally included there. We extend AdaTime to support the other discussed datasets as well, and extend its domain adaptation methods.

5.4. RESULTS

We separate the applied methods into two groups, namely heuristic and methods with theoretical error guarantees. All tables show accuracies of source-only (SO) and target-best (TB) models, where source-only denotes training without domain adaptation and target-best the best performing model obtained among all parameter settings. We highlight in bold the performance of the best performing method with theoretical error guarantees, and in italic the best performing heuristic. See Table 1 for results. Please find the full tables in the Supplementary Material Section D. Outperformance of theoretically justified methods: On all datasets, our method outperforms IWV and DEV, setting a new state of the art for solving parameter choice issues under theoretical guarantees. Outperformance of heuristics: It is interesting to note that each heuristic outperforms IWV and DEV on at least five of seven datasets. Moreover, every heuristic outperforms the (average) target best model (TB) in at least two cases, making it impossible for any model selection method to win in these cases. These facts highlight the quality of the predictions of our chosen heuristics. However, each heuristic is outperformed by our method on at least five of seven datasets. Information in aggregation weights and robustness w.r.t. inaccurate models: It is interesting to observe that, in contrast to the other heuristic aggregation baselines, the aggregation weights c 1 , . . . , c l of our method tend to be larger for accurate models, see Section D.5. Another result is that our method tends to be less sensitive to a high number of inaccurate models than the baselines, see Section D.6. This serves as another reason for its high empirical performance.

6. CONCLUSION AND FUTURE WORK

We present a constructive theory-based method for approaching parameter choice issues in the setting of unsupervised domain adaptation. Its theoretical approach relies on the extension of weighted least squares to vector-valued functions. The resulting aggregation method distinguishes itself by a wide scope of admissible model classes without strong assumptions, e.g. support vector machines, decision trees and neural networks. A broad empirical comparative study on benchmark datasets for language, images, body sensor signals and handy signals, underpins the theory-based optimality claim. It is left for future research to further refine the theory and its estimates, e.g., by exploiting concentration bounds from Gretton et al. (2006) or advanced density ratio estimators from Sugiyama et al. (2012) . 

A NOTATION AND PROOF OF MAIN RESULT

The aim of this section is to give a full proof of our main result, Theorem 1 in the main paper. We start by introducing and summarizing the notation and the required concepts from functional analysis and measure theory, so that we can state and prove the required lemmas.

Summary of Notation

• Spaces: input space X ⊂ R d1 and label space Y with inner product ⟨., .⟩ Y . Y is assumed to be a separable Hilbert space such that for the associated norm ∥y∥ Y ≤ y 0 holds for all y ∈ Y and some y 0 > 0. Note that this setting is more general than the one from the main text, where we assumed Y ⊂ R d2 (the simplification in the main text improves readability and respectation of space limits). • Datasets and Distributions: Source data set: (x, y) = ((x 1 , y 1 ), . . . , (x n , y n )) ∈ (X × Y) n independently drawn according to source distribution p on X × Y and an unlabeled target dataset x ′ = (x ′ 1 , . . . , x ′ m ) ∈ X m independently drawn according marginal distribution q X of target distribution q on X × Y (the corresponding marginal distribution of p on X is similarly denoted as p X ). • Source Risk: R p (f ) = X ×Y ∥f (x) -y∥ 2 Y dp(x, y). • Source Regression function f p (x) = Y y dp(y|x). (Vector valued) integral in the sense of Lebesgue-Bochner. • Target Risk: R q (f ) = X ×Y ∥f (x) -y∥ 2 Y dq(x, y) • Target Regression function f q (x) = Y y dq(y|x). (Vector valued) integral in the sense of Lebesgue-Bochner.

Problem

• Given: sequence f 1 , . . . , f l : X → Y of models, source sample (x, y) and unlabeled target sample x ′ • Aim: find aggretation f = l i=1 c i f i with minimal R q (f ).

Main Assumptions

• covariate shift: p(y|x) = q(y|x) and thus f p = f q . • bounded density ratio: there is β : X → [0, B] such that dq X (x) = β(x) dp X (x). Existence of the associated conditional probability measures is guaranteed by the fact that X × Y is Polish (a separable and complete metric space), c.f. Dudley (2002, Theorem 10.2.2.) . Notation from functional analysis/operator theory Let U and V denote separable Hilbert spaces (i.e. they admit countable orthonormal bases) with associated inner products ⟨., .⟩ U (or ⟨., .⟩ V , respectively). Let us briefly recall some notions from functional analysis that we need in order to set up our theory. There are lots of standard references on these aspects, e.g. Teschl (2022a) and Teschl (2022b): • L(U, V): space of bounded linear operators U → V with uniform norm ∥.∥ L(U,V) . L(U): space of bounded linear operators U → U. • For A ∈ L(U, V), its adjoint is denoted by A * ∈ L(V, U) (and uniquely defined by the equation ⟨Au, v⟩ V = ⟨u, A * v⟩ U for any u ∈ U, v ∈ V). • If A ∈ L(U) and A = A * : A is called self-adjoint. • If A ∈ L(U) is self adjoint and ⟨Au, u⟩ U ≥ 0 for any u ∈ U, then A is called positive. Equivalently: there exists (unique) bounded and self-adjoint B := √ A ∈ L(U) such that B 2 = A. • Trace of an operator A ∈ L(U): T r(A) = k ⟨Ae k , e k ⟩ U for any orthonormal basis (e k ) ∞ k=1 of U (independent of choice of basis). If T r(A) < ∞: A is called trace class. • L 2 (U): separable Hilbert space of Hilbert-Schmidt operators on U with scalar product ⟨A, B⟩ L2(U) = T r(B * A) and norm ∥A∥ L2(U) = T r(A * A) ≥ ∥A∥ L(U) . • A : U → V is called Hilbert-Schmidt, if A * A is trace class. Also here: ∥A∥ L(U,V) ≤ T r(A * A) • For (probability) measure q on X (or Y) and appropriate functions F : X → U (e.g. strongly measurable and ∥F ∥ U is integrable wrt. q) we denote the usual (U-valued) Bochner integral of F as X F (x) dq(x). We denote the associated L p -spaces by L p (X , q, U), or L p (q) for short, if the associated spaces are clear from the context.

Assumptions on models

We assume that the regression function f * = f p = f q as well as the models f 1 , ..., f l belong to a hypothesis space (Caponnetto & De Vito, 2007) The space H is a separable Hilbert space of functions f : X → Y such that: H ⊆ C(X , Y) ⊆ L 2 (p X ) ∩ L 2 (q X ), • For all x ∈ X there is a Hilbert-Schmidt operator K x : Y → H satisfying f (x) = K * x f, f ∈ H, • The function from X × X to R (x, t) → ⟨K t v, K x w⟩ H is measurable ∀v, w ∈ Y; • There is κ > 0 such that T r(K * x K x ) ≤ κ, ∀x ∈ X . Moreover we assume that the norms ∥f k ∥ H , k = 1, 2, . . . , l, are under our control, such that we can put a threshold γ l > 0 and consider ∥f k ∥ H ≤ γ l . Further useful observations Then we have K * t K x = K(t, x) ∈ L 2 (Y) ∀x, t ∈ X . Given x ∈ X the operator T x = K x K * x ∈ L 2 (H), is a positive Hilbert-Schmidt operator and (9) ensures ∥T x ∥ L(H) ≤ ∥T x ∥ L2(H) = ∥K(x, x)∥ L2(Y) ≤ κ. ( ) Let T q X : H → H be T q X = X T x dq X (x), where the integral converges in L 2 (H) to a positive trace class operator with ∥T q X ∥ L(H) ≤ ∥T q X ∥ L2(H) ≤ T r(T q X ) = X T r(T x ) dq X (x) ≤ κ. Following Proposition 1 in Caponnetto & De Vito (2007) , we have the minimizers f q of expected risk R q are the solution of the following equation: T q X f q = g, where g = X K x f q (x) dq X (x) ∈ H, with integral converging in H. Next we define the operators T x ′ = 1 m m j=1 K x ′ j K * x ′ j , T x,β = 1 n n i=1 β(x i )K xi K * xi , g x,y,β = 1 n n i=1 β(x i )K xi y i . In the sequel we adopt the convention that C denotes a generic positive coefficient, which can vary from appearance to appearance and may only depend on basic parameter such as p X , q X , κ, B, y 0 and others introduced below, but not on n, m and error probability δ > 0. We will need the following statements. Lemma 1. With probability at least 1 -δ we have ∥T q X -T x ′ ∥ L(H) ≤ ∥T q X -T x ′ ∥ L2(H) ≤ C log 1 2 1 δ m -1 2 , ∥T x ′ -T x,β ∥ L(H) ≤ C log 1 2 1 δ n -1 2 + m -1 2 , ∥T x,β f * -g x,y,β ∥ H ≤ C log 1 2 1 δ n -1 2 , ( ) where C > 0 does not depend on n, m and δ. The proof of Lemma 1 is based on Lemma 4 of Huang et al. (2006) , which we formulate in our notations as follows Lemma 2. ( (Huang et al., 2006 )) Let ϕ be a map from U to U such that ∥ϕ(x)∥ U ≤ R for all x ∈ X . Then with probability at least 1 -δ it holds 1 m m j=1 ϕ(x ′ j ) - 1 n n i=1 β(x i )ϕ(x i ) U ≤ 1 + 2 log 2 δ R B 2 n + 1 m . Moreover, we will need a concentration inequality that follows from Pinelis (1992) , see also Rosasco et al. (2010) . Lemma 3 (Concentration lemma). If ξ 1 , ξ 2 , . . . , ξ n are zero mean independent random variables with values in a separable Hilbert space U, and for some D > 0 one has ∥ξ i ∥ U ≤ D, i = 1, 2, . . . , n, then the following bound 1 n n i=1 ξ i U ≤ D 2 log 2 δ √ n holds true with probability at least 1 -δ. Proof of Lemma 1. Let us start by proving ( 12) by introducing the map ξ : 10) and ( 11) it follows that X → L 2 (H) as ξ(x) = K x K * x -T q X . From ( ∥ξ(x)∥ L2(H) ≤ ∥K x K * x ∥ L2(H) + ∥T q X ∥ L2(H) ≤ 2κ. Moreover, we have X ξ(x)dq X (x) = X K x K * x dq X (x) -T q X = 0. Therefore, for x ′ j , j = 1, 2, . . . , m, drawn i.i.d from the marginal probability measure q X , the corresponding operators ξ j = ξ(x ′ j ) can be treated as zero mean independent random variables in L 2 (H), such that the condition of Concentration lemma are satisfied with D = 2κ, and ∥T x ′ -T q X ∥ L2(H) = 1 m m j=1 K x ′ j K * x ′ j -T q X L2(H) = 1 m m j=1 ξ j L2(H) ≤ 2κ 2 log 2 δ √ m . To obtain ( 13), for any f ∈ H we define a map ϕ = ϕ f : X → H as ϕ f (x) = K x K * x f . It clear that ∥ϕ f (x)∥ H = ∥K x K * x ∥ L(H) ∥f ∥ H ≤ κ ∥f ∥ H . Therefore, for the map ϕ = ϕ f the condition of the above Lemma 2 is satisfied with R = κ ∥f ∥ H . Then directly from that lemma for any f ∈ H we have ∥T x ′ f -T x,β f ∥ H = 1 m m j=1 ϕ f (x ′ j ) - 1 n n i=1 β(x i )ϕ f (x i ) H ≤ 1 + 2 log 2 δ B 2 n + 1 m κ ∥f ∥ H ≤ C log 1 2 1 δ m -1 2 + n -1 2 ∥f ∥ H , that proves (13). Consider now the map 𭟋 : X × Y → H defined by 𭟋(x, y) = β(x)K x (f p (x) -y). Recall that ∥K x ∥ L(Y,H) ≤ T r(K * x K x ) ≤ √ κ. Then we obtain: ∥𭟋(x, y)∥ H ≤ ∥K x ∥ L(Y,H) Y y ′ dp(y ′ |x) -y Y |β(x)| ≤ 2y 0 B √ κ. Moreover, for p(x, y) = p(y|x)p X (x) we have X ×Y 𭟋(x, y)dp(x, y) = X K x β(x) Y Y y ′ dp(y ′ |x) -y dp(y|x)dp X (x) = 0, such that for (x i , y i ), i = 1, 2, . . . , n, drawn i.i.d from the measure p(x, y) the corresponding values 𭟋 i = 𭟋(x i , y i ) are zero mean independent random variables in H. Then for the just defined 𭟋 i = β(x i )K xi (f q (x i ) -y i ) the conditions of Lemma 3 are satisfied with D = 2y 0 B √ κ, such that 1 n n i=1 𭟋 i H = 1 n n i=1 β(x i )K xi (f q (x i ) -y i ) H = n i=1 β(x i )K xi K * xi f q - n i=1 β(x i )K xi y i H = ∥T x,β f q -g x,y,β ∥ H ≤ 2y 0 B √ κ 2 log 2 δ √ n . This bound gives us (14). Aggregation for vector-valued functions Next we construct a new approximant in the form of a linear combination of approximants f 1 , f 2 , . . . , f l , computed for all tried parameter values. The linear combination of the approximants is computed as f = l k=1 c k f k . ( ) Since f 1 , f 2 , . . . , f l belong to RKHS H, it is clear that f ∈ H. Now we want to argue on how close we can get to f q . Following Proposition 1 in Caponnetto & De Vito (2007) , we have R q (f ) -R q (f q ) = ∥f -f q ∥ 2 L 2 (q X ) = T q X (f -f q ) 2 H . ( ) Next we observe that the best approximation f * of the target regression function f q by linear combinations corresponds to the vector c * = (c * 1 , . . . , c * l ) of ideal coefficients in ( 15) that solves the linear system Gc * = ḡ with the Gram matrix G = T q X f k , T q X f u H l k,u=1 and the right-hand side vector ḡ = T q X f q , T q X f k H l k=1 . Let us provide a prove of this short observation in the next lemma. Note that the entries G and g can equivalently also be formulated in terms of ⟨., .⟩ L 2 (q X ) , as done in the main text. We are going to use this formulation in the next lemma in order to be compatible with the main text (switching to the inner products in terms of H would not change the argument of the proof at all): Lemma 4. The best L 2 (q X )-approximation f * of the target regression function f q by linear combinations corresponds to the vector c * = (c * 1 , . . . , c * l ) = G -1 g. Proof. Let us denote ( 16) by f (c) and rewrite this expression appropriately: f (c) = l i,j=1 c i c j ⟨f i , f j ⟩ L 2 (q X ) -2 l i=1 c i ⟨f i , f q ⟩ L 2 (q X ) + ⟨f q , f q ⟩ L 2 (q X ) . Taking the derivative with respect to c i yield: ∂f (c) ∂c i = 2   l j=1 c j ⟨f i , f j ⟩ L 2 (q X ) -⟨f i , f q ⟩ L 2 (q X )   . Setting these derivatives to zero (for all i ∈ {1, . . . , l}) gives the claimed equation. Noting that the Hessian is equal to 2G (and thus positive-definite) ensures that c * is a global minimum of f . But, of course, neither Gram matrix G nor the vector ḡ is accessible, because there is no access to the target measure q X , so we switch to the empirical counterparts G and g. Then the following lemma is helpful to gain some information on the error made by the empirical average: Lemma 5. With probability 1 -δ we have T q X f u , T q X f k H - 1 m m j=1 f k (x ′ j ), f u (x ′ j ) Y ≤ C log 1 2 1 δ m -1 2 , T q X f k , T q X f q H - 1 n n i=1 β(x i ) ⟨f k (x i ), y i ⟩ Y ≤ C log 1 2 1 δ (n -1 2 + m -1 2 ), where C > 0 does not depend on n, m and δ. Proof. Keeping in mind that f q , f k ∈ H we have T q X f u , T q X f k H = ⟨T x ′ f k , f u ⟩ H + ⟨(T q X -T x ′ )f u , f k ⟩ H = 1 m m j=1 K x ′ j K * x ′ j f k , f u H + ⟨(T q X -T x ′ )f u , f k ⟩ H = 1 m m j=1 K * x ′ j f k , K * x ′ j f u Y + ⟨(T q X -T x ′ )f u , f k ⟩ H = 1 m m j=1 f k (x ′ j ), f u (x ′ j ) Y + ⟨(T q X -T x ′ )f u , f k ⟩ H . Moreover, from ( 12) with probability 1 -δ we have that ⟨(T q X -T x ′ )f u , f k ⟩ H ≤ C ∥f u ∥ H ∥f k ∥ H log 1 2 1 δ m -1 2 . Then T q X f u , T q X f k H - 1 m m j=1 f k (x ′ j ), f u (x ′ j ) Y ≤ C log 1 2 1 δ m -1 2 . Now, we prove the second statement in Lemma 5. We have T q X f k , T q X f q H = ⟨f k , T q X f q ⟩ H = ⟨f k , T q X f q -g x,y,β ⟩ H + ⟨f k , g x,y,β ⟩ H = 1 n n i=1 β(x i ) ⟨f k , K xi y i ⟩ H + ⟨f k , T q X f q -g x,y,β ⟩ H = 1 n n i=1 β(x i ) K * xi f k , y i Y + ⟨f k , T q X f q -g x,y,β ⟩ H = 1 n n i=1 β(x i ) ⟨f k (x i ), y i ⟩ Y + ⟨f k , T q X f q -g x,y,β ⟩ H . From Lemma 1, with probability 1 -δ we have ∥T q X f q -g x,y,β ∥ H ≤ ∥T q X f q -T x ′ f q ∥ H + ∥T x ′ f q -g x,y,β ∥ H ≤ ∥T q X f q -T x ′ f q ∥ H + ∥T x ′ f q -T x,β f q ∥ H + ∥T x,β f q -g x,y,β ∥ H ≤ C ∥f q ∥ H log 1 2 1 δ m -1 2 + C ∥f q ∥ H log 1 2 1 δ (n -1 2 + m -1 2 ) + ∥T x,β f q -g x,y,β ∥ H ≤ C ∥f q ∥ H log 1 2 1 δ m -1 2 + C ∥f q ∥ H log 1 2 1 δ (n -1 2 + m -1 2 ) + C log 1 2 1 δ n -1 2 . Then ⟨f k , T q X f q -g x,y,β ⟩ H ≤ C ∥f k ∥ H log 1 2 1 δ n -1 2 + m -1 2 . Therefore, T q X f k , T q X f q H - 1 n n i=1 β(x i ) ⟨f k (x i ), y i ⟩ Y ≤ C log 1 2 1 δ (n -1 2 + m -1 2 ). Towards our main generalization bound Next we use similar arguments as in Theorem 4 from Gizewski et al. (2022) to obtain our main result, Theorem 1. Lemma 5 suggests to approximate G and ḡ by their empirical counterparts: G =   1 m m j=1 f k (x ′ j ), f u (x ′ j ) Y   l k,u=1 , g = 1 n n i=1 β(x i ) ⟨y i , f k (x i )⟩ Y l k=1 which can be effectively computed from data samples. Moreover, again from Lemma 5 we can argue that with probability 1 -δ it holds: ∥ḡ -g∥ R l ≤ C log 1 2 1 δ (n -1 2 + m -1 2 ), ∥G -G∥ L(R l ) ≤ C log 1 2 1 δ m -1 2 . ( ) With the matrix G at hand one can easily check whether or not it is well-conditioned and G-1 exists (otherwise one needs to get rid of models with similar performance). Thus the norms G L(R l ) and G-1 L(R l ) can be bounded independently of m and n, due to the fact that all their entries can be bounded as follows (we only do the calculation for the entries of G): | Gk,u | ≤ 1 m m j=1 f k (x ′ j ), f u (x ′ j ) Y = 1 m m j=1 K * x ′ j f k , K * x ′ j f u Y = 1 m m j=1 K x ′ j K * x ′ j f k , f u H = 1 m m j=1 T x ′ j f k , f u H ≤ 1 m m j=1 T x ′ j L(H) ∥f k ∥ H ∥f u ∥ H ≤ κγ 2 l , where we used the reproducing property (5) to obtain the equality in the first line and (10) for the last inequality. Now assume that m is so large that with probability 1 -δ we have ∥G -G∥ L(R l ) < 1 G-1 L(R l ) . (21) Moreover we can use the following simple manipulation: G -1 = G-1 (G G-1 ) -1 = G-1 (I -(I -G G-1 )) -1 = G-1 (I -( G -G) G-1 ) -1 . Then ( 21) ensures that the Neumann series for (I -( G -G) G-1 ) -1 converges and we obtain the following bound: G -1 L(R l ) ≤ G-1 L(R l ) 1 -G-1 L(R l ) G -G L(R l ) = O(1). ( ) Now we are in the position to prove our main generalization bound (4) for unsupervised domain adaptation: Proof of Theorem 1. We have already discussed that the coefficients in the best approximation f * to f q are given by c * = (c * 1 , c * 2 , . . . , c * l ) = G -1 ḡ. Since: G -1 (g -ḡ) + G -1 (G -G)c = G -1 g -c * + c -G -1 g = c -c * then from (19)-(22) with probability 1 -δ we have ∥c -c * ∥ R l = ≤ G -1 L(R l ) ∥g -ḡ∥ R l + ∥G -G∥ L(R l ) ∥c∥ R l ≤ C log 1 2 1 δ (n -1 2 + m -1 2 ). Moreover: R q ( f ) -R q (f q ) = T q X ( f -f q ) 2 H ≤ T q X (f * -f q ) H + T q X ( f -f * ) H 2 ≤ 2 T q X (f * -f q ) 2 H + 2 T q X ( f -f * ) 2 H = 2 (R q (f * ) -R q (f q )) + 2 T q X ( f -f * ) 2 H ≤ 2 (R q (f * ) -R q (f q )) + 2 l k=1 |c * k -ck | T q X f k H 2 ≤ 2 (R q (f * ) -R q (f q )) + 2l∥c * -c∥ 2 R l max k T q X f k 2 H ≤ 2 (R q (f * ) -R q (f q )) + 2 T q X 2 L(H) lγ 2 l ∥c * -c∥ 2 R l , The statement of the theorem follows now from ( 23)-( 24) (using again the inequality (a + b) 2 ≤ 2(a 2 + b 2 )). On the dependence of the error bound on the number l of models An interesting question is, how the bound in Eq. ( 4) depends on l. To this end, let us have a look at the second term in the last line in Eq. ( 24): T q X L analyzes a sampling operator, thus does not depend on the number of models, same goes for γ l , which is just a uniform bound on all our models. To analyze ∥c * -c∥ 2 R l , let us have a look at the individual factors in the first inequality of Eq. (23). To not overload notation, C > 0 is used here for any absolute constant that is independent of l, m, n and δ. • ∥G -G∥ L(R l ) : The individual entries of G -G are (in absolute values) bounded by Lemma 5. The proof arguments only involve norm bounds on the associated sampling operators, the uniform bound γ l on all the models and the bound B on β, thus the absolute constant C there is independent of l. By the definition of the matrix Frobenius norm, we thus get ∥G -G∥ L(R l ) ≤ Cl log 1 2 1 δ (n -1 2 + m -1 2 ) • ∥g-ḡ∥ R l : Similar arguments as before lead to ∥g-ḡ∥ L(R l ) ≤ C √ l log 1 2 1 δ (n -1 2 +m -1 2 ) • G -1 L(R l ) : It is natural to assume that there is some constant c ≥ G-1 L(R l ) . Otherwise, we can, e.g., orthogonalize our models and coefficients without changing the aggregation, but with reducing the conditioning number (i.e., with reducing G-1 ). It is also natural to assume that m and n are large enough such that l(n -1 2 + m -1 2 ) < 1 2c . Then applying Eq. ( 25) to Eq. ( 22), we can deduce that G -1 L(R l ) ≤ 2c. • ∥c∥ R l : This quantity can also be assumed to be known independently of l, since it is given by our data only.

Combining the previous points gives us ∥c

* -c∥ 2 R l ≤ Cl 2 log 1 δ (n -1 + m -1 ) which finally leads to the refined bound R q ( f ) -R q (f q ) ≤ 2 (R q (f * ) -R q (f q )) + Cl 3 log 1 δ (n -1 + m -1 ) for sufficiently large l, m and n and error probability δ > 0.

B CONSTRUCTION OF FUNCTION SPACES

Let us give a short discussion on the construction of our required function space mentioned in the previous Section A, the reproducing kernel space H. As mentioned already in the main text, the explicit knowledge of H is not required, we just need to rely on its existence. First, any of our models f can be regarded as an element of some reproducing kernel space (RKHS) H satisfying the assumptions 1. This is immediate if f : X → R is a real valued continuous function and we take k(x, y) = f (x)f (y) as the associated reproducing kernel. In the case f : X → Y and Y is finite dimensional, it is not hard to see that a similar construction is possible, as this case can again be boiled down to the construction of a kernel with real-valued output, see e.g. Remark 1 in Caponnetto & De Vito (2007) for details. Overall we end up with a finite sequence of spaces (H k ) l+1 k=1 of functions living on the same domain X (we have l + 1 as we also take into account the regression function), and the existence of a RKHS containing all given models and the regression function is not a real restriction. For example, in case of real valued functions, this assumption is automatically satisfied, as linear combinations of functions with the same domain which stem from a finite sequence of RKHSs belong to an RKHS. This follows from a classical result by N. Aronszajn and R. Godement, see e.g. Pereverzyev (2022, Theorem 1.4.) . There is also ongoing research on constructing function spaces (and especially associated reproducing kernels) for families of neural networks that are used in applications, see e.g. Ma & Wu (2022) for ReLU networks, Fermanian et al. (2021) for recurrent networks and Bietti & Mairal (2017; 2019) for convolutional neural networks. Incorporating these into our work may lead to refined generalization bounds that also reflect the nature of our models. We leave the details open for future work.

C DATASETS

This section provides an overview over all applied datasets from language, image, and time-series domains. Illustrative example: For the illustrative example (Figure 1 in the main paper) we rely on the following setting, taken from Shimodaira (2000) ; Sugiyama et al. (2007) ; You et al. (2019) : The data points are labelled with y = sin(πx) πx with random noise sampled from the normal distribution N 0, 1 4 2 . Moreover p X ∼ N 1, 1 4 , q X ∼ N 2, 1 2 . The density ratio β can be computed analytically and is bounded. We aggregate several linear models with our approach and compare it to the optimal linear model, whose coefficients have been evaluated using a computer algebra system. Academic Dataset We rely on the Transformed Moons dataset (Zellinger et al., 2021) , allowing us to visualize and address low-dimensional input data. The dataset consists of two-dimensional input data points forming two classes with a "moon-shaped" support. The shift from source to target domain is simulated by a transformation in input space as depicted in Figure 3 . The results are shown in the following table : Language Dataset To evaluate our method on a language task, we rely on the Amazon Reviews (Blitzer et al., 2006) dataset. This dataset consists of text reviews from four domains: books (B), DVDs (D), electronics (E), and kitchen appliances (K). Reviews are encoded in 5000 dimensional feature vectors of bag-of-words unigrams and bigrams with binary labels: label 0 if the product is ranked by 1 to 3 stars, and label 1 if the product is ranked by 4 or 5 stars. From the four categories, we obtain twelve domain adaptation tasks, where each category serves once as source domain and once as target domain (e.g., see Table 15 ). We follow similar data splits as previous works (Chen et al., 2012; Louizos et al., 2016; Ganin et al., 2016) . In particular, we use 4000 labeled source examples and 4000 unlabeled target examples for training, and over 1000 examples for testing. Image Dataset Our third dataset is MiniDomainNet, which is based on the DomainNet-2019 dataset (Peng et al., 2019) consisting of six different image domains (Quickdraw: Q, Real: R, Clipart: C, Sketch: S, Infograph: I, and Painting: P). We follow Zellinger et al. (2021) and rely on the reduced version of DomainNet-2019, referred to as MiniDomainNet, which reduces the number of classes to the top-five largest representatives in the training set across all six domains. To further improve computation time, we rely on a ImageNet (Krizhevsky et al., 2012) pre-trained ResNet-18 (He et al., 2016) backbone. Therefore, we assume that the backbone has learned lower-level filters suitable for the "Real" image category, and we only need to adapt to the remaining five domains (e.g., Clipart, Sketch). This results in five domain adaptation tasks.

Time-Series Dataset

We based our time-series experiments on the four datasets included in the AdaTime benchmark suite (Ragab et al., 2022) , which consists of UCI-HAR, WISDM, HHAR, and Sleep-EDF. The suite includes four representative datasets spanning 20 cross-domain real-world scenarios, i.e., human activity recognition and sleep stage classification. The first dataset is the Human Activity Recognition (HAR) (Anguita et al., 2013) dataset from the UC Irvine Repository denoted as UCI-HAR, which contains data from three motion sensors (accelerometer, gyroscope and body-worn sensors) gathered using smartphones from 30 different subjects. It classifies their activities in several categories, namely, walking, walking upstairs, downstairs, standing, sitting, and lying down. The WISDM (Kwapisz et al., 2011) dataset is a class-imbalanced variant from collected accelerometer sensors, including GPS data, from 29 different subjects which are performing similar activities as in the UCI-HAR dataset. The Heterogeneity Human Activity Recognition (HHAR) (Stisen et al., 2015) dataset investigate sensor-, device-and workload-specific heterogeneities using 36 smartphones and smartwatches, consisting of 13 different device models from four manufacturers. Finally, the Sleep Stage Classification time-series setting aims to classify the electroencephalography (EEG) signals into five stages i.e., Wake (W), Non-Rapid Eye Movement stages (N1, N2, N3), and Rapid Eye Movement (REM). Analogous to Ragab et al. (2022) ; Eldele et al. (2021) , we adopt the Sleep-EDF-20 dataset obtained from PhysioBank (Goldberger et al., 2000) , which contains EEG readings from 20 healthy subjects. For all datasets, each subject is treated as an own domain, and adopt from a source subject to a target subject.

D EXPERIMENTAL SETUP

This section is meant to provide further details on the overall computational setting of our experiments. We start by giving an overview on the used computational resources for the specific datasets and the implementation tools. Next, we describe the network architectures for the individual datasets in greater detail. In the third subsection we elaborate on the construction of our models, and the fourth subsection is devoted to matrix inversion. Finally, in the last subsection, we describe the detailed empirical results and give the complete tables.

D.1 COMPUTATIONAL RESOURCES AND IMPLEMENTATIONS

Overall, to compute the results in our tables, we trained 16680 models with an approximate computational budget of 1500 GPU/hours on one high-performance computing station using 8×NVIDIA P100 16GB, 512GB RAM, 40 Cores Xeon(R) CPU E5-2698 v4 @ 2.20GHz on CentOS Linux 7. 

D.2 ARCHITECTURES AND TRAINING SETUP

In this subsection, we provide details on the model architectures and the training setup for every dataset. Our base architectures are based on the AdaTime benchmark suite, which is a large-scale evaluation of domain adaptation algorithms on time-series data. We extended the benchmark suite to support 11 state-of-the-art model architectures on multiple dataset types ranging from language, image to time-series data, addressed by Transformed Moons, Amazon Reviews, MiniDomainNet and the four time-series datasets (UCI-HAR, WISDM, HHAR, and Sleep-EDF) spanning in total 38 cross-domain real-world scenarios. Transformed Moons For the Transformed Moons dataset we use two sequential blocks with fully-connected layers, 1D-BatchNorm, ReLU activation functions and Dropout. The full architecture specification can be found in Table 3 . The domain classifier (density ratio estimator) uses the same architecture. We train the class prediction models for 50 epochs and the domain classifier for 80 epochs with learning rate 0.001, weight decay 0.0001 and batchsize 128 using the Adam optimizer (Kingma & Ba, 2014) . We share the same base architecture and training setup across every domain adaption method (e.g., DANN, HoMM, CMD). Additional hyper-parameters are reported in Table 9 . Amazon Reviews For the Amazon Reviews dataset we use two sequential blocks with fullyconnected layers, 1D-BatchNorm, ReLU activation function and Dropout, analogous to the setup for Transformed Moons. We also use the same architecture for the domain classifier. We train the class prediction models for 50 epochs and the domain classifier for 80 epochs with learning rate 0.001, weight decay 0.0001 and batchsize 128 using the Adam optimizer (Kingma & Ba, 2014) . We share the same base architecture and training setup across every domain adaption method (e.g., DANN, HoMM, CMD). Additional hyper-parameters are reported in Table 9 . MiniDomainNet Following the pre-trained setup from Peng et al. (2019) , we use a frozen ResNet-18 backbone model which was trained on ImageNet, and operate subsequent computations on the 512 dimensional extracted features. To alleviate overfitting effects on pre-computed features, we perform data augmentation on the images of each batch and forward each batch through the backbone. We incorporate zero padding before resizing the images to 256x256 to avoid image distortions. Furthermore, in alignment with data augmentation techniques from Shorten & Khoshgoftaar (2019) , we perform random resized cropping to 224x224 with a random viewport between 70% and 100% of the original image, random horizontal flipping, color jittering of 0.25% on each RGB channel, and a ±2 degree rotation. After the ResNet-18 backbone output, we add a projection layer, and define the domain adaptation layers on which we use the domain adaptation methods to align the representations. The backbone and projection layers are defined as a common architecture across the different domain adaptation methods. Additional layers are further added for the classification networks, according to the requirements of the individual domain adaptation methods (e.g., CMD, HoMM). The number of layers/neurons in the upper layers of our architecture have been tuned in order to achieve the best performance in the source-only setup. See Table 5 for a detailed description of the architecture used. We perform experiments on all 5 domain adaptation tasks as defined in Section C for each of the previously listed methods, and with 3 repetitions based on different random weights initialization. All class prediction models have been trained for 60 epochs and domain classifiers for 100 epochs with Adam optimizer, a learning rate of 0.001, β 1 = 0.9, β 2 = 0.999, batchsize of 128 and weight decay of 0.0001. Additional hyper-parameters are reported in Table 9 . AdaTime Unless stated otherwise, we follow the implementation and hyper-parameter settings as reported in Ragab et al. (2022) . We extended the AdaTime suite to comprise a collection of 11 domain adaptation algorithms. We learned all domain adaptations models according to the following approaches (see also Table 6 , Table 7 and Table 8 ): Deep Domain Confusion (DDC) (Tzeng et al., 2014) , Correlation Alignment via Deep Neural Networks (Deep-Coral) (Sun et al., 2017) , Higherorder Moment Matching (HoMM) (Chen et al., 2020) , Minimum Discrepancy Estimation for Deep Domain Adaptation (MMDA) (Rahman et al., 2020) , Central Moment Discrepancy (CMD) (Zellinger et al., 2017) , Deep Subdomain Adaptation (DSAN) (Zhu et al., 2021) , Domain-Adversarial Neural Networks (DANN) (Ganin et al., 2016) , Conditional Adversarial Domain Adaptation (CDAN) (Long et al., 2018) , A DIRT-T Approach to Unsupervised Domain Adaptation (DIRT) (Shu et al., 2018) , Convolutional deep Domain Adaptation model for Time-Series data (CoDATS) (Wilson et al., 2020) , and Adversarial Spectral Kernel Matching (AdvSKM) (Liu & Xue, 2021) . The backbone architecture of all models is a 1D-CNN network. It consists of three CNN blocks and each block has a 1D convolutional layer, followed by 1D batch normalization layer, ReLU activation function, 1D max pooling and Dropout. In the first block, the kernel size of the convolutional layer is set according to the dataset as reported in Ragab et al. (2022) . After the convolutional blocks, we apply an 1D adaptive pooling layer. All methods are trained for 100 epochs on all datasets. The batch size is 32, except for Sleep-EDF, where we use batch size of 128. All models are trained with Adam optimizer (Kingma & Ba, 2014) and weight decay of 10 -4 . Additional hyper-parameters are reported in Table 10 .

D.3 MODEL SEQUENCE

Our algorithm, IWA, constructs an ensemble from a sequence of different classifiers, e.g. obtained from a sequence of possible hyper-parameter configurations in domain adaptation algorithms. To obtain this sequence of models, we train multiple models for every domain adaptation task across all datasets with different hyper-parameter choices. For the experiments on the language, im- We multiply each hyper-paramter with a set of scaling factors λ ∈ {0, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 5, 10} to obtain a sequence. 

D.4 MATRIX INVERSION

Matrix inversion is a well-known numerical task, especially in cases of limited computing precision and ill-conditioned matrices. In our case, similar models in the given sequence can cause numerical instability due to limited compute precision. That is, occasionally a tabula rasa inversion of the matrix G in Algorithm 1 is numerically unstable. Various standard approaches can be applied to handle this common issue, including the exclusion of similar models and various regularization techniques. In our computational setup, we rely on the Python routine numpy.linalg.pinv, which is based on the eigendecompostion of G (coinciding with the singular value decomposition in our case due to positive-definiteness) and an eigenvalue-based regularization based on a treshold value rcond for small eigenvalues, see Strang (1980, pages 138-140) for details. The choice of rcond depends on the scale of the Gram matrix and can therefore be chosen by source data only. Based on evaluating our method on source data only (target domain is fixed to be source domain) on several choices for rcond, we obtain a stable choice for rcond of 10 -1 for all datasets. More precisely, for a given sequence f 1 , . . . , f l of models, we compute the Pearson correlation coefficient between the aggregation weights c 1 , . . . , c l (see Algorithm 1) and the corresponding target accuracies 1 t t i=1 1[y i = f 1 (x i )], . . . , 1 t t i=1 1[y i = f l (x i )] for the target test data (x 1 , y 1 ), . . . , (x t , y t ), where 1[P ] = 1 iff P is true and 1[P ] = 0 otherwise. In this context, a positive correlation coefficient means that the aggregation algorithm assigns a higher weight to models performing better on target samples. We calculate these coefficients for our method IWA and the other linear regression baselines SOR, TCR, and TMR for all domain adaptation methods across all datasets and cross-domain scenarios. Note that for this analysis we cannot compare to the other baseline TMV, as the count-based aggregation by majority voting does not involve the computation of aggregation weights. In Figure 4 and 6 we compare the resulting correlation coefficient distribution of our method IWA to the ones for the heuristic baselines SOR, TCR, and TMR. Figure 5 compares the correlation coefficients for IWA on different datasets. We find that IWA shows a stronger positive correlation than other methods, between a model's target accuracy and its aggregation weight.

D.6 SENSITIVITY ANALYSIS: EFFECT OF ADDING INACCURATE MODELS

We study the sensitivity of our method with respect to adding inaccurate models in the given sequence of models. We define an innacurate model as having a target accuracy lower than 80% of the target accuracy of the model computed without domain adaptation (SO). In particular, we add +10, +50 and +100 inaccurate models to the given sequences of models. One inaccurate model is constructed as follows: First, a model is chosen uniformly at random from the given sequence of models. Second, the outputs of the chosen model are corrupted by adding, to half of the elements of its vector-valued output, random Gaussian noise with zero-mean and unit variance. In Figure 7 , we show the median performance over all domain adaptation methods for each dataset. We see, that the performance of our method (IWA) is not very sensitive w.r.t. an increase in the number of inaccurate models. This is in contrast to the heuristics (e.g. TMV, TMR or TCR) and model selection method DEV, which show a high sensitivity concerning adding inaccurate models. SOR also shows stable results; it is, however, clearly outperformed by our method in five out of six datasets. We conclude, that IWA is not only overall the best performing method, but also the most stable choice concerning inaccurate models in the given sequence. 4 ). In comparison to the heuristic baselines, IWA shows a stronger positive correlation between a model's target accuracy and its aggregation weight.

D.7 DETAILED EMPIRICAL RESULTS

In this section, we add all result tables for the datasets described in the main paper. Baselines As addressed in the main paper, our method, IWA, is compared to ensemble learning methods that use linear regression and majority voting as heuristic for model aggregation, and, model selection methods with theoretical error guarantees. The heuristic baselines are majority voting on target data (TMV), source-only regression (SOR), target majority voting regression (TMR), target confidence average regression (TCR). The model selection methods with theoretical error guarantees are importance weighted validation (IWV) (Sugiyama et al., 2007) and deep embedded validation (DEV) (Kouw et al., 2019) . The tables also provide a column for source-only (SO) performance and target-best (TB) performance. We highlight in bold the performance of the best performing method with theoretical error guarantees, and in italic the best performing heuristic. 



Large scale benchmark experiments are available at https://github.com/Xpitfire/iwa; dinu@ml.jku.at, werner.zellinger@ricam.oeaw.ac.at The existence of the conditional probability density q(y|x) with q(x, y) = q(y|x)qX (x) is guaranteed by the fact that X × Y is Polish, i.e., a separable and complete metric space, c.f.Dudley (2002, Theorem 10.2.2.). Y-valued integrals are defined in the sense of Lebesgue-Bochner.



Figure1: Unsupervised domain adaptation problem(Shimodaira, 2000;Sugiyama et al., 2007;You et al., 2019). Left: Source distribution (solid) and target distribution (dashed). Right: A sequence of different linear models (dashed) is used to find the optimal linear aggregation of the models (solid). Model selection methods(Sugiyama et al., 2007;Kouw et al., 2019;You et al., 2019;Zellinger et al., 2021) cannot outperform the best single model in the sequence, confidence values as used inZou et al. (2018) are not available, and, approaches based on averages or tendencies of majorities of models(Saito et al., 2017) suffer from a high fraction of large-error-models in the sequence. In contrast, our approach (dotted-dashed) is nearly optimal. In addition, the model computed by our method provably approaches the optimal linear aggregation for increasing sample size. For further details on this example we refer to Section C in the Supplementary Material. 2015;Zellinger et al., 2017;Peng et al., 2019). In this work, we study the problem of constructing an optimal aggregation using all models in such a sequence. Our main motivation is that the error of such an optimal aggregation is clearly smaller than the error of the best single model in the sequence.

Figure2: Top: Mean classification accuracy (y-axis) of our method (IWA), source-only regression (SOR), deep embedded validation (DEV) and individual models (green: source accuracy, orange: target accuracy) used in the aggregation for the HHAR dataset(Stisen et al., 2015) over 3 seeds. The individual models (x-axis) are trained with DIRT(Shu et al., 2018) for different hyper-parameter choices. Bottom: Scaled Aggregation weights (y-axis) for individual models (x-axis) computed by IWA, SOR and DEV (average over 3 seeds). Instead of searching for the best model in the sequence, IWA effectively uses all models in the sequence and obtains a performance not reachable by any procedure selecting only one model.

Figure 3: Transformed Moons dataset. Source data is depicted as blue + and orange ×. Target data points are shown as black dots.

Transformed Moons: 11 methods × 14 parameters × 1 domain adaptation tasks × 3 seeds + 3 density estimator classifier = 465 trained models Amazon Reviews: 11 methods × 14 parameters × 12 domain adaptation tasks × 3 seeds + 12 × 3 density estimator classifier = 5580 trained models MiniDomainNet: 11 methods × 8 parameters × 5 domain adaptation tasks × 3 seeds + 5 × 3 density estimator classifier = 1335 trained models UCI-HAR: 11 methods × 14 parameters × 5 domain adaptation tasks × 3 seeds + 5 × 3 density estimator classifier = 2325 trained models Sleep-EDF: 11 methods × 14 parameters × 5 domain adaptation tasks × 3 seeds + 5 × 3 density estimator classifier = 2325 HHAR: 11 methods × 14 parameters × 5 domain adaptation tasks × 3 seeds + 5 × 3 density estimator classifier = 2325 trained models WISDM: 11 methods × 14 parameters × 5 domain adaptation tasks × 3 seeds + 5 × 3 density estimator classifier = 2325 trained models In Total: 465 + 5580 + 1335 + 4 × 2325 = 16680 trained models All methods have been implemented in Python using the Pytorch (Paszke et al., 2017, BSD license) library. For monitoring the runs we used Weights & Biases(Biewald, 2020, MIT license). We use Scikit-learn(Pedregosa et al., 2011) library for evaluation measures and toy datasets, and the TQDM (da Costa-Luis, 2019) library, and Tensorboard(Abadi et al., 2015) for keeping track of the progress of our experiments. We built parts of our implementation on the codebase ofZellinger et al.  (2021, MIT License)  andRagab et al. (2022, MIT License).

Figure 2 in the main paper suggests that there is a positive correlation between the target accuracy of each individual model (orange dashed line, Figure 2 top) and the respective aggregation weight (red bars, Figure 2 bottom), if the linear aggregation of the models is computed by our method IWA. In the following, we analyze whether this trend holds throughout all other experiments.

Figure 4: Boxplots of the correlation coefficients of IWA and the linear regression heuristic baselines SOR, TCR, and TMR over all datasets.

Figure 7: Sensitivity of methods for parameter choice issues w.r.t. adding inaccurate models in the given sequence of models; separate for each dataset (a-f), averaged over all domain adaptation methods, source-target pairs, and random seeds. Horizontal axes: Number of inaccurate models added to the initial sequence of models. Vertical axes: Target accuracy. Solid lines indicate median and shaded area indicate 50% confidence intervals.

) 0.805(±0.006) 0.805(±0.006) 0.807(±0.009) 0 .808 (±0 .012 ) 0.775(±0.023) 0.78(±0.035) 0.81(±0.007) 0.815(±0.007) D → K 0.784(±0.013) 0.815(±0.008) 0.815(±0.008) 0.817(±0.011) 0 .829 (±0 .014 ) 0.788(±0.007) 0.779(±0.012) 0.816(±0.011) 0.827(±0.014) E → B 0.701(±0.019) 0.712(±0.014) 0.712(±0.014) 0.712(±0.015) 0 .724 (±0 .014 ) 0.69(±0.03) 0.702(±0.022) 0.712(±0.012) 0.721(±0.014) E → D 0.736(±0.005) 0.743(±0.014) 0.743(±0.014) 0.747(±0.012) 0 .758 (±0 .012 ) 0.735(±0.053) 0.738(±0.019) 0.751(±0.008) 0.757(±0.012) E → K 0.854(±0.016) 0.877(±0.011) 0.877(±0.011) 0.878(±0.012) 0 .879 (±0 .014 ) 0.86(±0.009) 0.858(±0.007) 0.88(±0.013) 0.875(±0.012) K → B 0.711(±0.014) 0.741(±0.004) 0.741(±0.004) 0.74(±0.004) 0 .759 (±0 .007 ) 0.718(±0.012) 0.729(±0.003) 0.75(±0.004) 0.747(±0.007) K → D 0.738(±0.006) 0.768(±0.011) 0.768(±0.011) 0.765(±0.011) 0 .778 (±0 .013 ) 0.733(±0.013) 0.758(±0.019) 0.778(±0.012) 0.762(±0.012) K → E 0.837(±0.013) 0.864(±0.008) 0.864(±0.008) 0.864(±0.007) 0 .865 (±0 .009 ) 0.86(±0.013) 0.851(±0.008) 0.866(±0.011) 0.859(±0.011) Avg. 0.767(±0.012) 0.792(±0.009) 0.792(±0.009) 0.793(±0.01) 0 .8 (±0 .01 ) 0.776(±0.015) 0.778(±0.015) 0.797(±0.009) 0.779(±0.013) 0.801(±0.014) 0.801(±0.014) 0.803(±0.015) 0 .807 (±0 .01 ) 0.79(±0.02) 0.689(±0.108) 0.805(±0.014) 0.805(±0.01) B → E 0.752(±0.005) 0.788(±0.005) 0.788(±0.005) 0.783(±0.005) 0 .798 (±0 .002 ) 0.775(±0.022) 0.764(±0.02) 0.787(±0.006) 0.796(±0.002) B → K 0.768(±0.012) 0.799(±0.008) 0.799(±0.008) 0.799(±0.01) 0 .815 (±0 .007 ) 0.786(±0.036) 0.798(±0.031) 0.801(±0.011) 0.816(±0.005) D → B 0.782(±0.008) 0.795(±0.004) 0.795(±0.004) 0.795(±0.002) 0 .799 (±0 .007 ) 0.797(±0.01) 0.729(±0.111) 0.801(±0.005) 0.802(±0.007) D → E 0.771(±0.002) 0.804(±0.008) 0.804(±0.008) 0.804(±0.008) 0 .816 (±0 .012 ) 0.79(±0.02) 0.792(±0.02) 0.808(±0.003) 0.813(±0.012) D → K 0.786(±0.013) 0.804(±0.01) 0.804(±0.01) 0.802(±0.012) 0 .821 (±0 .014 ) 0.789(±0.013) 0.801(±0.025) 0.807(±0.012) 0.83(±0.016) E → B 0.702(±0.021) 0.716(±0.017) 0.716(±0.017) 0.718(±0.021) 0 .718 (±0 .018 ) 0.71(±0.021) 0.707(±0.018) 0.721(±0.02) 0.711(±0.02) E → D 0.725(±0.007) 0.743(±0.008) 0.743(±0.008) 0.741(±0.005) 0 .744 (±0 .008 ) 0.723(±0.021) 0.735(±0.012) 0.749(±0.005) 0.732(±0.005) E → K 0.865(±0.013) 0.883(±0.009) 0.883(±0.009) 0.882(±0.008) 0 .886 (±0 .009 ) 0.874(±0.003) 0.876(±0.006) 0.883(±0.01) 0

Mean and standard deviation (after ±) of target classification accuracy on Amazon Reviews, Sleep-EDF, UCI-HAR, HHAR and WISDM datasets over three different random initialization of model weights and several domain adaptation tasks.

Mean and standard deviation (after ±) of target classification accuracy on Transformed Moons dataset over three different random initialization of model weights and 11 domain adaptation methods.

Model architecture for the Transformed Moons dataset. The values for neural network layers correspond to the number of output units.

Model architecture for the Amazon Reviews dataset. The values for neural network layers correspond to the number of output units.

Model architecture for the MiniDomainNet dataset. The values for neural network layers correspond to the number of output units.

Model backbone for the AdaTime suite. Kernel size, stride, output channels of the convolutional layers are dataset dependent and are chosen according toRagab et al. (2022).

Model architecture for the AdaTime dataset. Layer hyper-parameters are dataset dependent and are chosen according toRagab et al. (2022).

Model architecture for the AdaTime dataset. Hyper-parameters are dataset dependent and are chosen according toRagab et al. (2022). In this way, we generate a sequence of 14 hyper-parameter choices. Due to computational limitations, in MiniDomainNet we use λ ∈ {0, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10}. All values are listed in Table9 and Table 10.

Domain adaptation hyper-parameter sequences for experiments on the datasets Transformed Moons, AmazonReviews, and MiniDomainNet. We multiply each hyper-paramter with a set of scaling factors λ ∈ {0, 0.0001, 0.001, 0.01, 0.05, 0.1, 0.25, 0.5, 0.75, 1, 1.5, 2, 5, 10} to obtain a sequence. Due to computational limitations, for MiniDomainNet we use only λ ∈ {0, 0.0001, 0.001, 0.01, 0.1, 1, 5, 10}.

Domain adaptation hyper-parameters for experiments on the timeseries data.

Table 16 and Table17show all domain adaptation tasks for the Amazon Review dataset. Table13shows all domain adaptation tasks for the MiniDomainNet experiments. Table21, Table22, Table23, Table24, Table25, Table26, Table27, and Table28show all domain adaptation task results for the time-series datasets.

Average target accuracies (and average standard deviations) for all 7 datasets (e.g., Sleep-EDF, MiniDomainNet, Amazon Reviews) taken over several domain adaptation tasks (e.g., 5 on Sleep-EDF, 5 on MiniDomainNet, 12 on Amazon Reviews), 11 domain adaptation methods (e.g., DANN, HoMM, CMD) and 3 repetitions with different random initialization of model weights. The input sequences of the approaches (e.g., DEV, IWA) consist of neural networks computed by runs of the domain adaptation methods with different hyper-parameters (e.g., 8 different values of λ for DANN).

Mean and standard deviation (after ±) of target classification accuracy on Amazon Reviews dataset over three different random initialization of model weights and 12 domain adaptation tasks.

Mean and standard deviation (after ±) of target classification accuracy on MiniDomainNet dataset over three different random initialization of model weights and five domain adaptation tasks.

Mean and standard deviation (after ±) of target classification accuracy on four time series datasets over three different random initialization of model weights and five domain adaptation tasks.

Mean and standard deviation (after ±) of target classification accuracy on Amazon Reviews (Part 1) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on Amazon Reviews (Part 2) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on Amazon Reviews (Part 3) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on MiniDomainNet dataset over three different random initialization of model weights and five domain adaptation tasks.

Mean and standard deviation (after ±) of target classification accuracy on MiniDomainNet (Part 1) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on MiniDomainNet (Part 2) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on Sleep-EDF (Part 1) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on Sleep-EDF (Part 2) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of classification on UCI-HAR (Part 1) over 3 repetitions with different random initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on UCI-HAR (Part 2) over 3 repetitions different initialization of model weights.

Mean and standard deviation (after ±) of target classification accuracy on HHAR (Part 1) over 3 repetitions with different random initialization of weights.

Mean and standard deviation (after ±) of target classification accuracy on HHAR (Part 2) over 3 repetitions with different random initialization of model weights. → 6 0.731(±0.028) 0.731(±0.028) 0 .736 (±0 .024 ) 0.632(±0.024) 0.703(±0.038) 0.7(±0.038) 0.728(±0.019) 0.703(±0.024) 1 → 6 0.862(±0.018) 0.893(±0.009) 0.893(±0.009) 0 .896 (±0 .004 ) 0.822(±0.104) 0.864(±0.017) 0.864(±0.017) 0.886(±0.009) 0.911(±0.017) 2 → 7 0.509(±0.097) 0.454(±0.02) 0.454(±0.02) 0.457(±0.025) 0 .51 (±0 .057 ) 0.499(±0.079) 0.539(±0.076) 0.46(±0.012) 0.565(±0.117) 3 → 8 0.798(±0.016) 0.801(±0.004) 0.801(±0.004) 0 .803 (±0 .006 ) 0.793(±0.024) 0.799(±0.013) 0.802(±0.016) 0.812(±0.0) 0.822(±0.01) 4 → 5 0.862(±0.036) 0.932(±0.025) 0.932(±0.025) 0 .938 (±0 .022 ) 0.648(±0.108) 0.906(±0.027) 0.887(±0.035) 0.936(±0.016) 0.96(±0.01) Avg. 0.745(±0.039) 0.762(±0.017) 0.762(±0.017) 0 .766 (±0 .016 ) 0.681(±0.063) 0.754(±0.035) 0.758(±0.036) 0.764(±0.011) 0.792(±0.036)

Mean and standard deviation (after ±) of target classification accuracy on WISDM (Part 1) over 3 repetitions with different random initialization of model weights. ±0 .033 ) 0.711(±0.019) 0.711(±0.019) 0.744(±0.019) 0.778(±0.096) 20 → 30 0.872(±0.029) 0 .885 (±0 .033 ) 0.885(±0.033) 0.885(±0.033) 0.872(±0.059) 0.872(±0.029) 0.872(±0.029) 0.853(±0.029) 0.885(±0.033) 35 → 31 0.619(±0.041) 0 .698 (±0 .027 ) 0.698(±0.027)

Mean and standard deviation (after ±) of target classification accuracy on WISDM (Part 2) over 3 repetitions with different random initialization of model weights. ±0.077) 0.656(±0.077) 0.689(±0.069) 0.711(±0.051) 20 → 30 0.827(±0.033) 0 .891 (±0 .029 ) 0.891(±0.029) 0.872(±0.044) 0.84(±0.04) 0.827(±0.033) 0.827(±0.033) 0.885(±0.033) 0.904(±0.038) 35 → 31 0.619(±0.086) 0 .69 (±0 .071 )

ACKNOWLEDGMENTS

The ELLIS Unit Linz, the LIT AI Lab, and the Institute for Machine Learning are supported by the Federal State Upper Austria. IARAI is supported by Here Technologies. We thank the projects AI-

funding

 RL (FFG-881064), PRIMAL (FFG-873979), S3AI (FFG-872172), DL for GranularFlow (FFG-871302), AIRI FG 9-N (FWF-36284, FWF-36235), and ELISE (H2020-ICT-2019-3 ID: 951847). We further thank Audi.JKU Deep Learning Center, TGW LOGISTICS GROUP GMBH, The research reported in this paper has been funded by the Federal Ministry for Climate Action, Environment, Energy, Mobility, Innovation and Technology (BMK), the Federal Ministry for Digital and Economic Affairs (BMDW), and the Province of Upper Austria in the frame of the COMET-Competence Centers for Excellent Technologies Programme and the COMET Module S3AI managed by the Austrian Research Promotion Agency FFG.

