A REPRODUCIBLE AND REALISTIC EVALUATION OF PARTIAL DOMAIN ADAPTATION METHODS Anonymous authors Paper under double-blind review

Abstract

Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In this work, we consider the Partial Domain Adaptation (PDA) variant, where we have extra source classes not present in the target domain. Most successful algorithms use model selection strategies that rely on target labels to find the best hyper-parameters and/or models along training. However, these strategies violate the main assumption in PDA: only unlabeled target domain samples are available. Moreover, there are also inconsistencies in the experimental settings -architecture, hyper-parameter tuning, number of runsyielding unfair comparisons. The main goal of this work is to provide a realistic evaluation of PDA methods with the different model selection strategies under a consistent evaluation protocol. We evaluate 7 representative PDA algorithms on 2 different real-world datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source. Model Selection

1. INTRODUCTION

Domain adaptation. Deep neural networks are highly successful in image recognition for indistribution samples (He et al., 2016) with this success being intrinsically tied to the large number of labeled training data. However, they tend to not generalize as well on images with different background or colors not seen during training. Such shift in the samples is referred to as domain shift in the literature. Unfortunately, enriching the training set with new samples from different domains is challenging as labeling data is both an expensive and time-consuming task. Thus, researchers have focused on unsupervised domain adaptation (UDA) where we have access to unlabelled samples from a different domain, known as the target domain. The purpose of UDA is to classify these unlabeled samples by leveraging the knowledge given by the labeled samples from the source domain (Pan & Yang, 2010; Patel et al., 2015) . In the standard UDA problem, the source and target domains are assumed to share the same classes. In this paper, we consider a more challenging variant of the problem called partial domain adaptation (PDA): the classes in the target domain Y t form a subset of the classes in the source domain Y s (Cao et al., 2018) , i.e., Y t ⊂ Y s . The number of target classes is unknown as we do not have access to the labels. The extra source classes, not present in the target domain, make the PDA problem more difficult: simply aligning the source and target domains forces a negative transfer where target samples are matched to outlier source-only labels. Realistic evaluations. Most recent PDA methods report an increase of the target accuracy up to 15 percentage points on average when compared to the baseline approach that uses only source domain samples. While these successes constitute important breakthroughs in the DA research literature, target labels are used for model selection, violating the main UDA assumption. In their absence, the effectiveness of PDA methods remains unclear and model selection constitutes a yet to be solved problem as we show in this work. Moreover, the hyper-parameter tuning is either unknown or lacks details and sometimes requires labeled target data, which makes it challenging to apply PDA methods to new datasets. Recent work has highlighted the importance of model selection in the presence of domain shift. Gulrajani & Lopez-Paz (2021) showed that when evaluating domain generalization (DG) algorithms, whose goal is to generalize to a completely unseen domain, in a consistent and realistic setting no method outperforms the baseline ERM method by more than 1 Table 1 : Task accuracy average computed over three different seeds (2020, 2021, 2022) on Partial OFFICE-HOME and Partial-VISDA. For each dataset and PDA method, we display the results of the worst and best performing model selection that do not use target labels as well as the ORACLE model selection strategy. All results can be found in Table 6 . percentage point. They argue that DG methods without a model selection strategy remain incomplete and should therefore be specified as part of the method. A similar recommendation was done by Saito et al. (2021) for domain adaptation. PDA methods have been designed using target labels at test time to select the best models. Parallel work (Saito et al., 2021; You et al., 2019) on model selection strategies for domain adaptation claimed to select the best models without using target labels. However, a realistic empirical study of these strategies in PDA is still lacking. In this work, we conduct extensive experiments to study the impact of model selection strategies on the performance of partial domain adaptation methods. We evaluate 7 different PDA methods over 7 different model selection strategies, 4 of which do not use target labels, and 2 different datasets under the same experimental protocol for a fair comparison. We list below our major findings: • The accuracy attained by models selected without target labels can decrease up to 30 percentage points compared to the one reported using target labels (See Table 1 for a summary of results). • Only 1 pair of PDA methods and target label-free model selection strategies achieve comparable accuracies to when target labels are used, while still improving over a source only baseline. • Random seed plays an important role in the selection of hyper-parameters. Selected parameters are not stable across different seeds and the standard deviation between accuracies on the same task can be up to 8.4% even when relying on target labels for model selection. • Under a more realistic scenario where some target labels are available, 100 random samples is enough to see only a drop of 1 percentage point in accuracy (when compared to using all target samples). However, the extreme case of using only one labeled target sample per class leads a to significant drop in performance. Outline. In Section 2, we provide an overview of the different model selection strategies considered in this work. Then in Section 3, we discuss the PDA methods that we consider. In Section 4 we describe the training procedures, hyper-parameter tuning and evaluation protocols used to evaluate all methods fairly. In Section 5, we discuss the results of the different benchmarked methods and the performance of the different model selection strategies. Finally in Section 6, we give some recommendations for future work in partial domain adaptation.

2. MODEL SELECTION STRATEGIES: AN OVERVIEW

Model selection (choosing hyper-parameters, training checkpoints, neural network architectures) is a crucial part of training neural networks. In the supervised learning setting, a validation set is used to estimate the model's accuracy. However, in UDA such approach is not possible as we have unlabeled target samples. Several strategies have been designed to address this issue. Below, we discuss the ones used in this work. Source Accuracy (S-ACC). Ganin & Lempitsky (2015) used the accuracy estimated on a small validation set from the source domain to perform the model selection. While the source and target accuracies are related, there are no theoretical guarantees. You et al. (2019) showed that when the domain gap is large this approach fails to select competitive models. Deep Embedded Validation (DEV). Sugiyama et al. (2007) and Long et al. (2018) perform model selection through Importance-Weighted Cross-Validation (IWCV). Under the assumption that the source and target domain follow a covariate shift, the target risk can be estimated from the source risk through importance weights that give increased importance to source samples that are closer to target samples. These importance weights correspond to the ratio of the target and source densities and are estimated using Gaussian kernels. Recently, You et al. (2019) proposed an improved variant, Deep Embedded Validation (DEV), that controls the variance of the estimator and estimates the importance weights with a discriminative model that distinguish source samples from target samples leading to a more stable and effective method. Entropy (ENT). While minimizing the entropy of the target samples has been used in domain adaptation to improve accuracy by promoting tighter clusters, Morerio et al. (2018) showed that it can also be used for model selection. The intuition is that a lower entropy model corresponds to a highly confident model with discriminative target features and therefore reliable predictions. Soft Neighborhood Density (SND). Saito et al. (2021) argue that a good UDA model will have a cluster structure where nearby target samples are in the same class. They claim that entropy is not able to capture this property and propose the Soft Neighborhood Density (SND) score to address it. Target Accuracy (ORACLE). We consider as well the target accuracy on all target samples. While we emphasize once again its use is not realistic in unsupervised domain adaptation (hence why we will refer to it as ORACLE), it has nonetheless been used to report the best accuracy achieved by the model along training in several previous works (Cao et al., 2018; Xu et al., 2019; Jian et al., 2020; Gu et al., 2021; Nguyen et al., 2022) . Here, we use it as an upper bound for all the other model selection strategies and to check the reproducibility of previous works. Small Labeled Target Set (1-SHOT and 100-RND). For real-world applications in an industry setting, it is unlikely that a model will be deployed without the very least of an estimate of its performance for which target labels are required. Therefore, one can imagine a situation where a PDA method is used and a small set of target samples is available. Thus, we will compute the target accuracy with 1 labeled sample per class (1-SHOT) and 100 random labeled target samples (100-RND) as model selection strategies. One could argue that the 100 random samples could have been used in the training with semi-supervised domain adaptation methods. However, note that we do not know how many classes we have on the target domain so it is hard to form a split when we have uncertainty of classes. For instance, 100-RND represents possibly less than 2 samples per class for one of our real-world dataset, as we do not know the number of classes, making a potential split between a train and validation target sets not possible.

3. PARTIAL DOMAIN ADAPTATION METHODS

In this section, we give a brief description of the PDA methods considered in our study. They can be grouped into two families: adversarial training and divergence minimization. Adversarial training. To solve the UDA problem, Ganin et al. (2016) aligned the source and target domains with the help of a domain discriminator trained adversarially to be able to distinguish the samples from the two domains. However, when applied to the PDA problem this strategy leads to negative transfer and the model performs worse than a model trained only on source data. Cao et al. (2018) proposed PADA that introduces a PDA specific solution to adversarial domain adaptation: the contribution of the source-only class samples to the training of both the source classifier and the domain adversarial network is decreased. This is achieved through class weights that are calculated by simply averaging the classifier prediction on all target samples. As the source-only classes should not be predicted in the target domain, they should have lower weights. More recently, Jian et al. (2020) proposed BA3US which augments the target mini-batch with source samples to transform the PDA problem into a vanilla DA problem. In addition, an adaptive weighted complement entropy objective is used to encourage incorrect classes to have uniform and low prediction scores. Divergence minimization. Another standard direction to align the source and target distributions in the feature space of a neural network is to minimize a given divergence between distributions of domains. Xu et al. (2019) empirically found that target samples have low feature norm compared to source samples. Based on this insight, they proposed SAFN which progressively adapts the feature norms of the two domains by minimizing the Maximum Mean Feature Norm Discrepancy (Gretton et al., 2012) . Other approaches are based on optimal transport (OT) (Bhushan Damodaran et al., 2018) with mini-batches (Peyré & Cuturi, 2019; Fatras et al., 2020; 2021b in specific, (Fatras et al., 2021a) developed JUMBOT, a mini-batch unbalanced optimal transport that learns a joint distribution of the embedded samples and labels. The use of unbalanced OT is critical for the PDA problem as it allows to transport only a portion of the mass limiting the negative transfer between distributions. Based on this work, (Nguyen et al., 2022) investigated the partial OT variant (Chapel et al., 2020) , a particular case of unbalanced OT, proposing M-POT. Finally, another line of work is to use the Kantorovich-Rubenstein duality of optimal transport to perform the alignment similarly to WGAN (Arjovsky et al., 2017) . This is precisely the work of Gu et al. (2021) that proposed, AR. In addition, source samples are reweighted in order to reduce the negative transfer from the source-only class samples. The Kantorovich-Rubenstein duality relies on a one Lipschitz function which is approximated using adversarial training like the PDA methods described above.

4. EXPERIMENTAL PROTOCOL

In this section, we discuss our choices regarding the training details, datasets and neural network architecture. We then discuss the hyper-parameter tuning used in this work. We summarize the PDA methods, model selection strategies and experimental protocol used in this work in Table 2 . The main differences in the experimental protocol of the different published state-of-the-art (SOTA) methods is summarized in Table 3 . To perform our experiments we developed a PyTorch (Paszke et al., 2019) framework: BenchmarkPDA. We make it available for other researchers to use and contribute with new algorithms and model selection strategies: https://anonymous.4open.science/r/BenchmarkPDA-7F73  It is the standard in the literature when proposing a new method to report directly the results of its competitors from the original papers (Cao et al., 2018; Xu et al., 2019; Jian et al., 2020; Gu et al., 2021; Nguyen et al., 2022) . As a result some methods differ for instance in the neural network architecture implementation (AR (Gu et al., 2021) , SAFN (Xu et al., 2019) ) or evaluation protocol JUMBOT (Fatras et al., 2021a) with other methods. These changes often contribute to an increased performance of the newly proposed method leaving previous methods at a disadvantage. Therefore we chose to implement all methods with the same commonly used neural network architecture, optimizer, learning rate schedule and evaluation protocol. We discuss the details below.

4.1. METHODS, DATASETS, TRAINING AND EVALUATION DETAILS

Methods. We implemented 7 PDA methods by adapting the code from the Official GitHub repositories of each method: Source Only, PADA (Cao et al., 2018) , SAFN (Xu et al., 2019) , BA3US (Jian et al., 2020) , AR (Gu et al., 2021) , JUMBOT (Fatras et al., 2021a) , MPOT (Nguyen et al., 2022) . We provide the links to the different official repositories in Appendix A.1. A comparison with previous reported results can be found in Table 4 and we postpone the discussion to Section 5. Datasets. We consider two standard real-world datasets used in DA. Our first dataset is OFFICE-HOME (Venkateswara et al., 2017) . It is a difficult dataset for unsupervised domain adaptation (UDA), it has 15,500 images from four different domains: Art (A), Clipart (C), Product (P) and Real-World (R). For each domain, the dataset contains images of 65 object categories that are common in office and home scenarios. For the partial OFFICE-HOME setting, we follow Cao et al. (2018) and select the first 25 categories (in alphabetic order) in each domain as a partial target domain. We evaluate all methods in all 12 adaptation scenarios. VISDA (Peng et al., 2017 ) is a large-scale dataset for UDA. It has 152,397 synthetic images as source domain and 55,388 real-world images as target domain, where 12 object categories are shared by these two domains. For the partial VISDA setting, we follow Cao et al. (2018) and select the first 6 categories, taken in alphabetic order, in each domain as a partial target domain. We evaluate the models in the two possible scenarios. We highlight that we are the first to investigate the performance of JUMBOT and MPOT on partial VISDA.

Model Selection Strategies

We consider the 7 different strategies for model selection described in Section 2: S-ACC, DEV, ENT, SND, ORACLE, 1-SHOT, 100-RND. We use them both for hyperparameter tuning as well selecting the best model along training. Since S-ACC, DEV and SND require a source validation set, we divide the source samples into a training subset (80%) and validation subset (20%). Regardless of the model selection strategy used, all methods are trained using the source training subset. This is in contrast with previous work that uses all source samples, but necessary to ensure a fair comparison of the model selection strategies. We refer to Appendix A.2 for additional details. Architecture. Our network is composed of a feature extractor with a linear classification layer on top of it. The feature extractor is a ResNet50 (He et al., 2016) , pre-trained on ImageNet (Deng et al., 2009) , with its last linear layer removed and replaced by a linear bottleneck layer of dimension 256. Optimizer. We use the SGD (Robbins & Monro, 1951) algorithm with momentum of 0.9, a weight decay of 5e -4 and Nesterov acceleration. As the bottleneck and classifier layers are randomly initialized, we set their learning rates to be 10 times that of the pre-trained ResNet50 backbone. We schedule the learning rate with a strategy similar to the one in (Ganin et al., 2016) : χ p = χ0 (1+µi) -ν , where i is the current iteration, χ 0 = 0.001, γ = 0.001, ν = 0.75. While this schedule is slightly different than the one reported in previous work, it is the one implemented in the different official code implementations. We elaborate in the Appendix A.3 on the differences and provide additional details. Finally, as for the mini-batch size, JUMBOT and M-POT were designed with a stratified sampling, i.e., a balanced source mini-batch with the same number of samples per class. This allows to reduce the negative transfer between domains and is crucial to their success. On the other hand, it was shown that for some methods (e.g. BA3US) using a larger mini-batch, than what was reported, leads to a decreased performance (Fatras et al., 2021a) . As a result, we used the default mini-batch strategies for each method. JUMBOT and M-POT use stratified mini-batches of size 65 for OFFICE-HOME and 36 for VISDA. All other methods use a standard random uniform sampling strategy with a mini-batch size of 36. Evaluation Protocol. For the hyper-parameters chosen with each model selection strategy, we run the methods for each task 3 times, each with a different seed (2020, 2021, 2022) . We tried to control for the randomness across methods by setting the seeds at the beginning of training. Interestingly, as we discuss in more detail in Section 5, some methods demonstrated a non-negligible variance across the different seeds showing that some hyper-parameters and methods are not robust to randomness.

4.2. HYPER-PARAMETER TUNING

Previous works (Gulrajani & Lopez-Paz, 2021; Musgrave et al., 2021; 2022) perform random searches with the same number of runs for each method. In contrast, we perform hyper-parameter grid searches for each method. As a result, the hyper-parameter tuning budgets differs across the methods depending on the number of hyper-parameters and the chosen grid. While one can argue this leads to an unfair comparison of the methods, in practice in most real-world applications one will be interested in using the best method and our approach will capture precisely that. The hyper-parameter tuning needs to be performed for each task of each dataset, but that would require a significant computational resources without a clear added benefit. Instead for each dataset, we perform the hyper-parameter tuning on a single task: A2C for OFFICE-HOME and S2R for VISDA. This same strategy was adopted in (Fatras et al., 2021a) and the hyper-parameters were found to generalize to the remaining tasks in the dataset. We conjecture that this may be due to the fact that information regarding the number of target only classes is implicitly hidden in the hyper-parameters. See Appendix A.4 for more details regarding the hyper-parameters. Several runs in our hyper-parameter search for JUMBOT, M-POT and BA3US were unsuccessful with the optimization reaching its end without the model being trained at all. This poses a challenge to DEV, SND and ENT and its one of the failures modes accounted for in (Saito et al., 2021) . Following their recommendations, for JUMBOT, M-POT and BA3US, before applying the model selection strategy, we discard models whose source domain accuracy is below a certain threshold thr, which is set with the heuristic as thr = 0.9 • Acc. Here Acc denotes the source domain accuracy of the Source-Only model. In our experiments, this leads to select models whom the source accuracy is at least of thr = 69.01% for the A2C task on OFFICE-HOME and thr = 89.83% for the S2R task on VISDA. We choose this heuristic because the ablation study of some methods showed that doing the adaptation decreased slightly the source accuracy (Bhushan Damodaran et al., 2018) . Table 5 shows that our heuristic leads to improved results. Lastly, when choosing the hyper-parameters, we only consider the model at the end of training, discarding the intermediate checkpoint models in order to select hyper-parameters which do not lead to overfitting at the end of training and better generalize to the other tasks. Following the above protocol, for each dataset we trained 468 models in total in order to find the best hyper-parameters. Then, to obtain the results with our neural network architecture on all tasks of each dataset, we trained an additional 1224 models for OFFICE-HOME and 156 models for VISDA. We additionally trained 231 models with the different neural network architectures for AR and SAFN. In total, 2547 models were trained to make this study and we present the different results in the next section.

5. PARTIAL DOMAIN ADAPTATION EXPERIMENTS

We start the results section by discussing the differences between our reproduced results and the published results from the different PDA methods. Then, we compare the performance of the different model selection strategies. Finally, we discuss the sensitivity of methods to the random seed.

5.1. REPRODUCIBILITY OF PREVIOUS RESULTS

We start by ensuring that our reimplementation of PDA methods was done correctly by comparing our reproduced results with previously reported results in Table 4 . As such the model selection strategy used is ORACLE. On OFFICE-HOME, both PADA and JUMBOT achieved higher average task accuracy (1.6 and 1.7 percentage points, respectively) in our reimplementation, while for BA3US and MPOT we recover the reported accuracy in their respective papers. However, we saw a decrease in performance for both SAFN and AR of roughly 8 and 5 percentage points respectively. This is to be expected due to the differences in the neural network architectures. While we use a linear bottleneck layer, SAFN uses a nonlinear bottleneck layer. As for AR, they make two significant changes: the linear classification head is replaced by a spherical logistic regression (SLR) layer (Gu et al., 2020) and the features are normalized (the 2-norm is set to a dataset dependent value, another hyperparameter that requires tuning) before feeding them to the classification head. While we account for the first change by comparing to AR (w/ linear) results reported in (Gu et al., 2021) , in our neural network architecture we do not normalize the features. These changes, nonlinear bottleneck layer for SAFN and feature normalization for AR, significantly boost the performance of both methods. When now comparing our reimplementation with the same neural network architectures, our SAFN reimplementation achieves a higher average task accuracy by 3 percentage points, while our AR reimplementation is now only 1 percentage points below. The fact that AR reported results are from only one run, while ours are averaged across 3 distinct seeds, justifies the small remaining gap. Moreover, we report higher accuracy or on par on 4 tasks of the 12 tasks. Given all the above and further discussion of the VISDA dataset results in Appendix B, our reimplementations are trustworthy and give validity to the results we discuss in the next sections.

5.2. RESULTS FOR MODEL SELECTION STRATEGIES

Model Selection Strategies (w/ vs w/o target labels) All average accuracies on the OFFICE-HOME and VISDA datasets can be found in Table 6 . For all methods on OFFICE-HOME, we can see that the results for model selection strategies which do not use target labels are below the results given by ORACLE. For some pairs, the drop of performance can be significant, leading some methods to perform on par with the S. ONLY method. That is the case on OFFICE-HOME when DEV is paired with either BA3US, JUMBOT and MPOT. Even worse is MPOT with SND as the average accuracy is more than 10 percentage points below that of S. ONLY with any model selection strategy. Overall on OFFICE-HOME, except for MPOT, all methods when paired with either ENT or SND give results that are at most 2 percentage points below compared to when paired with ORACLE. A similar situation can be seen over the VISDA dataset where the accuracy without target labels can be down to 25 percentage points. Yet again, some model selection strategies can lead to scores even worse than S. ONLY. That is the case for PADA, SAFN and BA3US. Contrary to OFFICE-HOME, all model selection strategies without target labels lead to at least one method with results on par or worse in comparison to the S. ONLY method. Overall, no model selection strategy without target labels can lead to score on par to the ORACLE model selection strategy. Finally, PADA performs worse than S. ONLY for most model selection strategies, including the ones which use target labels. However, when combined with SND it performs better than with ORACLE on average, although still within the standard deviation. This is a consequence of the random seed dependence mentioned before on VISDA: as the hyper-parameters were chosen by performing just one run, we were simply "unlucky". In general, all of this confirms the standard assumption in the literature regarding the difficulty of the VISDA dataset. Model Selection Strategies (w/ target labels) We recall that the ORACLE model selection strategy uses all the target samples to compute the accuracy while 1-SHOT and 100-RND use only subsets: 1-SHOT has only one sample per class for a total of 25 and 6 on OFFICE-HOME and VISDA, respectively, while 100-RND has 100 random target samples. Our results show that using only 100 random target labeled samples is enough to reasonably approximate the target accuracy leading to only a small accuracy drop (one percentage point in almost all cases) for both datasets. Not surprisingly, the gap between the 1-SHOT and ORACLE model selection strategies is even bigger, leading in some instances to worse results than with a model selection strategy that uses no target labels. This poor performance of the 1-SHOT model selection strategy also highlights that semi-supervised domain adaptation (SSDA) methods are not a straightforward alternative to the 100-RND model selection strategy. While one could argue that the target labels could be leveraged during training like in SSDA methods, one still needs labeled target data to perform model selection. However our results suggest that we would need at least 3 samples per class for SSDA methods. In addition, knowing that we have a certain number of labeled samples per class provides information regarding which classes are target only, one of the main assumptions in PDA. In that case, PDA methods could be tweaked. This warrants further study that we leave as future work. Finally, we have also investigated a smaller labeled target set of 50 random samples (50-RND) instead of 100 random samples. The accuracies of methods using 50-RND were not as good as when using 100-RND. All results of pairs of methods and 50-RND can be found in Appendix B. The smaller performance show that the size of the labeled target set is an important element and we suggest to use at least 100 random samples. Model Selection Strategies (w/o target labels) Only the (BA3US, ENT) pair achieved average task accuracies within 3 percentage points of its ORACLE counterpart (i.e., (BA3US, ORACLE)), while still improving over S. ONLY model. Our experiments show that there is no model selection strategy which performs well for all methods. That is why to deploy models in a real-world scenario, we advise to test selected models on a small labeled target set (i.e., 100-RND)) to assess the performance of the models as model selection strategies without target labels can perform poorly. Our conclusion is that the model selection for PDA methods is still an open problem. We conjecture that it is also the case for domain adaptation as the considered metrics were developed first for this setting. For future proposed methods, researchers should specify not only which model selection strategy should be used, but also which hyper-parameter search grid should be considered, in order to deploy them in a real-world scenario.

5.3. RANDOM SEED DEPENDENCE

Ideally, PDA methods should be robust to the choice of random seed. This is of particular importance when performing hyper-parameter tuning since typically only one run per set of hyper-parameters is done (that was the case in our work as well). We investigate this robustness by averaging all the results presented over three different seeds (2020, 2021 and 2022) and reporting the standard (2020, 2021, 2022) . For each method, we highlight the best and worst label-free model selection strategies in green and red, respectively. deviations. This is in contrast with previous work where only a single run is reported (Fatras et al., 2021a; Gu et al., 2021) . Other works (Cao et al., 2018; Xu et al., 2019; Jian et al., 2020) that report standard deviations do not specify if the random seed is different across runs. Results for all tasks on VISDA dataset are in Table 7 and on OFFICE-HOME in Appendix B due to space constraints. Our experiments show that some methods express a non-negligible instabilities over randomness with respect to any model selection methods. This is particularly true for BA3US when paired with DEV and 1-SHOT as model selection strategies: there are several tasks where the standard deviation is above 10%. While in this case this instability may stem from the poor performance of the model selection strategies, it is also visible when ORACLE is the model selection strategy used. For instance, the M-POT has a standard deviation of 3.3% on the AP task of OFFICE-HOME which corresponds to a variance of 11%. On VISDA this instability and seed dependence is even larger.

6. CONCLUSION

In this paper, we investigated how model selection strategies affect the performance of PDA methods. We performed a quantitative study with seven PDA methods and seven model selection strategies on two real-word datasets. Based on our findings, we provide the following recommendations: i) Target label samples should be used to test models before using them in real-world scenario. While this breaks the main PDA assumption, it is impossible to confidently deploy PDA models selected without the use of target labels. Indeed, model selection strategies without target labels lead to a significant drop in performance in most cases in comparison to using a small validation set. We argue that the cost of labeling it outweighs the uncertainty in current model selection strategies. ii) The robustness of new PDA method to randomness should be tested over at least three different seeds. We suggest to use the seeds (2020, 2021, 2022) to allow for a fair comparison with our results. iii) An ablation study should be considered when a novel architecture is proposed to quantify the associated increase of performance. As our work focus on a quantitative study of model selection methods and reproducibility of stateof-the-art partial domain adaptation methods, we do not see any potential ethical concern. Future work will investigate new model selection strategies which can achieve similar results as model selection strategies which use label target samples. One of our main claims regarding previous work is the use of target labels to choose the best model along training. This can be easily verified by inspecting the code. For PADA it can be seen on line 240 of the script "train pada.py", for BA3US in line 116 for the script "run partial.py", for M-POT it can be seen line 164 of the file "run mOT.py", for SAFN it can be seen in the "eval.py" file and finally for AR in line 149 of the script "train.py". For JUMBOT and M-POT which are based on optimal transport, we used the optimal transport solvers from (Flamary et al., 2021) .

A.2 MODEL SELECTION

DEV requires learning a discriminative model to distinguish source samples from target samples. Its neural network architecture must be specified as well the training details. We empirically observed the latter to yield more stable weights and so that was the one we used. In order to train the SVM discriminator, following (Saito et al., 2021) , we take 3000 feature embeddings from source samples used in training and 3000 random feature embeddings from target samples, both chosen randomly. We do a 80/20 split into training and test data. The SVM is trained with a linear kernel for a maximum of 4000 iterations. Of 5 different SVM models trained with decay values spaced evenly on log space between 10 -2 and 10 4 the one that leads to the highest accuracy (in distinguishing source from target features) on the test data split is the chosen one. As for SND, it also requires specifying a temperature for temperature scaling component of the strategy. We used the default value of 0.05 that is suggested in (Saito et al., 2021) . Finally, we mention that the samples used for 100-RND were randomly selected and their list is made available together with the code. As for the samples used for 1-SHOT, they are the same as the ones used in semi-supervised domain adaptation. In Table 11 , we show the accuracy per task on OFFICE-HOME averaged over three different seeds (2020, 2021, 2022) for all pairs of methods and model selection strategies. In Table 12 , we compare previously reported results with ours on VISDA. While proposed methods reported results on OFFICE-HOME, only PADA and AR results are reported in the original papers for VISDA. Gu et al. (2021) AR) also report results for BA3US. Analysing the results, we see a 9 percentage point decrease in average task accuracy for PADA, but our experiments show that there is a significant seed dependence which we discuss in detail below. This is particularly important since Cao et al. (2018) (PADA) report results from a single run. Comparing our best seeds for PADA on the SR and RS tasks, we achieve 58.01% and 67.9% accuracy versus a reported 53.53% and 76.5%. Moreover, we point out that the official code repository for PADA does not include the details to reproduce the VISDA experiments, so it is possible that minor tweaks (e.g learning rate) are necessary. As for BA3US, our results are within the standard deviation being better on the SR task and worse on the RS task. Finally as for AR we see a decrease in performance which, as the results on OFFICE-HOME show, can be explained by the differences in the neural network architecture. 



You et al. (2019) (DEV) use a multilayer perceptron, while Saito et al. (2021) (SND) use a Support Vector Machine in their reimplementation of DEV.

Hyper-parameters selected for the different methods for each model selection strategy on both OFFICE-HOME and VISDA.

, SAFN, BA3US AR, JUMBOT, MPOT Model Selection Strategies S-ACC, ENT, DEV, SND, 1-SHOT, 100-RND, ORACLE Architecture ResNet50 backbone ⊕ linear bottleneck ⊕ linear classification head Experimental protocol 3 seeds on the 12 tasks of OFFICE-HOME and 2 tasks of VISDA Summary of our considered methods, model selection strategies, architecture and datasets.

Comparison between reported ( †) accuracies on partial OFFICE-HOME from published methods with our implementation using the ORACLE model selection strategy. * denotes different bottleneck architectures.

Naive 52.60 63.10 44.48 52.30 26.75 17.67 49.01 16.72 30.63 32.12 49.67 5.01  Heuristic 58.45 63.10 60.96 56.24 45.79 55.16 49.01 45.61 30.63 46.27 49.67  49.67 VISDA Naive 39.06 36.99 1.14 35.89 54.53 11.99 75.04 55.33 36.11 52.82 53.26 0.83 Heuristic 67.50 34.94 38.76 47.23 54.53 66.42 75.04 55.33 85.36 52.82 53.26 52.82 Comparison between the naive model selection strategy and our heuristic approach. Accuracy on AC task for OFFICE-HOME and SR task for VISDA. Best results in bold.

Task accuracy average over seeds 2020, 2021, 2022 on Partial OFFICE-HOME and Partial VISDA for the PDA methods and model selection strategy. For each method, we highlight the best and worst label-free model selection strategies in green and red, respectively.

Accuracy of different PDA methods based on different model selection strategies on the 2 Partial VISDA tasks. Average is done over three seeds

Supplementary material

Outline. The supplementary material of this paper is organized as follows:• In Section A, we give more details on our experimental protocol.• In Section B, we provide additional results from our experiments.

A ADDITIONAL DETAILS ON EXPERIMENTAL PROTOCOL

A.1 IMPLEMENTATIONS IN BENCHMARKPDAIn order to reimplement the different PDA methods, we adapted the code from the official repository associated with each of the paper. We list them in 

A.3 OPTIMIZER

In general, all methods claim to adopt Nesterov's acceleration method as the optimization method with a momentum of 0.9 and setting the weight decay set to 5 × 10 -4 . The learning rate follows the annealing strategy as in Ganin et al. (2016) :where p is the training progress linearly changing from 0 to 1, µ 0 = 0.01 and α = 10 and β = 0.75.However, inspecting the Official code repo for each PDA method, the actual learning schedule is given bywhere i is the iteration number in the training procedure, µ 0 = 0.01 and α = 0.001 and β = 0.75. Only when the total number of iterations is 10000 do the learning rate schedules match. In this work, we followed the latter since it is the one indeed used. For OFFICE-HOME, all methods are trained for 5000 iterations, while for VISDA they are trained for 10000 iterations, with the exception of the S. ONLY which is trained for 1000 iterations on OFFICE-HOME and 5000 iterations on VISDA.

A.4 HYPER-PARAMETERS

In Table 9 , we report the values used for each hyper-parameter in our grid search. We report in Table 10 the hyper-parameters chosen by each model selection strategy for each method on both datasets.In addition, for the reproducibility of AR with the proposed architecture in Gu et al. (2021) ,a feature normalization layer is added in the bottleneck which requires specifying r, the value to which the 2-norm is set. This hyper-parameter is therefore included in the hyper-parameter grid search with the possible values of {5, 10, 20} which are the different values used in the experiments in (Gu et al., 2021) .

B ADDITIONAL DISCUSSION OF RESULTS

In this section, we provide additional results that we could not add to the main paper due to the space constraints.ALGORITHM S2R R2S Avg (2020, 2021, 2022) .

