A REPRODUCIBLE AND REALISTIC EVALUATION OF PARTIAL DOMAIN ADAPTATION METHODS Anonymous authors Paper under double-blind review

Abstract

Unsupervised Domain Adaptation (UDA) aims at classifying unlabeled target images leveraging source labeled ones. In this work, we consider the Partial Domain Adaptation (PDA) variant, where we have extra source classes not present in the target domain. Most successful algorithms use model selection strategies that rely on target labels to find the best hyper-parameters and/or models along training. However, these strategies violate the main assumption in PDA: only unlabeled target domain samples are available. Moreover, there are also inconsistencies in the experimental settings -architecture, hyper-parameter tuning, number of runsyielding unfair comparisons. The main goal of this work is to provide a realistic evaluation of PDA methods with the different model selection strategies under a consistent evaluation protocol. We evaluate 7 representative PDA algorithms on 2 different real-world datasets using 7 different model selection strategies. Our two main findings are: (i) without target labels for model selection, the accuracy of the methods decreases up to 30 percentage points; (ii) only one method and model selection pair performs well on both datasets. Experiments were performed with our PyTorch framework, BenchmarkPDA, which we open source.

1. INTRODUCTION

Domain adaptation. Deep neural networks are highly successful in image recognition for indistribution samples (He et al., 2016) with this success being intrinsically tied to the large number of labeled training data. However, they tend to not generalize as well on images with different background or colors not seen during training. Such shift in the samples is referred to as domain shift in the literature. Unfortunately, enriching the training set with new samples from different domains is challenging as labeling data is both an expensive and time-consuming task. Thus, researchers have focused on unsupervised domain adaptation (UDA) where we have access to unlabelled samples from a different domain, known as the target domain. The purpose of UDA is to classify these unlabeled samples by leveraging the knowledge given by the labeled samples from the source domain (Pan & Yang, 2010; Patel et al., 2015) . In the standard UDA problem, the source and target domains are assumed to share the same classes. In this paper, we consider a more challenging variant of the problem called partial domain adaptation (PDA): the classes in the target domain Y t form a subset of the classes in the source domain Y s (Cao et al., 2018) , i.e., Y t ⊂ Y s . The number of target classes is unknown as we do not have access to the labels. The extra source classes, not present in the target domain, make the PDA problem more difficult: simply aligning the source and target domains forces a negative transfer where target samples are matched to outlier source-only labels. Realistic evaluations. Most recent PDA methods report an increase of the target accuracy up to 15 percentage points on average when compared to the baseline approach that uses only source domain samples. While these successes constitute important breakthroughs in the DA research literature, target labels are used for model selection, violating the main UDA assumption. In their absence, the effectiveness of PDA methods remains unclear and model selection constitutes a yet to be solved problem as we show in this work. Moreover, the hyper-parameter tuning is either unknown or lacks details and sometimes requires labeled target data, which makes it challenging to apply PDA methods to new datasets. Recent work has highlighted the importance of model selection in the presence of domain shift. Gulrajani & Lopez-Paz (2021) showed that when evaluating domain generalization (DG) algorithms, whose goal is to generalize to a completely unseen domain, in a consistent and realistic setting no method outperforms the baseline ERM method by more than 1 We list below our major findings: • The accuracy attained by models selected without target labels can decrease up to 30 percentage points compared to the one reported using target labels (See Table 1 for a summary of results). • Only 1 pair of PDA methods and target label-free model selection strategies achieve comparable accuracies to when target labels are used, while still improving over a source only baseline. • Random seed plays an important role in the selection of hyper-parameters. Selected parameters are not stable across different seeds and the standard deviation between accuracies on the same task can be up to 8.4% even when relying on target labels for model selection. • Under a more realistic scenario where some target labels are available, 100 random samples is enough to see only a drop of 1 percentage point in accuracy (when compared to using all target samples). However, the extreme case of using only one labeled target sample per class leads a to significant drop in performance. Outline. In Section 2, we provide an overview of the different model selection strategies considered in this work. Then in Section 3, we discuss the PDA methods that we consider. In Section 4 we describe the training procedures, hyper-parameter tuning and evaluation protocols used to evaluate all methods fairly. In Section 5, we discuss the results of the different benchmarked methods and the performance of the different model selection strategies. Finally in Section 6, we give some recommendations for future work in partial domain adaptation.

2. MODEL SELECTION STRATEGIES: AN OVERVIEW

Model selection (choosing hyper-parameters, training checkpoints, neural network architectures) is a crucial part of training neural networks. In the supervised learning setting, a validation set is used to estimate the model's accuracy. However, in UDA such approach is not possible as we have unlabeled target samples. Several strategies have been designed to address this issue. Below, we discuss the ones used in this work. Source Accuracy (S-ACC). Ganin & Lempitsky (2015) used the accuracy estimated on a small validation set from the source domain to perform the model selection. While the source and target accuracies are related, there are no theoretical guarantees. You et al. (2019) showed that when the domain gap is large this approach fails to select competitive models. 

Deep Embedded Validation (DEV)

.



Sugiyama et al. (2007)  andLong et al. (2018)  perform model selection through Importance-Weighted Cross-Validation (IWCV). Under the assumption that the

Task accuracy average computed over three different seeds(2020, 2021, 2022)  on Partial OFFICE-HOME and Partial-VISDA. For each dataset and PDA method, we display the results of the worst and best performing model selection that do not use target labels as well as the ORACLE model selection strategy. All results can be found in Table6. percentage point. They argue that DG methods without a model selection strategy remain incomplete and should therefore be specified as part of the method. A similar recommendation was done by Saito et al. (2021) for domain adaptation.PDA methods have been designed using target labels at test time to select the best models. Parallel work(Saito et al., 2021; You et al., 2019)  on model selection strategies for domain adaptation claimed to select the best models without using target labels. However, a realistic empirical study of these strategies in PDA is still lacking. In this work, we conduct extensive experiments to study the impact of model selection strategies on the performance of partial domain adaptation methods. We evaluate 7 different PDA methods over 7 different model selection strategies, 4 of which do not use target labels, and 2 different datasets under the same experimental protocol for a fair comparison.

