KEY DESIGN CHOICES FOR DOUBLE-TRANSFER IN SOURCE-FREE UNSUPERVISED DOMAIN ADAPTATION

Abstract

Fine-tuning and Domain Adaptation emerged as effective strategies for efficiently transferring deep learning models to new target tasks. However, target domain labels are not accessible in many real-world scenarios. This led to the development of Unsupervised Domain Adaptation (UDA) methods, which only employ unlabeled target samples. Furthermore, efficiency and privacy requirements may also prevent the use of source domain data during the adaptation stage. This particularly challenging setting, known as Source-free Unsupervised Domain Adaptation (SF-UDA), is still understudied. In this paper, we systematically analyze the impact of the main design choices in SF-UDA through a large-scale empirical study on 500 models and 74 domain pairs. We identify the normalization approach, pretraining strategy, and backbone architecture as the most critical factors. Based on our observations, we propose recipes to best tackle SF-UDA scenarios. Moreover, we show that SF-UDA performs competitively also beyond standard benchmarks and backbone architectures, performing on par with UDA at a fraction of the data and computational cost. Experimental data and code will be released.

1. INTRODUCTION

The recent success of deep neural networks (DNNs) in many tasks and domains often relies on the availability of large annotated datasets. This can be tackled by pre-training DNNs on a large dataset and then fine-tuning their weights with target task data (Huh et al., 2016; Yosinski et al., 2014; Chu et al., 2016) . Furthermore, fine-tuning is usually simpler to perform and faster than training the model from scratch, the dataset size can be smaller, and the final performance is typically higher (with some exceptions: see (Kornblith et al., 2019) ). This approach is very convenient: the model requires a single expensive pre-training and can later be re-used for multiple down-stream tasks. This is a good example of transfer learning (Zhuang et al., 2021) , which leverages on the information acquired from a task to improve accuracy on another task of interest. Two relevant examples of transfer learning are Domain Adaptation (DA), that, given different-yet related-tasks, exploits source domain(s) data to improve performance on different known target domain(s), and Domain Generalization (DG), that aims to generalize to unknown target(s). As opposed to fine-tuning, in which the pre-training and downstream tasks can be significantly different, DA and DGen require stronger assumptions on the similarity between tasks, e.g., leveraging synthetic images to improve the classification of real images that share the same label space. DA is also related to Multi-task Learning (MTL) (Caruana, 1997; Ciliberto et al., 2017) and Multi-domain Learning (MDL) (Joshi et al., 2012) . In fact, domains can be seen as tasks in MTL or MDL. However, an explicit domain label is provided and annotated examples are available for each task. A particularly challenging and useful setting in practice is Unsupervised Domain Adaptation (UDA) (Tzeng et al., 2017; Ganin et al., 2016) , in which labeled samples from a source domain are used together with unlabeled samples from the target domain to improve performance on the latter. This work focuses on Source-Free Unsupervised Domain Adaptation (SF-UDA) (Liang et al., 2020) for the image classification task. SF-UDA is a two-steps sequential version of UDA in which the source-domain labeled data is only accessible in the first training phase. Adaptation to the new domain is carried out in a second stage where only the unlabeled data from the target domain is available. SF-UDA nicely matches applications where continual adaptation is required with computational and memory constraints, or where privacy policies prevent access to the source data. Since these techniques are usually applied to fine-tune and adapt models with pre-trained weights, there are two different transfers into play: (1) from the base task (used for pre-training) to the source domain, and (2) from the source domain to the target domain: we refer to this combined transfer as double-transfer. The objective of this work is two-fold. Firstly, we aim to study the impact of the main design choices in SF-UDA approaches, such as the backbone architecture, pre-training dataset, and the way double-transfer is performed. Secondly, we aim to investigate the strengths and failure modes of SF-UDA methods, also comparing them with standard UDA approaches. Recent works (Liang et al., 2020; Ding et al., 2022) show that SF-UDA techniques achieve comparable performance with state-of-the-art UDA methods on common benchmarks. In contrast, Kim et al. (2022) reports that recent UDA methods perform well on standard benchmarks because they overfit the task. Indeed, when employed in other settings (i.e., non-standard architectures or datasets) they result in worse accuracy than previous methods. We investigate with targeted experiments whether overfitting affects recent SF-UDA methods as well. We pursue such objectives through large-scale systematic experiments encompassing more than 500 different architectures on 6 separate domain adaptation datasets, totaling 23 domains and 74 domain shifts. We employ different probing and SF-UDA methods to better analyze the functioning of double-transfer methods, providing a ready-to-use recipe for effective system design. Our main findings are as follows: • The pre-training dataset choice and the resulting accuracy on the ImageNet top-1 benchmark directly impacts on the domain generalization and SF-UDA performance, for both CNNs and Vision Transformers. • Most SF-UDA methods fine-tune the model on the source domain before adaptation. However, we show that in some cases this causes severe performance degradation. Specifically, we identify the type of normalization layers as having a critical role in this context. • Besides fine-tuning, the normalization strategy heavily affects the failure rate of SF-UDA in general. We present a large-scale analysis of SF-UDA methods' failure rates, comparing architectures with Layer Normalization (LN) (Ba et al., 2016) and Batch Normalization (BN) (Ioffe & Szegedy, 2015) and attesting the improved robustness of the former. • SF-UDA methods, like SHOT (Liang et al., 2020) , SCA (see Sec. 3.2), and NRC (Yang et al., 2021a) , perform well also with architectures and datasets different from the usual benchmark ones and are competitive with state-of-the-art UDA methods. The full list of architectures, results in csv format, the code and pre-trained weights will be releasedfoot_0 .

2. BACKGROUND AND RELATED WORK

Let X be the input space, e.g., the image space, Z ⊆ R D a representation space, i.e., the feature space, and Y = {1, . . . , C} the output space for multi-class classification. A feature extractor (backbone) is a function f θ : X → Z with parameter θ, while a classifier is a function with parameter ϕ that assigns a label to any feature vector, h ϕ : Z → Y. We introduce two data distributions over X × Y: µ S that models the source domain and µ T that models the target domain. Domain Generalization (DG). In this work, we consider DG setting as the reference task to investigate transferability among domains. First introduced by Blanchard et al. ( 2011), its goal is to find a feature extractor f θ and a classifier h ϕ from N i.i.d. samples of a given domain (the source µ S ) that will perform well on other unseen domains (in our case just the target domain µ T ), that is: min θ,ϕ E (x,y)∼µ T [1{h ϕ (f θ (x)) ̸ = y}] given (x i , y i ) N i=1 ∼ µ N S , where 1{condition} is the indicator function. In this setting, some assumptions on the relationship between tasks (µ S and µ T ) are needed, but only data sampled from the source domain can be used for training, while target domain data is accessible at test time only. We remark that this is similar to the standard supervised-learning problem of generalization (Gen), with the exception that, here, the training and test distributions are different. Several methods specifically target this problem, e.g., (Volpi et al., 2018; Arjovsky et al., 2019; Ilse et al., 2020) ; see Wang et al. (2021) for a review.



https://www.github.com/anonymous_author/double_transfer/

