DIVIDE TO ADAPT: MITIGATING CONFIRMATION BIAS FOR DOMAIN ADAPTATION OF BLACK-BOX PREDIC-TORS

Abstract

aims to learn a model on an unlabeled target domain supervised by a black-box predictor trained on a source domain. It does not require access to both the source-domain data and the predictor parameters, thus addressing the data privacy and portability issues of standard domain adaptation methods. Existing DABP approaches mostly rely on knowledge distillation (KD) from the black-box predictor, i.e., training the model with its noisy target-domain predictions, which however inevitably introduces the confirmation bias accumulated from the prediction noises and leads to degrading performance. To mitigate such bias, we propose a new strategy, divide-to-adapt, that purifies cross-domain knowledge distillation by proper domain division. This is inspired by an observation we make for the first time in domain adaptation: the target domain usually contains easy-to-adapt and hard-to-adapt samples that have different levels of domain discrepancy w.r.t. the source domain, and deep models tend to fit easyto-adapt samples first. Leveraging easy-to-adapt samples with less noise can help KD alleviate the negative effect of prediction noises from black-box predictors. In this sense, the target domain can be divided into an easy-to-adapt subdomain with less noise and a hard-to-adapt subdomain at the early stage of training. Then the adaptation is achieved by semi-supervised learning. We further reduce distribution discrepancy between subdomains and develop weak-strong augmentation strategy to filter the predictor errors progressively. As such, our method is a simple yet effective solution to reduce error accumulation in cross-domain knowledge distillation for DABP. Moreover, we prove that the target error of DABP is bounded by the noise ratio of two subdomains, i.e., the confirmation bias, which provides the theoretical justifications for our method. Extensive experiments demonstrate our method achieves state of the art on all DABP benchmarks, outperforming the existing best approach by 9.5% on VisDA-17, and is even comparable with the standard domain adaptation methods that use the source-domain data 1 .

1. INTRODUCTION

Unsupervised domain adaptation (UDA) (Pan & Yang, 2009) aims to transfer knowledge from a labeled source domain to an unlabeled target domain and has wide applications (Tzeng et al., 2015; Hoffman et al., 2018; Zou et al., 2021) . However, UDA methods require to access the sourcedomain data, thus raising concerns about data privacy and portability issues. To solve them, Domain Adaptation of Black-box Predictors (DABP) (Liang et al., 2022) was introduced recently, which aims to learn a model with only the unlabeled target-domain data and a black-box predictor trained on the source domain, e.g., an API in the cloud, to avoid the privacy and safety issues from the leakage of data and model parameters. A few efforts have been made to solve the DABP problem. One of them is to leverage knowledge distillation (Hinton et al., 2015) and train the target model to imitate predictions from the source predictor (Liang et al., 2022) . Another one is to adopt learning with noisy labels (LNL) methods to select the clean samples from the noisy target-domain predictions for model training (Zhang et al., 2021) . Though inspiring, they have the following limitations. (i) Learning the noisy pseudo labels for knowledge distillation inevitably leads to confirmation bias (Tarvainen & Valpola, 2017) , i.e., accumulated model prediction errors. (ii) The LNL-based methods aims to select a clean subset of the target domain to train the model, which would limit the model's performance due to a decreased amount of usable data for model training. (iii) Existing DABP methods lack theoretical justifications. To address the aforementioned issues, this work proposes a simple yet effective strategy, divide-toadapt, which suppresses the confirmation bias by purifying cross-domain knowledge distillation. Intuitively, the divide-to-adapt strategy divides the target domain into an easy-to-adapt subdomain with less prediction noise and a hard-to-adapt subdomain. This is inspired by a popular observation: deep models tend to learn clean samples faster than noisy samples (Arpit et al., 2017) . For domain adaptation, we make a similar discovery: deep models tend to learn easy-to-adapt samples faster than hard-to-adapt samples, and thus we can leverage the loss distribution of cross-domain knowledge distillation at the early stage for subdomain division. By taking the easy-to-adapt subdomain as a labeled set and the hard-to-adapt subdomain as an unlabeled set, we can solve DABP problem via leveraging prevailing semi-supervised learning methods (Berthelot et al., 2019; Sohn et al., 2020) . The divide-to-adapt strategy purifies the target domain progressively for knowledge distillation while fully utilizing all the target dataset without wasting any samples. To implement the above strategy, this paper proposes Black-Box ModEl AdapTation by DomAin Division (BETA) that introduces two key modules to suppress the confirmation bias progressively. Firstly, we divides the target domain into an easy-to-adapt and hard-to-adapt subdomains by fitting the loss distribution into a Gaussian Mixture Model (GMM) and setting a threshold. The easy-to-adapt samples with less noise help purify the cross-domain knowledge distillation for DABP. Secondly, we propose mutually-distilled twin networks with weak-strong augmentation on two subdomains to progressively mitigate error accumulation. The distribution discrepancy between two subdomains is further aligned by an adversarial regularizer to enable the prediction consistency on the target domain. A domain adaptation theory is further derived to provide justifications for BETA. We make the following contributions. (i) We propose a novel BETA framework for the DABP problem that iteratively suppresses the error accumulation of model adaptation from the black-box sourcedomain predictor. To the best of our knowledge, this is the first work that addresses the confirmation bias for DABP. (ii) We theoretically show that the error of the target domain is bounded by the noise ratio of the hard-to-adapt subdomain, and empirically show that this error can be suppressed progressively by BETA. (iii) Extensive experiments demonstrate that our proposed BETA achieves state-of-the-art performance consistently on all benchmarks. It outperforms the existing best method by 9.5% on the challenging VisDA-17 and 2.0% on DomainNet.

2. RELATED WORK

Unsupervised Domain Adaptation. Unsupervised domain adaptation aims to adapt a model from a labeled source domain to an unlabeled target domain. Early UDA methods rely on feature projection (Pan et al., 2010a) and sample selection (Sugiyama et al., 2007) for classic machine learning models. With the development of deep representation learning, deep domain adaptation methods yield surprising performances in challenging UDA scenarios. Inspired by two-sample test, discrepancy minimization of feature distributions (Koniusz et al., 2017; Yang et al., 2021b; Xu et al., 2022a) is proposed to learn domain-invariant features (Cui et al., 2020a) based on statistic moment matching (Tzeng et al., 2014; Sun & Saenko, 2016) . Domain adversarial learning further employs a domain discriminator to achieve the same goal (Ganin et al., 2016; Zou et al., 2019; Yang et al., 2020b) and achieves remarkable results. Other effective techniques for UDA include entropy minimization (Grandvalet & Bengio, 2005; Xu et al., 2021) , contrastive learning (Kang et al., 2019) , domain normalization (Wang et al., 2019; Chang et al., 2019) , semantic alignment (Xie et al., 2018; Yang et al., 2021a) , meta-learning (Liu et al., 2020) , self-supervision (Saito et al., 2020) , semi-supervsed learning (Berthelot et al., 2021) curriculum learning (Zhang et al., 2017; Shu et al., 2019) , intra-domain alignment (Pan et al., 2020) , knowledge distillation (Yang et al., 2020a) and self-training (Chen et al., 2020; Zou et al., 2018) . Despite their effectiveness, they require access to the source domain data and therefore invoke privacy and portability concerns. Unsupervised Model Adaptation and DABP. Without accessing the source domain, unsupervised model adaptation, i.e., source-free UDA, has attracted increasing attention since it loosens the assumption and benefits more practical scenarios (Guan & Liu, 2021) . Early research provides a theoretical analysis of hypothesis transfer learning (Kuzborskij & Orabona, 2013) , which motivates the existence of deep domain adaptation without source data (Liang et al., 2020a; Huang et al., 2021; Li et al., 2020; Xu et al., 2022b) . Liang et al. propose to train the feature extractor by self-supervised learning and mutual information maximization with the classifier frozen (Liang et al., 2020a) . This paper deals with a more challenging problem: only leveraging the labels from the model trained in the source domain for model adaptation. Few works have been conducted in this field. (Zhang et al., 2021) proposes a noisy label learning method by sample selection, while (Liang et al., 2022 ) uses knowledge distillation with information maximization. Whereas, we propose to perform domain division and suppress confirmation bias for cross-domain knowledge distillation. Confirmation Bias in Semi-Supervised Learning. Confirmation bias refers to the noise accumulation when the model is trained using incorrect predictions for semi-supervised or unsupervised learning (Tarvainen & Valpola, 2017) . Such bias can cause the model to overfit the noisy feature space and then resist new changes (Arazo et al., 2020) . In UDA, pseudo-labeling (Saito et al., 2017; Gu et al., 2020; Morerio et al., 2020) and knowledge distillation (Liang et al., 2020a; Kundu et al., 2019; Zhou et al., 2020) are effective techniques but can be degraded due to confirmation bias. Especially for the transfer task with a distant domain, the pseudo labels for the target domain are very noisy and deteriorate the subsequent epochs of training. To alleviate the confirmation bias, several solutions are proposed including co-training (Qiao et al., 2018; Li et al., 2019) , Mixup (Zhang et al., 2018; Chen et al., 2019), and data-augmented unlabeled examples (Cubuk et al., 2019) . Our paper proposes BETA which is the first work that formulates and addresses the confirmation bias for DABP.

3. METHODOLOGY

The idea of our proposed BETA is to mitigate the confirmation bias for DABP by dividing the target domain into two subdomains with different adaptation difficulties. As shown in Figure 1 , BETA relies on two designs to suppress error accumulation, including a domain division module that purifies the target domain into a cleaner subdomain and transfers DABP to a semi-supervised learning task, and a two-networks mechanism (i.e., mutually-distilled twin networks) that further diminishes the self-training errors by information exchange. We firstly introduce the problem formulation and the key modules, and then make the algorithmic instantiation with more details. et al., 2007) between D S and D T . The objective is to learn a mapping model X t → Y t . Different from standard UDA (Pan et al., 2010b; Tzeng et al., 2014; Long et al., 2017) , DABP prohibits the model from accessing the source-domain data X s , Y s , and the parameters of the source model h s . Only a black-box predictor trained on the source domain, i.e., an API, is available. Confront these constraints, we can only resort to the hard predictions of the target domain from the source predictor, i.e., Ỹt = h s (X t ), in the DABP setting.

3.2. DOMAIN DIVISION

Different from the strategy of directly utilizing X t , Ỹt for knowledge distillation (Liang et al., 2022) , we propose to divide the target domain X t into an easy-to-adapt subdomain X e t ∼ D e and a hardto-adapt subdomain X h t ∼ D h , with X t = X e ∪ X h . Previous studies show that deep models are prone to fitting clean examples faster than noisy examples (Arpit et al., 2017) . In domain adaptation, the target domain consists of samples that have different similarities to the source-domain samples, and we find that deep models are prone to fitting easy-to-adapt samples first that are more similar to the source domain. Based on the observation, we can obtain the two subdomains with different domain discrepancy by training loss. For example in Figure 2 , it is seen that two peaks appear in the loss distribution on the target domain data, and each peak corresponds to one subdomain. We further calculate the A-distance (Ben-David et al., 2010) (d A ) between the two subdomains and the source domain, and the result shows that D e has less domain discrepancy than D h . This can be observed more intuitive by the subdomain samples in the appendix. Inspired by this observation, we first warm up the network, e.g., a CNN, for several epochs, and then obtain the loss distribution by calculating the per-sample cross-entropy loss for a K-way classification problem as L ce (x t i ) = - K k=1 ỹk i log(h k t (x t i )), (1) where h k t is the softmax probability for class k from the target model. As shown in Figure 1 , the loss distribution appears to be bimodal and two peaks indicate the clean and noisy clusters, which can be fitted by a GMM using Expectation Maximization (EM) (Li et al., 2019) . In noisy label learning, the clean and noisy subset division is achieved by a Beta Mixture Model (BMM) (Arazo et al., 2019) . However, in DABP, the noisy pseudo labels obtained by h s are dominated by asymmetric noise (Kim et al., 2021) , i.e., the noisy samples that do not follow a uniform distribution. In this case, the BMM leads to undesirable flat distributions and cannot work effectively for our task (Song et al., 2022) . Asymmetric noise in Ỹt also causes the model to perform confidently and generate near-zero losses, which hinders the domain division of GMM. To better fit the losses of the target domain with asymmetric noise, the negative entropy is used as a regularizer in the warm-up phase, defined as: L ne = K k=1 h k t (x t i ) log(h k t (x t i )). After fitting the loss distribution to a two-component GMM via the Expectation-Maximization algorithm, the clean probability ϱ c i is equivalent to the posterior probability p(c|l i (x t i )) where c is the Gaussian component with smaller loss and l i (x t i )) is the cross-entropy loss of x t i . Then the clean and noisy subdomains are divided by setting a threshold τ based on the clean probabilities: X e = {(x i , ỹi )|(x i , ỹi ) ∈ (X t , Ỹt ), ϱ c i ≥ τ }, (3) X h = {(x i , pi )|x i ∈ X t , ϱ c i < τ }, where pi = h s (x t i ) is the softmax probabilities. Intuitively, the clean subdomain consists of easy-toadapt samples, while the noisy subdomain consists of hard-to-adapt samples. The semi-supervised learning methods (Berthelot et al., 2019) can be directly applied with X e used as the labeled set and X h as the unlabeled set. Compared to sample selection (Zhang et al., 2021) and single distillation (Liang et al., 2022) , domain division enables the utilization of all accessible samples by semi-supervised learning and dilutes the risk of the confirmation bias by leveraging cleaner signals of supervision for model adaptation.

3.3. MUTUALLY-DISTILLED TWIN NETWORKS WITH SUBDOMAIN AUGMENTATION

The easy-to-adapt subdomain is purified by domain division but still has inevitable wrong labels. Overfitting to these wrong labels enforces the model to generate fallacious low losses for domain division, and hence accumulates the wrong predictions iteratively, which is the confirmation bias. Apart from domain division, we propose Mutually-distilled Twin Networks (MTN) to further mitigate such bias, inspired by the two-networks design in (Qiao et al., 2018; Li et al., 2019) where the confirmation bias of self-training can be diminished by training two networks to decontaminate the noise for each other. Specifically, we employ two identical networks initialized independently, where one network performs semi-supervised learning according to the domain division and pseudo labels of the other network. In this fashion, two networks are trained mutually and receive extra supervision to filter the error. In BETA, we further revamp this design by subdomain augmentation to increase the divergence of domain division, enabling two networks to obtain sufficiently different supervisions from each other. Suppose that two networks h θ1 t , h θ2 t where θ 1 , θ 2 are parameters generate two sets of domain division {X 1 e , X 1 h } and {X 2 e , X 2 h }, respectively. We take h θ2 t and {X 1 e , X 1 h } for example. Two augmentation strategies are tailored: the weak augmentation (e.g., random cropping and flipping), and the strong augmentation (i.e., RandAugment (Cubuk et al., 2020) and AutoAugment (Cubuk et al., 2019) . The samples from the easy-to-adapt subdomain are mostly correct, so we augment them using two strategies and obtain their soft pseudo labels by the convex combination of averaging all augmentations and the pseudo label according to the clean probability ϱ c i . Whereas, the hard-to-adapt subdomain is noisy, so we only apply the weak augmentation to update their pseudo labels but use strong augmentations in the subsequent learning phase. Furthermore, we employ the co-guessing strategy (Li et al., 2019) to refine the pseudo labels for X h . The refined subdomains are derived as: X 1 e = (x i , ỹ′ i )|ỹ ′ i = ϱ c i ỹi + (1 -ϱ c i ) 1 M M m=1 [h θ2 t (A m w/s (x i ))], (x i , ỹi ) ∈ X 1 e , X 1 h = (x i , p′ i )|p ′ i = 1 2M 2M m=1 [h θ1 t (A m (x i )) + h θ2 t (A m w (x i ))], (x i , pi ) ∈ X 1 h , where A m w/s (•) denotes the m-th weak and strong augmentation function, and M denotes the total number of augmentation views.

3.4. ALGORITHMIC INSTANTIATION

After the warm-up, domain division, and subdomain augmentation, we detail the algorithmic choices of other modules and the learning objectives of our framework. Hard knowledge distillation. For each epoch, we first perform knowledge distillation from the predictions of the source model h s by the relative entropy, i.e., the Kullback-Leibler divergence L kd = E xt∈Xt D(ỹ t ||h t (x t )), where D KL (•||•) denotes the KL-divergence, and the pseudo label ỹt is obtained by the EMA prediction of h s (x t ). Different from DINE (Liang et al., 2022 ) that uses source model probabilities, we only leverage the hard pseudo labels that are more ubiquitous for API services. Mutual information maximization. To circumvent the model to show partiality for categories with more samples during knowledge distillation, we maximize the mutual information as L mi = H(E x∈Xt h t (x)) -E x∈Xt H(h t (x)), where H(•) denotes the information entropy. This loss works jointly with L kd . Besides, after the semi-supervised learning, we use this loss to fine-tune the model to enforce the model to comply with the cluster assumption (Shu et al., 2018; Liang et al., 2022; Grandvalet & Bengio, 2005) . Domain division enabled semi-supervised learning. We choose MixMatch (Berthelot et al., 2019) as the semi-supervised learning method since it includes a mix-up procedure (Zhang et al., 2018) that can further diverge the two networks while refraining from overfitting. The mixed sets Ẍe , Ẍh are obtained by Ẍe = Mixup( X e , X e ∪ X h ) and Ẍh = Mixup( X h , X e ∪ X h ). Then the loss functions is written as L dd = L ce ( Ẍe ) + L mse ( Ẍh ) + L reg , where L ce denotes the cross-entropy loss, L mse denotes the mean squared error, and the regularizer L reg uses a uniform distribution π k to eliminate the effect of class imbalance, written as L reg = k π k log π k 1 | Ẍe | + | Ẍh | x∈ Ẍe+ Ẍh h t (x) . ( ) Subdomain alignment. We assume that there exists a distribution discrepancy between the easyto-adapt and the hard-to-adapt subdomains, which leading to the performance gap between them. Regarding this gap, we add an adversarial regularizer by introducing a domain discriminator Ω(•): L adv = E x∈ Ẍe log Ω(h t (x)) + E x∈ Ẍh log 1 -Ω(h t (x)) . Overall objectives. Summarizing all the losses, the overall objectives are formulated as L = (L kd -L mi step 1 ) + (L dd -γL adv step 2 ), where γ is a hyper-parameter that is empirically set to 0.1. In step 1, we perform distillation for two networks independently to form tight clusters by maximizing mutual information, while in step 2, the proposed BETA revises their predictions by mitigating the confirmation bias in a synergistic manner. The domain division is performed between two steps.

3.5. THEORETICAL JUSTIFICATIONS

Existing theories on UDA error bound (Ben-David et al., 2007) are based on the source-domain data, so they are not applicable to DABP models (Liang et al., 2022) , which hinders understanding of these models. To better explain why BETA contributes to DABP, we derive an error bound based on the existing UDA theories (Ben-David et al., 2010) . Let h denote a hypothesis, y e , y h and ŷe , ŷh denote the ground truth labels and the pseudo labels of X e , X h , respectively. As BETA is trained on a mixture of the two subdomains with pseudo labels, the error of BETA can be formulated as a convex combination of the errors of the easy-to-adapt subdomain and the hard-to-adapt subdomain: ϵ α (h) = αϵ e (h, ŷe ) + (1 -α)ϵ h (h, ŷh ), ( ) where α is the trade-off hyper-parameter, and ϵ e (h, ŷe ), ϵ h (h, ŷh ) represents the expected error of the two subdomains. We derive an upper bound of how the error ϵ α (h) is close to an oracle error of the target domain ϵ t (h, y t ) where y t is the ground truth labels of the target domain. Theorem 1 Let h be a hypothesis in class H. Then |ϵ α (h) -ϵ t (h, y t )| ≤ α(d H△H (D e , D h ) + λ + λ) + ρ h , where the ideal risk is the combined error of the ideal joint hypothesis λ = ϵ e (h * ) + ϵ h (h * ), the distribution discrepancy d H△H (D e , D h ) = 2 sup h,h ′ ∈H |E x∼Dc [h(x) ̸ = h ′ (x)] -E x∼Dn [h(x) ̸ = h ′ (x) ]|, and ρ h denotes the pseudo label rate of ŷh . The ideal joint hypothesis is given by h * = arg min h∈H (ϵ e (h) + ϵ h (h)), deriving the ideal risk λ = ϵ e (h * ) + ϵ h (h * ) and the pseudo risk λ = ϵ e (h * , ŷe ) + ϵ h (h * , ŷh ). In the above theorem, the error is bounded by the distribution discrepancy between two subdomains, the noise ratio of X h , and the risks. The ideal risk λ is neglectly small (Ganin et al., 2016) , and the pseudo risk λ is bounded by ρ h as shown in the appendix. Hence, the subdomain discrepancy and ρ h dominate the error bound. Empirical results show that d H△H (D e , D h ) is usually small for the two subdomains, and ρ h keeps dropping during training as shown in Figure 3 (a), which tightens the upper bound consequently. The proof with detailed analytics is in the appendix. Implementation details. We implement our method via PyTorch (Paszke et al., 2019) , and report the average accuracies among three runs. To show the capacity of handling the confirmation bias, we further report the average accuracies across hard tasks whose source-only accuracies are below 65% (H. Avg.). We employ ResNet-50 for Office-31, Office-Home, and DomainNet, and ResNet-101 for VisDA-17 as the backbones (He et al., 2016) , and add a new MLP-based classifier, which is commonly used in existing UDA works (Long et al., 2017; Chen et al., 2019; Liang et al., 2022; Saito et al., 2018) . The domain discriminator consists of fully-connected layers (2048-256-2) that perform a binary subdomain classification (Long et al., 2018) . The ImageNet pre-trained model is utilized as initialization. The model is optimized by mini-batch SGD with the learning rate of 1e-3 for CNN layers and 1e-2 for the MLP classifier. Following DINE (Liang et al., 2022) , we use the suggested training strategies including the momentum (0.9), batch size (64), and weight decay (1e-3).

4. EXPERIMENTS

The number of epochs for warm-up is empirically set to 3, and the training epoch is 50 except 10 for VisDA-17. The hyper-parameters of MixMatch are kept as same as the original paper (Berthelot et al., 2019) , attached in the appendix. As two networks of MTN perform similarly, we report the accuracy of the first network. Baselines. For a fair comparison, we follow the protocol and training strategy for the source domain in DINE (Liang et al., 2022) , and compare our BETA with state-of-the-art DABP methods. Specifically, LNL-KL (Zhang et al., 2021) , LNL-OT (Asano et al., 2019), and DivideMix (Li et al., 2019) are noisy label learning methods. HD-SHOT and SD-SHOT obtain the model using pseudo labels and apply SHOT (Liang et al., 2020a ) by self-training and the weighted cross-entropy loss, respectively. We also show state-of-the-art standard UDA methods for comparison, including CDAN (Long et al., 2018) , MDD (Zhang et al., 2019) , BSP (Chen et al., 2019) , CST (Liu et al., 2021) , SAFN (Xu et al., 2019) , DTA (Lee et al., 2019) , ATDOC (Liang et al., 2021) , MCC (Jin et al., 2020) , BA 3 US (Liang et al., 2020b) and JUMBOT (Fatras et al., 2021) .

4.2. RESULTS

Performance comparison. We show the results on Office-31, Office-Home, VisDA-17, and Domain-Net in Table 1 , 2, 3, and 4 respectively. The proposed BETA achieves the best performances on all benchmarks. On average, our method outperforms the state-of-the-art methods by 1.8%, 2.4%, 9.5%, and 2.0% on Office-31, Office-Home, VisDA-17, and DomainNet, respectively. The improvement is marginal for Office-31 as it is quite simple. Whereas, for the challenging VisDA-17, the BETA gains a huge improvement of 9.5%, even outperforming standard UDA methods, e.g., CDAN, BSP, SAFN, MDD. This demonstrates that the suppression of confirmation bias by BETA can be as effective as the domain alignment techniques. Hard transfer tasks with distant domain shift. Since our method effectively mitigates the confirmation bias, it works more effectively for the hard tasks with extremely noisy pseudo labels from the source-only model. For the hard tasks with lower than 65% source-only accuracy (i.e., H. Avg.), it is observed that the BETA outperforms the second-best method by 3.5%, 4.0%, and 11.6% on Office-31, Office-Home, and VisDA-17, respectively, which beats the normal UDA methods. For DomainNet, the source-only model produces less than 50% accuracy for all of the transfer tasks, which leads to negative transfer for DINE, e.g., qdr→skt. Whereas, BETA achieves robust improvement for most tasks, outperforming DINE by 2.0% in average. This demonstrates that our method can deal with transfer tasks with distant shifts and BETA can alleviate the negative effect of error accumulation caused by source-only models with poor performance. Figure 3 : Quantitative results on the estimated confirmation bias, and hyper-parameter sensitivity.

4.3. ANALYSIS

Ablation study. We study the effectiveness of key components in BETA, with results shown in Table 5 . It is seen that the semi-supervised loss enabled by domain division significantly improves the source-only model by 11.4%. The mutual twin networks, knowledge distillation, and mutual information contribute to 0.6%, 1.1%, and 0.9% improvements, respectively. As the two subdomains drawn from the target domain are quite similar, the distribution discrepancy is not always effective. Confirmation bias. We study the confirmation bias using the noise ratio of the two subdomains in terms of knowledge distillation (K.D.) and BETA on Office-Home (Ar→Cl) to show the effectiveness of the domain division and MTN. As shown in Figure 3 (a), the error rate of K.D. only drops for the first a few epochs and then stops decreasing. Whereas, the error rate of BETA keeps decreasing for about 20 epochs since the confirmation bias is iteratively suppressed. The error gap between K.D. and ours on the hard-to-adapt and easy-to-adapt subdomain reaches around 10% and 3%, respectively, validating that our method reduces the error rate ρ h and minimizes the target error in Theorem 1. Hyper-parameter sensitivity and MTN. We study the hyper-parameter τ on Office-Home (Cl→Pr) across three runs. We choose τ ranging from 0.3 to 0.9, as too small τ leads to noisy domain division while very large τ leads to a very small number of samples at the easy-to-adapt subdomain. As shown in Figure 3 (b), the accuracies of BETA range from 78.2% to 78.8%, and the best result is achieved at 0.8. For the MTN module, it is observed that the two networks of BETA achieve similar trends over different τ , and one network slightly outperforms another consistently. We further plot the Intersection over Union (IoU) between two easy-to-adapt clean subdomains X 1 e , X 2 e generated by domain division, and it decreases as τ gets greater, which means that a larger τ leads to more difference of the domain division. The diverged domain division can better mitigate the error accumulation for MTN. Thus, the best result at τ = 0.8 is a reasonable trade-off between the divergence of two domain divisions and the sample number of the easy-to-adapt clean subdomain.

5. CONCLUSION

In this work, we propose to suppress confirmation bias for DABP. This is achieved by domain division that purifies the noisy labels in cross-domain knowledge distillation. We further develop mutually-distilled twin networks with subdomain augmentation and alignment to mitigate the error accumulation. Besides, we derive a theorem to show why mitigating confirmation bias helps DABP. Extensive experiments over different backbones and learning setups show that BETA effectively suppresses the noise accumulation and achieves state-of-the-art performance on all benchmarks.

A APPENDIX

A.1 PROOF OF THEOREM 1 We prove the Theorem 1 which extends the learning theories of domain adaptation (Ben-David et al., 2010) for the black-box domain adaptation and provides theoretical justifications for our method. Denote X t ∼ D T as the target domain with its sample distribution. X e ∼ D e and X h ∼ D h denote the easy-to-adapt clean subdomain and the hard-to-adapt noisy subdomain with their corresponding sample distributions, respectively. Denote y e , y h and ŷe , ŷh as the ground truth labels and the pseudo labels of X e , X h , respectively. Let h denote a hypothesis. As our method performs training on a mixture of the clean set and the noisy set with pseudo labels, the error of our method can be formulated as a convex combination of the errors of the clean set and the noisy set: ϵ α (h) = αϵ e (h, ŷe ) + (1 -α)ϵ h (h, ŷh ), ( ) where α is the trade-off hyper-parameter, and ϵ e (h, ŷe ), ϵ h (h, ŷh ) represents the expected error of the easy-to-adapt clean set X e and the hard-to-adapt noisy set X h , respectively, defined by ϵ e (h, ŷe ) = E x∼De [|h(x) -ŷe |] (16) ϵ h (h, ŷh ) = E x∼D h [|h(x) -ŷh |]. We use the shorthand ϵ e (h) = ϵ e (h, f e ) in the proof. Then, we derive an upper bound of how the error ϵ α (h) is close to an oracle error of the target domain ϵ t (h, y t ) where y t is the ground truth labels of the target domain, which is illustrated in Theorem 1: Theorem 2 Let h be a hypothesis in class H. Then |ϵ α (h) -ϵ t (h, y t )| ≤ α(d H△H (D e , D h ) + λ + λ) + ρ h , where the ideal risk is the combined error of the ideal joint hypothesis λ = ϵ e (h * ) + ϵ h (h * ), the distribution discrepancy d H△H (D e , D h ) = 2 sup h,h ′ ∈H |E x∼De [h(x) ̸ = h ′ (x)] -E x∼D h [h(x) ̸ = h ′ (x) ]|, and ρ h denote the pseudo label rate of ŷh . The ideal joint hypothesis is given by h * = arg min h∈H (ϵ e (h) + ϵ h (h)), deriving the ideal risk λ = ϵ e (h * ) + ϵ h (h * ) and the pseudo risk λ = ϵ e (h * , ŷe ) + ϵ h (h * , ŷh ). Proof: |ϵ α (h) -ϵ t (h, y t )| = |αϵ e (h, ŷe ) + (1 -α)ϵ h (h, ŷh ) -αϵ e (h, y e ) -(1 -α)ϵ h (h, y h )| (19) ≤ α(|ϵ e (h, y e ) -ϵ h (h, y h )| + |ϵ e (h, ŷe ) -ϵ h (h, ŷh )|) + |ϵ h (h, ŷh ) -ϵ h (h, y h )| (20) = α(ϵ a + ϵ b ) + ϵ c Then we seek the upper bound of ϵ a , ϵ b , ϵ c by applying the triangle inequality for classification errors (Crammer et al., 2008) as stated in Lemma 1. Lemma 1 For any hypotheses f 1 , f 2 , f 3 in class H, ϵ(f 1 , f 2 ) ≤ ϵ(f 1 , f 3 ) + ϵ(f 2 , f 3 ). For ϵ a , ϵ a = |ϵ e (h, y e ) -ϵ h (h, y h )| ≤ |ϵ e (h, y e ) -ϵ e (h, h * )| + |ϵ e (h, h * ) -ϵ h (h, h * )| + |ϵ h (h, h * ) -ϵ h (h, y h )| (23) ≤ ϵ e (h * ) + |ϵ e (h, h * ) -ϵ h (h, h * )| + ϵ h (h * ) (24) ≤ 1 2 d H△H (D e , D h ) + λ (25) For ϵ b , ϵ b = |ϵ e (h, ŷe ) -ϵ h (h, ŷh )| ≤ ϵ e (h * , ŷe ) + |ϵ e (h, h * ) -ϵ h (h, h * )| + ϵ h (h * , ŷh ) (26) ≤ 1 2 d H△H (D c , D n ) + (ϵ e (h * , ŷe ) + ϵ h (h * , ŷh )) (27) ≤ 1 2 d H△H (D e , D h ) + λ (28) (29) For ϵ c , ϵ c = |ϵ h (h, ŷh ) -ϵ h (h, y h )| ≤ |ϵ h (ŷ h , y h )| = ρ h By summarizing ϵ a , ϵ b , ϵ c , we yield the inequality in Theorem 1: |ϵ α (h) -ϵ t (h, y t )| (30) ≤ α[( 1 2 d H△H (D e , D h ) + λ) + ( 1 2 d H△H (D e , D h ) + λ)] + ρ h (31) = α(d H△H (D e , D h ) + λ + λ) + ρ h (32) □ Furthermore, the pseudo risk is bounded by the ideal risk, the pseudo rate of the clean set ρ e and the noisy set ρ h , derived as follows: λ = ϵ e (h * , ŷe ) + ϵ h (h * , ŷh ) (33) ≤ (ϵ e (h * , y e ) + ϵ e (y e , ŷe )) + (ϵ h (h * , y h ) + ϵ h (y h , ŷh )) (34) = λ + ϵ e (y e , ŷe ) + ϵ h (y h , ŷh ) (35) = λ + ρ e + ρ h (36) Given a constant λ, when the easy-to-adapt subdomain is mostly correct, i.e., ρ e ≈ 0, the pseudo risk is bounded by the pseudo rate of the noisy set ρ h .

A.2 HYPER-PARAMETER SETTINGS

We show the hyper-parameters utilized in our experiments in Table 6 , including τ for domain division, α for MixUp, λ mse that controls the weight of L mse and the sharpening factor T . In semi-supervised learning, to prevent the noisy samples to cause error accumulation, we set λ mse to be 0. The Mixup follows a Beta distribution with α = 1.0. The sharpening factor T = 0.5. We use τ = 0.8 for Office-31 and Office-Home. In VisDA-17, since the model may not perform confidently for the large challenging dataset, we set τ = 0.5 to ensure sufficient samples in the easy-to-adapt subdomain. Table 6 : Hyper-parameters for different datasets.

A.3 CONVERGENCE OF LOSSES

Figure 4 shows the convergence of the losses of BETA during the training procedure. The adversarial loss keeps small since the two subdomains are all drawn from the same domain and thus the distribution divergence between the two subdomains should be small. The mutual information is maximized as shown in the curve of L mi . The semi-supervised loss L dd fluctuates while decreasing since two networks utilize the subdomains obtained by each other for semi-supervised learning, which decreases error accumulation. A.7 HYPER-PARAMETER SENSITIVITY ON VISDA-17. To further validate the hyper-parameter sensitivity, we conduct an additional experiment on VisDA-17. Similarly, we vary τ from [0.4, 0.7], and the results have been in Table 10 . From the results, it is observed that with varying hyper-parameters τ , the proposed method can still achieve significant improvements against the source-only model and the existing state-of-the-art method (DINE). Even the worst case (τ =0.4) brings an improvement of 32.9% against the source-only model, and outperforms the existing state-of-the-art model (DINE) by 6.2%. In real-world applications, we recommend directly using the empirical value (0.6±0.2) that can perform well on all the datasets in this paper. Note that τ cannot be set to a very large value, as this could lead to a limited number of the easy-to-adapt subdomain. It cannot be set to a very small value, as this could lead to a very noisy split of two subdomains. We compare our method with existing works that may share partial similar ideas. Semi-supervised learning for domain adaptation is proposed in AdaMatch (Berthelot et al., 2021) and IntraDA (Berthelot et al., 2021) proposes to reduce intra-domain discrepancy for semantic segmentation. The differences with these methods lie in the problem formulation, motivation, and the framework design. In this paper, we aim to deal with DABP problem where the model cannot access the source-domain data and model parameters, while these works highly rely on the source data. Without any labeled data, our method is motivated by a new observation, and performs domain division to generate two subdomains. Then we design the twin network structures to further mitigate the confirmation bias during self-training. We also see some works that proposes easy-to-hard strategy (Cui et al., 2020b; Shin et al., 2020; Shu et al., 2019) . However, all of these works require the source domain data for training, and thus these papers cannot be used in our scenario. Besides, these works rely on intermediate domain generation or curriculum learning, none of which leverages our observation and idea, "deep models tend to fit easy-to-adapt samples". We further compare our method with the DivideMix (Li et al., 2019) that draws the similar observation in noisy-label learning. Method (a) BETA uses GMM to divide the target domain and Mix-Match with different augmentation for semi-supervised learning. (similar) (b) BETA applies knowledge distillation between models in parallel to the semi-supervised learning, which is purified by the subdomain division to suppress error accumulation during distillation. (c) Subdomain alignment is proposed to align the internal domain shift. (d) Subdomain augmentation is proposed to enhance structural regularization (i.e., mutual information and mix-up). Strong-weak augmentation fully utilizes the high-confidence samples in X e and single weak augmentation does not introduce more noise to X h . This process enhances the L mi in Eq.( 8) that encourages the model to better comply with the cluster assumption and prevent the partiality for categories. DivideMix uses GMM to divide the data and then use MixMatch for semisupervised learning.

Theory

BETA analyzes the algorithm design theoretically and its connection with the learning shift of DABP. A new bound of DABP is derived to explain the rationale behind the optimization. N.A.

Experiments

Experiments on Office-Home demonstrate that BETA outperforms DivideMix by 4.7% on average. For the hard tasks with distant domain shift, BETA outperforms Di-videMix by 6.0% on average. Experiments are conducted on LNL benchmarks. A.9 STANDARD DEVIATION For the experimental results in this paper, we run the codes for 3 time using random seeds. Due to the page limit, we only report the mean accuracy in the paper. Here we further provide the standard deviation (std) in Table 12 , which shows that our method can achieve a robust improvement in these Figure 5 shows the feature distribution of the target domain at the 1st, 3rd, and 10th training epoch, and the color indicates the category of VisDA. It is observed that the clusters get tighter with clearer boundaries during training, though there still exists some intrinsic confusion among some classes that remains to be tackled in the future.

A.11 CODES AND DATASETS

We have attached the codes in the supplementary materials. The README.md introduces the two steps: (i) train a source-only model, and (ii) train the BETA using the hard predictions of the sourceonly model. The datasets should be prepared in the data folder using the official websites and their licenses should be followed (Saenko et al., 2010; Venkateswara et al., 2017; Peng et al., 2017) . A.12 EFFECTIVENESS OF DOMAIN DIVISION In Figure 6 , we show the domain division results at the first epoch (after the warm-up) on Office-Home (Art→Clipart). The three rows contain three categories: alarm clocks, candles, and TV (monitors). The domain shift is very large between Art and Clipart, and the source-only accuracy is only 44.1%. Even so, the domain division module still accurately divides the clean easy-to-adapt subdomain and the hard-to-adapt subdomain. In the easy-to-adapt subdomain, the contours of objects are similar to those of the source domain, such as the alarm clock. The domain shift between the easy-to-adapt subdomain and the source domain is smaller, as shown in the candle samples with a black background. For the TV, the easy-to-adapt samples have very clear contours and are easy to recognize. In comparison, the hard-to-adapt subdomain is more challenging in terms of shape, color, and style. Our domain division strategy outputs an AUC of 0.814 for the binary classification of clean samples and noisy samples whose pseudo labels are generated by the source-only model, which enables the semi-supervised learning in BETA to be reasonable. During the training, the AUC keeps increasing to 0.828 and further mitigates the confirmation bias progressively. 



Figure 1: The mutually-distilled twin networks in BETA are initialized by the predictions from the source API. Then the divide-to-adapt strategy is applied for domain division, and the two subdomains with augmentation are leveraged for semi-supervised learning (MSE and CE loss) and domain adaptation (DA Loss, i.e., L adv and L mi ).

𝑑 𝐴 𝒟 𝑆 , 𝒟 𝑒 = 1.74 𝑑 𝐴 𝒟 𝑆 , 𝒟 ℎ = 1.88

Figure 2: The loss distribution on A→W (Office-31).

Figure 5: The t-SNE visualization of the target domain on the VisDA-17 dataset at the 1st, 3rd, and 10th training epoch (left to right). Each color indicates one category of VisDA-17.

Figure 6: The domain division results on Office-Home (Art→Clipart).

Accuracies (%) on Office-31 for black-box model adaptation. H. Avg. denotes the average accuracy of the hard tasks whose source-only accuracies are below 65%.

Accuracies (%) on Office-Home for black-box model adaptation. (':' denotes 'transfer to')

Datasets. Office-31(Saenko et al., 2010) is the most common benchmark for UDA, which consists of three domains (Amazon, Webcam, DSLR) in 31 categories. Office-Home(Venkateswara et al., 2017) consists of four domains (Art, Clipart, Product, Real World) in 65 categories, and the distant domain shifts render it more challenging. VisDA-17(Peng et al., 2017) is a large-scale benchmark for synthetic-to-real object recognition, with a source domain with 152k synthetic images and a target domain with 55k real images from Microsoft COCO.

Accuracies (%) on VisDA-17 for black-box model adaptation.

Accuracies (%) on DomainNet for black-box model adaptation. The row indicates the source domain while the column indicates the target domain.

Ablation studies of learning objectives and MTN on Office-Home.

To demonstrate the effectiveness of the proposed design L dd , we further supplement the ablation study on VisDA-17. Note that MTN is applied to all runs in the ablation study. The results are shown in Table9. It is shown that each loss brings some improvement, but the largest improvement is brought by L dd . The combination of L kd and L dd lead to a small decreasing accuracy, due to the very noisy label of the source model that hinders the knowledge distillation. Surprisingly, we find that only our proposed L dd and information maximization can achieve a new state-of-the-art (SOTA) performance of 85.1% on VisDA-17, outperforming existing SOTA method (DINE) by 9.5%. Previously in the manuscript, all the four losses were leveraged for all datasets and experiments. Through this ablation, we can see that the proposed L dd brings the largest improvement of 36.2% against the source-only model. The performances of BETA can be further improved if we fine-tune the hyper-parameters. Ablation study of four learning objectives on VisDA-17.

Sensitivity study of hyper-parameter τ on VisDA-17.

The differences between our method and DivideMix.

Standard deviation of the results on all benchmarks.

acknowledgement

Acknowledgements. This work is supported by NTU Presidential Postdoctoral Fellowship, "Adaptive Multimodal Learning for Robust Sensing and Recognition in Smart Cities" project fund, at Nanyang Technological University, Singapore. This research is jointly supported by the National Research Foundation, Singapore under its AI Singapore Programme (AISG Award No: AISG2-PhD-2021-08-008). We thank Google TFRC for supporting us to get access to the Cloud TPUs. This work is jointly supported NUS startup grant, the Singapore MOE Tier-1 grant, and the ByteDance grant.

annex

Apart from closed-set UDA, we also demonstrate the effectiveness of our method for partial-set UDA tasks. To this end, we select the first 25 classes in alphabetical order as the target domain from Office-Home. As shown in Table 7 , it is seen that LNL-OT and LNL-KL lead to negative transfer due to the label shift. Compared to existing state-of-the-art methods, the proposed BETA achieves the best accuracy of 78.0%, and even outperforms some standard UDA methods (Fatras et al., 2021; Jin et al., 2020) . The improvements for partial-set tasks are not large, as BETA is not tailored to address the label shift.Moreover, the proposed BETA can be easily extended to the semi-supervised domain adaptation and multi-source domain adaptation. For the semi-supervised domain adaptation, we just add the labeled samples in the target domain to the easy-to-adapt subdomain, which enables BETA in a semi-supervised manner (Berthelot et al., 2021) . The labeled samples help BETA build a cleaner division for DABP problem. For the multi-source domain adaptation, we can just change the source API to an average prediction or voting of multiple source APIs.A.5 EXPERIMENTS UNDER CHALLENGING SCENARIOS.As the proposed method is partially based on the semi-supervised learning and self-training, there are two factors that might hinder the adaptation capacity of BETA: the number of training samples and the noise ratio. To study if BETA can lead to improvement in the extreme situations, we choose four hard tasks Ar→Cl (44.1%), Cl→Ar (54.5%), Pr→Ar (52.8%), Re→Cl (46.7%), and only choose a super small subset (30 samples per class) of the original domain as the unlabeled target domain data. The results have been shown in Table 8 , which demonstrates that our method still brings a large improvement using only a limited number of unlabeled samples with a super low noise ratio. However, the improvement margin is less than that of the original setting (with more samples).

