DIVIDE TO ADAPT: MITIGATING CONFIRMATION BIAS FOR DOMAIN ADAPTATION OF BLACK-BOX PREDIC-TORS

Abstract

aims to learn a model on an unlabeled target domain supervised by a black-box predictor trained on a source domain. It does not require access to both the source-domain data and the predictor parameters, thus addressing the data privacy and portability issues of standard domain adaptation methods. Existing DABP approaches mostly rely on knowledge distillation (KD) from the black-box predictor, i.e., training the model with its noisy target-domain predictions, which however inevitably introduces the confirmation bias accumulated from the prediction noises and leads to degrading performance. To mitigate such bias, we propose a new strategy, divide-to-adapt, that purifies cross-domain knowledge distillation by proper domain division. This is inspired by an observation we make for the first time in domain adaptation: the target domain usually contains easy-to-adapt and hard-to-adapt samples that have different levels of domain discrepancy w.r.t. the source domain, and deep models tend to fit easyto-adapt samples first. Leveraging easy-to-adapt samples with less noise can help KD alleviate the negative effect of prediction noises from black-box predictors. In this sense, the target domain can be divided into an easy-to-adapt subdomain with less noise and a hard-to-adapt subdomain at the early stage of training. Then the adaptation is achieved by semi-supervised learning. We further reduce distribution discrepancy between subdomains and develop weak-strong augmentation strategy to filter the predictor errors progressively. As such, our method is a simple yet effective solution to reduce error accumulation in cross-domain knowledge distillation for DABP. Moreover, we prove that the target error of DABP is bounded by the noise ratio of two subdomains, i.e., the confirmation bias, which provides the theoretical justifications for our method. Extensive experiments demonstrate our method achieves state of the art on all DABP benchmarks, outperforming the existing best approach by 9.5% on VisDA-17, and is even comparable with the standard domain adaptation methods that use the source-domain data 1 .

1. INTRODUCTION

Unsupervised domain adaptation (UDA) (Pan & Yang, 2009) aims to transfer knowledge from a labeled source domain to an unlabeled target domain and has wide applications (Tzeng et al., 2015; Hoffman et al., 2018; Zou et al., 2021) . However, UDA methods require to access the sourcedomain data, thus raising concerns about data privacy and portability issues. To solve them, Domain Adaptation of Black-box Predictors (DABP) (Liang et al., 2022) was introduced recently, which aims to learn a model with only the unlabeled target-domain data and a black-box predictor trained on the source domain, e.g., an API in the cloud, to avoid the privacy and safety issues from the leakage of data and model parameters. A few efforts have been made to solve the DABP problem. One of them is to leverage knowledge distillation (Hinton et al., 2015) and train the target model to imitate predictions from the source predictor (Liang et al., 2022) . Another one is to adopt learning with noisy labels (LNL) methods to select the clean samples from the noisy target-domain predictions for model training (Zhang et al., 2021) . Though inspiring, they have the following limitations. (i) Learning the noisy pseudo labels for knowledge distillation inevitably leads to confirmation bias (Tarvainen & Valpola, 2017) , i.e., accumulated model prediction errors. (ii) The LNL-based methods aims to select a clean subset of the target domain to train the model, which would limit the model's performance due to a decreased amount of usable data for model training. (iii) Existing DABP methods lack theoretical justifications. To address the aforementioned issues, this work proposes a simple yet effective strategy, divide-toadapt, which suppresses the confirmation bias by purifying cross-domain knowledge distillation. Intuitively, the divide-to-adapt strategy divides the target domain into an easy-to-adapt subdomain with less prediction noise and a hard-to-adapt subdomain. This is inspired by a popular observation: deep models tend to learn clean samples faster than noisy samples (Arpit et al., 2017) . For domain adaptation, we make a similar discovery: deep models tend to learn easy-to-adapt samples faster than hard-to-adapt samples, and thus we can leverage the loss distribution of cross-domain knowledge distillation at the early stage for subdomain division. By taking the easy-to-adapt subdomain as a labeled set and the hard-to-adapt subdomain as an unlabeled set, we can solve DABP problem via leveraging prevailing semi-supervised learning methods (Berthelot et al., 2019; Sohn et al., 2020) . The divide-to-adapt strategy purifies the target domain progressively for knowledge distillation while fully utilizing all the target dataset without wasting any samples. To implement the above strategy, this paper proposes Black-Box ModEl AdapTation by DomAin Division (BETA) that introduces two key modules to suppress the confirmation bias progressively. Firstly, we divides the target domain into an easy-to-adapt and hard-to-adapt subdomains by fitting the loss distribution into a Gaussian Mixture Model (GMM) and setting a threshold. The easy-to-adapt samples with less noise help purify the cross-domain knowledge distillation for DABP. Secondly, we propose mutually-distilled twin networks with weak-strong augmentation on two subdomains to progressively mitigate error accumulation. The distribution discrepancy between two subdomains is further aligned by an adversarial regularizer to enable the prediction consistency on the target domain. A domain adaptation theory is further derived to provide justifications for BETA. We make the following contributions. (i) We propose a novel BETA framework for the DABP problem that iteratively suppresses the error accumulation of model adaptation from the black-box sourcedomain predictor. To the best of our knowledge, this is the first work that addresses the confirmation bias for DABP. (ii) We theoretically show that the error of the target domain is bounded by the noise ratio of the hard-to-adapt subdomain, and empirically show that this error can be suppressed progressively by BETA. (iii) Extensive experiments demonstrate that our proposed BETA achieves state-of-the-art performance consistently on all benchmarks. It outperforms the existing best method by 9.5% on the challenging VisDA-17 and 2.0% on DomainNet.

2. RELATED WORK

Unsupervised Domain Adaptation. Unsupervised domain adaptation aims to adapt a model from a labeled source domain to an unlabeled target domain. Early UDA methods rely on feature projection (Pan et al., 2010a) and sample selection (Sugiyama et al., 2007) for classic machine learning models. With the development of deep representation learning, deep domain adaptation methods yield surprising performances in challenging UDA scenarios. Inspired by two-sample test, discrepancy minimization of feature distributions (Koniusz et al., 2017; Yang et al., 2021b; Xu et al., 2022a) is proposed to learn domain-invariant features (Cui et al., 2020a) based on statistic moment matching (Tzeng et al., 2014; Sun & Saenko, 2016) . Domain adversarial learning further employs a domain discriminator to achieve the same goal (Ganin et al., 2016; Zou et al., 2019; Yang et al., 2020b) and achieves remarkable results. Other effective techniques for UDA include entropy minimization (Grandvalet & Bengio, 2005; Xu et al., 2021) , contrastive learning (Kang et al., 2019 ), domain normalization (Wang et al., 2019; Chang et al., 2019) , semantic alignment (Xie et al., 2018; Yang et al., 2021a ), meta-learning (Liu et al., 2020 ), self-supervision (Saito et al., 2020) , semi-supervsed learning (Berthelot et al., 2021 ) curriculum learning (Zhang et al., 2017; Shu et al., 2019) , intra-domain alignment (Pan et al., 2020) , knowledge distillation (Yang et al., 2020a) and 

