GENERATIVE MODEL BASED NOISE ROBUST TRAIN-ING FOR UNSUPERVISED DOMAIN ADAPTATION

Abstract

Target domain pseudo-labelling has shown effectiveness in unsupervised domain adaptation (UDA). However, pseudo-labels of unlabeled target domain data are inevitably noisy due to the distribution shift between source and target domains. This paper proposes a generative model-based noise-robust training method (GeMo-NoRT), which eliminates domain shift while mitigating label noise. GeMo-NoRT incorporates a distribution-based class-wise feature augmentation (D-CFA) and a generative-discriminative classifier consistency (GDC), both based on the class-wise target distributions modelled by generative models. D-CFA minimizes the domain gap by augmenting the source data with distributionsampled target features, and trains a noise-robust discriminative classifier by using target domain knowledge from the generative models. GDC regards all the classwise generative models as generative classifiers and enforces a consistency regularization between the generative and discriminative classifiers. It exploits an ensemble of target knowledge from all the generative models to train a noise-robust discriminative classifier and eventually gets theoretically linked to the Ben-David domain adaptation theorem for reducing the domain gap. Extensive experiments on Office-Home, PACS, and Digit-Five show that our GeMo-NoRT achieves stateof-the-art under single-source and multi-source UDA settings.

1. INTRODUCTION

Convolutional neural networks (CNNs) trained by large amounts of training data have achieved remarkable success on a variety of computer vision tasks (Simonyan & Zisserman, 2014; Szegedy et al., 2015; He et al., 2016; Long et al., 2015a; He et al., 2019) . However, when a well-trained CNN model is deployed in a new environment, its performance usually degrades drastically. This is because the test data (of the target domain) is typically from a different distribution from the training data (of source domains). Such distribution mismatch is also known as domain gap. A popular solution to tackling the domain gap issue is unsupervised domain adaptation (UDA) (Gretton et al., 2012; Long et al., 2015b; 2016; Tzeng et al., 2014; Balaji et al., 2019; Xu et al., 2018) . UDA can be divided into two main sub-settings: single-source domain adaptation (SSDA) and multisource domain adaptation (MSDA), according to the number of source domains. Early works have mainly focused on the single-source scenarios (Gretton et al., 2012; Long et al., 2015b; Ganin et al., 2016; Tzeng et al., 2017) . Nevertheless, in real-world applications, the source domain data can be collected from various deployment environments, leading to multiple source setting. MSDA thus has been receiving more attention recently. Most UDA methods, including both SSDA and MSDA, tried to reduce domain gap by domain distribution alignment (Zhao et al., 2018; Xu et al., 2018; Peng et al., 2019; Li et al., 2021c) . Latest methods (Wang et al., 2020; Li et al., 2021a) further utilized class information for class-wise alignment, with pseudo-labels used for unlabeled target data. These methods have shown effectiveness for UDA. However, due to the domain gap, a model trained on the source domains cannot correctly classify all the target instances, leading to target domain pseudo-labels inevitably being noisy. If these noisy labels are directly used as supervision, their negative impact can be amplified or accumulated through iterations. This can even lead to the training corrupted. The noise accumulation problem thus must be addressed. An intuitive solution to noise accumulation is to reduce domain gap. This can indeed reduce label noise. However, in practice, the domain gap cannot be thoroughly eliminated, so the label noise can The GDC is a consistency regularization on the target domain data by matching the predictions of the discriminative classifier to these of the generative classifier. This regularization exploits an ensemble of target knowledge from all the generative models to train a noise-robust and domain-adaptive discriminative classifier. still exist. As such, it is necessary to train a label noise-robust model for UDA. We address the noiserobust training from the perspective of probability: to reduce the negative impact of a noisy instance, we can maximize the joint probability of its feature f and pseudo-label ŷ, i.e., p(f, ŷ), which can be achieved by max f p(f |ŷ)p(ŷ) (see Eq. 4). Eq. 4 essentially assumes that the pseudo-labels are intact but the features are corrupted. So we can generate the data (i.e., features) conditioned on pseudo-labels, or we can fix the pseudo label and maximize the probability of the feature f given the pseudo-label ŷ, i.e. p(f |ŷ). Practically, we 'correct'/augment the feature f of a noisy instance to force it to better match the pseudo-label ŷ. This differs from other noise correction methods which assume the features are intact but the labels are corrupted. To sum up, the keys to address UDA are: 1) reducing domain gap and 2) solving the probability maximization problem. This paper proposes a generative model based noise robust training (GeMo-NoRT) method to alleviate the pseudo label noise (by solving the probability maximization problem) and meanwhile to reduce the domain gap for UDA. The key idea is to leverage a generative model, modeling the classwise target distribution, to help train a noise-robust and domain-adaptive discriminative classifier. Specifically, GeMo-NoRT learns a generative model to enable a target distribution based class-wise feature augmentation (D-CFA) and a generative-discriminative classifier consistency (GDC), serving as feature-level and classifier-level regularization respectively. D-CFA learns the target-domain class-wise feature distributions using generative models like normalizing flows (Durkan et al., 2019) . Then, the source domain data will be augmented by the features sampled from target-domain distribution such that the domain gap can be reduced. For the pseudolabeled target domain data, our D-CFA also provides a simple yet effective approximate solution to the probability maximization problem to alleviate noise accumulation. As shown in Figure 1a , we transform/augment the original feature f by mixing it with the 'genuine' features (sampled from target-domain distribution) of the class ŷ. The augmented feature is now likelier from the class ŷ thus increases the joint probability. Note that the distribution-sampled features can be regarded as 'genuine' features of the class ŷ because the class-wise distribution models the whole population in a class, making the class label of sampled features highly reliable. This ensures that almost no extra label noise is introduced by the sampled features, which is critical for alleviating noise accumulation. GDC is a consistency regularization on the target domain data by matching the predictions of the discriminative classifier to these of the generative classifier (see Figure 1b ), where the generative classifier is all the off-the-shelf generative models used as a whole for class-wise probability pre-



Figure 1: Illustration for our GeMo-NoRT. The GeMo-NoRT includes (a) Distribution Based Classwise Feature Augmentation (D-CFA) and (b) Generative and Discriminative Consistency (GDC). (a)The D-CFA uses the mixed feature (the orange triangle) to replace the original misclassified feature (i.e., the red star that should be classified to class B but with a wrong pseudo-label of class A) for classifier training. The mixed feature is obtained by mixing the original feature and the feature sampled from the target distribution of class A. In this way, the mixed feature is closer to class A, thus likelier to have a correct label of class A. Such a mixed feature and pseudo-label A can better match each other to train a better classifier. In the next iteration, the classifier is free to assign a new pseudo-label (e.g. class B) to the original feature to stop noise accumulation. When a labeled source instance is mixed with sampled features, the mixed features serve as intermediate features between source and target domains to bridge the domain gap in each class. (b) The GDC is a consistency regularization on the target domain data by matching the predictions of the discriminative classifier to these of the generative classifier. This regularization exploits an ensemble of target knowledge from all the generative models to train a noise-robust and domain-adaptive discriminative classifier.

