PROGRESSIVE MIXUP AUGMENTED TEACHER-STUDENT LEARNING FOR UNSUPERVISED DOMAIN ADAPTATION

Abstract

Unsupervised Domain Adaptation (UDA) aims to transfer knowledge learned from a labeled source domain to an unlabeled target domain, mostly through learning a domain invariant feature representation. Currently, the best performing UDA methods use category level domain alignment to capture fine-grained information, resulting in significantly improved performance over global alignment. While successful, category level UDA methods suffer from the unreliable pseudo-labels for target data. Additionally, most UDA methods directly adapt from source to target domain without regard for the large domain discrepancy. In this paper, we propose an UDA approach with teacher-student learning where the teacher network is used to provide more reliable target pseudo-labels for the student during training. Furthermore, we use a progressive mixup augmentation strategy which generates intermediate samples that become increasingly target-dominant as training progresses. Aligning the source and intermediate domains allows the model to gradually transfer fine-grained domain knowledge from the source to the target domain while minimizing the negative impact of noisy target pseudo-labels. This progressive mixup augmented teacher-student (PMATS) training strategy along with class subset sampling and clustering based pseudo-label refinement achieves state-of-the-art performance on two public UDA benchmark datasets: Office-31 and Office-Home.

1. INTRODUCTION

Unsupervised Domain Adaptation (UDA) has become a popular research topic due to its necessity in applying deep learning models to real world scenarios. Often, there exists a domain gap (Quinonero-Candela et al., 2008) between training data and real world testing data that negatively affects model performance during test time. Collecting and labeling data from various domains is impractical due to being both time consuming and labor intensive. Though semi-supervised learning (Berthelot et al., 2019; Sohn et al., 2020) and unsupervised learning (Gidaris et al., 2018) have been studied to improve generalizability to unseen and unlabeled data, it is under the assumption that both the labeled and unlabeled data existed in a similar domain. UDA specifically tackles the problem of data distribution shift, or domain shift, between the labeled source data and the unlabeled target data through transferring knowledge learned from the source domain to the target domain. UDA methods can mostly be split into two different categories. Adversarial adaptation methods (Ganin & Lempitsky, 2015; Tzeng et al., 2017) , inspired by generative adversarial networks (GANs), introduce a domain discriminator to encourage domain confusion in the feature generator through domain-adversarial objectives. This minimizes the gap between source and target distributions through learning domain invariant features. Statistical adaptation methods (Long et al., 2015; 2017) align source and target domain distributions through minimizing a statistical discrepancy measure, such as maximum mean discrepancy (MMD) and joint MMD (JMMD), between the two domains. Most domain adaptation methods in these two categories directly align source and target domain distributions (Ganin & Lempitsky, 2015; Tzeng et al., 2017; Long et al., 2015; 2017) without consideration for the large domain discrepancy, e.g. from synthetic images to real images (Peng et al., 2017) or from art to real images (Venkateswara et al., 2017) . Even when Na et al. ( 2021) and Hua & Guo (2020) attempt to address the large domain gap, they only create a small number of augmented intermediate domains (4 or less) based on an arbitrarily fixed mixup ratio between source and target images. Recent domain adaptation research (Zhu et al., 2020; Long et al., 2018; Pei et al., 2018; Kumar et al., 2018) has found performing domain alignment at the category level, taking into account the source and target sample class information, to be more effective than naively learning a global domain shift. Since target data are unlabeled in the UDA setting, category level alignment methods rely on producing pseudo-labels for target samples. These generated pseudo-labels are noisy and unreliable, presenting a problem for deep convolutional neural networks (CNNs) which lack robustness to such pseudo-labels (Morerio et al., 2020; Jiang et al., 2020) . Replacing the CNN backbone with a vision transformer (Xu et al., 2021) has been shown to improve model robustness to noisy pseudo-labels. In this paper, we address the two issues of large domain gap and noisy pseudo-labels mentioned above by proposing a progressive mixup training strategy with teacher-student learning. We construct an intermediate augmented domain that becomes progressively more target-like as training continues. During different periods of training, the intermediate domain has different characteristics. For example, the intermediate domain is more source-like initially, with more reliable label information but lower correlation with target domain. As training progresses, the intermediate domain becomes more target-like, with less reliable label information but higher similarity to target domain. By gradually changing the intermediate domain from source-like to target-like, we're able to train one model that retains the benefits of both perspectives without using an ensemble of models as in Na et al. ( 2021). For category level subdomain alignment, we use a teacher model to generate pseudo-labels for training our student model. Our teacher model, as a temporal ensemble (Tarvainen & Valpola, 2017) of the student model, produces less noisy target predictions by averaging together the predictions of many student models at previous time steps. Furthermore, we use a clustering based pseudo-label refinement method to obtain more reliable soft pseudo-labels. We evaluate the performance of our Progressive Mixup Augmented Teacher-Student (PMATS) algorithm on two different UDA benchmarks with varying degrees of domain shift. Experiments prove the effectiveness of our approach since we achieve state-of-the-art performance on both datasets. The contributions of this paper are summarized as follows. • We propose an algorithm that effectively combines teacher-student learning with a progressive mixup to efficiently bridge the source and target domains utilizing a gradually shifting intermediate domain. This not only effectively combines knowledge from both source-like and target-like perspectives, but also increases the model robustness to noisy pseudo-labels. • Target predictions from our classifier are further refined through spherical K-means clustering and Gaussian Mixture Model (GMM) to produce more reliable soft pseudo-labels. • We validate the effectiveness of our approach through extensive ablation studies and evaluation on three standard benchmarks. 



LMMD) to align the subdomain distributions. The Contrastive Adaptation Network (CAN)(Kang et al., 2019)  uses a similar discrepancy measure called the Contrastive Domain Discrepancy (CDD) which not only minimizes the intra-class subdomain discrepancy, but also maximizes the inter-class subdomain discrepancy in a contrastive manner. Completely replacing the MMD based loss, cross-domain contrastive learning (CDCL)(Wang et al., 2022)  instead uses contrastive learning to align source and target distributions through a modified noise-contrastive es-

