GENERATIVE GRADUAL DOMAIN ADAPTATION WITH OPTIMAL TRANSPORT Anonymous authors Paper under double-blind review

Abstract

Unsupervised domain adaptation (UDA) adapts a model from a labeled source domain to an unlabeled target domain in a one-off way. Though widely applied, UDA faces a great challenge whenever the distribution shift between the source and the target is large. Gradual domain adaptation (GDA) mitigates this limitation by using intermediate domains to gradually adapt from the source to the target domain. However, it remains an open problem on how to leverage this paradigm when the given intermediate domains are missing or scarce. To approach this practical challenge, we propose Generative Gradual DOmain Adaptation with Optimal Transport (GOAT), an algorithmic framework that can generate intermediate domains in a data-dependent way. More concretely, we first generate intermediate domains along the Wasserstein geodesic between two given consecutive domains in a feature space, and apply gradual self-training, a standard GDA algorithm, to adapt the source-trained classifier to the target along the sequence of intermediate domains. Empirically, we demonstrate that our GOAT framework can improve the performance of standard GDA when the given intermediate domains are scarce, significantly broadening the real-world application scenarios of GDA.

1. INTRODUCTION

Modern machine learning models suffer from data distribution shifts across various settings and datasets [Gulrajani & Lopez-Paz, 2021; Sagawa et al., 2021; Koh et al., 2021; Hendrycks et al., 2021; Wiles et al., 2022] , i.e., trained models may face a significant performance degrade when the test data come from a distribution largely shifted from the training data distribution. Unsupervised domain adaptation (UDA) is a promising approach to address the distribution shift problem by adapting models from the training distribution (source domain) with labeled data to the test distribution (target domain) with unlabeled data [Ganin et al., 2016; Long et al., 2015; Zhao et al., 2018; Tzeng et al., 2017] . Typical UDA approaches include adversarial training [Ajakan et al., 2014; Ganin et al., 2016] , distribution matching [Zhang et al., 2019; Tachet des Combes et al., 2020] , optimal transport [Courty et al., 2016; 2017] , and self-training (aka pseudo-labeling) [Liang et al., 2019; 2020; Zou et al., 2018; 2019] . However, as the distribution shifts become large, these UDA algorithms have been observed to suffer from significant performance degradation [Kumar et al., 2020; Sagawa et al., 2021; Abnar et al., 2021; Wang et al., 2022a] . This empirical observation is consistent with theoretical analyses [Ben-David et al., 2010; Zhao et al., 2019] , which indicate that the expected test accuracy of a trained model in the target domain degrades as the distribution shift becomes larger. To tackle a large data distribution shift, one may resort to an intuitive divide-and-conquer strategy, e.g., breaking the large shift into pieces of smaller shifts and resolving each piece with a classical UDA approach. Concretely, the data distribution shift between the source and target can be divided into pieces with intermediate domains bridging the two (i.e., the source and target). In settings, e.g., gradual domain adaptation (GDA) [Kumar et al., 2020; Abnar et al., 2021; Chen & Chao, 2021; Wang et al., 2022a; Gadermayr et al., 2018; Wang et al., 2020; Bobu et al., 2018; Wulfmeier et al., 2018] , where the intermediate domains with unlabeled data are available to the learner, Kumar et al. In order to minimize the generalization error on the target domain, the sequence of intermediate domains should be placed uniformly along the Wasserstein geodesic between the source and target domains. At a high-level, GOAT contains the following steps: i) generate intermediate domains between each pair of consecutive given domains along the Wasserstein geodesic in a feature space, ii) apply gradual self-training (GST) [Kumar et al., 2020] , a popular GDA algorithm, over the sequence of given and generated domains. Empirically, we conduct experiments on Rotated MNIST and Portraits [Ginosar et al., 2015] , two benchmark datasets commonly used in the literature of GDA. The experimental results show that our GOAT significantly outperforms vanilla GDA, especially when the number of given intermediate domains is small. The empirical results also confirm the theoretical insights from Wang et al. [2022a]: 1). when the distribution shift between a pair of consecutive domains is small, one can generate more intermediate domains to further improve the performance of GDA; 2). there exists an optimal choice for the number of generated intermediate domains.

2. PRELIMINARIES

Notation We refer to the input and output space as X , Y, respectively, and we use X, Y to represent random variables that take values in X , Y. A domain corresponds to an underlying data distribution µ(X, Y ) over X × Y. As we focus on unlabeled samples, we use µ(X) to denote the marginal distribution of X. Classifiers from the hypothesis class H : X → Y are trained with the loss function ℓ(•, •). We use 1 n to refer to a n-dimension vector with all 1s. 1[•] stands for the indicator function. Unsupervised Domain Adaptation (UDA) In UDA, we have a source domain and a target domain. During the training stage, the learner can access m labeled samples from the source domain and n unlabeled samples from the target domain. In the test stage, the trained model will then be evaluated by its prediction accuracy on samples from the target domain. Gradual Domain Adaptation (GDA) Most UDA algorithms adapt models from the source to target in a one-step fashion, which can be challenging when the distribution shift between the two is large. Instead, in the setting of GDA, there exists a sequence of additional T -1 unlabeled intermediate domains bridging the source and target. We denote the underlying data distributions of these intermediate domains as µ 1 (X, Y ), . . . , µ T -1 (X, Y ), with µ 0 (X, Y ) and µ T (X, Y ) being the source and target domains, respectively. In this case, for each domain t ∈ {1, . . . , T }, the learner has access to S t , a set of n unlabeled data drawn i.i.d. from µ t (X). Same as UDA, the goal of GDA is still to make accurate predictions on test data from the target domain, while the learner can train over m labeled source data and nT unlabeled data from {S t } T t=1 . To contrast the setting of UDA and



[2020]   proposed a simple yet effective algorithm, Gradual Self-Training (GST), which applies self-training consecutively along the sequence of intermediate domains towards the target. For GST, Kumar et al.[2020] proved an upper bound on the target error of gradual self-training, andWang et al. [2022a]   provided a significantly improved bound under relaxed assumptions, corroborating the effectiveness of gradual self-training in reducing the target domain error with unlabeled intermediate domains.

Figure 1: A schematic diagram comparing Direct Unsupervised Domain Adaptation (UDA) vs. Gradual Domain Adaptation (GDA), using an example of Rotated MNIST (60°rotation).

