PROGRESSIVE MIX-UP FOR FEW-SHOT SUPERVISED MULTI-SOURCE DOMAIN TRANSFER

Abstract

This paper targets at a new and challenging setting of knowledge transfer from multiple source domains to a single target domain, where target data is few shot or even one shot with label. Traditional domain generalization or adaptation methods cannot directly work since there is no sufficient target domain distribution serving as the transfer object. The multi-source setting further prevents the transfer task as excessive domain gap introduced from all the source domains. To tackle this problem, we newly propose a progressive mix-up (P-Mixup) mechanism to introduce an intermediate mix-up domain, pushing both the source domains and the few-shot target domain aligned to this mix-up domain. Further by enforcing the mix-up domain to progressively move towards the source domains, we achieve the domain transfer from multi-source domains to the single one-shot target domain. Our P-Mixup is different from traditional mix-up that ours is with a progressive and adaptive mix-up ratio, following the curriculum learning spirit to better align the source and target domains. Moreover, our P-Mixup combines both pixel-level and feature-level mix-up to better enrich the data diversity. Experiments on two benchmarks show that our P-Mixup significantly outperforms the state-of-the-art methods, i.e., 6.0% and 8.6% improvements on Office-Home and DomainNet.

1. INTRODUCTION

Deep neural networks (DNN) have gained large achievements on a wide variety of computer vision tasks (He et al., 2016; Ren et al., 2015) . As problems turn complex, the learned DNN models consistently fall short in generalizing to test data under different distributions from the training data. Such domain shift (Torralba & Efros, 2011) further results in performance degradation as models are overfitting to the training distributions. Domain adaptation (DA) (Xu et al., 2021; Zhu et al., 2021; Zhu & Li, 2021; Liu et al., 2023) has been extensively studied to address this challenge. Due to different settings regarding the source and target domains, DA problems vary into different categories such: unsupervised domain adaptation (UDA) (Zhu & Li, 2022a) , supervised domain adaptation (SDA) (Motiian et al., 2017) , and multi-source domain adaptation (MSDA) (Zhao et al., 2018) . UDA aims to adopt knowledge from a fully labeled source domain to an unlabeled target domain. SDA intends to transfer knowledge from a fully labeled source domain to a partially labeled target domain. MSDA generalizes the UDA by adopting the knowledge from multiple fully labeled source domains to an unlabeled target domain. The main difficulty in the MSDA problem is how to achieve a meaningful alignment between the labeled source domains and the target domain that is unlabeled. Although DA has obtained some good achievements, assuming the availability of plenty of unlabeled/labeled target samples in real-world scenarios cannot be always guaranteed. In this paper, we propose a challenging and realistic problem setting named Few-shot Supervised Multi-source Domain Transfer (FSMDT), by assuming that multiple labeled source domains are accessible but the target domain only contains few samples (i.e., one labeled sample per class), shown in Figure 1 . Different from existing domain adaptation problems such as UDA, SDA and MSDA, the target domain in our problem does not provide any unlabeled samples to assist model training. The most relevant problem settings to ours are SDA and MSDA. SDA (Tzeng et al., 2015; Koniusz et al., 2017; Motiian et al., 2017; Morsing et al., 2021) seeks to transfer knowledge from a single source domain to a partially labeled target domain. The SDA methods cannot be simply used

Lack Data

Progress Mix-up

Boundary Source1

Source2 Target Figure 1 : Visual illustration on the FSMDT problem (left), traditional domain adaptation solutions (middle) and our P-Mixup method (right). to deal with our problem that involves multiple source domains, as the alignment among multiple source domains should be carefully addressed. In addition, existing MSDA methods (Duan et al., 2009; Sun et al., 2011; Zhao et al., 2018; Wang et al., 2020a; Zhou et al., 2021b; Ren et al., 2022) aim to learn domain-invariant representations by aligning the target domain to each of the source domains. However, these MSDA methods are not suitable for our FSMDT problem, as target domain only contains few labeled samples for training process which cannot support the domain invariance learning. Recently, multi-source few-shot domain adaptation (MFDA) (Yue et al., 2021) is proposed to address the application scenario where only a few samples in each source domain are annotated while the remaining source and target samples are unlabeled. Different from MFDA, our proposed FSMDT assumes only few target samples are available. The methods for MFDA would fail to learn discriminative representations on target domain in FSMDT due to insufficient target samples. We propose a novel progressive mix-up scheme to tackle the challenges in the newly proposed FSMDT problem. Our scheme firstly creates an intermediate mix-up domain, which is initially set closer to the few-shot target domain. Rather than the commonly used image-level mix-up, we induce a cross-domain bi-level mix-up, which involves both the image-level mix-up and feature-level mixup, to effectively enrich the data diversity. With the mix-up domain that is initially close to the target domain, the few-shot constraint on target domain is alleviated. Then, by enforcing the mix-up ratio to progressively favor towards the source domains, and meanwhile harnessing the target domain to be close to the mix-up domain, we gradually transfer knowledge from the multi-source domain to the target domain in a curriculum learning fashion. Furthermore, by optimizing over multiple source domains in a meta-learning regime, we present a stable and robust solution to the FSMDT problem. Our main contributions are summarized as follows: • We introduce a practical and challenging task, namely the Few-shot Supervised Multisource Domain Transfer (FSMDT), which aims to transfer knowledge from multiple labeled source domains to a target domain with only few labeled samples. • We propose a novel progressive mix-up scheme to help address the FSMDT problem, which creates an intermediate mix-up domain and gradually adapts the mix-up ratio to mitigate the domain shift between target domain and source domain. • We conduct extensive experiments and show that our method successfully tackles the new FSMDT problem and it surpasses state-of-the-arts with large margins. In particular, it improves the accuracy by 6.0% and 6.8% over MSDA and SDA baselines on the Office-Home and DomainNet datasets, respectively. (Tzeng et al., 2015; Motiian et al., 2017; Morsing et al., 2021) and multisource domain adaptation (MSDA) (Sun et al., 2011; Zhao et al., 2018; Wang et al., 2020a) Supervised Domain Adaptation trains models by exploiting a partially labeled target domain and a single, fully labeled source domain. Seminal work such as the simultaneous deep transfer (SDT) (Motiian et al., 2017) jointly learns domain-invariant features and aligns semantic information across domains by optimizing the domain confusion and distribution matching objectives. The classification and contrastive semantic alignment (CCSA) method (Motiian et al., 2017) uses the distribution alignment along the semantic manifold. To deal with the few-shot issue, CCSA reverts to point-wise surrogates of distribution and similarities. Recently, (Morsing et al., 2021) exploits graph embedding to encode intra-class and inter-class information to better align the source and target domains. Different from SDA, we consider multi-source domain instead of a single source domain, which is more challenging as real data is not constrained to be only from a distribution.

2. RELATED WORK

Multi-Source Domain Adaptation aims to learn domain-invariant feature across all domains, or leverage auxiliary classifiers trained with multi-source domain to ensemble a robust classifier for the target domain (Sun et al., 2011; Duan et al., 2009) . ecently, the multi-source domain adversarial network (MDAN) (Zhao et al., 2018) theoretically analyzes the average case generalization bounds for MSDA classification and regression problems. In addition, the learning to combine for multisource domain adaptation (LtC-MSDA) (Wang et al., 2020a) explores interactions among domains by building a knowledge graph of prototypes from various domains and investigates the information propagation among semantically adjacent representations. Despite the good performance, none of the above methods consider the practical scenario with only very few labeled target samples. Domain Generalization aims to learn a model from multiple source domains that can generalize well on unseen target domain. Existed DG methods can be roughly divided into three groups. Domain alignment based methods (Muandet et al., 2013; Li et al., 2018b) aim to learn the domain invariant features by aligning feature distributions across multiple source domains. Meta-learning based methods (Li et al., 2018a; Shu et al., 2021) divide multiple source domains into the metatrain and meta-test sets, and learn a model on the meta-train set with the intention of improving its performance on the meta-test set. Data augmentation based methods (Zhou et al., 2021a; 2020; Zhu & Li, 2022b) aim to improve the generalization of learned models by enriching the diversity of source domains. Though domain generalization addresses the unseen target domain, which is a harder problem than our few-shot seen target setting, it is not suitable for our FSMDT problem as it doesn't consider how to utilize these available few-shot samples in target domain.

2.2. DATA AUGMENTATION BY MIX-UP

Mix-up (Zhang et al., 2018) is a data augmentation technique that has been widely applied in selfsupervised learning, domain adaptation, and domain generalization. Dual mixup regularized learning (DMRL) (Wu et al., 2020 ) conducts class-level and domain-level mix-up strategies to learn a domain-invariant feature space. Recently, Domain-augmented meta learning (DAML) (Shu et al., 2021) applies multi-source mix-up strategy to augment source domains. However, most methods interpolate samples with a pre-defined mix-up ratio distribution, e.g., beta distribution. Lately, MetaMixup (Mai et al., 2021) proposes a meta-learning based framework to dynamically update mix-up ratio. However, it requires a special validation setting to learn the mix-up ratio, and it does not consider the mix-up problems across multiple domains. In contrast, we consider the crossdomain mix-up and propose a progressive mix-up scheme based on the cross-domain Wasserstein distance, which does not rely on extra validation settings.

2.3. FEW-SHOT LEARNING

Few-shot learning (Wang et al., 2020b) aims to learn a model that can be easily adapted to novel tasks with limited labeled data. To tackle this challenging problem, plenty of methods have been proposed which can be roughly divided into metric learning based method (Snell et al., 2017; Sung et al., 2018) , meta-learning based method Finn et al. (2017) ; Chen et al. (2021) , optimization based method (Lee et al., 2019; Ravi & Larochelle, 2017) , and data augmentation based method (Li et al., 2020; Xu & Le, 2022) . Snell et al. (2017) proposes the Prototypical Network (PTN) to learn a metric space for classification. Finn et al. ( 2017) designs a Model-Agnostic Meta-Learning (MAML) framework which can learn the model under various tasks, such that it can be easily adopted to novel tasks with a few labeled data. Li et al. (2020) proposes a conditional Wasserstein generative adversarial networks based adversarial feature generator to enrich the diversity of the available limited data for novel tasks. Recently, a benchmark, namely Meta-Dataset (Triantafillou et al., 2020) , is proposed for Multi-Domain Few-Shot Learning (MDFS) problem (Dvornik et al., 2020; Liu et al., 2021) . One of the state-of-the-art methods, Universal Representation Transformer (URT) (Liu et al., 2021) is designed to transfer the learned universal representation to task-specific representation. Even though MDFS is very similar to our proposed problem, there still existing significant difference between MDFS and our FSMDT, e.g., our proposed FSMDT assumes that the target label space is contained in multi-source label space while MDFS holds the assumption that the target label space is excluded in multi-source label space.

3. METHOD

Unlike SDA, we target at jointly leveraging multi-source domain other than single source domain, together with few-shot labeled target samples, to adapt the multi-source domain knowledge to the target domain. The main challenge is the extremely limited target data points, which cannot provide sufficient and stable target distribution and thus difficult to conduct transfer. Inspired by Mixup (Zhang et al., 2018) , we propose a progressive mix-up (P-Mix) scheme to introduce an intermediate mix-up domain, and enforce the distribution alignment of "source to mix-up" and "target to mix-up". Our scheme starts with a mix-up distribution close to the target domain, and gradually drifts towards source domains. In this way, the large domain gap is surrogated by a milder intermediate gap and the target to source alignment is indirectly achieved. Firstly, we introduce the preliminary. Then, we give details of the bi-level mix-up. Last, we illustrate our newly proposed progressive mix-up scheme and summarize the overall pipeline of our algorithm.

3.1. PRELIMINARIES

In Few-shot Supervised multi-source domain Transfer (FSMDT) problem, we have M full labeled source domains and a target domain with few-shot labeled data. The i-th source domain D s,i = {(x j s,i , y j s,i )} Ns,i j=1 contains N s,i labeled samples drawn from the source distribution P s,i (x, y), and the target domain D t = {(x j t , y j t )} Nt j=1 includes N t labeled samples selected from the target distribution P t (x, y). Here, N t ≪ N s,i , i.e., N t can be as few as 1-shot per class. P t (x, y) ̸ = P s,i (x, y), and P s,i (x, y) ̸ = P s,j (x, y) where i ̸ = j. The multiple source domains and target domain have the same label space Y = {1, 2, . . . , K} with K categories. We aim to learn an adaptive model H on {D s,i } M i=1 and D t , that can generalize well on unseen samples from target domain. In general, H consists of two functions, i.e., H = F • G. Here G : x → g represents the feature extractor that maps the input sample x into an embedding space, and F : g → f is the classifier with input the embedding to predict the category.

3.1.1. RECAP OF MIX-UP

Mix-up (Zhang et al., 2018) is one of the most popular data augmentation strategies to improve the generalization of the learned model by enriching the diversity of the original domain. The core idea of mix-up is to create virtual samples by randomly interpolating two samples in a convex fashion. Specifically, given two samples (x i , y i ) and (x j , y j ), the virtual sample (x, ỹ) is defined as: x = λx i + (1 -λ)x j , (1) ỹ = λy i + (1 -λ)y j , where label y is the one-hot label encoding and λ is randomly sampled from a predefined distribution, e.g., beta distribution.

3.2. CROSS-DOMAIN BI-LEVEL MIX-UP

Traditional mix-up is originally designed for self-supervised learning, i.e., introducing a new class data by interpolating from two known classes' data, which can increase the training data diversity. When considering the domain transfer problem, such mix-up will be cross-domain, i.e., a data point from source domain and a data point from target domain. Meanwhile, besides the pixel-level mixup, recent manifold based mix-up (Verma et al., 2019; Shu et al., 2021; Xu et al., 2020) shows that feature-level interpolation can also improve the generalization and model robustness. We thus investigate both the pixel-level and feature-level mix-ups. Epoch 1 (𝜆 ! ) Epoch 𝑖 (𝜆 " ) Epoch 𝑁 (𝜆 # ) Timeline 𝑑 $ (𝒢 %&' , 𝒢 ( ) 𝑑 $ (𝒢 %&' , 𝒢 ) ) 𝜆 & >𝜆 * Align Align 𝜆 + >𝜆 & Source Target Mix-up Align Align Figure 2: The flowchart of the proposed progressive mix-up. A mix-up domain (red) is introduced as initially closer to the target domain. By enforcing the mix-up ratio λ to be progressively increasing based on the wasserstein distance of source-to-mixup and target-to-mixup, we push the mix-up domain gradually to be closer to source domains, and thus achieving the alignment of multi-source domain to the few-shot target domain. Cross-Domain Image-Level Mix-up. Motivated by the success of mix-up in self-supervised learning, we apply it to our domain transfer task, which can create new samples with new labels. We utilize it to largely enrich the target domain distribution as there are overly limited target samples. The source and target samples are linearly interpolated as: ximg = λx s,i + (1 -λ)x t , (3) ỹimg = λy s,i + (1 -λ)y t , ( ) where λ is the mix-up ratio. Notice that during training, such mix-up ratio can be adjusted, e.g., a larger λ generates closer-to-source samples and a smaller λ generates closer-to-target samples. Cross-Domain Feature-Level Mix-up. On the learned feature representation manifold, mix-up at the feature level enables more intermediate virtual features to increase the feature diversity and can directly interact with the classifier F learning. Here, given a pair of source and target features and their corresponding labels: (g s,i , y si ) and (g t , y t ), we have gfeat = λg s,i + (1 -λ)g t , (5) ỹfeat = λy s,i + (1 -λ)y t , ( ) where λ is the mix-up ratio same as the one used in image-level mix-up. With exactly the same λ, we argue that the image-level mix-up samples lie in the same feature space as the feature-level mix-up samples. Thus, we can jointly utilize the two for penalty, i.e., the same class image-level mix-up and feature-level mix-up should go for the same classification result.

3.3. PROGRESSIVE MIX-UP SCHEME

Previous work apply either fixed sampling or some simple randomized sampling for the mix-up ratio λ, e.g., beta or dirichlet distribution (Zhang et al., 2018; Wu et al., 2020; Xu et al., 2020; Shu et al., 2021) . However, we find that the sampling of mix-up ratio is crucial for the domain transfer. The ratio directly determines the intermediate mix-up domain. If a mix-up domain is constant or some special distribution, the alignment is either still constantly hard or likely to be under-fitting, supported from a recent work MetaMixup (Mai et al., 2021) . To alleviate it, we dig into the Wasserstein distance of "source-to-mixup" d w (G s , G mix ) and "targetto-mixup" d w (G t , G mix ), where G s , G t , G mix stand for the embeddings of source, target and mixup domains. We observe that during the training, if the mix-up domain initially is closer to the few-shot target domain, the alignment is relatively simple as d w (G t , G mix ) is already small while d w (G s , G mix ) can be effectively minimized as there are sufficient source domain data. When gradually increasing the mix-up ratio towards closer to source domains, since we already harness the "target-to-mixup" distance to be small, we are pushing the entire mix-up domain and few-shot target domain towards the source domains, as illustrated in Figure 2 . Such progressively adjusted mix-up ratio, following the spirit of curriculum learning (Bengio et al., 2009) , eases the initial large domain gap by mildly starting close to the target, and secures the entire transfer process smoothly. Specifically, we introduce a weighting factor q to depict the closeness to source as: where T is a temperature factor defined as 0.05. During training, by initializing G mix closer to target domain, such q is small. To progressively adjust it, we consider to apply this closeness on top of the previous stage λ in a moving average manner. Further, a linearly incremental component is introduced to enforce the gradual closeness to the source domains. The progressive mix-up is formulated as: q = exp(- d w (G s , G mix ) (d w (G s , G mix ) + d w (G t , G mix ))T ), λ n = n(1 -q) N + qλ n-1 , ( ) where N is the total number of iterations and n is the current iteration index. Initial weighting λ 0 towards source is 0. To numerically stabilize the training procedure, we introduce a uniform distribution U , a random perturbation on top of the current λ n : λn = Clamp(U (λ n -σ, λ n + σ), min = 0.0, max = 1.0), where σ is a local perturbation range, i.e., we empirically set it as 0.2. λn is then stochastically sampled and clamped into range [0.0, 1.0] for each iteration n's mix-up ratio.

3.4. ARCHITECTURE AND LEARNING OBJECTIVES

The overall architecture is shown in Figure 3 , which mainly consists of a feature generator G and a classifier F. During training, since there are multiple labeled source domain data, and a single few-shot target domain data, we follow the canonical domain generalization frameworks such as MAML (Finn et al., 2017) , to organize our training in a meta-learning manner. Denoting the model parameters F • G as θ, the objective for classification is defined as: L Ti ce (θ) = - x,y∈Ti K k=1 y k log(θ(x) k ), where T i stands for a specific domain, e.g., one of the source domains or the target domain, x is the input image and y is the ground truth label, and K is the number of classes. Notice that for "cross domain image-level mix-up", the input is the mix-up image ximg and the label is the mix-up label ỹimg . For "cross domain feature-level mix-up", the mix-up feature gfeat is fed into the classifier and computes the L ce loss. Meanwhile, the label y is the mix-up label ỹfeat . Following MAML, we conduct a meta-optimization to pseudo-update the model parameters for the first time by minimizing Ti∈p(T ) L Ti ce : θ ′ = θ -α∇ θ Ti∈p(T ) L Ti ce (θ). p(T ) is a sampling distribution among the meta-train domains, e.g., each of the three benchmarks in our experiments contains four domains, we uniformly sample two out of three source domains. The remaining source, target and the mix-up domains are used for meta-test to update the model as: θ = θ -β∇ θ Tj / ∈p(T ) L Tj ce (θ ′ ), where α and β are the update step size for meta-train and meta-test respectively. To simplify the parameters, we set α = β = 0.001. Notice that the mix-up domain contains two sub-domains, the image-level mix-up and the feature-level mix-up. Both of them are used in meta-test. Protocols: To highlight the challenging few-shot target domain setting, we cannot anymore use the original protocols from the above two datasets. We observe that even with one-shot, since the number of classes are many, e.g., 345 classes from DomainNet, utilizing all the classes can provide a sufficient diversified target domain distribution. To exactly constrain the target distribution to be few-shot, for Office-Home, we randomly select 10 out of 65 classes each with one sample as the target. Similarly, we randomly select 15 out of 345 classes for DomainNet. The remaining samples in these selected classes are used as the test data. Such random sampling is conducted for 5 times and the averaged result is reported. Baselines: We compare with four main streams of methods: (1) Multi-Source Few Shot Learning method, namely Universal Representation Transformer (URT) (Liu et al., 2021) .( 2) Supervised Domain Adaptation method, namely Classification and Contrastive Semantic Alignment (CCSA) (Motiian et al., 2017) . (3) Multi-Source Domain Adaptation, namely Multisource Domain Adversarial Networks (MDAN) (Zhao et al., 2018) . ( 4) Domain Generalization method, namely Domain-Augmented Meta Learning (DAML) (Shu et al., 2021) . ( 5) Data Augmentation method, namely Mix-up (Zhang et al., 2018) . Besides, we consider another general baseline, i.e., Empirical Risk Minimization (Koltchinskii, 2011) with/without labeled target domain (ERM-w, ERM-w/o). Evaluation Metrics: For each of the benchmarks, each domain is in turn regarded as the target domain while the remaining are considered as source domains. For each experiment, we report the mean average precision (mAP) by averaging over 5 times of all the class' average precision. We fix the random seed to 1-5 when self-constructing the new domain and sampling target samples so the results of different methods can be fairly compared. Implementation Details: Our implementation is based on Pytorch (Paszke et al., 2019) . We use ResNet-18 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) as the backbone network. We optimize the model using SGD with momentum of 0.9 and weight decay of 5 × 10 -4 . The batch size is set to 50. The initial learning rate is set to 0.001. For all the compared methods and Ours, we use the same basic data preprocessing on the image and the same backbone.

4.2. MAIN RESULTS

Office-Home: In Table 1 , comparing ERM-w to ERM-w/o, we observe that the labeled target domain containing 10 images cannot directly improve the performance, which verifies the setting is indeed challenging. Third column is the representative supervised domain adaptation method, CCSA, clearly outperforms the baseline ERM-W. There is also MDAN in the fourth column whose performance is worse than ERM-W, as there is no sufficient target distribution to support the adaptation. Across all the methods, our approach demonstrates clear advantages, i.e., when compared to the second best, CCSA, the gain is as significant as 6.02% on "Ave." DomainNet: As shown in Table 2 , our method's performance in both 10 labeled target samples and 15 labeled target samples scenarios show a clear advantage over all the baselines. Compared to the most competitive opponent URT, our method gets the best performance on three out of four tasks and surpasses by 9.67% on 10 labeled target samples "Ave." and 14.88% on 15 labeled target samples "Ave.". In this dataset, we find that ERM-w obtains the same level performance as most of baselines, e.g., supervised domain adaptation method CCSA, partially showing that this benchmark is more challenging as the domain gap becomes more challenging compared to the other two datasets. Overall, these results strongly demonstrate the effectiveness of our proposed progressive mix-up for improving domain transfer with extremely few labeled target domain samples.

4.3. ABLATION STUDY

We conduct a comprehensive ablation study to examine the effectiveness of our proposed core components in Table 3 . The baseline of ERM-w utilizing the target domain data but without mix-up is shown in the first row. Feat-Mix denotes the cross-domain feature-level mix-up and Img-Mix indicates the cross-domain image-level mix-up. We introduce the general mix-up ratio sampling strategy (λ ∼ Beta(0, 1)) used in Mix-up (Zhang et al., 2018) , as a major comparison. Firstly, We observe that the bottom row methods consistently outperform the middle row by a large margin, highlighting the superiority of the proposed progressive mix-up strategy. Then, we look into the combination of modules within each sampling method. We observe that cross-domain image-level mix-up (Img-Mix) shows better result over cross-domain feature-level mix-up (Feat-Mix) with more than 4.0% "Ave." improvement. If going for one module, image-level mix-up would be a better choice. If with no restriction, a combination of both image and feature level mix-ups can further boost the accuracy, because a combined mix-up enriches the data diversity more than each of the single choices.

4.4. EFFECT OF TARGET SAMPLE SHOTS

We investigate the effect of the number of target sample shots on our proposed P-Mixup with Do-mainNet where "Clipart" is selected as the target domain. We increase the number of selected classes from 10 to 345. Corresponding, the number of available target samples range from 10 to 345. As shown in Table 4 , our method's performance across different numbers of selected classes settings show a clear advantage over all the baselines. Specifically, even when we reach the full classes, i.e., 345, our method still surpasses the most competitive opponent DAML, by 3.27%. As the size of available labeled target samples decreases, our method still holds an obvious advantage, which further confirms that our method is more advantageous when target sample shots are extremely fewer.

4.5. MIX-UP RATIO AND COMPUTATION ANALYSIS

As shown the first subfigure in Figure 4 , we validate the P-Mixup scheme by showing the mix-up ratio and the q value introduced in Equation 7on Office-Home. The subfigure shows the trend of the proposed mix-up ratio λ along the training iterations. Generally it is an increasing tread as we gradually push the mix-up domain to be closer to source domains. The second subfigure in Figure 4 shows 1 -q over iterations, which indicates the distance change between source and target domains. As q depicts the closeness to source, we use 1 -q to present the closeness to target. During the first 4000 iteration, the mix-up distribution is closer to target than source, and the model gradually handles the"target-to-mixup" distance to be small. As a result, we observe that the 1 -q value gently turns small. After the model harnesses the "target-to-mixup" distance, the mix-up distribution gradually moves close to source domains as λ goes up. Afterwards, the "target-to-mixup" distance continually decreases, showing that the source domains are continuously transferred onto the target domain and our P-Mixup is indeed effective in mitigating the domain shift in FSMDT. The last two subfigures in Figure 4 show the training behavior and standard deviation (STD) values for all methods on Office-Home. We observe that our proposed method P-Mixup consistently and significantly outperforms all the baselines in terms of training behavior and STD, which verify the effectiveness of our P-Mixup.

5. CONCLUSIONS

In this work, we propose to address a new and challenging problem, namely Few-shot Supervised multi-source domain Transfer (FSMDT), where multiple fully labeled source domain samples and extremely limited target samples are accessible. A progressive mix-up (P-Mixup) scheme is newly introduced to effectively mitigate the source and target domain gap especially when the target domain is with extremely few-shot samples. We jointly consider the image-level and feature-level cross-domain mix-up to sufficiently enrich the data diversity. A meta-learning optimization strategy is applied to support the multi-domain joint training with stable and robust convergence. Extensive experiments show that our method achieves significant performance gain over the state-of-the-art methods across two main domain adaptation benchmarks. We explore the performance of our method under different feature extractor by replacing the ResNet18 with ViT-B-16(vit small patch 224)foot_0 on Office-Home dataset with 10-shot. As shown in Table 9 , we can see that our method still consistently outperforms all the baselines under the vision transformer feature extractor.

A.7 MOVING DIRECTION OF MIX-UP DISTRIBUTION

We investigate the influence of moving direction of mix-up distribution on our P-Mixup by either moving the mix-up distribution from source to target domains ("Source-To-Target") or from target to source domains ("Target-To-Source"). In Table 10 , we observe that the direction of "Target-To-Source" consistently and significantly outperforms the direction of "Source-To-Target" with more than 4.0% improvement on average accuracy. To further explore the training behavior of our P-Mixup, we inspect the learned model from some of the intermediate training iterations, i.e., from iteration 500 to 5,000, to fully converged 10,000 iterations. As shown in Figure 6 , We find that the direction of "Target-To-Source" continuously improves the performance of the learned model on target domain compared with the direction of "Source-To-Target".

A.8 STANDARD DEVIATION VALUE

We compute the standard deviation (STD) values for all the methods across all the benchmark datasets. More details can be found in Figures 7, 8 , and 9. A.9 COMPUTATION ANALYSIS We explore our method training behavior by investigating the learned model from some of the intermediate training iterations, i.e., from iteration 2000 to 5000, to fully converged 10000 iterations. We run the models on Office-Home four protocols. As shown in Figure 5 , we observe that our proposed method P-Mixup consistently and significantly outperforms all the baselines which verifies the effectiveness of our P-Mixup. We also notice that other baseline methods, e.g., DAML and MDAN, obtain worse results than EMR-w which demonstrates the difficulty and challenge of our proposed Few-shot Supervised multi-source domain Transfer (FSMDT) problem. A.10 LIMITATION Our method mainly relies on source-target progressive mixup (P-Mixup) data augmentation, which progressively introduces an intermediate mix-up domain to mitigate the domain gap between source and target. P-Mixup focuses on the proposed Few-shot Supervised Multi-source Domain Transfer (FSMDT) problem, which provides multiple labeled source domain data and limited labeled target data. It aims at learning to generalize to unseen target domain data. To exactly constrain the target distribution to be few-shot, we consider one sample per class situation and limit the label space of target domain significantly smaller than the label space of multi-source domains. We can see in Table 4 that the performance of our method decreases as the number of target classes increases. There are two main reasons: First, the classification task becomes more difficult as the number of target classes increases. Second, the diversity of the augmented data is restricted by the fact that P-Mixup is only applied between source and the limited few-shot of target data. In contrast, the domain generalization method "DAML" conducts the mix-up across all the classes from multiple source domains. To mitigate the gap, we can additionally introduce the mix-up amongst source domains, e.g., source to source mix-up, into our overall framework to further enrich the data diversity.



https://timm.fast.ai/



Figure 3: The training architecture. Both image and feature level P-Mixup are applied for the crossentropy loss. G is the feature extractor and F is the classifier.

Figure4: As shown from left to right, the first two figures illustrate the Mix-up ratio λ and the q value introduced in Equation7. The lower 1 -q, the closer the target is to the sources. The last two figures describe the P-Mixup training behavior compared to baselines and standard deviation (STD) values for all methods.

Figure 6: Illustration of our P-Mixup training behavior under different moving strategies for mix-up distribution on Office-Home (averaged over 5 times) all four protocols.

mAP(%) on Office-Home. Named in row is the target domain which contains 10 classes randomly selected from label space. (A: Art, C: Clipart, P: Product, R: Real world)

mAP(%) on DomainNet. Named in column is the target domain contains 10/15 classes randomly selected from label space. (C: Clipart, P: Painting, R: Real, S: Sketch)

Ablation study on Office-Home. Named in column is the target domain which contains 10 classes randomly selected from the label space.

mAP(%) on DomainNet "Clipart" setting where "Clipart" is the target domain. n indicates the number of classes randomly selected from the label space.

Ablation study on Office-Home with 10-shot under different meta-learning splitting.

Impact of moving direction of mixup distribution in our P-Mixup on Office-Home (averaged over 5 times) all four protocols. Moving Direction Art Clipart Product Real world Ave.

ACKNOWLEDGEMENT

This research is supported by the the U.S. Army Research Office Award under Grant Number W911NF-21-1-0109 and the Cisco Faculty Award.

availability

Source code is available at https://github.com/ronghangzhu/

A APPENDIX

The appendix provides additional experiments and justifications of the proposed progressive mix-up (P-Mixup) method. In the following sections, we firstly introduce details of benchmark datasets. Then, we show the specific process of implementation and present the analysis of the hyperparameter sensitivity study on σ used in our method. Next, we investigate the influence of moving direction of mix-up distribution on proposed P-Mixup. Furthermore, we provide the standard deviation (STD) values and training behavior for all the methods across all the benchmark datasets. Finally, we analyse the limitation of our method and provide insights potential directions for furture research.The mix-up ratio update formula in our method is defined as:where λ n is the progressive mix-up ratio at n-th iteration. U is a uniform distribution with a local perturbation range σ.

A.1 DATASETS

Table 5 shows the overall descriptions of benchmark datasets, i.e., PACS (Li et al., 2017) , Office-Home (Venkateswara et al., 2017) and DomainNet (Peng et al., 2019) . ( 1 

A.2 IMPLEMENTATION DETAILS

We implement our P-Mixup in Pytorch (Paszke et al., 2019) . We adopt the ImageNet pre-trained ResNet-18 (He et al., 2016) as the feature extractor G and optimize it with SGD as the optimization algorithm. We train the model for 10,000 iterations on Office-Home, and 30,000 iterations on Do-mainNet. We update the mix-up ratio λ n every 100 iterations. Following the idea of MAML (Finn et al., 2017) , for each 3 iterations, we randomly select 2 source domains as the meta-train domain, and the rest source, target, and mixup domains are meta-test domain. The first 2 iterations is metatrain, and the last iteration is meta-test that contains source, target, and their bi-level mixed samples. The learning rates α and β are set to 0.01. For all the baselines, we use the same basic image processing procedures and the same feature extractor as our P-Mixup.

A.3 SENSITIVITY STUDY ON σ

To analyze the sensitivity of our P-Mixup to the hyper-parameter σ, we conduct experiments on Office-Home for all four protocols. The value σ is selected from {0.05, 0.10, 0.15, 0.20, 0.25, 0.30, 0.35, 0.40}. As shown in Table 6 , we observe that the performance of P-Mixup slightly increases in the range [0.05, 0.25] when the value of σ is increased, and then the performance is relatively stable in the range [0.25, 0.40]. Overall, our P-Mixup is not sensitive to the value of σ.Published as a conference paper at ICLR 2023 Evaluation on PACS is shown in Table 7 . We observe that our method consistently and significantly outperforms all the baselines. Specifically, we have 7.69% "Ave." performance gain compared with ERM-w and 4.99% "Ave." improvement compared with the second best DAML. A side observation is that all the baselines obtain relatively better performance compared to Office-Home and Doman-Net datasets, which suggests that the PACS could be less challenging, as 4 images from target domain could notably boost the performance, i.e., when comparing to ERM-w/o.

A.5 META-LEARNING ABLATION STUDY

We investigate the behavior of our method on different meta-train and meta-test splittings on Office-Home with 10-shot. As the Office-Home dataset contains 4 domains, for each task, there are 3 domains are selected as the source domains and the remaining is the target domain. We also have the mix-up domain in each task. Due to the limited target data, we simplify the splitting by treating the target and mix-up domains as the whole denoted as D mix-up . As shown in Table 8 , we increase the size of meta-train set from 1 source domain to 3 source domains. Corresponding, the remaining domains are adopted as the meta-test set. We find that different meta-learning splittings achieve the similar performance, and the meta-train set with three source domains slightly outperforms others. 

