JOINT ATTENTION-DRIVEN DOMAIN FUSION AND NOISE-TOLERANT LEARNING FOR MULTI-SOURCE DOMAIN ADAPTATION

Abstract

Multi-source Unsupervised Domain Adaptation (MUDA) transfers knowledge from multiple source domains with labeled data to an unlabeled target domain. Recently, endeavours have been made in establishing connections among different domains to enable feature interaction. However, as these approaches essentially enhance category information, they lack the transfer of the domain-specific information. Moreover, few research has explored the connection between pseudolabel generation and the framework's learning capabilities, crucial for ensuring robust MUDA. In this paper, we propose a novel framework, which significantly reduces the domain discrepancy and demonstrates new state-of-the-art performance. In particular, we first propose a Contrary Attention-based Domain Merge (CADM) module to enable the interaction among the features so as to achieve the mixture of domain-specific information instead of focusing on the category information. Secondly, to enable the network to correct the pseudo labels during training, we propose an adaptive and reverse cross-entropy loss, which can adaptively impose constraints on the pseudo-label generation process. We conduct experiments on four benchmark datasets, showing that our approach can efficiently fuse all domains for MUDA while showing much better performance than the prior methods.

1. INTRODUCTION

Deep neural networks (DNNs) have achieved excellent performance on various vision tasks under the assumption that training and test data come from the same distribution. However, different scenes have different illumination, viewing angles, and styles, which may cause the domain shift problem (Zhu et al., 2019; Tzeng et al., 2017; Long et al., 2016) . This can eventually lead to a significant performance drop on the target task. Unsupervised Domain Adaptation (UDA) aims at addressing this issue by transferring knowledge from the source domain to the unlabeled target domain (Saenko et al., 2010) . Early research has mostly focused on Single-source Unsupervised Domain Adaptation (SUDA), which transfers knowledge from one source domain to the target domain. Accordingly, some methods align the feature distribution among source and target domains (Tzeng et al., 2014) while some (Tzeng et al., 2017) learn domain invariants through adversarial learning. Liang et al. (2020) use the label information to maintain the robust training process. However, data is usually collected from multiple domains in the real-world scenario, which arises a more practical task, i.e., Multi-source Unsupervised Domain Adaptation (MUDA) (Duan et al., 2012) . MUDA leverages all of the available data and thus enables performance gains; nonetheless, it introduces a new challenge of reducing domain shift between all source and target domains. For this, some research (Peng et al., 2019) builds their methods based on SUDA, aiming to extract common domain-invariant features for all domains. Moreover, some works, e.g., Venkat et al. (2021) ; Zhou et al. (2021) focus on the classifier's predictions to achieve domain alignment. Recently, some approaches (Li et al., 2021; Wen et al., 2020) take advantage of the MUDA property to create connections for each domain. Overall, since the main challenge of MUDA is to eliminate the differences between all domains, there are two main ways to achieve this. One is to extract domain invariant features among all domains, i.e., filter domain-specific information for different domains. The other is by mixing domain-specific information from different domains so that all domains share such Figure 1 : Overview of our proposed framework that consists of three primary elements: (1) The feature extractor extracts features from various domains. (2) CADM is proposed to implement message passing and fuse features from different domains. The same domain is represented by the same color, while different shapes represent different classes. (3) we propose AR-CE loss and use the maintained memory to compute the soft label and pseudo label (hard label) for target domain. mixed information and thus fuse into one domain. Previous approaches have mostly followed the former, however, filtering the domain-specific information for multiple domains can be difficult and often results in losing discrimination ability. For the latter, few methods have been proposed to address MUDA in this way, and there is a lack of effective frameworks to achieve such domain fusion, which is the main problem to be addressed by our proposed approach. Moreover, existing methods ignore the importance of generating reliable pseudo labels as noisy pseudo labels could lead to the accumulation of prediction errors. Consequently, it is imperative to design a robust pseudo-label generation based on the MUDA framework. In this paper, we propose a novel framework that better reduces the domain discrepancy and shows the new state-of-the-art (SoTA) MUDA performance, as shown in Fig. 1 . Our method enjoys two pivotal technical breakthroughs. Firstly, we propose a Contrary Attention-based Domain Merge (CADM) module (Sec. 4.2), whose role is to perform the domain fusion. Self attention (Vaswani et al., 2017; Dosovitskiy et al., 2020) can capture the higher-order correlation between features and emphases more relevant feature information, e.g., the semantically closest information. Differently, our CADM proposes the contrary attention, enabling each domain to pay more attention to semantically different domain-specific information of other domains. Then, by integrating these domain-specific information, each domain can achieve movement to other domains, thus resulting in domain fusion. Secondly, to enable the network to correct the pseudo labels during training, we take the pseudo-label generation as the optimization objective of the network by proposing an adaptive and reverse cross-entropy (AR-CE) loss (Sec. 4.3). It imposes the optimizable constraints for pseudo-label generation, enabling the network to correct pseudo labels that tend to be wrong and reinforce pseudo labels that tend to be correct. We conduct extensive experiments on four benchmark datasets. The results show our method achieves new state-of-the-art performance and especially has significant advantages in dealing with the large-scale dataset. In summary, we have made the following contributions: • We propose a CADM to fuse the features of source and target domains, bridging the gap between different domains and enhancing the discriminability of different categories. • We propose a loss function, AR-CE loss, to diminish the negative impact of the noisy pseudo label during training. • Our method demonstrates the new state-of-the-art (SoTA) MUDA performance on multiple benchmark datasets.

2.1. UNSUPERVISED DOMAIN ADAPTATION (UDA)

According to the number of source domains, UDA can be divided into Single-source Unsupervised Domain Adaptation (SUDA) and Multi-source Unsupervised Domain Adaptation (MUDA). In SUDA, some methods use the metrics, such as Maximum Mean Discrepancy (Long et al., 2017; Wang et al., 2018) , to define and optimize domain shift. Some methods (Sankaranarayanan et al., 2018; Ma et al., 2019) learn the domain-invariant representations through adversarial learning. Other methods are based on the label information. Liang et al. (2020) aligns domains implicitly by constraining the distribution of prediction. Following Liang et al. (2020) , Liang et al. (2021) introduces a label shifting strategy as a way to improve the accuracy of low confidence predictions. For MUDA, previous methods eliminate the domain shift while aligning all source domains by constructing multiple source-target pairs, where (Peng et al., 2019; Zhu et al., 2019) are based on the discrepancy metric among different distributions and (Venkat et al., 2021; Ren et al., 2022) are based on the adversarial learning. Some methods (Venkat et al., 2021; Nguyen et al., 2021a; Li et al., 2022; Nguyen et al., 2021b) explore the relation between different classifiers and develop different agreements to achieve domain alignment. Some (Wen et al., 2020; Ahmed et al., 2021) achieve an adaptive transfer by a weighted combination of source domains. Current methods (Wang et al., 2020; Li et al., 2021) mainly construct connections between different domains to enable feature interaction and explore the relation of category information from different domains. Our approach also constructs connections between different domains; however, it mainly transfers domain-specific information to achieve the fusion of all domains.

2.2. ROBUST LOSS FUNCTION UNDER NOISY LABELS

A robust loss function is critical for UDA because the unlabeled target domain requires pseudo labels, which are often noisy. Previous work (Ghosh et al., 2017) demonstrates that some loss functions such as Mean Absolute Error (MAE) are more robust to noisy labels than the commonly used loss functions such as Cross Entropy (CE) loss. (Wang et al., 2019b) proposes the Symmetric Cross Entropy (SCE) combining the Reverse Cross Entropy (RCE) with the CE loss. Moreover, (Wang et al., 2019a) shows that directly adjusting the update process of the loss by the weight variance is effective. Some methods (Venkat et al., 2021; Zhou et al., 2021) mentioned in Section 2.1 focus on exploring a learning strategy to handle prediction values that have high or low confidence levels, respectively. Differently, we design an adaptive loss function for the network to learn robust pseudo-label generation by self-correction.

3. PROBLEM SETTING

For MUDA task, there are N source distributions and one target distribution, which can be denoted as {p sj (x, y)} N j=1 and {p t (x, y)}. The labeled source domain images {X sj , Y sj } N j=1 are obtained from the source distributions, where X sj = {x i sj } |Xs j | i=1 are the images in the source domain j and Y sj = {y i sj } |Xs j | i=1 represents the corresponding ground-truth labels. As for the unlabeled data in the target domain, there are target images X t = {x i t } |Xt| i=1 from target distribution. In this paper, we uniformly set the number of samples in a batch as B, with b = B N +1 for every domain (including the target domain). For a single sample x i sj , the subscript s j represents the j-th source domain, while the superscript i indicates that it is the i-th sample in that domain.

4.1. OVERALL FRAMEWORK

The general MUDA pipeline consists of two phases: 1) pre-training on the source domain; 2) training when transferring to the target domain. The base structure of the model consists of a feature extractor F (•) and a classifier C(•). Previous approaches mostly use different extractors or classifiers, unlike Contrary Attention 𝐵 × 𝐷 𝐴𝑣𝑔() • U 𝑞 V K Q 𝐵 × 𝐷 𝐵 × 𝐷 (𝑁 + 1) × 𝐷 (𝑁 + 1) × 𝐵 𝐴′ 𝑠1 𝐴′ 𝑠2 𝐴′ 𝑡 • U 𝑘𝑣 Φ 𝑠1 Φ 𝑠2 Φ 𝑡 Φ 𝑛-1 Φ 𝑛 MLP (𝑁 + 1) × 𝐷 f 𝑠1 f 𝑠2 f 𝑡 h 𝑠1 h 𝑠2 h 𝑡 𝑠 2 𝑠 1

𝑡

Figure 2 : CADM review: Features from different domains are fused by CADM. For simplicity, we set N , B, and D as 2, 6, and 5, respectively. Domain prototypes are mapped to Q, while all features are mapped to K and V. The domain style centroids are obtained by the contrary attention map A ′ and V, which is then updated by EMA. Finally, the fused feature h is obtained by MLP. them, our approach expects to achieve a deep fusion of all domains, i.e., belonging to one domain, and therefore F and C are shared among domains. In the pre-training phase, we use the source domain data {X sj , Y sj } N j=1 for training to obtain the pre-trained F (•) and C(•). In the training phase for the target domain, we firstly feed the source domain data {X sj , Y sj } N j=1 and target domain data X t into F (•) to get the features f , and perform the feature fusion by CADM to get the fused feature h. Then, we use the fused features of the target domain for the pseudo-label generation and feed the fused features of all domains into C(•) to get prediction values. Finally, we compute the loss based on the labels. Since source training is general, training in this paper typically refers to target domain training.

4.2. CONTRARY ATTENTION-BASED DOMAIN MERGE MODULE

The proposed CADM enables the passing of domain-specific information to achieve the deep fusion of all domains. For each domain, we expect it can receive domain-specific information from other domains by the information transfer to move towards other domains as shown in Fig. 1 . To achieve this goal, the transfer of semantically distinct domain-specific information should be encouraged. It is because receiving semantically close domain-specific information primarily strengthens the original domain information and does not help eliminate the domain shift. Instead, receiving different domain-specific information can contribute to the mixing of domain-specific information and 'push' the domain to other domains, thus achieving domain fusion. As a result, we construct contrary attention to focus on semantic differences among domains. The specific structure is shown in Fig. 2 . Specifically, features f are first extracted from the samples via F , and then be mapped into the latent space to obtain query, key, and value feature: Q = [f s1 . . . f s N , f t ]U q U q ∈ R D×D , [K, V] = [f 1 s1 . . . f 1 s2 . . . f b t ]U kv U kv ∈ R D×2D , where U q and U kv are learnable projection matrices, D is the feature dimension, f 1 s1 is the feature obtained from the sample x 1 s1 , and f s1 is the domain prototype representing the domain information of s 1 . Let g represent any domain, the domain prototype is obtained as follows: f g = 1 b b i=1 f i g . Since the query of each domain and the key of all samples are obtained in Eq. 1,2, then we use the query feature q g in Q to compute the correlation with the samples of different domains: A g = Softmax q g • K T √ D . A g ∈ R 1×B We use the self-attention mechanism to model the correlation because of its powerful ability to capture dependencies, and A g is the obtained attention map. The value in A g reflects the correlation between the domain g and the samples from other domains, while, as mentioned before, it is more meaningful to focus on semantically different domain-specific information. Consequently, we 'reverse' the obtained A g to generate a new attention map, which we refer to as the contrary attention map: A ′ g = 1 -A g sum(1 -A g ) , A ′ g ∈ R 1×B (5) where 1 is an all-one vector. The A ′ g thus obtained can achieve the opposite effect of the previous A g , where those features that are more different from the domain g in terms of domain-specific information are assigned higher weights. Subsequently, we use A ′ g to perform the feature fusion for domain g, which allows the merging of information from features according to their degree of dissimilarity with domain g: Φ g = A ′ g • V = ĝ∈G b i=1 a ′ gĝ i v i g , where a ′ gĝ i represents the contrary attention weight between the prototype of g and the i-th feature in ĝ. Φ g is obtained by weighting features from different domains and summing them according to their discrepancies from g. Given its greater emphasis on the merging of knowledge with distinct styles, we call Φ g the domain style centroid with respect to g. In addition, to guarantee the robustness of Φ g , we use exponential moving averages (EMA) to maintain it: ( Φg ) n = (1 -α)( Φg ) n-1 + α(Φ g ) n , where n denotes the n-th mini-batch. The recorded Φg will be used in the inference phase. Finally, we integrate the domain style centroid Φg into the original features. A fundamental MLP is used to simulate this process: h i g = MLP [f i g , Φg ] , where h i g is the fused feature and Φg is integrated into every feature in g, thus creating the overall movement of domain. Through the process described above, the original features achieve the merging of different domain-specific information by integrating the domain style centroid. Meanwhile, to maintain a well-defined decision boundary during such domain fusion, we use a metric-based loss (Wen et al., 2016) to optimize the intra-class distance. It allows samples from the same categories to be close to each other and shrink around the class center. L Cr = g∈G b i=1 ∥h i g -µ y i g ∥ 2 2 , where y i g is the label of h i g and µ i yg ∈ R 1×D denotes the y i g -th class center. For the robustness of class center, we maintain a memory M to store features at the class level and update them batch by batch during training, which is widely applied in UDA (Saito et al., 2020) . The class-level features can be regarded as the class center, and we then apply them in the computation of L Cr . Specifically, we traverse all the datasets to extract features and obtain different class centers before the training process. In the target domain, we use the classification result as the category since there is no ground-truth label: µ k = Avg N j=1 yi=k xi∈Xs j h i sj + arg max j pj (xi)=k xi∈Xt h i t , where p j (x) represents the confidence level of predicting x as the j-th class. Then after each backpropagation step, we update the memory M with the following rule: µ new k = (1 -β)µ old k + βh, ( ) where k is the label of h and µ k is the k-th class center in M. Under the supervision of L Cr , the features are able to maintain intra-class compactness in domain movement. Overall, through CADM, we establish connections among domains and enable the deep fusion of different domains. Eventually, with the L Cr , the model is able to reach a tradeoff between domain merging and a well-defined decision boundary.

4.3. ADAPTIVE AND REVERSE CROSS ENTROPY LOSS

General UDA methods use the K-means-like algorithm to generate pseudo labels for the entire unlabeled target dataset. However, it has been observed that the pseudo labels generated in this way are often misassigned due to the lack of constraints, which introduces noise and impairs performance. To address this issue, we creatively propose to assign gradients to the generation process of pseudo labels and constrain it with a well-designed rule, so that the network can optimize both the prediction values and pseudo labels simultaneously. To begin, we expect to obtain the pseudo labels with gradient flow. Since the dynamically updated memory M is introduced in Section 4.2, we can obtain both soft labels and hard labels by computing the distance between features and the class-level features in M in every batch: q(y = k|x) = e h•µ T k K j=1 e h•µ T j , ŷ = arg max k q(y = k|x), where q(y|x) is the soft label and ŷ is the hard label. The soft label makes the gradient backpropagation possible. Specifically, previous experimental observations (Venkat et al., 2021; Wang et al., 2019b) demonstrate that the trained classifier has important implications, which is key to preventing the model from collapsing in the unlabeled target domain. Therefore, based on the output of classifier, we establish constraints on the pseudo label by a reverse cross entropy loss: ℓ rce (x) = K k=1 p k (x) log q(y = k|x), where k represents the k-th class, and p k (x) represents the classifier's output at the k-th class. As shown in the Eq. 14, unlike the cross entropy loss, which uses labels to supervise the output of the classifier, ℓ rce uses the output of the classifier to supervise the generated soft labels. Then the gradient can be back-propagated through the soft label, thus preventing the accumulation of errors from pseudo-label generation. However, it is unfair to apply the same rule to those pseudo labels that tend to be accurate. An ideal situation is that the loss can be adaptively adjusted to correspond to different pseudo labels. Therefore, we propose the adaptive factor a(x) to achieve the adaptive adjustment of ℓ rce , which is denoted as: δ j (x, ŷ) = p j (x) 1 -p ŷ (x) , a(x) = j̸ =ŷ j δ j (x, ŷ) log δ j (x, ŷ), ( ) where ŷ is the generated hard label of the target domain sample x. We consider the classifier's output other than the class ŷ as a distribution and the probability of j-th class is denoted as δ j (x, ŷ) in Eq.15. The entropy of this distribution then serves as the adaptive factor a(x). It is because that pseudo label essentially arises from the feature extracted by the network, so its accuracy is closely related to the ability of the network to distinguish this sample. Liang et al. (2021) demonstrates that when the model has a good distinguishing ability for one sample, the output of classifier should be close to the one-hot vector. Consequently, the value of a(x), in this case, is larger than the case of poor discriminatory ability. Based on a(x) and ℓ rce , we design an adaptive and reverse cross entropy loss: ℓ arce (x, ŷ) = ℓ rce (x) exp (a(x)) τ , L AR-CE = B i ℓ arce (x i , ŷi ). The exponential form and τ are used to amplify its effect. Since L AR-CE aims to focus on the optimization of pseudo label, we freeze the gradient of the prediction values in Eq. 14. The backpropagation of L AR-CE can eventually enable the optimization of pseudo-label generation. Furthermore, the cross entropy loss is denoted as follows: L CE = - 1 B B i=1 log p yi (x i ), where y i is the ground-truth label for source domain sample and is the hard pseudo label ŷi obtained by Eq. 13 when x i is in the target domain. Ultimately, the loss function for the entire framework can be represented as follows: L = L CE + L Cr + L AR-CE , where L CE is the normal cross entropy loss, L Cr is for feature fusion, L AR-CE is used for noise tolerant learning, and the latter one item acts only on the target domain.

5.1. DATASETS AND IMPLEMENTATION

Office-31 (Saenko et al., 2010 ) is a fundamental domain adaptation dataset comprised of three separate domains: Amazon (A), Webcam (W), and DSLR (D). Office-Caltech (Gong et al., 2012) For a comparison with our method, we introduce the SoTA methods on MUDA, including DCTN (Xu et al., 2018) , M 3 SDA (Peng et al., 2019) , MFSAN (Zhu et al., 2019) , MDDA (Zhao et al., 2020) , DARN (Wen et al., 2020) , Ltc-MSDA (Wang et al., 2020) , SHOT++ (Liang et al., 2021) , SImpAl (Venkat et al., 2021) , T-SVDNet (Li et al., 2021) , CMSS (Ahmed et al., 2021) , DECISION (Ahmed et al., 2021) , DAEL (Zhou et al., 2021) , PTMDA (Ren et al., 2022) and DINE (Liang et al., 2022) . For Office-31, Office-Caltech and Office-Home, we use an Imagenet (Deng et al., 2009) pre-trained ResNet-50 (He et al., 2016) as the feature extractor. As for the DomainNet, we use ResNet-101 instead. In addition, a bottleneck layer (consists of two fully-connected layers) is used as the classifier. We use the Adam optimizer with a learning rate of 1 × 10 -6 and a weight decay of 5×10 -4 . We set the learning rate of extractor to ten times that of the classifier. The hyperparameters α and β are set to 0.05 and 0.005, respectively. All the framework is based on the Pytorch (Paszke et al., 2019) .

5.2. RESULTS

We compare the ADNT with different approaches on four datasets, and the results are shown in the following tables. Note that the one pointed by the arrow is the target domain and the others are the source domains. As shown in Table 1 , the experiments on Office-31 show optimal results. In particular, it is second only to PTMDA in the ResNet50-based approach and achieves the best results in all scenarios, except for transferring W and D to A. On the Office-Caltech dataset, our method not only outperforms all current approaches, but also achieves an impressive 100% accuracy on the task of transferring to W and to D. In addition, our method achieves the best results on all tasks except transfer to A, and has a significant advantage over the ResNet-101-based method. As shown in Table 3 , on the medium-sized Office-Home dataset, our proposed ADNT is the best performer and has a 0.9% lead over the best ResNet-50-based method. Furthermore, compared with the ResNet-101-based methods (DINE and SImpAl 101 ), our approach achieves leading performance with fewer parameters and a simpler model structure. As shown in Table 4 , ADNT also demonstrates its strength on the largest and most challenging dataset. The total result outperforms the current optimal method by 2.3%, which is highly impressive on DomainNet. Overall, our approach not only achieves the best results, but also has better performance on large datasets with more domains, especially DomainNet. For tasks with degraded performance, in addition to gaps in model structure (compared to ResNet-101), they mainly appear when transferring to some less stylized domains (Amazon, Clipart, Quickdraw). Further analysis is presented in the Appendix. 

5.3. ABLATION STUDY

To verify the effectiveness of the proposed components, we conduct ablation experiments on the Office-Home dataset, as shown in Table 5 . In order to determine whether the attention-driven module actually achieves domain fusion, we conduct experiments with only the domain fusion without L Cr , which is shown as '+ CADM (w/o L Cr )' in the table. Additionally, due to the fact that our proposed contrary attention differs from general attention by paying attention to the fusion of domainspecific information, a comparison with general attention can evaluate the viability of our strategy. Therefore, we replace the contrary attention map in the proposed CADM with a general attention map, i.e., without Eq. 5. The result of general attention is shown as '+ ADM' in Table 5 . The significant performance improvement powerfully demonstrates the effectiveness of our proposed CADM. We also use t-sne (Van der Maaten & Hinton, 2008) to visualize the results with and without CADM, as shown in Fig. 3 . We can observe that with the designed CADM, the different distributions are mixed with each other, achieving a higher degree of fusion. Meanwhile, the decision boundaries of different categories are more distinct. Table 5 verifies the effectiveness of our AR-CE loss. Additionally, we conduct experiments using only reverse cross entropy loss, denoted as +L R-CE . As shown in the Table 5 , L R-CE brings the performance improvement, and the adaptive factor is able to amplify this improvement. This demonstrates the positive and effective role of adaptive factor. Due to the space limit, additional analytical experiments, including the visualization of the training process and analysis of hyperparameters, can be found in the appendix.

6. CONCLUSION

This paper proposes the ADNT, a framework that combines the domain fusion module and robust loss function under noisy labels. Firstly, We construct a contrary attention-based domain merge module, which can achieve a mixture of different domain-specific information and thus fuse features from different domains. In addition, we design a robust loss function to avoid the error accumulation of pseudo-label generation. Substantial experiments on four datasets have shown the superior performance of ADNT, which strongly demonstrates the effectiveness of our approach.

A APPENDIX

A.1 ADDITIONAL EXPERIMENTS ABOUT CADM Our proposed CADM aims to achieve the domain fusion through the transfer of domain-specific information, where the coefficients in the Contrary Attention Map A ′ have an important role to focus on the semantically different domain-specific information. To verify the action of contrary attention, we perform the visualization by Grad-CAM. Specifically, we used Grad-CAM to visualize the weights in A ′ , as the column 'Painting' in Fig. 9 , reflecting the attention of other domains to that sample in Painting domain. Since our approach emphasizes attention to domain-specific information from other domains, it is reasonable to focus on texture and background, etc., rather than content. As shown in Fig 9 , the designed contrary attention is able to focus on domain-specific information, such as texture, color, background, and other domain characteristics (e.g., handwritten text in Sketch), rather than content. This makes the domain fusion possible. Meanwhile, the visualization results on domain Quickdraw show that since this domain is composed of simple lines, there is not enough information related to texture, background, etc., and the attention can be limited. This also explains why our approach does not achieve the best performance on some tasks, except for the reason of different model structure . This is because domains such as A in Office-31 and Qdr in DomainNet lack sufficient domain information to support contrary attention. Nevertheless, our ADNT still has a top performance on these domains, which reflects the robustness of our method. Painting RealWorld Sketch QuickDraw As shown in the Fig. 5 , during the training process, the standard deviation gradually becomes smaller and eventually tends to be stable, which indicates that the features from different domains are distributed in the space with a uniform state. However, the above results alone do not demonstrate that features from different domains are deeply fused because the previous effect in Fig. 5 can also be achieved if features are sufficiently separated from each other. Consequently, we map the fused features h into the low-dimensional space and visualize the distribution during the training process, divided by different domains and categories. As shown in Fig. 6 , at the start of training, the fused features are distributed haphazardly in space, and the network does not have capacity to process them. Then, different domains gradually move toward each other and merge together as the updating of model parameters. Finally, the features of the same category become more compact while preserving domain fusion. This powerfully demonstrates that through CADM, different domains embrace each other and achieve deep fusion. We make the model learn how to handle the fused features and show its excellent results.

A.2 ADDITIONAL EXPERIMENTS ABOUT L AR-CE

First, as shown in the Fig. 7 , we show the specific structure of the pseudo-label generation and the gradient back-propagation. Through L AR-CE , the soft label generation can be optimized during gradient descent to weaken the influence from the noise in the pseudo-labels. Furthermore, the accuracy of the generated pseudo labels can reflect the effectiveness and stability of the pseudo-label generation. Therefore, we compare the average accuracy of pseudo labels generated with and without AR-CE loss on Office-31, Office-Caltech and Office-Home, which is shown in Fig. 7 . Specifically, we calculate the mean accuracy of the final five epochs during training. It can be seen that after the introduction of the AR-CE loss, the model can derive noisy labels through self-correction and achieve the high accuracy of pseudo labels. Finally, we have conducted additional experiments to vertify the effect of adaptive factors. Specifically, we counted 1000 samples with correct and wrong pseudo labels separately and calculated their distribution entropy as described in Sec. our proposed adaptive factor. Although it is difficult to be completely correct because the pseudo labels depend on the generation, our method can significantly achieve the correction of pseudo-label generation. As a result, the impact of noise on the model performance is reduced. 

A.3 MODEL ARCHITECTURE

Since the feature extractor F () in our method is the pre-trained ResNet-50, in order to explore the performance of our method using a larger network structure, we conducted experiments on Office-31, Office-Caltech, and Office-Home using ResNet-101 as the feature extractor, as shown in Table 6 , 7, 8. The results show that the performance of the network does improve somewhat, however, there is a drop in some tasks (especially tasks that have close to 100% accuracy with ResNet-50). On the Office-Home dataset, there is a significant improvement. This indicates that data size is an important factor limiting the application of large model to our approach. Moreover, since we simply perform model replacement in this experiment, there is further room to optimize the way the larger model is combined with our proposed ADNT. As can be seen, the value of the hyperparameter α is somewhat greater and the β is taken at an overall smaller value. This indicates that the domain style centroid is updated more vigorously and provides a gradient sufficient for the backpropagation of the network. Memory M, on the other hand, requires a slower update to maintain network robustness because it involves the pseudo-label generation.



Figure 3: t-sne: The results of feature visualization on the Offce-Home dataset, while performing Ar, Cl, Rw to Pr. (a) (b) and (c) represent the result of Source-only, CADM without L Cr and CADM with L Cr respectively. All source domains are represented in red, while target domain is represented in blue.

Figure 4: The visualization about Grad-CAM. The first row shows the original image and the second row shows the visualization.

Figure 5: The standard deviation of A ′ on the Office-Home dataset.

Figure 6: Visualization of the feature distribution during the training process in the transfer task Ar, Cl, Rw to Pr. (a), (b), (c) and (d) represent the epoch 1, 10, 20 and 30, respectively. To provide a clearer illustration, we select features from the first ten classes of each domain, with different representing different class. Solid circles represent all source domains, while the squares represent target domains. Best viewed enlarged.

Figure 7: The specific structure of the pseudo-label generation in our ADNT, where the Class Score is the output of classifier.

Figure 8: Visualization of the value of the distribution entropy for different samples, where the vertical coordinate is the number of samples and the horizontal coordinate is the value of the distribution entropy.

Figure 9: Mean accuracies (%) of pseudo labels on three datasets. (a), (b), (c) represent the result on Office-31, Office-Caltech and Office-Home respectively.

PARAMETER ANALYSIS Exponential moving average (EMA) is used in the experiments to maintain the domain style centers and class-level features, so the selection of hyperparameters α and β requires additional analysis and verification. Therefore, we perform an experimental comparison of their values and present them in Fig. 10.

Figure 10: Sensitivity analysis of α and β.

Classification accuracies (%) on Office-31 dataset, " * " indicates that the method is based on ResNet-101.

Classification accuracies (%) on Office-Caltech dataset. " * " indicates that the method is based on ResNet-101.

Classification accuracies (%) on Office-Home dataset, " * " indicates that the method is based on ResNet-101.

Classification accuracies (%) on DomainNet dataset.

Classification accuracies (%) of ablation study on Office-Home dataset.

Classification accuracies (%) on Office-31 dataset

Classification accuracies (%) on Office-Caltech dataset

Classification accuracies (%) on Office-Home dataset

