FREE LUNCH FOR DOMAIN ADVERSARIAL TRAINING: ENVIRONMENT LABEL SMOOTHING

Abstract

A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield the state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy. The code is avaliable at https://github.com/yfzhang114/Environment-Label-Smoothing. * Work done during an internship at Alibaba Group. † Work done at Alibaba Group, and

1. INTRODUCTION

Despite being empirically effective on visual recognition benchmarks (Russakovsky et al., 2015) , modern neural networks are prone to learning shortcuts that stem from spurious correlations (Geirhos et al., 2020) , resulting in poor generalization for out-of-distribution (OOD) data. A popular thread of methods, minimizing domain divergence by Domain Adversarial Training (DAT) (Ganin et al., 2016) , has shown better domain transfer performance, suggesting that it is potential to be an effective candidate to extract domain-invariant features. Despite its power for domain adaptation and domain generalization, DAT is known to be difficult to train and converge (Roth et al., 2017; Jenni & Favaro, 2019; Arjovsky & Bottou, 2017; Sønderby et al., 2016) . The main difficulty for stable training is to maintain healthy competition between the encoder and the domain discriminator. Recent work seeks to attain this goal by designing novel optimization methods (Acuna et al., 2022; Rangwani et al., 2022) , however, most of them require additional optimization steps and slow the convergence. In this work, we aim to tackle the challenge from a totally different aspect from previous works, i.e., the environment label design. Two important observations that lead to the training instability of DAT motivate this work: (i) The environment label noise from environment partition (Creager et al., 2021) and training (Thanh-Tung et al., 2019) . As shown in Figure 1 , different domains of the VLCS benchmark have no significant difference in image style and some images are indistinguishable for which domain they belong. Besides, when the encoder gets better, the generated features from different domains are more similar. However, regardless of their quality, features are still labeled differently. As shown in (Thanh-Tung et al., 2019) , discriminators will overfit these mislabelled examples and then has poor generalization capability. (ii) To our best knowledge, DAT methods all assign one-hot environment labels to each data sample for domain discrimination, where the output probabilities will be highly confident. For DAT, a very confident domain discriminator leads to highly oscillatory gradients (Arjovsky & Bottou, 2017) , which is harmful to training stability. The first observation inspires us to force the training process to be robust with regard to environment-label noise, and the second observation encourages the discriminator to estimate soft probabilities rather than confident classification. To this end, we propose Environment Label Smoothing (ELS), which is a simple method to tackle the mentioned obstacles for DAT. Next, we summarize the main methodological, theoretical, and experimental contributions. Methodology: To our best knowledge, this is the first work to smooth environment labels for DAT. The proposed ELS yields three main advantages: (i) it does not require any extra parameters and optimization steps and yields faster convergence speed, better training stability, and more robustness to label noise theoretically and empirically; (ii) despite its efficiency, ELS is also easily to implement. People can easily incorporate ELS with any DAT methods in very few lines of code; (iii) ELS equipped DAT methods attain superior generalization performance compared to their native counterparts; Theories: The benefit of ELS is theoretically verified in the following aspects. (i) Training stability. We first connect DAT to Jensen-Shannon/Kullback-Leibler divergence minimization, where ELS is shown able to extend the support of training distributions and relieve both the oscillatory gradients and gradient vanishing phenomenons, which results in stable and well-behaved training. (ii) Robustness to noisy labels. We theoretically verify that the negative effect caused by noisy labels can be reduced or even eliminated by ELS with a proper smooth parameter. (iii) Faster non-asymptotic convergence speed. We analyze the non-asymptotic convergence properties of DANN. The results indicate that incorporating with ELS can further speed up the convergence process. In addition, we also provide the empirical gap and analyze some commonly used DAT tricks. Experiments: (i) Experiments are carried out on various benchmarks with different backbones, including image classification, image retrieval, neural language processing, genomics data, graph, and sequential data. ELS brings consistent improvement when incorporated with different DAT methods and achieves competitive or SOTA performance on various benchmarks, e.g., average accuracy on Rotating MNIST (52.1% → 62.1%), worst group accuracy on CivilComments (61.7% → 65.9%), test ID accuracy on RxRx1 (22.9% → 26.7%), average accuracy on Spurious-Fourier dataset (11.1% → 15.6%). (ii) Even if the environment labels are random or partially known, the performance of ELS + DANN will not degrade much and is superior to native DANN. (iii) Abundant analyzes on training dynamics are conducted to verify the benefit of ELS empirically. (iv) We conduct thorough ablations on hyper-parameter for ELS and some useful suggestions about choosing the best smooth parameter considering the dataset information are given.

2. METHODOLOGY

For domain generalization tasks, there are M source domains {D i } M i=1 . Let the hypothesis h be the composition of h = ĥ ○ g, where g ∈ G pushes forward the data samples to a representation space Z and ĥ = ( ĥ1 (⋅), . . . , ĥM (⋅)) ∈ Ĥ ∶ Z → [0, 1] M ; ∑ M i=1 ĥi (⋅) = 1 is the domain discriminator with softmax activation function. The classifier is defined as ĥ′ ∈ Ĥ′ ∶ Z → [0, 1] C ; ∑ C i=1 ĥ′ i (⋅) = 1 , where C is the number of classes. The cost used for the discriminator can be defined as: max ĥ∈ Ĥ d ĥ,g (D 1 , . . . , D M ) = max ĥ∈H E x∈D1 log ĥ1 ○ g(x) + ⋅ ⋅ ⋅ + E x∈D M log ĥM ○ g(x), where ĥi ○ g(x) is the prediction probability that x is belonged to D i . Denote y the class label, then the overall objective of DAT is min ĥ′ ,g max ĥ 1 M M ∑ i=1 E x∈Di [ℓ( ĥ′ ○ g(x), y)] + λd ĥ,g (D 1 , . . . , D M ), where ℓ is the cross-entropy loss for classification tasks and MSE for regression tasks, and λ is the tradeoff weight. We call the first term empirical risk minimization (ERM) part and the second term adversarial training (AT) part. Applying ELS, the target in Equ. (1) can be reformulated as max ĥ∈ Ĥ d ĥ,g,γ (D 1 , . . . , D M ) = max ĥ∈ Ĥ E x∈D1 ⎡ ⎢ ⎢ ⎢ ⎣ γ log ĥ1 ○ g(x) + (1 -γ) M -1 M ∑ j=1;j≠1 log ( ĥj ○ g(x)) ⎤ ⎥ ⎥ ⎥ ⎦ + ⋅ ⋅ ⋅ + E x∈D M ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ γ log ĥM ○ g(x) + (1 -γ) M -1 M ∑ j=1;j≠M log ( ĥj ○ g(x)) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ . (3)

3. THEORETICAL VALIDATION

In this section, we first assume the discriminator is optimized with no constraint, providing a theoretical interpretation of applying ELS. Then how ELS makes the training process more stable is discussed based on the interpretation and some analysis of the gradients. We next theoretically show that with ELS, the effect of label noise can be eliminated. Finally, to mitigate the impact of the no constraint assumption, the empirical gap, parameterization gap, and non-asymptotic convergence property are analyzed respectively. All omitted proofs can be found in the Appendix.

3.1. DIVERGENCE MINIMIZATION INTERPRETATION

In this subsection, the connection between ELS/one-sided ELS and divergence minimization is studied. The advantages brought by ELS and why GANs prefer one-sided ELS are theoretically claimed. We begin with the two-domain setting, which is used in domain adaptation and generative adversarial networks. Then the result in the multi-domain setting is further developed. Proposition 1. Given two domain distributions D S , D T over X, and a hypothesis class H. We suppose ĥ ∈ Ĥ the optimal discriminator with no constraint, denote the mixed distributions with hyper-parameter γ ∈ [0.5, 1] as { D S ′ = γD S + (1 -γ)D T D T ′ = γD T + (1 -γ)D S . Then minimizing domain divergence by adversarial training with ELS is equal to minimizing 2D JS (D S ′ ||D T ′ ) -2 log 2, where D JS is the Jensen-Shanon (JS) divergence. manifolds (Arjovsky & Bottou, 2017; Roth et al., 2017) . Adding noise from an arbitrary distribution to the data is shown to be able to extend the support of both distributions (Jenni & Favaro, 2019; Arjovsky & Bottou, 2017; Sønderby et al., 2016) and will protect the discriminator against measure 0 adversarial examples (Jenni & Favaro, 2019) , which result in stable and well-behaved training. Environment label smoothing can be viewed as a kind of noise injection, e.g., in Proposition 1, D S ′ = D T + γ(D S -D T ) where the noise is γ(D S -D T ) and the two distributions will be more likely to have joint supports. ELS relieves the gradient vanishing phenomenon. As shown in Section 3.1, the adversarial target is approximating KL or JS divergence, and when the discriminator is not optimal, a such approximation is inaccurate. We show that in vanilla DANN, as the discriminator gets better, the gradient passed from discriminator to the encoder vanishes (Proposition 4 and Proposition 5). Namely, either the approximation is inaccurate, or the gradient vanishes, which will make adversarial training extremely hard (Arjovsky & Bottou, 2017) . Incorporating ELS is shown able to relieve the The sum of gradients provided to the encoder by the adversarial loss. gradient vanishing phenomenon when the discriminator is close to the optimal one and stabilizes the training process. ELS serves as a data-driven regularization and stabilizes the oscillatory gradients. Gradients of the encoder with respect to adversarial loss remain highly oscillatory in native DANN, which is an important reason for the instability of adversarial training (Mescheder et al., 2018) . Figure 2 shows the gradient dynamics throughout the training process, where the PACS dataset is used as an example. With ELS, the gradient brought by the adversarial loss is smoother and more stable. The benefit is theoretically supported in Section A.6, where applying ELS is shown similar to adding a regularization term on discriminator parameters, which stabilizes the supplied gradients compared to the vanilla adversarial loss.

3.3. ELS MEETS NOISY LABELS

To analyze the benefits of ELS when noisy labels exist, we adopt the symmetric noise model (Kim et al., 2019) . Specifically, given two environments with a high-dimensional feature x and environment label y ∈ {0, 1}, assume that noisy labels ỹ are generated by random noise transition with noise rate e = P (ỹ = 1|y = 0) = P (ỹ = 0|y = 1). Denote f ∶= ĥ ○ g, ℓ the cross-entropy loss and ỹγ the smoothed noisy label, then minimizing the smoothed loss with noisy labels can be converted to min f E (x,ỹ)∼ D [ℓ(f (x), ỹγ )] = min f E (x,ỹ)∼ D [γℓ(f (x), ỹ) + (1 -γ)ℓ(f (x), 1 -ỹ)] = min f E (x,y)∼D [ℓ(f (x), y γ * )] + (γ * -γ -e + 2γe)E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)] where γ * is the optimal smooth parameter that makes the classifier return the best performance on unseen clean data (Wei et al., 2022) . The first term in Equ. ( 4) is the risk under the clean label. The influence of both noisy labels and ELS are reflected in the last term of the Equ. ( 4). E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)] is the opposite of the optimization process as we expect. Without label smoothing, the weight will be γ * -1 + e and a high noisy rate e will let this harmful term contributes more to our optimization. On the contrary, by choosing a smooth parameter γ = γ * -e 1-2e , the second term will be removed. For example, if e = 0, the best smooth parameter is just γ * .

3.4. EMPIRICAL GAP AND PARAMETERIZATION GAP

Propositions in Section 3.1 and Section 3.2 are based on two unrealistic assumptions. (i) Infinite data samples, and (ii) the discriminator is optimized without a constraint, namely, the discriminator is optimized over infinite-dimensional space. In practice, only empirical distributions with finite samples are observed and the discriminator is always constrained to smaller classes such as neural networks (Goodfellow et al., 2014) or reproducing kernel Hilbert spaces (RKHS) (Li et al., 2017a) . Besides, as shown in (Arora et al., 2017; Schäfer et al., 2019) , JS divergence has a large empirical gap, e.g., let D µ , D ν be uniform Gaussian distributions N (0, A natural question arises: "Given finite samples to multi-domain AT over finite-dimensional parameterized space, whether the expectation over the empirical distribution converges to the expectation over the true distribution?". In this subsection, we seek to answer this question by analyzing the empirical gap and parameterization gap, which is |d ĥ,g (D 1 , . . . , D M )d ĥ,g ( D1 , . . . , DM )|, where Di is the empirical distribution of D i and ĥ is constrained. We first show that, let H be a hypothesis class of VC dimension d, then for any δ ∈ (0, 1), with probability at least 1δ, the gap is less than 4 √ (d log(2n * ) + log 2/δ)/n * , where n * = min(n 1 , . . . , n M ) and n i is the number of samples in D i (Appendix A.8). The above analysis is based on H divergence and the VC dimension; we further analyze the gap when the discriminator is constrained to the Lipschitz continuous and build a connection between the gap and the model parameters. Specifically, suppose that each ĥi is L-Lipschitz with respect to the parameters and use p to denote the number of parameters of ĥi . Then given a universal constant c such that when n * ≥ cpM log(Lp/ϵ)/ϵ, we have with probability at least 1exp(-p), the gap is less than ϵ (Appendix A.9). Although the analysis cannot support the benefits of ELS, as far as we know, it is the first attempt to study the empirical and parameterization gap of multi-domain AT.

3.5. NON-ASYMPTOTIC CONVERGENCE

As mentioned in Section 3.4, the analysis in Section 3.1 and Section 3.2 assume the optimal discriminator can be obtained, which implies that both the hypothesis set has infinite modeling capacity and the training process can converge to the optimal result. If the objective of AT is convex-concave, then many works can support the global convergence behaviors (Nowozin et al., 2016; Yadav et al., 2017) . However, the convex-concave assumption is too unrealistic to hold true (Nie & Patel, 2020; Nagarajan & Kolter, 2017) , namely, the updates of DAT are no longer guaranteed to converge. In this section, we focus on the local convergence behaviors of DAT of points near the equilibrium. Specifically, we focus on the non-asymptotic convergence, which is shown able to more precisely reveal the convergence of the dynamic system than the asymptotic analysis (Nie & Patel, 2020) . We build a toy example to help us understand the convergence of DAT. Denote η the learning rate, γ the parameter for ELS, and c a constant. We conclude our theoretical results (which are detailed in Appendix A.10): (1) Simultaneous Gradient Descent (GD) DANN, which trains the discriminator and encoder simultaneously, has no guarantee of the non-asymptotic convergence. (2) If we train the discriminator n d times once we train the encoder n e times, the resulting alternating Gradient Descent (GD) DANN could converge with a sublinear convergence rate only when the η ≤ 4 √ n d nec . Such results support the importance of alternating GD training, which is commonly used during DANN implementation (Gulrajani & Lopez-Paz, 2021) . (3) Incorporate ELS into alternating GD speeds up the convergence rate by a factor 1 2γ-1 , that is, when η ≤ 4 √ n d nec 1 2γ-1 , the model could converge. Remark. In the above analysis, we made some assumptions e.g., in Section 3.5, we assume the algorithms are initialized in a neighborhood of a unique equilibrium point, and in Section 3.4 we assume that the NN is L-Lipschitz. These assumptions may not hold in practice, and they are computationally hard to verify. To this end, we empirically support our theoretical results, namely, verifying the benefits to convergence, training stability, and generalization results in the next section.

4. EXPERIMENTS

To demonstrate the effectiveness of our ELS, in this section, we select a broad range of tasks (in Table 1 ), which are image classification, image retrieval, neural language processing, genomics, graph, and sequential prediction tasks. Our target is to include benchmarks with (i) various numbers of domains (from 3 to 120, 084); (ii) various numbers of classes (from 2 to 18, 530); (iii) various dataset sizes (from 3, 200 to 448, 000); (iv) various dimensionalities and backbones (Transformer, ResNet, MobileNet, GIN, RNN) . See Appendix C for full details of all experimental settings, including dataset details, hyper-parameters, implementation details, and model structures. We conduct all the experiments on a machine with i7-8700K, 32G RAM, and four GTX2080ti. All experiments are repeated 3 times with different seeds and the full experimental results can be found in the appendix.

4.1. NUMERICAL RESULTS ON DIFFERENT SETTINGS AND BENCHMARKS

Domain Generalization and Domain Adaptation on Image Classification Tasks. We first incorporate ELS into SDAT, which is a variant of the DAT method and achieves the state-of-the-art Table 1 : A summary on evaluation benchmarks. Wg. acc. denotes worst group accuracy, 10 %/ acc. denotes 10th percentile accuracy. GIN (Xu et al., 2018) (Vapnik, 1999) 72.2 97.7 100.0 72.3 61.0 59.9 77.2 DANN (Ganin et al., 2016) 84 performance on the Office-Home dataset. Table 2 and Table 4 show that with the simple smoothing trick, the performance of SDAT is consistently improved, and on many of the domain pairs, the improvement is greater than 1%. Besides, the ELS can also bring consistent improvement both with ResNet-18, ResNet-50, and ViT backbones. The average domain generalization results on other benchmarks are shown in Table 3 . We observe consistent improvements achieved by DANN+ELS compared to DANN and the average accuracy on VLCS achieved by DANN+ELS (81.5%) clearly outperforms all other methods. See Appendix D.1 for Multi-Source Domain Generalization performance, DG performance on Rotated MNIST and on Image Retrieval benchmarks. A -W D -W W -D A -D D -A W -A Avg ResNet18 ERM Domain Generalization with Partial Environment labels. One of the main advantages brought by ELS is the robustness to environment label noise. As shown in Figure 3 (a), when all environment labels are known (GT), DANN+ELS is slightly better than DANN. When partial environment labels are known, for example, 30% means the environment labels of 30% training data are known and others are annotated differently than the ground truth annotations, DANN+ELS outperform DANN by a large margin (more than 5% accuracy when only 20% correct environment labels are given). Besides, we further assume the total number of environments is also unknown and the environment number is generated randomly. M=2 in Figure 3 (a) means we partition all the training data randomly into two domains, which are used for training then. With random environment partitions, DANN+ELS consistently beats DANN by a large margin, which verifies that the smoothness of the discrimination loss brings significant robustness to environment label noise for DAT. Continuously Indexed Domain Adaptation. We compare DANN+ELS with state-of-the-art continuously indexed domain adaptation methods. Table 5 compares the accuracy of various methods. DANN shows an inferior performance to CIDA. However, with ELS, DANN+ELS boosts the generalization performance by a large margin and beats the SOTA method CIDA (Wang et al., 2020) . We also 2019) 85.7 ± 1.0 79.3 ± 1.1 97.6 ± 0.4 75.9 ± 1.0 84.6 97.6 ± 0.5 64.7 ± 1.1 69.7 ± 0.5 76.6 ± 0 (Zhang et al., 2021b) 85.0 ± 1.2 81.4 ± 0.2 95.9 ± 0.3 80.9 ± 0.5 85.8 97.6 ± 0.6 66.5 ± 0.3 72.7 ± 0.6 74.4 ± 0.7 77.8 Fisher (Rame et al., 2021) --------86.9 --------76.2 DDG (Zhang et al., 2021a) 88.9 ± 0.6 85.0 ± 1.9 97.2 ± 1.2 84.3 ± 0.7 88.9 99.1 ± 0.6 66.5 ± 0.3 73.3 ± 0.6 80.9 ± 0.6 80.0 DANN+ELS 87.8 ± 0.8 83.8 ± 1.6 97.1 ± 0.4 81.4 ± 1.3 87.5 99.1 ± 0.3 73.2 ± 1.1 73.8 ± 0.9 79.9 ± 0.9 81.5 ↑ 2.4 0.7 0.8 1.8 1.4 0.5 0 1 1.1 0.7 Table 4 : Accuracy (%) on Office-Home for unsupervised DA (with ResNet-50 and ViT backbone). SDAT+ELS outperforms other SOTA DA techniques and improves SDAT consistently. (He et al., 2016) 34.9 50.0 58.0 37.4 41.9 46.2 38.5 31.2 60.4 53.9 41.2 59.9 46.1 DANN (Ganin et al., 2016) 45 5 shows that the representative DA method (ADDA) performs poorly when asked to align domains with continuous indices. However, the proposed DANN+ELS can get a near-optimal decision boundary. Method Backbone A-C A-P A-R C-A C-P C-R P-A P-C P-R R-A R-C R-P Avg ResNet-50 Generalization results on other structural datasets and Sequential Datasets. Table 6 shows the generalization results on NLP datasets, and Table 7 , 14 show the results on genomics datasets. DANN+ELS bring huge performance improvement on most of the evaluation metrics, e.g., 4.17% test worst-group accuracy on CivilComments, 3.79% test ID accuracy on RxRx1, and 3.13% test accuracy on OGB-MolPCBA. Generalization results on sequential prediction tasks are shown in Table 15 and Table 18 , where DANN works poorly but DANN+ELS brings consistent improvement and beats all baselines on the Spurious-Fourier dataset.

4.2. INTERPRETATION AND ANALYSIS

To choose the best γ. Figure 3 (b) visualizes the best γ values in our experiments. For datasets like PACS and VLCS, where each domain will be set as a target domain respectively and has one best γ, we calculate the mean and standard deviation of all these γ values. Our main observation is that, as the number of domains increases, the optimal γ will also decrease, which is intuitive because more domains mean that the discriminator is more likely to overfit and thus needs a lower γ to solve the problem. An interesting thing is that in Figure 3 (b), PACS and VLCS both have 4 domains, (Sagawa et al., 2019) 70.7 ± 0.6 70.0 ± 0.6 54.7 ± 0.0 53.3 ± 0.0 54.2 ± 0.3 6.3 ± 0.2 CORAL (Sun & Saenko, 2016) 72.0 ± 0.3 71.1 ± 0.3 54.7 ± 0.0 52.9 ± 0.8 30.0 ± 0.2 6.1 ± 0.1 IRM (Arjovsky et al., 2019) 71.5 ± 0.3 70.5 ± 0.3 54.2 ± 0.8 52.4 ± 0.8 32.2 ± 0.8 5.3 ± 0.2 Reweight 69.1 ± 0.5 68.6 ± 0.6 52.1 ± 0.2 52.0 ± 0.0 34.9 ± 1.2 9.1 ± 0.4 DANN (Ganin et al., 2016) 72 (Arjovsky et al., 2019) 89.0 ± 0.7 65.9 ± 2.8 88.8 ± 0.7 66.3 ± 2.1 ERM (Vapnik, 1999) 92.3 ± 0.2 50.5 ± 1.9 92.2 ± 0.1 56.0 ± 3.6 DANN (Ganin et al., 2016) 87.0 ± 0.3 64.0 ± 2.0 87.0 ± 0.3 61.7 ± 2.2 DANN+ELS 88.5 ± 0.4 65.9 ± 1.1 88.4 ± 0.4 66.0 ± 2.2 ↑ 1.4 1.9 1.4 4.3 Table 7 : Domain generalization performance on genomics dataset, RxRx1.

RxRx1-Wilds Algorithm

Val Acc Test ID Acc Test Acc Val Worst-Group Acc Test ID Worst-Group Acc Test Worst-Group Acc ERM (Vapnik, 1999) 19.4 ± 0.2 35.9 ± 0.4 29.9 ± 0.4 ---Group DRO (Sagawa et al., 2019) 15.2 ± 0.1 28.1 ± 0.3 23.0 ± 0.3 ---IRM (Arjovsky et al., 2019) 5.6 ± 0.4 9.9 ± 1.4 8.2 ± 1.1 0.8 ± 0.2 1.9 ± 0.4 1.5 ± 0.2 DANN (Ganin et al., 2016) 12.7 ± 0.2 22.9 ± 0.1 19.2 ± 0.1 1.0 ± 0.1 4.6 ± 0.4 3.6 ± 0.0 DANN+ELS 14.1 ± 0.1 26.7 ± 0.1 21.2 ± 0.2 1.1 ± 0.1 7.2 ± 0.3 4.2 ± 0.1 ↑ 1.4 3.8 2 0.1 2.6 0.6 but VLCS needs a higher γ. Figure 6 shows that images from different domains in PACS are of great visual difference and can be easily discriminated. In contrast, domains in VLCS do not show significant visual differences, and it is hard to discriminate which domain one image belongs to. The discrimination difficulty caused by this inter-domain distinction is another important factor affecting the selection of γ. Annealing γ. To achieve better generalization performance and avoid troublesome parametric searches, we propose to gradually decrease γ as training progresses, specifically, γ = 1.0 -M -1 M t T , where t, T are the current training step and the total training steps. Figure 3 (c) shows that annealing γ achieves a comparable or even better generalization performance than fine-grained searched γ. Empirical Verification of our theoretical results. We use the PACS dataset as an example to empirically support our theoretical results, namely verifying the benefits to convergence, training stability, and generalization results. In Figure 4, 'A as sources. Considering ELS, we can see that in all the experimental results, DANN+ELS with appropriate γ attains high training stability, faster and stable convergence, and better performance compared to DANN. In comparison, the training dynamics of native DANN is highly oscillatory, especially in the middle and late stages of training.

5. RELATED WORKS

Label Smoothing and Analysis is a technique from the 1980s, and independently re-discovered by (Szegedy et al., 2016) . Recently, label smoothing is shown to reduce the vulnerability of neural networks (Warde-Farley & Goodfellow, 2016) and reduce the risk of adversarial examples in GANs (Salimans et al., 2016) . Several works seek to theoretically or empirically study the effect of label smoothing. (Chen et al., 2020) focus on studying the minimizer of the training error and finding the optimal smoothing parameter. (Xu et al., 2020) analyzes the convergence behaviors of stochastic gradient descent with label smoothing. However, as far as we know, no study focuses on the effect of label smoothing on the convergence speed and training stability of DAT. (Ganin et al., 2016) using a domain discriminator to distinguish the source and target domains and the gradients of the discriminator to the encoder are reversed by the Gradient Reversal layer (GRL), which achieves the goal of learning domain invariant features. (Schoenauer-Sebag et al., 2019; Zhao et al., 2018) extend generalization bounds in DANN (Ganin et al., 2016) to multi-source domains and propose multisource domain adversarial networks. (Hu et al., 2021) incorporates the prototypical features into DAT to achieve semantic domain alignment. (Acuna et al., 2022) interprets the DAT framework through the lens of game theory and proposes to replace gradient descent with high-order ODE solvers. (Rangwani et al., 2022) finds that enforcing the smoothness of the classifier leads to better generalization on the target domain and presents Smooth Domain Adversarial Training (SDAT). The proposed method is orthogonal to existing DAT methods and yields excellent optimization properties theoretically and empirically.

Domain Adversarial Training

For space limit, the related works about domain adaptation, domain generalization, and adversarial Training in GANs are in the appendix.

6. CONCLUSION

In this work, we propose a simple approach, i.e., ELS, to optimize the training process of DAT methods from an environment label design perspective, which is orthogonal to most existing DAT methods. Incorporating ELS into DAT methods is empirically and theoretically shown to be capable of improving robustness to noisy environment labels, converge faster, attain more stable training and better generalization performance. As far as we know, our work takes a first step towards utilizing and understanding label smoothing for environmental labels. Although ELS is designed for DAT methods, reducing the effect of environment label noise and a soft environment partition may benefit all DG/DA methods, which is a promising future direction. 

A PROOFS OF THEORETICAL STATEMENTS

The commonly used notations and their corresponding descriptions are concluded in Table 8 .

A.1 CONNECT ENVIRONMENT LABEL SMOOTHING TO JS DIVERGENCE MINIMIZATION

To complete the proofs, we begin by introducing some necessary definitions and assumptions. Definition 1. (H-divergence (Ben-David et al., 2006) ). Given two domain distributions D S , D T over X, and a hypothesis class H, the H-divergence between D S , D T is -David et al., 2006) .) For an symmetric hypothesis class H, one can compute the empirical H-divergence between two empirical distributions DS and DT by computing d H (D S , D T ) = 2 sup h∈H |E x∼D S [h(x) = 1] -E x∼D T [h(x) = 1]| (5) Definition 2. (Empirical H-divergence (Ben dH ( DS , DT ) = 2 (1 -min h∈H [ 1 m m ∑ i=1 I[h(x i ) = 0] + 1 n n ∑ i=1 I[h(x i ) = 1]]) , where m, n is the number of data samples of DS and DT respectively and I[a] is the indicator function which is 1 if predicate a is true, and 0 otherwise. Vanilla DANN estimating the "min" part of Equ. ( 6) by a domain discriminator, that models the probability that a given input is from the source domain or the target domain. Specially, let the hypothesis h be the composition of h = ĥ ○ g, where ĥ ∈ Ĥ is a additional hypothesis and g ∈ G pushes forward the data samples to a representation space Z. DANN (Ben-David et al., 2006) seeks to approximate the H-divergence of Equ. ( 6) by Data samples from source domain, target domain, and domain i. max ĥ∈ Ĥ d ĥ,g (D S , D T ) = max ĥ∈ Ĥ E xs∼D S log ĥ ○ g(x s ) + E xt∼D T log (1 -ĥ ○ g(x t )) , D z S , D z T , D z i Feature distributions of D S , D T , D i respectively, which is also termed g ○ D S , g ○ D T , g ○ D i . p z s , p z t , p z i Density functions for D z S , D z T , D z i respectively. z s , z t , z i Data samples from D z S , D z T , D z i . H, Ĥ, G Support sets for hypothesis, discriminator, and feature encoder. h, ĥ, ĥ * , g Hypothesis, discriminator, the optimal discriminator, and feature encoder. where the sigmoid activate function is ignored for simplicity, ĥ ○ g(x) is the prediction probability that x is belonged to D S and 1 -ĥ ○ g(x) is the prediction probability that x is belonged to D T . Applying environment label smoothing, the target can be reformulated to max ĥ∈ Ĥ d ĥ,g,γ (D S , D T ) = max ĥ∈ Ĥ E xs∼D S [γ log ĥ ○ g(x s ) + (1 -γ) log (1 -ĥ ○ g(x s ))] + E xt∼D T [(1 -γ) log ĥ ○ g(x t ) + γ log (1 -ĥ ○ g(x t ))] When γ ∈ {0, 1}, Equ. ( 8) is equal to Equ. ( 7) and no environment label smoothing is applied. Then we prove the proposition 1 Proposition 1. Suppose ĥ the optimal domain classifier with no constraint and mixed distri- butions { D S ′ = γD S + (1 -γ)D T D T ′ = γD T + (1 -γ)D S with hyper-parameter γ, then max ĥ∈ Ĥ d ĥ,g,γ (D S , D T ) = 2D JS (D S ′ ||D T ′ ) -2 log 2 , where D JS is the Jensen-Shanon (JS) divergence. Proof. Denote the injected source/target density as p z s ∶= g ○p s , p z t ∶= g ○p t , where p s , p t is the density of D S , D T respectively. We can rewrite Equ. (8) as: d ĥ,g,γ (D S , D T ) = ∫ Z p z s (z) log [γ log ĥ(z) + (1 -γ) log (1 -ĥ(z))] + p z t (z) [(1 -γ) log ĥ(z) + γ log (1 -ĥ(z))] We first take derivatives and find the optimal ĥ * :  ∂d ĥ,g,γ (D S , D T ) ∂ ĥ(z) = p z s (z) [γ 1 ĥ(z) + (1 -γ) -1 1 -ĥ(z) ] + p z t (z) [(1 -γ) log 1 ĥ(z) + γ -1 1 -ĥ(z) ] = 0 ⇒ p z s (z) [γ(1 -ĥ(z)) -(1 -γ) ĥ(z)]] + p z t (z) [(1 -γ)(1 -ĥ(z)) -γ ĥ(z)] = 0 ⇒ p z s (z) [γ -ĥ(z)] + p z t (z) [1 -γ -ĥ(z)] = 0 ⇒ ĥ * (z) = p z t (z) + γ(p z s (z) -p z t (z)) p z s (z) + p z t (z) max ĥ∈ Ĥ d ĥ,g,γ (D S , D T ) = ∫ Z p s [γ log [ p t + γ(p s -p t ) p s + p t ] + (1 -γ) log [ p s + γ(p t -p s ) p s + p t ]] + p t [(1 -γ) log [ p t + γ(p s -p t ) p s + p t ] + γ log [ p s + γ(p t -p s ) p s + p t ]] d z = ∫ Z p s log p s + ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ p s = γp s ′ +(γ-1)p t ′ 2γ-1 p t = γp t ′ +(γ-1)p s ′ 2γ-1 , and p s ′ + p t ′ = p s + p t . Then 1 in Equ. ( 11) can be rearranged to 11) can be rearranged to γ 2γ -1 (p s ′ log p t ′ p s ′ + p t ′ + p t ′ log p s ′ p s ′ + p t ′ ) + γ -1 2γ -1 (p t ′ log p t ′ p s ′ + p t ′ + p s ′ log p s ′ p s ′ + p t ′ ) = γ 2γ -1 (p s ′ log p t ′ p s ′ + p t ′ + p s ′ log p s ′ p t ′ -p s ′ log p s ′ p t ′ + p t ′ log p s ′ p s ′ + p t ′ ) + γ -1 2γ -1 (p t ′ log p t ′ p s ′ + p t ′ + p s ′ log p s ′ p s ′ + p t ′ ) = (p t ′ log p t ′ p s ′ + p t ′ + p s ′ log p s ′ p s ′ + p t ′ ) - γ 2γ -1 (p s ′ log p s ′ p t ′ + p t ′ log p t ′ p s ′ ) = 2 1 2 (p t ′ log 2p t ′ p s ′ + p t ′ + p s ′ log 2p s ′ p s ′ + p t ′ -2 log 2) - γ 2γ -1 (p s ′ log p s ′ p t ′ + p t ′ log p t ′ p s ′ ) = 2D JS (D S ′ ||D T ′ ) -2 log 2 - γ 2γ -1 (p s ′ -p t ′ ) log p s ′ p t ′ (12) 2 in Equ. ( γ (p s log p s ′ p t ′ + p t log p t ′ p s ′ ) = γ log p s ′ p t ′ ( γp s ′ + (γ -1)p t ′ 2γ -1 - γp t ′ + (γ -1)p s ′ 2γ -1 ) = γ 2γ -1 (p s ′ -p t ′ ) log p s ′ p t ′ By plugging the rearranged 1 and 2 into Equ. ( 11), we get max ĥ∈ Ĥ d ĥ,g,γ (D S , D T ) = 2D JS (D S ′ ||D T ′ ) -2 log 2 A.2 CONNECT ONE-SIDED ENVIRONMENT LABEL SMOOTHING TO JS DIVERGENCE MINIMIZATION Proposition 2. Given two domain distributions D S , D T over X, where D S is the read data distribution and D T is the generated data distribution. The cost used for the discriminator is: max h∈H d h (D S , D T ) = max h∈H E xs∼D S log h(x s ) + E xt∼D T log (1 -h(x t )) , where h ∈ H ∶ X → [0, 1]. Suppose h ∈ H the optimal discriminator with no constraint and Proof. Applying one-sided environment label smoothing, the target can be reformulated to mixed distributions { D S ′ = γD S D T ′ = D T + (1 -γ)D S max h∈H d h,γ (D S , D T ) = max h∈H E xs∼D S [γ log h(x s ) + (1 -γ) log (1 -h(x s ))] + E xt∼D T [log (1 -h(x t ))] = max h∈H ∫ X p s (x) log [γ log h(x) + (1 -γ) log(1 -h(x))] + p t (x) log(1 -h(x)) (16) where γ is a value slightly less than one, p s (x), p t (x) is the density of D S , D T respectively. By taking derivatives and finding the optimal h we can get h * = γps(x) ps(x)+pt(x) . Plugging the optimal h * into the original target we can get: = ∫ X p s (x) [γ log γp s (x) p s (x) + p t (x) + (1 -γ) log p t (x) + (1 -γ)p s (x) p s (x) + p t (x) ] + p t (x) log p t (x) + (1 -γ)p s (x) p s (x) + p t (x) d x = ∫ X p s (x)γ log γp s (x) p s (x) + p t (x) + [p s (1 -γ) + p t (x)] log p t (x) + (1 -γ)p s (x) p s (x) + p t (x) d x = ∫ X p s ′ (x) log p s ′ (x) p s ′ (x) + p t ′ (x) + p t ′ (x) log p t ′ (x) p s ′ (x) + p t ′ (x) d x = 2D JS (D S ′ ||D T ′ ) -2 log 2, ( ) where { D S ′ = γD S D T ′ = D T + (1 -γ)D S are two mixed distributions and { p s ′ = γp s p t ′ = p t + (1 -γ)p s are their densities. Our result supplies an explanation to "why GANs only use one-sided label smoothing rather than native label smoothing". That is, if the density of real data in a region is near zero p s (x) → 0, native environment label smoothing will be dominated by only the generated sample densities because { p s ′ = p t + γ(p s -p t ) ≈ (1 -γ)p t p t ′ = p s + γ(p t -p s ) ≈ γp t . Namely, the discriminator will not align the distribution between generated samples and real samples, but enforce the generator to produce samples that follow the fake mode D T . In contrast, one-sided label smoothing reserves the real distribution density as far as possible, that is, p s ′ = γp s , p t ′ ≈ γp t , which avoids divergence minimization between fake mode to fake mode and relieves model collapse.

A.3 CONNECT MULTI-DOMAIN ADVERSARIAL TRAINING TO KL DIVERGENCE MINIMIZATION

Proposition 3. Given domain distributions {D i } M i=1 over X, and a hypothesis class H. Suppose ĥ ∈ Ĥ the optimal discriminator with no constraint and mixed distributions D M ix = ∑ Proof. We restate corresponding notations and definitions as follows. Given M domains {D i } M i=1 . Let the hypothesis h be the composition of h = ĥ ○ g, where g ∈ G pushes forward the data samples to a representation space Z and the domain discriminator with softmax activation function is defined as ĥ = ( ĥ1 (⋅), . . . , ĥM (⋅)) ∈ Ĥ ∶ Z → [0, 1] M ; ∑ M i=1 ĥi (⋅) = 1. Denote g ○ D i the feature distribution of D i which is encoded by encoder g. The cost used for the discriminator can be defined as: max ĥ∈ Ĥ d ĥ,g (D 1 , . . . , D M ) = max ĥ∈H E z∼g○D1 log ĥ1 (z) + ⋅ ⋅ ⋅ + E z∼g○D M log ĥM (z), s.t. M ∑ i=1 ĥi (z) = 1 (18) Denote p z i (z) the density of feature distribution g ○ D i . For simplicity, we ignore ∫ Z . Applying lagrange multiplier and taking the first derivative with respect to each ĥi , we can get ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ∂d ĥ,g ∂ ĥ1 = p z 1 (z) 1 ĥ1(z) -λ = 0 ⋮ ∂d ĥ,g ∂ ĥM = p z M (z) 1 ĥM (z) -λ = 0 ⇒ ⎧ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎩ ĥ1 (z) = p z 1 (z) λ ⋮ ĥM (z) = p z M (z) λ ⇒ 1 ⎧ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎩ ĥ * 1 (z) = p z 1 (z) p z 1 (z)+⋅⋅⋅+p z M (z) ⋮ ĥ * M (z) = p z M (z) p z 1 (z)+⋅⋅⋅+p z M (z) ) where λ is the lagrange variable and 1 is because the constraint ∑ M i=1 ĥi (z) = 1. Denote D M ix = ∑ M i=1 D i is a mixed distribution and p M ix = ∑ M i=1 p i is the density. Then we have max ĥ∈ Ĥ d ĥ,g (D 1 , . . . , D M ) = ∫ Z p z 1 (z) log p z 1 (z) p z M ix (z) + p z 2 (z) log p z 2 (z) p z M ix (z) + ⋅ + p z M (z) log p z M (z) p z M ix (z) d z = M ∑ i=1 D KL (D i ||D M ix ), (20) where D KL is the KL divergence. With environment label smoothing, the target is max ĥ∈ Ĥ d ĥ,g,γ (D 1 , . . . , D M ) = max ĥ∈ Ĥ E z∼g○D1 ⎡ ⎢ ⎢ ⎢ ⎣ γ log ĥ1 (z) + (1 -γ) M -1 M ∑ j=1;j≠1 log ( ĥj (z)) ⎤ ⎥ ⎥ ⎥ ⎦ + ⋅ ⋅ ⋅ + E z∼g○D M ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ γ log ĥM (z) + (1 -γ) M -1 M ∑ j=1;j≠M log ( ĥj (z)) ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ , s.t. M ∑ i=1 ĥi (z) = 1 (21) Take the same operation as Equ. ( 19) we can get ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ∂d ĥ,g,γ ∂ ĥ1 = γp z 1 (z) 1 ĥ1(z) + 1-γ M -1 ∑ M j=1;j≠1 p z j (z) 1 ĥ1(z) -λ = 0 ⋮ ∂d ĥ,g,γ ∂ ĥM = γp z M (z) 1 ĥM (z) + 1-γ M -1 ∑ M j=1;j≠M p z j (z) 1 ĥM (z) -λ = 0 ⇒ ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ ĥ * 1 (z) = γp z 1 (z)+ 1-γ M -1 ∑ M j=1;j≠1 p z j (z) p z 1 (z)+⋅⋅⋅+p z M (z) ⋮ ĥ * M (z) = γp z M (z)+ 1-γ M -1 ∑ M j=1;j≠M p z j (z) p z 1 (z)+⋅⋅⋅+p z M (z) (22) Denote {D i ′ = γD i + 1-γ M -1 ∑ M j=1;j≠i D} M i=1 a set of mixed distributions and {p i ′ (z) = γp z i (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z)} M i=1 the corresponding densities. Plugging Equ. ( 22) to the target we can get M ∑ i=1 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∫ Z γp z i (z) log γp z i (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z) p z i (z) + ⋅ ⋅ ⋅ + p z M (z) + (1 -γ) M -1 M ∑ k=1;k≠i p z i (z) log γp z k (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z) p z i (z) + ⋅ ⋅ ⋅ + p z M (z) d z ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = M ∑ i=1 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∫ Z γp z i (z) log γp z i (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z) p z i (z) + ⋅ ⋅ ⋅ + p z M (z) + (1 -γ) M -1 M ∑ k=1;k≠i p z k (z) log γp z i (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z) p z i (z) + ⋅ ⋅ ⋅ + p z M (z) d z ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = M ∑ i=1 ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ∫ Z ⎛ ⎝ γp z i (z) + (1 -γ) M -1 M ∑ k=1;k≠i p z k (z) ⎞ ⎠ log γp z i (z) + 1-γ M -1 ∑ M j=1;j≠i p z j (z) p z i (z) + ⋅ ⋅ ⋅ + p z M (z) d z ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ = M ∑ i=1 D KL (D i ′ ||D M ix ) A.4 TRAINING STABILITY BROUGHT BY ENVIRONMENT LABEL SMOOTHING Let D S , D T two distributions and D z S , D z T their induced distributions projected by encoder g ∶ X → Z over feature space. We first show that if D z S , D z T are disjoint or lie in low dimensional manifolds, there is always a perfect discriminator between them. Theorem 1. (Theorem 2.1. in (Arjovsky & Bottou, 2017) .) If two distribution D z S , D z T have support contained on two disjoint compact subsets M and P respectively, then there is a smooth optimal discriminator ĥ * ∶ Z → [0, 1] that has accuracy 1 and ∇ z ĥ * (z) = 0 for all z ∼ M ⋃ P. Theorem 2. (Theorem 2.2. in (Arjovsky & Bottou, 2017) .) Assume two distribution D z S , D z T have support contained in two closed manifolds M and P that don't perfectly align and don't have full dimension. Both D z S , D z T are assumed to be continuous in their respective manifolds. Then, there is a smooth optimal discriminator ĥ * ∶ Z → [0, 1] that has accuracy 1, and for almost all z ∼ M ⋃ P, ĥ * is smooth in a neighbourhood of z and ∇ z ĥ * (z) = 0. Namely, if the two distributions have supports that are disjoint or lie on low dimensional manifolds, the optimal discriminator will be accurate on all samples and its gradient will be zero almost everywhere. Then we can study the gradients we pass to the generator through a discriminator. Proposition 4. Denote g(θ; ⋅) ∶ X → Z a differentiable function that induces distributions D z S , D z T with parameter θ, and ĥ a differentiable discriminator. If Theorem 1 or 2 holds, given a ϵ-optimal discriminator ĥ, that is sup z∈Z ∥ ∇ z ĥ(z) ∥ 2 +| ĥ(z) -ĥ * (z)| < ϵfoot_0 , assume the Jacobian matrix of g(θ; x) given x is bounded by sup x∈X [∥ J θ (g(θ; x)) ∥ 2 ] ≤ C, then we have lim ϵ→0 ∥ ∇ θ d ĥ,g (D S , D T ) ∥ 2 = 0 (24) lim ϵ→0 ∥ ∇ θ d ĥ,g,γ (D S , D T ) ∥ 2 < 2(1 -γ)C Proof. Theorem 1 or 2 show that in Equ. ( 8), ĥ * is locally one on the support of D z S and zero on the support of D z T . Then, using Jensen's inequality, triangle inequality, and the chain rule on these supports, the gradients we pass to the generator through a discriminator given x s ∼ D S is ∥ ∇ θ E xs∼D S [γ log ĥ ○ g(θ; x s ) + (1 -γ) log (1 -ĥ ○ g(θ; x s ))] ∥ 2 ≤ E xs∼D S [∥ ∇ θ γ log ĥ ○ g(θ; x s ) ∥ 2 ] + E xs∼D S [∥ ∇ θ (1 -γ) log (1 -ĥ ○ g(θ; x s )) ∥ 2 ] ≤ E xs∼D S [γ ∥ ∇ θ ĥ ○ g(θ; x s ) ∥ 2 | ĥ ○ g(θ; x s )| ] + E xs∼D S ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ (1 -γ) ∥ ∇ θ ĥ ○ g(θ; x s ) ∥ 2 |1 -ĥ ○ g(θ; x s )| ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ≤ E xs∼D S [γ ∥ ∇ z ĥ(z) ∥ 2 ∥ J θ (g(θ; x s )) ∥ 2 | ĥ ○ g(θ; x s )| ] + E xs∼D S ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ (1 -γ) ∥ ∇ z ĥ(z) ∥ 2 ∥ J θ (g(θ; x s )) ∥ 2 |1 -ĥ ○ g(θ; x s )| ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ < γE xs∼D S [ ϵ ∥ J θ (g(θ; x s )) ∥ 2 | ĥ * ○ g(θ; x s ) -ϵ| ] + (1 -γ)E xs∼D S ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ϵ ∥ J θ (g(θ; x s )) ∥ 2 |1 -ĥ * ○ g(θ; x s ) + ϵ| ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ≤ γ ϵC 1 -ϵ + (1 -γ)C, ) where the fifth line is because we have ĥ(z) ≈ ĥ * (z)ϵ when ϵ is small enough and ∥ ∇ z ĥ(z) ∥ 2 < ϵ. Similarly we can get the gradients given x t ∼ D T is ∥ ∇ θ E xt∼D T [(1 -γ) log ĥ ○ g(x t ) + γ log (1 -ĥ ○ g(x t ))] ∥ 2 < (1 -γ)E xt∼D T [ ϵ ∥ J θ (g(θ; x t )) ∥ 2 | ĥ * ○ g(θ; x t ) + ϵ| ] + γE xt∼D T ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ϵ ∥ J θ (g(θ; x t )) ∥ 2 |1 -ĥ * ○ g(θ; x t ) -ϵ| ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ≤ (1 -γ)C + γ ϵC 1 -ϵ Here ĥ(z) ≈ ĥ * (z) + ϵ because ĥ * is locally zero on the support of D z T . Then we have lim ϵ→0 ∥ ∇ θ d ĥ,g,γ (D S , D T ) ∥ 2 ≤ lim ϵ→0 ∥ ∇ θ E xs∼D S [γ log ĥ ○ g(θ; x s ) + (1 -γ) log (1 -ĥ ○ g(θ; x s ))] ∥ 2 + ∥ ∇ θ E xt∼D T [(1 -γ) log ĥ ○ g(x t ) + γ log (1 -ĥ ○ g(x t ))] ∥ 2 < lim ϵ→0 γ ϵC 1 -ϵ + γ ϵC 1 -ϵ 1 + (1 -γ)C + (1 -γ)C 2 = 2(1 -γ)C, ( ) where 1 is equal to the gradient of native DANN in Equ. ( 7) times γ, namely lim ϵ→0 ∥ ∇ θ d ĥ,g (D S , D T ) ∥ 2 = 0, which shows that as our discriminator gets better, the gradient of the encoder vanishes. With environment label smoothing, we have lim ϵ→0 ∥ ∇ θ d ĥ,g,γ (D S , D T ) ∥ 2 = 2(1 -γ)C, which alleviates the problem of gradients vanishing.

A.5 TRAINING STABILITY ANALYSIS OF MULTI-DOMAIN SETTINGS

Let {D i } M i=1 a set of data distributions and {D z i } M i=1 their induced distributions projected by encoder g ∶ X → Z over feature space. Recall that the domain discriminator with softmax activation function is defined as ĥ = ( ĥ1 , . . . , ĥM ) ∈ Ĥ ∶ Z → [0, 1] M , where ĥi (z) denotes the probability that z belongs to D z i . To verify the existence of each optimal discriminator ĥ * i , we can easily replace D z s , D z t in Theorem 1 and Theorem 2 by D z i , ∑ M j=1;j≠i D z j respectively. Namely, if distribution D z i and ∑ M j=1;j≠i D z j have supports that are disjoint or lie on low dimensional manifolds, ĥ * i can perfectly discriminate samples within and beyond D z i and its gradient will be zero almost everywhere. Proposition 5. Denote g(θ; ⋅) ∶ X → Z a differentiable function that induces distributions {D z i } M i=1 with parameter θ, and { ĥi } M i=1 corresponding differentiable discriminators. If optimal discriminators for induced distributions exist, given any ϵ-optimal discriminator ĥi , we have sup z∈Z ∥ ∇ z ĥi (z) ∥ 2 +| ĥi (z) -ĥ * i (z)| < ϵ, assume the Jacobian matrix of g(θ; x) given x is bounded by sup x∈X [∥ J θ (g(θ; x)) ∥ 2 ] ≤ C, then we have lim ϵ→0 ∥ ∇ θ d ĥ,g (D 1 , . . . , D M ) ∥ 2 = 0 (31) lim ϵ→0 ∥ ∇ θ d ĥ,g,γ (D 1 , . . . , D M ) ∥ 2 < M (1 -γ)C Proof. Following the proof in Proposition 4, we have lim ϵ→0 ∥ ∇ θ E x∈Di ⎡ ⎢ ⎢ ⎢ ⎣ γ log ĥi ○ g(x) + (1 -γ) M -1 M ∑ j=1;j≠i log ( ĥj ○ g(x)) ⎤ ⎥ ⎥ ⎥ ⎦ ∥ 2 ≤ lim ϵ→0 E x∼Di [γ ∥ ∇ θ ĥi ○ g(θ; x) ∥ 2 | ĥi ○ g(θ; x)| ] + (1 -γ) M -1 M ∑ j=1;j≠i E x∼Dj [γ ∥ ∇ θ ĥj ○ g(θ; x) ∥ 2 | ĥj ○ g(θ; x)| ] < lim ϵ→0 γE x∼Di ⎡ ⎢ ⎢ ⎢ ⎣ ϵ ∥ J θ (g(θ; x)) ∥ 2 | ĥ * i ○ g(θ; x) -ϵ| ⎤ ⎥ ⎥ ⎥ ⎦ + (1 -γ) M -1 M ∑ j=1;j≠i γE x∼Dj ⎡ ⎢ ⎢ ⎢ ⎢ ⎣ ϵ ∥ J θ (g(θ; x)) ∥ 2 | ĥ * j ○ g(θ; x) + ϵ| ⎤ ⎥ ⎥ ⎥ ⎥ ⎦ ≤ lim ϵ→0 [γ ϵC 1 -ϵ + (1 -γ)C] = (1 -γ)C (33) where the second line is because for z ∼ D z i , ĥ * i (z) is locally one and other optimal discriminators ĥ * j (z)|j ≠ i, j ∈ [M ] are all locally zero, thus we have ĥi (z) ≈ ĥ * i (z)ϵ, and ĥj (z) ≈ ĥ * j (z) + ϵ. lim ϵ→0 ϵC 1-ϵ = 0 is the gradient that passed to the generator by native multi-domain DANN (Equ. ( 1)). Environment label smoothing leads to another term, that is (1γ)C and avoid gradients vanishing. Consider all distributions, we have lim ϵ→0 ∥ ∇ θ d ĥ,g,γ (D 1 , . . . , D M ) ∥ 2 ≤ lim ϵ→0 ∥ ∇ θ E x∈D1 ⎡ ⎢ ⎢ ⎢ ⎣ γ log ĥ1 ○ g(x) + (1 -γ) M -1 M ∑ j=2 log ( ĥj ○ g(x)) ⎤ ⎥ ⎥ ⎥ ⎦ ∥ 2 + ⋅ ⋅ ⋅ + lim ϵ→0 ∥ ∇ θ E x∈D M ⎡ ⎢ ⎢ ⎢ ⎣ γ log ĥM ○ g(x) + (1 -γ) M -1 M -1 ∑ j=1 log ( ĥj ○ g(x)) ⎤ ⎥ ⎥ ⎥ ⎦ ∥ 2 = M (1 -γ)C,

A.6 ELS STABILIZE THE OSCILLATORY GRADIENT

For the clarity of our proof, the notations here is a little different compared to other sections. Let ec(i) be the cross-entropy loss for class i, we denote g is the encoder and {w i } M i=1 is the classification parameter for all domains, then the adversarial loss function for a given sample x with domain index i here is F (x, i) =(1 -γ)ec(i) + γ M ∑ j≠i ec(j) =ec(i) + γ M -1 ∑ j (ec(j) -ec(i)) =ec(i) + γ M -1 ∑ j (-log ( exp(w ⊺ j g(x)) ∑ k exp(w ⊺ k g(x) ) ) + log ( exp(w ⊺ i g(x)) ∑ k exp(w ⊺ k g(x)) )) =ec(i) + γ M -1 ∑ j (-w ⊺ j g(x) + log (∑ k exp(w ⊺ k g(x))) + w ⊺ i g(x)) -log (∑ k exp(w ⊺ k g(x)))) =ec(i) + γ M -1 ∑ j ((w i -w j ) ⊺ g(x)) = -w ⊺ i g(x) + log (∑ k exp(w ⊺ k g(x))) + γ M -1 ∑ j ((w i -w j ) ⊺ g(x)) We compute the gradient: ∂F (x, i) ∂w i = -g(x) + exp(w ⊺ i g(x) ∑ k exp(w ⊺ k g(x)) g(x) + γ M -1 g(x) = (-1 + p(i) + γ M -1 ) g(x), where p(i) denotes exp(w ⊺ i g(x) ∑ k exp(w ⊺ k g(x) ) . When γ is small (e.g., γ ≲ M (1p(i))), the gradient will be further pullback towards 0. Similarly, for w j and g(x), we have ∂F (x, i) ∂w j = exp(w ⊺ j g(x) ∑ k exp(w ⊺ k g(x)) g(x) - γ M -1 g(x) = (p(j) - γ M -1 ) g(x) ∂F (x, i) ∂g(x) = -w i + ∑ j exp(w ⊺ j g(x) ∑ k exp(w ⊺ k g(x)) w j + γ M -1 ∑ j (w i -w j ) = -(1 - γ M -1 )w i + ∑ j (p(j) - γ M -1 ) w j , then with proper choice of γ (e.g., γ ≲ min j M p(j)), the gradient w.r.t w j and g(x) will also shrink towards zero.

A.7 ENVIRONMENT LABEL SMOOTHING MEETS NOISY LABELS

In this subsection, we focus on binary classification settings and adopt the symmetric noise model (Kim et al., 2019) . Some of our proofs follow (Wei et al., 2022) but different results and analyses are given. The symmetric noise model is widely accepted in the literature on learning with noisy labels and generates the noisy labels by randomly flipping the clean label to the other possible classes. Specifically, given two environment with high-dimensional feature x environment label y ∈ {0, 1}, denote noisy labels ỹ is generated by a noise transition matrix T , where T ij denotes denotes the probability of flipping the clean label y = i to the noisy label ỹ = j, i.e., T ij = P (ỹ = j|y = i). Let e = P (ỹ = 1|y = 0) = P (ỹ = 0|y = 1) denote the noisy rate, the binary symmetric transition matrix becomes: T = ( 1 -e e e 1 -e ) , Suppose (x, y) are drawn from a joint distribution D, but during training, only samples with noisy labels are accessible from (x, ỹ) ∼ D. Denote f ∶= ĥ ○ g and ℓ the cross-entropy loss, minimizing the smoothed loss with noisy labels can then be converted to min f E (x,ỹ)∼ D [ℓ(f (x), ỹγ )] = min f E (x,ỹ)∼ D [γℓ(f (x), ỹ) + (1 -γ)ℓ(f (x), 1 -ỹ)] Let c 1 = γ, c 2 = 1γ, according to the law of total probability, we have Equ. ( 39) is equal to min f E x.y=0 [P (ỹ = 0|y = 0)(c 1 ℓ(f (x), 0) + c 2 ℓ(f (x), 1) + P (ỹ = 1|y = 0)(c 1 ℓ(f (x), 1) + c 2 ℓ(f (x), 0)] + E x.y=1 [P (ỹ = 0|y = 1)(c 1 ℓ(f (x), 0) + c 2 ℓ(f (x), 1) + P (ỹ = 1|y = 1)(c 1 ℓ(f (x), 1) + c 2 ℓ(f (x), 0)] recall that e = P (ỹ = 1|y = 0) = P (ỹ = 0|y = 1), the above equation is equal to min f E x.y=0 [(1 -e)(c 1 ℓ(f (x), 0) + c 2 ℓ(f (x), 1) + e(c 1 ℓ(f (x), 1) + c 2 ℓ(f (x), 0)] + E x.y=1 [e(c 1 ℓ(f (x), 0) + c 2 ℓ(f (x), 1) + (1 -e)(c 1 ℓ(f (x), 1) + c 2 ℓ(f (x), 0)] = min f E x.y=0 [[(1 -e)c 1 + ec 2 ]ℓ(f (x), 0) + [(1 -e)c 2 + ec 1 ]ℓ(f (x), 1)] + E x.y=1 [[ec 2 + (1 -e)c 1 ]ℓ(f (x), 1) + [ec 1 + (1 -ec 2 )]ℓ(f (x), 0)] = min f E x.y=0 [[(1 -e)c 1 + ec 2 ]ℓ(f (x), 0) + [(1 -e)c 2 + ec 1 ]ℓ(f (x), 1)] + E x.y=1 [[(1 -e)c 1 + ec 2 ]ℓ(f (x), 1) + [(1 -e)c 2 + ec 1 ]ℓ(f (x), 0)] + E x.y=1 [[(e -e)(c 2 -c 1 )]ℓ(f (x), 1) -[(e -e)(c 2 -c 1 )]ℓ(f (x), 0)] = min f E (x,y)∼D [[(1 -e)c 1 + ec 2 ]ℓ(f (x), y) + [(1 -e)c 2 + ec 1 ]ℓ(f (x), 1 -y)] = min f E (x,y)∼D [(c 1 + c 2 )ℓ(f (x), y)] + [(1 -e)c 2 + ec 1 ]E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)] = min f E (x,y)∼D [ℓ(f (x), y)] + (1 -γ -e + 2γe)E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)] (41) Assume γ * is the optimal smooth parameter that makes the corresponding classifier return the best performance on unseen clean data distribution (Wei et al., 2022) . Then the above equation can be converted to = min f E (x,y)∼D [ℓ(f (x), y γ * )] + (γ * -γ -e + 2γe))E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)], namely minimizing the smoothed loss with noisy labels is equal to optimizing two terms, min f E (x,y)∼D [ℓ(f (x), y γ * )] 1 Risk under clean label + (γ * -γ -e + 2γe))E (x,y)∼D [ℓ(f (x), 1 -y) -ℓ(f (x), y)] 2 Reverse optimization ( ) where 1 is the risk under the clean label. The influence of both noisy labels and ELS are reflected in the last term of Equ. ( 43). Considering the reverse optimization term 2 , which is the opposite of the optimization process as we expect. Without label smoothing, the weight of 2 will be γ * -1 + e and a high noisy rate e will let this harmful term contributes more to our optimization. In contrast, by choosing the smooth parameter γ = γ * -e 1-2e , 2 will be removed. For example, if the noisy rate is zero, the best smooth parameter is just γ * . A.8 EMPIRICAL GAP ANALYSIS ADOPTED FROM VAPNIK-CHERVONENKIS FRAMEWORK Theorem 3. (Lemma 1 in (Ben-David et al., 2010) ) Given Definition 1 and Definition 2, let H be a hypothesis class of VC dimension d. If empirical distributions DS and DT all have at least n samples, then for any δ ∈ (0, 1), with probability at least 1δ, d H (D S , D T ) ≤ dH ( DS , DT ) + 4 √ d log(2n) + log 2 δ n (44) Denote convex hull Λ the set of mixture distributions, Λ = { DMix ∶ DMix = ∑ M i=1 π i D i , π i ∈ ∆}, where ∆ is standard M -1-simplex. The convex hull assumption is commonly used in domain generalization setting (Zhang et al., 2021a; Albuquerque et al., 2019; Zhang et al., 2022a) , while none of them focus on the empirical gap. Note that d H ( DMix , D T ) in domain generalization setting is intractable for the unseen target domain D T is unavailable during training. We thus need to convert d H ( DMix , D T ) to a tractable objective. Let D * M ix = ∑ M i=1 π * i D i , (π * 0 , . . . , π * M ) ∈ ∆, where π * 0 , . . . , π * M = arg min π0,...,π M d H ( DMix , D T ) , and D * M ix is the element within Λ which is closest to the unseen target domain. Then we have d H ( DMix , D T ) = 2 sup h∈H |E x∼ DMix [h(x) = 1] -E x∼D T [h(x) = 1]| = 2 sup h∈H | E x∼ DMix [h(x) = 1] -E x∼ D * M ix [h(x) = 1] + E x∼ D * M ix [h(x) = 1] -E x∼D T [h(x) = 1] | ≤ d H ( D * M ix , D T ) + d H ( DMix , D * M ix ) The explanation follows (Zhang et al., 2021a) that the first term corresponds to "To what extent can the convex combination of the source domain approximate the target domain". The minimization of the first term requires diverse data or strong data augmentation, such that the unseen distribution lies within the convex combination of source domains. We dismiss this term in the following because it includes D T and cannot be optimized. Follows Lemma 1 in (Albuquerque et al., 2019) , the second term can be bounded by, d H ( DMix , D * M ix ) ≤ M ∑ i=1 M ∑ j=1 π i π * j d H (D i , D j ) ≤ max i,j∈[M ] d H (D i , D j ), namely the second term can be bounded by the combination of pairwise H-divergence between source domains. The cost (Equ. ( 1)) used for the multi-domain adversarial training can be seen as an approximation of such a target. Until now, we can bound the empirical gap with the help of Theorem 3 M ∑ i=1 M ∑ j=1 π i π * j d H (D i , D j ) ≤ M ∑ i=1 M ∑ j=1 π i π * j ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ dH ( Di , Dj ) + 4 d log(2 min(n i , n j )) + log 2 δ min(n i , n j ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ M ∑ i=1 M ∑ j=1 π i π * j d H (D i , D j ) - M ∑ i=1 M ∑ j=1 π i π * j dH ( Di , Dj ≤ 4 √ d log(2n * ) + log 2 δ n * where n i is the number of samples in D i and n * = min(n 1 , . . . , n M ). A.9 EMPIRICAL GAP ANALYSIS ADOPTED FROM NEURAL NET DISTANCE Proposition 6. (Adapted from Theorem A.2 in (Arora et al., 2017) ) Let {D i } M i=1 a set of distributions and { Di } M i=1 be empirical versions with at least n * samples each. We assume that the set of discriminators with softmax activation function ĥ(θ; ⋅) = ( ĥ1 (θ 1 , ⋅), . . . , ĥM (θ M , ⋅)) ∈ Ĥ ∶ Z → [0, 1] M ; ∑ M i=1 ĥi (θ i ; ⋅) = 1 2 are L-Lipschitz with respect to the parameters θ and use p denote the number of parameter θ i . There is a universal constant c such that when n * ≥ cpM log(Lp/ϵ) ϵ , we have with probability at least 1exp(-p) over the randomness of { Di } M i=1 , | d ĥ,g (D 1 , . . . , D M )d ĥ,g ( D1 , . . . , DM ) |≤ ϵ (48) Proof. For simplicity, we ignore the parameter θ i when using h i (⋅). According to the following triangle inequality, below we focus on the term | E z∼g○D1 log ĥ1 (z) -E z∼g○ D1 log ĥ1 (z) | and other Theorem 4. (Proposition 4.4.1 in (Bertsekas, 1999) .) Let F ∶ Ω → Ω be a continuously differential function on an open subset Ω in R and let θ ∈ Ω be so that 1. F η (θ * ) = θ * , and 2. the absolute values of the eigenvalues of the Jacobian |λ i | < 1, ∀λ i ∈ Sp(F ′ η (θ * )). Then there is an open neighborhood U of θ * so that for all θ 0 ∈ U , the iterates θ k+1 = F η (θ k ) is locally converge to θ * . The rate of convergence is at least linear. More precisely, the error ∥ θ k -θ * ∥ is in O(|λ max | k ) for k → ∞ where λ max is the eigenvalue of F ′ η (θ * ) with the largest absolute value. When |λ i | > 1, F will not converge and when |λ i | = 1, F is either converge with a sublinear convergence rate or cannot converge. Finding fixed points of F η (θ) = θ + ηh(θ) is equivalent to finding solutions to the nonlinear equation h(θ) = 0 and the Jacobian is given by: F ′ η (θ) = I + ηh ′ (θ), where both F ′ η (θ), h ′ (θ) are not symmetric and can therefore have complex eigenvalues. The following Theorem shows when a fixed point of F satisfies the conditions of Theorem 4. Theorem 5. (Lemma 4 in (Mescheder et al., 2017) .) Assume A ≜ h ′ (θ) |θ=θ * only has eigenvalues with negative real-part and let η > 0, then the eigenvalues of the matrix I + ηA lie in the unit ball if and only if η < 2a a 2 + b 2 = 1 |a| 2 1 + ( b a ) 2 ; ∀λ = -a + bi ∈ Sp(A) Namely, both the maximum value of a and b/a determine the maximum possible learning rate. Although (Acuna et al., 2021) shows domain adversarial training is indeed a three-player game among classifier, feature encoder, and domain discriminator, it also indicates that the complex eigenvalues with a large imaginary component are originated from encoder-discriminator adversarial training. Hence here we only focus on the two-player zero-sum game between the feature encoder, and domain discriminator. One interesting thing is that, from non-asymptotic convergence analysis, we can get a result (Theorem 5) that is very similar to that from the Hurwitz condition (Corollary 1 in (Acuna et al., 2021) : η < -2a b 2 -a 2 ; ∀λ = a + bi ∈ Sp(A) and |a| < |b|).

A.10.2 A SIMPLE ADVERSARIAL TRAINING EXAMPLE

According to Ali Rahimi's test of times award speech at NIPS 17, simple experiments, simple theorems are the building blocks that help us understand more complicated systems. Along this line, we propose this toy example to understand the convergence of domain adversarial training. Denote D S = x s , D t = x t two Dirac distribution where both x 1 and x 2 are float number. In this setting, both the encoder and discriminator have exactly one parameter, which is θ e , θ d respectivelyfoot_2 . The DANN training objective in Equ. ( 7) is given by d θ = f (θ d θ e x s ) + f (-θ d θ e x t ), where f (t) = log (1/(1 + exp(-t))) and the unique equilibrium point of the training objective in Equ. ( 56) is given by θ * e = θ * d = 0. We then recall the update operators of simultaneous and alternating Gradient Descent, for the former, we have F η (θ) = ( θ e -η∇ θe d θ θ d + η∇ θ d d θ ) For the latter, we have F η = F η,2 (θ) ○ F η,1 (θ), and F η,1 , F η,2 are defined as F η,1 (θ) = ( θ e -η∇ θe d θ θ d ) , F η,2 (θ) = ( θ e θ d + η∇ θ d d θ ) , If we update the discriminator n d times after we update the encoder n e times, then the update operator will be F η = F ne η,1 (θ) ○ F n d η,1 (θ). To understand convergence of simultaneous and alternating gradient descent, we have to understand when the Jacobian of the corresponding update operator has only eigenvalues with absolute value smaller than 1.  λ 1/2 = 1 ± η 2 |x s -x t |i, namely F η (θ) will never satisfies the second conditions of Theorem 4 whatever η is, which shows that this continuous system is generally not linearly convergent to the equilibrium point. Proof. The Jacobian of F η (θ) = ( θ e -η∇ θe d θ θ d + η∇ θ d d θ ) is ∇ θ F η (θ) = ∇ θ ( θ e -η (θ d x s f ′ (θ d θ e x s ) -θ d x t f ′ (θ d θ e x t )) θ d + η (θ e x s f ′ (θ d θ e x s ) -θ e x t f ′ (θ d θ e x t )) ) = ( 1 -η (x s f ′ (θ d θ e x s ) -x t f ′ (θ d θ e x t )) η (x s f ′ (θ d θ e x s ) -x t f ′ (θ d θ e x t )) 1 ) = ( 1 -η 2 (x s -x t ) η 2 (x s -x t ) 1 ) , The derivation result of ∇ θe θ eη (θ d x s f ′ (θ d θ e x s ) -θ d x t f ′ (θ d θ e x t )) should have been 1 -η (θ 2 d x 2 s f ′′ (θ d θ e x s ) -θ 2 d x 2 t f ′′ (θ d θ e x t )) Since the equilibrium point (θ * e , θ * d ) = (0, 0), for points near the equilibrium, we ignore high-order infinitesimal terms e.g.  λ 1/2 = 1 - α 2 2 ± (1 - α 2 2 ) 2 -1, where α = 1 2 √ n d n e η|x s -x t |. |λ 1/2 | = 1 for η ≤ 4 √ nen d |xs-xt| and |λ 1/2 | > 1 otherwise. Such result indicates that although alternating gradient descent does not converge linearly to the Nashequilibrium, it could converge with a sublinear convergence rate. Proof. The Jacobians of alternating gradient descent DANN operators (Equ. ( 58)) near the equilibrium are given by: ∇ θ F η,1 (θ) = ( 1 -η (x s f ′ (θ d θ e x s ) -x t f ′ (θ d θ e x t )) 0 1 ) = ( 1 -η 2 (x s -x t ) 0 1 ) , similarly we can get ∇ θ F η,2 (θ) = ( 1 0 η 2 (x sx t ) 1 ) . As a result, the Jacobian of the combined update operator ∇ θ F η (θ) is ∇ θ F η (θ) = ∇ θ F ne η,2 (θ)∇ θ F n d η,1 (θ) = ( 1 -ηne 2 (x s -x t ) ηn d 2 (x s -x t ) -ηn d ne 4 (x s -x t ) 2 + 1 ) . ( ) An easy calculation shows that the eigenvalues of this matrix are λ 1/2 = 1 - n e n d 8 η 2 (x s -x t ) 2 ± √ (1 - n e n d 8 η 2 (x s -x t ) 2 ) 2 -1 (68) The Jacobians of alternating gradient descent DANN+ELS operators near the equilibrium are given by: ∇ θ F η,1 (θ) = ( 1 -η(2γ-1) 2 (x s -x t ) 0 1 ) , ∇ θ F η,2 (θ) = ( 1 0 η(2γ-1) 2 (x s -x t ) 1 ) , As a result, the Jacobian of the combined update operator ∇ θ F η (θ) is ∇ θ F η (θ) = ∇ θ F ne η,2 (θ)∇ θ F n d η,1 (θ) = ⎛ ⎝ 1 -ηne(2γ-1) 2 (x s -x t ) ηn d (2γ-1) 2 (x s -x t ) -ηn d ne(2γ-1) 2 4 (x s -x t ) 2 + 1 ⎞ ⎠ . (70) An easy calculation shows that the eigenvalues of this matrix are λ 1/2 = 1 - n e n d 8 η 2 (2γ -1) 2 (x s -x t ) 2 ± √ (1 - n e n d 8 η 2 (2γ -1) 2 (x s -x t ) 2 ) 2 -1 Similarly to the proof of Proposition 8, let α = 2γ-1 2 √ n d n e η|x s -x t |, we can get λ 1/2 = 1 -α 2 2 ± √ (1 -α 2 2 ) 2 -1. Only when α ≤ 2, λ 1/2 are on the unit circle, namely η ≤ 

B EXTENDED RELATED WORKS

Domain adaptation and domain generalization (Muandet et al., 2013; Sagawa et al., 2019; Li et al., 2018a; Blanchard et al., 2021; Li et al., 2018b; Zhang et al., 2021a; 2022c) aims to learn a model that can extrapolate well in unseen environments. Representative methods like AT method (Ganin et al., 2016) proposed the idea of learning domain-invariant representations as an adversarial game. This approach led to a plethora of methods including state-of-the-art approaches (Zhang et al., 2019; Acuna et al., 2021; 2022) . In this paper, we propose a simple but effective trick, ELS, which benefits the generalization performance of methods by using soft environment labels. Adversarial Training in GANs is well studied and many theoretical results of GANs motivate the analysis in this paper. e.g., divergence minimization interpretation (Goodfellow et al., 2014; Nguyen et al., 2017) , generalization of the discriminator (Arora et al., 2017; Thanh-Tung et al., 2019) , training stability (Thanh-Tung et al., 2019; Schäfer et al., 2019; Arjovsky & Bottou, 2017; Arjovsky et al., 2017) , nash equilibrium (Farnia & Ozdaglar, 2020; Nagarajan & Kolter, 2017) , and gradient descent in GAN optimization (Nagarajan & Kolter, 2017; Gidel et al., 2018; Chen et al., 2018) . Multi-domain image generation is also related to this work, generalization to the JSD metric has been explored to address this challenge (Gan et al., 2017; Pu et al., 2018; Trung Le et al., 2019) . However, most of them have to build M (M -1) domain spectrum into several separate domains and performs adaptation between multiple source and target domains. For Rotating MNIST, the seven target domains contain images rotating by d ∈{[0 ○ , 45 ○ ), [45 ○ , 90 ○ ), [90 ○ , 135 ○ ), . . . , [315 ○ , 360 ○ )} degrees, respectively. Circle Dataset (Wang et al., 2020) includes 30 domains indexed from 1 to 30 and Figure 5 (a) shows the 30 domains in different colors (from right to left is 1, . . . , 30 respectively). Each domain contains data on a circle and the task is binary classification. Figure 5 (b) shows positive samples as red dots and negative samples as blue crosses. In our experiments, We use domains 1 to 6 as source domains and the rest as target domains.

C.1.2 IMAGE RETRIEVAL DATASETS

Experimental settings. Following previous generalizable person ReID methods, we use Mo-bileNetV2 (Sandler et al., 2018) with a multiplier of 1.4 as the backbone network, which is pretrained on ImageNet (Deng et al., 2009) . Images are resized to 256 × 128 and the training batch size N is set to 80. The SGD optimizer is used to train all the components with a learning rate of 0.01, a momentum of 0.9 and a weight decay of 5 × 10 -4 . The learning rate is warmed up in the first 10 epochs and decayed to its 0.1× and 0.01× at 40 and 70 epochs. We evaluate the proposed method by Person re-identification (ReID) tasks, which aims to find the correspondences between person images from the same identity across multiple camera views. The training datasets include CUHK02 (Li & Wang, 2013) , CUHK03 (Li et al., 2014) , Market1501 (Zheng et al., 2015) , DukeMTMC-ReID (Zheng et al., 2017) , and CUHK-SYSU PersonSearch (Xiao et al., 2016) . The unseen test domains are VIPeR (Gray et al., 2007) , PRID (Hirzer et al., 2011) , QMUL GRID (Liu et al., 2012) , and i-LIDS (Wei-Shi et al., 2009) . Details of the training datasets are summarized in Table 10 and the test datasets are summarized in Table 11 . All the assets (i.e., datasets and the codes for baselines) we use include a MIT license containing a copyright notice and this permission notice shall be included in all copies or substantial portions of the software. GRID (Liu et al., 2012) contains 250 probe images and 250 true match images of the probes in the gallery. Besides, there are a total of 775 additional images that do not belong to any of the probes. We randomly take out 125 probe images. The remaining 125 probe images and 1025(775 + 250) images in the gallery are used for testing. i-LIDS (Wei-Shi et al., 2009) has two versions, images and sequences. The former is used in our experiments. It involves 300 different pedestrian pairs observed across two disjoint camera views 1 and 2 in public open space. We randomly select 60 pedestrian pairs, two images per pair are randomly selected as probe image and gallery image respectively. PRID2011 (Hirzer et al., 2011) has single-shot and multi-shot versions. We use the former in our experiments. The single-shot version has two camera views A and B, which capture 385 and 749 pedestrians respectively. Only 200 pedestrians appear in both views. During the evaluation, 100 randomly identities presented in both views are selected, the remaining 100 identities in view A constitute probe set and the remaining 649 identities in view B constitute gallery set. VIPeR (Gray et al., 2007) contains 632 pedestrian image pairs. Each pair contains two images of the same individual seen from different camera views 1 and 2. Each image pair was taken from an arbitrary viewpoint under varying illumination conditions. To compare to other methods, we randomly select half of these identities from camera view 1 as probe images and their matched images in view 2 as gallery images. We follow the single-shot setting. The average rank-k (R-k) accuracy and mean Average Precision (mAP) over 10 random splits are reported based on the evaluation protocol C.1.3 NEURAL LANGUAGE DATASETS CivilComments-Wilds (Koh et al., 2021) contains 448, 000 comments on online articles taken from the Civil Comments platform. The input is a text comment and the task is to predicate whether the comment was rated as toxic, e.g., , the comment Maybe you should learn to write a coherent sentence so we can understand WTF your point is is rated as toxic and I applaud your father. He was a good man! We need more like him. is not. Domain in CivilComments-Wilds dataset is an 8-dimensional binary vector where each component corresponds to whether the comment mentions one of the 8 demographic identities {male, female, LGBTQ, Christian, Muslim, other religions, Black, White}. Amazon-Wilds (Koh et al., 2021) contains 539, 520 reviews from disjoint sets of users. The input is the review text and the task is to predict the corresponding 1-to-5 star rating from reviews of Amazon products. Domain d identifies the user who wrote the review and the training set has 3, 920 domains. The 10-th percentile of per-user accuracies metric is used for evaluation, which is standard to measure model performance on devices and users at various percentiles in an effort to encourage good performance across many devices. MNIST ConvNet. is detailed in Table . 12. DistillBERT. We use the implementation from (Wolf et al., 2019) and finetune a BERT-base-uncased models for neural language datasets. EncoderSTN use a four-layer convolutional neural network for the encoder and a three-layer MLP to make the prediction. The domain discriminator is a four-layer MLP. The encoder is incorporated with a Spacial Transfer Network (STN) (Jaderberg et al., 2015) , which takes the image and the domain index as input and outputs a set of rotation parameters which are then applied to rotate the given image. Graph Isomorphism Networks (GIN) (Xu et al., 2018) combined with virtual nodes is used for OGB-MoIPCBA dataset, as this is currently the model with the highest performance in the Open Graph Benchmark. Deep ConvNets (Schirrmeister et al., 2017) for HHAR combines temporal and spatial convolution,which fits this data well and we use the implementation in the BrainDecode Schirrmeister (Schirrmeister et al., 2017) Toolbox. 

D.1 ADDITIONAL NUMERICAL RESULTS

Multi-Source Domain Generalization. IRM (Arjovsky et al., 2019) introduces specific conditions for an upper bound on the number of training environments required such that an invariant optimal model can be obtained, which stresses the importance of the number of training environments. In this paper, we reduce the training environments on the Rotated MNIST from five to three. As shown in Table 17 , as the number of training environment decreases, the performance of IRM fall sharply (e.g., the averaged accuracy from 97.5% to 91.8%), and the performance on the most challenging domains d = {0, 5} decline the most (94.9% → 80.9% and 95.2% → 91.1%). In contrast, both ERM and DANN+ELS retain high generalization performances and DANN+ELS outperforms ERM in most domains. Image Retrieval. We compare the proposed DANN+ELS with methods on a typical DG-ReID setting. As shown in Table 16 , we implement DANN with various hyper-parameters while DANN always fails to converge on ReID benchmarks. As illustrated in Appendix Figure 8 , we compare the training statistics with the baseline, where DANN is highly unstable and attains inferior results. However, equipped with ELS and following the same hyper-parameter as DANN, DANN+ELS attains well-training stability and achieves either comparable or better performance when compared with recent state-of-the-art DG-ReID methods. See Appendix D.2 for t-sne visualization and comparison. Table 14 : Domain generalization performance on the OGB-MolPCBA dataset.

Algorithm

Val Avg Acc Test Avg Acc ERM (Vapnik, 1999) 27.8 ± 0.1 27.2 ± 0.3 Group DRO Sagawa et al. (2019) 23.1 ± 0.6 22.4 ± 0.6 CORAL Sun & Saenko (2016) 18.4 ± 0.2 17.9 ± 0.5 IRM (Arjovsky et al., 2019) 15.8 ± 0.2 15.6 ± 0.3 DANN (Ganin et al., 2016) 15.0 ± 0.6 14.1 ± 0.5 DANN+ELS 18.0 ± 0.3 17.2 ± 0.3 ↑ 3.0 3.1 (Vapnik, 1999) 9.7 ± 0.3 9.3 ± 0.1 IRM (Arjovsky et al., 2019) 9.3 ± 0.1 57.6 ± 0.8 SD Pezeshki et al. (2021) 10.2 ± 0.1 9.2 ± 0.0 VREx Krueger et al. (2021) 9.7 ± 0.2 65.3 ± 4.8 DANN (Ganin et al., 2016) 9.7 ± 0.1 11.1 ± 1.5 DANN+ELS 10.7 ± 0.6 15.6 ± 2.8 ↑ 1.0 4.5

D.2 ADDITIONAL ANALYSIS AND INTERPRETATION

T-sne visualization. We compare the proposed DANN+ELS with MetaBIN and ERM through t-SNE visualization. We observe a distinct division of different domains in Figure 7 (a) and Figure 7 (d), which indicates that a domain-specific feature space is learned by the ERM. MetaBIN perform better than ERM and the proposed DANN+ELS can learn more domain-invariant representations while keeping discriminative capability for ReID tasks. 



The constraint on ∥ ∇z ĥ(z) ∥2 is because the optimal discriminator has zero gradients almost everywhere, and | ĥ(z) -ĥ * (z)| is a constraint on the prediction accuracy. There might be some confusion here because in Section A.4 we use θ as the parameters of encoder h. The usage is just for simplicity but does not mean that h, g have the same parameters. One may argue that neural networks are non-linear, but Theorem 4.5 from (Khalil., 1996) shows that one can "linearize" any non-linear system near equilibrium and analyze the stability of the linearized system to comment on the local stability of the original system. pairwise critics, which is expensive when M is large. χ 2 GAN(Tao et al., 2018) firstly attempts to tackle the challenge and only needs M -1 critics.



Figure 1: A motivating example of ELS with 3 domains on the VLCS dataset.

1 d I), and Dµ , Dν be empirical versions of D µ , D ν with n examples. Then we have |d JS (D µ , D ν )d JS ( Dµ , Dν )| = log 2 with high probability. Namely, the empirical divergence cannot reflect the true distribution divergence.

Figure 3: (a) Generalization performance of DANN+ELS compared to DANN with partial correct environment label on the PACS dataset (P as target domain). (b) The best γ for each dataset. Civil is the CivilComments dataset and OGB is the OGB-MolPCBA dataset. (c) Average generalization accuracy on the PACS dataset with different smoothing policies.

Figure 4: Training statistics on PACS datasets. Alternating GD with n d = 5, n e = 1 is used. All other parameters setting are the same and only on the default hyperparameters and without the fine-grained parametric search.

Additional Numerical Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 D.2 Additional Analysis and Interpretation . . . . . . . . . . . . . . . . . . . . . . . . . . . 34 D.3 Ablation Studies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

hyper-parameter γ. Then to minimize domain divergence by adversarial training with one-sided environment label smoothing is equal to minimize 2D JS (D S ′ ||D T ′ ) -2 log 2, where D JS is the Jensen-Shanon (JS) divergence.

i=1 D i , and {D i ′ = γD i + 1-γ M -1 ∑ M j=1;j≠i D} M i=1 with hyper-parameter γ ∈ [0.5, 1]. Then to minimize domain divergence by adversarial training w/wo environment label smoothing is equal to minimize ∑ M i=1 D KL (D i ||D M ix ), and ∑ M i=1 D KL (D i ′ ||D M ix ) respectively, where D KL is the Kullback-Leibler (KL) divergence.

.10.3 SIMULTANEOUS GRADIENT DESCENT DANN Proposition 7. The unique equilibrium point of the training objective in Equ. (56) is given by θ * e = θ * d = 0. Moreover, the Jacobian of F η (θ) = ( θ e -η∇ θe d θ θ d + η∇ θ d d θ ) at the equilibrium point has the two eigenvalues

, θ 2 e , θ 2 d , θ e θ d . We can thus obtain the derivation of the second line. The eigenvalues of the second-order matrix A = (eigenvalues of ∇ θ F η (θ) is 1 ± η 2 |x sx t |i. Obviously |λ| > 1 and the proposition is completed. A.10.4 ALTERNATING GRADIENT DESCENT DANN Proposition 8. The unique equilibrium point of the training objective in Equ. (56) is given by θ * e = θ * d = 0. If we update the discriminator n d times after we update the encoder n e times. Moreover, the Jacobian of F η = F η,2 (θ) ○ F η,1 (θ) (Equ. (58)) has eigenvalues

result in Proposition 8, which is η ≤ 4 √ n d ne|xs-xt| , the additional 1 2γ-1 > 1 enables us to choose more large learning rate and could converge to an small error solution by fewer iterations.

.1.4 GENOMICS AND GRAPH DATASETS RxRx1-wilds (Koh et al., 2021) comprises images of cells that have been genetically perturbed by siRNA, which comprises 125, 510 images of cells obtained by fluorescent microscopy. The output y indicates which of the 1, 139 genetic treatments (including no treatment) the cells received, and d specifies 51 batches in which the imaging experiment was run. OGB-MolPCBA (Koh et al., 2021) is a multi-label classification dataset, which comprises 437, 929 molecules with 120, 084 different structural scaffolds. The input is a molecular graph, the label is a 128-dimensional binary vector where each component corresponds to a biochemical assay result, and the domain d specifies the scaffold (i.e., a cluster of molecules with similar structure). The training and test sets contain molecules with disjoint scaffolds; The training set has molecules from over 40, 000 scaffolds. We evaluate models by averaging the Average Precision (AP) across each of the 128 assays. C.1.5 SEQUENTIAL DATA Spurious-Fourier (Gagnon-Audet et al., 2022) is a binary classification dataset (y ∈ {low-frequency peak (L) and high-frequency peak (H).}), which is composed of one-dimensional signal. Domains d ∈ {10%, 80%, 90%} contain signal-label pairs, where the label is a noisy function of the low-and high-frequencies such that low-frequency peaks bear a varying correlation of d with the label and high-frequency peaks bear an invariant correlation of 75% with the label. HHAR (Gagnon-Audet et al., 2022) is a 6 activities classification dataset (y ∈ {Stand, Sit, Walk, Bike, Stairs up, and Stairs Down }), which is composed of recordings of 3-axis accelerometer and 3-axis gyroscope data. Specifically, the input x is recordings of 500 time-steps of a 6-dimensional signal sampled at 100Hz. Domain d consist of five smart device models: d ∈ {Nexus 4, Galaxy S3, Galaxy S3 Mini, LG Watch, and Samsung Galaxy Gears}. C.2 BACKBONE STRUCTURES Most of the backbones are ResNet-50/ResNet-18 and we follow the same setting as the reference works. Here we briefly introduce some special backbones used in our experiments,i.e., ConvNet for Rotated MNIST, EncoderSTN for Rotating MNIST, DistillBERT for Neural Language datasets, and GIN for OGB-MoIPCBA.

Figure 5: Results on the Circle dataset with 30 domains. (a) shows domain index by color, (b) shows label index by color, where red dots and blue crosses are positive and negative data sample. Source domains contain the first 6 domains and others are target domains.

denotes Graph Isomorphism Networks, and CRNN (Gagnon-Audet et al., 2022) denotes convolutional recurrent neural networks.

The domain adaptation accuracies (%) on Office-31. ↑ denotes improvement of a method with ELS compared to that wo/ ELS.

The domain generalization accuracies (%) on VLCS, and PACS. ↑ denotes improvement of DANN+ELS compared to DANN.

Rotating MNIST accuracy (%) at the source domain and each target domain. X ○ denotes the domain whose images are Rotating by [X ○ , X ○ + 45 ○ ].

Domain generalization performance on neural language datasets. The backbone is DistillBERT-base-uncased and all results are reported over 3 random seed runs.

Notations. Distributions for source domain, target domain, and domain i. DS , DT , Di Empirical distributions for source domain, target domain, and domain i. p s , p t , p i Density functions for source domain, target domain, and domain i. x s , x t , x i

10) For simplicity, we use p s , p t denote p z s (z), p z t (z) respectively and ignore the ∫ Z . Plugging Equ. (10) into Equ. (8) we can get

γ(p tp s ) p s + p t + p t log p t + γ(p sp t ) p s + p t ′ = p s + (1γ)p t p t ′ = p t + (1γ)p s two distribution densities that are the convex combinations of p s , p t , we have

Domain generalization performance on the Spurious-Fourier dataset.

Generalization performance on sequential benchmarks. ↑ denotes improvement of DANN+ELS compared to DANN.

acknowledgement

This work was partially funded by the National Natural Science Foundation of China (Grant No. 62276256, 62076078), the Beijing Nova Program under Grant Z211100002121108, and the National Natural Science Foundation of China (62236010, 61721004, and U1803261) 

Appendix

| d ĥ,g (D 1 , . . . , D M )d ĥ,g ( D1 , . . . , Therefore, when n * ≥ cpM log(Lp/ϵ) ϵ for large enough constant c, we can union bound over all θ 1 ∈ Φ. With probability at least 1exp(-p), for all θ 1 ∈ Φ, we haveNamely we haveThe result verifies that for the multi-domain adversarial training, the expectation over the empirical distribution converges to the expectation over the true distribution for all discriminators given enough data samples.

A.10 CONVERGENCE THEORY

In this subsection, we first provide some preliminaries before domain adversarial training convergence analysis. We then show simultaneous gradient descent DANN is not stable near the equilibrium but alternating gradient descent DANN could converge with a sublinear convergence rate, which support the importance of training encoder and discriminator separately. Finally, when incorporated with environment label smoothing, alternating gradient descent DANN is shown able to attain a faster convergence speed.

A.10.1 PRELIMINARIES

The asymptotic convergence analysis is defined as applying the "ordinary differential equation (ODE) method" to analyze the convergence properties of dynamic systems. Given a discrete-time system characterized by the gradient descent:where h(⋅) ∶ R → R is the gradient and η is the learning rate. The important technique for analyzing asymptotic convergence analysis is Hurwitz condition (Khalil., 1996) : if the Jacobian of the dynamic system A ≜ h ′ (θ) |θ=θ * at a stationary point θ * is Hurwitz, namely the real part of every eigenvalue of A is positive then the continuous gradient dynamics are asymptotically stable.Given the same discrete-time system and Jacobian A, to ensure the non-asymptotic convergence, we need to provide an appropriate range of η by solving |1 + λ i (A)| < 1, ∀λ i ∈ Sp(A), where Sp(A) is the spectrum of A. Namely, we can get constraint of the learning rate, which thus is able to evaluate the minimum number of iterations for an ϵ-error solution and could more precisely reveal the convergence performance of the dynamic system than the asymptotic analysis (Nie & Patel, 2020) .

C ADDITIONAL EXPERIMENTAL SETUPS C.1 DATASET DETAILS AND EXPERIMENTAL SETTINGS

In this subsection, we introduce all the used datasets and the hyper-parameters for reproducing the experimental results in this work. We have uploaded the codes for all experiments in the supplementary materials to make sure that all the results are reproducible. All the main hyperparameters for reproducing the experimental results in this work are shown in Table 9 .

C.1.1 IMAGES CLASSIFICATION DATASETS

Experimental settings. For DG and multi-source DG tasks, all the baselines are implemented using the codebase of Domainbed (Gulrajani & Lopez-Paz, 2021) and we use as encoders ConvNet for RotatedMNIST (detailed in Appdendix D.1 in (Gulrajani & Lopez-Paz, 2021) ) and ResNet-50 for the remaining datasets. The model selection that we use is test-domain validation, one of the three selection methods in (Gulrajani & Lopez-Paz, 2021) . That is, we choose the model maximizing the accuracy on a validation set that follows the same distribution of the test domain. For DA tasks, all baselines implementation and hyper-parameters follows (Wang & Hou) . For Continuously Indexed Domain Adaptation tasks, all baselines are implemented using PyTorch with the same architecture as (Wang et al., 2020) . Note that although our theoretical analysis on non-asymptotic convergence is based on alternating Gradient Descent, current DA methods mainly build on Gradient Reverse Layer. For a fair comparison, in our experiments considering domain adaptation benchmarks, we also use GRL as default and let the analysis in future work.Rotated MNIST (Ghifary et al., 2015) consists of 70,000 digits in MNIST with different rotated angles where domain is determined by the degrees d ∈ {0, 15, 30, 45, 60, 75}.PACS (Li et al., 2017b) includes 9, 991 images with 7 classes y ∈ { dog, elephant, giraffe, guitar, horse, house, person } from 4 domains d ∈ {art, cartoons, photos, sketches}.VLCS (Torralba & Efros, 2011 ) is composed of 10,729 images, 5 classes y ∈ { bird, car, chair, dog, person } from domains d ∈ {Caltech101, LabelMe, SUN09, VOC2007}.Office-31 (Saenko et al., 2010) contains contains 4, 110 images, 31 object categories in three domains: d ∈ { Amazon, DSLR, and Webcam}.Office-Home (Venkateswara et al., 2017) : consists of 15,500 images from 65 classes and 4 domains: d ∈ { Art (Ar), Clipart (Cl), Product (Pr) and Real World (Rw) }.Rotating MNIST (Wang et al., 2020) is adapted from regular MNIST digits with mild rotation to significantly Rotating MNIST digits. In our experiments, [0 ○ , 45 ○ ) is set as the source domain and others are unlabeled target domains. The chosen baselines include Adversarial Discriminative Domain Adaptation (ADDA (Tzeng et al., 2017) ), and CIDA (Wang et al., 2020) . ADDA merges data with different domain indices into one source and one target domain. DANN divides the continuous ERM (Vapnik, 1999) 95.3 ± 0.2 98.7 ± 0.1 98.9 ± 0.1 98.7 ± 0.2 98.9 ± 0.0 96.2 ± 0.2 97.8 IRM (Arjovsky et al., 2019) 94.9 ± 0.6 98.7 ± 0.2 98.6 ± 0.1 98.6 ± 0.2 98.7 ± 0.1 95.2 ± 0.3 97.5 DANN (Ganin et al., 2016) 95.9 ± 0.1 98.6 ± 0.1 98.7 ± 0.2 99.0 ± 0.1 98.7 ± 0.0 96.5 ± 0.3 97.9 ARM (Zhang et al., 2021b) 95.9 ± 0.4 99.0 ± 0.1 98.8 ± 0.1 98.9 ± 0.1 99. (Vapnik, 1999) 96.0 ± 0.3 98.8 ± 0.4 98.7 ± 0.1 98.8 ± 0.3 99.1 ± 0.1 96.7 ± 0.3 98.0 IRM (Arjovsky et al., 2019) 80.9 ± 3.2 94.7 ± 0.9 94.3 ± 1.3 94.3 ± 0.8 95.5 ± 0.5 91.1 ± 3.1 91.8 DANN (Ganin et al., 2016) 96 

