FREE LUNCH FOR DOMAIN ADVERSARIAL TRAINING: ENVIRONMENT LABEL SMOOTHING

Abstract

A fundamental challenge for machine learning models is how to generalize learned models for out-of-distribution (OOD) data. Among various approaches, exploiting invariant features by Domain Adversarial Training (DAT) received widespread attention. Despite its success, we observe training instability from DAT, mostly due to over-confident domain discriminator and environment label noise. To address this issue, we proposed Environment Label Smoothing (ELS), which encourages the discriminator to output soft probability, which thus reduces the confidence of the discriminator and alleviates the impact of noisy environment labels. We demonstrate, both experimentally and theoretically, that ELS can improve training stability, local convergence, and robustness to noisy environment labels. By incorporating ELS with DAT methods, we are able to yield the state-of-art results on a wide range of domain generalization/adaptation tasks, particularly when the environment labels are highly noisy.

1. INTRODUCTION

Despite being empirically effective on visual recognition benchmarks (Russakovsky et al., 2015) , modern neural networks are prone to learning shortcuts that stem from spurious correlations (Geirhos et al., 2020) , resulting in poor generalization for out-of-distribution (OOD) data. A popular thread of methods, minimizing domain divergence by Domain Adversarial Training (DAT) (Ganin et al., 2016) , has shown better domain transfer performance, suggesting that it is potential to be an effective candidate to extract domain-invariant features. Despite its power for domain adaptation and domain generalization, DAT is known to be difficult to train and converge (Roth et al., 2017; Jenni & Favaro, 2019; Arjovsky & Bottou, 2017; Sønderby et al., 2016) . 

2. METHODOLOGY

For domain generalization tasks, there are M source domains {D i } M i=1 . Let the hypothesis h be the composition of h = ĥ ○ g, where g ∈ G pushes forward the data samples to a representation space Z and ĥ = ( ĥ1 (⋅), . . . , ĥM ( ⋅)) ∈ Ĥ ∶ Z → [0, 1] M ; ∑ M i=1 ĥi (⋅) = 1 is the domain discriminator with softmax activation function. The classifier is defined as ĥ′ ∈ Ĥ′ ∶ Z → [0, 1] C ; ∑ C i=1 ĥ′ i (⋅) = 1 , where C is the number of classes. The cost used for the discriminator can be defined as:  where ℓ is the cross-entropy loss for classification tasks and MSE for regression tasks, and λ is the tradeoff weight. We call the first term empirical risk minimization (ERM) part and the second term



* Work done during an internship at Alibaba Group. † Work done at Alibaba Group, and now affiliated with Twitter.



Figure 1: A motivating example of ELS with 3 domains on the VLCS dataset.

max ĥ∈ Ĥ d ĥ,g (D 1 , . . . , D M ) = max ĥ∈H E x∈D1 log ĥ1 ○ g(x) + ⋅ ⋅ ⋅ + E x∈D M log ĥM ○ g(x),(1)where ĥi ○ g(x) is the prediction probability that x is belonged to D i . Denote y the class label, then the overall objective of DAT is min x∈Di [ℓ( ĥ′ ○ g(x), y)] + λd ĥ,g (D 1 , . . . , D M ),

overfit these mislabelled examples and then has poor generalization capability. (ii) To our best knowledge, DAT methods all assign one-hot environment labels to each data sample for domain discrimination, where the output probabilities will be highly confident. For DAT, a very confident domain discriminator leads to highly oscillatory gradients(Arjovsky & Bottou, 2017), which is harmful to training stability. The first observation inspires us to force the training process to be robust with regard to environment-label noise, and the second observation encourages the discriminator to estimate soft probabilities rather than confident classification. To this end, we propose Environment Label Smoothing (ELS), which is a simple method to tackle the mentioned obstacles for DAT. Next, we summarize the main methodological, theoretical, and experimental contributions.Methodology: To our best knowledge, this is the first work to smooth environment labels for DAT. The proposed ELS yields three main advantages: (i) it does not require any extra parameters and optimization steps and yields faster convergence speed, better training stability, and more robustness to label noise theoretically and empirically; (ii) despite its efficiency, ELS is also easily to implement. People can easily incorporate ELS with any DAT methods in very few lines of code; (iii) ELS equipped DAT methods attain superior generalization performance compared to their native counterparts;Theories: The benefit of ELS is theoretically verified in the following aspects. (i) Training stability. We first connect DAT to Jensen-Shannon/Kullback-Leibler divergence minimization, where ELS is shown able to extend the support of training distributions and relieve both the oscillatory gradients and gradient vanishing phenomenons, which results in stable and well-behaved training. (ii) Robustness to noisy labels. We theoretically verify that the negative effect caused by noisy labels can be reduced or even eliminated by ELS with a proper smooth parameter. (iii) Faster non-asymptotic convergence speed. We analyze the non-asymptotic convergence properties of DANN. The results indicate that incorporating with ELS can further speed up the convergence process. In addition, we also provide the empirical gap and analyze some commonly used DAT tricks.Experiments: (i) Experiments are carried out on various benchmarks with different backbones, including image classification, image retrieval, neural language processing, genomics data, graph, and sequential data. ELS brings consistent improvement when incorporated with different DAT methods and achieves competitive or SOTA performance on various benchmarks, e.g., average accuracy on Rotating MNIST (52.1% → 62.1%), worst group accuracy on CivilComments (61.7% → 65.9%), test ID accuracy on RxRx1 (22.9% → 26.7%), average accuracy on Spurious-Fourier dataset (11.1% → 15.6%). (ii) Even if the environment labels are random or partially known, the performance of ELS + DANN will not degrade much and is superior to native DANN. (iii) Abundant analyzes on training dynamics are conducted to verify the benefit of ELS empirically. (iv) We conduct thorough ablations on hyper-parameter for ELS and some useful suggestions about choosing the best smooth parameter considering the dataset information are given.

