DOMAIN-FREE ADVERSARIAL SPLITTING FOR DOMAIN GENERALIZATION

Abstract

Domain generalization is an approach that utilizes several source domains to train the learner to be generalizable to unseen target domain to tackle domain shift issue. It has drawn much attention in machine learning community. This paper aims to learn to generalize well to unseen target domain without relying on the knowledge of the number of source domains and domain labels. We unify adversarial training and meta-learning in a novel proposed Domain-Free Adversarial Splitting (DFAS) framework. In this framework, we model the domain generalization as a learning problem that enforces the learner to be able to generalize well for any train/val subsets splitting of the training dataset. To achieve this goal, we propose a min-max optimization problem which can be solved by an iterative adversarial training process. In each iteration, it adversarially splits the training dataset into train/val subsets to maximize domain shift between them using current learner, and then updates the learner on this splitting to be able to generalize well from train-subset to val-subset using meta-learning approach. Extensive experiments on three benchmark datasets under three different settings on the source and target domains show that our method achieves state-of-the-art results and confirm the effectiveness of our method by ablation study. We also derive a generalization error bound for theoretical understanding of our method.

1. INTRODUCTION

Deep learning approach has achieved great success in image recognition (He et al., 2016; Krizhevsky et al., 2012; Simonyan & Zisserman, 2014) . However, deep learning methods mostly succeed in the case that the training and test data are sampled from the same distribution (i.e., the i.i.d. assumption). However, this assumption is often violated in real-world applications since the equipments/environments that generate data are often different in training and test datasets. When there exists distribution difference (domain shift (Torralba & Efros, 2011) ) between training and test datasets, the performance of trained model, i.e., learner, will significantly degrade. To tackle the domain shift issue, domain adaptation approach (Pan & Yang, 2010; Daume III & Marcu, 2006; Huang et al., 2007) learns a transferable learner from source domain to target domain. Domain adaptation methods align distributions of different domains either in feature space (Long et al., 2015; Ganin et al., 2016) or in raw pixel space (Hoffman et al., 2018) , which relies on unlabeled data from target domain at training time. However, in many applications, it is unrealistic to access the unlabeled target data, therefore this prevents us to use domain adaptation approach in this setting, and motivates the research on the learning problem of domain generalization. Domain generalization (DG) approach (Blanchard et al., 2011; Muandet et al., 2013) commonly uses several source domains to train a learner that can generalize to an unseen target domain. The underlying assumption is that there exists a latent domain invariant feature space across source domains and unseen target domain. To learn the domain invariant features, (Muandet et al., 2013; Ghifary et al., 2015; Li et al., 2018b) explicitly align distributions of different source domains in feature space. (Balaji et al., 2018; Li et al., 2019b; 2018a; Dou et al., 2019) split source domains into meta-train and meta-test to simulate domain shift and train learner in a meta-learning approach. (Shankar et al., 2018; Carlucci et al., 2019; Zhou et al., 2020; Ryu et al., 2020) augment images or features to train learner to enhance generalization capability. Conventional domain generalization methods assume that the domain labels are available. But in a more realistic scenario, the domain labels may be unknown (Wang et al., 2019) . To handle this domain-free setting, Carlucci et al. (2019) combines supervised learning and self-supervised learning to solve jigsaw puzzles of the training images. Matsuura & Harada (2020) divides samples into several latent domains via clustering and trains a domain invariant feature extractor via adversarial training. Huang et al. (2020) discards the dominant activated features, forcing the learner to activate remaining features that correlate with labels. Another line of works (Volpi et al., 2018; Qiao et al., 2020) tackle the single source setting that the training set comprises a single domain, and the train and test data are from different domains. In this work, we focus on a general learning scenario of domain generalization as follows. First, we do not know the domain label of each data and do not assume that there are several domains in the training dataset. Second, we do not assume that the training and test data are from different domains (e.g., styles). However, the previous domain-free DG methods (Matsuura & Harada, 2020) commonly evaluate on the datasets (e.g., PACS) composed of several domains though they do not use domain labels in training. In our domain-free setting, we do not assume and know the domains in the training dataset, we therefore model domain generalization as a learning problem that the learner should be able to generalize well for any splitting of train/val subsets, i.e., synthetic source/target domains, over the training dataset. This explicitly enforces that the trained learner should be generalizable for any possible domain shifts within the training dataset. To achieve this goal, we propose an adversarial splitting model that is a min-max optimization problem, due to the difficulty of enumerating all splittings. In this min-max problem, we adversarially split training dataset to train/val subsets by maximizing the domain shift between them based on the given learner, and then update learner by minimizing the prediction error on val-subset using meta-learning approach given the splitting. By optimizing this min-max problem, we enforce the learner to generalize well even in the worst-case splitting. We also investigate L 2 -normalization of features in our domain generalization method. It is surprisingly found that L 2 -normalization can improve performance of learner and mitigate gradient explosion in the meta-learning process of DG. We further theorectically analyze the underlying reasons for this finding. This proposed domain generalization approach is dubbed Domain-Free Adversarial Splitting, i.e., DFAS. To verify the effectiveness of our method, we conduct extensive experiments on benchmark datasets of PACS, Office-Home and CIFAR-10 under different settings with multiple/single source domains. In experiments that the training data are from several source domains, our method achieves state-ofthe-art results on both PACS and Office-Home datasets. We also find that our method significantly outperforms baselines in experiments that the training data are from a single source domain on PACS and CIFAR-10. We also confirm the effectiveness of our method by ablation study. Based on domain adaptation theory, we also derive an upper bound of the generalization error on unseen target domain. We analyze that the terms in this upper bound are implicitly minimized by our method. This theoretical analysis partially explains the success of our method.

2. RELATED WORKS

We summarize and compare with related domain generalization (DG) methods in two perspectives, i.e., DG with domain labels and DG without domain labels. DG with domain labels. When the domain labels are available, there are three categories of methods for DG. First, (Muandet et al., 2013; Ghifary et al., 2015; Li et al., 2018b; Piratla et al., 2020) learn domain invariant features by aligning feature distributions or by common/specific feature decomposition. Second, (Li et al., 2019a; Balaji et al., 2018; Li et al., 2019b; 2018a; Dou et al., 2019; Du et al., 2020a; b) are based on meta-learning approach that splits given source domains into meta-train and meta-test domains and trains learner in an episodic training paradigm. Third, (Shankar et al., 2018; Carlucci et al., 2019; Zhou et al., 2020; Wang et al., 2020) augment fake domain data to train learner for enhancing generalization capability of learner. Our method may mostly relate to the above second category of methods. But differently, we consider the DG problem in domain-free setting and adversarially split training dataset to synthesize domain shift in a principled min-max optimization method, instead of using leave-one-domain-out splitting in these methods. DG without domain labels. When the domain label is unavailable, to enhance generalization ability of learner, Wang et al. (2019) extracts robust feature representation by projecting out superficial patterns like color and texture. Carlucci et al. (2019) proposes to solve jigsaw puzzles of the training images. Matsuura & Harada (2020) divides samples into several latent domains via clustering and learns domain invariant features via adversarial training of feature extractor and domain discriminator. Huang et al. (2020) discards the dominant activated features, forcing the learner to activate remaining features that correlate with labels. Volpi et al. (2018) and Qiao et al. (2020) propose adversarial data augmentation to tackle the setting that the training set comprises a single domain. In methodology, these methods either explicitly force the learner to extract robust features (Wang et al., 2019; Matsuura & Harada, 2020; Huang et al., 2020) or augment new data to increase training data (Carlucci et al., 2019; Qiao et al., 2020; Volpi et al., 2018) . While our method is a novel meta-learning approach for DG by introducing adversarial splitting of training dataset during training, without relying on data/domain augmentation.

3. METHOD

In our setting, since we do not assume and know the domains in the training dataset, the training data could be independently sampled from several underlying source domains or just from a single source domain. We denote S = {(x i , y i )} N i=1 as the training dataset. Our goal is to train the learner with S that can generalize well on an unseen target domain. In the following sections, we introduce details of our proposed model in Sect. 3.1, followed by its optimization method in Sect. 3.2. We also investigate L 2 -normalization for domain generalization in Sect. 3.3. Theoretical analysis for our method is presented in Sect. 4. Experimental results are reported in Sect. 5. Sect. 6 concludes this paper.

3.1. DOMAIN-FREE ADVERSARIAL SPLITTING MODEL

As mentioned in Sect. 1, we model DG as a learning problem that enforces the learner to be able to generalize well for any train/val subsets splitting of the training dataset. The learner is trained using meta-learning approach (Finn et al., 2017) . To formulate our idea mathematically, we first introduce some notations. We denote f as a function/learner (f could be a deep neural network, e.g., ResNet (He et al., 2016) ) that outputs classification score of the input image, l as the loss such as cross-entropy, S t and S v as the train-subset and val-subset respectively such that S = S t ∪ S v and S t ∩ S v = ∅. The formulated optimization problem for domain generalization is min w 1 |Γ ξ | Sv∈Γ ξ L(θ(w); S v ) + R(w) s.t. θ(w) = arg min θ L(θ; S t , w), S t = S -S v . (1) y ) is the loss on S v , where θ(w) is the parameters of f , L(θ; S t , w) is L(θ; S t ) with θ initialized by w and R(w) is regularization term. In the optimization model of Eq. ( 1), the parameter θ(w) of learner trained on S t is treated as a function of the initial parameter w. To force the learner trained on S t to generalize well on S v , we directly minimize the loss L(θ(w); S v ), dubbed generalization loss, on val-subset S v , w.r.t. the parameter θ(w) trained on S t . Solving Eq. (1) will force the learner to be able to generalize well from any train-subset to corresponding val-subset. In Eq. (1), Γ ξ = {S v : S v ⊂ S, |S v | = ξ} is the set of all possible val-subsets of S with length of ξ, S t = S -S v is train-subset paired with each S v , L(θ(w); S v ) = 1 |Sv| (x,y)∈Sv l(f θ(w) (x), Since |Γ ξ | may be extremely large, it is infeasible to enumerate all possible train/val splittings. Thus, we propose the following adversarial splitting model instead, min w max Sv∈Γ ξ L(θ(w); S v ) + R(w) s.t. θ(w) = arg min θ L(θ; S t , w), S t = S -S v . (2) In the min-max problem of Eq. ( 2), the train/val (S t /S v ) splitting is optimized to maximize the generalization loss to increase the domain shift between train and val subsets by finding the hardest splitting to the learner. While w is optimized by minimizing the generalization loss of learner over the splitting. Solving this adversarial splitting optimization model in Eq. ( 2) enforces the learner to be generalizable even for the worst-case splitting. We therefore expect that the trained learner is robust to the domain shifts within the training dataset. For the regularization term R(w), we set it to be the training loss on S t (i.e., R(w) = L(w; S t )), which additionally constrains that the learner with parameter w should be effective on S t (Li et al., 2018a) . The effect of the hyper-parameter ξ will be discussed in Appendix A.2. In conventional adversarial machine learning, adversarial training is imposed on adversarial samples and learner to increase robustness of the learner to adversarial corruption (Goodfellow et al., 2015) . While in our optimization model of Eq. ( 2), adversarial training is conducted on data splitting and learner to force the learner to be robust to domain shift between train/val subsets. Our model bridges adversarial training and meta-learning. It is a general learning framework for domain generalization and is a complement to adversarial machine learning.

3.2. OPTIMIZATION

This section focuses on the optimization of Eq. ( 2). Since Eq. ( 2) is a min-max optimization problem, we alternately update S v and w by fixing the other one as known. We should also consider the inner loop for optimization of θ(w) in the bi-layer optimization problem of Eq. ( 2). We next discuss these updating steps in details. The convergence and computational cost of this algorithm will be also discussed in Appendix A.3 and A.4 respectively. Inner loop for optimization of θ(w). We adopt finite steps of gradient descent to approximate the minimizer θ(w) of the inner objective L(θ; S t ) with initial value w. This approximation technique has been introduced in machine learning community several years ago (Sun & Tappen, 2011; Finn et al., 2017; Fan et al., 2018) . For convenience of computation, following (Li et al., 2018a; Dou et al., 2019) , we only conduct gradient descent by one step as θ(w) = w -α∇ θ L(θ; S t )| θ=w , where α is the step size of inner optimization and its effect will be discussed in Appendix A.2. Optimizing w with fixed S v . For convenience, we denote g t w = ∇ θ L(θ; S t )| θ=w . Fixing S v (S t is then fixed), w can also be updated by gradient descent, i.e., w = w -η∇ w L(w -αg t w ; S v ) + R(w) , ( ) where η is the step size of outer optimization. Finding the hardest splitting S v with fixed w. Fixing w, to find S v ∈ Γ ξ to maximize L(wαg t w ; S v ), we do first order Taylor expansion for L(w - αg t w ; S v ) by L(w -αg t w ; S v ) ≈ L(w; S v ) - α g t w , g v w , where g v w = ∇ θ L(θ; S v )| θ=w and •, • denotes the inner product. From the definition of L, g t w and g v w , the optimization problem of max Sv∈Γ ξ {L(w; S v ) -α g t w , g v w } can be written as max Sv∈Γ ξ { 1 |Sv| (x,y)∈Sv l (f w (x), y) -α ∇ w l(f w (x), y), g t w } . This problem is equivalent to the following splitting formulation: max Sv,A (x,y)∈Sv l (f w (x), y) -α ∇ w l(f w (x), y), A s.t. A = g t w , S v ∈ Γ ξ , where we introduced an auxiliary variable A. Eq. ( 5) can be solved by alternatively updating S v and A. Given A, we compute and rank the values of l (f w (x), y) -α ∇ w l(f w (x), y), A for all (x, y) ∈ S and select the largest ξ samples to constitute the S v . Given S v (S t is then given), we update A by A = g t w = 1 |St| (x,y)∈St ∇ w l(f w (x), y). We also discuss details and convergence of this alternative iteration in Appendix C. Since computing gradient w.r.t. all parameters is time and memory consuming, we only compute gradient w.r.t. parameters of the final layer of learner f .

3.3. L 2 -NORMALIZATION FOR EXTRACTED FEATURE

L 2 -normalization has been used in face recognition (Liu et al., 2017; Wang et al., 2018) and domain adaptation (Saito et al., 2019; Gu et al., 2020) , but is rarely investigated in domain generalization. We investigate L 2 -normalization in domain generalization in this paper. It is found surprisingly in experiments that L 2 -normalization not only improves the performance of learner (see Sect. 5.4), but also mitigates gradient explosion (see Sect. 5.5) that occurs frequently during the training of metalearning for DG (Finn et al., 2017; Dou et al., 2019) . We next discuss details of L 2 -normalization in our method and analyze why L 2 -normalization mitigates gradient explosion. Feature L 2 -normalization. The L 2 -normalization is used as a component of our learner f . Specifically, we decompose f into feature extractor f e (e.g., the convolutional layers of ResNet), the transform f n representing L 2 -normalization and classifier f c , i.e., f = f c • f n • f e . The feature of input image x extracted by f e is fed to f n to output an unit vector z, i.e., z = f n (f e (x)) = f e (x) f e (x) . The classifier f c consists of unit weight vectors W = [w 1 , w 2 , • • • , w K ], where K is the number of classes and w k = 1, ∀k. f c takes z as an input and outputs the classification score vector σ m,s (W T z). σ m,s (•) is the marginal softmax function defined by [σ m,s (W T z)] k = exp(s(w T k z -mI {k=y} )) K k =1 exp(s(w T k z -mI {k =y} )) , k = 1, 2, • • • , K, ( ) where y is the label of x, [•] k indicates the k-th element, I {a} is the indicator function that returns 1 if a is true, 0 otherwise, m and s are hyper-parameters indicating margin and radius respectively. Analysis of mitigating gradient explosion. We find that L 2 -normalization mitigates gradient explosion in the training of meta-learning for domain generalization. For the sake of simplicity, we analyze gradient norm of loss w.r.t. parameters of f c in the meta-learning process of domain generalization, with f e as fixed function. Without loss of generality, we consider the case that K = 2 (i.e., binary classification), s = 1 and m = 0. In this case, we have the following proposition. Proposition 1. Under the above setting, if the input feature of f c is L 2 -normalized, the gradient norm of loss w.r.t. parameters of f c in the meta-learning process of DG is bounded. Sketch of proof. Given feature z, the loss of binary classification is According to Proposition 1, L 2 -normalization can mitigate gradient explosion under the above setting. The analysis of gradient norm of loss w.r.t. parameters of both f c and f e in the meta-learning process is much more complex, left for our future work. L(w; z) = -y log(σ(w T z)) - (1 -y) log(1 -σ(w T z)), where σ is the sigmoid function. Let w = w -α∇ w L(w; z), then ∇ w L(w ; z) = (I -αH)∇ w L(w ; z),

4. THEORETICAL ANALYSIS

This section presents theoretical understanding of our method. We first derive a generalization error bound on target domain in theorem 1 for the general setting of meta-learning for DG. Then, based on theorem 1, we theoretically explain the reason on the success of our method. Without loss of generality, we consider binary classification problem. We denote H as the set of all possible f , i.e., H = {f w : w ∈ R M }, where M is the number of parameters. For any  S v ∈ Γ ξ and S t = S -S v , we let H St = {f θ(w) : θ(w) = arg min θ L(θ; S t , w), w ∈ R M }. (f ) = E (x,y)∼Q [I {Ψ(f (x)) =y} ] as the generalization error on distribution Q of unseen target domain, ˆ Ψ Sv (f ) = 1 |Sv| (x,y)∈Sv I {Ψ(f (x) ) =y} as the empirical error on S v , V C(H) as the VC-dimension of H, and Ψ(•) as the prediction rule such as the Bayes Optimal Predictor. Based on the domain adaptation theory (Ben-David et al., 2007; 2010) and inspired by the analysis in (Saito et al., 2019) , we have the following theorem. Theorem 1. Let γ be a constant, assume E Q [I {l(f (x),y)>γ} ] ≥ E P [I {l(f (x),y)>γ} ], then given any S v ∈ Γ ξ and S t = S -S v and for any δ ∈ (0, 1), with probability at least 1 -2δ, we have ∀f ∈ H St ,  Ψ l Q (f ) ≤ ˆ Ψ l Sv (f ) + B(S v ) + 2 8 ξ C 2 + 4 δ + C 3 , where B(S v ) = C 1 -inf f ∈H S t 1 |S v | (x,y)∈Sv I {l(f (x),y)>γ} , C 1 = sup S v ∈Γ ξ sup f ∈H S-S v E Q [I {l(f (x),y)>γ} ], C 2 = sup S v ∈Γ ξ V C(H Ψ l S-S v ) log 2eξ V C(H Ψ l S-S v ) , C 3 ≥ sup S v ∈Γ ξ inf f ∈H S-S v { Ψ l P (f ) + Ψ l Q (f )}, H Ψ l S-St = {Ψ l • f : f ∈ H S-St }, Ψ l is a loss-related indicator defined by Ψ l (f (x)) = 1 if l(f (x), y) > γ. 0 otherwise . Proof is given in Appendix D. The assumption of E Q [I {l(f (x),y)>γ} ] ≥ E P [I {l(f (x),y)>γ} ] in theorem 1 is realistic because the data of Q is not accessed at training time, and the learner trained on data of P should have smaller classification loss on P than Q. In theorem 1, C 1 , C 2 , C 3 are constants to f . In Eq. ( 7), the generalization error Ψ l Q (f ) on Q can be bounded by the empirical error ˆ Ψ l Sv (f ) on S v , the term B(S v ) that measures the discrepancy between P and Q, and the last two constant terms in Eq. (7). To obtain lower Ψ l Q (f ), we need to minimize ˆ Ψ l Sv (f ) and B(S v ). Minimizing B(S v ) w.r.t. S v is equivalent to max Sv∈Γ ξ inf f ∈H S t 1 |S v | (x,y)∈Sv I {l(f (x),y)>γ} . Intuitively, Eq. ( 10) means to find a S v ∈ Γ ξ such that the infimum ratio of examples in S v having loss greater than γ is maximized. This min-max problem of Eq. ( 10) for computing the error bound bears the similar idea as our min-max problem. Our adversarial splitting model in Eq. ( 2) can implicitly realize the goal of Eq. ( 10) and meanwhile ensure lower ˆ Ψ l Sv (f ) for any S v . The maximization in Eq. ( 10) corresponds to our adversarial splitting that finds the hardest val-subset S v for the learner in Eq. ( 2). The infimum in Eq. ( 10) corresponds to the minimization of the loss in Eq. (2) on S v w.r.t. the learner parameterized by θ(w). Instead of using indicator function I in Eq. ( 10), in our model of Eq. ( 2), we choose differentiable classification loss for easier optimization.

5. EXPERIMENTS

We verify the effectiveness of our method in three types of experimental settings: Multi Source with Domain Shift (MSDS) that the training data are from several source domains and there exists domain .3 95.6 ±.1 63.5 ±.4 72.0 ±.3 86.5 ±.1 73.3 ±.2 68.4 ±.4 32.7 ±.5 42.2 ±.4 41.6 ±.5 60.3 ±.2 49.3 ±.3 62.4 DFAS (ours) 67.5 ±.2 94.5 ±.1 67.1 ±.3 69.0 ±.2 86.5 ±.1 73.8 ±.3 70.2 ±.2 36.1 ±.4 52.1 ±.2 56.7 ±.2 67.9 ±.3 57.4 ±.1 66.6 shift between training and test data, Single Source with Domain Shift (SSDS) that the training data are from a single source domain and there exists domain shift between training and test data, and Same Source and Target Domain (SSTD) that the training and test data are from a same single domain. The source codes will be released online. We conduct experiments on three benchmark datasets. PACS (Li et al., 2017) contains four domains, including art painting (A), cartoon (C), photo (P), sketch (S), sharing seven classes. Office-Home (Volpi et al., 2018) , a dataset widely used in domain adaptation and recently utilized in domain generalization, consists of four domains: Art (Ar), Clipart (Cl), Product (Pr), Real World (Rw), sharing 65 classes. Both of these two datasets are utilized to conduct experiments in settings of MSDS and SSDS. CIFAR-10 ( Krizhevsky et al., 2009) is taken for the experimental setting of SSTD.

5.1. TYPE I: MULTI SOURCE WITH DOMAIN SHIFT (MSDS)

In the setting of MSDS, following (Carlucci et al., 2019) , we use leave-one-domain-out crossvalidation, i.e., training on three domains and testing on the remaining unseen domain, on PACS and Office-Home. Note that the domain labels are not used during training. We adopt ResNet18 and ResNet50 (He et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015) . For each of them, the last fully-connected layer is replaced by a bottleneck layer, then the corresponding network is taken as feature extractor f e . Full implementation details are reported in Appendix B. We compare our method with several state-of-the-art methods, including D-SAM (D'Innocente & Caputo, 2018), JiGen (Carlucci et al., 2019) , MASF (Dou et al., 2019) , MMLD (Matsuura & Harada, 2020) , MetaReg (Balaji et al., 2018) , RSC (Huang et al., 2020) . The results on PACS and Office-Home are reported in Table 1 and Table 2 respectively. Our DFAS achieves state-of-the-art results based on both ResNet18 and ResNet50 on both PACS (85.4%, 89.0%) and Office-Home (64.3%, 70.3%), outperforming RSC by 0.3% and 1.2% on PACS based on ResNet18 and ResNet50 respectively, and by 1.2% on Office-Home based on ResNet18 (these methods do not conduct experiment on Office-Home using ResNet50). Compared with Baseline that directly aggregates three source domains to train learner with standard fully-connected layer as classifier f c , our method of DFAS improves its performance by 5.0% and 5.7% on PACS based on ResNet18 and ResNet50 respectively, and by 2.1% and 2.1% on Office-Home based on ResNet18 and ResNet50 respectively. In Table 1 , on PACS, DFAS significantly outperforms Baseline in almost all tasks except when P is taken as target domain. Note that, in the task that domain S, of which the style is extremely different from rest three domains, is target domain, our DFAS boosts the accuracy of Baseline by 12.0% and 15.3% based on ResNet18 and ResNet50 respectively. This indicates that our method can generalize well when domain shift is large. Office-Home is challenging for DG since the number of classes is larger than other datasets. As shown in Table 2 , our DFAS outperforms Baseline stably in almost all tasks on Office-Home. These performance improvements demonstrate the effectiveness of our method in the case that training data are from multi-source domains and the unseen target domain is different from source domains.

5.2. TYPE II: SINGLE SOURCE WITH DOMAIN SHIFT (SSDS)

We conduct this type of experiment on PACS based on ResNet18. In this experiment, we train learner on one domain and test on each of the rest three domains, resulting in total 12 tasks. Implementation details are shown in Appendix B. Our method is compared with related methods, including Baseline that directly trains learner with standard fully-connected layer as classifier f c on the source domain, Jien (Carlucci et al., 2019) and SagNet (Nam et al., 2019) . Results are reported in Table 3 . Our method of DFAS outperforms Baseline and SagNet by 4.2% and 4.7% respectively. We observe that our method outperforms Baseline in 10 tasks among all 12 tasks. The performance boosts are large in tasks when domain S is set to be source domain. These performance improvements demonstrate the effectiveness of our method in the case that training data are from single source domain and the unseen target domain is different from source domain. .4 85.8 ±.5 94.9 ±.1 72.4 DFAS (ours) 43.9 ±.5 64.3 ±.2 69.4 ±.3 80.0 ±.4 83.9 ±.4 87.2 ±.3 95.1 ±.0 74.8

5.3. TYPE III: SAME SOURCE AND TARGET DOMAIN (SSTD)

We also apply our DG method to the common recognition task that the training and test data are from a same domain, i.e., SSTD, on CIFAR-10 dataset. To investigate the effect of training size, we sample different sizes of training data from the provided training set (i.e., source domain). Implementation details are in Appendix B. As shown in Table 4 , our DFAS outperforms Baseline, JiGen and MMLD by 2.4%, 3.6% and 4.3% respectively on average. The results of JiGen and MMLD are obtained by running their codes on CIFAR-10. We observe that DFAS outperforms Baseline and compared methods in all different numbers of training data. In general, the performance boost is larger when the number of training data is smaller. This may be because the learner is more possible to be overfitting when the training size is smaller and our DFAS is designed to extract better generalizable features.

5.4. ABLATION STUDY

To further verify the effectiveness of each component of our method, we conduct additional ablation experiments on PACS dataset based on ResNet18 in both MSDS and SSDS setting. The results are reported in Table 5 and Table 6 . Due to space limit, we add more ablation experiments in Appendix A.1 to further compare different splittings, including adversarial splitting, domain-label-based splitting and random splitting. 5.5 MITIGATING GRADIENT EXPLOSION BY L 2 -NORMALIZATION. To show that L 2 -normalization can mitigate gradient explosion, we conduct the same experiments independently for 50 times respectively with L 2 -normalization and without L 2 -normalization. Then we count the numbers of occurrences of gradient explosion that are reported in Table 7 . From Table 7 , we can observe that L 2 -normalization can mitigate gradient explosion. 

6. CONCLUSION

In this paper, we unify adversarial training and meta-learning in a novel proposed Domain-Free Adversarial Splitting (DFAS) framework to tackle the general domain generalization problem. Extensive experiments show the effectiveness of the proposed method. We are interested in deeper theoretical understanding and more applications of our method in the future work.

A ANALYSIS

A.1 COMPARISON OF ADVERSARIAL SPLITTING AND DOMAIN-LABEL-BASED SPLITTING In the section, we compare our adversarial splitting with domain-label-based splitting that is commonly used in meta-learning-based DG methods. Due to variations of style, poses, sub-classes, etc., the internal inconsistency within dataset is complicated. Domain-label partially capture the inconsistency, while cannot cover all possible internal inconsistency. Our adversarial splitting method does not rely on the domain label. It iteratively finds the hardest train/val splitting to the learner to maximize the inconsistency and train the learner to generalize well for the hardest splitting, in an adversarial training way. This strategy more flexibly investigates the possible inconsistency within training dataset, adaptively to the learner, and can potentially enhance the generalization ability of learner. We first empirically show that the domain-label-based splitting (denoted as Label-split) is not as hard as our adversarial splitting (Adv-split) to the learner in Table 8 . In Table 8 , we report the values of objective function in Eq. ( 5) of Adv-split and Label-split by fixing the learner with different network parameters w i at different epoch (1th, 2th, 5th and 10th) in the training process. Larger value in the table indicates that the splitting is harder to the learner (i.e., network). It can be observed that the domain-label-based splitting (Label-split) is not as hard as Adv-split to learner. We also conduct experiments on PACS in MSDS setting to fairly compare different splittings, including adversarial splitting (Adv-split), domain-label-based splitting (Label-split) and random splitting (Rand-split). The results are reported in Table 9 . Table 9 shows that adversarial splitting outperforms random splitting and domain-label-based splitting when training data is from multiple domains. When the training data are from only a single domain, our adversarial splitting also performs well (as in Table 3 ). However, domain-label-based splitting cannot be used in this setting, since there is no domain label available. 

A.2 EFFECT OF HYPER-PARAMETERS

Effect of hyper-parameter ξ. In Fig. 1 (a), we show the performance of our method when varying the hyper-parameter ξ, i.e., length of the val-subset S v in adversarial splitting of training dataset. The best result is obtained when ξ = |S| 2 , and the results are similar when ξ |S| ranges from 0.3 to 0.7. Effect of hyper-parameter α. We evaluate the effect of α in MSDS setting on PACS dataset in Fig. 1(b ). From Fig. 1 (b), the ACC is stable to the values of α in large range of 1e-6 to 1e-4. Small α results in small step-size for parameter updating in meta-learning framework, and limits the benefits from meta-learning and adversarial splitting. Larger α results in larger step-size for gradient descent based network updating, which may fail to decrease the training loss from the optimization perspective. Effect of hyper-parameter m. The effect of m is evaluated in MSDS setting on PACS dataset in Fig. 1(c ). Fig. 1(c ) shows that the result is not sensitive to the value of m.

A.3 CONVERGENCE

We testify the convergence of DFAS with errors and losses in different tasks in Fig. 2 . In Fig. 2(a 

A.4 COMPUTATIONAL COST OF ADVERSARIAL SPLITTING AND RANDOM SPLITTING

We compare the computational cost of adversarial splitting and random splitting in this section. Since we only update the worst-case splitting per epoch, instead of at each step of updating parameters, the computational cost is only slightly higher than that of random splitting. To show this, we compare the total training times of the adversarial splitting and random spitting in the same number of steps (20000), as in Table 10 . From Table 10 , the training time of Adv-split is only 5.6% (0.33/5.90) higher than Rand-split. We visualize the feature space learned by our method of DFAS and Baseline (shown in Fig. 3 ), by t-SNE (Maaten & Hinton, 2008 ). It appears that DFAS yields better separation of classes and better alignment of distributions of source and unseen target domains, which possibly explains the accuracy improvements achieved by our DFAS.

B IMPLEMENTATION DETAILS

For the setting of MSDS, we use ResNet18 and ResNet50 (He et al., 2016) pre-trained on Ima-geNet (Russakovsky et al., 2015) . For each of them, the last fully-connected layer is replaced by a bottleneck layer, then the corresponding network is taken as feature extractor f e . The dimension of the bottleneck layer is set to be 512 when the backbone is ResNet18 as in (Saito et al., 2019) , and 256 when the backbone is ResNet50 as in (Gu et al., 2020) . Following (Gu et al., 2020) , s is set to 7.5 for PACS and 10.0 for Office-Home. m is set to 0.2 for PACS and 0.1 for Office-Home. ξ is set to |S| 2 . SGD with momentum of 0.9 is utilized to update parameters of learner. The learning rate of classifier and bottleneck layer is 10 times of convolutional layers, which is widely adopted in domain adaptation (Long et al., 2015; Ganin et al., 2016) . Following (Ganin et al., 2016) , the learning rate of convolutional layer is adjusted by η = 0.001 (1+10p) 0.75 , where p is the optimizing progress linearly changing from 0 to 1. The learning rate α of inner loop optimization is set to 10 -5 . The parameters are updated for 20000 steps and the hardest val-subset is updated per 200 steps. The batchsize is set to 64. The running mean and running variance of Batch Normalization (BN) layers are fixed as the pre-trained values on ImageNet during training, which is discussed in (Du et al., 2020a) . Due to memory limit, when implementing experiments based on ResNet50, we adopt the first order approximation (Finn et al., 2017) that stops the gradient of g t w in Eq. ( 4) for reducing memory and computational cost. For the setting of SSDS, we conduct experiment based on ResNet18 on PACS. The implementation details are same as MSDS. For the setting of SSTD, we conduct experiment based on ResNet18 on CIFAR-10. The hyper-parameters of s and m are set to 8.0 and 0.2 respectively. Other implementation details are same as MSDS except that, in BN layers, the running mean and running variance are updated. We implement experiments using Pytorch (Paszke et al., 2019) on a single NVIDIA Tesla P100 GPU.  l (f w (x), y) -α ∇ w l(f w (x), y), A s.t. A = g t w , S v ∈ Γ ξ , we design an alternative iteration algorithm in Sect. 3.2. Specifically, we initialize A with gradient of a sample randomly selected from S. Then we alternatively update S v and A. Given A, S v is updated by solving max Sv (x,y)∈Sv l (f w (x), y) -α ∇ w l(f w (x), y), A s.t. S v ⊂ S, |S v | = ξ, where the constraints are derived from the definition of Γ ξ . Equation ( 12) indicates that the optimal S v consists of ξ samples that have the largest values of l (f w (x), y) -α ∇ w l(f w (x), y), A . Thus we compute and rank the values of l (f w (x), y) -α ∇ w l(f w (x), y), A for all (x, y) ∈ S and select the largest ξ samples to constitute the S v . Given S v (S t is then given), we update A to satisfy the constraint A = g t w in Eq. ( 11), then A is A = g t w = 1 |S t | (x,y)∈St ∇ w l(f w (x), y).

C.2 CONVERGENCE IN EXPERIMENTS

We show empirically the convergence of this alternative iteration algorithm in Fig. 4 , with the values of objective function in Eq. ( 11). Figure 4 shows that the values of objective function converges after only a few iterations. We also check if the splitting changes when the value of objective function converges. To do this, we count the ratio of changed sample indexes in S v at each iteration, as in Table 11 . Table 11 shows that the splitting is not changed when the value of objective function converges. We present a toy example in this section to check if this algorithm can find the optimal solution. The toy example is an 2-dimensional classification problem, as shown in Fig. 5 . Different colors indicate different classes. A fully-connected layer without bias is used as the network (learner). We split the data of the first class (blue points) with our algorithm. The candidate solutions with corresponding objective function values are given in Table 12 . Candidate S v (0,1,2) (0,1,3) (0,1,4) (0,1,5) (0,2,3) (0,2,4) (0,2,5) The solutions in the iteration process of our algorithm are reported in Table 13 . The solutions converge to (1,3,4), which is the optimal solution in Table 12 . This indicates that our algorithm can find the From the definition of H Ψ St , for any f ∈ H St , there exists a h f ∈ H Ψ St such that h f = Ψ • f . Applying Theorem A-1, with probability at least 1 -δ, we have ∀f ∈ H St , | Ψ P (f ) -ˆ Ψ Sv (f )| =| P (h f ) -ˆ Sv (h f )| ≤ 8 |S v | V C(H Ψ St ) log 2e |S v | V C(H Ψ St ) + 4 δ . ( ) Lemma A-2. For any S v ∈ Γ ξ and S t = S -S v , let Ψ P (g) = inf f ∈H S t Ψ P (f ) and ˆ Ψ Sv (h) = inf f ∈H S t ˆ Ψ Sv (f ), then ∀δ ∈ (0, 1), with probability at least 1 -δ, we have Ψ P (g) ≥ ˆ Ψ Sv (h) - 8 |S v | V C(H Ψ St ) log 2e |S v | V C(H Ψ St ) + 4 δ . ( ) Proof: From the definition of g and h, we have ˆ Ψ Sv (g) ≥ ˆ Ψ Sv (h). ∀δ ∈ (0, 1), with probability at least 1 -δ, we have Ψ P (g) -ˆ Ψ Sv (h) = Ψ P (g) -ˆ Ψ Sv (g) + ˆ Ψ Sv (g) -ˆ Ψ Sv (h) ≥ Ψ P (g) -ˆ Ψ Sv (g) ≥ - 8 |S v | V C(H Ψ St ) log 2e |S v | V C(H Ψ St ) + 4 δ . Thus, Eq. ( 19) holds.

D.3 PROOF OF THEOREM 1

Proof: We denote by H Ψ l the hypothesis space such that ∀h ∈ H Ψ l , h(x) = Ψ l (f (x)) = 1 if l(f (x), y) > γ, 0 otherwise , for f ∈ H. Then  d H Ψ l (P, Q) = 2 sup h∈H Ψ l E P [h = 1] -E Q [h = 1] = 2 sup f ∈H E P [Ψ l (f (x)) = 1] -E Q [Ψ l (f (x)) = 1] = 2 sup where λ * (S v ) ≥ inf f ∈H S-Sv { Ψ l P (f ) + Ψ l Q (f )}. Let C 3 = sup S v ∈Γ ξ λ * (S v ) ≥ sup S v ∈Γ ξ inf f ∈H S-S v { Ψ l P (f ) + Ψ l Q (f )}, we have Ψ l Q (f ) ≤ Ψ l P (f ) + C 1 -inf f ∈H S t E P [I {l(f (x),y)>γ} ] + C 3 . ( ) Applying Lemma A-1 to the first term of right side in Eq. ( 25), ∀δ ∈ (0, 1), with probability at least 1 -δ, we have ∀f ∈ H St , Ψ l P (f ) ≤ ˆ Ψ l Sv (f ) + 8 |S v | V C(H Ψ l St ) log 2e |S v | V C(H Ψ l St ) + 4 δ . ( ) Applying Lemma A-2 to the third term of right side in Eq. ( 25), ∀δ ∈ (0, 1), with probability at least 1 -δ, we have inf f ∈H S t E P [I {l(f (x),y)>γ} ] ≥ inf f ∈H S t 1 |S v | (x,y)∈Sv I {l(f (x),y)>γ} - 8 |S v | V C(H Ψ l St ) log 2e |S v | V C(H Ψ l St ) + 4 δ . Combining Eq. ( 25), ( 26), ( 27) and thanks to the union bound, for any δ ∈ (0, 1), with probability at least 1 -2δ, we have ∀f ∈ H St , Ψ l Q (f ) ≤ ˆ Ψ l Sv (f ) + B(S v ) + 2 8 |S v | V C(H Ψ l St ) log 2e |S v | V C(H Ψ l St ) + 4 δ + C 3 , where B(S v ) = C 1 -inf f ∈H S t 1 |Sv| (x,y)∈Sv I {l(f (x),y)>γ} . Using the fact that |S v | = ξ and let C 2 = sup S v ∈Γ ξ V C(H Ψ l S-S v ) log 2eξ V C(H Ψ l S-S v ) , we have Ψ l Q (f ) ≤ ˆ Ψ l Sv (f ) + B(S v ) + 2 8 |S v | C 2 + 4 δ + C 3 .



where H is the Hessian matrix. The gradient norm ∇ w L(w ; z) ≤ I -αH ∇ w L(w ; z) ≤ (1 + |α| H ) ∇ w L(w ; z) . Since ∇ w L(w ; z) = (p -y)z and H = p(1 -p)zz T where p = σ(w T z), H = sup u: u =1 Hu ≤ sup u: u =1 zz T u ≤ z 2 and ∇ w L(w ; z) ≤ z . If z = 1, we have ∇ w L(w ; z) ≤ 1 + |α|.

Figure 1: Effect of hyper-parameters of ξ, α and m. We use ResNet18, and domain A is taken as domain on PACS dataset in the setting of MSDS.

) and Fig. 2(b) , we show the classification error curves on target domains (A and Ar respectively) in the setting of MSDS. In Fig. 2(c), we show the training loss of DFAS in task A in MSDS setting. These training curves indicates that DFAS converges in the training process. We also observe that DFAS has better stability than Baseline, in Fig. 2(a) and Fig. 2(b).

Figure 2: Curves of target errors and losses of task A (PACS) and Ar (Office-Home) during training based on ResNet50 in the setting of MSDS.

Figure 3: The t-SNE visualization of extracted features, using our proposed DFAS (a-b) and Baseline (c-d) on PACS dataset in MSDS setting. In (a) and (c), the different colors indicate different domains. In (b) and (d), the different colors indicate different classes.

Figure 4: Convergence of the alternative iterations for finding the hardest S v . (a) and (b) respectively show the values of objective function in Eq. (11) in two different runs with different initializations.

P [I {l(f (x),y)>γ} ] -E Q [I {l(f (x),y)>γ} ] = 2 sup f ∈H E Q [I {l(f (x),y)>γ} ] -E P [I {l(f (x),y)>γ} ] ≤ 2 sup f ∈H E Q [I {l(f (x),y)>γ} ] -2 inf f ∈H E P [I {l(f (x),y)>γ} ]. (22)In the fourth equation, we utilize the assumption that E Q [I {l(f (x),y)>γ} ] ≥ E P [I {l(f (x),y)>γ} ].

The metalearning approach for DG is to find a function in H St to minimize classification loss on S v . Note that, although the training samples in S may be sampled from several distributions, they can still be seen as being i.i.d. sampled from a mixture of these distributions. We next respectively denote P = D d=1 β d P d as the mixture distribution with β d representing the sampling ratio of the d-th source domain, Ψ Q

Results of MSDS experiment on PACS based on ResNet18 and ResNet50.

Results of MSDS experiment on Office-Home based on ResNet18 and ResNet50.

Results of SSDS experiment on PACS based on ResNet18.

Results of SSTD experiment on CIFAR-10 based on ResNet18. ±.4 62.3 ±.3 66.4 ±.6 76.3 ±.3 82.0 ±

Additional ablation results on PACS in MSDS setting.In Table5 and 6, L 2 -norm denotes feature L 2 -normalization defined in Sect.3.3. Rand-split means the random splitting strategy that we randomly split the train/val subsets at each step of updating the

Additional ablation results on PACS in SSDS setting. L 2 -norm Rand-split Adv-split A→C A→P A→S C→A C→P C→S P→A These performance improvements demonstrate that the feature L 2 -normalization is useful in domain generalization.

The number of occurrences of gradient explosion.

Values of objective function in Eq. (5) of Adv-split and Label-split.

Results of different splittings.

Total training time (hour) of the adversarial splitting (Adv-split) and random spitting (Rand-split).

Ratio of changed sample indexes in S v at each iteration.

Candidate solutions (sample indexes) with objective function values of the toy example.

= sup S v ∈Γ ξ sup f ∈H S-S v E Q [I {l(f (x),y)>γ} ]. Applying Theorem A-2, for any f ∈ H St , we have Ψ l Q (f ) ≤ Ψ l P (f ) + C 1 -inf f ∈H S t E P [I {l(f (x),y)>γ} ] + λ * (S v ),

annex

optimal splitting for the toy example. We also give the code of this toy example as below, and the reader may rerun it to verify the results.Code the toy example:import torch import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import make_blobs print("all possible solutions:",Solutions) print("objective function values of possible solutions:",Values) print("optimal solution:",optimal_solution,"\nobjective function value of optimal solution:",optimal_value) log.write(str({"all possible solutions":Solutions, "objective function values of possible solutions":Values, "optimal solution":optimal_solution, "objective function value of optimal solution":optimal_value})) print("our optimal solution:",solutions_tmp[-1],"\nobjective function value of our optimal solution:",values_tmp[-1]) log.write(str({"our optimal solution":solutions_tmp[-1], "objective function value of our optimal solution":values_tmp[-1]})) print(solutions_tmp,values_tmp)

D PROOF OF THEOREM 1

We first introduce VC-dimension based generalization bound and domain adaptation theory in Appendix D.1, then present two lemmas in Appendix D.2, and finally give the proof of Theorem 1 in Appendix D.3.

VC-dimension based generalization bound.

Theorem A-1. (Abu-Mostafa et al., 2012) Let S be the set of training data i.i.d. sampled for distribution P. For any δ ∈ (0, 1), with probability at least 1 -δ, we have ∀h (h : X → {0, 1}) in hypothesis space H,where P (h) = E (x,y)∼P [I {(h(x) ) =y} ] and ˆ S (h) = 1 |S| (x,y)∈S I {(h(x)) =y} .

Domain adaptation theory.

Theorem A-2. (Ben-David et al., 2007; 2010) For any h in hypothesis space H, we havewhereis H-divergence.

D.2 LEMMAS

Lemma A-1. For any S v ∈ Γ ξ and S t = S -S v , ∀δ ∈ (0, 1), with probability at least 1 -δ, we have ∀f ∈ H St ,where 

