VARIATIONAL INVARIANT LEARNING FOR BAYESIAN DOMAIN GENERALIZATION

Abstract

Domain generalization addresses the out-of-distribution problem, which is challenging due to the domain shift and the uncertainty caused by the inaccessibility to data from the target domains. In this paper, we propose variational invariant learning, a probabilistic inference framework that jointly models domain invariance and uncertainty. We introduce variational Bayesian approximation into both the feature representation and classifier layers to facilitate invariant learning for better generalization across domains. In the probabilistic modeling framework, we introduce a domain-invariant principle to explore invariance across domains in a unified way. We incorporate the principle into the variational Bayesian layers in neural networks, achieving domain-invariant representations and classifier. We empirically demonstrate the effectiveness of our proposal on four widely used cross-domain visual recognition benchmarks. Ablation studies demonstrate the benefits of our proposal and on all benchmarks our variational invariant learning consistently delivers state-of-the-art performance.

1. INTRODUCTION

Domain generalization (Muandet et al., 2013) , as an out-of-distribution problem, aims to train a model on several source domains and have it generalize well to unseen target domains. The major challenge stems from the large distribution shift between the source and target domains, which is further complicated by the prediction uncertainty (Malinin & Gales, 2018) introduced by the inaccessibility to data from target domains during training. Previous approaches focus on learning domain-invariant features using novel loss functions (Muandet et al., 2013; Li et al., 2018a) or specific architectures (Li et al., 2017a; D'Innocente & Caputo, 2018) . Meta-learning based methods were proposed to achieve similar goals by leveraging an episodic training strategy (Li et al., 2017b; Balaji et al., 2018; Du et al., 2020) . Most of these methods are based on deep neural network backbones (Krizhevsky et al., 2012; He et al., 2016) . However, while deep neural networks have achieved remarkable success in various vision tasks, their performance is known to degenerate considerably when the test samples are out of the training data distribution (Nguyen et al., 2015; Ilse et al., 2019) , due to their poorly calibrated behavior (Guo et al., 2017; Kristiadi et al., 2020) . As an attractive solution, Bayesian learning naturally represents prediction uncertainty (Kristiadi et al., 2020; MacKay, 1992), possesses better generalizability to out-of-distribution examples (Louizos & Welling, 2017) and provides an elegant formulation to transfer knowledge across different datasets (Nguyen et al., 2018) . Further, approximate Bayesian inference has been demonstrated to be able to improve prediction uncertainty (Blundell et al., 2015; Louizos & Welling, 2017; Atanov et al., 2019) , even when only applied to the last network layer (Kristiadi et al., 2020) . These properties make it appealing to introduce Bayesian learning into the challenging and unexplored scenario of domain generalization. In this paper, we propose variational invariant learning (VIL), a Bayesian inference framework that jointly models domain invariance and uncertainty for domain generalization. We apply variational Bayesian approximation to the last two network layers for both the representations and classifier by placing prior distributions over their weights, which facilitates generalization. We adopt Bayesian neural networks to domain generalization, which enjoys the representational power of deep neural networks while facilitating better generalization. To further improve the robustness to domain shifts, we introduce the domain-invariant principle under the Bayesian inference framework, which enables us to explore domain invariance for both feature representations and the classifier in a unified way. We evaluate our method on four widely-used benchmarks for cross-domain visual object classification. Our ablation studies demonstrate the effectiveness of the variational Bayesian domain-invariant features and classifier for domain generalization. Results further show that our method achieves the best performance on all of the four benchmarks.

2. METHODOLOGY

We explore Bayesian inference for domain generalization. In this task, the samples from the target domains are never seen during training, and are usually out of the data distribution of the source domains. This leads to uncertainty when making predictions on the target domains. Bayesian inference offers a principled way to represent the predictive uncertainty in neural networks (MacKay, 1992; Kristiadi et al., 2020) . We briefly introduce approximate Bayesian inference, under which we will introduce our variational invariant learning for domain generalization.

2.1. APPROXIMATE BAYESIAN INFERENCE

Given a dataset {x (i) , y (i) } N i=1 of N input-output pairs and a model parameterized by weights θ with a prior distribution p(θ), Bayesian neural networks aim to infer the true posterior distribution p(θ|x, y). As the exact inference of the true posterior is computationally intractable, Hinton & Camp (1993) and Graves (2011) recommended learning a variational distribution q(θ) to approximate p(θ|x, y) by minimizing the Kullback-Leibler (KL) divergence between them: θ * = arg min θ D KL q(θ)||p(θ|x, y) . The above optimization is equivalent to minimizing the loss function: L Bayes = -E q(θ) [log p(y|x, θ)] + D KL [q(θ)||p(θ)], which is also known as the negative value of the evidence lower bound (ELBO) (Blei et al., 2017).

2.2. VARIATIONAL DOMAIN-INVARIANT LEARNING

In domain generalization, let D = {D i } |D| i=1 = S ∪ T be a set of domains, where S and T denote source domains and target domains respectively. S and T do not have any overlap with each other but share the same label space. For each domain D i ∈ D, we can define a joint distribution p(x i , y i ) in the input space X and the output space Y. We aim to learn a model f : X → Y in the source domains S that can generalize well to the target domains T . The fundamental problem in domain generalization is to achieve robustness to domain shift between source and target domains, that is, we aim to learn a model invariant to the distributional shift between the source and target domains. In this work, we mainly focus on the invariant property across domains instead of exploring general invariance properties (Nalisnick & Smyth, 2018) . Therefore, we introduce a formal definition of domain invariance, which is easily incorporated as criteria into the Bayesian framework to achieve domain-invariant learning. Provided that all domains in D are in the same domain space, then for any input sample x s in domain D s , we assume that there exists a domain-transform function g ζ (•) which is defined as a mapping function that is able to project x s to other different domains  p θ (y s |x s ) = p θ (y ζ |x ζ ), ∀ζ ∼ q(ζ). Here, we use y to represent the output from a neural layer with input x, which can either be the prediction vector from the last layer or the feature vector from the convolutional layers. To make the domain-invariant principle easier to implement, we then extend the Eq. ( 3) to an expectation form: p θ (y s |x s ) = E q ζ [p θ (y ζ |x ζ )]. Based on this definition, we use the Kullback-Leibler divergence between the two terms in Eq. ( 4), D KL p θ (y s |x s )||E q ζ [p θ (y ζ |x ζ )] , to quantify the domain invariance of the model, which will be zero when the model is domain invariant. As in most cases, there is no analytical form of the domaintransform function and only a few samples from D ζ are available, which makes E q ζ [p θ (y ζ |x ζ )] intractable. Thus, we derive the following upper bound of the divergence: D KL p θ (y s |x s )||E q ζ [p θ (y ζ |x ζ )] ≤ E q ζ D KL p θ (y s |x s )||p θ (y ζ |x ζ ) , which can be approximated by Monte Carlo sampling. We define the complete objective function of our variational invariant learning by combining Eq. ( 5) with Eq. ( 2). However, in Bayesian inference, the likelihood is obtained by taking the expectation over the distribution of parameter θ, i.e., p θ (y|x) = E q(θ) [p(y|x, θ)], which is also intractable in Eq. ( 5). As the KL divergence is a convex function (Nalisnick & Smyth, 2018) , we further extend Eq. ( 5) to an upper bound: E q ζ D KL p θ (y s |x s )||p θ (y ζ |x ζ ) ≤ E q ζ E q(θ) D KL p(y s |x s , θ)||p(y ζ |x ζ , θ) , which is tractable with the unbiased Monte Carlo approximation. The complete derivations of Eq. ( 5) and Eq. ( 6) are provided in Appendix A. In addition, it is worth noting that the domain-transformation distribution q(ζ) is implicit and inexpressible in reality and there are only a limited number of domains available in practice. This problem is exacerbated because the target domain is unseen during training, which further limits the number of available domains. Moreover, in most of the domain generalization databases, for a certain sample x s from domain D s , there is no transformation corresponding to x s in other domains. This prevents the expectation with respect to q ζ from being directly tractable in general. Thus, we resort to use an empirically tractable implementation and adopt an episodic setting as in (Li et al., 2019) . In each episode, we choose one domain from the source domains S as the metasource domain D s and the rest are used as the meta-target domains {D t } T t=1 . To achieve variational invariant learning in the Bayesian framework, we use samples from meta-target domains in the same category as x s to approximate the samples of g ζ (x s ). Then we obtain a general loss function for domain-invariant learning: L I = 1 T T t=1 1 N N i=1 E q(θ) D KL p(y s |x s , θ)||p(y i t |x i t , θ) , where x i t N i=1 are from D t , denoting the samples in the same category as x s . More details and an illustration of the domain-invariant loss function can be found in Appendix B. With the aforementioned loss functions, we develop the loss function of variational invariant learning for domain generalization: L VIL = L Bayes + λL I . Our variational invariant learning combines the Bayesian framework, which is able to introduce uncertainty into the network and is beneficial for out-of-distribution problems (Daxberger & Hernández-Lobato, 2019), and a domain-invariant loss function L I , which is designed based on predictive distributions to make the model generalize better to the unseen target domains. For Bayesian learning, it has been demonstrated that being just "a bit" Bayesian in the last layer of the neural network can well represent the uncertainty in predictions (Kristiadi et al., 2020) . This indicates that applying the Bayesian treatment only to the last layer already brings sufficient benefits of Bayesian inference. Although adding Bayesian inference to more layers improves the performance, it also increases the computational cost. Further, from the perspective of domain invariance, making both the classifier and feature extractor more robust to the domain shifts also leads to better performance (Li et al., 2019) . Thus, there is a trade-off between the benefits of variational Bayesian domain invariance and computational efficiency. Instead of applying the Bayesian principle to all the layers of the neural network, in this work we explore domain invariance by applying it to only the classifier layer ψ and the last feature extraction layer φ. In this case, the L Bayes in Eq. ( 2) becomes the ELBO with respect to ψ and φ jointly. As they are independent, the L Bayes is expressed as: L Bayes = -E q(ψ) E q(φ) [log p(y|x, ψ, φ)] + D KL [q(ψ)||p(ψ)] + D KL [q(φ)||p(φ)]. The above variational inference objective allows us to explore domain-invariant representations and classifier in a unified way. The detailed derivation of Eq. ( 9) is provided in Appendix A.

Domain-Invariant Classifier

To establish the domain-invariant classifier, we directly incorporate the proposed domain-invariant principle into the last layer of the network, which gives rise to L I (ψ) = 1 T T t=1 1 N N i=1 E q(ψ) D KL p(y s |z s , ψ)||p(y i t |z i t , ψ) , where z denotes the feature representations of input x, and the subscripts s and t indicate the metasource domain and the meta-target domains as in Eq. ( 7). Since p(y|z, ψ) is a Bernoulli distribution, we can conveniently calculate the KL divergence in Eq. ( 10).

Domain-Invariant Representations

To also make the representations domain invariant, we have L I (φ) = 1 T T t=1 1 N N i=1 E q(φ) D KL p(z s |x s , φ)||p(z i t |x i t , φ) , where φ are the parameters of the feature extractor. Since the feature extractor is also a Bayesian layer, the distribution of p(z|x, φ) will be a factorized Gaussian if the posterior of φ is as well. We illustrate this as follows. Let x be the input feature of a Bayesian layer φ, which has a factorized Gaussian posterior, the posterior of the activation z of the Bayesian layer is also a factorized Gaussian (Kingma et al., 2015): q(φ i,j ) ∼N (µ i,j , σ 2 i,j ) ∀φ i,j ∈ φ ⇒ p(z j |x, φ) ∼ N (γ j , δ 2 j ), γ j = N i=1 x i µ i,j , and δ 2 j = N i=1 x 2 i σ 2 i,j , where z j denotes the j-th element in z, likewise for x i , and φ i,j denotes the element at the position (i, j) in φ. Based on this property of the Bayesian framework, we assume that the posterior of our variational invariant feature extractor has a factorized Gaussian distribution, which leads to an easier calculation of the KL divergence in Eq. ( 11). Note that with the domain-invariant representations, z in Eq. ( 10) corresponds to samples of the feature representation distributions: z s ∼ p(z s |x s , φ) and z t ∼ p(z t |x t , φ).

2.3. OBJECTIVE FUNCTION

The objective function of our variational invariant learning is defined as: L VIL = L Bayes + λ ψ L I (ψ) + λ φ L I (φ), where λ ψ and λ φ are hyperparameters to control the domain-invariant terms. We adopt Monte Carlo sampling and obtain the empirical objective function for variational invariant learning as follows: L VIL = 1 L L =1 1 M M m -log p(y s |x s , ψ ( ) , φ (m) ) + D KL q(ψ)||p(ψ)] + D KL [q(φ)||p(φ) + λ ψ 1 T T t=1 1 N N i=1 1 L L =1 D KL p(y s |z s , ψ ( ) )||p(y i t |z i t , ψ ( ) ) + λ φ 1 T T t=1 1 N N i=1 1 M M m=1 D KL p(z s |x s , φ (m) )||p(z i t |x i t , φ (m) ) , where x s and z s denote the input and its feature from D s , respectively, and x i t and z i t are from D t as in Eq. ( 7). The posteriors are set to factorized Gaussian distributions, i.e., q(ψ) = N (µ ψ , σ 2 ψ ) and q(φ) = N (µ φ , σ 2 φ ). We adopt the reparameterization trick to draw Monte Carlo samples (Kingma & Welling, 2014) as ψ ( ) = µ ψ + ( ) * σ ψ , where ( ) ∼ N (0, I). We draw the samples for φ (m) in a similar way. In the implementation of our variational invariant learning, to increase the flexibility of the prior distribution in our Bayesian layers, we choose to place a scale mixture of two Gaussian distributions as the priors p(ψ) and p(φ) (Blundell et al., 2015) : πN (0, σ 2 1 ) + (1 -π)N (0, σ 2 2 ), where σ 1 , σ 2 and π are hyperparameters chosen by cross-validation.

3. RELATED WORK

One solution for domain generalization is to generate more source domain data to increase the probability of covering the data in the target domains (Shankar et 15) are set to 0.1 and 1.5. The model with the highest validation set accuracy is employed for evaluation on the target domain. All code will be made publicly available.

4.2. ABLATION STUDY

We conduct an ablation study to investigate the effectiveness of our variational invariant learning for domain generalization. The experiments are performed on the PACS dataset. Since the major contributions of this work are the Bayesian treatment and the domain-invariant principle, we evaluate their effect by individually incorporating them into the classifier -the last layer -ψ and the feature extractor -the penultimate layer -φ. The results are shown in Table 1 . The " " and "×" in the "Bayesian" column denote whether the classifier ψ and feature extractor φ are Bayesian layers or deterministic layers. In the "Invariant" column they denote whether the domain-invariant loss is introduced into the classifier and the feature extractor. Note that the predictive distribution is a Bernoulli distribution, which also admits the domain-invariant loss, we therefore include this case for a comprehensive comparison. In Table 1 , the first four rows demonstrate the benefit of the Bayesian treatment. The first row (a) serves as a baseline model, which is a vanilla deep convolutional network without any Bayesian treatment and domain-invariant loss. The backbone is also a ResNet-18 pretrained on ImageNet. It is clear the Bayesian treatment, either for the classifier (b) or the feature extractor (c), improves the performance, especially in the "Art-painting" and "Sketch" domains, and this is further demonstrated in (d) where we employ the Bayesian classifier and feature extractor simultaneously. The benefit of the domain-invariant principle for classifiers is demonstrated by comparing (e) to (a) and (f) to (b). The settings with domain invariance consistently perform better than those without it. A similar trend is also observed when applying the domain-invariant principle to the feature extractor, as shown by comparing (g) to (c). Overall, our variational invariant learning (h) achieves the best performance compared to other variants, demonstrating its effectiveness for domain generalization. Note that the feature distributions p(z|x) are unknown without Bayesian formalism, leading to an intractable L I (φ). Therefore, we do not conduct the experiment with only the domain-invariant loss on both the classifier and the feature extractor. To further demonstrate the domain-invariant property of our method, we visualize the features learned by different settings of our method in Table 1 . We use t-SNE (Maaten & Hinton, 2008) Comparing the two figures in each column indicates that our domain-invariant principle imposed on either the representation or the classifier further enlarges the inter-class distances. At the same time, it reduces the distance between samples of the same class from different domains. This is even more apparent in the intra-class distance between samples from source and target domains. As a result, the inter-class distances in the target domain become larger, therefore improving performance. It is worth noting that the domain-invariant principle on the classifier in Fig. 1 (f) and on the feature extractor in Fig. 1 (g )) both improve the domain-invariant features. Our variational invariant learning in Fig. 1 (h) therefore has better performance by combining their benefits. We also experiments with more layers in the feature extractor, see Table 2 . "Bayesian φ " and "Invariant φ " denote whether the additional feature extraction layer φ has the Bayesian property and domain-invariant property. The classifiers have both properties in all cases in Table 2 . The first row is the setting with only one variational invariant layer in the feature extractor. When introducing another Bayesian learning layer φ without the domain-invariant property into the model, as shown in the second row in Table 2 , the average performance improves slightly. If we introduce both the Bayesian learning and domain-invariant learning into φ , as shown in the third row, the overall performance declines a bit. One reason might be the information loss in feature representations caused by the excessive use of domain-invariant learning. In addition, due to the Bayesian inference and Monte-Carlo sampling, more variational-invariant layers leads to higher memory usage and more computations, which is also one reason for us to apply the variational invariant learning only to the last feature extraction layer and the classifier.  (e) (f) (g) (h) (a) (b) (c) (d)

4.3. STATE-OF-THE-ART COMPARISON

In this section, we compare our method with several state-of-the-art methods on four datasets. The results are reported in Tables 3 4 5 . The baseline on PACS (Table 3 ), Office-Home (Table 4 ), and rotated MNIST and Fashion-MNIST (Table 5 ) are all based on the same vanilla deep convolutional ResNet-18 network, without any Bayesian treatment, the same as row (a) in Table 1 On PACS, as shown in Table 3 , our variational invariant learning method achieves the best overall performance. On each domain, our performance is competitive with the state-of-the-art and we exceed all other methods on the "Cartoon" domain. On Office-Home, as shown in Table 4 , we again achieve the best recognition accuracy. It is worth mentioning that on the most challenging "Art" and "Clipart" domains, our variational invariant learning also delivers the highest performance, with a good improvement over previous methods. L2A-OT and DSON outperform the proposed model on some domains of PACS and Office-Home. L2A-OT learns a generator to synthesize data from pseudo-novel domains to augment the source domains. The pseudo-novel domains often have similar characteristics with the source data. Thus, when the target data also have similar characteristics with the source domains this pays off as the pseudo domains are more likely to cover the target domain, such as "Product" and "Real World" in Office-Home and "Photo" in PACS. When the test domain is different from all of the training domains the performance suffers, e.g., "Clipart" in Office-Home and "Sketch" in PACS. Our method generates domain-invariant representations and classifiers, resulting in competitive results across all domains and overall. DSON mixtures batch and instance normalization for domain generalization. This tactic is effective on PACS, but less competitive on Office-Home. We attribute this to the larger number of categories on Office-Home, where instance normalization is known to make features less discriminative with respect to object categories (Seo et al., 2019) . Our domain-invariant network makes feature distributions and predictive distributions similar across domains, resulting in good performance on both PACS and Office-Home. On the Rotated MNIST and Fashion-MNIST datasets, following the experimental settings in (Piratla et al., 2020), we evaluate our method on the in-distribution and out-of-distribution sets. As shown in Table 5 , our VIL achieves the best performance on both sets of the two datasets, surpassing other methods. Moreover, our method especially improves the classification performance on the out-of-distribution sets, demonstrating its strong generalizability to unseen domains, which is also consistent with the findings in Fig. 1 .

5. CONCLUSION

In this work, we propose variational invariant learning (VIL), a variational Bayesian learning framework for domain generalization. We introduce Bayesian neural networks into the model, which is able to better represent uncertainty and enhance the generalization to out-of-distribution data. ] is intractable, we derive the upper bound in Eq. ( 5), which is achieved via Jensen's inequality: D KL p θ (y s |x s )||E q ζ [p θ (y ζ |x ζ )] = E p θ (ys|xs) [log p θ (y s |x s )] -E p θ (ys|xs) log E q ζ [p θ (y ζ |x ζ )] ≤ E p θ (ys|xs) [log p θ (y s |x s )] -E p θ (ys|xs) E q ζ [log p θ (y ζ |x ζ )] = E q ζ D KL p θ (y s |x s )||p θ (y ζ |x ζ ) . In Bayesian inference, computing the likelihood p θ (y|x) = θ p(y|x, θ)dθ = E q(θ) [p(y|x, θ)] is notoriously difficult. Thus, as the fact that KL divergence is a convex function, we obtain the upper bound in Eq. ( 6) achieved via Jensen's inequality similar to Eq. ( 16): E q ζ D KL p θ (y s |x s )||p θ (y ζ |x ζ ) = E q ζ D KL E q(θ) [p(y s |x s , θ)]||E q(θ) [p(y ζ |x ζ , θ)] ≤ E q ζ E q(θ) D KL p(y s |x s , θ)||p(y ζ |x ζ , θ) . A.2 DERIVATION OF VARIATIONAL BAYESIAN APPROXIMATION FOR REPRESENTATION (φ) AND CLASSIFIER (ψ) LAYERS. We consider the model with two Bayesian layers φ and ψ as the last layer of feature extractor and the classifier respectively. The prior distribution of the model is p(φ, ψ), and the true posterior distribution is p(φ, ψ|x, y). Following the settings in Section 2.1, we need to learn a variational distribution q(φ, ψ) to approximate the true posterior by minimizing the KL divergence from q(φ, ψ) to p(φ, ψ|x, By applying the Bayesian rule p(φ, ψ|x, y) ∝ p(y|x, φ, ψ)p(φ, ψ), the optimization is equivalent to minimizing: L Bayes = q(φ, ψ) log q(φ, ψ) p(φ, ψ)p(y|x, φ, ψ) dφdψ = D KL q(φ, ψ)||p(φ, ψ) -E q(φ,ψ) log p(y|x, φ, ψ) . With φ and ψ are independent, L Bayes = -E q(ψ) E q(φ) [log p(y|x, ψ, φ)] + D KL [q(ψ)||p(ψ)] + D KL [q(φ)||p(φ)].

B DETAILS OF DOMAIN-INVARIANT LOSS IN VIL TRAINING

We split the training phase of VIL into several episodes. In each episode, as shown in Fig. 2 , we randomly choose a source domain as the meta-source domain D s , and the rest of the source domains {D t } T t=1 are treated as the meta-target domains. From D s , we randomly select a batch of samples x s . For each x s , we then select N samples x i t N i=1 , which are in the same category as x s , from each of the meta-target domains D t . All of these samples are sent to the variational invariant feature extractor φ to get the representations z s and z i t N i=1 , which are then sent to the variational invariant classifier ψ to obtain the predictions y s and y i t N i=1 . We obtain the domain-invariant loss for feature extractor L I (φ) by calculating the mean of the KL divergence of z s and each z i t as Eq.( 11). The domain-invariant loss for feature classifier L I (ψ) is calculated in a similar way on y s and y i t N i=1 as Eq.(10).

C ABLATION STUDY FOR HYPERPARAMTERS

We also ablate the hyperparameters λ φ , λ ψ and π on PACS with cartoon as the target domain. 

D DETAILED ABLATION STUDY ON PACS

In addition to the aforementioned experiments, we conduct some supplementary experiments with other settings on PACS to further demonstrate the effectiveness of the Bayesian inference and domain-invariant loss as shown in Table 6 . The evaluated components are the same as in Table 1 . For better comparison, we show the contents of Table 1 again in Table 6 , and add three other settings with IDs (i), (j) and (k). Note that as the distribution of features z is unknown without a Bayesian feature extractor φ, the settings with L I (φ) and a non-Bayesian feature extractor is intractable. Comparing (i) with (d), we find that employing Bayesian inference to the last layer of the feature extractor improves the overall performance and the classification accuracy on three of the four domains. Moreover, comparing (j) and (k) to (g) shows the benefits of introducing variational Bayes and the domain-invariant loss to the classifier on most of the domains and the average of them.

E EXTRA VISUALIZATION RESULTS

To further observe and analyze the benefits of the individual components of VIL for domain-invariant learning, we visualize the features of all categories from the target domain only in Fig. 4 , and features of only one category from all domains in Fig. 5 . The same as Fig. 1 , the visualization is conducted on the PACS dataset and the target domain is "art-painting". The chosen category in Fig. 5 is "horse". 



D ζ with respect to the parameter ζ, where ζ ∼ q(ζ), and a different ζ lead to different post-transformation domains D ζ . Usually the exact form of g ζ (•) is not necessarily known. Under this assumption, we introduce the definition of domain invariance, which we will incorporate into the Bayesian layers of neural networks for domain-invariant learning. Definition 2.1 (Domain Invariance) Let x s be a given sample from domain D s ∈ D, and x ζ = g ζ (x s ) be a transformation of x s in another domain D ζ , where ζ ∼ q(ζ). p θ (y|x) denotes the output distribution of input x with model θ. The model θ is domain-invariant if,

Figure 1: Visualization of feature representations. The eight sub-figures correspond to the eight settings in Table 1 (identified by ID). Colors denote domains, while shapes indicate classes. The target domain (violet) is "art-painting". The top row shows the Bayesian treatment enlarges the inter-class distance for all domains, considerably. The bottom row, compared with the top-row figures in the same column, shows the domain-invariant principle enlarges the inter-class distance in the target domain by reducing the intra-class distances between the source and target domains.

Figure 2: Illustration of the domain-invariant loss in the training phase of VIL. S denotes the source domains, T denotes the target domains, and D = S ∪ D. x, z and y denote inputs, features and outputs of samples in each domain. L I (φ) and L I (ψ) denote the domain-invariant loss functions for representations and the classifier.

Results are shown in Fig 3 (a), (b) and (c). We obtain Fig 3 (a) by fixing λ ψ as 100 and adjusting λ φ , Fig 3 (b) by fixing λ φ as 1 and adjusting λ ψ and Fig 3 (c) by adjusting π while fixing other settings as in Section 4.1. λ φ and λ ψ balance the influence of the Bayesian learning and domain-invariant learning, and their optimal values are 1 and 100. If the values are too small, the model tends to overfit to source domains as the performance on target data drops more obviously than on validation data. In contrast, too large values of them harm the overall performance of the model as there are obvious decrease of accuracy on both validation data and target data. Moreover, π balances the two components of the scale mixture prior of our Bayesian model. According to Blundell et al. (2015), the two components cause a prior density with heavier tail while many weights tightly concentrate around zero. Both of them are important. The performance is the best when π is 0.5 according to Fig 3 (c), which demonstrates the effectiveness of the two components in the scale mixture prior.

Figure 3: Performance on "Cartoon" domain in PACS with different hyperparameters λ φ , λ ψ and π. The red line denotes the accuracy on validation data while the blue line denotes accuracy on target data. The optimal value of λ φ , λ ψ and π are 1, 100 and 0.5 respectively.

Fig.4provides a more intuitive observation of the benefits of the Bayesian framework and domaininvariant learning in our method for enlarging the inter-class distance in the target domain. The conclusion is similar as in Fig.1. From the figures in the first row, it is clear that the Bayesian framework whether in the classifier ((b)) or the feature extractor ((c)) increases the inter-class distance compared with the baseline method ((a)). With both of them ((d)), the performance becomes

Figure 4: Visualization of feature representations of the target domain. Different colors denote different categories. The sub-figures have the same experimental settings as the experiments in Table1and Fig.1. Visualizing only the feature representations of the target domain shows the benefits of the individual components to the target domain recognition more intuitively. The target domain is "art-painting", as in Fig.1. We obtain a similar conclusion to Section 4.2, where the Bayesian inference enlarges the inter-class distance for all domains and the domain-invariant principle reduces the intra-class distance of the source and target domains.

Figure 5: Visualization of feature representations of one category. All samples are from the "horse" category with colors denoting different domains. The target domain is "art-painting" (violet). The top row shows Bayesian inference benefits domain generalization by gathering features from different domains to the same manifold. The figures in each column indicate domain-invariant learning reduces the intra-class distance between domains, resulting in better target domain performance.

Fig.5provides a deeper insight into the intra-class feature distributions of the same category from different domains. By introducing the Bayesian inference into the model, the features demonstrate the manifold of the category as shown in the first row ((b), (c) and (d)). This makes recognition easier. Indeed, the visualization of features from multiple categories has similar properties as shown in Fig.1. As shown in each column, introducing the domain-invariant learning into the model leads to a better mixture of features from different domains. The resultant domain-invariant representation makes the model more generalizable to unseen domains.

al., 2018;Volpi et al., 2018).Shankar et al. (2018) augmented the data by perturbing the input images with adversarial gradients generated by an auxiliary classifier.Qiao et al. (2020) proposed a more challenging scenario of domain generalization named single domain generalization, which only has one source domain, and they designed an adversarial domain augmentation method to create "fictitious" yet "challenging" data. Our method introduces variational Bayesian approximation to both the feature extractor and classifier of the neural network in conjunction with the newly introduced domain-invariant principle for domain generalization. The resultant variational invariant learning combines the representational power of deep neural networks and variational Bayesian inference.PACS(Li et al., 2017a) consists of 9,991 images of seven classes from four domains -photo, artpainting, cartoon and sketch. We follow the "leave-one-out" protocol in(Li et al., 2017a; 2018b;Carlucci et al., 2019), where the model is trained on any three of the four domains, which we call source domains, and tested on the last (target) domain.Office-Home (Venkateswara et al., 2017) also has four domains: art, clipart, product and realworld. There are about 15,500 images of 65 categories for object recognition in office and home environments. We use the same experimental protocol as for PACS.Rotated MNIST and Fashion-MNIST were introduced for evaluating domain generalization in (Piratla et al., 2020). For the fair comparison, we follow their recommended settings and randomly select a subset of 2,000 images from MNIST and 10,000 images from Fashion-MNIST, which is considered to have been rotated by 0 • . The subset of images is then rotated by 15 • through 75 • in intervals of 15 • , creating five source domains. The target domains are created by rotation angles of 0 • and 90 • . We use these two datasets to demonstrate the generalizability by comparing the performance on in-distribution and out-of-distribution data. For all four benchmarks, we use ResNet-18 (He et al., 2016) pretrained on ImageNet (Deng et al., 2009) as the CNN backbone. During training, we use Adam optimization (Kingma & Ba, 2014)with the learning rate set to 0.0001, and train for 10,000 iterations. In each iteration we choose one source domain as the meta-source domain. The batch size is 128. To fit memory footprint, we choose a maximum number of samples per category per target domain to implement the domaininvariant learning, i.e. sixteen for PACS, Rotated MNIST and Fashion-MNIST datasets, and four for the Office-Home dataset. We choose λ φ and λ ψ based on the performance on the validation set and their influence is summarized in the new Fig 3 in Appendix C. The optimal values of λ φ and λ ψ are 0.1 and 100 respectively. Parameters σ 1 and σ 2 in Eq. (

Ablation study on PACS. All the individual components of our variational invariant learning benefit domain generalization performance. More comparisons can be found in Appendix D

Ablation with two variational invariant layers in the feature extractor on PACS. Bayesian φ and Invariant φ denote whether the additional variational invariant layer in the feature extractor has a Bayesian property and domain-invariant property. More Bayesian layers benefits the performance while excessive domain-invariant learning harms it.Bayesian φ Invariant φ Photo Art-painting Cartoon Sketch Mean

Comparison on PACS. Our method achieves the best performance on the "Cartoon" domain, is competitive on the other three domains and obtains the best overall mean accuracy.Photo Art-painting Cartoon Sketch Mean

Comparison on Office-Home. Our variational invariant learning achieves the best performance on the "Art" and "Clipart" domains, while being competitive on the "Product" and "Real" domains. Again we report the best overall mean accuracy.Art Clipart Product Real Mean

Comparison on Rotated MNIST and Fashion-MNIST. In-distribution performance is evaluated on the test sets of MNIST and Fashion-MNIST with rotation angles of 15 • , 30 • , 45 • , 60 • and 75 • , while the out-of-distribution performance is evaluated on test sets with angles of 0 • and 90 • . Our VIL achieves the best performance on both the in-distribution and out-of-distribution test sets.

To handle the domain shift between source and target domains, we propose a domain-invariant principle under the variational inference framework, which is incorporated by establishing a domain-invariant feature extractor and classifier. Our variational invariant learning combines the representational power of deep neural networks and uncertainty modeling ability of Bayesian learning, showing great effectiveness for domain generalization. Extensive ablation studies demonstrate the benefits of the Bayesian inference and domain-invariant principle for domain generalization. Our variational invariant learning sets a new state-of-the-art on four domain generalization benchmarks.As in most cases the E q ζ [p θ (y ζ |x ζ )

More detailed ablation study on PACS. Compared to Table1we add three more settings with IDs (i), (j) and (k). All the individual components of our variational invariant learning benefit domain generalization performance.

annex

We also visualize the features in rotated MNIST and Fashion MNIST datasets, as shown in Fig. 6 . Different shapes denote different categories. Red samples denote features from the in-distribution set and blue samples denote features from the out-of-distribution set. Compared with the baseline, our method reduces the intra-class distance between samples from the in-distribution set and the out-ofdistribution set and clusters the out-of-distribution samples of the same categories better, especially in the rotated Fashion-MNIST dataset. 

