DOMAIN GENERALISATION VIA DOMAIN ADAPTATION: AN ADVERSARIAL FOURIER AMPLITUDE APPROACH

Abstract

We tackle the domain generalisation (DG) problem by posing it as a domain adaptation (DA) task where we adversarially synthesise the worst-case 'target' domain and adapt a model to that worst-case domain, thereby improving the model's robustness. To synthesise data that is challenging yet semantics-preserving, we generate Fourier amplitude images and combine them with source domain phase images, exploiting the widely believed conjecture from signal processing that amplitude spectra mainly determines image style, while phase data mainly captures image semantics. To synthesise a worst-case domain for adaptation, we train the classifier and the amplitude generator adversarially. Specifically, we exploit the maximum classifier discrepancy (MCD) principle from DA that relates the target domain performance to the discrepancy of classifiers in the model hypothesis space. By Bayesian hypothesis modeling, we express the model hypothesis space effectively as a posterior distribution over classifiers given the source domains, making adversarial MCD minimisation feasible. On the DomainBed benchmark including the large-scale DomainNet dataset, the proposed approach yields significantly improved domain generalisation performance over the state-of-the-art.

1. INTRODUCTION

Contemporary machine learning models perform well when training and testing data are identically distributed. However, in practice it is often impossible to obtain an unbiased sample of real-world data for training, and therefore distribution-shift inevitably exists between training and deployment. Performance can degrade dramatically under such domain shift (Koh et al., 2021) , and this is often the cause of poor performance of real-world deployments (Geirhos et al., 2020) . This important issue has motivated a large amount of research into the topic of domain generalisation (DG) (Zhou et al., 2021a) , which addresses training models with increased robustness to distribution shift. These DG approaches span a diverse set of strategies including architectural innovations (Chattopadhyay et al., 2020) , novel regularisation (Balaji et al., 2018) , alignment (Sun & Saenko, 2016) and learning (Li et al., 2019) objectives, and data augmentation (Zhou et al., 2021b) to make available training data more representative of potential testing data. However, the problem remains essentially unsolved, especially as measured by recent carefully designed benchmarks (Gulrajani & Lopez-Paz, 2021) . Our approach is related to existing lines of work on data-augmentation solutions to DG (Zhou et al., 2021b; Shankar et al., 2018) , which synthesise more data for model training; and alignment-based approaches to Domain Adaptation (Sun & Saenko, 2016; Saito et al., 2018) that adapt a source model to an unlabeled target set -but cannot address the DG problem where the target set is unavailable. We improve on both by providing a unified framework for stronger data synthesis and domain alignment. Our framework combines two key innovations: A Bayesian approach to maximum classifier discrepancy, and a Fourier analysis approach to data augmentation. We start from the perspective of maximum classifier discrepancy (MCD) from domain adaptation (Ben-David et al., 2007; 2010; Saito et al., 2018) . This bounds the target-domain error as a function of discrepancy between multiple source-domain classifiers. It is not obvious how to apply MCD to the DG problem where we have no access to target-domain data. A key insight is that MCD provides a principled objective that we can maximise in order to synthesise a worst-case target domain, and also minimise in order to train a model that is adapted to that worst-case domain. Specifically, we take a Bayesian approach that learns a distribution over source-domain classifiers, with which we can compute MCD. This simplifies the model by eliminating the need for adversarial classifier training in previous applications of MCD (Saito et al., 2018) , which leaves us free to adversarially train the worst-case target domain. To enable challenging worst-case augmentations to be generated without the risk of altering image semantics, our augmentation strategy operates in the Fourier amplitude domain. It synthesises amplitude images, which can be combined with phase images from source-domain data to produce images that are substantially different in style (amplitude), while retaining the original semantics (phase). Our overall strategy termed Adversarial Generation of Fourier Amplitude (AGFA) is illustrated in Fig. 1 . In summary, we make the following main contributions: (1) We provide a novel and principled perspective on DG by drawing upon the MCD principle from DA. (2) We provide AGFA, an effective algorithm for DG based on variational Bayesian learning of the classifier and Fourier-based synthesis of the worst-case domain for robust learning. (3) Our empirical results show clear improvement on previous state-of-the-arts on the rigorous DomainBed benchmark.

2. PROBLEM SETUP AND BACKGROUND

We follow the standard setup for the Domain Generalisation (DG) problem. As training data, we are given labeled data S = {(x, y)|(x, y) ∼ D i , i = 1, . . . , N } where x ∈ X and y ∈ Y = {1, . . . , C}. Although the source domain S consists of different domains {D i } N i=1 with domain labels available, we simply take their union without using the originating domain labels. This is because in practice the number of domains (N ) is typically small, and it is rarely possible to estimate a meaningful population distribution for empirical S from a few different domains. What distinguishes DG from the closely-related (unsupervised) Domain Adaptation (DA), is that the target domain (T ) on which model's prediction performance is measured is unknown for DG, whereas in DA the input data x from the target domain are revealed (without class labels y). Below we briefly summarise the MCD principle and Ben-David's theorem, one of the key theorems in DA, as we exploit them to tackle DG. Ben-David's theorem and MCD principle in DA. In unsupervised DA, Ben-David's theorem (Ben-David et al., 2010; 2007) provides an upper bound for the target-domain generalisation error of a model (hypothesis). We focus on the tighter bound version, which states that for any classifier h in the hypothesis space H = {h|h : X → Y}, the following holds (without the sampling error term): e T (h) ≤ e S (h) + sup h,h ′ ∈H d S (h, h ′ ) -d T (h, h ′ ) + e * (H; S, T ), where e S (h ) := E (x,y)∼S [I(h(x) ̸ = y)] is the error rate of h(•) on the source domain S, d S (h, h ′ ) := E x∼S [I(h(x) ̸ = h ′ (x)) ] denotes the discrepancy between two classifiers h and h ′ on S (similarly for e T (h) and d T (h, h ′ )), and e * (H; S, T ) := min h∈H e S (h) + e T (h). Thus we can provably reduce the target domain generalisation error by simultaneously minimizing the three terms in the upper boundfoot_0 , namely source-domain error e S (h), classifier discrepancy, and minimal source-target error. Previous approaches (Saito et al., 2018; Kim et al., 2019)  ) := sup h,h ′ ∈H |S |d T (h.h ′ )| = sup h,h ′ ∈H |S E x∼T I(h(x) ̸ = h ′ (x)) . This suggests the MCD learning principle: we need to minimise both the error on S (so as to form the source-confined hypothesis space H |S ) and the MCD loss on T . Note however that the last term e * is not considered in (Saito et al., 2018; Kim et al., 2019) mainly due to the difficulty of estimating the target domain error. We will incorporate e * in our DG algorithm as described in the next section. We conclude the section by briefly reviewing how the MCD learning principle was exploited in previous works. In (Saito et al., 2018) they explicitly introduce two classifier networks h(x) = g(ϕ(x)) and h ′ (x) = g ′ (ϕ(x)), where the classification heads g, g ′ and the feature extractor ϕ are cooperatively updated to minimise the error on S (thus implicitly obtaining H |S ), they are updated adversarially to maximise (minimise) the MCD loss on T with respect to g and g ′ (ϕ, respectively). In (Kim et al., 2019) , they build a Gaussian process (GP) classifier on the feature space ϕ(x), in which H |S is attained by GP posterior inference. Minimisation of the MCD term is then accomplished by the maximum margin learning which essentially enforces minimal overlap between the two largest posterior modes. Note that (Saito et al., 2018) 's strategy requires adversarial optimisation, and hence it is less suitable for our DG algorithm which will require adversarial generator learning: Having two adversarial learning components would make the training difficult since we need to find two nested equilibrium (saddle) points. We instead adopt the Bayesian hypothesis modeling approach of (Kim et al., 2019) . In the next section, we describe our approach in greater detail.

3. ADVERSARIAL GENERATION OF FOURIER AMPLITUDE (AGFA)

Defining and optimising a hypothesis space. Our DG approach aims to minimise the MCD loss, MCD(H |S ; T ) defined in (2) . The first challenge is that the target domain data T is not available in DG. Before we address it, we clarify the optimisation problem (i.e., what is the MCD loss optimised for?) and how the hypothesis spaces (H and H |S ) are represented. The MCD loss is a function of hypothesis space H (or H |S ), not a function of individual classifier h in it. Hence, minimising the MCD loss amounts to choosing the best hypothesis space H. To this end, we need to parametrise the hypothesis space (so as to frame it as a continuous optimisation), and our choice is the Bayesian linear classifier with deterministic feature extractor. We consider the conventional neural-network feed-forward classifier modeling: we have the feature extractor network ϕ θ (x) ∈ R d (with the weight parameters θ) followed by the linear classification head W = [w 1 , . . . , w C ] (C-way classification, each w j ∈ R d ), where the class prediction is done by the softmax likelihood: P (y = j|x, θ, W ) ∝ e w ⊤ j ϕ θ (x) , j = 1, . . . , C. So each configuration (θ, W ) specifies a particular classifier h. To parametrise the hypothesis space H (∋ h), ideally we can consider a parametric family of distributions over (θ, W ). Each distribution P β (θ, W ) specified by the parameter β corresponds to a particular hypothesis space H, and each sample (θ, W ) ∼ P β (θ, W ) corresponds to a particular classifier h ∈ H. Although this is conceptually simple, to have a tractable model in practice, we define θ to be deterministic parameters and only W to be stochastic. A reasonable choice for P (W ), without any prior knowledge, is the standard Gaussian, P (W ) = C j=1 N (w j ; 0, I). Now, we can represent a hypothesis space as H = {P (y|x, θ, W ) | W ∼ P (W )}. Thus H is parametrised by θ, and with θ fixed (H fixed), each sample W from P (W ) instantiates a classifier h ∈ H. The main benefit of this Bayesian hypothesis space modeling is that we can induce the source-confined hypothesis space H |S (i.e., the set of classifiers that perform well on the source domain) in a principled manner by the posterior, P (W |S, θ) ∝ P (W ) • (x,y)∼S P (y|x, θ, W ). The posterior places most of its probability density on those samples (classifiers) W that attain high likelihood scores on S (under given θ) while being smooth due to the prior. To ensure that the source domain S is indeed explained well by the model, we further impose high data likelihood on S as constraints for θ, θ ∈ Θ S where Θ S := {θ | log P (S|θ) ≥ L th }, (5) where L th is the (constant) threshold that guarantees sufficient fidelity of the model to explaining S. Then it is reasonable to represent H |S by the support of P (W |S, θ) for θ ∈ Θ S , postulating that H |S exclusively contains smooth classifiers h that perform well on S. Formally, the source-confined hypothesis space can be parametrised as: H |S (θ) = {P (y|x, θ, W ) | W ∼ P (W |S, θ)} for θ ∈ Θ S , where we use the notation H |S (θ) to emphasise its dependency on θ. Intuitively, the hypothesis space H |S is identified by choosing the feature space (i.e., choosing θ ∈ Θ S ), and individual classifiers h ∈ H |S are realised by the Bayesian posterior samples W ∼ P (W |S, θ) (inferred on the chosen feature space). Since the posterior P (W |S, θ) in ( 6) and the marginal likelihood log P (S|θ) in (5) do not admit closed forms in general, we adopt the variational inference technique to approximate them. We defer the detailed derivations (Sec. 3.1) for now, and return to the MCD minimisation problem since we have defined the hypothesis space representation. Optimising a worst-case target domain. For the DG problem, we cannot directly apply the MCD learning principle since the target domain T is unknown during the training stage. Our key idea is to consider the worst-case scenario where the target domain T maximises the MCD loss. This naturally forms minimax-type optimisation, min θ∈Θ S max T MCD(H |S (θ); T ). To solve the saddle-point optimisation (7), we adopt the adversarial learning strategy with a generator network (Goodfellow et al., 2014) . The generator for T has to synthesise samples x of T that need to satisfy three conditions: (C1) The generated samples maximally baffle the classifiers in H |S to have least consensus in prediction (for inner maximisation); (C2) T still retains the same semantic class information as the source domain S (for the definition of DG); and (C3) The generated samples in T need to be distinguishable along their classesfoot_2 . Paramaterising domains. To meet these conditions, we generate target domain images using Fourier frequency spectra. We specifically build a generator network that synthesises amplitude images in the Fourier frequency domain. The synthesised amplitude images are then combined with the phase images sampled from the source domain S to construct new samples x ∈ T by inverse Fourier transform. This is motivated by signal processing where it is widely believed that the frequency phase spectra capture the semantic information of signals, while the amplitudes take charge of non-semantic (e.g., style) aspects of the signals (Oppenheim & Lim, 1981) . Denoting the amplitude generator network as G ν (ϵ) with parameters ν and random noise input ϵ ∼ N (0, I), our target sampler (x, y) ∼ T are generated as follows: 1. (x S , y S ) ∼ S (Sample an image and its class label from S) 2. A S ∠P S = F(x S ) (Fourier transform to have amplitude and phase for x S ) 3. A = G ν (ϵ), ϵ ∼ N (0, I) (Generate an amplitude image from G) 4. x = F -1 (A∠P S ), y = y S (Construct target data with the synthesised A) hu+wv) dhdw, and A∠P stands for the polar representation of the Fourier frequency responses (complex numbers) for the amplitude image A and the phase image P . That is, A∠P = A • e i•P = A • (cos P + i sin P ) with i = √ -1, where all operations are element/pixel-wise. Note that we set y = y S in step 4 since the original phase (semantic) information P S is retained in the synthesised x. Algorithm summary. Finally the worst-case target MCD learning can be solved by adversarial learning, which can be implemented as an alternating optimisation: Here, F(•) is the 2D Fourier transform, F (u, v) = F(x) = x(h, w)e -i( (Fix ν) min θ∈Θ S MCD(H |S (θ); T (ν)) (8) (Fix θ) max ν MCD(H |S (θ); T (ν)) We used T (ν) to emphasise functional dependency of target images on the generator parameters ν. Note that although the MCD loss in DA can be computed without the target domain labels (recall the definition ( 2)), in our DG case the class labels for the generated target data are available, as induced from the phase P S (i.e., y = y S in step 4). Thus we can modify the MCD loss by incorporating the target class labels. In the following we provide concrete derivations using the variational posterior inference, and propose a modified MCD loss that takes into account the induced target class labels.

3.1. CONCRETE DERIVATIONS USING VARIATIONAL INFERENCE

Source-confined hypothesis space by variational inference. The posterior P (W |S, θ) does not admit a closed form, and we approximate P (W |S, θ) by the Gaussian variational density, Q λ (W ) = C j=1 N (w j ; m j , V j ), where λ := {m j , V j } C j=1 constitutes the variational parameters. To enforce Q λ (W ) ≈ P (W |S, θ), we optimise the evidence lower bound (ELBO), ELBO(λ, θ; S) := (x,y)∼S E Q λ (W ) log P (y|x, W, θ) -KL Q λ (W )||P (W ) , which is the lower bound of the marginal data likelihood log P (S|θ) (Appendix A.3 for derivations). Hence maximising ELBO(λ, θ; S) with respect to λ tightens the posterior approximation Q λ (W ) ≈ P (W |S, θ), while maximising it with respect to θ leads to high data likelihood log P (S|θ). The latter has the very effect of imposing the constraints θ ∈ Θ S in (8) since one can transform constrained optimisation into a regularised (Lagrangian) form equivalently (Boyd & Vandenberghe, 2004) . Optimising the MCD loss. The next thing is to minimise the MCD loss, MCD(H |S (θ); T ) with the current target domain T generated by the generator network G ν . That is, solving (8). We follow the maximum margin learning strategy from (Kim et al., 2019) , where the idea is to enforce the prediction consistency for different classifiers (i.e., posterior samples) W ∼ Q λ (W ) on x ∼ T by separating the highest class score from the second highest by large margin. To understand the idea, let j * be the model's predicted class label for x ∼ T , or equivalently let j * have the highest class score j * = arg max j w ⊤ j ϕ(x) as per (3). (We drop the subscript in ϕ θ (x) for simplicity in notation.) We let j † be the second most probable class, i.e., j † = arg max j̸ =j * w ⊤ j ϕ(x). Our model's class prediction would change if w ⊤ j * ϕ(x) < w ⊤ j † ϕ(x) for some W ∼ Q λ (W ), which leads to discrepancy of classifiers. To avoid such overtaking, we need to ensure that the (plausible) minimal value of w ⊤ j * ϕ(x) is greater than the (plausible) maximal value of w ⊤ j † ϕ(x). Since the score (logit) f j (x) := w ⊤ j ϕ(x) is Gaussian under Q λ (W ), namely f j (x) ∼ N (µ j (x), σ j (x) 2 ) where µ j (x) = m ⊤ j ϕ(x), σ 2 j (x) = ϕ(x) ⊤ V j ϕ(x), the prediction consistency is achieved by enforcing: µ j * (x) -ασ j * (x) > µ j † (x) + ασ j † (x) , where we can choose α = 1.96 for 2.5% rare one-sided chance. By introducing slack variables ξ(x) ≥ 0, µ j * (x) -ασ j * (x) ≥ 1 + max j̸ =j * µ j (x) + ασ j (x) -ξ(x). ( ) Satisfying the constraints amounts to fulfilling the desideratum of MCD minimisation, essentially imposing prediction consistency of classifiers. Note that we add the constant 1 in the right hand side of (13) for the normalisation purpose to prevent the scale of µ and σ from being arbitrary small. The constraints in (13) can be translated into the following MCD loss (as a function of θ): MCD(θ; T ) := E x∼T 1 + T 2 µ j (x) + ασ j (x) -T 1 µ j (x) -ασ j (x) + ( ) where T k is the operator that selects the top-k element, and (a) + = max(0, a). Modified MCD loss. The above MCD loss does not utilise the target domain class labels y = y S that are induced from the phase information P S (Recall the target domain data generation steps 1 ∼ 4 above). To incorporate the supervised data {(x, y)} ∈ T in the generated target domain, we modify (x) 2 ) for j = 1, 2, 3, at some input x ∈ T . We assume that the true (induced) class label y = 2. (Left) Since the mean logit for class 2, µ 2 (x) is the maximum among others, the prediction is marginally correct (from softmax). Beyond that, the logit of the worst plausible hypothesis for class 2, µ 2 (x) -1.96σ 2 (x) is greater than that of the runner-up class 1, µ 1 (x) + 1.96σ 1 (x) by some positive margin (green arrow), meaning there is little chance of prediction overtaking (so, consistent); equivalently, the SMCD loss is small. (Middle) Prediction is marginally correct, but prediction overtaking is plausible, indicated by the negative margin (red arrow); the SMCD loss is large. (Right) Incorrect marginal prediction (to class 1) with more severe negative margin (red arrow); the SMCD loss is even larger. the MCD loss as follows: First, instead of separating the margin between the two largest logit scores as in the MCD, we maximise the margin between the logit for the given class y and the largest logit among the classes other than y. That is, we replace the constraints (13) with the following: µ y (x) -ασ y (x) ≥ 1 + max j̸ =y µ j (x) + ασ j (x) -ξ(x), where y is the class label (induced from the phase information) for the generated instance x. See Fig. 2 for illustration of the idea. Consequently, our new MCD loss (coined supervised MCD or SMCD for short) is defined as follows: SMCD(θ; T ) := E (x,y)∼T 1 + max j̸ =y µ j (x) + ασ j (x) -µ y (x) -ασ y (x) + . ( ) Here the variational parameters λ is treated as constant since the only role of λ is to maximise the ELBO. It should be noted that ( 16) essentially aims at maximising the logit for the given class y (the last term), or equivalently, classification error minimisation on T , and at the same time minimising the logit for the runner-up class (the middle max term). Surprisingly, the former amounts to minimising the minimal source-target error term e * (H; S, T ) in the generalisation bound (1), which we have left out so far. That is, e * (H; S, T ) = min h∈H e S (h) + e T (h) ≈ min h∈H |S e T (h), and the last term of the SMCD loss leads to θ that makes e T (h) small for all h ∈ H |S (θ). Moreover, minimising the logit for the runner-up class (the middle max term of the SMCD) has the effect of margin maximisation. Algorithm summary. Our AGFA algorithm can be understood as MCD-based DA with adversarial amplitude generated target domain. It entails the following alternating optimisation (η > 0 is the trade-off hyperparameter for SMCD): 1. min λ,θ -ELBO(λ, θ; S) + ηSMCD(θ; T ) (model learning + VI; ν fixed) 2. max ν SMCD(θ; T ) (adversarial generator learning; θ, λ fixed) Our algorithm is summarised in Alg. 1 (in Appendix) and illustrated schematically in Fig. 1 . At test time, we can apply the classifier (3) with the learned θ and any sample W ∼ Q λ (W ) to target domain inputs to predict class labels. In our experiments, we take the posterior means w j = m j instead of sampling from Q λ (W ).

3.2. FURTHER CONSIDERATIONS

Post-synthesis mixup of generated amplitude images. In our adversarial learning, the amplitude generator network G ν synthesises target domain image samples that have highly challenging amplitude spectra to the current model. Although we retain the phase information from source domains, unconstrained amplitude images can potentially alter the semantic content destructively (e.g., a constant zero amplitude image would zero out the image content), rendering it impossible to classify. To this end, instead of using the generator's output A = G ν (ϵ) directly, we combine it with the source domain amplitude image corresponding to the phase image by simple mixup. That is, by letting A S be the amplitude spectra corresponding to the phase P S , we alter A as: A ← λA + (1 -λ)A S where λ ∼ Uniform(0, α). This post-synthesis mixup can address our desideratum C3 that we discussed before, that is, the generated samples for the target domain need to be distinguishable by class to solve the DG problem. Post-synthesis mixup, ensures synthesised amplitude images lie closer to the amplitude manifold of the source data, ensuring the model can solve the classification problem. Dense model averaging (SWAD). We found that the DG training becomes more stable and the target-domain test performance becomes more consistent when we use the dense model averaging strategy SWAD (Cha et al., 2021) . We adopt the SWAD model averaging for the variational and model parameters (λ, θ) while the generator network is not averaged. Amplitude image structures. From the definition of the Fourier transform, the frequency domain function should be even-conjugate, i.e., F (-u, -v) = F (u, v), for the real-valued images. This implies that amplitude images are symmetric. Conversely, if the amplitude images are symmetric, inverse Fourier transform returns real-valued signals. Thus when generating amplitude images, we only generate the non-redundant part (frequencies) of the amplitude images. Also, the amplitude should be non-negative. We keep these constraints in mind when designing the generator network.

4. RELATED WORK

MCD. Several studies have used the MCD principle for domain adaptation, to align a source model to unlabeled target data (Saito et al., 2018; Kim et al., 2019; Lu et al., 2020) . We uniquely exploit the MCD principle for the DG problem, in the absence of target data, by using MCD to synthesise worst-case target domain data, as well as to adapt the model to that synthesised domain. Augmentation approaches to DG. Several DG approaches have been proposed based on data augmentation. Existing approaches either define augmentation heuristics (Zhou et al., 2021b; Xu et al., 2021) , or exploit domain adversarial learning -i.e., confusing a domain classifier (Shankar et al., 2018; Zhou et al., 2020) . Our adversarial learning is based on the much stronger (S)MCD principle that confuses a category classifier. This provides much harder examples for robust learning, while our Fourier amplitude synthesis ensures the examples are actually recognisable. Alignment approaches to DG. Several approaches to DG are based on aligning between multiple source domains (Sun & Saenko, 2016; Ganin et al., 2016; Li et al., 2018c; b) , under the assumption that a common feature across all source domains will be good for a held out target domain. Differently, we use the MCD principle to robustify our source trained model by aligning it with the synthesised worst-case target domain.

5. EXPERIMENTS

We test our approach on the DomainBed benchmark (Gulrajani & Lopez-Paz, 2021) , including: PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , OfficeHome (Venkateswara et al., 2017) , Ter-raIncognita (Beery et al., 2018) , and DomainNet (Peng et al., 2019) . For each dataset, we adopt the standard leave-one-domain-out source/target domain splits. The overall training/test protocols are similar to (Gulrajani & Lopez-Paz, 2021; Cha et al., 2021) . We use the ResNet-50 (He et al., 2016) as our feature extractor backbone, which is initialised by the pretrained weights on ImageNet (Deng et al., 2009) . For the generator network, we found that a linear model performed the best for the noise dimension 100. Our model is trained by the Adam optimiser (Kingma & Ba, 2015) on machines with single Tesla V100 GPUs. The hyperparameters introduced in our model (e.g., SMCD trade-off η) and the general ones (e.g., learning rate, SWAD regime hyperparameters, maximum numbers of iterations) are chosen by grid search on the validation set according to the DomainBed protocol (Gulrajani & Lopez-Paz, 2021) . For instance, η = 0.1 for all datasets. The implementation details including chosen hyperparameters can be found in Appendix A.1.

5.1. MAIN RESULTS

The test accuracies averaged over target domains are summarised in for all datasets among the competitors, and the difference from the second best model (SWAD) is significant (about 1.1% margin). We particularly contrast with two recent approaches: SWAD (Cha et al., 2021) that adopts the dense model averaging with the simple ERM loss and FACT (Xu et al., 2021) that uses the Fourier amplitude mixup as means of data augmentation with additional student-teacher regularisation. First, SWAD (Cha et al., 2021) is the second best model in Table 1 , implying that the simple ERM loss combined with the dense model averaging that seeks for flat minima is quite effective, also observed previously (Gulrajani & Lopez-Paz, 2021) . FACT (Xu et al., 2021) utilises the Fourier amplitude spectra similar to our approach, but their main focus is data augmentation, producing more training images by amplitude mixup of source domain images. FACT also adopted the so-called teacher co-regularisation which forces the orders of the class prediction logits to be consistent between teacher and student models on the amplitude-mixup data. To disentangle the impact of these two components in FACT, we ran a model called Amp-Mixup that is simply FACT without teacher co-regularisation. The teacher co-regularisation yields further improvement in the average accuracy (FACT > Amp-Mixup in the last column of Table 1 ), verifying the claim in (Xu et al., 2021) , although FACT is slightly worse than Amp-Mixup on VLCS and TerraIncognita. We also modified FACT and Amp-Mixup models by incorporating the SWAD model averaging (FACT+SWAD and Amp-Mixup+SWAD in the table). Clearly they perform even better in combination with SWAD. Since Amp-Mixup+SWAD can be seen as dropping the teacher regularisation and adopting the SWAD (regularisation) strategy instead, we can say that SWAD is more effective regularisation than student-teacher. Nevertheless, despite the utilisation of amplitude-mixup augmentation, it appears that FACT and Amp-Mixup have little improvement over the ERM loss even when the SWAD strategy is used. This signifies the effect of the adversarial Fourier-based target domain generation in our approach which exhibits significant improvement over ERM and SWAD.

5.2. FURTHER ANALYSIS

Sensitivity to η (SMCD strength). We analyze sensitivity of the target domain generalisation performance to the SMCD trade-off hyperparameter η. We run our algorithm with different values of η. The results are shown in Fig. 3 . Note that η = 0 ignores the SMCD loss term (thus generator has no influence on the model training), which corresponds to the ERM approach. The test accuracy of the proposed approach remains significantly better than ERM/SWAD for all those η with moderate variations around the best value. See Appendix A.2 for the results on individual target domains. Sensitivity to α (post-synthesis mixup strength). We mix up the generated amplitude images and the source domain images as in ( 17) to make the adversarial target domain classification task solvable. The task becomes easier for small α (less impact of the generated amplitudes), and vice versa. Note that α = 0 ignores generated amplitude images completely in post-mixup, and the training becomes close to ERM learning where the only difference is that we utilise more basic augmentation (e.g., flip, rotation, color jittering). As shown in Fig. 4 , the target test performance is not very sensitive around the best selected hyperparameters. See also ablation study results on the impact of post-mixup below. Impact of SMCD (vs. unsupervised MCD). We verify the positive effect of the proposed supervised MCD loss (SMCD in ( 16)) that exploits the induced target domain class labels, compared to the conventional (unsupervised) MCD loss (14) without using the target class labels. The result in Table 2 supports our claim that exploiting target class labels induced from the phase information is quite effective, improving the target generalisation performance. Impact of post-synthesis mixup. We argued that our post-synthesis mixup of the generated amplitude images makes the class prediction task easier for the generated target domain, for the solvability of the DG problem. To verify this, we compare two models, with and without the post-mixup strategy in Table 2 . The model trained with post-mixup performs better. Impact of SWAD. We adopted the SWAD model averaging scheme (Cha et al., 2021) for improving generalisation performance. We verify the impact of the SWAD as in Table 2 where the model without SWAD has lower target test accuracy signifying the importance of the SWAD model averaging. Impact of amplitude generation. The amplitude image generation in our adversarial MCD learning allows us to separate the phase and amplitude images and exploit the class labels induced by the phase information. However, one may be curious about how the model would work if we instead generate full images without phase/amplitude separation in an adversarial way. That is, we adopt a pixel-based adversarial image generator, and in turn replace our SMCD by the conventional MCD loss (since there are no class labels inducible in this strategy). We consider two generator architectures: linear (from 100-dim input noise to full image pixels) and nonlinear (a fully connected network with one hidden layer of 100 units), where the former slightly performs better. Table 2 shows that this pixel-based target image generation underperforms our amplitude generation.

6. CONCLUSION

We tried to address the domain generalisation problem from the perspective of maximum classifier discrepancy: Improving robustness by synthesising a worst-case target domain for learning, and training the model to be robust to that domain with the (S)MCD objective. To provide an approximation to style and content separation for synthesis, the worst-case domain is synthesised in Fourier amplitude space. Our results provide a clear improvement on the state-of-the-arts on the challenging DomainBed benchmark suite.

A APPENDIX

The Appendix consists of the following contents: • Implementation Details (Sec. A.1) We adopt the ResNet50 (He et al., 2016) architecture (removing the final classification layer) as the feature extractor network. For the amplitude generator network, we have tested several fullyconnected network architectures with different numbers of hidden layers and hidden units, and the simple linear network peformed the best. The input noise dimension for the generator is chosen as 100. The covariance matrices of the variational parameters are restricted to be diagonal. The number of MC samples from Q λ (W ) in the ELBO optimisation is chosen as 50. The optimisation hyperparameters are chosen by the same strategy as (Cha et al., 2021) , where we employ the Adam optimiser (Kingma & Ba, 2015) with learning rate 5 × 10 -5 , and no dropout, weight decay used. The batch size was 32 (for each training domain) in ERM/SWAD (Cha et al., 2021) , but we halved it in our model since the remaining half are constructed by the adversarial target generation. The standard basic data augmentation is also applied to the input images. Following the suggestion from (Cha et al., 2021) , we run our model up to 5000 iterations for all datasets except for DomainNet. But the algorithm may stop earlier before the maximum iterations if SWAD termination condition is met (See Sec. A.1.1 below). Since DomainNet is a large-scale dataset, and it is required to have a even larger number of iterations to go through the entire data at least several times. In (Cha et al., 2021) , they used 15000 iterations which roughly corresponds to 3 to 10 data epochs. In our model, since we halved the number of input images in the batch, in order to have the same training epochs as (Cha et al., 2021) , we increase it up to 30000 iterations for DomainNet. The details of the SWAD implementation follows in the next section.

A.1.1 SWAD MODEL AVERAGING

We adopt the SWAD model averaging strategy (Cha et al., 2021) to have a more robust model that is less affected by overfitting. We apply the SWAD to the feature extractor network parameters θ and the variational parameters λ, but not the adversarial generator network. Since SWAD is an important component in our model, we provide more details here. SWAD is motivated from stochastic weight averaging (SWA) (Izmailov et al., 2018) , however, unlike SWA's model averaging for every epoch, SWAD takes dense model averaging for every (batch) iteration. A key component of the SWAD algorithm is to determine the model averaging regime, the interval of iterations for which the model averaging is performed. This regime is aimed to avoid overfitting, and known as overfit-aware model averaging. The regime is specified by the start and end iteration numbers, t s and t e , respectively, and we take model averaging for iterations t ∈ [t s , t e ], that is, θ SW AD = 1 t e -t s + 1 te t=ts θ t , λ SW AD = 1 t e -t s + 1 te t=ts λ t , ( ) where θ t and λ t are the model parameters after iteration t. Here (θ SW AD , λ SW AD ) are the final model parameters returned by the training algorithm. Now we describe how the regime is determined. Ideally, we expect the intermediate models during the interval [t s , t e ] to be overfit-free, having high generalisation performance. To this end, we evaluate the model on the validation set (held out from the source domain training data), and denote the validation loss of the t-th model by l t val . Then the start iteration of the regime, t s is determined by the first t where the validation loss is not improved for the next N s iterations (e.g., N s = 3). That is, t s = min{t -N s + 1 | l t-Ns+1 val ≤ l t val , l t-1 val , . . . , l t-Ns+1 val }. ( ) Algorithm 1 AGFA Algorithm with SWAD Model Averaging. Input: Source data S, SMCD trade-off η, post-mixup α, and learning rate γ, and SWAD hyperparameters N s , N e , r. Initialise: θ (feature extractor), λ (variational parameters), and ν (generator). (flag) SWAD-Regime-Entered ← F ALSE, (iteration) t ← 0. Repeat: 0. Sample a minibatch S B = {(x S i , y S i )} n i=1 from S. 1. Prepare {(A S i , P S i )} n i=1 by Fourier transform A S i ∠P S i = F(x S i ). 2. Generate amplitude images A G i = G ν (ϵ i ), ϵ i ∼ N (0, I) for i = 1, . . . , n. 3. Post-mixup: A G i ← λA G i + (1 -λ)A S i , λ ∼ Uniform(0, α). 4. Construct a target batch T B = {(x T i , y T i )} n i=1 : x T i = F -1 (A G i ∠P S i ), y T i = y S i . 5. Evaluate L model := -ELBO(λ, θ; S B ) + ηSMCD(θ; T B ). 6. Update the model and variational parameters: (λ, θ) ← (λ, θ) -γ∇ (λ,θ) L model . 7. Evaluate L gen := -SMCD(θ; T B ). 8. Update the generator network: ν ← ν -γ∇ ν L gen . 9. (SWAD procedure) t ← t + 1, (λ t , θ t ) ← (λ, θ). If SWAD-Regime-Entered == F ALSE: If l t-Ns+1 val = min 0≤t ′ <Ns l t-t ′ val : t s ← t -N s + 1, l val ← 1 Ns Ns-1 t ′ =0 l t-t ′ val . SWAD-Regime-Entered ← T rue. Else: If r • l val < min 0≤t ′ <Ne l t-t ′ val : t e ← t -N e . Return θ SW AD = 1 te-ts+1 te t=ts θ t , λ SW AD = 1 te-ts+1 te t=ts λ t . Once we find t s , we compute the (average) starting validation loss, l val = ts+Ns-1 t=ts l t val N s , which is used as a reference when we decide the end iteration t e . As we enter the regime, we start model averaging every iteration. To determine when to stop, we inspect the validation losses to see if the model starts overfitting. Specifically, if the validation losses are consecutively greater than l val by some margin, we regard it as overfit signal. That is, t e = min{t -N e | l t val , l t-1 val , . . . , l t-Ne+1 val > r • l val }, where r and N e are user-driven hyperparameters (e.g., r = 1.3, N e = 6). The pseudo code of our AGFA algorithm with the SWAD strategy is summarised in Alg. 1. There are three hyperparameters in SWAD, (N s , N e , r), and following (Cha et al., 2021) , we use N s = 3, N e = 6, r = 1.3 for all datasets in DomainBed except r = 1.2 for VLCS. One technical issue is that evaluating the validation loss every iteration is computationally demanding. Similarly as (Cha et al., 2021) , we compute the validation loss at every V -th iterations (e.g., V = 50 for VLCS, V = 500 for DomainNet, and V = 100 for the rest) although the model averaging is still performed every iteration. Accordingly, the equations ( 19), (20), and (21) need to be changed where essentially all iteration numbers in those equations should be changed to multiples of V . The model averaging in Alg. 1 is implemented by the running (online) average and the use of (FIFO) queue data structures similarly as (Cha et al., 2021) , which does not incur significant extra computational overhead.

A.2 FULL RESULTS

The full results (test errors on individual target domains) on DomainBed datasets are summarised in We also show the full results of the sensitivity analysis in Table 8 (the SMCD loss trade-off η) and Table 9 (the post-synthesis mixup strength α). Moreover, we visualise in Fig. 5 the ablation study results for the four different modeling choices: 1) Impact of SMCD (vs. conventional unsupervised MCD), 2) Impact of post-synthesis mixup, 3) Impact of SWAD, and 4) Impact of amplitude generation (vs. pixel-based image generation). For the pixel-based image generation, we consider two generator architectures: linear (from 100-dim input noise to full image pixels) and nonlinear (a fully connected network with one hidden layer of 100 units). Visualisation of generated adversarial images. We visualise in Fig. 6 some synthesised amplitude images and constructed target domain images from the learned model on the PACS dataset. Although the generated amplitude images visually look like random noise, they appear to have the effect of attenuating high frequency spectra (shown as darker pixels in the fifth column) when combined with the source domain amplitude images by post-mixup. The constructed images from the generated amplitude images alone without post-mixup (sixth column) look a lot like edge detection maps, whereas the post-mixup constructed ones (seventh column) remain visually similar to the original source domain images, promoting DG solvability.

A.3 DERIVATION OF ELBO IN VARIATIONAL INFERENCE

We derive the evidence lower bound (ELBO) in (11) in the main paper. To enforce Q λ (W ) ≈ P (W |S, θ), we minimise their KL divergence, KL Q λ (W )||P (W |S, θ) = E Q λ (W ) log Q λ (W ) P (W |S, θ) (22) = E Q λ (W ) log Q λ (W )P (S|θ) P (S|W, θ)P (W ) (23) = log P (S|θ) -E Q λ (W ) log P (S|W, θ) + E Q λ (W ) log Q λ (W ) P (W ) (24) = log P (S|θ) -E Q λ (W ) log P (S|W, θ) + KL Q λ (W )||P (W ) . = log P (S|θ) - (Xu et al., 2020; Yan et al., 2020; Wang et al., (Xu et al., 2021) 97.6 ± 0.1 65.5 ± 0.5 69.2 ± 0.8 73.9 ± 0.7 76.6 Amp-Mixup (Xu et al., 2021) 97.4 ± 0.7 65.6 ± 0.3 70.5 ± 0.9 70.1 ± 0.8 75.9 SWAD (Cha et al., 60.4 ± 0.7 52.7 ± 1.0 75.0 ± 0.7 76.0 ± 0.7 66.0 I-Mixup (Xu et al., 2020; Yan et al., 2020; Wang et al., Since KL divergence is non-negative, re-arranging (26) yields: (x,y)∼S E Q λ (W ) log P (y|x, W, θ) + KL Q λ (W )||P (W ) . log P (S|θ) ≥ (x,y)∼S E Q λ (W ) log P (y|x, W, θ) -KL Q λ (W )||P (W ) , and the right hand side constitutes the ELBO.

A.4.1 RESULTS ON RESNET-18 BACKBONE

To test our approach on backbone networks other than ResNet-50, we run experiments with the ResNet-18 backbone on the PACS dataset. The results are summarised in Table 10 . Compared to the recent approaches MixStyle (Zhou et al., 2021b) and EFDMix (Zhang et al., 2022) , our approach AGFA again shows higher performance even with the smaller ResNet-18 backbone. 



Some recent work such as(Vedantam et al., 2021), however, empirically studied potential risk of looseness of the bound in certain scenarios. This condition naturally originates from the solvability of the DG problem.



Figure 1: Overall training flow of the proposed approach (AGFA). We generate target-domain data by synthesizing Fourier amplitude images trained adversarially. See main text in Sec. 3 for details.

Figure 2: Illustration of the SMCD loss on three different hypothesis spaces H |S shown in three panels. For C = 3-way classification case, each panel shows the class logit scores (Gaussian random) f j (x) ∼ N (µ j (x), σ j (x) 2) for j = 1, 2, 3, at some input x ∈ T . We assume that the true (induced) class label y = 2. (Left) Since the mean logit for class 2, µ 2 (x) is the maximum among others, the prediction is marginally correct (from softmax). Beyond that, the logit of the worst plausible hypothesis for class 2, µ 2 (x) -1.96σ 2 (x) is greater than that of the runner-up class 1, µ 1 (x) + 1.96σ 1 (x) by some positive margin (green arrow), meaning there is little chance of prediction overtaking (so, consistent); equivalently, the SMCD loss is small. (Middle) Prediction is marginally correct, but prediction overtaking is plausible, indicated by the negative margin (red arrow); the SMCD loss is large. (Right) Incorrect marginal prediction (to class 1) with more severe negative margin (red arrow); the SMCD loss is even larger.

Figure 3: Sensitivity to η (SMCD trade-off) on PACS and OfficeHome.

Figure 5: Ablation study of four different modeling choices: SMCD, post-mixup, SWAD, and amplitude generation (instead of pixel-based target image generation).

Figure 6: Visualisation of the generated amplitude and constructed images. The columns are (from left to right): 1) original image, 2) phase and 3) amplitude spectra after Fourier transform, 4) generated amplitude image, 5) post-mixup of 3 and 4, 6) constructed image from phase in 2) and generated amplitude image in 4) (by inverse Fourier transform), and 7) constructed image from phase 2 and the post-mixup amplitude 5.

Figure 7: Comparison between pixel-based and our Fourier-based generated target images. Whereas the pixel-based generation is visually uninformative and looks like pure random noise, our Fourierbased generation contains salient object edge information that is closely related to class semantics.

Within this source-confined hypothesis space (denoted by H |S ), the terms e S (h) and d S (h, h ′ ) in the bound are expected to be close to 0 for all h, h ′ ∈ H |S , and the bound of (1) effectively reduces to what is called the Maximum Classifier Discrepancy (MCD) loss,

Table1, where the results for individual target domains are reported in Appendix A.2. The proposed approach performs the best Average accuracies on DomainBed datasets. Note: † indicates that the results are excerpted from the published papers or(Gulrajani & Lopez-Paz, 2021). Our own runs are reported without † . Note that FACT(Xu et al., 2021) adopted a slightly different data/domain split protocol from DomainBed's, explaining discrepancy on PACS.

Ablation study: 1) unsupervised MCD (instead of SMCD), 2) without post-mixup, 3) without SWAD, and 4) pixel-based target image generation (instead of amplitude generation).



Average accuracies on PACS. Note: † indicates that the results are excerpted from the published papers or(Gulrajani & Lopez-Paz, 2021). Our own runs are reported without † . FACT(Xu et al., 2021) adopted a slightly different data/domain split from DomainBed's, explaining discrepancy.

Average accuracies on VLCS. The same interpretation as Table3.

Average accuracies on OfficeHome. The same interpretation as Table3.

Average accuracies on DomainNet. The same interpretation as Table3.

Sensitivity analysis on the SMCD loss trade off η on PACS and OfficeHome.

Sensitivity analysis on the post-mixup trade off α on PACS and OfficeHome.

Single source domain generalisation results on PACS with (a) ResNet-18 and (b) ResNet-50 backbones. Each column shows test accuracies averaged over the rest three target domains. Results on ERM, MixStyle(Zhou et al., 2021b), and EFDMix(Zhang et al., 2022) are excerpted from(Zhang et al., 2022).

annex

Table 10 : Average accuracies on PACS with ResNet-18 backbone. Results on ERM, Mixup (Zhang et al., 2018) , MixStyle (Zhou et al., 2021b) , and EFDMix (Zhang et al., 2022) are excerpted from (Zhang et al., 2022) . 

