PGR A D : LEARNING PRINCIPAL GRADIENTS FOR DO-MAIN GENERALIZATION

Abstract

Machine learning models fail to perform when facing out-of-distribution (OOD) domains, a challenging task known as domain generalization (DG). In this work, we develop a novel DG training strategy, we call PGrad , to learn a robust gradient direction, improving models' generalization ability on unseen domains. The proposed gradient aggregates the principal directions of a sampled roll-out optimization trajectory that measures the training dynamics across all training domains. PGrad 's gradient design forces the DG training to ignore domain-dependent noise signals and updates all training domains with a robust direction covering main components of parameter dynamics. We further improve PGrad via bijection-based computational refinement and directional plus length-based calibrations. Our theoretical proof connects PGrad to the spectral analysis of Hessian in training neural networks. Experiments on DomainBed and WILDS benchmarks demonstrate that our approach effectively enables robust DG optimization and leads to smoothly decreased loss curves. Empirically, PGrad achieves competitive results across seven datasets, demonstrating its efficacy across both synthetic and real-world distributional shifts.

1. INTRODUCTION

Deep neural networks have shown remarkable generalization ability on test data following the same distribution as their training data. Yet, high-capacity models are incentivized to exploit any correlation in the training data that will lead to more accurate predictions. As a result, these models risk becoming overly reliant on "domain-specific" correlations that may harm model performance on test cases from out-of-distribution (OOD). A typical example is a camel-and-cows classification task (Beery et al., 2018; Shi et al., 2021) , where camel pictures in training are almost always shown in a desert environment while cow pictures mostly have green grassland backgrounds. A typical machine learning model trained on this dataset will perform worse than random guessing on those test pictures with cows in deserts or camels in pastures. The network has learned to use the background texture as one deciding factor when we want it to learn to recognize animal shapes. Unfortunately, the model overfits to specific traps that are highly predictive of some training domains but fail on OOD target domains. Recent domain generalization (DG) research efforts deal with such a challenge. They are concerned with how to learn a machine learning model that can generalize to an unseen test distribution when given multiple different but related training domains. 1Recent literature covers a wide spectrum of DG methods, including invariant representation learning, meta-learning, data augmentation, ensemble learning, and gradient manipulation (more details in Section 2.4). Despite the large body of recent DG literature, the authors of (Gulrajani & Lopez-Paz, 2021) showed that empirical risk minimization (ERM) provides a competitive baseline on many real-world DG benchmarks. ERM does not explicitly address distributional shifts during training. Instead, ERM calculates the gradient from each training domain and updates a model with the average gradient. However, one caveat of ERM is its average gradient-based model update will preserve domain-specific noise during optimization. This observation motivates the core design of our method. We propose a novel training strategy that learns a robust gradient direction for DG optimization, and we call it PGrad . PGrad samples an optimization trajectory in high dimensional parameter space by updating the current model sequentially across training domains. It then constructs a local coordinate system to explain the parameter variations in the trajectory. Via singular value decomposition (SVD), we derive an aggregated vector that covers the main components of parameter dynamics and use it as a new gradient direction to update the target model. This novel vector -that we name the "principal gradient" -reduces domain-specific noise in the DG model update and prevents the model from overfitting to particular training domains. To decrease the computational complexity of SVD, we construct a bijection between the parameter space and a low-dimensional space through transpose mapping. Hence, the computational complexity of the PGrad relates to the number of sampled training domains and does not depend on the size of our model parameters. This paper makes the following contributions: (1) PGrad places no explicit assumption on either the joint or the marginal distributions. (2) PGrad is model-agnostic and is scalable to various model architecture since its computational cost only relates to the number of training domains. (3) We theoretically show the connection between PGrad and Hessian approximation, and also prove that PGrad benefits the training efficiency via learning a gradient in a smaller subspace constructed from learning trajectory. (4) Our empirical results demonstrate the competitive performance of PGrad across seven datasets covering both synthetic and real-world distributional shifts.

2. METHOD

Domain generalization (Wang et al., 2021; Zhou et al., 2021) assumes no access to instances from future unseen domains. In domain generalization, we are given a set of training domains D tr = {D i } n i=1 and test domains T te = {T j } m j=1 . Each domain D i (or T j ) is associated with a joint distribution P Di X ×Y (or P Tj X ×Y ), where X represents the input space and Y is the output space. Moreover, each training domain D i is characterized by a set of i.i.d samples {x i k , y i k }. For any two different domains sampled from either D tr or T te , their joint distribution varies P Di X ×Y = P Dj X ×Y , and most importantly, P Di X ×Y = P Tj X ×Y . We consider the prediction task from the input x ∈ X to the output y ∈ Y. Provided with a model family whose parameter space is Θ ⊂ R d and the loss function L : Θ × (X × Y) → R + , the goal is to find an optimal Θ * te on test domains so that: Θ * te = arg min Θ∈Θ E Tj ∼Tte E (x,y)∼P T j X ×Y L[Θ, (x, y)]. In DG setup, any prior about T te , such as inputs or outputs, are unavailable in the training phase. Despite not considering domain discrepancies from training to testing, ERM is still a competitive method for domain generalization tasks (Gulrajani & Lopez-Paz, 2021) . ERM naively groups data from all training domains D tr together and obtains its optimal parameter Θ * tr via the following to approximate Θ * te : Θ * tr = arg min Θ∈Θ E Di∼Dtr E (x,y)∼P D i X ×Y L[Θ, (x, y)]. In the rest of the paper, we omit the subscript in Θ tr and use Θ for simplicity (during DG training, only training domains D tr will be available for model learning). When optimizing with ERM on DG across multiple training domains, the update of Θ follows: Θ t+1 = Θ t - γ n n i=1 ∇ Θ t L Di , where (x, y) ] calculates the gradient of the loss on domain D i with respect to the current parameter Θ t and γ is the learning rate. ∇ Θ t L Di = ∇E (x,y)∼P D i X ×Y L[Θ t , The gradient determines the learning path of a model. When using ERM in DG setting, each step of model updates uses an average gradient and may introduce and preserve domain-specific noise. For instance, if one training domain includes the trapping signals like cows always in pastures and camels always in deserts (as mentioned earlier). When investigating across multiple training domains, we envision such domain-specific noise signals will not be the main components of parameter variations across all domains. This motivates us to design PGrad as follows.

2.1. PGR A D : PRINCIPAL GRADIENT BASED MODEL UPDATES

We extend ERM with a robust gradient estimation that we call PGrad . We visualize an overview in Figure 1 to better explain how it works. Briefly speaking, given the current model parameter vector, we sample a trajectory of parameters by training sequentially on each training domain. Next, we build a local principal coordinate system based on parameters obtained from the sampled trajectory. The chosen gradient direction is then built as a linear combination of the orthogonal axes of the local principal coordinates. Our design also forces the learned gradient to filter out domain-specific noise and follows a direction that maximally benefits all training domains D tr ; we refer to this extracted direction as the principal gradient. In the following, we cover details of the trajectory sampling, local coordinate construction, direction and length calibration, plus the noise suppression for our principal gradient design. Trajectory Sampling. Denote the current parameter vector as Θ t . We first sample a trajectory S through the parameter space Θ by sequentially optimizing the model on each training domain: Θ t 0 = Θ t , Θ t i = Θ t i-1 -α∇ Θ t i-1 L Di , i = {1, • • • , n} We refer to the process of choosing an order of training domain to optimize as trajectory sampling. Different ordering arrangements of training domains will generate different trajectories. Principal Coordinate system Construction. Now we have a sampled trajectory S = {Θ t i } n i=0 ∈ R (n+1)×d , that were derived from Θ t . Note: the inclusion of the starting location Θ t 0 as part of the trajectory is necessary; see the proof in Appendix (A.3). Then we construct a local principal coordinate system to explain the variations in S. We are looking for orthogonal and unit axes )×d to maximally capture the variations of the trajectory. Each v z ∈ R d is a unit vector of size d, the same dimension as the parameters Θ t . V = [v T z ] n z=0 ∈ R (n+1 max vz Variance([Sv z ]), s.t. V T V = I d . (5) The above objective is the classic principal component analysis formulation and can be solved with singular value decomposition (a.k.a. SVD). Eq. ( 5)) has the closed-form solution as follows (the revised computational complexity is n): λ z , v z = SVD z ( 1 n ŜT Ŝ), Here λ z , v z denote the z-th largest eigenvalue and its corresponding eigenvector. Ŝ denotes the centered trajectory by removing the mean from S. In the above Eq. ( 6)), the computational bottleneck lies in the SVD, whose computational complexity comes at O(d 3 ) due to ŜT Ŝ ∈ R d×d . d denotes the size of the parameter vector and is fairly large for most state-of-the-art (SOTA) deep learning architectures. This prohibits the computation of the eigenvalues and eigenvectors from Eq. ( 6) for SOTA deep learning models. Hence, we refine and construct a bijection as follows to lower the computational complexity (to O((n + 1) 3 )): Ŝ ŜT e z = λ z e z =⇒ ŜT Ŝ ŜT e z = λ z ŜT e z =⇒ v z = ŜT e z Eq. ( 7) indicates that if λ z , e z are the z-th largest eigenvalue and corresponding eigenvector of Ŝ ŜT , the z-th largest eigenvalue and corresponding eigenvector of ŜT Ŝ are λ z , ŜT e z (i.e., v z = ŜT e z ). This property introduces a bijection from eigenvectors of Ŝ ŜT ∈ R (n+1)×(n+1) to those of ŜT Ŝ ∈ R d×d . Since n -the number of training domains -is much smaller than d, calculating eigen-decomposition of Ŝ ŜT ∈ R (n+1)×(n+1) is therefore much cheaper. Directional Calibration. With the derived orthogonal axes V = [v T z ] n z=0 from Eq. ( 7), now we construct a local principal coordinate system with each axis aligning with one eigenvector v z . These principal coordinate axes V are ordered based on the magnitude of the eigenvalues. This means that v i explains more variations of the sampled trajectory S than v j if i < j, and they are all unit vectors. In addition, these vectors are unoriented, which means either positive or negative multiple of an eigenvector still falls into the eigenvector set. Now our main goal is to get a robust gradient direction by aggregating information from V . First we calibrate the directions of each eigenvectors so that they point to the directions that can improve the DG prediction accuracy. Learning an ideal direction is impossible without a reference. The choice of the reference is flexible, as long as it is pointing to a direction that climbs up the loss surface. We want the reference to guide the principal gradient in the right direction for gradient descent based algorithms. For simplicity, we use the difference between the starting point Θ t 0 and the end point Θ t n of the trajectory S as our reference ∇ r = Θ t 0 -Θ t n . So for each coordinate axis, we revise its direction so that the resulting vector w z is positively related to the reference ∇ r in terms of the inner product: w z = r z v z , r z = 1, if v z , ∇ r ≥ 0, -1, otherwise. Constructing Principal Gradient. The relative importance of each w z is conveyed in the corresponding eigenvalue λ z . Larger λ z implies higher variance when projecting the trajectory S along w z direction. We weight each axis with their eigenvalues, and aggregate them together into a weighted sum. This gives us the principal gradient vector being calculated as follows: ∇ p = n z=0 λ z ||λ|| 2 w z , λ = [λ 0 , λ 1 , • • • , λ n ] There exists other possible aggregation besides Eq. ( 9). For instance, another choice of weight could be λ z /||λ|| 1 or simply λ z since the eigenvalue of a semi-positive definite matrix is non-negative. Gradient normalization has been widely recommended for improving training stability (You et al., 2017; Yu et al., 2017) . Our design in Eq. ( 9) automatically achieves L 2 normalization, because: ||∇ p || 2 2 = n z=0 λ 2 z ||λ|| 2 2 ||w z || 2 2 = 1, Length Calibration. As training updates continue, a fixed length gradient operator may become too rigid, causing fluctuations in the loss. We, therefore, propose to calibrate the norm of ∇ p with a reference, for achieving adaptive length tuning. Specifically, we propose to multiply the aggregated gradient from Eq. ( 9) with the L 2 norm of ∇ r : ∇ p = n z=0 λ z ||∇ r || 2 ||λ|| 2 w z , With this length calibration via ||∇ r || 2 , the norm of the proposed gradient is constrained by the multiplier, and is automatically tuned during the training process. Noise Suppression. Most w z axes correspond to small eigenvalues and may relate to just domainspecific noise signals. They may help the accuracy of a specific training domain D i , but mostly hurt the overall accuracy on D tr . We therefore define the principal gradient as follows and show how to use it to solve DG model training via gradient descent optimization (where k is a hyperparaemter): ∇ p = k z=0 λ z ||∇ r || 2 ||λ[: k]|| 2 w z , Θ t+1 = Θ t -γ∇ p .

2.2. THEORETICAL ANALYSIS

In Appendix (A.5.1), we prove that 1 n ŜT Ŝ in Eq. ( 6) provides us with the mean of all training domains' domain-specific Fisher information matrix (FIM). Since FIM is the negative of Hessian under mild conditions, PGrad essentially performs spectrum analysis on the approximated Hessian matrix. Moreover, in Appendix (A.5.2), we show that PGrad improves the training efficiency of neural networks by recovering a subspace from the original over-parameterized space Θ. This subspace is built from the top eigenvectors of the approximated Hessian. We visualize the evolution of the eigenvalue distributions in Figure 8 . Our theoretical analysis connects to the machine learning literature that performs spectrum analysis of Hessian matrix (Gur-Ari et al., 2018) and connects its top subspace spanned by the eigenvectors to training efficiency and generalization in neural networks (Wei & Schwab, 2019; Ghorbani et al., 2019; Pascanu et al., 2014) . For example, (Hochreiter & Schmidhuber, 1997) shows that small eigenvalues in the Hessian spectrum are indicators of flat directions. Another work (Sagun et al., 2017) empirically demonstrated that the spectrum of the Hessian contains both a bulk component with many small eigenvalues and a few top components of much more significant positive eigenvalues. Later, (Gur-Ari et al., 2018) pointed out that the gradient of neural networks quickly converges to the top subspace of the Hessian.

2.3. VARIATIONS OF PGR A D

There exist many ways to construct a sampled trajectory, creating multiple variations of PGrad . • PGrad-F : The vanilla trajectory sampling method will sample a trajectory of length n + 1 by sequentially visiting each D i in a fixed order. See appendix for the results of the rigid variation. • PGrad : We can randomly shuffle the domain order in the training, and then perform to sample a trajectory. This random order based strategy is used as the default version of PGrad . • PGrad-B : We can split each training batch into B smaller batches and construct a long sampled trajectory that is with length n * B + 1. • PGrad-BMix : Our method is model and data agnostic. Therefore it is complementary and can combine with many other DG strategies. As one example, we combine the random order based PGrad-B and MixUp (Zhang et al., 2017) into PGrad-BMix in our empirical analysis. In PGrad and PGrad-F , the principal gradient's trajectory covers all training domains D tr exactly once (per domain). There are two possible limitations. (1) If the number of training domains n is tiny, a length-(n + 1) trajectory will not provide enough information to achieve robustness. In the extreme case of n = 1, we will only be able to get one axis w z , that goes back to ERM. (2) The current design can only eliminate discrepancies between different domains. Notable intra-domain variations also exist because empirically approximating the expected loss may include noise due to data sampling, plus batch-based optimization may induce additional bias. Based on this intuition, we propose a new design for sampling a trajectory by evenly splitting {x i k , y i k } from a training domain D i into B small batches. This new strategy allows us to obtain nB pseudo training domains. Such a design brings in two benefits: (1) We can sample a longer trajectory S, as the length changes from n to nB. (2) Our design splits each domain's data into B batches and treats each batch as if they come from different training domains. By learning the principal gradient from these nB pseudo domains, we also address the intra-domain noise issue. We name this new design PGrad-B . Appendix (A.1) includes a figure comparing vanilla trajectory sampling with this extended trajectory sampling.

2.4. CONNECTING TO RELATED WORKS

We can categorize existing DG methods into four broad groups. Invariant element learning. Learning invariant mechanisms shared across training domains provides a promising path toward DG. Recent literature has equipped various deep learning components -especially representation modules -with the invariant property to achieve DG (Li et al., 2018d; e; Muandet et al., 2013) . The central idea is to minimize the distance or maximize the similarity between representation distributions P (f (X )|D) across training domains so that prediction is based on statistically indistinguishable representations. Adversarial methods (Li et al., 2018d) and moment matching (Peng et al., 2019; Zellinger et al., 2017) are two promising approaches for distributional alignment. A recent line of work explores the connection between invariance and causality. IRM (Arjovsky et al., 2019) learns an invariant linear classifier that is simultaneously optimal for all training domains. Under the linear case and some constraints, the invariance of the classifier induces causality. Ahuja et al. further extend IRM by posing it as finding the Nash equilibrium (Ahuja et al., 2020) and adding information bottleneck constraints to seek theoretical explanations (Ahuja et al., 2021) . However, later works (Kamath et al., 2021) show that even when capturing the correct invariances, IRM still tends to learn a suboptimal predictor. Compared to this stream of works, our method places no assumption on either the marginal or joint distribution. Instead, the PGrad explores the promising gradient direction and is model and data-agnostic. Optimization methods. One line of optimization-based DG works is those related to the Group Distributionally robust optimization (a.k.a DRO) (Sagawa et al., 2019) . Group DRO aims to tackle domain generalization by minimizing the worst-case training loss when considering all training distributions (rather than the average loss). The second set of optimization DG methods is optimizationbased meta-learning. Optimization-based meta-learning uses bilevel optimization for DG by achieving properties like global inter-class alignment (Dou et al., 2019) or local intra-class distinguishability (Li et al., 2018b) . One recent work (Li et al., 2019) synthesizes virtual training and testing domains to imitate the episodic training for few-shot learning. Gradient manipulation. Gradient directions drive the updates of neural networks throughout training and are vital elements of generalization. In DG, the main goal is to learn a gradient direction that benefits all training domains (plus unseen domains). Gradient surgery (Mansilla et al., 2021) proposes to use the sign function as a signal to measure the element-wise gradient alignment. Similarly, the authors of (Chattopadhyay et al., 2020) presented And-mask, to learn a binary gradient mask to zero out those gradient components that have inconsistent signs across training domains. Sandmask (Shahtalebi et al., 2021) added a tanh function into mask generation to measure the gradient consistency. They extend And-mask by promoting gradient agreement. Fish (Shi et al., 2021) and Fishr (Rame et al., 2021) are two recent DG works motivated by gradient matching. They require the parallel calculation of the domain gradient from every training domain w.r.t a current parameter vector. Fish maximizes the inner product of domain-level gradients; Fishr uses the variance of the gradient covariance as a regularizer to align the per-domains' Hessian matrix. Our method PGrad differs gradient matching by learning a robust gradient direction. Besides, our method efficiently approximates the Hessian with training domains' Fisher information matrix. Appendix (A.4) includes a detailed analysis comparing parallel versus sequential domain-level training. Furthermore, we adapt PGrad with parallel training, and compare it against PGrad with sequential training and ERM to justify our analysis, see visualizations in Figure 6 . We then show that gradient alignment is not necessary a sufficient indicator of the generalization ability in Figure 7 . Others. Besides the categories above, there exist other recently adopted to conquer domain generalization. Data augmentation (Xu et al., 2021; Zhang et al., 2021; 2017; Volpi et al., 2018) , which generates new training samples or representations from training domains to prevent overfitting. Data augmentation can facilitate a target model with desirable properties such as linearity via Mixup (Zhang et al., 2017) or object focus (Xu et al., 2021) . Other strategies, like contrastive learning (Kim et al., 2021) , representation disentangling (Piratla et al., 2020) , and semi-supervised learning (Li et al., 2021) , have also been developed for the DG challenge.

3. EXPERIMENTS

We conduct empirical experiments to answer the following questions: Q1. Does PGrad successfully handle both synthetic and real-life distributional shifts? Q2. Can PGrad handle various architectures (ResNet and DenseNet), data types (scene and satellite images), and tasks (classification and regression)? Q3. Compared to existing baselines, does PGrad enable smooth decreasing loss curves and generate smoother parameter trajectories? Q4. Can PGrad act as a practical complementary approach to combine with other DG strategies?foot_1 Q5. How do bottom eigenvectors in the roll-out trajectories affect the model's training dynamic and generalization ability? 

3.1. DOMAINBED BENCHMARK

Setup and baselines. The DomainBed benchmark (Gulrajani & Lopez-Paz, 2021 ) is a popular suite designed for rigorous comparisons of domain generalization methods. DomainBed datasets focus on distribution shifts induced by synthetic transformations, We conduct extensive experiments on it to compare it with SOTA methods. The testbed of domain generalization implements consistent experimental protocols for various datasets, and we use five datasets from DomainBed (excluding two MNIST-related datasets) in our experiments. See data details in Table 1 . DomainBed offers a diverse set of algorithms for comparison. Following the categories we summarized in Section 2.4, we compare with invariant element learning works: IRM (Arjovsky et al., 2019) , MMD (Li et al., 2018d) , DANN (Ganin et al., 2016) , and CORAL (Sun & Saenko, 2016) . Among optimization methods, we use GroupDRO (Sagawa et al., 2019) and MLDG (Li et al., 2018b) . The most closely related works are those based on gradient manipulation, and we compare with Fish (Shi et al., 2021) and Fishr (Rame et al., 2021) . Of the representation augmentation methods, we pick two popular works: MixUp (Zhang et al., 2017) and ARM (Zhang et al., 2021) . DomainNet's additional model parameters in the final classification layer lead to memory constraints on our hardware at the default batch size of 32. Therefore, we use lower batch size 24. For our method variation PGrad-B , we set B = 3 for all datasets except using B = 2 for DomainNet. We default to Adam (Kingma & Ba, 2017) as the optimizer to roll-out a trajectory. All experiments use the DomainBed default architecture, where we finetune a pretrained ResNet50 (He et al., 2016) . Results analysis. We aggregate results on each dataset by taking the average prediction accuracy on all domains, and the results are summarized in Table 2 . The per-domain prediction accuracy on each dataset is available in Appendix (A.7). We summarize our observations: 1). ERM remains a strong baseline between all methods, and gradient alignment methods provide promising results compared to other categories. 2). PGrad ranks first out of 11 methods based on average accuracy. Concretely, PGrad consistently outperforms ERM on all datasets and gains 1.8% improvement on VLCS, 2.8% on OfficeHome, 2.9% on Ter-raIncognita, and no improvement on DomainNet. 3) Our variation PGrad-B outperform PGrad on almost all datasets except VLCS (where it is similar to PGrad ). This observation showcases that intra-domain noise suppression can benefit OOD generalization. A longer trajectory enables PGrad to learn more robust principal directions. 4) The combined variation PGrad-BMix outperforms MixUp across all datasets. On average (see last column of Table 2 ), PGrad-BMix is the best performing strategy. This observation indicates our method can be effectively combined with other DG categories to improve generalization further. Tuning k for noise suppression. As we pointed out in Section 2.1, we achieve domain-specific noise suppression by only aggregating coordinate axes {w z } k z=0 when learning the principal gradient ∇ p . To investigate the effect of k, we run experiments with different values of k for both PGrad and PGrad-B . The analysis results on PACS dataset are collected in Table 3 . Note that for default version of PGrad , the maximum number of training domains is n = 3, therefore, the length of the PGrad trajectory is upper bounded by 4. Table 3 shows that the generalization accuracy initially improves and then drops as k increases. If we use k = n + 1 (as k = 4 for PGrad), domain-specific noise is included and aggregated from principal coordinate W and the performance decreases compared with k = 3. The same pattern can also be observed in PGrad-B (note: the length of the trajectory is upper bounded by nB + 1 = 10). Training loss curve analysis. Learning and explaining a model update's behavior is an important step toward understanding its generalization ability. To answer Q3, we look for insights by visualizing domain-wise training losses as updates proceed. To prevent randomness, we plot average results together with the standard deviation over 9 runs. The results for ERM and PGrad are visualized in Figure 2 . Compared to ERM, our method PGrad has smoother decreasing losses as training proceeds. Besides, all training domains benefit from each update in PGrad. On the other hand, ERM's decreasing loss on one domain can come with increases on other domains, especially late in training. We hypothesize this is because domain-specific noise takes a more dominant role as training progresses in ERM. PGrad can effectively suppress domain-specific noise and optimize all training losses in unison without significant conflict. Moreover, the loss variances across training domains are stably minimized, achieving a similar effect as V-REx (Krueger et al., 2021) without an explicit variance penalty. In Appendix (A.2), we visualize four training trajectories trained with PGrad and ERM. ERM trajectories proceed over-optimistically at the beginning and turn sharply in late training. PGrad moves cautiously for every step and consistently towards one direction. ) is a curated benchmark of 10 datasets covering real-life distribution shifts in the wild such as poverty mapping and land use classification. We apply our method to its two vision applications. Our goal is to investigate the scalability of PGrad under different model architectures and metrics, and more importantly, its performance when facing real-world OOD shifts. We conduct experiments on WILDS to explore both Q1 and Q2. For each dataset, we use the recommended metrics and model architecture. A summary of the dataset, the metrics, and the model architectures are provided in Table 4 . (I) The POVERTYMAP dataset is collected for poverty evaluation, where the input x ∈ X is an 8-channel multispectral satellite image, the output y ∈ Y is the real-value asset wealth index. The dataset includes satellite images from 23 different countries and covers both urban and rural areas for each country. We use 13 countries as training domains, pick 5 other countries for model selection, and use the remaining 5 countries for test purpose. We calculate the Pearson correlation r between the predicted index and the groundtruth, and report the average r across 5 test domains and two different areas. (II) The objective of the FMOW dataset is to categorize land use based on RGB satellite images spanning 16 years and 5 geographical regions. The training domains contain images from the first 11 years, with the middle 3 years as validation domains and the last 2 years as test domains. We report the average region accuracy on both validation and test domains to evaluate our method under the geographical distributional shift challenge. 3 The training details and the hyperparameters we used can be found in Appendix (A.9). We compare ours with SOTA methods including IRM (Arjovsky et al., 2019) , Coral (Sun & Saenko, 2016) , and Fish (Shi et al., 2021) . We repeat each experiment with three random seeds and report both recommended metrics and their standard deviations on each dataset. In A SUPPLEMENTARY FOR PGrad : Learning robust gradients for domain generalization. This appendix includes the following contents: • We include figures describing the different trajectory sampling methods in A.1. • To understand how PGrad changes the optimization path of model's parameters, we show the TSNE projection of parameters during training in A.2. In Figure 4 , we visualize the projection for different datasets and test domains. • The inclusion of the starting point during trajectory sampling is important; we analyze the reason for this effect in A.3. • PGrad differs from related work in that it adopts a sequential training strategy when sampling a trajectory. We include a detailed discussion comparing the parallel training used in gradient matching methods like Fish with our sequential training strategy in A.4. We empirically demonstrate that gradient alignment is not necessarily a sufficient indicator of the generalization ability, see • We show that our method has two major theoretical contributions. First, instead of being driven by gradient alignment, PGrad learns a gradient flow from the tangent space spanned by the eigenvectors of the Hessian matrix -see A.5.1 for the detailed analysis. Second, in A.5.2, we show that PGrad has a deep connection to the efficient training of neural networks by projecting the high-dimensional parameter space to some low-dimensional subspace. To further understand the evolution of eigenvalue distributions over time, we visualize their average log values over 1k training step intervals in Figure 8 . • A critical question is how bottom eigenvectors affect the training dynamic and generalization ability. We design experiments showing that those bottom eigenvectors span the normal subspace perpendicular to the tangent space of the loss landscape. Updating the model with any directions lie within the normal subspace will make no changes to the training loss but hurt the generalization ability. See A.6 for details. • Finally, in A.7, A.8, and A.9, we show the per-domain prediction accuracy of our PGrad variants and the experimental details on both benchmarks.

A.1 ILLUSTRATIVE EXPLANATION OF THE PROPOSED TRAJECTORY SAMPLING METHODS

We compare the three proposed trajectory sampling methods in Figure 3 . As vanilla trajectory sampling only considers inter-domain variations, the long trajectory sampling variant splits perbatch to smaller batches to eliminate intra-domain discrepancies from data collection, data sampling, etc. Figure 4 : Two-dimensional projection of the parameters' trajectories on different datasets. We use ResNet50 as the backbone and apply TSNE for projection. Both PGrad and ERM start from a similar random initialization. Increasing path thickness represents the later training phase. To reinforce the visual effect, we smooth the curve within a window of size 8.  Θ t 1 D 1 D 2 D 3 Θ t Θ t 2 Θ t 3 Θ t 1 D 1 D 2 D 3 Θ t Θ t 2 Θ t 3 Θ t 1 Θ t Θ t 2 Θ t 3 D 2,1 D D 3,1 Θ t 4 Θ t 5 Θ t 6 D 3,2 D 2,2

A.2 REAL OPTIMIZATION TRAJECTORY VISUALIZATION

We visualize the optimization paths for both our method PGrad and the ERM baseline with TSNE projection. Concretely, we save the model's parameters into the memory buffer after every 100 training steps and project them to the xy-plane after finishing 5,000 training steps. The trajectories for different datasets and test domains are shown in the following Figure 4 . ERM moves over-confidently with a large step size at the beginning, the turns sharply in late training. Our method PGrad 'thinks fast' but 'moves cautiously'. It samples roll-out optimization trajectory  Θ k-1 Θ k 1 Θ k 3 Θ 1 ⋆ Θ 2 ⋆ Θ 3 ⋆ (b): Parallel training will suppress robust direction Θ k 2 Θ k-1 Θ 1 k Θ 2 k Θ 3 k Θ 1 ⋆ Θ 2 ⋆ Θ 3 ⋆

A.3 TRAJECTORY SAMPLING

The inclusion of the starting point Θ t 0 in the trajectory sampling is important, otherwise the learned principal gradient will skip the gradient information from the first training domain from each update. We show the derivation in the following. To simplify the notation, we substitute ∇ Θ t i-1 L Di with ∇ Θ t i and assume we have only two training domains.

Sampled Trajectory With Starting

Point Θ t 0 S = {Θ t 0 → (Θ t 0 -∇ Θ t 1 ) → (Θ t 0 -∇ Θ t 1 -∇ Θ t 2 )} Trajectory Centering Ŝ = { 2∇ Θ t 1 + ∇ Θ t 2 3 -∇ Θ t 1 + ∇ Θ t 2 3 → -∇ Θ t 1 2∇ Θ t 2 3 Sampled Trajectory W/O Starting Point Θ t 0 S = {Θ t 1 → (Θ t 1 -∇ Θ t 2 )} Trajectory Centering Ŝ = { 1 2 ∇ Θ t 2 → - 1 2 ∇ Θ t 2 } After trajectory centering, the gradient information from the first training domain ∇ Θ t 1 will be skipped if trajectories are sampled without the starting point Θ t 0 . To learn from all training domains, we use the left policy for sampling.

A.4 COMPARISON BETWEEN THE PARALLEL TRAINING AND SEQUENTIAL TRAINING

In this subsection, we detailed the comparison between our method PGrad with the other two gradient-based methods: Fish and Fishr. Both Fish and Fishr are inspired by gradient alignment. They made efforts to align the gradients from different domains and adding the alignment as a penalty to the loss function. In their vanilla implementation, both Fish and Fishr calculate the perdomain gradient w.r.t the current parameter Θ t 0 . We can name the training paradigm as 'parallel training'. As contrast, our method learns a robust gradient direction from the optimization trajectory sampled by sequential training. We explain why parallel training is not a proper choice when learning the principal direction in PGrad . Principal directions learn dominant changing directions. If we apply parallel training in PGrad , the centering of the trajectory will suppress the shared pattern and reinforce the domain-specific noises, see Figure 5(b) . Instead, the sequential training keeps enlarging shared gradient patterns with each of the multi-step updating. We visualize and compare the two cases using Figure 5 . Moreover, the sequential training is more efficient comparing with parallel training. In Figure 5 , we intuitively explain that sequential training will reinforce the learning of a clean direction and parallel training will significantly suppress it. We now design experiments to justify our hypothesis. We adapt PGrad-B by learning the principal direction through parallel training. We fix all random mechanisms and compare PGrad-B under sequential training, PGrad-B under parallel training, and ERM. The test accuracy as functions of the training epoch is visualized in Figure . In the left panel, we use C, P, R as training domains. In the right panel, we use A, P, R as training domains. We attribute the performance drop of the PGrad-B (parallel) to the enlarged noise component and suppressed clean direction. The observations are consistent with above analysis. Another interesting observation is that curves from PGrad-B are smoother compared to ERM. We highlight the novelty of our method by showing that PGrad achieves domain generalization without constraining the gradient alignment. Specifically, we define a gradient alignment measurement as the mean of the cosine gradient similarity across all training domain pairs: Hessian matrix, which is widely used for analyzing the model's training behaviour, is defined as the second order derivative of the function. Similar to NTK (Jacot et al., 2018) , we assume the loss function L is a functional acting on parameters Θ, 1/ n 2 i =j < ∇(D i ), ∇(D j ) > ||∇(D i )||||∇(D j )|| L(Θ) = E x,y L[Θ, (x, y)] ≈ 1 |x × y| i L[Θ, (x i , y i )]. We have Taylor expansion around parameter Θ: L(Θ ) = L(Θ) + (Θ -Θ) T ∇ Θ L + 1 2 (Θ -Θ) T H(Θ -Θ) + O(||Θ -Θ|| 2 ), where H i,j = ∂ 2 L ∂Θ i ∂Θ j . The Hessian matrix H contains local geometric properties of the loss landscape around Θ. The calculation of the second-order gradient is impractical, especially for modern neural network architectures. Under certain mild regularity conditions and equipped with log-likelihood as loss function (Schervish, 2012) , we can approximate H with the outer product of the gradient. Specifically, I = ∇ Θ L ⊗ ∇ Θ L = - ∂ 2 L ∂Θ 2 = -H ( ) where I is also known as Fisher information matrix (FIM). We explain how our method PGrad automatically approximates and aggregates the eigenvalues of the Hessian matrix by following the proposed training procedures. We sample a trajectory as S = {Θ 0 , Θ 1 , • • • , Θ n } ∈ R (n+1)×d . In the following, we show the trajectory centering operation is equivalent to taking the average of the training domains' Hessian approximations. Lemma A.1 The centered trajectory Ŝ is the linear transformation of the domain-specific gradient, whose columns can be interpreted as the domain-wise gradient vector starting from the same initialization. The shared initialization is the trajectory center of S. The first half of the Lemma is easy to show. Any vector within a convex hull can be recovered by the linear combination of its edges. See Appendix A.3 for the derivation on a simple case. It implies that the centered trajectory Ŝ contains as rich information as training domains' gradient matrix. For the sampled trajectory S, the center is calculated as S o = n i=0 Θ i n + 1 . We can re-interpret the sampled trajectory with local coordinate centered with S o . Since the update step is small enough, we have: Ŝ = [Θ 0 -S o , Θ 1 -S o , • • • , Θ n -S o ] = [ ∇0 , ∇1 , • • , ∇n ] The new gradients { ∇i } n i=0 shares pseudo initialization S o . We then proceed to show the covariance of the centered trajectory gives us the mean of the training domains' FIM. PGrad uses the training domains' average FIM to approximate the real expected FIM. 1 n ŜT S = 1 n [ ∇0 , ∇1 , • • • , ∇n ] ⊗ [ ∇0 , ∇1 , • • • , ∇n ] (18) = 1 n i ∇i ⊗ ∇i = 1 n i I i = - 1 n i H i , where I i , H i are domain-specific FIM and Hessian matrix, respectively. The approximation is a covariance matrix, which is positive semi-definite (SPD) and has the eigenvalue-eigenvector pairs {(λ z , v z )} n z=0 with λ 0 > λ 1 > • • • > λ n . The eigenvalue λ z is the curvature of the loss in direction of v z in the neighborhood of S o . The training behaviour of the neural network is determined by the distribution of the eigenvalues. Specifically, first-order optimization methods slow down significantly when {λ z } n z=0 are highly spread out (Bottou & Bousquet, 2007; Ghorbani et al., 2019) . The property inspires us to zero out insignificant directions and use the directions with the large curvature to derive our gradient direction ∇ p . As a contrast, Fishr (Rame et al., 2021) uses the current parameter value as the initialization and parallelly approximates the per-domain Hessian matrix with its gradient variance. It defines the Fishr regularization as the square of the Euclidean distance between gradient variance matrices to bring closer domain-wise Hessian. Its goal is to align the second order gradient of the training domains. The high computational cost from parallel training constrains them to operate only on the last classification layer in practice. Comparing with other gradient manipulation works which emphasize alignment or matching, PGrad uses the average of the FIMs to approximate the underlining Hessian matrix under the DG setup. The can also reduce the noise variance by 1/n, where n is the size of the training domains. The property also provides an explanation of why PGrad-B achieves better generalization ability compared with PGrad . Second, instead of matching eigenvalues of per-domain Hessian, we learn a robust gradient flow which is the combination of the eigenvectors reflecting the dominant changes and zero out the remaining directions, which are empirically proved to be generalization toxic, see Table 3 . Our approximation of the Hessian matrix and introduced bijection allows us to efficiently operate on the high-dimensional parameter space.

A.5.2 CONNECTION TO THE LEARNING BEHAVIOUR OF THE NEURAL NETWORKS

Modern SOTA neural networks are usually over-parameterized (Allen-Zhu et al., 2019; Zou & Gu, 2019) . Improving the training efficiency of high-dimensional neural networks is an active research direction (Li et al., 2018a) . Recent works (Gur-Ari et al., 2018; Gressmann et al., 2020) have shown that deep neural networks can be optimized in some subspace of much smaller dimensionality than their native parameter space, we show how PGrad connects to the line of the work. It was proved by previous work (Li et al., 2018c; Gur-Ari et al., 2018; Gressmann et al., 2020) that the functional induced by the neural network varies most along only some specific directions. We, therefore, focus on recovering the low-dimensional subspace where the loss function L varies the most on average, and project the native parameter space to the subspace. We formulate the discussion in the following. Given a direction v, the directional derivative of the loss L at Θ is defined as: ∂L v (Θ) = [∇ Θ L(Θ)] T v, and we can measure the expected scale (or length) of the directional derivative as: E Θ |∂L v (Θ)| 2 = E Θ [v T ∇ Θ L(Θ) ⊗ v T ∇ Θ L(Θ)] = v T E Θ [∇ Θ L(Θ) ⊗ ∇ Θ L(Θ)]v, However, the distribution of the Θ is unavailable, we turn to utilize the parameters from different training domains for empirical approximation. E Θ |∂L v (Θ)| 2 ≈ v T ( 1 n i ∇n ⊗ ∇n )v = 1 n v T ŜT Ŝv. We demonstrate that learning the principal gradient direction enables us to find a low-rank updating space which is noise resistant by showing the following lemma. Lemma A.2 Suppose we are searching for a k dimensional linear projection M ∈ R k×d of the original parameter space Θ ∈ R d , such that it keeps largest directional derivatives with respect to the loss L. If the eigenvalue-eigenvector pair of the outer product matrix Lacking access to data from the test distribution during training makes it hard to perform model selection for DG compared to other supervised learning tasks. In the main, we follow the popular setup where we sample 20% of the data from each training domain and group them as a validation set. The validation accuracy will be used as an indicator of the optimal model. We show the prediction accuracy of PGrad and the extensions on each domain in Tables 7-11. E Θ [∇ Θ L(Θ) ⊗ ∇ Θ L(Θ)] are {λ z , v z } n z=0 with λ 0 > λ 1 • • • > λ n . We have: Span{M 0 , M 1 , • • • , M k } = Span{v 0 , v 1 , • • • , v k }. (

A.8 HYPERPARAMETER SEARCH

We following a similar training and evaluation protocol to the DomainBed experiments (Gulrajani & Lopez-Paz, 2021) . We list our hyperparameter search space in Table 6 . We design a narrow search space to prevent undersampling and potential performance degeneration from a smaller number of random trials. We use a fixed batch size of 32, which ensures there are adequate samples to provide precise gradient directions when sampling with our long and highentropy method. For example, there are still 11 samples in each small batch if B = 3. We also found the outer learning rate γ = 0.1 works consistently well on each dataset. DomainNet's additional model parameters in the final classification layer lead to memory constraints on our hardware at the default batch size of 32. Therefore, we use a lower batch sizes 24, and increase the training step to 6, 000. A.9 EXPERIMENTAL DETAILS ON WILDS BENCHMARK A summary of the two vision datasets in the WILDS benchmark is shown in Table 4 . For the POVER-TYMAP task, the inputs are 8-channel satellite images, therefore, we tuned the first convolutional layer of the ResNet18 (He et al., 2016) to accommodate a multispectral input. To sample a trajectory of length n+1 in high dimensional parameter space Θ, we randomly sample n training domains and sequentially update the parameter starting from Θ t on each of them. We use the Adam optimizer for trajectory construction and set the default learning rate to be 1e-3 without weight decay. We set the outer learning rate γ = 0.1/n, which adjusts the step size based on the number of sampled training domains. We train the model with the proposed PGrad for 200 epochs and select the optimal model based on the average Pearson coefficient on validation domains. For the land use classification task FMOW, we follow the exact same training and evaluation protocol applied in Fish (Shi et al., 2021) . We find a good starting point by updating the ERM objective with an Adam optimizer for 24,000 iterations with a learning rate of 1e -4. After pretraining, we proceed with our proposed PGrad and keep tuning the model for 10 epochs with outer learning rate γ = 0.01/n. After training is completed, we report the worst regional accuracy on the test domains. This measurement is designed by WILDS (Koh et al., 2021) to test models' generalization ability when facing both time and regional distribution shift. The numerical results for both datasets are available in Table 5 . 98.7 ± 0.3 63.9 ± 1.1 74.6 ± 0.5 78.5±0.6 78.9 PGrad-BMix 99.1 ± 0.1 63.8 ± 0.7 73.5 ± 0.5 79.0 ± 0.5 78.9 89.9 ± 0.2 80.0 ± 0.6 98.0 ± 0.2 80.1 ± 0.9 87.0 PGrad-BMix 89.6 ± 0.3 78.9 ± 0.6 97.7 ± 0.3 78.8 ± 1.0 86.2 



In the rest of this paper, we use the terms "domain" and "distribution" interchangeably. Note: we leave hyperparameter tuning details and some ablation analysis results in Appendix (A.7 to A.9). We follow the exact same training protocols as in Fish(Shi et al., 2021). On FMOW, we pretrain the model with ERM to a suboptimal starting and then proceed with PGrad .



Figure 1: Overview of our PGrad training strategy. With a current parameter Θ t , we first obtain a rollout trajectory Θ t → Θ t 1 → Θ t 2 → Θ t 3 by sequentially optimizing across all training domains D tr = {D i } 3 i=1 . Then PGrad updates Θ t by extracting the principal gradient direction ∇ p of the trajectory. A target model's generalization is evaluated on unseen (OOD) test domains T j .

Figure 2: Visualzing domain-wise training losses on VLCS. Curves are the average over 9 runs, surrounded by ±σ the standard deviation. For comparison, the loss curves start from 1,000 epochs.

Figure 7. • To justify the analysis between PGrad under sequential training and parallel training, We design experiments comparing PGrad-B (Sequential), PGrad-B (Parallel), and ERM. The observations are consistent with the intuitive analysis. Using ERM as baseline, PGrad-B (Sequential) enlarges clean direction and PGrad-B (Parallel) enlarges noise. See Figure 6 in A.4

three trajectory sampling methods. Assume the number of training domains n = 3. The top-left box shows the fixed-order trajectory sampling, which is PGrad-F . The topright box shows the random-order trajectory sampling, the default method PGrad . The bottom box represents PGrad-B , a version of the long trajectory sampling with B = 2.

(a): Sequential training will reinforce robust direction

Figure 5: Comparison of sequential training and parallel training to learn principal gradient. Stars represent domain-specific optimal minimum, Θ k-1 is the starting point, {Θ k i } 3 i=0 forms a trajectory, the ellipse captures the current principle coordinate chart. and aggregates the principal directions. The training curves are consistently smooth even late in training.

Figure 6: Comparison of PGrad-B (sequential) , PGrad-B (parallel), and ERM on OfficeHome dataset.

13) In Figure 7, We visualize the test domain accuracy and training domains gradient alignment as functions of the training epoch. The figure implies the gradient alignment is not a sufficient indicator of the generalization ability. The test accuracy of the PGrad is lower bounded by Fish, but Fish aligns training domains' gradient better than PGrad . Secondly, the right figure shows the smoothness of the gradient alignment curves has a positive correlation with the test accuracy. Starting from 3,000 epochs, the alignment curve of the PGrad becomes smooth, and the model achieves higher prediction accuracy in the test domain. The starting phase of the Fish reveals the same pattern.

Figure 7: Visualization of test domain accuracy and training domain gradient alignment.

Figure 9: In the left figure, we learn principal gradient always from bottom eigenvectors; the middle figure starts with top eigenvectors and switches to bottom ones after 200 epoch; the right figure always uses top eigenvectors. The vertical line indicates when the intervention happened. The y-axis is the training loss, the x-axis is the training epoch.

A summary on DOMAINBED dataset, metrics, and architectures we used.

Test accuracy (%) on five datasets from the DomainBed benchmark. We group 20% data from each training domain to construct validation set for model selection. +1.4 87.0 ± 0.1 +1.5 69.6 ± 0.1 +3.1 49.4 ± 0.8 +3.3 41.4 ± 0.1 +0.5 65.3 +2.0 PGrad-BMix 78.9 ± 0.2 +1.4 86.2 ± 0.4 +0.7 69.8 ± 0.1 +3.3 50.7 ± 0.6 +4.6 42.6 ± 0.2 +1.7 65.7 +2.4

Analysis the effect of varying k. The experiments are performed on PACS dataset. We highlight first and second best results.

A summary on WILDS dataset, metrics, and architectures we used.

Results on WILDS benchmark. Top two results are highlighted.

PGrad achieves state-of-the-art results on both datasets, and the variation PGrad-B further improves the performance. On POVERTYMAP, PGrad demonstrates a strong correlation between the predicted wealth index and the ground truth by achieving the highest Pearson coefficient on both validation and testing domains. Its low standard deviation indicates PGrad is stable across different random seeds. PGrad-B achieves better domain generalization by enabling more extended trajectory sampling. Similarly, on the FMOW data, PGrad improves over baseline IRM and Coral, and is on par with Fish. PGrad-B achieves the best accuracy. These experimental results demonstrate that the proposed methods are effective across different model architectures and can successfully handle the real-life distributional shift.

Sample space for hyperparameters.

Per-domain prediction accuracy (%) on VLCS dataset

Per-domain prediction accuracy (%) on PACS dataset ± 0.4 79.2 ± 0.3 98.2 ± 0.2 76.6 ± 1.5 85.5 PGrad 87.6 ± 0.3 79.1 ± 1.0 97.4 ± 0.1 76.3 ± 1.2 85.1 PGrad-B

Per-domain prediction accuracy (%) on OfficeHome dataset PGrad-BMix 65.8 ± 0.2 55.4 ± 0.4 78.0 ± 0.1 80.0 ± 0.4 69.8

Per-domain prediction accuracy (%) on TerraIncognita dataset ± 1.3 42.0 ± 0.9 57.2 ± 0.9 43.2 ± 2.1 48.6 PGrad 51.2 ± 0.8 43.4 ± 0.7 60.0 ± 0.6 41.3 ± 0.8 49.0 PGrad-B 52.7 ± 2.1 43.5 ± 0.7 59.5 ± 0.5 41.9 ± 0.3 49.4 PGrad-BMix 61.2 ± 2.2 45.7 ± 0.9 58.2 ± 0.3 37.9 ± 0.7 50.7

Per-domain prediction accuracy (%) on DomainNet dataset ± 0.2 18.8 ± 0.2 48.3 ± 0.1 13.1 ± 0.1 61.2 ± 0.1 49.9 ± 0.1 41.4 PGrad-BMix 59.4 ± 0.1 19.8 ± 0.1 49.1 ± 0.4 13.7 ± 0.1 61.8 ± 0.2 51.1 ± 0.1 42.5

annex

Published as a conference paper at ICLR 2023 The length of the trajectories is nB + 1 = 10. We plot the distribution of the top-9 eigenvalues since the smallest one is out-of-precision. We calculate the contribution of each component by normalization with their sum. To smooth the results, we take the log value of the average across per-thousand training epochs. The ∇ p is learned by aggregating the top-4 eigenvectors.

A.6 WHAT ARE THE TAIL EIGENVECTORS?

To answer the last question we proposed in Section 3, we conduct the analysis to show the effect of the bottom eigenvectors on the training dynamics. Ablation studies in Table 3 indicate that including bottom eigenvalues into principal gradient will hurt the generalization ability. We add new experiments to clarify that the bottom eigenvectors are noise signals of 'special' properties. Concretely, we design three different strategies to update the model with PGrad :• Always from bottom eigenvectors.• Start from the top eigenvectors and then switch to the bottom vectors in the middle.• Always from the top eigenvectors.The training losses keep being constant for case (1) and case (2) after the switching, even when we set the step size to be meaningfully large. The training loss keeps getting decreased for the setup (3). These results imply that those bottom components span the subspace perpendicular to the tangent space of the loss landscape. They do not hurt the training loss but are not helpful for generalization. Similar to the original setup, we calibrate the direction and length for optimization purpose. We show the training loss curves in Figure 9 .A.7 EXPERIMENTAL SETUP DETAILS FOR DOMAINBED Following the training protocol described in DomainBed, we run experiments on each domain with a random mechanism to reduce the bias of hyperparameter selection. The DomainBed experiments (Gulrajani & Lopez-Paz, 2021 ) select the best model from random samples of the hyperparameter search space, and repeat this search process 3 times. Specifically, for each domain, we select 2 random combinations of hyperparameters from a relatively narrow range and repeat the search 3 times to report confidence intervals. For a dataset with n domains, we train n models. T te , held out for testing, and use the rest as training domains D tr . This design leads to a total of 6n experiments per dataset for each method.

