DOMAIN-ADJUSTED REGRESSION OR: ERM MAY ALREADY LEARN FEATURES SUFFICIENT FOR OUT-OF-DISTRIBUTION GENERALIZATION Anonymous

Abstract

A common explanation for the failure of deep networks to generalize out-ofdistribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. Our findings also imply that given a small amount of data from the target distribution, retraining only the last linear layer will give excellent performance. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance.

1. INTRODUCTION

The historical motivation for deep learning focuses on the ability of deep neural networks to automatically learn rich, hierarchical features of complex data (LeCun et al., 2015; Goodfellow et al., 2016) . Simple Empirical Risk Minimization (ERM), with appropriate regularization, results in high-quality representations which surpass carefully hand-selected features on a wide variety of downstream tasks. Despite these successes, or perhaps because of them, the dominant focus of late is on the shortcomings of this approach: recent work points to the failure of networks trained with ERM to generalize under even moderate distribution shift (Recht et al., 2019; Miller et al., 2020) . A common explanation for this phenomenon is reliance on "spurious correlations" or "shortcuts", where a network makes predictions based on structure in the data which generalizes on average in the training set but may not persist in future test distributions (Poliak et al., 2018; Geirhos et al., 2019; Xiao et al., 2021) . Many proposed solutions implicitly assume that this problem is due to the entire neural network: they suggest an alternate objective to be minimized over a deep network in an end-to-end fashion (Sun & Saenko, 2016; Ganin et al., 2016; Arjovsky et al., 2019) . These objectives are complex, poorly understood, and difficult to optimize. Indeed, the efficacy of many such objectives was recently called into serious question (Zhao et al., 2019; Rosenfeld et al., 2021; Gulrajani & Lopez-Paz, 2021) . Though a neural network is often viewed as a deep feature embedder with a final linear predictor applied to the features, it is still unclear-and to our knowledge has not been directly asked or tested-whether these issues are primarily because of (i) learning the wrong features or (ii) learning good features but failing to find the best-generalizing linear predictor on top of them. We begin with a simple experiment (Figure 1 ) to try to distinguish between these two possibilities: we train a deep network with ERM on several domain generalization benchmarks, where the task is to learn a predictor using a collection of distinct training domains and then perform well on a new, Figure 1 : Accuracy via "cheating": dagger ( †) denotes access to test domain at train-time. Each letter is a domain. Dark blue is approximate SOTA, orange is our proposed DARE objective, light grey represents cheating while retraining the linear classifier only. All three methods use the same features, attained without cheating. Dark grey is "ideal" accuracy, cheating while training the entire deep network. Surprisingly, cheating only for the linear classifier rivals cheating for the whole network. Cheating accuracy on pretrained features (light blue) makes clear that this effect is due to finetuning on the train domains, and not simply overparameterization (i.e., a very large number of features). unseen domain. After training, we freeze the features and separately learn a linear classifier on top of them. Crucially, when training this classifier (i.e., retraining just the last linear layer), we give it an unreasonable advantage by optimizing on both the train and test domains-henceforth we refer to this as "cheating". Since we use just a linear classifier, this process establishes a lower bound on what performance we could plausibly achieve using standard ERM features. We then separately cheat while training the full network end-to-end, simulating the idealized setting with no distribution shift. Note that in neither case do we train on the test points; our cheating entails training on (different) samples from the test domain, which are assumed unavailable in domain generalization. Notably, we find that simple (cheating) logistic regression on frozen deep features learned via ERM results in enormous improvements over current state of the art, on the order of 10-15%. In fact, it usually performs comparably to the full cheating method-which learns both features and classifier end-to-end with test domain access-sometimes even outperforming it. Put another way, cheating while training the entire network rarely does significantly better than cheating while training just the last linear layer. One possible explanation for this is that the pretrained model is so overparametrized as to effectively be a kernel with universal approximation power; in this case, the outstanding performance of a cheating linear classifier on top of these features would be unsurprising. However, we find that this cheating method does not ensure good performance on pretrained features, which implies that we are not yet in such a regime and that the effect we observe is indeed due to finetuning via ERM. Collectively, these results suggest that training modern deep architectures with ERM and established in-distribution training and regularization practices may be "good enough" for out-of-distribution generalization and that the current bottleneck lies primarily in learning a simple, robust predictor. Motivated by these findings, we propose a new objective, which we call Domain-Adjusted Regression (DARE). The DARE objective is convex and it learns a linear predictor on frozen features. Unlike invariant prediction (Peters et al., 2016) , which projects out feature variation such that a single predictor performs acceptably on very different domains, DARE performs a domain-specific adjustment to unify the environmental features in a canonical latent space. Based on the presumption that standard ERM features are good enough (made formal in Section 4), DARE enjoys strong theoretical guarantees: under a new model of distribution shift which captures ideas from invariant/non-invariant latent variable models, we precisely characterize the adversarial risk of the DARE solution against a natural perturbation set, and we prove that this risk is minimax. We further provide the first finite-environment convergence guarantee to the minimax risk, improving over existing results which merely demonstrate a threshold in the number of observed environments at which the solution is discovered (Rosenfeld et al., 2021; Chen et al., 2021; Wang et al., 2022) . Finally, we show how our objective can be modified to leverage access to unlabeled samples at test-time. We use this to derive a method for provably effective "just-in-time" unsupervised domain adaptation, for which we provide a finite-sample excess risk bound. Evaluated on finetuned features, we find that DARE compares favorably to existing methods, consistently achieving equal or better performance. We also find that methods which previously underperformed on these benchmarks do much better in this frozen feature setting, often besting ERM. This suggests that these approaches are beneficial for linear prediction (the setting in which they are understood and justified) but using them to train deep networks may result in worse features.

2. ERM LEARNS SURPRISINGLY USEFUL FEATURES

Our experiments are motivated by a simple question: is the observed failure of deep neural networks to generalize out-of-distribution more appropriately attributed to inadequate feature learning or inadequate robust prediction? Both could undoubtedly be improved, but we are concerned with which currently serves as the primary bottleneck. It's typical to train the entire network end-to-end and then evaluate on a test distribution; in reality, this measures the quality of the interaction of the features and the classifier, not of either one individually. Using datasets and methodology from DOMAINBED (Gulrajani & Lopez-Paz, 2021) , we finetune a ResNet-50 (He et al., 2016) with ERM on the training domains to extract features. Next, we cheat while learning a linear classifier on top of these frozen features by optimizing on both the train and test domains. We compare this cheating linear classifier to a full cheating network trained on all domains end-to-end. If it is the case that ERM learns features which do not generalize, we should expect that the cheating linear classifier will not substantially improve over the current state of the art and will perform significantly worse than cheating end-to-end, since the latter method can adapt the features to better suit the test domain. Instead, we find that simply giving the linear predictor access to all domains while training makes up the vast majority of the gap between current state of the art (ERM with heavy regularization) and the ideal setting where we train a network on all domains. In other words, ERM produces features which are informative enough that a linear classifier on top of these frozen features is-in principle-capable of generalizing almost as well as if we had access to the test domain when training the entire network.foot_0 Figure 2 in the Appendix depicts the evaluation methodology described above, along with a more detailed explanation. We conjecture that this phenomenon occurs more broadly: our findings suggest that for many applications, existing features learned via ERM may be "good enough" for out-of-distribution generalization in deep learning. Based on this idea, we posit that future work would benefit from modularity, working to improve representation learning and robust classification/regression separately. There are several distinct advantages to this approach: • More robust comparisons and better reproducibility: Current methods have myriad degrees of freedom which makes informative comparisons difficult; evaluating feature learning and robust regression separately eliminates many of these sources of experimental variation. • Less compute overhead: Training large networks to learn both features and a classifier is expensive. Benchmarks could include files with the weights for ready-to-use deep feature embedders trained with various objectives-these models can be much larger and trained on much more data than would be feasible for many. Using these features, academic researchers could thus make faster progress on better classifiers with less compute. • Modular theoretical guarantees: Conditioning on the frozen features, we can use more classical analyses to provide guarantees for the simpler parametric classifiers learned on top of these features. For example, Bansal et al. (2021) derive a generalization bound for simpler classifiers which is agnostic to the complexity of the features on which they are trained. We conclude by emphasizing that while the predictor in our experiments is linear, future methods need not be. Rather, we are highlighting that there may be no need for complex, expensive, highly variable regularization of a deep network when much simpler approaches suffice.

3. THE DOMAIN-ADJUSTED REGRESSION OBJECTIVE

The goal of many prior methods in deep domain adaptation or generalization is to learn a single network which does well on all environments simultaneously, often by throwing away non-invariant components (Peng et al., 2019; Arjovsky et al., 2019) . While invariance is a powerful framework, there are some clear drawbacks , such as the need to throw away possibly informative features. In settings where we expect the distribution shift to be less than worst-case, this would be unnecessarily conservative. Instead, we reframe the problem by thinking of each training domain as a distinct transformation from a shared canonical representation space. In this framing, we can "adjust" each domain in order to undo these transformations, aligning the representations to learn a single robust predictor in this unified space. Specifically, we propose to whiten each domain's features to have zero mean and identity covariance, which can be thought of as a "domain-specific batchnorm" but using the full feature covariance. This idea is not totally new: some prior works align the moments between domains to improve generalization. The difference is that DARE does not learn a single featurizer which aligns the moments, but rather aligns the moments of already learned features; the latter approach maintains useful variation between domains which would be eliminated by the former. DARE is closer to methods which learn separate batchnorm parameters per domain over a deep network, possibly adjusting at test-time (Seo et al., 2019; Chang et al., 2019; Segù et al., 2020 )-these methods perform well, but they are entirely heuristic based, difficult to optimize, and come with no formal guarantees. Our theoretical analysis thus serves as preliminary justification for the observed benefits of such methods, which have so far lacked serious grounding. To begin, define E as the set of observed training environments, each of which is defined by a distribution p e (x, y) over features x ∈ R d and class labels y ∈ [k]. For each e ∈ E, denote the mean and covariance of the features as µ e , Σ e . Our first step is to adjust the features via the whitening transformation Σ -1/2 e (xµ e ). Unfortunately, we cannot undo the test transformation because we have no way of knowing what the test mean will be. Interestingly, this is not a problem provided the predictor satisfies a simple constraint. Suppose that the predictor's output on the mean representation of each environment is a multiple of the all-ones vector. As softmax is invariant to a constant offset, this enforces that the environment mean has no effect on the final probability distribution, and thus there is no need to adjust for the test-time mean. We therefore enforce this constraint during training, with the hope that the same invariance will approximately hold at test-time. Formally, the DARE objective finds a matrix β ∈ R d×k which solves min β e∈E E p e [ (β T Σ -1/2 e x, y)] subject to softmax β T Σ -1/2 e µ e = 1 k 1. ∀e ∈ E, where is the multinomial logistic loss and we omit the bias β 0 for brevity. For binary classification, β is a vector and the softmax is replaced with the logistic function-the constraint is then equivalent to requiring the mean of the logits to be 0. Thus, the DARE objective explicitly regresses on the adjusted features, while the constraint enforces that each environment mean has no effect on the output distribution to encourage predictions to also be invariant to test-time transformations. The astute reader will point out that we also do not know the correct whitening matrix for the test data-instead, we adjust using our best guess for the test covariance: denoting this estimate as Σ, our prediction on a new sample x is f (x; β) = softmax β T Σ-1/2 x . We prove that this prediction is minimax so long as our guess is "sufficiently close", and in practice we find that simply averaging the training domain adjustments performs well. Table 2 in the Appendix shows this average is actually quite close to the sample covariance of the (unseen) test domain, explaining the good performance. Unlike many prior methods, DARE does not enforce invariance of the features themselves. Rather, it aligns the representations such that different domains share similar optimal predictors. To support this claim, Table 3 displays the cosine similarity between optimal linear classifiers for individual domains-we observe a large increase in average similarity as a result of the feature adjustment. Further, in Section 5 we demonstrate that the DARE objective is in fact minimizing the worst-case risk under a constrained set of possible distribution shifts, making it less conservative than methods which require complete feature invariance. Though the objective (1) assumes a particular form of invariance, this specific choice is by no means a requirement for the approach we take. We justify this by noting that the minimax predictor can always be written as the solution to an objective of this form, depending only on what we choose to assume holds constant: Proposition 3.1 (informal). Assume the minimax-optimal predictor f * lies in our hypothesis class F. Let S : F × P(X × Y) → R d be any "sufficient statistic" which encodes all invariances, i.e. S(f * , p e ) = c ∀e, for some constant c, and ∀f ∈ F with lower risk than f * on at least one environment, ∃e, e . S(f, p e ) = S(f, p e ). Then as |E| → ∞, f * = arg min f ∈F 1 |E| e∈E (f (x), y) s.t. S(f, p e ) = c ∀e ∈ E. This generic objective recovers IRM by defining S(f, p e ) = E pe [∇ (f (x), y))], c = 0 (the "sufficient statistic" is the environment loss gradient), or CORAL with S(f, p e ) = E pe [x], E pe [(xµ)(xµ) T ] (the feature mean and covariance) and letting c be arbitrary. DARE instead allows for domain-specific transformations T e , and we define S(f, p e ) = softmax E pe [f (T -1 e (x))] -here T -1 e is the whitening operation detailed in the next section. Implementation in practice. Due to its convexity, the DARE objective is extremely simple to optimize. In practice, we finetune a deep network over the training data with ERM and then extract the features. Next, treating the frozen features of the training data as direct observations x, we minimize the empirical Lagrangian form: Lλ cls (β) := 1 |E| e∈E 1 n e ne i=1 (β T Σ-1/2 e x i , y i ) + λ (β T Σ-1/2 e μe , k -1 1) , where μe , Σe are the usual sample estimates and n e is the number of samples in environment e. We find that the solution is incredibly robust to the choice of λ, but it is natural to wonder whether each of the components above is necessary for the performance gains we observe. We ablate both the whitening operation and the use of the constraint (Appendix E) and see performance drops in both cases. We also consider estimating the test-time mean rather than enforcing invariance, but this results in substantially worse accuracy-the environment feature means vary quite a bit, so the estimate is usually inaccurate. For regression, we consider the same setup but minimize mean squared error on targets y ∈ R. Here, the DARE solution is constrained such that the mean output has no effect on the prediction, meaning each domain's mean prediction should be zero. We discuss this in more detail in Appendix A.1, along with an interesting connection to anchor regression (Rothenhäusler et al., 2021) . Another benefit to our approach is that the adjustments for each domain do not depend on labels, so given unlabeled samples from the test domain we can often do even better. Unlike methods which use those samples while training (or even for test-time training), this adjustment can be done just-in-time, without updating any parameters! We name this task Just-in-Time Unsupervised Domain Adaptation (JIT-UDA), and we provide finite-sample risk bounds for the DARE objective in this setting.

4. A NEW MODEL OF DISTRIBUTION SHIFT

The DARE objective is based on the intuition that all domains jointly share a representation space and that they arise as unique transformations from this space. To capture this notion mathematically, we model the joint distribution p e ( , y) over latents ∈ R d and label y ∈ {0, 1}, along with an environment-specific transformation to observations x ∈ R d for each domain. In the fully general case, this transformation can take an arbitrary form and can be written as x = T e ( , y). Our primary assumption is that p e (y | ) is constant for all domains. Our goal is thus to invert each transformation T e such that we are learning an invariant conditional p(y | T -1 e (x)) (throughout we assume T e is invertible). One can also view this model as a generalization of covariate shift: where the usual assumption is constancy of p(y | x), the inverse transformation gives a richer model which can more realistically capture real-world variation across domains. It is important to note that this model generalizes (and has strictly weaker requirements than) both IRM and domain-invariant representation learning, which can be recovered by assuming T e is the same for all environments. For a typical deep learning analysis we would model T e as a non-linear generative map from latents to high-dimensional observations. However, our finding that ERM features are good enough suggests that modeling the learned features as a simple function of the "true latents" is not unreasonable. In other words, we now consider x to represent the frozen features of the trained network, and we expect this network to have already "undone" the majority of the complexity, resulting in observations which are a simple function of the ground truth. Accordingly, we consider the following model: = 0 + b e , y = 1{β * T + η ≥ 0}, x = A e . (2) Here, 0 ∼ p e ( 0 ), which we allow to be any domain-specific distribution; we assume only that its mean is zero (such that E[ ] = b e ) and that its covariance exists. We fix β * ∈ R d for all domains and model η as logistic noise. Finally, A e ∈ R d×d , b e ∈ R d are domain-specific. For regression, we model the same generative process for x, , while for the response, we have y ∈ R and η is zero-mean independent noise: y = β * T + η. We remark that the reason for separate definitions of p e ( 0 ) and b e is that our robustness guarantees are agnostic to the distribution of latent residuals p e ( 0 )-the only aspect of the latent distribution p( ) which affects our bounds is the environment mean b e . Connection to Invariant Prediction. Statistical models of varying and invariant (latent) features have recently become a popular tool for analyzing the behavior of deep representation learning algorithms (Rosenfeld et al., 2021; Chen et al., 2021; Wald et al., 2021) . We see that Equation ( 2) can model such a setting by assuming, e.g., that a subspace of the columnspan of A e is constant for all e, while the remaining subspace can vary. In such a case, the features x are expressed as the sum of a varying and an invariant component, and any minimax representation must remove the varying component, throwing away potentially useful information. Instead, DARE realigns these components so that we can use them at test-time. We illustrate this with the following running example: Example 1 (Invariant Latent Subspace Recovery). Consider the model ( 2) with ∼ N (0, I d1+d2 ). Define Π := blockdiag(I d1 , 0 d2 ) and assume A e = blockdiag(Σ 1/2 , Σ 1/2 e ) for all e, where Σ ∈ R d1×d1 is constant for all domains but Σ e ∈ R d2×d2 varies. Here we have a simple latent variable model: the features have the decomposition x = A e = Σ 1/2 Π + Σ 1/2 e (I -Π) , where the first component is constant across environments and the second varies. It will be instructive at this point to analyze what an invariant prediction algorithm such as IRM would do in this setting. Here, the IRM constraint would enforce learning a featurizer Φ such that Φ(x) has an invariant relationship with the target. Under the above model, the solution is Φ(x) = ΠA -1 e x = Σ 1/2 Π , retainining only the component lying in span(Π). Crucially, removing these features actually results in worse performance on the training environments-IRM only does so because using these features could harm test performance in the worst case. However, projecting out the varying component in this manner is unnecessarily conservative, as we may expect that for a future distribution, Σ e will not be too different from what we have seen before. Instead of projecting out this component, DARE performs a more nuanced alignment of environment subspaces, and it is thus able to take advantage of this additional informaton. We will see shortly the resulting benefits.

5. THEORETICAL ANALYSIS

Before we can analyze the DARE objective, we observe that there is a possible degeneracy in Equation ( 2), since two identical observational distributions p(x, y) can have different regression vectors β * . We therefore begin with a simple non-degeneracy assumption: Assumption 5.1. Write the SVD of A e as U e S e V T e . We assume V e = V ∀e. One special case where this holds is when A e is constant for all domains; this is very similar to the "additive intervention" setting of Rothenhäusler et al. (2021) , since only p e , b e can vary. We assume WLOG that V = I, as any other value can be subsumed by the other parameters. We further let E[ 0 T 0 ] = I WLOG by the same reasoning (see Appendix A.2 for a discussion on these conditions). With Assumption 5.1, we can uniquely recover A e = U e S e via the eigendecomposition of the covariance Σ e = A e A T e = U e S 2 e U T e . We therefore use the notation Σ 1/2 e to refer to A e recovered in this way. We allow covariances to have zero eigenvalues, in which case we write the matrix inverse to implicitly refer to the pseudoinverse. As is standard in domain generalization analysis, unless stated otherwise we assume full distribution access to the training domains, though standard concentration inequalities could easily be applied. A remark on our assumptions. Assumption 5.1 is not trivial. Domain generalization is an exceptionally difficult problem, and showing anything meaningful requires some assumption of consistency between train and test. Our assumptions are only as strong as necessary to prove our results, but future work could relax them, resulting in weaker performance guarantees. Experiments in Appendix E demonstrate that our covariance estimation is indeed accurate, and the fact that our method exceeds state of the art even with this strong constraint (and does worse without the constraint, see Figure 4 ) is further evidence that our assumptions are reasonable. We begin by deriving the solution to Equations ( 1) and (3) . Recall that the DARE constraint requires that the mean representation of each domain has no effect on our prediction. To enforce this, the DARE solution must project out the subspace in which the means vary. Given a set of E training environments, define B as the d × E matrix whose columns are the environmental mean parameters b e . Throughout this section, we make use of the matrix Π, which is defined as the orthogonal projection onto the nullspace of B T : Π := I -BB † = U ΠS ΠU T Π ∈ R d×d . This matrix projects onto the DARE constraint set, and it turns out to be all that is necessary to state the solution: Theorem 5.2. (Closed-form solution to the DARE population objective). Under model (2), the solution to the DARE population objective (3) for linear regression is Πβ * . If is Gaussian, then the solution for logistic regression (1) is α Πβ * for some α ∈ (0, 1]. In Example 1, we saw how invariant prediction will discard a large subspace of the representation and why this is undesirable. Instead, DARE undoes the environment transformation and regresses directly on = T -1 e (x). Because we are performing a separate transformation for each environment, we are aligning the varying subspaces rather than throwing them away, allowing us to make use of additional information. Though the DARE solution also recovers β * up to a projection, it is using the adjusted features; DARE therefore only removes what cannot be aligned. In particular, whereas Π has rank d 1 in Example 1, Π = I would have full rank-this retains strictly more information. Indeed, in the ideal setting where we have a good estimate of Σ e (e.g., under mild distribution shift or when solving JIT-UDA), we can make the Bayes-optimal prediction as E [y | x] = β * T A -1 e x. Thus we see a clear advantage that DARE enjoys over invariant prediction.

5.1. THE ADVERSARIAL RISK OF DARE

Moving forward, we denote the DARE solution β * Π := Πβ * , with β * I-Π defined analogously. We next study the behavior of the DARE solution under worst-case distribution shift. We consider the setting where an adversary directly observes our choices of Σ, β and chooses new environmental parameters A e , b e so as to cause the greatest possible loss. Specifically, we study the square loss, defining the excess test risk of a predictor as R e ( β) := E p e [( βT Σ-1/2 xβ * T )foot_1 ] (we leave the dependence on Σ implicit). For logistic regression we therefore analyze the squared error with respect to the log-odds. With some abuse of notation, we also reference excess risk when only using a particular subspace, i.e. R Π e ( β) : = E p e [( βT Π Σ-1/2 x -β * T Π ) 2 ]. Because we guess Σ before observing any data from this new distribution, ensuring success is impossible in the general case. Instead, we consider a set of restrictions on the adversary which will make the problem tractable. Define the error in our test domain adjustment as ∆ := Σ 1/2 e Σ-1/2 -I; observe that if Σ = Σ e , then ∆ = 0. Our first assumption 2 says that the effect of our adjustment error with respect to the interaction between subspaces Π and (I -Π) is bounded: Assumption 5.3. For a fixed constant B ≥ 0, (I -Π)∆ Π β ≤ B Πβ * . Remark 5.4. To see specific cases when this would hold, consider the decomposition of ∆ according to its components in the subspaces Π and I -Π: U T Π ∆U Π = ∆ 1 ∆ 12 ∆ 21 ∆ 2 . A few settings automatically satisfy Assumption 5.3 with B = 0, since U T Π ∆U Π will be block-diagonal. In particular, this will be the case if all domains share an invariant subspace-e.g., if ΠA e Π is constant as in Example 1. Below, we show that exact recovery of this subspace occurs once we observe rank(I -Π) environments-this matches (actually, it is one less than) the linear environment complexity of most invariant predictors (Rosenfeld et al., 2021) , and therefore Assumption 5.3 with B = 0 is no stronger than assuming a linear number of environments. Assumption 5.5. Using only covariates in the non-varying subspace, the risk of the ground truth regressor β * is less than that of the trivial zero predictor: R Π p e (β * ) < R Π p e (0). The need for this restriction should be immediate-if our adjustment error were so large that this did not hold, even the oracle regression vector would do worse than simply always predicting ŷ = 0. Assumption 5.5 is satisfied for example if ∆ Π < 1, which again is guaranteed if there is an invariant subspace. Note that we make no restriction on the risk in the subspace I -Π-the adversary is allowed any amount of variation in directions where we have already seen variation in the mean terms b e , but introduction of new variation is assumed bounded. This is a no-free-lunch necessity: if we have never seen a particular type of variation, we cannot possibly know how to use it at test-time. With these restrictions on the adversary, our main result derives the supremum of the excess test risk of the DARE solution under adversarial distribution shift. Furthermore, we prove that this risk is minimax: making no more restrictions on the adversary other than a global bound on the mean, the DARE solution achieves the best performance we could possibly hope for at test-time: Theorem 5.6 (DARE risk and minimaxity). Denote the set of possible test environments A ρ which contains all parameters (A e , b e ) subject to Assumptions 5.3 and 5.5 and a bound on the mean: b e ≤ ρ. For logistic or linear regression, let β be the minimizer of the corresponding DARE objective as in Theorem 5.2. Then, sup (A e ,b e )∈Aρ R e ( β) = (1 + ρ 2 )( β * 2 + 2B β * Π β * I-Π ). Furthermore, the DARE solution is minimax: β ∈ arg min β∈R d sup (A e ,b e )∈Aρ R e (β). A special case when our assumptions hold is when all domains share an invariant subspace and we only predict using that subspace, but this is often too conservative. There are settings where allowing for (limited) new variation can improve our predictions, and Theorem 5.6 shows that DARE should outperform invariant prediction in such settings.

5.2. THE ENVIRONMENT COMPLEXITY OF DARE

An important new measure of domain generalization algorithms is their environment complexity, which describes how the test risk behaves as a function of the number of (possibly random) domains we observe. In contrast to Example 1, for this analysis we assume an invariant subspace Π outside of which both A e and b e can vary arbitrarily-we formalize this as a prior over the b e whose covariance has the same span as I -Π. Our next result demonstrates that DARE achieves the same threshold as prior methods, but we also prove the first finite-environment convergence guarantee, quantifying how quickly the risk of the DARE predictor approaches that of the minimax-optimal predictor. Theorem 5.7 (Environment complexity of DARE). Define the smallest gap between consecutive eigenvalues:  ξ(Σ) := min i∈[d-1] λ i -λ i+1 . β = β * Π . 2. Otherwise, if E ≥ r(Σ b ) then with probability ≥ 1 -δ, R e ( β) ≤ R e (β * Π ) + O Σ b ξ(Σ b ) r(Σ b ) E + max log 1/δ E , log 1/δ E , where O(•) hides dependence on ∆ and r(Σ) := Tr(Σ)/ Σ is the effective rank. For coherence we present the first item as a probabilistic statement, but it holds deterministically so long as there are rank(Σ b ) linearly independent observations of b e . Remark 5.8. Prior analyses of invariant prediction methods only show a discontinuous threshold where the minimax predictor is discovered after seeing a fixed number of environments-usually linear in the non-invariant latent dimension But one should expect that if the variation is not too large then we can do better, and indeed Theorem 5.7 shows that if the effective rank of Σ b is sufficiently small, the risk of the DARE predictor will approach that of the minimax predictor as O(E -1/2 ).

5.3. APPLYING DARE TO JIT-UDA

So far, we have only considered a setting with no knowledge of the test domain. As discussed in Section 3, we'd expect that estimating the adjustment via unlabeled samples will improve performance. Prior works have extensively explored how to leverage access to unlabeled test samples for improved generalization-but while some suggest ways of using unlabeled samples at test-time, they are not truly "just-in time", nor have the advantages been formally quantified. Our final theorem investigates the provable benefits of using the empirical moments instead of enforcing invariance, giving a finite-sample convergence guarantee for the unconstrained DARE objective in the JIT-UDA setting: Theorem 5.9 (JIT-UDA, shortened). Let Σ S , Σ T be the covariances of the source and target distributions, respectively. Define m(Σ) := λmax(Σ) λ 3 min (Σ) . Assume we observe n s = Ω(m(Σ S )d 2 ) source samples and analogous n T target samples. Then after solving the DARE objective, with probability at least 1 -3d -1 , the excess squared risk of our predictor on the new environment is bounded as RT ( β) = O d 2 µT 2 m(Σ S ) n S + m(Σ T ) n T . Experimentally, we found that DARE does not outperform methods specifically intended for UDA, possibly because n d 2 -but we believe this is a promising direction for future research, since it doesn't require unlabeled samples at train-time and it can incorporate new data on-the-fly. Most algorithms implemented in the DOMAINBED benchmark are only applicable to deep networks; many apply complex regularization to either the learned features or the network itself. We instead compare to three popular algorithms which work for linear classifiers: ERM (Vapnik, 1999) , IRM (Arjovsky et al., 2019) , and GroupDRO (Sagawa et al., 2020) . We evaluate all approaches on four datasets: Office-Home (Venkateswara et al., 2017) , PACS (Li et al., 2017) , VLCS (Fang et al., 2013) , and DomainNet (Peng et al., 2019) . We find that DARE consistently matches or surpasses prior methods. A detailed description of the evaluation and comparison methodology can be found in Appendix D and evaluations of other end-to-end methods in Appendix E.

6. EXPERIMENTS

Prior methods now consistently outperform ERM. Interestingly, the previously observed gap between ERM and alternatives disappears in this setting with a linear predictor. Gulrajani & Lopez-Paz (2021) report IRM and GroupDRO performing much worse on DomainNet (5-10% lower accuracy on some domains) but they surpass ERM when using frozen features. This suggests they are more difficult to optimize over a deep network and when they do work, it is most likely due to learning a better linear classifier. This further motivates work on methods for learning simpler robust predictors.

7. RELATED WORK

A popular approach to domain generalization matches the domains in feature space, either by aligning moments (Sun & Saenko, 2016) or with an adversarial loss (Ganin et al., 2016) , though these methods are known to be inadequate in general (Zhao et al., 2019) . DARE differs from these approaches in that the constraint requires only that the feature mean projection onto the vector β be invariant. Domain-invariant projections (Baktashmotlagh et al., 2013) were recently analyzed by Chen & Bühlmann (2021) , though notably under fully observed features and only for domain adaptation. There has been intense recent focus on invariant prediction, based on ideas from causality (Peters et al., 2016) and catalyzed by IRM (Arjovsky et al., 2019) . Though the goal of such methods is minimax-optimality under major distribution shift, later work identifies critical failure modes of this approach (Rosenfeld et al., 2021; Kamath et al., 2021) . These methods eliminate features whose information is not invariant, which is often overly conservative. DARE instead allows for limited new variation by aligning the non-invariant subspaces, enabling stronger theoretical guarantees. Some prior works "normalize" each domain by learning separate batchnorm parameters but sharing the rest of the network. This was initially suggested for UDA (Li et al., 2016; Bousmalis et al., 2016; Chang et al., 2019) , which is not directly comparable to DARE since it requires unlabeled test data. This idea has also been applied to domain generalization (Seo et al., 2019; Segù et al., 2020) but in an ad-hoc manner. Because of the difficulty in training the network end-to-end, there is no consistent method for optimizing or validating the objective-in particular, all deep domain generalization methods were recently called into question when Gulrajani & Lopez-Paz (2021) gave convincing evidence that nothing beats ERM when evaluated fairly. Nevertheless, our analysis provides an initial justification for these methods, suggesting that this idea is worth exploring further.

LIMITATIONS

Our experiments demonstrate that traditional training already learns excellent features for generalizing to new distributions. This suggests that we should focus on methods learning to use these features, but it also means that these features are likely to contain just as much spurious or sensitive information as the original inputs. Thus, even if a method which uses frozen features performs better under shift (such as on minority subpopulations), care must still be taken to account for possible biases in the resulting classifier.

REPRODUCIBILITY STATEMENT

A major advantage to evaluating only a linear predictor is that we can release the exact model weights which were used to extract the frozen features on which all algorithms were trained. Code for our experiments will be released along with these models.

A DISCUSSION

A.1 CONNECTION TO ANCHOR REGRESSION The DARE objective for linear regression is written min β e∈E E p e [(β T Σ -1/2 e (x -µ e ) -y) 2 ] s.t. β T Σ -1/2 e µ e = 0. ∀e ∈ E. (3) The idea of adjusting for domain projections has similarities to Anchor Regression (Rothenhäusler et al., 2021) , an objective which linearly regresses separately on the projection and rejection of the data onto the span of a set of anchor variables. These variables represent some (known) measure of variability across the data, and the resulting solution enjoys robustness to pointwise-additive shifts in the underlying SCM. If we define the anchor variable to be a one-hot vector indicating a sample's environment, the Anchor Regression objective minimizes 1 |E| e∈E E p e (β T (x -µ e ), y -µ y,e ) + γ (β T µ e , µ y,e )) , where is the squared loss and µ y,e = E p e [y]. Here we see that Anchor Regression is "adjusting" in a sense, by regressing on the residuals, though the objective also regresses the mean prediction onto the target mean. Unfortunately, this requires access to the target mean, which is unavailable in logistic regression due to the lack of a good estimator for E log p(y=1|x) p(y=-1|x) . Nevertheless, if we (i) assume the feature covariance is fixed for all environments and (ii) assume the target mean is zero for all environments, we observe that the above objective becomes equivalent to the Lagrangian form of DARE for linear regression (3). Alternatively, we could imagine combining the two by keeping Anchor Regression's use of separate target means while adding DARE's feature covariance whitening which would give 1 |E| e∈E E p e (β T Σ -1/2 e (x -µ e ), y -µ y,e ) + γ (β T Σ -1/2 e µ e , µ y,e ) , However, this still requires us to estimate the target mean, so it is unclear if or how this objective could be applied to the task of classification.

A.2 ON THE CONDITIONS ASSUMED WITHOUT LOSS OF GENERALITY

It may not be immediately clear why it is reasonable to assume both V = I and Σ = I WLOG simultaneously, so we clarify this point here. It is crucial to observe here that β * and A e do not need to be directly identifiable, because we only care about the predictive distribution β * T . We only need A e to be identifiable from x up to equivalence of this distribution. So if for example we recover some invertible transformation Âe = A e M , this is not at all a problem because we also learn the corresponding β = M T β * such that βT Â-1 e x = β * T M M -1 A -1 e x = β * T . In particular, suppose instead V = I and E[ 0 T 0 ] = Σ 0 . Then we can simply reparameterize as 0 → Σ -1/2 0 0 , b e → Σ -1/2 0 b e , β * → Σ 1/2 0 β * , A e → A e V -1 Σ 1/2 0 . It is easy to see this results in the same observed distribution over (x, y), and further that learning β * to predict on 0 is the same as learning Σ 1/2 0 β * to predict on Σ -1/2 0 0 . So now we've reduced this to a setting where E[ 0 T 0 ] = I but perhaps V = I. However, when V = I it represents precisely the unidentifiable transformation M above, which does not pose a problem for prediction because it will not change in future environments.

B.1 NOTATION

We use capital letters to denote matrices and lowercase to denote vectors or scalars, where the latter should be clear from context. • refers to the usual vector norm, or spectral norm for matrices. For a matrix M , we use λ max (M ) to mean its maximum eigenvalue-the minimum is defined analogously. We write the pseudo-inverse as M † . For a collection of samples {x i } n i=1 , we frequently make use of the sample mean, μ := 1 n x i , and the sample covariance, Σ := 1 n (x iμ)(x iμ) T . The notation means less than or equal to up to constant factors. Proof. The logistic regression model can be rewritten: y | z = 1{β * T z + > 0}, where is drawn from a standard logistic distribution. If we are restricted to not use z S c , we can see that these can be modeled simply as an additional noise term. Thus, our new model is y | z = 1{β * T S z S + + τ > 0} , where τ := β * T S c z S c ∼ p is symmetric zero-mean noise, independent of z S . Because we are now modeling the other dimensions as noise, moving forward we will drop the S subscript, writing simply β * T z. Define F, f as the CDF and PDF of the distribution of := + τ . Then the MLE population objective can be written L(β) = -E z [E ∼f ( ) [1{β * T z + > 0} log σ(β T z) + 1{-(β * T z + ) > 0} log σ(-β T z)]]. For a fixed z, note that E [1{β * T z + > 0}] = P( ≥ -β * T z) = F (β * T z) (since f is symmetric), and therefore taking the derivative of this objective we get ∇ β L(β) = -∇ β E z [F (β * T z) log σ(β T z) + F (-β * T z) log σ(-β T z)] = E z [z • F (-β * T z)σ(β T z) -F (β * T z)σ(-β T z) ] Because f is symmetric, we have F (z) = 1 -F (-z), giving ∇ β L(β) = E z z • σ(β T z) -F (β * T z) . Consider the directional derivative of the loss in the direction β * , at the point β = αβ * : β * T ∇ β L(αβ * ) = E z β * T z • σ(αβ * T z) -F (β * T z) . Because F is the CDF of a logistic distribution convolved with p, by Fubini's theorem we have F (z) = z -∞ f (z) dz = z -∞ ∞ -∞ p(τ )σ (z -τ ) dτ dz = ∞ -∞ p(τ ) z-τ -∞ σ (ω) dω dτ = ∞ -∞ p(τ )σ(z -τ ) = E τ ∼p [σ(z -τ )]. Further, because p is symmetric, this is equal to 1 2 (E τ ∼p [σ(z -τ )] + E τ ∼p [σ(z + τ )] ). Thus, we have β * T ∇ β L(αβ * ) = E z β * T z • E τ ∼p σ(αβ * T z) - 1 2 σ(β * T z -τ ) + σ(β * T z + τ ) . We first consider the case where α = 1. When β * T z > 0, the term inside the expectation is positive for all τ = 0, and vice-versa when β * T z < 0 (this can be verified by writing the difference as a function of β * T z, τ , and observing that all the terms are non-negative except for a factor of e β * T z -1). It follows that at the point β = β * , -β * is a descent direction. Furthermore, since the objective is continuous in α, we can follow this direction by reducing α (that is, moving in the direction -β * ) until the directional derivative vanishes. Next, consider α = 0. Then the directional derivative is β * T ∇ β L(0) = 1 2 E z β * T z • E τ ∼p 1 -(σ(β * T z -τ ) + σ(β * T z + τ )) . Here, when β * T z > 0 the inner term is negative, and vice-versa for β * T z < 0, implying that the directional derivative is now negative. Because the objective is convex, it follows that the optimal choice for α lies in (0, 1], being equal to 1 when τ = 0 almost surely. It remains to show that the optimal vector has no other component orthogonal to β * -in other words, that the solution is precisely αβ * . For isotropic Gaussian z, we have for any δ perpendicular to β * that E[δ T z|β * T z] = 0. Therefore, the gradient in the direction δ is E z [δ T z • (σ(αβ * T z) -F (β * T z))] = E β * T z [E δ T z|β * T z [δ T z](σ(αβ * T z) -F (β * T z))] = 0. Since β * and all orthogonal directions form a complete basis, it follows that ∇L(αβ * ) = 0 and therefore that αβ * is the optimal solution. Though we prove this lemma only for Gaussian z, we found empirically that the result approximately holds whenever z is dimension-wise independent and symmetric about the origin. We believe this is a consequence of the Central Limit Theorem: our proof relies on the conditional expectation of inner products with z which converge to Gaussian in distribution as the dimensionality of z grows. We observe that Li & Duan (1989) prove a much simpler result under a general "linear conditional expectation" condition which is similar to the property we exploit regarding zero-mean conditional expectation of orthogonal inner products with isotropic Gaussians. Their result is more general, but it allows for any value of α, including negative. In this case, we would actually be recovering the opposite effects of the ground truth, which is clearly insufficient for test-time prediction. Heagerty & Zeger (2000) give an analytical closed-form for the solution under a probit model with Gaussian noise; since this is not a logistic model, the Gaussian noise represents significantly less model misspecification, which explains why the exact closed-form is recoverable. Proof. Observe that under the constraint, regressing on the centered observations is equivalent to regressing on the non-centered observations (since the mean must have no effect on the output), so the solutions to these two objectives must be the same and have the same minimizers. We therefore consider the solution to the DARE objective but on non-centered observations. It is immediate that the unconstrained solution to the DARE population objective on non-centered data is β * for both linear and logistic regression. For linear regression, we observe that because the adjusted covariates in each environment have identity covariance, the excess training risk of a predictor β is exactly ββ * 2 . Therefore, the solution can be rewritten min β β -β * 2 s.t. β T Σ -1/2 e µ e = 0. ∀e ∈ E.

Recalling that Σ

-1/2 e µ e = b e , the constraint can be written in matrix form as B T β = 0, and thus we see that the solution is the 2 -norm projection of β * onto the nullspace of B T (i.e., the intersection of the nullspaces of the b e ). By definition, this is given by (I -BB † )β * = Πβ * . To derive the closed-form for logistic regression, write the spectral decomposition B = U DV T , and consider regressing on U T instead of . As the predictor only affects the objective through its linear projection, the solution to this objective will be U T times the solution to the original objective (that is, for all vectors v, (U T β) T U T v = β T v). We will denote parameters for the rotated objective with a tilde, e.g. ṽ := U T v. The constraint in Equation ( 1) is equivalent to requiring that the mean projection is a constant vector c1 and, with the inclusion of a bias term, we can WLOG consider c = 0. Thus, the constraint can be written BT β = V D β = 0 ⇐⇒ D β = 0. We can therefore see that this constraint is the same as requiring that the dimensions of β corresponding to the non-zero dimensions of D are 0. Noting that U T ∼ N (0, I), we now apply Lemma B.1 to see that the solution will be β = α(I -DD † ) β * for some α ∈ (0, 1]. Finally, as argued above we can recover the solution to the original objective by rotating back, giving the solution β = U β = αU (I -DD † )U T β * = α Πβ * . B.4 PROOF OF THEOREM 5.6 Theorem B.2 (Theorem 5.6, restated). For any ρ ≥ 0, denote the set of possible test environments A ρ which contains all parameters (A e , b e ) subject to Assumptions 5.3 and 5.5 and a bound on the mean: b e ≤ ρ. For logistic or linear regression, let β be the minimizer of the corresponding DARE objective (1) or (3). Then, sup (A e ,b e )∈Aρ R e ( β) = (1 + ρ 2 )( β * 2 + 2B β * Π β * I-Π ). Furthermore, the DARE solution is minimax: β ∈ arg min β∈R d sup (A e ,b e )∈Aρ R e (β). Proof. Recall that in an environment e, E e [y | x] = β * T Σ -1/2 e x. So, for an environment e and predictor β, we have the following excess risk decomposition: R e ( β) = E e [( βT Σ-1/2 x -β * T Σ -1/2 e x ) 2 ] = T1 E e [( βT Σ-1/2 (x -µ e ) -β * T Σ -1/2 e (x -µ e )) 2 ] + T2 E e [( βT Σ-1/2 µ e -β * T Σ -1/2 e µ e ) 2 ] . Observe that term T 1 does not depend on the mean b e . Term T 2 simplifies to E e [( βT Σ-1/2 µ e -β * T Σ -1/2 e µ e ) 2 ] =   ( v Σ 1/2 e Σ-1/2 β -β * ) T b e    2 , and so we can write a supremum over T 2 as sup Aρ T 2 = sup Aρ (v T b e ) 2 = ρ 2 sup Aρ v 2 . Next, observe that T 1 simplifies to βT Σ-1/2 Σ e Σ-1/2 β + β * 2 -2 βT Σ-1/2 Σ 1/2 e β * = Σ 1/2 e Σ-1/2 β -β * 2 = v 2 . So, returning to the full loss and recalling that T 1 is independent of b e , we have sup Aρ R e ( β) = sup Aρ T 1 + T 2 = (1 + ρ 2 ) sup Aρ v 2 . Of course, the ideal would be for a given environment e to set β := Σ1/2 Σ -1/2 e β * =⇒ v = 0, but we have to choose a single β for all possible environments e parameterized by (A e , b e ) ∈ A ρ . We will show that the choice of β := α Πβ * is minimax-optimal under this set for any α ∈ (0, 1]. Leaving the supremum over adversary choices implicit, we can rewrite the squared norm of v as v T v = ((∆ + I) β -β * ) T ((∆ + I) β -β * ) = βT ∆ T ∆ β + β -β * 2 + 2 βT ∆ T ( β -β * ). By Lemma C.4, we can define an environment by defining the block terms of U Π∆U T Π directly. Consider the choice of ∆ 1 = λ βI-Π βT I-Π βI-Π 2 , ∆ 2 = ∆ 12 = ∆ 21 = 0. This choice is in A ρ for any λ ∈ R since it is block-diagonal and ∆ 2 = 0. Now we can write v T v = λ 2 βI-Π 2 + β -β * 2 + 2λ βT I- Π( βI-Π -β * I- Π) ≥ λ 2 βI-Π 2 -2λ βI-Π β -β * ), via Cauchy-Schwarz and the triangle inequality. So, taking λ → ∞ means that the minimax risk is unbounded unless βI-Π = 0 ⇐⇒ β = βΠ . For the remainder of the proof we consider only this case. With this restriction, we have v 2 = (∆ + I) βΠ -β * 2 = (∆ + I) βΠ -β * Π -β * I-Π 2 = (∆ + I) βΠ -β * Π 2 + β * I-Π 2 -2β * T I- Π∆ βΠ . Assumption 5.3 implies that |β * T I- Π∆ βΠ | = |β * T U Π(I -S Π)U T Π ∆U ΠS ΠU T Π β| = β * T U Π(I -S Π) ∆ 1 ∆ 12 ∆ 21 ∆ 2 S ΠU T Π β ≤ B β * Π β * I-Π , Consider the DARE solution β = αβ * Π for α ∈ (0, 1]. Then using the equivalent supremized set from Assumption 5.5 and Lemma C.3, the worst-case risk of this choice is sup Aρ R e ( β) = (1 + ρ 2 ) sup ∆β * Π 2 < β * Π 2 ( (α∆ + (α -1)I)β * Π 2 + β * I-Π 2 + 2B β * Π β * I-Π ) (4) = (1 + ρ 2 )( β * Π 2 + β * I-Π 2 + 2B β * Π β * I-Π ). It remains to show that any other choice of β results in greater risk. Observe that the second two terms of (4) are constant with respect to ∆, so we focus on the first term. That is, we show that any other choice results in sup ∆β * Π 2 < β * Π 2 (∆ + I) βΠ -β * Π 2 > β * Π 2 (except for β = 0, which we address at the end). For any choice of βΠ , decompose the vector into its projection and rejection onto β * Π as βΠ = αβ * Π +δ, with δ T β * Π = 0. The adversary can choose ∆ 2 = λδδ T , which lies in A ρ for any λ. Then taking λ → ∞ causes risk to grow without bound-it follows that we must have δ = 0.

Next, consider any choice

α ∈ (0, 1]. If α > 2 or α < 0, choosing ∆ 2 = 0 makes this term (α -1) 2 β * Π 2 > β * Π 2 . If 2 ≥ α > 1, choosing ∆ 2 = 1 α I makes this term α 2 β * Π 2 > β * Π 2 . Finally, if α = 0 then this term is equal to β * Π 2 -however, this value is the supremum of the adversarial risk of the DARE solution and it cannot actually be attained. 1. If E ≥ rank(Σ b ) then DARE recovers the minimax-optimal predictor almost surely: β = β * Π . 2. Otherwise, if E ≥ r(Σ b ) then with probability ≥ 1 -δ, R e ( β) ≤ R e (β * Π ) + O Σ b ξ(Σ b ) r(Σ b ) E + max log 1/δ E , log 1/δ E , where O(•) hides dependence on ∆ . Proof. Define ΠE as the projection onto the nullspace of B T after having seen E environments. Item 1 is immediate, since as soon as we observe E = rank(Σ b ) linearly independent b e we have that span(B) = span(Σ b ) and therefore ΠE = Π (this occurs almost surely for any absolutely continuous distribution). To prove item 2, we will write the solution learned after seeing E environments as βE := ΠE β * . We can write the excess risk of the ground-truth minimax predictor β * Π as R e (β * Π ) = E[(β * T Π Σ-1/2 Σ 1/2 e -β * T ) 2 ] = E[(β * T Π Σ-1/2 Σ 1/2 e -β * T ) 2 ], and likewise we have R e ( βE ) = E[( βT E Σ-1/2 Σ 1/2 e -β * T ) 2 ] = E[(β * T ΠE Σ-1/2 Σ 1/2 e -β * T ) 2 ]. Taking the difference, R( βE ) -R(β * Π ) = E[(β * T ΠE :=v Σ-1/2 Σ 1/2 e -β * T ) 2 -(β * T Π Σ-1/2 Σ 1/2 e -β * T ) 2 ] ≤ (Π -ΠE )E[vv T ](2I -Π -ΠE ) + 2 (Π -ΠE )E[v T ] ≤ 2 (Π -ΠE )E[vv T ] + (Π -ΠE )E[v T ] . Now we note that E[vv T ] = E[( Σ-1/2 Σ 1/2 e )( Σ-1/2 Σ 1/2 e ) T ] = Σ-1/2 Σ e Σ-1/2 and E[v T ] = Σ-1/2 Σ 1/2 e . These matrices are bounded by ∆ + I 2 , ∆ + I respectively and are constant with respect to the training environments we sample. It follows that we can bound the risk difference as R( βE ) -R(β * Π ) ΠE -Π , and all that remains is to bound the term ΠE -Π . Combining Theorems 4 and 5 from Koltchinskii & Lounici (2017) with the triangle inequality, we have that when r(Σ b ) ≤ E, with probability ≥ 1δ, Σ -Σ b Σ b max log 1/δ E , log 1/δ E + max r(Σ b ) E , r(Σ b ) E . Since r(Σ b ) ≤ E, the first term of the second max dominates. Further, Corrolary 3 of Yu et al. (2014) , a variant of the Davis-Kahan theorem, gives us ΠE -Π Σ -Σ b ξ(Σ b ) . Combining these facts gives the result. B.6 PROOF OF THEOREM 5.9 Theorem B.4 (Theorem 5.9, fully stated). Assume the data follows model (2) with ∼ N (0, I). Observing n S samples from a source distribution S with covariance Σ S , we use half to estimate ΣS and the other half to learn parameters β which minimize the unconstrained (λ = 0), uncentered DARE regression objective. At test-time, given n T samples {x i } n T i=1 from the target distribution T with mean and covariance µ T , Σ T , we predict f ( x; β) = β T Σ-1/2 T x. Define m(Σ) := λmax(Σ) λ 3 min (Σ) , and assume n S = Ω(m(Σ S )d 2 ), n T = Ω(m(Σ T )d 2 ). Then with probability at least 1 -3d -1 , the excess squared error of our predictor on the new environment is bounded as R T (f ) = O   (1 + µ T 2 )   E[η 2 ] 1 -O d n S d n S + d 2 m(Σ S ) n S + m(Σ T ) n T     . Proof. We begin by bounding the error of our solution ββ * . Observe that with our estimate ΣS of the source environment moments, we are solving linear regression with targets β * T i + η i and covariates xi = Σ-1/2 S x i = Σ-1/2 S Σ S 1/2 i . Thus, if we had access to the true gradient of the modified least-squares objective (which is not the same as assuming n S → ∞, because in that case we would have ΣS → Σ S ), the solution would be E[xx T ] -1 E[xy] = Σ-1/2 S Σ S 1/2 (I + µ T µ T T )Σ S 1/2 Σ-1/2 S -1 Σ-1/2 S Σ S 1/2 (I + µ T µ T T )β * = Σ1/2 S Σ S -1/2 β * . Moving forward we denote this solution to the modified objective as To show a rate of convergence to the OLS solution, we need to solve for the minimum eigenvalue λ min (M )-this will suffice since the above fact implies finite-sample convergence of the OLS estimator to the population solution. The well-known bound for sub-Gaussian random vectors tells us that with probability ≥ 1δ 1 , β := Σ1/2 S Σ S -1/2 β * , β -β λ max (M -1 )σ 2 y   d n S + log 1/δ 1 n S   . and moving forward we condition on this event. Now let γ S := d/n S , with γ T defined analogously. Since M 0, it follows that λ max (M -1 ) = λ min (∆ T S ∆ S ) -1 , and further, λ min (∆ T S ∆ S ) ≥ 1 -∆ T S ∆ S -I . By Lemma C.1, we have that with probability ≥ 1d -1 ∆ T S ∆ S -I γ S . This implies that our solution's error can be bounded as β -β * = (β -β) + ( β -β * ) ≤ σ 2 y 1 -O(γ S )   d n S + log 1/δ 1 n S   + Σ1/2 S Σ S -1/2 -I β * We assume β * = 1 so we can avoid carrying it throughout the rest of the proof. Bounding the second term with Lemma C.2 gives β -β * σ 2 y 1 -O(γ S ))   d n S + log 1/δ 1 n S   + γ S d m(Σ S ). On the target distribution, our excess risk with a predictor β is R e (β) = E[(β T Σ-1/2 T x -β * T Σ T -1/2 x) 2 ] = E       ( ∆ T Σ T 1/2 Σ-1/2 T β -β * ) T    2     = (∆ T β -β * ) T (I + µ T µ T T )(∆ T β -β * ) ≤ (1 + µ T 2 ) ∆ T β -β * 2 Now, we have ∆ T β -β * = (∆ T β -∆ T β * ) + (∆ T β * -β * ) ≤ ∆ T (β -β * ) + (∆ T -I)β * ≤ (1 + ∆ T -I ) β -β * + ∆ T -I . Once again invoking Lemma C.2, with probability ≥ 1 -d -1 , ∆ T -I γ T d m(Σ T ), and using this plus the previous result, the triangle inequality, and (a + b) 2 ≤ 2(a 2 + b 2 ) gives ∆ T β -β * 2 (1 + ∆ T -I ) 2 β -β * 2 + ∆ T -I 2 (1 + γ T d m(Σ T )) 2 β -β * 2 + ( Σ 1/2 Σ -1 3/2 √ dγ T ) 2 β -β * 2 + γ 2 T d m(Σ T ), where the lower bound on n T enforces γ T d m(Σ T ) ≤ 1. It follows that the excess risk can be bounded as R e (1 + µ T 2 ) ∆ T β -β * 2 (1 + µ T 2 ) β -β * 2 + γ 2 T d m(Σ T ) . Letting δ 1 = 1/d, combining all of the above via union bound, and eliminating lower-order terms, we get R e (β) (1 + µ T 2 ) σ 2 y 1 -O(γ S ) d n S + γ 2 S d m(Σ S ) + γ 2 T d m(Σ T ) = (1 + µ T 2 )   σ 2 y 1 -O d n S d n S + d 2 m(Σ S ) n S + m(Σ T ) n T   with probability ≥ 1 -3d -1 .

C TECHNICAL LEMMAS

Lemma C.1. Suppose we observe n ∈ Ω(d + log 1/δ) samples X ∼ N (µ, Σ) with Σ 0 and estimate the inverse covariance matrix Σ -1 with the inverse of the sample covariance matrix Σ-1 . Then with probability ≥ 1δ, it holds that Σ-1/2 Σ Σ-1/2 -I d + log 1/δ n . Proof. As Σ-1/2 Σ Σ-1/2 and Σ 1/2 Σ-1 Σ 1/2 have the same spectrum, it suffices to bound the latter. Observe that Σ 1/2 Σ-1 Σ 1/2 -I = Σ 1/2 ( Σ-1 -Σ -1 )Σ 1/2 . Now applying Theorem 10 of Kereta & Klock (2021) with A = B = Σ 1/2 yields the result. Lemma C.2. Assume the conditions of Lemma C.1. Then under the same event as Lemma C.1, it holds that Σ 1/2 Σ-1/2 -I Σ 1/2 Σ -1 3/2 √ dγ + Σ -1 2 γ 2 , where γ = d+log 1/δ n . In particular, if δ = d -1 , then Σ 1/2 Σ-1/2 -I Σ Σ -1 3 d 2 n Proof. We begin by deriving a bound for Σ-1/2 -Σ -1/2 . Define E = Σ-1 -Σ -1 . Observe that E = Σ -1/2 Σ 1/2 EΣ 1/2 Σ -1/2 = Σ -1/2 (Σ 1/2 Σ-1 Σ 1/2 -I)Σ -1/2 ≤ Σ -1 Σ 1/2 Σ-1 Σ 1/2 -I , and now apply Lemma C.1 (since Σ-1/2 Σ Σ-1/2 and Σ 1/2 Σ-1 Σ 1/2 have the same spectrum), giving E Σ -1 γ. Let U DU T be the eigendecomposition of Σ, and define the matrix [ √ •, α] i,j = 1 √ Dii+ √ Djj as in Carlsson (2018) . The Daleckii-Krein theorem (Daleckii & Krein, 1965) tells us that Σ-1/2 -Σ -1/2 = U ([ √ •, α] • E)U T + O( E 2 ). Note that max i,j |[ √ •, α] i,j | = 1/2 λ min (Σ) =⇒ [ √ •, α] ≤ d Σ -1 , and therefore by sub-multiplicativity of spectral norm under Hadamard product, Σ-1/2 -Σ -1/2 d Σ -1 E + E 2 Σ -1 3/2 √ dγ + Σ -1 2 γ 2 . Finally, noting that Σ 1/2 Σ-1/2 -I = Σ 1/2 ( Σ-1/2 -Σ -1/2 ) ≤ Σ 1/2 Σ-1/2 -Σ -1/2 completes the main proof. To see the second claim, note that n ≥ 2 Σ -1 implies Σ -1 3/2 √ d ≥ Σ -1 2 γ, meaning the first of the two terms dominates.  R Π p e (β * ) = E p e (β * T Π Σ-1/2 x -β * T Π Σ -1/2 e x) 2 = E p e (β * T Π Σ-1/2 Σ 1/2 e -β * T Π ) 2 = E p e (β * T Π∆ T ) 2 = ∆ Πβ * 2 . Next, the excess risk of the null vector β = 0 in the same subspace is R Π p e (0) = E p e [(β * T Π Σ -1/2 e x) 2 ] = E p e [(β * T Π ) 2 ] = Πβ * 2 . Lemma C.4. For a fixed Σ, choosing an environmental covariance Σ e is equivalent to directly choosing the error terms ∆ 1 , ∆ 2 , ∆ 12 , ∆ 21 . Proof. For a fixed Σ, due to the unique definition of square root as a result of Assumption 5.1, the map Σ e → Σ 1/2 e Σ-1/2 -I is one-to-one. Recall that we can write U T Π ∆U Π = ∆ 1 ∆ 12 ∆ 21 ∆ 2 which defines a one-to-one map from ∆ to the error terms. Since the composition of bijective functions is bijective, the claim follows. D IMPLEMENTATION DETAILS Figure 2 depicts the overall pipeline for evaluation of the different approaches we describe in Section 2. Our baselines are three distinct pipelines, each slightly different: 1. The first pipeline (dark blue in Figure 2 ) is the standard method of evaluating ERM on a new domain. We train the entire network (the combination of a feature embedder comprising all layers except the last linear layer, and the last layer linear classifier) end-to-end on the training domains, simultaneously learning the features and the linear classifier on the training domains via backpropagation. When evaluated on a new domain, this achieves state-of-the-art accuracy (Gulrajani & Lopez-Paz, 2021) , but it still performs quite poorly. 2. The second pipeline (red) is very similar to the first, but we include the test domain among the domains on which the network is trained end-to-end. Because this means that the test distribution is one of the training domains, this simulates an "ideal" setting where no distribution shift occurs from train-time to test-time. As such, it is unsurprising that this approach performs substantially better (though it still leaves a bit to be desired-this raises a separate question about failures of in-distribution learning with multiple domains). As this approach requires cheating, it is completely unrealistic to expect current methods to even begin to approach this baseline, but it gives a good sense of what would be the best performance to hope for. 3. The final pipeline (grey) is our main experimental contribution. Here, the features (all but the last layer) are learned without cheating, as in the first pipeline. Next, we freeze the features and cheat only while learning a linear classifier on top of these features. Not only does this method do significantly better than the first baseline, it actually performs almost as well as-and sometimes better than-the second pipeline. Thus, we find that standard features learned via ERM without cheating are already good enough for generalization and that the main bottleneck to reaching the accuracy of the second baseline is in learning a good linear classifier only. This has several important advantages to current methods which attempt to change the entire end-to-end process, which we explicate in Section 2 in the main body.

D.2 NUMERICAL COMPARISONS

The features used for each trial are the default hyperparameter sweep of DOMAINBED. For computational reasons, we used fewer random hyperparameter choices per trial, meaning the reported accuracies are not directly comparable. Nevertheless, our results are significant because they are consistently evaluated according to the same methodology. All algorithms use the same set of features for each trial, so their performances are highly dependent. To account for this, we perform a one-sided paired t-test between algorithms to determine the best performers. The fact that DARE consistently bests ERM is somewhat surprising: it is intended to guard against a worst-case distribution shift and depends on the quality of our guess Σ, so we would in general expect worse performance on some datasets.

D.3 CODE DETAILS

All features were learned by finetuning a ResNet-50 using the default settings and hyperparameter sweeps of the DOMAINBED benchmark (Gulrajani & Lopez-Paz, 2021) . We extracted features from 3 trials, with 5 random hyperparameter choices per trial, picking the one with the best training domain validation accuracy. We used the default random splits of 80% train / 20% test for each domain. Using frozen features, the cheating linear classifier was trained by minimizing the multinomial logistic loss with full-batch SGD with momentum for 3000 steps. We did not use a validation set. For the main experiments, all algorithms were trained using full-batch L-BFGS (Liu & Nocedal, 1989) . We used the exact same optimization hyperparameters for all methods; the default learning rate of 1 was unstable when optimizing IRM, frequently diverging, so we lowered the learning rate until IRM consistently converged (which occurred at a learning rate of 0.2). Since the IRM penalty only makes sense with both a feature embedder and a classification vector, we used an additional linear layer for IRM, making the objective non-convex. Presumably due to their convexity for linear classifiers, all other methods were unaffected by this change. For all methods, we halted optimization once 20 epochs occurred with no increase in training domain validation accuracy; the maximum validation accuracy typically occurred within the first 5 epochs. For stability when whitening (and because the number of samples per domain was often less than the feature dimension), in estimating Σe for each environment we shrank the sample covariance towards the identity. Specifically, we define Σe = (1ρ) 1 n n i=1 x i x T i + ρI, with ρ = 0.1, and ρ = 0.01 for DomainNet due to its much larger size. We found that increasing λ beyond ∼1 had little-to-no effect on the accuracy, loss, or penalty of the DARE solution (see Appendix E for ablations). However, we did observe that choosing a very large value for λ (e.g., 10 5 or higher) could result in poor conditioning of the objective, with the result being that the L-BFGS optimizer took several epochs for the loss to begin going down.

E ADDITIONAL EXPERIMENTS

Here we present additional experimental results. All reported accuracies represent the mean of three trials, and all error bars (where present) display 90% confidence intervals calculated as ±1.645 σ √ n . Note though that the results are not independent across methods, so simply checking if the error bars overlap is overly conservative in identifying methods which perform better.

E.1 ACCURACY OF THE TEST COVARIANCE WHITENER

As we do not have access to the true test-time covariance, we estimate the adjustment with the average of the train-time adjustments. Table 2 reports the normalized squared Frobenius norm of our estimate's error to the sample covariance adjustment, defined as is our averaging estimate. We find that this normalized error is consistently small, meaning our estimated adjustment is reasonably close to the "true" adjustment, relative to its spectrum (we put "true" in quotation marks because the best we can do is estimate it via the sample covariance). This helps explain why our averaging adjustment performs so well in practice, though we expect future methods could improve on this estimate (particularly on the last domain of PACS). Ŵ -W 2

E.2 ALIGNMENT OF DOMAIN-SPECIFIC OPTIMAL CLASSIFIERS

As discussed in Section 3, DARE does not project out varying subspaces but rather aligns them such that the adjusted domains have similar optimal classifiers simultaneously. To verify that this is actually happening, we learn the optimal linear classifier for each domain individually and evaluate the interdomain cosine similarity for these vectors for each class. We see that without adjustment, different domains' optimal linear decision boundaries have normals with small alignment on average, which explains why trying to learn a single linear classifier which does well on all domains simultaneously performs poorly in most cases. After alignment, the individually optimal classifiers are more aligned, which allows a single classifier to perform better on all domains. Furthermore, this is done without E.4 ABLATIONS Figure 3 : Demonstration of the effect of whitening. NoSigmaDARE is the exact same algorithm as DARE but with no covariance whitening. In almost all cases, covariance whitening + guessing at test-time results in better performance. We expect under much larger distribution shift that this pattern may reverse. Figure 4 : Effect of penalty term λ on the two algorithms which use it. λ = 0 corresponds to no constraint, and the lower performance demonstrates that this invariance requirement is essential to the quality of the learned classifier. For λ = 0, DARE accuracy is extremely robust, effectively constant for all λ ≥ 1; in practice we also found the penalty term itself to always be ∼0. In contrast, IRM accuracy appears to decrease with increasing λ, implying that the observed benefit of IRM primarily comes from the domain reweighting as in our GroupERM method. Figure 5 : Effect of final feature bottleneck dimensionality on cheating accuracy. Reducing the dimensionality reduces accuracy of all methods to varying degrees, though in some cases it actually increases test accuracy. We observe that the main pattern persists, though the gap between cheating on finetuned features and traditional ERM shrinks as the dimensionality is reduced substantially. To reduce dimensionality of the pretrained features we use a random projection; unsurprisingly, the quality dramatically falls as the number of dimensions is reduced.



On a few domains the linear method sees a gap of ∼5% accuracy from the idealized setting. We emphasize that our use of a simple linear predictor serves as a lower bound on the achievable error using ERM features. Though we label these as assumptions, they are properly interpreted as restrictions on an adversary-we consider an "uncertainty set" comprising all possible domains subject to these requirements.



Fix test parameters A e , b e and guess Σ. Suppose we minimize the DARE regression objective (3) on environments whose means b e are Gaussian vectors with covariance Σ b , with span(Σ b ) = span(I -Π). After seeing E training domains: 1. If E ≥ rank(Σ b ) then DARE recovers the minimax-optimal predictor almost surely:

STATEMENT AND PROOF OF LEMMA B.1, AND DISCUSSION OF RELATED RESULTS Lemma B.1. Assume our data follows a logistic regression model with regression vector β * and covariates z ∼ N (0, I): log p(y=1|z) p(y=-1|z) = β * T z. Then the solution to the dimension-constrained logistic regression problem arg min β E z,y [-log σ(yβ T z)] s.t. β S c = 0, where S ⊆ [d] indexes a subset of the dimensions, is equal to αβ * S for some α ∈ (0, 1].

PROOF OF THEOREM 5.2 Theorem 5.2. (Closed-form solution to the DARE population objective). Under model (2), the solution to the DARE population objective (3) for linear regression is Πβ * . If is Gaussian, then the solution for logistic regression (1) is α Πβ * for some α ∈ (0, 1].

PROOF OF THEOREM 5.7 Theorem B.3 (Theorem 5.7, restated). Fix test environment parameters A e , b e and our guess Σ. Suppose we minimize the DARE regression objective (3) on environments whose means b e are Gaussian vectors with covariance Σ b , with span(Σ b ) = span(I -Π). After seeing E training domains:

and further define ∆ S := Σ S 1/2 Σ-1/2 S , with ∆ T defined analogously. A classical result tells us that the OLS solution is distributed as N β, σ 2 y n S M -1 , where M is the modified covariance ∆ T S ∆ S and σ 2 y := E[η 2 ].

Let Σ be fixed. Then Assumption 5.5 is satisfied if and only if ∆β * The claim follows by rewriting the risk terms. Recall that ∆ = Σ 1/2 e Σ-1/2 -I. Writing out the excess risk of the ground truth β * in the subspace Π,

Figure 2: Depiction of evaluation pipeline. Standard training on train domains leads to poor performance. Cheating while training the full network (features and classifier) does substantially better. However, cheating on just the linear classifier does almost as well, implying that the features learned without cheating are already good enough for massive improvements over SOTA.

Performance of linear predictors on top of fixed features learned via ERM. Each letter is a domain. Because all algorithms use the same set of features for each trial, results are not independent. Therefore, bold indicates highest mean according to one-sided paired t-tests at p = 0.1 significance.If not the overall highest, underline indicates higher mean than ERM under the same test.



Mean normalized adjustment estimation error for each dataset and each train-domain/testdomain split.throwing away the varying component (as would be done by invariant prediction(Peters et al., 2016)), allowing DARE to use more information and resulting in higher test accuracy.

Mean cosine similarity between linear classification vectors which are individually optimal for their respective domains (learned via logistic regression). For each dataset and each traindomain/test-domain split we report the average similarity across all class normal vectors and all domain pairs.E.3 RESULTS ON ADDITIONAL END-TO-END METHODSThough ERM represents approximate state-of-the-art accuracy, for completeness we include a few additional end-to-end methods to confirm that this fact still holds under our experimental conditions.We also include additional results on ColoredMNIST: ± 1.9 52.1 ± 1.6 74.6 ± 0.7 75.8 ± 1.9 Fish 61.2 ± 1.0 50.1 ± 1.7 75.1 ± 1.0 76.5 ± 0.2 CORAL 61.1 ± 0.7 52.4 ± 1.1 75.2 ± 0.7 77.0 ± 0.3 ± 0.7 66.1 ± 0.6 71.9 ± 1.9 76.1 ± 2.3 Fish 98.8 ± 0.4 63.6 ± 0.6 72.0 ± 1.6 78.6 ± 1.6 CORAL 97.7 ± 0.3 64.0 ± 2.0 72.2 ± 1.8 74.0 ± 0.7

Performance of other algorithms under our experimental conditions. Our results confirm that ERM remains approximate state of the art among end-to-end methods.

Performance on ColoredMNIST.

