NEAR-OPTIMAL LINEAR REGRESSION UNDER DIS-TRIBUTION SHIFT

Abstract

Transfer learning is an essential technique when sufficient data comes from the source domain, while no or scarce data is from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under the distribution shift. Our algorithms cover different kinds of settings with covariate shift or model shift. We also consider when data are generating from either linear or general nonlinear models. We show that affine minimax rules are within an absolute constant of the minimax risk even among nonlinear rules for various source/target distributions.

1. INTRODUCTION

The success of machine learning crucially relies on the availability of labeled data. The data labeling process usually requires much human labor and can be very expensive and time-consuming, especially for large datasets like ImageNet (Deng et al., 2009) . On the other hand, models trained on one dataset, despite performing well on test data from the same distribution they are trained on, are often sensitive to distribution shifts, i.e., they do not adapt well to related but different distributions. Even small distributional shift can result in substantial performance degradation (Recht et al., 2018; Lu et al., 2020) . Transfer learning has been an essential paradigm to tackle the challenges associated with insufficient labeled data (Pan & Yang, 2009; Weiss et al., 2016; Long et al., 2017) . The main idea is to make use of a source domain with a lot of labeled data (e.g. ImageNet), and to try to learn a model that performs well on our target domain (e.g. medical images) where few or no labels are available. Despite the lack of labeled data, we may still use unlabeled data from the target domain, which are usually much easier to obtain and can provide helpful information about the target domain. Although this approach has been integral to many applications, many fundamental questions are left open even in very basic settings. In this work, we focus on the setting of linear regression under distribution shift and ask the fundamental question of how to optimally learn a linear model for a target domain, using labeled data from a source domain and unlabeled data (and possibly some labeled data) from the target domain. For various settings, including covariate shift (i.e., when p(x) changes) and model shift (i.e., when p(y|x) changes), we develop estimators that achieve near minimax risk (up to universal constant factors) among all linear estimation rules. Here linear estimators refer to all estimators that depend linearly on the label vector; these include almost all popular estimators known in linear regression, such as ridge regression and its variants. When the input covariances in source and target domains commute, we prove that our estimators achieve near minimax risk among all possible estimators. A key insight from our results is that, when covariate shift is present, we need to apply datadependent regularization that adapts to changes in the input distribution. For linear regression, this can be given by the input covariances of source and target tasks, which can be estimated using unlabeled data. Our experiments verify that our estimator has significant improvement over ridge regression and similar heuristics.

1.1. RELATED WORK

Different types of distribution shift are introduced in (Storkey, 2009; Quionero-Candela et al., 2009) . Specifically, covariate shift occurs when the marginal distribution on P (X) changes from source to target domain (Shimodaira, 2000; Huang et al., 2007) . Wang et al. (2014) ; Wang & Schneider (2015) tackle model shift (P (Y |X)) provided the change is smooth as a function of X. Sun et al. (2011) design a two-stage reweighting method based on both covariate shift and model shift. Other methods like the change of representation, adaptation through prior, and instance pruning are proposed in (Jiang & Zhai, 2007) . In this work, we focus on the above two kinds of distribution shift. For modeling target shift (P (Y )) and conditional shift (P (X|Y )), Zhang et al. (2013) exploits the benefit of multi-layer adaptation by some location-scale transformation on X. Transfer learning/domain adaptation are sub-fields within machine learning to cope with distribution shift. A variety of prior work roughly falls into the following categories. 1) Importancereweighting is mostly used in the covariate shift. (Shimodaira, 2000; Huang et al., 2007; Cortes et al., 2010) ; 2) One fruitful line of work focuses on exploring robust/causal features or domaininvariant representations through invariant risk minimization (Arjovsky et al., 2019) , distributional robust minimization (Sagawa et al., 2019) , human annotation (Srivastava et al., 2020) , adversarial training (Long et al., 2017; Ganin et al., 2016) , or by minimizing domain discrepancy measured by some distance metric (Pan et al., 2010; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Wang & Schneider, 2014) ; 3) Several approaches seek gradual domain adaptation (Gopalan et al., 2011; Gong et al., 2012; Glorot et al., 2011; Kumar et al., 2020) through self-training or a gradual change in the training distribution. Near minimax estimations are introduced in Donoho (1994) for linear regression problems with Gaussian noise. For a more general setting, Juditsky et al. (2009) estimate the linear functional using convex programming. Blaker (2000) compares ridge regression with a minimax linear estimator under weighted squared error. Kalan et al. (2020) considers a setting similar to this work of minimax estimator under distribution shift, but focuses on computing the lower bound for linear and one-hidden-layer neural network under distribution shift. A few more interesting results are derived on the generalization lower bound for distribution shift under various settings (David et al., 2010; Hanneke & Kpotufe, 2019; Ben-David et al., 2010; Zhao et al., 2019) .

2. PRELIMINARY

We formalize the setting considered in this paper for transfer learning under the distribution shift. Notation and setup. Let p S (x) and p T (x) be the marginal distribution for x in source and target domain. The associated covariance matrices are Σ S , and Σ T . We assume to have sufficient unlabeled data to estimate Σ T accurately. We observe n S , n T labeled samples from source and target domain. Data is scarce in target domain: n S n T and n T can be 0. Specifically, X S = [x 1 |x 2 | • • • |x n S ] ∈ R n×d , with x i , i ∈ [n S ] drawn from p S , noise z = [z 1 , z 2 , • • • z n S ] , z i ∼ N (0, σ 2 ). y S = [y 1 , y 2 , • • • , y n S ] ∈ R n S , with each y i = f * (x i ) + z i (X T ∈ R n T ×d and y T ∈ R n T are similarly defined). Denote by ΣS = X S X S /n S the empirical covariance matrix (Throughout the paper we assume data is centered: E p S [x] = E p T [x] = 0). The positive part of a number is denoted by (x) + . We consider both linear (f * (x) = x β * ) and general nonlinear ground truth models. When the optimal linear model changes from source to domain we add a subscript for distinction, i.e. , β * S and β * T . We use bold (x) symbols for vectors, lower case letter (x) for scalars and capital letter (A) for matrices. Minimax (linear) risk. In this work, we focus on designing linear estimators β = Ay Sfoot_0 for parameter β * ∈ B. Our estimator is evaluated by the excess risk on target domain, with the worst case β * in some set B: L B ( β) = max β * ∈B E y S Σ 1/2 T ( β(y S ) -β * ) 2 . Minimax linear risk and minimax risk among all estimators are respectively defined as: R L (B) ≡ min β linear in y S L B ( β); R N (B) ≡ min β L B ( β). The subscript "N" or "L" is a mnemonic for "non-linear" or "linear" estimators. R N is the optimal risk with no restriction placed on the class of estimators. R L only considers the linear function class for β. Minimax linear estimator and minimax estimator are the estimators that respectively attain R L and R N within universal multiplicative constants. Normally we only consider B = {β| β 2 ≤ r}. When there is no ambiguity, we simplify β(y S ) by β. Our meta-algorithm. Our paper considers different settings with distribution shift. Our methods are unified under the following meta-algorithm: Step 1: Find an unbiased sufficient statistic βSSfoot_1 for the unknown parameter. Step 2: Find βMM , a linear operator applied to βSS that minimizes L B ( βMM ). For each setting, we will show that βMM achieves linear minimax risk R L (asymptotically or in fixed design). Furthermore, under some conditions, the minimax risk R N is uniformly lower bounded by a universal constant times L B ( βMM ). Outline. In the sections below, we tackle the problem in different settings. In Section 3 we design algorithms with only covariate shift: 1) n T = 0 and f * (x) is linear (Section 3.1); 2) n T = 0 and f * (x) is a general nonlinear function (Section 3.2); 3) n T > 0 and f * is linear (Section 3.3). Finally, we cope with the model shift for linear models (β * S = β * T ) in Section 4.

3. MINIMAX ESTIMATOR WITH COVARIATE SHIFT

In this section, we consider the setting with only covariate shift. That is, only Σ S (marginal distribution p S (x)) changes to Σ T (p T (x)), but f * = E[y|x] (conditional distribution p(y|x)) is shared. We first consider the case when f * is a linear map: x → x β * and then consider the problem with approximation power.

3.1. COVARIATE SHIFT WITH LINEAR MODELS

We observe n S samples from source domain: y S = X S β * + z, z ∼ N (0, σ 2 I) and no labeled samples from the target domain. Our goal is to find the minimax linear estimator βMM (y S ) = Ay S with some linear mapping A that attains R L (B). Following our meta-algorithm, let βSS = 1 n S Σ-1 S X S y Sfoot_2 be an unbiased sufficient statistic for β * : βSS = 1 n S Σ-1 S X S y S = 1 n S Σ-1 S X S X S β * + 1 n S Σ-1 S X S z. =β * + 1 n S Σ-1 S X S z ∼ N β * , σ 2 n S Σ-1 S . The fact that βS S(y S ) is a sufficient statistic is proven in Claim 3.7 for a more general case, using the Fisher-Neyman factorization theorem. Here we consider X S as fixed values, and randomness only comes from noise z. We prove that the minimax linear estimator is of the form βMM = C βSS and then design algorithms that calculate the optimal C. Claim 3.1. The minimax linear estimator is of the form βMM = C βSS for some C ∈ R d×d . Warm-up: commutative covariance matrices. In order to derive the minimax linear estimator, we first consider the simple case when Σ T and ΣS are simultaneously diagonalizable. We apply Pinsker's Theorem (Johnstone, 2011) and get: Theorem 3.2 (Linear Minimax Risk with Covariate Shift). Suppose the observations follow sequence model y S = X S β * +z, z ∼ N (0, σ 2 I n ). If Σ T = U diag(t)U and ΣS ≡ X S X S /n S = U diag(s)U , then the minimax linear risk R L (B) ≡ min β=Ay S max β * ∈B E Σ 1/2 T ( β -β * ) 2 = i σ 2 n S t i s i 1 - λ √ t i + , where B = {β| β ≤ r}, and λ = λ(r) is determined by σ 2 n S d i=1 1 si ( √ t i /λ -1) + = r 2 . The linear minimax estimator is given by: βMM = Σ -1/2 T U (I -diag(λ/ √ t)) + U Σ 1/2 T βSS , where βSS = 1 n S Σ-1 S X S y S . Since r is unknown in practice, we could simply view either r or directly λ as the tuning parameter. We compare the functionality of λ with that of ridge regression: βλ RR = arg min β E 1 2n X S β - y S 2 + λ 2 β 2 = ( ΣS + λI) -1 X S y S /n S . For both algorithms, λ is to balance the bias and variance: λ = 0 gives an unbiased estimator, and a big λ gives a (near) zero estimator with no variance. The difference is, our estimator shrinks some signal directions based on the value of t i . The estimator tends to sacrifice the directions of signal where t i is smaller. Ridge regression, however, respects the value of s i . A natural counterpart is for ridge to also regularize based on t: let βλ RR,T = arg min 1 n Σ 1/2 T (β -Σ-1 S X S y S ) 2 + λ β 2 = (Σ T + λI) -1 Σ T βSS . We will compare their performances in the experimental section. Non-commutative covariance matrices. For non-commutative covariate shift, we follow the same procedure. Our estimator is achieved by optimizing over C: βMM = C βSS : R L (B) ≡ min β=Ay S max β * ∈B E Σ 1/2 T ( β -β * ) 2 2 = min β=C βSS max β * ≤r Σ 1/2 T (C -I)β * 2 2 + σ 2 n S Tr(Σ 1/2 T C Σ-1 S C Σ 1/2 T ) (Claim 3.1) = min τ,C r 2 τ + σ 2 n S Tr(Σ 1/2 T C Σ-1 S C Σ 1/2 T ) , s.t. (C -I) Σ T (C -I) τ I. Unlike the commutative case, this problem doesn't have a closed form solution, but is still solvable: Proposition 3.3. Problem (3) is a convex program and thus solvable. We achieve near-optimal minimax risk among all estimators under some conditions: Theorem 3.4 (Near minimaxity of linear estimators). When Σ S , Σ T commute, or Σ T is rank 1, the best linear estimator from (2) or (3) achieves near-optimal minimax risk: L B ( βMM ) = R L (B) ≤ 1.25R N (B). Note that R N ≤ R L by definition. Therefore 1) our estimator βMM is near-optimal, and 2) our lower bound for R N is tight. Lower bounds (without matching upper bounds) for general noncommutative problem is presented in (Kalan et al., 2020) and we improve their result for the commutative case and provide a matching algorithm. Their lower bound scales with d n S min i ti si for large r, while ours becomes 1 n S i ti si . Our lower bound is always larger and thus tighter, and potentially arbitrarily larger when max i ti si and min i ti si are very different. We defer our proof to the appendix. Remark 3.1 (Benefit of minimax linear estimator). Consider estimators from ridge regression: βλ RR = arg min β E 1 2n X S β -y S 2 + λ 2 β 2 . There is an example that R L (B) ≤ O(d -1/4 L B ( βλ RR ) ) even with the optimal hyperparameter λ.foot_3 Remark 3.2 (Incorporating the randomness of source and target features). For clean presentation purposes, in the main text we assume to have access to Σ T . In practice, we will need to estimate Σ T by finite unlabeled samples from target domain. In Appendix C.1 we show that our estimator remains near-optimal if we have d unlabeled target samples under some standard light-tail assumptions. Theorem 3.4 is comparing our estimator with the optimal nonlinear estimator using the same data X S from the source domain. In appendix C we compare our estimator with a stronger notion of linear estimator with infinite access to p S and show that our estimator is still within multiplicative factor of it.

3.2. LINEAR MINIMAX ESTIMATOR WITH APPROXIMATION ERROR

Now we consider observations coming from nonlinear models: y S = f * (X S ) + z. Let β * S = arg min β E x∼p S ,z∼N (0,σ 2 ) [(f * (x) + z -β x) 2 ], and similarly for β * T . Notice now even with f * unchanged across domains, the input distribution affects the best linear model. Approximation error is a S (x) = f * (x) -x β * S and vice versa for a T . Define the reweighting vector w ∈ R n as w i = p T (x i )/p S (x i ). We form unbiased estimator via βLS = arg min β { i p T (x i ) p S (x i ) (β x i -y i ) 2 } = (X S diag(w)X S ) -1 (X S diag(w)y S ). Claim 3.5. βLS is asymptotically unbiased and normally distributed: √ n S ( βLS -β * T ) d → N (0, Σ -1 T E x∼p T [p T (x)/p S (x)(a T (x) 2 + σ 2 )xx ]Σ -1 T ). Denote by m(x) = a T (x) + z. We want to minimize the worst case risk: min β=C βLS max β * T ∈B E Σ 1/2 T ( β -β * T ) 2 d → min C max β * T ≤r Σ 1/2 T (C -I)β * T 2 2 + 1 n S Tr(CΣ -1 T E p T [ p T (x) p S (x) m(x) 2 xx ]Σ -1 T C Σ T ) = min C (C -I) Σ T (C -I) 2 r 2 + 1 n S Tr(CΣ -1 T E p T [ p T (x) p S (x) m(x) 2 xx ]Σ -1 T C Σ T ) Therefore our estimator is βMM ← Ĉ βLS , where Ĉ finds Ĉ ← arg min τ,C r 2 τ + 1 n S 1 n S i p 2 T (x) p 2 S (x) (y i -x i βLS ) 2 x i x i , Σ -1 T C Σ T CΣ -1 T (4) s.t. (C -I) Σ T (C -I) τ I. Claim 3.6. Let B = {β| β ≤ r}, and f * ∈ F is some compact symmetric function class: f ∈ F ⇔ -f ∈ F. Then linear minimax estimator is of the form C βLS for some C. When Ĉ solves Eqn. (4), L B ( βMM ) asymptotically matches R L (B), the linear minimax risk. By reducing from y S to βLS we eliminate n -d dimensions, and this claim says that X S y S is sufficient to predict β * T . We note that f * is more general than a linear function and therefore the lower bound could only be larger than R N (B) defined in the previous section.

3.3. UTILIZE SOURCE AND TARGET LABELED DATA JOINTLY

In some scenarios we have moderate amount of labeled data from target domain as well. Then it is important to utilize the source and target labeled data jointly. Let y S = X S β * + z S , y T = X T β * + z T . We consider X S , X T as deterministic variables, Σ-1 S X S y S /n S ∼ N (β * , σ 2 n S Σ-1 S ) and Σ-1 T X T y T /n T ∼ N (β * , σ 2 n T Σ-1 T ). Therefore conditioned on the observations y S , y T , a sufficient statistic for β * is βSS := (n S ΣS + n T ΣT ) -1 (X S y S + X T y T ). Claim 3.7. βSS is an unbiased sufficient statistic of β * with samples y S , y T . βSS ∼ N (β * , σ 2 (n S ΣS + n T ΣT ) -1 ). Algorithm: First consider the estimator βSS = (n S ΣS + n T ΣT ) -1 (X S y S + X T y T ). Next find the best linear function of βSS : βMM = arg min C,τ r 2 τ + σ 2 Tr((n S ΣS + n T ΣT ) -1 C Σ T C), s.t. (C -I) Σ T (C -I) τ. Proposition 3.8. The minimax estimator βMM is of the form C βSS for some C. When choosing C with our proposed algorithm and when ΣS commutes with ΣT and Σ T , we achieve the minimax risk R L (B) ≤ 1.25R N (B).

4. NEAR MINIMAX ESTIMATOR WITH MODEL SHIFT

The general setting of transfer learning in linear regression involves both model shift and covariate shift. Namely, the generative model of the labels might be different: First consider a sufficient statistic ( βS , βT ) for (β * T , δ). y S = X S β * S + z S , Here βS = Σ-1 S X S y S /n S ∼ N (β * T + δ, σ 2 n S Σ-1 S ), and βT = Σ-1 T X T y T /n T ∼ N (β * T , σ 2 n T Σ-1 T ). Then consider the best linear estimator on top of it: β = A 1 βS + A 2 βT . Write ∆ = {δ| δ ≤ γ} and L B,∆ ( β) := max β * T ∈B,δ∈∆ Σ 1/2 T ( β -β * T ) 2 . R L (B, ∆) := min β=A1 βS +A2 βT L B,∆ ( β) ≤ min A1,A2 max β * T ≤r, δ ≤γ 2 Σ 1/2 T ((A 1 + A 2 -I)β * T 2 + 2 Σ 1/2 T A 1 δ 2 (5) + σ 2 n S Tr(A 1 Σ-1 S A 1 ) + σ 2 n T Tr(A 2 Σ-1 T A 2 ) (AM-GM) = min A1,A2 2 Σ 1/2 T ((A 1 + A 2 -I) 2 2 r 2 + 2 Σ 1/2 T A 1 2 2 γ 2 + σ 2 n S Tr(A 1 Σ-1 S A 1 ) + σ 2 n T Tr(A 2 Σ-1 T A 2 ) =: r B,∆ (A 1 , A 2 ) . Therefore we optimize over this upper bound and reformulate the problem as a convex program: ( Â1 , Â2 ) ← arg min A1,A2,a,b 2ar 2 + 2bγ 2 + σ 2 n S Tr(A 1 Σ-1 S A 1 ) + σ 2 n T Tr(A 2 Σ-1 T A 2 ) s.t. (A 1 + A 2 -I) Σ T (A 1 + A 2 -I) aI, A 1 Σ T A 1 bI. ( ) Our estimator is given by: βMM = Â1 βS + Â2 βT . Since βMM is a relaxation of the linear minimax estimator, it is important to understand how well βMM performs on the original objective: Claim 4.1. R L (B, ∆) ≤ L B,∆ ( βMM ) ≤ 2R L (B, ∆). Finally we show with the relaxation we still achieve a near-optimal estimator even among all nonlinear rules. Theorem 4.2. When Σ T commutes with ΣS , it satisfies: L B,∆ ( βMM ) := max β * T ∈B,δ∈∆ Σ 1/2 T ( βMM -β * T ) 2 ≤ 27R N (B, ∆). Here R N (B, ∆) := min β(y S ,y T ) max β * T ∈B,δ∈∆ Σ 1/2 T ( β -β * T ) is the minimax risk. Proof sketch of Theorem 4.2. For the ease of understanding, we provide a simple proof sketch when Σ S = Σ T are diagonal. We first define the hardest hyperrectangular subproblem. Let B(τ ) = {b : |β i | ≤ τ i } be a subset of B and similarly for ∆(ζ). We show that R L (B, ∆) = max τ ∈B,ζ∈∆ R L (B(τ ), ∆(ζ)), and clearly R N (B, ∆) ≥ max τ ∈B,ζ∈∆ R N (B(τ ), ∆(ζ)). Meanwhile we show when the sets are hyperrectangles the minimax (linear) risk could be decomposed to 1-d problems: R L (B(τ ), ∆(ζ)) = i R L (τ i , ζ i ). Each R L (τ i , ζ i ) is the linear minimax risk to estimate β i from x ∼ N (β i + δ i , 1) and y ∼ N (β i , 1) where |β i | ≤ τ i and |δ i | ≤ ζ i . This 1-d problem for linear risk has a closed form solution, and the minimax risk can be lower bounded using Le Cam's two point lemma. We show R L (τ i , ζ i ) ≤ 13.5R N (τ i , ζ i ) and therefore:  1 2 L B,∆ ( βMM ) Claim 4.1 ≤ R L (B, ∆) Lemma B.2 = max τ ∈B,ζ∈∆ R L (B(τ ), ∆(ζ)) Prop B.4.a = max τ ∈B,ζ∈∆ i R L (τ i , ζ i ) Lemma B.6 ≤ max τ ∈B,ζ∈∆ 13.5 i R N (τ i , ζ i ) Prop B.4.b = 13.5 max τ ∈B,ζ∈∆ R N (B(τ ), ∆(ζ)) ≤ 13.5R N (B, ∆).

5. EXPERIMENTS

Our estimators are provably near optimal for the worst case β * . However, it remains unknown whether on average they outperform other baselines. With synthetic data we explore the performances with random β * . We are also interested to investigate the conditions when we win more. Setup. We set n S = 2000, d = 50, σ = 1, r = √ d. For each setting, we sample β * T from standard normal distribution and rescale it to be norm r. We assume to know Σ T . We compare our estimator with ridge regression (S-ridge) and a variant of ridge regression transformed to target domain (T-ridge): βλ RR,T = arg min 1 n Σ 1/2 T (β -Σ-1 S X S y S ) 2 + λ β 2 = (Σ T + λI) -1 Σ T βSS . Covariate shift. In order to understand the effect of covariate shift on our algorithm, we consider three types of settings, each with a unique varying factor that influences the performance: 1) covariate eigenvalue shift with shared eigenspace; 2) covariate eigenspace shift with fixed eigenvaluesfoot_4 ; 3) signal strength change. We also have an additional 200 labeled data from target domain as validation set only for hyper-parameter tuning. Model shift. Next we consider the problem with model shift. We sample a random δ with norm γ varying from 0 to r = √ d and observe data generated by y S = X S (β * T + δ) + z S ∈ R 2000 , z S ∼ N (0, I) and y T = X T β * T + z T ∈ R 500 , z T ∼ N (0, I). We compare our estimator with two baselines: "ridge-source" denotes ridge regression using only source data, and "ridge-target" is from ridge regression with target data. Figure 1 demonstrates the better performance of our estimator in all circumstances. From (a) we see that with more discrepancy between Σ S and Σ T , our estimator tends to perform better. (b) shows our estimator is better when the signal is relatively stronger. From (c) we can see that with the increasing model shift measured by γ/r, ridge-source becomes worse and is outperformed by ridge-target that remains unchanged. Our estimator becomes slightly worse as well due to the less utility from source data, but remains the best among others. When γ/r ≈ 0.2, our method has the most improvement in percentage compared to the best result among ridge-source and ridge-target.

A OMITTED PROOF FOR MINIMAX ESTIMATOR WITH COVARIATE SHIFT

A.1 PINSKER'S THEOREM AND COVARIATE SHIFT WITH LINEAR MODEL Theorem A.1 (Pinsker's Theorem). Suppose the obervations follow sequence model y i = θ * i + i z i , i > 0, i ∈ [d], and Θ is an ellipsoid in R d : Θ = Θ(a, C) = {θ : i a 2 i θ 2 i ≤ C 2 }. Then the minimax linear risk R L (Θ) := min θ linear max θ * ∈Θ E θ(y) -θ * 2 = i 2 i (1 -a i /µ) + , where µ = µ(C) is determined by d i=1 2 i a i (µ -a i ) + = C 2 . The linear minimax estimator is given by θ * i (y) = c * i y i = (1 -a i /µ) + y i , and is Bayes for a Gaussian prior π C having independent components θ i ∼ N (0, τ 2 i ) with τ * i = 2 i (µ/a i -1) + . Our theorem 3.2 is to connect our parameter β * to the θ * in pinsker's theorem. First we show that reformulating the problem from a linear map of n dimensional observations y S to a linear map on the d-dimensional statistic βSS is sufficient, i.e., Claim 3.1: Proof of Claim 3.1. This is to show that if β(y S ) := Ay S is a minimax linear estimator, each row vector of A ∈ R d×n is in the column span of X S . Write d) , columns of which forms the orthonormal complement for the column space of X S . Equivalently we want to show A 2 = 0. We have A = A 1 X S + A 2 W where W ∈ R n×(n- R L (B) ≡ min β=Ay max β * ∈B E Σ 1/2 T ( β -β * ) 2 = min A1,A2 max β * ∈B E Σ 1/2 T ((A 1 X S + A 2 W )y S -β * ) 2 = min A1,A2 max β * ∈B E Σ 1/2 T (A 1 X S (X S β * + z) + A 2 W z -β * ) 2 (Since W X S = 0) = min A1,A2 max β * ∈B Σ 1/2 T (A 1 X S X S -I)β * 2 + E Σ 1/2 T A 1 X S z 2 + E Σ 1/2 T A 2 W z 2 + E Σ 1/2 T A 1 X S z, Σ 1/2 T A 2 W z (Other cross terms vanish since E[z] = 0) = min A1,A2 max β * ∈B Σ 1/2 T (A 1 X S X S -I)β * 2 + E Σ 1/2 T A 1 X S z 2 + E Σ 1/2 T A 2 W z 2 , where the last equation is because E Σ 1/2 T A 1 X S z, Σ 1/2 T A 2 W z = E Tr Σ 1/2 T A 1 X S zzW A 2 Σ T = Tr Σ 1/2 T A 1 X S E[zz ]W A 2 Σ T = σ 2 Tr Σ 1/2 T A 1 X S W A 2 Σ T = 0. Clearly, at min-max point, without loss of generality we can take A 2 = 0. Formally the proof for Theorem 3.2 is presented here: Proof of Theorem 3.2. To use Pinsker's theorem to prove Theorem 3.2, we simply need to transform the problem match its setting. Let y T = Σ 1/2 T Σ-1 S X S y S /n S = θ * T + z T , where θ * T = U Σ 1/2 T β * and z T ∼ N (0, σ 2 diag([t i /s i ] d i=1 )/n S ). The set for θ * T is Θ = {θ| Σ -1/2 T U θ ≤ r}, i.e., Θ = {θ| i θ 2 i /t i ≤ r 2 }. Now with Pinsker's theorem, θ(y T ) i = (1 -1/(µ √ t i )) + (y T ) i is the best linear estimator for θ * T , where µ = µ(r) solves σ 2 n S d i=1 √ t i s i (µ - 1 √ t i ) + = r 2 . ( ) Connecting to the original problem, we get that the best estimator for Σ 1/2 T β * is U (I - 1 µ diag([1/ √ t i ] d i=1 ))y T = U (I -1 µ diag([1/ √ t i ] d i=1 ))U Σ 1/2 T Σ -1 S X S y S /n S . A  := αC 1 + (1 -α)C 2 , τ α := τ 1 α + τ 2 (1 -α), (C α -I) Σ T (C α -I) τ α I for any α ∈ [0, 1]. First, notice (C 1 -C 2 ) Σ T (C 1 -C 2 ) 0. Next, (C α -I) Σ T (C α -I) =α(C 1 -I) Σ T (C 1 -I) + (1 -α)(C 2 -I) Σ T (C 2 -I) -α(1 -α)(C 1 -C 2 ) Σ T (C 1 -C 2 ) α(C 1 -I) Σ T (C 1 -I) + (1 -α)(C 2 -I) Σ T (C 2 -I) τ α I. Benefit of our estimator. Compared to ridge regression, our estimator could possibly achieve much better (d -1/4 ) improvements: Proof of Remark 3.1. We consider diagonal covariance matrices ΣS = diag(s), Σ T = diag(t), σ = 1. First we calculate the expected risk obtained with ridge regression: βλ RR = (X S X S /n + λI) -1 X S y S /n S ∼ N (( ΣS + λI) -1 Σ S β * , 1/n S (Σ S + λI) -2 Σ S ). L B (β λ RR ) = max β * ∈B E y S Σ 1/2 T ( βλ RR (y S ) -β * ) 2 = max β * ∈B Σ 1/2 T (( ΣS + λI) -1 ΣS -I)β * 2 + Tr( 1 n S ( ΣS + λI) -2 ΣS Σ T ) = max i r 2 √ t i s i s i + λ - √ t i 2 + i 1 n S t i s i (s i + λ) 2 . Compared to our risk: R L (B) = i 1 n S t i s i (1 - 1 √ t i µ ) + , where 1 n d i=1 √ ti si (µ -1 √ ti ) + = r 2 . Let r 2 = √ d n S , s i = 1, ∀i, t i = 1, ∀i ∈ [d 0 ], t i = d -1/2 , d 0 < i ≤ d, where d 0 = √ d d 1/4 -1 ≈ d 1/4 . Then µ = 1, and R L (B) = d 1/4 n . In this case, min λ max i r 2 √ t i s i s i + λ - √ t i 2 + i 1 n S t i s i (s i + λ) 2 = min λ max i √ d n √ t i 1 + λ - √ t i 2 + i 1 n S t i (1 + λ) 2 ≥ min λ √ d n λ 2 (1 + λ) 2 + √ d n 1 (1 + λ) 2 ≥ √ d 2n . Therefore min λ L B ( βλ RR ) ≥ d 1/4 R L (B)/2. Near minimax risk. Even among all nonlinear estimators, our estimator is within 1.25 of the minimax risk: Proof of Theorem 3.4. First we note that for both linear and nonlinear estimators, it is sufficient to use βSS instead of the original observations y S . See Lemma A.2 and its corollary. Therefore it suffices to do the following reformulations of the problem. When Σ S and Σ T commute, we formulate the problem as the following Gaussian sequence model. Recall ΣS = U diag(s)U , Σ T = U diag(t)U . Let θ * = U Σ 1/2 T β * , and y = U Σ 1/2 T βSS ∼ N (θ * , σ 2 n S diag(t/s)). Our objective of minimizing Σ 1/2 T ( β(y S ) -β * ) from linear estimator is equivalent to minimizing U ( θ(y) -θ * ) = θ(y) -θ * from linear estimator. The set for the parameter that satisfies θ * = U Σ 1/2 T β * , β * ≤ r is equivalent to Σ -1/2 T U θ * ≤ r ⇔ θ * i / √ t i ≤ r is an axis-aligned ellipsoid. Then we could directly derive our result from Corollary 4.26 from Johnstone (2011) . Note that this result is a special case of Theorem 4.2 and we have provided a detailed proof in Section B. Therefore here we save further descriptions. For the case when Σ T = aa is rank-1, the objective function becomes: R * L (B) = min β * linear max β∈B E(a ( β(y S ) -β * )) 2 . Then the result could be derived from Corollary 1 of Donoho (1994), which reformulate the problem to the hardest one-dimensional problem which becomes tractable. In the proof above, we equate the best nonlinear estimator on y S as the best nonlinear estimator on βSS . The reasoning is as follows:  Lemma A.2 (Sufficient

A.3 OMITTED PROOF WITH APPROXIMATION ERROR

Unbiased estimator for β * T . Proof of Claim 3.5. βLS -β * T =(X S diag(w)X S ) -1 (X S diag(w)y) -β * T =(X S diag(w)X S ) -1 (X S diag(w)(X S β * T + a T + z)) -β * T =(X S diag(w)X S ) -1 (X S diag(w)(a T + z)) Notice E x∼p S [xa T (x) p T (x) p S (x) ] = E x∼p T [xa T (x)] = 0 . This is due to the KKT condition for the minimizer of l(β) := E x∼p T f * (x) -β x 2 at β * T : ∇ β f (β * ) = 0 → E x∼p T [x(f * - x β * T )] = 0, i.e., E x∼p T [xa T (x)] = 0. Next we have: E xi∼p S [X S diag(w)X S ] = E xi∼p S n i=1 p T (xi) p S (xi) x i x i = E xj ∼p T n j=1 [x j x j ] = n S Σ T . Therefore βLS -β * T → N (0, 1 n S Σ -1 T E x∼p T [p T (x)/p S (x)(a T (x) 2 + σ 2 )xx ]Σ -1 T ). Proof of Claim 3.6. Recall X S = [x 1 |x 2 | • • • |x n ] ∈ R n×d , with x i , ∀i ∈ [n] drawn from p S , a T = [a T (x 1 ), a T (x 2 ), • • • a T (x n )] ∈ R n , y = [y(x 1 ), y(x 2 ), • • • , y(x n )] ∈ R n , noise z = y -f * (X). w = [p T (x i )/p S (x i )] . To prove the, we only need to show the minimax linear estimator Ay is achieved of the form A 1 X diag(w), i.e., the row span of A is in the row span of X diag(w). -d) forms the orthogonal complement for the column span of diag(w)X. Therefore X diag(w)W = 0, and R L (B) ≡ min A max β * T ∈B,a T ∈F E xi∼ps,z [ Σ 1/2 T (Ay -β * T ) 2 ] = min A max β * T ∈B,a T ∈F E Σ 1/2 T ((AX -I)β * T + Aa T + Az) 2 = min A max β * T ∈B,a T ∈F Σ 1/2 T ((E[AX] -I)β * T + E[Aa T ]) 2 2 + E Σ 1/2 T (AX -E[AX])β * T 2 + E Σ 1/2 T (Aa T -E[Aa T ]) 2 + E Σ 1/2 T Az 2 Write A = A 1 X diag(w) + A 2 W , where X ∈ R n×d and W ∈ R n×(n W W = I n-d . Also, notice E xi∼p S [X diag(w)a T ] = n E x∼p T [xa T (x)] = 0. Therefore plugging it in R L (B), we have: R L (B) = min A max β * T ∈B,f * ∈F Σ 1/2 T ((A 1 E p S [X diag(w)X] -I)β * T + A 2 E[W a T ]) 2 2 + E Σ 1/2 T A 1 (X diag(w)X -E[X diag(w)X])β * T 2 + E Σ 1/2 T A 2 (W a T -E[W a T ]) 2 +σ 2 E Σ 1/2 T A 1 X diag(w) 2 + σ 2 E Σ 1/2 T A 2 2 = min A1,A2 max β * T ∈B,f * ∈F Σ 1/2 T ((A 1 n S Σ T -I)β * T + A 2 E[W a T ]) 2 2 + E Σ 1/2 T A 1 (X diag(w)X -Σ T )β * T 2 + E Σ 1/2 T A 2 (W a T -E[W a T ]) 2 +σ 2 E Σ 1/2 T A 1 X diag(w) 2 + σ 2 E Σ 1/2 T A 2 2 We could view E[W a T ] and W a T -E[W a T ] separately. First notice at min-max point, if E[W a T ] = 0, the minimizer A 2 should be 0 since it only appears in the third and last nonnegative terms. If E[W a T ] = 0, the cross term of the bias should be non-negative, or otherwise since both f * and -f * are in the set, a T , β * T could be replaced by -a T , -β * T and the loss increases. Clearly in this case A 2 should also be 0 at min-max point.

Sufficient statistic.

Proof of Claim 3.7. Denote by βS : = Σ-1 S X S y S /n S ∼ N (β * , σ 2 n S Σ-1 S ) and βT := Σ-1 T X T y T /n T ∼ N (β * , σ 2 n T Σ-1 T ). We use the Fisher-Neyman factorization theorem to derive the sufficient statistics. The likelihood of observing βS , βT from parameter β * is: p( βS , βT ; β * ) =ce -n S σ 2 ( βS -β * ) ΣS ( βS -β * )- n T σ 2 ( β-β * ) ΣT ( βT -β * ) =cg(β * , T (β * ))h( βS , βT ), where g(β * , T (β * )) = e -(β * -βSS) ( n S σ 2 ΣS + n T σ 2 ΣT ) -1 (β * -βSS) , and c is some constant. Therefore it's easy to see that T (β * ) = βSS is the sufficient statistic for β * . Proof of Claim 3.8. With similar procedure as before, and notice z S and z T are independent, we could first conclude that the optimal estimator is of the form β = A Σ-1 S X S y S /n S + B Σ-1 T X T y T /n T ∼ N ((A + B)β * , σ 2 n S A Σ-1 S A + σ 2 n T B Σ-1 T B ). R L (B) = min A,B max β * ∈B E z Σ 1/2 T ( β -β * ) 2 = min A,B max β * ∈B Σ 1/2 (A + B -I)β * 2 +σ 2 Tr(( 1 n S A Σ-1 S A + 1 n T B Σ-1 T B )Σ T ) = min A,B Σ 1/2 (A + B -I) 2 op r 2 + σ 2 Tr(( 1 n S A Σ-1 S A + 1 n T B Σ-1 T B )Σ T ) Take gradient w.r.t A and B respectively we have: ∇ A ( Σ 1/2 (A + B -I) 2 op r 2 ) + σ 2 n S Σ T A Σ-1 S = 0 =∇ B ( Σ 1/2 (A + B -I) 2 op r 2 ) + σ 2 n T Σ T B Σ-1 T = 0 Notice the first terms are equivalent. Therefore 1 n S A Σ-1 S = 1 n T B Σ-1 T thus the optimal β is of the form C(X S y S + X T y T ) for some matrix C, thus finishing the proof. 

B OMITTED PROOF WITH MODEL SHIFT

L B,∆ ( β) ≤ r B,∆ ( β) ≤ 2L B,∆ ( β). The first inequality is straightforward with the same reasoning of AM-GM as the derivation of (5). As for the second inequality, we take a closer look at (5). Notice that when max β * T ∈B,δ∈∆ is achieved, the cross term has to be non-negative, or otherwise one could flip the sign of β * T to make the value larger. Therefore at maximum Σ 1/2 T ((A 1 + A 2 -I)β * T 2 + Σ 1/2 T A 1 δ 2 ≤ Σ 1/2 T ((A 1 + A 2 -I)β * T + Σ 1/2 T A 1 δ 2 , and notice the remaining parts are all non-negative. Therefore r B,∆ ( β) ≤ 2L B,∆ ( β). Now let β * = arg min β=A1 ȳS +A2 ȳS L B,∆ ( β). We have: R L (B, ∆) =L B,∆ ( β * ) (a) ≤ L B,∆ ( βMM ) (9) ≤r B,∆ ( βMM ) (b) ≤ r B,∆ ( β * ) (9) ≤ 2L B,∆ ( β * ) = 2R L (B, ∆). The inequality (a) is by definition of β * while (b) is from the definition of βMM .

B.1 LOWER BOUND WITH MODEL SHIFT

In order to derive the lower bound, we abstract the problem to the following more general one: Problem 1. For arbitrary diagonal matrix D ∈ R d×d , two 2 -compact, solid, orthosymmetric, and quadratically convex sets Θ, ∆ ⊂ R d , let Here r P ( θ) := E x∼P θ(x) -θ(P ) 2 2 . We want to derive a uniform lower bound for R N with R L , i.e., R N ≥ µ * R L , where µ * is universal and doesn't depend on the choices of D, Θ or ∆. P Θ,∆,D = N Dθ + δ θ , I 0 0 I θ ∈ Θ, δ ∈ ∆ Let R L (Θ, Before proving the lower bound, we establish its connection to our considered problem: Remark B.1. Suppose Σ S = U diag(s)U and Σ T = U diag(t)U share the same eigenspace. Recall our samples a ∼ N (Σ 1/2 S (β * T + δ), σ 2 I), b ∼ N (Σ 1/2 T β * T , σ 2 I). Our goal to uniformly lower bound R N (r, γ) by R L (r, γ) is essentially Problem 1, where R L (r, γ) := min β linear max β * T ≤r, δ ≤γ E Σ 1/2 T ( β(a, b) -β * ) 2 , R N (r, γ) := min β max β * T ≤r, δ ≤γ E Σ 1/2 T ( β(a, b) -β * ) 2 . Proof of Remark B.1. Our target considers samples drawn from distributions x ∼ N (Σ 1/2 S (β * T + δ), σ 2 I), y ∼ N (Σ 1/2 T β * T , σ 2 I). a b ∼ N U diag(s 1/2 )U (β * T + δ) U diag(t 1/2 )U β * T , σ 2 I 0 0 σ 2 I , θ ∈ Θ, δ ∈ ∆ ⇐⇒ U a/σ U b/σ ∼ N diag(s 1/2 )U (β * T + δ) diag(t 1/2 )U β * T , I 0 0 I , β * T ≤ r, δ ∈ γ Let ā = U a/σ, b = U b/σ, Θ = {θ| diag(t -1/2 )θ ≤ r}, ∆ = { diag(s -1/2 )δ ≤ γ}. θ = U Σ 1/2 T β * T , δ = U Σ 1/2 S δ, and D = diag(s 1/2 t -1/2 ). We get: U a/σ U b/σ ∼ N diag(s 1/2 )U (β * T + δ) diag(t 1/2 )U β * T , I 0 0 I , β T ≤ r, δ ∈ γ ⇐⇒ ā b ∼ P θ,δ,D := N D θ + δ θ , I 0 0 I , θ ∈ Θ, δ ∈ ∆. Let P Θ,∆,D := P θ, δ,D θ ∈ Θ, δ ∈ ∆ . Since U is an invertible matrices, observing U a/σ, U b/σ instead of a, b has no affect on the performance of the best estimator. Also Θ, ∆ are axis-aligned ellipsoid and thus satisfy orthosymmetry. Therefore our problem is essentially reduced to Problem 1. Lemma B.2. Let Θ(τ ) = {θ|θ i ≤ τ i , ∀i, θ ∈ Θ} and similarly for ∆(ζ) = {δ|δ i ≤ ζ i , δ ∈ ∆}, D is some diagonal matrix. R L (Θ, ∆, D) = sup τ ∈Θ,ζ∈∆ R L (Θ(τ ), ∆(ζ), D), and R N (Θ, ∆, D) ≥ sup τ ∈Θ,ζ∈∆ R N (Θ(τ ), ∆(ζ), D). Write samples drawn from some P θ,δ,D ∈ P Θ,∆,D as (x, y) : x ∼ N (Dθ + δ, I), y ∼ N (θ, I). Lemma B.3. The minimax linear estimator θ : (x, y) → Ax + By has the form θa,b (x, y) = i a i x i + i b i y i for some a, b ∈ R d . Namely, R L (Θ, ∆, D) = inf θa,b max P ∈P Θ,∆,D r P ( θa,b ). Proof. According to the proof of Proposition B.4.a, by discarding off-diagonal terms, the maximum risk of any linear estimator θA,B over any hyperrectangles Θ(τ ), ∆(ζ) is reduced. max θ∈Θ(τ ),δ∈∆(ζ) r P θ,δ,D ( θA,B ) ≥ max θ∈Θ(τ ),δ∈∆ r P θ,δ,D ( θdiag(A),diag(B) ). Further we have: We will show that restricting A, B to be diagonal will not include the RHS value. For any ζi ) , (-τ i , -ζi )}} be the subset of vertices of Θ( τ ) × ∆( ζ). Let π( τ , ζ) be uniform distribution on this finite set. Due to the symmetry of this distribution, we have τ ∈ Θ(τ ), ζ ∈ ∆(ζ), let set V ( τ , ζ) = {(θ, δ)|(θ i , δ i ) ∈ {(τ i , E π( τ , ζ) θ i = 0, i ∈ [d], E π( τ , ζ) δ i = 0, i ∈ [d], E π( τ , ζ) θ i θ j = 1 i=j τ 2 i , i ∈ [d], E π( τ , ζ) δ i δ j = 1 i=j ζ2 i , i ∈ [d], E π( τ , ζ) θ i δ j = 1 i=j τi ζi , i ∈ [d]. We utilize the distribution to find the explicit value of the maximum (in fact the maximum will only be obtained inside the vertices set V ( τ , ζ) ):  max (θ,δ)∈V ( τ , ζ) r P θ,δ,D ( θA,B ) ≥ E π( τ , ζ) r P θ,δ,D ( θA,B ) = E π( τ , ζ) (AD + B -I)θ + Aδ 2 + ≥ i ((A ii D ii + B ii -1)τ i + A ii ζi ) 2 + A 2 ii + B 2 ii = (diag(A)D + diag(B) -I)θ + diag(A)δ 2 + Tr(diag(A) 2 ) + Tr(diag(B) 2 ), (∀(θ, δ) ∈ V ( τ , ζ)) = max V ( τ , ζ) (diag(A)D + diag(B) -I)θ + diag(A)δ 2 + Tr(diag(A) 2 ) + Tr(diag(B) 2 ) Therefore we have: Next, since the optimal solution on the minimizer is always obtained by diagonal A, B, it becomes straightforward that each axis could be viewed in separation, thus finishing the proof for part a. The nonlinear part is a straightforward extension of Proposition 4.16 from Johnstone (2011). Theorem B.5 (Restated Le Cam Two Point Theorem Wainwright (2019) ). Let P be a family of distribution, and θ : P → Θ is some associated parameter. Let ρ : Θ × Θ → R + be some metric defined on Θ and Φ : R + → R + is a monotone non-decreasing function with Φ(0) = 0. For any α ∈ (0, 1), inf θ sup P ∈P [Φ(ρ( θ, θ(P )))] ≥ max P1,P2∈P 1 2 Φ( 1 2 ρ(θ(P 1 ), θ(P 2 )))(1 -α), s.t. P n 1 -P n 2 T V ≤ α. Lemma B.6. Consider a class of distribution P τ,ζ,s = {P θ,δ,s |P θ,δ,s := N ([sθ +δ, θ] , I 2 ), |θ| ≤ τ, |δ| ≤ ζ}. Define R L (τ, ζ, s) = min θ linear max |θ|≤τ,|δ|≤ζ E x∼P θ,δ,s ( θ(x) -θ) 2 , and R N (τ, ζ, s) = min θ max |θ|≤τ,|δ|≤ζ E x∼P θ,δ,s ( θ(x) -θ) 2 We have R L (τ, ζ, s) ≤ 27/2R N (τ, ζ, s), ∀ζ, s > 0, τ > 0. Proof of Lemma B.6. We first calculate an upper bound of R L and connect it to a lower bound of R N .  (|as + b -1|τ + |a|ζ) 2 + a 2 + b 2 ≤ min a,b 2(as + b -1) 2 τ 2 + 2a 2 ζ 2 + a 2 + b 2 . By some detailed calculations, we get the RHS is equal to: 2τ 2 (2ζ 2 + 1) 2τ 2 (s 2 + 2ζ 2 + 1) + 2ζ 2 + 1 ≤ min{1, 2τ 2 , 1 + 4ζ 2 s 2 + 1 }. For simplify this form, we could see that Next, we use Le cam two point theorem to lower bound R N (τ, ζ, s) where the metric ρ is Euclidean distance and Φ is squared function. Therefore R N (τ, ζ, s) ≥ max |θi|≤τ,|δi|≤ζ,i∈{1,2} 1 2 ( 1 2 (θ 1 -θ 2 )) 2 (1 -α) s.t. N ([sθ 1 + δ 1 , θ 1 ] , I 2 ), N ([sθ 2 + δ 2 , θ 2 ] , I 2 ) T V ≤ α. Since the total variation distance is related to Kullback-Leibler divergence by Pinsker's inequality: •, • T V ≤foot_5 2 D KL (• •), it's sufficient to replace the constraint as: D KL N ([sθ 1 + δ 1 , θ 1 ] , I 2 ) N ([sθ 2 + δ 2 , θ 2 ] , I 2 ) ≤ 2α 2 . max |θi|≤τ,|δi|≤ζ,i∈{1,2} 1 8 (θ 1 -θ 2 ) 2 (1 -α) s.t. (sθ 1 + δ 1 -(sθ 2 + δ 2 )) 2 + (θ 1 -θ 2 ) 2 ≤ 2α 2 ⇔ max |c|≤2τ,|d|≤2ζ c 2 8 (1 -α) s.t. (sc + d) 2 + c 2 ≤ 2α 2 . Recall R L ≤ min{1, 2τfoot_6 , 1+4ζ s 2 +1 }. We first note that c 2 ≤ 4τ 2 and setting α = 0 we have R N ≥ τ 2 /2 ≥ 1/4R L . For In the following we look at other cases when the bound for c 2 is smaller. When 2ζ ≥ sc, will set d = -sc and c 2 = 2α 2 . Let α = 2/3 for large τ we get : c 2 (1 -α)/8 = 2/27 ≥ 2/27R L . When 2ζ ≤ sc we set d = -2ζ and require (sc -2ζ) 2 + c 2 ≤ 2α 2 . We have (sc -2ζ) 2 + c 2 = s 2 c 2 + 4ζ 2 -4ζsc + c 2 ≤ s 2 c 2 + 4ζ 2 -8ζ 2 + c 2 = (s 2 + 1)c 2 -4ζ 2 . Therefore as we set c 2 = 2α 2 +4ζ 2 s 2 +1 , the original inequality is satisfied. Again by setting α = 2/3 we have c 2 ≥ 8/9 1+4ζ 2 s 2 +1 ≥ 8/9R L . Therefore in this case R N ≥ 2 27 R L . C DISCUSSIONS ON RANDOM DESIGN UNDER COVARIATE SHIFT. In the main text, we present the results where we consider X S as fixed and Σ T to be known. In this section, we view both source and target input data as random, and generalize the results of Section 3 while training is on finite observations and testing is on the (worst case) population loss, under some light-tail properties of the input data samples.

C.1 RANDOM DESIGN ON TARGET COVARIANCE MATRIX

In Section 3, we consider the case when Σ T is known exactly. This could be viewed as the fixed design setting where training and testing are on the same set of data. In this section, our analysis will include the estimation error on observing finite unlabeled samples of target domain. Let X T = [x 1 , • • • x n U ] ∈ R n U ×d be n U ( Here U stands for unlabeled data and is used to distinguish from n T labeled target samples) data samples where x i ∼ p T , and we will use the unlabeled target samples to conduct estimation. We let ΣT = X T X T /n U . Let LB to denote the worst case excess risk measured on the observed target samples: LB ( β) = max β * ∈B E y S proposed algorithm becomes: Ĉ ← min τ,C r 2 τ + σ 2 n S Tr( Σ1/2 T C Σ-1 S C Σ1/2 T ) , s.t. (C -I) ΣT (C -I) τ I. And set β = Ĉ Σ-1 S X S y S /n S . We want to show that in spite of the existence of estimation error due to the replacement of Σ T with ΣT , our generated β performs well on the worst-case population risk L B ( β) := max β * E y S E x∼p T x ( β(y S ) -β * ) 2 and achieves minimax linear risk (up to constant multiplicative error). In this section we assume that the data samples is light tail: Definition C.1 (ρ 2 -subgaussian distribution). We call a distribution p to be ρ 2 -subgaussian when there exists ρ > 0 such that the random vector x ∼ p is ρ 2 -subgaussian. p is the whitening of p such that x ∼ p is equivalent to x = Σ 1/2 x ∼ p, where Σ is the covariance matrix of p.foot_7  Notice that here the subgaussian parameter is defined on the whitening of the data, and ρ doesn't depend on how large Σ op is. Theorem C.2. Fix a failure probability δ ∈ (0, 1). Suppose target distribution p T is ρ 2subgaussian, and the sample size in target domain satisfies n U ρ 4 (d + log 1 δ ). Let β : y S → Ĉ Σ-1 S X S y S where Ĉ is defined from Eqn. 10. Then with probability at least 1 -δ over the unlabeled samples from target domain, and for each fixed X S from source domain, our learned estimator β(y S ) satisfies: L B ( β) ≤ (1 + O( ρ 4 (d + log(1/δ)) n ))R L (B). Specifically, when Σ T commutes with ΣS or is rank 1, we have: L B ( β) ≤ (1.25 + O( ρ 4 (d + log(1/δ)) n ))R N (B). Similarly all other results in the paper could be extended to random design with finite samples X T . Proof of Theorem C.2. The proof relies on the two technical claims C.3, C.4. Let βR be the optimal linear estimator on L B , i.e., L B ( βR In the main text or the subsection above, the worst case excess risk is upper bounded by 1.25R N , which is achieved by best estimator that is using the same set of training data (X S , y S ). Here we would like to take into consideration the randomness of X S and compare the worst case excess risk using our estimator with a stronger notion of linear estimator. ) = min β linear in y S L B (β) = R L (B). L B ( β) ≤ (1 + O( ρ 4 (d + log(1/δ)) n )) LB ( β) (Claim C.4) ≤(1 + O( ρ 4 (d + log(1/δ)) n )) LB ( βR ) (from definition of β) ≤(1 + O( ρ 4 (d + log(1/δ)) n )) 2 L B ( βR ) (Claim C.4) ≤(1 + O( ρ 4 (d + log(1/δ)) n ))L B ( β) = (1 + O( ρ 4 (d + log(1/δ)) n ))R L (B). For this purpose, we consider estimators that are linear functionals of y R := Σ 1/2 S β * + z ∈ R d , z ∼ N (0, σ 2 /n S I d ) (this σ 2 /n S is the correct scaling since X S X S /n S is comparable to Σ S ). We consider the minimax linear estimator with y R and with access to Σ S , and we compare our estimator against this oracle linear estimator. This estimator is not computable in practice since Σ S must be estimated, but we will show that our estimator is within an absolute multiplicative constant in minimax risk of the oracle linear estimator. To recap the notations and setup, let LB ( β) := max β * E y S 1 n U X T ( β(y S ) -β * ) 2 , L B ( β) := max β * E y S E x∼p T x ( β(y S ) -β * ) 2 , L B,R ( β) := max β * E y R E x∼p T x ( β(y R ) -β * ) 2 . Our target is to find the best linear estimator using LB ( β) (trained with X T ) and prove its performance on the population (worst-case) excess risk L B ( β) is no much worse compared to the minimax linear risk trained on y R and Σ S . Theorem C.5. Fix a failure probability δ ∈ (0, 1). Suppose both target and source distributions p S and p T are ρ 2 -subgaussian, and the sample sizes in source domain and target domain satisfies n S , n U ρ 4 (d + log 1 δ ). Let Ĉ be the solution for Eqn.(10), and set β(y S ) ← Ĉ Σ-1 S X S y S . Then with probability at least 1 -δ over all the unlabeled samples from target domain and all the labeled samples X S from source domain, our estimator β(y R ) yields the worst case expected excess risk that satisfies: This finishes the proof.

D MORE EMPIRICAL RESULTS

We include some more empirical studies. In the main text our results have small noise. Here we show some more results with larger noise, and also the case with varied eigenspace. Setup. We choose n S = 2000, d = 50. Let X S ∈ R 2000×50 be generated randomly under Gaussian distribution N (0, Σ S ). We also generate a small validation dataset from target domain: X CV ∈ R 500×50 , sampled from N (0, Σ T ), y CV = f * (X CV ) + z CV , with z CV ∼ N (0, σ 2 I). We choose λ i (Σ S ) ∝ i, λ i (Σ T ) ∝ 1/i, and the eigenspace for both Σ S and Σ T are random orthonormal matrices. ( Σ S 2 F = Σ T 2 F = d.) The ground truth model is a one-hidden-layer ReLU network: f * (x) = 1/da (W x) + , where W and a are randomly generated from standard Gaussian distribution. We observe noisy labels: y S = f * (x) + z, where z i ∼ N (0, σ 2 ).



A ∈ R d×n may depend in an arbitrary way on XS, nS, or ΣT . The estimator is linear in the observation yS. With samples yS, a statistic t = T (yS) is sufficient for the underlying parameter β * if the conditional probability distribution of the data yS, given the statistic t = T (yS), does not depend on the parameter β * . Throughout the paper Σ-1 S could be replaced by pseudo-inverse and our algorithm also applies when n < d. Note this goes without saying that our method can also be order-wise better than ordinary least square, which is a special case of ridge regression by setting λ = 0. We leave this result in appendix since performance appears invariant to this factor. n U X T ( β(y S ) -β * ) . To find the best linear estimator that minimizes LB , our A random vector x is called ρ 2 -subgaussian if for any fixed unit vector v of the same dimension, the random variable v x is ρ 2 -subgaussian, i.e., E[e s•v (x-E[x]) ] ≤ e s 2 ρ 2 /2 (∀s ∈ R).



and y T = X T β * T + z T . Denote by δ := β * S -β * T as the model shift. We are interested in the minimax linear estimator when δ ≤ γ and β * T ≤ r. Thus our problem becomes to find minimax estimator for β * T ∈ B = {β| β ≤ r} from y S , y T . Algorithm:

Figure 1: Performance comparisons. (a): The x-axis α defines the spread of eigen-spectrum of Σ S : s i ∝ 1/i α , t i ∝ 1/i. (b) x-axis is the normalized value of signal strength: Σ T β * /r. (c) X-axis is the model shift measured by γ/r. Performance with standard error bar is from 40 runs.

.2 OMITTED PROOF FOR NONCOMMUTE COVARIANCE MATRICES Convex program. Our estimator for β * can be achieved through convex programming: Proof of Proposition 3.3. First note the objective function is quadratic in C and linear in τ , therefore we only need to prove the constraint S = {(C, τ )|(C -I) Σ T (C -I) τ I} is a convex set. Notice for (C 1 , τ 1 ), (C 2 , τ 2 ) ∈ S, i.e., (C i -I) Σ T (C i -I) τ i I, i ∈ {1, 2}. We simply need to prove for C α

Definition B.1 (Orthosymmetry).A set Θ is said to be solid and orthosymmetric if θ ∈ Θ and|ζ i | ≤ |θ i | for all i implies that ζ ∈ Θ. If a solid, orthosymmetric Θ contains a point τ , then it contains the entire hyperrectangle that τ defines: Θ(τ ) ≡ {θ||θ i | ≤ τ i , ∀i} ⊂ Θ.Proof of Claim 4.1. First notice for any estimator β, it all satisfies

∆, D) and R N (Θ, ∆, D) be the minimax linear risk and minimax risk respectively for estimating θ within the distribution class P Θ,∆,D : R L (Θ, ∆, D) = min θ:R d →Θ linear max P ∈P Θ,∆,D r P ( θ), R N (Θ, ∆, D) = min θ:R d →Θ max P ∈P Θ,∆,D r P ( θ).

r P θ,δ,D ( θA,B ).Therefore all four terms have to be equal, thus finishing the proof.Notice Θ(τ ) and ∆(ζ) are hyperrectangles in R d . Therefore we could decompose the problem to some 2-d problems: Proposition B.4. Under the same setting as Problem 1,a). R L (Θ(τ ), ∆(ζ), D) = i R L (τ i , ζ i , D ii ). If θA,B (x, y) = Ax + By is minimax linear estimator over P Θ(τ ),∆(ζ),D , then necessarily A, B must be diagonal. b). R N (Θ(τ ), ∆(ζ), D) = i R N (τ i , ζ i , D ii ).Proof of Proposition B.4.a . First review our notation:r P θ,δ,D ( θA,B ) = E (x,y)∼P θ,δ,D θA,B (x, y) -θ 2 = E x∼N (Dθ+δ,I),y∼N (θ,I) Ax + By -θ 2 = A(Dθ + δ) + Bθ -θ 2 + Tr(AA ) + Tr(BB ) = (AD + B -I)θ + Aδ 2 + Tr(AA ) + Tr(BB ). Our objective is R L (Θ(τ ), ∆(ζ), D) := min A,B max θ∈Θ(τ ),δ∈∆(ζ) r P θ,δ,D ( θA,B )

Tr(AA ) + Tr(BB ) =Tr((AD + B -I) E[θθ ](AD + B -I) ) + Tr(A E[δδ ]A )+ 2Tr((AD + B -I) E[θδ ]A ) + Tr(AA ) + Tr(BB ) =Tr((AD + B -I) (AD + B -I)diag( τ 2 )) + Tr(A Adiag( ζ2 )) + Tr((AD + B -I) Adiag( τ ζ)) + Tr(AA ) + Tr(BB ) = i (AD + B -I) :,i τi + A :,i ζi 2 + Tr(AA ) + Tr(BB )

+ b -1)θ + aδ] 2 + a 2 + b 2

linear in y R L R,B (β).Proof of Theorem C.5. For each matrix C ∈ R d×d , we first conduct bias-variance decomposition and rewrite each worst-case risk with linear estimator in terms of a matrix C. When β(y S ) = C Σ-1 S X S y S , we have:

Figure 2: (a): The x-axis α defines the spread of eigen-spectrum of Σ S : s i ∝ 1/i α , t i ∝ 1/i. (b) x-axis is the normalized value of signal strength: Σ T β * /r. (c) X-axis is the covariate shift due to eigenspace shift measured by U S -U T F . the main text. From Figure 2 (c) we see no particular relationship between the performance of each algorithm with eigenspace shift.

Figure 3: The x-axis is noise level σ and yaxis is the excess risk (with approximation error).

statistic is enough to achieve a best estimator). Consider the statistical problem of estimating β * ∈ B from observations y ∈ Y. B 2 -compact. If S(y) is a sufficient statistic of β * , then the best estimator that achieves min β max B ( β, β * ) is of the form β = f (S(y)) with some function f , for any loss : Y → [0, ∞).

annex

From Theorem 3.4 we know R L (B) ≤ 1.25R N (B) when Σ T is rank-1 matrix or commute with ΣS which further finishes the whole proof.Claim C.3 (Restated Claim A.6 from Du et al. (2020) ). Fix a failure probability δ ∈ (0, 1), and assume n ρ 4 (d + log(1/δ)) 7 . Then with probability at least 1 -δ 10 over the inputs x 1 , . . . , x n , if x i ∼ p and p is a ρ 2 -subgaussian distribution, we havewhereWith the help of Claim C.3 we directly get: Claim C.4. Fix a failure probability δ ∈ (0, 1), and assumeWe have for any estimator β:with high probability 1 -δ/10 over the random samples X T .Proof of Claim C.4. RecallTherefore for any estimator β, it satisfieswhich finishes the proof.7 When this is not satisfied the result is still satisfied by replacing O(For cleaner presentation, we assume n is large enough and simplify the results.Claim C.6. Fix a failure probability δ ∈ (0, 1), and assume n U , n S ρ 4 (d + log(1/δ)), X S ∈ R n S ×d , X T ∈ R n U ×d are respectively from p S p T which are both ρ 2 -subgaussian. We have for any matrix C ∈ R d×d :with high probability 1 -δ/10 over the random samples X T .(with high probability 1 -δ/10 over the random samples X S .Proof of Claim C.6. We omit the proof of the first inequality since it's exactly the same as proof of Claim C.4.For the second line, we have:Therefore we prove the RHS of the second inequality. The LHS follows with the same proof techniques.Now let Ĉ be the minimizer for l(C), and C R be the minimizer for l R (C).Estimating weights p T (x)/p S (x). Since the generated data samples are Gaussian, the absoluteHowever, this absolute value has an exponential factor and can amplify the noise level. Meanwhile, when one multiplies both X S , y S by 10, the ground truth β doesn't change but the absolute value for p T (x)/p S (x) will change drastically. This discrepancy highlights the importance of relative magnitudes (among samples) instead of the absolute value Kanamori et al. (2009) .To obtain a relative score, we first estimate the absolute value of p T (x)/p S (x) by l(x) := x ( Σ-1 S -Σ-1 T )x. We then uniformly assign the weight for each sample by 10 discrete values 1, 2, 3 • • • 10 based on their scoring l(x) and then rescale the reweighting vector properly.We implement our method (Eqn. 4) using the estimated weights as above. Refer to Figure 3 for the results. The baselines we choose are ordinary least square ("OLS" in Figure (3)), ridge regression (Legend is "Ridge") and classic weighted least square Kanamori et al. (2009) (Legend is "Reweighting"; βLS in our main text). For both ridge regression and our methods, we tune hyperparameters through cross-validation. All results are presented from 40 runs where the randomness comes from f * and the eigenspaces of Σ S , Σ T .

