FIRST STEPS TOWARD UNDERSTANDING THE EXTRAP-OLATION OF NONLINEAR MODELS TO UNSEEN DO-MAINS

Abstract

Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models-under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the marginal distribution of each coordinate of the data (or subset of coordinates) do not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form f (x) = f i (x i ), where f i is an arbitrary function on the subset of features x i , can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.

1. INTRODUCTION

In real-world applications, machine learning models are often deployed on domains that are not seen in the training time. For example, we may train machine learning models for medical diagnosis on data from hospitals in Europe and then deploy them to hospitals in Asia. Thus, we need to understand the extrapolation of models to new test distributions -how the model trained on one distribution behaves on another unseen distribution. This extrapolation of neural networks is central to various robustness questions such as domain generalization (Gulrajani & Lopez-Paz (2020) ; Ganin et al. (2016) ; Peters et al. (2016) and references therein) and adversarial robustness (Goodfellow et al., 2014; Kurakin et al., 2018) , and also plays a critical role in nonlinear bandits and reinforcement learning where the distribution is constantly changing during training (Dong et al., 2021; Agarwal et al., 2019; Lattimore & Szepesvári, 2020; Sutton & Barto, 2018) .

This paper focuses on the following mathematical abstraction of this extrapolation question:

Under what conditions on the source distribution P , target distribution Q, and function class F do we have that any functions f, g ∈ F that agree on P are also guaranteed to agree on Q? Here we can measure the agreement of two functions on P by the 2 distance between f and g under distribution P , that is, f -g P E x∼P [(f (x) -g(x)) 2 ]foot_0/2 . The function f can be thought of as the learned model, g as the ground-truth function, and thus f -g P as the error on the source distribution P . This question is well-understood for linear function class F. Essentially, if the covariance of Q can be bounded from above by the covariance of P (in any direction), then the error on Q is guaranteed to be bounded by the error on P . We refer the reader to Lei et al. (2021) ; Mousavi Kalan et al. (2020) and references therein for more recent advances along this line. By contrast, theoretical results for extrapolation of nonlinear models is rather limited. Classical results have long settled the case where P and Q have bounded density ratios (Ben-David & Urner, 2014; Sugiyama et al., 2007) . Bounded density ratio implies that the support of Q must be a subset of the support of P , and thus arguably these results do not capture the extrapolation behavior of models outside the training domain. Without the bounded density ratio assumption, there was limited prior positive result for characterizing the extrapolation power of neural networks. Ben-David et al. (2010) show that the model can extrapolate when the H∆H-distance between training and test distribution is small. However, it remains unclear for what distributions and function class, the H∆H-distance can be bounded. 1 In general, the question is challenging partly because of the existence of such a strong impossibility result. As soon as the support of Q is not contained in the support of P (and they satisfy some non-degeneracy condition), it turns out that even two-layer neural networks cannot extrapolate-there are two-layer neural networks f and g that agree on P perfectly but behave very differently on Q (See Proposition 5 for a formal statement.) The impossibility result suggests that any positive results on the extrapolation of nonlinear models require more fine-grained structures on the relationship between P and Q (which are common in practice Koh et al. (2021) ; Sagawa et al. (2022) ) as well as the function class F. The structure in the domain shift between P and Q may also need to be compatible with the assumption on the function class F. This paper makes some first steps towards proving certain family of nonlinear models can extrapolate to a new test domain with structured shift. We consider a setting where the joint distribution of the data can does not have much overlap across P and Q (and thus bounded density ratio assumption does not hold), whereas the marginal distributions for each coordinate of the data does overlap. Such a scenario may practically happen when the features (coordinates of the data) exhibit different correlations on the source and target distribution. For example, consider the task of predicting the probability of a lightning storm from basic meteorological information such as precipitation, temperature, etc. We learn models from some cities on the west coast of United States and deploy them to the east coast. In this case, the joint test distribution of the features may not necessarily have much overlap with the training distributioncorrelation between precipitation and temperature could be vastly different across regions, e.g., the rainy season coincides with the winter's low temperature on the west coast, but not so much on the east coast. However, the individual feature's marginal distribution is much more likely to overlap between the source and target-the possible ranges of temperature on east and west coasts are similar. Concretely, we assume that the features x ∈ R s1+s2 have Gaussian distributions and can be divided into two subsets x 1 ∈ R s1 and x 2 ∈ R s2 such that each set of feature x i (i ∈ {1, 2}) has the same marginal distributions on P and Q. Moreover, we assume that x 1 and x 2 are not exactly correlated on P -the covariance of features x on distribution P has a strictly positive minimum eigenvalue. As argued before, restricted assumptions on the function class F are still necessary (for almost any P and Q without the bounded density ratio property). Here, we assume that F consists of all functions of the form f (x) = f 1 (x 1 ) + f 2 (x 2 ) for arbitrary functions f 1 : R s1 → R and f 2 : R s2 → R. The function class F does not contain all two-layer neural networks (so that the impossibility result does not apply), but still consists of a rich set of functions where each subset of features independently contribute to the prediction with arbitrary nonlinear transformations. We show that under these assumptions, if any two models approximately agree on P , they must also approximately agree on Q -formally speaking, ∀f, g ∈ F, f -g Q f -g P (Theorem 4). We also prove a variant of the result above where we divide features vector x ∈ R d into d coordinate, denoted by x = (x 1 , . . . , x d ) where x i ∈ R. The function class consists of all combinations of nonlinear transformations of x i 's, that is, F = { d i=1 f i (x i )}. Assuming coordinates of x are pairwise Gaussian and a non-degenerate covariance matrix, the nonlinear model f ∈ F can extrapolate to any distribution Q that has the same marginals as P (Theorem 3). These results can be viewed as first steps for analyzing extrapolation beyond linear models. Compared the works of Lei et al. (2021) on linear models, our assumptions on the covariance of P are qualitatively similar. We additionally require P and Q have overlapping marginal distributions because it is even necessary for the extrapolation of one-dimensional functions on a single coordinate. Our results work for a more expressive family of nonlinear functions, that is, F = { d i=1 f i (x i )}, than the linear functions. We also present a result on the case where x i 's are discrete variables, which demonstrates the key intuition and also may be of its own interest. Suppose we have two discrete random variable x 1 and x 2 . In this case, the joint distribution of P and Q can be both presented by a matrix (as visualized in Figure 1 ), and the marginal distributions are the column and row sums of this joint probability matrix. We prove that extrapolation occurs when (1) the support of the marginals of Q is covered by P , and (2) the density matrix of P is non-clusterable -we cannot find a subset of rows and columns such the support of P in these rows and columns lies within their intersections. In Figure 1 , we visualize a few interesting cases. First, distributions P 1 , P 2 visualized in Figures 1a and 1b , respectively, do not satisfy our conditions. Fundamentally, there are two models that agree on the support of distribution P 1 (or P 2 ), but still differ much on the non-support. In contrast, our result predict that models trained on distribution P 3 in Figure 1c can extrapolate to the distribution Q in Figure 1d , despite that the support of P 3 is sparse and their support have very little overlap. We also note that the failure of P 1 and P 2 demonstrate the non-triviality of our results. The overlapping marginal assumption by itself does not guarantees extrapolation, and condition (2), or analogously the minimal eigenvalue condition for the Gaussian cases, is critical for extrapolation. Our proof technique for the theorems above is generally viewing h 2 P as K P (h, h) for some kernel K P : F × F → R. Here h is a shorthand for the error function f -g. Note that the kernel takes in two functions in F as inputs and captures relationship between the functions. Hence, the extrapolation of F (i.e., proving h Q h P for all h ∈ F) reduces to the relationship of the kernels (i.e., whether K Q (h, h) K P (h, h) for all h ∈ F), which is then governed by properties of the eigenspaces of kernel K P , K Q . Thanks to the special structure of our model class F = f i (x i ), we can analytically relate the eigenspace of the kernels K P , K Q to more explicitly and interpretable property of the data distribution of P and Q.

2. PROBLEM SETUP

We use P and Q to denote the source and target distribution over the space of features X = X 1 × • • • × X k , respectively. We measure the extrapolation of a model class F ⊆ R X from P to Q by the F-restricted error ratio, or F-RER as a shorthand:foot_1  τ (P, Q, F) sup f,g∈F E Q [(f (x) -g(x)) 2 ] E P [(f (x) -g(x)) 2 ] . (1) When τ (P, Q, F) is small, if two models f, g ∈ F approximately agree on P (meaning that E P [(f (x) -g(x)) 2 ] is small), they must approximately agree on Q because E Q [(f (x) -g(x)) 2 ] ≤ τ (P, Q, F)E P [(f (x) -g(x)) 2 ]. The F-restricted error ratio monotonically increases as we enlarge the model class F, and eventually, τ (P, Q, F) becomes the density ratio between Q and P if the model F contains all functions with bounded output. To go beyond the bounded density ratio assumption, in this paper we focus on the structured model class F = { k i=1 f i (x i ) : E P [f i (x i ) 2 ] < ∞, ∀i ∈ [k]} where f i : X i → R is an arbitrary function. Since F is closed under addition, we can simplify Eq. (1) to τ (P, Q, F) = sup f ∈F E Q [f (x) 2 ] E P [f (x) 2 ] . For simplicity, we omit the dependency on P, Q, F when the context is clear. If the model class F includes the ground-truth labeling function, τ (P, Q, F) upperbounds the ratio between the error on distribution Q and the error on distribution P (formally stated in Proposition 1), which provides the robustness guarantee of the trained model. This is because when g corresponds to the ground-truth label, E P [(f (x) -g(x)) 2 ] becomes the 2 error of model f . Proposition 1. Let τ be the F-RER defined in Eq. (1). For any distribution P, Q and model class F, if there exists a model f in F that can represent the true labeling y : X → R on both P and Q: E 1 2 (P +Q) [(y(x) -f (x)) 2 ] ≤ F , then we have ∀f ∈ F, E Q [(y(x) -f (x)) 2 ] ≤ (8τ + 4) F + 4τ E P [(y(x) -f (x)) 2 ]. Proof of this proposition is deferred to Appendix A.1 Relationship to the H∆H-distance. Compared with the H∆H-distance (Ben-David et al., 2010) : d H∆H (P, Q) = 2 sup f,g∈F |Pr x∼P [f (x) = g(x)] -Pr x∼Q [f (x) = g(x)]|, our differences are: (1) we consider 2 loss instead of classification loss, and (2) τ focuses on the ratio of losses whereas d H∆H focuses on the absolute difference. As we will see later, these differences bring the mathematical simplicity to prove concrete conditions for model extrapolation. Additional notations. Let I [E] be the indicator function that equals 1 if the condition E is true, and 0 otherwise. For an integer n, let [n] be the set {1, 2, • • • , n}. For a vector x ∈ R d , we use [x] i to denote its i-th coordinate. Similarly, [M ] i,j denotes the (i, j)-th element of a matrix M . We use M n to represent the element-wise n-th power of the matrix M (i.e., [M n ] i,j = ([M ] i,j ) n ). Let I d ∈ R d×d be the identity matrix, 1 d ∈ R d the all-1 vector and e i,d the i-th base vector. We omit the subscript d when the context is clear. For a square matrix P ∈ R d×d , we use diag(P ) ∈ R d×d to denote the matrix generated by masking out all non-diagonal terms of P . For list σ 1 , • • • , σ d , let diag({σ 1 , • • • , σ d }) ∈ R d×d be the diagonal matrix whose diagonal terms are σ 1 , • • • , σ d . For a symmetric matrix M ∈ R d×d ,let λ 1 (M ) ≤ λ 2 (M ) ≤ • • • ≤ λ d (M ) be its eigenvalues in ascending order, and λ max (M ), λ min (M ) the maximum and minimum eigenvalue, respectively. Similarly, we use σ 1 (M ), • • • , σ min(d1,d2) (M ) to denote the singular values of M ∈ R d1×d2 .

3. MAIN RESULTS

In this section, we present our main results. Section 3.1 discusses the case where the features have discrete values. In Section 3.2 and 3.3, we extend our analysis to two other settings with real-valued features.

3.1. FEATURES WITH DISCRETE VALUES

For better exposition, we discuss the case that x = (x 1 , x 2 ) here and defer the discussion of the general case to Appendix A.5. We assume that x i takes the value in {1, 2, • • • , r i } for i ∈ [2]. Hence, the density of distribution P can be written in a matrix with dimension r 1 × r 2 . We measure the (approximate) clusterablity by eigenvalues of the Laplacian matrix of a bipartite graph associated with the density matrix P ∈ R r1×r2 , which is known to capture the clustering structure of the graph (Chung, 1996; Alon, 1986) . Let G P be a weighted bipartite graph whose adjacency matrix equals the density matrix P ∈ R r1×r2 -concretely, U = {u 1 , • • • , u r1 } and V = {v 1 , • • • , v r2 } are the sets of vertices, and the weight between u i , v j is P (x 1 = i, x 2 = j). To define the signless Laplacian of G P , let d 1 ∈ R r1 and d 2 ∈ R r2 be the row and column sums of the weight matrix P (in other words, degree of the vertices), and D 1 = diag(d 1 ) ∈ R r1×r1 , D 2 = diag(d 2 ) ∈ R r2×r2 the diagonal matrices induced by d 1 , d 2 , respectively. The signless Laplacian K P and normalized signless Laplacian KP are: K P = D 1 P P D 2 , KP = diag(K P ) -1/2 K P diag(K P ) -1/2 . ( ) Compared with the standard Laplacian matrix, the non-diagonal terms in the signless Laplacian K P are positive and equal to the absolute value of corresponding terms in the standard Laplacian matrix. In the following theorem, we bound the F-RER from above by the eigenvalues of KP and the density ratio of the marginal distributions of Q, P . Theorem 2. For any distributions P, Q over discrete random variables x 1 , x 2 , and the model class F = {f 1 (x 1 ) + f 2 (x 2 )} where f i : X i → R is an arbitrary function, the F-RER can be bounded above: τ (P, Q, F) ≤ 2λ 2 ( KP ) -1 max i∈[2],t∈[ri] Q(x i = t) P (x i = t) . Compared with prior works that assumes a bounded density ratio on the entire distribution (e.g., Ben-David & Urner (2014); Sugiyama et al. ( 2007)), we only require a bounded density ratio of the marginal distributions. In other words, the model class f (x) = f 1 (x 1 ) + f 2 (x 2 ) can extrapolate to distributions with a larger support (see Figure 1c ). In contrast, for an unstructured model (i.e., f (x) is an arbitrary function of the entire input x), the model can behave arbitrarily on data points outside the support of P . Qualitatively, Theorem 2 proves sufficient conditions under which the structured model class can extrapolate to unseen distributions (as visualized in Figure 1 )-In particulary, Theorem 2 implies that for non-trivial extrapolation, that is, τ (P, Q, F) < ∞, we need (a) max i∈[2],t∈[ri] Q(xi=t) P (xi=t) < ∞ (i.e. , the support of the marginals of Q is covered by P ), and (b) λ 2 ( KP ) > 0. To interpret condition (b), note that Cheeger's inequality implies that λ 2 ( KP ) > 0 if and only if the bipartite graph G P is connected (Chung, 1996; Alon, 1986) foot_2 , that is, there does not exist non-empty strict subsets of vertices U ⊂ U, V ⊂ V , such that P (x 1 ∈ U , x 2 ∈ V ) = 0 and P (x 1 ∈ U , x 2 ∈ V ) = 0. Equivalent, we cannot shuffle the rows and columns of P to form a block diagonal matrix where each block is a strict sub-matrix of P . In other words, the density matrix P is non-clusterable as discussed in Section 1. Proof sketch of Theorem 2. In the following we present a proof sketch of Theorem 2. We start with a high-level proof strategy and then instantiate the proof on the setting of Theorem 2. Suppose we can find a set of (not necessarily orthogonal) basis {b 1 , • • • , b r } where b i : X → R, such that any model f ∈ F can be represented as a linear combination of basis, that is, f = r i=1 v i b i . Since the model family F is closed under subtraction, we have τ (P, Q, F) = sup f ∈F f 2 Q f 2 P = sup v∈R r E Q r i=1 v i b i (x) 2 E P r i=1 v i b i (x) 2 = sup v∈R r r i,j=1 [v] i [v] j E Q [b i (x)b j (x)] r i,j=1 [v] i [v] j E P [b i (x)b j (x)] . (7) If we define the kernel matrices [K P ] i,j = E P [b i (x)b j (x)] and [K Q ] i,j = E Q [b i (x)b j (x) ] (we use the same notation for the kernel matrix and signless Laplacian because later we will show that the kernels K P , K Q equal to the signless Laplacian of the bipartite graphs G P , G Q with specific choice of the basis b i ) , it follows that sup v∈R r r i,j=1 v i v j E Q [b i (x)b j (x)] r i,j=1 v i v j E P [b i (x)b j (x)] = sup v∈R r v K Q v v K P v . ( ) Hence, upper bounding τ (P, Q, F) reduces to bounding the eigenvalues of kernel matrices K P , K Q . Since the model has the structure f (x) = f 1 (x 1 ) + f 2 (x 2 ), we can construct the basis {b t } r t=1 explicitly. For any i ∈ [2], t ∈ [r i ], with little abuse of notation, let b i,t (x) = I [x i = t] . We can verify that the set {b i,t } i∈[2],t∈[ri] is indeed a complete set of basis. As a result, the kernel matrices K P can be computed directly using its definition: E P [b i,t (x)b j,s (x)] = P (x i = t, x j = s), when i = j, P (x i = t)I [s = t] , when i = j, which is exactly the Laplacian matrix defined in Eq. ( 5). To prove Eq. ( 6), we need to upperbound the eigenvalues of K Q . Since the eigenvalues of the normalized signless Laplacian KQ is universally upper bounded by 2 for every distribution Q, we first write K P , K Q in terms of KP , KQ . Formally, let D P = diag(K P ) and D Q = diag(K Q ) and we have sup v∈R r v K Q v v K P v = sup v∈R r v D 1/2 Q KQ D 1/2 Q v v D 1/2 P KP D 1/2 P v ≤ λ max ( KQ ) λ min ( KP ) sup v∈R r D 1/2 Q v 2 2 D 1/2 P v 2 2 . However, this naive bound is vacuous because for any P we have λ min ( KP ) = 0. In fact, K P and K Q share the eigenvalue 0 and the corresponding eigenvector u ∈ R r1+r2 with [u] t = (-1) I[t>r1] . Therefore we can restrict to the subspace orthogonal to the direction u, and then λ min ( KP ) becomes λ 2 ( KP ) in Eq. ( 10). Finally, by basic algebra we also have λ max ( KQ ) ≤ 2 and sup v∈R r D 1/2 Q v 2 2 D 1/2 P v 2 2 ≤ max i∈[2],t∈[ri] Q(xi=t) P (xi=t) , which complete the proof sketch. The full proof of Theorem 2 is deferred to Appendix A.3.

3.2. FEATURES WITH REAL VALUES

In this section we extend our analysis to the case where x 1 , x 2 , • • • , x d are real-valued random variables. Recall that our model has the structure f (x) = d i=1 f i (x i ) where f i is an arbitrary one-dimensional function. When d = 2, we can view this setting as a direct extension of the setting in Section 3.1 where each x i 's has infinite number of possible values (instead of finite number), and thus the Laplacian "matrix" becomes infinite-dimensional. Nonetheless, we can still bound the F-RER from above, as stated in the following theorem. Theorem 3. For any distributions P, Q over variables x = (x 1 , • • • , x d ) with matching marginals, assume that (x i , x j ) has the distribution of a two-dimensional Gaussian random variable for every i, j ∈  [d]. Let x = (x 1 , • • • , xd ) be the normalized input where xi (x i -E P [x i ])Var(x i ) -1/ For better exposition, we first focus on the case where every x i has zero mean and unit variance, hence x = x and ΣP = Σ P E P [xx ] . Compared with linear models, Theorem 3 proves that the structured nonlinear model class f (x) = d i=1 f i (x i ) can extrapolate with similar conditions-for linear models F linear {v x : v ∈ R d } we have τ (P, Q, F linear ) = sup v∈R d v x 2 Q v x 2 P = sup v∈R d v E Q [xx ]v v E P [xx ]v λ min (Σ P ) -1 = λ min ( ΣP ) -1 , which equals to the RHS of Eq. ( 11) up to factors that does not depend on the covariance Σ P , Σ Q . We emphasize that we only assume the marginals on every pair of features x i , x j is Gaussian, which does not imply the Gaussianity of the joint distribution of x. In fact, there exists a non-Gaussian distribution that satisfies our assumption. Proof sketch of Theorem 3. On a high level, we treat the features x i 's as discrete random variables, and follow the same proof strategy as in Theorem 2. For better exposition, we first assume that x i has zero mean and unit variance for every i ∈ [d], hence ΣP = Σ P E P [xx ] . First we consider a simplified case when d = 2. Because x 1 , x 2 are continuous, the normalized signless Laplacian KP is infinite dimensional, and has the form KP = I A A I , where A is an infinite dimensional "matrix" indexed by real numbers x 1 , x 2 ∈ R with values [A] x1,x2 = P (x 1 , x 2 )/ P (x 1 )P (x 2 ), and I is the identity "matrix". Recall the proof of Theorem 2 gives τ (P, Q, F) ≤ 2λ 2 ( KP ) -1 max i∈[2],t Q(x i = t) P (x i = t) . ( ) By the assumption that P, Q have matching marginals, we get max i∈[2],t Q(xi=t) P (xi=t) = 1. As result, we only need to lowerbound the second smallest eigenvalue of KP . To this end, we first write A in its singular value decomposition form A = U ΛV , where U U = I, V V = I and Λ = diag({σ n } n≥0 ) with σ 0 ≥ σ 1 ≥ • • • . Then we get KP = I A A I = U 0 0 V I Λ Λ I U 0 0 V . Since the matrix KP I Λ Λ I consists of four diagonal sub-matrices, we can shuffle the rows and columns of KP to form a block-diagonal matrix with blocks 1 σ n σ n 1 n=0,1,2,••• . As a result, the eigenvalues of KP are 1 ± σ 0 , 1 ± σ 1 , • • • . Because 1 = σ 0 ≥ σ 1 ≥ • • • ≥ 0, the and second smallest eigenvalues of KP are 1 -σ 0 and 1 -σ 1 , respectively, meaning that λ 2 ( KP ) = λ 2 ( KP ) = 1-σ 1 . By the assumption that (x 1 , x 2 ) follows from Gaussian distribution, the "matrix" A is a Gaussian kernel, whose eigenvalues and eigenfunctions can be computed analytically-Theorem 11 proves that σ 1 = |E P [x 1 x 2 ]| if x 1 , x 2 have zero mean and unit variance. Consequently, λ 2 ( KP ) = 1 -σ 1 = 1 -|E P [x 1 x 2 ]| = λ min (Σ P ). Now we briefly discuss the case when d = 3, and the most general cases (i.e., d > 3) are proved similarly. When d = 3, the normalized kernel will have the following form KP =   I A B A I C B C I   . By the assumption that x 1 , x 2 are zero mean and unit variance with joint Gaussian distribution, matrices A is symmetric because [A] x1,x2 = P (x 1 , x 2 )/ P (x 1 )P (x 2 ) = P (x 2 , x 1 )/ P (x 1 )P (x 2 ) = [A] x2,x1 Similarly, matrices B, C are symmetric. In addition, Theorem 11 shows that the eigenfunctions of the Gaussian kernel is independent of the value E P [x i x j ]. Hence, the matrices A, B, C shares the same eigenspace and can be diagonalized simultaneously: KP =   I A B A I C B C I   = U 0 0 0 U 0 0 0 U   I Λ A Λ B Λ A I Λ C Λ B Λ C I     U 0 0 0 U 0 0 0 U   . By reshuffling the columns and rows, the eigenvalues of KP are the union of the eigenvalues of following matrices { K(n) P } n=0,1,2,••• 1 σ n (A) σ n (B) σ n (A) 1 σ n (C) σ n (B) σ n (C) 1 n=0,1,2,••• . ( )

5. RELATED WORKS

The most related work is Ben-David et al. (2010) , where they use the H∆H-distance to measure the maximum discrepancy of two models f, g ∈ F on any distributions P, Q. However, it remains an open question to determine when H∆H-distance is small for concrete nonlinear model classes and distributions. On the technical side, the quantity τ is an analog of the H∆H-distance for regression problems, and we provide concrete examples where τ is upper bounded even if the distributions P, Q have significantly different support. Another closely related settings are domain adaptation (Ganin & Lempitsky, 2015; Ghifary et al., 2016; Ganin et al., 2016) and domain generalization (Gulrajani & Lopez-Paz, 2020; Peters et al., 2016) , where the algorithms actively improve the extrapolation of learned model either by using unlabeled data from the test domain (Sun & Saenko, 2016; Li et al., 2020a; b; Zhang et al., 2019) , or learn an invariant model across different domains (Arjovsky et al., 2019; Peters et al., 2016; Gulrajani & Lopez-Paz, 2020) . There are also algorithms that learn features whose distributions on the source and target domain have a small discrepancy measured by the maximum mean discrepancy (Donahue et al., 2014; Long et al., 2015) , or the Wasserstein distances (Shen et al., 2018; Courty et al., 2017) . In comparison, this paper studies whether a model trained on one distribution (without any implicit bias and unlabeled data from test domain) extrapolates to new distributions. There are also prior works that theoretically analyze algorithms that use additional (unlabeled) data from the test distribution, such as self-training (Wei et al., 2020; Chen et al., 2020) , contrastive learning (Shen et al., 2022; HaoChen et al., 2022) , label propagation (Cai et al., 2021) , etc.

6. CONCLUSIONS

In this paper, we propose to study domain shifts between P and Q with the structure that each feature's marginal distribution has good overlap between source and target domain but the joint distribution of the features may have a much bigger shift. As a first step toward understanding the extrapolation of nonlinear models, we prove sufficient conditions for the model f (x) = k i=1 f i (x i ) to extrapolate where f i is an arbitrary function of a single feature. Even though the assumptions on the shift and function class is stylized, to the best of our knowledge, this is the first analysis of how nonlinear models extrapolate when source and target distribution do not have shared support in concrete settings. There still remain many interesting open questions, which we leave as future works: (a) Our current proof can only deal with a restricted nonlinear model family of the special form f (x) = k i=1 f i (x i ). Can we extend to a more general model class? (b) In this paper, we focus on regression tasks with 2 loss for mathematical simplicity, whereas majority of the prior works focus on the classification problems. Do similar results also hold for classification problem?



In fact, the H∆H-distance likely cannot be bounded when the function class contains two-layer neural networks, and the supports of the training and test distributions do not overlap -when there exists a function that can distinguish the source and target domain, the H∆H divergence will be large. For simplicity, we set 0/0 = 0. Cheeger's inequality measures the clustering structure of a graph by the eigenvalues of its standard Laplacian. However, the signless Laplacian and standard Laplacian have the same eigenvalues for bipartite graphsCvetković et al. (2007);Grone et al. (1990).



Figure 1: Visualization of three different training distributions P 1 , P 2 , P 3 and a test distribution Q, where the orange color blocks marks the support. (a) and (b): distributions P 1 , P 2 that do not satisfy our conditions and cannot extrapolate. (c): a distribution P 3 that satisfies our conditions even though the support of P 3 is sparse. (d): the test distribution.

2 has zero mean and unit variance for every i ∈ [d], and ΣP E P [xx ] the covariance matrix of x. Then τ (P, Q, F) ≤ d λ min ( ΣP ).

ACKNOWLEDGMENT

The authors would like to thank Yuanhao Wang, Yuhao Zhou, Hong Liu, Ananya Kumar, Jason D. Lee, and Kendrick Shen for helpful discussions. The authors would also like to thank the support from NSF CIF 2212263.

annex

Published as a conference paper at ICLR 2023 Theorem 11 implies that σ n (A) = ([Σ P ] 1,2 ) n , σ n (B) = ([Σ P ] 1,3 ) n and σ n (C) = ([Σ P ] 2,3 ) n . Consequently we get K(n) P = Σ n P . Then, this theorem follows directly by noticing λ min (Σ n P ) ≥ λ min (Σ P ) for n ≥ 1 (Lemma 13).Finally, the general case where x i has arbitrary mean and variance can be reduced to the case where x i has zero mean and unit variance (Lemma 8). The full proof of Theorem 11 is deferred to Appendix A.6.

3.3. TWO FEATURES WITH MULTI-DIMENSIONAL GAUSSIAN DISTRIBUTION

Now we extend Theorem 3 to the case where x 1 ∈ R d1 , x 2 ∈ R d2 are two subset of features with dimensions d 1 , d 2 > 1, respectively, and the input x = (x 1 , x 2 ) has Gaussian distribution. Recall that the model class isIn this case, we can still upper bound the F-RER by the eigenvalues of the covariance matrix Σ P , which is stated in the following theorem.Theorem 4. For any distributions P, Q over variablesx 2 ) has Gaussian distribution on both P and Q with zero mean and matching marginals, andWe defer the proof of Theorem 4 to Appendix A.7.Compared with Theorem 3 where the features x 1 , • • • , x d are one-dimensional, our condition for the covariance is almost the same: λ min (Σ P ) > 0. However, the model class considered Theorem 4 is more powerful because it captures nonlinear interactions between features within the same subset. As a compromise, the assumption on the marginals of P and Q is stronger because Theorem 4 requires matching marginals on each subset of the features, whereas Theorem 3 only requires matching marginals on each individual feature.Remarks. Our current techniques can only handle the case when the input is divided into k = 2 subsets. This is because for k ≥ 3 we must diagonalize multiple multi-dimensional Gaussian kernels simultaneously using the same set of eigenfunctions, as required in the proof of Theorem 3. However, these multi-dimensional Gaussian kernels do not share the same eigenfunctions because the rotation matrix U, V depends on the covariance E P [x i x j ]. Hence, the proof strategy for Theorem 3 fails for the case k ≥ 3.

4. LOWER BOUNDS

In this section, we prove a lower bound as a motivation to consider structured distributions shifts.The following proposition shows that in the worst case, models learned on P cannot extrapolate to Q when the support of distribution Q is not contained in the support of P .Proposition 5. Let the model class F be the family of two-layer neural networks with ReLU activation:Suppose for simplicity that all the inputs have unit norm (i.e., x 2 = 1). If Q has non-zero probability mass on the set of points well-separated from the support of P in the sense thatwe can construct a model f ∈ F such that f P = 0 but f Q can be arbitrarily large.A complete proof of this proposition is deferred to Appendix A.8. On a high level, we prove this proposition by construct a two-layer neural network g t that represents a bump function around any given input t ∈ S d-1 . As a result, when t is a point in supp(Q) \ supp(P ), the model g t (x) will have zero 2 norm on P but have a positive 2 norm on Q. This construction is inspired by Dong et al. (2021, Theorem 5.1) .

