FIRST STEPS TOWARD UNDERSTANDING THE EXTRAP-OLATION OF NONLINEAR MODELS TO UNSEEN DO-MAINS

Abstract

Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models-under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the marginal distribution of each coordinate of the data (or subset of coordinates) do not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form f (x) = f i (x i ), where f i is an arbitrary function on the subset of features x i , can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.

1. INTRODUCTION

In real-world applications, machine learning models are often deployed on domains that are not seen in the training time. For example, we may train machine learning models for medical diagnosis on data from hospitals in Europe and then deploy them to hospitals in Asia. Thus, we need to understand the extrapolation of models to new test distributions -how the model trained on one distribution behaves on another unseen distribution. This extrapolation of neural networks is central to various robustness questions such as domain generalization (Gulrajani & Lopez-Paz (2020); Ganin et al. (2016); Peters et al. (2016) and references therein) and adversarial robustness (Goodfellow et al., 2014; Kurakin et al., 2018) , and also plays a critical role in nonlinear bandits and reinforcement learning where the distribution is constantly changing during training (Dong et al., 2021; Agarwal et al., 2019; Lattimore & Szepesvári, 2020; Sutton & Barto, 2018) .

This paper focuses on the following mathematical abstraction of this extrapolation question:

Under what conditions on the source distribution P , target distribution Q, and function class F do we have that any functions f, g ∈ F that agree on P are also guaranteed to agree on Q? Here we can measure the agreement of two functions on P by the 2 distance between f and g under distribution P , that is, f -g P E x∼P [(f (x) -g(x) ) 2 ]foot_0/2 . The function f can be thought of as the learned model, g as the ground-truth function, and thus f -g P as the error on the source distribution P . This question is well-understood for linear function class F. Essentially, if the covariance of Q can be bounded from above by the covariance of P (in any direction), then the error on Q is guaranteed to be bounded by the error on P . We refer the reader to Lei et al. (2021); Mousavi Kalan et al. (2020) and references therein for more recent advances along this line. By contrast, theoretical results for extrapolation of nonlinear models is rather limited. Classical results have long settled the case where P and Q have bounded density ratios (Ben-David & Urner, 2014; Sugiyama et al., 2007) . Bounded density ratio implies that the support of Q must be a subset of the support of P , and thus arguably these results do not capture the extrapolation behavior of models outside the training domain. Without the bounded density ratio assumption, there was limited prior positive result for characterizing the extrapolation power of neural networks. Ben-David et al. (2010) show that the model can extrapolate when the H∆H-distance between training and test distribution is small. However, it remains unclear for what distributions and function class, the H∆H-distance can be bounded. 1 In general, the question is challenging partly because of the existence of such a strong impossibility result. As soon as the support of Q is not contained in the support of P (and they satisfy some non-degeneracy condition), it turns out that even two-layer neural networks cannot extrapolate-there are two-layer neural networks f and g that agree on P perfectly but behave very differently on Q (See Proposition 5 for a formal statement.) The impossibility result suggests that any positive results on the extrapolation of nonlinear models require more fine-grained structures on the relationship between P and Q (which are common in practice Koh et al. ( 2021); Sagawa et al. ( 2022)) as well as the function class F. The structure in the domain shift between P and Q may also need to be compatible with the assumption on the function class F. This paper makes some first steps towards proving certain family of nonlinear models can extrapolate to a new test domain with structured shift. We consider a setting where the joint distribution of the data can does not have much overlap across P and Q (and thus bounded density ratio assumption does not hold), whereas the marginal distributions for each coordinate of the data does overlap. Such a scenario may practically happen when the features (coordinates of the data) exhibit different correlations on the source and target distribution. For example, consider the task of predicting the probability of a lightning storm from basic meteorological information such as precipitation, temperature, etc. We learn models from some cities on the west coast of United States and deploy them to the east coast. In this case, the joint test distribution of the features may not necessarily have much overlap with the training distributioncorrelation between precipitation and temperature could be vastly different across regions, e.g., the rainy season coincides with the winter's low temperature on the west coast, but not so much on the east coast. However, the individual feature's marginal distribution is much more likely to overlap between the source and target-the possible ranges of temperature on east and west coasts are similar. Concretely, we assume that the features x ∈ R s1+s2 have Gaussian distributions and can be divided into two subsets x 1 ∈ R s1 and x 2 ∈ R s2 such that each set of feature x i (i ∈ {1, 2}) has the same marginal distributions on P and Q. Moreover, we assume that x 1 and x 2 are not exactly correlated on P -the covariance of features x on distribution P has a strictly positive minimum eigenvalue. As argued before, restricted assumptions on the function class F are still necessary (for almost any P and Q without the bounded density ratio property). Here, we assume that F consists of all functions of the form f (x) = f 1 (x 1 ) + f 2 (x 2 ) for arbitrary functions f 1 : R s1 → R and f 2 : R s2 → R. The function class F does not contain all two-layer neural networks (so that the impossibility result does not apply), but still consists of a rich set of functions where each subset of features independently contribute to the prediction with arbitrary nonlinear transformations. We show that under these assumptions, if any two models approximately agree on P , they must also approximately agree on Q -formally speaking, ∀f, g ∈ F, f -g Q f -g P (Theorem 4). We also prove a variant of the result above where we divide features vector x ∈ R d into d coordinate, denoted by x = (x 1 , . . . , x d ) where x i ∈ R. The function class consists of all combinations of nonlinear transformations of x i 's, that is, F = { d i=1 f i (x i )}. Assuming coordinates of x are pairwise Gaussian and a non-degenerate covariance matrix, the nonlinear model f ∈ F can extrapolate to any distribution Q that has the same marginals as P (Theorem 3).



In fact, the H∆H-distance likely cannot be bounded when the function class contains two-layer neural networks, and the supports of the training and test distributions do not overlap -when there exists a function that can distinguish the source and target domain, the H∆H divergence will be large.

