FIRST STEPS TOWARD UNDERSTANDING THE EXTRAP-OLATION OF NONLINEAR MODELS TO UNSEEN DO-MAINS

Abstract

Real-world machine learning applications often involve deploying neural networks to domains that are not seen in the training time. Hence, we need to understand the extrapolation of nonlinear models-under what conditions on the distributions and function class, models can be guaranteed to extrapolate to new test distributions. The question is very challenging because even two-layer neural networks cannot be guaranteed to extrapolate outside the support of the training distribution without further assumptions on the domain shift. This paper makes some initial steps towards analyzing the extrapolation of nonlinear models for structured domain shift. We primarily consider settings where the marginal distribution of each coordinate of the data (or subset of coordinates) do not shift significantly across the training and test distributions, but the joint distribution may have a much bigger shift. We prove that the family of nonlinear models of the form f (x) = f i (x i ), where f i is an arbitrary function on the subset of features x i , can extrapolate to unseen distributions, if the covariance of the features is well-conditioned. To the best of our knowledge, this is the first result that goes beyond linear models and the bounded density ratio assumption, even though the assumptions on the distribution shift and function class are stylized.

1. INTRODUCTION

In real-world applications, machine learning models are often deployed on domains that are not seen in the training time. For example, we may train machine learning models for medical diagnosis on data from hospitals in Europe and then deploy them to hospitals in Asia. 2016) and references therein) and adversarial robustness (Goodfellow et al., 2014; Kurakin et al., 2018) , and also plays a critical role in nonlinear bandits and reinforcement learning where the distribution is constantly changing during training (Dong et al., 2021; Agarwal et al., 2019; Lattimore & Szepesvári, 2020; Sutton & Barto, 2018) .

This paper focuses on the following mathematical abstraction of this extrapolation question:

Under what conditions on the source distribution P , target distribution Q, and function class F do we have that any functions f, g ∈ F that agree on P are also guaranteed to agree on Q? Here we can measure the agreement of two functions on P by the 2 distance between f and g under distribution P , that is, f -g P E x∼P [(f (x) -g(x)) 2 ] 1/2 . The function f can be thought of as the learned model, g as the ground-truth function, and thus f -g P as the error on the source distribution P . This question is well-understood for linear function class F. Essentially, if the covariance of Q can be bounded from above by the covariance of P (in any direction), then the error on Q is guaranteed



Thus, we need to understand the extrapolation of models to new test distributions -how the model trained on one distribution behaves on another unseen distribution. This extrapolation of neural networks is central to various robustness questions such as domain generalization (Gulrajani & Lopez-Paz (2020); Ganin et al. (2016); Peters et al. (

