NEAR-OPTIMAL LINEAR REGRESSION UNDER DIS-TRIBUTION SHIFT

Abstract

Transfer learning is an essential technique when sufficient data comes from the source domain, while no or scarce data is from the target domain. We develop estimators that achieve minimax linear risk for linear regression problems under the distribution shift. Our algorithms cover different kinds of settings with covariate shift or model shift. We also consider when data are generating from either linear or general nonlinear models. We show that affine minimax rules are within an absolute constant of the minimax risk even among nonlinear rules for various source/target distributions.

1. INTRODUCTION

The success of machine learning crucially relies on the availability of labeled data. The data labeling process usually requires much human labor and can be very expensive and time-consuming, especially for large datasets like ImageNet (Deng et al., 2009) . On the other hand, models trained on one dataset, despite performing well on test data from the same distribution they are trained on, are often sensitive to distribution shifts, i.e., they do not adapt well to related but different distributions. Even small distributional shift can result in substantial performance degradation (Recht et al., 2018; Lu et al., 2020) . Transfer learning has been an essential paradigm to tackle the challenges associated with insufficient labeled data (Pan & Yang, 2009; Weiss et al., 2016; Long et al., 2017) . The main idea is to make use of a source domain with a lot of labeled data (e.g. ImageNet), and to try to learn a model that performs well on our target domain (e.g. medical images) where few or no labels are available. Despite the lack of labeled data, we may still use unlabeled data from the target domain, which are usually much easier to obtain and can provide helpful information about the target domain. Although this approach has been integral to many applications, many fundamental questions are left open even in very basic settings. In this work, we focus on the setting of linear regression under distribution shift and ask the fundamental question of how to optimally learn a linear model for a target domain, using labeled data from a source domain and unlabeled data (and possibly some labeled data) from the target domain. For various settings, including covariate shift (i.e., when p(x) changes) and model shift (i.e., when p(y|x) changes), we develop estimators that achieve near minimax risk (up to universal constant factors) among all linear estimation rules. Here linear estimators refer to all estimators that depend linearly on the label vector; these include almost all popular estimators known in linear regression, such as ridge regression and its variants. When the input covariances in source and target domains commute, we prove that our estimators achieve near minimax risk among all possible estimators. A key insight from our results is that, when covariate shift is present, we need to apply datadependent regularization that adapts to changes in the input distribution. For linear regression, this can be given by the input covariances of source and target tasks, which can be estimated using unlabeled data. Our experiments verify that our estimator has significant improvement over ridge regression and similar heuristics.

1.1. RELATED WORK

Different types of distribution shift are introduced in (Storkey, 2009; Quionero-Candela et al., 2009) . Specifically, covariate shift occurs when the marginal distribution on P (X) changes from source to target domain (Shimodaira, 2000; Huang et al., 2007) . Wang et al. ( 2014 2011) design a two-stage reweighting method based on both covariate shift and model shift. Other methods like the change of representation, adaptation through prior, and instance pruning are proposed in (Jiang & Zhai, 2007) . In this work, we focus on the above two kinds of distribution shift. For modeling target shift (P (Y )) and conditional shift (P (X|Y )), Zhang et al. ( 2013) exploits the benefit of multi-layer adaptation by some location-scale transformation on X. Transfer learning/domain adaptation are sub-fields within machine learning to cope with distribution shift. A variety of prior work roughly falls into the following categories. 1) Importancereweighting is mostly used in the covariate shift. (Shimodaira, 2000; Huang et al., 2007; Cortes et al., 2010) ; 2) One fruitful line of work focuses on exploring robust/causal features or domaininvariant representations through invariant risk minimization (Arjovsky et al., 2019) , distributional robust minimization (Sagawa et al., 2019) , human annotation (Srivastava et al., 2020 ), adversarial training (Long et al., 2017; Ganin et al., 2016) , or by minimizing domain discrepancy measured by some distance metric (Pan et al., 2010; Long et al., 2013; Baktashmotlagh et al., 2013; Gong et al., 2013; Zhang et al., 2013; Wang & Schneider, 2014) ; 3) Several approaches seek gradual domain adaptation (Gopalan et al., 2011; Gong et al., 2012; Glorot et al., 2011; Kumar et al., 2020) through self-training or a gradual change in the training distribution. Near minimax estimations are introduced in Donoho (1994) for linear regression problems with Gaussian noise. For a more general setting, Juditsky et al. (2009) estimate the linear functional using convex programming. Blaker (2000) compares ridge regression with a minimax linear estimator under weighted squared error. Kalan et al. (2020) considers a setting similar to this work of minimax estimator under distribution shift, but focuses on computing the lower bound for linear and one-hidden-layer neural network under distribution shift. A few more interesting results are derived on the generalization lower bound for distribution shift under various settings (David et al., 2010; Hanneke & Kpotufe, 2019; Ben-David et al., 2010; Zhao et al., 2019) .

2. PRELIMINARY

We formalize the setting considered in this paper for transfer learning under the distribution shift. Notation and setup. Let p S (x) and p T (x) be the marginal distribution for x in source and target domain. The associated covariance matrices are Σ S , and Σ T . We assume to have sufficient unlabeled data to estimate Σ T accurately. We observe n S , n T labeled samples from source and target domain. Data is scarce in target domain: n S n T and n T can be 0. Specifically, X S = [x 1 |x 2 | • • • |x n S ] ∈ R n×d , with x i , i ∈ [n S ] drawn from p S , noise z = [z 1 , z 2 , • • • z n S ] , z i ∼ N (0, σ 2 ). y S = [y 1 , y 2 , • • • , y n S ] ∈ R n S , with each y i = f * (x i ) + z i (X T ∈ R n T ×d and



); Wang & Schneider (2015) tackle model shift (P (Y |X)) provided the change is smooth as a function of X. Sun et al. (

