DOMAIN-ADJUSTED REGRESSION OR: ERM MAY ALREADY LEARN FEATURES SUFFICIENT FOR OUT-OF-DISTRIBUTION GENERALIZATION Anonymous

Abstract

A common explanation for the failure of deep networks to generalize out-ofdistribution is that they fail to recover the "correct" features. We challenge this notion with a simple experiment which suggests that ERM already learns sufficient features and that the current bottleneck is not feature learning, but robust regression. Our findings also imply that given a small amount of data from the target distribution, retraining only the last linear layer will give excellent performance. We therefore argue that devising simpler methods for learning predictors on existing features is a promising direction for future research. Towards this end, we introduce Domain-Adjusted Regression (DARE), a convex objective for learning a linear predictor that is provably robust under a new model of distribution shift. Rather than learning one function, DARE performs a domain-specific adjustment to unify the domains in a canonical latent space and learns to predict in this space. Under a natural model, we prove that the DARE solution is the minimax-optimal predictor for a constrained set of test distributions. Further, we provide the first finite-environment convergence guarantee to the minimax risk, improving over existing analyses which only yield minimax predictors after an environment threshold. Evaluated on finetuned features, we find that DARE compares favorably to prior methods, consistently achieving equal or better performance.

1. INTRODUCTION

The historical motivation for deep learning focuses on the ability of deep neural networks to automatically learn rich, hierarchical features of complex data (LeCun et al., 2015; Goodfellow et al., 2016) . Simple Empirical Risk Minimization (ERM), with appropriate regularization, results in high-quality representations which surpass carefully hand-selected features on a wide variety of downstream tasks. Despite these successes, or perhaps because of them, the dominant focus of late is on the shortcomings of this approach: recent work points to the failure of networks trained with ERM to generalize under even moderate distribution shift (Recht et al., 2019; Miller et al., 2020) . A common explanation for this phenomenon is reliance on "spurious correlations" or "shortcuts", where a network makes predictions based on structure in the data which generalizes on average in the training set but may not persist in future test distributions (Poliak et al., 2018; Geirhos et al., 2019; Xiao et al., 2021) . Many proposed solutions implicitly assume that this problem is due to the entire neural network: they suggest an alternate objective to be minimized over a deep network in an end-to-end fashion (Sun & Saenko, 2016; Ganin et al., 2016; Arjovsky et al., 2019) . These objectives are complex, poorly understood, and difficult to optimize. Indeed, the efficacy of many such objectives was recently called into serious question (Zhao et al., 2019; Rosenfeld et al., 2021; Gulrajani & Lopez-Paz, 2021) . Though a neural network is often viewed as a deep feature embedder with a final linear predictor applied to the features, it is still unclear-and to our knowledge has not been directly asked or tested-whether these issues are primarily because of (i) learning the wrong features or (ii) learning good features but failing to find the best-generalizing linear predictor on top of them. We begin with a simple experiment (Figure 1 ) to try to distinguish between these two possibilities: we train a deep network with ERM on several domain generalization benchmarks, where the task is to learn a predictor using a collection of distinct training domains and then perform well on a new, 1

