UNDERSTANDING WHY GENERALIZED REWEIGHTING DOES NOT IMPROVE OVER ERM

Abstract

Empirical risk minimization (ERM) is known to be non-robust in practice to distributional shift where the training and the test distributions are different. A suite of approaches, such as importance weighting, and variants of distributionally robust optimization (DRO), have been proposed to solve this problem. But a line of recent work has empirically shown that these approaches do not significantly improve over ERM in real applications with distribution shift. The goal of this work is to obtain a comprehensive theoretical understanding of this intriguing phenomenon. We first posit the class of Generalized Reweighting (GRW) algorithms, as a broad category of approaches that iteratively update model parameters based on iterative reweighting of the training samples. We show that when overparameterized models are trained under GRW, the resulting models are close to that obtained by ERM. We also show that adding small regularization which does not greatly affect the empirical training accuracy does not help. Together, our results show that a broad category of what we term GRW approaches are not able to achieve distributionally robust generalization. Our work thus has the following sobering takeaway: to make progress towards distributionally robust generalization, we either have to develop non-GRW approaches, or perhaps devise novel classification/regression loss functions that are adapted to GRW approaches.

1. INTRODUCTION

It has now been well established that empirical risk minimization (ERM) can empirically achieve high test performance on a variety of tasks, particularly with modern overparameterized models where the number of parameters is much larger than the number of training samples. This strong performance of ERM however has been shown to degrade under distributional shift, where the training and test distributions are different (Hovy & Søgaard, 2015; Blodgett et al., 2016; Tatman, 2017) . There are two broad categories of distribution shift: domain generalization, defined as the scenario where the test distribution contains samples from new domains that did not appear during training; and subpopulation shift, defined as the scenario where the training set contains several subgroups and the testing distribution weighs these subgroups differently, like in fair machine learning. People have proposed various approaches to learn models robust to distributional shift. The most classical one is importance weighting (IW) (Shimodaira, 2000; Fang et al., 2020) , which reweights training samples; for subpopulation shift these weights are typically set so that each subpopulation has the same overall weight in the training objective. The approach most widely used today is Distributional Robust Optimization (DRO) (Duchi & Namkoong, 2018; Hashimoto et al., 2018) , which assumes that the test distribution belongs to a certain uncertainty set of distributions that are close to the training distribution, and train on the worst distribution in that set. Many variants of DRO have been proposed and are used in practice (Sagawa et al., 2020a; Zhai et al., 2021a; b) . While these approaches have been developed for the express purpose of improving ERM for distribution shift, a line of recent work has empirically shown the negative result that when used to train overparameterized models, these methods do not improve over ERM. For IW, Byrd & Lipton (2019) observed that its effect under stochastic gradient descent (SGD) diminishes over training epochs, and finally does not improve over ERM. For variants of DRO, Sagawa et al. (2020a) found that these methods overfit very easily, i.e. their test performances will drop to the same low level as ERM after sufficiently many epochs if no regularization is applied. Gulrajani & Lopez-Paz (2021); Koh et al. (2021) compared these methods with ERM on a number of real-world applications, and found that in most cases none of these methods improves over ERM. This line of empirical results has also been bolstered by some recent theoretical results. Sagawa et al. (2020b) constructed a synthetic dataset where a linear model trained with IW is provably not robust to subpopulation shift. Xu et al. (2021) further proved that under gradient descent (GD) with a sufficiently small learning rate, a linear classifier trained with either IW or ERM converges to the same max-margin classifier, and thus upon convergence, are no different. These previous theoretical results are limited to linear models and specific approaches such as IW where sample weights are fixed during training. They are not applicable to more complex models, and more general approaches where the sample weights could iteratively change, including most DRO variants. Towards placing the empirical results on a stronger theoretical footing, we define the class of generalized reweighting (GRW), which dynamically assigns weights to the training samples, and iteratively minimizes the weighted average of the sample losses. By allowing the weights to vary with iterations, we cover not just static importance weighting, but also DRO approaches outlined earlier; though of course, the GRW class is much broader than just these instances. Main contributions. We prove that GRW and ERM have (almost) equivalent implicit biases, in the sense that the points they converge to are very close to each other, under a much more general setting than those used in previous work. Thus, GRW cannot improve over ERM because it does not yield a significantly different model. We are the first to extend this line of theoretical results (i) to wide neural networks, (ii) to reweighting methods with dynamic weights, (iii) to regression tasks, and (iv) to methods with L 2 regularization. We note that these extensions are non-trivial technically as they require the result that wide neural networks can be approximated by their linearized counterparts to hold uniformly throughout the iterative process of GRW algorithms. Moreover, we fix the proof in a previous paper (Lee et al., 2019) (see Appendix E) which is also a great contribution. Overall, the important takeaway is that distributionally robust generalization (DRG) cannot be directly achieved by the broad class of GRW algorithms (which includes popular approaches such as importance weighting and most DRO variants). Progress towards this important goal thus requires either going beyond GRW algorithms, or devising novel loss functions that are adapted to GRW approaches. In Section 6 we will discuss some promising future directions and the case with nonoverparameterized models and early stopping. Finally, we want to emphasize that while the models we use in our results (linear models and wide neural networks) are different from practical models, they are general models most widely used in existing theory papers, and our results based on these models provide explanations to the baffling observations made in previous empirical work, as well as valuable insights into how to improve distributionally robust generalization.

2. PRELIMINARIES

Let the input space be X ⊆ R d and the output space be Y ⊆ R. 1 We assume that X is a subset of the unit L 2 ball of R d , so that any x ∈ X satisfies x 2 ≤ 1. We have a training set {z i = (x i , y i )} n i=1 i.i.d. sampled from an underlying distribution P over X × Y. Denote X = (x 1 , • • • , x n ) ∈ R d×n , and Y = (y 1 , • • • , y n ) ∈ R n . For any function g : X → R m , we overload notation and use g(X) = (g(x 1 ), • • • , g(x n )) ∈ R m×n (except when m = 1, g(X) is defined as a column vector).

Let the loss function be

: Y × Y → [0, 1]. ERM trains a model by minimizing its expected risk R(f ; P ) = E z∼P [ (f (x), y)] via minimizing the empirical risk R(f ) = 1 n n i=1 (f (x i ), y i ). In distributional shift, the model is evaluated not on the training distribution P , but a different test distribution P test , so that we care about the expected risk R(f ; P test ). A large family of methods designed for such distributional shift is distributionally robust optimization (DRO), which minimizes the expected risk over the worst-case distribution Q Pfoot_1 in a ball w.r. 



Our results can be easily extended to the multi-class scenario (see Appendix B). For distributions P and Q, Q is absolute continuous to P , or Q P , means that for any event A, P (A) = 0 implies Q(A) = 0.



t. divergence D around the training distribution P . Specifically, DRO minimizes the expected DRO risk defined as: R D,ρ (f ; P ) = supQ P {E Q [ (f (x), y)] : D(Q P ) ≤ ρ} (1)for ρ > 0. Examples include CVaR, χ 2 -DRO(Hashimoto et al., 2018), and DORO (Zhai et al.,  2021a), among others.

