HOW DOES OVERPARAMETRIZATION AFFECT PERFORMANCE ON MINORITY GROUPS?

Abstract

The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.

1. INTRODUCTION

Traditionally, the goal of machine learning (ML) is to optimize the average or overall performance of ML models. The relentless pursuit of this goal eventually led to the development of deep neural networks, which achieve state of the art performance in many application areas. A prominent trend in the development of such modern ML models is overparameterization: the models are so complex that they are capable of perfectly interpolating the data. There is a large body of work showing overparameterization improves the performance of ML models in a variety of settings (e. However, as ML models find their way into high-stakes decision-making processes, other aspects of their performance (besides average performance) are coming under scrutiny. One aspect that is particularly relevant to the fairness and safety of ML models is their performance on traditionally disadvantaged demographic groups. There is a troubling line of work showing ML models that perform well on average may perform poorly on minority groups of training examples. For example, Buolamwini and Gebru (2018) show that commercial gender classification systems, despite achieving low classification error on average, tend to misclassify dark-skinned people. In the same spirit, Wilson et al. (2019) show that pedestrian detection models, despite performing admirably on average, have trouble recognizing dark-skinned pedestrians. The literature examines the effect of model size on the generalization error of the worst group. Sagawa et al. ( 2020) find that increasing model size beyond the threshold of zero training error can have a negative impact on test error for minority groups because the model learns spurious correlations. They show that subsampling the majority groups is far more successful than upweighting the minority groups in reducing worst-group error. Pham et al. ( 2021) conduct more extensive experiments to investigate the influence of model size on worst-group error under various neural network architecture and model parameter initialization configurations. They discover that increasing model size either improves or does not harm the worst-group test performance across all settings. Idrissi et al. ( 2021) recommend using simple methods, i.e. subsampling and reweighting for balanced classes or balanced groups, before venturing into more complicated procedures. They suggest that newly developed robust optimization approaches for worst-group error control (Sagawa et al., 2019; Liu et al., 2021) could be computationally demanding, and that there is no strong (statistically significant) evidence of advantage over those simple methods. In this paper, we provide theoretical justification for the empirical results in Sagawa et al. ( 2020 2021) by studying how overparameterization affects the performance of ML models on minority groups in an idealized regression problem. Our investigation shows that overparamaterization generally improves or stabilizes the performance of ML models on minority groups. Our main contributions are: 1. we develop a simple two-group model for studying the effects of overparameterization on (sub)groups. This model has parameters controlling signal strength, majority group fraction, overparameterization ratio, discrepancy between the two groups, and error term variance that display a rich set of possible effects. 2. we develop a comprehensive picture of the limiting risk of empirical risk minimization in a high-dimensional asymptotic setting (see Sections 3). 3. we show that majority group subsampling provably improves minority group performance in the overparameterized regime. Some of the technical tools that we develop in the proofs may be of independent interest.

2.1. DATA GENERATING PROCESS

Let X ⊂ R d be the feature space and Y ⊂ R be the output space. To keep things simple, we consider a two group setup. Let P 0 and P 1 be probability distributions on X × Y. We consider P 0 and P 1 as the distribution of samples from the minority and majority groups respectively. In the minority group, the samples (x, y) ∈ X × Y are distributed as x ∼ P X , y | x = β ⊤ 0 x + ε, ε ∼ N (0, τ 2 ), where P X is the marginal distribution of the features, β 0 ∈ R d is a vector of regression coefficients, and τ 2 > 0 is the noise level. The normality of the error term in (2.1) is not important; our theoretical results remain valid even if the error term is non-Gaussian. In the majority group, the marginal distribution of features is identical, but the conditional distribution of the output is different: y | x = β ⊤ 1 x + ε, ε ∼ N (0, τ 2 ), where β 1 ∈ R d is a vector of regression coefficients for the majority group. We note that this difference between the majority and minority groups is a form of concept drift or posterior drift: the marginal distribution of the feature is identical, but the conditional distribution of the output is different. We focus on this setting because it not only simplifies our derivations, but also isolates the effects of concept drift between subpopulations through the difference δ ≜ β 1 -β 0 . If the covariate distributions between the two groups are different, then an overparameterized model may be able to distinguish between the two groups, thus effectively modeling the groups separately. In that sense, by assuming that the covariates are equally distributed, we consider the worst case. Let g i ∈ {0, 1} denote the group membership of the i-th training sample. The training data {(x i , y i , g i )} n i=1 consists of a mixture of samples from the majority and minority groups: g i ∼ Ber(π) (x i , y i ) | g i ∼ P gi , where π ∈ [ 1 2 , 1] is the (expected) proportion of samples from the majority group in the training data. We denote n 1 as the sample size for majority group in the training data.

2.2. RANDOM FEATURE MODELS

Here, we consider a random feature regression model (Rahimi and Recht, 2007; Montanari et al., 2020) f (x, a, Θ) = N j=1 a j σ(θ ⊤ j x/ √ d) (2.4)



g. ridgeless least squares Hastie et al. (2019), random feature models Belkin et al. (2019); Mei and Montanari (2019b) and deep neural networks Nakkiran et al. (2019)).

); Pham et al. (2021); Idrissi et al. (

