HOW DOES OVERPARAMETRIZATION AFFECT PERFORMANCE ON MINORITY GROUPS?

Abstract

The benefits of overparameterization for the overall performance of modern machine learning (ML) models are well known. However, the effect of overparameterization at a more granular level of data subgroups is less understood. Recent empirical studies demonstrate encouraging results: (i) when groups are not known, overparameterized models trained with empirical risk minimization (ERM) perform better on minority groups; (ii) when groups are known, ERM on data subsampled to equalize group sizes yields state-of-the-art worst-group-accuracy in the overparameterized regime. In this paper, we complement these empirical studies with a theoretical investigation of the risk of overparameterized random feature models on minority groups. In a setting in which the regression functions for the majority and minority groups are different, we show that overparameterization always improves minority group performance.

1. INTRODUCTION

Traditionally, the goal of machine learning (ML) is to optimize the average or overall performance of ML models. The relentless pursuit of this goal eventually led to the development of deep neural networks, which achieve state of the art performance in many application areas. A prominent trend in the development of such modern ML models is overparameterization: the models are so complex that they are capable of perfectly interpolating the data. There is a large body of work showing overparameterization improves the performance of ML models in a variety of settings (e.g. ridgeless least squares Hastie et al. (2019) , random feature models Belkin et al. (2019) ; Mei and Montanari (2019b) and deep neural networks Nakkiran et al. (2019) ). However, as ML models find their way into high-stakes decision-making processes, other aspects of their performance (besides average performance) are coming under scrutiny. One aspect that is particularly relevant to the fairness and safety of ML models is their performance on traditionally disadvantaged demographic groups. There is a troubling line of work showing ML models that perform well on average may perform poorly on minority groups of training examples. For example, Buolamwini and Gebru (2018) show that commercial gender classification systems, despite achieving low classification error on average, tend to misclassify dark-skinned people. In the same spirit, Wilson et al. (2019) show that pedestrian detection models, despite performing admirably on average, have trouble recognizing dark-skinned pedestrians. The literature examines the effect of model size on the generalization error of the worst group. Sagawa et al. (2020) find that increasing model size beyond the threshold of zero training error can have a negative impact on test error for minority groups because the model learns spurious correlations. They show that subsampling the majority groups is far more successful than upweighting the minority groups in reducing worst-group error. Pham et al. (2021) conduct more extensive experiments to investigate the influence of model size on worst-group error under various neural network architecture and model parameter initialization configurations. They discover that increasing model size either improves or does not harm the worst-group test performance across all settings. Idrissi et al. ( 2021) recommend using simple methods, i.e. subsampling and reweighting for balanced classes or balanced groups, before venturing into more complicated procedures. They suggest that newly developed robust optimization approaches for worst-group error control (Sagawa et al., 2019; Liu et al., 2021) could be computationally demanding, and that there is no strong (statistically significant) evidence of advantage over those simple methods. In this paper, we provide theoretical justification for the empirical results in Sagawa et al. (2020) ; Pham et al. (2021) ; Idrissi et al. (2021) by studying how overparameterization affects the performance of ML models on minority groups in an idealized regression problem. Our investigation shows that overparamaterization generally improves or stabilizes the performance of ML models on minority groups. Our main contributions are: 1. we develop a simple two-group model for studying the effects of overparameterization on (sub)groups. This model has parameters controlling signal strength, majority group fraction, overparameterization ratio, discrepancy between the two groups, and error term variance that display a rich set of possible effects. 2. we develop a comprehensive picture of the limiting risk of empirical risk minimization in a high-dimensional asymptotic setting (see Sections 3). 3. we show that majority group subsampling provably improves minority group performance in the overparameterized regime. Some of the technical tools that we develop in the proofs may be of independent interest.

2. PROBLEM SETUP

2.1 DATA GENERATING PROCESS Let X ⊂ R d be the feature space and Y ⊂ R be the output space. To keep things simple, we consider a two group setup. Let P 0 and P 1 be probability distributions on X × Y. We consider P 0 and P 1 as the distribution of samples from the minority and majority groups respectively. In the minority group, the samples (x, y) ∈ X × Y are distributed as x ∼ P X , y | x = β ⊤ 0 x + ε, ε ∼ N (0, τ 2 ), where P X is the marginal distribution of the features, β 0 ∈ R d is a vector of regression coefficients, and τ 2 > 0 is the noise level. The normality of the error term in (2.1) is not important; our theoretical results remain valid even if the error term is non-Gaussian. In the majority group, the marginal distribution of features is identical, but the conditional distribution of the output is different: y | x = β ⊤ 1 x + ε, ε ∼ N (0, τ 2 ), where β 1 ∈ R d is a vector of regression coefficients for the majority group. We note that this difference between the majority and minority groups is a form of concept drift or posterior drift: the marginal distribution of the feature is identical, but the conditional distribution of the output is different. We focus on this setting because it not only simplifies our derivations, but also isolates the effects of concept drift between subpopulations through the difference δ ≜ β 1 -β 0 . If the covariate distributions between the two groups are different, then an overparameterized model may be able to distinguish between the two groups, thus effectively modeling the groups separately. In that sense, by assuming that the covariates are equally distributed, we consider the worst case. Let g i ∈ {0, 1} denote the group membership of the i-th training sample. The training data {(x i , y i , g i )} n i=1 consists of a mixture of samples from the majority and minority groups: g i ∼ Ber(π) (x i , y i ) | g i ∼ P gi , (2.3) where π ∈ [ 1 2 , 1] is the (expected) proportion of samples from the majority group in the training data. We denote n 1 as the sample size for majority group in the training data.

2.2. RANDOM FEATURE MODELS

Here, we consider a random feature regression model (Rahimi and Recht, 2007; Montanari et al., 2020) f (x, a, Θ) = N j=1 a j σ(θ ⊤ j x/ √ d) (2.4) where σ(•) is a non-linear activation function and N is the number of random features considered in the model. The random feature model is similar to a two-layer neural network, but the weights θ j 's of the hidden layer are set at some (usually random) initial values. In other words, a random feature model fits a linear regression model on the response y using the non-linear random features σ(θ ⊤ j x/ √ d)'s instead of the original features x j 's. This type of models has recently been used Montanari et al. (2020) to provide theoretical explanation for some of the behaviors seen in large neural network models. We note that at first we analyzed the usual linear regression model hoping to understand the effect of overparameterization. Specifically, we considered the model f (x, a) = a ⊤ x. (2.5) However, we found that the linear model suffers in terms of prediction accuracy for minority group in overparameterized regime. In particular, in the case of high signal-to-noise ratio, the prediction accuracy for the minority group worsens with the overparameterization of the model. This is rather unsatisfactory as the theoretical findings do not echo the empirical findings in Le Pham et al. (2021) , suggesting that a linear model is not adequate for understanding the deep learning phenomena in this case. We point the readers to Appendix A for the theoretical analysis of overparameterezied linear models.

2.3. TRAINING OF THE RANDOM FEATURE MODEL

We consider two ways in which a learner may fit the random feature model: empirical risk minimization (ERM) that does not require group annotations; subsampling that does require group annotations. The availability of group annotations is highly application dependant and there is a plethora of prior works considering either of the scenarios (Hashimoto et al., 2018; Zhai et al., 2021; Pham et al., 2021; Sagawa et al., 2019; 2020; Idrissi et al., 2021) .

2.3.1. EMPIRICAL RISK MINIMIZATION (ERM)

The most common way to fit a predictive model is to minimize the empirical risk on the training data: â ∈ arg min a∈R N 1 n n i=1 1 2 {y i -f (x i , a, Θ)} 2 = arg min a∈R N 1 n n i=1 1 2 y i - N j=1 a j σ(θ ⊤ j x i / √ d) 2 , (2.6) where we recall that the weights θ j 's of the first layer are randomly assigned. The above optimization (2.6) has a unique solution when the number of random features or neurons are less than the sample size (N < n) and we call this regime as underparameterized. When the number of neurons is greater than the sample size (N > n), which we call the overparameterized regime, â is not unique in (2.6), as there are multiple a ∈ R N that interpolate the training data, i.e., y i = N j=1 a j σ(θ ⊤ j x i ), i ∈ [n] , resulting in a zero training error. In such a situation we set â at a specific a ∈ R N which (1) interpolates the training data and (2) has minimum ℓ 2 -norm. This particular solution is known as the minimum norm interpolant solution (Hastie et al., 2019; Montanari et al., 2020) and is formally defined as â ∈ arg min{∥a∥ 2 : y i = N j=1 a j σ(θ ⊤ j x i / √ d), i ∈ [n]} = (Z ⊤ Z) † Z ⊤ y , (2.7) where Z ∈ R n×N with Z i,j = σ(θ ⊤ j x i / √ d) and A † is the Moore-Penrose inverse of A. The minimum norm interpolant solution (2.7) has an alternative interpretation: it is the limiting solution to the ridge regression problem at vanishing regularization strength, â = lim λ→0+ âλ , âλ ∈ arg min a∈R N 1 n n i=1 1 2 y i - N j=1 a j σ(θ ⊤ j x i / √ d) 2 + N λ d . (2.8) In fact the conclusion in (2.8) continues to hold in the underparameterized (N < n) regime. Hence, we combine the two regimes (N < n and N > n) and obtain the ERM solution as (2.8). In Hastie et al. (2019) it is known as the ridgeless solution. Finally, having an estimator â of the parameter a in the random feature model (2.4), the response of an individual with feature vector x is predicted as: ŷ(x) = f (x, â, Θ) = N j=1 âj σ(θ ⊤ j x/ √ d) . (2.9)

2.3.2. MAJORITY GROUP SUBSAMPLING

It is known that a model trained by ERM often exhibits poor performance on minority groups (Sagawa et al., 2019) . One promising way to circumvent the issue is majority group subsampling: a randomly drawn subset of training sample points are discarded from majority groups to match the sample sizes for the majority and minority groups in the remaining training data. This emulates the effect of reweighted-ERM, where the samples from minority groups are upweighted in the ERM. In underparameterized case the reweighting is preferred over the subsampling due to its superior statistical efficiency. On contrary, in the overparameterized case, reweighing may not have the intended effect on the performance of minority groups (Byrd and Lipton, 2019) , but subsampling does (Sagawa et al., 2020; Idrissi et al., 2021) . Thus we consider subsampling as a way of improving performance in minority groups in overparameterized regime. We note that subsampling requires the knowledge of the groups/sensitive attributes, so its applicability is limited to problems in which the group identities are observed in the training data.

3. RANDOM FEATURE REGRESSION MODEL

In this paper we mainly focus on the minority group performance, which we measure as the mean squared prediction error for the minority samples in the test data: R 0 (â) = E x∼P X [{f (x, â, Θ) -β ⊤ 0 x} 2 ] = E x∼P X [{ N j=1 âj σ(θ ⊤ j x/ √ d) -β ⊤ 0 x} 2 ] , (3.1) where x is independent of the training data, and is identically distributed as x i 's and for a generic a by the notation E a we mean that the expectation is taken over the randomness in a. One may alternatively study E (x,y)∼P0 [{f (x, â, Θ) -y} 2 ] as the mean square prediction error for the minority group, but for a fixed noise level τ it has the same behavior as (3.1), as suggested by the following: To study the minority risk we first decompose it into several parts and study each of them separately. The first decomposition follows: Lemma 3.1. We define the bias E (x,y)∼P0 [{f (x, â, Θ) -y} 2 ] = E x∼P X [{f (x, â, Θ) -β ⊤ 0 x} 2 ] + τ 2 . B(β 0 , δ, τ ) = E x∼P X [{E ϵ [f (x, â, Θ)] - β ⊤ 0 x} 2 ] and the variance V (β 0 , δ, τ ) = E x∼P X [Var ϵ {f (x, â, Θ)}], where ϵ = (ϵ 1 , . . . , ϵ n ) ⊤ are the errors in the data generating model, i.e. ϵ i = y i -x ⊤ i β gi and Var ϵ denotes the variance with respect to ϵ. The minority risk decomposes as E x∼P X ,ϵ [{f (x, â, Θ) -β ⊤ 0 x} 2 ] = B(β 0 , δ, τ ) + V (β 0 , δ, τ ) . (3.2) The proof is realized by decomposing the square in [f (x, â, Θ) -β ⊤ 0 x] 2 = [{f (x, â, Θ)] - E ϵ [f (x, â, Θ)]} + {E ϵ [f (x, â, Θ)] -β ⊤ 0 x}] 2 . We now provide an interpretation for the variance term. The variance term Firstly we notice that the variance V (β 0 , δ, τ ) captures the average variance of the prediction rule f (x, â, Θ) in terms of the errors ϵ in training data. To look closely we define Left: variances for varying γ. Middle: biases for different γ and θ, where θ is the angle between β 0 and β 1 . Here the majority group proportion π is set at 0.8. Right: biases for different π's and γ's when θ is set at 180 • . The solid lines are the theoretical predictions and the points with error-bars represent the empirical predictions over 10 random runs. z = σ(Θx/ √ d) as the random feature for a new covariates x and realize that the variance of the predictor with respect to ϵ, and thus V (β 0 , δ, τ ), do not depend on the values of β 0 or δ, as shown in the following display: Var ϵ {f (x, â, Θ)} = Var ϵ {z ⊤ â} = Var ϵ {z ⊤ (Z ⊤ Z) † Z ⊤ y} = Var ϵ {z ⊤ (Z ⊤ Z) † Z ⊤ ϵ} . Hence we drop β 0 and δ from the notation and write the variance term as V (τ ). Since ϵ i 's have variance τ 2 we realize that the exact dependence of the variance term with respect to τ is the following: V (τ ) = τ 2 V (τ = 1). This also implies that the variance term V (τ ) teases out the contribution of data noise level τ in the minority risk. The bias term We notice that the bias term B(β 0 , δ, τ ) in decomposition (3.2) is the average bias in prediction for the minority group. We further notice that it does not depend on the noise level τ (both E ϵ [f (x, â, Θ)] and β ⊤ 0 x do not depend on τ ) and denote it as B(β 0 , δ) by dropping τ . Next, we separate out the contributions of β 0 and δ in the bias term via a decomposition. For the purpose we recall, Z = {z i,j } i∈[n],j∈[N ] ∈ R n×N is the matrix of random features and define the following (1) X = [x ⊤ 1 , . . . x ⊤ n ] ⊤ is the covariate matrix in training data, (2) X 1 is the covariate matrix consisting only the samples from majority group, and (3) Z 1 is the random feature matrix for the majority group. The decomposition for the bias term follows: B(β 0 , δ) = E x∼P X [{(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 } 2 + {z ⊤ (Z ⊤ Z) † Z ⊤ 1 X 1 δ} 2 +2(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 δ ⊤ X ⊤ 1 Z 1 (Z ⊤ Z) † z] . (3.3) Next, we make some assumptions on β 0 and δ vectors: Assumption 3.2. There exists some F β , F δ > 0 and F β,δ ∈ R such that the following holds: ∥β 0 ∥ 2 2 = F 2 β , ∥δ∥ 2 2 = F 2 δ and ⟨β 0 , δ⟩ = F β,δ . Here, F β and F δ are the ℓ 2 signal strengths of β 0 and δ. The decomposition in the next lemma separates out the contributions of β 0 and δ in B(β 0 , δ). Lemma 3.3. Define the followings: B β = E x∼P X [∥z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ ∥ 2 2 /d], B δ = E x∼P X [∥z ⊤ (Z ⊤ Z) † Z ⊤ 1 X 1 ∥ 2 2 /d] and C β,δ = 2E x∼P X [(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )X ⊤ 1 Z 1 (Z ⊤ Z) † z]. Then we have B(β 0 , δ) = F 2 β B β + F 2 δ B δ + F β,δ C β,δ . In Lemma 3.3 we see that F 2 δ B δ + F β,δ C β,δ quantifies the contribution of the model misspecification in the two-group model, and we call this misspecification error term. For the other terms in the decomposition we utilize the results of Montanari et al. (2020) who studied the effect of overparameterization on the overall performance (i.e., in a single group model), thus the misspecification error term does not appear in their analysis. Our theoretical contribution in this paper is studying the exact asymptotics of the misspecification error terms. Before we present the asymptotic results we introduce some assumptions and definitions. First we make a distributional assumption on the covariates x i 's and the random weights θ j 's which allow an easier analysis for the minority risk. Assumption 3.4. We assume that {x i } n i=1 and {θ j } N j=1 are iid Unif{S d-1 ( √ d)}, i.e., uniformly over the surface of a d-dimensional Euclidean ball of radius √ d and centered at the origin. Next we assume that activation function σ has some properties, which are satisfied by any commonly used activation functions, e.g. ReLU and sigmoid activations. Assumption 3.5. The activation function σ : R → R is weakly differentiable with weak derivative σ ′ and for some constants c 0 , c 1 > 0 it holds |σ(u)|, |σ ′ (u)| ≤ c 0 e c1|u| . Both assumptions 3.4 and 3.5 also appear in Montanari et al. (2020) . Below we define some quantities which we require to describe the asymptotic results. Definition 3.6. 1. For the constants µ 0 = E{σ(G)}, µ 1 = E{Gσ(G)}, µ 2 ⋆ = E{σ(G) 2 } -µ 2 0 -µ 2 1 , (3.4) where the expectation is taken with respect to G ∼ N(0, 1) we assume that 0 < µ 0 , µ 1 , µ ⋆ < ∞ and define ξ = µ 1 /µ ⋆ . 2. Recall that N/d → ψ 1 > 0, n/d → ψ 2 > 0 and n 1 /n → π ∈ [1/2, 1]. We set ψ = min{ψ 1 , ψ 2 } and define χ = -[(ψξ 2 -ξ 2 -1) 2 +4ξ 2 ψ] 1/2 +(ψξ 2 -ξ 2 -1) 2ξ 2 . 3. We furthermore define the following: E ⋆ 0 = -χ 5 ξ 6 + 3χ 4 ξ 4 + (ψ 1 ψ 2 -ψ 1 -ψ 2 + 1)χ 3 ξ 6 -2χ 3 ξ 4 -3χ 3 ξ 2 + (ψ 1 + ψ 2 -3ψ 1 ψ 2 + 1)χ 2 ξ 4 + 2χ 2 ξ 2 + χ 2 + 3ψ 1 ψ 2 χξ 2 -ψ 1 ψ 2 , E ⋆ 1 = ψ 2 χ 3 ξ 4 -ψ 2 χ 2 ξ 2 + ψ 1 ψ 2 χξ 2 -ψ 1 ψ 2 , E ⋆ 2 = χ 5 ξ 6 -3χ 4 ξ 4 + (ψ 1 -1)χ 3 ξ 6 + 2χ 3 ξ 4 + 3χ 3 ξ 2 + (-ψ 1 -1)χ 2 ξ 4 -2χ 2 ξ 2 -χ 2 . Equipped with the assumptions and the definitions we're now ready to state the asymptotics for each of the terms in the minority group prediction errors. The following lemma, which states the asymptotic results for B β and V (τ ), has been proven in Montanari et al. (2020) . Lemma 3.7 (Theorem 5.7, Montanari et al. (2020) ). Let the assumptions 3.2, 3.4 and 3.5 hold. Define B ⋆ = E ⋆ 1 /E ⋆ 0 and V ⋆ = E ⋆ 2 /E ⋆ 0 where E ⋆ 0 , E ⋆ 1 and E ⋆ 2 are defined in Definition 3.6. Then the following hold: lim d→∞ E[B β ] = B ⋆ , lim d→∞ E[V (τ )] = τ 2 V ⋆ , where the expectation E is taken over {x i } n i=1 , {θ j } N j=1 , β 0 and δ. Trend in the variance term We again recall that the variance term V (τ ) teases out the contributions of noise ϵ in the minority group prediction error. Here, the trend that we're most interested in is the effect of overparameterization: how does the asymptotic V ⋆ behave as a function of γ = ψ 1 /ψ 2 = lim d→∞ (N/n) when ψ 2 = lim d→∞ (n/d) is held fixed, and γ > 1. Although it is difficult to understand such trends from the mathematical definitions of V ⋆ we notice in the Figure 1 (left) that the variance decreases in overparameterized regime (γ > 1) with increasing model size (γ). Next we study the asymptotics of B δ and C β,δ which, as discussed in lemma 3.3, quantify the part of the minority group prediction risk that comes from the model misspecification in the two group model (2.3). Lemma 3.8 (Misspecification error terms). Let the assumptions 3.2, 3.4 and 3.5 hold. Following the definitions of B ⋆ and V ⋆ in lemma 3.7 we further define Ψ ⋆ 2 = B ⋆ -1 + 2(χ + ψ) , where χ and ψ are defined in Definition 3.6. Then lim d→∞ E[B δ ] = M ⋆ 1 ≜ π(1 -π)V ⋆ + π 2 Ψ ⋆ 2 , lim d→∞ E[C β,δ ] = M ⋆ 2 ≜ π(B ⋆ -1 + Ψ ⋆ 2 ) where as in lemma 3.7, the expectation E is is taken over {x i } n i=1 , {θ j } N j=1 , β 0 and δ. Below we describe the observed trends in the bias term in the minority risk. Trends in the bias term Here, we study the asymptotic trend for the bias term B(β 0 , δ). We see from the bias decomposition in Lemma 3.3 and the term by term asymptotics in Lemmas 3.7 and 3.8 that the bias term asymptotically converges to F 2 β B ⋆ + F 2 δ M ⋆ 1 + F β,δ M ⋆ 2 . We mainly study the trend of this bias term in overparameterized regime with respect to the three parameters: (1) the overparametrization parameter γ, (2) the majority group proportion π and (3) the angle between the vectors β 0 and β 1 , which is denoted by θ. The middle and right panels in Figure 1 give an overall understanding of the trend of the bias term with respect to the aforementioned parameters. In both of these figures we set ∥β 1 ∥ 2 = ∥β 0 ∥ 2 = 1. In the middle panel, we see that for a fixed angle θ, the bias generally decreases with overparameterization. In contrast, for a fixed level of overparameterization γ, the bias increases as the angle θ increases. This is not surprising as intuitively, as the angle θ grows, the separation between β 0 and β 1 becomes more prominent. As a consequence, the accuracy of the model worsens and that is being reflected in the plot of the bias term. Similar trend can also be observed when the majority group proportion π varies for a fixed angle θ (right most panel of Figure 1 ). Here we set β 1 = -β 0 , i.e., the angle θ = 180 • . As expected, the bias increases with parameter π for a fixed γ because of higher imbalance in the population. Whereas, for a fixed π, the bias decreases with increasing γ. Thus, in summary we can conclude that overparameterization generally improves the worst group performance of the model for fixed instances of θ and π. The following theorem summarizes the asymptotic results in Lemmas 3.7 and 3.8 and states the asymptotic for minority group prediction error. Theorem 3.9. Let the assumptions 3.2, 3.4 and 3.5 hold. Following the definitions of B ⋆ , V ⋆ , M ⋆ 1 and M ⋆ 2 in Lemmas 3.7 and 3.8, we have the term by term asymptotics: lim d→∞ E[B β ] = B ⋆ , lim d→∞ E[V (τ )] = τ 2 V ⋆ , lim d→∞ E[B δ ] = M ⋆ 1 , and lim d→∞ E[C β,δ ] = M ⋆ 2 , and the minority group prediction error has the following asymptotic E[R 0 (â)] = F 2 β E[B β ] + F 2 δ E[B δ ] + F β,δ E[C β,δ ] + τ 2 E[V (τ = 1)] → F 2 β B ⋆ + F 2 δ M ⋆ 1 + F β,δ M ⋆ 2 + τ 2 V ⋆ . A complete proof of the theorem is provided in Appendix B. Trend in minority group prediction error for ERM To understand the trends in minority group prediction error we combine the trends in the bias and the variance terms. We again recall that we're interested in the effect of overparameterization, i.e. growing γ when γ > 1. In the discussions after Lemmas 3.7 and 3.8 (and in Figure 1 ) we notice that both the bias and variance terms decrease or do not increase with growing overparameterization for any choices of the majority group proportion π and the angle θ between β 0 and β 1 . Since the minority risk decomposes to bias+variance terms (3.2), we notice that similar trends hold for the overall minority risk. In other words, overparameterization improves or does not harm the minority risk. These trends agree with the empirical findings in Pham et al. (2021) .

3.1. LIMITING RISK FOR MAJORITY GROUP SUBSAMPLING

Though overparameterization of ML models does not harm the minority risk for ERM, they generally produce large minority risk Pham et al. (2021) . To improve the risk over minority group Sagawa et al. (2020) ; Idrissi et al. (2021) recommend using majority group subsampling before fitting ERM and show that they achieve state-of-the-art minority risks. Here, we complement their empirical finding with a theoretical study on the minority risk for random feature models in the two group regression setup (2.3). We recall from subsection 2.3.2 that subsampling discards a randomly chosen subsample of size n 1 -n 0 from the majority group (of size n 1 ) to match the sample size (n 0 ) of the minority group, and then fits an ERM over the remaining sample points. Here, we repurpose the asymptotic results for ERM (Theorem 3.9) to describe the asymptotic behavior for subsampling by carefully reviewing the changes in the parameters ψ 1 , ψ 2 and π. First we notice that the total sample size after subsampling is n S = 2n 0 (both the majority and minority groups have sample size n 0,S = n 1,S = n 0 ) which means the new majority sample proportion is π S = n 0,S /n S = 1/2. We further notice that ψ 1 remains unchanged, i.e., ψ 1,S = ψ 1 whereas ψ 2 = lim d→∞ n/d changes to ψ 2,S =  lim d→∞ n S /d = lim d→∞ 2n 0 /d = lim d→∞ 2(n 0 /n) × (n/d) = 2(1 -π)ψ 2 . The subsampling setup is underparameterized when n S = 2n 0 > N or γ S = ψ 1,S /ψ 2,S = γ/(2 * (1 -π)) < 1 and overparameterized when γ S > 1. Rest of the setup in subsampling remains exactly the same as in the ERM setup, and the results in Lemmas 3.7, 3.8 and Theorem 3.9 continue to hold with the new parameters: ψ 1,S , ψ 2,S and π S . ERM vs subsampling We compare the asymptotic risks of ERM and majority group subsampling in overparameterized setting. Figure 2 shows the trend of the error difference between subsampling and ERM. Similar to the previous discussion, we denote by θ the angle between β 1 and β 0 , and set ∥β 1 ∥ 2 = ∥β 0 ∥ 2 = 1. In the left panel, we fix the majority group parameter π = 0.8 and vary θ. It is generally observed that subsampling improves the worst group performance over ERM, which is consistent with the empirical findings in Sagawa et al. (2020) and Idrissi et al. (2021) . Moreover, the improvement is prominent when the group regression coefficients are well separated, i.e., θ is large. Intuitively, when θ is very large, the distinction between majority and minority groups is more prominent. As a consequence, the effect of under representation on minority group becomes more relevant, due to which the worst group performance of ERM is affected. Whereas, subsampling alleviates this effect of under representation by homogenizing the group sizes. On the other hand, the effect of under representation becomes less severe when β 0 and β 1 are close by, i.e., θ is small. In this case, we see subsampling does not improve the worst group performance significantly. In fact, for larger values of γ, subsampling performs slightly worse compared to ERM. This is again not surprising, as for smaller separation, the population structure of the two groups becomes more homogeneous. Thus, full sample ERM should deliver better performance compared to subsampling. Similar trend is also observed in the right panel where we fix θ = 180 • and vary π. Again, improvement due to subsampling is most prominent for larger values of π, which corresponds to greater imbalance in the data. Unlike the previous case, the subsampling always helps in terms of worst group performance but the improvement becomes less noticeable with decreasing π as overparameterization grows.

4. RELATED WORK

Overparameterized ML models The benefits of overparameterization for average or overall performance of ML models is well-studied. This phenomenon is known as "double descent", and it asserts that the (overall) risk of ML models is decreasing as model complexity increases past a certain point. This behavior has been studied empirically (Advani and Saxe, 2017; Belkin et al., 2018; Nakkiran et al., 2019; Yang et al., 2020) and theoretically (Hastie et al., 2019; Montanari et al., 2020; Mei and Montanari, 2019b; Deng et al., 2020) . All these works focus on the average performance of ML models, while we focus on the performance of ML models on minority groups. The most closely related paper in this line of work is Pham et al. (2021) , where they empirically study the effect of model overparameterization on minority risk and find that the model overparamterization generally helps or does not harm minority group performance.

Improving performance on minority groups

There is a long line of work on improving the performance of models on minority groups. There are methods based on reweighing/subsampling (Shimodaira, 2000; Cui et al., 2019) and (group) distributionally robust optimization (DRO) (Hashimoto et al., 2018; Duchi et al., 2020; Sagawa et al., 2019) . In the overparameterized regime, methods based on reweighing are not very effective (Byrd and Lipton, 2019) , while subsampling has been empirically shown to improve performance on the minority groups (Sagawa et al., 2020; Idrissi et al., 2021) . Our theoretical results confirm the efficacy of subsampling for improving performance on the minority groups and demonstrate that it benefits from overparameterization. Group fairness literature proposes many definition of fairness (Chouldechova and Roth, 2018) , some of which require similar accuracy on the minority and majority groups (Hardt et al., 2016) . Methods for achieving group fairness typically perform ERM subject to some fairness constraints (Agarwal et al., 2018) . The interplay between these constraints and overparametrization is beyond the scope of this work.

5. SUMMARY AND DISCUSSION

In this paper, we studied the performance of overparameterized ML models on minority groups. We set up a two-group model and derived the limiting risk of random feature models in the minority group (see Theorem 3.9). In our theoretical finding we generally see that overparameterzation improves or does not harm the minority risk of ERM, which complements the findings in Pham et al. (2021) . We also show theoretically that majority group subsampling is an effective way of improving the performance of overparamterized models in minority groups. This confirms the empirical results on subsampling overparameterized neural networks given in Sagawa et al. (2020) and Idrissi et al. (2021) . 

What about classification?

We observe that the trend of minority group generalization error in the random feature regression model is qualitatively distinct from its classification counterpart (see Appendix D for the model details). Figure 3 demonstrates that although overparameterization always reduces the majority error, whether or not it benefits the minority group depends on the angle between the coefficients of the two groups: overparameterzation helps the minority group generalization error if the angle is acute, while it harms the minority error if the angle is obtuse. This is different from those trends in Figure 1 . The disagreement can be explained in part by the fact that the signal-to-noise ratio in classification problems is difficult to tune explicitly because the signal and noise are coupled due to the model's setup. The effect of overparametrization on minority groups under the random feature classification model is open to research by interested readers. Recent developments of convex Gaussian min-max Theorem (CGMT) and related techniques (Montanari et al., 2020; Bosch et al., 2022) could be technically useful. Practical implications One of the main conclusion of this paper is that overparamaterization generally helps or does not harm the worst group performance of ERM. In other words, using overparametrized models is unlikely to magnify disparities across groups. However, we warn the practitioners that overparametrization should not be confused with a method for improving the minority group performance. Dedicated methods such as group distributionally robust optimization (Sagawa et al., 2019) and subsampling (Sagawa et al., 2020; Idrissi et al., 2021) are far more effective. In particular, subsampling improves worst-group performance and benefits from overparametrization as demonstrated in prior empirical studies (Sagawa et al., 2020; Idrissi et al., 2021) and in this paper. A RIDGELESS REGRESSION Similar to Section 3, here we describe analysis the linear regression model in overparameterized regime. The learner fits a linear model f (x) ≜ β T x to the training data, where β ∈ R d is a vector of regression coefficients. We note that although the linear model is well-specified for each group, it is misspecified for the mixture of two groups. The learner fits a linear model to the training data in one of two ways: (1) empirical risk minimization (ERM) and (2) subsampling.

Empirical risk minimization (ERM)

The most common way of fitting a linear model is ordinary least squares (OLS): β OLS ∈ arg min β∈R d 1 n n i=1 1 2 (y i -β T x i ) 2 = arg min β∈R d 1 2n ∥y -Xβ∥ 2 2 , where the rows of X ∈ R n×d are the x i 's and the entries of y ∈ R n are the y i 's. We note that ERM does not use the sensitive attribute (even during training), so it is suitable for problems in which the training data does not include the sensitive attribute. In the overparameterized case (n < d), β OLS is not well-defined because there are many linear models that can interpolate the training data. In this case, we consider the minimum ℓ 2 norm (min-norm) linear model with zero training error: β min ∈ arg min{∥β∥ 2 | y = Xβ} = (X T X) † X T y. where (X T X) † denotes the Moore-Penrose pseudoinverse of X T X. The min-norm least squares estimator is also known as the ridgeless least squares estimator because β min = lim λ↘0 β λ , where β λ ∈ arg min β∈R d 1 n ∥y -Xβ∥ 2 2 + λ∥β∥ 2 2 . We note that if X has full column rank (i.e. X T X is non-singular), then β min is equivalent to β OLS . Majority group subsampling It is known that models trained by ERM may perform poorly on minority groups Sagawa et al. (2019) . One promising way of improving performance on minority groups is majority group subsampling-randomly discarding training samples from the majority group until the two groups are equally represented in the remaining training data. This achieves an effect similar to that of upweighing the loss on minority groups in the underparameterized regime. In the underparameterized case (n > d), reweighing is more statistically efficient (it does not discard training samples), so subsampling is rarely used. On the other hand, in the overparameterized case, reweighing may not have the intended effect on the performance of overparameterized models in minority groups Byrd and Lipton (2019) , but Sagawa et al. (2020) show that subsampling does. Thus we also consider subsampling here as a way of improving performance in minority groups. We note that reweighing requires knowledge of the sensitive attribute, so its applicability is limited to problems in which the sensitive attribute is observed during training.

A.2 LIMITING RISK OF ERM AND SUBSAMPLING

We are concerned with the performance of the fitted models on the minority group. In this paper, we measure performance on the minority group with the mean squared prediction error on a test sample from the minority group: R 0 (β) ≜ E ((β -β 0 ) ⊤ x) 2 | X = E (β -β 0 ) ⊤ Σ(β -β 0 ) | X . (A.1) We note that the definition of (A.1) is conditional on X; i.e. the expectation is with respect to the error terms in the training data and the features of the test sample. Although it is hard to evaluate R 0 exactly for finite n and d, it is possible to approximate it with its limit in a high-dimensional asymptotic setup. In this section, we consider an asymptotic setup in which n, d → ∞ in a way such that d n → γ ∈ (0, ∞). If γ < 1, the problem is underparameterized; if γ > 1, it is overparameterized. To keep things simple, we first present our results in the special case of isotropic features. We start by formally stating the assumptions on the distribution of the feature vector P X . Assumption A.1. The feature vector x ∼ P X has independent entries with zero mean, unit variance, and finite 8 + ϵ moment for some ϵ > 0.

A.3 EMPIRICAL RISK MINIMIZATION

We start by decomposing the minority risk (A.1) of the OLS estimator β OLS and the ridgeless least squares estimator β min . Let n 0 and n 1 be the number of training examples from the minority and the majority groups respectively. Without loss of generality, we arrange the training samples so that the first n 0 examples are those from the minority group: X = X 0 X 1 , y = y 0 y 1 . Lemma A.2 (ERM minority risk decomposition). Let β be either β OLS or β min . We have R 0 ( β) = E ( β -β 0 ) ⊤ ( β -β 0 ) = β ⊤ 0 (I d -Π X )β 0 + τ 2 n Tr( Σ † ) + δ ⊤ 0 n0×d X 1 ⊤ X/n ( Σ † ) 2 X ⊤ 0 n0×d X 1 /n δ + 2δ ⊤ 0 n0×d X 1 ⊤ X/n ( Σ † ) 2 Σβ 0 (A.2) where Π X ≜ X † X is the projector onto ran(X ⊤ ), δ ≜ β 0 -β 1 , and Σ ≜ 1 n X ⊤ X is the sample covariance matrix of the features in the training data.

Inductive bias

We recognize the first term on the right side of (A.2) as a squared bias term. This term reflects the inductive bias of ridgeless least squares: it is orthogonal to ker(X), so it cannot capture the part of β 0 in ker(X). We note that this term is only non-zero in the overparameterized regime: β OLS has no inductive bias. Lemma A.3 (Hastie et al. (2019) , Lemma 2). In addition to Assumption A.1, assume ∥β 0 ∥ 2 2 = s 0 for all n, d. We have β ⊤ 0 (I d -Π X )β 0 p → s 0 1 -1 γ ∨ 0 as n, d → ∞, d n → γ. Variance The second term on the right side of (A.2) is a variance term. The limit of this term in the high-dimensional asymptotic setting is known. Lemma A.4 (Hastie et al. (2019) , Theorem 1, Lemma 3). Under Assumption A.1, we have τ 2 n Tr( Σ † ) p → τ 2 γ 1-γ γ < 1 τ 2 γ-1 γ > 1 as n, d → ∞, d n → γ. Approximation error The third and forth terms in (A.2) reflects the approximation error of β OLS and β min because the linear model is misspecified for the mixture of two groups. Unlike the inductive bias and variance terms, this term does not appear in prior studies of the average/overall risk of the ridgeless least squares estimator Hastie et al. (2019) . Lemma A.5. In addition to Assumption A.1, assume ∥δ∥ 2 2 = r and δ ⊤ β 0 = c for all n, d. As d → ∞ we have δ ⊤ 0 n0×d X 1 ⊤ X/n ( Σ † ) 2 X ⊤ 0 n0×d X 1 /n δ p → r πγ 1-γ + r π 2 (1-2γ) 1-γ γ < 1 r π γ-1 + r π 2 (γ-2) γ(γ-1) γ > 1 and δ ⊤ 0 n0×d X 1 ⊤ X/n ( Σ † ) 2 Σβ 0 p → cπ({1/γ} ∧ 1) . Before moving on, we note that (the limit of) the approximation error term is the only term that depends on the majority fraction π. Though the inductive bias may increase at overparameterized regime (γ > 1) with growing γ when the SNR (= ∥β 0 ∥ 2 2 /τ 2 ) is large, we notice that the approximation error terms always decrease with growing overparameterization. In fact they tend to zero as γ → ∞. We plot the (the limit of) the minority risk prediction error (MSPE) in Figure 4 . In both overparameterized and underparameterized regimes, the MSPE increases as π increases. This is expected: as the fraction of training samples from the minority group decreases, we expect ERM to train a model that aligns more closely with the regression function of the majority group. We also notice that in the overparameterized regime (γ > 1) when the SNR (∥β 2 ∥ 2 2 /τ 2 ) is high the MSPE increases with growing γ if the two groups are aligned (the angle between β 0 and β 1 is θ = 0). This trend is also observed in Hastie et al. (2019) .  ∥β 0 ∥ 2 2 /τ 2 = ∥β 1 ∥ 2 2 /τ 2 = 10.

A.4 INADEQUACIES OF RIDGELESS LEAST SQUARES

There is one notable disagreement between the asymptotic risk of the ridgeless least squares estimator and the empirical practice Pham et al. (2021) in modern ML: the risk of the ridgeless least squares estimator increases as the overparameterization ratio γ increases, while the accuracy of modern ML models generally improves with overparameterization. This has led ML practitioner to train overparameterized models whose risks exhibit a double-descent phenomenon Belkin et al. (2018) . Inspecting the asymptotic risk of ridgeless least squares reveals the increase in risk (as γ increases) is due to the inductive bias term (see A.3). For high SNR problems (s 0 ≫ τ 2 ), the increase of the inductive bias term dominates the decrease of the variance term (see A.4), which leads the overall risk to increase. This is a consequence of the fact that problem dimension (the dimension of the inputs) and the degree of overparameterization are tied ridgeless least squares. In order to elucidate behavior (in the risk) that more closely matches empirical observations, we study the random features models, which allows us to keep the problem dimension fixed while increasing the overparameterization by increasing the number of random features.

A.5 MAJORITY GROUP SUBSAMPLING

We note that the (limiting) minority risk curve of majority group subsampling is the (limiting) minority risk curve of ERM after a change of variables. Indeed, it is not hard to check that discarding training sample from the majority group until the groups are balanced in the training data leads to a reduction in total sample size by a factor of 2(1 -π) (recall π ∈ [ 1 2 , 1]). In other words, if the (group imbalanced) training data consists of n samples, then the (group balanced) training data will have 2(1 -π) training samples. In the n, d → ∞, d n → γ limit, this is equivalent to increasing γ by a factor of 1 2(1-π) . The limiting results in ERM can be reused with the following changes: under subsampling (1) π changes to 1/2, (2) γ changes to γ/(2 -2π), and (3) rest of the parameters for the limit remains same. In Figure 5 , where we plot the differences between subsampling and ERM minority risk, we observe that subsampling generally improves minority group performance over ERM. This aligns with the empirical findings in Sagawa et al. (2020); Idrissi et al. (2021) . The only instance when subsampling has worse minority risk than ERM is when the angle between β 0 and β 1 is zero, i.e. β 0 = β 1 . This perfectly agrees with intuition; when there is no differnce between the distributions of the two groups then subsampling discards some samples for majority group which are valuable in learning the predictor for minority group, resulting an inferior predictor for minority group. To calculate nF 1,1 we notice that nF 1,1 = n d E[Tr{ x 1 x ⊤ 1 n ( Σ † ) 2 x 1 x ⊤ 1 n }] = 1 γ x ⊤ 1 x 1 n x ⊤ 1 ( Σ † ) 2 x 1 n .

L1

→ as the L 1 convergence we notice that x1x ⊤ 1 n L1 → γ, i.e. E[(∥x 1 ∥ 2 2 /n -γ) 2 ] → 0. We now calculate the convergence limit for x ⊤ 1 ( Σ † ) 2 x1 n . Noticing that 1 n x ⊤ 1 ( Σ † ) 2 x 1 = lim z→0+ 1 n x ⊤ 1 ( Σ + zI d ) -2 x 1 we write Σ + zI d = (x 1 x ⊤ 1 /n) + A z , where A z = X ⊤ -1 X -1 /n + zI d . Using the Woodbery decomposition {(x 1 x ⊤ 1 /n) + A z } -1 = A -1 z - A -1 z (x 1 x ⊤ 1 /n)A -1 z 1 + x ⊤ 1 A -1 z x 1 we obtain that 1 n x ⊤ 1 ( Σ + zI d ) -2 x 1 = 1 n x ⊤ 1 A -2 z x 1 (1 + 1 n x ⊤ 1 A -1 z x 1 ) 2 In Lemma A.6 we notice that E[ 1 n x ⊤ 1 A -1 z x 1 ] → γ 1-γ γ < 1 1 γ-1 γ > 1 , E[ 1 n x ⊤ 1 A -2 z x 1 ] → γ (1-γ) 3 γ < 1 γ (γ-1) 3 γ > 1 as d → ∞. var[ 1 n x ⊤ 1 A -1 z x 1 ], var[ 1 n x ⊤ 1 A -2 z x 1 ] → 0 which implies nF 1,1 converges to nF 1,1 L1 → γ 1-γ γ < 1 1 γ(γ-1) γ > 1 Noticing that 1 d E[Tr{ Σ Σ † }] = 1 d E[Tr{ Σ( Σ † ) 2 Σ}] = nF 1,1 + n(n -1)F 1,2 we obtain the convergence of n(n -1)F 1,2 as n(n -1)F 1,2 L1 → 1-2γ 1-γ γ < 1 γ-2 γ(γ-1) γ > 1 which finally yields 1 d E[Tr{ X ⊤ 1 X 1 n ( Σ † ) 2 X ⊤ 1 X 1 n }] = n 1 F 1,1 + n 1 (n 1 -1)F 1,2 ≍ πnF 1,1 + π 2 n(n -1)F 1,2 L1 → πγ 1-γ + π 2 (1-2γ) 1-γ γ < 1 π γ-1 + π 2 (γ-2) γ(γ-1) γ > 1 Lemma A.6. As z → 0+ and d → ∞ we have E[ 1 n x ⊤ 1 A -1 z x 1 ] → γ 1-γ γ < 1 1 γ-1 γ > 1 , E[ 1 n x ⊤ 1 A -2 z x 1 ] → γ (1-γ) 3 γ < 1 γ (γ-1) 3 γ > 1 and var[ 1 n x ⊤ 1 A -1 z x 1 ], var[ 1 n x ⊤ 1 A -2 z x 1 ] → 0 Proof of Lemma A.6. To prove the results about variances we first establish that for any symmetric matrix B = ((b ij )) ∈ R d×d it holds var(x ⊤ 1 Bx 1 ) ≤ cTr{B 2 } , (A.4) for some c > 0. Writing x 1 = (u 1 , . . . , u d ) ⊤ we see that var(x ⊤ 1 Bx 1 ) = var i,j u i u j b ij ) = i,j,k,l b ij b kl cov(u i u j , u k u l ) Noticing that cov(u i u j , u k u l ) = E[u i u j u k u l ] -E[u i u j ]E[u k u l ] we see that 1. If one of i, j, k, l is distinct from the others then cov(u i u j , u k u l ) = 0. 2. If i = j and k = l and i ̸ = k then E[u i u j u k u l ] -E[u i u j ]E[u k u l ] = E[u 2 i u 2 k ] -E[u 2 i ]E[u 2 k ] = 0 . 3. If i = j = k = l then cov(u i u j , u k u l ) = var(u 2 1 ). 4. If {i = k and j = l and i ̸ = j} or {i = l and j = k and i ̸ = j} then we have E[u i u j u k u l ] -E[u i u j ]E[u k u l ] = var(u i u j ) = var(u 1 u 2 ) . Gathering the terms we notice that var(x ⊤ 1 Bx 1 ) = i var(u 2 i )b 2 ii + 2 i̸ =j b 2 ij var(u i u j ) ≤ c i,j b 2 ij = cTr{B 2 } , where c = 2{var(u 2 1 ) ∨ var(u 1 u 2 )}. Using (A.4) we notice that lim z→0+ var(x ⊤ 1 A -j z x 1 /n) = 1 n 2 Tr{( Σ † -1 ) 2j } ≍ 1 n 2 Tr{( Σ † ) 2j } → 0 . We notice that E[ 1 n x ⊤ 1 A -1 z x 1 ] = 1 n Tr{A -1 z } ≍ 1 n Tr{ Σ † } which appears in the variance term in Lemma A.4, and we conclude that E[ 1 n x ⊤ 1 A -1 z x 1 ] → γ 1-γ γ < 1 1 γ-1 γ > 1 as d → ∞. To calculate E[ 1 n x ⊤ 1 A -2 z x 1 ] = 1 n Tr{A -2 z } ≍ 1 n Tr{( Σ † ) 2 } we consider case by case. For γ < 1 we see that 1 n Tr{( Σ † ) 2 } = γ 1 d Tr{( Σ) -2 } → γ 1 s 2 dF γ (s) The Stieltjes transformation of µ γ is s γ (z) = 1 x -z dµ γ (x) = 1 -γ -z -(1 -γ -z) 2 -4γz 2γz , z ∈ C\[1 - √ γ, 1 + √ γ]. For |z| < 1 - √ γ s γ (z) = 1 x -z dµ γ (x) = 1 x 1 1 -z/x dµ γ (x) = 1 x ∞ k=0 (z/x) k dµ γ (x) Hence, we have lim z→0- d dz s γ (z) = E[1/X 2 ]. Now, 1 -γ -z -(1 -γ -z) 2 -4γz = 1 -γ -z -1 + γ 2 + z 2 -2γ -2z + 2γz -4γz 1/2 = 1 -γ -z -(1 -γ) 2 + z 2 -2z -2γz 1/2 ≍ 1 -γ -z -(1 -γ) 1 + z 2 -2z -2γz 2(1 -γ) 2 - (z 2 -2z -2γz) 2 8(1 -γ) 4 , for z → 0- ≍ 1 -γ -z -(1 -γ) 1 + z 2 -2z -2γz 2(1 -γ) 2 - z 2 (1 + γ) 2 2(1 -γ) 4 , for z → 0- = z 2 (1 + γ) 2 2(1 -γ) 3 - z 2 -4γz 2(1 -γ) = 2z 2 γ (1 -γ) 3 + 2γz 1 -γ which implies s γ (z) ≍ z (1 -γ) 3 + 1 1 -γ This establish that 1 n Tr{( Σ † ) 2 } = γ lim z→0- d dz s γ (z) = γ (1 -γ) 3 . For γ > 1 we notice that 1 n Tr{( Σ † ) 2 } = 1 n n i=1 1 s 2 i where s i 's are non-zero eigen-values of X ⊤ X/n. This is also equal to 1 γp n i=1 1 t 2 i where t i are the eigen-values of the invertible matrix XX ⊤ /d. Hence we have 1 n Tr{( Σ † ) 2 } = 1 γ 2 1 n Tr{(XX ⊤ /d) -2 } → 1 γ 2 lim z→0+ d dz s 1/γ (z) = γ (γ -1) 3 . We combine the limits for γ < 1 and γ > 1 to write to write E[ 1 n x ⊤ 1 A -2 z x 1 ] → γ (1-γ) 3 γ < 1 γ (γ-1) 3 γ > 1 as d → ∞. B PROOF OF THEOREM 3.9 Before diving into the main proof we give a rough sketch of the whole proof. Essentially, the proof has three major components. 1. In the first part, we establish the limiting risk where we treat β 0 and δ as uncorrelated random variables. (Section B.1) 2. In the second part of the proof we establish the necessary bias-variance decomposition under orthogonality of the fixed parameters β 0 and δ. (Section B.2) 3. In the final part, we use the results of previous two parts to establish the limiting risk result for β 0 and δ in general position. (Section B.3 and B.4)

B.1 LIMITING RISK FOR UNCORRELATED PARAMETERS

We start this section with some definitions of some important quantities. Definition B.1. Let the functions ν 1 , ν 2 : C + → C + be uniquely defined by the following conditions: (i) ν 1 , ν 2 are analytic on C + . (ii) For Im(ζ) > 0, ν 1 (ζ) and ν 2 (ζ) satisfy the equations ν 1 = ψ 1 -ζ -ν 2 - ξ 2 ν 2 1 -ξ 2 ν 1 ν 2 -1 ν 2 = ψ 1 -ζ -ν 1 - ξ 2 ν 1 1 -ξ 2 ν 1 ν 2 -1 (iii) (ν 1 (ζ), ν 2 (ζ)) is the unique solution of these equations with |ν 1 (ζ)| ≤ ψ 1 /Im(ζ), |ν 2 (ζ)| ≤ ψ 2 /Im(ζ) for Im(ζ) > C, with a C sufficiently large constant. Let χ := ν 1 (i(ψ 1 ψ 2 λ) 1/2 ).ν 2 (i(ψ 1 ψ 2 λ) 1/2 ), E 0 (ξ, ψ 1 , ψ 2 , λ) = -χ 5 ξ 6 + 3χ 4 ξ 4 + (ψ 1 ψ 2 -ψ 1 -ψ 2 + 1)χ 3 ξ 6 -2χ 3 ξ 4 -3χ 3 ξ 2 + (ψ 1 + ψ 2 -3ψ 1 ψ 2 + 1)χ 2 ξ 4 + 2χ 2 ξ 2 + χ 2 + 3ψ 1 ψ 2 χξ 2 -ψ 1 ψ 2 , E 1 (ξ, ψ 1 , ψ 2 , λ) = ψ 2 χ 3 ξ 4 -ψ 2 χ 2 ξ 2 + ψ 1 ψ 2 χξ 2 -ψ 1 ψ 2 , E 2 (ξ, ψ 1 , ψ 2 , λ) = χ 5 ξ 6 -3χ 4 ξ 4 + (ψ 1 -1)χ 3 ξ 6 + 2χ 3 ξ 4 + 3χ 3 ξ 2 + (-ψ 1 -1)χ 2 ξ 4 -2χ 2 ξ 2 -χ 2 . We then define B ridge (ξ, ψ 1 , ψ 2 , λ) = E 1 (ξ, ψ 1 , ψ 2 , λ) E 0 (ξ, ψ 1 , ψ 2 , λ) , (B.1) V ridge (ξ, ψ 1 , ψ 2 , λ) = E 2 (ξ, ψ 1 , ψ 2 , λ) E 0 (ξ, ψ 1 , ψ 2 , λ) . (B.2) The quantities derived above can be easily derived numerically. But for our interest, we need to focus on the limiting case when λ → 0. It can be shown that when λ → 0, the expressions of B ridge and V ridge reduces to the form of B ⋆ and V ⋆ defined in Lemma 3.7 respectively. For details we refer to Section 5.2 in Mei and Montanari (2019a) . Now we are ready to present the proof of the main result. Recall that â(λ) := arg min a∈R N 1 n n i=1   y i - N j=1 a j σ(θ ⊤ j x i / √ d)   2 + N λ d ∥a∥ 2 2 . By standard linear algebra it immediately follows that â(λ) = 1 √ d (Z ⊤ Z + λψ 1,d ψ 2,d I N ) -1 Z ⊤ y, (B.3) where Z = 1 √ d σ(XΘ ⊤ / √ d), ψ 1,d = N d and ψ 2,d = n d . Also, define the square error loss R RF (x, X, Θ, λ) := (x ⊤ β 0 -f (x; â(λ); Θ)) 2 , where x is a new feature point. For brevity, we denote by Γ the tuple (X, Θ, β 0 , δ). Recall that we are interested in the out-of-sample risk E Γ E x [R RF (x, X, Θ, λ)], where x is a new feature point from minority group. The main recipe of our proof is the following: • We first do the bias-variance decomposition of the expected out-of-sample-risk. • We analyze the bias and variance terms separately using techniques from random matrix theory. • Finally, we obtain asymptotic limit of the expected risk by using the asymptotic limits of bias and variance terms.

BIAS VARIANCE DECOMPOSITION

The expected risk at a new point x coming from the minority group can be decomposed into following way: (B.5) In the next couple of sections we will study the asymptotic behavior of the bias and variance terms by analyzing several terms involving Ψ. In order to obtain asymptotic limits of those terms, we will be borrowing results of random matrix theory in Mei and Montanari (2019a) . E Γ E x [R RF (x, X, Θ, λ)] = E Γ E x {x ⊤ β 0 -f (x; â(λ); Θ)} 2 = E Γ E x x ⊤ β 0 -E ϵ f

DECOMPOSITION OF BIAS TERM

We focus the bias B(δ, λ) := E Γ E x x ⊤ β 0 -E ϵ f (x; â(λ); Θ) 2 = E Γ E x x ⊤ β 0 -σ(Θx/ √ d) ⊤ 1 √ d ΨZ ⊤ Xβ 0 + ΨZ ⊤ 1 X 1 δ 2 . (B.6) It turns out that it is rather difficult to directly analyze the bias term B(δ, λ). The main difficulty lies in the fact that the terms involving β 0 and δ are coupled in the bias term and poses main difficulty in applying well known random matrix theory results directly. Hence, in order to decouple the terms we look at the following decomposition: B(δ, λ) = B(δ, λ) -B(0, λ) (I) + B(0, λ) (II) It is fairly simple to see that term (II) is only a function of β 0 . Next, we will show that term (I) does not contain β 0 . To this end, We define the matrices U , U ∈ R N ×N as follows: U ij = σ(θ ⊤ i x/ √ d)σ(θ ⊤ j x/ √ d), U ij = E( U ij ) ∀(i, j) ∈ [N ] × [N ]. (B.7) Next, due to Assumption 3.2, it easily follows that B(δ, λ) -B(0, λ) = E Γ E x x ⊤ β 0 -σ(Θx/ √ d) ⊤ 1 √ d ΨZ ⊤ Xβ 0 + ΨZ ⊤ 1 X 1 δ 2 -E Γ E x x ⊤ β 0 -σ(Θx/ √ d) ⊤ 1 √ d ΨZ ⊤ Xβ 0 2 = 1 d E Γ E x δ ⊤ X ⊤ 1 Z 1 Ψ U ΨZ ⊤ 1 X 1 δ = 1 d E Γ E x Tr X ⊤ 1 Z 1 Ψ U ΨZ ⊤ 1 X 1 δδ ⊤ = F 2 δ d 2 E Tr X ⊤ 1 Z 1 ΨU ΨZ ⊤ 1 X 1 . (B.8) Above display shows that term (I) does not contain β 0 . In the subsequent discussion, we will now focus on obtaining the limits of term (I) and (II) separately. Next, by the property of trace, we have the following: Tr(X ⊤ 1 Z 1 ΨU ΨZ ⊤ 1 X 1 ) = Tr n i=n0+1 x i z ⊤ i ΨU Ψ n i=n0+1 z i x ⊤ i = n1 i=n0+1 n1 j=n0+1 Tr ΨU Ψz j x ⊤ j x i z ⊤ i = n1 i=n0+1 n1 j=n0+1 x ⊤ j x i Tr ΨU Ψz j z ⊤ i = n i=n0+1 x ⊤ i x i Tr ΨU Ψz i z ⊤ i + n0+1≤i<j≤n x ⊤ j x i Tr ΨU Ψz j z ⊤ i . (B.9) Thus, by exchangeability of the terms {Tr ΨU Ψz i z ⊤ i } i∈[n] and {x ⊤ j x i Tr ΨU Ψz j z ⊤ i } n0+1≤i<j≤n we have 1 d E Tr X ⊤ 1 Z 1 ΨU ΨZ ⊤ 1 X 1 = n 1 f (1, 1) + n 1 (n 1 -1)f (1, 2), (B.10) where f (i, j) = E x ⊤ j x i Tr ΨU Ψz j z ⊤ i /d 2 and 1 ≤ i, j ≤ 2. Hence, together with Equation (B.8) and Equation (B.10), it follows that B(δ, λ) -B(0, λ) = F 2 δ {n 1 f (1, 1) + n 1 (n 1 -1)f (1, 2)}. (B.11) Furthermore, from Equation (8.10), Equation (8.25), Lemma 9.3 and Lemma 9.4 of Mei and Montanari (2019a), we get B(0, λ) = B ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ )F 2 β + o d (1), (B.12) where B ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) is defined as in Equation (B.1). Thus, it only remains to calculate the limit of B(δ, λ) -B(0, λ) in order to obtain the limit of the bias term B(δ, λ).

ANALYSIS OF VARIANCE TERM

Now we briefly deffer the calculation of term (I) in Equation (B.11) and shift our focus to variance term. The reason behind this is that it is hard to directly calculate the limits of the quantities appearing in the right hand side of Equation (B.11) . We obtain these limiting quantities with the help of the limiting variance. Note that for the variance term, we have (B.13) where Ψ 3 := E Tr ΨU ΨZ ⊤ Z /d. Again by exchangeability argument, we have the following: V(λ) = E Γ E x Var ϵ f (x; â(λ); Θ) = E Γ E x Var ϵ σ(Θx/ √ d) ⊤ 1 √ d ΨZ ⊤ ϵ = 1 d E Γ E x Tr Ψ U ΨZ ⊤ Z τ 2 = τ 2 d E Tr ΨU ΨZ ⊤ Z = Ψ 3 τ 2 , nf (1, 1) = E 1 d 2 n i=1 x ⊤ i x i Tr ΨU Ψz i z ⊤ i = E 1 d n i=1 Tr ΨU Ψz i z ⊤ i (as x ⊤ i x i /d = 1) = E 1 d Tr ΨU ΨZ ⊤ Z = Ψ 3 . (B.14) Thus, in the light of Equation (B.11) and Equation (B.14), we reemphasize that it is essential to understand the limiting behavior of f (i, j) in order to obtain the limit of expected out-of-sample risk.

OBTAINING LIMIT OF VARIANCE TERM:

To this end, we define the matrix A ∈ R M ×M , M = N + n with parameters q = (s 1 , s 2 , t 1 , t 2 , p) ∈ R 5 in the following way: A = A(q) := s 1 I N + s 2 Q Z ⊤ + pW ⊤ Z + pW t 1 I n + t 2 H where Q = 1 d ΘΘ ⊤ , H = 1 d XX ⊤ , W = µ 1 d XΘ ⊤ . For ξ ∈ C + , we define the log-determinant of A as G(ξ, q) = 1 d M i=1 Log λ i (A) -ξI M . Here Log is the branch cut on the negative real axis and {λ i (A)} i∈[M ] denotes the eigenvalues of A in non-increasing order.

Define the quantity

Ψ3 := 1 d Tr ΨU ΨZ ⊤ Z). From Equation (B.14), it trivially follows that E( Ψ3 ) = nf (1, 1). The key trick lies in replacing the kernel matrix U by the matrix Λ = µ 2 1 Q + µ 2 ⋆ I N . By (Mei and Montanari, 2019a, Lemma 9 .4), we know that the asymptotic error incurred in the expected value of Ψ3 by replacing U in place of Λ is o d (1). To elaborate, we first define Ψ 3 := 1 d Tr ΨΛΨZ ⊤ Z). Then by (Mei and Montanari, 2019a, Lemma 9 .4), we have E| Ψ3 -Ψ 3 | = o d (1). (B.15) Thus, instead of Ψ3 , we will focus on Ψ 3 . By (Mei and Montanari, 2019a, Proposition 8 .2) we know Ψ 3 = -µ 2 ⋆ ∂ s1,t1 G d (i(ψ 1 ψ 2 λ) 1/2 ; 0) -µ 2 1 ∂ s2,t1 G d (i(ψ 1 ψ 2 λ) 1/2 ; 0). (B.16) Also by (Mei and Montanari, 2019a, Proposition 8.5) , for any fixed u ∈ R + , we have the following: lim d→∞ E ∇ 2 q G d (iu; 0) -∇ 2 q g(iu; 0) op = 0. (B.17) As Ψ 3 is a bi-linear form of ∇ 2 q G d (i(ψ 1 ψ 2 λ) 1/2 ; 0) (See Equation (B.16)), together with (Mei and Montanari, 2019a, Equation (8.26 )), it follows that lim d→∞ E Ψ 3 -V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) → 0, (B.18) where V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) is as defined in Equation (B.2). Thus, both Equation (B.15) and Equation (B.18) yields that lim d→∞ E| Ψ3 -V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ )| → 0. (B.19) Hence, by Equation (B.13) we get the following: V(λ) = V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ )τ 2 + o d (1). (B.20) OBTAINING LIMIT OF BIAS TERM: Now we revisit the term B(δ, λ) -B(0, λ). We will study the terms n 1 f (1, 1) and n 1 (n 1 -1)f (1, 2) in Equation (B.11) separately. First, we focus on the term n 1 f (1, 1). Next, in order to understand the limiting behavior of f (1, 2), we define Ψ2 := 1 d Tr ΨU ΨZ ⊤ HZ , Ψ 2 := 1 d Tr ΨΛΨZ ⊤ HZ . Again due to (Mei and Montanari, 2019a, Lemma 9 .4), we have E| Ψ3 -Ψ 3 | = o d (1). Hence, we only focus on Ψ 2 . A calculation similar to (B.9) shows that E(Ψ 2 ) = nf (1, 1) + n(n -1)f (1, 2) + o d (1). (B.22) Let g(ξ; q) be the analytic function defined in (Mei and Montanari, 2019a, Equation (8.19) ). Using (Mei and Montanari, 2019a , Proposition 8.5), we get E(Ψ 2 ) = -µ 2 ⋆ ∂ s1,t2 g(i(ψ 1 ψ 2 λ) 1/2 ; 0)-µ 2 1 ∂ s2,t2 g(i(ψ 1 ψ 2 λ) 1/2 ; 0)+o d (1) =: Ψ ⋆ 2 (ξ, ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 )+o d 1 . This along with (B.21) and (B.22) show that n(n -1)f (1, 2) = Ψ ⋆ 2 (ξ, ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 ) -V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) + o d (1). (B.23) Thus, finally using Equation (B.11), Equation (B.12) and the fact n 1 /n → π, we conclude that  B(δ, λ) = F 2 β B ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ )+F 2 δ [π(1-π)V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ )+π 2 Ψ ⋆ 2 (ξ, ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 )]+o d E X,Θ,β0,δ [R RF (x, X, Θ, λ)] = F 2 β B ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) + F 2 δ [π(1 -π)V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ) + π 2 Ψ ⋆ 2 (ξ, ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 )] + τ 2 V ridge (ξ, ψ 1 , ψ 2 , λ/µ 2 ⋆ ). (B.25) RIDGELESS LIMIT Finally, for ridgeless limit we need to take λ → 0+ on both sides of Equation (B.25). Following similar calculations as in the proof of (Mei and Montanari, 2019a, Theorem 5.7) , or more specifically using (Mei and Montanari, 2019a, Lemma 12 .1), we ultimately get lim λ→0+ lim d→∞ E X,Θ,β0,δ [R RF (x, X, Θ, λ)] = F 2 β B ⋆ + F 2 δ M ⋆ 1 + τ 2 V ⋆ . This completes the proof. Apart from the cross-correlation term, The first two-terms of the equation does not depend on the interaction between β 0 and δ, To be precise, the first two terms are only the functions of β 0 and δ individually and does not depend on the orthogonality of β 0 and δ. Thus, an analysis similar to Section B.2 yields that B(β 0 , δ) = F 2 β B β + F 2 δ B δ + 2(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 δ ⊤ X ⊤ 1 Z 1 (Z ⊤ Z) † z] . Thus, it only remains to analyze the cross-covariance term Cβ,δ ≜ 2E x∼P X [(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 δ ⊤ X ⊤ 1 Z 1 (Z ⊤ Z) † z], To get the desired form as in Lemma 3.3, we again introduce that random variables β0 , δ, X, X1 , Z, Z1 and z as in Section B.2. Also, define the quantity C β, δ ≜ 2E x∼P X [(z ⊤ ( Z⊤ Z) † Z⊤ X -x⊤ ) β0 δ⊤ X⊤ 1 Z1 ( Z⊤ Z) † z]. Following, a similar argument as in Section B.2, we have  Cβ,δ = E O∼H d { C β, δ }. Next, note that, Cβ,δ = 2E[Tr( X⊤ 1 Z1 ( Z⊤ Z) † z(z ⊤ ( Z⊤ Z) † Z⊤ X -x⊤ ) β0 δ⊤ )]. Cβ,δ ≜ 2E x∼P X [(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 δ ⊤ X ⊤ 1 Z 1 (Z ⊤ Z) † z], to fully characterize the asymptotic risk asymptotically. To begin with, the vector δ can be written as a direct sum of two components δ 1 and δ 2 , where δ 2 is the orthogonal projection of δ on span({β 0 }). In particular we have δ = δ 1 + δ 2 , and δ 2 = (I d -β 0 β ⊤ 0 / ∥β 0 ∥ 2 )δ. Thus, there exists µ ∈ R such that δ 1 = µβ 0 . Thus, Cβ,δ can be decomposed in the following way: Cβ,δ = 2µE x∼P X [{β ⊤ 0 X ⊤ Z(Z ⊤ Z) † zz ⊤ (Z ⊤ Z) † Z ⊤ 1 X 1 β 0 ] C (1,1) β,δ -2µE x∼P X [{β ⊤ 0 X ⊤ 1 Z 1 (Z ⊤ Z) † zx ⊤ β 0 ] C (1,2) β,δ + 2E x∼P X [(z ⊤ (Z ⊤ Z) † Z ⊤ X -x ⊤ )β 0 δ ⊤ 2 X ⊤ 1 Z 1 (Z ⊤ Z) † z] C (2) β,δ . As β ⊤ 0 δ 2 = 0, following the arguments in Section B.2 we have C (2) β,δ = 0. To analyze the terms C (1,1) β,δ and C (1,1) β,δ , we will gain analyze their corresponding ridge equivalents C (1,1) β,δ (λ) := 2µ d E x β ⊤ 0 X ⊤ ZΨ U ΨZ ⊤ 1 X 1 β 0 , C (1,2) β,δ (λ) := 2µ √ d E x β ⊤ 0 X ⊤ 1 Z 1 Ψux ⊤ β 0 , where Ψ and U are defined in Equation (B.5) and Equation (B.7) respectively and u = √ dz. We focus on these terms separately. To start with, by an application of exchangeability argument we note that

E(C

(1,1) Let us denote by V the matrix E(ux ⊤ ). This shows that β,δ (λ)) = 2µF 2 β d 2 E Tr X ⊤ ZΨU ΨZ 1 X 1 = 2µF 2 β [n 1 f (1, 1) + n 1 (n 1 -1)f (1, 2) + n 0 n 1 f (1, 2)]. (1,2) β,δ (λ)) = 2µ √ d E β ⊤ 0 X ⊤ 1 Z 1 ΨV β 0 . Again by exchangeability argument we get (1,2) β,δ (λ)) = n 1 n . 2µ √ d E β ⊤ 0 X ⊤ ZΨV β 0 . Let use define T 1 := 1 √ d E β ⊤ 0 X ⊤ ZΨV β 0 . By the arguments of Section 9.1 in Mei and Montanari (2019a) , we can also conclude that T 1 = F 2 β 2 ∂ p g(i(ψ 1 ψ 2 λ) 1/2 ; 0) + o d (1), the function is defined in (Mei and Montanari, 2019a, Equation (8.19) ). Thus we have E(C (1,2) β,δ (λ)) = µF 2 β π∂ p g(i(ψ 1 ψ 2 λ) 1/2 ; 0) + o d (1). This along with Equation (B.29) yields the following:  E( Cβ,δ ) = lim lim d→∞ E(R 0 (â)) = F 2 β B ⋆ + F 2 δ M ⋆ 1 + F β,δ M ⋆ 2 + τ 2 V ⋆ . with C ⋆ = π(B ⋆ -1 + Ψ ⋆ 2 ). This concludes the proof of Theorem 3.9.

C UNIFORM DISTRIBUTION ON SPHERE AND HAAR MEASURE

In this section we will discuss some useful results related to Uniform distribution on sphere and the Haar measure H d on O(d).  E(o i o ⊤ i ) = d k=1 E(o k o ⊤ k ) d = E(O ⊤ O) d = 1 d I d for all i ∈ [d]. Now we are equipped to compute matrix T. Note that i ) = 0. Thus, we get T ij = 0 and this concludes the proof for uncorrelatedness. Next, we define V β := E( β β⊤ ). By a similar argument, it also follows that T ii = E(o ⊤ i βδ ⊤ o i ) = E{δ ⊤ o i o ⊤ i β} = δ ⊤ β/d = F β (V β ) i,j = ∥β∥ 2 2 d ∆ ij , where ∆ ij is the Kronecker delta function. The result for δ can be shown following exactly the same recipe.

D RANDOM FEATURE CLASSIFICATION MODEL

In this section, we consider an overparameterized classification problem. We consider a two group setup: g ∈ {0, 1}. Without loss of generality, we assume π > 1 2 (so the g = 1 group is the majority group). The data generating process of the training examples {(x i , y i )} n i=1 is g i ∼ Ber(π) x i ∼ N (0, I) y i |x i , g i ← +1, w.p. f (x ⊤ i β 0 )1{g i = 0} + f (x ⊤ i β 1 )1{g i = 1} -1, w.p. 1 -f (x ⊤ i β 0 )1{g i = 0} -f (x ⊤ i β 1 )1{g i = 1} , (D.1) where β 0 , β 1 ∈ R d are vectors of coefficients of the minority and majority groups respectively, and f (t) = (1 + e -t ) -1 is the sigmoid function. Consider a random feature classification model, that is, for a newly generated sample (x n+1 , y n+1 ), the classifier which predicts a label ŷn+1 for the new sample as y = sign   N j=1 a j σ θ ⊤ j x n+1 / √ d   ,



Figure 1: Trends for bias and variance in the minority ERM risk for varying overparameterization γ.Left: variances for varying γ. Middle: biases for different γ and θ, where θ is the angle between β 0 and β 1 . Here the majority group proportion π is set at 0.8. Right: biases for different π's and γ's when θ is set at 180 • . The solid lines are the theoretical predictions and the points with error-bars represent the empirical predictions over 10 random runs.

Figure2: Difference in minority risk between subsampling and ERM. Left: the differences in minority risks for several values of θ (the angle between β 0 and β 1 ) and overparameterzation γ, where majority proportion π is set at 0.8. Right: the minority risk differences for several values of π and γ when θ is set at 180 • .

Figure 3: Classification error against overparamterization level γ of the majority and minority group for various angles θ between the coefficients of two groups under the random feature classification model.

Figure 4: Minority group prediction error for the ridgeless least square ERM in the isotropic features case. The left plot considers the setup with varying π when the angle between β 0 and β 1 is set at θ = 180 • . The right plot sets π = 0.8 and considers different values of θ. Here the SNR is set at ∥β 0 ∥ 2 2 /τ 2 = ∥β 1 ∥ 2 2 /τ 2 = 10.

E Γ E x Var ϵ f (x; â(λ); Θ) follows from Lemma 3.3, which basically uses the uncorrelatedness of β 0 and δ. Now we define the matrix Ψ := (Z ⊤ Z + λψ 1,d ψ 2,d I N ) -1 .

ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 ) -∂ p g(i(ψ 1 ψ 2 λ) 1/2 ; 0)] + o d (1) = µπF 2 β (B ⋆ -1 + Ψ ⋆ 2 ) + o d (1).Finally, by a simple algebra it follows that µ = F δ F β cos ϕ β,δ , whereϕ β,δ = arccos ⟨β, δ⟩ F β F δ .This shows that µF 2 β = F β F δ cos(ϕ β,δ ) = β ⊤ 0 δ = F β,δ . Now recalling Equation (B.28) we havelim d→∞ E[C β,δ ] = π(B ⋆ -1 + Ψ ⋆2 ). This finished teh proof of Lemma 3.8.B.4 FINAL FORM OF LIMITING RISKFinally, Gathering all the results from Section B.1, B.2 and B.3, we have

Let U ∼ Unif{S d-1 ( √ d)} and O ∈ O(d). Then OU ∼ Unif{S d-1 ( √ d)}. Also, if O ∼ H d and is independent of U , then also OU ∼ Unif{S d-1 ( √ d)}.Proof. Note that U d = √ dG/ ∥G∥ 2 , where G ∼ N(0, I d ). Next define G := OG. By property of Gaussian random vector we have G d = G. Now the result follows from the following: O ∼ H d be an independent random matrix from U . Thus conditional on O, we have OU | O ∼ Unif{S d-1 ( √ d)}. Thus, unconditionally we have OU ∼ Unif{S d-1 ( √ d)}. Lemma C.2. Let β, δ ∈ R d such that β ⊤ δ = F β , δ. Also, define the random vectors β = Oβ and δ = Oδ, where O ∼ H d . Then the followings are true: Let O = (o 1 , o 2 , . . . , o d ) ⊤ , i.e, o i is the ith row of O. Let T = E( βδ⊤ ) = E(Oβδ ⊤ O ⊤ ). Also, by property of Haar measure, we have ΠO d = O for any permutation matrix Π. This shows that {o i } d i=1 are exchangeable. As a consequence we have

,δ /d. Also, for i ̸ = j, we similarly getT ij = E{δ ⊤ o j o ⊤ i β}.Now again by property of Haar measure we know O d = (I d -2e i e ⊤ i )O, where e i denotes the ith canonical basis of R d . This implies that o j o ⊤ i d = -o j o ⊤ i . Using, this property of H d , we also get E(o j o ⊤

Table of notations

Thus condition on x and z and applying Lemma C.2, the result follows, i.e.,

Using Equation (B.21) and Equation (B.23) it can be deduced that (λ)) = 2µF 2 β πΨ ⋆ 2 (ξ, ψ 1 , ψ 2 , λ, µ ⋆ , µ 1 ) + o d (1). (B.29)

annex

Figure 5 : Minority error differnce between subsampling and ERM for the ridgeless least square in the isotropic features case. The left plot considers the setup with varying π when the angle between β 0 and β 1 is set at θ = 180 • . The right plot sets π at 0.8 and consider different values of θ. Here the SNR is set at ∥β 0 ∥ 2 2 /τ 2 = ∥β 1 ∥ 2 2 /τ 2 = 5.A.6 PROOFS OF LEMMAS A.2 AND A.5A.6.1 PROOF OF LEMMA A.2Denoting U = (X ⊤ X/n) † X/n we note that the estimation error β -β 0 iswe have the following decomposition of ERM minority riskBy the linearity of expectation and trace operator, the term II is equal toBy properties of Moore-Penrose pseudoinverse, the term III and IV are equal torespectively. The term V is 0 since E[ϵ] = 0. Hence we complete the proof. □ A.6.2 PROOF OF LEMMA A.5We first prove for the cross term thatTo realize the above we notice that for any orthogonal matrix O it holdswhere X is replaced by X O = XO and the replacement in X changes the covariance matrix to Σ O and it's Moore-Penrose inverse toWe now let O to be uniformly distributed over the set of all d × d orthogonal matrices (distributed according to Haar measure) which results inTo calculate the termwe notice thatand each of the terms Tr{ Σ( Σ † ) 2 ( 1 n x i x ⊤ i ) are identically distributed. We appeal to exchangability argument to concludewhere we notice that Σ Σ † is a projection matrix and concludeIf d < n i.e. γ < 1 then Σ is invertible and and it holds Σ Σ † = I d . If d > n then there are exactly n many non-zero eigen-values in Σ Σ † and it holds Tr{ Σ Σ † } = n. Combining the results we obtain⊤ δ] we first use similar argument to (A.3) and obtainWe rewritewhere we define

B.2 BIAS-VARIANCE DECOMPOSITION UNDER ORTHOGONALITY OF PARAMETERS

In this section we will demonstrate that the same bias-variance decomposition in Lemma 3.3 continues to hold rather weaker assumption than Assumption 3.2. Specifically, in thi section we only assume the parameter vectors β 0 and δ are orthogonal, i.e., β ⊤ 0 δ = 0. The following Lemma shows that under this orthogonality condition, the desired bias-variance decomposition still holds. Lemma B.2. Define the followings:Proof. Similar to Equation (B.4), her also we have The key thing here is to note the following This motivates us to consider the change of variables in Equation (B.27), when O is sampled from Haar measure H d on O(d) and independent of X, Θ, ϵ. Also, note that due to Lemma C.1 and Lemma C.2, the random vectors β0 and δ satisfy the setup of Section B.1. Thus the result follows immediately by noting the fact that, where the last inequality follows from taking λ → 0 in Equation (B.11) and Equation (B.12).

B.3 LIMITING RISK FOR GENERAL β 0 AND δ

Unlike the previous section, this section studies the general property of the asymptotic risk R 0 (â) when the parameters β 0 and δ are in general position. Specifically, in this section we relax the assumption that β ⊤ 0 δ ̸ = 0. Thus, the only difficulty arise in analysing the cross-covariance term3) as it does not vanish under the absence of orthogonality assumption. Thus, essentially it boils down to showing the results in Lemma 3.3 and Lemma 3.8.

B.3.1 PROOF OF LEMMA 3.3

We begin with proof of Lemma 3.3. We recall the right Equation (3.3), i.e,where σ(•) is a non-linear activation function and N is the number of random features considered in the model. In the overparameterized setting (n ≤ N ), we train the random feature classification model by solving the following hard-margin SVM problem:,whereWe wish to study disparity between the asymptotic risks (i.e., test-time classification errors) of a on the minority and majority groups, that is, comparing Therefore, γ encodes the level of overparameterization.In the simulation for Figure 3 , we let σ(•) be the ReLU activation function and θ i,j be IID standard normal distributed. Moreover, we let π = 0.95, β 0 = 10e 1 , β 1 = 10 cos(θ)e 1 + 10 sin(θ)e 2 , n = 400, d = 200, N = γn where e 1 and e 2 are the first two standard basis of R d . We tune hyperparameters θ ∈ {0 .5, 2, 2.5, 3, 3.5, 4, 4.5, 5, 5 .5, 6}, then report test errors averaged over 20 replicates.

