MAX-MARGIN WORKS WHILE LARGE MARGIN FAILS: GENERALIZATION WITHOUT UNIFORM CONVERGENCE

Abstract

A major challenge in modern machine learning is theoretically understanding the generalization properties of overparameterized models. Many existing tools rely on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Nagarajan & Kolter (2019b) show that in certain simple linear and neuralnetwork settings, any uniform convergence bound will be vacuous, leaving open the question of how to prove generalization in settings where UC fails. Our main contribution is proving novel generalization bounds in two such settings, one linear, and one non-linear. We study the linear classification setting of Nagarajan & Kolter (2019b), and a quadratic ground truth function learned via a two-layer neural network in the non-linear regime. We prove a new type of margin bound showing that above a certain signal-to-noise threshold, any near-max-margin classifier will achieve almost no test loss in these two settings. Our results show that near-maxmargin is important: while any model that achieves at least a (1 -ϵ)-fraction of the max-margin generalizes well, a classifier achieving half of the max-margin may fail terribly. Our analysis provides insight on why memorization can coexist with generalization: we show that in this challenging regime where generalization occurs but UC fails, near-max-margin classifiers contain both some generalizable components and some overfitting components that memorize the data. The presence of the overfitting components is enough to preclude UC, but the near-extremal margin guarantees that sufficient generalizable components are present. We leverage near-max-margins in a unified way for both the linear and nonlinear settings, and we hope that this approach will be useful more broadly in overparameterized settings. In the challenging regime of generalization without UC, good learned models contain some generalizable signal components and some overfitting components that memorize the data. Our main technique is to show that any

1. INTRODUCTION

A central challenge of machine learning theory is understanding the generalization of overparameterized models. While in many real-world settings deep networks achieve low test loss, their high capacity makes theoretical analysis with classical tools difficult, or sometimes impossible (Zhang et al., 2017; Nagarajan & Kolter, 2019b) . Most classical theoretical tools are based on uniform convergence (UC), a property that, when it holds, guarantees that the test loss will be close to the training loss, uniformly over a class of candidate models. Many generalization bounds for neural networks are built on this property, e.g. Neyshabur et al. (2015; 2017b; 2018) ; Harvey et al. (2017) ; Golowich et al. (2018) . The seminal work of Nagarajan & Kolter (2019b) gives theoretical and empirical evidence that UC cannot hold in natural overparameterized linear and neural network settings. The impossibility results of Nagarajan and Kolter are strong: they rule out even UC on the smallest reasonable algorithmdependent family of models, that is, any possible models output by learning algorithm on typical datasets. In particular, they prove that in an overparameterized linear classification problem, models found by gradient descent will achieve small test loss, but any UC bound over these models will be vacuous. In a two-layer neural network setting, Nagarajan & Kolter (2019b) empirically demonstrate the same phenomenon for the 0/1 loss. Many margin-based generalization bounds do not technically fit into the category of UC bounds defined by Nagarajan and Kolter, but still may be intrinsically limited for similar reasons. Classical κ = n dσ 2 captures the signal-to-noise ratio κ gen κ uc Near-max-margin solns do not always generalize (Prop. 3.3) Near-max-margin solns generalize (Thms 3.1, 3.2) UC and polynomial margin bounds impossible (Props 3.4, 3.5, 3.6) Figure 1 : Thresholds for Uniform Convergence and Generalization. margin-based generalization guarantees bounds (see eg. Shalev-Shwartz & Ben-David (2014) ; Kakade et al. (2009) ) and related margin bounds for neural networks (Wei & Ma, 2019a; 2020; Bartlett et al., 2017; Golowich et al., 2018) scale inversely polynomially in the margin size, and are typically proved via uniform convergence on a surrogate loss (eg. the hinge loss or ramp loss) that upper bounds the 0/1 misclassification loss. Nagarajan and Kolter's results show that any UC bound on the ramp loss is vacuous in an overparameterized linear setting, suggesting (though not proving) that classical margin bounds may not be useful. Muthukumar et al. (2021) shows empirically that such margin bounds are vacuous in a broader linear settings. In light of this, it is very important to develop theoretical tools to analyze generalization in settings where uniform convergence cannot yield meaningful bounds. In this paper we establish novel margin-based generalization bounds in regimes where UC provably fails. These bounds guarantee generalization in the extremal case where the model has a nearmaximal margin, and thus we call them extremal margin bounds.Indeed, near max-margin solutions are achievable by minimizing the logistic loss with weak ℓ 2 -regularization (Wei et al., 2019) , and minimizing the unregularized logistic loss with gradient descent converges to a stationary points of the max-margin objective (Lyu & Li, 2019; Lyu et al., 2021) . In linear settings, SGD converges to the max-margin (Nacson et al., 2019) . Our results consider two settings, the linear setting of Nagarajan & Kolter (2019b) , and a commonly studied quadratic problem learned on by a two-layer neural network (Wei et al., 2019; Frei et al., 2022b) . In Theorems 3.1 and 3.2, we prove that above a certain signal-to-noise threshold κ gen , nearmax-margin solutions will generalize. Below this threshold, max-margin solutions may not generalize (Proposition 3.3). Below a second higher threshold, κ uc , uniform convergence fails (Proposition 3.4). In Figure 1 we illustrate these three regions; the main significance of our results is in the challenging middle region between κ gen and κ uc where generalization occurs, but UC fails. Additionally in this regime where UC fails, we show that classical margin bounds can only yield loose guarantees, even for the max-margin solution (Proposition 3.5 and 3.6). We prove this by showing the existence of models that achieve a large but non-near-max-margin (e.g., half the maxmargin), but do not generalize at all. This phase transition between good-margin and near-max-margin cannot be captured by classical margin bounds where the generalization guarantee decays inversely polynomially in the margin. Our extremal margin bounds are fundamentally different from classical margin bounds and are not based on uniform convergence. Prior works have also studied the challenging regime where uniform convergence does not work. In a linear regression setting, Zhou et al. (2020) and Koehler et al. (2021) show that the test loss can be uniformly bounded for all low-norm solutions that perfectly fit the data (this uses the data-dependent interpolation condition to improve upon UC bounds); nevertheless, Yang et al. (2021) shows that such bounds are still loose on the min-norm solution. Negrea et al. (2020) suggests an alternative framework based on uniform convergence over a less complex family of surrogate models; they use this technique to show generalization in a linear setting and in another high-dimensional problem amenable to analysis. To our knowledge, our results are the first instance of theoretically proving generalization in a neural network setting (that is not in the NTK regime) where UC provably fails. More broadly, a variety of new generalization bounds have been derived in hopes of explaining generalization in deep learning. While none of these bounds have been explicitly proven to succeed in regimes where UC fails, they leverage additional properties of the training data or the optimization process and thus are not directly susceptible to the critiques of Nagarajan & Kolter (2019b) . Among these are works that leverage properties such as Lipschitzness of the model on the training data (Arora et al., 2018; Nagarajan & Kolter, 2019a; Wei & Ma, 2019a; b) , use algorithmic stability (Mou et al., 2018; Li et al., 2019a; Chatterjee & Zielinski, 2022) , or information-theoretic perspectives (Negrea et al., 2019; Haghifam et al., 2021) . Finally, a body of work seeks to draw connections between optimization and generalization in deep learning by studying implicit regularization effects of the optimization algorithm (see e.g. (Gunasekar et al., 2017; Li et al., 2017; Gunasekar et al., 2018a; b; Woodworth et al., 2020; Damian et al., 2021; HaoChen et al., 2020; Li et al., 2019b; Wei et al., 2020) and related references). Most relevent in this literature is the aforementioned work connecting gradient descent and max-margin solutions.

2. PRELIMINARIES

Our work achieves results in two settings. The first is a linear setting previously studied by Nagarajan & Kolter (2019b) where both the ground truth and the trained model are linear. In the second nonlinear setting, studied before by Wei et al. (2019) ; Frei et al. (2022b) , the ground truth is quadratic, and the trained model is a two-layer neural network. In both settings, the data is drawn from a product distribution on features involved in the ground truth labeling function, and "junk" features orthogonal to the signal. We formalize the two settings below. µ 2 -µ 2 - - - - - -µ 1 µ 1 + + + + + + + + + + - - - - - {µ 1 , µ 2 } ⊥ κ XOR,h Linear setting ▶ Data Distribution. Fix some ground truth unit vector direction µ ∈ R d . Let x = z + ξ, where z ∼ Uniform({µ, -µ}) and ξ is uniform on the sphere of radius √ d -1σ in d -1 dimensions, orthogonal to the direction µ. Let y = µ T x, such that y = 1 with probability 1/2 and -1 with probability 1/2. We denote this distribution of (x, y) on R d × {-1, 1} by D µ,σ,d . ▶ Model. We learn a model w ∈ R d that predicts ŷ = sign(f w (x)) where f w (x) = w T x. Setting for Two-Layer Neural Network Model with Quadratic "XOR" Ground Truth ▶ Data Distribution. Fix some orthogonal ground truth unit vector directions µ 1 and µ 2 in R d . Let x = z + ξ, where z ∼ Uniform({µ 1 , -µ 1 , µ 2 , -µ 2 }) and ξ is uniform on the sphere of radius √ d -2σ in d -2 dimensions, orthogonal to the directions µ 1 and µ 2 . Let y = (µ T 1 x) 2 -(µ T 2 x) 2 for some orthogonal ground truth directions µ 1 and µ 2 (see Figure 2 (left)). We denote this distribution of (x, y) on R d × {-1, 1} by D µ1,µ2,σ,d . We call this the XOR problem because y = XOR (µ 1 + µ 2 ) T x, (-µ 1 + µ 2 ) T x . For instance, if µ 1 = e 1 and µ 2 = e 2 , then y = x 2 1 -x 2 2 . As can be seen in Figure 2 (left), this distribution is not linearly separable, and so one must use nonlinear model to learn in this setting. ▶ Model. Fix a ∈ {-1, 1} m so that i a i = 0. The model is a two-layer neural network with m hidden units and activation function ϕ, parameterized by W ∈ R m×d . W (which will be learned) represents the weights of the first layer and a (which is fixed) is the second layer weights. The model predicts f W (x) = m i=1 a i ϕ(w T i x) , where w i ∈ R d denotes the i'th column of W . We work with activations ϕ of the form ϕ(z) = max(0, z) h for h ∈ [1, 2), and require that m is divisible by 4foot_0 . We define a problem class of distributions to be a set of data distributions. In this paper, we work with the linear problem class Ω linear σ,d := {D µ,σ,d : µ ∈ R d , ∥µ∥ = 1}, and the quadratic problem class Ω XOR σ,d := {D µ1,µ2,σ,d : µ 1 ⊥ µ 2 ∈ R d , ∥µ 1 ∥ = ∥µ 2 ∥ = 1}. Here ∥ • ∥ denotes the ℓ 2 norm. We will sometimes abuse notation and say that x ∼ D instead of saying that (x, y) ∼ D. Before proceeding, we make some comments on the parameter settings and compare to related work.

Large dimension assumption.

In both the linear and non-linear settings, our focus is an overparameterized regime where the dimension d is at least a constant factor times larger than n, the number of training samples. Such an assumption is mild relative to the assumptions made in related work, which require d = ω(n) (see eg. (Cao et al., 2021; Wang & Thrampoulidis, 2020; Muthukumar et al., 2021; Shamir, 2022; Chatterji & Long, 2021) on linear models; for neural networks, the closest related works of Frei et al. (2022a) and Cao et al. (2022) assume that d ≥ n 2 or stronger). When the dimen-sion is sufficiently large (in particular, at least ω(n)), with high probability, the max-margin solution coincides with the min-norm regression solution (see Hsu et al. (2021) ), meaning the max-margin solution can be analyzed via a closed-form expression. Our work is fundamentally different from the work on linear classification which operates in the d = ω(n) regime, because in our setting when d = Θ(n), these two solutions do not coincide. 2021)), or variance in the signal direction, that is, x T µ ̸ = y (eg. Shamir (2022) ). We work with a simpler distribution, which is still challenging, because it defies existing analyses built on UC or closed-form solutions.

2.1. BACKGROUND AND DEFINITIONS ON UNIFORM CONVERGENCE

In this subsection, we provide some definitions from Nagarajan & Kolter (2019b) on algorithmdependent UC bounds. We also provide some definitions and background on margin bounds. For a loss function L : R × R → R, and a hypothesis h mapping from a domain X to R, we define the test loss on a distribution D to be L D (h) := E (x,y)∼D L(h(x), y). For a set of examples S = {(x i , y i )} i∈[n] , we define L S (h) := E i∈[n] L(h(x i ), y i ) to be the empirical loss. Unless otherwise specified, we will use L to denote the 0/1 loss, which equals 1 if and only if the signs of the two labels disagree, that is, L(y, y ′ ) = 1(sign(y) ̸ = sign(y ′ )). Typically in machine learning one considers a global hypothesis class G that an algorithm may explore (e.g., the set of all two-layer neural networks). A uniform convergence bound, defined below, may hold over a smaller subset H of G, eg. the subset of networks with bounded norm. Definition 2.1 (Uniform Convergence Bound). A uniform convergence bound with parameter ϵ unif for a distribution D, a set of hypotheses H, and loss L is a bound that guarantees that Pr S∼D n [sup h∈H |L D (h) -L S (h)| ≥ ϵ unif ] ≤ 1 4 . (2.1) A uniform convergence bound can be customized to algorithms by choosing H to depend on the implicit bias of an algorithm. For instance, if an algorithm A favors low-norm solutions, one could choose H to be the set of all classifiers with bounded norm. Of course, if H is too small, it may not be useful for proving generalization, because A will never output a solution in H. We formalize the notion of choosing a useful algorithm-dependent set H as follows. Definition 2.2 (Useful Hypothesis Class). A hypothesis class H is useful with respect to an algorithm A and a distribution D if Pr S∼D n [A(S) ∈ H] ≥ 3 4 . Remark 2.3. Our definition of a uniform convergence bound on a useful hypothesis class is essentially equivalent to the definition of algorithm-dependent uniform convergence bound in Nagarajan & Kolter (2019b) . We introduce new terminology since we use it later in our results on margin bounds. More generally, we can have generalization bounds that do not yield the same generalization guarantee for all elements of H. Instead, their guarantee scales with some property of the hypothesis h and the sample S. We call these data-dependent bounds. Such bounds are useful if the favorable property is satisfied with high probability by the algorithm of interest. One specific type of data-dependent bound depends on the margin achieved by the classifier on the training sample. We recall the definition of a margin: Definition 2.4 (Margin). The margin γ(h, S) of a classifier h on a sample S equals min (x,y)∈S yh(x). In certain parameterized hypothesis classes it is useful to define a normalized margin. If f W is h-homogeneous, that is, f cW (x) = c h f W (x) for a positive scalar c, we define the normalized margin γ(f W , S) := γ(f W , S) ∥W ∥ h = γ(f W/∥W ∥ , S), (2.2) where we define the norm ∥W ∥ to equal E i∈[m] [∥w i ∥ 2 ] , where w i is the i'th column of W . We will use γ * (S) to denote the maximum normalized margin. When we are discussing the linear problem, we let γ * (S) be the max-margin over all vectors w ∈ R d with norm 1, that is γ * (S) := sup w:∥w∥2≤1 γ(S, f w ). In the XOR problem, we use γ * (S) to denote the max-margin over all weight matrices W ∈ R m×d with norm 1, that is γ * (S) := sup W :∥W ∥≤1 γ(S, f W ). Most classical margin bounds prove that the generalization gap can be bounded by a term that scales inversely linearly or quadratically in the margin (Koltchinskii & Panchenko, 2002; Kakade et al., 2009) . More generally, we will call margin bounds in which the generalization guarantee scales with 1 γ(S,f W ) p for a constant p a polynomial margin bound. Such bounds usually rely on proving uniform convergence for a continuous loss that upper bounds the 0/1 loss. As we will show in the next section, such bounds are also intrinsically limited in regimes where UC fails on the 0/1 loss. In contrast to this, in our work, we prove bounds for classifiers that achieve near-maximal margins. Definition 2.5. A classifier h is a (1 -ϵ)-max-margin solution for S if γ(h, S) ≥ (1 -ϵ)γ * (S). We refer to a bound that holds for (1 -ϵ)-max-margin solutions as a extremal margin bound.

3. MAIN RESULTS

In the following section, we state our main results for the linear and quadratic problems, and provide intuition for our findings. As illustrated in Figure 1 , and in more detail in Figure 2 (right), our results show different possibilities for a near max-margin solution depending on the size of κ := n dσ 2 , a signal-to-noise parameter, where σ, d are as in Section 2. When κ is smaller than some threshold κ gen we are not guaranteed to have learning: even a near max-margin solution may not generalize. When κ exceeds κ gen by an absolute constant and when σ 2 ≪ 1, our results show that any near max-margin solution generalizes well. Finally, we show that if κ is smaller than a second threshold κ uc , then uniform convergence approaches will fail to guarantee generalization. All of our results additionally include an overparameterization condition that d ≥ cn for a constant c, as is pictured in Fig 2 (right) . The exact thresholds κ gen and κ uc depend on the problem class of interest, but in both the linear setting and the nonlinear setting we study, we show that κ uc > κ gen . Thus we observe a regime where uniform convergence fails, but generalization still occurs for near max-margin solutions. For the linear problem, we define the universal constants κ linear gen := 0 and κ linear uc := 1. (3.1) For the XOR problem with activation relu h , for h ∈ [1, 2), we define the constants κ XOR,h gen := the solution to 2 1 h 2 κ = κ 4 + κ + 16 κ (4 + κ) and κ XOR,h uc := 4. (3.2) The constants are pictured in Figure 2 (right) as a function of h. Observe that for h ∈ (1, 2), we have κ XOR,h gen < κ XOR,h uc , and κ XOR,h gen > 0. When h = 1 and the activation is relu, we have κ XOR,h gen = κ XOR,h uc , and thus we do not expect to have a regime where uniform convergence fails, but max-margin solutions generalize. We elaborate more intuitively on why h > 1 allows for generalization without UC in Section A. Our first theorem states that when κ > κ gen , any near-max-margin solution generalizes. We prove a similar generalization result for XOR problem learned on two-layer neural networks. Theorem 3.2 (Extremal-Margin Generalization for XOR on Neural Network). Let h ∈ (1, 2), and let δ > 0. There exist constants ϵ = ϵ(δ) and c = c(δ) such that the following holds. For any n, d, σ and D ∈ Ω XOR σ,d satisfying κ = n dσ 2 ≥ κ XOR,h gen + δ and d n ≥ c, then with probability 1 -3e -n/c over the training set S ∼ D n , for any two-layer neural network with activation function relu h and weight matrix W that is a (1 -ϵ)-max-margin solution (as in Definition 2.5), we have L D (f W ) ≤ e -1 cσ 2 . This theorem guarantees meaningful results whenever σ is small enough. To see this, note that the assumptions of the theorem require that d n ∈ c, 1 σ 2 (κ XOR,h gen +δ) . If σ is small enough (in terms of δ), this interval is non-empty. Further, the generalization guarantee is good if σ is small enough (since exp(-1/(cσ 2 )) tends to 0 as σ approaches 0). For instance consider a setting where d ≫ n, and σ 2 = n d . Then our theorem guarantees that L D (f W ) ≪ 1. Key intuitions for generalization theorems. We demonstrate the gist of the analysis for the linear problems with some simplifications. It turns out that two special solutions merit particular attention: (i) the good solution w g = µ that generalizes perfectly, and (ii) the bad overfitting solution w b :≈ 1 √ ndσ j y j ξ j that memorizes the "junk" dimension of the data, and satisfies ξ T i w b ≈ 1 √ ndσ y i |ξ i | 2 = y i dσ 2 n for all i. 2 We examine the margin of the two solutions and have γ(w g , S) = 1 and γ(w b , S) ≈ dσ 2 n . (3.3) At first glance, one might conclude that when γ(w g , S) < γ(w b , S), the max margin solution will be w b , which does not generalize. However, our key observation is that any (near) max margin solution w always contains a mixture of both w g and w b . When the w g component is small but non-trivial and the w b component is large, the solution can simultaneously generalize but contain a large enough overfitting component to preclude UC. More concretely, suppose we consider the margin of a linear mixture w = αw g + βw b satisfying α 2 + β 2 = 1 so that ∥w∥ 2 = 1. It is easy to see that the margin on the training set is γ(w, S) = αγ(w g , S) + βγ(w b , S) (3.4) Meanwhile, the margin on an test example x is only slightly affected by w b : γ(w, x) ≈ αγ(w g , S) ± βw T b x ≈ αγ(w g , S) ± βγ(w b , S) n d . The effect w T b x of the bad solution on the test sample is is smaller than γ(w, S) by a n d factor because x is a high dimensional random vector, and thus mostly orthogonal to w b . Therefore, even if the margin on the training set mostly stems from the bad overfitting solution, that is, αγ(w g , S) < βγ(w b , S), the model may still generalize as long as αγ(w g , S) ≥ βγ(w b , S) n d . The optimal α, β satisfying α 2 + β 2 = 1 that maximize the margin turns out to be proportional to the original margin: γ(wb,S) 2 . In other words, we should expect reasonable generalization of near-max margin solutions as long as γ(wg,S) γ(wb,S) > ( n d ) 1/4 , which by eq. 3.3 occurs when n dσ 4 ≫ 1. In Appendix A, we describe the challenges that arise when adapting these intuitions to nonlinear setting, and our techniques for overcoming them. Before proceeding to our lower bounds, observe that a typical margin bound for the linear setting would yield |L D (w) -L S (w)| ≤ 2|x| √ nγ(S,w) ≈ 2 √ dσ √ 1+1/κ √ n , which is at least 2 for κ ≤ κ linear uc = 1. 2 More precisely, we will choose wb to be the rescaled min-norm vector satisfying ξ T i wb = yi for all i. This distinction is important in the case when d is only a constant factor larger than n, and the solution 1 √ ndσ j yjξj does not necessarily correctly classify the training data. We now proceed to present our lower bounds, which show when near max-margin solutions may not always generalize, and when UC bounds and polynomial margin bounds are impossible. If κ < κ gen , it is possible that a near-max margin solution does not generalize at all. Since κ gen = 0 in the linear setting, we only state this result for the XOR problem. Proposition 3.3 (Region where Max-Margin Generalization not Guaranteed). Let h ∈ (1, 2), and let ϵ > 0. There exists a constant c = c(ϵ) such that the following holds. For any n, d, σ and D ∈ Ω XOR σ,d satisfying κ ≤ κ XOR,h gen -ϵ and d n ≥ c, with probability 1 -3e -n/c over S ∼ D n , there exists some W with ∥W ∥ = 1 and γ(f W , S) ≥ (1 -ϵ)γ * (S) such that L D (f W ) = 1 2 . Theorems 3.2 and Prop. 3.3 demonstrate that in the XOR problem, there is a threshold in κ above which generalization occurs. If κ is above this threshold, we achieve generalization when σ 2 ≪ 1. The next proposition states that when κ < κ uc , any algorithm-dependent uniform convergence bounds will be vacuous, that is, its generalization guarantee will be arbitrarily close to 1. We state our results for the linear and XOR neural network settings together; we state the more complicated XOR result in full and then mention how the linear result differs. Proposition 3.4 (UC Bounds are Vacuous). Fix any h ∈ (1, 2), and δ > 0. For any n, d, σ and D ∈ Ω XOR σ,d , if κ XOR,h gen + δ ≤ κ ≤ κ XOR,h uc -δ, there exist strictly positive constants ϵ = ϵ(δ) and c = c(δ) such that the following holds. Let A be any algorithm that outputs a (1 -ϵ)-max-margin two-layer neural network f W for any S ∈ (R d × {1, -1}) n . Let H be any hypothesis class that is useful for D (as in Definition 2.2). Suppose that ϵ unif is a uniform convergence bound for D and H that is, Pr S∼D n [sup h∈H |L D (h) -L S (h)| ≥ ϵ unif ] ≤ 1/4. Then if d n ≥ c and n > c, we must have ϵ unif ≥ 1 -δ. A similar result holds for the linear problem with κ linear gen + δ < κ < κ linear uc -δ and any D ∈ Ω linear σ,d . In this case we achieve the guarantee that ϵ unif ≥ 1 -e -n 36dσ 2 -e -n/8 . Prop. 3.4 is proved using the same technique as in Nagarajan & Kolter (2019b) : we show that with high probability over S ∼ D m , the hypothesis A(S) has good generalization, but on an "oppositite" dataset ψ(S) with the junk components reversed, the empirical error of A(S) is close to 1. This large gap between empirical error and generalization forces ϵ unif-alg to be large. Further extending this technique, we can also show the limitations of classical polynomial margin bounds which achieve an bound that scales inversely polynomially with γ(h, S). We show that with high probability over S ∼ D m , the hypothesis A(ψ(S)) has a large margin on the set S (a constant fraction times the max-margin), but poor generalization on D. Since any polynomial margin bound cannot predict much better generalization for the max-margin solution than for a solution with a constant-fraction of the max-margin, we conclude that any such margin bound is far from showing good generalization for the max-margin solution. One subtlety to this approach is that here (unlike in the work of Nagarajan & Kolter (2019b)), the "opposite" data set ψ(S) is defined to be the data set with the signal features reversed. Thus we can only show the limitations of polynomial margin bounds that are useful for both D and for its "oppostite" distribution ψ(D), which has the opposite ground-truth vector(s), which is a slightly stronger assumption than in the work of Nagarajan & Kolter (2019b) . Let H be any hypothesis class that is useful for A (as in Definition 2.2) on both D linear µ,σ and D linear -µ,σ . Suppose that there exists an polynomial margin bound of integer degree p: that is, there is some G that satisfies for D ∈ {D, ψ(D)}, Pr S∼ Dn sup h∈H L D (h) -L S (h) ≥ G γ(h, S) p ≤ 1 4 . Then with probability 1 2 -3e -n over S ∼ D n , the margin bound is weak even on the max-margin solution, that is, G γ * (S) p ≥ max 1 c , 1 -e -κ 36σ 2 -e -n/8 -3κ c p , which is more than an absolute constant. This theorem says that no polynomial margin bound will be able to show that the test error of the max-margin solution is less than an absolute constant. We know however from Theorem 3.1 that in this same regime, the test error of the max-margin solution can be arbitrarily small for small enough σ. Thus no polynomial margin bound can predict this behaviour. The attentive reader again may notice that if κ → 0 as n and d grow, but generalization occurs, any such margin bound is vacuous, in that G γ * (S) p → 1. In Prop ??, we prove a more precise version, yielding the exact dependence of c and ϵ on the gap between κ and the boundaries κ linear uc and κ linear gen . We achieve a similar result in the XOR setting. Then with probability 1 2 -3e -n/c over S ∼ D n , on the max-margin solution, the generalization guarantee is no better than 1 c , that is, G γ * (S) p ≥ 1 c . Remark 3.7. The polynomial margin impossibility results is slightly weaker for the XOR problem. Namely, the hypothesis class H we consider is larger in the XOR problem: it must contain with probability 3 4 any near max-margin solution, instead of just the one output by A. The combination of our generalization results and our margin possibility results suggest a phase transition in how the margin size affects generalization. If the margin is near-maximal, Theorems 3.1 and Prop. 3.2 show that we achieve generalization. Meanwhile, the proof of Props 3.5 and 3.6 suggest that solutions achieving a constant factor of the maximum margin may not generalize. The proofs of all of our results concerning the linear problem are given in Section C. The proofs for the XOR problem are in Section D.

4. CONCLUSION

In the work, we give novel generalization bounds in settings where uniform convergence provably fails. We use a unified approach of leveraging the extremal margin in both a linear classification setting and a non-linear two-layer neural network setting. Our work provides insight on why memorization can coexist with generalization. Going beyond our results, it is important to find broader tools for understanding the regime near the boundary of generalization and no generalization. We conclude with several concrete open directions in this vein. One question is how to prove generalization without UC when d < n, but the model itself (e.g. a neural network) is overparameterized, and thus can still overfit to the point of UC failing. A second direction asks if we can prove similar results in the non-linear network setting for the solution found by gradient descent, if this solution is not a near max-margin solution. Indeed, in a non-convex landscape, it not guaranteed that that gradient descent will find the max-margin solution.



The assumption that m is divisible by 4 is for convenience, and can be removed if m is large enough. We believe considering such types of margin bounds is natural. Indeed, for most problems, the designer of the generalization bound would not know in advance the ground truth distribution, but might know that the data comes from some problem class, e.g., linearly separable distributions, or linearly separable distributions with a sparse ground truth vector. In such cases, they would likely have to design a generalization bound that holds for data coming from two distributions with opposite ground truths.If we make this same stronger two-distribution assumption in Prop. 3.4, we can additionally rule out one-sided UC bounds, which only upper bound LD -LS.



Figure 2: Left: Quadratic XOR Problem. Middle: κ XOR,h gen

Assumption on Data Covariance. Many works on linear classification study more general data models which allow arbitrary decay of the eigenvalues of the covariance matrix (eg.Muthukumar  et al. (2021);Wang & Thrampoulidis (2020);Cao et al. (

Theorem 3.1 (Extremal-Margin Generalization for Linear Problem). Let δ > 0. There exist constants ϵ = ϵ(δ) and c = c(δ) such that the following holds. For any n, d, σ and D ∈ Ω linear σ,d satisfying κ linear gen + δ ≤ κ ≤ 1 δ , and d n ≥ c, then with probability 1 -3e -n over the randomness of a training set S ∼ D n , for any w ∈ R d that is a (1 -ϵ)-max-margin solution (as in Definition 2.5), we have L D (f w ) ≤ e -n 36dσ 4 + e -n/8 . Attentive readers may observe that since κ linear gen = 0, Theorem 3.1 can guarantee asymptotic generalization for some sequences of parameters (n i , d i , σ i ) i≥1 even when κ i = ni diσ 2 i = o i→∞ (1), as long as σ 2 i decays fast enough. In Theorem C.4 in the appendix, we state a more detailed version of this theorem which states the exact dependence of c and ϵ on δ, yielding precise results for κ = o(1).

,S) , yielding a max-margin of γ(w g , S) 2 + γ(w b , S) 2 ≈ dσ 2 +n

3 Formally, if D = D linear µ,σ , then we define ψ(D) := D linear -µ,σ . If D = D XOR µ1,µ2,σ , then ψ(D) := D XOR µ2,µ1,σ . The following results state that if κ < κ uc , then certain types of margin bounds cannot yield better than constant test loss on even the max-margin solution. Proposition 3.5 (Polynomial Margin Bounds Fail for Linear Problem). Fix δ > 0. For any n, d, σ and D ∈ Ω linear σ,d such that κ linear gen + δ < κ < κ linear uc -δ and d n ≥ c, the following holds. Let A be any algorithm so that A(S) outputs a (1 -ϵ)-max-margin solution f w for any S ∈ (R d × {1, -1}) n .

Proposition 3.6 (Polynomial Margin Bounds Fail for XOR on Neural Network). Fix an integer p ≥ 1, and any ϵ > 0. There exists c = c(p, ϵ) such that the following holds for any n, d, σ and D ∈ Ω XOR σ,d with κ XOR,h gen + ϵ < κ < κ XOR,h uc -ϵ, d n ≥ c and n ≥ c. Let H be any hypothesis class such that for D ∈ {D, ψ(D)}, Pr S∼ Dn [all (1 -ϵ)-max-margin two-layer neural networks f W for S lie in H] ≥ 3/4.Suppose that there exists an polynomial margin bound of degree p: that is, there is some G that satisfies for D ∈ {D, ψ(D)},

near max-margin solution has to contain both signal components and overfitting components. The overfitting component causes UC to fail, but fortunately, has a reduced influence on a random test example, whereas the signal component has a similar influence on training and test examples.

ACKNOWLEDGMENTS

We thank Jason Lee for helpful discussions. MW acknowledges the support of NSF Grant CCF-1844628 and a Sloan Research Fellowship. TM is supported by NSF IIS 2045685.

