A LAW OF ADVERSARIAL RISK, INTERPOLATION, AND LABEL NOISE

Abstract

In supervised learning, it has been shown that label noise in the data can be interpolated without penalties on test accuracy. We show that interpolating label noise induces adversarial vulnerability, and prove the first theorem showing the relationship between label noise and adversarial risk for any data distribution. Our results are almost tight if we do not make any assumptions on the inductive bias of the learning algorithm. We then investigate how different components of this problem affect this result including properties of the distribution. We also discuss non-uniform label noise distributions; and prove a new theorem showing uniform label noise induces nearly as large an adversarial risk as the worst poisoning with the same noise rate. Then, we provide theoretical and empirical evidence that uniform label noise is more harmful than typical real-world label noise. Finally, we show how inductive biases amplify the effect of label noise and argue the need for future work in this direction.

1. INTRODUCTION

Label noise is ubiquitous in data collected from the real world. Such noise can be a result of both malicious intent as well as human error. The well-known work of Zhang et al. (2017) observes that training overparameterised neural networks with gradient descent can memorize large amounts of label noise without increased test error. Recently, Bartlett et al. (2020) investigated this phenomenon and termed it benign overfitting: perfect interpolation of the noisy training dataset still leads to satisfactory generalization for overparameterized models. A long series of works (Donhauser et al., 2022; Hastie et al., 2022; Muthukumar et al., 2020) focused on providing generalization guarantees for models that interpolate data under uniform label noise. This suggests that noisy training data does not hurt the test error of overparameterized models. However, when deploying machine learning systems in the real world, it is not enough to guarantee low test error. Adversarial vulnerability is a practical security threat (Kurakin et al., 2016; Sharif et al., 2016; Eykholt et al., 2018) for deploying machine learning algorithms in critical environments. An adversarially vulnerable classifier, that is accurate on the test distribution, can be forced to err on carefully perturbed inputs even when the perturbations are small. This has motivated a large body of work towards improving the adversarial robustness of neural networks (Goodfellow et al., 2014; Papernot et al., 2016; Tramèr et al., 2018; Sanyal et al., 2018; Cisse et al., 2017) . Despite the empirical advances, the theoretical guarantees on robust defenses are still poorly understood. Consider the setting of uniformly random label noise. Under certain distributional assumptions, Sanyal et al. (2021) claim that with moderate amount of label noise, when training classifiers to zero training error, the adversarial risk is always large, even when the test error is low. Experimentally, this is supported by Zhu et al. (2021) , who showed that common methods for reducing adversarial risk like adversarial training in fact does not memorise label noise. However, it is not clear whether their distributional assumptions are realistic, or if their result is tight. To deploy machine learning models responsibly, it is important to understand the extent to which a common phenomenon like label noise can negatively impact adversarial robustness. In this work, we improve upon previous theoretical results, proving a stronger result on how label noise guarantees adversarial risk for large enough sample size. On the other hand, existing experimental results (Sanyal et al., 2021) seem to suggest that neural networks suffer from large adversarial risk even in the small sample size regime. Our results show that this phenomenon cannot be explained without further assumptions on the data distributions, the learning algorithm, or the machine learning model. While specific biases of machine learning models and algorithms (referred to as inductive bias) have usually played a "positive" role in machine learning literature (Vaswani et al., 2017; van Merriënboer et al., 2017; Mingard et al., 2020) , we show how some biases can make the model more vulnerable to adversarial risks under noisy interpolation. Apart from the data distribution and the inductive biases, we also investigate the role of the label noise model. Uniform label noise, also known as random classification noise (Angluin and Laird, 1988) , is a natural choice for modeling label noise, but it is neither the most realistic nor the most adversarial noise model. Yet, our results show that when it comes to guaranteeing a lower bound on adversarial risk for interpolating models, uniform label noise model is not much weaker than the optimal poisoning adversary. Our experiments indicate that natural label noise (Wei et al., 2022) is not as bad for adversarial robustness as uniform label noise. Finally, we also attempt to understand the conditions under which such benign (natural) label noise arises. Overview First, we introduce notation necessary to understand the rest of the paper. Then, we prove a theoretical result (Theorem 2) on adversarial risk caused by label noise, significantly improving upon previous results (Theorem 1 from Sanyal et al. (2021) ). In fact, our Theorem 2 gives the first theoretical guarantee that adversarial risk is large for all compactly supported input distributions and all interpolating classifiers, in the presence of label noise. Our theorem does not rely on the particular function class or the training method. Then, in Section 3, we show Theorem 2 is tight without further assumptions, but does not accurately reflect empirical observations on standard datasets. Our hypothesis is that the experimentally shown effect of label noise depends on properties of the distribution and the inductive bias of the function class. In Section 4, we prove (Theorem 5) that uniform label noise is on the same order of harmfmul as worst case data poisoning, given a slight increase in dataset size and adversarial radius. We also run experiments in Figure 3 , showing that mistakes done by human labelers are more benign than the same rate of uniform noise. Finally, in Section 5, we show that the inductive bias of the function class makes the impact of label noise on adversarial vulnerability much stronger and provide an example in Theorem 7.

2. GUARANTEEING ADVERSARIAL RISK FOR NOISY INTERPOLATORS

Our setting Choose a norm ∥•∥ on R d , for example ∥•∥ 2 or ∥•∥ ∞ . For x ∈ R d , let B r (x) denote the ∥•∥-ball of radius r around x. Let µ be a distribution on R d and let f * : C → {0, 1} be a measurable ground truth classifier. Then we can define the adversarial risk of any classifier f with respect to f * , µ, given an adversary with perturbation budget ρ > 0 under the norm ∥•∥, as R Adv,ρ (f, µ) = P x∼µ [∃z ∈ B ρ (x), f * (x) ̸ = f (z))] . Next, consider a training set ((z 1 , y 1 ), . . . , (z m , y m )) in R d ×{0, 1}, where the z i are independently sampled from µ, and each y i equals f * (z i ) with probability 1 -η, where η > 0 is the label noise rate. Let f be any classifier which correctly interpolates the training set. We now state the main theoretical result of Sanyal et al. (2021) so that we can compare our result with it. Theorem 1 ( Sanyal et al. (2021) ). Suppose that there exist c 1 ≥ c 2 > 0, ρ > 0, and a finite set ζ ⊂ R d satisfying µ   s∈ζ B ρ/2 (s)   ≥ c 1 and ∀s ∈ ζ, µ B ρ/2 (s) ≥ c 2 |ζ| (2) Further, suppose that each of these balls contains points from a single class. Then for δ > 0, when the number of samples m ≥ |ζ| ηc2 log |ζ| δ , with probability 1 -δ R Adv,ρ (f, µ) ≥ c 1 . This is the first guarantee for adversarial risk caused by label noise in the literature. However, Theorem 1 has two extremely strong assumptions: • The input distribution has mass c 1 in a union of balls, each of which has mass at least c 2 ; • Each ball only contains points from a single class. It is not clear why such balls would exist for real-world datasets, or even MNIST or CIFAR-10. In Appendix F, we give some evidence against the second assumption in particular. In Theorem 2, we remove these assumptions and show that our guarantees hold for all compactly supported input distributions, with comparable guarantees on adversarial risk. Let C be a compact subset of R d . An important quantity in our theorem will be the covering number Theorem 2. Let C ⊂ R d satisfy µ(C) > 0, and let N = N (ρ/2; C, ∥•∥) be its covering number. For δ > 0, when the number of samples satisfies m ≥ 8N µ(C)η log 2N δ . with probability 1 -δ we have that N = N (ρ/2; C, ∥•∥) of C in the metric ∥•∥. R Adv,ρ (f, µ) ≥ 1 4 µ(C) for any classifier f that interpolates the training set. The compact C can be chosen freely, allowing us to make tradeoffs between the required number of samples m and the lower bound on the adversarial risk. As the chosen C expands in volume, the lower bound on the adversarial risk µ (C) also increases. However, this also increases the required number of samples for the theorem to kick in, which depends on its covering number N . The tradeoff curve depends on the distribution µ; we discuss this in Section 3. Note that Theorem 2 is easier to interpret than Theorem 1, as it holds for any compact C as opposed to a finite set ζ of dense balls. Our result avoids the unwieldy assumptions, and in fact gives a slightly stronger guarantee than Theorem 1. When Equation (2) holds, note that we can choose the compact C = s∈ζ B ρ/2 (s) from Theorem 1 yielding N = |ζ| and µ (C) = c 1 . Thus, under similar settings as the previous result, our theorem requires the number of samples m = Ω |ζ| ηc1 , which is smaller than m = Ω |ζ| ηc2 required in Theorem 1. We leave the proof of Theorem 2 to Appendix A, but we provide a brief sketch of the ideas. Proof sketch We want to prove that a large portion of points from µ have a mislabeled point nearby when m is large enough. The expected number of label noise training points is ηm; however a priori those could be anywhere in the support of µ. The key idea is that we can always find a set of ∥•∥-balls covering a lot of measure, with each of the balls having a large enough density of µ. We prove this in Lemma 8 and provide an illustration in Figure 1 . The blue dotted circles in Figure 1 are the ∥•∥-balls; they do not cover the entire space but cover a significant portion of the entire density. Then, if we take a lot of ∥•∥-balls with large density of a single class, we can prove that label noise induces an opposite-labeled point in each of the chosen balls given m large enough. Concretely, the probability for a single chosen ball to not be adversarially vulnerable is on the order of 1 -η 2N µ(C)m , and summing this up over the O(N ) chosen balls goes to zero when m is large. By the union bound, each of these balls is then adversarially vulnerable, summing up to a constant adversarial risk. (2021) . Adversarial error increases with increasing label noise η (xaxis) at a rate much faster than predicted by Theorem 2. Here, ϵ is the perturbation magnitude (ρ in the current paper). The label noise is synthetically injected in the training set with probability η. We considered the binary classification case for simplicity. The proof in Appendix A lower bounds the adversarial risk on a single true class. Thus, by summing up the risks for each class, we lose only a constant factor on the guaranteed adversarial risk in the multi-class case. For compactly supported µ, we can take C to be the support of µ to prove a general statement. Corollary 3. Let N be the covering number of supp(µ) with balls of radius ρ/2. For δ > 0, when the number of samples satisfies m ≥ 8N η log 2N δ . with probability 1 -δ we have that R Adv,ρ (f, µ) ≥ 1 4 . ( ) This is easier to understand than Theorem 2: if interpolating a dataset with label noise, the number of samples required to guarantee constant adversarial risk scales with the covering number of the support of the distribution. Remark 1. The proof in Appendix A actually proves Equation ( 4) by first proving a stronger fact: if µ| C is the normalized restricton of µ on C, then R Adv,ρ (f, µ| C ) ≥ 1 4 . ( ) We can use this to give better guarantees when f * is not robust. If a region of supp(µ) is already adversarially vulnerable using the true classifier f * , we can omit it from C, and just add the guarantee from Theorem 2 to the original adversarial risk to get a stronger lower bound on R Adv,ρ (f, µ).

3. PRACTICAL IMPLICATIONS ON SAMPLE SIZE

In this section, we discuss the limitations of results like Theorem 2. When we allow arbitrary interpolating classifiers, we show that Theorem 2 paints an accurate picture of the interaction of label noise, interpolation, and adversarial risk. However, this particular theoretical framework cannot explain the strong effect of label noise on adversarial risk in practice (see Figure 2 ). We argue that this requires a better understanding of the inductive biases of the hypothesis class and the optimization algorithm.

Required sample size for Theorem 2

The number of required samples m in Theorem 2 can be very large, depending on the density and the covering number of the chosen compact C. Consider ∥•∥ to be the ∥•∥ ∞ norm, as is customary in adversarial robustness research (Goodfellow et al., 2014) . Then the balls B ρ are small hypercubes in R d . If we choose C to be the hypercube [0, 1] d , the covering number scales exponentially in dimension: N = N (ρ; [0, 1] d , ∥•∥ ∞ ) ≃ 1 ρ d . A rough back-of-the-envelope calculation indicates that this can scale badly even for standard datasets such as MNIST (d = 784) or CIFAR-10 (d = 3072), since in Theorem 2 we need m ≳ N µ(C)η . This amounts an impossibly large sample size (m ≳ 10 784 ) for ρ = 0.2 to explain the effect already observed with in m = 50000 MNIST training samples in Figure 2a . Hence our result often does not guarantee any adversarial risk if the number of samples m is small. In general, the covering number of a dataset is not polynomial in the dimension, except if the data has special properties in the given metric. For example, if the data distribution is supported on a subspace of R d of ambient dimension k < d, we can pick a C for which the covering number in ∥•∥ 2 will depend only on k and not on d. However, this is still not sufficient to explain the behaviour in Figures 2b and 2c . If our result indeed kicks in for some ambient dimension k with DenseNet121 on CIFAR10, then for a given adversarial risk (say 80%), the power law dependence would imply ρ2 ρ1 k ≈ η1 η2 , where label noise rate η i yields adversarial error 80% with perturbation budget ρ i . With another back-of-the-envelope calculation using Figure 2c , we set η 1 , η 2 = 0.7, 0.1 and ρ 1 , ρ 2 = 0.001, 0.004. This yields an ambient dimension k < 2, which is unrealistic for CIFAR-10. These calculations suggest the possibility that a tighter bound than Theorem 2 might exist. However, the large sample size is not just a limitation of Theorem 2. In fact, we show that if arbitrary classifiers and distributions are allowed, the adversarial risk cannot be lower bounded for m = poly(d). Our result is tight It is a priori possible that the true dependence of adversarial risk on label noise kicks in for much lower sample size regimes than in Theorem 2. This might suggest that the lower bound on sample complexity can be improved. We show this is not the case and in fact our bound is sharp. In particular, we design a simple distribution on R d such that there exist classifiers which correctly and robustly interpolate datasets with the number of samples m exponential in d. Proposition 4. Let µ be the uniform distribution on S d-1 = x 1 , . . . , x d ∈ R d : x 2 1 + . . . + x 2 d = 1 , and let the ground truth classifier f * be a threshold function on x 1 : f * (x) = 1 x1> 1 2 . Consider any adversarial radius ρ <foot_0 4 in the Euclidean metric. Then, for any label noise η < 1: with high probability, there exists a classifier f that interpolates m = ⌊1.01 d ⌋ samples from the label noise distribution, such that R Adv,ρ (f, µ) = o d (1).

Proof sketch

The main ingredient of the proof is the concentration of measure on S d-1 , which makes the training samples far apart in the Euclidean metric. We leave the full proof to Appendix B. Similar statements in the clean data setting have appeared before, e.g. in Bubeck and Sellke (2021) . Note that Proposition 4 shows a construction where Theorem 2 cannot guarantee an adversarial risk lower bound with sample size m sub-exponential in d. Hence, the covering number of any substantial portion of S d-1 is exponential in the dimension d. This unintentionally proves the well known fact that the covering number of the sphere S d-1 in the Euclidean metric is exponential. 1 Optimizing C can avoid large sample size While Proposition 4 shows the tightness of Theorem 2 in the worst case, it is possible a smaller sample size requirement is sufficient under certain conditions. In particular, if we can pick a compact C with small covering number, such that the measure µ(C) is large, then Theorem 2 allows for a small sample size while guaranteeing a large adversarial risk. Example Take an adversarial radius ρ > 0 in the ∥•∥ ∞ metric and choose r ∈ (0, 1 2 ), let µ = (1 -r)µ 1 + rµ 2 be the average of two measures, µ 1 and µ 2 , with µ 1 the uniform distribution on [0, 1] d , and µ 2 the uniform distribution on a smaller hypercube [0, ρ] d . The first choice C = [0, 1] d as in Corollary 3 has covering number on the order of ρ -d . Theorem 2 is then vacuous until m ≳ ρ -d /η, which is very large in high dimensions. Note that this is necessary to achieve the lower bound of adversarial risk of 1 4 . However, if we only want to guarantee an adversarial risk of r 4 , instead we can use C = [0, ρ] d . For this, the covering number is 1 and we can use Theorem 2 for m = O 1 η . This suggests that while the required sample size for the maximal adversarial risk is possibly very large, it can be much smaller, depending on the distribution, for guaranteeing a smaller adversarial risk. Formally, to get the "best possible" m in Theorem 2 for a certain adversarial risk lower bound r, we should solve the following optimization problem over subsets of supp(µ): min µ(C)≥4r N (ρ/2, C, ∥•∥ ∞ ) log N (ρ/2, C, ∥•∥ ∞ ) µ(C) . The above optimization problem comes from substituting r into the adversarial risk placeholder in Theorem 2. Equation ( 8) provides a complexity measure to get tighter lower bounds on the adversarial vulnerability induced by uniform label noise. It is not known whether the optimization is tractable in general. However, the concept of having to solve an optimization problem in order to get a tight lower bound is common in the literature. Some examples are the representation dimension (Beimel et al., 2019) and the SQ dimension (Feldman, 2017) . To conclude, this section shows that in real world data, the required sample size for guaranteeing large adversarial risk from interpolating label noise is significantly smaller than what an off-the-shelf application of Theorem 2 might suggest. However, we also proved that it is not possible to obtain tighter bounds without further assumptions on the data or the model.

4. NON-UNIFORM LABEL NOISE

In previous sections, we discussed guaranteeing a lower bound on adversarial error for noisy interpolators in Section 2. In Section 3, we discussed the tightness of the said bound. However, all of these results assumed that the label noise is distributed uniformly on the points in the training set, which corresponds to the popular Random Classification Noise (Angluin and Laird, 1988) model. However, an uniform noise model is not very realistic (Hedderich et al., 2021; Wei et al., 2022) ; and it is thus sensible to also investigate how our results change under non-uniform label noise models. Uniform noise is almost as harmful as poisoning The worst-case non-uniform label noise model is data poisoning, where an adversary can choose the labels of a subset of the training set of a fixed size (see Biggio and Roli (2018) for a survey). It is well known that flipping the label of a constant number of points in the training set can significantly increase the error of a logistic regression model (Jagielski et al., 2018) or an SVM classifier (Biggio et al., 2011) . On the contrary, the test error of a neural networks has been surprisingly difficult to hurt by data poisoning attacks which flip a small fraction of the labels. Lu et al. (2022) show that, on some datasets, the effect of adversarial data poisoning on test accuracy is, in fact, comparable to the effect of uniform label noise. We phrase the main result of this section informally in Theorem 5, using standard game-theoretic metaphors for data poisoning, and defer the formal version to Appendix C. Let again µ be a distribution on R d and f * a correct binary classifier, and let η be the label noise rate. Consider a game in which an adversary flips the labels of a subset of the training set, and tries to maximize the minimum adversarial risk among all interpolators of the noisy (after flipping labels) training set. We will compare the performances of two adversaries: • Uniform, who samples T points uniformly from the distribution, and flips the label of each of the T points in the sampled training set with probability η; • Poisoner, who inserts N = ηm arbitrary points from supp(µ) with flipped labels into the training set and then samples the remaining m -N points uniformly, with correct labels. Here T and m are the respective training set sizes. If T ∼ 1 η N = m, then the two adversaries flip the same number of labels in expectation. In that sense, both of these adversaries have the same budget. However, the Poisoner can choose which points to flip and thus intuitively, in this regime, the Poisoner will get a higher adversarial risk than the Uniform. Surprisingly, we can prove the Uniform is not much worse if T ∼ m log m. Theorem 5 (Informal statement of Theorem 11). Denote the adversarial risks of the Uniform and the Poisoner adversaries by R Unif and R Poison respectively. For any ρ > 0, we have that R Unif 2ρ ≥ 1 2 R Poison ρ (9) as long as R Poison ρ = Ω(1) and T ≳ m log m. Roughly speaking, the above theorem shows that if the Uniform adversary is given double the adversarial radius and a log factor increase on the training set size, then Uniform can guarantee an adversarial risk of the same magnitude as the Poisoner. The full statement and the proof of the theorem are given in Appendix C but we provide a brief sketch here. Proof sketch The Poisoner will choose N points to flip, adversarially poisoning every point in the N corresponding ρ-balls. As in Theorem 2, we can use Lemma 8 to show that a subset of the balls with density Ω(1/N ) covers half of the adversarially vulnerable region. Then Uniform samples T points, and we expect to hit each of the balls in the chosen subset. Because of the doubled radius, each sampled point makes the the whole ρ-ball vulnerable. The log factor comes from the same reason as in the standard coupon collector (balls and bins) problem; if we have N bins with hitting probabilities Ω(1/N ), then we need Ω(N log N ) tries to hit each bin at least once. Some label noise models are benign Different label noise models with the same expected label noise rate can have very different effects on the adversarial risk. In the previous sections, we showed that uniform label noise is almost as bad as the worst possible noise model with the same label noise rate. This raises the question whether all noise models are as harmful as the uniform label noise model. We answer the question in the negative especially for data distributions that have a long tailed structure: many far-apart low-density subpopulations in the support of the distribution µ. For this, we show a simple data distribution µ in Proposition 6, where: • Uniform label noise with probability η guarantees adversarial risk on the order of η; • A different label noise model, with expected label noise rate η, which affects only the long tail of the distribution µ can be interpolated with o(1) adversarial risk. We argue that this is neither an unrealistic distributional assumption nor an impractical noise model. In fact, most standard image datasets, like SUN (Xiao et al., 2010) , PASCAL (Everingham et al., 2010) , and CelebA (Liu et al., 2015) have a long tail (Zhu et al., 2014; Sanyal et al., 2022) . Moreover, it is natural to assume that mistakes in the datasets are more likely to occur in the long tail, where the data points are atypical. In Feldman (2020) , it was argued that noisy labels on the long tail are one of the reasons for why overparameterized neural networks remember the training data. Formally, we prove the following regarding the benign noise model for a long-tailed distribution. Proposition 6. Let A < B be integers with A much smaller than B. Let µ be a mixture model on R supported on a disjoint union of A + B intervals, such that half of the mass is on the first A intervals and half of the mass is on the last B intervals: µ = 1 2A A i=1 Unif i, i + 1 2 + 1 2B B j=1 Unif A + j, A + j + 1 2 Let the ground truth label be zero everywhere. Sample two datasets D 1 , D 2 of size m from µ using two different label noise distributions: For D 1 , flip the label of each sample x ∈ [0, A + B] independently with probability η. For D 2 , flip the label of each sample x ∈ [A, A + B] independently with probability 2η, and leave the labels of the other samples unchanged. Then, for any ρ, δ ∈ 0, 1 2 , for the number of samples m = Θρ (A) (ignoring log terms), we have that with probability 1 -δ: • For any f which interpolates D 1 , the adversarial risk is large: R Adv,ρ (f, µ) = Ω ρ (1). • There exists f which interpolates D 2 , such that R Adv,ρ (f, µ) = O ρ A B . A similar distribution was previously used as a representative long tailed distribution in the context of privacy and fairness in Sanyal et al. (2022) . Our result can also be extended to more complicated long-tailed distributions with a similar strategy. Proposition 6 implies that the the first noise model (for D 1 ) induces Ω(1) adversarial risk on all interpolators. On the other hand, for the second noise model i.e. for D 2 , it is possible to obtain interpolators with adversarial risk on the order of A B . Thus, for distributions where A ≪ B, this implies the existence of almost robust interpolators despite having the same label noise rate. Real-world noise is more benign than uniform label noise To support our argument that real world noise models are, in fact, more benign than uniform noise models, we consider the noise Figure 3 : Figures 3a and 3b plots adversarial risk against perturbation budget on CIFAR10 and CIFAR100 datasets respectively, with three label noise models. Clean denotes the original CIFAR-10/100 labels. Human denotes the human-generated labels from (Wei et al., 2022) . Uniform refers to uniformly random label noise with the same rate as Human. Figure 3c induced by human annotators in Wei et al. (2022) . They propose a new version of the CIFAR10/100 dataset (Krizhevsky et al., 2009) where each image is labelled by three human annotators. Known as CIFAR-10/100-n, each example's label is decided by a majority vote on the three annotated labels. We train ResNet34 models till interpolation on these two datasets. The label noise rate, after the majority vote, is ≈ 9% in CIFAR10 and ≈ 40% in CIFAR100. We repeat the same experiment for uniform label noise with the same noise rates, and also without any label noise. Each of these models' adversarial error is evaluated with an ℓ ∞ PGD adversary plotted in Figures 3a and 3b . Figures 3a and 3b show that, for both CIFAR10 and CIFAR100, uniform label noise is indeed worse for adversarial risk than human-generated label noise. For CIFAR-10, the model that interpolates human-generated label noise is almost as robust as the model trained on clean data. This supports our argument that real-world label noise is more benign, for adversarial risk, than uniform label noise. An important direction for future research is understanding what types of label noise models are useful mathematical proxies for realistic label noise. We shed some light on this question using the idea of memorisation score (Feldman and Zhang, 2020) . Informally, memorisation score quantifies the atypicality of a sample; it measures the increase in the loss on a data point when the learning algorithm does not observe it during training compared to when it does. A high memorisation score indicates that the point is unique in the training data and thus, likely, lies in the long tail of the distribution. In Figure 3c , we plot the average memorisation score of each class of CIFAR10 in brown, and the average for images that were mislabeled by the human annotator in blue. It is clearly evident that the mislabeled images have a higher memorisation score. This supports our hypothesis (also in Feldman ( 2020)) that, in the real world, examples in the long tail, are more likely to be mislabeled.

5. THE ROLE OF INDUCTIVE BIAS

We have seen, in Section 3, that without further assumptions the theoretical guarantees in Theorem 2 only hold for very large training sets. In this section, we discuss how the inductive bias of the hypothesis class or the learning algorithm can lower the sample size requirement and point to recent work that indicates the presence of such inductive biases in neural networks.

Inductive bias can hurt robustness even further

There is ample empirical evidence (Ortiz-Jimenez et al., 2020; Kalimeris et al., 2019; Shamir et al., 2021) that neural networks exhibit an inductive bias that is different from what is required for robustness. Shah et al. (2020) also provides empirical evidence that neural networks exhibit a certain inductive bias, that they call simplicity bias, that hurts adversarial robustness. Ortiz-Jiménez et al. (2022) show that this is also responsible for a phenomenon known as catastrophic overfitting. Here, we show a simple example to illustrate the role of inductive bias. Consider a binary classification problem on a data distribution µ and a dataset S m,η of m points, sampled i.i.d. from µ such that the label of each example is flipped with probability η. Theorem 7. For any ρ > 0, there exists a distribution µ on R 2 and two hypothesis classes H and F, such that for any label noise rate η ∈ (0, 1 /2) and dataset size m = Θ 1 η , in expectation we have that: for all h ∈ H that interpolate S m,η , R Adv,ρ (h, µ) ≥ Ω (1) ; (10) whereas there exists an f ∈ F that interpolates S m,η and R Adv,ρ (f ; µ) = O (ρ). The classes H and F are precisely defined in Appendix E where a formal proof is given as well, but we provide a proof sketch here. The data distribution µ in Theorem 7 is uniform on the set [0, W ] × {0}, that is, the data is just supported on the first coordinate where W ≫ ρ > 0. The ground truth is a threshold function on the first coordinate. The constructed f ∈ F simply labels everything according to the ground truth classifier (which is a threshold function on the first coordinate) except the mislabeled data points; where it constructs infinitesimally small intervals around the point on the first coordinate. Note that this construction is similar to the one in Proposition 4. By design, it interpolates the training set and its expected adversarial risk is upper bounded by 2mηρ. Each hypothesis in the hypothesis class H can be thought of as a union of T-shaped decision regions. The region inside the T-shaped regions are classified as 1 and the rest as 0. Note that the "head" of a T-shaped region make the region on the data manifold (first coordinate) directly below them adversarially vulnerable. The width of the T can be interpreted as the inductive bias of the learning algorithm. The decision boundaries of neural networks usually lie on the data manifold (Somepalli et al., 2022) ; and the network behaves more smoothly off the data manifold. A natural consequence of this is that the head of the Ts are large. This is not the exact explanation of inductive bias in neural networks, but rather an illustrative example for what might be happening in practice. In Appendix G, we provide experimental evidence to show this type of behaviour for neural networks. There are two important properties of this simple example relevant for understanding adversarial vulnerability of neural networks. First, the adversarial examples constructed here are off-manifold: they do not lie on the manifold of the data. This has been observed in prior works (Hendrycks and Gimpel, 2016; Sanyal et al., 2018; Khoury and Hadfield-Menell, 2018) . Second, our examples implicitly exhibit the dimpled manifold phenomenon recently described in Shamir et al. (2021) . Is Theorem 2 about the wrong function class? When fitting deep neural networks to real datasets, the results of Theorem 2 still hold even when the number of samples m is much smaller than required, as can be seen in Figure 2 . We think that proving guarantees on adversarial risk in the presence of label noise is within reach for simple neural network settings. Towards this goal, we propose a conjecture in a similar vein to Bubeck et al. (2020) : Conjecture 1. Let f : R d → R be a neural network with a single hidden layer with k neurons. Under the same conditions as in Theorem 2, for the number of samples m = Ω( 1 η poly(k, d)), R Adv,ρ (f, µ) ≥ const. ( ) for a distribution µ supported on [0, 1] d . In short, we conjecture that neural networks exhibit inductive biases which hurt robustness when interpolating label noise. Dohmatob (2021) show a similar result for a large class of neural networks. However, the Bayes optimal error in their setting is a positive constant (as opposed to zero in our setting), and we assume uniform label noise in the training set. Understanding these properties is important for training on real-world data, where label noise is not a possibility but rather a norm. By the union bound, P [x is in a ρ-ball around at least one of the z i ] (36) ≤ m P [∥x -z 1 ∥ 2 ≤ ρ] (37) ≤ m P ∥x∥ 2 + ∥z 1 ∥ 2 -2⟨x, z 1 ⟩ ≤ ρ 2 (38) = m P ⟨x, z 1 ⟩ ≥ 1 -ρ 2 /2 . ( ) As µ is rotationally invariant, ⟨x, z 1 ⟩ is distributed the same as x 1 . We have proved R Adv,ρ (f, µ) ≤ m P x 1 ≥ 1 - ρ 2 2 + P x 1 ≥ 1 2 -ρ . ( ) We can bound P[x 1 ≥ t] for t > 0 as follows: let g 1 , . . . , g d be i.i.d. standard N (0, 1) random variables. P [x 1 ≥ t] = P g 1 g 2 1 + . . . + g 2 d ≥ t (41a) = P g 2 1 ≥ t 2 (g 2 1 + . . . + g 2 d ) (41b) = P 1 -t 2 t 2 g 2 1 ≥ g 2 2 + . . . + g 2 d (41c) ≤ P 1 -t 2 t 2 g 2 1 ≥ d -1 2 + P g 2 2 + . . . + g 2 d ≤ d -1 2 , where the last inequality is because a ≥ b implies a ≥ c or b ≤ c. As 0 < ρ < 1 4 , we can take t = 1 4 in both probabilities in Equation ( 40). We now use the often-cited chi-square bounds from Lemma 1 in Laurent and Massart (2000) . P g 2 2 + . . . + g 2 d ≤ (d -1) -2 (d -1)s ≤ exp (-s) (42a) P g 2 1 ≥ 1 + 2 √ s + 2s ≤ exp (-s) Then for s = d 40 , it's easy to see that both probabilities in Equation (41d) are less than the corresponding probabilities in Equation (42b) and Equation (42a).  R Adv,ρ (f, µ) = P x∼µ [∃z ∈ B ρ (x), f * (x) ̸ = f (z))] . For interpolating classifiers with minimal test error, there are two sources of adversarial risk: decision boundaries and label noise. In this paper, we are specifically interested in the latter. The theorem is easiest to formalize in the case where the decision boundary contribution to the adversarial risk is negligible. For example, this is the case when the classes are separable with a large margin, or in real-world datasets when there is not much data which humans would label ambiguously. Therefore, instead of working with the adversarial risk, we introduce the separable proxy for the adversarial risk. Let the label noised points be S = {s 1 , . . . , s N }, and let f interpolate the training set, and otherwise minimize the test error. RAdv,ρ (S, µ) def = P x∼µ [∃1 ≤ k ≤ N, s k ∈ B ρ (x)] ≤ R Adv,ρ (f, µ). ( ) Proof. To prove the adversarial risk for D 1 , we simply invoke Theorem 2. Consider the compact C = A i=1 i, i + 1 2 . The covering number for C is N = A 2ρ and its probability mass µ (C) = 1 2 . By Theorem 2, for m ≥ 16A ρη log A ρδ , with probability greater than 1 -δ we have that R Adv,ρ (f, µ) ≥ 1 8 . This proves the first part. For the second part, using Hoeffding's inequality, as long as m ≥ 16 log 2 δ , we have that with probability at least 1-δ, the number of samples in [A, A + B] is less than 3m 4 . Therefore, the number of mislabeled samples in that interval is also less than 3m 4 with the same probability. Now consider the interpolator that is zero everywhere except at the mislabeled points. The maximum adversarial risk of this interpolator is the probability mass of the union of the intervals [A + j, A + j + 1 2 ] each of the mislabeled points lie in. This probability mass is at most 3mρ 8B . Setting, m = 16A ρη log A ρδ , we obtain R Adv,ρ (f, µ) = O ρ A B .

E PROOF OF INDUCTIVE BIAS

Theorem 7. For any ρ > 0, there exists a distribution µ on R 2 and two hypothesis classes H and F, such that for any label noise rate η ∈ (0, 1 /2) and dataset size m = Θ 1 η , in expectation we have that: for all h ∈ H that interpolate S m,η , R Adv,ρ (h, µ) ≥ Ω (1) ; (10) whereas there exists an f ∈ F that interpolates S m,η and R Adv,ρ (f ; µ) = O (ρ). Proof. For any ρ ≥ 0, W ≫ ρ, construct a distribution µ on [0, W ] × {0} as follows. Distribute the covariates uniformly randomly in [0, W 2 -2ρ] [ W 2 + 2ρ, W ] and then label then with the ground truth labelling function f * (x) = 1{x 1 ≥ W 2 } where x = [x 1 , x 2 ] is the two-dimensional covariate. Next, we construct an m dimensional dataset and flip each label independently with probability 1 -η. We denote this set with S m,η . The hypothesis class F is the class of one-dimensional thresholds on the first coordinate of the input space (ignores the second coordinate entirely). Define the following interpolating classifier f ∈ F : R 2 → {0, 1} as follows f (x) = y 1 if x is in S m,η 1{x 1 ≥ W 2 } otherwise . As the sampling of the covariates and the label noise are independent events, E Sm,η [# of mislabeled points in S m,η ] = mη. Then the expected measure of the set of points adversarially vulnerable by an adversary of perturbation magnitude ρ on the classifier h, as defined above, is upper bounded by 2ρmη. Using the fact that the total measure of the domain is W and that m = Θ 1 η , we get that E Sm,η [R Adv,ρ (f ; µ)] ≤ 2ρmη W = O(ρ). Next, consider the hypothesis class H defined as follows. Given a set of points Z = {z 1 , . . . , z k } ∈ [0, W ] k and γ > ρ, define the hypothesis h Z,γ (x) =    1 ∃z ∈ Z | 1{x 2 < ρ} ∧ 1{x 1 = z} 1 ∃z ∈ Z | 1{x 2 < ρ} ∧ 1{|x 1 -z| ≤ γ} 0 otherwise. If S is the set of mislabeled 1s in S m,η , then for any interpolating classifier h Z,γ , it holds that S ⊆ Z. Next, by construction, for every point z ∈ Z, it holds that all points x ∈ [z -γ, z + γ] can adversarially perturbed in the x 2 component to obtain the label 1. Thus the total measure of the adversarially vulnerable set of points is greater than the number of mislabeled points, whose original label is zero, multipled with 2γ, which is 2mηγ. Thus, we have that for any h ∈ H that interpolates S m,η , E Sm,η [R Adv,ρ (f ; µ)] ≥ min 2γmη W , 1 2 = Ω(γ). We proved the adversarial risk bounds only in expectation over the training set; but note that both of the bounds in Theorem 7 can be transformed into high probability bounds using concentration inequalities. For simplicity, we did not treat the second bound in the theorem as a learning problem. However, it is possible to show that there exists a learning algorithm that uses a similar number of samples to output f ∈ F such that the adversarial risk is O (ρ).

F SEPARATION AND AVERAGE DISTANCES OF DIFFERENT CLASSES

For a classification dataset, it is not easy to check if a given class can be covered by balls of some radius in a given metric, such that each ball contains only points from a single class. However, we can estimate the distance distribution between points inside a class and between points of different classes, If those two distributions are similar, and especially if the minimum distances are roughly the same, then it seems likely that it's not possible for the balls to contain only points from a single class. The plots in Figure 6 strongly suggest that the points of different classes are not extremely "far off" in the ∥•∥ 2 or ∥•∥ ∞ metrics, compared to the distances between points inside a class.

G EXPERIMENTS REGARDING INDUCTIVE BIAS

We show that the the structure of the T-shaped classifier used in Theorem 7 is visible when fitting neural networks to label noise on a tiny dataset. We sample a three dimensional dataset of points (X, Y, Z) with labels in {0, 1} as follows: • X is sampled uniformly from the segment [0, 1]; • Y is sampled from a normal distribution N (0, 0.1); • Z is sampled from a normal distribution N (0, 0.001); • The label is 1 if X > 0.5, and 0 otherwise. We sample 50 points from this distribution to create the clean dataset. To create the noisy dataset, we randomly flip 10% of the labels to generate the noisy dataset. Then we train a one-hidden layer MLP with 1000 hidden units using the ADAM optimizer with a learning rate of 0.01. The decision boundary after running this for 350 epochs with a batch size of 20 is plotted in Appendix G. All our models interpolate the dataset (both clean and noisy). We plot the decision region in the XY plane for three different values of the Z dimension in Appendix G for models trained on the noisy dataset as well as the clean dataset. The first row in a box corresponds to the model trained on the noisy dataset and the second row corresponds to the model trained on the clean dataset. The maroon circles inside the plots are balls of radius 0.04, drawn around the points with label noise, indicating the region of adversarial vulnerability induced by the points in that plane. As visualizations like these are often susceptible to variance due to random seeds, we report for three different seeds, denoted as Run 1, Run 2, and Run 3. To interpret the T-like structure from Figure 7 in these experiments, note that ρ = 0.04, so the "head" of the 'T' is in the XY plane for Z = 0.04. Further, the Z = -0.04 is essentially the head of the 'T' for the other class. In the first rows in all of the boxes (i.e. the noisy dataset), note that "heads of the Ts" are almost entirely within the decision region of one of the classes. This indicates that all points on the XY plane at Z = 0 can be perturbed along the Z-axis with a perturbation less than 0.04 to change its predicted label, yielding an adversarial risk of 100%. However, in the Z = 0 plane, the region of vulnerability is the union of the maroon balls, which is significantly smaller than what is what induced by perturbation along the Z-dimension. This suggests that our intuition with the T-shaped classifiers may be relevant in practice for neural networks.



See Proposition 4.16 in https://www.stats.ox.ac.uk/ ˜rebeschi/teaching/AFoL/20/ material/lecture04.pdf



The covering number N is the minimum number of ∥•∥-balls of radius ρ/2 such that their union contains C. For any distribution µ on R d , denote by µ (C) = P x∼µ [x ∈ C] the mass of the distribution µ contained in the compact C.

Figure 1: Depending on the covering number of C, a small number of ∥•∥balls of sufficient density cover a lot of the measure of C. Label noise makes every point drawn from the covered set adversarially vulnerable.

Figure 2: FromSanyal et al. (2021). Adversarial error increases with increasing label noise η (xaxis) at a rate much faster than predicted by Theorem 2. Here, ϵ is the perturbation magnitude (ρ in the current paper). The label noise is synthetically injected in the training set with probability η.

Figure3: Figures3a and 3bplots adversarial risk against perturbation budget on CIFAR10 and CIFAR100 datasets respectively, with three label noise models. Clean denotes the original CIFAR-10/100 labels. Human denotes the human-generated labels from(Wei et al., 2022). Uniform refers to uniformly random label noise with the same rate as Human. Figure3cplots the average memorisation score per class for: Human mislabeled examples i.e. examples with Human noisy labels and All samples. The horizontal line is the average across all classes. The higher score of human mislabeled examples indicates that those examples belong to the long tail of the distribution.

Figure 4: Visualization of a portion of the distribution µ and the hypothesis class H used in Theorem 7. The crosses are the mislabeled examples and the circles are correctly labelled examples. All the circles are adversarially vulnerable to upwards perturbations of magnitude less than ρ.

, as d goes to infinity, R Adv,ρ (f, µ) ≤ m exp(-d/40) + exp(-d/40) (43) ≤ exp(-d/80) + exp(-d/40) → 0. (44) C POISONING THEOREM Recall the definition of the adversarial risk R Adv,ρ from Section 2:

Figure 6: Separation and average distances of different classes in CIFAR-10.

ACKNOWLEDGMENTS

Amartya Sanyal acknowledges the ETH AI Center for the postdoctoral fellowship. We thank Mislav Balunović for discussing and checking the proof of Theorem 2, Domagoj Bradač for discussing Lemma 8, and Jacob Clarysse and Florian Tramèr for general feedback.

A PROOF OF THEOREM 2

Here we prove the following statement: Theorem 2. Let C ⊂ R d satisfy µ(C) > 0, and let N = N (ρ/2; C, ∥•∥) be its covering number. For δ > 0, when the number of samples satisfies m ≥ 8N µ(C)η log 2N δ . with probability 1 -δ we have thatfor any classifier f that interpolates the training set.For notational convenience, we replace ρ by 2ρ in all places for the proof below.Proof. Without loss of generality, let C 0 = { x ∈ C : f * (x) = 0 } have probability µ(C 0 ) ≥ 1 2 µ(C). Let µ 0 = µ| C0 , normalized so that µ 0 (C 0 ) = 1.By Chernoff, with probability 1 -exp -µ(C)m 16 ≥ 1 -δ 2 , at least m 0 = ⌊ µ(C)m 4 ⌋ of the samples z i are in C 0 . Without loss of generality, let z 1 , . . . , z m0 be those samples. ThenLet s 1 , . . . , s N be the centers of a minimum ρ-covering of C 0 .The plan is the following: we will lower bound i≤m0, f (zi)̸ =0 B 2ρ (z i ) by the union of some B ρ (s k ), which will have large µ 0 -measure in total. Moreover, each of the chosen B ρ (s k ) will have large enough µ 0 -measure. For this, we use the following general lemma:Lemma 8. Let s 1 , . . . , s N be the centers of some balls B r (s i ) in R d , and take any measure ν. Then, for any constant 0 < α < 1, there exists a subset S ⊆ { 1, . . . , N } of the balls such that:Informally, the first condition says that the union of the chosen subset has a constant fraction of the measure of the union. The second condition says that each of the chosen balls has Ω(1/N ) of the measure of the union of all balls.Proof. Without loss of generality, let the balls be ordered by measure:a) The original cover of C.b) The dense greedy subcover.Figure 5 : Illustration of Lemma 8 and Corollary 9. Given a cover of N balls, we can pick a subcover of balls covering at least half of the measure, with each ball having measure at least 1 2N .We take the greedy subset S = { 1, . . . , K }, whereThe first inequality follows because for all i ≥ K + 1 it holds ν (B r (s K )) < α ν N i=1 B r (s i ) , and the second is because N -K N ≤ 1. ■We can apply the above to the situation in the proof of Theorem 2 with α = 1 2 . Without loss of generality, order the covering s 1 , . . . , s N by the µ 0 -measure of the corresponding balls:We now show that the chosen balls are dense enough to get samples in the training set with high probability.Lemma 10. With probabilityProof. We haveand because the label corruption is independent from everything, we also haveTherefore,In the last line we used 1 + x ≤ e x , the fact that m 0 = 1 4 µ(C)m, and the expression for m from Theorem 2. Now we can use the union bound to prove the lemma:■ Finally, using both Lemma 8 and Lemma 10, we can finish:B PROOF OF PROPOSITION 4 Proof. Let the m = 1.01 d ≤ exp(d/80) samples be z 1 , . . . , z m with labels y 1 , . . . , y m ∈ {0, 1}. Almost surely the z i are distinct. Define the interpolating classifier f : R d → {0, 1} asWe want to show f is robust. Draw x = (x 1 , . . . , x d ) uniformly on S d-1 . There are only two ways x can contribute to the adversarial risk R Adv,ρ (f, µ):• x is close to a training sample z i with label noise;• x is close to the "decision boundary"Hence, remembering Equation (1),≤ P [x is in a ρ-ball around at least one of theThe proxy adversarial risk RPoison for the poisoned case is easy to define:In fact, by a compactness argument, the sup above can be replaced by max, but this is not important for the theorem we want to prove.The uniform version RUnif is a random variable. Its value depends on the random training set and which points get their labels flipped. We define it as follows: Sample a training set of T random points from µ independently, and let S be a random subset of the training set, with each point taken with probability η. Then RUnif ρ,T,η := RAdv,ρ (S, µ).(49)We are now ready to state our formal theorem. Theorem 11. Let η > 0 be the label noise rate, and let T, m be positive integers representing the training set sizes for Uniform and Poisoner. Fix N = ⌊ηm⌋ to be be the number of labels the poisoner can flip. For δ > 0, with probability 1 -δ, for any ρ > 0,forTo be precise, the Uniform adversary takes at least the following number of samples:The proof is quite similar in spirit to the proof of Theorem 2.Proof. Let the Poisoner pick points s 1 , . . . , s N ∈ R d . We want to prove that with high probability,We show that, with high probability, a sample of T points from µ, with uniform label noise with probability η, will result in many of the balls B ρ (s k ) having a mislabeled point in them.Order the points s 1 , . . . ,By Lemma 8 with α = 1 2 , we know that there exists K such thatandAs in Appendix A, we proceed to show that with high probability, a random sample of T points from µ will hit each of the balls B(s k , ρ) for k ≤ K, because the chosen balls are dense enough to get hit when T ≳ m log m. This is enough to provesince any two points in the same B ρ (s k ) are within distance 2ρ of each other.Let z 1 , . . . , z T be a random dataset of T points from µ. We have a lemma similar to Lemma 10:Proof. We haveand because the label corruption is independent from everything, we also haveTherefore,Here in the last line we used 1 + x ≤ e x and Equation ( 51).Now we can use the union bound to prove the lemma:

■

Combining Equation (53) and Equation (54), we have the desiredAs this holds for any s 1 , . . . , s N , and the left-hand-side does not depend on the chosen points s k , we can take the supremum over all possible choices of s 1 , . . . , s N to prove the theorem.

D PROOF OF PROPOSITION 6

Proposition 6. Let A < B be integers with A much smaller than B. Let µ be a mixture model on R supported on a disjoint union of A + B intervals, such that half of the mass is on the first A intervals and half of the mass is on the last B intervals:Unif A + j, A + j + 1 2Let the ground truth label be zero everywhere. Sample two datasets D 1 , D 2 of size m from µ using two different label noise distributions: For D 1 , flip the label of each sample x ∈ [0, A + B] independently with probability η. For D 2 , flip the label of each sample x ∈ [A, A + B] independently with probability 2η, and leave the labels of the other samples unchanged. Then, for any ρ, δ ∈ 0, 1 2 , for the number of samples m = Θρ (A) (ignoring log terms), we have that with probability 1 -δ:• For any f which interpolates D 1 , the adversarial risk is large: R Adv,ρ (f, µ) = Ω ρ (1).• There exists f which interpolatesPublished as a conference paper at ICLR 2023 

