HOW BENIGN IS BENIGN OVERFITTING?

Abstract

We investigate two causes for adversarial vulnerability in deep neural networks: bad data and (poorly) trained models. When trained with SGD, deep neural networks essentially achieve zero training error, even in the presence of label noise, while also exhibiting good generalization on natural test data, something referred to as benign overfitting (Bartlett et al., 2020; Chatterji & Long, 2020). However, these models are vulnerable to adversarial attacks. We identify label noise as one of the causes for adversarial vulnerability, and provide theoretical and empirical evidence in support of this. Surprisingly, we find several instances of label noise in datasets such as MNIST and CIFAR, and that robustly trained models incur training error on some of these, i.e. they don't fit the noise. However, removing noisy labels alone does not suffice to achieve adversarial robustness. We conjecture that in part sub-optimal representation learning is also responsible for adversarial vulnerability. By means of simple theoretical setups, we show how the choice of representation can drastically affect adversarial robustness.

1. INTRODUCTION

Modern machine learning methods achieve a very high accuracy on wide range of tasks, e.g. in computer vision, natural language processing etc. However, especially in vision tasks, they have been shown to be highly vulnerable to small adversarial perturbations that are imperceptible to the human eye (Dalvi et al., 2004; Biggio & Roli, 2018; Goodfellow et al., 2014) . This vulnerability poses serious security concerns when these models are deployed in real-world tasks (cf. (Papernot et al., 2017; Schönherr et al., 2018; Hendrycks et al., 2019b; Li et al., 2019a) ). A large body of research has been devoted to crafting defences to protect neural networks from adversarial attacks (e.g. (Goodfellow et al., 2014; Papernot et al., 2015; Tramèr et al., 2018; Madry et al., 2018; Zhang et al., 2019) ). However, such defences have usually been broken by future attacks (Athalye et al., 2018; Tramer et al., 2020) . This arms race between attacks and defenses suggests that to create a truly robust model would require a deeper understanding of the source of this vulnerability. Our goal in this paper is not to propose new defenses, but to provide better answers to the question: what causes adversarial vulnerability? In doing so, we also seek to understand how existing methods designed to achieve adversarial robustness overcome some of the hurdles pointed out by our work. We identify two sources of adversarial vulnerability that, to the best of our knowledge, have not been properly studied before: a) memorization of label noise, and b) improper representation learning. Overfitting Label Noise: Starting with the celebrated work of Zhang et al. ( 2016) it has been observed that neural networks trained with SGD are capable of memorizing large amounts of label noise. Recent theoretical work (e.g. (Liang & Rakhlin, 2018; Belkin et al., 2018b; a; Hastie et al., 2019;  )) has also sought to explain why fitting training data perfectly does not lead to a large drop in test accuracy, as the classical notion of overfitting might suggest. This is commonly referred to as memorization or interpolation. We show through simple theoretical models, as well as experiments on standard datasets, that there are scenarios where label noise causes significant adversarial vulnerability, even when high natural (test) accuracy can be achieved. Surprisingly, we find that label noise is not at all uncommon in datasets such as MNIST and CIFAR-10 (see Figure 1 ). (2019) have argued that the trade-off between robustness and accuracy might be unavoidable. However, their setting involves a distribution that is not robustly separable by any classifier. In such a situation there is indeed a trade-off between robustness and accuracy. In this paper, we focus on settings where robust classifiers exist, which is a more realistic scenario for real-world data. At least for vision, one may well argue that "humans" are robust classifiers, and as a result we would expect that classes are well-separated at least in some representation space. In fact, Yang et al. (2020) show that classes are already well-separated in the input space. In such situations, there is no need for robustness to be at odds with accuracy. A more plausible scenario which we posit, and provide theoretical evidence in support of in Theorem 2, is that depending on the choice of representations, the trade-off may exist or can be avoided. Recent empirical work (Sanyal et al., 2020a; Mao et al., 2020) has also established that modifying the training objective to favour certain properties in the learned representations can automatically lead to improved robustness. However, we show in Section 3.2 that some training algorithms can create an apparent trade-off even though the trade-off might not necessarily be fundamental to the problem. On a related note, it has been suggested in recent works that adversarially robust learning may require more "complex" decision boundaries, and as a result may require more data (Shah et al.; Schmidt et al., 2018; Yin et al., 2019; Nakkiran, 2019; Madry et al., 2018) . However, the question of decision boundaries in neural networks is subtle as the network learns a feature representation as well as a decision boundary on top of it. We develop concrete theoretical examples in Theorem 2 and 3 to establish that choosing one feature representation over another may lead to visually more complex decision boundaries on the input space, though these are not necessarily more complex in terms of statistical learning theoretic concepts such as VC dimension.

Summary of Theoretical Contributions

1. We provide simple sufficient conditions on the data distribution under which any classifier that fits the training data with label noise perfectly is adversarially vulnerable. 2. There exists data distributions and training algorithms, which when trained with (some fraction of) random label noise have the following property: (i) using one representation, it is possible



Figure 1: Label noise in CIFAR10 and MNIST. Text above the image indicates the training set label.

Our experiments show that robust training methods like Adversarial training (AT) (Madry et al., 2018) and TRADES (Zhang et al., 2019) produce models that incur training error on at least some of the noisy examples,but also on atypical examples from the classes (Zhang & Feldman, 2020). Viewed

