ON ACHIEVING OPTIMAL ADVERSARIAL TEST ERROR

Abstract

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees.

1. INTRODUCTION

Imperceptibly altering the input data in a malicious fashion can dramatically decrease the accuracy of neural networks (Szegedy et al., 2014) . To defend against such adversarial attacks, maliciously altered training examples can be incorporated into the training process, encouraging robustness in the final neural network. Differing types of attacks used during this adversarial training, such as FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2019) , and the C&W attack (Carlini & Wagner, 2016) , which are optimization-based procedures that try to find bad perturbations around the inputs, have been shown to help with robustness. While many other defenses have been proposed (Guo et al., 2017; Dhillon et al., 2018; Xie et al., 2017) , adversarial training is the standard approach (Athalye et al., 2018) . Despite many advances, a large gap still persists between the accuracies we are able to achieve on non-adversarial and adversarial test sets. For instance, in Madry et al. (2019) , a wide ResNet model was able to achieve 95% accuracy on CIFAR-10 with standard training, but only 46% accuracy on CIFAR-10 images with perturbations arising from PGD bounded by 8/255 in each coordinate, even with the benefit of adversarial training. In this work we seek to better understand the optimal adversarial predictors we are trying to achieve, as well as how adversarial training can help us get there. While several recent works have analyzed properties of optimal adversarial zero-one classifiers (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b) , in the present work we build off of these analyses to characterize optimal adversarial convex surrogate loss classifiers. Even though some prior works have suggested shifting away from the use of convex losses in the adversarial setting because they are not adversarially calibrated (Bao et al., 2020; Awasthi et al., 2021a; c; 2022a; b) , we show the use of convex losses is not an issue as long as a threshold is appropriately chosen. We will also show that under idealized settings adversarial training can achieve the optimal adversarial test error. In prior work guarantees on the adversarial test error have been elusive, except in the specialized case of linear regression (Donhauser et al., 2021; Javanmard et al., 2020; Hassani & Javanmard, 2022) . Our analysis is in the Neural Tangent Kernel (NTK) or near-initialization regime, where recent work has shown analyzing gradient descent can be more tractable (Jacot et al., 2018; Du et al., 2018) . Of many such works our analysis is closest to Ji et al. ( 2021), which provides a general test error analysis, but for standard (non-adversarial) training. A recent work (Rice et al., 2020) suggests that early stopping helps with adversarial training, as otherwise the network enters a robust overfitting phase in which the adversarial test error quickly rises while the adversarial training error continues to decrease. The present work uses a form of early stopping, and so is in the earlier regime where there is little to no overfitting.

