ON ACHIEVING OPTIMAL ADVERSARIAL TEST ERROR

Abstract

We first elucidate various fundamental properties of optimal adversarial predictors: the structure of optimal adversarial convex predictors in terms of optimal adversarial zero-one predictors, bounds relating the adversarial convex loss to the adversarial zero-one loss, and the fact that continuous predictors can get arbitrarily close to the optimal adversarial error for both convex and zero-one losses. Applying these results along with new Rademacher complexity bounds for adversarial training near initialization, we prove that for general data distributions and perturbation sets, adversarial training on shallow networks with early stopping and an idealized optimal adversary is able to achieve optimal adversarial test error. By contrast, prior theoretical work either considered specialized data distributions or only provided training error guarantees.

1. INTRODUCTION

Imperceptibly altering the input data in a malicious fashion can dramatically decrease the accuracy of neural networks (Szegedy et al., 2014) . To defend against such adversarial attacks, maliciously altered training examples can be incorporated into the training process, encouraging robustness in the final neural network. Differing types of attacks used during this adversarial training, such as FGSM (Goodfellow et al., 2015) , PGD (Madry et al., 2019) , and the C&W attack (Carlini & Wagner, 2016) , which are optimization-based procedures that try to find bad perturbations around the inputs, have been shown to help with robustness. While many other defenses have been proposed (Guo et al., 2017; Dhillon et al., 2018; Xie et al., 2017) , adversarial training is the standard approach (Athalye et al., 2018) . Despite many advances, a large gap still persists between the accuracies we are able to achieve on non-adversarial and adversarial test sets. For instance, in Madry et al. (2019) , a wide ResNet model was able to achieve 95% accuracy on CIFAR-10 with standard training, but only 46% accuracy on CIFAR-10 images with perturbations arising from PGD bounded by 8/255 in each coordinate, even with the benefit of adversarial training. In this work we seek to better understand the optimal adversarial predictors we are trying to achieve, as well as how adversarial training can help us get there. While several recent works have analyzed properties of optimal adversarial zero-one classifiers (Bhagoji et al., 2019; Pydi & Jog, 2020; Awasthi et al., 2021b) , in the present work we build off of these analyses to characterize optimal adversarial convex surrogate loss classifiers. Even though some prior works have suggested shifting away from the use of convex losses in the adversarial setting because they are not adversarially calibrated (Bao et al., 2020; Awasthi et al., 2021a; c; 2022a; b) , we show the use of convex losses is not an issue as long as a threshold is appropriately chosen. We will also show that under idealized settings adversarial training can achieve the optimal adversarial test error. In prior work guarantees on the adversarial test error have been elusive, except in the specialized case of linear regression (Donhauser et al., 2021; Javanmard et al., 2020; Hassani & Javanmard, 2022) . Our analysis is in the Neural Tangent Kernel (NTK) or near-initialization regime, where recent work has shown analyzing gradient descent can be more tractable (Jacot et al., 2018; Du et al., 2018) . Of many such works our analysis is closest to Ji et al. ( 2021), which provides a general test error analysis, but for standard (non-adversarial) training. A recent work (Rice et al., 2020) suggests that early stopping helps with adversarial training, as otherwise the network enters a robust overfitting phase in which the adversarial test error quickly rises while the adversarial training error continues to decrease. The present work uses a form of early stopping, and so is in the earlier regime where there is little to no overfitting. In fact, due to technical reasons our analysis will be further restricted to an even earlier portion of this phase, as we remain within the near-initialization/NTK regime. As noted in prior work, adversarial training, as compared with standard training, seems to have more fragile test-time performance and quickly enters a phase of severe overfitting, but we do not consider this issue here.

1.1. OUR CONTRIBUTIONS

In this work, we prove structural results on the nature of predictors that are close to, or even achieve, optimal adversarial test error. In addition, we prove adversarial training on shallow ReLU networks can get arbitrarily close to the optimal adversarial test error over all measurable functions. This theoretical guarantee requires the use of optimal adversarial attacks during training, meaning we have access to an oracle that gives, within an allowed set of perturbations, the data point which maximizes the loss. We also use early stopping so that we remain in the near-initialization regime and ensure low model complexity. The main technical contributions are as follows. 1. Optimal adversarial predictor structure (Section 3). We prove fundamental results about optimal adversarial predictors by relating the global adversarial convex loss to global adversarial zero-one losses (cf. Lemma 3.1). We show that optimal adversarial convex loss predictors are directly related to optimal adversarial zero-one loss predictors (cf. Lemma 3.2). In addition, for predictors whose adversarial convex loss is almost optimal, we show that when an appropriate threshold is chosen its adversarial zero-one loss is also almost optimal (cf. Theorem 3.3). This theorem translates bounds on adversarial convex losses, such as those in Section 4, into bounds on adversarial zero-one losses when optimal thresholds are chosen. Using our structural results of optimal adversarial predictors, we prove that continuous functions can get arbitrarily close to the optimal test error given by measurable functions (cf. Lemma 3.4).

2.

Adversarial training (Section 4). Under idealized settings, we show adversarial training leads to optimal adversarial predictors. (a) Generalization bound. We prove a near-initialization generalization bound for adversarial risk (cf. Lemma 4.4). To do so, we provide a Rademacher complexity bound for linearized functions around initialization (cf. Lemma 4.5). The overall bound scales directly with the parameter's distance from initialization, and 1/ √ n, where n is the number of training points. Included in the bound is a perturbation term which depends on the width of the network, and in the worst case scales like τ 1/4 , where τ bounds the ℓ 2 norm of the perturbations. (b) Optimization bound. We show that using an optimal adversarial attack during gradient descent training results in a network which is adversarially robust on the training set, in the sense that it is not much worse compared to an arbitrary reference network (cf. Lemma 4.6). Comparing to a reference network instead of just ensuring low training error (as in prior work) will be key to obtaining a good generalization analysis, as the optimal adversarial test error may be high.



Figure1: A plot of the (robust/standard) zero-one (training/test) loss throughout training for an adversarially trained network. We ran Rice et al.'s code, using a constant step size of 0.01. The present work is set within the early phase of training, where we can get arbitrarily close to the optimal adversarial test error. In fact, due to technical reasons our analysis will be further restricted to an even earlier portion of this phase, as we remain within the near-initialization/NTK regime. As noted in prior work, adversarial training, as compared with standard training, seems to have more fragile test-time performance and quickly enters a phase of severe overfitting, but we do not consider this issue here.

