Adversarial Boot Camp: label free certified robustness in one epoch

Abstract

Machine learning models are vulnerable to adversarial attacks. One approach to addressing this vulnerability is certification, which focuses on models that are guaranteed to be robust for a given perturbation size. A drawback of recent certified models is that they are stochastic: they require multiple computationally expensive model evaluations with random noise added to a given image. In our work, we present a deterministic certification approach which results in a certifiably robust model. This approach is based on an equivalence between training with a particular regularized loss, and the expected values of Gaussian averages. We achieve certified models on ImageNet-1k by retraining a model with this loss for one epoch without the use of label information.

1. Introduction

Neural networks are very accurate on image classification tasks, but they are vulnerable to adversarial perturbations, i.e. small changes to the model input leading to misclassification (Szegedy et al., 2014) . Adversarial training (Madry et al., 2018) improves robustness, at the expense of a loss of accuracy on unperturbed images (Zhang et al., 2019) . Model certification (Lécuyer et al., 2019; Raghunathan et al., 2018; Cohen et al., 2019) is complementary approach to adversarial training, which provides a guarantee that a model prediction is invariant to perturbations up to a given norm. Given an input x, the model f is certified to 2 norm r at x if it gives the same classification on f (x + η) for all perturbation η with norm up to r, arg max f (x + η) = arg max f (x), for all η 2 ≤ r (1) Cohen et al. ( 2019) and Salman et al. ( 2019) certify models by defining a "smoothed" model, f smooth , which is the expected Gaussian average of our initial model f at a given input example x, f smooth (x) ≈ E η [f (x + η)] (2) where the perturbation is sampled from a Gaussian, η ∼ N (0, σ 2 I). Cohen et al. ( 2019) used a probabilistic argument to show that models defined by (2) can be certified to a given radius by making a large number of stochastic model evaluations. Certified models can classify by first averaging the model, (Salman et al., 2019) , or by taking the mode, the most popular classification given by the ensemble (Cohen et al., 2019) . Cohen et al. and Salman et al. approximate the model f smooth stochastically, using a Gaussian ensemble, which consists of evaluating the base model f multiple times on the image perturbed by noise. Like all ensemble models, these stochastic models require multiple inferences, which is more costly than performing inference a single time. In addition, these stochastic models require training the base model f from scratch, by exposing it to Gaussian noise, in order to improve the accuracy of f smooth . Salman et al. ( 2019) additionally expose the model to adversarial attacks during training. In the case of certified models, there is a trade-off between certification and accuracy: the certified models lose accuracy on unperturbed images.

