(CERTIFIED!!) ADVERSARIAL ROBUSTNESS FOR FREE!

Abstract

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2 -norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. ( 2020) by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2 norm of ε = 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.

1. INTRODUCTION

Evaluating the robustness of deep learning models to norm bounded adversarial perturbations has been shown to be difficult (Athalye et al., 2018; Uesato et al., 2018) . Certified defenses-such as those based on bound propagation (Gowal et al., 2018; Mirman et al., 2018) or randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) -offer provable guarantees that a model's predictions are robust to norm-bounded adversarial perturbations, for a large fraction of examples in the test set. The current state-of-the-art approaches to certify robustness to adversarial perturbations bounded in the 2 norm rely on randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) . By taking a majority vote over the labels predicted by a "base classifier" under random Gaussian perturbations of the input, if the correct class is output sufficiently often, then the defense's output on the original un-noised input is guaranteed to be robust to 2 norm bounded adversarial perturbations. Denoised smoothing (Salman et al., 2020) is a certified defense that splits this one-step process into two. After randomly perturbing an input, the defense first applies a denoiser model that aims to remove the added noise, followed by a standard classifier that guesses a label given this noisedthen-denoised input. This enables applying randomized smoothing to pretrained black-box base classifiers, as long as the denoiser can produce clean images close to the base classifier's original training distribution. We observe that the recent line of work on denoising diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021 )-which achieve state-of-the-art results on image generation-are a perfect match for the denoising step in a denoised smoothing defense. A forward diffusion process takes a source data distribution (e.g., images from some data distribution) and then adds Gaussian noise until the distribution converges to a high-variance isotropic Gaussian. Denoising diffusion models are trained to invert this process. Thus, we can use a diffusion model as a denoiser that recovers high quality denoised inputs from inputs perturbed with Gaussian noise. In this paper, we combine state-of-the-art, publicly available diffusion models as denoisers with standard pretrained state-of-the-art classifiers. We show that the resulting denoised smoothing defense obtains significantly better certified robustness results-for perturbations of 2 norm of ≤ 2 on ImageNet and ≤ 0.5 on CIFAR-10-compared to the "custom" denoisers trained in prior work (Salman et al., 2020) , or in fact with any certifiably robust defense (even those that do not rely on denoised smoothing). Code to reproduce our experiments is available at: https://github.com/ethz-privsec/diffusion_denoised_smoothing. Adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) are inputs x = x+δ constructed by taking some input x (with true label y ∈ Y) and adding a perturbation δ (that is assumed to be imperceptible and hence label-preserving) so that a given classifier f misclassifies the perturbed input, i.e., f (x + δ) = y. The "smallness" of δ is quantified by its Euclidean norm, and we constrain δ 2 ≤ ε. Even when considering exceptionally small perturbation budgets (e.g., ε = 0.5) modern classifiers often have near-0% accuracy (Carlini & Wagner, 2017) . Randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) is a technique to certify the robustness of arbitrary classifiers against adversarial examples under the 2 norm. Given an input x and base classifier f , randomized smoothing considers a smooth version of f defined as: g(x) := argmax c Pr δ∼N (0,σ 2 I) (f (x + δ) = c) Cohen et al. ( 2019) prove that the smooth classifier g is robust to perturbations of 2 radius R, where the radius R grows with the classifier's "margin" (i.e., the difference in probabilities assigned to the most likely and second most-likely classes). As the probability in Equation 1 Denoised smoothing (Salman et al., 2020) is an instantiation of randomized smoothing, where the base classifier f is composed of a denoiser denoise followed by a standard classifier f clf : f (x + δ) := f clf (denoise(x + δ)) . Given a very good denoiser (i.e., denoise(x + δ) ≈ x with high probability for δ ∼ N 0, σ 2 I ), we can expect the base classifier's accuracy on noisy images to be similar to the clean accuracy of the standard classifier f clf . Salman et al. (2020) instantiate their denoised smoothing technique by training custom denoiser models with Gaussian noise augmentation, combined with off-the-shelf pretrained classifiers. Denoising Diffusion Probabilistic Models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021) are a form of generative model that work by learning a model that can reverse time on a diffusion process of the form x t ∼ √ 1 -β t • x t-1 + β t • ω t , ω t ∼ N (0, I) with x 0 coming from the data distribution, and the β t being fixed (or learned) variance parameters. The diffusion process transforms images from the target data distribution to purely random noise over time. The reverse process then synthesizes images from the data distribution starting with random Gaussian noise. In this paper we will not make use of diffusion models in the typical way; instead it suffices to understand just one single property about how they are trained.

Given a clean training image

x ∈ [-1, 1] w•h•c , a diffusion model selects a timestep t ∈ N + from some fixed schedule and then samples a noisy image x t of the form x t := √ α t • x + √ 1 -α t • N (0, I) , where the factor α t is a constant derived from the timestamp t that determines the amount of noise to be added to the image (the noise magnitude increases monotonically with t). The diffusion model is then trained (loosely speaking) to minimize the discrepancy between x and denoise(x t ; t); that is, to predict what the original (un-noised) image should look like after applying the noising step at timestep t.foot_0 



State-of-the-art diffusion models are actually trained to predict the noise rather than the denoised image directly(Ho et al., 2020; Nichol & Dhariwal, 2021).



cannot be efficiently computed when the base classifier f is a neural network,Cohen et al. (2019)  instantiate this defense by sampling a small number m of noise instances (e.g., m = 10) and taking a majority vote over the outputs of the base classifier f on m noisy versions of the input. To compute a lower-bound on this defense's robust radius R, they estimate the probabilities Pr[f (x + δ) = c] for each class label c by sampling a large number N of noise instances δ (e.g., N = 100,000). SeeCohen et al. (2019)   for details.

