(CERTIFIED!!) ADVERSARIAL ROBUSTNESS FOR FREE!

Abstract

In this paper we show how to achieve state-of-the-art certified adversarial robustness to 2 -norm bounded perturbations by relying exclusively on off-the-shelf pretrained models. To do so, we instantiate the denoised smoothing approach of Salman et al. ( 2020) by combining a pretrained denoising diffusion probabilistic model and a standard high-accuracy classifier. This allows us to certify 71% accuracy on ImageNet under adversarial perturbations constrained to be within an 2 norm of ε = 0.5, an improvement of 14 percentage points over the prior certified SoTA using any approach, or an improvement of 30 percentage points over denoised smoothing. We obtain these results using only pretrained diffusion models and image classifiers, without requiring any fine tuning or retraining of model parameters.

1. INTRODUCTION

Evaluating the robustness of deep learning models to norm bounded adversarial perturbations has been shown to be difficult (Athalye et al., 2018; Uesato et al., 2018) . Certified defenses-such as those based on bound propagation (Gowal et al., 2018; Mirman et al., 2018) or randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) -offer provable guarantees that a model's predictions are robust to norm-bounded adversarial perturbations, for a large fraction of examples in the test set. The current state-of-the-art approaches to certify robustness to adversarial perturbations bounded in the 2 norm rely on randomized smoothing (Lecuyer et al., 2019; Cohen et al., 2019) . By taking a majority vote over the labels predicted by a "base classifier" under random Gaussian perturbations of the input, if the correct class is output sufficiently often, then the defense's output on the original un-noised input is guaranteed to be robust to 2 norm bounded adversarial perturbations. Denoised smoothing (Salman et al., 2020) is a certified defense that splits this one-step process into two. After randomly perturbing an input, the defense first applies a denoiser model that aims to remove the added noise, followed by a standard classifier that guesses a label given this noisedthen-denoised input. This enables applying randomized smoothing to pretrained black-box base classifiers, as long as the denoiser can produce clean images close to the base classifier's original training distribution. We observe that the recent line of work on denoising diffusion probabilistic models (Sohl-Dickstein et al., 2015; Ho et al., 2020; Nichol & Dhariwal, 2021 )-which achieve state-of-the-art results on image generation-are a perfect match for the denoising step in a denoised smoothing defense. A forward diffusion process takes a source data distribution (e.g., images from some data distribution) and then adds Gaussian noise until the distribution converges to a high-variance isotropic Gaussian. Denoising diffusion models are trained to invert this process. Thus, we can use a diffusion model as a denoiser that recovers high quality denoised inputs from inputs perturbed with Gaussian noise. In this paper, we combine state-of-the-art, publicly available diffusion models as denoisers with standard pretrained state-of-the-art classifiers. We show that the resulting denoised smoothing defense obtains significantly better certified robustness results-for perturbations of 2 norm of ≤ 2 on ImageNet and ≤ 0.5 on CIFAR-10-compared to the "custom" denoisers trained in prior work (Salman et al., 2020) , or in fact with any certifiably robust defense (even those that do not rely on denoised smoothing). Code to reproduce our experiments is available at: https://github.com/ethz-privsec/diffusion_denoised_smoothing. * Joint first authors 1

