DENSEPURE: UNDERSTANDING DIFFUSION MODELS FOR ADVERSARIAL ROBUSTNESS

ABSTRACT

Diffusion models have been recently employed to improve certified robustness through the process of denoising. However, the theoretical understanding of why diffusion models are able to improve the certified robustness is still lacking, preventing from further improvement. In this study, we close this gap by analyzing the fundamental properties of diffusion models and establishing the conditions under which they can enhance certified robustness. This deeper understanding allows us to propose a new method DensePure, designed to improve the certified robustness of a pretrained model (i.e. classifier). Given an (adversarial) input, DensePure consists of multiple runs of denoising via the reverse process of the diffusion model (with different random seeds) to get multiple reversed samples, which are then passed through the classifier, followed by majority voting of inferred labels to make the final prediction. This design of using multiple runs of denoising is informed by our theoretical analysis of the conditional distribution of the reversed sample. Specifically, when the data density of a clean sample is high, its conditional density under the reverse process in a diffusion model is also high; thus sampling from the latter conditional distribution can purify the adversarial example and return the corresponding clean sample with a high probability. By using the highest density point in the conditional distribution as the reversed sample, we identify the robust region of a given instance under the diffusion model's reverse process. We show that this robust region is a union of multiple convex sets, and is potentially much larger than the robust regions identified in previous works. In practice, DensePure can approximate the label of the high density region in the conditional distribution so that it can enhance certified robustness. We conduct extensive experiments to demonstrate the effectiveness of DensePure by evaluating its certified robustness given a standard model via randomized smoothing. We show that DensePure is consistently better than existing methods on ImageNet, with 7% improvement on average. Project page:https://densepure.github.io/.

1. INTRODUCTION

Diffusion models have been shown to be a powerful image generation tool (Ho et al., 2020; Song et al., 2021b) owing to their iterative diffusion and denoising processes. These models have achieved state-of-the-art performance on sample quality (Dhariwal & Nichol, 2021; Vahdat et al., 2021) as well as effective mode coverage (Song et al., 2021a) . A diffusion model usually consists of two processes: (i) a forward diffusion process that converts data to noise by gradually adding noise to the input, and (ii) a reverse generative process that starts from noise and generates data by denoising one step at a time (Song et al., 2021b) . Given the natural denoising property of diffusion models, empirical studies have leveraged them for adversarial purification (Nie et al., 2022; Wu et al., 2022; Carlini et al., 2022) . For instance, Nie et al. ( 2022) employed diffusion models for model purification, DiffPure. They empirically show that by carefully choosing the amount of Gaussian noises added during the diffusion process, adversarial perturbations can be removed while preserving the true label semantics. Despite the significant empirical result, there is no provable guarantee of the achieved robustness. A concurrent work (Carlini et al., 2022) instantiated the randomized smoothing approach with the diffusion model to offer a provable guarantee of model robustness against L 2 -norm bounded adversarial example. However, they do not provide a theoretical understanding of why and how diffusion models contribute to such nontrivial certified robustness. Our Approach. We are the first to theoretically analyze the fundamental properties of diffusion models to understand why and how diffusion models enhance certified robustness. This deeper understanding allows us to propose a new method DensePure to improve the certified robustness of any given classifier more effectively using diffusion models. An illustration of the DensePure framework is provided in Figure 1 , where it consists of a pretrained diffusion model and a pretrained classifier. DensePure incorporates two steps: (i) using the reverse process of the diffusion model to obtain a sample of the posterior data distribution conditioned on the adversarial input; and (ii) repeating the reverse process multiple times with different random seeds to approximate the label of the high-density region in the conditional distribution via a simple majority vote strategy. In particular, given an adversarial input, we repeatedly feed it into the reverse process of the diffusion model to get multiple reversed examples and feed them into the classifier to calculate their labels. We then apply the majority vote on the set of labels to get the final predicted label. DensePure is inspired by our theoretical analysis, where we show that the reverse process of the diffusion model provides a conditional distribution of the reversed sample given an adversarial input. Sampling from this conditional distribution can enhance the certified robustness. Specifically, we prove that when the data density of clean samples is high, it is a sufficient condition for the conditional density of the reversed samples to be also high. Therefore, in DensePure, samples from the conditional distribution can recover the ground-truth labels with a high probability. For understanding and rigorous analysis conveniently, we use the highest density point in the conditional distribution as the deterministic reversed sample for the classifier prediction. We show that the robust region for a given sample under the diffusion model's reverse process is the union of multiple convex sets, each surrounding a region around the ground-truth label. Compared with the robust region of previous work (Cohen et al., 2019) , which only focuses on only one region with the ground-truth label, such the union of multiple convex sets has the potential to provide a much larger robust region, resulting in higher certified robustness. Moreover, the characterization implies that the size of robust regions is affected by the relative density and the distance between data regions with the ground-truth label and those with other labels. We conduct extensive experiments on ImageNet and CIFAR-10 datasets under different settings to evaluate the certifiable robustness of DensePure. In particular, we follow the setting from Carlini et al. ( 2022) and rely on randomized smoothing to certify the robustness of the adversarial perturbations bounded in the L 2 -norm. We show that DensePure achieves a new state-of-the-art certified robustness on the standard pretrained model without further tuning any model parameters (e.g., smooth augmentation Cohen et al. ( 2019)). On ImageNet, it achieves a consistently higher certified accuracy, 7% improvement on average, than the existing methods among every σ at every radius ϵ . Technical Contributions. In this paper, we take the first step to understand why and how diffusion models contribute to certified robustness. We make contributions on both theoretical and empirical fronts: (1)in theory, we prove that an adversarial example can be recovered back to the original clean sample with a high probability via the reverse process of a diffusion model. (2) In theory, we characterized the robust region for each point by further taking the highest density point in the conditional



Figure 1: Pipeline of DensePure.

