EXPECTED PERTURBATION SCORE FOR ADVERSARIAL DETECTION

Abstract

Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions. Unfortunately, estimating or comparing two data distributions is extremely difficult, especially in the high-dimension space. Recently, the gradient of log probability density (a.k.a., score) w.r.t. the sample is used as an alternative statistic to compute. However, we find that the score is sensitive in identifying adversarial samples due to insufficient information with one sample only. In this paper, we propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations. Specifically, to obtain adequate information regarding one sample, we can perturb it by adding various noises to capture its multi-view observations. We theoretically prove that EPS is a proper statistic to compute the discrepancy between two samples under mild conditions. In practice, we can use a pre-trained diffusion model to estimate EPS for each sample. Last, we propose an EPS-based adversarial detection (EPS-AD) method, in which we develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples. To verify the validity of our proposed method, we also prove that the EPS-based MMD between natural and adversarial samples is larger than that among natural samples. Empirical studies on CIFAR-10 and ImageNet across different network architectures including ResNet, WideResNet, and ViT show the superior adversarial detection performance of EPS-AD compared to existing methods.

1. INTRODUCTION

Deep neural networks (DNNs) are known to be sensitive to adversarial samples that are generated by adding imperceptible perturbations to the input but may mislead the model to make unexpected predictions (Szegedy et al., 2014; Goodfellow et al., 2015) . Adversarial samples threaten widespread machine learning systems (Li & Vorobeychik, 2014; Ozbulak et al., 2019) , which raises an urgent requirement for advanced techniques to improve the robustness of models. Among them, adversarial training introduces adversarial data into training to improve the robustness of models but suffers from significant performance degradation and high computational complexity (Sriramanan et al., 2021; Laidlaw et al., 2021; Wong et al., 2020) ; adversarial purification relies on generative models to purify adversarial data before classification, which still has to compromise on unsatisfactory natural and adversarial accuracy (Shi et al., 2021; Yoon et al., 2021; Nie et al., 2022) . In contrast, another class of defense methods, called adversarial detection, could be achieved by detecting and rejecting adversarial examples, which are friendly to existed machine learning systems due to the lossless natural accuracy, and can help to identify security-compromised input sources (Abusnaina et al., 2021) . Adversarial detection aims to tell whether a test sample is an adversarial sample, for which the key is to find the discrepancy between the adversarial and natural distributions. However, existing adversarial detection approaches primarily train a tailored detector for specific attacks (Feinman et al., 2017; Ma et al., 2018; Lee et al., 2018) or for a specific classifier (Deng et al., 2021) , which largely overlook modeling the adversarial and natural distributions, resulting in their limited performance against unseen attacks or transferable attacks. Unfortunately, it is non-trivial to estimate or compare two data distributions, especially in the highdimension space (e.g., image-based space). One alternative approach is to estimate the gradient of log probability density with respect to the sample, i.e., score. This statistic has emerged as a powerful means for adversarial defense (Yoon et al., 2021; Nie et al., 2022) and diffusion models (Song & Ermon, 2019; Song et al., 2021; Kingma et al., 2021; Huang et al., 2021) . However, how to effectively exploit the score function for adversarial detection is not well studied. Recently, Yoon et al. (2021) purify adversarial samples by gradually removing the adversarial noise from the (attacked) samples with the score function for adversarial defense. During the purification process, they employ the norm of scores (between being-purified adversarial samples and natural samples) to set a threshold for determining which timestep to stop purifying. They empirically find that natural samples usually have lower score norms than adversarial samples across purification timesteps. Intuitively, the score could represent the momentum of the sample towards the high-density areas of natural data (Song & Ermon, 2019) . From this point of view, a lower score norm indicates the sample is closer to the high-density areas of natural data, i.e., a higher probability of the sample following the natural distribution. To further understand this, we demonstrate the score norms of natural samples and adversarial samples at different purification timesteps. According to Figure 1 , most natural samples have lower score norms than adversarial samples at the same timestep, but they are very sensitive to the timesteps due to the significant overlap across all timesteps. This suggests that the score of one sample is useful but not effective enough in identifying the adversarial samples. In this paper, we propose a new statistic called expected perturbation score (EPS), which represents the expected score after multiple perturbations of a given sample. In EPS, we consider multiple levels of noise perturbations to diversify one sample, allowing us to capture multi-view observations of one sample and thus extract adequate information from the data. Our theoretical analysis shows EPS is a valid statistic in distinguishing between natural and adversarial samples under mild conditions. Thus, we propose an EPS-based adversarial detection method (EPS-AD) for adversarial detection, as illustrated in Figure 2 . Specifically, given a pre-trained score-based diffusion model, EPS-AD consists of three steps: 1) we simultaneously add multiple perturbations to a set of natural images and an upcoming image following the forward diffusion process with a time step T * ; 2) we obtain their EPSs via the score model; 3) we adopt the maximum mean discrepancy (MMD) between any test sample and natural samples relying on EPS. We provide both empirical and theoretical analyses to demonstrate the effectiveness of EPS-AD. Empirically, we achieve superior performance on both CIFAR-10 and ImageNet across many network architectures including ResNet, WideResNet and ViT compared to existing methods. Especially for the extremely low attack intensity ϵ = 1/255 on ImageNet, we achieve the area under the receiver operating characteristic (AUROC) of +95% against 12 attacks over ResNet-50. Theoretically, we prove that the MMD between EPSs of the natural samples is smaller than that between natural and adversarial samples. We summarize our main contributions as follows: • We propose a new and reliable statistic called expected perturbation score (EPS) to capture sufficient information regarding a sample from its multi-view observations by adding various perturbations. We theoretically prove that EPS is a proper statistic to compute the discrepancy between two distributions under mild conditions. • Relying on the proposed EPS, we exploit the maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples and then develop a novel adversarial detection method called EPS-AD. We theoretically show that the EPS-based MMD between natural and adversarial samples is larger than that among natural samples, which verifies the validity of the proposed adversarial detection method.

2. PRELIMINARIES

We start by recalling the setting of score-based continuous-time diffusion models, and then present the concept of maximum mean discrepancy. Adversarial data generation. Given a well-trained classifier f on a data set D{(x i , l i )} n i=1 with x i being a sample from the input space X and l i being its ground-truth label defined in a label set C = {1, • • • , C}, adversarial data x regarding x with perturbation ϵ is given by x = arg max x∈B(x,ϵ) ℓ f (x), l , where B(x, ϵ) = {x ′ ∈ X | d (x, x ′ ) ≤ ϵ}, d is some distance (e.g., ℓ 2 or ℓ ∞ distance) , and ℓ is some loss function. For simplicity, we denote x = x + ϵ as the adversarial data regarding x. Continuous-time diffusion models. Following Song et al. (2021) , let p(x) be the unknown data distribution. Diffusion models firstly construct a forward diffusion process {x t } Tdiff t=0 indexed by a continuous time variable t ∈ [0, T diff ], which can be modeled by a stochastic differential equation (SDE) with positive time increments: dx = f (x, t)dt + g(t)dw, (2) where x 0 := x ∼ p(x), f (•, t) : R d → R d is a vector-valued function, g(•) : R → R is a scalar function that is independent on x, and w is a standard Wiener process. Let p t (x) be the the marginal distribution of x t with p 0 (x) = p(x), if f (x, t) and g(t) are designed well such that p T (x) ≈ N (0, I d ), by reversing the diffusion process from t = T diff to t = 0, we can reconstruct samples x 0 ∼ p 0 (x). The reverse process is given by the reverse-time SDE: dx = f (x, t) -g(t) 2 ∇ x log p t (x) dt + g(t)d w, where w is a standard reverse-time Wiener process and dt is an infinitesimal negative time step. Throughout the paper, we consider the VP-SDE following the setting of Song et al. (2021) , where f (x, t) := -1 2 β(t)x and g(t) := β(t) with β(t) being the linear noise schedule, i.e., β(t) := β min + (β max -β min ) t/T diff for t ∈ [0, T diff ]. Reconstructing samples from the Gaussian distribution requires the score of the marginal distribution, i.e., ∇ x log p t (x) in the reverse process Eq. (3). To estimate the score function ∇ x log p t (x), one effective solution is to train a score model s θ (x, t) on samples with score matching (Hyvärinen & Dayan, 2005; Song & Ermon, 2019; Vincent, 2011) . The training objective function is: θ * = min θ E t λ(t)E x0∼p0(x0) E xt∼p0t(xt|x0) ∥s θ (x t , t) -∇ xt log p 0t (x t | x 0 )∥ 2 2 , where λ(t) is a positive weighting function, and p 0t (x t | x 0 ) is a transition kernel from x 0 to x t , which can be derived by the forward SDE in Eq. (2). Maximum mean discrepancy. Following Gretton et al. (2012) ; Borgwardt et al. (2006) , let X ⊂ R d be a separable metric space and p, q be Borel probability measures on X . Given two independent identically distributed (IID) observations S X = {x (i) } n i=1 and S Y = {y (i) } m i=1 from distributions p and q, respectively, maximum mean discrepancy (MMD) aims to measure the closeness between these two distributions, which is defined as: MMD(p, q; F) := sup f ∈F |E[f (X)] -E[f (Y )]|, ( ) where F is a class of functions f : X → R, and X ∼ p, Y ∼ q are two random variables. To better study the richness of the MMD function class F, Borgwardt et al. (2006) propose to choose F to be the unit ball in a universal reproducing kernel Hilbert space and obtain the following kernel-based MMD, MMD (p, q; H k ) := sup f ∈H,∥f ∥ H k ≤1 |E[f (X)] -E[f (Y )]| = ∥µ p -µ q ∥ H k = E [k (X, X ′ ) + k (Y, Y ′ ) -2k(X, Y )], where k : X × X → R is the kernel of a reproducing kernel Hilbert space H k , µ p := E[k(•, X)] and µ q := E[k(•, Y )] are the kernel mean embeddings of p and q, respectively (Gretton et al., 2012; Borgwardt et al., 2006; Jitkrittum et al., 2017; Liu et al., 2020; Gao et al., 2021) . 

3. PROPOSED METHOD

In this section, we first provide the problem definition of adversarial detection in Section 3.1. Then we present the definition of expected perturbation score (EPS) and provide the theoretical analysis that EPS is able to distinguish between natural and adversarial data well if natural data are from Gaussian distributions (Section 3.2). Motivated by this, we develop a new adversarial detection method, called EPS-AD, as shown in Figure 2 . Particularly, we estimate the EPS for each sample using a pre-trained score-based diffusion model and then adopt the maximum mean discrepancy (MMD) between the test sample and natural samples relying on EPS as a characteristic of the test sample (Section 3.3). Moreover, we provide theoretical analysis that the MMD between EPSs of the natural samples is smaller than that between natural and adversarial samples. (Section 3.4).

3.1. PROBLEM SETTING

In this paper, we aim to address the following adversarial detection problem. Problem 1. (Adversarial detection) Let X ⊂ R d be a separable metric space and p be a Borel probability measure on X , and IID observations S X = {x (i) } n i=1 from the distribution p and a ground-truth labeling mapping f (•) := R d → C with C = {1, . . . , C} being a label set. Assuming that the attacker has access to some well-trained classifier f on S X and IID observations S ′ X from the distribution p, we wish to know whether each new sample in S Y = y (i) m i=1 that are crafted with S ′ X is following the distribution p. Note that the definition of adversarial detection in Problem 1 is different from that in two-sample test (Grosse et al., 2017) . Particularly, Problem 1 aims to determine whether each example in S Y is sampled from the distribution p, while two-sample test aims to tell if two populations S X and S Y come from the same distribution, which focuses on the closeness between two populations.

3.2. EXPECTED PERTURBATION SCORE

As aforementioned, the score of one sample, e.g., ∇ x log p(x), is ineffective enough in identifying the adversarial samples. To capture more information from one sample, we propose a new statistic expected perturbation score (EPS), which is formulated as bellow. Definition 1. (Expected perturbation score) Let X ⊂ R d be a separable metric space and p be Borel probability measure on X . Given a perturbation process transition distribution p 0t (x t | x 0 ), the expected perturbation score (EPS) of a sample x is given by: S (x) = E t∼u(0,T ) ∇ x log p t (x) , where p t (x) is the marginal probability distribution of x t , x 0 := x is the initialized sample at t = 0, and T is the last perturbation time step. Note that the perturbation process transition distribution p 0t (x t | x 0 ) can be any distribution, such as the commonly used Gaussian distribution or uniform distribution. Moreover, we consider multiple levels of noise in the definition of EPS, with the aim of diversifying the single sample, which enables us to capture multi-view observation data and thus fully exploit more information from the data. Built upon Definition 1, we derive a following theorem to give a closer look on EPS S(x) for the natural data and adversarial data when the perturbation process transition distribution p 0t (x t | x 0 ) and p(x) are from Gaussian distributions. Theorem 1. Assuming that the distribution of natural data p(x) = N (µ x , σ 2 x I), given a perturbation transition kernel p 0t (x t | x 0 ) = N x t ; γ t x 0 , σ 2 t I with γ t and σ t being the time-dependent noise schedule, then the following three conclusions hold for S(x): 1) For ∀ x∼p(x), S(x) ∼ N (0, σ 2 S I); 2) For ∀ y∼p(x), adversarial sample ŷ=y+ϵ with perturbation ϵ, S(ŷ)∼N (-µ S , σ 2 S I); 3) For ∀ x, y ∼ p(x), adversarial sample ŷ=y+ϵ, we have S(x) -S(y) d → N (0, 2σ 2 S I); (8) S(x) -S(ŷ) d → N (µ S , 2σ 2 S I), (9) where µ S =E t∼U (0,T ) µ t with µ t = ϵ1 γ 2 t σ 2 x +σ 2 t and σ 2 S = E t∼U (0,T ) σ 2 t with σ 2 t = 1 γ 2 t σ 2 x +σ 2 t . Theorem 1 tells us: 1) the first two findings show that the mean of EPS for the adversarial sample differs from that of the natural sample due to the additional perturbation term µ S ; 2) the third finding indicates that the EPS of the natural sample is closer to that of other natural samples compared to adversarial samples, and this discrepancy becomes more pronounced when the perturbation transition γ t and σ t are small. These findings motivate us to employ S(x) for adversarial detection. Why multiple scores? Note that µ t and σ 2 t decrease as the timestep t increases due to the increase of γ t and σ t . However, smaller variance σ 2 S and larger mean µ S are required for good adversarial detection. If we only consider one score of some unique timestep t (i.e., removing the expectation from the definition of EPS), the variance σ 2 S and mean µ S of the discrepancy will be so fluctuant that performing detecting adversarial samples will be very sensitive to the timestep t (as validated in Section 4.4). To alleviate this issue, we consider taking expectation w.r.t. the timestep on multiple scores. In this way, the distribution of the discrepancy between the natural sample and the adversarial sample will be more stable to the timestep, which makes it easier to obtain a superior solution.

3.3. ADVERSARIAL DETECTION WITH EXPECTED PERTURBATION SCORE

Estimation for expected perturbation score. Note that EPS (i.e., S(x)) in Eq. ( 7) requires knowing the score function ∇ x log p t (x), which can be estimated by training a neural network with score matching (Hyvärinen & Dayan, 2005; Song & Ermon, 2019; Vincent, 2011; Kingma et al., 2021) . To this end, we model the perturbation process transition as a Gaussian distribution p 0t (x t | x 0 ) = N x t ; γ t x 0 , σ 2 t I , where γ t = e -1 2 t 0 β(s)ds and σ 2 t = 1-e -t 0 β(s)ds with β(t) for t ∈ [0, 1000] being the time-dependent noise schedule. By optimizing Eq. ( 4), with sufficient data and model capacity, score matching ensures that the optimal solution to Eq.( 4) equals ∇ x log p t (x) for almost x and t (Song et al., 2021) . As a result, the score ∇ x log p t (x) can be approximated by s θ (x t , t). In practice, we use a pre-trained diffusion model to achieve the estimation for the score. MMD for characterizing expected perturbation score. Using the norm of estimated EPS S(x) as a characterization for adversarial detection is straightforward. Nevertheless, using the norm of S(x) as the criterion can only describe the magnitude of the EPS vector, so it ignores the rich information that can be derived from its direction. It is critical that we design a distance metric to measure the distance between the EPS of an upcoming sample and the EPSs of natural samples in order to derive more useful information from S(x). Benefiting from the superior performance of maximum mean discrepancy (MMD) in measuring two given distributions (Long et al., 2015; Zhu et al., 2019; Gao et al., 2021) , we resort to it for characterizing EPS. The basic idea of MMD is that two distributions would be identical if two random variables are identical for any order, and the moment that makes the largest distance between the two distributions should be the measure of the two distributions when the two distributions are not the same (Smola et al., 2007; Gong et al., 2022) . Formally, we define two distributions for the natural samples {x (i) } n i=1 as P X = 1 n n i=1 δ xx (i) and a test sample x as Q Y = δ(∥y -x∥), respectively, where δ is the Dirac function (Alt, 2006) . Then we estimate the distance between these two distributions as MMD 2 b [P X , Q Y ; H k ] =   1 n 2 n i,j=1 k x (i) , x (j) - 2 n n i=1 k x (i) , y + k(y, y)   , where k(x, y) = κ(S(x), S(y)) with κ being the kernel of a reproducing kernel Hilbert space H k such as the Gaussian kernel κ (x, y) = exp -∥x -y∥ 2 / 2σ 2 . In our practice, we consider deep kernel MMD following Liu et al. (2020) .

3.4. THEORETICAL ANALYSIS FOR EPS-AD

Note that after performing the same perturbation process, the first and the third terms in Eq. ( 10) are the same for each upcoming sample, thus we only focus on the cross-term J = 2 n n i=1 k x (i) , y . Next, we analyze this term for the adversarial sample. Corollary 1. Considering the Gaussian kernel κ (x, y) = exp -∥x -y∥ 2 / 2σ 2 and the assumption in Theorem 1, for ∀0<η<1, the probability of P {k(x, ŷ) > η} is given by P {k(x, ŷ)>η} = C 0 χ 2 d (∥µ S ∥ 2 )dx, where µ S denotes the mean of S(x) -S(ŷ), C is a constant for given η and σ, χ 2 is the probability density function of noncentral chi-squared distribution with d degrees of freedom (Abdel-Aty, 1954) . Corollary 1 indicates that the cross-term J will be larger if ∥µ S ∥ is close to zero given an η. Combining Eq. ( 8) and Eq. ( 9), we conclude that the natural data have larger J than the adversarial data with higher probability due to the additional term E t ϵ1 γ 2 t σ 2 x +σ 2 t , suggesting that the MMD between EPSs of the natural samples is smaller than that between natural and adversarial samples.

4.1. EXPERIMENTAL SETTINGS

Datasets and network architectures. We evaluate our method on CIFAR-10 ( Krizhevsky, 2009) and ImageNet (Deng et al., 2009) . We implement three widely used architectures as classifiers: WideResNet (Zagoruyko & Komodakis, 2016; Gowal et al., 2021) for CIFAR-10, ResNet (He et al., 2016) and ViT (Dosovitskiy et al., 2021) for ImageNet. For diffusion models, we consider the pre-trained diffusion models of Score SDE (Song et al., 2021) for CIFAR-10 and Guided Diffusion (Dhariwal & Nichol, 2021) for ImageNet, respectively. Attack methods. Following Deng et al. (2021) , we evaluate our adversarial detection method under various attack methods. We consider the commonly used ℓ 2 and ℓ ∞ threat models, including PGD (Madry et al., 2018) , FGSM (Goodfellow et al., 2015) , BIM (Kurakin et al., 2018) , MIM (Dong et al., 2018) , TIM (Dong et al., 2019) , CW (Carlini & Wagner, 2017) , DI MIM (Xie et al., 2019) . Moreover, we apply two adaptive attack methods such as AutoAttack (AA) (Croce & Hein, 2020) and Minimum-Margin Attack (MM) (Gao et al., 2022) . To show the superiority of our method, we consider the relatively low attack intensities, i.e., ℓ 2 -ball and ℓ ∞ -ball ϵ-constrains with ϵ = 4/255, and iterative attacks run for 5 steps using step size ϵ/5, unless stated otherwise. Baselines. We compare our method with several state-of-the-art adversarial detection methods, including kernel density (KD) (Feinman et al., 2017) , local intrinsic dimensionality (LID) (Ma et al., 2018) , mahalanobis distance (MD) (Lee et al., 2018) and LiBRe (Deng et al., 2021) . Besides, we construct two new adversarial detection methods based on diffusion models: 1) S-N: using the score norm of raw images, i.e., ||s θ (x, t)|| 2 . 2) EPS-N: using the norm of the EPS, i.e., ||S(x)|| 2 . Differently, our proposed EPS-AD further calculates the maximum mean discrepancy of EPSs. Evaluation metric. We evaluate the performance of adversarial detection approaches with the area under the receiver operating characteristic (AUROC), which is a widely used statistic for assessing the discriminatory capacity of distribution models (Jiménez-Valverde, 2012). Considering the computational cost of applying 12 attacks to the classifier, especially for the ImageNet, following Gao et al. (2021) , we randomly select two disjoint subsets as adversarial and natural samples (each containing 500 samples) and compute the AUROC value over these two subsets. Notably, our method is applicable to different data set sizes, which is verified in Appendix. Moreover, we set T * = 20 in S(x) on CIFAR-10 and T * = 50 on ImageNet for both our EPS-AD and EPS-N, and set t * = 5 in s θ (x, t) on CIFAR-10 and t * = 20 on ImageNet for S-N.

4.2. DETECTING ON KNOWN ATTACKS

We start by comparing EPS-AD with the state-of-the-art adversarial detection methods that are trained with seen adversarial examples, against ℓ 2 and ℓ ∞ threat models. Considering adversarial detection becomes more challenging when the attack intensity of adversarial samples is low, to We demonstrate other attacks and more results on WideResNet-70-16 (Gowal et al., 2021) in Appendix. Obviously, our EPS-AD have much higher AUROC performance than other methods. Critically, we observe that our EPS-AD preserves almost non-degraded AUROC when the attack intensity ϵ surpasses 2/255 against ℓ 2 and ℓ ∞ attacks, which shows the stability of EPS-AD when detecting challenge adversarial samples. In addition, we report quantitative results for adversarial detection under the attack intensity ϵ = 4/255 in Table 1 . The results show that by employing EPS and measuring their MMD, EPS-AD consistently outperforms existing methods against various attacks in terms of AUROC. We also see that by simply applying the norm of EPS, EPS-N achieves superior adversarial detection performance, which demonstrates the effectiveness of EPS. Results on ImageNet. We report the adversarial detection performance against ℓ 2 and ℓ ∞ attacks on ImageNet over ResNet-50 (He et al., 2016) in Table 1 and Figure 4 . We defer the results of one widely-used ViT architecture, DeiT-S (Dosovitskiy et al., 2021) , in Appendix. We observe that our approach consistently outperforms baselines under various attacks, especially for detecting PGD, BIM and FGSM-ℓ 2 attacks. These results reveal that our proposed EPS-AD is effective even on a large-scale data set. Moreover, we observe that EPS-N exhibits poor results compared to EPS-AD when detecting ℓ 2 attacks (e.g., FGSM-ℓ 2 and BIM-ℓ 2 ) since the norm of EPS ignores rich information contained in the EPS vector, which is not effective enough to detect on a large-scale data set. This performance degradation is more pronounced in the method S-N.

4.3. DETECTING ON UNSEEN ATTACKS AND TRANSFERABLE ATTACKS

In light of poor performance for adversarial detection baselines against unseen attacks and transferable attacks, we evaluate our method in the context of these kinds of attacks. Detecting on unseen attacks. To detect unseen attacks, we train KD, LID, and MD detectors on CIFAR-10 using only FGSM and FGSM-ℓ 2 adversarial examples and evaluate their performance under other attacks. Combining Table 2 and 1, we find that adversarial detection performance for MD and LID worsens. An explanation is that their detectors are trained with vectorized features extracted from the seen samples through logistic regression, resulting in their limited generalization on unseen attacks. In contrast, diffusion-based detection methods show superior performance since they directly model the distribution of natural data. Detecting on transferable attacks. To validate the transferability, we train KD, LID, MD and LiBRe detectors with ResNet-50 but detect the adversarial examples from a surrogate ResNet-101 model. Comparing Table 3 and 1, the non-diffusion-based methods (e.g., KD, LID, MD and Li-BRe) drop significantly against transferable attacks. By contrast, our EPS-AD method achieves significantly better transferability, since it does not rely on specific classifiers, but rather models the distribution of natural data, indicating its versatility in various attack scenarios.

4.4. ABLATION STUDY ON IMPACT OF TIMESTEP

We conduct experiments on ImageNet over ResNet-50 to show the effect of the timestep. To this end, we set the total timestep T = 100, which is sufficient for both EPS-AD and EPS-N to achieve a good solution. As shown in Figure 4 (f), we draw two observations: 1) our EPS-AD and EPS-N are insensitive to the total timestep T while S-N fluctuates greatly with the timestep t; 2) As the total timestep T increases, EPS-AD and EPS-N exhibit progressively better performance, however, this gain gradually decreases when T exceeds the optimal value. This is due to the larger diffusion time, the mean µ S in Eq.( 9) gradually approaches zeros, resulting in a smaller discrepancy between the natural and adversarial distributions. Nie et al. (2022) also confirms this phenomenon (Theorem 1).

5. CONCLUSION

In 

APPENDIX OF EPS-AD

In appendix, we provide detailed proofs of the theorem and corollary, descriptions of related works, more details and more experimental results of the proposed EPS-AD. We organize the supplementary into the following sections. In Section A, we provide descriptions of related works regarding adversarial detection. In Section B, we derive the proofs of the theorem and corollary. In Section C, we demonstrate the pseudo-code of our proposed EPS-AD. In Section D, we present detailed implementation of our experiments. In Section E, we show the impact of EPS in our method. In Section F, we study the impact of adding perturbations over samples in our method. In Section G, we report more comparison results, more ablation studies and more applications of our proposed EPS-AD.

A RELATED WORK

Diffusion models. Diffusion models have emerged as a powerful generative model in many synthesis tasks (Song & Ermon, 2019; Dhariwal & Nichol, 2021; Saharia et al., 2022; Rombach et al., 2022; Ramesh et al., 2022) . Since then, many researchers exploit the diffusion model to adversarial purification for improving the robustness of model (Nie et al., 2022; Yoon et al., 2021) , where the score becomes a powerful means. Yet, only few researchers apply them to adversarial detection. Recently, Yoon et al. ( 2021) employ the score model to purify the adversarial examples and use the norm of scores as the stopping condition of the purification. They also demonstrate some results about the norm of score in distinguishing adversarial samples from natural samples. However, this criterion is not effective enough for adversarial detection. In our work, we comprehensively consider multiple scores of perturbed samples, where these perturbed samples are from the same sample, and exploit rich information from these scores for adversarial detection. We empirically compare our approach to these methods and find that our approach outperforms these methods by a large margin. Adversarial Attack. Numerous studies have been proposed to attack neural networks by slightly modifying the input data to trigger misclassifications of classifiers. We enumerate a series of such works in what follows. Fast Gradient Sign Method (FGSM) (Goodfellow et al., 2015) simply adds small noise along the gradient of the loss function. To further adjust the direction of increment, Kurakin et al. (2018) propose Basic Iterative Method (BIM) that extends FGSM to iteratively take multiple steps. Madry et al. (2018) propose Projected Gradient Descent (PGD) by combining the iterative method with random initialization for the adversarial example. Meanwhile, (Dong et al., 2018) propose Momentum Iterative Method (MIM) by adding a momentum term to BIM for achieving a more stable attack. Besides, Dong et al. (2019) propose Translation-Invariant Method (TIM) by optimizing the perturbation over translated images to obtain more transferable attacks, which can be incorporated into the methods FGSM and BIM. To further improve the transferability of attack, Wang & He (2021) propose variance tunning to guide the gradient update, namely VMI-FGSM; Zhang et al. ( 2022) conduct feature-level attacks with more accurate neuron importance estimations, called Neuron Attribution-based (NAA) attack. Besides the non-targeted attack methods mentioned above, there are also many methods that perturb data to one target label. For example, Carlini & Wagner (2017) perform an attack by incorporating the iterative mechanism of BIM, called CW. Besides, there are approaches that combine multiple attacks to perturb data, such as the AutoAttack (Croce & Hein, 2020) and the Minimum-margin (MM) attack (Gao et al., 2022) , the faster version of AutoAttack. The various attacks may cause serious consequences in security-critical tasks, raising an urgent requirement for advanced techniques to achieve a robust model. Adversarial detection. To ensure the safety of machine learning system, a plethora of exploration for adversarial detection has attracted increasing sight. The most common idea is to filter out adversarial samples from test data using a trained binary classifier. Recently, statistics on hidden-layer features of DNNs are widely considered for adversarial detection. Feinman et al. (2017) train a logistic regression classifier using Kernel Density (KD) of features in the last hidden layer, as well as Bayesian Uncertainty (BU) as a basis. Ma et al. (2018) consider the local intrinsic dimensionality (LID) of the features of DNNs as the characteristics for detection. Lee et al. (2018) use a mahalanobis distance-based score to detect adversarial examples. In addition, Deng et al. (2021) train a Bayesian neural network by adding uniform noises to the samples. However, these methods train a tailored detector for some specific attacks or for a specific classifier, which largely overlooks the modeling for data distribution, leading to their limited performance under unknown attacks.

B PROOFS IN SECTION 3 B.1 PROOF OF THEOREM 1

Theorem 1 Assuming that the distribution of natural data p(x) = N (µ x , σ 2 x I), given a perturbation transition kernel p 0t (x t | x 0 ) = N x t ; γ t x 0 , σ 2 t I with γ t and σ t being the time-dependent noise schedule, then the following three conclusions hold for S(x): 1) For ∀ x∼p(x), S(x) ∼ N (0, σ 2 S I); 2) For ∀ y∼p(x), adversarial sample ŷ=y+ϵ with perturbation ϵ, S(ŷ)∼N (-µ S , σ 2 S I); 3) For ∀ x, y ∼ p(x), adversarial sample ŷ=y+ϵ, we have S(x) -S(y) d → N (0, 2σ 2 S I); (12) S(x) -S(ŷ) d → N (µ S , 2σ 2 S I), ( ) where µ S =E t ϵ1 γ 2 t σ 2 x +σ 2 t and σ 2 S = E t 1 γ 2 t σ 2 x +σ 2 t . Proof. 1) Based on the distribution p(x), i.e., x 0 = x ∼ N (µ x , σ 2 x I), we obtain x 0 = µ x + σ x z with z ∼ N (0, I); based on the perturbation transition kernel p 0t (x t | x 0 ) = N x t ; γ t x 0 , σ 2 t I , we have x t = γ t x 0 + σ t z. Combining the distribution of x and x 0 , we have x t = γ t µ x + γ 2 t σ 2 x + σ 2 t I, i.e., x t ∼ N (γ t µ x , γ 2 t σ 2 x + σ 2 t ) (14) For x t ∼ p t (x) = N (γ t µ , γ 2 t σ 2 x + σ 2 t ) , we calculate the derivation ∇ x log p t (x) = - x t -γ t µ x γ 2 t σ 2 x + σ 2 t = - 1 γ 2 t σ 2 x + σ 2 t N (0, I). Taking expectation to t, we give the distribution of S(x) S(x) ∼ N (0, σ 2 S I), where σ 2 S = E t 1 γ 2 t σ 2 x + σ 2 t . ( ) 2) Based on y ∼ p(x) and ŷ = y + ϵ, we obtain ŷ0 = ŷ ∼ N (µ x + ϵ, σ 2 x I). Then, we have ∇ ŷ log p t (ŷ) = - y t + ϵ -γ t µ x γ 2 t σ 2 x + σ 2 t = - 1 γ 2 t σ 2 x + σ 2 t N (0, I) - ϵ γ 2 t σ 2 x +σ 2 t , where the last equation is based on p t (y) = p t (x) = N (γ t µ x , γ 2 t σ 2 x + σ 2 t ). Taking expectation to t, we give the distribution of S(ŷ) S(ŷ)∼N (-µ S , σ 2 S I), where µ S =E t ϵ1 γ 2 t σ 2 x +σ 2 t and σ 2 S = E t 1 γ 2 t σ 2 x + σ 2 t . ( ) 3) According to the additive property of the Gaussian distribution, combing Eq. ( 16) and Eq. ( 18), we obtain the third conclusion.

B.2 PROOF OF COROLLARY 1

Corollary 1 Considering the Gaussian kernel κ (x, y) = exp -∥x -y∥ 2 / 2σ 2 and the assumption in Theorem 1, for ∀0<η<1, the probability of P {k(x, y) > η} is given by P {k(x, ŷ)>η} = C 0 χ 2 d (∥µ S ∥ 2 )dx, ( ) where µ S denotes the mean of S(x) -S(ŷ), C is a constant for given η and σ, χ 2 is the probability density function of noncentral chi-squared distribution with d degrees of freedom (Abdel-Aty, 1954) . Proof. Based on the Gaussian kernel κ (x, y), we have P {k(x, ŷ)>η} =P {κ(S(x), S(ŷ))>η} = P exp -∥S(x) -S(ŷ)∥ 2 / 2σ 2 >η (20) =P ∥S(x) -S(ŷ)∥ 2 < -2σ 2 ln η . ( ) Let ξ ∼ S(x) -S(ŷ), then ξ i ∼ N ((µ S ) i , σ 2 S ), thus we have P {k(x, ŷ)>η} = P d i=1 ξ 2 i < -2σ 2 ln η = P d i=1 ( ξ i σ S ) 2 < -2σ 2 ln η σ 2 S ( ) Note that ξ i σ S ∼ N ((µ S ) i , 1), based on the definition of noncentral chi-squared distribution (Abdel-Aty, 1954), we have d i=1 ( ξ i σ S ) 2 ∼ χ 2 d (∥µ S ∥ 2 ) (23) Let C = -2σ 2 ln η σ 2

S

, we obtain the conclusion P {k(x, ŷ)>η} = C 0 χ 2 d (∥µ S ∥ 2 )dx. C PSEUDO-CODE OF EPS-AD (i) )} n i=1 using Eq. ( 7). Compute EPS of the upcoming sample S(x) using Eq. ( 7). Compute the MMD between {S(x (i) )} n i=1 and S(x) using Eq. ( 10).

D MORE DETAILS FOR EXPERIMENT SETTINGS D.1 IMPLEMENTATION DETAILS OF OUR METHOD

Our adversarial detection method is built upon diffusion models. Specifically, we consider the pretrained diffusion model of Score SDE for CIFAR-10 following Song et al. (2021) and choose the vp/cifar10 ddpmpp deep continuous checkpoint from the score sde libraryfoot_0 ; for ImageNet, we consider the pre-trained diffusion model of Guided Diffusion following Dhariwal & Nichol (2021) and use the 256 × 256 diffusion (unconditional) checkpoint from the guided-diffusion libraryfoot_1 . For classifiers, we use pre-trained WiderResNet-28-10 (Zagoruyko & Komodakis, 2016) and WiderResNet-70-16 (Gowal et al., 2021) for CIFAR-10, and ResNet-50, ResNet-101 (He et al., 2016) , and DeiT-S (Dosovitskiy et al., 2021) for ImageNet. For attacks, we consider 8 attack intensities ϵ ∈ [1/255, 8/255] with iterative attacks run for 5 steps using step size ϵ/5 under 12 different ℓ 2 and ℓ ∞ attack methods to generate adversarial examples, including PGD, PGD-ℓ 2 (Madry et al., 2018) , FGSM, FGSM-ℓ 2 (Goodfellow et al., 2015) , BIM, BIM-ℓ 2 (Kurakin et al., 2018) , MIM (Dong et al., 2018) , TIM (Dong et al., 2019) , CW (Carlini & Wagner, 2017) , DI MIM (Xie et al., 2019) and two adaptive attacks AutoAttack (AA) (Croce & Hein, 2020) and Minimum-Margin Attack (MM) (Gao et al., 2022) that is a faster version of AA. And for evaluation, we choose the area under the receiver operating characteristic curve (AUROC) as metric for adversarial detection. Through all our experiments, we only use FGSM and FGSM-L2 (ϵ = 1/255) to train a deep kernel to perform detection against all the attacks following Liu et al. ( 2020) 3 , where the deep kernel can also be trained on a general public dataset which we leave for our future work. Note that our method is suitable for detecting all the ℓ 2 and ℓ ∞ adversarial samples. We conduct our experiments based on Python 3.7 and Pytorch 1.7.1 on a server with 1× RTX 3090 GPU.

D.2 IMPLEMENTATION DETAILS OF BASELINES

We choose three standard adversarial detection approaches, KD (Feinman et al., 2017) , LID (Ma et al., 2018) and MD (Lee et al., 2018) as baselines for both CIFAR-10 and ImageNet, as well as LiBRe (Deng et al., 2021) individually on the ImageNet, which trains a Bayesian neural network by adding the uniform noises into the samples. Besides, we also construct two additional diffusionbased detection methods:1) S-N: using the score norm of raw images as the characteristics, i.e., ||s θ (x, t)|| 2 ; 2) EPS-N: using the norm of the EPS as the characteristics, i.e., ||S(x)|| 2 . KD & LID & MD. We implement KD following the codebasefoot_3 , LID following the codebasefoot_4 , and MD following the codebassefoot_5 . These three methods train a logistic regressor to distinguish natural, noisy and adversarial samples. To show their best performance for adversarial detection, we choose the noise scale of the ℓ 2 distance in KD, LID and MD as 40, 1, 10 on CIFAR-10 and 20, 1, 10 on ImageNet. Besides, for KD, we choose bandwidth as 10 on CIFAR-10 and 20 on ImageNet. LiBRe. We implement LiBRe on ImageNet following their codebasefoot_6 . We evaluate its performance on ResNet-50 and ResNet-101 to make a comparison.

D.3 IMPLEMENTATION DETAILS OF FIGURE 1

For the results in Figure 1 , we calculate the norms of scores of natural samples and adversarial samples at different purification timesteps using a score model pre-trained on ImageNet, where these adversarial samples are crafted by FGSM with ϵ = 1/255. Before feeding these samples into the score model, we do not perturb them again. To better demonstrate these results, we normalize score norm with the maximum of scores in each timestep.

E IMPACT OF EPS FOR EPS-AD

In our method EPS-AD, we use diffusion-based score model to calculate the characteristics of samples, i.e., EPS, which has the same dimension as this sample. In this experiment, we investigate the impact of EPS in our method. As a result, we remove the calculation of EPS from our method, instead using the raw sample as their characteristic. Table 4 shows adversarial detection performance of our method against 6 attacks under attack intensities ϵ ∈ {2/255, 4/255} on ImageNet over ResNet-50 compared to that without EPS. Obviously, EPS-AD without employing EPS demonstrates significant performance drop (≈ 28% ↓), suggesting the superiority of our proposed EPS in distinguishing between adversarial and natural samples. 

F IMPACT OF ADDING PERTURBATIONS OVER SAMPLES FOR EPS-AD

Adding perturbations into the samples is critical for our proposed EPS-AD. To investigate the impact of this operation, we conduct ablation studies against 6 adversarial attack methods on ImageNet. Table 5 demonstrates adversarial detection performance of EPS-AD against 6 attacks under ϵ = 1/255 on ImageNet over ResNet-50 compared to that without adding perturbations. We observe that the adversarial detection performance is significantly improved with adding perturbations. Specifically, our method obtains about 1.86% ↑ on average of 12 attacks, in which the maximum is 4.84% ↑ against BIM-ℓ 2 . This coincides with the conclusion in Theorem 1 that adding the perturbations helps distinguish between adversarial and natural samples. 

G MORE RESULTS OF ADVERSARIAL DETECTION

To further evaluate the effectiveness of our proposed EPS-AD, in subsection G.1, we conduct more comparison experiments on detecting more adversarial attacks on ImageNet and CIFAR-10 datasets. To demonstrate the generalization of EPS-AD, we conduct more experiments on detecting unseen attacks and transferable attacks. In subsection G.2, we additionally study the mechanism of EPS-AD by reporting the impact of time steps, set size, low attack intensity, transferability across datasets and a real application detecting face anti-spoofing samples. We provide extra results of all compared experiments in Table 6 , 7, 8 and Figure 5 , 6, 7, as supplements of additional attack methods corresponding to the main body.

G.1 MORE COMPARISON EXPERIMENTS

More comparison results on basic setup. In Table 6 , we provide additional attack results including MIM, TIM, DI MIM, PGD-ℓ 2 , MM on CIFAR-10 and ImageNet. We observe that EPS-AD keeps dominant position under these attacks, EPS-N and S-N exhibit poor performance on ImageNet due to the fact that these two methods use the norm of vectors and thus overlook the rich information that can be deviated from the vector. Moreover, we find that in Figure 5 and Figure 6 , most adversarial detection methods suffer from the extremely low attack intensity (e.g., ϵ = 1/255). In contrast, our method EPS-AD still has promising detection performance (refer to Table 14 for more quantitative results). More comparison results on unseen and transferable attacks. We also compare our EPS-AD with KD, LID MD and LiBRe under 6 additional unseen or transferable attacks (MIM, TIM, DI MIM, PGD-ℓ 2 , MM, VMI-FGSM (Wang & He, 2021) ) to further evaluate the effectiveness of our method. In Table 7 and Table 8 , Our approach consistently exhibits superior generalization compared to other baselines. More comparison results on CIFAR-10 over robust WideResNet-70-16 . We further compare our method with baselines on an adversarial trained classifier on CIFAR-10, e.g., WideResNet-70-16 (Gowal et al., 2021) against various attacks. In the specificity of the vision-transformer-based structure, we compare our method to three baselines (LID, S-N and EPS-N). From Table 11 , our method exhibits consistent superiority when compared to other baselines, suggesting the versatility of EPS-AD with different architectures. 

G.2 MORE DISCUSSIONS OF EPS-AD

Ablation of timestep. The ablation study of timestep is provided in Table 12 . We observe that when the timestep T stays in [10, 100], EPS-AD and EPS-N both obtain usable AUROC in detecting FGSM-ℓ 2 attack, which verifies that our method is insensitive to the timestep. Moreover, Our approach achieves optimum performance when T * = 20 on CIFAR-10 and T * = 50 on ImageNet. Impact of set size. Previous adversarial detection methods usually measure the discrepancy well only with large amount of data (Gao et al., 2021) . To show the effectiveness of our proposed EPS-AD, in this experiment, we further ablate the effect of set size by conducting experiments on 100 to 500 samples subset of CIFAR-10 with WideResNet-28-10 and ImageNet with ResNet-50. Performance of our three methods, S-N, EPS-N and EPS-AD, is shown in Table 13 and Figure 7 . It is obvious that EPS-AD consistently outperforms EPS-N and S-N with small set size and large set size. Moreover, EPS-AD is robust to the changes of set size while EPS-N and S-N fluctuate with set size, especially on CIFAR-10 dataset. Computational efficiency of EPS-AD Given a fixed-size model, the computational cost of our method (EPS-AD) mainly depends on two factors: the resolution of input images and total diffusion timestep T . Actually, our EPS-AD performs adversarial detection efficiently, especially on lowresolution images, and yields promising performance compared to existing methods. To evaluate the efficiency of EPS-AD, we randomly choose 500 images from CIFAR-10 and ImageNet respectively in detecting FGSM-ℓ 2 adversarial samples on a single RTX3090 GPU. The average time costs per image for CIFAR-10 (T = 20) and ImageNet (T = 50) are 0.038s and 2.386s, respectively. To further demonstrate the effect of total diffusion timesteps on the efficiency, we provide the results under different diffusion timesteps against FGSM-L2 attack, as shown in Table 12 , From the table, our EPS-AD method shows superior adversarial detection performance when 20 ≤ T ≤ 100 on ImageNet and takes 0.954s when T = 20, which is much more efficient than that with T = 50. Moreover, our EPS-AD achieves superior or comparable performance on ImageNet and CIFAR-10 compared with existing methods (i.e., KD, LID, and MD) even with T as small as 20. Note that the computational efficiency of our method can be further improved by applying an efficient sampling strategy (Lu et al., 2022) , a low-resolution diffusion model (Dhariwal & Nichol, 2021) and a sparse diffusion timestep (e.g., sampling with a time interval of 2/1000 during the diffusion process). We leave these techniques for our future work. Detecting on low attack intensity. To further reveal the superiority of our EPS-AD, we conduct an experiment under an extremely low attack intensity (e.g., ϵ = 1/255) on ImageNet. In Table 14 , we observe that Our EPS-AD achieves a significant advantage in detecting adversarial samples crafted with extremely low attack intensity, demonstrating its significant effectiveness. Detecting on adversarial samples across datasets. We further exploit the transferability across different datasets. To this end, we utilize a pre-trained score-based diffusion model on ImageNet to perform detecting adversarial samples from CIFAR-10. Specifically, we randomly select two disjoint subsets as adversarial and natural samples (each containing 500 samples) from CIFAR-10 and use a score model pre-trained on ImageNet to calculate the AUROC, which is named EPS-AD * . Table 15 demonstrates detection performance of 6 methods against 12 attacks under ϵ = 2/255 on CIFAR-10 over WideResNet-28-10. We observe that EPS-AD * still exhibits superior detection performance compared to KD, LID, MD baselines, and achieves a comparable performance compared to other diffusion-based methods that use the score model pre-trained on CIFAR-10. 



https://github.com/yang-song/score_sde https://github.com/openai/guided-diffusion https://github.com/fengliu90/DK-for-TST https://github.com/rfeinman/detecting-adversarial-samples https://github.com/xingjunm/lid_adversarial_subspace_detection https://github.com/pokaxpoka/deep_Mahalanobis_detector https://github.com/thudzj/ScalableBDL



Figure 1: An illustration of score norms of 200 random sampled natural images and adversarial images on ImageNet, in which most natural images have lower score norms than adversarial images at the same timestep but are very sensitive to the timesteps due to the significant overlap.

Figure 2: Overview of the proposed EPS-AD. EPS denotes the expected score after multiple perturbations of a sample using a pre-trained score model. Specifically, we simultaneously add perturbations to a set of natural images {x (i) 0 } and a test image x0 following the diffusion process with a time step T * to get perturbed images, from which we obtain their EPSs S(x) via the score model and calculate the MMD between EPS of the test sample and EPSs of natural samples.

Figure 3: Comparison with adversarial detection methods on CIFAR-10 in terms of AUROC under ϵ∈{1/255, • • •, 8/255}. Sub-figures (a) -(f) share the same legend presented in sub-figure (a).

Figure 4: Results of adversarial detection on ImageNet. Sub-figures (a) -(e) report the AUROC on different attacks under ϵ∈{1/255, • • •, 8/255} and share the same legend presented in sub-figure (a). Sub-figure (f) reports the AUROCs of different timesteps in {1, 2, 5, 10, 20, 50, 100}. broadly evaluate the adversarial detection performance, we compare our method with other baselines on different attacks under different attack intensities. Moreover, to show the best performance of KD, LID and MD, we test the detection performance on their corresponding adversarial examples.Results on CIFAR-10. Figure3shows our adversarial detection performance against 6 attacks under different attack intensities ϵ∈{1/255, • • •, 8/255} on CIFAR-10 overWideResNet-28-10  (Zagoruyko & Komodakis, 2016)  compared to other baselines. We demonstrate other attacks and more results on WideResNet-70-16(Gowal et al., 2021) in Appendix. Obviously, our EPS-AD have much higher AUROC performance than other methods. Critically, we observe that our EPS-AD preserves almost non-degraded AUROC when the attack intensity ϵ surpasses 2/255 against ℓ 2 and ℓ ∞ attacks, which shows the stability of EPS-AD when detecting challenge adversarial samples.

Figure 6: More Results of adversarial detection on ImageNet. Sub-figures (a) -(f) report the AUROC on different attacks under ϵ∈{1/255, • • •, 8/255} and share the same legend in sub-figure (c).

Figure 7: Impact of different set sizes and diffusion time step. Sub-figures (a) and (b) report the AUROC on FGSM-ℓ 2 attack under ϵ = 4/255. Sub-figure (c) reports the AUROCs of different diffusion time in {1, 2, 5, 10, 20, 50, 100} on CIFAR-10 dataset.

Comparison of different adversarial detection methods on CIFAR-10 and ImageNet in terms of AUROC under ϵ = 4/255. The bold number indicates the best results.

Comparison of AUROC for detecting unseen attacks on CIFAR-10, where "FGSM (seen)" denotes the seen adversarial attack used for the training of KD, LID and MD.

Comparison of AUROC for detecting transferable attacks on ImageNet, where KD, LID, MD and LiBRe are trained with adversarial examples with ResNet-50 but detect the adversarial examples crafted with ResNet-101.

ensure the reproducibility of experimental results, we provide an exhaustive implementation in Section D and the code will be released upon acceptance. Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In British Machine Vision Conference 2016, 2016. Jianping Zhang, Weibin Wu, Jen-tse Huang, Yizhan Huang, Wenxuan Wang, Yuxin Su, and Michael R Lyu. Improving adversarial transferability via neuron attribution-based attacks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 14993-15002, 2022.

Output: MMD between EPS of the upcoming sample x0 and EPSs of natural samples {x

Impact of EPS with ResNet-50 on ImageNet under ϵ = 2/255 and ϵ = 4/255.

Impact of adding perturbations with ResNet-50 on ImageNet under ϵ = 1/255.

More results of different adversarial detection methods on CIFAR-10 and ImageNet in terms of AUROC under ϵ = 4/255 (MIM, TIM, DI, MIM, PGD-ℓ 2 , MM).

Table 9, 10, We observe that diffusion-based detection methods are much better than other baselines trained with specific adversarial samples. One possible reason is that adversarial samples are difficult to deceive robust classifiers, which means that such adversarial samples are ineffective for training effective detectors.

More results of AUROC for detecting the unseen attacks (MIM, TIM, DI MIM, PGD-ℓ 2 , MM) on CIFAR-10. "FGSM (seen)" denotes the seen adversarial attack used for the training of KD, LID and MD.

More results of AUROC for detecting the transferable attacks (MIM, TIM, DI MIM, PGDℓ 2 , MM, VMI-FGSM) on ImageNet, where KD, LID, MD and LiBRe are trained with adversarial examples with ResNet-50 but detect the adversarial examples crafted with ResNet-101.

Comparison of AUROC for using adversarial trained WideResNet-70-16 as classifier on CIFAR-10 under ϵ = 2/255. Due to the constraint of memory and resources, we omit the detection results on AutoAttack for KD, LID and MD.

Comparison of AUROC for using adversarial trained WideResNet-70-16 as classifier on CIFAR-10 under ϵ = 4/255.

Comparison of AUROC for using DeiT-S as classifier on ImageNet under ϵ = 4/255.

Impact of timestep with WideResNet-28-10 on CIFAR-10 and ResNet-50 on ImageNet against FGSM-ℓ 2 under ϵ = 4/255.

Impact of data set size with WideResNet-28-10 on CIFAR-10 and ResNet-50 on ImageNet against FGSM-ℓ 2 under ϵ = 4/255.

