EXPECTED PERTURBATION SCORE FOR ADVERSARIAL DETECTION

Abstract

Adversarial detection aims to determine whether a given sample is an adversarial one based on the discrepancy between natural and adversarial distributions. Unfortunately, estimating or comparing two data distributions is extremely difficult, especially in the high-dimension space. Recently, the gradient of log probability density (a.k.a., score) w.r.t. the sample is used as an alternative statistic to compute. However, we find that the score is sensitive in identifying adversarial samples due to insufficient information with one sample only. In this paper, we propose a new statistic called expected perturbation score (EPS), which is essentially the expected score of a sample after various perturbations. Specifically, to obtain adequate information regarding one sample, we can perturb it by adding various noises to capture its multi-view observations. We theoretically prove that EPS is a proper statistic to compute the discrepancy between two samples under mild conditions. In practice, we can use a pre-trained diffusion model to estimate EPS for each sample. Last, we propose an EPS-based adversarial detection (EPS-AD) method, in which we develop EPS-based maximum mean discrepancy (MMD) as a metric to measure the discrepancy between the test sample and natural samples. To verify the validity of our proposed method, we also prove that the EPS-based MMD between natural and adversarial samples is larger than that among natural samples. Empirical studies on CIFAR-10 and ImageNet across different network architectures including ResNet, WideResNet, and ViT show the superior adversarial detection performance of EPS-AD compared to existing methods.

1. INTRODUCTION

Deep neural networks (DNNs) are known to be sensitive to adversarial samples that are generated by adding imperceptible perturbations to the input but may mislead the model to make unexpected predictions (Szegedy et al., 2014; Goodfellow et al., 2015) . Adversarial samples threaten widespread machine learning systems (Li & Vorobeychik, 2014; Ozbulak et al., 2019) , which raises an urgent requirement for advanced techniques to improve the robustness of models. Among them, adversarial training introduces adversarial data into training to improve the robustness of models but suffers from significant performance degradation and high computational complexity (Sriramanan et al., 2021; Laidlaw et al., 2021; Wong et al., 2020) ; adversarial purification relies on generative models to purify adversarial data before classification, which still has to compromise on unsatisfactory natural and adversarial accuracy (Shi et al., 2021; Yoon et al., 2021; Nie et al., 2022) . In contrast, another class of defense methods, called adversarial detection, could be achieved by detecting and rejecting adversarial examples, which are friendly to existed machine learning systems due to the lossless natural accuracy, and can help to identify security-compromised input sources (Abusnaina et al., 2021) . Adversarial detection aims to tell whether a test sample is an adversarial sample, for which the key is to find the discrepancy between the adversarial and natural distributions. However, existing adversarial detection approaches primarily train a tailored detector for specific attacks (Feinman et al., 2017; Ma et al., 2018; Lee et al., 2018) or for a specific classifier (Deng et al., 2021) , which largely overlook modeling the adversarial and natural distributions, resulting in their limited performance against unseen attacks or transferable attacks. Unfortunately, it is non-trivial to estimate or compare two data distributions, especially in the highdimension space (e.g., image-based space). One alternative approach is to estimate the gradient

