STOCHASTIC SECURITY: ADVERSARIAL DEFENSE USING LONG-RUN DYNAMICS OF ENERGY-BASED MODELS

Abstract

The vulnerability of deep networks to adversarial attacks is a central problem for deep learning from the perspective of both cognition and security. The current most successful defense method is to train a classifier using adversarial images created during learning. Another defense approach involves transformation or purification of the original input to remove adversarial signals before the image is classified. We focus on defending naturally-trained classifiers using Markov Chain Monte Carlo (MCMC) sampling with an Energy-Based Model (EBM) for adversarial purification. In contrast to adversarial training, our approach is intended to secure highly vulnerable pre-existing classifiers. To our knowledge, no prior defensive transformation is capable of securing naturally-trained classifiers, and our method is the first to validate a post-training defense approach that is distinct from current successful defenses which modify classifier training. The memoryless behavior of long-run MCMC sampling will eventually remove adversarial signals, while metastable behavior preserves consistent appearance of MCMC samples after many steps to allow accurate long-run prediction. Balancing these factors can lead to effective purification and robust classification. We evaluate adversarial defense with an EBM using the strongest known attacks against purification. Our contributions are 1) an improved method for training EBM's with realistic long-run MCMC samples for effective purification, 2) an Expectation-Over-Transformation (EOT) defense that resolves ambiguities for evaluating stochastic defenses and from which the EOT attack naturally follows, and 3) state-of-the-art adversarial defense for naturally-trained classifiers and competitive defense compared to adversarial training on CIFAR-10, SVHN, and CIFAR-100. Our code and pre-trained models are available at https://github.com/point0bar1/ebm-defense.

1. MOTIVATION AND CONTRIBUTIONS

Deep neural networks are highly sensitive to small input perturbations. This sensitivity can be exploited to create adversarial examples that undermine robustness by causing trained networks to produce defective results from input changes that are imperceptible to the human eye (Goodfellow et al., 2015) . The adversarial scenarios studied in this paper are primarily untargeted white-box attacks on image classification networks. White-box attacks have full access to the classifier (in particular, to classifier gradients) and are the strongest attacks against the majority of defenses. 2018) revealed that many preprocessing defenses can be overcome with minor adjustments to the standard PGD attack. Both stochastic behavior from preprocessing and the computational difficulty of end-to-end backpropagation can be circumvented to attack the classifier through the defensive transformation. In this paper we carefully address Athalye et al. ( 2018) to evaluate AP with an EBM using attacks with the greatest known effectiveness and efficiency.

Many

Figure 1 : Left: Visualization of calculating our stochastic logits FH (x) from ( 5). The input image x is replicated H times and parallel Langevin updates with a ConvNet EBM are performed on each replicate to generate {x h } H h=1 . Purified samples are sent in parallel to our naturally-trained classifier network f (x) and the resulting logits {f (x h )} H h=1 are averaged to produce FH (x). The logits FH (x) give an approximation of our true classifier logits F (x) in (4) that can be made arbitrarily precise by increasing H. Right: Graphical diagram of the Langevin dynamics (3) that we use for T (x). Images are iteratively updated with a gradient from a naturally-trained EBM (1) and Gaussian noise Z k . Langevin sampling using an EBM with a ConvNet potential (Xie et al., 2016) has recently emerged as a method for AP (Du & Mordatch, 2019; Grathwohl et al., 2020) . However, the proposed defenses are not competitive with AT (see Table 1 and Croce & Hein (2020) ). In the present work we demonstrate that EBM defense of a naturally-trained classifier can be stronger than standard AT (Madry et al., 2018) and competitive with state-of-the-art AT (Zhang et al., 2019; Carmon et al., 2019) . Our defense tools are a classifier trained with labeled natural images and an EBM trained with unlabeled natural images. For prediction, we perform Langevin sampling with the EBM and send the sampled images to the naturally-trained classifier. An intuitive visualization of our defense method is shown in Figure 1 . Langevin chains constitute a memoryless trajectory that removes adversarial signals, while metastable sampling behaviors preserve image classes over long-run trajectories. Balancing these two factors leads to effective adversarial defense. Our main contributions are: • A simple but effective adjustment to improve the convergent learning procedure from Nijkamp et al. (2020) . Our adjustment enables realistic long-run sampling with EBMs learned from complex datasets such as CIFAR-10. 



whitebox methods have been introduced to create adversarial examples. Strong iterative attacks such as Projected Gradient Descent (PGD) (Madry et al., 2018) can reduce the accuracy of a naturally-trained classifier to virtually 0. Currently the most robust form of adversarial defense is to train a classifier on adversarial samples in a procedure known as adversarial training (AT) (Madry et al., 2018). Another defense strategy, which we will refer to as adversarial preprocessing (AP), uses defensive transformations to purify an image and remove or nullify adversarial signals before classification (Song et al. (2018); Guo et al. (2018); Yang et al. (2019), and others). AP is an attractive strategy compared to AT because it has the potential to secure vulnerable pre-existing classifiers. Defending naturally-trained classifiers is the central focus of this work. Athalye et al. (

• An Expectation-Over-Transformation (EOT) defense that prevents the possibility of a stochastic defense breaking due to random variation in prediction instead of an adversarial signal. The EOT attack(Athalye et al., 2018)  naturally follows from the EOT defense.• Experiments showing state-of-the-art defense for naturally-trained classifiers and competitive defense compared to state-of-the-art AT.

