PROVABLY ROBUST CLASSIFICATION OF ADVERSARIAL EXAMPLES WITH DETECTION

Abstract

Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can detect the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating verifiable performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify or detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under ∞ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than 63.63% and 67.92%, for 55.6% and 66.37% natural error, for = 8/255 and 16/255 on the CIFAR-10 dataset, respectively.

1. INTRODUCTION

Despite popularity and success of deep neural networks in many applications, their performance declines sharply in adversarial settings. Small adversarial perturbations are shown to greatly deteriorate the performance of neural network classifiers, which creates a growing concern for utilizing them in safety critical application where robust performance is key. In adversarial training, different methods with varying levels of computational complexity aim at robustifying the network by finding such adversarial examples at each training steps and adding them to the training dataset. While such methods exhibit empirical robustness, they lack verifiable guarantees as it is not provable that a more rigorous adversary, e.g., one that does brute-force enumeration to compute adversarial perturbations, will not be able to cause the classifier to misclassify. It is thus desirable to provably verify the performance of robust classifiers without restricting the adversarial perturbations by inexact solvers, while restraining perturbations to a class of admissible set, e.g., within an ∞ norm-bounded ball. Progress has been made by 'complete methods' that use Satisfiability Modulo Theory (SMT) or Mixed-Integer Programming (MIP) to provide exact robustness bounds, however, such approaches are expensive, and difficult to scale to large networks as exhaustive enumeration in the worst case is required (Tjeng et al., 2017; Ehlers, 2017; Xiao et al., 2018) . 'Incomplete methods' on the other hand, proceed by computing a differential upper bound on the worst-case adversarial loss, and similarly for the verification violations, with lower computational complexity and improved scalability. Such upper bounds, if easy to compute, can be utilized during the training, and yield provably robust networks with tight bounds. In particular, bound propagation via various methods such as differentiable geometric abstractions (Mirman et al., 2018) , convex polytope relaxation (Wong & Kolter, 2018) , and more recently in (Salman et al., 2019; Balunovic & Vechev, 2020; Gowal et al., 2018; Zhang et al., 2020) , together with other techniques such as semidefinite relaxation, (Fazlyab et al., 2019; Raghunathan et al., 2018) , and dual solutions via additional verifier networks (Dvijotham et al., 2018) fall within this category. In particular, recent successful use of Interval Bound Propagation (IBP) as a simple layer-by-layer bound propagation mechanism was shown to be very effective in Gowal et al. (2018) , which despite its light computational complexity exhibits SOTA robustness verification. Additionally, combining IBP in a forward bounding pass with linear relaxation based backward bounding pass (CROWN) Zhang et al. 2019). However, to provide a fair evaluation, a defense must be evaluated under attackers that attempt to fool both the classifier and the detector, while addressing particular characteristics of a given defense, e.g., gradient obfuscation, non-differentability, randomization, and simplifying the attacker's objective for increased efficiency. A non-exhaustive list of recent detection methods entails randomization and sparsity-based defenses (Xiao et al., 2019; Roth et al., 2019; Pang et al., 2019b) , confidence and uncertainty-based detection (Smith & Gal, 2018; Stutz et al., 2020; Sheikholeslami et al., 2020) , transformation-based defenses (Bafna et al., 2018; Yang et al., 2019) , ensemble methods (Verma & Swami, 2019; Pang et al., 2019a) , generative adversarial training Yin et al. (2020) , and many more. Unfortunately, existing defenses have largely proven to have poor performance against adaptive attacks (Athalye et al., 2018; Tramer et al., 2020) , necessitating provable guarantees on detectors as well. Recently Laidlaw & Feizi (2019) have proposed joint training of classifier and detector, however it also does not provided any provable guarantees. Our contribution. In this work, we propose a new method for jointly training a provably robust classifier and detector. Specifically, by introducing an additional "abstain/detection" into a classifier, we show that the existing certified defense mechanisms can be modified, and by building on the detection capability of the network, classifier can effectively choose to either robustly classify or detect adversarial attacks. We extend the light-weight Interval Bound Propagation (IBP) method to account for our new robust objective, enabling verification of the network for provable performance guarantees. Our proposed robust training objective is also effectively upper bounded, enabling its incorporation into the training procedure leading to tight provably robust performance. While tightening of the bound propagation may be additionally possible for tighter verification, to the best of our knowledge, our approach is the first method to extend certification techniques by considering detection while providing provable verification. By stabilizing the training, as also used in similar IBP-based methods, experiments on MNIST and CIFAR-10 empirically show that the proposed method can successfully leverage its detection capability, and improves traditional IBP used in isolation, especially for large perturbation sizes.

2. BACKGROUND AND RELATED WORK

Let us consider an L-layer feed-forward neural network, trained for a K-class classification task. Given input x, it will pass through a sequential model, with h l denoting the mapping at layer l, recursively parameterized by z l = h l (z l-1 ) = σ l (W l z l-1 + b l ), l = 1, • • • , L W l ∈ R n l-1 ×n l , b l ∈ R n l (1) where σ l (.) is a monotonic activation function, z 0 denotes the input, and z L ∈ R K is the preactivation unnormalized K-dimensional output vector (n L = K and σ L (.) as identity operator),



(2020) leads to improved robustness, although it can be up to 3-10 times slower. Alternative to robust classification, detection of adversarial examples can also provide robustness against adversarial attacks, where suspicious inputs will be flagged and the classifier "rejects/abstains" from assigning a label. There has been some work on detection of out-of-distribution examples Bitterwolf et al. (2020), however the situation in the literature on the detection of adversarial examples is quite different from above. Most techniques that attempt to detect adversarial examples, either by training explicit classifiers to do so or by simply formulating "hand-tuned" detectors, still largely look to identify and exploit statistical properties of adversarial examples that appear in practice Smith & Gal (2018); Roth et al. (

