PROVABLY ROBUST CLASSIFICATION OF ADVERSARIAL EXAMPLES WITH DETECTION

Abstract

Adversarial attacks against deep networks can be defended against either by building robust classifiers or, by creating classifiers that can detect the presence of adversarial perturbations. Although it may intuitively seem easier to simply detect attacks rather than build a robust classifier, this has not bourne out in practice even empirically, as most detection methods have subsequently been broken by adaptive attacks, thus necessitating verifiable performance for detection mechanisms. In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, we show that by introducing an additional "abstain/detection" into a classifier, we can modify existing certified defense mechanisms to allow the classifier to either robustly classify or detect adversarial attacks. We extend the common interval bound propagation (IBP) method for certified robustness under ∞ perturbations to account for our new robust objective, and show that the method outperforms traditional IBP used in isolation, especially for large perturbation sizes. Specifically, tests on MNIST and CIFAR-10 datasets exhibit promising results, for example with provable robust error less than 63.63% and 67.92%, for 55.6% and 66.37% natural error, for = 8/255 and 16/255 on the CIFAR-10 dataset, respectively.

1. INTRODUCTION

Despite popularity and success of deep neural networks in many applications, their performance declines sharply in adversarial settings. Small adversarial perturbations are shown to greatly deteriorate the performance of neural network classifiers, which creates a growing concern for utilizing them in safety critical application where robust performance is key. In adversarial training, different methods with varying levels of computational complexity aim at robustifying the network by finding such adversarial examples at each training steps and adding them to the training dataset. While such methods exhibit empirical robustness, they lack verifiable guarantees as it is not provable that a more rigorous adversary, e.g., one that does brute-force enumeration to compute adversarial perturbations, will not be able to cause the classifier to misclassify. It is thus desirable to provably verify the performance of robust classifiers without restricting the adversarial perturbations by inexact solvers, while restraining perturbations to a class of admissible set, e.g., within an ∞ norm-bounded ball. Progress has been made by 'complete methods' that use Satisfiability Modulo Theory (SMT) or Mixed-Integer Programming (MIP) to provide exact robustness bounds, however, such approaches are expensive, and difficult to scale to large networks as exhaustive enumeration in the worst case is required (Tjeng et al., 2017; Ehlers, 2017; Xiao et al., 2018) . 'Incomplete methods' on the other hand, proceed by computing a differential upper bound on the worst-case adversarial loss, and similarly for the verification violations, with lower computational complexity and improved scalability. Such upper bounds, if easy to compute, can be utilized during the training, and yield provably robust networks with tight bounds. In particular, bound propagation via various methods such as differentiable geometric abstractions (Mirman et al., 2018) , convex polytope relaxation (Wong & Kolter, 2018) , and more recently in (Salman et al., 2019; Balunovic & Vechev, 2020; Gowal et al., 2018; Zhang et al., 2020) , together with other techniques such as semidefinite relaxation, (Fazlyab et al., 2019; Raghunathan et al., 2018) , and dual solutions via additional verifier networks (Dvijotham et al., 2018) fall within this category. In particular, recent successful use of Interval Bound Propagation (IBP) as a simple layer-by-layer bound propagation mechanism was shown to be very effective in Gowal et al. (2018) , which despite its light computational complexity exhibits SOTA robustness verification. Additionally, combining IBP in a forward bounding pass with linear relaxation based backward bounding pass (CROWN) Zhang et al. (2020) leads to improved robustness, although it can be up to 3-10 times slower. Alternative to robust classification, detection of adversarial examples can also provide robustness against adversarial attacks, where suspicious inputs will be flagged and the classifier "rejects/abstains" from assigning a label. There has been some work on detection of out-of-distribution examples Bitterwolf et al. (2020) , however the situation in the literature on the detection of adversarial examples is quite different from above. Most techniques that attempt to detect adversarial examples, either by training explicit classifiers to do so or by simply formulating "hand-tuned" detectors, still largely look to identify and exploit statistical properties of adversarial examples that appear in practice Smith & Gal (2018) ; Roth et al. (2019) . However, to provide a fair evaluation, a defense must be evaluated under attackers that attempt to fool both the classifier and the detector, while addressing particular characteristics of a given defense, e.g., gradient obfuscation, non-differentability, randomization, and simplifying the attacker's objective for increased efficiency. A non-exhaustive list of recent detection methods entails randomization and sparsity-based defenses (Xiao et al., 2019; Roth et al., 2019; Pang et al., 2019b) , confidence and uncertainty-based detection (Smith & Gal, 2018; Stutz et al., 2020; Sheikholeslami et al., 2020) , transformation-based defenses (Bafna et al., 2018; Yang et al., 2019) , ensemble methods (Verma & Swami, 2019; Pang et al., 2019a) , generative adversarial training Yin et al. (2020) , and many more. Unfortunately, existing defenses have largely proven to have poor performance against adaptive attacks (Athalye et al., 2018; Tramer et al., 2020) , necessitating provable guarantees on detectors as well. Recently Laidlaw & Feizi (2019) have proposed joint training of classifier and detector, however it also does not provided any provable guarantees. Our contribution. In this work, we propose a new method for jointly training a provably robust classifier and detector. Specifically, by introducing an additional "abstain/detection" into a classifier, we show that the existing certified defense mechanisms can be modified, and by building on the detection capability of the network, classifier can effectively choose to either robustly classify or detect adversarial attacks. We extend the light-weight Interval Bound Propagation (IBP) method to account for our new robust objective, enabling verification of the network for provable performance guarantees. Our proposed robust training objective is also effectively upper bounded, enabling its incorporation into the training procedure leading to tight provably robust performance. While tightening of the bound propagation may be additionally possible for tighter verification, to the best of our knowledge, our approach is the first method to extend certification techniques by considering detection while providing provable verification. By stabilizing the training, as also used in similar IBP-based methods, experiments on MNIST and CIFAR-10 empirically show that the proposed method can successfully leverage its detection capability, and improves traditional IBP used in isolation, especially for large perturbation sizes.

2. BACKGROUND AND RELATED WORK

Let us consider an L-layer feed-forward neural network, trained for a K-class classification task. Given input x, it will pass through a sequential model, with h l denoting the mapping at layer l, recursively parameterized by z l = h l (z l-1 ) = σ l (W l z l-1 + b l ), l = 1, • • • , L W l ∈ R n l-1 ×n l , b l ∈ R n l (1) where σ l (.) is a monotonic activation function, z 0 denotes the input, and z L ∈ R K is the preactivation unnormalized K-dimensional output vector (n L = K and σ L (.) as identity operator), referred to as the logits. Robust classifiers can be obtained by minimizing the worst-case (adversarial) classification loss, formally trained by the following min-max optimization Madry et al. (2017) minimize θ E (x,y)∼D max δ∈∆ (f θ (x + δ), y) . (2) where θ denotes network parameters, vector f θ (x) = z L is the logit output for input x, (.) is the misclassfication loss, e.g., xent (.) defined as the cross-entropy loss, and ∆ denotes the set of permissible perturbations, e.g., for  p * i = min z L ∈Z L c y,i z L where , Z L := {z L |z l = h l (z l-1 ), l = 1, ..., L, z 0 = x + δ, ∀δ ∈ ∆ } where c y,i = e y -e i for i = 1, 2, .., K, i = y, and e i is the standard i th canonical basis vector. If p * i > 0 ∀i = y, then the classifier is verifiably robust at point (x, y) as this guarantees that z i ≤ z y ∀i = y for all admissible perturbations δ ∈ ∆ . The feasible set Z L is generally nonconvex, rendering obtaining p * i intractable. Any convex relaxation of Z L however, will provide a lower bound on p * i , and can be alternatively used for verification. As outlined in Section 1, various relaxation techniques have been proposed in the literature. Specifically, IBP in (Mirman et al., 2018; Gowal et al., 2018) proceeds by bounding the activation z l of each layer by propagating an element-wise bounding box using interval arithmetic for networks with monotonic activation functions. Despite its simplicity and relatively small computational complexity (computational requirements for bound propagation for a given input using IBP is equal to two forward passes of the input), it can provide tight bounds once the network is trained accordingly. Specifically, starting from the input layer z 0 , it can be bounded for the perturbation class δ ∈ ∆ as z 0 = x -1 and z 0 = x + 1, and z l for the following layers can be bounded as z l = σ l (W l z l-1 + z l-1 2 -|W l | z l-1 -z l-1 2 ), z l = σ l (W l z l-1 + z l-1 2 + |W l | z l-1 -z l-1 2 ), where | • | is the element-wise absolute-value operator. The verification problem over the relaxed feasible set ẐL := {z L | z L,i ≤ z L,i ≤ zL,i }, where Z L ⊆ ẐL is then easily solved as p * i = min z L ∈Z L c y,i z L ≥ min z L ∈ ẐL c y,i z L = z L,y -zL,i .

2.2. ROBUST TRAINING OF VERIFIABLE NETWORKS

It has been shown that convex relaxation of Z L can also provide a tractable upper bound on the inner maximization in Eq. 2. While this holds for various relaxation techniques, focusing on the IBP let us define J IBP ,θ (x, y) := [J IBP 1 , J IBP 2 , ..., J IBP K ] where J IBP i := min z L ∈ ẐL c y,i z L (5) with (θ, ) implicitly influencing ẐL (dropped for brevity), and upperbound the inner-max in Eq. 2 max δ∈∆ xent (f θ (x + δ), y) ≤ xent (-J IBP ,θ (x, y), y), xent (z, c) := -log exp(z c ) i exp(z i ) By using this tractable upper bound of the robust optimization, network can now be trained by minimize θ (x,y)∈D (1 -κ) xent (-J IBP ,θ (x, y), y) + κγ xent (f θ (x), y), where γ trades natural versus robust accuracy, and κ is scheduled through a ramp-down process to stabilize the training and tightening of IBP Gowal et al. (2018) (where γ = 1 is selected therein).

3. VERIFIABLE CLASSIFICATION WITH DETECTION

In this paper, we propose a new method for jointly training a provably robust classifier and detector. Specifically, let us augment the classifier by introducing an additional "abstain/detection". This can be readily done by extending the K-class classification task to a (K + 1)-class classification, with the (K + 1)-th class dedicated to the detection task, and the maximum weighted class is finally chosen as the classification output. The classifier is then trained such that adversarial examples, or ideally any other example that the network would misclassify, are classified in this abstain class, denoted by a, thus preventing incorrect classification. Formally, the classifier can be denoted as in Eq. 1, with the only difference that the final output is K + 1 dimensional, i.e., z L ∈ R K+1 ; simply by substituting the last fully-connected weight matrix W L of dimension n L × K with that of dimension n L × (K + 1), and similarly for b L .

3.1. VERIFICATION PROBLEM FOR CLASSIFICATION WITH ABSTAIN/DETECTION

It is desirable to provably verify performance of the joint classification/detection. In contrast to existing robust classifiers, however, on a perturbed image x + δ, the classification/detection task is considered successful if the input is classified either as the correct class y, or as the abstain class a; as both cases prevent misclassification of the adversarially perturbed input as a wrong class. On clean natural images however, classification/detection is considered successful only if it is classified as the correct class y, and abstaining is considered misclassification. In order to certify performance in adversarial settings, it is now sufficient to verify that the network satisfies the following for a given input pair (x, y) and δ ∈ ∆ and i = 1, .., K, i = y: max{c y,i z, c a,i z} ≥ 0 ∀z ∈ Z L := {z L |z l = h l (z l-1 ) l = 1, ..., L, z 0 = x + δ, ∀δ ∈ ∆ } 8) where c y := e y -e i and c a := e a -e i , a denotes the "abstain" class, and the dependence of Z L on (x, y, , θ) is omitted for brevity. Verification can be done effectively by seeking a counterexample π * i := min z∈Z L max{c y,i z, c a,i z}. ( ) If π * i ≥ 0 ∀i = y , the specification is then satisfied and the performance is verified. Similar to previous verification methods, to overcome the non-convexity of the optimization in Eq. 9, one can lower bound the problem by expanding the feasible set Z L ⊆ ẐL , where ẐL is convex, as stated in Theorem 1, and proved in Appendix A.1. Theorem 1: For any convex ẐL s.t. Z L ⊆ ẐL , Eq. 9 can be bounded by the convex relaxation max 0≤η≤1 min z∈ ẐL η c a,i + (1 -η) c y,i z ≤ min z∈Z L max{c y,i z, c a,i z}. ( ) Although Theorem 1 holds for any convex relaxation of Z L , for IBP relaxation in Gowal et al. (2018) it can be further simplified by substituting z = W L z L-1 + b L , thus not propagating the intervals through the last layer for tighter bounding, and solved analytically as follows. Theorem 2: The optimization in Eq. 9 can be lower-bounded by the convex optimization  J i (x, y) = max 0≤η≤1 min z L-1 ∈ ẐL-1 (ω 1 + η ω 2 ) z L-1 + η ω 3 + ω 4 ≤ min z∈Z L max{c y,i z, c a,i z} (11) in which ω 1 := W L c y,i , ω 2 := W L (c a,i -c y,i ), ω 3 := b L (c a,i -c y,i ), ω 4 := b L c y,i and convex set ẐL-1 is a convex relaxation of Z L-1 on the hidden values at L -1. Furthermore, J i (x, 2: ω 1 = W L c y,i and ω 2 = W L (c a,i -c y,i ), ω 3 := b L (c a,i -c y,i ), and ω 4 := b L c y,i 3: ζ = [ζ 1 , ..., ζ n L ] := -ω 1 /ω 2 and vector of indices s that sorts ζ , i.e., ζ s1 ≤ • • • ≤ ζ sn L-1 4: u 1 = Π s (ω 1 • z L-1 ) , ū1 = Π s (ω 1 • zL-1 ), u 2 := Π s (ω 2 • z L-1 ) , ū2 := Π s (ω 2 • zL-1 ) where operators • and Π s (.) denote element-wise multiplication, and permutation according to indices s, respectively. 5: m = min ζ s j ≥0 j and M = max ζ s j ≤1 j for j = 1, ..., n L-1 6: for η = 0, ζ sm , ζ sm+1 , • • • , ζ s M -1 , ζ s M , 1 do 7: Compute g(η) = n L-1 j=1 1 {ω1,j +ηω2,j ≤0} (ū 1,j + ηū 2,j ) + 1 {ω1,j +ηω2,j ≥0} u 1,j + ηu 2,j + η ω 3 + ω 4 8: return max g(η) over the computed values.

4. TRAINING A VERIFIABLE ROBUST CLASSIFICATION WITH DETECTION

In order to train a robust classifier with detection, let us start by formalizing the objective of an adversarial attacker. Naturally, an adaptive attacker's objective is to craft perturbation δ such that it simultaneously evades detection and causes misclassification. Formally, this can be tackled by seeking δ such that loss corresponding to the winner of the two classes y and a (higher logit leading to smaller cross-entropy loss) is maximized, i.e., This small alteration to the cost, while not changing the minimization "winner" between the true class y and rejection class a in Eq. 12 and 13, i.e., max δ∈∆ min xent (f θ (x + δ), y), xent (f θ (x + δ), a) za ≤ zy ⇒ xent(fθ (x + δ), y) ≤ xent(fθ (x + δ), a) and xent\a (f θ (x + δ), y) ≤ xent\y (f θ (x + δ), a) zy ≤ za ⇒ xent(fθ (x + δ), a) ≤ xent(fθ (x + δ), y) and xent\y (f θ (x + δ), a) ≤ xent\a (f θ (x + δ), y) favorably influences the training process. That is so since, for δ such that, for instance z a < z y , minimizing L abstain robust (x, y; θ) during training reduces to minimizing xent (f θ (x + δ), y) which in turn leads to further increasing z y while decreasing the logit value z a ; and similarly, increasing z y while decreasing z y if z y < z a . Intuitively however, the true objective of the classifier augmented with detection on adversarial examples is to increase both z y and z a while reducing z j , ∀j = a, y; thus preventing any gap in between the boundary of the classes a and y, which can potentially lead to successful adaptive attacks. Hence, minimizing Eq. 12 would be in contrast with the true underlying objective, and Eq. 13 simply prevents the raised issue. Upon defining L natural (x, y; θ) := xent (f θ (x), y) and L robust (x, y; θ) := max δ∈∆ xent (f θ (x + δ), y), we then define the overall training loss as L = L robust (x, y; θ) + λ 1 L abstain robust (x, y; θ) + λ 2 L natural (x, y; θ), where L natural (x, y; θ) captures the misclassification loss of the natural (clean) examples, and L robust (x, y; θ) denotes that of adversarial examples without considering the rejection class, i.e., similar to that of Gowal et al. (2018) , and parameters (λ 1 , λ 2 ) trade-off clean and adversarial accuracy. To train a robust classifier, we proceed by minimizing the overall loss Eq. 14, by first upperbounding L robust (x, y; θ) and L abstain robust (x, y; θ).

4.1. UPPERBOUNDING THE TRAINING LOSS

Using Theorem 2, and restricting 0 < η ≤ η ≤ η < 1, let us now define J η,η i (x, y), where trivially J η,η i (x, y) := max 0≤η≤η≤η≤1 (ω 1 + ηω 2 ) ẑL-1 + η ω 3 + ω 4 ≤ J i (x, y) and can also be solved analytically similar to Theorem 2. By generalizing the findings in Wong & Kolter (2018) ; Mirman et al. (2018) , we can upper bound the robust optimization problem using our dual problem in Eq. 15, according to the following Theorem, which we prove in Appendix A.4. Theorem 3: For any data point (x, y), and > 0, and for any 0 ≤ η ≤ η ≤ 1, the adversarial loss L abstain robust (x, y; θ) in Eq. 13 can be upper bounded by L abstain robust (x, y; θ) ≤ Labstain robust (x, y; θ) := xent\a (-J ,θ (x, y), y) = xent\y (-J ,θ (x, y), a) where J ,θ (x, y) is a (K + 1)-dimensional vector, valued at index i as [J ,θ (x, y)] i = J η,η i (x, y). Note that maximization over η for obtaining J η,η i (x, y) can be done either by bisection (concave maximization) or by following Alg. 1 and substituting m = min ζ sν ≥η ν , and M = max ζ sν ≤ η ν Remark 1. Setting η = η = 0 forces η = 0 which reduces J η,η i (x, y) in Eq. 15 to that in Eq. 5, .i.e, J IBP i (x, y) = J η,η i (x, y)| η=η=0 , also bounding loss term L natural (x, y; θ) as L robust (x, y; θ) ≤ Lrobust (x, y; θ) := xent (-J IBP ,θ (x, y), y). Remark 2. While setting η = 0 and η = 1 gives tighter bounds, (and is thus used for the verification counterpart in Theorem 2), strictly setting 0 < η ≤ η < 1 empirically yields better generalization of the network. This can be intuitively understood by rewriting ω 1 + ηω 2 = W L (ηc a,i + (1 -η)c y,i ) which is a convex combination of the verification constraints for the correct and the abstain class. Thus η = 0 = 1 will lead to minimizing a combination of both terms, preventing gaps in between the two classes. Also, higher values of η increase the influence of the term corresponding to the abstain case, and vice versa, whose tuning can promote abstaining by considering how desirable such outcome is (or is not). Utilizing upperbounds in Eq. 16 and Eq. 17, we can proceed to training the network by minimizing the tractable upperbound on the overall loss min θ L ≤ min θ xent (-J IBP ,θ (x, y), y) + λ 1 xent\y (-J ,θ (x, y), a) + λ 2 natural (f θ (x), y) Note that setting λ 1 = 0 and γ = λ 2 -and incorporation of a ramp-down process by parameter κ as detailed in Section 5 -reduces the training in Eq. 18 to that of Gowal et al. (2018) without detection. Complexity. Since given IBP bounds on z L-1 , the solution to Eq. 16 is analytically available (that is after sorting whose complexity is negligible in comparison with forward pass), computing Eq. 18 imposes the same computational complexity as in IBP, which is twice the normal training procedure, as it requires propagating the upper and lower bounds via forward pass.

5. EXPERIMENTS

Empirical performance of the proposed robust classification with detection on MNIST-10 and CIFAR-10 datasets is reported in this section, and is compared with the state-of-the-art alternatives. The training procedure is stabilized as detailed nextfoot_0 .

5.1. STABILIZING THE TRAINING PROCEDURE

We incorporate the following mechanisms to stabilize the training procedure in our tests, where the first two have been previously used in (Gowal et al., 2018) and (Zhang et al., 2020) as well. Ramp down of κ: To stabilize the trade-off between nominal and verified accuracy, let us introduce parameter κ in the overall loss by trading the natural and robust loss as L = (1 -κ) Lrobust (x, y; θ) + λ 1 Labstain robust (x, y; θ) Robust loss + κ λ 2 L natural (x, y; θ) Natural loss ( ) Setting κ = 0.5 renders the optimization identical to that in Eq. 18. During the training however, we incorporate a ramp down procedure where κ starts at value κ start = 1, thus training the model to fit the natural data, and slowly decreasing it to value κ end = 0.5, similar to that in Gowal et al. (2018) . Ramp up of : It is very important during the training process to start at = 0 and gradually increase it to train , while also setting train larger than test can improve generalization. Ramp down of η and η: Setting 0 < η and η < 1 helps with better generalization. Furthermore, setting large η and η promotes the abstain class in loss term Labstain robust by increasing the weight of ω 2 in Eq. 15. Thus, we can further stabilize the training process through a ramp down procedure where these parameters start at η = η start and η = ηstart , and are gradually reduced to η = η end and η = ηend , with η end < η start and ηend < ηstart . Furthermore, although the term Lrobust (x, y; θ) could in theory be excluded from the training process, as the term L natural (x, y; θ) prevents the degenerate solution of always classifying all images in the abstain class, it's inclusion empirically helps the stability of the training process.

5.2. EMPIRICAL RESULTS ON MNIST AND CIFAR10

The classification networks are identical to the large network in Gowal et al. (2018) , also detailed in Table 2 , trained by minimizing the loss in Eq. 18 with the above stabilizing schemes. Selection of parameters for each datasets is detailed in Appendix B. Since most recent detector networks have shown very low performance against adaptive attacks, and lack provable performance Tramer et al. (2020) , we only compare the performance with other provable robust classification methods, while focusing on the different decomposition in the reported natural and robust accuracy among these two. As numbers in Table 1 suggest, the proposed detection/classification network shows improved robustness against other methods, including IBP in isolation (without the detection capability), specially against larger perturbations in the CIFAR-10 dataset, which intuitively is pleasing: as larger perturbations are naturally more distinguishable, the detection capability of the network is successfully leveraged for improving the adversarial robustness. Let us now take a closer look at the performance by focusing on the detection capability. Effectiveness of the detection class. By nature, the proposed classification "adaptively chooses" between (robust) correct classification and detection of adversarial or difficult inputs during the training. This gives rise to two phenomena: (1) In verifiably robust methods, natural image accuracy declines as robustness improves. In the proposed approach however, a considerable number of misclassified natural inputs are in fact abstained on, which in certain applications is more favorable than assigning them to a wrong class, as classifiers without detection capability would: compare 30.5% abstain and 25.6% 'wrong-class misclassification' (other than abstain and the correct class) in IBP-with-detection, with that of 53.7% 'wrongclass' misclassification in IBP on natural CIFAR-10 images in networks trained for = 8/255. (2) Regardless of the training procedure, the proposed classifier with detection can still be verified using verification in Eq. 4 to obtain its guaranteed robustness with only considering the correct class. Thus, comparing this verification percentage with that of Eq. 11 highlights the effectiveness of the abstain class in detecting perturbed images and increasing robustness: for instance, using our method 76.07% maximum robust error successfully decreases to 63.63% by considering the detection capability, on CIFAR-10 trained for = 8/255, compared to 69.92% in IBP without detection. (Zhang et al., 2020) and our experiments, while results from (Gowal et al., 2018) are reported under best literature record for IBP. † It is important to note that unlike robust classification, the proposed joint classification/detection does successfully leverage the detection capability to decrease the verified error rate by rejecting some adversarial examples, which makes direct comparison of these values difficult. However since there exists no other verifiable detection scheme, such comparison is made here to show the effect of successful detection; see Figure 1 for a detailed discussion on this. See Fig. 1 for decomposition of the performance metrics of the proposed network over CIFAR-10 dataset, demonstrating the effectiveness of the abstain class in detecting "difficult" natural images while also increasing the robustness certificate over adversarial inputs.

5.3. NATURAL VERSUS ADVERSARIAL ERROR TRADEOFF

Reporting a single set point in the Pareto Frontier as reported in Table 1 gives limited understanding on how different methods trade off natural versus robust error. To address this, a more detailed study on this trade-off in IBP-based robust classification with and without detection is discissed here. In order to get the best performance for IBP-based robust training without detection (that is λ 1 = 0), and since it is not known whether varying κ end or λ 2 will lead to a better performance, we have Results are plotted in Fig. 2 and 3 (presented in the Appendix due to space limitation). As shown, the classifier enhanced with detection capability is better able to trade natural and robust accuracy, thus attaining higher robustness by trading small decrease in natural accuracy. This together with the fact that the natural accuracy decrease is also partly handled by abstaining of such natural images that would have been misclassified (as one of the original K classes) otherwise, demonstrates the effective utilization of the detection capability in the proposed method. with operator Π s (.) denoting the permutation of its arguments according to s, such that ζi = ζ si ∀i, and ζ is sorted in the increasing order . We can also rewrite the problem by summing over the indices in the sorting set s instead, as (ū 1,j + ηū 2,j ) + 1 max 0≤η≤1 n L-1 j=1 1 {ω1,j +ηω2,j ≤0} [ω 1 • zL-1 + ηω 2 • zL-1 ] sj + 1 {ω1,j +ηω2,j ≥0} [ω 1 • z L-1 + ω 2 • z L-1 ] sj + η ω 3 + ω 4 . {η≥ ζj and ω2,s j >0} or {η≤ ζj and ω2,s j <0} u 1,j + ηu 2,j + η ω 3 + ω 4 . Since each of these subproblems are maximized at the boundaries of the feasible sets, the overall maximization essentially reduces to evaluation of the following objective function at (M -m + 3) points η = 0, ζm , ζm+1 , • • • , ζM-1 , ζM , 1 g(η) = n L-1 j=1 1 {ω1,j +ηω2,j ≤0} (ū 1,j + ηū 2,j ) + 1 {ω1,j +ηω2,j ≥0} u 1,j + ηu 2,j + η ω 3 + ω 4 Values of g(η) can be efficiently computed by a forward cumulative sum and forward-backward cumulative sum of u 1 and u 2 , ū1 and ū2 , thus imposing the overall complexity which is dominated by the sorting at O(n L-1 log(n L-1 )) in an efficient implementation. .

A.3 DESCRIPTION OF ALGORITHM 1

Here is a step-by-step walk-through for Algorithm 1, with insight on how these steps are performed.  This attack is indeed an adaptive attack as it aims at circumventing the detection while trying to cause misclassification (Tramer et al., 2020) . Perturbations are sought by maximizing this objective using PGD with 200-steps for mnist and 500-steps for CIFAR-10 Madry et al. ( 2017), with 10 random restarts. It is interesting to note that the achieved attack success rate in Table 1 is well below the verified robust error, further implying the effectiveness of incorporation of the detection mechanism as the true robustness of the system against practical adaptive PGD attacks are considerably stronger in comparison to robust classification without detection.



Code is available at https://github.com/boschresearch/robust_classification_ with_detection



where xent (z, c) denotes the cross-entropy loss for class c = y and c = a, and I = {1, 2, ..., K, a} denotes the class index set with K + 1 elements. Let us now define L abstain robust (x, y; θ) := max δ∈∆ min xent\a (f θ (x + δ), y), xent\y (f θ (x + δ), a) (13) in which the inner maximization is closely related to that of the adversarial objective in Eq. 12 with a small difference: loss terms xent\a and xent\y are defined as xent\a (z, y) := -log exp(z y ) i∈I\{a} exp(z i ) , and xent\y (x, a) := -log exp(z a ) i∈I\{y} exp(z i ) .

Figure1: Decomposition of accuracy and verified accuracy on CIFAR-10 dataste: the detection capability of the network can increase robustness by adaptively abstaining on adversarial inputs while also abstaining on some natural images rather than misclassifying them.

Now let us define u1 := Π s (ω 1 • z L-1 ) , ū1 := Π s (ω 1 • zL-1 ), u 2 := Π s (ω 2 • z L-1 ) , ū2 := Π s (ω 2 • zL-1 ), we get max 0≤η≤1 n L-1 j=1 1{η≤ ζj and ω2,s j >0} or {η≥ ζj and ω2,s j <0} (ū 1,j + ηū 2,j ) + 1 {η≥ ζj and ω2,s j >0} or {η≤ ζj and ω2,s j <0} u 1,j + ηu 2,j+ η ω 3 + ω 4 .(28)In order to break the objective of maximization into piece-wise linear programming subproblems, let us first identify the (indices of) ζ si values that fall in the feasible set 0 ≤ η ≤ 1 by m = min ζ sν ≥0 ν and M = max ζ sν ≤1 ν The overall maximization can now be reduced to piece-wise subproblems over sets ζν ≤ η ≤ ζν+1 for m -1 ≤ ν ≤ M as max max{0, ζν }≤η≤min{1, ζν+1} n L-1 j=1 1 {η≤ ζj and ω2,s j >0} or {η≥ ζj and ω2,s j <0}

Form vectors ω 1 and ω 2 , which are the last layer values asω 1 = W L c y,i , ω 2 = W L (c a,i -c y,i ) , ω 3 := b L (c a,i -c y,i ),and ω 4 := b L c y,i .

Figure 3: Naural versus robust error tradeoff for IBP (λ 1 = 0) and IBP-with-detection (λ 1 > 0) on MNIST dataset for various perturbation sizes = 0.3 and = 0.4. Points closer to the origin are better. IBP-with-detection is effectively utilizing its detection capability to adaptively trade natural and robust performance, leading to improved certified robustness against adversarial perturbations.

The verified, standard (clean), and PGD attack errors for models trained on MNIST and CIFAR-10. IBP with detection is to be compared with IBP (without detection capability) to emphasize the successful utilization of the detection capability of the network in increasing its verifiable as well as empirical performance. For a more detailed decomposition of the standard and robust error terms see Fig.1. + As reported inZhang et al. (2020), achieving the 68.44% IBP verified error requires extra PGD adversarial training loss, without which the verified error is 72.91% (LP/MIP verified) or 73.52% (IBP verified), thus our result should be compared to 73.52%. * Best reported numbers for IBP are computed using mixed integer programming (MIP), which are strictly smaller than IBP veified error rates, see table 3 and 4 inGowal et al. (2018). For fair comparison, we report IBP verified error rates from table3therein. ** Best reported results from the literature may use different network architectures, and empirical PGD error rate may have been computed under different settings, e.g., number of steps and restarts. *** Number in the IBP rows in this table are the best between

It is important to note that IBP w/detection allows us to obtain additional regions on this Pareto frontier that traditionalrobust-classifiers without detection cannot obtain, and could potentially provide additional gain to what is achievable by other various improvement techniques such as tighter relaxation and bound propagation methods. To do this, let us use s to denote the n L -ary tuple of indices that sorts ζ. That is ζ = [ ζ1 , ..., ζn L-1 ] := Π s (ζ) := [ζ s1 , ..., ζ sn L-1 ] s.t. ζ s1 ≤ ... ≤ ζ sn L-1

A APPENDIX

A.1 PROOF OF THEOREM 1 Since Z L ∈ ẐL it trivially holds that min z∈ ẐL max{c y,i z, c a,i z} ≤ min z∈Z L max{c y,i z, c a,i z} (20)The lower bound is now a convex minimization, which can be rewritten as min z∈ ẐL max{c y,i z, c a,i z} = min τ,z∈ ẐL τ s. t. c y,i z ≤ τ , c a,i z ≤ τ.Defining the slack variables η a ≥ 0 and η y ≥ 0 for the inequality constraints, the Lagrangian can be written as L(τ, z, η a , η y ) = τ + η a (c a,i z -τ ) + η y (c y,i z -τ )and minimizing L(τ, z, η a , η y ) with respect to the primal variable τ , yields η a + η y = 1. Defining η := η a = 1 -η y , and using the fact that the dual maximization always serves as a lower bound on the primal we get

A.2 PROOF OF THEOREM 2

Following on the statement of Theorem 1 and by substituting zwhich can be reordered aswherewhere minimization w.r.t. z L-1 is solved by the (this is under the setting for most networks with positive activations, and thus lower bound z l is always non-negative)and can be rewritten asand can be rewritten aswhere "•" denotes the elementwise multiplication. Thus, due to the concavity of the dual, optimal η can be found by evaluationg the objective in between the break points which are given by] with its j-th element defined as/ω 2 and get the vector of indices s that sorts it, i.e.,3. Form the element-wise product of (ω 1 , ω 2 ) with (z L-1 , zL-1 )), and sort them according to the index set s.4. Get the lowest and highest indexes (m, M ) such that the sorted ζ vector value at those indices are in the feasible set, between 0 and 1. 5. Iterate over the feasible values of η = 0,6. Return the maximum value of g(η) over the evaluated points.A.4 PROOF OF THEOREM 3Let us start by splitting the feasible set into disjoint sets of≤ maxLoss function xent\a is the cross entropy loss defined on theand class y, and thus following Wong & Kolter (2018) given its transnational invariance equalswith 1 denoting the (K + 1)-dimensional vector of all ones. Given the invariance of xent\a with respect to z L,a , it can finally be upperbounded by taking the upperbound for all i indices where i = 1, ..., K, i = a, y and lowerbound at index i = y. Note that for i = y, value [z L -z L,y 1] i = 0, and a lower bound on other entries i = 1, ..., K, i = a, y can be obtained bywhere the first equality holds since Ẑy, second inequality is due to Theorem 2, and third inequality is given by Eq. 15.Thus, for z ∈ Ẑy L-1 the loss term is now upperbounded by L abstain robust (x, y; θ) ≤ xent\a (-J ,θ (x, y), y) whereTable 2 : Network architecture. Similar to the Large network used in (Gowal et al., 2018) Similarly, it can be shown that for Thus, for z ∈ Ẑa L-1 the loss term is now upperbounded byThe equality of xent\y (-J ,θ (x, y), a) = xent\a (-J ,θ (x, y), y) trivially follows from the fact that [J ,θ (x, y)] i = 0 for i = a, y.Thus, since ẐL-1 = Ẑy L-1 ∪ Ẑa L-1 , the proof is complete. .

B APPENDIX: EXPERIMENT SET UP

Training parameters and schedules are similar to (Gowal et al., 2018) and (Zhang et al., 2020) , and outlined in detail here. For training the classifier network with architecture given in Table 2 , for both datasets, Adam optimizer with learning rate of 5 × 10 -4 is used. Unless stated differently, κ is scheduled by a linear ramp-down process, starting at 1, which after a warm-up perio,d is ramped down to value κ end = 0.5. Value of during the training is also simultaneously scheduled by a linear ramp-up, starting at 0, and ramped up to the final value of train , reported in Tabel 1, and networks are trained with a single NVIDIA Tesla V100S GPU.• For MNIST, the network is trained in 100 epochs with batchsize of 100 (total of 60K steps).A warm up period of 3 epochs (2K steps) is used (normal classification training with no robust loss), followed up by a ramp-up duration of 18 epochs (10K steps), and the learning rate is decayed ×10 at epochs 25 and 42. No data augmentation is used. Furthermore, fixed selection of η = 0.9 and η = 0.1 during training is used for this dataset with no ramp-down. Reported numbers in Table 1 corresponds to λ 1 = 1 and λ 2 = 2 for = 0.3, and λ 1 = 0.6 and λ 2 = 1 for = 0.4 respectively.• For CIFAR10, the network is trained in 3200 epochs with batchsize of 1600 (total of 100K steps). A warm up period of 320 epochs (10K steps) is used (normal classification training with no robust loss), followed up by a ramp-up duration of 1600 epochs (50K steps), and the learning rate is decayed ×10 at epochs 2600 and 3040 (60k and 90K steps). Random translations and flips, and normalization of each image channel (using the channel statistics from the train set) is used during training. Furthermore, during training for all values we have selected ηstart = 1.0 and ηend = 0.9. Additionally, η end = 0.1 is used during training, with η start = 0.1 for = 2/255 (no ramp down), η start = 0.3 for = 8/255, η start = 0.4 for = 12/255, and η start = 0.5 for = 16/255. The intuition behind these parameters selection lies in Remark 2, as large η values promote the abstain option more, so for large , we start with larger η start as well. Reported numbers in Tabel 1 correspond to λ 1 = 1 for all values, and λ 2 = 3.0 for = 2/255, λ 2 = 2.9 for = 8/255, and λ 2 = 3.1 for = 16/255 to insure similar natural accuracy for fair comparison against other methods.

