A DISTRIBUTIONAL ROBUSTNESS CERTIFICATE BY RANDOMIZED SMOOTHING

Abstract

The robustness of deep neural networks against adversarial example attacks has received much attention recently. We focus on certified robustness of smoothed classifiers in this work, and propose to use the worst-case population loss over noisy inputs as a robustness metric. Under this metric, we provide a tractable upper bound serving as a robustness certificate by exploiting the duality. To improve the robustness, we further propose a noisy adversarial learning procedure to minimize the upper bound following the robust optimization framework. The smoothness of the loss function ensures the problem easy to optimize even for non-smooth neural networks. We show how our robustness certificate compares with others and the improvement over previous works. Experiments on a variety of datasets and models verify that in terms of empirical accuracies, our approach exceeds the state-of-the-art certified/heuristic methods in defending adversarial examples.

1. INTRODUCTION

Deep neural networks (DNNs) have been known to be vulnerable to adversarial example attacks: by feeding the DNN with slightly perturbed inputs, the attack alters the prediction output. The attack can be fatal in performance-critical systems such as autonomous vehicles or automated tumor diagnosis. A DNN is robust when it can resist such an attack that, as long as the range of the perturbation is not too large (usually invisible by human), the model produces an expected output despite of the specific perturbation. Various approaches have been proposed for improving the robustness of DNNs, with or without a performance guarantee. Although a number of approaches have been proposed for certified robustness, it is vague how robustness should be defined. For example, works including Cohen et al. (2019) ; Pinot et al. (2019) ; Li et al. (2019) ; Lecuyer et al. (2019) propose smoothed classifiers to ensure the inputs with adversarial perturbation to be classified into the same class as the inputs without. However, since both inputs are inserted randomized noise, it cannot be guaranteed that the inputs are classified into the correct class. It is possible that the adversarially perturbed input has the same label as the original one which is wrongly classified by the DNN. In this case, the robustness guarantee does not make sense any more. Further, the robustness guarantee is provided at the instance level, i.e., within a certain perturbation range, the modification of an input instance cannot affect the prediction output. But a DNN is a statistical model to be evaluated on the input distribution, rather than a single instance. Instead of counting the number of input instances meeting the robustness definition, it is desired to evaluate the robustness of a DNN over the input distribution. We introduce the distributional risk as a DNN robustness metric, and propose a noisy adversarial learning (NAL) procedure based on distributional robust optimization, which provides a provable guarantee. Assume a base classifier f trying to map instance x 0 to corresponding label y. It is found that when fed with the perturbed instance x (within a l 2 ball centered at x 0 ), a smoothed classifier g(x) = E Z [f (x + z)] with z ∼ Z = N (0, σ 2 I) can provably return the same label as g(x 0 ) does. However, we think such a robustness guarantee cannot ensure g(x 0 ) to be correctly classified as y, resulting in unsatisfying performance in practice. Instead, we evaluate robustness as the worst-case loss over the distribution of noisy inputs. For simplicity, we jointly express the input instance and the label as x 0 ∼ P 0 where P 0 is the distribution of the original input. By using (•) as the loss function, we evaluate DNNs by the worst-case distributional risk: sup S E S [ (θ; s) ]. The classifier is parameterized by θ ∈ Θ, and s = x + z ∼ S where S is a distribution within a certain distance from P 0 . We prove such a loss is upper bounded by a data-dependent certificate, which can be optimized by the noisy adversarial training procedure: minimize θ∈Θ sup S E S [ (θ; s) ]. (1) Compared to previous robustness certificates via smoothed classifiers, our method provides a provable guarantee w.r.t. the ground truth input distribution. Letting the optimized θ be the parameter of g(•) and f (•) respectively, we further show that the smoothed classifier g(•) provides an improved robustness certificate than that of f (•), due to a tighter bound on the worst-case loss. The key is that, for mild perturbations, we adopt a Lagrangian relaxation for the usual loss (θ; x+z) as the robust surrogate, and the surrogate is strongly concave in x and hence easy to optimize. Our approach enjoys convergence guarantee similar to the method in Sinha et al. (2018) , but different from Sinha et al. (2018) , our approach does not require to be smooth, and thus can be applied to arbitrary neural networks. The advantage of the smoothed classifier also lies in a tighter robustness certificate than the base classifier. The intuition is that, in the inner maximization step, instead of seeking one direction which maximizes the loss, our approach performs gradient ascent along the direction which maximizes the total loss of examples sampled from the neighborhood of the original input. The noisy adversarial training procedure produces smoothed classifiers robust against the neighborhood of the worst-case adversarial examples with a certified bound. Highlights of our contribution are as follows. First, we review the drawbacks in the previous definition of robustness, and propose to evaluate robustness by the worst-case loss over the input distribution. Second, we derive a data-dependent upper bound for the worst-case loss, constituting a robustness certificate. Third, by minimizing the robustness certificate in the training loop, we propose noisy adversarial learning for enhancing model robustness, in which the smoothness property entails the computational tractability of the certificate. Through both theoretical analysis and experimental results, we verify that our certified DNNs enjoy better accuracies compared with the state-of-the-art defending adversarial example attacks.

2. RELATED WORK

Works proposed to defend against adversarial example attacks can be categorized into the following categories. In empirical defences, there is no guarantee how the DNN model would perform Certified defences are certifiably robust against any adversarial input within an p -norm perturbation range from the original input. A line of works construct a computationally tractable relaxation for computing an upper bound on the worst-case loss over all valid attacks. The relaxations include linear programming (Wong & Kolter (2018) ), mixed integer programming (Tjeng et al. (2018) ), semidefinite programming (Raghunathan et al. (2018) ), and convex relaxation (Namkoong & Duchi (2017); Salman et al. (2019b) ). But those deterministic methods are not scalable. Some works such as Dvijotham et al. (2018) formulate the search for the largest perturbation range as an optimization problem and solve its dual problem. 

3. PROPOSED APPROACH

We first define the closeness between distributions, based on which we constrain how far the input distribution is perturbed. Then we introduce our definition of robustness on smoothed classifiers. Our main theorem gives a tractable robustness certificate which is easy to optimize. Our algorithm for improving the robustness of the smoothed classifiers is provided. All proofs are collected in the appendices for conciseness.

3.1. A DISTRIBUTIONAL ROBUSTNESS CERTIFICATE

Definition 1 (Wasserstein distance). Wasserstein distances define a notion of closeness between distributions. Let X ⊂ R d , A, P be a probability space and the transportation cost c : X × X → [0, ∞) be nonnegative, lower semi-continuous, and c(x, x) = 0. P and Q are two probability measures supported on X . Let Π(P, Q) denotes the collection of all measures on X × X with marginals P and Q on the first and second factors respectively, i.e., it holds that π(A, X ) = P (A) and π(X , A) = Q(A), ∀A ∈ A and π ∈ Π(P, Q). The Wasserstein distance between P and Q is W c (P, Q) := inf π∈Π(P,Q) E π [c (x, y)] . For example, the 2 -norm c(x, x 0 ) = x -x 0 2 2 satisfies the aforementioned conditions. Distributional robustness. Assume the original input x 0 is drawn from the distribution P 0 , and the perturbed input x is drawn from the distribution P . Each input is added randomized Gaussian noise z ∼ Z = N (0, σ 2 I) before being fed to the classifier. Instead of regarding the noise as a part of the smoothed classifier, we treat s = x + z as a noisy input coming from the distribution S in the analysis. Since z ∈ R d , we need to set X = R d to admit s ∈ X as Lecuyer et al. (2019) ; Cohen et al. (2019); Salman et al. (2019a) do. Since the perturbed input should be visually indistinguishable from the original one, we define the robustness region as P = {P : W c (P, P 0 ) ≤ ρ, P ∈ P (X )} , where ρ > 0. Within such a region, we evaluate the robustness as a worst-case population loss over noisy inputs: sup S∈P E S [ (θ; s) ]. Essentially, we evaluate the robustness of a smoothed classifer based on its performance on the worst-case adversarial example distribution. A smaller loss indicates a higher level of robustness. We will compare the definition against others in the next section. However, such a robustness metric is impossible to measure in practice as we have no idea about P . Even if P can be acquired, it can be a non-convex region which renders the constrained optimization objective intractable. Hence we resort to the Lagrangian relaxation of the problem by assuming a dual variable γ. As the main theorem of this work, we provide an upper bound for the worst-case population loss for any level of robustness ρ. We further show that for small enough ρ, the upper bound is tractable and easy to optimize. Theorem 1. Let : Θ × X → R and transportation cost function c : X × X → R + be continuous. Let x 0 be an input drawn from the input distribution P 0 , x be the adversarial example which follows the distribution P and z ∼ Z = N (0, σ 2 I) be the additive noise of the same shape as x. The sum of x and z is denoted as s = x + z ∼ S and we let φ γ (θ; x 0 ) = sup x∈X E Z { (θ; x + z) -γc (x + z, x 0 )} be the robust surrogate. For any γ, ρ > 0 and σ, we have sup S:Wc(S,P0)≤ρ E S [ (θ; s)] ≤ γρ + E P0 [φ γ (θ; x 0 )] . ( ) The proof is given in Appendix A.1. It is notable that the right-hand side take the expectation over P 0 and Z respectively, and given a particular input x 0 and a noise sample z, we seek an adversarial example which maximizes the surrogate loss. Typically, P 0 is impossible to obtain and thus we use an empirical distribution, such as the training data distribution, to approximate P 0 in practice. Since Thm. 1 provides an upper bound for the worst-case population loss, it offers a principled adversarial training approach which minimizes the upper bound instead of the actual loss, i.e., minimize θ∈Θ E P0 [φ γ (θ; x 0 )]. In the following we show the above loss function has a form which is tractable for arbitrary neural networks, due to a smoothed loss function. Hence Thm. 1 provides a tractable robustness certificate depending on the data. Properties of the smoothed classifier. We show the optimization objective of Eq. 4 has a form which is tractable for any neural network, particular for the non-smooth ones with ReLU activation layers. More importantly, the smoothness of the classifier enables the adversarial training procedure to converge as we want by using the common optimization techniques such as stochastic gradient descent. The smoothness of the loss function comes from the smoothed classifier with randomized noise. Specifically, Theorem 2. Assume : Θ × X → [0, M ] is a bounded loss function. The loss function on the smoothed classifier can be expressed as ˆ (θ; x) := E Z [ (θ; x + z)], z ∼ Z = N (0, σ 2 I). Then we have ˆ is 2M σ 2 -smooth w.r.t. 2 -norm, i.e., ˆ satisfies ∇ x ˆ (θ; x) -∇ x ˆ (θ; x ) 2 ≤ 2M σ 2 x -x 2 . ( ) The proof is in Appendix A.2. It mainly takes advantage of the randomized noise which has a smoothing effect on the loss function. For DNNs with non-smooth layers, the smoothed classifier makes it up and turns the loss function to a smoothed one, which contributes as an important property to the strong concavity of E Z [ (θ; x + z) -γc (x + z, x 0 )] and therefore ensures the tractability of the robustness certificate. Corollary 1. For any c : X × X → R + ∪ {∞} 1-strongly convex in its first argument, and ˆ : x → E Z [ (θ; x + z)] being 2M σ 2 -smooth, the function E Z { (θ; x + z) -γc (x + z, x 0 )} is strongly concave in x for any γ ≥ 2M σ 2 . The proof is in Appendix A.3. Note that here we specify the requirement on the transportation cost c to be 1-strongly convex in its first argument. The 2 -norm cost satisfies the condition. Before showing how the strong concavity plays a part in the convergence, we illustrate our algorithm first.

3.2. NOISY ADVERSARIAL LEARNING ALOGRITHM

Problem 4 provides an explicit way to improve the robustness of a smoothed classifier parameterized by θ. We correspondingly design a noisy adversarial learning algorithm to obtain the classifier of which its robustness can be guaranteed. In the algorithm, we use the empirical distribution to replace the ideal input distribution P 0 , and sample z a number of times to substitute the expectation with the sample average. Assuming we have a total of n training instances x i 0 , ∀i ∈ [n], and sample z ij ∼ N (0, σ 2 I) for the i-th instance for r times, the objective is: minimize θ∈Θ 1 nr n i=1 r j=1 sup x∈X (θ; x + z ij ) -γc x + z ij , x i 0 . (6) The detail of the algorithm is illustrated in Alg. 1. In the inner maximization step (line 3-6), we adopt the projected gradient descent (PGD Madry et al. (2017) ; Kurakin et al. (2018) ) to approximate the maximizer according to the convention. The hyperparameters include the number of iterations K and the learning rate η 1 . Within each iteration, we sample the Gaussian noise r times, given which we compute an average perturbation direction for each update. The more noise samples, the closer the averaging result is to the expectation value, which is definitely at the sacrifice of higher computation expense. Similarly, a larger number of K indicates stronger adversarial attacks and higher model robustness, but also incurs higher computation complexity. Hence choosing appropriate values of r and K is important in practice. Algorithm  ∆x i k = 1 r r j=1 x i k (θ; x i k + z ij ) -γ x i k c(x i k + z ij , x i 0 ) , where z ij ∼ N (0, σ 2 I) 5: x i k+1 = x i k + η 1 ∆x i k 6: end for 7: end for 8: θ t+1 = θ t -η 2 1 nr n i=1 θ r j=1 (θ t ; x i K + z ij ) 9: end for After training is done, we obtain the classifier parameter θ. In the inference phase, we sample a number of z ∼ N (0, σ 2 I) to add to the testing instance. The noisy testing examples are fed to the classifier to get the prediction outputs. Convergence. An important property associated with the smoothed classifier is the strong concavity of the robust surrogate loss, which is the key to the convergence proof. The detail of the proof can be found in Appendix A.4. As long as the loss ˆ is smooth on the parameter space Θ, NAL has a convergence rate O(1/ √ T ), similar to Sinha et al. (2018) , but NAL does not need to replace the non-smooth layer ReLU with Sigmoid or ELU to guarantee robustness.

4. A TIGHTER BOUND

We compare our work with the state-of-the-art robustness definitions and certificates in this section.

4.1. ADVERSARIAL TRAINING

Our approach improves the distributional robustness certificate proposed by Sinha et al. (2018) . In Sinha et al. (2018) , a classifier f maps input instance x 0 ∼ P 0 to corresponding label y. They perturb x 0 to x in the same robustness region as ours: P = {P : W c (P, P 0 ) ≤ ρ, P ∈ P (X )} , where ρ > 0. But their worst-case population loss is defined on the base classifier without noise: sup P :Wc(P ,P0)≤ρ E P [ (θ; x )]. We show that, given the same classifier parameter θ, our worst-case loss is smaller than Sinha et al. (2018) , suggesting a better robustness certificate. Theorem 3. Under the same denotations and conditions as Thm. 1, we have sup S∈P E S [ (θ; s)] ≤ inf γ≥0 {γρ + E P0 [φ γ (θ; x 0 )]} ≤ inf γ≥0 γρ + E P0 sup x ∈X [ (θ; x ) -γc (x , x 0 )] = sup P :Wc(P ,P0)≤ρ E P [ (θ; x )]. ( ) The proof is given in Appendix A.5. We demonstrate that not only the worst-case loss is smaller, but the tractable upper bound is smaller than the certificate of Sinha et al. (2018) . If the outer minimization problem applies to both sides of the inequality, our approach would obtain a smaller loss when both classifiers share the same neural architecture. 2019) and others guarantee the robustness of a DNN classifier by inserting randomized noise to the input at the inference phase. Most of them do not concern about the training phase, but merely provide a deterministic relationship between the robustness certificate and the additive noise. Specifically, we have the original input x 0 ∈ X and its perturbation x within a given range x -x 0 2 ≤ ε. The smoothed classifier g(x) returns class c i with probability p i . For instance x 0 , robustness is defined by the largest perturbation radius R which does not alter the instance's prediction, i.e., g(x) is classified into the same category as g(x 0 ). Such perturbation radius depends on the largest and second largest probabilities of p i , denoted by p A , p B respectively. For example, the results in Cohen et al. (2019) have shown that R = σ 2 Φ -1 p A -Φ -1 (p B ) where Φ -1 is the inverse of the standard Gaussian CDF, p A is a lower bound of p A , and p B is an upper bound of p B . The previous robustness definition only guarantees g(x) to be classified to the same class as g(x 0 ), but ignores the fact that g(x 0 ) may be wrongly classified, which is not a precise definition. To make up for it, Li et al. (2019) propose stability training with noise (STN) and Cohen et al. (2019) adopt training with noise, both of which learn smoothed classifiers mapping noisy inputs to correct labels. However, there is no guarantee to ensure g(x 0 ) to be correctly labeled. Actually we found the robustness mainly comes from the STN/training with noise, rather than the noise addition at the inference. In Fig. 1 , we could observe that the model performance indeed improves when tested with noise. However, the classifier trained without additive noise (triangle) degrades significantly compared with STN/training with noise (diamond/circle). The result is an evidence that a classifier almost cannot defend adversarial attacks when trained without but tested with additive noise. Therefore, we conclude the smoothed classifier can only improve robustness only if the base classifier is robust. We consider robustness refers to the ability of a DNN to classify adversarial examples into the correct classes, and such an ability should be evaluated on the population of adversarial examples, not a single instance.

5. EXPERIMENT

Baselines, datasets and models. Testing accuracies under different levels of adversarial attacks are chosen as the metric. We compare the empirical performance of NAL with representative baselines including: WRM (Sinha et al.  ε 2 = ρ(θ) = E P0 E Z [c (x + z, x 0 )] . (8) And ε can be computed accordingly. The corresponding values of γ and ε used in experiments are given in Table 1 as well. Attack parameters. To evaluate the empirical accuracies for different methods, we adopt the PGD attack Kurakin et al. (2018) ; Madry et al. (2017) as the adversarial attack following the convention of Li et al. (2019) ; Sinha et al. (2018) ; Zhang et al. (2019) , etc. We set the number of iterations in PGD attack as K attack = 20 and the learning rate η = 2ε attack /K attack where ε attack is 2 attack radius.

Certificate.

To better understand how close the upper bound is to the true distributional risk, we plot our certificate γρ + E Ptest [φ γ (θ; x 0 )] against any level of robustness ρ, and the out-ofsample (test) worst-case performance sup S∈P E S [ (θ; s)] for NAL (Fig. 2(a) ). Since the worstcase loss is hard to evaluate directly, we solve its Lagrangian relaxation for different values of γ adv . For each γ adv , we compute the average distance to adversarial examples in the test set as ρ test (θ) := E Ptest E Z [c (x + z, x 0 )] where P test is the test data distribution and x = arg max x E Z { (θ; x + z) -γ adv c(x + z, x 0 )} is the adversarial perturbation of x 0 . The worst-case loss is given by ( ρ test (θ) , E Ptest E Z [ (θ; x + z)]). As we observe, ρ test (θ) tends to increase with a higher noise level. Hence we need to keep the noise at an appropriate level to make our certificate tractable. Cost without noise. To find out if NAL works when noise is removed from the cost, we designed a verification experiment on CIFAR-10 (ResNet-18) by letting c(x, x 0 ) = x -x 0 2 2 and inserting noise only to . We set γ = 1.5, σ = 0.1, K = 4, r = 4. As Fig. 2 (b) has shown, the accuracy performance of the model excluding noise from the cost is far inferior, which shows that the randomized noise is an inherent part in the design. Sample number and PGD iterations. We also study the impact of the noise sample number s and PGD iteration K to the model robustness with CIFAR-10 (ResNet-18) as an example. The result in Table 2 shows that while the model performance enhances with K, it does not necessarily increase with a larger noise samples. We did not test with greater noise samples due to high complexity. For a combined consideration of computation overhead and accuracy, we choose K = 4, r = 4 by default in the experiments, which is likely to deliver a sufficiently good performance. Due to space constraints, complete experimental results are in Appendix B.2. Penalty and noise level. We vary the value of γ and σ in the experiments to find out their impact. By the results in Fig. 2 (c), (d ) and 3, we observe γ = 0.25 yields the best performance for MNIST, and γ = 1.5 is best for CIFAR-10 and Tiny ImageNet, considering all levels of adversarial attacks. For a complete result on γ, one can refer to Appendix B.3. Likewise, the best value of σ also depends on the dataset, shown by the experimental results in Appendix B.2. Comparison with baselines. Finally, we compare the empirical accuracies with the baselines and the results are presented in Fig. 2 (c ),(d) and 3. For WRM, the experiments are conducted on the modified structure of DNNs to ensure smoothness. NAL has superior performance in almost all cases except that: 1) the clean accuracies (denoted by 2 attack radius = 0) on CIFAR-10 and Tiny ImageNet of NAL are inferior to STN; 2) on MNIST, the performance of NAL is no worse but does not exceed baselines by a large margin. For 1), we found STN mostly has far worse performance than other schemes when the attack radius > 0, which echos the proposition in Salman et al. (2019a) that adversarial training brings higher robustness than stability training. Hence it can be explained by the inherent tradeoff between clean accuracy and robustness (Zhang et al. (2019) ) that STN has higher clean accuracies than others. Actually, NAL shows better tradeoff between accuracy and robustness than baselines, indicated by the relatively flat accuracy lines. For 2), we think MNIST has a relatively simple decision boundary than the other two datasets and hence allows larger perturbations (smaller γ). Thus the performance boost by NAL is not significant. Actually, when γ is larger, the performance of NAL exceeds baselines by a large margin (Appendix B.3). 

A PROOFS

A.1 PROOF OF THEOREM 1 Proof. We express the worst-case loss in its dual form with dual variable γ. By the weak dual property, we have sup S∈P E S [ (θ; s)] ≤ inf γ≥0 sup S∈P {E S [ (θ; s)] -γW c (S, P 0 ) + γρ} , the left hand-side of which can be rewritten in integral form: inf γ≥0 sup S∈P {E S [ (θ; x + z)] -γW c (S, P 0 ) + γρ} = inf γ≥0 sup S∈P (θ; x + z)dZ(z)P (x) -γW c (S, P 0 ) + γρ . Note that for any π ∈ Π(S, P 0 ), we have f (s)dS = f (s)dπ(s, x 0 ). And by the definition of Wasserstein distance, we have inf γ≥0 sup S∈P (θ; s)dS(s) -γW c (S, P 0 ) + γρ = inf γ≥0 sup S∈P (θ; s)dπ(s, x 0 ) -γ inf π∈Π(S,P0) c(s, x 0 )dπ(s, x 0 ) + γρ = inf γ≥0 sup S∈P sup π∈Π(S,P0) [ (θ; s) -γc(s, x 0 )]dπ(s, x 0 ) + γρ . By the independence between z and x, x 0 , one would obtain [ (θ; s) -γc(s, x 0 )]dπ(s, x 0 ) = [ (θ; x + z) -γc(x + z, x 0 )]dZ(z)dπ(x, x 0 ) (12) By taking the maximum over x, [ (θ; x + z) -γc(x + z, x 0 )]dZ(z)dπ(x, x 0 ) = E Z [ (θ; x + z) -γc(x + z, x 0 )]dπ(x, x 0 ) ≤ sup x {E Z [ (θ; x + z) -γc(x + z, x 0 )]} dπ(x, x 0 ). Fixing x to be value that maximizes the expression to be integrated, x in the formula is fixed, so we only need to integrate dπ(x, x 0 ) on X. So we can get: sup x {E Z [ (θ; x + z) -γc(x + z, x 0 )]} dπ(x, x 0 ) = x0 sup x {E Z [ (θ; x + z) -γc(x + z, x 0 )]} dP 0 (x 0 ) =E P0 sup x E Z [ (θ; x + z) -γc(x + z, x 0 )]. Because the distribution of z is definite and z is independent of x, and supremum of S is replaced by the supremum of x. Therefore, Eq. 11 can be written as inf γ≥0 sup S∈P sup π∈Π(S,P0) [ (θ; s) -γc(s, x 0 )]dπ(s, x 0 ) + γρ ≤ inf γ≥0 sup S∈P sup π∈Π(S,P0) E P0 sup x E Z [ (θ; x + z) -γc(x + z, x 0 )] + γρ = inf γ≥0 E P0 sup x E Z [ (θ; x + z) -γc(x + z, x 0 )] + γρ . By plugging the above into Eq. 9, we could get sup S∈P E S [ (θ; s)] ≤ inf γ≥0 E P0 sup x E Z [ (θ; x + z) -γc(x + z, x 0 )] + γρ = inf γ≥0 {E P0 [φ γ (θ; x 0 )] + γρ} ≤ E P0 [φ γ (θ; x 0 )] + γρ. for any given γ ≥ 0, which completes the proof.

A.2 PROOF OF THEOREM 2

Proof. The proof of ˆ being 2M σ 2 -smooth is equivalent to ∇ ˆ being 2M σ 2 -Lipschitz. We apply the Taylor expansion in ∇ ˆ at x 0 and set δ = x 0 -x: ∇ ˆ (x 0 ) = ∇ ˆ (x) + ∇ 2 ˆ (x + θδ)δ, where 0 < θ < 1. Hence we only need to prove ∇ 2 ˆ (x + θδ) 2 is bounded since ∇ ˆ (x + δ) - ∇ ˆ (x) 2 = ∇ 2 ˆ (x + θδ)δ 2 . By taking the first and second-order derivatives of ˆ (x), we have ∇ ˆ (x) = 1 (2π) d/2 σ d+2 R d (t)(t -x) exp - 1 2σ 2 x -t 2 dt, and ∇ 2 ˆ (x) = 1 (2π) d/2 σ d+2 R d (t) exp - 1 2σ 2 x -t 2 [-I + 1 σ 2 (t -x)(t -x) ]dt. ( ) We divide the right hand-side of Eq. 19 into two halves with the first half: 1 (2π) d/2 σ d+2 R d (t) exp - 1 2σ 2 x -t 2 (-I)dt 2 = 1 σ 2 ˆ (x)(-I) 2 ≤ 1 σ 2 ˆ (x) 2 ≤ M σ 2 . ( ) The second half is 1 (2π) d/2 σ d+4 R d (t) exp - 1 2σ 2 x -t 2 ((t -x)(t -x) )dt 2 ≤ 1 (2π) d/2 σ d+4 R d | (t)| exp - 1 2σ 2 x -t 2 (t -x)(t -x) 2 dt ≤ M (2π) d/2 σ d+4 R d exp - 1 2σ 2 x -t 2 (t -x)(t -x) 2 dt. Due to the rank of the matrix (t -x)(t -x) is 1, its 2 norm is easy to compute: (t -x)(t -x) 2 = (t -x) (t -x). Hence M (2π) d/2 σ d+4 R d exp - 1 2σ 2 x -t 2 (t -x)(t -x) 2 dt = M σ 2 . ( ) Finally, combining the two halves we get ∇ 2 ˆ (x + θδ) 2 ≤ 2M σ 2 . ( ) A.3 PROOF OF COROLLARY 1 Proof. Since ˆ is 2M σ 2 -smooth and c is 1-strongly convex in its first argument, we have ∇ 2 x ˆ (θ; x) 2M σ 2 I, and ∇ 2 x E Z c(x + z, z 0 ) = ∇ 2 x c(x, x 0 ) + dσ 2 = ∇ 2 x c(x, x 0 ) I. Therefore we have ∇ 2 x E Z { (θ; x + z) -γc (x + z, x 0 )} ( 2M σ 2 -γ)I. ( ) Hence the strong concavity is proved for γ ≥ 2M σ 2 .

A.4 CONVERGENCE PROOF

We start with the required assumptions, which roughly quantify the robustness we provide. Assumption 1. The loss ˆ : Θ × X → [0, M ] satisfies the Lipschitzian smoothness conditions ∇ θ ˆ (θ; x) -∇ θ ˆ (θ ; x) * ≤ L θθ θ -θ , ∇ x ˆ (θ; x) -∇ x ˆ (θ; x ) * ≤ L xx x -x , ∇ θ ˆ (θ; x) -∇ θ ˆ (θ; x ) * ≤ L θx x -x , ∇ x ˆ (θ; x) -∇ x ˆ (θ ; x) * ≤ L xθ θ -θ . ( ) Let • * be the dual norm to • ; we abuse notation by using the same norm • on Θ and X . Here we have proved the second condition of Assumption 1 holds true by Theorem 2, with L xx = 2M σ 2 . Therefore, if ˆ satisfies the other three conditions, we could adopt a similar proof procedure for Theorem 2 in Sinha et al. (2018) to prove the convergence of Algorithm 1.

A.5 PROOF OF THEOREM 3

Proof. By Eq. 16 we could get sup S∈P E S [ (θ; s)] ≤ inf γ≥0 {γρ + E P0 [φ γ (θ; x 0 )]} , where E P0 [φ γ (θ; x 0 )] = E P0 sup x∈X E Z [ (θ; x + z) -γc (x + z, x 0 )] . Since (θ; x + z) -γc (x + z, x 0 ) is strongly concave for x + z in Sinha et al. (2018) , by Jensen Inequality we have for any fixed x, E Z { (θ; x + z) -γc (x + z, x 0 )} ≤ (θ; E Z (x + z)) -γc (E Z (x + z), x 0 ) = (θ; x) -γc (x, x 0 ) . Hence, the following inequality holds  E P0 [φ γ (θ; x 0 )] ≤ E P0 sup x∈X [ (θ; x) -γc (x, x 0 )] . Finally, we can get Eq. 7 by concatenating the inequalities which completes the proof. A.6 CONNECTIONS BETWEEN ROBUSTNESS CERTIFICATES Proposition 1. Let p A , p B denote the largest and second largest probabilities returned by the smoothed classifier g(x 0 ) and R = σ 2 Φ -1 p A -Φ -1 (p B ) . We choose as the cross-entropy loss in the smoothed loss function ˆ (x) = E Z [ (θ; x + z)], z ∼ Z = N (0, σ 2 I). If ˆ (θ; x) ≤ -log Φ Φ -1 (p B ) + x -x 0 2 σ (34) holds, and the ground truth label y = c A , then g(x) is robust against any x such that x-x 0 2 ≤ R. Proof. By Theorem 1 of Cohen et al. (2019) , we just need to prove the condition Eq. 34 leads to the condition x -x 0 2 ≤ R. With being the cross-entropy loss, ˆ (x) = E Z [ (θ; x + z)] = E Z [-log(f (y) (x + z))]. Then we use Jensen Inequality on -log(x) to obtain ˆ (x) = E Z [-log(f (y) (x + z))] ≥ -log[E Z f (y) (x + z)]. As y = c A , we have E Z f (y) (x + z) = P (f (x + z) = c A ). By Eq. 34, we have -log[P (f (x + z) = c A )] ≤ E Z [-log(f (y) (x + z))] ≤ -log Φ Φ -1 (p B ) + x -x 0 2 σ . (37) And hence P (f (x + z) = c A ) ≥ Φ Φ -1 (p B ) + x -x 0 2 σ . ( ) By the proof of Theorem 1 in Cohen et al. (2019) , P (f (x + z) = c A ) = Φ Φ -1 p A - x -x 0 2 σ , which leads to Φ -1 p A - x -x 0 2 σ ≥ Φ -1 (p B ) + x -x 0 2 σ . Therefore, x -x 0 2 ≤ σ 2 Φ -1 p A -Φ -1 (p B ) = R. To sum up, if Eq. 34 holds and x 0 is correctly classified, g(x) is robust within a 2 ball with radius R. One can tell the loss on a single instance is weakly associated with the robustness of the model, and the condition of g(x) being robust is quite stringent. It is not practical to sum up the single-instance loss to gauge the model robustness either.

B EXPERIMENTS B.1 BASELINE SETTINGS

We provide the training settings for baselines in We compare NAL with SmoothAdv and STN under the same experimental setting but different σs. In Table 4 , NAL achieves the best performance at σ = 0.1 above all. We believe in different experimental settings, the best σ value is different. For example, NAL and STN obtain the best performance at σ = 0.1, whereas SmoothAdv performs best at σ = 0.05. For the same σ, NAL has superior performance than the other two baselines except that, when σ = 0.05, SmoothAdv is more robust than NAL for 2 attack radius ≥ 0.75. This is mainly because SmoothAdv achieves the best performance when σ = 0.05. However, the model accuracies degrade below 0.5 is not our main consideration. In Table 5 , we show NAL's accuracy over a variety of σ, K, r values. We found that the result of σ = 0.12 is generally better than a larger value. Under the same σ, we choose K ∈ {2, 4, 6, 8}, r ∈ {1, 4},. We found the model cannot converge with (K, r) = (2, 1), and thus did not present the results. The 

B.3 RESULTS WITH VARYING γ

We show the impact of γ on MNIST and CIFAR-10. On MNIST, γ takes the value {0.25, 1.5, 3} and σ is chosen as 0.05. On CIFAR-10, γ ∈ {0.25, 1.5, 5} and σ is set to 0. model, and ReLU model presents faster convergence. The robustness performance of both models is presented in Table 6 . It is clear that in the testing phase, the ReLU model also obtains a better performance. Hence NAL generally yields better performance on ReLU models than ELU models.



CONCLUSIONOur work view the robustness of a smoothed classifier from a different perspective, i.e., the worstcase population loss over the input distribution. We provide a tractable upper bound (certificate) for the loss and devise a noisy adversarial learning approach to obtain a tight certificate. Compared with previous works, our certificate is practically meaningful and offers superior empirical robustness performance.



4.2 SMOOTHED CLASSIFIERS Works including Lecuyer et al. (2019); Cohen et al. (2019); Pinot et al. (2019); Li et al. (

Figure 1: Accuracies of models trained on MNIST under different levels of 2 attacks. Undefend means a naturally trained model. Solid lines represent models tested with additive noise, and dotted lines mean that without. σ = 0.1 means adding Gaussian noise N (0, 0.1 2 I).

(2018)),SmoothAdv (Salman et al. (2019a)), STN(Li et al. (2019)) and TRADES(Zhang et al. (2019)). Since WRM requires the loss function to be smooth, we follow the convention to adapt the ReLU activation layer to the ELU layer. SmoothAdv combines adversarial training with the smoothed classifier and claims to be superior thanCohen et al. (2019). Hence we omitCohen et al. (2019) in comparison. TRADES is an adversarial training algorithm which won 1st place in the NeurIPS 2018 Adversarial Vision Challenge. Experiments are conducted on datasets MNIST, CIFAR-10, and Tiny ImageNet, and models including a three-layer CNN, ResNet-18, VGG-16, and their corresponding variants with ReLu replaced by ELU for fair comparison with WRM. The cross-entropy loss is chosen for and c(x, x 0 ) = x -x 0 2 2 is selected as the cost function.Training hyperparameters.

Figure 2: (a) gives the distance between the robustness certificate (yellow) and the worst-case performance on testing data (pink) with an example on MNIST. The gap between the two lines indicates the tightness of our certificate (Eq. 3). (b) compares the performance of two models trained with different c(•)s. The classifier trained with the noise included in the cost has better performance overall. (c) compares the performance of NAL with WRM on MNIST, CNN (ELU) under different γs. NAL overall has better performance than WRM. (d) compares NAL with SmoothAdv, TRADES and STN on MNIST, CNN at γ = 0.25 and the corresponding ε. NAL does not show significant improvement when γ is small.and WRM bound the adversarial perturbations by the Wasserstein distance ρ which is different from the 2 -norm perturbation range ε in SmoothAdv and TRADES, we need to establish an equivalence between the perturbation ranges in different methods. Following the convention ofSinha et al. (2018), we choose different γs and for each γ we generate adversarial examples x by PGD with 15 iterations. We compute ρ as the expected transportation cost between the generated adversarial examples and the original inputs over the training set:ε 2 = ρ(θ) = E P0 E Z [c (x + z, x 0 )] .(8) And ε can be computed accordingly. The corresponding values of γ and ε used in experiments are given in Table 1 as well.

NAL outperforms baselines on CIFAR-10, VGG-16 and Tiny ImageNet, ResNet-18. (a),(c) are trained on ELU models under different γs. For the same γ, NAL exceeds WRM. (b),(d) are trained on ReLU models with γ = 1.5 and the corresponding ε. NAL yields the highest robustness under different levels of attack. STN has the highest clean accuracy.

) By Proposition 1 in Sinha et al. (2018), we could get inf γ≥0 E P0 sup x∈X [ (θ; x) -γc (x, x 0 )] + γρ = sup P :Wc(P ,P0)≤ρ E P [ (θ; x)].

1. (K, r) = (4, 4) for all experiments. Fig. 4(a) and 5(a) compare NAL with WRM on models with ELU, whereas the rest of Fig. 4 and 5 show the comparison with SmoothAdv, TRADES, and STN on regular models. NAL has superior performance than baselines in almost all cases.

Figure 4: NAL versus baselines on MNIST, CNN. (a) NAL versus WRM for different γs. (b) γ = 0.25. (c) γ = 1.5. (d) γ = 3. Equivalent εs are used in SmoothAdv and TRADES.

Figure 6: The comparison between the ReLU model (pink) and the ELU model (yellow) on CIFAR-10, ResNet-18 with γ = 1.5 and σ = 0.1. The ReLU model converges faster than the ELU model.

Other works(Mirman et al. (2018);Singh et al. (2018)) apply the abstract interpretation to train provably robust neural networks. Our work is orthogonal to these works.

Table 1 gives the training hyperparameters in NAL and the batch size is chosen as 128. The hyperparameters used in baselines are supplied in Appendix B.1. Since NAL Hyperparameters and perturbation ranges on different datasets.

Testing accuracies of NAL (CIFAR-10, ResNet-18) on a variety of r and K. Under each setting, the model with the highest clean accuracy ( 2 attack radius = 0) is chosen for testing. Numbers in bold represent the best performance in defending the attack.

The learning rate η 1 is adjusted according to different γs and εs. The noise level (σ) is the same for all methods.

Baseline hyperparameter settings. γ and ε is chosen from Table 1. B.2 RESULTS WITH VARYING σ AND (K, r)

Different methods with different levels of noise on CIFAR-10, ResNet-18, γ = 1.5 and (K, r) = (4, 4). The best performance at the same noise level is in bold.

Table show that a larger K admits better robustness whereas r does not have that impact. NAL with different σs and (K, r) on CIFAR-10, ResNet-18 when γ = 1.16. The best performance under the same noise level is in bold.

Testing accuracies for the ReLU model and the ELU model on CIFAR-10, ResNet-18.

