Adversarial Boot Camp: label free certified robustness in one epoch

Abstract

Machine learning models are vulnerable to adversarial attacks. One approach to addressing this vulnerability is certification, which focuses on models that are guaranteed to be robust for a given perturbation size. A drawback of recent certified models is that they are stochastic: they require multiple computationally expensive model evaluations with random noise added to a given image. In our work, we present a deterministic certification approach which results in a certifiably robust model. This approach is based on an equivalence between training with a particular regularized loss, and the expected values of Gaussian averages. We achieve certified models on ImageNet-1k by retraining a model with this loss for one epoch without the use of label information.

1. Introduction

Neural networks are very accurate on image classification tasks, but they are vulnerable to adversarial perturbations, i.e. small changes to the model input leading to misclassification (Szegedy et al., 2014) . Adversarial training (Madry et al., 2018) improves robustness, at the expense of a loss of accuracy on unperturbed images (Zhang et al., 2019) . Model certification (Lécuyer et al., 2019; Raghunathan et al., 2018; Cohen et al., 2019) is complementary approach to adversarial training, which provides a guarantee that a model prediction is invariant to perturbations up to a given norm. Given an input x, the model f is certified to 2 norm r at x if it gives the same classification on f (x + η) for all perturbation η with norm up to r, arg max f (x + η) = arg max f (x), for all η 2 ≤ r (1) Cohen et al. (2019) and Salman et al. (2019) certify models by defining a "smoothed" model, f smooth , which is the expected Gaussian average of our initial model f at a given input example x, f smooth (x) ≈ E η [f (x + η)] (2) where the perturbation is sampled from a Gaussian, η ∼ N (0, σ 2 I). Cohen et al. (2019) used a probabilistic argument to show that models defined by (2) can be certified to a given radius by making a large number of stochastic model evaluations. Certified models can classify by first averaging the model, (Salman et al., 2019) , or by taking the mode, the most popular classification given by the ensemble (Cohen et al., 2019) . Cohen et al. and Salman et al. approximate the model f smooth stochastically, using a Gaussian ensemble, which consists of evaluating the base model f multiple times on the image perturbed by noise. Like all ensemble models, these stochastic models require multiple inferences, which is more costly than performing inference a single time. In addition, these stochastic models require training the base model f from scratch, by exposing it to Gaussian noise, in order to improve the accuracy of f smooth . Salman et al. (2019) additionally expose the model to adversarial attacks during training. In the case of certified models, there is a trade-off between certification and accuracy: the certified models lose accuracy on unperturbed images. In this work, we present a deterministic model f smooth (x) given by ( 2). Unlike the stochastic models, we do not need to train a new model from scratch with Gaussian noise data augmentations, instead, we can fine tune an accurate baseline model for one epoch, using a loss designed to encourage (2) to hold. The result is a certified deterministic model which is faster to train, and faster to perform inference. In addition, the certification radius for our model is improved, compared to previous work, on both the CIFAR-10 and ImageNet databases, see Figure 1 . Moreover, the accuracy on unperturbed images is improved on ImageNet with a Top-5 accuracy of 86.1% for our deterministic model versus 85.3% for Cohen et al. (2019) and 78.8% for Salman et al. (2019) . To our knowledge, this is the first deterministic Gaussian smoothing certification technique. The main appeal of our approach is a large decrease in the number of function evaluations for computing certifiably robust models, and a corresponding decrease in compute time at inference. Rather than stochastically sampling many times from a Gaussian at inference time, our method is certifiably robust with only a single model query. This greater speed in model evaluation is demonstrated in Table 2 . Moreover, the deterministic certified model can be obtained by retraining a model for one epoch, without using labels. The speed and flexibility of this method allows it to be used to make empirically robust models certifiably robust. To see this, we test our method on adversarially trained models from Madry et al. (2018) , increasing the certified radius of the adversarially robust model, see Table 4 . Cohen et al. (2019) and Salman et al. (2019) . Adversarial training from Madry et al. (2018) . 

2. Related work

The issue of adversarial vulnerability arose in the works of Szegedy et al. (2014) and Goodfellow et al. (2015) , and has spawned a vast body of research. The idea of training models to be robust to adversarial attacks was widely popularized in Madry et al. (2018) . This method, known as adversarial training, trains a model on images corrupted by gradientbased adversarial attacks, resulting in robust models. In terms of certification, early work by Cheng et al. (2017) provided a method of computing maximum perturbation bounds for neural networks, and reduced to solving a mixed integer optimization problem. Weng et al. (2018a) introduced non-trivial robustness bounds for fully connected networks, and provided tight robustness bounds at low computational cost. Weng et al. (2018b) proposed a metric that has theoretical grounding based on Lipschitz continuity of the classifier model and is scaleable to state-of-the-art ImageNet neural network classifiers. Zhang et al. (2018) proposed a general framework to certify neural networks based on linear and quadratic bounding techniques on the activation functions, which is more flexible than its predecessors. Training a neural network with Gaussian noise has been shown to be equivalent to gradient regularization (Bishop, 1995) . This helps improve robustness of models; however, recent work has used additive noise during training and evaluation for certification purposes. Lécuyer et al. (2019) first considered adding random Gaussian noise as a certifiable defense in a method called PixelDP. In their method, they take a known neural network architecture and add a layer of random noise to make the model's output random. The expected classification is in turn more robust to adversarial perturbations. Furthermore, their defense is a certified defense, meaning they provide a lower bound on the amount of adversarial perturbations for which their defence will always work. In a following work, Li et al. (2018) provided a defence with improved certified robustness. The certification guarantees given in these two papers are loose, meaning the defended model will always be more robust than the certification bound indicates. In contrast, Cohen et al. (2019) provided a defence utilizing randomized Gaussian smoothing that leads to tight robustness guarantees under the 2 norm. Moreover Cohen et al. used Monte Carlo sampling to compute the radius in which a model's prediction is unchanged; we refer to this method as RandomizedSmoothing. In work building on Cohen et al., Salman et al. (2019) developed an adversarial training framework called SmoothAdv and defined a Lipschitz constant of averaged models. Yang et al. (2020) generalize previous randomized smoothing methods by providing robustness guarantees in the 1 , 2 , and ∞ norms for smoothing with several non-Gaussian distributions.

3. Deterministic Smoothing

Suppose we are given a dataset consisting of paired samples (x, y) ∈ X × Y where x is an example with corresponding true classification y. The supervised learning approach trains a model f : X -→ R N c which maps images to a vector whose length equals the number of classes. Suppose f is the initial model, and let f smooth be the averaged model given by equation ( 2). Cohen et al. (2019) find a Gaussian smoothed classification model f smooth by sampling η ∼ N (0, σ 2 I) independently n times, performing n classifications, and then computing the most popular classification. In the randomized smoothing method, the initial model f is trained on data which is augmented with Gaussian noise to improve accuracy on noisy images. We take a different approach to Gaussian smoothing. Starting from an accurate pretrained model f , we now discard the training labels, and iteratively retrain a new model, f smooth using a quadratic loss between the model f and the new model's predictions, with an additional gradient regularization term. We have found that discarding the original one-hot labels and instead using model predictions helps make the model smoother. To be precise, our new models is a result of minimizing the loss which we call HeatSmoothing, E x 1 2 softmax f smooth (x) -softmax (f (x)) 2 2 + σ 2 2 ∇ x f smooth (x) 2 2 (3) The smoothing achieved by the new models is illustrated schematically in Figure 5 .

3.1. Related regularized losses

Gradient regularization is known to be equivalent to Gaussian smoothing (Bishop, 1995; LeCun et al., 1998) . Our deterministic smoothed model arises by training using the HeatSmoothing loss (3), which is designed so to ensure that (2) holds for our model. Our results is related to the early results on regularized networks (Bishop, 1995; LeCun et al., 1998) : that full gradient regularization is equivalent to Gaussian smoothing. Formally this is stated as follows. Theorem 1. (Bishop, 1995) Training a feed-forward neural-network model using the quadratic (or mean-squared error) loss, with added Gaussian noise of mean 0 and variance σ 2 to the inputs, is equivalent to training with E x f (x) -y 2 + σ 2 ∇f (x) 2 (4) up to higher order terms. The equivalence is normally used to go from models augmented with Gaussian noise to regularized models. In our case, we use the result in the other direction: we train a regularized model in order to produce a model which is equivalent to evaluating with noise. In practice, this means that rather than adding noise to regularize models for certifiable robustness, we explicitly perform a type of gradient regularization, in order to produce a model which performs as if Gaussian noise was added. See Figure 4 in Appendix D for an illustration of the effect of this gradient regularization. The gradient regularization term in the HeatSmoothing loss (3), is also related to adversarial training. Tikhonov regularization is used to produced adversarially robust models (Finlay and Oberman, 2019) . However in adversarial training, the gradient of the loss is used, rather that the gradient of the full model. Also, our loss does not use information from the true labels. The reason for these differences is due to the fact that we wish to have a model that approximates the Gaussian average of our initial model f (see Appendix A). Furthermore, minimizing the gradient-norm of the loss of the output gives us a smooth model in all directions, rather than being robust to only adversarial perturbations.

3.2. Algorithmic Details

We have found that early on in training, the value 1 2 f smooth (x) -f k (x) 2 2 may be far greater than the σ 2 2 ∇ x f smooth (x) 2 2 term. So we introduced a softmax of the vectors in the distance-squared term to reduce the overall magnitude of this term. We perform the training minimization of (3) for one epoch. The pseudo-code for our neural network weight update is given by Algorithm 1 1 Note that the ∇ x f smooth (x) 2 2 term in (3) requires the computation of a Jacobian matrix norm. In high dimensions this is computationally expensive. To approximate this term, we make use of the Johnson-Lindenstrauss lemma (Johnson and Lindenstrauss, 1984; Vempala, 2005) followed by the finite difference approximation from Finlay and Oberman (2019) . We are able to approximate ∇ x f smooth (x) 2 2 by taking the average of the product of the Jacobian matrix and Gaussian noise vectors. Jacobian-vector products can be easily computed via reverse mode automatic differentiation, by moving the noise vector w inside: w • (∇ x v(x)) = ∇ x (w • v(x)) (5) Further computation expense is reduced by using finite-differences to approximate the norm of the gradient. Once the finite-difference is computed, we detach this term from the automatic differentiation computation graph, further speeding training. More details of our implementation of these approximation techniques, and the definition of the term ĝ which is a regularization of the gradient, are presented in Appendix B. A model v set to "train" mode Current model f set to "eval" mode σ, standard deviation of Gaussian smoothing κ, number of Gaussian noise replications (default= 10) δ, finite difference step-size (default= 0.1) Update : learning-rate according to a pre-defined scheduler. for i ∈ {1, . . . N b} do Compute : f smooth (x (i) ), f (x (i) ) ∈ R N c J i = 1 2 f smooth (x (i) ) -f (x (i) ) 2 2 ∈ R for j ∈ {1, . . . κ} do Generate w = 1 √ N c (w 1 , . . . , w N c ), w 1 , . . . , w N c ∈ N (0, 1) Compute the normalized gradient ĝ via (18) Detach x (i) from the computation graph J i ← J i + σ 2 2δ 2 w • f smooth (x (i) + δĝ) -w • f smooth (x (i) ) 2 end J ← 1 N b N b i=1 J i end Update the weights of v by running backpropagation on J at the current learning rate.

3.3. Theoretical details

We appeal to partial differential equations (PDE) theory for explaining the equivalence between gradient regularization and Gaussian convolution (averaging) of the modelfoot_1 . The idea is that the gradient term which appears in the loss leads to a smoothing of the new function (model). The fact that the exact form of the smoothing corresponds to Gaussian convolution is a mathematical result which can be interpreted probabilistically or using techniques from analysis. Briefly, we detail the link as follows. Einstein (1906) showed that the function value of an averaged model under Brownian motion is related to the heat equation (a PDE); the theory of stochastic differential equations makes this rigorous (Karatzas and Shreve, 1998) . Moreover, solutions of the heat equation are given by Gaussian convolution with the original model. Crucially, in addition solutions of the heat equation can be interpreted as iterations of a regularized loss problem (called a variational energy) like that of equation 3. The minimizer of this variational energy (3) satisfies an equation which is formally equivalent to the heat equation (Gelfand et al., 2000) . Thus, taking these facts together, we see that a few steps of the minimization of the loss in (3) yield a model which approximately satisfies the heat equation, and corresponds to a model smoothed by Gaussian convolution. See Figure 4 for an illustration of a few steps of the training procedure. This result is summarized in the following theorem. Theorem 2. (Strauss, 2007) Let f be a bounded function, x ∈ R d , and η ∼ N 0, σ 2 I . Then the following are equivalent: 1. E η [f (x + η)], the expected value of Gaussian averages of f at x. 2. f * N (0, σ 2 I) (x), the convolution of f with the density of the N (0, σ 2 I) distribution evaluated at x.

3.. The solutions of the heat equation,

∂ ∂t f (x, t) = σ 2 2 ∆ x f (x, t) (6) at time t = 1, with initial condition f (x, 0) = f (x). In Appendix A, we use Theorem 2 to show the equivalence of training with noise and iteratively training (3). To assess how well our model approximates the Gaussian average of the initial model, we compute the certified 2 radius for averaged models introduced in Cohen et al. (2019) . A larger radius implies a better approximation of the Gaussian average of the initial model. We compare our models with stochastically averaged models via certified accuracy. This is the fraction of the test set which a model correctly classifies at a given radius while ignoring abstained classifications. Throughout, we always use the same σ value for certification as for training. In conjunction with the certification technique of Cohen et al., we also provide the following theorem, which describes a bound based on the Lipschitz constant of a Gaussian averaged model. We refer to this bound as the L-bound, which demonstrates the link between Gaussian averaging and adversarial robustness. Theorem 3 (L-bound). Suppose f smooth is the convolution (average) of f : R d → [0, 1] N c with a Gaussian kernel of variance σ 2 I, f smooth (x) = f * N (0, σ 2 I) (x) Then any perturbation δ which results in a change of rank of the k-th component of f smooth (x) must have norm bounded as follows: δ 2 ≥ σ(π/2) 1/2 (f smooth (x) (k) -f smooth (x) (k+1) ) (7) where f smooth (x) (i) is the i th largest value in the vector f smooth (x) ∈ [0, 1] N c . See Appendix C for proof. This bound is equally applicable to deterministic or stochastically averaged models. In stochastically averaged models f smooth (x) is replaced by the stochastic approximation of E η∼N (0,σ 2 I) [f (x + η)].

3.4. Adversarial Attacks

To test how robust our model is to adversarial examples, we calculate the minimum 2 adversarial via our L-bound and we attack our models using the projected gradient descent (PGD) (Kurakin et al., 2017; Madry et al., 2018) and decoupled direction and norm (DDN) (Rony et al., 2019) methods. These attacks are chosen because there is a specific way they can be applied to stochastically averaged models (Salman et al., 2019) . In the 2 setting of both attacks, it is standard to take the step g = α ∇ δt L (f (x + δ t ), y) ∇ δt L (f (x + δ t ), y) 2 (8) in the iterative algorithm. Here, x is an input example with corresponding true class y; δ t denotes the adversarial perturbation at its current iteration; L denotes the cross-entropy Loss function (or KL Divergence); ε is the maximum perturbation allowed; and α is the step-size. In the stochastically averaged model setting, the step is given by g n = α n i=1 ∇ δt L (f (x + δ t + η i ), y) n i=1 ∇ δt L (f (x + δ t + η i ), y) 2 where η 1 , . . . , η n iid ∼ N (0, σ 2 I). For our deterministically averaged models, we implement the update (8). This is because our models are deterministic, meaning there is no need to sample noise at evaluation time. For stochastically averaged models (Cohen et al., 2019; Salman et al., 2019) , we implement the update (9). 

4. Experiments & Results

We now execute our method on the ImageNet-1k dataset (Deng et al., 2009) with the ResNet-50 model architecture. The initial model f was trained on clean images for 29 epochs with the cross-entropy loss function. Due to a lack of computing resources, we had to modify the training procedure (3) and Algorithm 1 to obtain our smoothed model f smooth . This new training procedure amounts to minimizing the loss function 1 2 softmax f smooth (x + η) -softmax (f (x)) 2 2 + σ 2 2 ∇ x f smooth (x + η) 2 2 for only 1 epoch of the training set using stochastic gradient descent at a fixed learning rate of 0.01 and with σ = 0.25. This is needed because the output vectors in the ImageNet setting are of length 1,000. Using softmax in the calculation of the 2 distance metric prevents the metric from dominating the gradient-penalty term and the loss blowing up. Furthermore, we add noise η ∼ N 0, σ 2 I to half of the training images.

4.1. Comparison to Stochastic Methods via Certified Radii

We compare our results to a pretrained RandomizedSmoothing ResNet-50 model with σ = 0.25 provided by Cohen et al. (2019) . We also compare to a pretrained SmoothAdv ResNet-50 model trained with 1 step of PGD and with a maximum perturbation of ε = 0.5 with σ = 0.25 provided by Salman et al. (2019) . To assess certified accuracy, we run the Certify algorithm from Cohen et al. (2019) with n 0 = 25, n = 1, 000, σ = 0.25 for the stochastically trained models. We realize that this may not be an optimal number of noise samples, but it was the most our computational resources could handle. For the HeatSmoothing model, we run the same certification algorithm but without running SamplingUnderNoise to compute ĉA . For completeness, we also certify the baseline model f 0 . Certification results on 5,000 ImageNet test images are presented in Figure 1b . We see that our model is indeed comparable to the stochastic methods presented in earlier paper, despite the fact that we only needed one training epoch. Note that CIFAR-10 results are presented in Appendix E. 

5. Conclusion

Randomized smoothing is a well-known method to achieve a Gaussian average of some initial neural network. This is desirable to guarantee that a model's predictions are unchanged given perturbed input data. In this work, we used a regularized loss to obtain deterministic Gaussian averaged models. By computing 2 certified radii, we showed that our method is comparable to previously-known stochastic methods. This is confirmed by attacking our models, which results in adversarial distances similar to those seen with stochastically smoothed models. We also developed a new lower bound on perturbations necessary to throw off averaged models, and used it as a measure of model robustness. Lastly, our method is less computationally expensive in terms of inference time (see Table 2 ).

A Solving the heat equation by training with a regularized loss function

Theorem 2 tells us that training a model with added Gaussian noise is equivalent to training a model to solve the heat equation. We can discretize the heat equation ( 6) to obtain f k+1 -f k h = σ 2 2 ∆f k+1 for k = 0, . . . , n T -1, where n T is the fixed number of timesteps, h = 1/n T , and f 0 = f , our initial model. Notice how, using the Euler-Lagrange equation, we can express f k+1 in (11) as the variational problem f k+1 = argmin v 1 2 R d v(x) -f k (x) 2 + hσ 2 2 ∇ x v(x) 2 2 ρ(x)dx ( ) where ρ is the density from which our clean data comes form. Therefore, this is equivalent to solving f k+1 = argmin v E x v(x) -f k (x) 2 + hσ 2 2 ∇ x v(x) 2 2 (13) Note that the minimizer of the objective of ( 12) satisfies v -f k = hσ 2 2 ∆v which matches (11) if we set f k+1 = v. In the derivation of ( 14), we take for granted the fact that empirically, ρ is approximately uniform and is therefore constant. In the end, we iteratively compute ( 13) and obtain models f 1 , . . . , f n T , setting v = f n T , our smoothed model. Something to take note of is that our model outputs be vectors whose length corresponds to the total number of classes; therefore, the objective function in (13) will not be suitable for vector-valued outputs f k (x) and v(x). We instead use the following update f k+1 = argmin v E x 1 2 v(x) -f k (x) 2 2 + hσ 2 2 ∇ x v(x) 2 2 B Approximating the gradient-norm regularization term By the Johnson-Lindenstrauss Lemma (Johnson and Lindenstrauss, 1984; Vempala, 2005) , ∇ x v(x) 2 2 has the following approximation, ∇ x v(x) 2 2 ≈ κ i=1 ∇ x (w i • v(x)) 2 2 ≈ κ i=1 (w i • v (x + δĝ i )) -(w i • v(x)) δ 2 ( ) where w i = 1 √ K (w i1 , . . . , w iK ) T ∈ R K , w ij iid ∼ N (0, 1) and ĝi is given by ĝi = ∇x(wi•v(x)) ∇x(wi•v(x)) 2 if ∇ x (w i • v(x)) = 0 0 otherwise In practice, we set δ = 0.1, κ = 10, and K = N c, the total number of classes.

C Proof of Theorem 3

Proof. Suppose the loss function is Lipschitz continuous with respect to model input x, with Lipschitz constant L. Let 0 be such that if (x) < 0 , the model is always correct. Then by Proposition 2.2 in Finlay and Oberman (2019) , a lower bound on the minimum magnitude of perturbation δ necessary to adversarially perturb an image x is given by δ 2 ≥ max { 0 -(x), 0} L (19) By Lemma 1 of Appendix A in Salman et al. (2019) , our averaged model 



Code and links to trained models are posted in Supplemental Materials We sometimes interchange the terms Gaussian averaging and Gaussian convolution; they are equivalent, as shown in Theorem 2.



Figure 1: Certified accuracy as a function of 2 radius.

HeatSmoothing Neural Network Weight UpdateInput: Minibatch of input examples x (mb) = x (1) , . . . , x(N b)

Figure 2: Attack curves: % of images successfully attacked as a function of 2 adversarial distance.

smooth (x) = f * N (0, σ 2 I) (x) has Lipschitz constant L = 1 σ 2 π . Replacing L in (19) with 1 σ and setting 0 = f smooth (x) (k) , (x) = f smooth (x) (k+1)gives us the proof.D Illustration of regularized training

Figure 4: Illustration of gradient regularization (4) in the binary classification setting. The lighter line represents classification boundary for original model with large gradients, and the darker line represents classification boundary of the smoothed model. The symbols indicate the classification by the original model: a single red circle is very close to many blue stars. The smoothed model has a smoother classification boundary which flips the classification of the outlier.

Figure 5: Illustration of performing the iterative model update (15) for 8 timesteps in the binary classification setting. The dashed black line represents our decision boundary. The blue line represents our current classification model. The blue stars and red circles represent our predicted classes using the current model iteration. Consider the datapoint at x = -0.5. In the initial model f 0 , the adversarial distance is ≈ 0.10. In model f 5 , the adversarial distance is increased to ≈ 0.35.



Average classification inference time (seconds)

2 adversarial distance metrics on ImageNet-1k

2 certified radii summary statistics for robust models on

4.2. Comparison to Stochastic Methods via Adversarial Attacks

Next, we attack our four models using PGD and DDN. For the stochastic models, we do 25 noise samples to compute the loss. We run both attacks with max 2 perturbation of = 4.0 until top-5 misclassification is achieved or until 20 steps are reached. Results on 1,000 ImageNet test images are presented in Table 3 and Figures 2a and 2b . We see that our model is comparable to the stochastic models, but does not outperform them. In Figure 2a , it is clear that the model presented in Salman et al. (2019) performs best, since this model was trained on PGD-corrupted images. Note that CIFAR-10 results are presented in Appendix E.

4.3. Comparing Models Using Their Classifications

Recall that the goal of this work is to use deterministic methods to obtain an averaged model equivalent to that of Cohen et al. (2019) . One way of measuring the similarity is to compare the predictions of both models. To do this, we consider a pretrained stochastically averaged model with σ = 0.25 and our deterministic model trained using (10). We randomly select 1, 000 ImageNet test images and assess the difference between evaluating our model in a single forward pass vs. the stochastic method presented in Cohen et al.. In Figure 3a , we see how often a model's isnge forward pass top-1 prediction matches with predictions of this model using Cohen et al.'s Predict algorithm, conditioning on the single forward pass prediction being correct. In Figure 3b , we then see how often our deterministic model agrees with Cohen et al.'s stochastic model's prediction. In these plots, for stochastic prediction, we fix the number of Gaussian replications n = 100. We see that our model's predictions are the same when doing a single forward pass and doing a stochastic prediction with high probability. We also see that our model's deterministic predictions match a high-performing stochastically averaged model's predictions with high probability.

4.4. Certifying Robust Models

So far, we have showed that we can take a non-robust baseline model and make it certifiably robust by retraining for one epoch with a regularized loss (10). A natural question arises: can we use this method to make robust models certifiably robust? To test this, we begin with an adversarially trained model (Madry et al., 2018) . This pretrained model was downloaded from Madry's "Robustness" GitHub repository and was trained with images corrupted by the L 2 PGD attack with maximum perturbation size = 3.0. We certify this model by retraining it with (10) for one epoch using stochastic gradient descent with fixed learning rate 0.01. In Table 4 , we compute the 2 certified radius from Cohen et al. (2019) for these models using 1,000 ImageNet-1k test images with σ = 0.25. The certified radii for the model trained with the loss function (10) are significantly higher than those of the adversarially trained model from Madry et al. (2018) . We test our method on the CIFAR-10 dataset (Krizhevsky et al., 2009) Cohen et al., 2019) . We also trained a SmoothAdv model (Salman et al., 2019) for 4 steps of PGD with the maximum perturbation set to ε = 0.5. To assess certified accuracy, we run the Certify algorithm from Cohen et al. (2019) with n 0 = 100, n = 10, 000, σ = 0.1 for the stochastically trained models. For the HeatSmoothing model, we run the same certification algorithm, but without running SamplingUnderNoise to compute ĉA . For completeness, we also certify the baseline model f 0 . Certification plots are presented in Figure 1a . In this plot, we see that our model's 2 certified accuracy outperforms the stochastic models. Next, we attack our four models using PGD and DDN. For the stochastic models, we do 100 noise samples to compute the loss. We run both attacks with 20 steps and maximum perturbation ε = 4.0 to force top-1 misclassification. Results are presented in Table 5 and Figures 6a and 6b . In Table 5 , we see that HeatSmoothing outperforms the stochastic models in terms of robustness. The only exception is robustness to mean PGD perturbations. This is shown in Figures 6a. Our 

