ENHANCING CERTIFIED ROBUSTNESS OF SMOOTHED CLASSIFIERS VIA WEIGHTED MODEL ENSEMBLING

Abstract

Randomized smoothing has achieved state-of-the-art certified robustness against l 2 -norm adversarial attacks. However, it is not wholly resolved on how to find the optimal base classifier for randomized smoothing. In this work, we employ a Smoothed WEighted ENsembling (SWEEN) scheme to improve the performance of randomized smoothed classifiers. We show the ensembling generality that SWEEN can help achieve optimal certified robustness. Furthermore, theoretical analysis proves that the optimal SWEEN model can be obtained from training under mild assumptions. We also develop an adaptive prediction algorithm to reduce the prediction and certification cost of SWEEN models. Extensive experiments show that SWEEN models outperform the upper envelope of their corresponding candidate models by a large margin. Moreover, SWEEN models constructed using a few small models can achieve comparable performance to a single large model with a notable reduction in training time.

1. INTRODUCTION

Deep neural networks have achieved great success in image classification tasks. However, they are vulnerable to adversarial examples, which are small imperceptible perturbations on the original inputs that can cause misclassification (Biggio et al., 2013; Szegedy et al., 2014) . To tackle this problem, researchers have proposed various defense methods to train classifiers robust to adversarial perturbations. These defenses can be roughly categorized into empirical defenses and certified defenses. One of the most successful empirical defenses is adversarial training (Kurakin et al., 2017; Madry et al., 2018) , which optimizes the model by minimizing the loss over adversarial examples generated during training. Empirical defenses produce models robust to certain attacks without a theoretical guarantee. Most of the empirical defenses are heuristic and subsequently broken by more sophisticated adversaries (Carlini & Wagner, 2017; Athalye et al., 2018; Uesato et al., 2018; Tramer et al., 2020) . Certified defenses, either exact or conservative, are introduced to mitigate such deficiency in empirical defenses. In the context of l p norm-bounded perturbations, exact methods report whether an adversarial example exists within an l p ball with radius r centered at a given input x. Exact methods are usually based on Satisfiability Modulo Theories (Katz et al., 2017; Ehlers, 2017) or mixed-integer linear programming (Lomuscio & Maganti, 2017; Fischetti & Jo, 2017) , which are computationally inefficient and not scalable (Tjeng et al., 2019) . Conservative methods are more computationally efficient, but might mistakenly flag a safe data point as vulnerable to adversarial examples (Raghunathan et al., 2018a; Wong & Kolter, 2018; Wong et al., 2018; Gehr et al., 2018; Mirman et al., 2018; Weng et al., 2018; Zhang et al., 2018; Raghunathan et al., 2018b; Dvijotham et al., 2018b; Singh et al., 2018; Wang et al., 2018b; Salman et al., 2019b; Croce et al., 2019; Gowal et al., 2018; Dvijotham et al., 2018a; Wang et al., 2018a) . However, both types of defenses are not scalable to practical networks that perform well on modern machine learning problems (e.g., the ImageNet (Deng et al., 2009) 

classification task).

Recently, a new certified defense technique called randomized smoothing has been proposed (Lecuyer et al., 2019; Cohen et al., 2019) . A (randomized) smoothed classifier is constructed from a base classifier, typically a deep neural network. It outputs the most probable class given by its base classifier under a random noise perturbation of the input. Randomized smoothing is scalable due to its independency over architectures and has achieved state-of-the-art certified l 2 -robustness. In theory, randomized smoothing can apply to any classifiers. However, naively using randomized smoothing on standard-trained classifiers leads to poor robustness results. It is still not wholly resolved on how a base classifier should be trained so that the corresponding smoothed classifier has good robustness properties. Recently, Salman et al. (2019a) employ adversarial training to train base classifiers and substantially improve the performance of randomized smoothing, which indicates that techniques originally proposed for empirical defenses can be useful in finding good base classifiers for randomized smoothing. In this paper, we take a step towards finding suitable base models for randomized smoothing by model ensembling. The idea of model ensembling has been used in various empirical defenses against adversarial examples and shows promising results for robustness (Liu et al., 2018; Strauss et al., 2018; Pang et al., 2019; Wang et al., 2019; Meng et al., 2020; Sen et al., 2020) . Moreover, an ensemble can combine the strengths of candidate modelsfoot_0 to achieve superior clean accuracy (Hansen & Salamon, 1990; Krogh & Vedelsby, 1994) . Thus, we believe ensembling several smoothed models can help improve both the robustness and accuracy. Specifically for randomized smoothing, the smoothing operator is commutative with the ensembling operator: ensembling several smoothed models is equivalent to smoothing an ensembled base model. This property makes the combination suitable and efficient. Therefore, we directly ensemble a base model by taking some pre-trained models as candidates and optimizing the optimal weights for randomized smoothing. We refer to the final model as a Smoothed WEighted ENsembling (SWEEN) model. Moreover, SWEEN does not limit how individual candidate classifiers are trained, thus is compatible with most previously proposed training algorithms on randomized smoothing. Our contributions are summarized as follows: 1. We propose SWEEN to substantially improve the performance of smoothed models. Theoretical analysis shows the ensembling generality and the optimization guarantee: SWEEN can achieve optimal certified robustness w.r.t. the defined γ-robustness index, which is an extension of previously proposed criteria of certified robustness (Lemma 1), and SWEEN can be easily trained to a near-optimal risk with a surrogate loss (Theorem 2). 2. We develop an adaptive prediction algorithm for the weighted ensembling, which effectively reduces the prediction and certification cost of the smoothed ensemble classifier. 3. We evaluate our proposed method through extensive experiments. On all tasks, SWEEN models consistently outperform the upper envelopes of their respective candidate models in terms of the approximated certified accuracy by a large margin. In addition, SWEEN models can achieve comparable or superior performance to a large individual model using a few candidates with a notable reduction in total training time.

2. RELATED WORK

In the past few years, numerous defenses have been proposed to build classifiers robust to adversarial examples. Our work typically involves randomized smoothing and model ensembling. Randomized smoothing Randomized smoothing constructs a smoothed classifier from a base classifier via convolution between the input distribution and certain noise distribution. It is first proposed as a heuristic defense by (Liu et al., 2018; Cao & Gong, 2017) . Lecuyer et al. (2019) first prove robustness guarantees for randomized smoothing utilizing tools from differential privacy. Subsequently, a stronger robustness guarantee is given by Li et al. (2018) . Cohen et al. (2019) provide a tight robustness bound for isotropic Gaussian noise in l 2 robustness setting. The theoretical properties of randomized smoothing in various norm and noise distribution settings have been further discussed in the literature (Blum et al., 2020; Kumar et al., 2020; Yang et al., 2020; Lee et al., 2019; Teng et al., 2019; Zhang et al., 2020) . Recently, a series of works (Salman et al., 2019a; Zhai et al., 2020) develop practical algorithms to train a base classifier for randomized smoothing. Our work improves the performance of smoothed classifiers via weighted ensembling of pre-trained base classifiers. Model ensembling Model ensembling has been widely studied and applied in machine learning as a technique to improve the generalization performance of the model (Hansen & Salamon, 1990; Krogh & Vedelsby, 1994) . Krogh & Vedelsby (1994) show that ensembles constructed from accurate and diverse networks perform better. Recently, simple averaging of multiple neural networks has been a success in ILSVRC competitions (He et al., 2016; Krizhevsky et al., 2017; Simonyan & Zisserman, 2015) . Model ensembling has also been used in defenses against adversarial examples (Liu et al., 2018; Strauss et al., 2018; Pang et al., 2019; Wang et al., 2019; Meng et al., 2020; Sen et al., 2020) . Wang et al. (2019) have shown that a jointly trained ensemble of noise injected ResNets can improve clean and robust accuracies. Recently, Meng et al. (2020) find that ensembling diverse weak models can be quite robust to adversarial attacks. Unlike the above works, which are empirical or heuristic, we employ ensembling in randomized smoothing to provide a theoretical robustness certification.

3. PRELIMINARIES

Notation Let Y = {1, 2, ..., M }. We overload notation slightly, letting k refer the M -dimensional one-hot vector whose k-th entry is 1 for k = 1, ..., M as well. The choice should be clear from context. Let ∆ k = {(p 1 , p 2 , ..., p k ) p i ≥ 0, k i=1 p i = 1} be the k-dimensional probability simplex for k ∈ N + , and ∆ = ∆ M . For an M -dimensional function f , we use f i to refer to its i-th entry, i = 1, 2, • • • , M . We use N (0, σ 2 I) to denote the d-dimensional Gaussian distribution with mean 0 and variance σ 2 I. We use Φ -1 to denote the inverse of the standard Gaussian CDF, and use Γ to denote the gamma function. We use R * to denote the set of non-negative real numbers. For x, a, b ∈ R, a ≤ b, we define clip(x; a, b) = min{max{x, a}, b}. We use Ω(•) to denote Big-Omega notation that suppresses multiplicative constants.

Neural network and classifier

Consider a classification problem from X ⊆ R d to classes Y. Assume the input space X has finite diameter D = sup x1,x2∈X x 1 -x 2 2 < ∞. The training set {(x i , y i )} n i=1 is i.i.d. drawn from the data distribution D. We call f a probability function or a classifier if it is a mapping from R d to ∆ or Y, respectively. For a probability function f , its induced classifier f * is defined such that f * (x) = arg max 1≤i≤M f i (x). For simplicity, we will not distinguish between f and f * when there is no ambiguity, and hence all definitions and properties for classifiers automatically apply to probability functions as well. f (•; θ) denotes a neural network parameterized by θ ∈ Θ. Here Θ can include hyper-parameters, thus the architectures of f (•; θ)'s do not have to be identical.

Certified robustness

We call x + δ an adversarial example of a classifier f , if f correctly classifies x but f (x + δ) = f (x). Usually δ 2 is small enough so x + δ and x appear almost identical for the human eye. The (l 2 -)robust radius of f is defined as r(x, y; f ) = inf F (x+δ) =y δ 2 , which is the radius of the largest l 2 ball centered at x within which f consistently predicts the true label y of x. Note that r(x, y; f ) = 0 if f (x) = y. As mentioned before, we can extend the above definitions to the case when f is a probability function by considering the induced classifier f * . A certified robustness method tries to find some lower bound r c (x, y; f ) of r(x, y; f ), and we call r c a certified radius of f . Randomized smoothing Let f be a probability function or a classifier. The (randomized) smoothed function of f is defined as g(x) = E δ∼N (0,σ 2 I) [f (x + δ)]. (2) The (randomized) smoothed classifier of f is then defined as g * . Cohen et al. (2019) first provide a tight robustness guarantee for classifier-based smoothed classifiers, which is summerized in the following theorem: Theorem 1. (Cohen et al. (2019) ) For any classifier f , denote its smoothed function by g. Then r(x, y; g) ≥ σ 2 [Φ -1 (g y (x)) -Φ -1 (max k =y g k (x))]. Later on, Salman et al. (2019a); Zhai et al. (2020) extends Theorem 1 for probability functions.

4. SWEEN: SMOOTHED WEIGHTED ENSEMBLING

In this section, we describe the SWEEN framework we use. We also present some theoretical results for SWEEN models. The proofs of the results in this section can be found in Appendix A.

4.1. SWEEN: OVERVIEW

To be specific, we adopt a data-dependent weighted average of neural networks to serve as the base model for smoothing. Suppose we have some pre-trained neural networks f (•; θ 1 ), ..., f (•; θ K ) as ensemble candidates. A weighted ensemble model is then f ens (•; θ, w) = K k=1 w k f (•; θ k ), where θ = (θ 1 , • • • , θ K ) ∈ Θ K , and w ∈ ∆ K is the ensemble weight. For a specific f ens , the corresponding SWEEN model is defined as the smoothed function of f ens , denoted by g ens . We have g ens (x; θ, w) = E δ [ K k=1 w k f (x + δ; θ k )] = K k=1 w k E δ f (x + δ; θ k ) = K k=1 w k g(x; θ k ), where g(•; θ) is the smoothed function of f (•; θ). This result means that g ens is the weighted sum of the smoothed functions of the candidate models under the same weight w, or more briefly, randomized smoothing and weighted ensembling are commutative. Thus, ensembling under the randomized smoothing framework can provides benefits in improving the accuracy and robustness. To find the optimal SWEEN model, we can minimize a surrogate loss of g ens over the training set to obtain the value of appropriate weights. These data-dependent weights can make the ensemble model robust to the presence of some biased candidate models, as they will be assigned with small weights.

4.2. CERTIFIED ROBUSTNESS OF SWEEN MODELS

For a smoothed function g, the certified radius at (x, y) provided by Theorem 1 is r c (x, y; g) = clip( σ 2 [Φ -1 (g y (x)) -Φ -1 (max k =y g k (x))]; 0, D). We now formally define γ-robustness index as a criterion of certified robustness. Definition 1. (γ-robustness index). For γ : R * → R * and a smoothed function g, the γ-robustness index of g is defined as I γ (g) = E (x,y)∼D γ(r c (x, y; g)). It can be easily observed that γ-robustness index is an extension of many frequently-used criteria of certified robustness of smoothed classifiers. Proposition 1. Let γ 1 (r) = 1{r ≥ R}, γ 2 (r) = r, γ 3 (r) = π d 2 Γ( d 2 +1) r d . Then, γ 1 -robustness index is the certified accuracy at radius R (Cohen et al., 2019) ; γ 2 -robustness index is the average certified radius (Zhai et al., 2020) ; γ 3 -robustness index is the average volume of the certified region. We note that criteria considering the volume of the certified region are more comprehensive than those only considering the certified radii in a sense, as they take the input dimension into account. Now consider F = {f (•; θ) : R d → ∆ θ ∈ Θ}, the set of neural networks parametrized over Θ. The corresponding set of smoothed functions is G = {g(x; θ) = E δ∼N (0,σ 2 I) [f (x + δ; θ)]|θ ∈ Θ}. Suppose θ 1 , • • • , θ K are drawn i.i.d. from a fixed probability distribution p on Θ. The set of SWEEN models is then Fθ = φ(x) = K k=1 w k g(x; θ k ) w k ≥ 0, K k=1 w k = 1 . ( ) Similar to Rahimi & Recht (2008) , we consider mixtures of the form φ(x) = Θ w(θ)g(x; θ)dθ. For a mixture φ, we define φ p := sup θ | w(θ) p(θ) |. Define F p = φ(x) = Θ w(θ)g(x; θ)dθ φ p < ∞, w(θ) ≥ 0, Θ w(θ)dθ = 1 . Note that for any φ ∈ F p , φ is a smoothed probability function. Intuitively, F p is quite a rich set. The following result shows that with high probability, the best γ-robustness index a SWEEN model can obtain is near the optimal γ-robustness index in the class F p . Thus, the ensembling generality also holds for the γ-robustness index we defined for robustness. Lemma 1. Suppose γ is a Lipschitz function. Given η > 0. For any ε > 0, for sufficently large K, with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ ∈ Fθ which satisfies I γ ( φ) > sup φ∈Fp I γ (φ) -ε. (9) Moreover, if there exists φ 0 ∈ F p such that I γ (φ 0 ) = sup φ∈Fp I γ (φ), K = Ω( 1 ε 4 ). In practice, the defined robustness index I γ (•) may be hard to optimized directly, in which case we choose a surrogate loss function l : R M × Y → R to approximate it. Now the optimization for the ensemble weight w of a SWEEN model over a training set {(x i , y i )} n i=1 can be formulated as min w∈∆ K 1 n n i=1 l( K k=1 w k g(x i ; θ k ), y i ). However, this process typically invovles Monte Carlo simulation since we only have access to f (•, θ k ), k = 1, • • • , K. We define the risk and empirical risk w.r.t. the surrogate loss l. Definition 2. (Risk and empirical risk). For a surrogate loss function l : R M × Y → R, the risk of a probability function φ are defined as R[φ] = E (x,y)∼D l(φ(x), y). ( ) If φ(x) = K k=1 w k g(x; θ k ) ∈ Fθ , for training set {(x i , y i )} n i=1 and sample size s, the empirical risk of φ is defined as R emp [φ] = 1 n n i=1 l( K k=1 w k [ 1 s s j=1 f (x i + δ ijk ; θ k )], y i ), where δ ijk i.i.d. ∼ N (0, σ 2 I), 1 ≤ i ≤ n, 1 ≤ j ≤ s, 1 ≤ k ≤ K. Now solving for w is reduced to finding the minimizer of R emp . When the loss function l is convex, this problem is a low-dimensional convex optimization, so we can obtain the global empirical risk minimizer using traditional convex optimization algorithms. Furthermore, we have: Theorem 2. Suppose for all y ∈ Y, l(•, y) is a Lipschitz function with constant L and is uniformly bounded. Given η > 0. For any ε > 0, for sufficently large K, if n = Ω( K 2 ε 2 ), s = Ω( log Kn ε 2 ), then with probability at least 1 -η over the training dataset {(x i , y i )} n i=1 drawn i.i.d. from D and the parameters θ 1 , ..., θ K drawn i.i.d. from p and the noise samples drawn i.i.d. from N (0, σ 2 I), the empirical risk minimizer φ over Fθ satisfies R[ φ] -inf φ∈Fp R[φ] < ε. (13) Moreover, if there exists φ 0 ∈ F p such that R[φ 0 ] = inf φ∈Fp R[φ], K = Ω( 1 ε 4 ). Theorem 2 gives a guarantee that, for large enough K, n, s, the gap between the risk of the empirical risk minimizer φ and inf φ∈Fp R[φ] can be arbitrarily small with high probability. Note that we can solve φ to any given precision when l is convex. Moreover, Theorem 2 reveals the accessibility of φ in Lemma 1 when l approximates the γ-robustness index well. While the number of candidate models and the number of training samples need to be large to ensure good theoretical properties, we will show that the performance of SWEEN models of practical settings is good enough in Section 5.

4.3. ADAPTIVE PREDICTION ALGORITHM

A major drawback of ensembling is the high execution cost during inference, which consists of prediction and certification costs for smoothed classifiers. The evaluation of smoothed classifiers relies on Monte Carlo simulation, which is computationally expensive. For instance, Cohen et al. (2019) use 100 Monte Carlo samples for prediction and 100,000 samples for certification. If we use 100 candidate models to construct a SWEEN model, the certification of a single data point will require 10,010,000 local evaluations (10,000 for prediction and 10,000,000 for certification). Inoue (2019) observes that ensembling does not make improvements for inputs predicted with high probabilities even when they are mispredicted. He proposes an adaptive ensemble prediction algorithm to reduce the execution cost of unweighted ensemble models. We modify the algorithm to make it applicative to weighted ensemble models, which is detailed in Appendix B.1. For a data point, classifiers are evaluated in descending order with respect to their weights. Whenever an early-exit condition is satisfied, we stop the evaluation and return the current prediction.

5. EXPERIMENTS

In this section, we design extensive experiments on CIFAR-10, SVHN and ImageNet to evaluate the performance of SWEEN models. Solving the ensembling weight From Section 4 we know that we can obtain the empirical risk minimizer by solving a convex optimization. However, this requires first to approximate the value of smoothed functions of candidate models at every data point, which can be very costly when the number of candidates or training data samples is large. Hence, we use Gaussian data augmented training to solve the ensembling weight. More precisely, we freeze the parameters of candidate models and minimize the cross-entropy loss of the SWEEN model on Gaussian augmented data from the evaluation set. Empirically we find that this approach is much faster and yields comparable results.

5.1. SETUP

Certification Following previous works, we report the approximated certified accuracy (ACA), which is the fraction of the test set that can be certified to be robust at radius r approximately (see Cohen et al. (2019) for more details). We also report the average certified radius (ACR) following Zhai et al. (2020) . The ACR equals to the area under the radius-accuracy curve (see Figure 1 ). All results were certified using algorithms in Cohen et al. (2019) with N = 100, 000 samples and failure probability α = 0.001.

5.2. RESULTS

Standard training on CIFAR-10 Table 1 displays the performance of two kinds of SWEEN models under noise levels σ ∈ {0.25, 0.50, 1.00}. The performance of a single ResNet-110 is included for comparison, and we also report the upper envelopes of the ACA and ACR of their corresponding candidate models as UE. In Figure 1 , we display the radius-accuracy curves for the SWEEN models and all their corresponding candidate models under σ = 0.50 on CIFAR-10. We also include full-size figures in Appendix D. The results show that SWEEN models significantly boost the performance compared to their corresponding candidate models. According to Figure 1 , the SWEEN-7 model consistently outperforms all its candidates in terms of the ACA at all radii. The ACR of the SWEEN-7 model is 0.678, much higher than that of the upper envelope of the candidates, which is 0.574. It confirms our theoretical analysis in Section 4 that SWEEN can combine the strength of candidate models and attain superior performance. Besides, SWEEN is effective when only limited numbers of small candidate models are available. The SWEEN-3 model using ResNet-20, ResNet-26, and ResNet-32 achieves higher ACA than the ResNet-110 at all radii on all noise levels. The total training time and the number of parameters of the SWEEN-3 model are 36 % and 30% less than those of ResNet-110, respectively. The improvements can be further amplified by increasing the number and size of candidate models. As an instance, the ACR of the SWEEN-7 model is at least 13% higher than that of the ResNet-110 in Table 1 . The above results verify the effectiveness of SWEEN for randomized smoothing. In the results, the SWEEN-3 model achieves comparable results to the ResNet-110 but is more efficient. From Table 2 and 3, the SWEEN-3 model takes 33.9 hours to achieve 0.727 in terms of the ACR for σ = 0.5, using three small and easy-to-train candidates models. Meanwhile, it takes 49.4 hours for the ResNet-110 to achieve similar performance on CIFAR-10. This 32% speed up reveals the efficiency of applying SWEEN to previous training methods. Other experimental results Due to space constraints, we only report the main results here. The results of further experiments (e.g., the results on SVHN, the results of the adaptive prediction algorithm, the results of the SWEEN model with identical architectures, and the results of SWEEN versus real attacks) can be found in Appendix C.

6. CONCLUSIONS

In this work, we introduced the smoothed weighted ensembling (SWEEN) to improve randomized smoothed classifiers in terms of both accuracy and robustness. We showed that SWEEN can achieve optimal certified robustness w.r.t. our defined γ-robustness index. Furthermore, we can obtain the optimal SWEEN model w.r.t. a surrogate loss from training under mild assumptions. We also developed an adaptive prediction algorithm to accelerate the prediction and certification process. Our extensive experiments showed that a properly designed SWEEN model was able to outperform all its candidate models by a significant margin consistently. Moreover, SWEEN models using a few small and easy-to-train candidates could match or exceed a large individual model on performance with a notable reduction in total training time. Our theoretical and empirical results confirmed that SWEEN is a viable tool for improving the performance of randomized smoothing models.

A PROOFS

A.1 PROOF OF LEMMA 1 Define F p = φ(x) = Θ w(θ)g(x; θ)dθ φ p < ∞, w(θ) ≥ 0 , F θ = φ(x) = K k=1 w k g(x; θ k ) w k ≥ 0 . ( ) We have F p ⊆ F p , Fθ ⊆ F θ . Lemma 2. Let µ be any probability measure on R d . For φ : R d → R M , define the norm φ 2 µ R d φ(x) 2 2 dµ(x). Fix φ ∈ F p , then for any η > 0, with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ (x) = K k=1 c k g(x; θ k ) ∈ F θ which satisfies φ -φ µ ≤ φ p √ K (1 + 2 log 1 η ). ( ) Proof. Sine φ ∈ F p , we can write φ(x) = Θ w(θ)g(x; θ)dθ, where w(θ) ≥ 0. Con- struct φ k = β k g(•; θ k ), k = 1, 2, • • • , K, where β k = w(θ k ) p(θ k ) , then Eφ k = φ, φ k µ = R d β 2 k g(x; θ k ) 2 2 dµ(x) ≤ |β k | ≤ φ p . We then define u(θ 1 , • • • , θ K ) = 1 K K k=1 φ k -φ µ . ( ) First, by using Jensen's inequality and the fact that φ k µ ≤ φ p , we have E[u(θ)] ≤ E[u 2 (θ)] = E[ 1 K K k=1 φ k -Eφ k 2 µ ] = 1 K (E φ k 2 µ -Eφ k 2 µ ) ≤ φ p √ K . Next, for θ 1 , • • • , θ M and θi , we have |u(θ 1 , • • • , θ M ) -u(θ 1 , • • • , θi , • • • , θ M )| = | 1 K K k=1 φ k -φ µ - 1 K ( K k=1,k =i φ k + φi ) -φ µ | ≤ 1 K M k=1 φ k - 1 K ( M k=1,k =i φ k + φi ) µ = φ i -φi µ K ≤ 2 φ p K .

Now we can use

McDiarmid's inequality to bound u(θ), which gives P[u(θ) - φ p √ K ≥ ε] ≤ P[u(θ) -Eu(θ) ≥ ε] ≤ exp(- Kε 2 2 φ 2 p ). The theorem follows by setting δ to the right hand side and solving ε. Lemma 3. Let µ be any probability measure on R d . For φ : R d → R M , define the norm φ 2 µ R d φ(x) 2 2 dµ(x), then for any η > 0, for K ≥ M φ 2 p (1 + 2 log 1 η ) 2 , with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ(x) = K k=1 c k g(x; θ k ) ∈ Fθ which satisfies φ -φ µ < 2 φ p 4 M K (1 + 2 log 1 η ) 1 2 . ( ) Proof. Fix φ ∈ F p ⊆ F p , by using Lemma 2, we have that for any δ > 0, with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ(x) = K k=1 c k g(x; θ k ) ∈ F θ which satisfies φ -φ µ < φ p √ K (1 + 2 log 1 η ) B(K). ( ) Denote C = K k=1 c k , and define s(t) M i=1 t i as the sum of all elements of t ∈ R M . Then s(g(x; θ)) = 1, ∀x ∈ R d , θ ∈ Θ. Thus, s(φ(x)) = M i=1 φ i (x) = M i=1 Θ w(θ)g i (x; θ)dθ = Θ w(θ) M i=1 g i (x; θ)dθ = Θ w(θ)dθ = 1, s( φ(x)) = M i=1 φi (x) = M i=1 K k=1 c k g i (x; θ k ) = K k=1 c k M i=1 g i (x; θ k ) = K k=1 c k = C. Proof. |E[l( φ(x), y)] -E[l(φ(x), y)]| ≤ E|c(φ(x), y) -c( φ(x), y)| ≤ LE φ(x) -φ(x) 2 ≤ L E φ(x) -φ(x) 2 2 = L φ -φ D|x The desired result follows from Lemma 3. Lemma 5. (Corollary of Proposition 1 in Zhai et al. ( 2020)) Given any p 1 , p 2 , • • • , p M satisfies p 1 ≥ p 2 ≥ • • • ≥ p M ≥ 0 and p 1 +p 2 +• • •+p M = 1. The derivative of clip( σ 2 [Φ -1 (p 1 )-Φ -1 (p 2 )] ; 0, D) with respect to p 1 and p 2 is bounded. Now we can prove Lemma 1. Proof of Lemma 1. Let φ 0 ∈ F p such that I γ (φ 0 ) > sup φ∈Fp I γ (φ) -ε 2 . From Lemma 5 we know that q(p, y) clip( σ 2 [Φ -1 (p y ) -Φ -1 (max k =y p k )]; 0, D) is Lipschitz in its first argument. Since m is Lipschitz, c(p, y) m(q(p, y)) is also Lipschitz in its first argument with some constant L. Apply Lemma 4, we have that for K ≥ M φ 2 p (1 + 1 + 2 log 1 δ ) 2 , with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ ∈ Fθ which satisfies I γ (φ 0 ) -I γ ( φ) = E (x,y)∼D [l(φ 0 (x), y)] -E (x,y)∼D [l( φ(x), y)] < 2L φ 0 p 4 M K (1 + 2 log 1 η ) 1 2 . When K > 256L 4 φ0 2 p M (1+ 2 log 1 η ) 2 ε 4 , we have sup φ∈Fp I γ (φ) -I γ ( φ) = ( sup φ∈Fp I γ (φ) -I γ (φ 0 )) + (I γ (φ 0 ) -I γ ( φ)) < ε 2 + ε 2 = ε. If I γ (φ 0 ) = sup φ∈Fp I γ (φ), which means φ 0 p is independent of ε, K = Ω( 1 ε 4 ).

A.2 PROOF OF THEOREM 2

First we introduce some results from statistical learning theory. Definition 3. (Gaussian complexity). Let µ be a probability distribution on a set X and suppose that x 1 , ..., x n are independent samples selected according to µ. Let F be a class of functions mapping from X to R. The Gaussian complexity of F is G n [F ] E[ sup f ∈F | 2 n n i=1 ξ i f (x i )| x 1 , ..., x n ; ξ i , ..., ξ n ] where ξ 1 , ..., ξ n are independent N (0, 1) random variables. Definition 4. (Rademacher complexity) Let µ be a probability distribution on a set X and suppose that x 1 , ..., x n are independent samples selected according to µ. Let F be a class of functions mapping from X to R. The Rademacher complexity of F is R n [F ] E[ sup f ∈F | 2 n n i=1 σ i f (x i )| x 1 , ..., x n ; σ i , ..., σ n ] where σ 1 , ..., σ n are independent uniform {±1}-valued random variables. Lemma 6. (Part of Lemma 4 in Bartlett & Mendelson (2001) ). There are absolute constants β such that for every class F and every integer n, R n (F ) ≤ βG n (F ). Lemma 7. (Corollary of Theorem 8 in Bartlett & Mendelson (2001) ). Consider a loss function c : A × Y → [0, 1]. Let F be a class of functions mapping from X to A and let (x i , y i ) n i=1 be independently selected according to the probability measure µ. Then, for any integer n and any 0 < η < 1, with probability at least 1 -η over samples of length n, every f in F satisfies E (x,y)∼µ [c(f (x), y)] ≤ 1 n n i=1 c(f (x i ), y i ) + R n [c • F ] + 8 log 2 η n , where c • F = {(x, y) → c(f (x), y) -c(0, y) f ∈ F } Lemma 8. (Corollary of Theorem 14 in Bartlett & Mendelson (2001) ). Let A = R M and let F be a class of functions mapping from X to A. Suppose that there are real-valued classes F 1 , ..., F M such that F is a subset of their Cartesian product. Assume further that c : A × Y → R is such that, for all y ∈ Y , c(•, y) is a Lipschitz function with constant L which passes through the origin and is uniformly bounded. Then G n (c • F ) ≤ 2L M i=1 G n (F i ). Now we prove the following lemma: Lemma 9. Let c, F , (x i , y i ) n i=1 , c•F be as in Lemma 7. Then, for any integer n and any 0 < η < 1, with probability at least 1 -η over samples of length n, every f in F satisfies 1 n n i=1 c(f (x i ), y i ) ≤ E (x,y)∼µ [c(f (x), y)] + R n [c • F ] + 8 log 2 η n . Proof. 1 n n i=1 c(f (x i ), y i ) -E (x,y)∼µ [c(f (x), y)] ≤ sup h∈c•F ( Ên h -Eh) = sup h∈c•F ( Ên h -Eh) + Ên c(0, y) -Ec(0, y). When an (x i , y i ) pair changes, the random variable sup  sup h∈c•F ( Ên h -Eh) ≤ E sup h∈c•F ( Ên h -Eh) + 2 log 2 η n . A similar argument, together with the fact that E Ên c(0, y) = Ec(0, y), shows that with probability at least 1 -η, R emp [f ] ≤ R[f ] + E sup h∈c•F ( Ên h -Eh) + 8 log 2 η n . It's left to show that E sup h∈c•F ( Ên h -Eh) ≤ R n [c • F ]. Let (x 1 , y 1 ), ..., (x n , y n ) be drawn i.i.d. from µ and independent from (x i , y i ) n i=1 , then E sup h∈c•F ( Ên h -Eh) = E sup h∈c•F E[ Ên h - 1 n n i=1 h(x i , y i )] ≤ EE sup h∈c•F [ Ên h - 1 n n i=1 h(x i , y i )] = E sup h∈c•F 1 n ( n i=1 h(x i , y i ) - n i=1 h(x i , y i )) ≤ 2E sup h∈c•F 1 n n i=1 σ i h(x i , y i ) ≤ R n [c • F ]. Proof. Let φ * be the minimizer of R over Fθ . Combine Lemma 4 and 10, we derive that, with probability at least 1 -2δ over the training dataset and the choice of the parameters θ 1 , ..., θ K , R[ φ] -R[φ] = (R[ φ] -R se [ φ]) + (R se [ φ] -R se [φ * ]) + (R se [φ * ] -R[φ * ]) + (R[φ * ] -R[φ]) < 2βLM K + 2 2 log 4 η √ n + 0 + 2βLM K + 2 2 log 4 η √ n + 2L φ p 4 M K (1 + 2 log 1 η ) 1 2 = 4βLM K + 4 2 log 4 η √ n + 2L φ p 4 M K (1 + 2 log 1 η ) 1 2 . Lemma 11. Let µ be a probability distribution on ∆. For any η > 0, with probability at least 1 -η over x 1 , ..., x s drawn i.i.d. from µ, it holds that 1 s s i=1 x i -E x∼µ [x] 2 ≤ 1 √ s (1 + 2 log 1 η ) (22) Proof. Define u(x 1 , • • • , x s ) = 1 s s i=1 x i -E[x] 2 . By using Jensen's inequality, we have E[u(x)] ≤ E[u 2 (x)] = E[ 1 s s i=1 x i -E[x] 2 2 ] = 1 s (E x 2 2 -E[x] 2 2 ) ≤ 1 √ s . Next, for x 1 , • • • , x M and xk , we have |u(x 1 , • • • , x s ) -u(x 1 , • • • , xk , • • • , x s )| = | 1 s s i=1 x i -E[x] 2 - 1 s ( s i=1,i =k x i + xk ) -E[x] 2 | ≤ 1 s s i=1 x i - 1 s ( s i=1,i =k x i + xk ) 2 = x k -xk 2 s ≤ 2 s . Now we can use McDiarmid's inequality to bound u(x), which gives P[u(x) - 1 √ s ≥ ε] ≤ P[u(x) -Eu(x) ≥ ε] ≤ exp(- sε 2 2 ). The result follows by setting η to the right hand side and solving ε. Now we are ready to prove Theorem 2. Proof of Theorem 2. Let φ 0 ∈ F p such that R[φ 0 ] < inf φ∈Fp R[φ] + ε 4 . By Lemma 11, with probability at least 1 -η 3 , 1 s s j=1 f (x i + δ ijk ; θ k ) -g(x i ; θ k ) 2 ≤ 1 + 2 log 3Kn η √ s , 1 ≤ i ≤ n, 1 ≤ k ≤ K, hold simultaneously. So with probability at least 1 -η 3 , for every φ = K k=1 w k g(x; θ k ) ∈ Fθ , it holds that |R emp [φ] -R se [φ]| = | 1 n n i=1 [l( K k=1 w k [ 1 s s j=1 f (x i + δ ijk ; θ k )], y i ) -l( K k=1 w k g(x i ; θ k ), y i )]| ≤ L n n i=1 K k=1 w k [ 1 s s j=1 f (x i + δ ijk ; θ k ) -g(x i ; θ k )] 2 ≤ L n n i=1 K k=1 w k 1 s s j=1 f (x i + δ ijk ; θ k ) -g(x i ; θ k ) 2 ≤ L n n i=1 K k=1 w k 1 + 2 log 3Kn η √ s = L(1 + 2 log 3Kn η ) √ s ε 1 . By Lemma 10, with probability at least 1 -η 3 , for every φ ∈ Fθ , it holds that |R se [φ] -R[φ]| ≤ 2βLM K √ n + 8 log 12 η n ε 2 . Let φ * be the minimizer of R over Fθ . By Lemma 4, with probability at least 1 -δ 3 , for K ≥ M φ 0 2 p (1 + 2 log 3 η ) 2 , R[φ * ] -R[φ 0 ] < 2L φ p 4 M K (1 + 2 log 3 η ) 1 2 ε 3 . So with probability at least 1 -η, it holds that R[ φ] -inf φ∈Fp R[φ] = (R[ φ] -R se [ φ]) + (R se [ φ] -R emp [ φ]) + (R emp [ φ] -R emp [φ * ]) +(R emp [φ * ] -R se [φ * ]) + (R se [φ * ] -R[φ * ]) + (R[φ * ] -R[φ 0 ]) + (R[φ 0 ] -inf φ∈Fp R[φ]) < ε 2 + ε 1 + 0 + ε 1 + ε 2 + ε 3 + ε 4 = 2ε 1 + 2ε 2 + ε 3 + ε 4 . When K > 256L 4 φ0 2 p M (1+ 2 log 1 η ) 2 ε 4 , n > 64(2βLM K+ 8 log 12 η ) 2 ε 2 , s > 64L 2 (1+ 2 log 3Kn η ) 2 ε 2 , we have R[ φ] -inf φ∈Fp R[φ] < ε. ( ) If R[φ 0 ] = inf φ∈Fp R[φ], which means φ 0 p is independent of ε, K = Ω( 1 ε 4 ).

B ALGORITHMS B.1 ADAPTIVE PREDICTION ALGORITHM

The entire adaptive prediction algorithm is shown in Algorithm 1. It is modified from Inoue (2019) to generalize to weighted ensembles. The exit condition is the weighted version of the confidence-levelbased early-exit condition in Inoue (2019) . The algorithm accelerates the evaluation of the ensemble function f and the smoothed operation g(x) = E δ [f (x + δ)] remains the same, so it does not affect the Monte Carlo estimation and certification procedure of smoothed classifiers. Algorithm 1 Adaptive prediction for weighted ensembling 1: Input: Ensembling weight w ∈ R K , candidate model parameters θ ∈ Θ K , significance level α, threshold T , data point x 2: Compute z = Φ -1 (1 -α 2 ) 3: Set π as the permutation of indices that sorts w in descending order and i ← 0 4: repeat 5: Set i ← i + 1 6: Compute the w πi -th local prediction p πi ← (p πi,1 , • • • , p πi,M ) ∈ ∆ 7: Compute pi,k ← i j=1 wπ j p π j ,k i j=1 wπ j for k = 1, 2, • • • M 8: Compute k i ← arg max k pi,k 9: until p1,k1 > T or pi,ki > 1 2 + z i j=1 w 2 π j i j=1 wπ j i j=1 wπ j (p π j ,k i -pi,k i ) 2 i j=1 wπ j , i > 1 or i = K 10: return k i and pi,k , k = 1, 2, • • • M B.2 DETAILED SWEEN ALGORITHM Algorithm 2 SWEEN 1: Input: Training set ptrain , evaluation set peval , Ensembling weight w ∈ R K , candidate model parameters θ = {θ 1 , ..., θ K } ∈ Θ K . 2: Initialize θ 1 , ..., θ K , w 3: for i = 1 to K do 4: Train candidate models θ i using ptrain . We perform all experiments on CIFAR-10 and SVHN with a single GeForce GTX 1080 Ti GPU. For the experiments on ImageNet, we use eight V100 GPUs. For training the SWEEN models on CIFAR-10 and SVHN, we divide the training set into two parts, one for training candidate models, and the other for solving weights. We employ 2,000 images for solving weights on CIFAR-10 and 3000 images for solving weights on SVHN. For ImageNet, we use the whole training set to train candidate models and 1/1000 of the training set to solve weights. For Gaussian data augmentation training, all the models are trained for 400 epochs using SGD on CIFAR-10 and SVHN. The models on ImageNet are trained for 90 epochs. The learning rate is initialized set as 0.01, and decayed by 0.1 at the 150th/300th epoch. For MACER training, we use the same hyper-parameters as Zhai et al. (2020) , i.e., we use k = 16, β = 16.0, γ = 8.0, and we use λ = 12.0 for σ = 0.25 and λ = 4.0 for σ = 0.50. We train the models for 440 epochs, the learning rate is initialized set as 0.01, and decayed by 0.1 at the 200th/400th epoch.

C.2 RESULTS ON SVHN

To further evaluate our method, we also experiment on SVHN. The results in Table 5 show that SWEEN models outperform the upper envelopes of their corresponding candidate models as well. Figure 2 plots the results on CIFAR-10 and SVHN for comparison. The SWEEN-3 and SWEEN-7 models are all using candidate models with diverse architectures. For a more comprehensive result, we also experiment with SWEEN models using candidates with identical architectures. For σ ∈ {0.25, 0.5, 1.0}, We train 8 ResNet-110 models using different random seeds on CIFAR-10 via the standard training. We then use these models to ensemble SWEEN models. The results are shown in Table 6 and Figure 3 . We can see that SWEEN is still effective in this scenario and significantly boosts the performance compared to candidate models. We also run experiments on ImageNet using models with identical structure but with different random initialization. We train 3 ResNet-50 on ImageNet via the standard training to ensemble the SWEEN models. Table 7 shows the results. The improvement of SWEEN is substantial compared with the AVG and UE results. To alleviate the higher execution cost introduced by SWEEN, we apply the previously mentioned adaptive prediction algorithm to speed up the certification. Experiments are conducted on the SWEEN-7 models via the standard training on CIFAR-10 and the results are summarized in Table 8 . It can be observed that the adaptive prediction successfully reduce the number of evaluations. However, the performance of the adaptive prediction models is only slightly worse than their vanilla counterparts. 



In this paper, "candidate model" and "candidate" refer to an individual model used in an ensemble. The term "base model" refers to a model to which randomized smoothing applies.



Figure 1: Radius-accuracy curves under σ = 0.50. All models are trained via the standard training. (Left) The SWEEN-7 model and all its candidate models. (Middle) The SWEEN-3 model and all its candidate models. (Right) The SWEEN-7 model, the SWEEN-3 model and the ResNet-110.

Ên h -Eh) can change by no more than 2 n . McDiarmid's inequality implies that with probability at least 1 -η 2 ,

5: end for 6: Construct SWEEN model g sween (•; θ, w) = K k=1 w k g(•; θ k ) 7: Train w using peval 8: return w and θ 1 , ..., θ K C SUPPLEMENTARY MATERIAL FOR EXPERIMENTS C.1 DETAILED SETTINGS AND HYPER-PARAMETERS

Figure 2: Comparing SWEEN models to the upper envelopes of their corresponding candidate models. All models are trained via the standard training. (Left) The SWEEN-3 model on CIFAR-10. (Middle) The SWEEN-7 model on CIFAR-10. (Right) The SWEEN-7 model on SVHN.

Figure 3: Radius-accuracy curves of SWEEN models and their candidate models. All candidate models are using the ResNet-110 architecture and trained via the standard training. (Left) σ = 0.25. (Middle) σ = 0.50. (Right) σ = 1.00.

This section plots the scatter diagram of the certified accuracy of test data points for SWEEN-7 versus ResNet-110 in Figure4. Most of the points lie under the line y = x, implying that SWEEN-7 performs superior to ResNet-110.

Figure 5: Radius-accuracy curves for SWEEN models with candidate models trained by MACER on CIFAR-10. (Left) σ = 0.25. (Right) σ = 0.50.

Figure 6: Radius-accuracy curves w.r.t. the SWEEN-7 model and all its candidate models under σ = 0.50. All models are trained via the standard training.

Figure 7: Radius-accuracy curves w.r.t. the SWEEN-3 model and all its candidate models under σ = 0.50. All models are trained via the standard training.

Figure 8: Radius-accuracy curves w.r.t. the SWEEN-7 model, the SWEEN-3 model and the ResNet-110 under σ = 0.50. All models are trained via the standard training.

Model setupWe train different network architectures on CIFAR-10 and SVHN to serve as candidates for ensembling, includingLeNet (LeCun et al., 1989), AlexNet(Krizhevsky et al., 2017),

ACA (%) and ACR on CIFAR-10. All models are trained via the standard training. UE stands for the upper envelope, which shows the largest ACA and ACR among the candidate models.

Training time, #parameters and #FLOPs for models under σ = 0.50 via MACER training. All the experiments are run on a single NVIDIA 1080 Ti GPU.

ACA (%) and ACR on CIFAR-10. All models are trained via MACER training. UE stands for the upper envelope of candidate models. Since SWEEN is compatible with previous training algorithms, we adopt MACER training for the SWEEN-3 model. The results are summarized in Table3. For the ACA and ACR of the ResNet-110 model, we use the original numbers fromZhai et al. (2020).

Certified accuracy (%) and ACR on SVHN. All models are trained via standard training. UE stands for the upper envelope of candidate models.

ACA (%) and ACR on CIFAR-10. All candidate models are ResNet-110s trained via the standard training. UE stands for the upper envelope, which shows the largest ACA and ACR among the candidate models. AVG stands for the average ACA or ACR of candidate models.

ACA (%) and ACR on ImageNet. All candidate models are ResNet-50s trained via the standard training. The SWEEN model here contains 3 ResNet-50s. UE stands for the upper envelope, which shows the largest ACA and ACR among the candidate models. AVG stands for the average ACA or ACR of candidate models.

ACA (%) and ACR on CIFAR-10. All models are trained via the standard training. * means the upper envelope of candidate models.

Now we have

Thus, with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p,Lemma 4. Suppose l(•, •) is L-Lipschitz in its first argument. Fix φ ∈ F p , then for any η > 0, for, with probability at least 1 -η over θ 1 , ..., θ K drawn i.i.d. from p, there exists φ ∈ Fθ which satisfiesWe can prove the following result:Theorem 3. Let A = R M and let F be a class of functions mapping from X to A. Suppose that there are real-valued classes F 1 , ..., F M such that F is a subset of their Cartesian product. Assume further that the loss function c : A × Y → R is such that, for all y ∈ Y , c(•, y) is a Lipschitz function with constant L and is uniformly bounded. Let {(x i , y i )} n i=1 be independently selected according to the probability measure µ. Then, for any integer n and any 0 < η < 1, there is a probability of at least 1 -η that every f ∈ F haswhere β is a constant.Proof. From Lemma 7 and 9 we have that with probability at least 1 -η over samples of length n, every f in F satisfiesit follows by applying Lemma 6 and 8.Lemma 10. Let c(•, •), β be as in Theorem 3. Let (x i , y i ) n i=1 be independently selected according to the probability measure D. For any integer n and any 0 < η < 1, there is a probability of at leastProof. DenoteWe have that Fθ ⊆

M k=1

Fθ (i), where stands for a Cartesian product operation. The Gaussian comlexities of Fθ (i)'s can be bounded asThe desired result follows by applying Theorem 3 to Fθ , Fθ (1), • • • , Fθ (M ) and D.Next, we give the definition of semi-empirical risk. The term "semi-" implies that it is empirical with respect to the training set but not the smoothing operation.Definition 5. (Semi-empirical risk). For a surrogate loss function l(•,We can use Lemma 4 and 10 to prove the following result:Theorem 4. Suppose for all y ∈ Y, l(•, y) is a Lipschitz function with constant L and is uniformly bounded. Fix φ ∈ F p , then for any η > 0, with probability at least 1 -η over the training dataset {(x i , y i )} n i=1 drawn i.i.d. from D and the parameters θ 1 , ..., θ K drawn i.i.d. from p, the semi-empirical risk minimizer φ over Fθ satisfieswhere β is a constant. 

C.6 SWEEN VERSUS ADVERSARIAL ATTACKS

We further investigate the performance of SWEEN models versus AutoAttack (Croce & Hein, 2020) , which is an ensemble of four diverse attacks to reliably evaluate robustness. Similar to Salman et al.(2019a), we used 128 samples to estimate the smoothed classifier. We share the results below in Table 9 . It can be seen that SWEEN can improve the empirical robustness as well. In this section, we plot the radius-accuracy curves for SWEEN models with candidate models trained by MACER on CIFAR-10 in Figure 5 .

