ON THE CERTIFIED ROBUSTNESS FOR ENSEMBLE MODELS AND BEYOND Anonymous

Abstract

Recent studies show that deep neural networks (DNN) are vulnerable to adversarial examples, which aim to mislead DNNs to make arbitrarily incorrect predictions. To defend against such attacks, both empirical and theoretical defense approaches have been proposed for a single ML model. In this work, we aim to explore and characterize the robustness conditions for ensemble ML models. We prove that the diversified gradient and large confidence margin are sufficient and necessary conditions for certifiably robust ensemble models under the modelsmoothness assumption. We also show that an ensemble model can achieve higher certified robustness than a single base model based on these conditions. To our best knowledge, this is the first work providing tight conditions for the ensemble robustness. Inspired by our analysis, we propose the lightweight Diversity Regularized Training (DRT) for ensemble models. We derive the certified robustness of DRT based ensembles such as standard Weighted Ensemble and Max-Margin Ensemble following the sufficient and necessary conditions. Besides, to efficiently calculate the model-smoothness, we leverage adapted randomized model smoothing to obtain the certified robustness for different ensembles in practice. We show that the certified robustness of ensembles, on the other hand, verifies the necessity of DRT. To compare different ensembles, we prove that when the adversarial transferability among base models is high, Max-Margin Ensemble can achieve higher certified robustness than Weighted Ensemble; vice versa. Extensive experiments show that ensemble models trained with DRT can achieve the state-of-theart certified robustness under various settings. Our work will shed light on future analysis for robust ensemble models.

1. INTRODUCTION

Deep neural networks (DNN) have been widely applied in various applications, such as image classification (Krizhevsky, 2012; He et al., 2016) , face recognition (Taigman et al., 2014; Sun et al., 2014) , and natural language processing (Vaswani et al., 2017; Devlin et al., 2019) . However, it is well-known that DNNs are vulnerable to adversarial examples (Szegedy et al., 2013; Carlini & Wagner, 2017; Xiao et al., 2018) , and it has raised great concerns especially when they are deployed in the safety-critical applications such as autonomous driving and facial recognition. To defend against such attacks, several empirical defenses have been proposed (Papernot et al., 2016b; Buckman et al., 2018; Madry et al., 2018) ; however, many of them have been attacked again by strong adaptive attackers (Athalye et al., 2018; Tramer et al., 2020) . On the other hand, the certified defenses (Wong & Kolter, 2018; Cohen et al., 2019) have been proposed to provide certified robustness guarantees for given ML models, so that no additional attack can break the model under certain conditions. For instance, randomized smoothing has been proposed as an effective defense providing certified robustness (Lecuyer et al., 2019; Li et al., 2019; Cohen et al., 2019; Yang et al., 2020) . Compared with other certified robustness approaches such as linear bound propagation (Weng et al., 2018; Mirman et al., 2018) and interval bound propagation (Gowal et al., 2019) , randomized smoothing provides a way to smooth a given DNN efficiently and does not depend on the neural network architecture. However, existing defenses mainly focus on the robustness of a single ML model, and it is unclear whether an ensemble ML model could provide additional robustness. In this work, we aim to char-acterize the conditions for a robust ensemble and answer the question from both theoretical and empirical perspectives. In particular, we analyze the standard Weighted Ensemble (WE) and Max-Margin Ensemble (MME) protocols, and prove the necessary and sufficient conditions for robust ensemble models under mild model-smoothness assumptions. Under these conditions, we can see that an ensemble model would be more robust than each single base model. The intuitive illustration of their certified robust radius is in Fig 1 . Our analysis shows that diversified gradient and large confidence margins of base models would lead to higher certified robustness for ensemble models. Inspired by our analysis, we propose Diversity-Regularized Training, a lightweight regularization-based ensemble training approach. We derive certified robustness for both WE and MME trained with DRT, and realize model-smoothness assumption via randomized smoothing. We analyze different smoothing protocols and prove that Ensemble Before Smoothing provides higher certified robustness. We further prove that when the adversarial transferability among base models is high, MME is more robust than WE. We evaluate DRT on a wide range of datasets including MNIST, CIFAR-10, and ImageNet. Extensive experiments show that DRT can achieve higher certified robustness compared with the stateof-the-art baselines with similar training cost as training a single model. Furthermore, when we combine DRT with existing robust models as the base models, DRT can achieve the highest certified robustness to our best knowledge. We summarize our main contributions as follows: 1) We provide the necessary and sufficient conditions for robust ensemble models including Weighted Ensemble (WE) and Max-Margin Ensemble (MME) under the model-smoothness assumptions. We prove that an ensemble model is more robust than a single base model under the model-smoothness assumption. Our analysis shows that diversified gradients and large confidence margins of base models are the keys to robust ensembles. 2) Based on our analysis, we propose DRT, a lightweight regularization-based training approach, containing both Gradient Diversity Loss and Confidence Margin Loss. 3) We derive certified robustness for ensemble models trained with DRT. The analysis of certified robustness further reveals the importance of DRT. Under mild conditions, we further prove that when the adversarial transferability among base models is high MME is more robust than WE. 4) We conduct extensive experiments to evaluate the effectiveness of DRT on various datasets, which show that DRT can achieve the best certified robustness with similar training time as a single ML model. (Szegedy et al., 2013) . To defend against such attacks, several empirical defenses have been proposed (Papernot et al., 2016b; Madry et al., 2018) . For ensemble models, existing work mainly focuses on empirical robustness (Pang et al., 2019; Li et al., 2020; Srisakaokul et al., 2018) where the robustness is measured by accuracy under existing attacks and no certifiable robustness guarantee could be provided or enhanced; or certify the robustness for a vanilla weighted ensemble (Zhang et al., 2019; Liu et al., 2020) using either LP-based (Zhang et al., 2018) verification or randomized smoothing but without diversity enforcement. In this paper, we aim to prove that the gradient diversity and base model margin are two key factors for certified ensemble robustness and based on these key factors, we propose a training approach to enhance the certified robustness of model ensemble.

Related work. DNNs are known vulnerable to adversarial examples

Randomized smoothing (Cohen et al., 2019) has been proposed to provide certified robustness for a single ML model. It achieved the state-of-the-art certified robustness on large ImageNet and CIFAR-10 dataset under L 2 norm. Several approaches have further improved it by: (1) choosing different smoothing distributions for different L p norms (Dvijotham et al., 2019; Zhang et al., 2020; Yang et al., 2020) , and (2) training more robust smoothed classifiers, using data augmentation (Cohen et al., 2019) , unlabeled data (Carmon et al., 2019 ), adversarial training (Salman et al., 2019) , regularization (Li et al., 2019; Zhai et al., 2019) , and denoising (Salman et al., 2020) . However, within our knowledge, there is no work studying how to customize randomized smoothing for ensemble models. In this paper, we compare and select a good randomized smoothing strategy to improve the certified robustness of the ensemble. In this paper, we mainly focus on the certified robustness under L 2 norm. Though randomized smoothing suffers from difficulties when it comes to L ∞ norm (Yang et al., 2020; Kumar et al., 2020) , the analysis of certified robustness and the training approach DRT can be further extended to other L p norms.

2. DIVERSITY-REGULARIZED TRAINING

In this section, we will first provide the robustness conditions for the standard Weighted Ensemble and Max-Margin Ensemble. Using the robustness condition, we can compare the certified robustness of the ensemble models and a single base model. The comparison shows that under modelsmoothness assumptions, the ensemble models are more robust in terms of their certified robustness. Motivated by the key factors in the robustness conditions, we then propose Diversity-Regularized Training. Notations. Throughout the paper, we consider the classification task with C classes. We first define the classification scoring function f : R d → ∆ C , which maps the input to a confidence vector, and f (x) i represents the confidence for the ith class. We mainly focus on the confidence after normalization, i.e., f (x) ∈ ∆ C = {p ∈ R C ≥0 : p 1 = 1} is in the probability simplex. To characterize the confidence margin between two classes we define f y1/y2 (x) := f (x) y1 -f (x) y2 . The corresponding prediction F : R d → [C] is defined by F (x) := arg max i∈[C] f (x) i . We are also interested in the runner-up prediction F (2) (x) := arg max i∈[C]:i =F (x) f (x) i . In this paper, we mainly consider the model robustness against the L 2 -bounded perturbations. Definition 1 (r-Robustness). For a prediction function F : R d → [C] and input x 0 , if any instance x ∈ {x 0 + δ : δ 2 ≤ r} satisfies F (x) = F (x 0 ), we say model F is r-robust (at point x 0 ). We map existing certified robustness (Cohen et al., 2019) to r-Robustness in Appendix B.1.

2.1. ROBUSTNESS CONDITIONS FOR ENSEMBLE MODELS

An ensemble model contains N base models {f i } N i=1 , where F i and F (2) i are their top and runner-up predictions respectively. The ensemble prediction is denoted by M : R d → [C], which is computed based on outputs of base models following certain ensemble protocols. In this paper, we consider both Weighted Ensemble (WE) and Maximum Margin Ensemble (MME). Definition 2 (Weighted Ensemble (WE)). Given N base models {f i } N i=1 , and the weight vector {w i } N i=1 ∈ R N + , the Weighted Ensemble is constructed as M WE : R d → [C] such that for any input x 0 : M WE (x 0 ) := arg max i∈[C] N j=1 w j f j (x 0 ) i . (1) Definition 3 (Max-Margin Ensemble (MME)). Given N base models {f i } N i=1 , for input x 0 , the Max-Margin Ensemble model M MME : R d → [C] is defined by M MME (x 0 ) := F c (x 0 ) where c = arg max i∈[N ] f i (x 0 ) Fi(x0) -f i (x 0 ) F (2) i (x0) . WE sums up the weighted confidence scores of base models {f i } N i=1 with weight vector {w i } N i=1 , and predicts the class with the highest value. WE is commonly-used (Zhang et al., 2019; Liu et al., 2020) . Max-Margin Ensemble chooses the base model with the largest confidence margin between the top and the runner-up classes, which is a direct extension from max-margin training (Huang et al., 2008) .

2.1.1. GENERAL ROBUSTNESS CONDITIONS

For WE, since it predicts the class with the highest aggregated confidence, we can easily observe its sufficient and necessary conditions for the certified robustness. Proposition 1 (Robustness Condition for WE). Consider an input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and a Weighted Ensemble model M WE constructed by base models {f i } N i=1 with weights {w i } N i=1 . Suppose M WE (x 0 ) = y 0 . Then, the ensemble M WE is r-robust at point x 0 if and only if for any x ∈ {x 0 + δ : δ 2 ≤ r}, min yi∈[C] N j=1 w j f y0/yi j (x) ≥ 0. For MME, however, the model prediction is decided by the base model with the largest confidence margin. This "maximum" is a discrete operator and poses challenges especially in the multi-class setting. We cannot assert that the model predicts the true label by simply looking at the margins between only the true label and other labels unless carefully filtering out possible violated cases. In the following theorem, through careful analysis of the layout of confidence scores (e.g. enumerating the cases where y 0 is the top class, runner-up class, or one of other classes, see details in Lemmas B.1 and B.2), we present a succinct but sufficient and necessary condition for MME robustness. Theorem 1 (Robustness Condition for MME). Consider an input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Let M MME be an MME defined over base models {f i } N i=1 . Suppose: (1) M MME (x 0 ) = y 0 ; (2) for any x ∈ {x 0 + δ : δ 2 ≤ r}, given any base model i ∈ [N ], either F i (x) = y 0 or F (2) i (x) = y 0 . Then, the ensemble M MME is r-robust at point x 0 if and only if for any x ∈ {x 0 + δ : δ 2 ≤ r}, max i∈[N ] min y i ∈[C]:y i =y 0 f y 0 /y i i (x) ≥ max i∈[N ] min y i ∈[C]:y i =y 0 f y i /y 0 i (x). (3) In above theorem, the y i 's and y i 's are associated with each base model f i and are respectively minimized among all the C classes except class y 0 . We defer the proof to Appendix B.2. This theorem along with the intermediate lemmas in the proof serves as the foundation for our subsequent analysis.

2.1.2. DIVERSIFIED GRADIENTS AND LARGE CONFIDENCE MARGIN CONDITIONS

The conditions in Proposition 1 and Theorem 1 are rather general and involve x ∈ {x 0 + δ : δ 2 ≤ r}, which is challenging to verify for neural networks due to its non-convexity. In this section, we adapt the above conditions for DNNs based on confidence scores and gradient of base models at input x 0 , showing the diversified gradients and large confidence margin are the sufficient and necessary conditions for ensemble robustness. Definition 4 (β-Smoothness). A differentiable function f : R d → R C is β-smooth, if for any x, y ∈ R d and any output dimension j ∈ [C], ∇xf (x)j -∇yf (y)j 2 x-y 2 ≤ β. The definition of β-smoothness is inherited from optimization theory literature, and is equivalent to the curvature bound in certified robustness literature (Singla & Feizi, 2020) . Note that smaller β indicates smoother models, and when β = 0 the model is linear. For Weighted Ensemble, we have the following robustness conditions. Theorem 2 (Gradient and Confidence Margin Conditions for WE Robustness). Given input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and M WE as a WE defined over base models {f i } N i=1 with weights {w i } N i=1 . M WE (x 0 ) = y 0 . All base models f i 's are β-smooth. • (Sufficient Condition) The M WE is r-robust at point x 0 if for any y i = y 0 , N j=1 wj∇xf y 0 /y i j (x0) 2 ≤ 1 r N j=1 wjf y 0 /y i j (x0) -βr N j=1 wj, • (Necessary Condition) If M WE is r-robust at point x 0 , for any y i = y 0 , N j=1 wj∇xf y 0 /y i j (x0) 2 ≤ 1 r N j=1 wjf y 0 /y i j (x0) + βr N j=1 wj. (5) The proof directly follows from our general robustness conditions and Taylor expansion at x 0 . For Max-Margin Ensemble with two base models, we derive the following robustness conditions. Theorem 3 (Gradient and Confidence Margin Conditions for MME Robustness). Given input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and M MME as an MME defined over base models {f 1 , f 2 }. M MME (x 0 ) = y 0 . Both f 1 and f 2 are β-smooth. • (Sufficient Condition) The M MME is r-robust at point x 0 if for any y 1 , y 2 ∈ [C] such that y 1 = y 0 and y 2 = y 0 , ∇xf y 0 /y 1 1 (x0) + ∇xf y 0 /y 2 2 (x0) 2 ≤ 1 r (f y 0 /y 1 1 (x0) + f y 0 /y 2 2 (x0)) -2βr. • (Necessary Condition) Suppose x ∈ {x 0 + δ : ||δ|| 2 ≤ r}, and for i ∈ {1, 2} either F i (x) = y 0 or F (2) i (x) = y 0 . If M MME is r-robust at point x 0 , for any y 1 , y 2 ∈ [C] such that y 1 = y 0 and y 2 = y 0 , ∇xf y 0 /y 1 1 (x0) + ∇xf y 0 /y 2 2 (x0) 2 ≤ 1 r (f y 0 /y 1 1 (x0) + f y 0 /y 2 2 (x0)) + 2βr. The proof combines the proof procedure of Theorem 1 with Taylor expansion at x 0 . We remark that for MME, it is challenging to extend the theorem to the case with n > 2 base models. The reason is that the "maximum" operator in MME poses difficulties for expressing the robust condition in succinct form of continuous function. Therefore, Taylor expansion is unable to apply. However, the general tendency should be the same. We also derive the robustness conditions for a single model for comparison in Appendix B.3. Comparison of ensemble and single-model robustness. The preceding theorems enable the analysis on comparing the certified robustness for an ensemble and a single ML model. Corollary 1 (Comparison of Ensemble and Single-Model Robustness). Given an input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Suppose we have two β-smooth base models {f 1 , f 2 }, which are both r-robust at point x 0 . For any ∆ ∈ [0, 1): • (Weighted Ensemble) Define Weighted Ensemble M WE by base models {f 1 , f 2 } with weights {w 1 , w 2 }. Suppose M WE (x 0 ) = y 0 . If for any label y i = y 0 , the base models' smoothness β ≤ ∆ • min{f y0/yi 1 (x 0 ), f y0/yi 2 (x 0 )}/(c 2 r 2 ) , and the gradient cosine similarity cos ∇ x f y0/yi 1 (x 0 ), ∇ x f y0/yi 2 (x 0 ) ≤ cos θ, then the M WE is at least R-robust at x 0 with R = r • 1 -∆ 1 + ∆ (1 -CWE(1 -cos θ)) -1/2 , where C WE = min yi:yi =y0 2w1w2f y 0 /y i 1 (x0)f y 0 /y i 2 (x0) (w1f y 0 /y i 1 (x0)+w2f y 0 /y i 2 (x0)) 2 , c = max{ 1-∆ 1+∆ (1 -C WE (1 -cos θ)) -1 /2 , 1}. • (Max-Margin Ensemble) Define Max-Margin Ensemble M MME by base models {f 1 , f 2 }. Suppose M MME (x 0 ) = y 0 . If for any label y 1 = y 0 and y 2 = y 0 , the base models' smoothness β ≤ ∆ • min{f y0/y1 1 (x 0 ), f y0/y2 2 (x 0 )}/(c 2 r 2 ) , and the gradient cosine similarity cos ∇ x f y0/y1 1 (x 0 ), ∇ x f y0/y2 2 (x 0 ) ≤ cos θ, then the M MME is at least R-robust at x 0 with R = r • 1 -∆ 1 + ∆ (1 -CMME(1 -cos θ)) -1/2 , where C MME = min y1,y2:y1,y2 =y0 2f y 0 /y 1 1 (x0)f y 0 /y 2 2 (x0) (f y 0 /y 1 1 (x0)+f y 0 /y 2 2 (x0)) 2 , c = max{ 1-∆ 1+∆ (1 -C MME (1 -cos θ)) -1 /2 , 1}. The above corollary reveals the connection between an ensemble and single-model robustness. As we can see, when cos θ < 1 - 4∆ C(1+∆) 2 (C is either C WE or C MME ), R > r. When the base models are smooth enough (β → 0 + so ∆ → 0 + ), in Equations ( 8) and ( 9), the RHS → 1 -. As long as the cosine similarity of base models' gradients are not close to 1, this condition can be easily satisfied, i.e., the ensemble models achieve higher certified robustness than base models. Furthermore, the diversity of gradients measured by cosine similarity is important for improving ensemble robustness. We defer the proof to Appendix B.4. Next, we discuss the implications of our theoretical analysis. Key factors for the certified robustness of an ensemble. We observe that smaller magnitude of joint gradients (e.g. 4) to (7). Since these robustness condition has the form LHS ≤ RHS, it implies that the robustness condition is easier to be satisfied for current radius r, i.e., the certified robust radius r could be improved. Therefore, smaller magnitude of joint gradients leads to higher certified ensemble robustness. N j=1 w j ∇ x f y0/yi j (x 0 ) 2 (Theorem 2) or ∇ x f y0/y1 1 (x 0 )+∇ x f y0/y2 2 (x 0 ) 2 (Theorem 3)) indicates smaller LHS in Equations (

Inspired from low of cosines:

for any two vectors a, b, a + b 2 = a 2 2 + b 2 2 + 2 a 2 b 2 cos a, b 1/2 , the smaller magnitude of joint gradients can be achieved by smaller gradient magnitude of base models or larger diversity (in terms of smaller cosine similarity) between the gradient of base models. Therefore, constraining the magnitude of joint gradient is equivalent to improving gradient diversity and reducing base models' gradient magnitude, and they both contribute to improved ensemble robustness. We can also observe that large confidence margins, such as N j=1 w j f y0/yi j (x 0 ) (Theorem 2) and f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 ) (Theorem 3 ), directly lead to larger RHS in Equations ( 4) to (7). It again implies that the robustness condition becomes easier to be satisfied and larger robust radius r can be achieved. Thus, increasing confidence margins can lead to higher ensemble robustness. Comparison between ensemble and single-model robustness. As the discussion following Corollary 1 reveals, when the base models are smooth enough, the ensemble model is more robust than the base models. Moreover, we prove that certified robustness of ensembles is positively correlated with the base model (gradient) diversity, which is aligned with existing empirical observations (Tramèr et al., 2017; Pang et al., 2019) .

2.2. DIVERSITY-REGULARIZED TRAINING

Inspired by the above key factors for the certified ensemble robustness, we propose the Diversity-Regularized Training. In particular, let x 0 be a training sample, DRT contains the following two regularization terms in the objective function to minimize: • Gradient Diversity Loss (GD Loss): L GD (x 0 ) ij = ∇ x f y0/y (2) i i (x 0 ) + ∇ x f y0/y (2) j j (x 0 ) 2 . ( ) • Confidence Margin Loss (CM Loss): L CM (x 0 ) ij = f y (2) i /y0 i (x 0 ) + f y (2) j /y0 j (x 0 ). In Equations ( 10) and ( 11), y 0 is the ground-truth label of x 0 , and y (2) i (or y (2) j ) is the runner-up class of base model F i (or F j ). Intuitively, for each model pair (i, j) where i, j ∈ [N ] and i = j, the GD Loss encourages the joint gradient, i.e., gradient vector sum between model i and j, to be small. Note that the gradient computed here is actually the gradient difference between different labels. As our theorems reveal, it is the gradient difference between different labels instead of pure gradient itself that matters, which improves previous understanding of gradient diversity (Pang et al., 2019; Demontis et al., 2019) . The GD Loss encourages both large gradient diversity and small base models' gradient magnitude in a naturally balanced way, and encodes the interplay between gradient magnitude and direction diversity. Compared with GD Loss, solely regularizing the base models' gradient would hurt the model's benign accuracy, and solely regularizing gradient diversity is hard to realize due to the boundedness of cosine similarity. The CM Loss encourages the large margin between the true and runner-up classes for base models. Both regularization terms are directly motivated by our analysis, and the detailed implementation process can be found in Section 4.

3. ROBUSTNESS FOR SMOOTHED ML ENSEMBLES

To compute the certified robustness for different ensemble models based on Theorems 2 and 3, we need to ensure the model smoothness which is challenging. Thus, in this section, we apply an adapted randomized model smoothing to compute certified robustness for general ensembles based on our conditions. We focus on Ensemble Before Smoothing (EBS) strategy: first construct the ensemble M from base models, then smooth M's prediction. M could be either M WE or M MME . We also consider another strategy as smoothing base models first then ensemble. The analysis of these two strategies which proves EBS is more robust is deferred to Appendix C.

3.1. CERTIFIED ROBUSTNESS OF ENSEMBLES VIA RANDOMIZED SMOOTHING

To derive the certified robustness of both MME and WE, we first define the statistical robustness and confidence for the single model and ensemble models. Definition 5 (( , p)-Statistical Robust). Given a random variable and model F : R d → [C], at point x 0 with ground truth label y 0 , we call F is ( , p)-statistical robust if Pr (F (x 0 + ) = y 0 ) ≥ p. Note that based on Theorem B.1, when ∼ N (0, σ 2 I d ), if F is ( , p)-statistical robust at point x 0 , the smoothed model G F over F is (σΦ -1 (p))-robust at point x 0 . Definition 6 (( , λ, p)-WE Confident). Let M WE be Weighted Ensemble defined over base models {f i } N i=1 with weights {w i } N i=1 . If at point x 0 with ground-truth y 0 and random variable , we have Pr max y j ∈[C]:y j =y 0 N i=1 wifi(x0 + )y j ≤ λ N i=1 wi (1 -fi(x0 + )y 0 ) = 1 -p, we call Weighted Ensemble M WE ( , λ, p)-WE confident at point x 0 . Definition 7 (( , λ, p)-MME Confident). Let M MME be a Max-Margin Ensemble over {f i } N i=1 . If at point x 0 with ground-truth y 0 and random variable , we have Pr   i∈[N ] max y j ∈[C]:y j =y 0 fi(x0 + )y j ≤ λ(1 -fi(x0 + )y 0 )   = 1 -p, we call Max-Margin Ensemble M MME ( , λ, p)-MME confident at point x 0 . Note that the confidence of every single model lies in the probability simplex, and λ reflects the confidence portion that a wrong prediction class can take beyond the true class (1 -f i (x 0 + )). Now we are ready to present the certified robustness for different ensemble models. Theorem 4 (Certified Robustness for WE). Let be a random variable supported on R d . Let M WE be a Weighted Ensemble defined over {f i } N i=1 with weights {w i } N i=1 . The M WE is ( , λ 1 , p)-WE confident. Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d. and follow symmetric distribution with mean µ and variance s 2 , where µ > (1 + λ -1 1 ) -1 . We have Pr (MWE(x0 + ) = y0) ≥ 1 -p - w 2 2 w 2 1 • s 2 2 µ -1 + λ -1 1 -1 2 . ( ) Theorem 5 (Certified Robustness for MME). Let be a random variable supported on R d . Let M MME be a Max-Margin Ensemble defined over {f i } N i=1 . The M MME is ( , λ 2 , p)-MME confident. Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d. and follow symmetric distribution with mean µ where µ > (1 + λ -1 2 ) -1 . Define s 2 f = Var(min i∈[N ] f i (x 0 + ) y0 ). We have Pr (MMME(x0 + ) = y0) ≥ 1 -p - s 2 f 2 µ -1 + λ -1 2 -1 2 . ( ) The condition µ > 1/(1 + λ -1 ) guarantees normal performance of a model, which is the sufficient condition for the standard setup p A > p B as in (Cohen et al., 2019) ). For comparison, we also derive certified robustness for a single model in Proposition D.1. We defer the proofs to Appendix D.1. Based on our theoretical analysis above, we draw additional implications on the connections between the certified robustness and different losses. For Confidence Margin Loss, which aims at increasing the confidence margin of ensembles by enlarging that of base models, from Theorems 4 and 5, we can see that small λ (λ 1 in WE and λ 2 in MME) results in large Pr(M(x 0 + ) = y 0 ), i.e., large certified robustness. For Standard Training Loss, which increases base models' confidence of true class, we can view it as increasing the average confidence score, µ, and its effectiveness is revealed from Theorems 4 and 5.

3.2. COMPARISON FOR THE CERTIFIED ROBUSTNESS OF ENSEMBLES

The unified form of certified robustness above allows us to compare it for different ensembles. Corollary 2 (Comparison of Certified Robustness). Let be a random variable supported on R d . Over base models {f i } N i=1 , let M MME be Max-Margin Ensemble, and M WE the Weighted Ensemble with weights {w i } N i=1 . Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d. and follow symmetric distribution with mean µ and variance s 2 , where µ > max{(1 + λ -1 1 ) -1 , (1 + λ -1 2 ) -1 }. Define s 2 f = Var(min i∈[N ] f i (x 0 + ) y0 ) and assume s f < s. • When λ1 λ2 < λ -1 2 s s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1 , for any weights {w i } N i=1 , M WE has higher certified robustness than M MME . • When λ1 λ2 > λ -1 2 s √ N s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1 , for any weights {w i } N i=1 , M MME has higher certified robustness than M WE . Here, the certified robustness is given by Theorems 4 and 5. Appendix D.2 entails the detailed proofs. Note that given λ 1 is the weighted average and λ 2 the maximum over λ's of all base models, λ 1 /λ 2 reflects the adversarial transferability (Papernot et al., 2016a) among base models under the same p: If the transferability is high the confidence scores of base models are similar (λ's are similar), and thus λ 1 is large resulting in large λ 1 /λ 2 . On the other hand, when the transferability is low, the confidence scores are diverse (λ's are diverse), and thus λ 1 is small resulting in small λ 1 /λ 2 . Based on our theoretical analysis we can see that MME is more robust when the transferability is high; WE is more robust when the transferability is low. In Appendix D.3, we also prove that under certain distribution of f i (x 0 + ) y0 , when N is sufficiently large, the MME always more robust. Appendix D.4 entails the numerical evaluations.

4. EXPERIMENTAL EVALUATION

In order to make a fair comparison with existing work (Cohen et al., 2019; Salman et al., 2019) , we evaluate our approach on different datasets: MNIST (LeCun et al., 2010) , CIFAR-10 (Krizhevsky, 2012), and ImageNet (Deng et al., 2009) . We show that by training MME/WE with DRT, our model can achieve the state-of-the-art certified robustness.

4.1. EXPERIMENTAL SETUP

Baselines: We mainly consider two state-of-the-art baselines for certified robustness: 1) Gaussian smoothing (Cohen et al., 2019) , which trains a smoothed classifier by applying Gaussian augmentation. 2) SmoothAdv (Salman et al., 2019) , which integrates adversarial training on the soft approximation. The comparison with more baselines can be found in Appendix E.4.

Model structures:

For each base model in our ensemble, we follow the same configuration of the baselines: LeNet (LeCun et al., 1998) for MNIST, ResNet-110 and ResNet-50 (He et al., 2016) for CIFAR-10 and ImageNet datasets. Training: We smooth the N base models of an ensemble following the baselines (Cohen et al., 2019; Salman et al., 2019) . For each input x 0 with ground truth y 0 , we use x 0 + with ∼ N (0, σ 2 I d ) as training input for each base model. We call two base models f i , f j valid model pair at (x 0 , y 0 ) if both F i (x 0 + ) and F j (x 0 + ) predict y 0 . For every valid model pair, we apply GD Loss and CM Loss with ρ 1 and ρ 2 as the weight parameters. The final training loss of an ensemble is as below: L = i∈[N ] L std (x 0 + , y 0 ) i + ρ 1 i,j∈[N ],i =j F i (x 0 + )=y 0 F j (x 0 + )=y 0 L GD (x 0 + ) ij + ρ 2 i,j∈[N ],i =j F i (x 0 + )=y 0 F j (x 0 + )=y 0 L CM (x 0 + ) ij . ( ) The standard training loss L std (x 0 + , y 0 ) i of each base model f i is either cross-entropy loss in (Cohen et al., 2019; Yang et al., 2020) , or adversarial training loss in (Salman et al., 2019) . We leave more training details in Appendix E. Robustness Certification: During certification, we apply the MME or WE ensemble protocols over the trained base models {f i } N i=1 to obtain ensemble M, then smooth M with noise ∼ N (0, σ 2 I d ). We report the standard certified accuracy under different L 2 radius r as our evaluation metric (Cohen et al., 2019) (more implementation details in Appendix E). 

4.2. EXPERIMENTAL RESULTS

In our experiments, we consider ensemble models consisting of three base models on MNIST, CIFAR-10, and ImageNet datasets. We observe that when training MME or WE (using different base models) with DRT, they can achieve the state-of-the-art certified robustness. The evaluation results on MNIST are shown in Table 1 . We observe that the certified accuracy can be improved slightly by applying the MME or WE compared with a single base model (aligned with Corollary 1). After training with DRT, the improvements become significant and the DRT-trained ensemble models can achieve the highest certified accuracy under every radius r. In particular, DRT ensemble model can surpass the base model's certified accuracy around 7% under large radius r = 2.50. We further compare the certified robustness of WE and MME in Appendix D.4. On CIFAR-10 the evaluation results are shown in Table 2 . Similarly, we can see that the DRT-based ensemble model can achieve the best certified robustness under different radius r (more experimental details are in Appendix E.2). Note that the DRT-based ensemble with Gaussian smoothed base models can achieve comparable results to SmoothAdv with less training time (detailed efficiency analysis is in Appendix E.2). We defer the results on ImageNet to Appendix E.3 and put all discussions about hyper-parameters under different settings in Appendix E.

5. CONCLUSION

In this paper, we explore and characterize the robustness conditions for ensemble ML models theoretically, and propose DRT for training a robust ensemble in practice. Our analysis provides the justification of the regularization-based training approach DRT, as well as why an ensemble model could have higher robustness compared with a single model. Especially, we show that smaller magnitude of joint gradients, and the large confidence margins are the key factors that contribute to high certified robustness of an ensemble. We further compare the certified robustness of two types of ensembles: Weighted Ensemble and Max-Margin Ensemble under the randomized smoothing regime. Extensive experiments show that the ensemble models trained with DRT can achieve higher certified robustness than existing approaches. 

A TABLE OF THEORETICAL RESULTS

For a quick index for the theoretical results, we refer the readers to Table 3 .

B FORMAL DEFINITIONS AND PROOFS OF ROBUSTNESS CONDITIONS

In this appendix, we discuss the connection between r-robustness and certified robustness given by randomized smoothing, and the detailed proofs of the robustness conditions in Section 2.

B.1 r-ROBUSTNESS AND RANDOMIZED SMOOTHING

In this subsection, we discuss the connection between r-robustness and the certified robustness given by randomized smoothing (Cohen et al., 2019) . In randomized smoothing, each input's prediction is given by the most probable prediction after adding noise. Formally, let ∼ N (0, σ 2 I d ) be a Gaussian random variable, from model prediction F , we can define the smoothed classifier G F : R d → [C] where G F (x) = arg max j∈[C] g F (x) j : g F (x) j := E ∼N (0,σ 2 I d ) 1[F (x + ) = j] = Pr ∼N (0,σ 2 I d ) (F (x + ) = j). Intuitively, the confidence score for each class is given by the probability of predicting that class under noised inputs. Theorem B.1 (Simplified; Certified Robustness via Randomized Smoothing; Cohen et al. (2019) ). At point x 0 , let ∼ N (0, σ 2 I d ), a smoothed model G F is r-robust, where r := σΦ -1 (g F (x 0 ) G F (x0) ), and Φ -1 is the inverse function of Gaussian CDF. We remark that a tighter certified radius is r : = σ 2 Φ -1 (g F (x 0 ) G F (x0) ) -Φ -1 (g F (x 0 )) G (2) F (x0) ≥ r, while for the ease of sampling, the Equation ( 19) is used more often in the literature. In Sections 3 and 4, we use Equation ( 19) for analysis and empirical evaluation, and these results can be generalized to the tighter radius r easily.

B.2 GENERAL ROBUSTNESS CONDITIONS

Proposition 1 (Robustness Condition for WE). Consider an input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and an ensemble model M WE constructed by base models {f i } N i=1 with weights {w i } N i=1 . Suppose M WE (x 0 ) = y 0 . Then, the ensemble M WE is r-robust at point x 0 if and only if for any x ∈ {x 0 + δ : δ 2 ≤ r}, min yi∈[C]:yi =y0 N j=1 w j f y0/yi j (x) ≥ 0. ( ) Proof of Proposition 1. According the the definition of r-robust, we know M WE is r-robust if and only if for any point x := x 0 + δ where δ 2 ≤ r, M WE (x 0 + δ) = y 0 , which means that for any other label y i = y 0 , the confidence score for label y 0 is larger or equal than the confidence score for label y i . It means that N j=1 w j f j (x) y0 ≥ N j=1 w j f j (x) yi for any x ∈ {x 0 + δ : δ 2 ≤ r}. Since this should hold for any y i = y 0 , we have the necessary and sufficient condition min yi∈[C]:yi =y0 N j=1 w j f y0/yi j (x) ≥ 0. ( ) Theorem 1 (Robustness Condition for MME). Consider an input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Let M MME be an MME defined over base models {f i } N i=1 . Suppose: (1) M MME (x 0 ) = y 0 ; (2) for any x ∈ {x 0 + δ : δ 2 ≤ r}, given any base model i ∈ [N ], either F i (x) = y 0 or F (2) i (x) = y 0 . Then, the ensemble M MME is r-robust at point x 0 if and only if for any x ∈ {x 0 + δ : δ 2 ≤ r}, max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x) ≥ max i∈[N ] min y i ∈[C]:y i =y0 f y i /y0 i (x). (3) The theorem states the sufficient and necessary robustness condition for MME. We divide the two directions into the following two lemmas and prove them separately. We mainly use the alternative form of Equation ( 3) as such in the following lemmas and their proofs: max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x) + min i∈[N ] min y i ∈[C]:y i =y0 f y0/y i i (x) ≥ 0. Lemma B.1 (Sufficient Condition for MME). Let M MME be an MME defined over base models {f i } N i=1 . For any input x 0 ∈ R d , the Max-Margin Ensemble M MME predicts M MME (x 0 ) = y 0 if max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x 0 ) + min i∈[N ] min y i ∈[C]:y i =y0 f y0/y i i (x 0 ) ≥ 0. Proof of Lemma B.1. For brevity, for i ∈ [N ], we denote y i := F i (x 0 ), y i := F (2) i (x 0 ) for each base model's top class and runner-up class at point x 0 . Suppose M MME (x 0 ) = y 0 , then according to ensemble definition (see Definition 3), there exists c ∈ [N ], such that M MME (x 0 ) = F c (x 0 ) = y c , and ∀i ∈ [N ], i = c, f c (x 0 ) yc/y c > f i (x 0 ) yi/y i . ( ) Because y c = y 0 , we have f c (x 0 ) y0 ≤ f c (x 0 ) y c , so that f c (x 0 ) yc/y0 ≥ f c (x 0 ) yc/y c . Now con- sider any model f i where i ∈ [N ], we would like to show that there exists y * = y 0 , such that f i (x 0 ) yi/y i ≥ f i (x 0 ) y0/y * : • If y i = y 0 , let y * := y i , trivially f i (x 0 ) yi/y i = f i (x 0 ) y0/y * ; • If y i = y 0 , and y i = y 0 , we let y * := y i , then f i (x 0 ) yi/y i = f i (x 0 ) yi/y * ≥ f i (x 0 ) y0/y * ; • If y i = y 0 , but y i = y 0 , we let y * := y i , then f i (x 0 ) yi/y i = f i (x 0 ) yi/y0 ≥ f i (x 0 ) y0/yi = f i (x 0 ) y0/y * . Combine the above findings with Equation 21, we have: ∀i ∈ [N ], i = c, ∃y * c ∈ [C] and y * c = y 0 , ∃y * i ∈ [C] and y * i = y 0 , f c (x 0 ) y * c /y0 > f i (x 0 ) y0/y * i . Therefore, its negation ∃i ∈ [N ], i = c, ∀y * c ∈ [C] and y * c = y 0 , ∀y * i ∈ [C] and y * i = y 0 , f c (x 0 ) y0/y * c + f i (x 0 ) y0/y * i ≥ 0 implies M(x 0 ) = y 0 . Since Equation ( 22) holds for any y * c and y * i , the equation is equivalent to ∃i ∈ [N ], i = c, min yc∈[C]:yc =y0 f c (x 0 ) y0/yc (x 0 ) + min y i ∈[C]:y i =y0 f i (x 0 ) y0/y i (x 0 ) ≥ 0. The existence qualifier over i can be replaced by maximum: min yc∈[C]:yc =y0 f c (x 0 ) y0/yc (x 0 ) + max i∈[N ] min y i ∈[C]:y i =y0 f i (x 0 ) y0/y i (x 0 ) ≥ 0. It is implied by max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x 0 ) + min i∈[N ] min y i ∈[C]:y i =y0 f y0/y i i (x 0 ) ≥ 0. (3) Thus, Equation (3) is a sufficient condition for M MME (x 0 ) = y 0 . Lemma B.2 (Necessary Condition for MME). For any input x 0 ∈ R d , if for any base model i ∈ [N ], either F i (x 0 ) = y 0 or F (2) i (x 0 ) = y 0 , then Max-Margin Ensemble M MME predicting M MME (x 0 ) = y 0 implies max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x 0 ) + min i∈[N ] min y i ∈[C]:y i =y0 f y0/y i i (x 0 ) ≥ 0. (3) Proof of Lemma B.2. Similar as before, for brevity, for i ∈ [N ], we denote y i := F i (x 0 ), y i := F (2) i (x 0 ) for each base model's top class and runner-up class at point x 0 . Suppose Equation ( 3) is not satisfied, it means that ∃c ∈ [N ], ∃y * c ∈ [C] and y * c = y 0 , ∀i ∈ [N ], ∃y * i ∈ [C] and y * i = y 0 , f y * c /y0 c (x 0 ) > f y0/y * i i (x 0 ). • If y c = y 0 , then f y * c /y0 c (x 0 ) ≤ 0, which implies that f y0/y * i i (x 0 ) < 0, and hence F i (x 0 ) = y 0 . Moreover, we know that f yi/y i i (x 0 ) = f yi/y0 i (x 0 ) ≥ f y * i /y0 i (x 0 ) > f y0/y * c c (x 0 ) ≥ f y0/y c c (x 0 ) = f yc/y c c (x 0 ) so M(x 0 ) = F c (x 0 ) = y 0 . • If y c = y 0 , i.e., y c = y 0 , then f yc/y0 c (x 0 ) ≥ f y * c /y0 c (x 0 ) > f y0/y * 1 i (x 0 ). If F i (x 0 ) = y 0 , then f y0/y * i i (x 0 ) ≥ f y0/y i i (x 0 ) = f yi/y i i (x 0 ). Thus, f yc/y c c (x 0 ) = f yc/y0 c (x 0 ) > f yi/y i i (x 0 ). As the result, M(x 0 ) = F c (x 0 ) = y 0 . For both cases, we show that M MME (x 0 ) = y 0 , i.e., Equation ( 3) is a necessary condition for M(x 0 ) = y 0 . Proof of Theorem 1. Lemmas B.1 and B.2 are exactly the two directions (necessary and sufficient condition) of M MME predicting label y 0 at point x. Therefore, if the condition (Equation ( 3)) holds for any x ∈ {x 0 + δ : δ 2 ≤ r}, the ensemble M MME is r-robust at point x 0 ; vice versa. For comparison, here we list the trivial robustness condition for single model.  ∈ {x 0 + δ : δ 2 ≤ r}, min yi∈[C]:yi =y0 f y0/yi (x) ≥ 0. The fact is apparent given that the model predicts the class with the highest confidence.

B.3 GRADIENT AND CONFIDENCE MARGIN-BASED CONDITION

We can concertize the preceding general robustness condition by gradients and confidence margins of base models leveraging Taylor expansion. Theorem 2 (Gradient and Confidence Margin Condition for WE Robustness). Given input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and M WE as a WE defined over base models {f i } N i=1 with weights {w i } N i=1 . M WE (x 0 ) = y 0 . All base model f i 's are β-smooth. • (Sufficient Condition) M WE is r-robust at point x 0 if for any y i = y 0 , N j=1 w j ∇ x f y0/yi j (x 0 ) 2 ≤ 1 r N j=1 w j f y0/yi j (x 0 ) -βr N j=1 w j . • (Necessary Condition) If M WE is r-robust at point x 0 , then for any y i = y 0 , N j=1 w j ∇ x f y0/yi j (x 0 ) 2 ≤ 1 r N j=1 w j f y0/yi j (x 0 ) + βr N j=1 w j . ( ) Proof of Theorem 2. From Taylor expansion with Lagrange remainder and the β-smoothness assumption on the base models, we have N j=1 w j f y0/yi j (x 0 ) -r N j=1 w j ∇ x f y0/yi j (x 0 ) 2 - 1 2 r 2 N j=1 (2βw j ) ≤ min x: x-x0 2≤r N j=1 w j f y0/yi j (x) ≤ N j=1 w j f y0/yi j (x 0 ) -r N j=1 w j ∇ x f y0/yi j (x 0 ) 2 + 1 2 r 2 N j=1 (2βw j ), where the term -1 2 r 2 N j=1 (2βw j ) and 1 2 r 2 N j=1 (2βw j ) are bounded from Lagrange remainder. From Proposition 1, the sufficient and necessary condition of WE's r-robustness is N j=1 w j f y0/yi j (x) ≥ 0 for any y i ∈ [C] such that y i = y 0 , and any x = x 0 + δ where δ 2 ≤ r. Plugging this term into Equation ( 23) we get the theorem. Theorem 3 (Gradient and Confidence Margin Condition for MME Robustness). Given input x 0 ∈ R d with ground-truth label y 0 ∈ [C], and M MME as an MME defined over base models {f 1 , f 2 }. M MME (x 0 ) = y 0 . Both f 1 and f 2 are β-smooth. • (Sufficient Condition) If for any y 1 , y 2 ∈ [C] such that y 1 = y 0 and y 2 = y 0 , ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 ≤ 1 r (f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 )) -2βr, then M MME is r-robust at point x 0 . • (Necessary Condition) Suppose for any x ∈ {x 0 + δ : ||δ|| 2 ≤ r}, for any i ∈ {1, 2}, either F i (x) = y 0 or F (2) i (x) = y 0 . If M MME is r-robust at point x 0 , then for any y 1 , y 2 ∈ [C] such that y 1 = y 0 and y 2 = y 0 , ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 ≤ 1 r (f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 )) + 2βr. ( ) Proof of Theorem 3. We prove the sufficient condition and necessary condition separately. • (Sufficient Condition) From Lemma B.1, since there are only two base models, we can simplify the sufficient condition for M MME (x) = y 0 as min yi∈[C]:yi =y0 f y0/yi 1 (x) + min y i ∈[C]:y i =y0 f y0/y i 2 (x) ≥ 0. In other words, for any y 1 = y 0 and y 2 = y 0 , f y0/y1 1 (x) + f y0/y2 2 (x) ≥ 0. With Taylor expansion and smoothness assumption, we have min x: x-x0 2 ≤r f y0/y1 1 (x) + f y0/y2 2 (x) ≥f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 ) -r ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 - 1 2 • 4βr 2 . Plugging this into Equation ( 24) yields the sufficient condition. In the above equation, the term -1 2 • 4βr 2 is bounded from Lagrange remainder. Here, the 4β term comes from the fact that f y0/y1 1 (x) + f y0/y2 2 (x) is (4β)-smooth since it is the sum of difference of β-smooth function.

• (Necessary Condition)

From Lemma B.2, similarly, the necessary condition for M MME (x) = y 0 is simplified to: for any y 1 = y 0 and y 2 = y 0 , f y0/y1 1 (x) + f y0/y2 2 (x) ≥ 0. Again, from Taylor expansion, we have min x: x-x0 2 ≤r f y0/y1 1 (x) + f y0/y2 2 (x) ≤f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 ) -r ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 + 1 2 • 4βr 2 . Plugging this into Equation ( 24) yields the necessary condition. In the above equation, the term + 1 2 • 4βr 2 is bounded from Lagrange remainder. The 4β term appears because of the same reason as before. To compare the robustness of ensemble models and the single model, we show the corresponding conditions for single-model robustness.

Proposition B.1 (Gradient and Confidence Margin Conditions for Single-Model Robustness).

Given input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Model F (x 0 ) = y 0 , and it is β-smooth. • (Sufficient Condition) If for any y 1 ∈ [C] such that y 1 = y 0 , ∇ x f y0/y1 (x 0 ) 2 ≤ 1 r f y0/y1 (x 0 ) -βr, F is r-robust at point x 0 . • (Necessary Condition) If F is r-robust at point x 0 , for any y 1 ∈ [C] such that y 1 = y 0 , ∇ x f y0/y1 (x 0 ) 2 ≤ 1 r f y0/y1 (x 0 ) + βr. ( ) Proof of Proposition B.1. This proposition is apparent given the following inequality from Taylor expansion f y0/y1 (x 0 )-r ∇ x f y0/y1 (x 0 ) 2 -βr 2 ≤ min x: x-x0 2≤r f y0/y1 (x) ≤ f y0/y1 (x 0 )-r ∇ x f y0/y1 (x 0 ) 2 +βr 2 and the necessary and sufficient robust condition in Fact B.1.

B.4 COMPARE CERTIFIED ROBUSTNESS OF ENSEMBLE MODEL AND SINGLE MODELS

Corollary 1 (Comparison of Ensemble and Single-Model Robustness). Given an input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Suppose we have two β-smooth base models {f 1 , f 2 }, which are both r-robust at point x 0 . For any ∆ ∈ [0, 1): • (Weighted Ensemble) Define Weighted Ensemble M WE with base models {f 1 , f 2 }. Suppose M WE (x 0 ) = y 0 . If for any label y i = y 0 , the base models' smoothness β ≤ ∆ • min{f y0/yi 1 (x 0 ), f y0/yi 2 (x 0 )}/(c 2 r 2 ), and the gradient cosine similarity cos ∇ x f y0/yi 1 (x 0 ), ∇ x f y0/yi 2 (x 0 ) ≤ cos θ, then the M WE with weights {w 1 , w 2 } is at least R-robust at point x 0 with R = r • 1 -∆ 1 + ∆ (1 -C WE (1 -cos θ)) -1/2 , where C WE = min yi:yi =y0 2w1w2f y 0 /y i 1 (x0)f y 0 /y i 2 (x0) ( y 0 /y i 1 (x0)+w2f y 0 /y i 2 (x0)) 2 , c = max{ 1-∆ 1+∆ (1 -C WE (1 -cos θ)) -1 /2 , 1}. • (Max-Margin Ensemble) Define Max-Margin Ensemble M MME with the base models {f 1 , f 2 }. Suppose M MME (x 0 ) = y 0 . If for any label y 1 = y 0 and y 2 = y 0 , the base models' smoothness β ≤ ∆ • min{f y0/y1 1 (x 0 ), f y0/y2 2 (x 0 )}/(c 2 r 2 ) , and the gradient cosine similarity cos ∇ x f y0/y1 1 (x 0 ), ∇ x f y0/y2 2 (x 0 ) ≤ cos θ, then the M MME is at least R-robust at point x 0 with R = r • 1 -∆ 1 + ∆ (1 -C MME (1 -cos θ)) -1/2 , where C MME = min y1,y2: y1,y2 =y0 2f y 0 /y 1 1 (x0)f y 0 /y 2 2 (x0) (f y 0 /y 1 1 (x0)+f y 0 /y 2 2 (x0)) 2 , c = max{ 1-∆ 1+∆ (1 -C MME (1 -cos θ)) -1 /2 , 1}. Proof of Corollary 1. We first prove the theorem for Weighted Ensemble. For arbitrary y i = y 0 , we have w1∇xf y 0 /y i 1 (x0) + w2∇xf y 0 /y i 2 (x0) 2 = w 2 1 ∇xf y 0 /y i 1 (x0) 2 2 + w 2 2 ∇xf y 0 /y i 2 (x0) 2 2 + 2w1w2 ∇xf y 0 /y i 1 (x0), f y 0 /y i 2 (x0) ≤ w 2 1 ∇xf y 0 /y i 1 (x0) 2 2 + w 2 2 ∇xf y 0 /y i 2 (x0) 2 2 + 2w1w2 ∇xf y 0 /y i 1 (x0) 2 ∇xf y 0 /y i 2 (x0) 2 cos θ (i.) ≤ w 2 1 1 r f y 0 /y i 1 (x0) + βr 2 + w 2 2 1 r f y 0 /y i 2 (x0) + βr 2 + 2w1w2 1 r f y 0 /y i 1 (x0) + βr 1 r f y 0 /y i 2 (x0) + βr cos θ = 1 r w 2 1 f y 0 /y i 1 (x0) + βr 2 2 + w 2 2 f y 0 /y i 2 (x0) + βr 2 2 + 2w1w2 f y 0 /y i 1 (x0) + βr 2 f y 0 /y i 2 (x0) + βr 2 cos θ (ii.) ≤ 1 r • 1 + ∆ c 2 w 2 1 f y 0 /y i 1 (x0) 2 + w 2 2 f y 0 /y i 2 (x0) 2 + 2w1w2f y 0 /y i 1 (x0)f y 0 /y i 2 (x0) cos θ = 1 r • 1 + ∆ c 2 w1f y 0 /y i 1 (x0) + w2f y 0 /y i 2 (x0) 2 -2(1 -cos θ)w1f y 0 /y i 1 (x0)w2f y 0 /y i 2 (x0) (iii.) ≤ 1 r • 1 + ∆ c 2 1 -(1 -cos θ)CWE w1f y 0 /y i 1 (x0) + w2f y 0 /y i 2 (x0) where (i.) follows from the necessary condition in Proposition B.1; (ii.) uses the condition on β; and (iii.) replaces 2w 1 w 2 f y0/yi 1 (x 0 )f y0/yi 2 (x 0 ) leveraging C WE . Now, we define K := 1 -∆ 1 + ∆ (1 -C WE (1 -cos θ)) -1 /2 . All we need to do is to prove that M WE is robust within radius Kr. To do so, from Equation (4), we upper bound w 1 ∇ x f y0/yi 1 (x 0 ) + w 2 ∇ x f y0/yi 2 (x 0 ) 2 by 1 Kr w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) - βKr(w 1 + w 2 ): w 1 ∇ x f y0/yi 1 (x 0 ) + w 2 ∇ x f y0/yi 2 (x 0 ) 2 ≤ 1 r • 1 + ∆ c 2 1 -(1 -cos θ)C WE w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) ≤ 1 r (1 + ∆) 1 -(1 -cos θ)C WE w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) = 1 r • 1 -∆ 1-∆ 1+∆ (1 -(1 -cos θ)C WE ) -1/2 w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) = 1 Kr (1 -∆) w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) ≤ 1 Kr w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) -∆ min{f y0/yi 1 (x 0 ), f y0/yi 2 (x 0 )}(w 1 + w 2 ) . Notice that ∆ min{f y0/yi 1 (x 0 ), f y0/yi 2 (x 0 )} ≥ βc 2 r 2 from β's condition, so w 1 ∇ x f y0/yi 1 (x 0 ) + w 2 ∇ x f y0/yi 2 (x 0 ) 2 ≤ 1 Kr w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) -βc 2 r 2 (w 1 + w 2 ) = 1 Kr w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) -βKr(w 1 + w 2 ) • c 2 K 2 ≤ 1 Kr w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 ) -βKr(w 1 + w 2 ). From Equation (4), the theorem for Weighted Ensemble is proved. Now we prove the theorem for Max-Margin Ensemble. Similarly, for any arbitrary y 1 , y 2 such that y 1 = y 0 , y 2 = y 0 , we have ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 ≤ 1 r • 1 + ∆ c 2 1 -(1 -cos θ)C MME f y0/y1 1 (x 0 ) + f y0/y2 2 (x 0 ) .

Now we define

K := 1 -∆ 1 + ∆ (1 -C MME (1 -cos θ)) -1 /2 . Again, from β's condition we have ∆ min{f y0/y1 1 (x 0 ), f y0/y2 2 (x 0 )} ≥ βc 2 r 2 and ∇ x f y0/y1 1 (x 0 ) + ∇ x f y0/y2 2 (x 0 ) 2 ≤ 1 K r f y0/yi 1 (x 0 ) + f y0/yi 2 (x 0 ) -2βK r. From Equation ( 6), the ensemble is (K r)-robust at point x 0 , i.e., the theorem for Max-Margin Ensemble is proved.

DISCUSSION

Optimizing Weighted Ensemble. As we can observe from Corollary 1, we can adjust the weights {w 1 , w 2 } for Weighted Ensemble to change C WE and the certified robust radius (Equation ( 8)). Then comes the problem of which set of weights can achieve the highest certified robust radius. Since larger C WE results in higher radius, we need to choose (w OP T 1 , w OP T 2 ) = arg max w1,w2 min yi:yi =y0 2w 1 w 2 f y0/yi 1 (x 0 )f y0/yi 2 (x 0 ) (w 1 f y0/yi 1 (x 0 ) + w 2 f y0/yi 2 (x 0 )) 2 . Under review as a conference paper at ICLR 2021 Since this quantity is scale-invariant, we can fix w 1 and optimize over w 2 to get the optimal weights. In particular, if there are only two classes, we have a closed-form solution (w OP T 1 , w OP T 2 ) = arg max w1,w2 2w 1 w 2 f y0/y1 1 (x 0 )f y0/y1 2 (x 0 ) (w 1 f y0/y1 1 (x 0 ) + w 2 f y0/y1 2 (x 0 )) 2 = {k • f y0/y1 2 (x 0 ), k • f y0/y1 1 (x 0 ) : k ∈ R + }, and corresponding C WE achieves the maximum 1/2. For a special case-average weighted ensemble, we get the corresponding certified robust radius by setting w 1 = w 2 and plug the yielded C WE = min yi:yi =y0 2f y0/yi 1 (x 0 )f y0/yi 2 (x 0 ) (f y0/yi 1 (x 0 ) + f y0/yi 2 (x 0 )) 2 ∈ (0, 1/2]. into Equation ( 8). Comparison between ensemble and single-model robustness. We expand the discussion in Section 2.1. The similar forms of R in the corollary allow us to discuss the Weighted Ensemble and Max-Margin Ensemble together. Specifically, we let C be either C WE or C MME , then R = r • 1 -∆ 1 + ∆ (1 -C(1 -cos θ)) -1 /2 . Since when R > r, both ensembles have higher certified robustness than the base models, we solve this condition for cos θ: R > r ⇐⇒ 1 -∆ 1 + ∆ 2 > 1 -C(1 -cos θ) ⇐⇒ cos θ ≤ 1 - 4∆ C(1 + ∆) 2 . Notice that C ∈ (0, 1/2]. From this condition, we can easily observe that when the gradient cosine similarity is smaller, it is more likely that the ensemble has higher certified robustness than the base models. When the model is smooth enough, according to the condition on β, we can notice that ∆ would be close to zero. As a result, 1 -4∆ C(1+∆) 2 is close to 1. Thus, unless the gradient of base models is (or close to) colinear, it always holds that the ensemble (either WE or MME) has higher certified robustness than the base models. Remark. The theorem appears to be a bit counter-intuitive -picking the best smoothed model in terms of certified robustness cannot give strong certified robustness for the ensemble. As long as the base models have different certified robust radius (i.e., r i 's are different), the r, certified robust radius for the ensemble, is strictly inferior to that of the best base model (i.e., max r i ). Furthermore, if there exists a base model with wrong prediction (i.e., r i ≤ 0), the certified robust radius r is strictly smaller than half of the best base model. Proof of Theorem C.1. Without loss of generality, we assume r 1 > r 2 > • • • > r N . Let the perturbation added to x 0 has L 2 length δ. When δ ≤ r N , since picking any model always gives the right prediction, the ensemble is robust. When r N < δ ≤ r1+r N 2 , the highest robust radius with wrong prediction is δ -r N , and we can still guarantee that model f 1 has robust radius at least r 1 -δ from the smoothness of function x → g F1 (x) G F 1 (x0) (Salman et al., 2019). Since r 1 -δ ≥ r1-r N 2 ≥ δ -r N , the ensemble will agree on f 1 or other base model with correct prediction and still gives the right prediction. When δ > r1+r N 2 , suppose f N is a linear model and only predicts two labels (which achieves the tight robust radius bound according to Cohen et al. (2019) ), then f N can have robust radius δ -r N for the wrong prediction. At the same time, for any other model f i which is linear and predicts correctly, the robust radius is at most r i -δ. Since r i -δ < r 1 -δ < r1-r N 2 < δ -r N , the ensemble can probably give wrong prediction. In summary, as we have shown, the certified robust radius can be at most r. For any radius δ > r, there exist base models which lead the ensemble H M (x 0 +δe) to predict the label other than y 0 .

C.2 COMPARISON OF TWO STRATEGIES

In this subsection, we compare the two ensemble strategies when the ensembles are constructed from two base models.

Corollary C.1 (Smoothing Strategy Comparison

). Given M MME , a Max-Margin Ensemble constructed from base models {f a , f b }. Let ∼ N (0, σ 2 I d ). Let G MMME be the EBS ensemble, and H MMME be the EAS ensemble. Suppose at point x 0 with ground-truth label y 0 , G Fa (x 0 ) = G F b (x 0 ) = y 0 , g Fa (x 0 ) > 0.5, g F b (x 0 ) > 0.5. Let δ be their probability difference for class y 0 , i.e, δ := |g Fa (x 0 ) y0 -g F b (x 0 ) y0 |,. Let p min be the smaller probability for class y 0 between them, i.e., p min := min{g Fa (x 0 ) y0 , g F b (x 0 ) y0 }. We denote p to the probability of choosing the correct class when the base models disagree with each other; denote p ab to the probability of both base models agreeing on the correct class: p := Pr (M MME (x 0 + ) = y 0 | F a (x 0 + ) = F b (x 0 + ) and (F a (x 0 + ) = y 0 or F b (x 0 + ) = y 0 )) , p ab := Pr (F a (x 0 + ) = F b (x 0 + ) = y 0 ) . We have: 1. If p > 1/2 + (2 + 4(p min -p ab )/δ) -1 , r G > r H . 2. If p ≤ 1/2, r H ≥ r G . Here, r G is the certified robust radius of G MMME computed from Equation (29); and r H is the certified robust radius of H MMME computed from Equation (30). Remark. Since p is the probability where the ensemble chooses the correct prediction between two base model predictions, with Max-Margin Ensemble, we think p > 1/2 with non-trivial margin. The quantity p min -p ab and δ both measure the base model's diversity in terms of predicted label distribution, and generally they should be close. As a result, 1/2 + (2 + 4(p min -p ab )/δ) -1 ≈ 1/2 + 1/6 = 2/3, and case (1) should be much more likely to happen than case (2). Therefore, EBS usually yields higher robustness guarantee. We remark that the similar tendency also holds with multiple base models.  r G := σ 2 • 2Φ -1 Pr (M MME (x 0 + ) = y 0 ) , r H := σ 2 Φ -1 (p a ) + Φ -1 (p b ) . Notice that Pr (M MME (x 0 + ) = y 0 ) = p ab + p(p a + p b -2p ab ), we can rewrite r G as r G = σ 2 • 2Φ -1 (p ab + p(p a + p b -2p ab )). 1. When p > 1/2 + (2 + 4(p min -p ab )/δ) -1 , since p > 1 2 + 1 2 + 4(pmin-p ab ) δ = 1 2 + δ 2δ + 4(p b -p ab ) = p a + p b + δ -2p ab 2(p a + p b -2p ab ) = p a -p ab p a + p b -2p ab , we have p ab + p(p a + p b -2p ab ) > p a . Therefore, r G > σΦ -1 (p a ). Whereas, r H ≤ σ/2 • 2Φ -1 (p a ) = σΦ -1 (p a ). So r G > r H . 2. When p ≤ 1/2, p ab + p(p a + p b -2p ab ) ≤ p ab + 1/2 • (p a + p b -2p ab ) = (p a + p b )/2. Therefore, r G ≤ σΦ -1 ((p a + p b )/2). Notice that Φ -1 is convex in [1/2, +∞), so Φ -1 (p a ) + Φ -1 (p b ) ≥ 2Φ -1 ((p a + p b )/2), i.e., r H ≥ r G .

D PROOFS AND EXPERIMENTS OF ROBUSTNESS FOR SMOOTHED ML ENSEMBLE

In this appendix, we provide the detailed proofs and discussions for the results in Sections 3.1 and 3.2. Moreover, to present a more intuitive understanding, we show both numerical and realistic experiments for the theorems. D.1 CERTIFIED ROBUSTNESS VIA RANDOMIZED SMOOTHING First, using the notion of ( , p)-Statistical Robust (Definition 5), we prove the certified robustness of the single model and ensembles under the i.i.d. assumption of the confidence scores. As noted in Section 3.1 and Appendix B.1, when follows some distribution such as N (0, σ 2 I d ), we can translate the statistical robustness guarantee to r-robustness guarantee of the smoothed classifier. The following lemma is frequently used in our following proofs: Lemma D.1. Suppose the random variable X satisfies EX > 0, Var(X) < ∞ and for any x ∈ R + , Pr(X ≥ EX + x) = Pr(X ≤ EX -x), then Pr(X ≤ 0) ≤ Var(X) 2(EX) 2 . Proof of Lemma D.1. Apply Chebyshev's inequality on random variable X and notice that X is symmetric, then we can easily observe this lemma.

D.1.1 CERTIFIED ROBUSTNESS FOR SINGLE MODEL

As the start point, we first show a direct proposition stating the certified robustness guarantee of the single model. Definition D.1 (( , λ, p)-Single Confident). Given a classification model F . If at point x 0 with ground-truth label y 0 and the random variable , we have Pr max yj ∈[C]:yj =y0 f (x 0 + ) yj ≤ λ(1 -f (x 0 + ) y0 ) = 1 -p, we call F ( , λ, p)-single confident at point x 0 . Proposition D.1 (Certified Robustness for Single Model). Let be a random variable. Let F be a classification model, which is ( , λ 3 , p)-single confident. Let x 0 ∈ R d be the input with groundtruth y 0 ∈ [C]. Suppose f (x 0 + ) y0 follows symmetric distribution with mean µ and variance s 2 , where µ > (1 + λ -1 3 ) -1 . We have Pr (F (x 0 + ) = y 0 ) ≥ 1 -p - s 2 2(µ -(1 + λ -1 3 ) -1 ) 2 . Proof of Proposition D.1. We consider the distribution of quantity Y := f (x 0 + ) y0 -λ 3 (1 - f (x 0 + ) y0 ). Since the model F is ( , λ 3 , p)-single confident, with probability 1 -p, Y ≤ f (x 0 + ) y0 -max yj ∈[C]:yj =y0 f (x 0 + ) yj . We note that since EY = (1 + λ 3 )µ -λ 3 , Var(Y ) = (1 + λ 3 ) 2 s 2 , from Lemma D.1, Pr(Y ≤ 0) ≤ s 2 2(µ -(1 + λ -1 3 ) -1 ) 2 . Thus, Pr(F (x 0 + ) = y 0 ) = 1 -Pr(F (x 0 + ) = y 0 ) = 1 -Pr f (x 0 + ) y0 - max yj ∈[C]:yj =y0 f (x 0 + ) yj < 0 ≥ 1 -p -Pr(Y ≤ 0) ≥ 1 -p - s 2 2(µ -(1 + λ -1 3 ) -1 ) 2 .

D.1.2 CERTIFIED ROBUSTNESS FOR ENSEMBLES

Now we are ready to prove the certified robustness of the Weighted Ensemble and Max-Margin Ensemble (Theorems 4 and 5). In the following text, we first define statistical margins for both WE and MME, and point out their connections to the notion of ( , p)-Statistical Robust. Then, we reason about the expectation, variance, and tail bounds of the statistical margins. Finally, we derive the certified robustness from the statistical margins. Definition D.2 ( X1 ; Statistical Margin for WE M WE ). Let M WE be Weighted Ensemble defined over base models {f i } N i=1 with weights {w i } N i=1 . Suppose M WE is ( , λ 1 , p)-WE-confident. We define random variable X1 which is depended by random variable : X1 ( ) := (1 + λ 1 ) N j=1 w j f j (x 0 + ) y0 -λ 1 w 1 . ( ) Definition D.3 ( X2 ; Statistical Margin for MME M MME ). Let M MME be Max-Margin Ensemble defined over base models {f i } N i=1 . Suppose M MME is ( , λ 2 , p)-MME-confident. We define random variable X2 which is depended by random variable : X2 ( ) := (1 + λ 2 ) max i∈[N ] f i (x 0 + ) y0 + min i∈[N ] f i (x 0 + ) y0 -2λ 2 . ( ) We have the following observation: Lemma D.2. For Weighted Ensemble, Pr (M WE (x 0 + ) = y 0 ) ≥ 1 -p -Pr ( X1 ( ) < 0). For Max-Margin Ensemble, Pr (M MME (x 0 + ) = y 0 ) ≥ 1 -p -Pr ( X2 ( ) < 0). Proof of Lemma D.2. (1) For Weighted Ensemble, we define the random variable X 1 : X 1 ( ) := min yi∈[C]:yi =y0 N j=1 w j f y0/yi j (x 0 + ). Since M WE is ( , λ 1 , p)-WE-confident, from Definition 6, with probability 1 -p, we have X 1 ( ) ≥ N j=1 w j (f j (x 0 + ) y0 -λ 2 (1 -f j (x 0 + ) y0 )) = (1 + λ 2 ) N j=1 w j f j (x 0 + ) y0 -λ 1 w 1 = X1 ( ). Therefore, Pr (M WE (x 0 + ) = y 0 ) = Pr (X 1 ( ) ≥ 0) ≥ 1 -p -Pr ( X2 ( ) < 0). (2) For Max-Margin Ensemble, we define the random variable X 2 : X 2 ( ) := max i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x 0 + ) + min i∈[N ] min yi∈[C]:yi =y0 f y0/yi i (x 0 + ). Similarly, since M MME is ( , λ 2 , p)-MME-confident, from Definition 7, with probability 1 -p, we have X 2 ( ) ≥ max i∈[N ] (f i (x 0 + ) y0 -λ 2 (1 -f i (x 0 + ) y0 )) + min i∈[N ] (f i (x 0 + ) y0 -λ 2 (1 -f i (x 0 + ) y0 )) = (1 + λ 2 ) max i∈[N ] f i (x 0 + ) y0 + min i∈[N ] f i (x 0 + ) y0 -2λ 2 = X2 ( ). Moreover, from Lemma B.1, we know Pr (M(x 0 + ) = y 0 ) ≥ Pr (X 2 ( ) ≥ 0) ≥ 1 -p -Pr ( X2 ( ) < 0). As the result, to quantify the statistical robustness of two types of ensembles, we can analyze the distribution of statistical margins X1 and X2 . Lemma D.3 (Expectation and variance of X1 and X2 ). Let X1 and X2 be defined by Definition D.2 and Definition D.3 respectively. Assume {f i (x 0 + ) y0 } N i=1 are i.i.d. and follow symmetric distribution with mean µ and variance s 2 . Define s 2 f = Var(min i∈[N ] f i (x 0 + ) y0 ). We have E X1 ( ) =(1 + λ 1 ) w 1 µ -λ 1 w 1 , Var X1 ( ) = (1 + λ 1 ) 2 s 2 w 2 2 , E X2 ( ) =2(1 + λ 2 )µ -2λ 2 , Var X2 ( ) ≤ 4(1 + λ 2 ) 2 s 2 f . Proof of Lemma D.3. E X1 ( ) = (1 + λ 1 ) N j=1 Ew j f j (x 0 + ) y0 -λ 1 w 1 = (1 + λ 1 ) w 1 µ -λ 1 w 1 ; Var X1 ( ) = (1 + λ 1 ) 2 N j=1 w 2 j Var(f j (x 0 + ) y0 ) = (1 + λ 1 ) 2 s 2 w 2 2 . According to the symmetric distribution property of {f i (x 0 + ) y0 } N i=1 , we have E X2 ( ) = E(1 + λ 2 ) max i∈[N ] f i (x 0 + ) y0 + min i∈[N ] f i (x 0 + ) y0 -2λ 2 = 2(1 + λ 2 )µ -2λ 2 . Also, due the symmetry, we have Var min i∈[N ] f i (x 0 + ) y0 = Var max i∈[N ] f i (x 0 + ) y0 = s 2 f . As a result, Var X2 ( ) ≤ (1 + λ 2 ) 2 • 4s 2 f . From Lemma D.3, now with Lemma D.1, we are ready to derive the statistical robustness lower bound for WE and MME. Theorem 4 (Certified Robustness for WE). Let be a random variable supported on R d . Let M WE be a Weighted Ensemble defined over {f i } N i=1 with weights {w i } N i=1 . The M WE is ( , λ 1 , p)-WE confident. Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d. and follow symmetric distribution with mean µ and variance s 2 , where µ > (1 + λ -1 1 ) -1 . We have Pr (M WE (x 0 + ) = y 0 ) ≥ 1 -p - w 2 2 w 2 1 • s 2 2 µ -1 + λ -1 1 -1 2 . ( ) Theorem 5 (Certified Robustness for MME). Let be a random variable. Let M MME be a Max-Margin Ensemble defined over {f i } N i=1 . The M MME is ( , λ 2 , p)-MME confident. Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d. and follow symmetric distribution with mean µ where µ > (1 + λ -1 2 ) -1 . Define s 2 f = Var(min i∈[N ] f i (x 0 + ) y0 ). We have Pr (M MME (x 0 + ) = y 0 ) ≥ 1 -p - s 2 f 2 µ -1 + λ -1 2 -1 2 . ( ) Proof of Theorems 4 and 5. Combining Lemmas D.1 to D.3, we get the theorem. Remark. Theorems 4 and 5 provide two statistical robustness lower bounds for both types of ensembles, which is shown to be able to translate to certified robustness. For the Weighted Ensemble, noticing that X1 is the weighted sum of several independent variables, we can further apply McDiarmid's Inequality to get another bound Pr (M WE (x 0 + ) = y 0 ) ≥ 1 -p -exp -2 w 2 1 w 2 2 µ -1 + λ -1 1 -1 2 , which is tighter than Equation ( 14) when w 2 1 / w 2 2 is large. For average weighted ensemble, w 2 1 / w 2 2 = N . Thus, when N is large, this bound is tighter. Both theorems are applicable under the i.i.d. assumption of confidence scores. The another assumption µ > max{(1 + λ -1 1 ) -1 , (1 + λ -1 2 ) -1 } insures that both ensembles have higher probability of predicting the true class rather than other classes, i.e., the ensembles have non-trivial clean accuracy.

D.2 COMPARISON OF CERTIFIED ROBUSTNESS

We first show and prove an important lemma. Then, based on the lemma and Theorems 4 and 5, we derive the comparison corollary. Lemma D.4. For µ, λ 1 , λ 2 , C > 0, when max{λ 1 /(1 + λ 1 ), λ 2 /(1 + λ 2 )} < µ ≤ 1, and C < 1, we have µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 < C ⇐⇒ λ 1 λ 2 < λ -1 2 C -1 µ - λ 2 1 + λ 2 + 1 -µ -1 -1 . ( ) Proof of Lemma D.4. µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 < C ⇐⇒ 1 λ -1 2 + 1 - C λ -1 1 + 1 > µ(1 -C) ⇐⇒ λ 1 /λ 2 λ -1 2 + λ 1 /λ 2 < C -1 λ -1 2 + 1 -µ(C -1 -1) ⇐⇒ λ 1 λ 2 1 -µ + C -1 µ - 1 λ -1 2 + 1 < λ -1 2 C -1 1 λ -1 2 + 1 -µ + µ ⇐⇒ λ 1 λ 2 < λ -1 2 C -1 1 λ -1 2 +1 -µ + µ C -1 µ -1 λ -1 2 +1 + 1 -µ ⇐⇒ λ 1 λ 2 < λ -1 2 C -1 µ - λ 2 1 + λ 2 + 1 -µ -1 -1 .

Now we can show and prove the comparison corollary.

Corollary 2 (Comparison of Certified Robustness). Let be a random variable supported on R d . Over base models {f i } N i=1 , let M MME be Max-Margin Ensemble, and M WE the Weighted Ensemble with weights {w i } N i=1 . Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Assume {f i (x 0 + ) y0 } N i=1 , the confidence scores across base models for label y 0 , are i.i.d, and follow symmetric distribution with mean µ and variance s 2 , where µ > max{(1 + λ -1 1 ) -1 , (1 + λ -1 2 ) -1 }. Define s 2 f = Var(min i∈[N ] f i (x 0 + ) y0 ) and assume s f < s. • When λ 1 λ 2 < λ -1 2 s s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1 , for any weights {w i } N i=1 , M WE has higher certified robustness than M MME . • When λ 1 λ 2 > λ -1 2   s √ N s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1   , for any weights {w i } N i=1 , M MME has higher certified robustness than M WE . Here, the certified robustness is given by Theorems 4 and 5. Proof of Corollary 2. (1) According to Lemma D.4, we have λ 1 λ 2 < λ -1 2 s s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1 =⇒ µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 < s f s =⇒ w 2 2 w 2 1 µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 < s f s =⇒ w 2 2 w 2 1 • s 2 2 µ -1 + λ -1 1 -1 2 < s 2 f 2 µ -1 + λ -1 2 -1 2 . According to Theorems 4 and 5, we know the RHS in Equation ( 14) is larger than the RHS in Equation ( 15), i.e., M WE has higher certified robustnesss than M MME . (2) According to Lemma D.4, we have λ 1 λ 2 > λ -1 2   s √ N s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1   =⇒ µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 > √ N s f s =⇒ w 2 2 w 2 1 µ -(λ -1 2 + 1) -1 µ -(λ -1 1 + 1) -1 > s f s =⇒ w 2 2 w 2 1 • s 2 2 µ -1 + λ -1 1 -1 2 > s 2 f 2 µ -1 + λ -1 2 -1 2 . According to Theorems 4 and 5, we know the RHS in Equation ( 15) is larger than the RHS in Equation ( 14), i.e., M MME has higher certified robustnesss than M WE . Remark. As we can observe in the proof, there is a gap between Equation ( 16) and Equation ( 17) -when λ 1 /λ 2 lies in between RHS of Equation ( 16) and RHS of Equation ( 17), it is undetermined which ensemble protocol has higher robustness. Indeed, this uncertainty is caused by the adjustable weights {w i } N i=1 of the Weighted Ensemble. If we only consider the average ensemble, then this gap is closed: λ 1 λ 2 MMME more robust ≷ MWE more robust λ -1 2   s √ N s f µ -1 + λ -1 2 -1 + 1 -µ -1 -1   . In the corollary, we assume that s f < s. Note that s 2 is the variance of single variable and s 2 f is the variance of minimum of N i.i.d. variables. For common symmetry distributions, along with the increase of N , s f shrinks in the order of O(1/N B ) where B ∈ (0, 2]. Thus, as long as N is large, the assumption will always hold. An exception is that when these random variables follow the exponential distribution, where s f does not shrink along with the increase of N . However, since these random variables are confidence scores which are in [0, 1], they cannot obey exponential distribution.

D.3 A CONCRETE CASE: UNIFORM DISTRIBUTION

As shown by Saremi & Srivastava (2020) (Remark 2.1), when the input dimension d is large, the Gaussian noise ∼ N (0, σ 2 I d ) ≈ Unif(σ √ dS d-1 ), i.e., x 0 + is highly uniformly distributed on the (d -1)-sphere centered at x 0 . Motivated by this, we study the case where the confidence scores {f i (x 0 + ) y0 } N i=1 are also uniformly distributed. Under this additional assumption, we can further make the certified robustness for the single model and both ensembles more concrete.

D.3.1 CERTIFIED ROBUSTNESS FOR SINGLE MODEL

Proposition D.2 (Certified Robustness for Single Model under Uniform Distribution). Let be a random variable supported on R d . Let F be a classification model, which is ( , λ 3 , p)-single confident. Let x 0 ∈ R d be the input with ground-truth y 0 ∈ [C]. Suppose f (x 0 + ) y0 is uniformly distributed in [a, b]. We have Pr (F (x 0 + ) = y 0 ) ≥ 1 -p -clip 1/(1 + λ -1 3 ) -a b -a , where clip(x) = max(min(x, 1), 0). Proof of Proposition D.2. We consider the distribution of quantity Y := f (x 0 + ) y0 -λ 3 (1 - f (x 0 + ) y0 ). Since the model F is ( , λ 3 , p)-single confident, with probability 1 -p, Y ≤ f (x 0 + ) y0 -max yj ∈[C]:yj =y0 f (x 0 + ) yj . At the same time, because f (x 0 + ) y0 follows the distribution U([a, b]), Y = (1 + λ 3 )f (x 0 + ) y0 -λ 3 follows the distribution U([(1 + λ 3 )a -λ 3 , (1 + λ 3 )b -λ 3 ]). Therefore, Pr(Y ≤ 0) = clip λ 3 -(1 + λ 3 )a (1 + λ 3 )(b -a) . As the result, Pr f (x 0 + ) y0 - max yj ∈[C]:yj =y0 f (x 0 + ) yj ≤ 0 ≤ p + clip λ 3 -(1 + λ 3 )a (1 + λ 3 )(b -a) , which is exactly Pr (F (x 0 + ) = y 0 ) ≥ 1 -p -clip λ 3 -(1 + λ 3 )a (1 + λ 3 )(b -a) = 1 -p -clip 1/(1 + λ -1 3 ) -a b -a .

D.3.2 CERTIFIED ROBUSTNESS FOR ENSEMBLES

Still, we define X1 ( ) and X2 ( ) according to Definitions D.2 and D.3. Under the uniform distribution assumption, we have the following lemma. Lemma D.5 (Expectation and Variance of X1 and X2 under Uniform Distribution). Let X1 and X2 be defined by Definition D.2 and Definition D.3 respectively. Assume that under the distribution of , the base models' confidence scores for true class {f i (x 0 + ) y0 } N i=1 are pairwise i.i.d and uniformly distributed in range [a, b] . We have E X1 ( ) = 1 2 (1 + λ 1 ) w 1 (a + b) -λ 1 w 1 , Var X1 ( ) = 1 12 (1 + λ 1 ) 2 w 2 2 (b -a) 2 , E X2 ( ) = (1 + λ 2 )(a + b) -2λ 2 , Var X2 ( ) ≤ (1 + λ 2 ) 2 4 N + 1 2 N + 2 - 1 N + 1 (b -a) 2 . Proof of Lemma D.5. We start from analyzing X1 . From the definition X1 ( ) := (1 + λ 1 ) N j=1 w j f j (x 0 + ) y0 -λ 1 w 1 where {f i (x 0 + ) y0 } N i=1 are i.i.d. variables obeying uniform distribution U([a, b]), E X1 ( ) = (1 + λ 1 ) w 1 a + b 2 -λ 1 w 1 = 1 2 (1 + λ 1 ) w 1 (a + b) -λ 1 w 1 , Var X1 ( ) = (1 + λ 1 ) 2 N j=1 w 2 j 1 12 (b -a) 2 = 1 12 (1 + λ 1 ) 2 w 2 2 (b -a) 2 . Now analyze the expectation of X2 . By the symmetry of uniform distribution, we know E X2 ( ) = (1 + λ 2 ) • 2E f i (x 0 + ) y0 -2λ 2 = (1 + λ 2 )(a + b) -2λ 2 . To reason about the variance, we need the following fact: Fact D.1. Let x 1 , x 2 , . . . , x n be uniformly distributed and independent random variables. Specifically, for each [a, b] ). Then we have 1 ≤ i ≤ n, x i ∼ U(

Var min 1≤i≤n

x i = Var max 1≤i≤n x i = 1 n + 1 2 n + 2 - 1 n + 1 (b -a) 2 . Observing that each i.i.d. f i (x 0 + ) y0 is exactly identical to x i in Fact D.1, we have Var max i∈[N ] f i (x 0 + ) y0 + min i∈[N ] f i (x 0 + ) y0 ≤ 4 N + 1 2 N + 2 - 1 N + 1 (b -a) 2 . Therefore, Var X2 ( ) ≤ (1 + λ 2 ) 2 4 N + 1 2 N + 2 - 1 N + 1 (b -a) 2 . Proof of Fact D.1. From symmetry of uniform distribution, we know Var (min 1≤i≤n x i ) = Var (max 1≤i≤n x i ). So here we only consider Y := min 1≤i≤n x i . Its CDF F and PDF f can be easily computed: F (y) = 1 -Pr min i x i ≥ y = 1 - b -y b -a n , f (y) = F (y) = n (b -y) n-1 (b -a) n , where y ∈ [a, b]. Hence, E Y = b a yf (y)dy = y(b -y) n + (n + 1) -1 (b -y) n+1 (b -a) n a b = a + b -a n + 1 , E Y 2 = b a y 2 f (y)dy = b a ny 2 (b -y) n-1 (b -a) n dy = - b -y b -a n y 2 b a + 2 b a b -y b -a n ydy = - b -y b -a n y 2 b a + 2 n + 1 - (b -y) n+1 (b -a) n y + (b -y) n+1 (b -a) n dy b a = - b -y b -a n y 2 b a + 2 n + 1 - (b -y) n+1 (b -a) n y - 1 n + 2 (b -y) n+2 (b -a) n b a = a 2 + 2 n + 1 (b -a)a + 2 (n + 1)(n + 2) (b -a) 2 . As the result,  Var Y = EY 2 -(EY ) 2 = 1 n+1 2 n+2 -1 n+1 (b -a) 2 . Now, (x 0 + ) y0 } N i=1 are i.i.d. and uniformly distributed in [a, b]. The M WE is ( , λ 1 , p)-WE confident. Assume a+b 2 > 1 1+λ -1 1 . We have Pr (M WE (x 0 + ) = y 0 ) ≥ 1 -p - d w K 2 1 12 , where d w = w 2 2 w 2 1 , K 1 = b -a a+b 2 -1 1+λ -1 1 . M MME is ( , λ 2 , p)-MME confident. Assume a+b 2 > 1 1+λ -1 2 . We have Pr (M MME (x 0 + ) = y 0 ) ≥ 1 -p - c N K 2 2 4 , where c N = 2 N + 1 2 N + 2 - 1 N + 1 , K 2 = b -a a+b 2 -1 1+λ -1 2 . M MME is ( , λ 2 , p)-MME confident. Assume µ > max 1 1+λ -1 1 , 1 1+λ -1 2 . • When λ 1 λ 2 < λ -1 2   (N + 1) N + 2 6N µ - 1 1 + λ -1 2 + 1 -µ -1 -1   , M WE has higher certified robustness than M MME . • When λ 1 λ 2 > λ -1 2   N + 1 N N + 2 6 µ - 1 1 + λ -1 2 + 1 -µ -1 -1   , M MME has higher certified robustness than M WE . • When N > 6 1 - 1 µ(1 + λ -1 2 ) -2 -2, ( ) for any λ 1 , M MME has higher or equal certified robustness than M WE . Here, the certified robustness is given by Theorems D.1 and D.2. Proof of Corollary D.1. First, we notice that a uniform distribution with mean µ can be any distribution U ([a, b] ) where (a + b)/2 = µ. We replace µ by (a + b)/2. Then ( 1) and ( 2) follow from Lemma D.4 similar to the proof of Corollary 2. (3) Since N > 6 1 - 1 µ(1 + λ -1 2 ) -2 -2 =⇒ N + 2 6 µ - 1 1 + λ -1 2 + 1 -µ -1 < 1 =⇒ N + 1 N N + 2 6 µ - 1 1 + λ -1 2 + 1 -µ -1 < 1, the RHS of Equation ( 37) is smaller than 0. Thus, for any λ 1 , since λ 1 /λ 2 > 0, the Equation ( 17) is satisfied. According to (2), M MME has higher certified robustnesss than M WE . Remark. Comparing to the general corollary (Corollary 2), under the uniform distribution, we have an additional finding that when N is sufficiently large, we will always have higher certified robustness for Max-Margin Ensemble than Weighted Ensemble. This is due to the more efficient variance reduction of Max-Margin Ensemble than Weighted Ensemble. As shown in Lemma D.5, the quantity Var X( )/(E X( )) 2 for Weighted Ensemble is Ω(1/N ), while for Max-Margin Ensemble is O(1/N 2 ). As the result, when N becomes larger, Max-Margin Ensemble has higher certified robustness. We use uniform assumption here to give an illustration in a specific regime. Since the assumption may not hold exactly in practice, we think it would be an interesting future direction to generalize the analysis to other distributions such as the Gaussian distribution that corresponds to locally linear classifiers. The result from these distribution may be derived from their specific concentration bound for maximum/minimum i.i.d. random variables as discussed at the end of Appendix D.2.

D.4 NUMERICAL EXPERIMENTS

To validate and give more intuitive explanations for our theorems, we present some numerical experiments.

D.4.1 ENSEMBLE COMPARISON FROM NUMERICAL SAMPLING

As discussed in Section 3.2, λ 1 /λ 2 reflects the transferability across base models. It is challenging to get enough amount of different ensembles of various transferability levels while keeping all other variables controlled. Therefore, we simulate the transferability of ensembles numerically by varying λ 1 /λ 2 (see the definitions of λ 1 and λ 2 in Definitions 6 and 7), and sampling the confidence scores {f i (x 0 + ) y0 } and {max j∈[C]:j =y0 f i (x 0 + ) j } under determined λ 1 and λ 2 . For each level of λ 1 /λ 2 , with the samples, we compute the certified robust radius r using randomized smoothing (Theorem B.1) and compare the radius difference of Weighted Ensemble and Max-Margin Ensemble. According to Corollary 2 in Section 3.2, we should observe the tendency that along with the increase of transferability λ 1 /λ 2 , Max-Margin Ensemble would gradually become better than Weighted Ensemble. Figure 2 verifies the trends: with the increase of λ 1 /λ 2 , MME model tends to achieve higher certified radius than WE model. Moreover, we notice that under the same λ 1 /λ 2 , with the larger number of base models N , the MME tends to be relatively better compared with WE. This is because we sample the confidence score uniformly and under the uniform distribution, MME tends to be better than WE when the number of base models N becomes large, according to Corollary D.1. The concrete number settings of λ 1 , λ 2 , and the sampling interval of confidence scores are entailed in the caption of Figure 2 .

D.4.2 ENSEMBLE COMPARISON FROM CERTIFIED ROBUSTNESS PLOTTING

In Corollary D.1, we derive the concrete certified robustness for both ensembles and the single model under i.i.d. and uniform distribution assumption. In fact, from the corollary, we can directly compute the certified robust radius without sampling, as long as we assume the added noise is Gaussian. In Figure 3 , we plot out such certified robust radius for the single model, the WE, and the MME. Concretely, in the figure, we assume that the true class confidence score for each base model is i.i.d. and uniformly distributed in [a, b] . The Weighted Ensemble is ( , λ 1 , 0.01)-WE confident; the Max-Margin Ensemble is ( , λ 2 , 0.01)-MME confident; and the single model is ( , λ 3 , 0.01)-MME confident. We guarantee that λ 1 ≤ λ 3 ≤ λ 2 to simulate the scenario that ensembles are based on the same set of base models to make a fair comparison. We directly apply the results from our analysis (Theorem D.1, Theorem D.2, Proposition D.2) to get the statistical robustness for single model and both ensembles. Then, we leverage Theorem B.1 to get the certified robust radius (with σ = 1.0, N = 100000 and failing probability α = 0.001 which are aligned with realistic setting). The x-axis is the number of base models N and the y-axis is the certified robustness. We note that N is not applicable to the single model, so we plot the single model's curve by a horizontal red dashed line. From the figure, we observe that when the number of base models N becomes larger, both ensembles perform much better than the single model. We remark that when N is small, the ensembles have 0 certified robustness mainly because our theoretical bounds for ensembles are not tight enough with the small N . Furthermore, we observe that the Max-Margin Ensemble gradually surpasses Weighted Ensemble when N is large, which conforms to our Corollary D.1. Note that the left sub-figure has smaller transferability λ 1 /λ 2 and the right subfigure has larger transferability λ 1 /λ 2 , it again conforms to our Corollary 2 and discussion in Section 3.2 that in the left subfigure the Weighted Ensemble is relatively more robust than the Max-Margin Ensemble. We study the correlation between transferability λ 1 /λ 2 and whether Weighted Ensemble or Max-Margin Ensemble is more certifiably robust using realistic data. By varying the hyper-parameters of DRT, we find out a setting where over the same set of base models, Weighted Ensemble and Max-Margin Ensemble have similar certified robustness, i.e., for about half of the test set samples, WE is more robust; for another half, MME is more robust. We collect 1, 000 test set samples in total. Then, for each test set sample, we compute the transferability λ 1 /λ 2 and whether WE or MME has the higher certified robust radius. We remark that λ 1 and λ 2 are difficult to be practically estimated so we use the average confidence portion as the proxy: • For WE, λ 1 = E max yj ∈[C]:yj =y0 N i=1 w i f i (x 0 + ) yj N i=1 w i (1 -f i (x 0 + ) y0 ) . • For MME, λ 2 = E max i∈[N ] max yj ∈[C]:yj =y0 f i (x 0 + ) yj (1 -f i (x 0 + ) y0 ) . Now we study the correlation between 17) and Y := 1[MME has higher certified robustness]. X := λ 1 /λ 2 -RHS of Equation ( To do so, we draw the ROC curve where the threshold on X does binary classification on Y . The curve and the AUC score is shown in Figure 4 . From the ROC curve, we find that X and Y are apparently positively correlated since AUC = 0.66 > 0.5, which again verifies Corollary 2. We remark that besides X, other factors such as non-symmetric or non-i.i.d. confidence score distribution may also play a role. 

E EXPERIMENT DETAILS

Evaluation metric: We use the certified test set accuracy at each radius r as our evaluation metric, which is defined as the fraction of the test set samples that the smoothed classifier can certify the robustness within the L 2 ball of radius r. Since the computation of the accurate value of this metric is intractable, we report the approximate certified test accuracy (Cohen et al., 2019) sampled through the Monte Carlo procedure. For each sample, the robustness certification holds with probability at least 1 -α. Following the literature, we choose α = 0.001, n 0 = 100 for Monte Carlo sampling during prediction phase, and n = 100, 000 for Monte Carlo sampling during certification phase.

E.1 MNIST

Baseline models' hyper-parameter configuration: We choose the number of noise samples per instance m = 2 and Gaussian smoothing parameter σ ∈ {0.25, 0.5, 1.0} for all the training methods. For SmoothAdv, we consider the attack to be 10-step L 2 PGD attack with perturbation scale δ = 1.0 without pretraining and unlabelled data augmentation. We reproduced similar results to their paper by using their open-sourced codefoot_0 . Training details: We use LeNet architecture and trained each base model for 90 epochs. For the training optimizer, we use the SGD-momentum with the initial learning rate α = 0.01. The learning rate is decayed for every 30-epochs with decay ratio γ = 0.1 and the batch size equals to 256. For DRT experiments, we start our training with the small learning rate α = 5 × 10 -4 and finetune the base models for another 90 epochs. During the training, we find too large regularization weights may cause the model's training collapse on MNIST. We turn to use small DRT hyper-parameters ρ 1 , ρ 2 and report the one with the best certified accuracy on each radius r. Certified accuracy curve: Figure 6 and Figure 7 show the certified accuracy curve with different base model type and smoothing parameter σ among the range of the radius r. We can notice that while simply applying MME or WE can improve the certified accuracy slightly, DRT could boost the certified accuracy on each radius r with a significant scale. DRT hyper-parameters: We investigate the DRT hyper-parameters ρ 1 ∈ {0.1, 0.2, 0.3, 0.5, 1.0} and ρ 2 ∈ {0.5, 1.0, 2.0, 5.0} corresponding to different smoothing parameter σ ∈ {0.25, 0.5, 1.0}. Here we put the detailed results for every hyper-parameter setting in Tables 4 to 6 and bold the numbers with the highest certified accuracy on each radius r among all the tables with different σ's. From the experiments, we found that the GD loss's weight ρ 1 can have the major influence on the ensemble model's functionality: if we choose larger ρ 1 , the model will achieve slightly worse certified accuracy on small radius r, but better certified accuracy on large r. We cannot choose too large ρ 1 on small σ cases (e.g., σ = 0.25). Otherwise, the training procedure will collapse. Here we show the DRT-based model's approximate certified accuracy with different ρ 1 in Figure 5 . Alternatively, we found that the CM loss's weight ρ 2 can have positive influence on model's performance: the larger ρ 2 we choose, the better certified accuracy we can get. Choosing large ρ 2 does not harm model's functionality too much, but the improvement we received will become marginal. For MNIST, (σ, ρ 1 , ρ 2 ) ∈ {(0.25, 0.1, 0.2), (0.5, 0.5, 5.0), (1.0, 1.0, 5.0)} are good combinations. Efficiency Analysis: We regard the execution time per mini-batch as our efficiency criterion. For MNIST with batch size equals to 256, DRT with the Gaussian smoothing base model only requires 1.04s to finish one mini-batch training to achieve the comparable results to the SmoothAdv method which requires 1.86s. Moreover, DRT with the SmoothAdv base model requires 2.52s per training batch but achieves much better results. The evaluation is on single NVIDIA GeForce GTX 1080 Ti GPU. For SmoothAdv, we consider the attack to be 10-step L 2 PGD attack with perturbation scale δ = 1.0 without pretraining and unlabelled data augmentation. We also reproduced the similar results mentioned in baseline's papers. Certified accuracy curve: Figure 8 and Figure 9 show the certified accuracy curve with different base model type and smoothing parameter σ among the range of the radius r. We can see the same trends: Applying the MME/WE mechanism will give a slight improvement and DRT can help make this improvement to be significant. DRT hyper-parameter: We studied the DRT hyper-parameter ρ ∈ {0.1, 0.2, 0.5, 1.0, 1.5} and ρ 2 ∈ {0.5, 2.0, 5.0} corresponding to different σ ∈ {0.25, 0.5, 1.0} and put the detailed results in Tables 7 to 9 . We bold the numbers with the highest certified accuracy on each radius r among all the tables with different σ's. The results show the similar conclusion about the choosing of ρ 1 and ρ 2 . (σ, ρ 1 , ρ 2 ) ∈ {(0.25, 0.1, 0.5), (0.5, 1.0, 5.0), (1.0, 1.5, 5.0)} could be the good choices on CIFAR-10 dataset. Efficiency Analysis: We also use the execution time per mini-batch as our efficiency criterion. For CIFAR-10 with batch size equals to 256, DRT with the Gaussian smoothing base model requires 3.82s to finish one mini-batch training to achieve the competitive results to 10-step PGD attack based SmoothAdv method which requires 6.39s. All the models are trained in parallel on 4 NVIDIA GeForce GTX 1080 Ti GPUs. 

E.3 IMAGENET

For ImageNet, we utilize ResNet-50 architecture and train each base model for 90 epochs using the SGD-momentum optimizer. The initial learning rate α = 0.1. The learning rate is decayed for every 30-epochs with decay ratio γ = 0.1. We tried different Gaussian smoothing parameter σ ∈ {0.50, 1.00}, and consider the best hyper-parameter configuration for different σ in the baseline models. We explored the DRT hyper-parameter ρ 1 ∈ {0.5, 1.0, 1.5}, ρ 2 ∈ {1.0, 2.0, 5.0} in our experiments and started with the learning rate α = 5 × 10 -3 during the DRT. Table 10 shows the



https://github.com/Hadisalman/smoothing-adversarial/



Figure 1: Illustration of a robust ensemble.

Fact B.1 (Robustness Condition for Single Model). Consider an input x 0 ∈ R d with ground-truth label y 0 ∈ [C]. Suppose a model F satisfies F (x 0 ) = y 0 . Then, the model F is r-robust at point x 0 if and only if for any x

Proof of Corollary C.1. For convenience, define p a := g Fa (x 0 ) y0 , p b := g F b (x 0 ) y0 , where p a = p b + δ and p min = p b . From Proposition C.1 and Theorem C.1, we have

Certified Robustness for MME under Uniform Distribution). Let M MME be a Max-Margin Ensemble over {f i } N i=1 . Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Let be a random variable supported on R d . Under the distribution of , suppose {f i (x 0 + ) y0 } N i=1 are i.i.d. and uniformly distributed in [a, b].

Figure 2: Signed certified robust radius difference between MME and WE by λ 1 /λ 2 under different numbers of base models N . Here we fix λ 2 to be 0.95 and uniformly sample λ 1 ∈ [0.8, 0.95). The confidence score for the true class on each base model is uniformly sampled from [a, b], where a is sampled from [0.3, 1.0) and b is sampled from [a, 1.0) uniformly for each instance. Blue points correspond to the negative radius difference (i.e., WE has larger radius than MME) and Red points correspond to the positive radius difference (i.e., MME has larger radius than WE).

[a, b] = [0.2, 0.3], λ1 = 0.29, λ2 = 0.31, λ3 = 0.30. [a, b] = [0.3, 0.4], λ1 = 0.48, λ2 = 0.50, λ3 = 0.49.

Figure 3: Comparison of certified robustness (in terms of certified robust radius) of Max-Margin Ensemble, Weighted Ensemble, and single model under concrete numerical settings. The y-axis is the certified robustness and the x-axis is the number of base models. The confidence score for the true class is uniformly distributed in [a, b]. The Weighted Ensemble (shown by blue line) is ( , λ 1 , 0.01)-WE confident; the Max-Margin Ensemble (shown by green line) is ( , λ 2 , 0.01)-MME confident; and the single model (shown by red line) is ( , λ 3 , 0.01)-MME confident.

Figure 4: ROC curve of the 1[MME has higher certified robustness] classification task with the threshold variable X.

Figure 5: Effect of ρ 1 : Comparison of approximate certified accuracy of DRT models on MNIST with different GD Loss's weight ρ 1 .

We use ResNet-110 architecture for each base model and train them for 150 epochs. During the training, We utilize the SGD-momentum with the initial learning rate α = 0.1, which is decayed for every 50-epochs with ratio γ = 0.1. For DRT experiments, we start the training with the learning rate α = 5 × 10 -3 and finetune our base models for another 150 epochs.

The certified accuracy under different radius r for MNIST dataset.

The certified accuracy under different radius r for CIFAR-10 dataset.

Main Theoretical Results.

similarly, we use Lemma D.1 to derive the statistical robustness lower bound for WE and MME. We omit the proofs since they are direct applications of Lemma D.5, Lemma D.1, and Lemma D.2. Theorem D.1 (Certified Robustness for WE under Uniform Distribution). Let M WE be a Weighted Ensemble defined over {f i } N i=1 with weights {w i } N i=1 . Let x 0 ∈ R d be the input with groundtruth label y 0 ∈ [C]. Let be a random variable supported on R d . Under the distribution of , suppose {f i

D.3.3 COMPARISON Now under the uniform distribution, we can also have the certified robustness comparison. Corollary D.1 (Comparison of Certified Robustness under Uniform Distribution). Over base models {f i } N i=1 , let M MME be Max-Margin Ensemble, and M WE the Weighted Ensemble with weights {w i } N i=1 . Let x 0 ∈ R d be the input with ground-truth label y 0 ∈ [C]. Let be a random variable supported on R d . Under the distribution of , suppose {f i (x 0 + ) y0 } N i=1 are i.i.d. and uniformly distributed with mean µ. Suppose M WE is ( , λ 1 , p)-WE confident, and

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on MNIST dataset. Smoothing parameter σ = 0.25.

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on MNIST dataset. Smoothing parameter σ = 0.50.

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on MNIST dataset. Smoothing parameter σ = 1.00. .0 94.2 91.9 88.6 84.5 79.6 72.5 63.7 53.9 44.9 36.4 27.3 0.2 5.0 94.2 91.6 88.9 84.4 79.3 72.5 63.3 54.3 45.9 36.9 28.7 0.5 2.0 92.6 91.3 87.7 83.1 77.5 71.1 62.4 53.3 45.3 36.7 29.3 0.5 5.0 92.5 91.2 88.0 83.4 78.5 71.1 62.3 53.7 45.3 37.8 29.5 1.0 5.0 92.1 90.0 86.4 81.4 76.3 69.7 61.1 54.0 46.4 38.4 31.0

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on CIFAR-10 dataset. Smoothing parameter σ = 0.25.

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on CIFAR-10 dataset. Smoothing parameter σ = 0.50. .0 62.2 56.3 50.3 43.4 37.5 26.9 24.7 19.3 0.5 5.0 61.9 56.2 50.2 43.4 37.9 31.8 25.0 19.6 1.0 5.0 61.5 56.0 50.1 43.3 37.5 32.2 25.6 19.9

DRT-(ρ 1 , ρ 2 ) model's certified accuracy under different radius r on CIFAR-10 dataset. Smoothing parameter σ = 1.00.

C ANALYSIS OF ENSEMBLE SMOOTHING STRATEGIES

In Section 3 we mainly use the adapted randomized model smoothing strategy which is named Ensemble Before Smoothing (EBS). We also consider Ensemble After Smoothing (Ensemble After Smoothing). Through the following analysis, we will show Ensemble Before Smoothing is generally better than Ensemble After Smoothing which justifies our choice of the strategy.We formally define Ensemble Before Smoothing strategy as below: Definition C.1 (Strategy: Ensemble Before Smoothing (EBS)). Let M be an ensemble model over base models {f i } N i=1 . Let be a random variable. The EBS ensembleWe define the Ensemble After Smoothing strategy accordingly: Definition C.2 (Strategy: Ensemble After Smoothing (EAS)). Let M be an ensemble model over base models {f i } N i=1 . Let be a random variable. The EAS ensembleHere, c is the index of the smoothed base model selected.Remark. In EBS, we first construct a model ensemble M based on base models using WE or MME protocol, then apply randomized smoothing on top of the classifier. The classifier predicts the most frequent class of M when the input follows distribution x 0 + .In EAS, we use to construct smoothed classifiers for base models respectively. Then, for given input x 0 , the ensemble agrees on the base model which has the highest probability for its predicted class.

C.1 CERTIFIED ROBUSTNESS

In this subsection, we characterize the certified robustness when using both strategies.

C.1.1 ENSEMBLE BEFORE SMOOTHING

Proposition C.1 (Certified Robustness for Ensemble Before Smoothing). Let G M be an ensemble constructed by EBS strategy. The random variable ∼ N (0,Here,The proposition is a direct application of Theorem B.1.

C.1.2 ENSEMBLE AFTER SMOOTHING

Theorem C.1 (Certified robustness for Ensemble After Smoothing). Let H M be an ensemble constructed by EAS strategy over base models We studied the effects of GD Loss and Confidence Margin Loss separately by setting ρ 1 = 0 or ρ 2 = 0 but tuning another parameter only. We did this ablation study on CIFAR-10 dataset with ensemble of Gaussian-smoothed base models and the results are shown in Table 13 .We observed that both GD Loss (GDL) and Confidence Margin Loss (CML) could have positive effects on improving the certified accuracy while GDL plays a major role in the larger radius. While combining these two regularization loss together as our DRT loss, the ensemble model could achieve the best certified accuracy among all the radii.

