ON THE ROBUSTNESS OF RANDOMIZED ENSEMBLES TO ADVERSARIAL PERTURBATIONS Anonymous authors Paper under double-blind review

Abstract

Randomized ensemble classifiers (RECs), where one classifier is randomly selected during inference, have emerged as an attractive alternative to traditional ensembling methods for realizing adversarially robust classifiers with limited compute requirements. However, recent works have shown that existing methods for constructing RECs are more vulnerable than initially claimed, casting major doubts on their efficacy and prompting fundamental questions such as: "When are RECs useful?", "What are their limits?", and "How do we train them?". In this work, we first demystify RECs as we derive fundamental results regarding their theoretical limits, necessary and sufficient conditions for them to be useful, and more. Leveraging this new understanding, we propose a new boosting algorithm (BARRE) for training robust RECs, and empirically demonstrate its effectiveness at defending against strong ℓ ∞ norm-bounded adversaries across various network architectures and datasets. Our code is submitted as part of the supplementary material, and will be publicly released on GitHub. 1 when compared to the high clean accuracy achieved in a non-adversarial setting

1. INTRODUCTION

Defending deep networks against adversarial perturbations (Szegedy et al., 2013; Biggio et al., 2013; Goodfellow et al., 2014) remains a difficult task. Several proposed defenses (Papernot et al., 2016; Pang et al., 2019; Yang et al., 2019; Sen et al., 2019; Pinot et al., 2020) have been subsequently "broken" by stronger adversaries (Carlini & Wagner, 2017; Athalye et al., 2018; Tramèr et al., 2020; Dbouk & Shanbhag, 2022) , whereas strong defenses (Cisse et al., 2017; Tramèr et al., 2018; Cohen et al., 2019) , such as adversarial training (AT) (Goodfellow et al., 2014; Zhang et al., 2019; Madry et al., 2018) , achieve unsatisfactory levels of robustness 1 . A popular belief in the adversarial community is that single model defenses, e.g., AT, lack the capacity to defend against all possible perturbations, and that constructing an ensemble of diverse, often smaller, models should be more cost-effective (Pang et al., 2019; Kariyappa & Qureshi, 2019; Pinot et al., 2020; Yang et al., 2020b; 2021; Abernethy et al., 2021; Zhang et al., 2022) . Indeed, recent deterministic robust ensemble methods, such as MRBoost (Zhang et al., 2022) , have been successful at achieving higher robustness compared to AT baselines using the same network architecture, at the expense of 4× more compute (see Fig. 1 ). In fact, Fig 1 indicates that one can simply adversarially training larger deep nets that can match the robustness and compute requirements of MRBoost models, rendering state-of-the-art boosting techniques obsolete for designing classifiers that are both robust and efficient. In contrast, randomized ensembles, where one classifier is randomly selected during inference, offer a unique way of ensembling that can operate with limited compute resources. However, the recent work of Dbouk & Shanbhag (2022) has cast major concerns regarding their efficacy, as they successfully compromised the state-of-the-art randomized defense of Pinot et al. (2020) by large margins using their proposed ARC adversary. Furthermore, there is an apparent lack of proper theory on the robustness of randomized ensembles, as fundamental questions such as: "when does randomization help?" or "how to find the optimal sampling probability?" remain unanswered. Contributions. In this work, we first provide a theoretical framework for analyzing the adversarial robustness of randomized ensmeble classifiers (RECs). Our theoretical results enable us to better understand randomized ensembles, revealing interesting and useful answers regarding their limits, necessary and sufficient conditions for them to be useful, and efficient methods for finding the optimal sampling probability. Next, guided by our threoretical results, we propose BARRE, a new boosting algorithm for training robust randomized ensemble classifiers achieving state-of-the-art robustness. We validate the effectiveness of BARRE via comprehensive experiments across multiple network architectures and datasets, thereby demonstrating that RECs can achieve similar robustness to AT and MRBoost, at a fraction of the computational cost (see Fig. 1 ).

2. BACKGROUND AND RELATED WORK

Adversarial Robustness. Deep neural networks are known to be vulnerable to adversarial perturbations (Szegedy et al., 2013; Biggio et al., 2013) . In an attempt to robustify deep nets, several defense methods have been proposed (Katz et al., 2017; Madry et al., 2018; Cisse et al., 2017; Zhang et al., 2019; Yang et al., 2020b; Zhang et al., 2022; Tjeng et al., 2018; Xiao et al., 2018; Raghunathan et al., 2018; Yang et al., 2020a) . While some heuristic-based empirical defenses have later been broken by better adversaries (Carlini & Wagner, 2017; Athalye et al., 2018; Tramèr et al., 2020) , strong defenses, such as adversarial training (AT) (Goodfellow et al., 2014; Madry et al., 2018; Zhang et al., 2019) , remain unbroken but achieve unsatisfactory levels of robustness. Ensemble Defenses. Building on the massive success of classic ensemble methods in machine learning (Breiman, 1996; Freund & Schapire, 1997; Dietterich, 2000) , robust ensemble methods (Kariyappa & Qureshi, 2019; Pang et al., 2019; Sen et al., 2019; Yang et al., 2020b; 2021; Abernethy et al., 2021; Zhang et al., 2022) have emerged as a natural solution to compensate for the unsatisfactory performance of existing single-model defenses, such as AT. Earlier works (Kariyappa & Qureshi, 2019; Pang et al., 2019; Sen et al., 2019) relied on heuristic-based techniques for inducing diversity within the ensembles, and have been subsequently shown to be weak (Tramèr et al., 2020; Athalye et al., 2018) . Recent methods, such as RobBoost (Abernethy et al., 2021) and MRBoost (Zhang et al., 2022) , formulate the design of robust ensembles from a margin boosting perspective, achieving state-of-the-art robustness for deterministic ensemble methods. This achievement comes at a massive (4 -5×) increase in compute requirements, as each inference requires executing all members of the ensemble, deeming them unsuitable for safety-critical edge applications (Guo et al., 2020; Sehwag et al., 2020; Dbouk & Shanbhag, 2021) . Randomized ensembles (Pinot et al., 2020) , where one classifier is chosen randomly during inference, offer a more compute-efficient alternative. However, their ability to defend against strong adversaries remains unclear (Dbouk & Shanbhag, 2022; Zhang et al., 2022) . In this work, we show that randomized ensemble classifiers can be effective at defending against adversarial perturbations, and propose a boosting algorithm for training such ensembles, thereby achieving high levels of robustness with limited compute requirements. Randomized Defenses. A randomized defense, where the defender adopts a random strategy for classification, is intuitive: if the defender does not know what is the exact policy used for a certain input, then one expects that the adversary will struggle on average to fool such a defense. Theoretically, Bayesian Neural Nets (BNNs) (Neal, 2012) have been shown to be robust (in the large data limit) to gradient-based attacks (Carbone et al., 2020) , whereas Pinot et al. (2020) has shown that a randomized ensemble classifier (REC) with higher robustness exists for every deterministic classifier. However, realizing strong and practical randomized defenses remains elusive as BNNs are too computationally prohibitive and existing methods (Xie et al., 2018; Dhillon et al., 2018; Yang et al., 2019) often end up being compromised by adaptive attacks (Athalye et al., 2018; Tramèr et al., 2020) . Even BAT, the proposed method of Pinot et al. (2020) for robust RECs, was recently broken by Zhang et al. (2022) ; Dbouk & Shanbhag (2022) . In contrast, our work first demystifies randomized ensembles as we derive fundamental results regarding the limit of RECs, necessary and sufficient conditions for them to be useful, and efficient methods for finding the optimal sampling probability. Empirically, our proposed boosting algorithm (BARRE) can successfully train robust RECs, achieving state-of-the-art robustness for RECs.

3. PRELIMINARIES & PROBLEM SETUP

Notation. Let F = {f 1 , ..., f M } be a collection of M arbitrary C-ary classifiers f i : R d → [C]. A soft classifier, denoted by f : R d → R C , can be used to construct a hard classifier f (x) = arg max c∈[C] [ f (x)] c , where [v] c = v c . We use the notation f (•|θ) to represent parametric clas- sifiers where f is a fixed mapping and θ ∈ Θ represents the learnable parameters. Let ∆ M = {v ∈ [0, 1] M : v i = 1} be the probability simplex of dimension M -1. Given a probability vector α ∈ ∆ M , we construct a randomized ensemble classifier (REC) f α such that f α (x) = f i (x) with probability α i . In contrast, traditional ensembling methods construct a deterministic ensemble classifier (DEC) using the soft classifiers as followsfoot_0 : f (x) = arg max c∈[C] [ M i=1 fi (x)] c . Denote z = (x, y) ∈ R d ×[C] as a feature-label pair that follows some unknown distribution D. Let S ⊂ R d be a closed and bounded set representing the attacker's perturbation set. A typical choice of S in the adversarial community is the ℓ p ball of radius ϵ: B p (ϵ) = {δ ∈ R d : ∥δ∥ p ≤ ϵ}. For a classifier f i ∈ F and data-point z = (x, y), define S i (z) = {δ ∈ S : f i (x + δ) ̸ = y} to be the set of valid adversarial perturbations to f i at z. Definition 1. For any (potentially random) classifier f , define the adversarial risk η: η(f ) = E z∼D max δ∈S E f [1 {f (x + δ) ̸ = y}] The adversarial risk measures the robustness of f on average in the presence of an adversary (attacker) restricted to the set S. For the special case of S = {0}, the adversarial risk reduces to the standard risk of f : η 0 (f ) = E z∼D [E f [1 {f (x) ̸ = y}]] = P {f (x) ̸ = y} (2) The more commonly reported robust accuracy of f , i.e., accuracy against adversarially perturbed inputs, can be directly computed from η(f ). The same can be said for the clean accuracy and η 0 (f ). When working with an REC f α , the adversarial risk can be expressed as: η(f α ) ≡ η(α) = E z∼D max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} (3) where we use the notation η(α) whenever the collection F is fixed. Let {e i } M i=1 ⊂ {0, 1} M be the standard basis vectors of R M , then we employ the notation η( f i ) = η(f ei ) ≡ η(e i ) = η i .

4. THE ADVERSARIAL RISK OF A RANDOMIZED ENSEMBLE CLASSIFIER

In this section, we develop our main theoretical findings regarding the adversarial robustness of any randomized ensemble classifier. Detailed proofs of all statements and theorems can be found in Appendix B. 

4.1. PROPERTIES OF η

We start with the following statement: Proposition 1. For any F = {f i } M i=1 , perturbation set S ⊂ R d , and data distribution D, the adversarial risk η is a piece-wise linear convex function ∀α ∈ ∆ M . Specifically, ∃K ∈ N configurations U k ⊆ {0, 1} M ∀k ∈ [K] and p.m.f. p ∈ ∆ K such that: η(α) = K k=1 p k • max u∈U k u ⊤ α (4) Before we explain the intuition behind Proposition 1, we first make the following observations: Generality. Proposition 1 makes no assumptions about the classifiers F, i.e., it applies even to the enigmatic deep nets. While the majority of theoretical results in the literature have been restricted to ℓ p -bounded adversaries, Proposition 1 holds for any closed and bounded perturbation set S. This is crucial, as real-world attacks are often not restricted to ℓ p balls around the input (Liu et al., 2018; Duan et al., 2020) . This generality is further inherited by all of our results, as they build on Proposition 1. Analytic Form. Proposition 1 allows us to re-write the adversarial risk in (3) using the analytic form in (4), which is much simpler to analyze and work with. In fact, the analytic form in (4) enables us to derive our main theoretical results in Sections 4.2 & 4.3, which include tight fundamental bounds on η. Optimal Sampling. The convexity of η implies that any local minimum α * is also a global minimum. The probability simplex is a closed convex set, thus a global minimum, which need not be unique, is always achievable. Since η is piece-wise linear, then there always exists a finite set of candidate solutions for α * . For M ≤ 3, we efficiently enumerate all candidates in Section 4.2, eliminating the need for any sophisticated search method. For larger M however, enumeration becomes intractable. In Section 4.4, we construct an optimal algorithm for finding α * by leveraging the classic sub-gradient method (Shor, 2012) for optimizing sub-differentiable functions. Intuition. Consider a data-point z ∈ R d × [C], then for any δ ∈ S and α ∈ ∆ M we have the per sample risk: r (z, δ, α) = M i=1 α i 1 {f i (x + δ) ̸ = y} = u ⊤ α (5) where u ∈ {0, 1} M such that u i = 1 if and only if δ is adversarial to f i at z. Since u is independent of α, we thus obtain a many-to-one mapping from δ ∈ S to u ∈ {0, 1} M . Therefore, for any α and z, we can always decompose the perturbation set S, i.e., S = G 1 ∪ ... ∪ G n , into n ≤ 2 M subsets, such that: ∀δ ∈ G j : r (z, δ, α) = α ⊤ u j for some binary vector u j independent of α. Let U = {u j } n j=1 be the collection of these vectors, then we can write: The main idea behind the equivalence in ( 6) is that we can represent any configuration of classifiers, data-point and perturbation set using a unique set of binary vectors U. For example, Fig. 2 pictorially depicts this equivalence using a case of M = 3 classifiers in R 2 with S = B 2 (ϵ). This equivalence is the key behind Proposition 1, since the point-wise max term in ( 6) is piece-wise linear and convex ∀α ∈ ∆ M . Finally, Proposition 1 holds due to the pigeon-hole principle and the linearity of expectation. max δ∈S r (z, δ, α) = max δ∈G1∪...∪Gn r (z, δ, α) = max j∈[n] max δ∈Gj r (z, δ, α) = max u∈U u ⊤ α (6)

4.2. SPECIAL CASE OF TWO CLASSIFIERS

With two classifiers only, we can leverage the analytic form of η in (4) and enumerate all possible classifiers/data-point configurations around S by enumerating all configurations U k ⊆ {0, 1} 2 . Specifically, Fig. 3 visualizes all K = 5 such unique configurations, which allows us to write ∀α ∈ ∆ 2 : η(α) = p 1 • max{α 1 , α 2 } + p 2 • 1 + p 3 • α 1 + p 4 • α 2 + p 5 • 0 7) where p ∈ ∆ 5 is the p.m.f. of "binning" any data-point z into any of the five configurations, under the data distribution z ∼ D. Using (7), we obtain the following result: Theorem 1. For any two classifiers f 1 and f 2 with individual adversarial risks η 1 and η 2 , respectively, subject to a perturbation set S ⊂ R d and data distribution D, if: P {z ∈ R 1 } > |η 1 -η 2 | (8) where: R 1 = {z ∈ R d × [C] : S 1 (z) ̸ = ∅, S 2 (z) ̸ = ∅, S 1 (z) ∩ S 2 (z) = ∅} (9) then the optimal sampling probability α * = [ 1 /2 1 /2] ⊤ uniquely minimizes η(α) resulting in η(α * ) = 1 2 (η 1 + η 2 -P {z ∈ R 1 }). Otherwise, α * ∈ {e 1 , e 2 } minimizes η(α) , where e i s are the standard basis vectors of R 2 . Theorem 1 provides us with a complete description of how randomized ensembles operate when M = 2. We discuss its implications below: Interpretation. Theorem 1 states that randomization is guaranteed to help when the condition in (8) is satisfied, i.e., when the probability of data-points z (P {z ∈ R 1 }) for which it is possible to find adversarial perturbations that can fool f 1 or f 2 but not both (see configuration 1 in Fig. 3 ), is greater than the absolute difference (|η 1 -η 2 |) of the individual classifiers' adversarial risks. Consequently, if the adversarial risks of the classifiers are heavily skewed, i.e., |η 1 -η 2 | is large, then randomization is less likely to help, since condition (8) becomes harder to satisfy. This, in fact, is the case for BAT defense (Pinot et al., 2020) since it generates two classifiers with η 1 < 1 and η 2 = 1. Theorem 1 indicates that adversarial defenses should strive to achieve η 1 ≈ η 2 for randomization to be effective. In practice, it is very difficult to make P {z ∈ R 1 } very large compared to η 1 and η 2 due to transferability of adversarial perturbations. Optimality Condition. In fact, the condition in ( 8) is actually a necessary and sufficient condition for η(α * ) < min{η 1 , η 2 }. That is, a randomized ensemble of f 1 and f 2 is guaranteed to achieve smaller adversarial risk than either f 1 of f 2 if and only if (8) holds. This also implies that it is impossible to have a nontrivialfoot_1 unique global minimizer other than α * = [ 1 /2 1 /2] ⊤ , which provides further theoretical justification for why the BAT defense in (Pinot et al., 2020) does not work, where α * = [0.9 0.1] ⊤ was claimed to be a unique optimum (obtained via sweeping α). Simplified Search. Theorem 1 eliminates the need for sweeping α to find the optimal sampling probability α * when working with M = 2 classifiers as done in (Pinot et al., 2020; Dbouk & Shanbhag, 2022) . We only need to evaluate η [ 1 /2 1 /2] ⊤ and check if it is smaller than min{η 1 , η 2 } to choose our optimal sampling probability. In Appendix C.1, we extend this result for M = 3. Interestingly, Vorobeychik & Li (2014) derive a similar result for M = 2 for a different problem of an adversary attempting to reverse engineer the defender's classifier via queries. Theoretical Limit. From Theorem 1, we can directly obtain a tight bound on the adversarial risk: Corollary 1. For any two classifiers f 1 and f 2 with individual adversarial risks η 1 and η 2 , respectively, perturbation set S, and data distribution D: min α∈∆2 η(α) = η(α * ) ≥ min 1 2 max{η 1 , η 2 }, min{η 1 , η 2 } . ( ) In other words, it is impossible for a REC with M = 2 classifiers to achieve a risk smaller than the RHS in (10). In the next section, we derive a more general version of this bound for arbitrary M .

4.3. TIGHT FUNDAMENTAL BOUNDS

A fundamental question remains to be answered: given an ensemble F of M classifiers with adversarial risks η 1 , ..., η M , what is the tightest bound we can provide for the adversarial risk η(α) of a randomized ensemble constructed from F? The following theorem answers this question: Theorem 2. For a perturbation set S, data distribution D, and collection of M classifiers F with individual adversarial risks η i (i ∈ [M ]) such that 0 < η 1 ≤ ... ≤ η M ≤ 1, we have ∀α ∈ ∆ M : min k∈[M ] η k k ≤ η(α) ≤ η M ( ) Both bounds are tight in the sense that if all that is known about the setup F, D, and S is {η i } M i=1 , then there exist no tighter bounds. Furthermore, the upper bound is always met if α = e M , and the lower bound (if achievable) can be met if α = 1 m ... 1 m 0 ... 0 ⊤ , where m = arg min k∈[M ] { η k k }. Upper bound: The upper bound in (11) holds due to the convexity of η (Proposition 1) and the fact ∆ M = H {e i } M i=1 , where H(X ) is the convex hull of the set of points X . Implications of upper bound: Intuitively, we expect that a randomized ensemble cannot be worse than the worst performing member (in this case f M ). A direct implication of this is that if all the members have similar robustness η i ≈ η j ∀i, j, then randomized ensembling is guaranteed to either improve or achieve the same robustness. In contrast, deterministic ensemble methods that average logits (Zhang et al., 2022; Abernethy et al., 2021; Kariyappa & Qureshi, 2019) do not even satisfy this upper bound (see Appendix C.2). In other words, there are no worst-case performance guarantees with deterministic ensembling, even if all the classifiers are robust. Lower bound: The main idea behind the proof of the lower bound in ( 11) is to show that ∀α ∈ ∆ M : η(α) ≥ M i=1 (η i -η i-1 ) • max j∈{i,...,M } {α j } = h(α) ≥ min α∈∆ M h(α) = h(α * ) = η m m (12) where η 0 . = 0, m = arg min k∈[M ] { η k/k}, and h can be interpreted as the adversarial risk of an REC constructed from an optimal set of classifiers F ′ with the same individual risks as F. We make the following observations: Implications of lower bound: The lower bound in (11) provides us with a fundamental limit on the adversarial risk of RECs viz., it is impossible for any REC constructed from M classifiers with sorted risks {η i } M i=1 to achieve an adversarial risk smaller than min k∈[M ] { η k/k} = ηm /m. This limit is not always achievable and generalizes the one in (10) which holds for M = 2. Theorem 2 states that if the limit is achievable then the corresponding optimal sampling probability α * = 1 m ... 1 m 0 ... 0 ⊤ . Note that this does not imply that the optimal sampling probability is always equiprobable sampling ∀F! Additionally, the lower bound in (11) provides guidelines for robustifying individual classifiers in order for randomized ensembling to enhance the overall adversarial risk. Given classifiers f 1 , ..., f m obtained via any sequential ensemble training algorithm, a good rule of thumb for the classifier obtained via the training iteration m + 1 is to have: η m ≤ η m+1 ≤ 1 + 1 m η m (13) Note that only for m = 1 does (13) become a necessary condition: If η 2 > 2η 1 , then f 1 will always achieve better risk than an REC of f 1 and f 2 . If a training method generates classifiers f 1 , ..., f M with risks: η 1 < 1 and η i = 1 ∀i ∈ {2, ..., M }, i.e., only the first classifier is somewhat robust and the remaining M -1 classifiers are compromised (such as BAT), the lower bound in (11) reduces to: η(α) ≥ min η 1 , 1 M implying the necessary condition M ≥ ⌈η -1 1 ⌉ for RECs constructed from F to achieve better risk than f 1 . Note: the fact that this condition is violated by (Pinot et al., 2020) hints to the existence of strong attacks that can break it (Zhang et al., 2022; Dbouk & Shanbhag, 2022) .

4.4. OPTIMAL SAMPLING

In this section, we leverage Proposition 1 to extend the results in Section 4.2 to provide a theoretically optimal and efficient solution for computing the optimal sampling probability (OSP) algorithm (Algorithm 1) for M > 3. In practice, we do not know the true data distribution D. Instead, we are provided a training set z 1 , ..., z n , assumed to be sampled i.i.d. from D. Given the training set, and a fixed collection of classifiers F, we wish to find the optimal sampling probability: α * = arg min α∈∆ M η(α) = arg min α∈∆ M 1 n n j=1 arg max δ∈S M i=1 α i 1 {f i (x j + δ) ̸ = y j } (15) Note that the empirical adversarial risk η is also piece-wise linear and convex in α, and hence all our theoretical results apply naturally. In order to numerically solve (15), we first require access to an adversarial attack oracle (attack) for RECs that solves ∀S, F, z, and α: attack (F, S, α, z) = arg max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} (16) Using the oracle attack, Algorithm 1 updates its solution iteratively given the adversarial error-rate of each classifier over the training set. The projection operator Π ∆ M in Line (15) of Algorithm 1 ensures that the solution is a valid p.m.f.. Wang & Carreira-Perpinán (2013) provide a simple and exact method for computing Π ∆ M . Finally, we state the following result on the optimality of OSP: Theorem 3. The OSP algorithm output α T satisfies: 0 ≤ η(α T ) -η(α * ) ≤ ∥α (1) -α * ∥ 2 2 + M a 2 T t=1 t -2 2a T t=1 t -1 ----→ T →∞ 0 (17) for all initial conditions α (1) ∈ ∆ M , a > 0, where α * is a global minimum. Theorem 3 follows from a direct application of the classic convergence result of the projected subgradient method for constrained convex minimization (Shor ( 2012)). The optimality of OSP relies on the existence of an attack oracle for (16) which may not always exist. However, attack algorithms such as ARC (Dbouk & Shanbhag (2022) ) were found to yield good results in the common setting of differentiable classifiers and ℓ p -restricted adversaries. g ← 0, a t ← a t 7: for j ∈ {1, ..., n} do 8: δ j ← attack F, S, α (t) , z j 9: ∀i ∈ [M ]: g i ← g i + 1 {f i (x j + δ j ) ̸ = y j } 10: end for 11: g ← 1 n g ▷ sub-gradient of η(α (t) ) 12: η (t) ← g ⊤ α (t) ▷ η(α (t) ) 13: if η (t) ≤ η best then t best ← t, η best ← η (t) 14: / * projection-update step 15:  α (t+1) ← Π ∆M α (t θ m ← θ m-1 , F ← F ∪ {f (•|θ m )}, α ← 1 m ... 1 m ⊤ 6: for e ∈ {1, ..., E} do 7: for mini-batch {z b } B b do 8: compute ∀b ∈ [B]: δ b ← attack (F, S, α, z b ) 9: update θ m via SGD: end for 14: end for 15: return F, α θ m ← θ m - ρ B B b=1 ∇ θm l f (x b + δ b |θ m ),

5. A ROBUST BOOSTING ALGORITHM FOR RANDOMIZED ENSEMBLES

Inspired by BAT (Pinot et al., 2020) and MRBoost (Zhang et al., 2022) , we leverage our results in Section 4 and propose BARRE: a unified Boosting Algorithm for Robust Randomized Ensembles described in Algorithm 2. Given a dataset {z j } n j=1 and an REC attack algorithm attack, BARRE iteratively trains a set of parametric classifiers f (•|θ 1 ), ..., f (•|θ M ) such that the adversarial risk of the corresponding REC is minimized. The first iteration of BARRE reduces to standard AT (Madry et al., 2018) . Doing so typically guarantees that the first classifier achieves the lowest adversarial risk and η(α * ) ≤ η 1 , i.e., Theorem 3 ensures the REC is no worse than single model AT. In each iteration m ≥ 2, BARRE initializes the m-th classifier f (•|θ m ) with θ m = θ m-1 . The training procedure alternates between updating the parameters θ m via SGD using adversarial samples of the current REC and solving for the optimal sampling probability α * ∈ ∆ m via OSP. Including f (•|θ m ) in the attack (Line (8)) is crucial, as it ensures that the robustness of f (•|θ m ) is not completely compromised, thereby improving the bounds in Theorem 2. Note that for iterations m ≤ 3, we replace the OSP procedure in Line (12) with a simplified search over a finite set of candidate solutions (see Section 4.2 and Appendix C.1).

5.1. EXPERIMENTAL RESULTS

In this section, we validate the effectiveness of BARRE in constructing robust RECs. Setup. Per standard practice, we focus on defending against ℓ ∞ norm-bounded adversaries. We report results for three network architectures with different complexities: ResNet-20 (He et al., 2016 ), MobileNetV1 (Howard et al., 2017) , and ResNet-18, across CIFAR-10 and CIFAR-100 datasets (Krizhevsky et al., 2009) . Computational complexity is measured via the number of floating-point operations (FLOPs) required per inference. To ensure a fair comparison across different baselines, we use the same hyper-parameter settings detailed in Appendix D.1. Attack Algorithm. For all our robust evaluations, we will adopt the state-of-the-art ARC algorithm (Dbouk & Shanbhag, 2022) which can be used for both RECs and single models. Specifically, we shall use a slightly modified version that achieves better results in the equiprobable sampling setting (see Appendix D.3). For training with BARRE, we adopt adaptive PGD (Zhang et al., 2022) for better generalization performance (see Appendix D.4). BARRE vs. Other Methods. Due to the lack of dedicated randomized ensemble defenses, we establish baselines by constructing RECs from both MRBoost and independently adversarially trained (IAT) models. We use OSP (Algorithm 1) to find the optimal sampling probability for each REC. All RECs share the same first classifier f 1 , which is adversarially trained. Doing so ensures a fair comparison, and guarantees all the methods cannot be worse than AT. Table 2 provides strong evidence that BARRE outperforms both IAT and MRBoost for both CIFAR-10 and CIFAR-100 datasets. Interestingly, we find that MRBoost ensembles can be quite ill-suited for RECs. This can be seen for MobileNetV1, where the optimal sampling probability obtained was α * = [0.25 0.25 0.25 0.25 0] ⊤ , i.e., the REC completely disregards the last classifier. In contrast, BARRE-trained RECs utilize all members of the ensemble.

6. CONCLUSION

We have demonstrated both theoretically and empirically that robust randomized ensemble classifiers (RECs) are realizable. Theoretically, we derive the robustness limits of RECs, necessary and sufficient conditions for them to be useful, and efficient methods for finding the optimal sampling probability. Empirically, we propose BARRE, a new boosting algorithm for constructing robust RECs and demonstrate its effectiveness at defending against strong ℓ ∞ norm-bounded adversaries.

A THEORETICAL RATIONALE FOR BARRE

In this section, we expand on Section 5.1 and provide a more detailed rationale behind the steps in BARRE. Given a dataset {z j } n j=1 and an REC attack algorithm attack, BARRE iteratively trains a set of parametric classifiers f (•|θ 1 ), ..., f (•|θ M ) in a boosting fashion. Note that the optimality of OSP (Theorem 3) implies that a BARRE-trained REC is guaranteed to achieve better or the same performance than the BEST performing member of the ensemble. This explains the choice of boosting for BARRE, since the first iteration reduces to standard AT, the most effective method to date for generating robust classifiers. Thus, BARRE trained RECs are guaranteed to perform better than AT. In each iteration m ≥ 2, BARRE initializes the m-th classifier f (•|θ m ) with θ m = θ m-1 . The training procedure alternates between updating the parameters θ m via SGD using adversarial samples of the current REC and solving for the optimal sampling probability α * ∈ ∆ m via OSP. Including f (•|θ m ) in the attack (Line ( 8)) is crucial, as it ensures that the robustness of f (•|θ m ) is not completely compromised, thereby improving the bounds in Theorem 2. Furthermore, the rationale behind the sequence of steps in BARRE can be better understood using Theorem 1 (for the case of M = 2). Theorem 1 states that the optimal REC adversarial risk would be η(α 8) is met), therefore it is equally important to minimize both η's and maximize P {z ∈ R 1 }. BARRE does so by initially adversarially training a robust classifier f 1 (minimizing η 1 ), then training f 2 (initialized from f 1 to minimizes η 2 ) on the adversarial examples of the REC of f 1 and f 2 . Doing so increases P {z ∈ R 1 } while maintaining η 2 as low as possible. * ) = 1 2 (η 1 + η 2 -P {z ∈ R 1 }) (assuming (

B OMITTED PROOFS AND DERIVATIONS B.1 PROOF OF PROPOSITION 1

We provide the proof of Proposition 1 (restated below): Proposition (Restated). For any F = {f i } M i=1 , perturbation set S ⊂ R d , and data distribution D, the adversarial risk η is a piece-wise linear convex function ∀α ∈ ∆ M . Specifically, ∃K ∈ N configurations U k ⊆ {0, 1} M ∀k ∈ [K] and p.m.f. p ∈ ∆ K such that: η(α) = K k=1 p k • max u∈U k u ⊤ α (18) Proof. Consider having one data-point z ∈ R d × [C], then for any δ ∈ S and α ∈ ∆ M we have: r (z, δ, α) = M i=1 α i 1 {f i (x + δ) ̸ = y} = u ⊤ α (19) where u ∈ {0, 1} M such that u i = 1 if and only if δ is adversarial to f i at z. Since u is independent of α, we thus obtain a many-to-one mapping from δ ∈ S to u ∈ {0, 1} M . Therefore, for any α and z, we can always decompose the perturbation set S, i.e., S = G 1 ∪ ... ∪ G n , into n ≤ 2 M subsets, such that: ∀δ ∈ G j : r (z, δ, α) = α ⊤ u j for some binary vector u j independent of α. Let U = {u j } n j=1 be the collection of these vectors, then we can write: max δ∈S r (z, δ, α) = max δ∈G1∪...∪Gn r (z, δ, α) = max j∈[n] max δ∈Gj r (z, δ, α) = max u∈U u ⊤ α The vectors {u j } n j=1 define a unique classifier and data-point configuration that is independent of the sampling probability. The function max δ r is thus convex and piece-wise linear in α.

Partitioning the data-point

space R ⊆ R d × [C] into K subsets R = R 1 ∪ ... ∪ R K such that all the data-points z ∈ R k share the same set "configuration" U k , we obtain: η(α) = E z∼D max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} = z∈R p z (z) • max δ∈S r (z, δ, α) dz = K k=1 z∈R k p z (z) • max δ∈S r (z, δ, α) dz = K k=1 z∈R k p z (z) • max u∈U k u ⊤ α dz = K k=1 max u∈U k u ⊤ α • z∈R k p z (z) dz = K k=1 p k • max u∈U k u ⊤ α (21) where the total size of the partition K is finite (exponential in the size M ) and p ∈ ∆ K such that p k = P {z ∈ R k }. Finally, η is convex and piece-wise linear in α since the summation of convex and piece-wise linear functions is also convex and piece-wise linear.

B.2 PROOF OF THEOREM 1

First, we state and prove this useful lemma: Lemma 1. Let h : R → R be a convex piece-wise linear, hence sub-differentiable, function of the form: h(x) = max{a 1 x + b 1 , a 2 x + b 2 } + a 3 x + b 3 (22 ) such that a 1 < a 2 . We wish to minimize h over x ∈ [c, d] where c ≤ y ≤ d, and y is the intersection point b2-b1 a1-a2 . Then, the optimal value x * that minimizes h(x) in ( 22), is given by x * =    y, if a 3 ∈ (-a 2 , -a 1 ) c, if a 3 ≥ -a 1 d, if a 3 ≤ -a 2 Note: only in the first case is the solution unique. Proof. From constrained convex optimization (Boyd et al. (2004) ; Shor ( 2012)), we know that x * is the minimizer of h over [c, d] if there exists a sub-gradient g ∈ ∂h(x * ) such that: g • (x -x * ) ≥ 0 ∀x ∈ [c, d] For x ̸ = y, h is differentiable with ∇h = a 3 + a 1 (if x < y) or ∇h = a 3 + a 2 (if x > y), and for x = y the sub-differential is given by ∂h(y ) = {a 3 + βa 1 + (1 -β)a 2 : β ∈ [0, 1]}. If a 3 ∈ (-a 2 , -a 1 ), then ∃β ∈ [0, 1] such that a 3 + βa 1 + (1 -β)a 2 = 0, and thus 0 ∈ ∂h(y), which is a sufficient condition for global minimization, thus x * = y. Furthermore, x * = y is unique, since ∀x ̸ = y, we will have ∇h = a 1 + a 3 < 0 (if x < y) or ∇h = a 2 + a 3 > 0 (if x > y) which in both cases implies ∀z ̸ = y ∃x ∈ [c, d] such that ∇h(z)(x -z) < 0. If a 3 / ∈ (-a 2 , -a 1 ), then either a 3 ≥ -a 1 or a 3 ≤ -a 2 . If a 3 ≥ -a 1 , then a 1 + a 3 = ∇h(c) ≥ 0, which implies that: (a 1 + a 3 )(x -c) ≥ 0 ∀x ∈ [c, d], hence x * = c. Otherwise if a 3 ≤ -a 2 , then a 2 + a 3 = ∇h(d) ≤ 0, which implies that: (a 2 + a 3 )(x -d) ≥ 0 ∀x ∈ [c, d], hence x * = d. We now provide the proof of Theorem 1 (restated below): Theorem (Restated). For any two classifiers f 1 and f 2 with individual adversarial risks η 1 and η 2 , respectively, subject to a perturbation set S ⊂ R d and data distribution D, if: P {z ∈ R 1 } > |η 1 -η 2 | (24) where: R 1 = {z ∈ R d × [C] : S 1 (z) ̸ = ∅, S 2 (z) ̸ = ∅, S 1 (z) ∩ S 2 (z) = ∅} (25) then the optimum sampling probability α * = ( 1 /2, 1 /2) ⊤ uniquely minimizes η(α) resulting in η(α * ) = 1 2 (η 1 + η 2 -P {z ∈ R 1 }). Otherwise, α * ∈ {e 1 , e 2 } minimizes η(α) , where e i s are the standard basis vectors of R 2 . Proof. We know that, for M = 2, the adversarial risk η can be re-written ∀α ∈ ∆ 2 : η(α) = p 1 • max{α 1 , α 2 } + p 2 • 1 + p 3 • α 1 + p 4 • α 2 + p 5 • 0 (26) where p k = P {z ∈ R k }, and the regions {R k } K k=1 partition the input space R d × [C] as follows: R 1 = {z ∈ R d × [C] : S 1 (z) ̸ = ∅, S 2 (z) ̸ = ∅, S 1 (z) ∩ S 2 (z) = ∅} R 2 = {z ∈ R d × [C] : S 1 (z) ∩ S 2 (z) ̸ = ∅} R 3 = {z ∈ R d × [C] : S 1 (z) ̸ = ∅, S 2 (z) = ∅} R 4 = {z ∈ R d × [C] : S 1 (z) = ∅, S 2 (z) ̸ = ∅} R 5 = {z ∈ R d × [C] : S 1 (z) = S 2 (z) = ∅} (27) Using α 1 = 1 -α 2 = α, we have ∀α ∈ [0, 1]: η (α, 1 -α) ⊤ = h(α) = p 1 • max{α, 1 -α} + (p 3 -p 4 ) • α + p 2 + p 4 ( ) where we wish to find α * ∈ [0, 1] that minimizes h(α). Applying Lemma 1 with: a 1 = -p 1 , b 1 = p 1 , a 2 = p 1 , b 2 = 0, a 3 = p 3 -p 4 , b 3 = p 2 + p 4 29) and utilizing η 1 = η(e 1 ) = p 1 + p 2 + p 3 and η 2 = η(e 2 ) = p 1 + p 2 + p 4 , yields the main result.

B.3 PROOF OF COROLLARY 1

We provide the proof of Corollary 1 (restated below): Corollary. For any two classifiers f 1 and f 2 with individual adversarial risks η 1 and η 2 , respectively, perturbation set S, and data distribution D: min α∈∆2 η(α) = η(α * ) ≥ min 1 2 max{η 1 , η 2 }, min{η 1 , η 2 } . ( ) Proof. From Theorem 1, we have that: η(α * ) = min 1 2 (η 1 + η 2 -P {z ∈ R 1 }) , min{η 1 , η 2 } (31) Using the tight upper bound on P {z ∈ R 1 } ≤ min{η 1 , η 2 }, we obtain the main result. B.4 PROOF OF THEOREM 2

B.4.1 USEFUL LEMMAS

We first state and prove a few useful lemmas that are vital for proving Theorem 2. While some lemmas are trivial and have been proven elsewhere, we nonetheless state their proofs for completeness. Lemma 2. Let h : R n → R be a convex function, and H(X ) ⊂ R n be the convex hull of X = {x 1 , ..., x d } where {x i } d i=1 ∈ R n , then there exists x m ∈ X such that: max u∈H(X ) h(u) = h(x m ) Proof. Let u be any arbitrary vector in H(X ), that is ∃α ∈ ∆ d : u = d i=1 α i x i (33) Let m ∈ [d] such that h(x m ) ≥ h(x i ) ∀i ∈ [d]. From the convexity of h, we upper bound h(u) as follows: h(u) = h d i=1 α i x i ≤ d i=1 α i h(x i ) ≤ d i=1 α i h(x m ) = h(x m ) d i=1 α i = h(x m ) Thus, (32) holds for any u ∈ H(X ). Lemma 3 (Redistribution Lemma). ∀p, q such that 0 ≤ p ≤ q ≤ 1, ∀α ∈ ∆ M , and ∀I, J ⊆ [M ] such that I / ∈ J / ∈ I we have: p • max i∈I {α i } + q • max j∈J {α j } ≥ p • max i∈I∪J {α i } + (q -p) • max j∈J {α j } + p • max k∈I∩J {α k } (35) Proof. p • max i∈I {α i } + q • max j∈J {α j } = p • α i * + q • α j * = p • (α i * + α j * ) + (q -p) • α j * (a) = p • max i∈I∪J {α i } + min{α i * , α j * } + (q -p) • α j * (b) ≥ p • max i∈I∪J {α i } + (q -p) • α j * + p • max k∈I∩J {α k } = p • max i∈I∪J {α i } + (q -p) • max j∈J {α j } + p • max k∈I∩J {α k } where (a) holds because the maximum over I ∪ J is either α i * or α j * , and (b) holds since the smallest of the two maximizers cannot be smaller than the maximizer of the smaller set I ∩ J . Lemma 4. Let {f i } M i=1 be an arbitrary collection of C-ary classifiers with individual adversarial risks η i such that 0 < η 1 ≤ ... ≤ η M ≤ 1. For any data distribution D and perturbation set S we have ∀α ∈ ∆ M : η(α) ≥ M i=1 (η i -η i-1 ) • max j∈{i,...,M } {α j } where η 0 . = 0. Proof. From Proposition 1 we know that ∃K ∈ N, p ∈ ∆ K , and U k ⊆ {0, 1} M ∀k ∈ [K] such that: η(α) = K k=1 p k • max u∈U k u ⊤ α (38) Let L k ⊆ [M ] represent the set of classifier indices i 1 , ..., i n that are active in the configuration U k , that is: m ∈ L k ⇐⇒ ∃v ∈ U k such that v m = 1 We then lower bound η as follows: η(α) = K k=1 p k • max u∈U k u ⊤ α ≥ K k=1 p k • max i∈L k {α i } = η ′ (α) The bound trivially holds, since the sum of positive numbers is always larger than any summand. It is noteworthy to point out that the RHS quantity η ′ (α) can be interpreted as the adversarial risk of an auxiliary set of classifiers F ′ with same individual risks {η i } such that for any z ∈ R d × [C], the classifiers have no common adversarial perturbations, i.e.: M i=1 S ′ i (z) = ∅ and: η ′ i = η ′ (e i ) = k:i∈L k p k = η(e i ) = η i Assume that the conditions of Lemma 3 are met by two terms in η ′ , i.e., ∃k 1 , k 2 ∈ [K] such that L k1 / ∈ L k2 / ∈ L k1 and p k1 ≤ p k2 , then we can apply the bound in Lemma 3 and obtain: η ′ (α) - k∈[K]\{k1,k2} p k • max i∈L k {α i } = p k1 • max i∈L k 1 {α i } + p k2 • max i∈L k 2 {α i } ≥ p k1 • max i∈L k 1 ∪L k 2 {α i } + (p k2 -p k1 ) • max j∈L k 2 {α j } + p k1 • max k∈L k 1 ∩L k 2 {α k } = η ′′ (α) - k∈[K]\{k1,k2} p k • max i∈L k {α i } where η ′′ (α) is the modified ensemble adversarial risk. The application of Lemma 3 can be understood as a way to "re-distribute" the classifiers' adversarial vulnerabilities while preserving the adversarial risk identities ∀i ∈ [M ]: η i = η ′ (e i ) = k:i∈L k p k = η ′′ (e i ) The main idea of this proof is to keep applying Lemma 3 to the modified ensemble adversarial risks (if possible) to obtain a better lower bound. The process stops when the conditions are no longer met, and we obtain an adversarial risk h(α): η ′ (α) ≥ η ′′ (α) ≥ .. ≥ h(α) = L k=1 q k • max j∈J k {α j } Without loss of generality, we will assume that {J k } are distinct and q k ̸ = 0. Furthermore, since the conditions of Lemma 3 cannot be met by any two sets in {J k }, we must have (up to a re-ordering of the indices): J L ⊂ J L-1 ⊂ ... ⊂ J 1 ⊆ [M ] (46) We now make the following observations: 1. Due to (46), we have that L ≤ M and for all i ∈ [M ], ∃m i ∈ [L] such that: η i = k:i∈J k q k = mi k=1 q k (47) 2. Since {η i } are sorted, we get that m i+1 = m i + 1 if η i < η i+1 or m i+1 = m i otherwise 3. J 1 = [M ] since η 1 ̸ = 0 4. For any two consecutive sets J k and J k+1 , we can always find n ≥ 1 indices from [M ] such that J k = J k+1 ∪ {i 1 , ..., i n }. The indices i 1 , ..., i n are consecutive, share the same m i (i.e., η i l is the same for all l ∈ [n]), and also satisfy: min l∈[n] {i l } = max j∈J k+1 {j} + 1 We first prove the lemma for the special case of distinct risks, i.e. η i < η i+1 ∀i. Special Case. The risks are distinct, then we must have L = M , with every two consecutive sets J k and J k+1 differing by one index. Therefore we have J k = J k+1 ∪ {k} and J M +1 = ∅. Furthermore, we will get η i -η i-1 = q i ∀i ∈ [M ] with η 0 = 0. Thus we can write: h(α) = M k=1 q k • max j∈J k {α j } = M i=1 (η i -η i-1 ) • max j∈{i,...,M } {α j } General Case. For the general case we will have L ≤ M distinct risks η i1 < ... < η i L and M -L repeated risks, where i 1 = 1. Thus we have q k = η i k -η i k-1 ∀k ∈ [L] , and η i0 = η 0 = 0 by definition. Using observations 3 and 4, we have that J k = {u k , ..., M } for some index u k ∈ [M ], with u 1 = 1. Thus we have u k+1 -u k -1 ≥ 0 to be the number of of consecutive repeated risks equal to η i k . Let {J ′ k } be the M -L index sets missing from {i ∈ [M ] : {i, ..., M }}, then we have: h(α) = L k=1 q k • max j∈J k {α j } = L k=1 η i k+1 -η i k • max j∈{u k ,...,M } {α j } = L k=1 η i k+1 -η i k • max j∈{u k ,...,M } {α j } + M -L k=1 0 • max j∈J ′ k {α j } (a) = M i=1 (η i -η i-1 ) • max j∈{i,...,M } {α j } where (a) holds due to the fact η i -η i-1 = 0 for all the merged M -L terms. Lemma 5. Given a sequence {γ i } M i=0 such that 0 = γ 0 < γ 1 ≤ ... ≤ γ M ≤ 1, the vector α * = 1 m ... 1 m 0 ... 0 ⊤ ∈ ∆ M is a solution to the following minimization problem: min α∈∆ M h(α) = min α∈∆ M M i=1 (γ i -γ i-1 ) • max j∈{i,...,M } {α j } = γ m m ( ) where γm m ≤ γi i , ∀i ∈ [M ]. where (a) holds because the minimum of a linear function over the convex hull of a set of points X is obtained at one of the points in X . Thus, to solve the original optimization problem, we only need to evaluate M ! linear functions with M vectors each, and pick the one that achieves the smallest value. Finally, we will now show that the search space can be significantly reduced from M ! × M to M possible solutions. Let ∆ n M be an arbitrary subset of ∆ M whose associated sorted indices are i n 1 , i n 2 , ..., i n M , and P n = {p n k } k are the associated extreme points. We first note that, ∀k ∈ [M ], g(p n k ) = 0 ... 0 1 k 0 ... 0 ⊤ with j n k = max{i n 1 , ..., i n k } is the non-zero index. Therefore, we have that ∀n, k: h(p n k ) = γ ⊤ g(p n k ) = γ j n k k Equation ( 59) reveals that, amongst all vectors p n k with fixed k, the smallest error is always achieved by the subset n whose associated j n k index is the smallest, since the robust errors are always assumed to be sorted. Furthermore, the smallest value that j n k can achieve is k, since it is the largest index amongst k arbitrary indices from [M ] . Therefore, let ∆ m M be the subset whose sorting indices are i k = k, i.e. α ∈ ∆ m M implies α 1 ≥ ... ≥ α M . For this subset, we will always have j m k = max{1, ..., k} = k which implies that ∀n ∈ [M !] and ∀k ∈ [M ]: h(p n k ) = γ j n k k ≥ γ k k = h(p m k ) where 58)&( 60) we obtain: p m k = 1 k ... 1 k 0 ... 0 ⊤ . Combining ( min α∈∆ M h(α) = min k∈[M ] γ ⊤ g(p m k ) = min k∈[M ] γ k k = γ k * k * which can be achieved using α * = 1 k * ... 1 k * 0 ... 0 ⊤ .

B.4.2 MAIN PROOF

We now restate and prove Theorem 2: Theorem (Restated). For a perturbation set S, data distribution D, and collection of M classifiers F with individual adversarial risks η i (i ∈ [M ]) such that 0 < η 1 ≤ ... ≤ η M ≤ 1, we have ∀α ∈ ∆ M : min k∈[M ] η k k ≤ η(α) ≤ η M Both bounds are tight in the sense that if all that is known about the setup F, D, and S is {η i } M i=1 , then there exist no tighter bounds. Furthermore, the upper bound is always met if α = e M , and the lower bound (if achievable) can be met if α = 1 m ... 1 m 0 ... 0 ⊤ , where m = arg min k∈ [M ] { η k k }. Proof. We first prove the upper bound and then the lower bound. Upper bound: From Proposition 1, we have that η is convex in α ∈ ∆ M . Using ∆ M = H ({e 1 , ..., e M }) and applying Lemma 2, we get ∀α ∈ ∆ M : η(α) ≤ max α∈∆ M η(α) = max i∈[M ] η(e i ) = η M (63) This establishes the upper bound in (62). The bound is tight, since η(e M ) = η M is achievable. Lower bound: From Lemmas 4&5, we establish ∀α ∈ ∆ M , the following result: Let S ⊂ R d be any closed and bounded set containing at least M distinct vectors {δ j } M j=1 ⊆ S. Let D be any valid distribution over R η(α) ≥ M i=1 (η i -η i-1 ) • max j∈{i,...,M } {α j } = h(α) ≥ min α∈∆ M h(α) = h(α * ) = η m m = R d × [C] such that ∀i ∈ [M ]: P {z ∈ T i } = η i , P {z ∈ T M +1 } = 1, and ∅ = T 0 ⊂ T 1 ⊆ T 2 ⊆ ... ⊆ T M ⊆ T M +1 ⊂ R. Finally, we construct classifiers f i (∀i ∈ [M ] ) to satisfy the following assignment ∀z ∈ T M +1 : f i (x + δ) = y ∀δ ∈ S \ {δ i } & f i (x + δ i ) = y if (x, y) / ∈ T i y ′ ̸ = y otherwise i.e., the i-th classifier decision f i (x + δ) is incorrect only if δ = δ i and z ∈ T i . Given the above construction, we establish First, we state the classic result on the convergence of the projected sub-gradient method for convex minimization (Shor ( 2012)): Lemma 6 (Projected Sub-gradient Method). Let h : R d → R be a a convex and sub-differentiable function. Let C ⊂ R d be a convex set. For iterations t = 1, .., T , define the projected sub-gradient method: η(α) = E z∼D max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} (a) = z∈T M +1 p z (z) • max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} dz (b) = M k=1 z∈T k \T k-1 p z (z) • max δ∈S M i=1 α i 1 {f i (x + δ) ̸ = y} dz (c) = M k=1 z∈T k \T k-1 p z (z) • max j∈{k,..,M } {α j } dz = M k=1 max j∈{k,..,M } {α j } T k k-1 p z (z) dz (d) = M k=1 (η i -η i-1 ) • max j∈{k,..,M } {α j } = h(α) x (t+1) = Π C x (t) -a t g (t) (67) h (t+1) best = min h (t) best , h(x (t+1) ) where a t = a /t for some positive a > 0, x (1) ∈ C is an arbitrary initial guess, h best = h(x (1) ), and g (t) ∈ ∂h(x (t) ) is a sub-gradient of h at x (t) . Let t best designate the best iteration index thus far. Then, if h has norm-bounded sub-gradients: ∥g∥ 2 ≤ G for all g ∈ ∂h(x) and x ∈ C, we have: h (t) best -h * ≤ ∥x (1) -x * ∥ 2 2 + G 2 t k=1 a 2 t 2 t k=1 a k ---→ t→∞ 0 where: h * = h(x * ) = min x∈C h(x) We then prove Theorem 3 (restated below) via a direct application of Lemma 6: Theorem (Restated). The OSP algorithm output α T satisfies: 0 ≤ η(α T ) -η(α * ) ≤ ∥α (1) -α * ∥ 2 2 + M a 2 T t=1 t -2 2a T t=1 t -1 ----→ T →∞ 0 (71) for any initial condition α (1) ∈ ∆ M , a > 0, where α * is a global minimum. Proof. The ensemble empirical adversarial risk η is convex and sub-differentiable (Proposition 1), being minimized over a convex set ∆ M . At each iteration t in OSP, the vector g obtained at line ( 12) is norm-bounded with G = √ M , the vector g is also a sub-gradient of η at α (t) , therefore Lemma 6 applies. where f is the deterministic ensemble classifier constructed via the rule: f (x) = arg max c∈[C] M i=1 fi (x) c (84) Consider the following setup: 1. two binary classifiers in R 2 : f i (x) = 1 if w ⊤ i x ≥ 0 2 otherwise (85) which can be obtained from the "soft" classifiers: fi (x) = w ⊤ i x -w ⊤ i x (86) using f i (x) = arg max c∈{1,2} [ fi (x)] c , where w 1 = [1 1] ⊤ and w 2 = [1 -1] ⊤ . 2. a Ber(p) data distribution D over two data-points in R 2 × [2]: z 1 = (x 1 , y 1 ) = -1 2 , 1 and z 2 = (x 2 , y 2 ) = -1 -2 , 1 3. the ℓ 2 norm-bounded perturbation set S = {δ : ∥δ∥ ≤ ϵ} for some 0 < ϵ < 1 / √ 2. We first note that for binary linear classifiers and ℓ 2 -norm bounded adversaries, we have that: • the shortest distance between a point x and the decision boundary of linear classifier f with weight w and bias b is: ζ = |w ⊤ x + b| ∥w∥ • if f (x) ̸ = y, then the optimal adversarial perturbation is given by: δ = -sign w ⊤ x + b ϵw ∥w∥ We can now evaluate the adversarial risks of each classifier: η 1 = p • max ∥δ∥≤ϵ 1 w ⊤ 1 (x 1 + δ) < 0 + (1 -p) • max ∥δ∥≤ϵ 1 w ⊤ 1 (x 2 + δ) < 0 = p • 1 1 - √ 2ϵ < 0 + (1 -p) • (1 {-3 < 0}) = 1 -p where we use ϵ < 1 / √ 2. Due to symmetry, we also get η 2 = p. The average ensemble classifier f constructed from f 1 and f 2 is defined via the rule: f (x) = 1 if x 1 ≥ 0 2 otherwise (91) whose adversarial risk can be computed as follows: η = p • max ∥δ∥≤ϵ 1 {x 1,1 + δ 1 ) < 0} + (1 -p) • max ∥δ∥≤ϵ 1 {x 2,1 + δ 1 ) < 0 < 0} = p • (1 {-1 < 0}) + (1 -p) • (1 {-1 < 0}) = p + 1 -p = 1 which is strictly greater than max{p, 1-p} ∀p ∈ (0, 1). Therefore, we have constructed an example where deterministic ensembling is always worse than using any of the individual classifiers, which proves that deterministic ensemble classifiers do not satisfy the upper bound.

D ADDITIONAL EXPERIMENTS AND COMPARISONS D.1 EXPERIMENTAL SETUP

In this section, we describe the complete experimental setup used for all our experiments. Training. All models are trained for 100 epochs via SGD with a batch size of 256 and 0.1 initial learning rate, decayed by 0.1 first at the 50 th epoch and twice at the 75 th epoch. We employ the recently proposed margin-maximizing cross-entropy (MCE) loss from Zhang et al. (2022) with 0.9 momentum and a weight decay factor of 5 × 10 -4 . We use 10 attack iterations during training with ϵ = 8 /255 and a step size β = 2 /255. For IAT, each classifier is indepdenelty trained from a different random initialization, using a standard PGD adversary. For MRBoost, we use their public implementation from GitHub to reproduce all their results. For BARRE, we use an adaptive PGD (APGD) adversary (discussed in detail in Section D.4) as our training attack algorithm. We apply OSP for T o = 10 iterations every E o = 10 epochs. To avoid catastrophic overfitting (Rice et al., 2020) , we always save the best performing checkpoint during training. Since all the ensemble methods considered reduce to adversarial training for the first iteration, we use a shared adversarially trained first classifier. Doing so ensures a fair comparison between different ensemble methods. For both CIFAR-10, and CIFAR-100 datasets, we adopt standard data augmentation (random crops and flips). Per standard practice, we apply input normalization as part of the model, so that the adversary operates on physical images x ∈ [0, 1] d . Evaluation. For all our robust evaluations, we will adopt the state-of-the-art ARC algorithm (Dbouk & Shanbhag, 2022) which can be used for both RECs and single models. Specifically, we use 20 iterations of ARC, with an attack strength ϵ = 8 /255 and approximation parameter G = 2. Following the recommendations of Dbouk & Shanbhag (2022) , we use a step size of 2 /255 when evaluating single models (M = 1) and a step size of 8 /255 when evaluating RECs (M ≥ 2).

D.2 INDIVIDUAL MODEL ROBUSTNESS

In Tables 3&4, we provide the clean and robust accuracies of all the individual classifiers constructed via the different ensemble methods on CIFAR-10 and CIFAR-100, respectively. Robust accuracy is measured using ARC. As expected, only ensembles produced via IAT consist of classifiers achieving near-identical robust and natural accuracies. In contrast, ensembles produced via MRBosst or BARRE witness a degradation in individual classifier robust accuracy as the ensemble size grows. However, since MRBoost was not initially designed for randomized ensemble classifiers, this degradation in robust accuracy can be rather severe as seen for MobileNetV1 in both Tables 3&4. This explains why, for such ensembles, the optimal sampling probability obtained for the constructed REC completely disregards the last classifier as highlighted in Section 5.1. 

D.3 ATTACKS FOR RANDOMIZED ENSEMBLES

Given a data-point z = (x, y) and a potentially random classifier f , the goal of an adversary is to find an adversarial perturbation that maximizes the single-point expected adversarial risk: δ * = arg max where we adopt the ℓ p norm-bounded adversary for the remainder of this section. Projected gradient descent (PGD) (Madry et al., 2018) is perhaps the most popular attack algorithm for solving (93) for the case of differentiable deterministic classifiers. Specifically, given a surrogate loss function l, such as the cross-entropy loss, PGD finds an adversarial δ iteratively via the following: δ (k) = Π p ϵ δ (k-1) + ηµ p ∇ x l f x + δ (k-1) , y where µ p is the ℓ p steepest direction projection operator, and Π p ϵ is the projection operator on the ℓ p ball of radius ϵ. In order to adapt PGD for evaluating randomized ensemble classifiers, Pinot et al. (2020) first proposed adaptive PGD (APGD-L) using the expectation-over-transformation (EOT) method (Athalye 



the normalizing constant 1 M does not affect the classifier output that is different than e1 or e2 ignoring the negligible memory overhead of storing α



Figure 1: The efficacy of employing randomized ensembles (⋆) for achieving robust and efficient inference compared to AT (•) and deterministic ensembling MRBoost (♦) on CIFAR-10. Robustness is measured using the standard ℓ ∞ norm-bounded adversary with radius ϵ = 8 /255.

Figure 2: Illustration of the equivalence in (6) using an example of three classifiers in R 2 . The shaded areas represent regions in the attacker-restricted input space where each classifier makes an error. All classifiers correctly classify x. The set U uniquely captures the interaction between z and f 1 , f 2 , & f 3 inside S.

Figure 3: Enumeration of all K = 5 unique configurations with two classifiers and a data-point around a set S. Note that since α i ≥ 0 ∀i, the 0 vector is redundant in U k for k ∈ [4], which explains why K = 5 and not more.

64) where m = arg min k∈[M ] { η k k } and α * = 1 m ... 1 m 0 ... 0 ⊤ . This establishes the lower bound in (62).The bound is tight, since for fixed 0 < η 1 ≤ ... ≤ η M ≤ 1, we can construct F, S, and D such that η(α) = h(α) and ∀i ∈ [M ] : η(e i ) = h(e i ) = η i , as shown next.

) where: (a) holds because P {z ∈ T M +1 } = 1; (b) holds because we can partition T M +1 into M + 1 sets: T 1 ∪ (T 2 \ T 1 ) ∪ . . . ∪ (T M +1 \ T M ), and because the max term is 0 ∀z ∈ T M +1 \ T M ; (c) holds by construction of F and S, and (d) holds since η i = P {z ∈ T i } and T i ⊆ T i+1 . B.5 PROOF OF THEOREM 3

E f [1 {f (x + δ) ̸ = y}] = arg max δ:∥δ∥p≤ϵ P {f (x + δ) ̸ = y} (93)

) -a t g 16: end for 17: return α (tbest) Number of classifiers M , perturbation set S, training set {z j } n j=1 , learning rate ρ, mini-batch size B, number of epochs E, OSP frequency E o , OSP number of iterations T o . 2: Output: Robust randomized ensemble classifier (F, α) 3: initialize θ 0 ∈ Θ, F ← ∅ 4: for m ∈ {1, ..., M } do

y b

Table1demonstrates that BARRE can successfully construct RECs that achieve competitive robustness (within ∼ 0.5%) compared to MRBoost-trained deterministic Comparison between BARRE and MRBoost across different network architectures and ensemble sizes on CIFAR-10. Robust accuracy is measured against an ℓ ∞ norm-bounded adversary using ARC with ϵ = 8 /255. , across three different network architectures on CIFAR-10. The benefit of randomization can be seen for M ≥ 2, as we obtain massive 2 -4× savings in compute requirements. Note that both methods have the same 4 memory footprint. These observations are further corroborated by CIFAR-100 experiments in Appendix D.5.

Comparison between BARRE and other methods at constructing robust randomized ensemble classifiers across various network architectures and datasets. Robust accuracy is measured against an ℓ ∞ norm-bounded adversary using ARC with ϵ = 8 /255. A rob [%] A nat [%] A rob[%]

Natural and robust accuracies of the individual classifiers of all ensembles methods trained on CIFAR-10 (from Table2). Robust accuracy is measured against an ℓ ∞ norm-bounded adversary using ARC with ϵ = 8 /255. .21 79.05 46.60 78.44 46.11 78.76 46.74 MRBoost 80.11 44.52 77.54 42.03 77.94 39.36 68.89 33.40 BARRE 80.15 44.56 79.43 42.67 79.56 39.65 79.60 38.28

Natural and robust accuracies of the individual classifiers of all ensembles methods trained on CIFAR-100 (from Table2). Robust accuracy is measured against an ℓ ∞ norm-bounded adversary using ARC with ϵ = 8 /255.

Comparing the success of different attack algorithms at fooling various RECs using ℓ ∞ norm-bounded attacks with ϵ = 8 /255 on CIFAR-10. All the RECs are constructed with equiprobable sampling.

annex

Proof. We know that h is a piece-wise linear convex function over a closed and convex set, which implies the existence of a global minimizer.Define the mapping g : ∆ M → [0, 1] M such that ∀i ∈ [M ]:We can re-write the function h via a simple re-arrangement to obtain:Define the decomposition over the probability simplex:, where ∀n ∈ [M !], ∃i 1 , i 2 , ..., i M such that ∀α ∈ ∆ n M we have:In other words, ∆ n M is the set of all probability vectors that share the same sorting indices. Since we have M ! ways to arrange M numbers, the size of the decomposition will be M !. We now make the following observations:We also have ∀l ∈ [M -1]:where H(X ) is the convex hull of the set of points X . quick proof : Let i 1 , ..., i M be the sorted indices associated with an arbitrary subset ∆ n M . Construct the M probability vectors as follows:M is convex (Claim 1), we thus have thatWe shall prove it by construction, specifically define:This induces a valid convex coefficient vector λ,It is also easy to verify that α i l = k λ k p n k,i l for all indices i l ∈ [M ], since:by construction of λ and P n .3. ∀n, the function g is linear overCombining observations 1,2&3, we can re-write the original optimization problem as follows:C ADDITIONAL THEORETICAL RESULTS

C.1 SPECIAL CASE OF THREE CLASSIFIERS

In this section, we derive a simplified search strategy for finding the optimal sampling probability for the special case of M = 3, akin to Section 4.2. Similar to (7), we can enumerate all possible classifiers/data-point configurations around S, which allows us to write ∀α ∈ ∆ 3 :where p ∈ ∆ 12 . Using (72), we obtain the following result:Theorem 4. Define A ⊂ ∆ 3 to be the set of the following vectors:Then for any three classifiers f 1 , f 2 , and f 3 , perturbation set S ⊂ R d , and data distribution D, we have:The set A is optimal, in the sense that there exist no smaller set A ′ such that (74) holds.Theorem 4 simplifies the search for the optimal sampling probability significantly, as it is sufficient to evaluate η at exactly 10 different candidate solutions, captured by A, and pick the best performing one. Theorem 4 also guarantees that the search procedure is efficient, since every candidate solution in A needs to be evaluated.Proof. We shall use the same technique used in the proof of Lemma 5. We can decompose ∆ 3 into 6 such subsets ∆ 1 3 , ..., ∆ 6 3 , such that each subset contains vectors that share the same sorting indices. These subsets are convex, and they can be represented as the convex hull of three vectors. Due to the symmetry of the problem, we shall focus on one subsetwhere ∀α ∈ ∆ 1 3 , we have: α 1 ≥ α 2 ≥ α 3 . Notice that for any α ∈ ∆ 1 3 , all the terms in (72) become linear in α, except for the term max{α 2 +α 3 , α 1 }. Therefore, we can further decompose ∆ 1 3 into two convex subsets ∆ 1,1 3 and ∆ 1,2 3 , such that:and η is linear over both subsets (but not their union).Claim: we have:Since both ∆ 1,1 3 and ∆ 1,2 3 are convex, it is enough to show that:for (76) to hold. For all α ∈ ∆ 1,1 3 , define:Then we always have:where it is easy to verify that λ 1 + λ 2 + λ 3 = 1. The same can be shown for any α ∈ ∆ 1,2 3 , using the following:which establishes the claim in (76).Using ( 76) and the linearity of η on each subset, we can write:Finally, repeating this procedure for the remainder 5 sets ∆ 2 3 , ..., ∆ 6 3 establishes (74). To show that the set A is minimal, we provide 10 constructions of η using the p vector in (72) such that the i th vector α ∈ A is a unique (amongst A) global optimum of η characterized by the i th p vector (listed below):

C.2 WORST CASE PERFORMANCE OF DETERMINISTIC ENSEMBLES

In Section 4.3, we showed via Theorem 2 that the adversarial risk of any randomized ensemble classifier is upper bounded by the worst performing classifier in the ensemble F. In this section, we will show that the same cannot be said regarding deterministic ensemble classifiers. That is, there exist an ensemble F, data distribution D, and perturbation set S such that: Note that the discrete nature of randomized ensembles allows for an exact computation of the expectation in (95).Recently, Zhang et al. (2022) proposed a stronger version of adaptive PGD, where the expectation is taken at the softmax level (APGD-S). Using APGD-S, Zhang et al. (2022) were able to compromise the BAT defense. Independently, Dbouk & Shanbhag (2022) studied the effectiveness of EOTbased adaptive attacks for evaluating the robustness of RECs, and concluded that such methods are fundamentally ill-suited for the task. Instead, they proposed the ARC attack (Algorithm 2 of (Dbouk & Shanbhag, 2022 )), which relied on iteratively updating the perturbation based on estimating the direction towards the decision boundary of each classifier and using an adaptive step size method.In this section, we propose a small modification to ARC (ARC-R) that proves to be quite more effective in the equiprobable setting. Specifically, instead of looping over the classifiers in a deterministic fashion based on the order of the sampling probability vector, we propose using a randomized order loop. This ensures that ARC is never biased towards certain classifiers. In fact, Table 5 demonstrates that ARC-R is better than APGD-L (Pinot et al., 2020) As highlighted in Section 5.1, we find that ARC, despite being the strongest adversary, leads to poor performance when adopted as the training attack in BARRE. In this section, we investigate this phenomenon, as we study the performance of BARRE using two different attacks during training, APGD (Zhang et al., 2022) and ARC (Dbouk & Shanbhag, 2022) . Specifically, we train two RECs on CIFAR-10 using the ResNet-20 architecture. Both RECs share the same first classifier f 1 , which is adversarially trained using standard PGD. The second classifier f 2 is trained via either APGD or ARC.Figure 4 plots the evolution of both robust and clean accuracies of the two RECs across the 100 training epochs of f 2 , measured on the test set. Note that in both RECs, the robust accuracy is evaluated via the stronger ARC adversary. When evaluated on clean images, we find that BARRE with ARC leads to significantly more accurate RECs when compared to BARRE with APGD. However, this comes at the expense of robust accuracy, as the REC obtained via BARRE with ARC is much more vulnerable than the APGD counterpart. We hypothesize that the adversarial samples generated via ARC during training do not generalize well to the test set. This explains why we observe that the REC obtained via BARRE with ARC achieves much higher robust accuracies on the training set. Thus, for better generalization performance, we shall adopt adaptive PGD during training in all our experiments.

D.5 ADDITIONAL RESULTS

In this section, we complete the CIFAR-10 results reported in Table 1 for showcasing the benefit of randomization. Specifically, Table 6 provides further evidence that BARRE can train RECs of competitive robustness compared to MRBoost-trained deterministic ensembles, while requiring significantly less compute. 

