CERTIFIED TRAINING: SMALL BOXES ARE ALL YOU NEED

Abstract

To obtain, deterministic guarantees of adversarial robustness, specialized training methods are used. We propose, SABR, a novel such certified training method, based on the key insight that propagating interval bounds for a small but carefully selected subset of the adversarial input region is sufficient to approximate the worst-case loss over the whole region while significantly reducing approximation errors. We show in an extensive empirical evaluation that SABR outperforms existing certified defenses in terms of both standard and certifiable accuracies across perturbation magnitudes and datasets, pointing to a new class of certified training methods promising to alleviate the robustness-accuracy trade-off.

1. INTRODUCTION

As neural networks are increasingly deployed in safety-critical domains, formal robustness guarantees against adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) are becoming ever more important. However, despite significant progress, specialized training methods that improve certifiability at the cost of severely reduced accuracies are still required to obtain deterministic guarantees. Given an input region defined by an adversary specification, both training and certification methods compute a network's reachable set by propagating a symbolic over-approximation of this region through the network (Singh et al., 2018; 2019a; Gowal et al., 2018a) . Depending on the propagation method, both the computational complexity and approximation-tightness can vary widely. For certified training, an over-approximation of the worst-case loss is computed from this reachable set and then optimized (Mirman et al., 2018; Wong et al., 2018) . Surprisingly, the least precise propagation methods yield the highest certified accuracies as more precise methods induce harder optimization problems (Jovanovic et al., 2021) . However, the large approximation errors incurred by these imprecise methods lead to over-regularization and thus poor accuracy. Combining precise worst-case loss approximations and a tractable optimization problem is thus the core challenge of certified training. In this work, we tackle this challenge and propose a novel certified training method, SABR, Small Adversarial Bounding Regions, based on the following key insight: by propagating small but carefully selected subsets of the adversarial input region with imprecise methods (i.e., BOX), we can obtain both well-behaved optimization problems and precise approximations of the worst-case loss. This yields less over-regularized networks, allowing SABR to improve on state-of-the-art certified defenses in terms of both standard and certified accuracies across settings, thereby pointing to a new class of certified training methods.

Main Contributions Our main contributions are:

• A novel certified training method, SABR, reducing over-regularization to improve both standard and certified accuracy ( §3). • A theoretical investigation motivating SABR by deriving new insights into the growth of BOX relaxations during propagation ( §4). • An extensive empirical evaluation demonstrating that SABR outperforms all state-of-theart certified training methods in terms of both standard and certifiable accuracies on MNIST, CIFAR-10, and TINYIMAGENET ( §5).

2. BACKGROUND

In this section, we provide the necessary background for SABR. Adversarial Robustness Consider a classification model h : R din → R c that, given an input x ∈ X ⊆ R din , predicts numerical scores y := h(x) for every class. We say that h is adversarially robust on an ℓ p -norm ball B ϵp p (x) of radius ϵ p if it consistently predicts the target class t for all perturbed inputs x ′ ∈ B ϵp p (x). More formally, we define adversarial robustness as: arg max j h(x ′ ) j = t, ∀x ′ ∈ B ϵp p (x) := {x ′ ∈ X | ∥x -x ′ ∥ p ≤ ϵ p }. ( ) BOX Propagation x 1 x 2 B 1 ∞ (x 0 ) = [-1, 1] 2 x 1 = 0.5, 0.3 0.2, 0.5 x 0 + 0.4 0.4 Neural Network Verification To verify that a neural network h is adversarially robust, several verification techniques have been proposed. x 2 = ReLU(x 1 ) y = 0.7, -0.3 -0.3, 0.7 x 2 + 0.4 -0.4 A simple but effective such method is verification with the BOX relaxation (Mirman et al., 2018) , also called interval bound propagation (IBP) (Gowal et al., 2018b) . Conceptually, we first compute an over-approximation of a network's reachable set by propagating the input region B ϵp p (x) through the neural network and then check whether all outputs in the reachable set yield the correct classification. This propagation sequentially computes a hyper-box (each dimension is described as an interval) relaxation of a layer's output, given a hyperbox input. As an example, consider an L-layer net- work h = f L • σ • f L-2 • . . . • f 1 , with linear layers f i and ReLU activation functions σ. Given an input region B ϵp p (x), we over-approximate it as a hyperbox, centered at x0 := x and with radius δ 0 := ϵ p , such that we have the i th dimension of the input A ReLU activation ReLU(x i-1 ) := max(0, x i-1 ) can be over-approximated by propagating the lower and upper bound separately, resulting in a output hyper-box with xi = u i +l i 2 and δ i = u i -l i 2 where l i = ReLU( xi-1δ i-1 ) and u i = ReLU( xi-1 + δ i-1 ). Proceeding this way for all layers, we obtain lower and upper bounds on the network output y and can check if the output score of the target class is greater than that of all other classes by computing the upper bound on the logit difference y ∆ i := y iy t and then checking whether y ∆ i < 0, ∀i ̸ = t. We illustrate this propagation process for a one-layer network in Fig. 1 . There, the blue shapes ( ) show an exact propagation of the input region and the red shapes ( ) their hyper-box relaxation. Note how after the first linear and ReLU layer (third row), the relaxation (red) contains already many points not reachable via exact propagation (blue), despite it being the smallest hyper-box containing the exact region. These so-called approximation errors accumulate quickly, leading to an increasingly imprecise abstraction, as can be seen by comparing the two shapes after an additional linear layer (last row). To verify that this network classifies all inputs in [-1, 1] 2 to class 1, we have to show the upper bound of the logit difference y 2y 1 to be less than 0. While the concrete maximum of -0.3 ≥ y 2y 1 (black ×) is indeed less than 0, showing that the network is robust, the BOX relaxation only yields 0.6 ≥ y 2y 1 (red ×) and is thus too imprecise to prove it. x 0 i ∈ [x 0 i -δ 0 i , x0 i + δ 0 i ]. Given a linear layer f i (x i-1 ) = W x i-1 + b =: x i , Beyond BOX, more precise verification approaches track more relational information at the cost of increased computational complexity (Palma et al., 2022; Wang et al., 2021) . A recent example is MN-BAB (Ferrari et al., 2022) , which improves on BOX in two key ways: First, instead of propagating axis-aligned hyper-boxes, it uses much more expressive polyhedra, allowing linear layers to be captured exactly and ReLU layers much more precisely. Second, if the result is still too imprecise, the verification problem is recursively split into easier ones, by introducing a case distinction between the two linear segments of the ReLU function. This is called the branch-and-bound (BaB) approach (Bunel et al., 2020) . We refer the interested reader to Ferrari et al. (2022) for more details. Training for Robustness For neural networks to be certifiably robust, special training is necessary. Given a data distribution (x, t) ∼ D, standard training generally aims to find a network parametrization θ that minimizes the expected cross-entropy loss (see App. B.1): θ std = arg min θ E D [L CE (h θ (x), t)], with L CE (y, t) = ln 1 + i̸ =t exp(y i -y t ) . (2) When training for robustness, we, instead, wish to minimize the expected worst-case loss around the data distribution, leading to the min-max optimization problem: θ rob = arg min θ E D max x ′ ∈B ϵp p (x) L CE (h θ (x ′ ), t) . (3) Unfortunately, solving the inner maximization problem is generally intractable. Therefore, it is commonly under-or over-approximated, yielding adversarial and certified training, respectively. For notational clarity, we henceforth drop the subscript p. Adversarial Training Adversarial training optimizes a lower bound on the inner optimization objective in Eq. ( 3) by first computing concrete examples x ′ ∈ B ϵ (x) maximizing the loss term and then optimizing the network parameters θ for these samples. Typically, x ′ is computed by initializing x ′ 0 uniformly at random in B ϵ (x) and then updating it over N projected gradient descent steps (PGD) (Madry et al., 2018)  x ′ n+1 = Π B ϵ (x) x ′ n + α sign(∇ x ′ n L CE (h θ (x ′ n ), t) ), with step size α and projection operator Π. While networks trained this way typically exhibit good empirical robustness, they remain hard to formally verify and sometimes vulnerable to stronger or different attacks (Tramèr et al., 2020; Croce & Hein, 2020) . Certified Training Certified training optimizes an upper bound on the inner maximization objective in Eq. ( 3), obtained via a bound propagation method. These methods compute an upper bound u y ∆ on the logit differences y ∆ := yy t , as described above, to obtain the robust cross-entropy loss L CE,rob (B ϵ (x), t) = L CE (u y ∆ , t). We will use BOX to refer to the verification and propagation approach, and IBP to refer to the corresponding training method. Surprisingly, using the imprecise BOX relaxation (Mirman et al., 2018; Gowal et al., 2018b; Shi et al., 2021) consistently produces better results than methods based on tighter abstractions (Zhang et al., 2020; Balunovic & Vechev, 2020; Wong et al., 2018) . Jovanovic et al. (2021) trace this back to the optimization problems induced by the more precise methods becoming intractable to solve. While the heavily regularized, IBP trained networks are amenable to certification, they suffer from severely reduced (standard) accuracies. Overcoming this robustness-accuracy trade-off remains a key challenge of robust machine learning.

3. METHOD -SMALL REGIONS FOR CERTIFIED TRAINING

To train networks that are not only robust and amenable to certification but also retain comparatively high standard accuracies, we propose the novel certified training method, SABR -Small Adversarial Bounding Regions. We leverage the key insight that computing an over-approximation of the worst-case loss over a small but carefully selected subset of the input region B ϵ (x) often yields a good proxy for the worst-case loss over the whole region while significantly reducing approximation errors. We illustrate this intuition in Fig. 2 . Existing certified training methods always consider the whole input region (dashed box in the input panel). Propagating such large regions through the network yields quickly growing approximation errors and thus very imprecise over-approximations of the actual worst-case loss (compare the reachable set in red and green to the dashed box in the output panel), causing significant over-regularization (large blue arrow ). Adversarial training methods, in contrast, only consider individual points in the input space (× in Fig. 2 ) and often fail to capture the actual worst-case loss. This leads to insufficient regularization (small blue arrow in the output panel) and yields networks which are not amenable to certification and potentially not robust. We tackle this problem by propagating small, adversarially chosen subsets of the input region (solid box in the input panel of Fig. 2 ), which we call the propagation region. This leads to significantly reduced approximation errors (see the solid box in the output panel) inducing a level of regularization in-between certified and adversarial training methods (medium blue arrow ), allowing us to train networks that are both robust and accurate. Instead of propagating a BOX approximation (dashed box ) of the whole input region (red and green shapes in input space), SABR propagates a small subset of this region (solid box ), selected to contain the adversarial example (black ×) and thus the misclassified region ( red). The smaller BOX accumulates much fewer approximation errors during propagation, leading to a significantly smaller output relaxation, which induces much less regularization (medium blue ) than training with the full region (large blue ), but more than training with just the adversarial example (small blue ). More formally, we define an auxiliary objective for the robust optimization problem Eq. ( 3) as L SABR = max x * ∈B τ (x ′ ) L CE (x * , t), where we replace the maximum over the whole input region B ϵ (x) with that over a carefully selected subset B τ (x ′ ). While choosing x ′ = Π B ϵ-τ (x) arg max x * ∈B ϵ (x) L CE (x * , t ) would recover the original robust training problem (Eq. ( 3)), both, computing the maximum loss over a given input region (Eq. ( 4)) and finding a point that realizes this loss is generally intractable. Instead, we instantiate SABR by combining different approximate approaches for the two key components: a) a method for choosing the location x ′ and size τ of the propagation region, and b) a method used for propagating the thus selected region. Note that we thus generally do not obtain a sound over-approximation of the loss on B ϵ (x). Depending on the size of the propagated region B τ (x ′ ), SABR can be seen as a continuous interpolation between adversarial training for infinitesimally small regions τ = 0 and standard certified training for the full input region τ = ϵ. Selecting the Propagation Region SABR aims to find and propagate a small subset of the adversarial input region B ϵ (x) that contains the inputs leading to the worst-case loss. To this end, we parametrize this propagation region as an ℓ p -norm ball B τ (x ′ ) with centre x ′ and radius τ ≤ ϵ-∥x-x ′ ∥ p . We first choose τ = λϵ by scaling the original perturbation radius ϵ with the subselection ratio λ ∈ (0, 1]. We then select x ′ as follows: We conduct a PGD attack, choosing the preliminary centre x * as the sample with the highest loss. We then ensure the obtained region is fully contained in the original one by projecting x * onto B ϵ-τ (x) to obtain x ′ . We show this in Fig. 3 . Propagation Method Having found the propagation region B τ (x ′ ), we can use any symbolic propagation method to compute an over-approximation of its worst-case loss. We chose BOX propagation (DIFFAI Mirman et al. (2018) or IBP (Gowal et al., 2018b) ) to obtain well-behaved optimization problems (Jovanovic et al., 2021) . There, choosing small propagation regions (τ ≪ 1), can significantly reduce the incurred over-approximation errors, as we will show later (see §4).

4. UNDERSTANDING SABR: ROBUST LOSS AND GROWTH OF SMALL BOXES

In this section, we aim to uncover the reasons behind SABR's success. Towards this, we first analyse the relationship between robust loss and over-approximation size before investigating the growth of the BOX approximation with propagation region size.

Robust Loss Analysis

Certified training typically optimizes an over-approximation of the worstcase cross-entropy loss L CE,rob , computed via the softmax of the upper-bound on the logit differences y ∆ := yy t . When training with the BOX relaxation and assuming the target class t = 1, w.l.o.g., we obtain the logit difference y ∆ ∈ [ ȳ∆δ ∆ , ȳ∆ + δ ∆ ] and thus the robust cross entropy loss L CE, rob (x) = ln 1 + n i=2 e ȳ∆ i +δ ∆ i . We observe that samples with high (> 0) worst-case misclassification margin ȳ∆ +δ ∆ := max i ȳ∆ i + δ ∆ i dominate the overall loss and permit the per-sample loss term to be approximated as max i ȳ∆ i + δ ∆ i =: ȳ∆ + δ ∆ < L CE, rob < ln(n) + max i ȳ∆ i + δ ∆ i . Further, we note that the BOX relaxations of many functions preserve the box centres, i.e., xi = f ( xi-1 ). Only unstable ReLUs, i.e., ReLUs containing 0 in their input bounds, introduce a slight shift. However, these are empirically few in certifiably trained networks (see Table 4 ). These observations allow us to decompose the robust loss into an accuracy term ȳ∆ , corresponding to the misclassification margin of the adversarial example x ′ at the centre of the propagation region, and a robustness term δ ∆ , bounding the difference to the actual worst-case loss. These terms generally represent conflicting objectives, as local robustness requires the network to disregard high frequency features (Ilyas et al., 2019) . Therefore, robustness and accuracy are balanced to minimize the optimization objective Eq. ( 5). Consequently, reducing the regularization induced by the robustness term will bias the optimization process towards standard accuracy. Next, we investigate how SABR reduces exactly this regularization strength, by propagating smaller regions. BOX Growth To investigate how BOX approximations grow as they are propagated, let us again consider an L-layer network h = f L • σ • f L-2 • . . . • f 1 , with linear layers f i and ReLU activation functions σ. Given a BOX input with radius δ i-1 and centre distribution xi-1 ∼ D, we now define the per-layer growth rate κ i as the ratio of input and expected output radius: κ i = E D [δ i ] δ i-1 . For linear layers with weight matrix W , we obtain an output radius δ i = |W |δ i-1 and thus a constant growth rate κ i , corresponding to the row-wise ℓ 1 norm of the weight matrix |W j,• | 1 . Empirically, we find most linear and convolutional layers to exhibit growth rates between 10 and 100 (see Table 9 in App. D.4). For ReLU layers x i = σ(x i-1 ), computing the growth rate is more challenging, as it depends on the location and size of the inputs. Shi et al. (2021) assume the input BOX centres xi-1 to be symmetrically distributed around 0, i.e., P D (x i-1 ) = P D (-x i-1 ), and obtain a constant growth rate of κ i = 0.5. While this assumption holds at initialization, we observe that trained networks tend to have more inactive than active ReLUs (see Table 4 ), indicating asymmetric distributions with more negative inputs (see Fig. 4 ). We now investigate this more realistic setting. We first consider the two limit cases where input radii δ i-1 go against 0 and ∞. When input radii are δ i-1 ≈ 0, active neurons will stay stably active, yielding δ i = δ i-1 and inactive neurons will stay stably inactive, yielding δ i = 0. Thus, we obtain a growth rate, equivalent to the portion of active neurons. In the other extreme δ i-1 → ∞, all neurons will become unstable with xi-1 ≪ δ i-1 , yielding δ i ≈ 0.5 δ i-1 , and thus a constant growth rate of κ i = 0.5. To analyze the behavior in between those extremes, we assume pointwise asymmetry favouring negative inputs, i.e., p(x i-1 = -z) > p(x i-1 = z), ∀z ∈ R >0 . In this setting, we find that output radii grow strictly super-linear in the input size: Theorem 4.1 (Hyper-Box Growth). Let y := σ(x) = max(0, x) be a ReLU function and consider box inputs with radius δ x and asymmetrically distributed centres x ∼ D such that P D (x = -z) > P D (x = z), ∀z ∈ R >0 . Then, the mean output radius δ y will grow super-linearly in the input radius δ x . More formally: ∀δ x , δ ′ x ∈ R ≥0 : δ ′ x > δ x =⇒ E D [δ ′ y ] > E D [δ y ] + (δ ′ x -δ x ) ∂ ∂δ x E D [δ y ]. We defer a proof to App. A and illustrate this behaviour in Fig. 5 for the box centre distribution x ∼ N (µ = -1.0, σ = 0.5). There, we clearly observe that the actual super-linear growth (purple) outpaces a linear approximation (orange). While even the qualitative behaviour depends on the exact centre distribution and the input box size δ x , we can solve special cases analytically. For example, a piecewise uniform centre distribution yields quadratic growth on its support (see App. A). Multiplying all layer-wise growth rates, we obtain the overall growth rate κ = L i=1 κ i , which is exponential in network depth and super-linear in input radius. When not specifically training with the BOX relaxation, we empirically observe that the large growth factors of linear layers dominate the shrinking effect of the ReLU layers, leading to a quick exponential growth in network depth. Further, for both SABR and IBP trained networks, the super-linear growth in input radius empirically manifests as exponential behaviour (see Figs. 8 and 9 ). Using SABR, we thus expect the regularization induced by the robustness term to decrease super-linearly, and empirically even exponentially, with subselection ratio λ, explaining the significantly higher accuracies compared to IBP.

5. EVALUATION

In this section, we first compare SABR to existing certified training methods before investigating its behavior in an ablation study. Experimental Setup We implement SABR in PyTorch (Paszke et al., 2019) foot_0 and use MN-BAB (Ferrari et al., 2022) for certification. We conduct experiments on MNIST (LeCun et al., 2010) , CIFAR-10 (Krizhevsky et al., 2009) , and TINYIMAGENET (Le & Yang, 2015) for the challenging ℓ ∞ perturbations, using the same 7-layer convolutional architecture CNN7 as prior work (Shi et al., 2021) unless indicated otherwise (see App. C for more details). We choose similar training hyperparameters as prior work (Shi et al., 2021) and provide more detailed information in App. C. We compare SABR to state-of-the-art certified training methods in Table 1 and Fig. 6 , reporting the best results achieved with a given method on any architecture.

5.1. MAIN RESULTS

In Fig. 6 , we show certified over standard accuracy (upper right-hand corner is best) and observe that SABR ( ) dominates all other methods, achieving both the highest certified and standard accuracy across all settings. As existing methods typically perform well either at large or small perturbation radii (see Table 1 and Fig. 6 ), we believe the high performance of SABR across perturbation radii to be particularly promising. Methods striving to balance accuracy and regularization by bridging the gap between provable and adversarial training ( , ) (Balunovic & Vechev, 2020; Palma et al., 2022) perform only slightly worse than SABR at small perturbation radii (CIFAR-10 ϵ = 2/255), but much worse at large radii, e.g., attaining only 27.5% ( ) and 27.9% ( ) certifiable accuracy for CIFAR-10 ϵ = 8/255 compared to 35.1% ( ). Similarly, methods focusing purely on certified accuracy by directly optimizing over-approximations of the worst-case loss ( , ) (Gowal et al., 2018b; Zhang et al., 2020) tend to perform well at large perturbation radii (MNIST ϵ = 0.3 and CIFAR-10 ϵ = 8/255), but poorly at small perturbation radii, e.g. on CIFAR-10 at ϵ = 2/255, SABR improves natural accuracy to 79.2% ( ) up from 66.8% ( ) and 71.5% ( ) and even more significantly certified accuracy to 62.8% ( ) up from 52.9% ( ) and 54.0% ( ). On the particularly challenging TINYIMAGENET, SABR again dominates all existing certified training methods, improving certified and standard accuracy by almost 3%. To summarize, SABR improves strictly on all existing certified training methods across all commonly used benchmarks with relative improvements exceeding 25% in some cases. In contrast to certified training methods, Zhang et al. (2022b) propose SORTNET, a generalization of recent architectures (Zhang et al., 2021; 2022a; Anil et al., 2019) with inherent ℓ ∞ -robustness properties. While SORTNET performs well at very high perturbation magnitudes (ϵ = 8/255 for CIFAR-10), it is dominated by SABR in all other settings. Further, robustness can only be obtained against one perturbation type at a time.

Certification Method and Propagation Region Size

To analyze the interaction between the precision of the certification method and the size of the propagation region, we train a range of models with subselection ratios λ varying from 0.0125 to 1.0 and analyze them with verification methods of increasing precision (BOX, DEEPPOLY, MN-BAB). Further, we compute adversarial accuracies using a 50-step PGD attack (Madry et al., 2018) with 5 random restarts and the targeted logit margin loss (Carlini & Wagner, 2017) . We illustrate results in Fig. 7 and observe that standard and adversarial accuracies increase with decreasing λ, as regularization decreases. For λ = 1, i.e., IBP training, we observe little difference between the verification methods. However, as we decrease λ, the BOX verified accuracy decreases quickly, despite BOX relaxations being used during training. In contrast, using the most precise method, MN-BAB, we initially observe increasing certified accuracies, as the reduced regularization yields more accurate networks, before the level of regularization becomes insufficient for certification. While DEEPPOLY loses precision less quickly than BOX, it can not benefit from more accurate networks. This indicates that the increased accuracy, enabled by the reduced regularization, may rely on complex neuron interactions, only captured by MN-BAB. These trends hold across perturbation magnitudes (Figs. 7a and 7b ) and become even more pronounced for narrower networks (Fig. 7c ), which are more easily over-regularized. This qualitatively different behavior depending on the precision of the certification method highlights the importance of recent advances in neural network verification for certified training. Even more importantly, these results clearly show that provably robust networks do not necessarily require the level of regularization introduced by IBP training. Loss Analysis In Fig. 8 , we compare the robust loss of a SABR and an IBP trained network across different propagation region sizes (all centred around the original sample) depending on the bound propagation method used. We first observe that, when propagating the full input region (λ = 1), the SABR trained network yields a much higher robust loss than the IBP trained one. However, when comparing the respective training subselection ratios, λ = 0.05 for SABR and λ = 1.0 for IBP, SABR yields significantly smaller training losses. Even more importantly, the difference between robust and standard loss is significantly lower, which, recalling §4, directly corresponds to a reduced regularization for robustness and allows the SABR trained network to reach a much lower standard loss. Finally, we observe the losses to clearly grow super-linearly with increasing propagation region sizes (note the logarithmic scaling of the y-axis) when using the BOX relaxation, agreeing well with our theoretical results in §4. While the more precise DEEPPOLY (DP) bounds yield significantly reduced robust losses for the SABR trained network, the IBP trained network does not benefit at all, again highlighting its over-regularization. See App. C for extended results. 

Gradient Alignment

To analyze whether SABR training is actually more aligned with standard accuracy and empirical robustness, as indicated by our theory in §4, we conduct the following experiment for CIFAR-10 and ϵ = 2/255: We train one network using SABR with λ = 0.05 and one with IBP, corresponding to λ = 1.0. For both, we now compute the gradients ∇ θ of their respective robust training losses L rob and the cross-entropy loss L CE applied to unperturbed (Std.) and adversarial (Adv.) samples. We then report the mean cosine similarity between these gradients across the whole test set in Table 3 . We clearly observe that the SABR loss is much better aligned with both the crossentropy loss of unperturbed and adversarial samples, corresponding to standard accuracy and empirical robustness, respectively. 

ReLU Activation States

The portion of ReLU activations which are (stably) active, inactive, or unstable has been identified as an important characteristic of certifiably trained networks (Shi et al., 2021) . We evaluate these metrics for IBP, SABR, and adversarially (PGD) trained networks on CIFAR-10 at ϵ = 2/255, using the BOX relaxation to compute intermediate bounds, and report the average over all layers and test set samples in Table 4 . We observe that, when evaluated on concrete points, the SABR trained network has around 37% more active ReLUs than the IBP trained one and almost as many as the PGD trained one, indicating a significantly smaller level of regularization. While the SABR trained network has around 3-times as many unstable ReLUs as the IBP trained network, when evaluated on the whole input region, it has 20-times fewer than the PGD trained one, highlighting the improved certifiability.

6. RELATED WORK

Verification Methods Deterministic verification methods analyse a given network by using abstract interpretation (Gehr et al., 2018; Singh et al., 2018; 2019a) , or translating the verification into an optimization problem which they then solve using linear programming (LP) (Palma et al., 2021; Müller et al., 2022; Wang et al., 2021; Zhang et al., 2022c) , mixed integer linear programming (MILP) (Tjeng et al., 2019; Singh et al., 2019b) , or semidefinite programming (SDP) (Raghunathan et al., 2018; Dathathri et al., 2020) . However, as neural network verification is generally NP-complete (Katz et al., 2017) , many of these methods trade precision for scalability, yielding socalled incomplete certification methods, which might fail to prove robustness even when it holds. In this work, we analyze our SABR trained networks with deterministic methods. Certified Training DIFFAI (Mirman et al., 2018) and IBP (Gowal et al., 2018b ) minimize a sound over-approximation of the worst-case loss computed using the BOX relaxation. The idea of propagating subsets of the adversarial input region has been explored in the settings of adversarial patches (Chiang et al., 2020) and geometric perturbations (Balunovic et al., 2019) , where the number of subsets required to cover the whole region is linear or constant in the input dimensionality. However, these methods are not applicable to the ℓ p -perturbation setting, we consider, where this scaling is exponential. 

7. CONCLUSION

We introduced a novel certified training method called SABR (Small Adversarial Bounding Regions) based on the key insight, that propagating small but carefully selected subsets of the input region combines small approximation errors and thus regularization with well-behaved optimization problems. This allows SABR trained networks to outperform all existing certified training methods on all commonly used benchmarks in terms of both standard and certified accuracy. Even more importantly, SABR lays the foundation for a new class of certified training methods promising to alleviate the robustness-accuracy trade-off and enable the training of networks that are both accurate and certifiably robust.

8. ETHICS STATEMENT

As SABR improves both certified and standard accuracy compared to existing approaches, it could help make real-world AI systems more robust to both malicious and random interference. Thus any positive and negative societal effects these systems have could be amplified. Further, while we achieve state-of-the-art results on all considered benchmark problems, this does not (necessarily) indicate sufficient robustness for safety-critical real-world applications, but could give practitioners a false sense of security when using SABR trained models.

9. REPRODUCIBILITY STATEMENT

We publish our code, all trained models, and detailed instructions on how to reproduce our results at https://github.com/eth-sri/sabr, providing an anonymized version to the reviewers. Further, we provide proofs for our theoretical contributions in App. A and a detailed description of all hyperparameter choices as well as a discussion of the used data sets including all preprocessing steps in App. C.



Code released at https://github.com/eth-sri/sabr



Figure 1: Comparison of exact (blue) and BOX (red) propagation through a one layer network.We show the concrete points maximizing the logit difference y2 -y1 as a black × and the corresponding relaxation as a red ×.

we obtain the hyper-box relaxation of its output with centre xi = W xi-1 + b and radius δ i = |W |δ i-1 , where | • | denotes the elementwise absolute value.

Figure2: Illustration of SABR training. Instead of propagating a BOX approximation (dashed box ) of the whole input region (red and green shapes in input space), SABR propagates a small subset of this region (solid box ), selected to contain the adversarial example (black ×) and thus the misclassified region ( red). The smaller BOX accumulates much fewer approximation errors during propagation, leading to a significantly smaller output relaxation, which induces much less regularization (medium blue ) than training with the full region (large blue ), but more than training with just the adversarial example (small blue ).

Figure 3: Illustration of propagation region selection process.

Figure 4: Input distribution for last ReLU layer depending on training method.

Figure 6: Certified over standard accuracy for different certified training methods. The upper right-hand corner is best.

Figure 7: Standard, adversarial and certified accuracy depending on the certification method (BOX, DEEPPOLY, and MN-BAB) for the first 1000 test set samples of CIFAR-10.

Figure 8: Standard (Std.) and robust crossentropy loss, computed with BOX (Box) and DEEPPOLY (DP) for an IBP and SABR trained network over evaluation subselection ratios λ.

Comparison of the standard (Acc.) and certified (Cert. Acc.) accuracy for different certified training methods on the full MNIST, CIFAR-10, and TINYIMAGENET test sets. We use MN-BAB(Ferrari et al., 2022) for certification and report other results from the relevant literature.

Comparison of natural (Nat.) and certified (Cert.) accuracy [%] to SORTNET(Zhang et al., 2022b).

Cosine similarity between ∇ θ L rob for IBP and SABR and ∇ θ L CE for ad-

Average percentage of active, inactive, and unstable ReLUs for concrete points and boxes depending on training method.

Wong et al.  (2018)  instead use the DEEPZ relaxation(Singh et al., 2018), approximated using Cauchy random matrices.Wong & Kolter (2018)  compute worst-case losses by back-substituting linear bounds using fixed relaxations. CROWN-IBP(Zhang et al., 2020) uses a similar back-substitution approach but leverages minimal area relaxations introduced byZhang et al. (2018) andSingh et al. (2019a)  to bound the worst-case loss while computing intermediate bounds using the less precise but much faster BOX relaxation.Shi et al. (2021) show that they can obtain the same accuracies with much shorter training schedules by combining IBP training with a special initialization. COLT(Balunovic & Vechev, 2020) combines propagation using the DEEPZ relaxation with adversarial search. IBP-R(Palma et al., 2022) combines adversarial training with much larger perturbation radii and a ReLUstability regularization based on the BOX relaxation. We compare favorably to all (recent) methods above in our experimental evaluation (see §5). Müller et al. (2021) combine certifiable and accurate networks to allow for more efficient trade-offs between robustness and accuracy.

Li et al. (2019),Lécuyer et al. (2019), andCohen et al. (2019) construct locally Lipschitz classifiers by introducing randomness into the inference process, allowing them to derive probabilistic robustness guarantees. Extended in a variety of ways(Salman et al., 2019;Yang et al., 2020), these methods can obtain strong robustness guarantees with high probability(Salman et al., 2019) at the cost of significantly (100x) increased runtime during inference. We focus our comparison on deterministic methods.Zhang et al. (2021) propose a novel architecture, which inherently exhibits ℓ ∞ -Lipschitzness properties, allowing them to efficiently derive corresponding robustness guarantees.Zhang et al. (2022a)  build on this work by improving the challenging training process. Finally,Zhang et al. (2022b)  generalize this concept in SORTNET.

ACKNOWLEDGEMENTS

We would like to thank our anonymous reviewers for their constructive comments and insightful questions.This work has been done as part of the EU grant ELSA (European Lighthouse on Secure and Safe AI, grant agreement no. 101070617) and the SERI grant SAFEAI (Certified Safe, Fair and Robust Artificial Intelligence, contract no. MB22.00088). Views and opinions expressed are however those of the authors only and do not necessarily reflect those of the European Union or European Commission. Neither the European Union nor the European Commission can be held responsible for them.The work has received funding from the Swiss State Secretariat for Education, Research and Innovation (SERI).

