CERTIFIED TRAINING: SMALL BOXES ARE ALL YOU NEED

Abstract

To obtain, deterministic guarantees of adversarial robustness, specialized training methods are used. We propose, SABR, a novel such certified training method, based on the key insight that propagating interval bounds for a small but carefully selected subset of the adversarial input region is sufficient to approximate the worst-case loss over the whole region while significantly reducing approximation errors. We show in an extensive empirical evaluation that SABR outperforms existing certified defenses in terms of both standard and certifiable accuracies across perturbation magnitudes and datasets, pointing to a new class of certified training methods promising to alleviate the robustness-accuracy trade-off.

1. INTRODUCTION

As neural networks are increasingly deployed in safety-critical domains, formal robustness guarantees against adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) are becoming ever more important. However, despite significant progress, specialized training methods that improve certifiability at the cost of severely reduced accuracies are still required to obtain deterministic guarantees. Given an input region defined by an adversary specification, both training and certification methods compute a network's reachable set by propagating a symbolic over-approximation of this region through the network (Singh et al., 2018; 2019a; Gowal et al., 2018a) . Depending on the propagation method, both the computational complexity and approximation-tightness can vary widely. For certified training, an over-approximation of the worst-case loss is computed from this reachable set and then optimized (Mirman et al., 2018; Wong et al., 2018) . Surprisingly, the least precise propagation methods yield the highest certified accuracies as more precise methods induce harder optimization problems (Jovanovic et al., 2021) . However, the large approximation errors incurred by these imprecise methods lead to over-regularization and thus poor accuracy. Combining precise worst-case loss approximations and a tractable optimization problem is thus the core challenge of certified training. In this work, we tackle this challenge and propose a novel certified training method, SABR, Small Adversarial Bounding Regions, based on the following key insight: by propagating small but carefully selected subsets of the adversarial input region with imprecise methods (i.e., BOX), we can obtain both well-behaved optimization problems and precise approximations of the worst-case loss. This yields less over-regularized networks, allowing SABR to improve on state-of-the-art certified defenses in terms of both standard and certified accuracies across settings, thereby pointing to a new class of certified training methods.

Main Contributions Our main contributions are:

• A novel certified training method, SABR, reducing over-regularization to improve both standard and certified accuracy ( §3). • A theoretical investigation motivating SABR by deriving new insights into the growth of BOX relaxations during propagation ( §4). • An extensive empirical evaluation demonstrating that SABR outperforms all state-of-theart certified training methods in terms of both standard and certifiable accuracies on MNIST, CIFAR-10, and TINYIMAGENET ( §5).

2. BACKGROUND

In this section, we provide the necessary background for SABR. Adversarial Robustness Consider a classification model h : R din → R c that, given an input x ∈ X ⊆ R din , predicts numerical scores y := h(x) for every class. We say that h is adversarially robust on an ℓ p -norm ball B ϵp p (x) of radius ϵ p if it consistently predicts the target class t for all perturbed inputs x ′ ∈ B ϵp p (x). More formally, we define adversarial robustness as: arg max j h(x ′ ) j = t, ∀x ′ ∈ B ϵp p (x) := {x ′ ∈ X | ∥x -x ′ ∥ p ≤ ϵ p }. ( ) BOX Propagation x 1 x 2 B 1 ∞ (x 0 ) = [-1, 1] 2 x 1 = 0.5, 0.3 0.2, 0.5 x 0 + 0.4 0.4 Neural Network Verification To verify that a neural network h is adversarially robust, several verification techniques have been proposed. x 2 = ReLU(x 1 ) y = 0.7, -0.3 -0.3, 0.7 x 2 + 0.4 -0.4 A simple but effective such method is verification with the BOX relaxation (Mirman et al., 2018) , also called interval bound propagation (IBP) (Gowal et al., 2018b) . Conceptually, we first compute an over-approximation of a network's reachable set by propagating the input region B ϵp p (x) through the neural network and then check whether all outputs in the reachable set yield the correct classification. This propagation sequentially computes a hyper-box (each dimension is described as an interval) relaxation of a layer's output, given a hyperbox input. As an example, consider an L-layer net- work h = f L • σ • f L-2 • . . . • f 1 , with linear layers f i and ReLU activation functions σ. Given an input region B ϵp p (x), we over-approximate it as a hyperbox, centered at x0 := x and with radius δ 0 := ϵ p , such that we have the i th dimension of the input A ReLU activation ReLU(x i-1 ) := max(0, x i-1 ) can be over-approximated by propagating the lower and upper bound separately, resulting in a output hyper-box with xi = u i +l i 2 and δ i = u i -l i 2 where l i = ReLU( xi-1δ i-1 ) and u i = ReLU( xi-1 + δ i-1 ). Proceeding this way for all layers, we obtain lower and upper bounds on the network output y and can check if the output score of the target class is greater than that of all other classes by computing the upper bound on the logit difference y ∆ i := y iy t and then checking whether y ∆ i < 0, ∀i ̸ = t. We illustrate this propagation process for a one-layer network in Fig. 1 . There, the blue shapes ( ) show an exact propagation of the input region and the red shapes ( ) their hyper-box relaxation. Note how after the first linear and ReLU layer (third row), the relaxation (red) contains already many points not reachable via exact propagation (blue), despite it being the smallest hyper-box containing the exact region. These so-called approximation errors accumulate quickly, leading to an increasingly imprecise abstraction, as can be seen by comparing the two shapes after an additional linear layer (last row). To verify that this network classifies all inputs in [-1, 1] 2 to class 1, we have to show the upper bound of the logit difference y 2y 1 to be less than 0. While the concrete maximum of -0.3 ≥ y 2y 1 (black ×) is indeed less than 0, showing that the network is robust, the BOX relaxation only yields 0.6 ≥ y 2y 1 (red ×) and is thus too imprecise to prove it. x 0 i ∈ [x 0 i -δ 0 i , x0 i + δ 0 i ]. Given a linear layer f i (x i-1 ) = W x i-1 + b =: x i , Beyond BOX, more precise verification approaches track more relational information at the cost of increased computational complexity (Palma et al., 2022; Wang et al., 2021) . A recent example is MN-BAB (Ferrari et al., 2022) , which improves on BOX in two key ways: First, instead of propagating axis-aligned hyper-boxes, it uses much more expressive polyhedra, allowing linear layers to be captured exactly and ReLU layers much more precisely. Second, if the result is still too imprecise, the verification problem is recursively split into easier ones, by introducing a case distinction between the two linear segments of the ReLU function. This is called the branch-and-bound (BaB) approach (Bunel et al., 2020) . We refer the interested reader to Ferrari et al. (2022) for more details.



Figure 1: Comparison of exact (blue) and BOX (red) propagation through a one layer network.We show the concrete points maximizing the logit difference y2 -y1 as a black × and the corresponding relaxation as a red ×.

we obtain the hyper-box relaxation of its output with centre xi = W xi-1 + b and radius δ i = |W |δ i-1 , where | • | denotes the elementwise absolute value.

