SCALING THE CONVEX BARRIER WITH ACTIVE SETS

Abstract

Tight and efficient neural network bounding is of critical importance for the scaling of neural network verification systems. A number of efficient specialised dual solvers for neural network bounds have been presented recently, but they are often too loose to verify more challenging properties. This lack of tightness is linked to the weakness of the employed relaxation, which is usually a linear program of size linear in the number of neurons. While a tighter linear relaxation for piecewise linear activations exists, it comes at the cost of exponentially many constraints and thus currently lacks an efficient customised solver. We alleviate this deficiency via a novel dual algorithm that realises the full potential of the new relaxation by operating on a small active set of dual variables. Our method recovers the strengths of the new relaxation in the dual space: tightness and a linear separation oracle. At the same time, it shares the benefits of previous dual approaches for weaker relaxations: massive parallelism, GPU implementation, low cost per iteration and valid bounds at any time. As a consequence, we obtain better bounds than off-the-shelf solvers in only a fraction of their running time and recover the speed-accuracy trade-offs of looser dual solvers if the computational budget is small. We demonstrate that this results in significant formal verification speed-ups.

1. INTRODUCTION

Verification requires formally proving or disproving that a given property of a neural network holds over all inputs in a specified domain. We consider properties in their canonical form (Bunel et al., 2018) , which requires us to either: (i) prove that no input results in a negative output (property is true); or (ii) identify a counter-example (property is false). The search for counter-examples is typically performed by efficient methods such as random sampling of the input domain (Webb et al., 2019) , or projected gradient descent (Carlini & Wagner, 2017) . In contrast, establishing the veracity of a property requires solving a suitable convex relaxation to obtain a lower bound on the minimum output. If the lower bound is positive, the given property is true. If the bound is negative and no counter-example is found, either: (i) we make no conclusions regarding the property (incomplete verification); or (ii) we further refine the counter-example search and lower bound computation within a branch-and-bound framework until we reach a concrete conclusion (complete verification). The main bottleneck of branch and bound is the computation of the lower bound for each node of the enumeration tree via convex optimization. While earlier works relied on off-the-shelf solvers (Ehlers, 2017; Bunel et al., 2018) , it was quickly established that such an approach does not scale-up elegantly with the size of the neural network. This has motivated researchers to design specialized dual solvers (Dvijotham et al., 2019; Bunel et al., 2020a) , thereby providing initial evidence that verification can be realised in practice. However, the convex relaxation considered in the dual solvers is itself very weak (Ehlers, 2017) , hitting what is now commonly referred to as the "convex barrier" (Salman et al., 2019) . In practice, this implies that either several properties remain undecided in incomplete verification, or take several hours to be verified exactly. Multiple works have tried to overcome the convex barrier for piecewise linear activations (Raghunathan et al., 2018; Singh et al., 2019) . Here, we focus on the single-neuron Linear Programming (LP) relaxation by Anderson et al. (2020) . Unfortunately, its tightness comes at the price of exponentially many (in the number of variables) constraints. Therefore, existing dual solvers (Dvijotham et al., 2018; Bunel et al., 2020a) are not easily applicable, limiting the scaling of the new relaxation. We address this problem by presenting a specialized dual solver for the relaxation by Anderson et al. (2020) , which realises its full potential by meeting the following desiderata: • By keeping an active set of dual variables, we obtain a sparse dual solver that recovers the strengths of the original primal problem (Anderson et al., 2020) in the dual domain. In line with previous dual solvers, our approach yields valid bounds at anytime, leverages convolutional network structure and enjoys massive parallelism within a GPU implementation, resulting in better bounds in an order of magnitude less time than off-the-shelf solvers (Gurobi Optimization, 2020) . • We present a unified dual treatment that includes both a linearly sized LP relaxation (Ehlers, 2017) and the tighter formulation. As a consequence, our solver provides a wide range of speedaccuracy trade-offs: (i) it is competitive with dual approaches on the looser relaxation (Dvijotham et al., 2018; Bunel et al., 2020a) ; and (ii) it yields much tighter bounds if a larger computational budget is available. Owing to this flexibility, we show that our dual algorithm yields large complete verification gains compared to primal approaches (Anderson et al., 2020) and previous dual algorithms.

2. PRELIMINARIES: NEURAL NETWORK RELAXATIONS

We denote vectors by bold lower case letters (for example, x) and matrices by upper case letters (for example, W ). We use for the Hadamard product, • for integer ranges, 1 a for the indicator vector on condition a and brackets for intervals ([l k , u k ]) and vector or matrix entries (x[i] or W [i, j]). In addition, given W ∈ R m×n and x ∈ R m , we will employ W x and W x as shorthands for respectively i col i (W ) x and i col i (W ) T x, where col i (W ) denotes the i-th column of matrix W . Let C be the network input domain. Similar to Dvijotham et al. (2018) ; Bunel et al. (2020a) , we assume that linear minimisation over C can be performed in closed-form. Our goal is to compute bounds on the scalar output of a piecewise-linear feedforward neural network. The tightest possible lower bound can be obtained by solving the following optimization problem: min x,x xn s.t. x 0 ∈ C, (1a) xk+1 = W k+1 x k + b k+1 k ∈ 0, n -1 , (1b) x k = σ (x k ) k ∈ 1, n -1 , where the activation function σ (x k ) is piecewise-linear, xk , x k ∈ R n k denote the outputs of the kth linear layer (fully-connected or convolutional) and activation function respectively, W k and b k denote its weight matrix and bias, n k is the number of activations at layer k. We will focus on the ReLU case (σ (x) = max (x, 0)), as common piecewise-linear functions can be expressed as a composition of ReLUs (Bunel et al., 2020b) . Problem (1) is non-convex due to the activation function's non-linearity (1c). As solving it is NPhard (Katz et al., 2017) , it is commonly approximated by a convex relaxation (see §4). The quality of the corresponding bounds, which is fundamental in verification, depends on the tightness of the relaxation. Unfortunately, tight relaxations usually correspond to slower bounding procedures. We first review a popular ReLU relaxation in §2.1). We then consider a tighter one in §2.2.

2.1. PLANET RELAXATION

The so-called Planet relaxation (Ehlers, 2017) has enjoyed widespread use due to its amenability to efficient customised solvers (Dvijotham et al., 2018; Bunel et al., 2020a) and is the "relaxation of choice" for many works in the area (Bunel et al., 2020b; Lu & Kumar, 2020) . Here, we describe it in its non-projected form M k , the LP relaxation of the Big-M Mixed Integer Programming (MIP) formulation (Tjeng et al., 2019) . Applying M k to problem (1) results in: min x,x,z xn s.t. x 0 ∈ C xk+1 = W k+1 x k + b k+1 k ∈ 0, n -1 , x k ≥ xk , x k ≤ ûk z k , x k ≤ xk -lk (1 -z k ), (x k , xk , z k ) ∈ [l k , u k ] × [ lk , ûk ] × [0, 1]    := M k k ∈ 1, n -1 , where lk , ûk and l k , u k are intermediate bounds respectively on pre-activation variables xk and post-activation variables x k . These constants play an important role in the structure of M k and, together with the relaxed binary constraints on z, define box constraints on the variables. We detail how to compute intermediate bounds in appendix E. Projecting out auxiliary variables z results in the Planet relaxation (cf. appendix B.1 for details), which replaces (1c) by its convex hull. Problem (2), which is linearly-sized, can be easily solved via commercial black-box LP solvers (Bunel et al., 2018) . This does not scale-up well with the size of the neural network, motivating the need for specialised solvers. Customised dual solvers have been designed by relaxing constraints (1b), (1c) (Dvijotham et al., 2018) or replacing (1c) by the Planet relaxation and employing Lagrangian Decomposition (Bunel et al., 2020a) . Both approaches result in bounds very close to optimality for problem (2) in only a fraction of the runtime of off-the-shelf solvers.

2.2. A TIGHTER RELAXATION

A much tighter approximation of problem (1) than the Planet relaxation ( §2.1) can be obtained by representing the convex hull of the composition of (1b) and (1c) rather than the convex hull of (1c) alone. A formulation of this type was recently introduced by Anderson et al. (2020) .

Let us define

Ľk-1 , Ǔk-1 ∈ R n k ×n k-1 as: Ľk-1 [i, j] = l k-1 [j]1 W k [i,j]≥0 + u k-1 [j]1 W k [i,j]<0 , and Ǔk-1 [i, j] = u k-1 [j]1 W k [i,j]≥0 + l k-1 [j]1 W k [i,j]<0 . Additionally, let us introduce 2 W k = {0, 1} n k ×n k-1 , the set of all possible binary masks of weight matrix W k , and E k := 2 W k \ {0, 1} , which excludes the all-zero and all-one masks. The new representation results in the following primal problem: min x,x,z xn s.t. x 0 ∈ C xk+1 = W k+1 x k + b k+1 k ∈ 0, n -1 , (x k , xk , z k ) ∈ M k x k ≤   (W k I k ) x k-1 + z k b k -W k I k Ľk-1 (1 -z k ) + W k (1 -I k ) Ǔk-1 z k   ∀I k ∈ E k        := A k k ∈ 1, n -1 . Both M k and A k yield valid MIP formulations for problem (1) when imposing integrality constraints on z. However, the LP relaxation of A k will yield tighter bounds. In the worst case, this tightness comes at the cost of exponentially many constraints: one for each I k ∈ E k . On the other hand, given a set of primal assignments (x, z) that are not necessarily feasible for problem (3), one can efficiently compute the most violated constraint (if any) at that point. The mask associated to such constraint can be computed in linear-time (Anderson et al., 2020) as: I k [i, j] = 1 T ((1-zk[i]) Ľk-1 [i,j]+z k [i] Ǔk-1 [i,j]-x k-1 [i])W k [i,j]≥0 . We point out that A k slightly differs from the original formulation of Anderson et al. (2020) , which does not explicitly include pre-activation bounds lk , ûk (which we treat via M k ). While this was implicitly addressed in practical applications (Botoeva et al., 2020) , not doing so has a strong negative effect on bound tightness, possibly to the point of yielding looser bounds than problem (2). In appendix F, we provide an example in which this is the case and extend the original derivation by Anderson et al. (2020) to recover A k as in problem (3). Owing to the exponential number of constraints, problem (3) cannot be solved as it is. As outlined by Anderson et al. (2020) , the availability of a linear-time separation oracle (4) offers a natural primal cutting plane algorithm, which can then be implemented in off-the-shelf solvers: solve the Big-M LP (2), then iteratively add the most violated constraints from A k at the optimal solution. When applied to the verification of small neural networks via off-the-shelf MIP solvers, this leads to substantial gains with respect to the looser Big-M relaxation (Anderson et al., 2020) .

3. AN EFFICIENT DUAL SOLVER FOR THE TIGHTER RELAXATION

Inspired by the success of dual approaches on looser relaxations (Bunel et al., 2020a; Dvijotham et al., 2019) , we show that the formal verification gains by Anderson et al. (2020) (see §2.2) scale to larger networks if we solve the tighter relaxation in the dual space. Due to the particular structure of the relaxation, a customised solver for problem (3) needs to meet a number of requirements. Fact 1. In order to replicate the success of previous dual algorithms on looser relaxations, we need a solver for problem (3) with the following properties: (i) sparsity: a memory cost linear in the number of network activations in spite of exponentially many constraints, (ii) tightness: the bounds should reflect the quality of those obtained in the primal space, (iii) anytime: low cost per iteration and valid bounds at each step. The anytime requirement motivates dual solutions: any dual assignment yields a valid bound due to weak duality. Unfortunately, as shown in appendix A, neither of the two dual derivations by Bunel et al. (2020a) ; Dvijotham et al. (2018) readily satisfy all desiderata at once. Therefore, we need a completely different approach. Let us introduce dual variables α, β and functions thereof: f k (α, β) = α k -W T k+1 α k+1 -I k β k,I k + I k+1 (W k+1 I k+1 ) T β k+1,I k+1 , g k (β) = I k ∈E k W k (1 -I k ) Ǔk-1 β k,I k + β k,0 ûk + β k,1 lk + I k ∈E k W k I k Ľk-1 β k,I k + I k ∈E k β k,I k b k , where I k is a shorthand for I k ∈2 W k . Starting from primal (3), we relax all constraints in A k except box constraints (see §2.1). We obtain the following dual problem (derivation in appendix C), where functions f k , g k appear in inner products with primal variables x k , z k : max (α,β)≥0 d(α, β) where: d(α, β) := min x,z L(x, z, α, β), L(x, z, α, β) = n-1 k=1 b T k α k - n-1 k=0 f k (α, β) T x k - n-1 k=1 g k (β) T z k + n-1 k=1 I k ∈E k (W k I k Ľk-1 ) β k,I k + β T k,1 ( lk -b k ) s.t. x 0 ∈ C, (x k , z k ) ∈ [l k , u k ] × [0, 1] k ∈ 1, n -1 . This is again a challenging problem: the exponentially many constraints in the primal (3) are now associated to an exponential number of variables. Nevertheless, we show that the requirements of Fact 1 can be met by operating on a restricted version of dual (6). To this end, we present Active Set, a specialised solver for the relaxation by Anderson et al. (2020) that is sparse, anytime and yields bounds reflecting the tightness of the new relaxation. Starting from the dual of problem (2), our solver iteratively adds variables to a small active set of dual variables β B and solves the resulting reduced version of problem (6). We first describe our solver on a fixed β B and then outline how to iteratively modify the active set ( §3.2). Pseudo-code can be found in appendix D.

3.1. ACTIVE SET SOLVER

We want to solve a version of problem (6) for which the sums over the I k masks of each layer k are restricted to B k ⊆ E k 1 , with B = ∪ k B k . By keeping B = ∅, we recover a novel dual solver for the Big-M relaxation (2) (explicitly described in appendix B), which is employed as initialisation. Setting 5), ( 6) and removing these from the formulation, we obtain: β k,I k = 0, ∀ I k ∈ E k \ B k in ( f B,k (α, β B ) = α k -W T k+1 α k+1 -I k ∈B k ∪{0,1} β k,I k , + I k+1 ∈B k+1 ∪{0,1} (W k+1 I k+1 ) T β k+1,I k+1 g B,k (β B ) = I k ∈B k W k (1 -I k ) Ǔk-1 β k,I k + β k,0 ûk + β k,1 lk + I k ∈B k W k I k Ľk-1 β k,I k + I k ∈B k β k,I k b k , along with the reduced dual problem: max (α,β B )≥0 d B (α, β B ) where: d B (α, β B ) := min x,z L B (x, z, α, β B ), L B (x, z, α, β B ) = n-1 k=1 b T k α k - n-1 k=0 f B,k (α, β B ) T x k - n-1 k=1 g B,k (β B ) T z k + n-1 k=1 I k ∈B k (W k I k Ľk-1 ) β k,I k + β T k,1 ( lk -b k ) s.t. x 0 ∈ C, (x k , z k ) ∈ [l k , u k ] × [0, 1] k ∈ 1, n -1 . ( ) We can maximize d B (α, β B ), which is concave and non-smooth, via projected supergradient ascent or variants thereof, such as Adam (Kingma & Ba, 2015) . In order to obtain a valid supergradient, we need to perform the inner minimisation over the primals. Thanks to the structure of problem (8), the optimisation decomposes over the layers. For k ∈ 1, n -1 , we can perform the minimisation in closed-form by driving the primals to their upper or lower bounds depending on the sign of their coefficients: x * k = 1 f B,k (α,β B )≥0 ûk + 1 f B,k (α,β B )<0 lk , z * k = 1 g B,k (β B )≥0 1. The subproblem corresponding to x 0 is different, as it involves a linear minimization over x 0 ∈ C: x * 0 ∈ argmin x0 f B,0 (α, β B ) T x 0 s.t. x 0 ∈ C. (10) We assumed in § 2 that (10) can be performed efficiently. We refer the reader to Bunel et al. (2020a) for descriptions of the minimisation when C is a ∞ or 2 ball, as common for adversarial examples. Given (x * , z * ) as above, the supergradient of d B (α, β B ) is a subset of the one for d(α, β), given by: ∇ α k d(α, β) = W k x * k-1 + b k -x * k , ∇ β k,0 d(α, β) = x * k -z * k ûk , ∇ β k,1 d(α, β) = x * k -W k x * k-1 + b k + (1 -z * k ) lk , ∇ β k,I k d(α, β) = x * k -(W k I k ) x * k-1 + W k I k Ľk-1 (1 -z * k ) -z * k b k + W k (1 -I k ) Ǔk-1 z * k I k ∈ B k , for each k ∈ 0, n -1 . At each iteration, after taking a step in the supergradient direction, the dual variables are projected to the non-negative orthant by clipping negative values.

3.2. EXTENDING THE ACTIVE SET

We initialise the dual (6) with a tight bound on the Big-M relaxation by solving for d ∅ (α, β ∅ ) in (8). To satisfy the tightness requirement in Fact 1, we then need to include constraints (via their Lagrangian multipliers) from the exponential family of A k into B k . Our goal is to tighten them as much as possible while keeping the active set small to save memory and compute. The active set strategy is defined by a selection criterion for the I * k to be addedfoot_1 to B k , and the frequency of addition. In practice, we add the variables maximising the entries of supergradient ∇ β k,I k d(α, β) after a fixed number of dual iterations. We now provide motivation for both choices.

Selection criterion

The selection criterion needs to be computationally efficient. Thus, we proceed greedily and focus only on the immediate effect at the current iteration. Let us map a restricted set of dual variables β B to a set of dual variables β for the full dual (6). We do so by setting variables not in the active set to 0: β B = 0, and β = β B ∪ β B. Then, for each layer k, we add the set of variables β k,I * k maximising the corresponding entries of the supergradient of the full dual problem (6): β k,I * k ∈ argmax β k,I k {∇ β k,I k d(α, β) T 1} . Therefore, we use the subderivatives as a proxy for short-term improvement on the full dual objective d(α, β). Under a primal interpretation, our selection criterion involves a call to the separation oracle (4) by Anderson et al. (2020) . Proposition 1. β k,I * k (as defined above) represents the Lagrangian multipliers associated to the most violated constraints from A k at (x * , z * ) ∈ argmin x,z L B (x, z, α, β B ), the primal minimiser of the current restricted Lagrangian. Proof. See appendix D.1. Frequency Finally, we need to decide the frequency at which to add variables to the active set. Fact 2. Assume we obtained a dual solution (α † , β † B ) ∈ argmax d B (α, β B ) using Active Set on the current B. Then (x * , z * ) ∈ argmin x,z L B (x, z, α † , β † B ) is not necessarily an optimal primal solution for the primal of the current restricted dual problem (Sherali & Choi, 1996) . The primal of d B (α, β B ) (restricted primal) is the problem obtained by setting E k ← B k in problem (3). While the primal cutting plane algorithm by Anderson et al. (2020) calls the separation oracle (4) at the optimal solution of the current restricted primal, Fact 2 shows that our selection criterion leads to a different behaviour even at dual optimality for d B (α, β B ). Therefore, as we have no theoretical incentive to reach (approximate) subproblem convergence, we add variables after a fixed tunable number of supergradient iterations. Furthermore, we can add more than one variable "at once" by running the oracle (4) repeatedly for a number of iterations. We conclude this section by pointing out that, while recovering primal optima is possible in principle (Sherali & Choi, 1996) , doing so would require dual convergence on each restricted dual problem (8). As the main advantage of dual approaches (Dvijotham et al., 2018; Bunel et al., 2020a) is their ability to quickly achieve tight bounds (rather than formal optimality), adapting the selection criterion to mirror the primal cutting plane algorithm would defeat the purpose of Active Set.

3.3. IMPLEMENTATION DETAILS, TECHNICAL CHALLENGES

Analogously to previous dual algorithms (Dvijotham et al., 2018; Bunel et al., 2020a) , our approach can leverage the massive parallelism offered by modern GPU architectures in three different ways. First, we execute in parallel the computations of lower and upper bounds relative to all the neurons of a given layer. Second, in complete verification, we can batch over the different Branch and Bound (BaB) subproblems. Third, as most of our solver relies on standard linear algebra operations employed during the forward and backward passes of neural networks, we can exploit the highly optimised implementations commonly found in modern deep learning frameworks. An exception are what we call "masked" forward/backward passes: operations of the form (W k I k ) x k or (W k I k ) T x k+1 , which are needed whenever dealing with constraints from A k . In our solver, they appear if B k = ∅ (see equations ( 8), ( 11)). Masked passes require a customised lower-level implementation for a proper treatment of convolutional layers, detailed in appendix G.

4. RELATED WORK

In addition to those described in §2, many other relaxations have been proposed in the literature. In fact, all bounding methods are equivalent to solving some convex relaxation of a neural network. This holds for conceptually different ideas such as bound propagation (Gowal et al., 2018) , specific dual assignments (Wong & Kolter, 2018) , dual formulations based on Lagrangian Relaxation (Dvijotham et al., 2018) or Decomposition (Bunel et al., 2020a) . The degree of tightness varies greatly: from looser relaxations associated to closed-form methods (Gowal et al., 2018; Weng et al., 2018; Wong & Kolter, 2018) to tighter formulations based on Semi-Definite Programming (SDP) (Raghunathan et al., 2018) . The speed of closed-form approaches results from simplifying the triangle-shaped feasible region of the Planet relaxation ( §2.1) (Singh et al., 2018; Wang et al., 2018) . On the other hand, tighter relaxations are more expressive than the linearly-sized LP by (Ehlers, 2017) . The SDP formulation by Raghunathan et al. (2018) can represent interactions between activations in the same layer. Similarly, Singh et al. (2019) tighten the Planet relaxation by considering the convex hull of the union of polyhedra relative to k ReLUs of a given layer at once. Alternatively, tighter LPs can be obtained by considering the ReLU together with the affine operator before it: standard MIP techniques (Jeroslow, 1987) lead to a formulation that is quadratic in the number of variables (see appendix F.2). The relaxation by Anderson et al. (2020) detailed in §2.2 is a more convenient representation of the same set. By projecting out the auxiliary z variables, (Tjandraatmadja et al., 2020) recently introduced another formulation equivalent to the one by Anderson et al. (2020) , with half as many variables and a linear factor more constraints compared to what described in §2.2. Therefore, the relationship between the two formulations mirrors the one between the Planet and Big-M relaxations (see appendix B.1). Our dual derivation and the Active Set algorithm can be adapted to operate on the projected relaxations. Specialised dual solvers significantly improve in bounding efficiency with respect to off-the-shelf solvers for both LP (Bunel et al., 2020a) and SDP formulations (Dvijotham et al., 2019) . Therefore, the design of similar solvers for other tight relaxations is an interesting line of future research. We contribute with a specialised dual solver for the relaxation by Anderson et al. (2020) ( §3) . In what follows, we demonstrate empirically that by seamlessly transitioning from the Planet relaxation to the tighter formulation, we can obtain large incomplete and complete verification improvements.

5. EXPERIMENTS

We empirically demonstrate the effectiveness of our method under two settings. On incomplete verification ( §5.1), we assess the speed and quality of bounds compared to other bounding algorithms. On complete verification ( §5.2), we examine whether our speed-accuracy trade-offs correspond to faster exact verification. Our implementation is based on Pytorch (Paszke et al., 2017) and is available at https://github.com/oval-group/scaling-the-convex-barrier.

5.1. INCOMPLETE VERIFICATION

We evaluate incomplete verification performance by upper bounding the robustness margin (the difference between the ground truth logit and the other logits) to adversarial perturbations (Szegedy et al., 2014) on the CIFAR-10 test set (Krizhevsky & Hinton, 2009) . If the upper bound is negative, we can certify the network's vulnerability to adversarial perturbations. We replicate the experimen- et al. (2020a) . The width at a given value represents the proportion of problems for which this is the result. Comparing Active Sets with 1650 steps to Gurobi 1 Cut, tighter bounds are achieved with a smaller runtime. tal setting from Bunel et al. (2020a) . The networks correspond to the small network architecture from Wong & Kolter (2018) . Here, we present results for a network trained via standard SGD and cross entropy loss, with no modification to the objective for robustness. Perturbations for this network lie in a ∞ norm ball with radius ver = 5/255 (which is hence lower than commonly employed radii for robustly trained networks). In appendix I, we provide additional CIFAR-10 results on an adversarially trained network using the method by Madry et al. (2018) , and on MNIST (LeCun et al., 1998) , for a network adversarially trained with the algorithm by Wong & Kolter (2018) . We compare both against previous dual iterative methods and Gurobi (Gurobi Optimization, 2020), the commercial black-box solver employed by Anderson et al. (2020) . For Gurobi-based baselines, Planet means solving the Planet Ehlers (2017) relaxation of the network, while Gurobi cut starts from the Big-M relaxation and adds constraints from A k in a cutting-plane fashion, as the original primal algorithm by Anderson et al. (2020) . We run both on 4 CPU threads. Amongst dual iterative methods, run on an Nvidia Titan Xp GPU, we compare with BDD+, the recent proximal-based solver by Bunel et al. (2020a) , operating on a Lagrangian Decomposition dual of the Planet relaxation. As we operate on (a subset of) the data by Bunel et al. (2020a) , we omit both their supergradientbased approach and the one by Dvijotham et al. (2018) , as they both perform worse than BDD+ (Bunel et al., 2020a) . For the same reason, we omit cheaper (and looser) methods, like interval propagation Gowal et al. (2018) and the one by Wong & Kolter (2018) . Active Set denotes our solver for problem (3), described in §3.1. By keeping B = ∅, Active Set reduces to Big-M, a solver for the non-projected Planet relaxation (appendix B), which can be seen as Active Set's initialiser. In line with previous bounding algorithms (Bunel et al., 2020a) , we employ Adam updates (Kingma & Ba, 2015) for supergradient-type methods due to their faster empirical convergence. Finally, we complement the comparison with Gurobi-based methods by running Active Set on 4 CPU threads (Active Set CPU). Further details, including hyper-parameters, can be found in appendix I. Figure 1 shows the distribution of runtime and the bound improvement with respect to Gurobi cut for the SGD-trained network. For Gurobi cut, we only add the single most violated cut from A k per neuron, due to the cost of repeatedly solving the LP. We tuned BDD+ and Big-M, the dual methods operating on the weaker relaxation (2), to have the same average runtime. They obtain bounds comparable to Gurobi Planet in one order less time. Initialised from 500 Big-M iterations, at 600 Published as a conference paper at ICLR 2021 iterations Active Set already achieves better bounds on average than Gurobi cut in around 1/20 th of the time. With a computational budget twice as large (1050 iterations) or four times as large (1650 iterations), the bounds significantly improve over Gurobi cut in still a fraction of the time. As we empirically demonstrate in appendix I, the tightness of the Active Set bounds is strongly linked to our active set strategy ( §3.2). Remarkably, even if our method is specifically designed to take advantage of GPU acceleration, executing it on CPU proves to be strongly competitive with Gurobi cut, producing better bounds in less time for the benchmark of Figure 1 . Figure 2 shows pointwise comparisons for a subset of the methods of Figure 1 , on the same data. Figure 2a shows the gap to the (Gurobi) Planet bound for BDD+ and our Big-M solver. Surprisingly, our Big-M solver is competitive with BDD+, achieving on average better bounds than BDD+, in the same time. Figure 2b shows the improvement over Planet bounds for Big-M and Active Set. The latter achieves markedly better bounds than Big-M in the same time, demonstrating the benefit of operating (at least partly) on the tighter dual (6).

5.2. COMPLETE VERIFICATION

We next evaluate the performance on complete verification, verifying the adversarial robustness of a network to perturbations in ∞ norm on a subset of the dataset by Lu & Kumar (2020) , replicating the experimental setting from Bunel et al. (2020a) . The dataset associates a different perturbation radius verif to each CIFAR-10 image, so as to create challenging verification properties. Its difficulty makes the dataset an appropriate testing ground for tighter relaxations like the one by Anderson et al. (2020) ( §2.2). Further details, including network architectures, can be found in appendix I. Here, we aim to solve the non-convex problem (1) directly, rather than an approximation like §5.1. In order to do so, we use BaSBR, a branch and bound algorithm from Bunel et al. (2020b) . Branch and Bound works by dividing the problem domain into subproblems (branching) and bounding the local minimum over those domains. Any domain which cannot contain the global lower bound is pruned away, whereas the others are kept and branched over. In BaBSR, branching is carried out by splitting an unfixed ReLU into its passing and blocking phases. The ReLU which induces maximum change in the domain's lower bound, when made unambiguous, is selected for splitting. A fundamental component of a BaB method is the bounding algorithm, which is, in general, the computational bottleneck (Lu & Kumar, 2020) . Therefore, we compare the effect on final verification time of using the different bounding methods in §5.1 within BaBSR. In addition, we evaluate MIP A k , which encodes problem (1) as a Big-M MIP (Tjeng et al., 2019) and solves it in Gurobi by adding cutting planes from A k , analogously to the original experiments by Anderson et al. (2020) . Finally, we also compare against ERAN (Singh et al., 2020) , a state-of-the-art complete verification toolbox: results on the dataset by Lu & Kumar (2020) are taken from the recent VNN-COMP competition (VNN-COMP, 2020). We use 100 iterations for Active Set, 100 iterations for BDD+ and 180 iterations for Big-M. For dual iterative algorithms, we solve 300 subproblems at once for the base network and 200 for the deep and wide networks (see §3.3). Additionally, dual variables are initialised from their parent node's bounding computation. As in Bunel et al. (2020a) , the time-limit is kept at one hour. Due to the difference in computational cost between algorithms operating on the tighter relaxation by Anderson et al. (2020) and the other bounding algorithms 3 , we also experiment with a stratified version of the bounding within BaBSR. We devise a set of heuristics to determine 3 For Active Set, this is partly due to the masked forward/backward pass described in appendix G. 

6. DISCUSSION

The vast majority of neural network bounding algorithms focuses on (solving or loosening) a popular triangle-shaped relaxation, referred to as the "convex barrier" for verification. Relaxations that are tighter than this convex barrier have been recently introduced, but their complexity hinders applicability. We have presented Active Set, a sparse dual solver for one such relaxation, and empirically demonstrated that it yields significant formal verification speed-ups. Our results show that scalable tightness is key to the efficiency of neural network verification and instrumental in the definition of a more appropriate "convex barrier". We believe that new customised solvers for similarly tight relaxations are a crucial avenue for future research in the area, possibly beyond piecewise-linear networks. Finally, as it is inevitable that tighter bounds will come at a larger computational cost, future verification systems will be required to recognise a priori whether tight bounds are needed for a given property. A possible solution to this problem could rely on learning algorithms.

A LIMITATIONS OF PREVIOUS DUAL APPROACHES

In this section, we show that previous dual derivations (Bunel et al., 2020a; Dvijotham et al., 2018) violate Fact 1. Therefore, they are not efficiently applicable to problem (3), motivating our own derivation in section 3. We start from the approach by Dvijotham et al. (2018) , which relies on relaxing equality constraints (1b), (1c) from the original non-convex problem (1). Dvijotham et al. (2018) prove that this relaxation corresponds to solving convex problem (2), which is equivalent to the Planet relaxation (Ehlers, 2017) , to which the original proof refers. As we would like to solve tighter problem (3), the derivation is not directly applicable. Relying on intuition from convex analysis applied to duality gaps (Lemaréchal, 2001) , we conjecture that relaxing the composition (1c) • (1b) might tighten the primal problem equivalent to the relaxation, obtaining the following dual: max µ min x W n x n-1 + b n + n-1 k=1 µ T k (x k -max {W k x k-1 + b k , 0}) s.t. l k ≤ x k ≤ u k k ∈ 1, n -1 , x 0 ∈ C. ( ) Unfortunately dual ( 12) requires an LP (the inner minimisation over x, which in this case does not decompose over layers) to be solved exactly to obtain a supergradient and any time a valid bound is needed. This is markedly different from the original dual by Dvijotham et al. (2018) , which had an efficient closed-form for the inner problems. The derivation by Bunel et al. (2020a) , instead, operates by substituting (1c) with its convex hull and solving its Lagrangian Decomposition dual. The Decomposition dual for the convex hull of (1c) • (1b) (i.e., A k ) takes the following form: max ρ min x,z W n x A,n-1 + b n + n-1 k=1 ρ T k (x B,k -x A,k ) s.t. x 0 ∈ C, (x B,k , W n x A,n-1 + b n , z k ) ∈ A dec,k k ∈ 1, n -1 , where A dec,k corresponds to A k with the following substitutions: x k → x B,k , and xk → W n x A,n-1 + b n . It can be easily seen that the inner problems (the inner minimisation over x A,k , x B,k , for each layer k > 0) are an exponentially sized LP. Again, this differs from the original dual on the Planet relaxation (Bunel et al., 2020a) , which had an efficient closed-form for the inner problems.

B DUAL INITIALISATION

Algorithm 1 Big-M solver 1: function BIGM COMPUTE BOUNDS({W k , b k , l k , u k , lk , ûk } k=1..n ) 2: Initialise duals α 0 , β 0 M using interval propagation bounds (Gowal et al., 2018) 3: for t ∈ 1, T -1 do 4: x * , z * ∈ argmin x,z L M (x, z, α t , β t M ) using ( 16)-( 17) 5: Compute supergradient using (18) 6: α t+1 , β t+1 M ← Adam's update rule (Kingma & Ba, 2015) 7: α t+1 , β t+1 M ← max(α t+1 , 0), max(β t+1 M , 0) (dual projection) 8: end for 9: return min x,z L M (x, z, α T , β T M ) 10: end function As shown in section 3, our Active Set solver yields a dual solver for the Big-M relaxation (2) if the active set B is kept empty throughout execution. As indeed B = ∅ for the first Active Set iterations (see algorithm 2 in section D), the Big-M solver can be thought of as dual initialisation. Furthermore, we demonstrate experimentally in §5 that, when used as a stand-alone solver, our Big-M solver is competitive with previous dual algorithms for problem (2). The goal of this section is to explicitly describe the Big-M solver, which is summarised in algorithm 1. We point out that, in the notation of restricted variable sets from section 3.1, β M := β ∅ . We now describe the equivalence between the Big-M and Planet relaxations, before presenting the solver in section B.3 and the dual it operates on in section B.2.

B.1 EQUIVALENCE TO PLANET

As previously shown (Bunel et al., 2018) , the Big-M relaxation (M k , when considering the k-th layer only) in problem ( 2) is equivalent to the Planet relaxation by Ehlers (2017) . Then, due to strong duality, our Big-M solver (section B.2) and the solvers by Bunel et al. (2020a) ; Dvijotham et al. (2018) will all converge to the bounds from the solution of problem (2). In fact, the Decompositionbased method (Bunel et al., 2020a) directly operates on the Planet relaxation, while Dvijotham et al. (2018) prove that their dual is equivalent to doing so. On the k-th layer, the Planet relaxation takes the following form: P k :=                        if lk ≤ 0 and ûk ≥ 0 : x k ≥ 0, x k ≥ xk , x k ≤ ûk (x k -lk ) 1/(û k -lk ) . if ûk ≤ 0 : x k = 0. if lk ≥ 0 : x k = xk . ( ) It can be seen that P k = Proj x,x (M k ), where Proj x,x denotes projection on the x, x hyperplane. In fact, as z k does not appear in the objective of the primal formulation (2), but only in the constraints, this means assigning it the value that allows the largest possible feasible region. This is trivial for passing or blocking ReLUs. For the ambiguous case, instead, Figure 4 (on a single ReLU) shows that z k = xk -lk ûk -lk is the correct assignment.

B.2 BIG-M DUAL

As evident from problem (3), A k ⊆ M k . If we relax all constraints in M k (except, again, the box constraints), we are going to obtain a dual with a strict subset of the variables in problem (6). The Big-M dual is a specific instance of the Active Set dual (8) where B = ∅, and it takes the following z k x k x k = xk -lk (1 -z k ) x k = ûk z k 1 0 z k = xk-lk ûk-lk C σ (x k , z k ) Figure 4 : M k plotted on the (z k , x k ) plane, under the assumption that lk ≤ 0 and ûk ≥ 0. form: max (α,β)≥0 d M (α, β M ) where: d M (α, β M ) := min x,z L M (x, z, α, β M ), L M (x, z, α, β M ) =     - n-1 k=0 α k -W T k+1 α k+1 -(β k,0 + β k,1 -W T k+1 β k+1,1 ) T x k + n-1 k=1 b T k α k - n-1 k=1 β k,0 ûk + β k,1 lk T z k + n-1 k=1 ( lk -b k ) T β k,1 s.t. x 0 ∈ C, (x k , z k ) ∈ [l k , u k ] × [0, 1] k ∈ 1, n -1 .

B.3 BIG-M SOLVER

We initialise dual variables from interval propagation bounds (Gowal et al., 2018) : this can be easily done by setting all dual variables except α n to 0. Then, we can maximize d M (α, β) via projected supergradient ascent, exactly as described in section 3.1 on a generic active set B. All the computations in the solver follow from keeping B = ∅ in §3.1. We explicitly report them here for the reader's convenience. Let us define the following shorthand for the primal coefficients: f M,k (α, β M ) = α k -W T k+1 α k+1 -(β k,0 + β k,1 -W T k+1 β k,1 ) g M,k (β M ) = β k,0 ûk + β k,1 lk . The minimisation of the Lagrangian L M (x, z, α, β) over the primals for k ∈ 1, n-1 is as follows: x * k = 1 f M,k (α,β M )≥0 ûk + 1 f M,k (α,β M )<0 lk z * k = 1 g M,k (β M )≥0 1 For k = 0, instead (assuming, as §3.1 that this can be done efficiently): x * 0 ∈ argmin x0 f M,k (α, β M ) T x 0 s.t. x 0 ∈ C. ( ) The supergradient over the Big-M dual variables α, β k,0 , β k,1 is computed exactly as in §3.1 and is again a subset of the supergradient of the full dual problem (6). We report it for completeness. For each k ∈ 0, n -1 : ∇ α k d(α, β) = W k x * k-1 + b k -x * k , ∇ β k,0 d(α, β) = x k -z k ûk , ∇ β k,1 d(α, β) = x k -(W k x k-1 + b k ) + (1 -z k ) lk .

C DUAL DERIVATIONS

We now derive problem (6), the dual of the full relaxation by Anderson et al. (2020) described in equation (3). The Active Set (equation ( 8)) and Big-M duals (equation ( 15)) can be obtained by removing β k,I k ∀ I k ∈ E k \ B k and β k,I k ∀ I k ∈ E k , respectively. We employ the following Lagrangian multipliers: x k ≥ xk ⇒ α k , x k ≤ ûk z k ⇒ β k,0 , x k ≤ xk -lk (1 -z k ) ⇒ β k,1 , x k ≤   (W k I k ) x k-1 + z k b k -W k I k Ľk-1 (1 -z k ) + W k (1 -I k ) Ǔk-1 z k   ⇒ β k,I k , and obtain, as a Lagrangian (using xk = W k x k-1 + b k ): L(x, z, α, β) =      n-1 k=1 α T k (W k x k-1 + b k -x k ) + n-1 k=1 β T k,0 (x k -z k ûk ) + n-1 k=1 I k ∈E k β T k,I k W k I k Ľk-1 (1 -z k ) -(W k I k ) x k-1 -b k z k -W k (1 -I k ) Ǔk-1 z k + x k + n-1 k=1 β T k,1 x k -(W k x k-1 + b k ) + (1 -z k ) lk + W n x n-1 + b n Let us use I k as shorthand for I k ∈E k ∪{0,1} . If we collect the terms with respect to the primal variables and employ dummy variables α 0 = 0, β 0 = 0, α n = I, β n = 0, we obtain: L(x, z, α, β) =             - n-1 k=0 α k -W T k+1 α k+1 -I k β k,I k + I k+1 (W k+1 I k+1 ) T β k+1,I k+1 T x k - n-1 k=1     I k ∈E k β k,I k b k + β k,1 lk + β k,0 ûk + I k ∈E k W k I k Ľk-1 β k,I k + I k ∈E k W k (1 -I k ) Ǔk-1 β k,I k     T z k + n-1 k=1 b T k α k + n-1 k=1 I k ∈E k (W k I k Ľk-1 ) β k,I k + β T k,1 ( lk -b k ) which corresponds to the form shown in problem (6).

D IMPLEMENTATION DETAILS FOR ACTIVE SET METHOD

From the high-level perspective, our Active Set solver proceeds by repeatedly solving modified instances of problem ( 6), where the exponential set E k is replaced by a fixed (small) set of active variables B. The full solver procedure is summarised in algorithm 2. Algorithm 2 Active Set solver 1: function ACTIVESET COMPUTE BOUNDS({W k , b k , l k , u k , lk , ûk } k=1..n ) 2: Initialise duals α 0 , β 0 0 , β 0 1 using Algorithm (1) 3: Set β k,I k = 0, ∀ I k ∈ E k 4: B = ∅ 5: for nb additions do 6: for t ∈ 1, T -1 do 7: x * , z * ∈ argmin x,z L B (x, z, α t , β t B ) using ( 9),(10) 8: if t ≤ nb vars to add then 9: For each layer k, add output of (4) called at (x * , z * ) to B k 10: end if 11: Compute supergradient using (11) Proof. In the following, (x * , z * ) denotes the points introduced in the statement. Recall the definition of ∇ β k,I k d(α, β) in equation ( 9), which applies beyond the current active set: ∇ β k,I k d(α, β) = x * k -(W k I k ) x * k-1 + W k I k Ľk-1 (1 -z * k ) -z * k b k + W k (1 -I k ) Ǔk-1 z * k I k ∈ E k . We want to compute I * k ∈ argmax I k {∇ β k,I k d(α, β) T 1} , that is: I * k ∈ argmax I k ∈E k x * k -(W k I k ) x * k-1 + W k I k Ľk-1 (1 -z * k ) -z * k b k + W k (1 -I k ) Ǔk-1 z * k T 1. By removing the terms that do not depend on I k , we obtain: max I k ∈E k -(W k I k ) x * k-1 + W k I k Ľk-1 (1 -z * k ) + W k I k Ǔk-1 z * k T 1. Let us denote the i-th row of W k and I k by w i,k and i i,k , respectively, and define E k [i] = 2 w i,k \ {0, 1}. The optimisation decomposes over each such row: we thus focus on the optimisation problem for the supergradient's i-th entry. Collecting the mask, we get: max i i,k ∈E k [i] j (1 -z * k [i]) Ľk-1 [i, j] + z * k [i] Ǔk-1 [i, j] -x * k-1 [i] W k [i, j] I k [i, j]. As the solution to the problem above is obtained by setting I * k [i, j] = 1 if its coefficient is positive and I * k [i, j] = 0 otherwise, we can see that the optimal I k corresponds to calling oracle (4) by Anderson et al. (2020) on (x * , z * ). Hence, in addition to being the mask associated to β k,I * k , the variable set maximising the supergradient, I * k corresponds to the most violated constraint from A k at (x * , z * ).

E INTERMEDIATE BOUNDS

A crucial quantity in both ReLU relaxations (M k and A k ) are intermediate pre-activation bounds lk , ûk . In practice, they are computed by solving a relaxation C k (which might be M k , A k , or something looser) of (1) over subsets of the network (Bunel et al., 2020a) . For li , this means solving the following problem (separately, for each entry li [j]): min x,x,z xi [j] s.t. x 0 ∈ C xk+1 = W k+1 x k + b k+1 , k ∈ 0, i -1 , (x k , xk , z k ) ∈ C k k ∈ 1, i -1 . As ( 19) needs to be solved twice for each neuron (lower and upper bounds, changing the sign of the last layer's weights) rather than once as in (3), depending on the computational budget, C k might be looser than the relaxation employed for the last layer bounds (in our case, A k ). In all our experiments, we compute intermediate bounds as the tightest bounds between the method by Wong & Kolter (2018) and Interval Propagation (Gowal et al., 2018) . Once pre-activation bounds are available, post-activation bounds can be simply computed as l k = max( lk , 0), u k = max(û k , 0).

F PRE-ACTIVATION BOUNDS IN A k

We now highlight the importance of an explicit treatment of pre-activation bounds in the context of the relaxation by Anderson et al. (2020) . In §F.1 we will show through an example that, without a separate pre-activation bounds treatment, A k could be looser than the less computationally expensive M k relaxation. We then ( §F.2) justify our specific pre-activation bounds treatment by extending the original proof by Anderson et al. (2020) . Published as a conference paper at ICLR 2021 The original formulation by Anderson et al. (2020) is the following: x k ≥ W k x k-1 + b k x k ≤   (W k I k ) x k-1 + z k b k -W k I k Ľk-1 (1 -z k ) + W k (1 -I k ) Ǔk-1 z k   ∀I k ∈ 2 W k (x k , xk , z k ) ∈ [l k , u k ] × [ lk , ûk ] × [0, 1]            = A k . ( ) The difference with respect to A k as defined in equation ( 3) exclusively lies in the treatment of pre-activation bounds. While A k explicitly employs generic lk , ûk in the constraint set via M k , A k implicitly sets lk , ûk to the value dictated by interval propagation bounds (Gowal et al., 2018) via the constraints in I k = 0 and I k = 1 from the exponential family. In fact, setting I k = 0 and I k = 1, we obtain the following two constraints: x k ≤ xk -M - k (1 -z k ) x k ≤ M + k z k where: M - k := min x k-1 ∈[l k-1 ,u k-1 ] W T k x k-1 + b k = W k Ľk-1 + b k M + k := max x k-1 ∈[l k-1 ,u k-1 ] W T k x k-1 + b k = W k Ǔk-1 + b k (21) which correspond to the upper bounding ReLU constraints in M k if we set lk → M - k , ûk → M + k . While lk , ûk are (potentially) computed solving an optimisation problem over the entire network (problem 19), the optimisation for M - k , M + k involves only the layer before the current. Therefore, the constraints in ( 21) might be much looser than those in M k . In practice, the effect of lk [i], ûk [i] on the resulting set is so significant that M k might yield better bounds than A k , even on very small networks. We now provide a simple example.

Input -0

Hidden -1 Hidden -2 Output -3 x 0 [0] x 1 [0] x 2 [0] x 3 [0] x 0 [1] x 1 [1] x 2 [1] -1 Figure 5 illustrates the network architecture. The size of the network is the minimal required to reproduce the phenomenon. M k and A k coincide for single-neuron layers (Anderson et al., 2020) , and lk = M - k , ûk = M + k on the first hidden layer (hence, a second layer is needed). Let us write the example network as a (not yet relaxed, as in problem ( 1)) optimization problem for the lower bound on the output node x 3 . (-1) (1) +1 (-1) (-1) -2 (2) (-2) (1) (2) (-1) l 3 = arg min x,x [2 -1] x 2 (22a) s.t. x 0 ∈ [-1, 1] 2 (22b) Published as a conference paper at ICLR 2021 x1 = 1 -1 1 -1 x 0 + -1 1 x 1 = max(0, x1 ) (22c) x2 = -1 2 -2 1 x 1 + -2 0 x 2 = max(0, x2 ) (22d) x 3 = [2 -1] x 2 (22e) Let us compute pre-activation bounds with C k = M k (see problem ( 19)). For this network, the final output lower bound is tighter if the employed relaxation is M k rather than A k (hence, in this case, M k ⊂ A k ). Specifically: l3,A k = -1.2857, l3,M k = -1.2273. In fact: • In order to compute l 1 and u 1 , the post-activation bounds of the first-layer, it suffices to solve a box-constrained linear program for l1 and û1 , which at this layer coincide with interval propagation bounds, and to clip them to be non-negative. This yields l 1 = [0 0] T , u 1 = [1 3] T . • Computing M + 2 [1] = max x1∈[l1,u1] [-2 1] x 1 = 3 we are assuming that x 1 [0] = l 1 [0] and x 1 [1] = u 1 [1] . These two assignments are in practice conflicting, as they imply different values for x 0 . Specifically, x 1 [1] = u 1 [1] requires x 0 = [u 0 [0] l 0 [1]] = [1 -1], but this would also imply x 1 [0] = u 1 [0], yielding x2 [1] = 1 = 3. Therefore, explicitly solving a LP relaxation of the network for the value of û2 [1] will tighten the bound. Using M k , the LP for this intermediate pre-activation bound is: û2 [1] = arg min x,x,z [-2 1] x 1 (23a) s.t. x 0 ∈ [-1, 1] 2 , z 1 ∈ [0, 1] 2 , x 1 ∈ R 2 ≥0 (23b) x1 = 1 -1 1 -1 x 0 + -1 1 (23c) x 1 ≥ x1 (23d) x 1 ≤ û1 z 1 = 1 3 z 1 (23e) x 1 ≤ x1 -l1 (1 -z 1 ) = x1 - -3 -1 (1 -z 1 ) (23f) Yielding û2 [1] = 2.25 < 3 = M + 2 [1]. An analogous reasoning holds for M - 2 [1] and l2 [1]. • In M k , we therefore added the following two constraints: x 2 [1] ≤ x2 [1] -l2 [1](1 -z 2 [1]) x 2 [1] ≤ û2 [1]z 2 [1] (24) that in A k correspond to the weaker: x 2 [1] ≤ x2 [1] -M - 2 [1](1 -z 2 [1]) x 2 [1] ≤ M + 2 [1]z 2 [1] As the last layer weight corresponding to x 2 [1] is negative (W 3 [0, 1] = -1), these constraints are going to influence the computation of l3 . • In fact, the constraints in ( 24) are both active when optimizing for l3,M k , whereas their counterparts for l3,A k in (25) are not. The only active upper constraint at neuron x 2 [1] for the Anderson relaxation is x 2 [1] ≤ x 1 [1], corresponding to the constraint from A 2 with I 2 [1, •] = [0 1]. Evidently, its effect is not sufficient to counter-balance the effect of the tighter constraints (24 ) for I 2 [1, •] = [1 1] and I 2 [1, •] = [0 0] , yielding a weaker lower bound for the network output.

F.2 DERIVATION OF A k

Having motivated an explicit pre-activation bounds treatment for the relaxation by Anderson et al. (2020) , we now extend the original proof for A k (equation ( 20)) to obtain our formulation A k (as defined in equation ( 3)). For simplicity, we will operate on a single neuron x k [i]. A self-contained way to derive A k is by applying Fourier-Motzkin elimination on a standard MIP formulation referred to as the multiple choice formulation (Anderson et al., 2019) , which is defined as follows: (x k-1 , x k [i]) = (x 0 k-1 , x 0 k [i]) + (x 1 k-1 , x 1 k [i]) x 0 k [i] = 0 ≥ w T i,k x 0 k-1 + b k [i](1 -z k [i]) x 1 k [i] = w T i,k x 1 k-1 + b k [i]z k [i] ≥ 0 l k-1 [i](1 -z k [i]) ≤ x 0 k-1 ≤ u k-1 [i](1 -z k [i]) l k-1 [i]z k [i] ≤ x 1 k-1 ≤ u k-1 [i]z k [i] z k [i] ∈ [0, 1]                = S k,i Where w i,k denotes the i-th row of W k , and x 1 k-1 and x 0 k-1 are copies of the previous layer variables. Applying (26) to the entire neural network results in a quadratic number of variables (relative to the number of neurons). The formulation can be obtained from well-known techniques from the MIP literature (Jeroslow, 1987) (it is the union of the two polyhedra for a passing and a blocking ReLU, operating in the space of x k-1 ). Anderson et al. (2019)  show that A k = Proj x k-1 ,x k ,z k (S k ). If pre-activation bounds lk , ûk (computed as described in section E) are available, we can naturally add them to (26) as follows: (x k-1 , x k [i], z k [i]) ∈ S k,i lk [i](1 -z k [i]) ≤ w T i,k x 0 k-1 + b k [i](1 -z k [i]) ≤ ûk [i](1 -z k [i]) lk [i] z k [i] ≤ w T i,k x 1 k-1 + b k [i]z k [i] ≤ ûk [i]z k [i]    = S k,i We now prove that this formulation yields A k when projecting out the copies of the activations. Proposition. Sets S k from equation (27) and A k from problem (3) are equivalent, in the sense that A k = Proj x k-1 ,x k ,z k (S k ). Proof. In order to prove the equivalence, we will rely on Fourier-Motzkin elimination as in the original Anderson relaxation proof (Anderson et al., 2019) . Going along the lines of the original proof, we start from (26) and eliminate x 1 k-1 , x 0 k [i] and x 1 k [i] exploiting the equalities. We then re-write all the inequalities as upper or lower bounds on x 0 k-1 [0] in order to eliminate this variable. As Anderson et al. (2019) , we assume w i,k [0] > 0. The proof generalizes by using Ľ and Ǔ for w i,k [0] < 0, whereas if the coefficient is 0 the variable is easily eliminated. We get the following system: x 0 k-1 [0] = 1 w i,k [0]   w T i,k x k-1 - j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i] -x k [i]   (28a) x 0 k-1 [0] ≤ - 1 w i,k [0]   j>1 w i,k [j]x 0 k-1 [j] + b k [i](1 -z k [i])   (28b) x 0 k-1 [0] ≤ 1 w i,k [0]   w T i,k x k-1 - j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i]   (28c) l k-1 [0](1 -z k [i]) ≤ x 0 k-1 [0] ≤ u k-1 [0](1 -z k [i]) x 0 k-1 [0] ≤ x k-1 [0] -l k-1 [0]z k [i] x 0 k-1 [0] ≥ x k-1 [0] -u k-1 [0]z k [i] (28f) x 0 k-1 [0] ≤ 1 w i,k [0]   w T i,k x k-1 - j>1 w i,k [j]x 0 k-1 [j] + (b k [i] -lk [i])z k [i]   (28g) x 0 k-1 [0] ≥ 1 w i,k [0]   w T i,k x k-1 - j>1 w i,k [j]x 0 k-1 [j] + (b k [i] -ûk [i])z k [i]   (28h) x 0 k-1 [0] ≥ 1 w i,k [0]   ( lk [i] -b k [i])(1 -z k [i]) - j>1 w i,k [j]x 0 k-1 [j]   (28i) x 0 k-1 [0] ≤ 1 w i,k [0]   (û k [i] -b k [i])(1 -z k [i]) - j>1 w i,k [j]x 0 k-1 [j]   where only inequalities (28g) to (28j) are not present in the original proof. We therefore focus on the part of the Fourier-Motzkin elimination that deals with them, and invite the reader to refer to Anderson et al. (2019) for the others. The combination of these new inequalities yields trivial constraints. For instance: (28i) + (28g) =⇒ lk [i] ≤ w T i,k x k-1 + b k [i] = xk [i] which holds by the definition of pre-activation bounds. Let us recall that x k [i] ≥ 0 and x k [i] ≥ xk [i] , the latter constraint resulting from (28a) + (28b). Then, it can be easily verified that the only combinations of interest (i.e., those that do not result in constraints that are obvious by definition or are implied by other constraints) are those containing the equality (28a). In particular, combining inequalities (28g) to (28j) with inequalities (28d) to (28f) generates constraints that are (after algebraic manipulations) superfluous with respect to those in (30). We are now ready to show the system resulting from the elimination: x k [i] ≥ 0 (30a) x k [i] ≥ xk [i] (30b) x k [i] ≤ w i,k [0]x k-1 [0] -w i,k [0]l k-1 [0](1 -z k [i]) + j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i] (30c) x k [i] ≤ w i,k [0]u k-1 [0]z k [i] + j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i] (30d) x k [i] ≥ w i,k [0]x k-1 [0] -w i,k [0]u k-1 [0](1 -z k [i]) + j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i] (30e) x k [i] ≥ w i,k [0]l k-1 [0]z k [i] + j>1 w i,k [j]x 0 k-1 [j] + b k [i]z k [i] (30f) l k-1 [0] ≤ x k [i] ≤ u k-1 [0] (30g) x k [i] ≥ lk [i]z k [i] (30h) x k [i] ≤ ûk [i]z k [i] (30i) x k [i] ≤ xk [i] -lk [i](1 -z k [i]) (30j) x k [i] ≥ xk [i] -ûk [i](1 -z k [i]) Constraints from (30a) to (30g) are those resulting from the original derivation of A k (see (Anderson et al., 2019) ). The others result from the inclusion of pre-activation bounds in (27). Of these, (30h) is implied by (30a) if lk [i] ≤ 0 and by the definition of pre-activation bounds (together with (30b)) if lk [i] > 0. Analogously, (30k) is implied by (30b) if ûk [i] ≥ 0 and by (30a) otherwise. By noting that no auxiliary variable is left in (30i) and in (30j), we can conclude that these will not be affected by the remaining part of the elimination procedure. Therefore, the rest of the proof (the elimination of x 0 k-1 [1], x 0 k-1 [2], . . . ) proceeds as in (Anderson et al., 2019) , leading to A k,i . Repeating the proof for each neuron i at layer k, we get A k = Proj x k-1 ,x k ,z k (S k ).

G MASKED FORWARD AND BACKWARD PASSES

Crucial to the practical efficiency of our solvers is to represent the various operations as standard forward/backward passes over a neural network. This way, we can leverage the engineering efforts behind popular deep learning frameworks such as PyTorch (Paszke et al., 2017) . While this can be trivially done for the Big-M solver (appendix B), the Active Set method ( §3.1) requires a specialised operator that we call "masked" forward/backward pass. Here, we provide the details to our implementation. The masked forward and backward passes respectively take the following forms: (W k I k ) x k , (W k I k ) T x k+1 and they are needed when dealing with the exponential family of constraints from the relaxation by Anderson et al. (2020) . A crucial property of the operator is that the I k mask may take on a different value for each input/output combination. While this is straightforward to implement for fully connected layers, we need to be more careful when handling convolutional operators, which rely on re-applying the same weights (kernel) to many different parts of the image. A naive solution is to convert convolutions into equivalent linear operators, but this has a high cost in terms of performance, as it involves much redundancy. A convolutional operator can be represented via a matrix-matrix multiplication if the input is unfolded and the filter is appropriately reshaped. The multiplication output can then be reshaped to the correct convolutional output shape. Given a filter w ∈ R c×k1×k2 , an input x ∈ R i1×i2×i3 and the convolutional output conv w (x) = y ∈ R c×o2×o3 , we need the following definitions: [•] I,O : I → O {•} j : R d1×•••×dn → R d1×•••×dj-1×dj+1×•••×dn unfold w (•) : R i1×i2×i3 → R k1k2×o2o3 fold w (•) : R k1k2×o2o3 → R i1×i2×i3 (31) where the brackets simply reshape the vector from shape I to O, while the braces sum over the j-th dimension. unfold decomposes the input image into the (possibly overlapping) o 2 o 3 blocks the sliding kernel operates on, taking padding and striding into account. fold brings the output of unfold to the original input space. Let us define the following reshaped versions of the filter and the convolutional output: W R = [w] R c×k 1 ×k 2 ,R c×k 1 k 2 y R = [y] R c×o 2 ×o 3 ,R c×o 2 o 3 The standard forward/backward convolution (neglecting the convolutional bias, which can be added at the end of the forward pass) can then be written as: conv w (x) = [W R unfold w (x)] R c×o 2 o 3 ,R c×o 2 ×o 3 back conv w (y) = fold w (W T R y R ). We need to mask the convolution with a different mask for each input-output pair. This means employing a mask I ∈ R c×k1k2×o2o3 . Therefore, assuming vectors are broadcast to the correct output shapefoot_2 , we can write the masked forward and backward passes as: conv w,I (x) = [{(W R I unfold w (x)} 2 ] R c×o 2 o 3 ,R c×o 2 ×o 3 back conv w,I (y) = fold w ({W R I y R } 1 ). Owing to the avoided redundancy with respect to the equivalent linear operation (e.g., copying of the kernel matrix, zero-padding in the linear weight matrix), this implementation of the masked for-Published as a conference paper at ICLR 2021 ward/backward pass reduces both the memory footprint and the number of floating point operations (FLOPs) associated to the passes computations by a factor (i 1 i 2 i 3 )/(k 1 k 2 ). In practice, this ratio might be significant: on the incomplete verification networks §5.1) it ranges from 16 to 64.

H STRATIFIED BOUNDING FOR BRANCH AND BOUND

As seen in the complete verification results in Figure 3 ( § 5.2), the use of a tighter bounding algorithm results in the verification of a larger number of properties. In general, tighter bounds come at a larger computational cost, which might negatively affect performance on "easy" verification properties, where a small number of domain splits with loose bounds suffices to verify the property (hence, the tightness is not needed). As a general complete verification system needs to be efficient a priori, on both easy and hard properties, we employ a stratified bounding system for use within Branch and Bound-like complete verification algorithms. We design a set of heuristics to determine whether the children (i.e., the subproblems arising after splitting on it) of a given subproblem will require tight bounds or not. The graph describing parentchild relations between sub-problems is referred to as the BaB search tree. Given a problem p, we individually mark it as difficult (with all its future offspring) if it meets all the conditions in the following set: • The lower bound on the minimum over the subproblem, p LB , should be relatively "close" to the decision threshold (0, in the standard form by Bunel et al. (2018) ). That is, p LB ≤ c LB . We argue that tighter bounds are worth computing only if they are likely to result in a crossing of the decision threshold, thus cutting away a part of the BaB search tree. • The depth p depth in the BaB search tree should not be below a certain threshold c depth . This is a way to ensure a certain number of splits is performed before adopting tighter bounds. • The running average of the lower bound improvement from parent to child p impr should fall below a threshold c impr . This reflects the idea that if splitting seems to be effective on a given section of the BaB tree, it is perhaps wiser to invest computational budget in more splits (with cheaper bounding computations, more splits can be performed in a given time) than tighter bounds. Empirically, this is the criterion with the largest effect on performance. Additionally, we do not employ tighter bounds unless the verification problem itself (in addition to the individual subproblems) is marked as "hard". We do so when the size of the un-pruned domains reaches a certain threshold c hard . The set of criteria above needs to address the following decision trade-off: one should postpone the use of tighter bounds until sure the difficulty of the verification task requires it. At the same time, if the switch to tighter bounds is excessively delayed, one might end up computing tight bounds on a large number of subproblems (as sections of the BaB tree were not pruned in time), hurting performance for harder tasks. In practice, we set the criteria as follows for all our experiments: c depth = 0, c LB = 0.5, c impr = 10 -1 , c hard = 200. As seen in Figure 3 , this resulted in a reasonable trade-off between losing efficiency on the harder properties (wide model) and gaining it on the easier ones (base and deep models).

I EXPERIMENTAL APPENDIX

We conclude the appendix by presenting supplementary experiments and additional details with respect to the presentation in the main paper. I Ba, 2015) , which showed stronger empirical convergence. For Big-M, replicating the findings by Bunel et al. (2020a) on their supergradient method, we linearly decrease the step size from 10 -2 to 10 -4 . Active Set is initialized with 500 Big-M iterations, after which the step size is reset and linearly scaled from 10 -3 to 10 -6 . We found the addition of variables to the active set to be effective before convergence: we add variables every 450 iterations, without re-scaling the step size again. Every addition consists of 2 new variables (see pseudo-code in appendix D), which was found to be a good compromise between fast bound improvement and computational cost. On complete verification, we re-employed the same hyper-parameters for both Big-M and Active Set, except the number of iterations. For Big-M, this was tuned to employ roughly the same time per bounding computation as BDD+. As the complete verification problem is formulated as a minimisation (as in problem (1)), in Branch and Bound we need a lower and an upper bound to the minimum of each sub-problem. The lower bound is computed by running the bounding algorithm, while the upper bound on the minimum is obtained by evaluating the network at the last primal output by the bounding algorithm in the lower bound computation (running the bounding algorithm to get an upper bound would result in a much looser bound, as it would imply having an upper bound on a version of problem (1) with maximisation instead of minimisation).

I.2 DATASET DETAILS

We now provide further details on the employed datasets. For incomplete verification, the architecture was introduced by Wong & Kolter (2018) and re-employed by Bunel et al. (2020a) . It corresponds to the "Wide" complete verification architecture, found in Table 2 . Due to the additional computational cost of bounds obtained via the tighter relaxation (3), we restricted the experiments to the first 3450 CIFAR-10 test set images for the experiments on the SGD-trained network (Figures 1, 2 ), and to the first 4513 images for the network trained via the method by Madry et al. (2018) (Figures 6, 7 ). For complete verification, we employed a subset of the adversarial robustness dataset presented by Lu & Kumar (2020) and used by Bunel et al. (2020a) , where the set of properties per network has been restricted to 100. The dataset provides, for a subset of the CIFAR-10 test set, a verification radius ver defining the small region over which to look for adversarial examples (input points for which the output of the network is not the correct class) and a (randomly sampled) non-correct class to verify against. The verification problem is formulated as the search for an adversarial example, carried out by minimizing the difference between the ground truth logit and the target logit. If the minimum is positive, we have not succeeded in finding a counter-example, and the network is robust. The ver radius was tuned to meet a certain "problem difficulty" via binary search, employing 2 : For each complete verification experiment, the network architecture used and the number of verification properties tested, a subset of the dataset by Lu & Kumar (2020) . Each layer but the last is followed by a ReLU activation function. a Gurobi-based bounding algorithm (Lu & Kumar, 2020) . The networks are robust on all the properties we employed. Three different network architectures of different sizes are used. A "Base" network with 3172 ReLU activations, and two networks with roughly twice as many activations: a "Deep" network, and a "Wide" network. Details can be found in Table 2 .

I.3 ADVERSARIALLY-TRAINED INCOMPLETE VERIFICATION

Figure 6 : Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from A k per neuron; higher is better. Results for the network adversarially trained with the method by Madry et al. (2018) , from Bunel et al. (2020a) . The width at a given value represents the proportion of problems for which this is the result. For the latter, lower is better. a fraction of its running time, even though the performance gap is on average smaller than on the SGD-trained network.

I.4 SENSITIVITY TO SELECTION CRITERION AND FREQUENCY

In section 3.2, we describe how to iteratively modify B, the active set of dual variables on which our Active Set solver operates. In short, Active Set adds the variables corresponding to the output Improvement from 1 cut of oracle (4) invoked at the primal minimiser of L B (x, z, α, β B ), at a fixed frequency ω. We now investigate the empirical sensitivity of Active Set to both the selection criterion and the frequency of addition. We test against Ran. Selection, a version of Active Set for which the variables to add are selected at random by uniformly sampling from the binary I k masks. As expected, Figure 8 shows that a good selection criterion is key to the efficiency of Active Set. In fact, random variable selection only marginally improves upon the Planet relaxation bounds, whereas the improvement becomes significant with our selection criterion from §3.2. In addition, we investigate the sensitivity of Active Set (AS) to variable addition frequency ω. In order to do so, we cap the maximum number of cuts at 7 for all runs, and vary ω while keeping the time budget fixed (we test on three different time budgets). Figure 9 compares the results for ω = 450 (Active Set), which were presented in §5.1, with the bounds obtained by setting ω = 300 and ω = 600 (respectively AS ω = 300 and AS ω = 600). Our solver proves to be relatively robust to ω across all the three budgets, with the difference in obtained bounds decreasing with the number of iterations. Moreover, early cut addition tends to yield better bounds in the same time, suggesting that our selection criterion is effective before subproblem convergence.

I.5 MNIST INCOMPLETE VERIFICATION

We conclude the experimental appendix by presenting incomplete verification results (the experimental setup mirrors the one employed in section 5.1 and appendix I.3) on the MNIST dataset (Le-Cun et al., 1998) . We report results on the "wide" MNIST network from Lu & Kumar (2020) , whose architecture is identical to the "wide" network in Table 2 except for the first layer, which has only one input channel to reflect the MNIST specification (the total number of ReLU activation units is 4804). As those employed for the complete verification experiments ( §5.2), and differently from the incomplete Lower plot: difference with the bounds obtained by Gurobi with a cut from A k per neuron; higher is better. MNIST results for a network adversarially trained with the method by Wong & Kolter (2018) , from Lu & Kumar (2020) . The width at a given value represents the proportion of problems for which this is the result. verification experiments in section 5.1 and appendix I.3, the network was adversarially trained with the method by Wong & Kolter (2018) . We compute the robustness margin to ver = 0.15 on the first 2960 images of the MNIST test set. All hyper-parameters are kept to the values employed for the CIFAR-10 networks, except the Big-M step size, which was linearly decreased from 10 -1 to 10 -3 , and the weight of the proximal terms for BDD+, which was linearly increased from 1 to 50. As seen on the CIFAR-10 networks, Figures 10, 11 show that Active Set yields comparable or better bounds than Gurobi 1 cut in less average runtime. However, more iterations are required to reach the same relative bound improvement over Gurobi 1 cut (2500 as opposed to 600 in Figures 1, 6 ). Furthermore, the smaller average gap between the bounds of Gurobi Planet and Gurobi 1 cut (especially with respect to Figure 1 ) suggests that the relaxation by Anderson et al. (2020) is less effective on this MNIST benchmark.



As dual variables β k,I k are indexed by I k , B = ∪ k B k implicitly defines an active set of variables βB. adding a single I * k mask to B k extends βB by n k variables: one for each neuron at layer k. if we want to perform an element-wise product a b between a ∈ R d 1 ×d 2 ×d 3 and b ∈ R d 1 ×d 3 , the operation is implicitly performed as a b , where b ∈ R d 1 ×d 2 ×d 3 is an extended version of b obtained by copying along the missing dimension.



Comparison of runtime (left)  and gap to Gurobi Planet bounds (right). For the latter, lower is better.(b) Comparison of runtime (left) and difference with the Gurobi Planet bounds (right). For the latter, higher is better.

Figure2: Pointwise comparison for a subset of the methods on the data presented in Figure1. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality.

return min x,z L B (x, z, α T , β T B ) 17: end function We conclude this section by proving the primal interpretation of the selection criterion for adding a new set of variables to B. D.1 ACTIVE SET SELECTION CRITERION We map a restricted set of dual variables β B to a set of dual variables β for the full dual (6) by setting variables not in the active set to 0: β B = 0, and β = β B ∪ β B. Proposition. Let β k,I * k be the set of dual variables maximising the corresponding entries of the supergradient of the full dual problem (6): β k,I * k ∈ argmax β k,I k {∇ β k,I k d(α, β) T 1}. β k,I * k represents the Lagrangian multipliers associated to the most violated constraints from A k at (x * , z * ) ∈ argmin x,z L B (x, z, α, β B ), the primal minimiser of the current restricted Lagrangian.

Figure 5: Example network architecture in which M k ⊂ A k , with pre-activation bounds computed with C k = M k . For the bold nodes (the two hidden layers) a ReLU activation follows the linear function. The numbers between parentheses indicate multiplicative weights, the others additive biases (if any).

Comparison of runtime (left) and gap to Gurobi Planet bounds.

Comparison of runtime (left) and improvement from the Gurobi Planet bounds. For the latter, higher is better.

Figure 7: Pointwise comparison for a subset of the methods on the data presented in Figure 6. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality.

Figure8: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from A k per neuron; higher is better. Results for the SGD-trained network fromBunel  et al. (2020a). The width at a given value represents the proportion of problems for which this is the result. Sensitivity of Active Set to selection criterion (see §3.2).

Figure10: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from A k per neuron; higher is better. MNIST results for a network adversarially trained with the method byWong & Kolter (2018), fromLu & Kumar (2020). The width at a given value represents the proportion of problems for which this is the result.

Figure11: Pointwise comparison for a subset of the methods on the data presented in Figure10. Comparison of runtime (left) and improvement from the Gurobi Planet bounds. For the latter, higher is better. Darker colour shades mean higher point density (on a logarithmic scale). The oblique dotted line corresponds to the equality.

Figure1: Upper plot: distribution of runtime in seconds. Lower plot: difference with the bounds obtained by Gurobi with a cut from A k per neuron; higher is better. Results for the SGD-trained network from Bunel

We compare average solving time, average number of solved sub-problems and the percentage of timed out properties on data fromLu & Kumar (2020). The best dual iterative method is highlighted in bold.whether a given subproblem is easy (therefore looser bounds are sufficient) or whether we need to operate on the tighter relaxation. Instances of this approach are Big-M + Active Set and Gurobi Planet + Gurobi 1 cut. Further details are provided in appendix H. Figure3and Table1show that Big-M performs competitively with BDD+. Active Set verifies a larger share of properties than the methods operating on the looser formulation (2), demonstrating the benefit of tighter bounds ( §5.1) in complete verification. On the other hand, the poor performance of MIP + A k and of Gurobi Planet + Gurobi 1 cut, tied to scaling limitations of off-the-shelf solvers, shows that tighter bounds are effective only if they can be computed efficiently. Nevertheless, the difference in performance between the two Gurobi-based methods confirms that customised Branch and Bound solvers (BaBSR) are preferable to generic MIP solvers, as observed byBunel et al.  (2020b)  on the looser Planet relaxation. Moreover, the stratified bounding system allows us to retain the speed of Big-M on easier properties, without excessively sacrificing Active Set's gains on the harder ones. Finally, while ERAN verifies 2% more properties than Active Set on two networks, BaBSR (with any dual bounding algorithm) is faster on most of the properties. BaBSR-based results could be further improved by employing the learned branching strategy presented byLu & Kumar  (2020): in this work, we focused on the bounding component of branch and bound.

.1 EXPERIMENTAL SETTING, HYPER-PARAMETERS All the experiments and bounding computations (including intermediate bounds) were run on a single Nvidia Titan Xp GPU, except Gurobi-based methods and "Active Set CPU". These were run on i7-6850K CPUs, utilising 4 cores for the incomplete verification experiments, and 6 cores for the more demanding complete verification experiments. The experiments were run under Ubuntu 16.04.2 LTS. Complete verification results for ERAN, the method by Singh et al. (2020), are taken from the recent VNN-COMP competition (VNN-COMP, 2020). These were executed by Singh et al. (2020) on a 2.6 GHz Intel Xeon CPU E5-2690 with 512 GB of main memory, utilising 14 cores.Gurobi-based methods make use of LP incrementalism (warm-starting) when possible. In the experiments of §5.1, where each image involves the computation of 9 different output upper bounds, we warm-start each LP from the LP of the previous neuron. For "Gurobi 1 cut", which involves two LPs per neuron, we first solve all Big-M LPs, then proceed with the LPs containing a single cut. Hyper-parameter tuning for incomplete verification was done on a small subset of the CIFAR-10 test set, on the SGD-trained network from Figures1, 2. BDD+ is run with the hyper-parameters found byBunel et al. (2020a)  on the same datasets, for both incomplete and complete verification. For all supergradient-based methods (Big-M, Active Set), we employed the Adam update rule (Kingma &

In addition to the SGD-trained network in §5.1, we now present results relative to the same architecture, trained with the adversarial training method byMadry et al. (2018) for robustness to perturbations of train = 8/255. Each adversarial sample for the training was obtained using 50 steps of projected gradient descent. For this network, we measure the robustness margin to perturbations with

ACKNOWLEDGMENTS

ADP was supported by the EPSRC Centre for Doctoral Training in Autonomous Intelligent Machines and Systems, grant EP/L015987/1, and an IBM PhD fellowship. HSB was supported using a Tencent studentship through the University of Oxford.

