LOSS LANDSCAPE MATTERS: TRAINING CERTIFIABLY ROBUST MODELS WITH FAVORABLE LOSS LAND-SCAPE

Abstract

In this paper, we study the problem of training certifiably robust models. Certifiable training minimizes an upper bound on the worst-case loss over the allowed perturbation, and thus the tightness of the upper bound is an important factor in building certifiably robust models. However, many studies have shown that Interval Bound Propagation (IBP) training uses much looser bounds but outperforms other models that use tighter bounds. We identify another key factor that influences the performance of certifiable training: smoothness of the loss landscape. We consider linear relaxation-based methods and find significant differences in the loss landscape across these methods. Based on this analysis, we propose a certifiable training method that utilizes a tighter upper bound and has a landscape with favorable properties. The proposed method achieves performance comparable to state-of-the-art methods under a wide range of perturbations.

1. INTRODUCTION

Despite the success of deep learning in many applications, the existence of adversarial example, an imperceptibly modified input that is designed to fool the neural network (Szegedy et al., 2013; Biggio et al., 2013) , hinders the application of deep learning to safety-critical domains. There has been increasing interest in building a model that is robust to adversarial attacks (Goodfellow et al., 2014; Papernot et al., 2016; Kurakin et al., 2016; Madry et al., 2018; Tramèr et al., 2017; Zhang et al., 2019a; Xie et al., 2019) . However, most defense methods evaluate their robustness with adversarial accuracy against predefined attacks such as PGD attack (Madry et al., 2018) or C&W attack (Carlini & Wagner, 2017) . Thus, these defenses can be broken by new attacks (Athalye et al., 2018) . To this end, many training methods have been proposed to build a certifiably robust model that can be guaranteed to be robust to adversarial perturbations (Hein & Andriushchenko, 2017; Raghunathan et al., 2018b; Wong & Kolter, 2018; Dvijotham et al., 2018; Mirman et al., 2018; Gowal et al., 2018; Zhang et al., 2019b) . They develop an upper bound on the worst-case loss over valid adversarial perturbations and minimize it to train a certifiably robust model. These certifiable training methods can be mainly categorized into two types: linear relaxation-based methods and bound propagation methods. Linear relaxation-based methods use relatively tighter bounds, but are slow, hard to scale to large models, and memory-inefficient (Wong & Kolter, 2018; Wong et al., 2018; Dvijotham et al., 2018) . On the other hand, bound propagation methods, represented by Interval Bound Propagation (IBP), are fast and scalable due to the use of simple but much looser bounds (Mirman et al., 2018; Gowal et al., 2018) . One would expect that training with tighter bounds would lead to better performance, but IBP outperforms linear relaxation-based methods in many cases, despite using much looser bounds. These observations on the performance of certifiable training methods raise the following questions: Why does training with tighter bounds not result in a better performance? What other factors may influence the performance of certifiable training? How can we improve the performance of certifiable training methods with tighter bounds? In this paper, we provide empirical and theoretical analysis to answer these questions. First, we demonstrate that IBP (Gowal et al., 2018) has a more favorable loss landscape than other linear relaxation-based methods, and thus it often leads to better performance even with much looser bounds. To account for this difference, we present a unified view of IBP and linear relaxation-based methods and find that the relaxed gradient approximation (which will be defined in Definition 1) of each method plays a crucial role in its optimization behavior. Based on the analysis of the loss landscape and the optimization behavior, we propose a new certifiable training method that has a favorable landscape with tighter bounds. The performance of the proposed method is comparable to that of state-of-the-art methods under a wide range of perturbations. We summarize the contributions of this study as follows: • We provide empirical and theoretical analysis of the loss landscape of certifiable training methods and find that smoothness of the loss landscape is important for building certifiably robust models. • We propose a certifiable training method with tighter bounds and a favorable loss landscape, obtaining comparable performance with state-of-the-art methods under a wide range of perturbations.

2. RELATED WORK

Earlier studies on training certifiably robust models were limited to 2-layered networks (Hein & Andriushchenko, 2017; Raghunathan et al., 2018a) . To scale to larger networks, a line of work has proposed the use of linear relaxation of nonlinear activation to formulate a robust optimization. Then, a dual problem is considered and a dual feasible solution is used to simplify the computation further. By doing so, Wong & Kolter (2018) built a method that can scale to a 4-layered network, and later, Wong et al. (2018) used Cauchy random projections to scale to much larger networks. However, they are still slow and memory-inefficient. Dvijotham et al. (2018) proposed a method called predictorverifier training (PVT), which uses a verifier network to optimize the dual solution. This is similar to our proposed method but we do not require any additional network. Xiao et al. (2018) proposed to add regularization technique with adversarial training for inducing ReLU stability, but it is less effective than other certified defenses. We also encourage our model to avoid unstable ReLUs, but we train the model with an upper bound of the worst-case loss and investigate ReLU stability from the loss landscape perspective. Mirman et al. (2018) proposed the propagation of a geometric bound (called domain) through the network to yield an outer approximation in logit space. This can be done with an efficient layerwise computation that exploits interval arithmetic. Over the outer domain, one can compute the worstcase loss to be minimized during training. Gowal et al. (2018) used a special case of the domain propagation called Interval Bound Propagation (IBP) using the simplest domain, the interval domain (or interval bound). In IBP, the authors introduced a different objective function, heuristic scheduling on the hyperparameters, and elision of the last layer to stabilize the training and to improve the performance. Both approaches, linear relaxation-based methods and bound propagation methods, use an upper bound on the worst-case loss. Bound propagation methods exploit much looser upper bounds, but they enjoy an unexpected benefit in many cases: better robustness than linear relaxation-based methods. Balunovic & Vechev (2019) hypothesized that the complexity of the loss computation makes the optimization more difficult, which could be a reason why IBP outperforms linear relaxationbased methods. They proposed a new optimization procedure with the existing linear relaxation. In this paper, we further investigate the causes of the difficulties in the optimization. Recently, Zhang et al. (2019b) proposed CROWN-IBP which uses linear relaxation in a verification method called CROWN (Zhang et al., 2018) in conjunction with IBP to train a certifiably robust model. Although beyond our focus here, there is another line of work on randomized smoothing (Li et al., 2018; Lecuyer et al., 2019; Cohen et al., 2019; Salman et al., 2019) , which can probabilistically certify the robustness with arbitrarily high probability by using a smoothed classifier. However, it requires a large number of samples for inference. There are many other works on certifiable verification (Weng et al., 2018; Singh et al., 2018a; 2019; 2018b; Zhang et al., 2018; Boopathy et al., 2019; Lyu et al., 2020) . However, our work focuses on "certifiable training".

3. BACKGROUND

First, we provide a brief overview of certifiable training methods. Then, we consider IBP (Gowal et al., 2018) as a special case of linear relaxation-based methods. This unified view on certifiable training methods helps us to comprehensively analyze the differences between the two approaches: bound propagation and linear relaxation. We present the details of the IBP in Appendix B.

3.1. NOTATIONS AND CERTIFIABLE TRAINING

We consider a c-class classification problem with a neural network f (x; θ) with the layerwise operations z (k) = h (k) (z (k-1) ) (k = 1, • • • , K ) and the input z (0) = x in the input space X . The corresponding probability function is denoted by 1) . For a linear operation h (k) , we use W (k) and b (k) to denote the weight and the bias for the layer. We consider the robustness of the classifier against the norm-bounded perturbation set B(x, ) = {x ∈ X : ||x -x|| ≤ } with the perturbation level . Here, we mainly focus on the ∞ -norm bounded set. To compute the margin between the true class y for the input x and the other classes, we define a c × c matrix C(y p f = softmax • f : X → [0, 1] c with subscript f . We denote a subnetwork with k operations as h [k] = h (k) • • • • • h ( ) = I -1e (y) T with (C(y)z (K) ) m = z (K) m -z (K) y (m = 0, • • • , c -1). For the last linear layer, the weights W (K) and the bias b (K) are merged with C(y), that is, W (K) ≡ C(y)W (K) and b (K) ≡ C(y)b (K) , yielding the margin score function s(x, y; θ) = C(y)f (x; θ) = f (x; θ)f y (x; θ)1 satisfying p s = p f . Then we can define the worst-case margin score s * (x, y, ; θ) = max x ∈B(x, ) s(x , y; θ) where max is element-wise maximization. With an upper bound s on the worst-case margin score, s ≥ s * , we can provide an upper bound on the worst-case loss over valid adversarial perturbations as follows: L(s(x, y, ; θ), y) ≥ max x ∈B(x, ) L(f (x ; θ), y) for cross-entropy loss L (Wong & Kolter, 2018) . Therefore, we can formulate certifiable training as a minimization of the upper bound, min θ L(s(x, y, ; θ), y), instead of directly solving min θ max x ∈B(x, ) L(f (x ; θ), y) which is infeasible. Note that adversarial training (Madry et al., 2018) uses a strong iterative gradient-based attack (PGD) to provide a lower bound on the worst-case loss to be minimized, but it cannot provide a certifiably robust model. Whenever possible, we will simplify the notations by omitting variables such as x, y, , and θ.

3.2. LINEAR RELAXATION-BASED METHODS

For a subnetwork h [k] , given with the pre-activation upper/lower bounds, u and l, for each nonlinear activation function h in h [k] , linear relaxation-based methods (Wong & Kolter (2018) ; Wong et al. (2018) ; Zhang et al. (2019b) ) use a relaxation of the activation function by two elementwise linear function bounds, h and h, that is, h(z) ≤ h(z) ≤ h(z) for l ≤ z ≤ u. We denote the function bounds as h(z) = a z + b and h(z) = a z + b for some a, b, a, and b, where denotes the elementwise (Hadamard) product. Using all the function bounds h's and h's for the nonlinear activations in conjunction with the linear operations in h [k] , an ith (scalar) activation h [k] i (•) ∈ R can be upper bounded by a linear function g T • +b over B(x, ) as in Zhang et al. (2018) . This can be equivalently explained with the dual relaxation viewpoint in Wong & Kolter (2018) . Further details are provided in Appendix C. Now we are ready to upper bound the activation h

[k]

i over B(x, ). Definition 1 (Linear Relaxation with Relaxed Gradient Approximation). For each neuron activation h i , a linear relaxation method computes an upper approximation of the activation over B(x, ) by using g ∈ R d and b ∈ R as follows: max x ∈B(x, ) h [k] i (x ) ≤ max x ∈B(x, ) g T x + b = g T x + ||g|| * + b. (2) We call g the relaxed gradient approximation of h [k] i over B(x, ). Similarly, we can obtain the corresponding lower bound. Inductively using these upper/lower bounds on the output of the subnetwork, we can obtain the bounds for the next subnetwork h [k+1] and then for the whole network s. The final bound s on the whole network s can then be used in the objective (1). The tightness of the bounds s and L(s, y) highly depend on how the linear bounds h and h in each layer are chosen. Unified view of IBP and linear relaxation-based methods IBP can also be considered as a linear relaxation-based method using zero-slope (a = a = 0) linear bounds, h(z) = u + and h(z) = l + , where v + = max(v, 0) and v -= min(v, 0). Thus, the bounds of a nonlinear activation depend only on the pre-activation bounds u and l for the activation layer, substantially reducing the feedforward/backpropagation computations. CROWN-IBP (Zhang et al., 2019b) applies different linear relaxation schemes to the subnetworks and the whole network. It uses the same linear bounds as IBP for the subnetworks h [k] for k < K except for the network s = h [K] itself, and uses h(z) = u + u + -l - (z -l -) and h(z) = 1[u + + l -> 0] z for the whole network s. Moreover, CROWN-IBP uses interpolations between two bounds with the mixing weight β, IBP bound and CROWN-IBP bound, with the following objective: L (1β)s IBP (x, y, ; θ) + βs CROWN-IBP (x, y, ; θ), y . (3) Convex Adversarial Polytope (CAP) (Wong & Kolter, 2018; Wong et al., 2018) uses the linear bounds h(z) = u + u + -l - (z -l -) and h(z) = u + u + -l - z for all subnetworks h [k] and the entire network. As CAP utilizes the linear bounds for each neuron, it is slow and memory-inefficient. It can be easily shown that tighter relaxations on nonlinear activations yield a tighter bound on the worstcase margin score s * . To specify the linear relaxation variable φ ≡ {a, a, b, b} used in relaxation, we use the notation s(x, y, ; θ, φ). CROWN-IBP and CAP generally yield a much tighter bound than IBP. These relaxation schemes are illustrated in Figure 6 in Appendix D.

4. WHAT FACTORS INFLUENCE THE PERFORMANCE OF CERTIFIABLE TRAINING?

One would expect that a tighter upper bound on the worst-case loss in (1) is beneficial in certifiable training. However, several previous works have shown that this is not the case: IBP performs better than linear relaxation-based methods in many cases while utilizing a much looser bound. We investigate the loss landscape and the optimization behavior of IBP and other linear relaxation-based methods, and find that the non-smoothness of the relaxed gradient approximation of linear relaxations negatively affects their performance. Detailed settings of the following analyses are presented in Appendix A. The learning curves for the scheduled value of with the loss variation along the gradient descent direction (the vertical line indicates when the ramp-up ends), and (Right) the loss landscapes along the gradient descent direction at each training step in the later phase of ramp-up period (epoch 50-130) . The thin lines and thick lines in the figure on the right show some sample landscapes at each step and the median values, respectively. Our method shows tight bounds like CROWN-IBP, while its landscape is as favorable as IBP, achieving the best performance among these four methods (see Table 1 ).

4.1. LOSS LANDSCAPE OF CERTIFIABLE TRAINING

We empirically show that models that have tighter bounds, CROWN-IBP (Zhang et al., 2019b) and CAP (Wong & Kolter, 2018) , tend to have non-smooth loss landscapes, which hinder optimization during training. We examine the learning curves of IBP and these linear relaxation-based methods. For a simple analysis, we avoid considering the mixture of the two logits in (3), and use β = 1 to consider CROWN-IBP logit only. Figure 1 (left) shows the learning curves on CIFAR-10 under train = 8 /255. We use -scheduling with the warm-up (regular training) for the first 10 epochs and the ramp-up during epochs 10-130 where we linearly increases the perturbation radius from 0 to the target perturbation train . Thus, the training loss may increase even during learning. In the early phase of the ramp-up period, in which the models are trained with small , CAP and CROWN-IBP have lower losses than IBP as expected because they use much tighter relaxation bounds than IBP. In particular, CAP has much tighter bounds than the others because CAP uses tighter relaxations for each subnetwork. This is consistent with the known results, that CAP tends to outperform the others at small perturbations, such as train = 2 /255 on CIFAR-10 (see Table 1 for details). However, at the end of the training, when the perturbation reaches its maximum target value ( train ), the opposite result is observed where CAP and CROWN-IBP perform worse than IBP. To understand this inconsistency, we measure the variation of the loss along the gradient direction as in Santurkar et al. (2018) , which is represented as the shaded region in Figure 1 (left). We find that linear relaxation-based methods have large variations, while IBP maintains a small variation throughout the entire training phase. It is known that a smooth loss landscape with a small loss variation induces stable and fast optimization with well-behaved gradients (Santurkar et al., 2018) . Therefore, even though CAP and CROWN-IBP show robustness in the early phase of training, the non-smooth loss landscape in the ramp-up period might have hindered the optimization, yielding less robust models. As will be discussed in the following section, we find that the loss variation is highly related to the relaxed gradient approximation g used in linear relaxation. We further explore the loss landscape near the local region of the parameter space at the current parameter θ (now) toward the next parameter θ (next) along the gradient in Figure 1 (right). We plot the landscapes for the later phase of the ramp-up period (epochs 50-130) during which large perturbations are used. IBP has flatter landscapes compared to the others, whereas CROWN-IBP has landscapes with large curvature along the gradient, and thus it tends to move towards a sharp local minimum and it may remain stuck there. Therefore, it may overfit to be robust to small perturbations, but is not robust to the target perturbation train . Next, we establish a relationship between the optimization procedure and linear relaxation. Figure 2 (top) shows the directional deviation between two successive loss gradient steps in terms of cosine similarity during training. Simultaneously, Figure 2 (bottom) shows the ratio of the number of unstable ReLUs for which the pre-activation bounds l and u span zero. We observe that the cosine similarity value is low when the number of unstable ReLUs is large -for example, in the early stage of CAP and the middle stage of CROWN-IBP. In particular, in the middle of the ramp-up period, CROWN-IBP has a large number of unstable ReLUs and exhibits abrupt changes in gradient steps. It often has deviation angles larger than 90 • , leading to parameter updates in the opposite direction of the previous one, bouncing in the basin of a local minimum. This is consistent with the results shown in Figure 1 . Moreover, since the gradient directions are not well-aligned, it may not enjoy the advantages of momentum-based optimizers and be sensitive to the learning rate. To summarize, a large number of unstable ReLUs, high nonlinearity, leads to an unfavorable landscape that can negatively affect the optimization process.

4.2. SMOOTHNESS OF RELAXED GRADIENT APPROXIMATION

In this section, we investigate the loss landscape further from a theoretical perspective to answer the question: "What makes some landscapes more favorable than others?" We find that the relaxed gradient approximation of a linear relaxation affects the smoothness of the landscape. First, we need some mild smoothness assumptions that are natural when the network parameters θ 1 and θ 2 are close to each other, especially they are two consecutive parameters from SGD update. Assumption 1. Given linear relaxation method, we make the following assumptions on the bias b(x; θ) in the linear relaxation and the probability function p(x; θ): (1) ||∇ θ b(x; θ 1 ) -∇ θ b(x; θ 2 )|| ≤ L b θθ ||θ 1 -θ 2 || for all θ 1 , θ 2 and x. (2) ||p(x; θ 1 ) -p(x; θ 1 )|| ≤ L p θ ||θ 1 -θ 2 || for all θ 1 , θ 2 and x. With the above assumptions, we can provide an upper bound on the loss gradient difference for linear relaxation-based methods to measure the non-smoothness of the loss landscape as follows: Theorem 1. Given input x ∈ X and perturbation radius , let M be max x ∈B(x, ) ||x ||. For a linear relaxation-based method with the upper bound m) satisfies Assumption 1 (1) for each m and p s satisfies Assumption 1 (2), then s m (x; θ) = max x ∈B(x, ) g (m) (x; θ) T x + b (m) (x; θ), if b ( ||∇ θ L(s(x; θ 1 )) -∇ θ L(s(x; θ 2 ))|| ≤ max m 2 ||∇ θ g (m) (x; θ 1,2 )|| + M ||∇ θ g (m) (x; θ 1 ) -∇ θ g (m) (x; θ 2 )|| + L (m) ||θ 1 -θ 2 || for any θ 1 , θ 2 , where L (m) = L b (m) θθ + L ps θ ||∇ θ s(x; θ 1,2 )|| and θ 1,2 can be any of θ 1 and θ 2 . According to Theorem 1, the relaxed gradient approximations g (m) in the linear relaxation play a major role in shaping the loss landscape. The smoother the relaxed gradient approximations are, the smoother the loss landscape is. Especially for IBP, using the zero-slope relaxed gradient approximation g (m) ≡ 0 for all m, the loss difference is upper bounded by only the last term, max L (m) ||θ 1θ 2 ||, and it is relatively small for a single gradient step. On the other hand, for other linear relaxation-based methods using non-zero relaxed gradient approximation g (m) = 0, the gradient updates used in the training are more unstable than IBP. It is consistent with the empirical results shown in Figure 1 that there are significant differences between the loss variations of IBP and others.

5. PROPOSED METHOD

Our analyses so far suggest that tightness of the upper bound on the worst-case loss and smoothness of the loss landscape are important for building a certifiably robust model. Therefore, we aim to design a new certifiable training method to improve the aforementioned factors (favorable landscape and tighter bound). More favorable landscape via less a = 1 We observe that CROWN-IBP (β = 1) tends to have more unstable ReLU and less smooth landscape than the others. What, in the objective of CROWN-IBP, does lead to these results? To answer this question, we investigate variants of CROWN-IBP with different a settings for unstable ReLUs. For each setting, we sample a ∈ {0, 1} with different (p, q) with P (a = 1 | |l| > |u|) = p and P (a = 1 | |l| ≤ |u|) = q for each neuron with preactivation bounds l and u. We use a = 1[u + + l -> 0] for the other stable ReLUs. For the other elements of the linear relaxation variable φ = {(a, a, b, b)}, we fix a = u + u + -l -, b = -u + l - u + -l -, and b = 0 for each activation node, because they are the optimal choices for tightening the bound (see Appendix C.2 for details). Figure 3 shows that it tends to have more unstable ReLUs as the number of a satisfying a = 1 increases. This observation implies that it is required to have smaller portion of a with a = 1 to have a more favorable landscape. However, reducing the portion of a with a = 1 is not enough to achieve robustness unless the tightness is guaranteed. Through manually adjusting the a, variants of CROWN-IBP achieve favorable landscapes, but they show looser upper bounds which lead to a worse performance. Further investigation of variants of CROWN-IBP is presented in Appendix E. Therefore, it is required to search for appropriate values of a that can achieve both tightness and favorable landscape. Tighter bound via optimization Now, we aim to reduce the number of a satisfying a = 1 and to tighten the upper bound in (1), simultaneously. We can achieve both by minimizing the upper bound over the linear relaxation variable φ as follows: L(s(x, y, ; θ), y) ≥ min φ L(s(x, y, ; θ, φ), y) ≥ max x ∈B(x, ) L(f (x ; θ), y). It can be equivalently understood as solving the dual optimization in CAP rather than using a dual feasible solution. However, solving the dual optimization is computationally prohibited for the linear relaxation of CAP. To resolve this problem, we use the same linear relaxation as IBP for the subnetworks of s except for s itself, similar to CROWN-IBP. Further, we efficiently compute a surrogate â of the minimizer a * = arg min a L(s(x, y, ; θ, φ), y) using the one-step projected gradient update of the relaxation variable a. Specifically, we have â = Π [0,1] n a 0 -ηsign(∇ a L(s(x, y, ; θ, φ), y)) (6) with an initial point a 0 ∼ U [0, 1] n and η ≥ 1, yielding the final objective L(s(x, y, ; θ, φ), y) where φ = {(a, â, b, b)}.

6. EXPERIMENTS

In this section, we demonstrate the proposed method satisfies two key criteria required for building certifiably robust models: 1) tightness of the upper bound on the worst-case loss, and 2) smoothness of the loss landscape. Subsequently, we evaluate the performance of the method by comparing it with others certifiable training methods. Details on the experimental settings are in Appendix A. Tightness To validate that the proposed method (OURS) has tighter bounds than other relaxations, we analyze various linear relaxation methods in Figure 4 . We define a tightness measure as a sum over the worst-case margin for each class m, c-1 m=0 s m (x, y, ; θ), obtained from (2). Then, we evaluate multiple methods on a single fixed model pre-trained with the proposed training method. The compared methods are, from left to right, OURS, CROWN-IBP (Zhang et al., 2019b) , CAP-IBP, and RANDOM. All methods use the same IBP relaxation for subnetworks, but use different linear relaxation variables a for the whole network s. CROWN-IBP, CAP-IBP, and RANDOM use a = 1[u + + l -> 0], a = u + u + -l -and a ∼ U [0, 1] n , respectively. We fix the other variables a, b, and b, as in Section 5. In both figures, our method shows the lowest value on average, which indicates that a single gradient step in ( 6) is sufficient to obtain tighter bounds compared to other relaxation methods. See Appendix P for the equivalent tightness violin plots of other models. Smoothness Figure 1 shows that the proposed method has small loss variations along the gradient as with IBP, whereas CROWN-IBP (β = 1) has a wide range of loss values. This is because CROWN-IBP (β = 1) has more unstable ReLUs than our methods as shown in Figure 2 . As mentioned above, number of a is closely related to the amount of unstable ReLUs, and Figure 5 shows that our method has successfully reduced the number of a = 1. Further, we conduct analysis on smoothness of the loss landscape with the loss gradient change (the left term in (4)) in Appendix H.

Robustness

We evaluate the performance of the proposed method and compare it to that of stateof-the-art certifiable training methods: IBP (Gowal et al., 2018) , CROWN-IBP (β = 1) (Zhang et al., 2019b) , and CAP (Wong et al., 2018) , as in Section 4.1. On MNIST, we follow Zhang et al. (2019b) and use train ≥ test ; whereas for CAP, we use the same train = test which yields better results. We used three evaluation metrics: standard (clean) error, 100-step PGD error, and verified error. For the verified error, we evaluated with the bound s of each method. This result is consistent with the analysis shown in Figure 1 that CAP and CROWN-IBP (β = 1) have lower loss than IBP at small , but their loss landscape is less smooth than IBP, leading to worse performance at large . Moreover, CAP cannot be trained on MNIST when train = 0.4. As the case is also not specified in Wong et al. (2018) , it seems that CAP is hard to be robust to train ≥ 0.4. On the other hand, the proposed method shows consistent performance in a wide range of test values, achieving the best performance in most cases, since it has tighter bounds and a favorable landscape, not overfitting to a local minimum during the -scheduling. We also compared our method with other prior work (Xiao et al., 2018; Mirman et al., 2018; Balunovic & Vechev, 2019) in Appendix K. We also conduct additional experiments on the hyperparameters in Appendix L, M, and N. Unlike standard training, certifiable training requires -scheduling. It is implicitly assumed that a set of weights that makes the network robust to a small is a good initial point to learn robustness to a large train . However, linear relaxation-based methods with tighter bounds start with a lower loss at a small , but with an unfavorable loss landscape, they cannot explore a sufficiently large area of the parameter space. Hence, they overfit to be robust to a small perturbation, and not generalize to a large perturbation. CAP and CROWN-IBP (β = 1) are typical examples that demonstrate the overfitting. This may overegularize the weight norm and decrease the model capacity (Wong et al., 2018; Zhang et al., 2019b) . The tightness of the proposed method improves the performance for a small , while the smoothness of the proposed method helps the optimization process, which also leads to better performance for a large . To conclude, the proposed method can achieve a decent performance under a wide range of perturbations as shown in Table 1 . 

7. CONCLUSION

In this work, we have investigated the loss landscape of certifiable training and found that the smoothness of the loss landscape is an important factor that influences in building certifiably robust models. To this end, we proposed a method that satisfies the two criteria: tightness of the upper bound on the worst-case loss and smoothness of the loss landscape. Then, we empirically demonstrated that the proposed method achieves robustness comparable to state-of-the-art methods under a wide range of perturbations. We believe that with an improved understanding of the loss landscape, better certifiably robust models can be built.

Datasets and Architectures

In the experiments, we use three datasets: MNIST, CIFAR-10 and SVHN and model architectures (Small, Medium, and Large) in Gowal et al. (2018) and their variants (Small* and Large*) as follows: • Small: Conv(•,16,4,2) -Conv(16,32,4,1) -Flatten -FC(•,100) -FC(100,c) • Small*: Conv(•,16,4,2) -Conv(16,32,4,2) -Flatten -FC(•,100) -FC(100,c)  κL (f (x; θ), y) + (1 -κ)L (1 -β)s IBP (x, y, ; θ) + βs MODEL (x, y, ; θ), y , where κ is the mixing weight between the natural loss and the robust loss, and β is the mixing weight between the two bounds obtained with IBP and given relaxation method (e.g. CROWN-IBP). A.1 SETTINGS IN SECTION 4.1 Figure 1 We conduct the experiment in Figure 1 on CIFAR-10 dataset with Medium architecture over all four methods. We train the model with train = 8 /255 for 200 epochs using -scheduling with 10 warm-up epochs and 120 ramp-up epochs. We use Adam optimizer with learning rate 0.001. We reduce the learning rate by 50% every 10 epochs after -scheduling ends. To demonstrate the instability of each training, we describe the variation of the loss along the gradient direction as Santurkar et al. (2018) . We take steps of different lengths in the direction of the gradient and measure the loss values obtained at each step. For the sake of consistency, we fix a Cauchy random matrix when evaluating CAP to obtain deterministic loss landscapes, not introducing randomness. The loss variation is computed with L(s(θ(t))) where L(s(θ)) ≡ L(s(x, y, ; θ), y) and θ(t) ≡ θ 0 -tη∇ θ L(s(θ 0 )) for t ∈ [0, 5], where θ 0 (= θ(0)) is the current model parameters and η is the learning rate. For the step of length t, we sample ten points from a range of [0,5] on a log scale. In Figure 1 (right), θ (now) = θ(0) and θ (next) = θ(1).

Figure 2 (top)

In Figure 2 , with the same model used in Figure 1 , we plot cosine similarity between two successive loss gradient steps during training as follows: cos(∇ θ L(s(θ(0))), ∇ θ L(s(θ(1)))), where cos(v 1 , v 2 ) is the cosine value of the angle between two vectors v 1 and v 2 .

A.2 SETTINGS IN TABLE 1

For MNIST, we use the same hyper-parameters as in Appendix C of Zhang et al. (2019b) . We train for 200 epochs (10 warm-up epochs and 50 ramp-up epochs) on Large model with batch sizes of 100. we decay the learning rate, 0.0005, by 10% in [130, 190] epochs. As mentioned in Zhang et al. (2019b) , we also found the same issue when training with small (see Appendix N for details). To alleviate the issue, we use train = min(0.4, test + 0.1) for each test as Table 2 of Zhang et al. (2019b) . For CIFAR-10, we train for 400 epochs (20 warm-up epochs and 240 ramp-up epochs) on Medium model with batch sizes of 128. We decay the learning rate, 0.003, by 2× every 10 epochs after the ramp-up period. For SVHN, we train for 200 epochs (10 warm-up epochs and 120 ramp-up epochs) on Large model with batch sizes of 128 (OURS with batch sizes of 80 to avoid out of memory). We decay the learning rate, 0.0003, by 2× every 10 epochs after the ramp-up period. Only for SVHN, we apply normalization with mean (0.438, 0.444, 0.473) and standard deviation (0.198, 0.201, 0.197) for each channel. In Table 1 , we use κ-scheduling from 1 to 0. For the corresponding results of κ-scheduling from 0 to 0, we refer the reader to Table 5 . We modify the source code for CAPfoot_0 to match our settings. For example, we introduce the warmup period and linear -scheduling. We avoid using the reported results in the literature and aim to make a fair comparison under the same settings with only minor differences -for example, because CAP does not support the channel-wise normalization, we could not use the input normalization. Also, due to the memory limit of CAP, we use a smaller batch size of 32 and try other smaller architectures. We found that it often achieves better results with smaller architectures (similar to the results in Table 3 of Wong et al. (2018) ). Thus, we present the performance with Large*, Medium, and Small* on MNIST, CIFAR-10, and SVHN, respectively. Throughout the experiments, CAP uses the fixed κ = 0.

B INTERVAL BOUND PROPAGATION (IBP)

IBP (Gowal et al., 2018) starts from the interval bound I (0) ≡ {z : l (0) ≤ z ≤ u (0) } = B(x, ) in the input space with the upper bound u (0) = x + 1 and the lower bound l (0) = x -1 where 1 is a column vector filled with 1. Then we propagate the interval bound I (k-1) ≡ {z : l (k-1) ≤ z ≤ u (k-1) } by using following equations iteratively: u (k) = h (k) (u (k-1) ) and l (k) = h (k) (l (k-1) ) (9) for element-wise monotonic increasing nonlinear activation h (k) with the pre-activation bounds u (k-1) and l (k-1) , and k) and ( 10) To make the paper self-contained, we provide details of linear relaxation given in the supplementary material of CROWN (Zhang et al., 2018) . We refer readers to the supplementary for more details. Given a network h [k] , we want to upper bound the activation h u (k) = W (k) u (k-1) + l (k-1) 2 + |W (k) | u (k-1) -l (k-1) 2 + b ( l (k) = W (k) u (k-1) + l (k-1) 2 -|W (k) | u (k-1) -l (k-1) 2 + b (k) (11) for linear function h (k) (k = 1, • • • , K). [k] i . We have h [k] i (x ) = W (k) i,: h (k-1) (h [k-2] (x )) + b (k) i = W (k) i,: h (k-1) (z (k-2) ) + b (k) i where z (k-2) = h [k-2] (x ). With the linear function bounds of h (k-1) and h (k-1) on the activation function h (k-1) , we have h [k] i (x ) = W (k) i,: h (k-1) (z (k-2) ) + b (k) i ≤ W (k) i,j <0 W (k) i,j h (k-1) j (z (k-2) ) + W (k) i,j ≥0 W (k) i,j h (k-1) j (z (k-2) ) + b (k) i = W (k) i,j <0 W (k) i,j a (k-1) j z (k-2) j + W (k) i,j ≥0 W (k) i,j a (k-1) j z (k-2) j + W (k) i,j <0 W (k) i,j b (k-1) j + W (k) i,j ≥0 W (k) i,j b (k-1) j + b (k) i = W (k) i,: z (k-2) + b(k) i = W (k) i,: h [k-2] (x ) + b(k) i = W (k) i,: W (k-2) (h [k-3] (x )) + b (k-2) + b(k) i = Ŵ (k-2) i,: h (k-3) (z (k-3) ) + b(k-2) i , where W (k) i,: = W (k) i,: D (k-1) with the diagonal matrix D (k-1) j,j = a (k-1) j for j satisfying W (k) i,j < 0 and D (k-1) j,j = a (k-1) j for j satisfying W (k) i,j ≥ 0, b(k) i = W (k) i,j <0 W (k) i,j b (k-1) j + W (k) i,j ≥0 W (k) i,j b (k-1) j + b (k) i , Ŵ (k-2) i,: = W (k) i,: W (k-2) , and b(k-2) i = W (k) i,: b (k-2) + b(k) i . Applying similar method iteratively, we can obtain g and b in (2) for the linear relaxation of h [k] i .

C.2 DUAL OPTIMIZATION VIEW

We first modify some notations in the main paper and use the notations similar to Wong & Kolter (2018) . We use the following hat notations: ẑ(k+1) = W (k+1) z (k) + b (k+1) and z (k) = h (k) ( ẑ(k) ) where h (k) is the k-th nonlinear activation function. We can build a primal problem with c T = C m,: as follows: max z (K) c T ẑ(K) (12) such that x -1 ≤ z (0) , z (0) ≤ x + 1, ẑ(k+1) = W (k+1) z (k) + b (k+1) (k = 0, • • • , K -1), z (k) = h (k) ( ẑ(k) ) (k = 1, • • • , K -1). Note that our c is negation of that of Wong & Kolter (2018) . Now we can derive the dual of the primal (12) as follows: min ξ + ,ξ -≥0 ν k sup z (k) , ẑ(k) c T ẑ(K) + ξ -T (x -1 -z (0) ) + ξ +T (z (0) -x -1) + K-1 k=0 ν T k+1 ẑ(k+1) -(W (k+1) z (k) + b (k+1) ) + K-1 k=1 νT k z (k) -h (k) ( ẑ(k) ) = (c + ν K ) T ẑ(K) + (ξ + -ξ --W (1)T ν 1 ) T z (0) + K-1 k=1 (-W (k+1)T ν k+1 + νk ) T z (k) + K-1 k=1 ( νT k h (k) ( ẑ(k) ) -ν T k ẑ(k) ) (13) -ν T 1 b (1) -ξ T x -||ξ|| 1 . It leads to c+ν K = 0, ξ + -ξ --W (1)T ν 1 = 0, and -W (k+1)T ν k+1 + νk = 0 (k = 1, • • • , K-1). Alternatively, they are represented as follows: ν K = -c, νk = W (k+1)T ν k+1 (k = K -1, • • • , 1), and ξ = ν1 . Now we need relationship between νk and ν k , i.e., ν k = g( νk ). With the further relaxation ν k = α k νk , we have a relaxed problem as follows: min α k sup z (k) , ẑ(k) K-1 k=1 ( νT k h (k) ( ẑ(k) ) -ν T k ẑ(k) ) -ν T 1 b (1) -ξ T x -||ξ|| 1 (14) such that ν K = -c, νk = W (k+1)T ν k+1 (k = K -1, • • • , 1), ν k = α k νk (k = K -1, • • • , 1), and ξ = ν1 . We decompose the first term in ( 14), and ignore the subscript k as follows νT h( ẑ) -(α ν) T ẑ. Further, we decompose this for each element, νh(ẑ)αν ẑ = ν(h(ẑ)αẑ). If the pre-activation bounds for h are both positive (active ReLU), then α should be 1 not to make the inner supremum ∞. Similarly, if the pre-activation bounds for h are both negative (dead ReLU), then α should be 0. In the case of unstable ReLU (l ≤ 0 ≤ u), if ν < 0, then we need to solve max α inf ẑ h(ẑ)-αẑ. The inner infimum is 0 for 0 ≤ α ≤ 1, and is -∞ otherwise. On the other hand, if ν ≥ 0, then we need to solve min α sup ẑ h(ẑ)-αẑ. The inner supremum is max{u-αu, -αl}, and thus the optimal dual variable is α * = u u-l which yields the optimal value (multiplied by ν) as ν(u -u u-l u) = -ul u-l ν which is equivalent to using linear relaxation with a z + b = u u-l (z -l). We can represent it as a z + b = u + u + -l - (z -l -) to include the case of active/dead ReLU. For the lower linear bound h(z) = a z + b in case of unstable ReLU, we can use any 0 ≤ a ≤ 1 and b = 0 according to the dual relaxation with α. While CAP and CROWN-IBP use a dual feasible solution like α = u + u + -l -or α = 1[u + + l -> 0] , our proposed method aims to optimize over the dual variable α or equivalently optimize over 0 ≤ a ≤ 1 to further tighten the upper bound on the loss. Under review as a conference paper at 2021 E LEARNING CURVES FOR VARIANTS OF CROWN-IBP It seems that a certifiable training with a looser bound tends to favor stable ReLUs. For example, IBP starts with small number of unstable ReLUs while CAP starts with large number of ReLUs as shown in Figure 2 (bottom). However, a tighter bound does not directly lead to many unstable ReLUs. We find that 0.5/1 and 1/1 have looser bounds than CROWN-IBP (as shown in Figure 7 ) but they have more unstable ReLUs (as shown in Figure 3 ) where p/q denotes the variant with sampling a ∈ {0, 1} with P (a = 1 | |l| > |u|) = p and P (a = 1 | |l| ≤ |u|) = q for unstable ReLUs. On the other hand, 0/0, 0/0.25, and 0/0.5 have looser bounds than CROWN-IBP and they have less unstable ReLUs, which leads to small loss variations as in Figure 7 . Therefore, this observation implies that it is more important to have less a = 1 to have a more smooth landscape. The learning curves for the scheduled value of with the loss variation along gradient descent direction (equivalent to Figure 1 ). As the ratio of the number of a with a = 1 increases, the loss variation increases. Table 3 : Performance (in terms of errors) of the variants of CROWN-IBP (β = 1). Note that 0/0.25, 0/0.5, and CAP-IBP start with looser bounds but they have more smooth landscape, which leads to a better performance than CROWN-IBP (β = 1) (highlighted with underline). Model 0/0 0/0.25 0/0.5 0/1 CROWN-IBP (β = 1) 0. 

F PROOF

To prove Theorem 1, we first prove following proposition. We note that θ and g are vectorized and the matrix norm of Jacobian is naturally defined -for example, ||∇ θ g|| is induced by the vector norms defined in X and Θ. Proposition 1. Given input x ∈ X and perturbation radius , let M = max{||x || : x ∈ B(x, )}. Then, for the upper bound s(x; θ) = max x ∈B(x, ) g(x; θ) T x + b(x; θ) with b satisfying Assumption 1 (1), we have ||∇ θ s(x; θ 1 ) -∇ θ s(x; θ 2 )|| ≤ 2 ||∇ θ g(x; θ 1,2 )|| + M ||∇ θ g(x; θ 1 ) -∇ θ g(x; θ 2 )|| + L b θθ ||θ 1 -θ 2 || (15) for any θ 1 , θ 2 , where θ 1,2 can be any of θ 1 and θ 2 . Proof. Say f (x ; θ) = g(x; θ) T x + b(x; θ) and the maximizer x * i = arg max x ∈B(x, ) f (x ; θ i ) for each θ i = θ 1 , θ 2 . Then, we have ||∇ θ s(x; θ 1 ) -∇ θ s(x; θ 2 )|| = ||∇ θ f (x * 1 ; θ 1 ) -∇ θ f (x * 2 ; θ 2 )|| = ||∇ θ f (x * 1 ; θ 1 ) -∇ θ f (x * 2 ; θ 1 ) + ∇ θ f (x * 2 ; θ 1 ) -∇ θ f (x * 2 ; θ 2 )|| ≤ ||∇ θ f (x * 1 ; θ 1 ) -∇ θ f (x * 2 ; θ 1 )|| + ||∇ θ f (x * 2 ; θ 1 ) -∇ θ f (x * 2 ; θ 2 )||. The first term on the RHS can be upper bounded as follows: ||∇ θ f (x * 1 ; θ 1 ) -∇ θ f (x * 2 ; θ 1 )|| = ||∇ θ ( g1 T x * 1 -g1 T x * 2 )|| = ||∇ θ (g T 1 x * 1 -g 1 T x * 2 )|| = ||∇ θ g 1 (x * 1 -x * 2 )|| ≤ 2 ||∇ θ g 1 ||, where g i = g(x; θ i ), b i = b(x; θ i ), gT i = [g T i ; b i ] and xT = [x T ; 1]. And the second term on the RHS can be upper bounded as follows: ||∇ θ f (x * 2 ; θ 1 ) -∇ θ f (x * 2 ; θ 2 )|| = ||∇ θ ( g1 T x * 2 -g2 T x * 2 )|| = ||∇ θ ( g1 -g2 ) x * 2 || ≤ ||∇ θ (g 1 -g 2 )||||x * 2 || + ||∇ θ (b 1 -b 2 )|| ≤ M ||∇ θ (g 1 -g 2 )|| + L b θθ ||θ 1 -θ 2 ||, Therefore, we obtain ||∇ θ s(x; θ 1 ) -∇ θ s(x; θ 2 )|| ≤ 2 ||∇ θ g 1 || + M ||∇ θ (g 1 -g 2 )|| + L b θθ ||θ 1 -θ 2 || = 2 ||∇ θ g(x; θ 1 )|| + M ||∇ θ g(x; θ 1 ) -∇ θ g(x; θ 2 )|| + L b θθ ||θ 1 -θ 2 ||. Note that θ 1 in the first term is arbitrarily chosen in ( 16). Therefore, this leads to the final inequality (15). Theorem 1. Given input x ∈ X and perturbation radius , let M be max x ∈B(x, ) ||x ||. For a linear relaxation-based method with the upper bound m) satisfies Assumption 1 (1) for each m and p s satisfies Assumption 1 (2), then s m (x; θ) = max x ∈B(x, ) g (m) (x; θ) T x + b (m) (x; θ), if b ( ||∇ θ L(s(x; θ 1 )) -∇ θ L(s(x; θ 2 ))|| ≤ max m 2 ||∇ θ g (m) (x; θ 1,2 )|| + M ||∇ θ g (m) (x; θ 1 ) -∇ θ g (m) (x; θ 2 )|| + L (m) ||θ 1 -θ 2 || (4) for any θ 1 , θ 2 , where L (m) = L b (m) θθ + L ps θ ||∇ θ s(x; θ 1,2 )|| and θ 1,2 can be any of θ 1 and θ 2 . Proof. We simplify the notation p s as p. Then we have ||∇ θ L(s(x; θ 1 )) ∇ θ L(s(x; θ 2 ))|| = ||∇ θ s(x; θ 1 )∇ s L(s(x; θ 1 )) -∇ θ s(x; θ 2 )∇ s L(s(x; θ 2 ))|| = || m ∇ θ s m (x; θ 1 )(p m (x; θ 1 ) -δ y,m ) -∇ θ s m (x; θ 2 )(p m (x; θ 2 ) -δ y,m )|| = ||∇ θ s(x; θ 1 )(p(x; θ 1 ) -e (y) ) -∇ θ s(x; θ 2 )(p(x; θ 2 )e (y) )|| = ||∇ θ s(x; θ 1 )p(x; θ 1 ) -∇ θ s(x; θ 2 )p(x; θ 2 )|| = ||∇ θ s(x; θ 1 )p(x; θ 1 ) -∇ θ s(x; θ 1 )p(x; θ 2 ) + ∇ θ s(x; θ 1 )p(x; θ 2 ) -∇ θ s(x; θ 2 )p(x;  θ 2 )|| = ||∇ θ s(x; θ 1 )(p(x; θ 1 ) -p(x; θ 2 )) + (∇ θ s(x; θ 1 ) -∇ θ s(x; θ 2 ))p(x; θ 2 )|| ≤ ||∇ θ s(x; θ 1 )||||p(x; θ 1 ) -p(x; θ 2 )|| + max m ||∇ θ s m (x; θ 1 ) -∇ θ s m (x; θ 2 )|| ≤ ||∇ θ s(x; θ 1 )||L p θ ||θ 1 -θ 2 || + max m ||∇ θ s m (x; θ 1 ) -∇ θ s m (x; θ 2 )|| ≤ max m 2 ||∇ θ g (m) (x; θ 1,2 )|| + M ||∇ θ g (m) (x; θ 1 ) -∇ θ g (m) (x; θ 2 )|| + L (m) ||θ 1 -θ 2 || G LEARNING CURVE FOR train

H SMOOTHNESS

We empirically measure the non-smoothness of the loss landscape with the difference between the two consecutive loss gradients at θ 1 = θ(0) and θ 2 = θ(1) in ( 8), says gradient difference (≡ ||∇ θ L(x; θ(0))θ L(x; θ(1))||). It is highly related to the ratio of the number of unstable ReLUs (nonlinearity of the classifier) as shown in Figure 10 . Figure 13 : Mode connectivity between CROWN-IBP and OURS, where w 0 and w 1 are well-trained models using CROWN-IBP bound and OURS bound, respectively. θ c is trained using CROWN-IBP (13a) and OURS (13b), respectively.

J RELU

In this section, we investigate how pre-activation bounds u and l for the activation layer change during training. For each activation node, it is said to be "active" when the pre-activation bounds are both positive (0 < l ≤ u), "unstable" when they span zero (l ≤ 0 ≤ u), and "dead" when they are both negative (l ≤ u < 0). Figure 14 shows the ratios of the number of active and dead ReLUs during the ramp-up period. Notably, we find that CROWN-IBP has more active ReLUs during training compared to the other three methods. Simultaneously, CROWN-IBP has the lowest ratio of dead ReLUs. Figure 15 shows the numbers of active, unstable, and dead ReLUs during the ramp-up period. We find that in CROWN-IBP, the number of unstable and active ReLUs increases as the number of dead ReLUs decreases. This indicates that a number of dead ReLUs change to unstable ReLUs as the training increases. However, in the other methods, the number of unstable ReLUs is consistently small, while the number of active ReLUs decreases as the number of dead ReLUs increases. Figure 16 depicts the histograms of the distribution of the slope u + u + -l -of the unstable ReLUs during the ramp-up period. In the early stages of CAP training, the slope distribution is concentrated around 0.4. However as the training progresses with a larger , the histogram distribution moves to left, which indicates unstable ReLUs change to dead ReLUs. It is consistent with the results in Figure 15c . On the other hand, in the case of CROWN-IBP, the histogram distribution moves to right during training. It is the same with the results in Figure 15b , which shows that number of active ReLUs increases during training. 

K COMPARISON WITH OTHER PRIOR WORK

All experiments and results (except for Table 4 ) in this paper are based on our own reimplementation. For the unimplemented prior work, we compare to the best reported results in the literature in Table 4 . We note that the results in Xiao et al. (2018) and Balunovic & Vechev (2019) are evaluated with a MILP based exact verifier (Tjeng et al., 2017) . L β-AND κ-SCHEDULINGS Table 5 shows the evaluation results of the models as in Table 1 but trained with different κscheduling (from 0 to 0). Table 6 shows the evaluation results of the proposed models trained with different κand β-schedulings.  a t+1 = Π [0,1] n a t -αsign(∇ a L(s(x, y, ; θ, φ), y)) . We compare the original 1-step method (α ≥ 1) to 7-step (t = 7) method with α = 0.1. The results are summarized in Table 7 . We found no significant difference between two methods even though multi-step takes multiple times with multi-step. Therefore, we decide to focus on one-step method. 

P LOSS AND TIGHTNESS VIOLIN PLOTS

We plot the equivalent tightness violin plots in Section 6 for models trained with other methods. The proposed method achieves the best results in terms of loss and tightness followed by CROWN-IBP, CAP-IBP, and RANDOM. Figure 17 

Q COMPARISON WITH CAP-IBP

As in section E, we train a model with CAP-IBP and compare with the proposed method and CROWN-IBP (β = 1). Figure 18 shows that CAP-IBP has gradient differences (defined in Section H) larger than the proposed method and smaller than CROWN-IBP (β = 1), which leads to a performance between the proposed method and CROWN-IBP (β = 1) (see Table 3 ). CAP-IBP has looser bounds than CROWN-IBP (β = 1) as shown in Figure 4 and Figure 17 , but with a relatively more smooth landscape, it can achieve a better performance than CROWN-IBP (β = 1). 

R RELU STABILITY

To see the effect of unstable ReLUs on smoothness, we adopt the ReLU stability loss (RS loss) L RS (u, l) = -tanh(1 + u • l) as a regularizer (Xiao et al., 2018) . We use L + λL RS as a loss and run CROWN-IBP (β = 1) with various λ settings. We plot the smoothness and the tightness in Figure 19 and Figure 20 on λ = 0, λ = 0.01, λ = 10. We found that small λ suggested in Xiao et al. (2018) has no effect on reducing the number of unstable ReLUs since certifiable methods have smaller unstable ReLUs as shown in Figure 15 , and thus not on improving the smoothness. By increasing λ, we observed that RS successfully reduces the number of unstable ReLUs with λ = 10. Figure 19 shows that large λ leads to a better loss variation and gradient difference. This supports that unstable ReLUs are closely related to the smoothness of the loss landscape. However, as Xiao et al. ( 2018) mentioned "placing too much weight on RS Loss can decrease the model capacity, potentially lowering the provable adversarial accuracy", the models trained with a large λ ≥ 1 couldn't obtain a tightness of the upper bound and significant improvement on robustness as illustrated in Figure 20 . The test errors (Standard / PGD / Verified) are 0.6278 / 0.7189 / 0.7634 on λ = 0.01 and 0.6090 / 0.7085 / 0.7600 on λ = 10. 



https://github.com/locuslab/convex adversarial



Figure1: (Left) The learning curves for the scheduled value of with the loss variation along the gradient descent direction (the vertical line indicates when the ramp-up ends), and (Right) the loss landscapes along the gradient descent direction at each training step in the later phase of ramp-up period (epoch 50-130). The thin lines and thick lines in the figure on the right show some sample landscapes at each step and the median values, respectively. Our method shows tight bounds like CROWN-IBP, while its landscape is as favorable as IBP, achieving the best performance among these four methods (see Table1).

Figure2: (Top) Cosine similarities between two consecutive loss gradients and (Bottom) the ratio of the number of unstable ReLUs during the ramp-up period. A large number of unstable ReLUs, high nonlinearity, leads to an unfavorable landscape that can negatively affect the optimization process.

Figure3: The ratio of the number of unstable ReLUs for models with different a settings during the ramp-up period. Notation p/q denotes the variant with sampling a ∈ {0, 1} with P (a = 1 | |l| > |u|) = p and P (a = 1 | |l| ≤ |u|) = q for unstable ReLUs. As the number of a = 1 increases, it tends to have more unstable ReLUs, which leads to less smooth loss landscapes.

Figure4: Violin plots of the test loss with the corresponding verified error (Left) and of tightness (Right) for various linear relaxations. Lower is better. This shows that the proposed relaxation method has a tighter bound than the others relaxation methods.

Finally, IBP uses the worst-case margin s = u (K) to formulate the objective in (1) for certifiable training. C DETAILS ON LINEAR RELAXATION C.1 LINEAR RELAXATION EXPLAINED IN CROWN (ZHANG ET AL., 2018)

Figure7: The learning curves for the scheduled value of with the loss variation along gradient descent direction (equivalent to Figure1). As the ratio of the number of a with a = 1 increases, the loss variation increases.

Figure 8: A zoomed-in version of Figure 7 for epochs 100-200.

Figure9shows the learning curves for the target perturbation train during the ramp-up period, while Figure1shows the corresponding curves for the scheduled value of . The two figures use the same settings in Appendix A.1.

Figure 10: (Top) Gradient difference and (Middle) cosine similarities between two consecutive loss gradients, and (Bottom) the ratio of the number of unstable ReLUs during the ramp-up period.

Figure 14: The ratio of the number of active (top) and dead (bottom) ReLUs during the ramp-up period.

Figure 17: Violin plots of the test loss (Left Column) and of tightness (Right Column) for various linear relaxations same as in Section 6. Lower is better.

Figure 18: (Top) Gradient difference and (Middle) cosine similarities between two consecutive loss gradients, and (Bottom) the ratio of the number of unstable ReLUs during the ramp-up period.

Figure 19: (Top) Gradient difference and (Middle) cosine similarities between two consecutive lossgradients, and (Bottom) the ratio of the number of unstable ReLUs on CROWN-IBP (β = 1) with λ = 0, λ = 0.01, λ = 10 and OURS.

Test errors (Standard / PGD / Verified error) of IBP, CROWN-IBP (β = 1), CAP, and OURS on MNIST, CIFAR-10, and SVHN. Bold and underline numbers are the first and second lowest verified error.Table 1 summarizes the evaluation results under different test for each dataset. In general, when test is low, methods with tighter linear relaxations show good performance, whereas IBP tends to perform better as test increases. In short, the state-of-the-art methods perform well for a specific range of test . For example, IBP show relatively better performance in the case of test = 0.3, 0.4 on MNIST and test = 6 /255, 8 /255, 16 /255 on CIFAR-10. On the other hand, CAP and CROWN-IBP (β = 1) outperform IBP in the case of test = 0.1 on MNIST, test = 2 /255 on CIFAR-10 and test = 0.001 on SVHN.

Test errors (Standard / PGD / Verified error) of OURS and CROWN-IBP 1→0 on CIFAR-10. Bold numbers are the lower error. IBP 1→0 could help to improve the robustness performance. And they argued that this is because training with a tighter bound of CROWN-IBP at the beginning can provide a good initialization for later IBP training. On the other hand, we provide another explanation that CROWN-IBP 1→0 starts with a tighter bound (CROWN-IBP only) but not overfits to small perturbation by gradually introducing the IBP objective which has a smoother landscape. Despite using a single objective without the mixture parameter β, the proposed method can outperforms CROWN-IBP 1→0 on CIFAR-10 as shown in Table2.

Mode connectivity between CROWN-IBP and IBP, where w 0 and w 1 are well-trained models using CROWN-IBP bound and IBP bound, respectively. θ c is trained using CROWN-IBP (11a) and IBP (11b), respectively.

Test errors (Standard / Verified error) compared to the best errors reported in the literature. Bold numbers are the lowest verified error.

Test errors (Standard / PGD / Verified error) of IBP, CROWN-IBP (β = 1), CAP, and OURS on MNIST, CIFAR-10, and SVHN. See Appendix A for all the other settings, same as in Table1. Bold and underline numbers are the first and second lowest verified error.

Test errors of OURS with different βand κ-scheduling on MNIST and CIFAR-10.

Test errors of OURS with different numbers of gradient update steps in (17) on CIFAR-10. Here, we use κ-scheduling from 0 to 0.

D ILLUSTRATION OF LINEAR RELAXATIONS

Figure 6 provides some illustrations of linear relaxations used in IBP, CAP, CROWN-IBP, and the proposed method. CROWN-IBP adaptively chooses the relaxation variable so that the area between h and h is minimized. However, the smaller area does not necessarily imply the tighter bound, and the proposed method achieves tighter bounds than CROWN-IBP relaxation as shown in Figure 4 . 

I MODE CONNECTIVITY

In this section, we check the mode connectivity (Garipov et al., 2018) between two models that are trained using certifiable training methods. Mode connectivity is a framework that investigates the connectedness between two models by finding a high accuracy curve between those models. It enables us to understand the loss surface of neural networks.Let w 0 and w 1 be two sets of weight corresponding to two different well-trained neural networks. Moreover, let φ θc (t) with t ∈ [0, 1] be a continuous piece-wise smooth parametric curve with parameters θ c such that φ θc (0) = w 0 and φ θc (1) = w 1 . To find a low-loss path between w 0 and w 1 , Garipov et al. (2018) suggested to find the parameter θ c that minimizes the expectation of a loss (w) over a distribution q θc (t) on the curve,To optimize L(θ c ) for θ c , we use uniform distribution U [0, 1] as q θc (t) and Bezier curve (Farouki, 2012) as φ θc (t), which provides a convenient parameterization of smoothness on the paths connecting two end points (w 0 and w 1 ) as follows:A path φ θc is said to have a barrier if ∃t such that (φ θc (t)) > max{ (w 0 ), (w 1 )}. The existence of a barrier suggests the modes of two well-trained models are not connected by the path in terms of the given loss function (Zhao et al., 2020) .We test the mode connectivity between the models trained with IBP, CROWN-IBP, and OURS. For example, to check the mode connectivity between two different models trained with CROWN-IBP and IBP, we use the loss function used on each model as a user-specified loss for training the parametric curve φ θc . Therefore, we can obtain two curves as depicted in Figure 11 , 12, and 13 for each pair of models. Here, we use the identical settings in Appendix A.1.Figure 11 shows the mode-connectivity between CROWN-IBP and IBP. We use CROWN-IBP loss as user-specific loss in Figure 11a and IBP loss in Figure 11b . In this figure, we find that using CROWN-IBP loss (11a), there exists a barrier between the two models. This suggests they are not connected by the path in terms of CROWN-IBP loss. However, with IBP loss, there is no loss barrier separating the two models. This indicates that using CROWN-IBP, it is hard to optimize the parameters from w 0 to w 1 , but IBP can.Figure 12 shows the mode-connectivity results on IBP and OURS. We find that two models are not connected to each other using either IBP bound or OURS bound, since there exists a barrier in both curves. In this figure, we can also notify that OURS has tighter bounds than IBP because the value of the loss function using OURS is lower than that of IBP.Finally, Figure 13 illustrates the mode connectivity between CROWN-IBP and OURS. Using CROWN-IBP as a user-specified loss function, we can find that the robust loss on the curve is higher than that of the end points. However, when OURS is used as a loss function, the robust loss generally decreases as the t increases. It shows that OURS has much favorable loss landscape compared to CROWN-IBP. In addition, we can find that OURS has a tighter bound than CROWN-IBP, since the value of the robust loss using OURS is lower than CROWN-IBP.N TRAIN WITH train ≥ test N.1 train ≥ test ON MNIST Zhang et al. (2019b) and Gowal et al. (2018) observed that IBP performs better when using train ≥ test than train = test . Figure 8 shows the results with different train 's for each test .The overfitting issue is more prominent in the case of IBP and CROWN-IBP 1→0 than the proposed method and CROWN-IBP 1→1 . However, using larger perturbations compromises the standard accuracy, and thus it is desirable to use smaller train . O TRAINING TIME All the training times are measured on a single TITAN X (Pascal) on Medium for CIFAR-10. We train with a batch size of 128 for OURS, CROWN-IBP 1→1 and IBP, but with a batch size of 32 for CAP due to its high memory cost. For CAP, we use random projection of 50 dimensions.• OURS: 115.9 sec / epoch 

