TIGHT SECOND-ORDER CERTIFICATES FOR RANDOMIZED SMOOTHING

Abstract

Randomized smoothing is a popular way of providing robustness guarantees against adversarial attacks: randomly-smoothed functions have a universal Lipschitz-like bound, allowing for robustness certificates to be easily computed. In this work, we show that there also exists a universal curvature-like bound for Gaussian random smoothing: given the exact value and gradient of a smoothed function, we compute a lower bound on the distance of a point to its closest adversarial example, called the Second-order Smoothing (SoS) robustness certificate. In addition to proving the correctness of this novel certificate, we show that SoS certificates are realizable and therefore tight. Interestingly, we show that the maximum achievable benefits, in terms of certified robustness, from using the additional information of the gradient norm are relatively small: because our bounds are tight, this is a fundamental negative result. The gain of SoS certificates further diminishes if we consider the estimation error of the gradient norms, for which we have developed an estimator. We therefore additionally develop a variant of Gaussian smoothing, called Gaussian dipole smoothing, which provides similar bounds to randomized smoothing with gradient information, but with much-improved sample efficiency. This allows us to achieve (marginally) improved robustness certificates on high-dimensional datasets such as CIFAR-10 and ImageNet. Code is available at https://github.com/alevine0/smoothing_second_ order.

1. INTRODUCTION

A topic of much recent interest in machine learning has been the design of deep classifiers with provable robustness guarantees. In particular, for an m-class classifier h : R d → [m], the L 2 certification problem for an input x is to find a radius ρ such that, for all δ with δ 2 < ρ, h(x) = h(x + δ). This robustness certificate serves as a lower bound on the magnitude of any adversarial perturbation of the input that can change the classification: therefore, the certificate is a security guarantee against adversarial attacks. There are many approaches to the certification problem, including exact methods, which compute the precise norm to the decision boundary (Tjeng et al., 2019; Carlini et al., 2017; Huang et al., 2017) as well as methods for which the certificate ρ is merely a lower bound on the distance to the decision boundary (Wong & Kolter, 2018; Gowal et al., 2018; Raghunathan et al., 2018) . One approach that belongs to the latter category is Lipschitz function approximation. Recall that a function f : R d → R is L-Lipschitz if, for all x, x , |f (x) -f (x )| ≤ L xx 2 . If a classifier is known to be a Lipschitz function, this immediately implies a robustness certificate. In particular, consider a binary classification for simplicity, where we use an L-Lipschitz function f as a classifier, using the sign of f (x) as the classification. Then for any input x, we are assured that the classification (i.e, the sign) will remain constant for all x within a radius |f (x)|/L of x. Numerous methods for training Lipschitz neural networks with small, known Lipschitz constants have been proposed. (Fazlyab et al., 2019; Zhang et al., 2019; Anil et al., 2019; Li et al., 2019b) It is desirable that the network be as expressive as possible, while still maintaining the desired Lipschitz property. Anil et al. (2019) in particular demonstrates that their proposed method can universally approximate Lipschitz functions, given sufficient network complexity. However, in practice, for the robust certification problem on large-scale input, randomized smoothing (Cohen et al., 2019) is the Figure 1 : (a) Tight lower bound on the value of a smoothed function at x (i.e. p a (x )) as a function of xx 2 . In this example, p a (x) = 0.8 and the smoothing standard deviation σ = 1. The red line shows the lower bound for the function, with no information about the gradient given. The blue line incorporates the additional information that ∇ x p a (x) 2 = 0. Note that the axis at Φ -1 (p a (x)) = 0 corresponds to p a (x) = 0.5, the decision boundary for a binary classifier. (b) Tight robustness certificates for a randomized-smoothed classifier, given the top-class value p a (x) and the gradient norm ∇ x p a (x) 2 . The dashed lines show the certificates given p a (x) alone. Note that the maximum possible gradient for a smoothed classifier depends on p a (x) (see Equation 1). current state-of-the-art method. The key observation of randomized smoothing (as formalized by (Salman et al., 2019; Levine et al., 2019) ) is that, for any arbitrary base classifier function f : R d → [0, 1], the function x → Φ -1 (p a ) where p a (x) := E ∼N (0,σ 2 I) f (x + ) is (1/σ)-Lipschitz, where N (0, σ 2 I) is a d-dimensional isometric Gaussian distribution with variance σ 2 and Φ -1 is the inverse normal CDF function. As a result, given the smoothed classifier value p a (x) at x, one can calculate the certified radius ρ(x) = σΦ -1 (p a (x)) in which p a (x) ≥ 0.5 (i.e., Φ -1 (p a (x)) ≥ 0). This means that we can use p a (x) ∈ R d → [0, 1] as a robust binary classifier (with one class assignment if p a (x) ≥ 0.5, and the other if p a (x) < 0.5). Cohen et al. (2019) shows that this is a tight certificate result for a classifier smoothed with Gaussian noise: given the value of p a (x), there exists a base classifier function f such that, if p a is the Gaussian-smoothed version of f , then there exists an x with xx 2 = ρ such that p a (x ) = 0.5. In other words, the certificate provided by (Cohen et al., 2019) is the largest possible certificate for Gaussian smoothing, given only the value of p a (x). Previous results (Li et al., 2019a; Lecuyer et al., 2019) provided looser bounds for Gaussian smoothing. Singla & Feizi (2020) have recently shown, for shallow neural networks, that, rather than globally bounding the (first-order) Lipschitz constant of the network, it is possible to achieve larger robustness certificates by instead globally bounding the Lipschitz constant of the gradient of the network. This second-order, curvature-based method takes advantage of the fact that the gradient at x can be computed easily via back-propagation, so certificates can make use of both f (x) and ∇ x f (x). This leads to a question: can we also use the gradient of a smoothed classifier ∇ x p a (x) to improve smoothing-based certificates? In this work, we show that there is a universal curvature-like bound for all randomly-smoothed classifiers. Therefore, given p a (x) and ∇ x p a (x), we can compute larger certificates than is possible using the value of p a (x) alone. Moreover, our bound is tight in that, given only the pair (p a (x), ∇ x p a (x)), the certificate we provide is the largest possible certificate for Gaussian smoothing. We call our certificates "Second-order Smoothing" (SoS) certificates. As shown in Figure 1 , the smoothing-based certificates which we can achieve using second-order smoothing represent relatively modest improvements compared to the first-order bounds. This is a meaningful negative result, given the tightness of our bounds, and is therefore useful in guiding (or limiting) future research into higher-order smoothing certificates. Additionally, this result shows that randomized smoothing (or, specifically, functions in the form of Equation 1) can not be used to universally approximate Lipschitz functions: all randomly smoothed functions will have the additional curvature constraint described in this work. If the base classifier f is a neural network, computing the expectation in Equation 1 analytically is not tractable. Therefore it is standard (Lecuyer et al., 2019; Cohen et al., 2019; Salman et al., 2019) to estimate this expectation using N random samples, and bound the expectation probabilistically. The certificate is then as a high-probability, rather than exact, result, using the estimated lower bound of p a (x). In Section 3.1, we discuss empirical estimation of the gradient norm of a smoothed classifier for second-order certification, and develop an estimator for this quantity, in which the number of samples required to estimate the gradient scales linearly with the dimensionality d of the input. 1 In order to overcome this, in Section 4, we develop a modified form of Gaussian randomized smoothing, Gausian Dipole Smoothing, which allows for a dipole certificate, related to the secondorder certificate, to be computed. Unlike the second-order certificate, however, the dipole certificate has no explicit dependence of dimensionality in its estimation, and therefore can practically scale to real-world high-dimensional datasets.

2. PRELIMINARIES, ASSUMPTIONS AND NOTATION

We use f (x) to represent a generic scalar-valued "base" function to be smoothed. In general, we assume f ∈ R d → [0, 1]. However, for empirical estimation results (Theorem 3), we assume that f is a "hard" base classifier: f ∈ R d → {0, 1}. This will be made clear in context. The smoothed version of f is notated as p a ∈ R d → [0, 1], defined as in equation 1. Recall that Φ is the normal CDF function and Φ is the normal PDF function. In randomized smoothing for multi-class problems, the base classifier is typically a vector-valued function f ∈ R d → {0, 1} m , c f c (x) = 1 , where m is the number of classes. The final classification returned by the smoothed classifier is then given by a := arg max c E f c (x + ). However, in most prominent implementations (Cohen et al., 2019; Salman et al., 2019) , certificates are computed using only the smoothed value for the estimated top class a, where a is estimated using a small number N 0 of initial random samples, before the final value of p a (x) is computed using N samples. The certificate then determines the radius in which p a (x ) will remain above 0.5: this guarantees that a will remain the top class, regardless of the other logits. While some works (Lecuyer et al., 2019; Feng et al., 2020) independently estimate each smoothed logit, this incurs additional estimation error as the number of classes increases. In this work, we assume that only estimates for the top-class smoothed logit p a (x) and its gradient ∇ x p a (x) are available (although we briefly discuss the case with more estimated logits in Section 3.2). When discussing empirical estimation, we use η as the accepted probability of failure of an estimation method.

3. SECOND-ORDER SMOOTHING CERTIFICATE

We now state our main second-order robustness certificate result: Theorem 1. For all x, x with xx 2 < ρ, and for all f : R d → [0, 1], p a (x ) ≥ Φ Φ -1 (a + p a (x)) - ρ σ -Φ Φ -1 (a ) - ρ σ ( ) where a is the (unique) solution to Φ (Φ -1 (a )) -Φ (Φ -1 (a + p a (x))) = -σ ∇ x p a (x) 2 . ( ) Further, for all pairs (p a (x), ∇ x p a (x) 2 ) which are possible, there exists a base classifier f and an adversarial point x such that Equation 2 is an equality. This implies that our certificate is realizable, and therefore tight. Note that the right-hand side of Equation 2 is monotonically decreasing with ρ: we can then compute a robustness certificate by simply setting p a (x ) = 0.5 and solving for the certified radius ρ. Also,  ∇ x p a (x) 2 = σ -1 Φ (Φ -1 (p a (x)). (b) The gradient has an intermediate value: 0 < ∇ x p a (x) 2 < σ -1 Φ (Φ -1 (p a (x)). (c) The gradient is zero: ∇ x p a (x) 2 = 0. a can be computed easily, because the left-hand side of Equation 3 is monotonic in a . Evaluated certificate values are shown in Figure 1 -b, and compared with first-order certificates. All proofs are presented in Appendix A. Like in Cohen et al. (2019) , we proceed by constructing the worst-case base classifier f given p a (x) and ∇ x p a (x) 2 . This is the base classifier f which creates an adversarial point to the smoothed classifier as close as possible to x, given the constraints that p a (x) and ∇p a (x) 2 are equal to their reported values. In Cohen et al. (2019) , given only p a (x), this is simply a linear classifier. With the gradient norm, the worst case is that x lies in a region with class a which is a slice between two linear decision boundaries, both perpendicular to ∇p a (x). See Figure 3. Note that, by isometry and because ∇p a (x) is the only vector information we have, there is no benefit in certified radius to having the direction of ∇p a (x): the norm is sufficient. In the case of a linear classifier the gradient takes its maximum possible value: ∇ x p a (x) 2 = σ -1 Φ (Φ -1 (p a (x) ). This case is shown in Figure 3 -a: if the gradient norm is equal to this value, the second-order certificate is identical to the first-order certificate (Cohen et al., 2019) . However, if the gradient norm is smaller, then we cannot be in this worst-case linear-classifier scenario. Instead, the new "worst case" is constructed by introducing a second "wrong class" region opposite to the direction of the adversarial point (Figure 3-b ). In the extreme case (Figure 3-c ) where the gradient norm is zero, this is accomplished by balancing two adversarial regions in a "sandwich" around x. This "sandwich" configuration reveals the relative weakness of gradient information in improving robustness certificates: having zero gradient does not require that the adversarial regions be evenly distributed around x. Rather, it is sufficient to distribute the adversarial probability mass 1 -p a (x) into just two adversarial regions. Therefore, the certified radius, even in this most extreme case, is similar to the Cohen et al. ( 2019) certificate in the case with half as much adversarial probability mass (the first-order certificate for p a (x) := (1 + p a (x))/2). This can be seen in Figure 1 -b: note that at p a (x) = 0.6, if the gradient norm is known to be zero, the certificate is slightly below the certificate for p a (x) = 0.8 with no gradient information. The second-order certificate when (p a (x) = 0.6, ∇ x p a (x)| 2 = 0) is in fact slightly below the first-order certificate for p a (x) = 0.8, because the Gaussian noise samples throughout all of space, so the smoothed classifier decision boundary is slightly affected by the adversarial region in the opposite direction of x. Because we can explicitly construct "worst-case" classifiers which represent the equality case of Equation 2, our certificates are known to be tight: the reported certified radii are the largest possible certificates, if only p a (x) and ∇p a (x) 2 are known. In Figure 2 , we show how our second-order certificate behaves on a simple, two-dimensional, nonlinearly separable dataset, the classic Swiss Roll. The increases are marginal, mostly because the certificates using standard randomized smoothing are already fairly tight. On these data, the certified radii for the two classes are nearly touching in many places along the decision boundary. However, for the blue class, which is surrounded on multiple sides by the red class, there are noticeable increases in the certified radius. This is especially true for points near the center of the blue class, which are at the "top of the hill" of the blue class probability, and therefore have smaller gradient.

3.1. GRADIENT NORM ESTIMATION

In order to use the second-order certificate in practice, we must first bound, with high-probability, the gradient norm ∇ x p a (x) 2 using samples from the base classifier f . Because Theorem 1 provides certificates that are strictly decreasing with ∇ x p a (x) 2 , it is only necessary to lower bound ∇ x p a (x) 2 with high probability. Salman et al. ( 2019) suggest two ways of approximate the gradient vector ∇ x p a (x) itself, both based on the following important observation: ∇ x p a (x) = E ∼N (0,σ 2 I) [∇ x f (x + )] = E ∼N (0,σ 2 I) [ f (x + )]/σ 2 These two methods are: 1. At each sampled point, one can measure the gradient of f using back-propagation, and take the mean vector of these estimates. 2. At each sampled point, one can multiply f (x + ) by the noise vector , and take the mean vector of these estimates. Note, however, that Salman et al. ( 2019) does not provide statistical bounds on these estimates: for our certificate application, we must do so. While we ultimately use an approach based on method 2, we will first briefly discuss method 1. The major obstacle to using method 1 is that it requires that the base classifier f itself to be a Lipschitz function, with a small Lipschitz constant. This can be understood from Markov's inequality. For example, consider the value of some component z(x) := u • ∇f (x), where u is an arbitrary vector. Suppose N samples are taken, but that z is distributed such that: z(x + ) = 0 with probability 1 -1 2N 2N with probability 1 2N (5) This would be the case if f is a function that approximates a step function from 0 to 1, with a small buffer region of very high slope, for example. Note that the probability that any of the N samples measures the nonzero gradient component is < 0.5 , but the expected value of this component is in fact 1.0. This example shows that, in order to accurately estimate the gradient with high probability, the number of samples used must at least scale linearly with the maximum possible value of the gradient norm for f . For un-restricted deep neural networks, Lipschitz constants are NP-hard to compute, and upper bounds on them are typically very large (Virmaux & Scaman, 2018) . Of course, we could use Lipschitz-constrained networks as described in Section 1 for the base classifier, but this would defeat the purpose of using randomized smoothing in the first place. Moreover, in standard "hard" randomized smoothing as typically implemented (Cohen et al., 2019; Salman et al., 2019) , the range of f is {0, 1}, so f is non-differentiable: therefore, this back-propagation method can not be used at all. We therefore use method 2. In particular, we reject the naive approach of estimating each component independently, taking a union bound, and the taking the norm: not only would the error in the normsquared scale with d as the error from each component accumulates, but there would be an additional dependence on d from the union bound: each component would have to be bounded with failure probability η/d, where η is the total failure probability for measuring the gradient norm. Note that this issue will also be encountered in method 1 above, but in that case, a loose upper bound could at least be achieved without this dependency using Jensen's inequality (the mean of the norms of the gradient is larger than the norm of the mean). Instead, we estimate the norm-squared of the mean using a single, unbiased estimator. Note that: ∇ x E [f (x + )] 2 2 = σ -4 E [ f (x + )] • E [ f (x + )] = σ -4 E [ f (x + )] • E [ f (x + )] = σ -4 E , [( f (x + )) • ( f (x + ))] In other words, we can estimate the norm-squared of the mean by taking pairs of smoothing samples, and taking the dot product of the noise vectors times the product of the sampled values. We show that this is a subexponential random variable (see Appendix), which gives us an asymptotically linear scaling of N with d: Theorem 2. Let V := E , [( f (x + )) • ( f (x + ))], and Ṽ be its empirical estimate. If n pairs of samples (= N/2) are used to estimate V , then, with probability at most η, E[V ] -Ṽ ≥ t, where: t = 4σ 2 -d n ln(η) if -2 ln(η) ≤ dn -4 √ 2σ 2 n ln(η) if -2 ln(η) > dn Note that in practice, we can use the same samples to estimate ∇ x p a (x) 2 as are used to estimate p a (x). However, this requires reducing the failure probability of each estimate to η = η/2, in order to use a union bound. This means that, if N is small (or d large), second-order smoothing can in fact give worse certificates than standard smoothing, because the benefit of a (loose, for N small) estimate of the gradient is less significant than the negative effect of reducing the estimate of p a (x). As shown in Figure 4 -a, even for very large N and relatively small dimension, the empirical estimation significantly reduces the radii of certificates which can be calculated. See Section 5 for experimental results.

3.2. UPPER-BOUND AND MULTI-CLASS CERTIFICATES

We can easily convert Theorem 1 into a tight upper bound on p a (x ) by simply evaluating it for f = 1 -f (and therefore p a = 1 -p a ). If estimates and gradients are available for multiple classes, it would then be possible to achieve an even larger certificate, by setting the lower bound of the top logit equal to the upper bounds of each of the other logits. Note, however, that unlike first-order smoothing works (Lecuyer et al., 2019; Feng et al., 2020) which use this approach, it is not sufficient to compare against just the "'runner-up" class, because other logits may have less restrictive upper-bounds due to having larger gradients. As discussed above, gradient norm estimation can be computationally expensive, so gradient estimation for many classes may not be feasible. Also, note that while this approach would produce larger, correct certificates, we do not claim that these would be tight certificates given the value and gradient information for all classes: the "worst case" constructions we describe above for a single logit might not be simultaneously construct-able for multiple logits.

4. DIPOLE SMOOTHING

For large-scale image datasets, the dependence on d in Theorem 2 can create statistical barriers. However, the general approach of second-order smoothing, especially using the discrete estimation method (method 2) described above, has an interesting interpretation: rather than using simply the mean of f (x + ), we are also using the geometrical distribution of the values of f (x + ) in space to compute a larger certified bound. In particular, if we can show that points which are adversarial for the base classifier (points with f (x + ) = 0) are dispersed, then this will imply larger certificates, because it makes it impossible for a perturbation in a single direction to move x towards the adversarial region. Second-order smoothing, above, is merely an example of this. We therefore introduce Gaussian Dipole smoothing. This is a method which, like second-order smoothing, also harnesses the geometrical distribution of the values of f (x) to improve certificates. However, unlike second-order smoothing, there is no explicit dependence on d in the empirical dipole smoothing bound. In this method, when we sample f (x + ) when estimating p a (x), we also sample f (x -). This allows us to compute two quantities: C S := E [f (x + )f (x -)] C N := E [f (x + ) -f (x + )f (x -)] The certificate we can calculate is then as follows: Theorem 3. For all x, x with xx 2 < ρ, and for all f : R d → [0, 1], p a (x ) ≥ Φ Φ -1 (C N ) - ρ σ + Φ Φ -1 ( 1 + C S 2 ) - ρ σ -Φ Φ -1 ( 1 -C S 2 ) - ρ σ We also compute this bound by constructing the worst possible classifier. In this case, the trick is that, if two adversarial sampled points are opposite one another (i.e., f (x + ) = f (x -) = 0) then they cannot both contribute to the same adversarial "direction". In the worst case, the "reflected" adversarial points form a plane opposite the base classifier decision boundary (See Figure 4-b ). In the extreme case where C N = 0, the "worst case" classifier is the same as for second-order smoothing. Experimentally, we simply need to lower-bound both C S and C N from samples. This reduces the precision of our estimates, for two reasons: we have half as many independent samples for the same number of evaluations we must perform, and we are bounding two quantities, which requires halving the error probability for each. However, unlike second-order smoothing, there is no dependence on d: this allows for practical certificates of real-world datasets.

5. EXPERIMENTS

Experimental results are presented in Figures 5 and 6 , with further results in Appendix B. Because both dipole and second-order certificates reduce the precision with which empirical quantities needed for certification can be estimated, but both provide strictly larger certificates at the population level, the key question becomes at what number of samples N does each higher-order method become beneficial. Note that in the figures, we are comparing the new methods to standard smoothing, using the same N for standard smoothing as for the new method. Due to the poor scaling of second-order certificates with dimension, we tested second-order smoothing on a low-dimensional dataset, 7 × 7 MNIST. However, significant increases to certificates were not seen until N = 10 7 even on this dataset. By contrast, dipole smoothing is beneficial for many images even when smaller numbers of smoothing samples are used. Because it scales to higher-dimensional data, we also tested Gaussian dipole smoothing on CIFAR-10 and ImageNet, where it led to modest improvements in certificates, in particular at N = 10 6 . In Appendix C, we show the absolute, rather than relative, certified accuracy curves for the experiments shown in Figures 5 and 6 . These plot show that higher-order smoothing techniques (SoS and Gaussian Dipole smoothing) are mostly beneficial for increasing the certificates of images with small certified radii. In cases where certificates are already large, increased estimation error can lead to a decrease in certificates, but this effect is small relative to the magnitudes of these certificates (typically < 1%).

6. CONCLUSION

In this work, we explored the limits of using gradient information to improve randomized smoothing certificates. In particular, we introduced second-order smoothing certificates and showed tight and realizable upper bounds on their maximum achievable benefits. We also proposed Gaussian dipole smoothing, a novel method for robustness certification, which can improve smoothing-based robustness certificates even on large-scale data sets. We first provide a novel, simple, and intuitive proof for first-order randomized smoothing: this will allow us to develop methods and notations for later proofs. Theorem. Let ∼ N (0, σ 2 I). For all x, x with xx 2 < ρ, and for all f : R d → [0, 1]: E [f (x + )] ≥ Φ Φ -1 (E [f (x + )]) - ρ σ ( ) Where Φ is the normal cdf function and Φ -1 is its inverse. Proof. Let R = xx 2 . Choose our basis so that x = 0 and x = [R, 0, 0, ..., 0] T (Note that by isometry, we still have ∼ N (0, σ 2 I)). Then define g : R → [0, 1]: g(z) = E 2,..., n [f ([z, 2 , ..., n ] T )] Note that: E 1 [g( 1 )] = E [f (x + )] E 1 [g(R + 1 )] = E [f (x + )] Now, in one dimension, 1 ∼ N (0, σ), and so has a pdf of z → σ -1 Φ z σ , where Φ is the normal pdf function. By the definition of expected value: E 1 [g( 1 )] = ∞ -∞ g( 1 )σ -1 Φ 1 σ d 1 E 1 [g(R + 1 )] = ∞ -∞ g(R + 1 )σ -1 Φ 1 σ d 1 = ∞ -∞ g( 1 )σ -1 Φ 1 σ - R σ d 1 We perform a change of integration variables, using y = Φ 1 σ (and noting that dy d 1 = σ -1 Φ 1 σ ): E 1 [g( 1 )] = 1 0 g(σΦ -1 (y))σ -1 Φ 1 σ d 1 dy dy = 1 0 g(σΦ -1 (y))dy E 1 [g(R + 1 )] = 1 0 g(σΦ -1 (y))σ -1 Φ 1 σ - R σ d 1 dy dy = 1 0 g(σΦ -1 (y)) Φ Φ -1 (y) -R σ Φ (Φ -1 (y)) dy Note that: Φ Φ -1 (y) -R σ Φ (Φ -1 (y)) = e -1 2 (Φ -1 (y)-R σ ) 2 e -1 2 (Φ -1 (y)) 2 = e Φ -1 (y)-R 2 2σ 2 Also, to simplify notation, define g Φ : [0, 1] → [0, 1] as g Φ (y) := g(σΦ -1 (y)). Then we have (combining Equations 14 and 15): E 1 [g( 1 )] = 1 0 g Φ (y)dy E 1 [g(R + 1 )] = 1 0 g Φ (y)e Φ -1 (y)-R 2 2σ 2 dy Fix the expectation at x, E 1 [g( 1)], at a constant C, let us consider the function g Φ which minimizes the expectation at x : E 1 [g(R + 1 )] ≥ min gΦ∈[0,1]→[0,1] 1 0 gΦ(y)dy=C 1 0 g Φ (y)e Φ -1 (y)-R 2 2σ 2 dy However, note that e Φ -1 (y)-R 2 2σ 2 increases monotonically with y. Then the minimum is achieved at: g * Φ (y) = 1 if y ≤ C 0 if y > C In terms of the function g(z), this is: g * (z) = 1 if z ≤ σΦ -1 (C) 0 if z > σΦ -1 (C) Then we can evaluate the minimum, using the form of the integral given in Equation 13: E 1 [g(R + 1 )] ≥ ∞ -∞ g * ( 1 )σ -1 Φ 1 σ - R σ d 1 = σΦ -1 (C) -∞ σ -1 Φ 1 σ - R σ d 1 = Φ σΦ -1 (C) σ - R σ -Φ -∞ σ - R σ = Φ Φ -1 (C) - R σ By the definition of C and Equation 12, this is: E [f (x + )] ≥ Φ Φ -1 (E [f (x + )]) - R σ ≥ Φ Φ -1 (E [f (x + )]) - ρ σ which was to be proven.

A.2 SECOND ORDER SMOOTHING

Theorem 1. Let ∼ N (0, σ 2 I). For all x, x with xx 2 < ρ, and for all f : R d → [0, 1], E 1 [f (x + )] ≥ Φ Φ -1 (a + E [f (x + )]) - ρ σ -Φ Φ -1 (a ) - ρ σ ( ) Where Φ is the normal cdf function, Φ -1 is its inverse, and a is the (unique) solution to Φ (Φ -1 (a )) -Φ (Φ -1 (a + E [f (x + )])) = -σ ∇ x E [f (x + )] 2 Further, for all pairs (E [f (x + )]), ∇ x E [f (x + )] 2 ) which are possible, there exists a base classifier f and an adversarial point x such that Equation 4 is an equality. As show by Salman et al. ( 2019), we have, for all x ∈ R d : ∇ x E [f (x + )] = σ -2 E [ f (x + )] Under the choice of basis of the above proof (in particular, x = 0), when evaluated at x this becomes: ∇ x E [f (x + )] = σ -2 E [ f ( )] Let u := [1, 0, 0, ..., 0] T , and define g as in the above proof. Note that: -∇ x E [f (x + )] 2 ≤u • ∇ x E [f (x + )] =u • σ -2 E [ f ( )] =σ -2 E [ 1 f ( )] =σ -2 E 1 [ 1 [E 2 ,.., d f ( )]] =σ -2 E 1 [ 1 g( 1 )] By the definition of expectation, and again using the change of integration variables y := Φ 1 σ , E 1 [ 1 g( 1 )] = 1 0 σΦ -1 (y)g(σΦ -1 (y))σ -1 Φ 1 σ d 1 dy dy = 1 0 σΦ -1 (y)g(σΦ -1 (y))dy Define: C := E 1 [g( 1 )] (= E [f (x + )) C := E 1 [ 1 g( 1 )] (≥ -σ 2 ∇ x E [f (x + )] 2 ) Then, by Equations 16 and 27, and defining g Φ , as above, E 1 [g(R + 1 )] ≥ min gΦ∈[0,1]→[0,1] 1 0 gΦ(y)dy=C 1 0 σΦ -1 (y)gΦ(y)dy=C 1 0 g Φ (y)e Φ -1 (y)-R 2 2σ 2 dy (29) Note that our constraints are linear in the space of functions; we can then introduce Lagrange multipliers: min gΦ∈[0,1]→[0,1] 1 0 g Φ (y)e Φ -1 (y)-R 2 2σ 2 dy -λ 1 1 0 g Φ (y)dy -C -λ 2 1 0 σΦ -1 (y)g Φ (y)dy -C (30) min gΦ∈[0,1]→[0,1] 1 0 g Φ (y)e Φ -1 (y)-R 2 2σ 2 dy -λ 1 g Φ (y) -λ 2 σΦ -1 (y)g Φ (y)dy + constants (31) min gΦ∈[0,1]→[0,1] 1 0 g Φ (y) e Φ -1 (y)-R 2 2σ 2 -λ 1 -λ 2 σΦ -1 (y) dy + constants This is simply the inner product between g Φ and a function: the inner product is minimized by setting g Φ = 1 where the expression is negative, and g Φ = 0 where the expression is positive: g * Φ (y) = 1 if e Φ -1 (y)-R 2 2σ 2 ≤ λ 1 + λ 2 σΦ -1 (y) 0 if e Φ -1 (y)-R 2 2σ 2 > λ 1 + λ 2 σΦ -1 (y) Sign changes occur at: Φ -1 (y) = -W   - e -R 2 2σ 2 - λ 1 λ 2 λ 2   - λ 1 λ 2 (33) Where W is the product-log (Lambert W) function. This returns zero, one, or two possible values, depending on the argument (zero values if the argument < -e -1 , two values on [-e -1 , 0), and one value on non-negative arguments). Therefore there are at most two sign changes. Also, note that as y → 1, Φ -1 (y) → ∞, taking the limit in Equation 32, we know that g * Φ (1) = 0. Therefore, taking into account the constraint 1 0 g Φ (y)dy = C, we know that g * Φ is either: • 0 everywhere (and C = 0), if no sign changes • 1 at y < C, 0 otherwise, if one sign change • 1 on the interval [a, a + C], 0 otherwise, for some a ∈ [0, 1 -C], if two sign changes In fact, the final case includes the first two, so all that we need to do now is find a to satisfy the C constraint. This constraint (Equation 27) becomes: a+C a Φ -1 (y)dy = C σ Because Φ -1 (y) is monotone increasing, the LHS of Equation 34 is a monotone increasing function of a. Using the indefinite integral of Φ -1 : Φ -1 (y)dy = √ 2 erf -1 (2y -1)dy = - 1 √ 2π e -(erf -1 (2y-1)) 2 + C = -Φ (Φ -1 (y)) + C Where erf -1 is the inverse error function. Then the constraint becomes: Φ (Φ -1 (a)) -Φ (Φ -1 (a + C)) = C σ We can now evaluate the value of the smoothed function at x , again using the form of the integral given in Equation 13: E 1 [g(R + 1 )] ≥ ∞ -∞ g * ( 1 )σ -1 Φ 1 σ - R σ d 1 = σΦ -1 (a+C) σΦ -1 (a) σ -1 Φ 1 σ - R σ d 1 = Φ Φ -1 (a + C) - R σ -Φ Φ -1 (a) - R σ If we consider the form of this integral in Equation 29, which reduces to simply: a+C a e Φ -1 (y)-R 2 2σ 2 dy (37) we see that is is a monotonically increasing function in a. Furthermore, we have that the LHS of Equation 35 is monotonic in a. Therefore, if we define a as the solution to: Φ (Φ -1 (a )) -Φ (Φ -1 (a + C)) = -σ ∇ x E [f (x + )] 2 38) Then by Equation 26, we know a ≤ a. Then because the RHS of Equation36 is also monotonic in a, E 1 [g(R + 1 )] ≥ Φ Φ -1 (a + C) - R σ -Φ Φ -1 (a ) - R σ By the definitions of g and C, and because the RHS is monotonically decreasing in R (See Equation 37), we can conclude the theorem as stated. Further, we can conclude that an equality case is possible by noting that it is achieved by the function g * (z) as described above: the minimal f * (z) can then be constructed as f * (z, •, •, ...) := g * (z). Note that we also need Equation 26 to be tight: this is achieved where the adversarial direction xx is parallel to the gradient of the smoothed function.  Note that ∇ x E [f (x + )] 2 2 = σ -4 E [ f ( )] • E [ f ( )] = σ -4 E [ f ( )] • E [ f ( )] = σ -4 E , [( f ( )) • ( f ( ))] (41) Theorem 2. Let V := E , [( f (x + )) • ( f (x + ))], and Ṽ be its empirical estimate. If n pairs of samples (= N/2) are used to estimate V , then, with probability at most η, E[V ] -Ṽ ≥ t, where: t = 4σ 2 -d n ln(η) if -2 ln(η) ≤ dn -4 √ 2σ 2 n ln(η) if -2 ln(η) > dn Wainwright ( 2019) gives the following condition for any centered (mean-zero) sub-exponential random variable X: Definition 1. A centered R.V. X is (a,b)-subexponential if: E[e λX ] ≤ e a 2 λ 2 /2 , ∀ λ ∈ [-b -1 , b -1 ] First, we establish bounds for • . (This can be considered a simplified case of Gaussian chaos of the second order, see Vershynin (2018) ). For each i ∈ [d], i and i are independent Gaussian random variables. Recall the moment-generating function for a Gaussian: E[e t i ] = e σ 2 t 2 /2 ∀ t Then for the product i i we have that: E[e λ i i ] = E i E i [[e λ i i ]] = E i [e 2 i (λ 2 σ 2 /2) ] Note that this has a similar form to the moment generating function of the Chi-squared distribution for k = 1: E i [e 2 i t ] = 1 √ 1 -2σ 2 t Then: E[e λ i i ] = E i [e 2 i (λ 2 σ 2 /2) ] = 1 √ 1 -λ 2 σ 4 ≤ e λ 2 σ 4 ∀λ 2 σ 4 ≤ 1 2 Where the final inequality can be shown by observing that, if λ 2 σ 4 ≤ 1/2: 1 1 -λ 2 σ 4 = 1 + λ 2 σ 4 1 -λ 2 σ 4 ≤ 1 + 2λ 2 σ 4 ≤ e 2λ 2 σ 4 (48) and taking square roots. Because i i is centered, this implies that i i is ( √ 2σ 2 , √ 2 )subexponential. Now, • is simply the sum of d such identical, independent, centered subexponential variables: by (Wainwright (2019) Equation 2.18), we conclude that • is ( √ 2σ 2 √ d, √ 2 )subexponential. This implies: E[e λ • ] ≤ e 2σ 4 dλ 2 /2 , ∀ λ ∈ [-( √ 2σ 2 ) -1 , ( √ 2σ 2 ) -1 ] Recall that the quantity which we are measuring is ( f ( )) • ( f ( )). For notation convenience, let v( , ) : R d × R d → {0, 1} be defined as f ( )f ( ), so that the quantity of interest is V := • v( , ) We further define a centered version of this quantity V := V -E[V ] We now introduce an important lemma: Lemma 1. V is (2 √ 2σ 2 √ d, 2 √ 2σ 2 )-subexponential. Proof. Define p := Pr , [v( , ) = 1] Then: V = • v( , ) = • -(1 -v( , )) • (53) E[V ] = E[ • ] -E[(1 -v( , )) • ] = -E[(1 -v( , )) • ] A.4 DIPOLE SMOOTHING Theorem 3. Let ∼ N (0, σ 2 I). For all x, x with xx 2 < ρ, and for all f : R d → [0, 1], define: C S := E [f (x + )f (x -)] C N := E [f (x + ) -f (x + )f (x -)] Then: E 1 [f (x + )] ≥ Φ Φ -1 (C N ) - ρ σ + Φ Φ -1 ( 1 + C S 2 ) - ρ σ -Φ Φ -1 ( 1 -C S 2 ) - ρ σ ( ) Where Φ is the normal cdf function and Φ -1 is its inverse. Proof. As in the proof of Theorem A.1, let R = xx 2 , and choose our basis so that x = 0 and x = [R, 0, 0, ..., 0] T . First, for f : R d → [0, 1], we define a decomposition into symmetric and non-symmetric components, f S , f N : R d → [0, 1]: f S ( ) := f ( )f (-) f N ( ) := f ( ) -f ( )f (-) Note that f ( ) = f S ( ) + f N ( ) and also that f S ( ) = f S (-). Define g S (z), g N (z) : R → [0, 1] by analogy to Equation 12. By linearity of expectation, note that g(z) = g S (z) + g N (z). Also note that:  Similarly, define g S Φ and g N Φ . We still have: g Φ (y) = g(σΦ -1 (y)) = g S (σΦ -1 (y)) + g N (σΦ -1 (y)) = g S Φ (y) + g N Φ (y) (76) Also (using Φ -1 (y) = -Φ -1 (1 -y)): g S Φ (y) = g S (σΦ -1 (y)) = g S (-σΦ -1 (1 -y)) = g S (σΦ -1 (1 -y)) = g S Φ (1 -y) Note that all of the mechanics of the proof of Theorem 1 can be applied to f, f S and f N . Following Equation 13, we have: In Figure 13 , we see experiments on ImageNet using dipole smoothing, for a range of σ ∈ {0.25, 0.50, 1.00} and N ∈ {10 5 , 10 6 }. There is an anomalous result for σ = 0.50, N = 10 5 , in that this is the only case where dipole smoothing appears to perform worse than standard smoothing. However, this turns out to be a computational artifact. At both σ = 0.50 and σ = 0.25, there are a large number of images where every smoothing sample is correctly classified, so p a is as close to 1 as the measurement bounds allow. Note that if p a truly equals 1, the certified radius is infinite, so in this domain, the reported certificate is entirely a function of the estimation error. Because dipole smoothing reduces measurement precision, these samples have somewhat smaller certified radii under dipole smoothing, especially at small N . However, this gap should be exactly proportional to σ. C S = E [f S (x + )] = The cause of the anomaly is the fact that our code (adapted from (Cohen et al., 2019) ) records each radius to three significant figures. At σ = 0.5, for an image where all noise samples are correctly classified, the ratio of the dipole smoothing radius to the standard smoothing radius is reported as 1.89/1.91 = 98.95%, while for σ = 0.25 it is reported as 0.947/0.953 = 99.37%. This explains the large number of samples with reported > 1% decrease in certificates for σ = 0.50, N = 10 5 .

C ABSOLUTE CERTIFICATES FOR MAIN-TEXT EXPERIMENTS

In Figures 14 and 15 , we show the absolute, rather than relative, values of the certificates reported in the main text, compared to the baseline first-order randomized smoothing. We see that the benefit of the proposed techniques is greatest for images with small absolute certificates, and that, on CIFAR-10 and ImageNet, there is some disadvantage to dipole smoothing on the largest possible certificates, where all smoothing samples are classified correctly. This is because, for these images, the certificate depends entirely on estimation error. 



In a concurrent work initially distributed after the submission of this work,Mohapatra et al. (2020) have proposed an identical second-order smoothing certificate, along with a tighter empirical estimator for the gradient norm. In this estimator, the number of samples required scales with √ d. To see this, note the form of the integral of gS equal to these terms given in Equation80



Figure 3: Worst case base classifiers for second-order smoothing for the same value of p a (x) at different values of ∇ x p a (x) 2 . The base classifier is f = 1 in the blue regions and f = 0 in the red regions. The point x is shown as a blue dot, with the Gaussian sampled region used for calculating p a (x) is approximately shown as a dashed blue circle. ∇ x p a (x) is shown as a blue arrow. (a) The gradient takes its maximum possible value:∇ x p a (x) 2 = σ -1 Φ (Φ -1 (p a (x)). (b)The gradient has an intermediate value: 0 < ∇ x p a (x) 2 < σ -1 Φ (Φ -1 (p a (x)). (c) The gradient is zero: ∇ x p a (x) 2 = 0.

Figure 2: Comparison of second-order smoothing certificates to standard Gaussian smoothing certificates on a selection of points from the Swiss Roll dataset. Correctly labeled points with (second-order) certificates are shown in light red and blue, and points with incorrect label or no certificate are in black. For a selection of points, shown in red/blue, the first-order certified radii shown are as red/blue rings. Increases to certified radii due to second-order smoothing shown are as light blue (light red, absent) rings around certificate radii. For both experiments, N = 10 8 , and η = 0.001.

Figure 4: (a) Empirical second-order smoothing certificates, with d = 49 (corresponding to 7 × 7 MNIST experiments), N = 10 8 , and η = .001 (b) Worst case classifier for dipole smoothing.

Figure5: Experiments on 7 × 7 MNIST. Reported is the distribution of the improvement (or reduction) of higher-order certificates from certificates computed using standard (first-order) randomized smoothing, for each tested image. For all, σ = 0.25. For (a, c, e, g), Second-order Smoothing is used. For (b, d, f, h), Gaussian dipole smoothing is used. For (a, b), N = 10 5 . For (c, d), N = 10 6 . For (e, f), N = 10 7 . For (g, h), N = 10 8 .

PRACTICAL CERTIFICATION ALGORITHM Define C := lower bound on C C := upper bound on C

S (-z) = E 2,..., n [f S ([-z, 2 , ..., n ] T )] = E -2,...,-n [[f S ([-z, -2 , ...,n ] T )] = E 2,..., n [[f S ([-z, -2 , ...,n ] T )] = E 2,..., n [[f S ([z, 2 , ..., n ] T )] = g S (z)

dyC N = E [f N (x + )] = 1 0 g N Φ (y)dy C := E [f (x + )] = 1 0 g Φ (y)dy = 1 0 g S Φ (y) + g N Φ (y)dy = C S + C N(78)B.4 DIPOLE SMOOTHING ON IMAGENET

Figure 12: CIFAR-10

This introduces a broader question for future work: what other information about the spacial distribution of classes in randomized smoothing can be efficiently used to improve robustness certificates? Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation. In Advances in Neural Information Processing Systems, pp. 3835-3844, 2018.

annex

Define:Trivially, we have:However, note that also:Then by Jensen's inequality, we have:Again, by Jensen's inequality:2 ) -1 ] (66) Because p ≤ 1, we then have:In other words, we have shown that both A and B are both2 )-subexponential. Then, by Cauchy-Schwartz:Finally, using the form of the one-sided Bernstein tail bound for subexponential random variables given in Wainwright (2019), we have, given n measurements and an empirical mean estimate of V as Ṽ :Then, given a failure rate η, we can compute the minimum deviation t such that the failure probability is less than η:We may then write the minimization in Equation 17, fixing C N and C S as constants separately 2 :The second minimum can be computed as in the proof of Theorem 1, it is simplyFor the first minimum, we consider the additional constraint, that g S Φ (y) = g S Φ (1 -y). Then we can rewrite the integral as:So the minimization becomes:Note that cosh(Φ -1 (y)) is a monotonically decreasing function of y on the range [0, 1 2 ]. Then the minimum is achieved by the function:Where the value in the domain [ 1 2 , 1] can be computed using g S * Φ (1 -y) = g S * Φ (y) In terms of the function g S (z), this is:2 Note that we are not considering all applicable constraints here: in particular we are not restricting the range of gΦ(z) to [0, 1] explicitly. However, the lower bound presented here must be at least as low as the lower bound with this additional constraint, so the inequality is still valid. Also, this constraint does in fact hold in the final construction.We can now evaluate the integral, again using the form of the integral given in Equation 13:So, combining the g S and g N terms, we have:From Equation 12, and noting that the last two terms together are monotonically decreasing 3 with R, we complete the proof.

B ADDITIONAL EXPERIMENTS

Here, we present experiments at a wider range of parameters. For all figures, images misclassified or not certified for both the baseline and the tested method are not counted: the total test set size is 1000 for MNIST, 500 for CIFAR and ImageNet with N = 10 5 , and 100 for CIFAR and ImageNet with N = 10 6 . For all experiments, N 0 = 100. Also, note that we test independently for the baseline and higher-order methods (i.e., we use different smoothing samples). This is necessary to compare fairly to dipole smoothing, where the sampling method is different; however, it does lead to some noise, especially at N = 10 5 .B.1 NOISE LEVEL σ.We see (Figures 7 and 8 ) that at a smaller level of noise (σ = 0.12), the effect of higher-order smoothing is diminished. This can be understood in terms of the curves in Figure 1 -b: lower noise leads to more inputs with higher p a , which reduces the benefit of the higher-order certificate. Conversely, higher noise increases the effects of the higher-order certificates, although it also leads to decreased total accuracy. The dipole certificate underperforms at N = 10 5 , σ = 0.5: this is likely due to the increase in estimation error, which becomes significant near p a = 0.5.

B.2 DIMENSIONALITY d

To test second-order smoothing on a lower-dimensional dataset, we performed PCA on the 7 × 7 MNIST images, and classified using the top 10 principal components. (d = 10). Results are shown in Figures 9, 10, and 11. We see that, at N = 10 6 , second-order smoothing has a marginal positive impact at this smaller scale.

B.3 DIPOLE SMOOTHING ON CIFAR-10

In Figure 12 , we see experiments on CIFAR-10 using dipole smoothing, for a range of σ ∈ {0.12, 0.25, 0.50, 1.00} and N ∈ {10 5 , 10 6 }. Note that dipole smoothing appears to be beneficial even at N = 10 5 on CIFAR-10, at all noise levels ≥ 0.25. 

