FANTASTIC FOUR: DIFFERENTIABLE BOUNDS ON SINGULAR VALUES OF CONVOLUTION LAYERS

Abstract

In deep neural networks, the spectral norm of the Jacobian of a layer bounds the factor by which the norm of a signal changes during forward/backward propagation. Spectral norm regularizations have been shown to improve generalization, robustness and optimization of deep learning methods. Existing methods to compute the spectral norm of convolution layers either rely on heuristics that are efficient in computation but lack guarantees or are theoretically-sound but computationally expensive. In this work, we obtain the best of both worlds by deriving four provable upper bounds on the spectral norm of a standard 2D multi-channel convolution layer. These bounds are differentiable and can be computed efficiently during training with negligible overhead. One of these bounds is in fact the popular heuristic method of Miyato et al. ( 2018) (multiplied by a constant factor depending on filter sizes). Each of these four bounds can achieve the tightest gap depending on convolution filters. Thus, we propose to use the minimum of these four bounds as a tight, differentiable and efficient upper bound on the spectral norm of convolution layers. We show that our spectral bound is an effective regularizer and can be used to bound either the lipschitz constant or curvature values (eigenvalues of the Hessian) of neural networks. Through experiments on MNIST and CIFAR-10, we demonstrate the effectiveness of our spectral bound in improving generalization and provable robustness of deep networks.

1. INTRODUCTION

Bounding singular values of different layers of a neural network is a way to control the complexity of the model and has been used in different problems including robustness, generalization, optimization, generative modeling, etc. In particular, the spectral norm (the maximum singular value) of a layer bounds the factor by which the norm of the signal increases or decreases during both forward and backward propagation within that layer. If all singular values are all close to one, then the gradients neither explode nor vanish (Hochreiter, 1991; Hochreiter et al., 2001; Klambauer et al., 2017; Xiao et al., 2018) . Spectral norm regularizations/bounds have been used in improving the generalization (Bartlett et al., 2017; Long & Sedghi, 2020) , in training deep generative models (Arjovsky et al., 2017; Gulrajani et al., 2017; Tolstikhin et al., 2018; Miyato et al., 2018; Hoogeboom et al., 2020) and in robustifying models against adversarial attacks (Singla & Feizi, 2020; Szegedy et al., 2014; Peck et al., 2017; Zhang et al., 2018; Anil et al., 2018; Hein & Andriushchenko, 2017; Cisse et al., 2017) . These applications have resulted in multiple works to regularize neural networks by penalizing the spectral norm of the network layers (Drucker & Le Cun, 1992; Yoshida & Miyato, 2017; Miyato et al., 2018; 2017; Sedghi et al., 2019; Singla & Feizi, 2020) . For a fully connected layer with weights W and bias b, the lipschitz constant is given by the spectral norm of the weight matrix i.e, W 2 , which can be computed efficiently using the power iteration method (Golub & Van Loan, 1996) . In particular, if the matrix W is of size p × q, the computational complexity of power iteration (assuming convergence in constant number of steps) is O(pq). Convolution layers (Lecun et al., 1998) are one of the key components of modern neural networks, particularly in computer vision (Krizhevsky et al., 2012) . Consider a convolution filter L of size c out × c in × h × w where c out , c in , h and w denote the number of output channels, input channels, height and width of the filter respectively; and a square input sample of size c in × n × n where n is its height and width. A naive representation of the Jacobian of this layer will result in a matrix of size n 2 c out × n 2 c in . For a typical convolution layer with the filter size 64 × 3 × 7 × 7 and an ImageNet sized input 3 × 224 × 224 (Krizhevsky et al., 2012) , the corresponding jacobian matrix has a very large size: 802816 × 150528. This makes an explicit computation of the jacobian infeasible. Ryu et al. (2019) provide a way to compute the spectral norm of convolution layers using convolution and transposed convolution operations in power iteration, thereby avoiding this explicit computation. This leads to an improved running time especially when the number of input/output channels is small (Table 1 ). However, in addition to the running time, there is an additional difficulty in the approach proposed in Ryu et al. (2019) (and other existing approaches described later) regarding the computation of the spectral norm gradient (often used as a regularization during the training). The gradient of the largest singular value with respect to the jacobian can be naively computed by taking the outer product of corresponding singular vectors. However, due to the special structure of the convolution operation, the jacobian will be a sparse matrix with repeated elements (see Appendix Section D for details). The naive computation of the gradient will result in non-zero gradient values at elements that should be in fact zeros throughout training and also will assign different gradient values at elements that should always be identical. These issues make the gradient computation of the spectral norm with respect to the convolution filter weights using the technique of Ryu et al. (2019) difficult. Recently, Sedghi et al. (2019) provided a principled approach for exactly computing the singular values of convolution layers. They construct n 2 matrices each of size c out × c in by taking the Fourier transform of the convolution filter (details in Appendix Section B). The set of singular values of the jacobian equals the union of singular values of these n 2 matrices. However, this method can have high computational complexity since it requires SVD of n 2 matrices. Although this method can be adapted to compute the spectral norm of n 2 matrices using power iteration (in parallel with a GPU implementation) instead of full SVD, the intrinsic computational complexity (discussed in Table 2 ) can make it difficult to use this approach for very deep networks and large input sizes especially when computational resources are limited. Moreover, computing the gradient of the spectral norm using this method is not straightforward since each of these n 2 matrices contain complex numbers. Thus, Sedghi et al. (2019) suggests to clip the singular values if they are above a certain threshold to bound the spectral norm of the layer. In order to reduce the training overhead, they clip the singular values only after every 100 iterations. The resulting method reduces the training overhead but is still costly for large input sizes and very deep networks. We report the running time of this method in Table 1 and its training time for one epoch (using 1 GPU implementation) in Table 4c . Because of the aforementioned issues, efficient methods to control the spectral norm of convolution layers have resorted to heuristics (Yoshida & Miyato, 2017; Miyato et al., 2018; Gouk et al., 2018) . Typically, these methods reshape the convolution filter of dimensions c out × c in × h × w to construct a matrix of dimensions c out × hwc in , and use the spectral norm of this matrix as an estimate of the spectral norm of the convolution layer. To regularize during training, they use the outer product of the corresponding singular vectors as the gradient of the largest singular value with respect to the reshaped matrix. Since the weights do not change significantly during each training step, they use only one iteration of power method during each step to update the singular values and vectors (using the singular vectors computed in the previous step). These methods result in negligible overhead during the training. However, due to lack of theoretical justifications (which we resolve in this work), they are not guaranteed to work for all different shapes and weights of the convolution filter. Previous studies have observed under estimation of the spectral norm using these heuristics (Jiang et al., 2019) . On one hand, there are computationally efficient but heuristic ways of computing and bounding the spectral norm of convolutional layers (Miyato et al., 2017; 2018) . On the other hand, the exact computation of the spectral norm of convolutional layers proposed by Sedghi et al. (2019) ; Ryu et al. (2019) can be expensive for commonly used architectures especially with large inputs such as ImageNet samples. Moreover, the difficulty in computing the gradient of the spectral norm with respect to the jacobian under these methods make their use as regularization during the training process challenging. In this paper, we resolve these issues by deriving a differentiable and efficient upper bound on the spectral norm of convolutional layers. Our bound is provable and not based on heuristics. Our 2019)) and our proposed bound for the a Resnet-18 network pre-trained on ImageNet. Our bound is within 1.5 times the exact spectral norm (except the first layer) while being significantly faster to compute compared to the exact methods. All running times were computed on GPU using tensorflow. For Sedghi et al. (2019) 's method, to have a fair comparison, we only compute the largest singular value (and not all singular values) for all individual matrices in parallel using power iteration (on 1 GPU) and take the maximum. We observe that for different filters, different components of our bound give the minimum value. Also, the values of the four bounds can be very different for different convolution filters (example: filter in the first layer). computational complexity is similar to that of heuristics (Miyato et al., 2017; 2018) allowing our bound to be used as a regularizer for efficiently training deep convolutional networks. In this way, our proposed approach combines the benefits of the speed of the heuristics and the theoretical rigor of Sedghi et al. (2019) . Table 2 summarizes the differences between previous works and our approach. In Table 1 , we empirically observe that our bound can be computed in a time significantly faster than Sedghi et al. (2019) ; Ryu et al. (2019) , while providing a guaranteed upper bound on the spectral norm. Moreover, we empirically observe that our upper bound and the exact value are close to each other (Section 3.1). Below, we briefly explain our main result. Consider a convolution filter L of dimensions c out × c in × h × w and input of size c in × n × n. The corresponding jacobian matrix J is of size n 2 c out × n 2 c in . We show that the largest singular value of the jacobian (i.e. J 2 ) is bounded as: J 2 ≤ √ hw min ( R 2 , S 2 , T 2 , U 2 ) , where R, S, T and U are matrices of sizes hc out ×wc in , wc out ×hc in , c out ×hwc in and hwc out ×c in respectively, and can be computed by appropriately reshaping the filter L (details in Section 3). This upper bound is independent of the input width and height (n). Formal results are stated in Theorem 1 and proved in the appendix. Remarkably, T 2 is the heuristic suggested by Miyato et al. (2018) . To the best of our knowledge, this is the first work that derives a provable bound on the spectral norm of convolution filter as a constant factor (dependant on filter sizes, but not filter weights) times the heuristic of Miyato et al. (2018) . In Tables 1 and 3 , we show that the other 3 bounds (using R 2 , S 2 , U 2 ) can be significantly smaller than √ hw T 2 for different convolution filters. Thus, we take the minimum of these 4 quantities to bound the spectral norm of a convolution filter. In Section 4, we show that our bound can be used to improve the generalization and robustness properties of neural networks. Specifically, we show that using our bound as a regularizer during training, we can achieve improvement in accuracy on par with exact method (Sedghi et al., 2019) while being significantly faster to train (Table 4 ). We also achieve significantly higher robustness certificates against adversarial attacks than CNN-Cert (Boopathy et al., 2018) on a single layer CNN (Table 5 ). These results demonstrate potentials for practical uses of our results. Code is available at the github repository: https://github.com/singlasahil14/fantastic-four. Exact (Sedghi et al., 2019) Exact (Ryu et al., 2019) Upper Bound (Ours) Computation Norm of n 2 matrices, each of size: c out c in Norm of one matrix of size: n 4 c out c in Norm of four matrices, each of size: c out c in hw Time complexity (O) n 2 c out c in n 2 h w c out c in h w c out c in Guaranteed bound

Easy gradient computation

Table 2 : Comparison of various methods used for computing the norm of convolution layers. n is the height and width for a square input, c in is the number of input channels, c out is the number of output channels, h and w are the height and width of the convolution filter. For Sedghi et al. (2019) , we only compute the largest singular value using power iteration (i.e not all singular values).

2. NOTATION

For a vector v, we use v j to denote the element in the j th position of the vector. We use A j,∶ and A ∶,k to denote the j th row and k th column of the matrix A respectively. We assume both A j,∶ , A ∶,k to be column vectors (thus A j,∶ is the transpose of j th row of A). A j,k denotes the element in j th row and k th column of A. The same rules can be directly extended to higher order tensors. For a matrix A ∈ R q×r and a tensor B ∈ R p×q×r , vec(A) denotes the vector constructed by stacking the rows of A and vec(B) denotes the vector constructed by stacking the vectors vec(B j,∶,∶ ), j ∈ [p -1]: vec(A) T = A T 0,∶ , A T 1,∶ , ⋯ , A T q-1,∶ vec(B) T = B T 0,∶,∶ , B T 1,∶,∶ , ⋯ , B T p-1,∶,∶ We use the following notation for a convolutional neural network. L denotes the number of layers and φ is the activation function. For an input x, we use z (I) (x) ∈ R N I and a (I) (x) ∈ R N I to denote the raw (before applying φ) and activated (after applying φ) neurons in the I th hidden layer respectively. a (0) denotes the input image x. To simplify notation and when no confusion arises, we make the dependency of z (I) and a (I) to x implicit. φ ′ (z (I) ) and φ ′′ (z (I) ) denotes the elementwise first and second derivatives of φ at z (1) . W (I) denotes the weights for the I th layer i.e W (I) will be a tensor for a convolution layer and a matrix for a fully connected layer. J (I) denotes the jacobian matrix of vec(z (1) ) with respect to the input x. θ denotes the neural network parameters. f θ (x) denotes the softmax probabilities output by the network for an input x. For an input x and label y, the cross entropy loss is denoted by (f θ (x), y).

Consider a convolution filter

L of size c out × c in × h × w applied to an input X of size c in × n × n. The filter L takes an input patch of size c in × h × w from the input and outputs a vector of size c out for every such patch. The same operation is applied across all such patches in the image. To apply convolution at the edges of the image, modern convolution layers either not compute the outputs thus reducing the size of the generated feature map, or pad the input with zeros to preserve its size. When we pad the image with zeroes, the corresponding jacobian becomes a toeplitz matrix. Another version of convolution treats the input as if it were a torus; when the convolution operation calls for a pixel off the right end of the image, the layer "wraps around" to take it from the left edge, and similarly for the other edges. For this version of convolution, the jacobian is a circulant matrix. The quality of approximations between toeplitz and circulant colvolutions has been analyzed in the case of 1D (Gray, 2005) and 2D signals (Zhu & Wakin, 2017) . For the 2D case (similar to the 1D case), O(1 p) bound is obtained on the error, where p × p is the size of both (topelitz and circulant) matrices. Consequently, theoretical analysis of convolutions that wrap around the input (i.e using circulant matrices) has been become standard. This is the case that we analyze in this work. Furthermore, we assume that the stride size to be 1 in both horizontal and vertical directions. The output Y produced from applying the filter L to an input X is of size c out × n × n. The corresponding jacobian (J) will be a matrix of size n 2 c out × n 2 c in satisfying: vec(Y) = Jvec(X). Our goal is to bound the norm of jacobian of the convolution operation i.e, J 2 . Sedghi et al. (2019) also derive an expression for the exact singular values of the jacobian of convolution layers. However, their method requires the computation of the spectral norm of n 2 matrices (each matrix of size c out × c in ) for every convolution layer. We extend their result to derive a differentiable and easy-to-compute upper bound on singular values stated in the following theorem: Theorem 1. Consider a convolution filter L of size c out × c in × h × w that applied to input X of size c in × n × n gives output Y of size c out × n × n. The jacobian of Y with respect to X (i.e. J) will be a matrix of size n 2 c out × n 2 c in . The spectral norm of J is bounded by: J 2 ≤ √ hw min ( R 2 , S 2 , T 2 , U 2 ) , where the matrices R, S, T and U are defined as follows: R = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ L ∶,∶,0,0 L ∶,∶,0,1 ⋯ L ∶,∶,0,w-1 L ∶,∶,1,0 L ∶,∶,1,1 ⋯ L ∶,∶,1,w-1 ⋮ ⋮ ⋱ ⋮ L ∶,∶,h-1,0 L ∶,∶,h-1,1 ⋯ L ∶,∶,h-1,w-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , S = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ L ∶,∶,0,0 L ∶,∶,1,0 ⋯ L ∶,∶,h-1,0 L ∶,∶,0,1 L ∶,∶,1,1 ⋯ L ∶,∶,h-1,1 ⋮ ⋮ ⋱ ⋮ L ∶,∶,0,w-1 L ∶,∶,1,w-1 ⋯ L ∶,∶,h-1,w-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ T = [A 0 A 1 ⋯ A h-1 ] , where A l = [L ∶,∶,l,0 L ∶,∶,l,1 ⋯ L ∶,∶,l,w-1 ] U T = B T 0 B T 1 ⋯ B T h-1 , where B T l = L T ∶,∶,l,0 L T ∶,∶,l,1 ⋯ L T ∶,∶,l,w-1 Proof of Theorem 1 is in Appendix E. The matrices R, S, T and U are of dimensions c out h × c in w, c out w × c in h, c out × c in hw and c out hw × c in respectively. In the current literature (Miyato et al., 2018) , the heuristic used for estimating the spectral norm involves combining the dimensions of sizes h, w, c in in the filter L to create the matrix T of dimensions c out × hwc in . The norm of resulting matrix is used as a heuristic estimate of the spectral norm of the jacobian of convolution operator. However, in Theorem 1 we show that the norm of this matrix multiplied with a factor of √ hw gives a provable upper bound on the singular values of the jacobian. In Tables 1 and 3 , we show that for different convolution filters, there can be significant differences between the four bounds and any of these four bounds can be the minimum.

3.1. TIGHTNESS ANALYSIS

In Appendix F, we show that the bound is exact for a convolution filter with h = w = 1: Lemma 1. For h = 1, w = 1, the bounds in Theorem 1 are exact i.e: J 2 = R 2 = S 2 = T 2 = U 2 In Table 3 , we analyze the tightness between our bound and the exact largest singular value computed by Sedghi et al. (2019) for different filter shapes. Each convolution filter was constructed by sampling from a standard normal distribution N (0, 1). We observe that the bound is tight for small filter sizes but the ratio (Bound/Exact) becomes large for large h and w. We also observe that the values computed by the four bounds can be significantly different and that we can get a significantly improved bound by taking the minimum of the four quantities. In Appendix Section G, Figure 1 , we empirically observe that the gap between our upper bound and the exact value can become very small by adding our bound as a regularizer during training. Filter shape Let u and v be the singular vectors corresponding to the largest singular value, i.e. R 2 . Then, the derivative of our upper bound √ hw × Bound (Ours) Exact (Sedghi) Bound/ Exact R 2 S 2 T 2 U 2 64 × 64 × 2 × 2

√

hw R 2 with respect to R can be computed as follows: ∇ R √ hw R 2 = √ hw uv T where R 2 = u T Rv. Moreover, since the weights do not change significantly during the training, we can use one iteration of power method to update u, v and R 2 during each training step (similar to Miyato et al. (2018; 2017) ). This allows us to use our bound efficiently as a regularizer during the training process.

4. EXPERIMENTS

All experiments were conducted using a single NVIDIA GeForce RTX 2080 Ti GPU.

4.1. COMPARISON WITH EXISTING METHODS

In Table 1 , we show a comparison between the exact spectral norms (computed using Sedghi et al. (2019) , Ryu et al. (2019) ) and our upper bound, i.e. √ hw min( R 2 , S 2 , T 2 , U 2 ) on a pretrained Resnet-18 network (He et al., 2015) . Except for the first layer, we observe that our bound is within 1.5 times the value of the exact spectral norm while being significantly faster to compute. Similar results can be observed in Table 3 for a standard gaussian filter. Thus, by taking the minimum of the four bounds, we can get a significant gain.

4.2. EFFECTS ON GENERALIZATION

In Table 4a , we study the effect of using our proposed bound as a training regularizer on the generalization error. We use a Resnet-32 neural network architecture and the CIFAR-10 dataset (Krizhevsky, 2009) for training. For regularization, we use the sum of spectral norms of all layers of the network during training. Thus, our regularized objective function is given as followsfoot_0 : min θ E (x,y) [ (f θ (x), y)] + β I u (I) (1) where β is the regularization coefficient, (x, y)'s are the input-label pairs in the training data, u (I) denotes the bound for the I th convolution or fully connected layer. For the convolution layers, u (I) is computed as √ hw min( R 2 , S 2 , T 2 , U 2 ) using Theorem 1. For fully connected layers, we can compute u (I) using power iteration (Miyato et al., 2018) Since weight decay (Krogh & Hertz, 1991) indirectly minimizes the Frobenius norm squared which is equal to the sum of squares of singular values, it implicitly forces the largest singular values for each layer (i.e the spectral norm) to be small. Therefore, to measure the effect of our regularizer on the test set accuracy, we compare the effect of adding our regularizer both with and without weight decay. The weight decay coefficient was selected using grid search using 20 values between [0, 2 × 10 -3 ] using a held-out validation set of 5000 samples. Our experimental results are reported in Table 4a . For the case of no weight decay, we observe an improvement of 0.97% over the case when β = 0. When we include a weight decay of 10 -4 in training, there is an improvement of 0.73% over the baseline. Using the method of Sedghi et al. (2019) (i.e. clipping the singular values above 0.5 and 1) results in gain of 0.77% (with weight decay) and 1.05% (without weight decay) which is similar to the results mentioned in their paper. In addition to obtaining on par performance gains with the exact method of Sedghi et al. (2019) , a key advantage of our approach is its very efficient running time allowing its use for very large input sizes and deep networks. We report the training time of these methods in Table 4 (c) for a dataset with large input sizes. The increase in the training time using our method compared to the standard training is just 9.7% while that for Sedghi et al. ( 2019) is around 215.6%.

4.3. EFFECTS ON PROVABLE ADVERSARIAL ROBUSTNESS

In this part, we show the usefulness of our spectral bound in enhancing provable robustness of convolutional classifiers against adversarial attacks (Szegedy et al., 2014) . A robustness certificate is a lower bound on the minimum distance of a given input to the decision boundary of the classifier. For any perturbation of the input with a magnitude smaller than the robustness certificate value, the classification output will provably remain the same. However, computing exact robustness certificates requires solving a non-convex optimization, making the computation difficult for deep classifiers. In the last couple of years, several certifiable defenses against adversarial attacks have been proposed (e.g. Levine & Feizi (2020) .) In particular, to show the usefulness of our spectral bound in this application, we examine the method proposed in Singla & Feizi (2020) that uses a bound on the lipschitz constant of network gradients (i.e. the curvature values of the network). Their robustness certification method in fact depends on the spectral norm of the Jacobian of different network layers; making a perfect case study for the use of our spectral bound (details in Appendix Section C). Due to the difficulty of computing J (I) 2 when the I th layer is a convolution, Singla & Feizi (2020) restrict their analysis to fully connected networks where J (I) simply equals the I th layer weight matrix. However, using our results in Theorem 1, we can further bound J (I) and run similar experiments for the convolution layers. Thus, for a 2 layer convolution network, our regularized objective function is given as follows: min θ E (x,y) (f θ (x), y) + γb max j W (2) y,j -W (2) t,j u (1) 2 where γ is the regularization coefficient, b is a bound on the second-derivative of the activation function ( σ ′′ (.) ≤ b), (x, y)'s are the input-label pairs in the training data, and u (I) denotes the bound on the spectral norm for the I th linear (convolution/fully connected) layer. For the convolutional layer, u (1) is again computed as √ hw min( R 2 , S 2 , T 2 , U 2 ) using Theorem 1. In Table 5 , we study the effect of the regularization coefficient γ on provable robustness when the network is trained with curvature regularization (Singla & Feizi, 2020) . We use a 2 layer convolutional neural network with the tanh (Dugas et al., 2000) activation function and 5 filters in the convolution layer. We observe that the method in Singla & Feizi (2020) coupled with our bound gives significantly higher robustness certificate than CNN-Cert, the previous state of the art (Boopathy et al., 2018) . In Appendix 2020)) greater than 0.5.

5. CONCLUSION

In this paper, we derive four efficient and differentiable bounds on the spectral norm of convolution layer and take their minimum as our single tight spectral bound. This bound significantly improves over the popular heuristic method of Miyato et al. (2017; 2018) , for which we provide the first provable guarantee. Compared to the exact methods of Sedghi et al. (2019) and Ryu et al. (2019) , our bound is significantly more efficient to compute, making it amendable to be used in large-scale problems. Over various filter sizes, we empirically observe that the gap between our bound and the true spectral norm is small. Using experiments on MNIST and CIFAR-10, we demonstrate the usefulness of our spectral bound in enhancing generalization as well as provable adversarial robustness of convolutional classifiers.

A NOTATION

For a vector v, we use v j to denote the element in the j th position of the vector. We use A j,∶ to denote the j th row of the matrix A, A ∶,k to denote the k th column of the matrix A. We assume both A j,∶ and A ∶,k to be column vectors (thus A j,∶ is constructed by taking the transpose of j th row of A). A j,k denotes the element in j th row and k th column of A. A j,∶k and A ∶j,k denote the vectors containing the first k elements of the j th row and first j elements of k th column, respectively. A ∶j,∶k denotes the matrix containing the first j rows and k columns of A: A j,∶k = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ A j,0 A j,1 ⋮ A j,k-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , A ∶j,k = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ A 0,k A 1,k ⋮ A j-1,k ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , A ∶j,∶k = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ A 0,0 A 0,1 ⋯ A 0,k-1 A 1,0 A 1,1 ⋯ A 1,k-1 ⋮ ⋮ ⋱ ⋮ A j-1,0 A j-1,1 ⋯ A j-1,k-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ The same rules can be directly extended to higher order tensors. For n ∈ N, we use [n] to denote the set {0, . . . , n} and [p, q] (p < q) to denote the set {p, p + 1 . . . , q}. We will index the rows and columns of matrices using elements of [n], i.e. numbering from 0. Addition of row and column indices will be done mod n unless otherwise indicated. For a matrix A ∈ R q×r and a tensor B ∈ R p×q×r , vec(A) denotes the vector constructed by stacking the rows of A and vec(B) denotes the vector constructed by stacking the vectors vec(B j,∶,∶ ), j ∈ [p -1]: vec(A) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ A 0,∶ A 1,∶ ⋮ A q-1,∶ ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , vec(B) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ vec(B 0,∶,∶ ) vec(B 1,∶,∶ ) ⋮ vec(B p-1,∶,∶ ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ For a given vector v ∈ R n , circ(v) denotes the n × n circulant matrix constructed from v i.e rows of circ(v) are circular shifts of v. For a matrix A ∈ R n×n , circ(A) denotes the n 2 × n 2 doubly block circulant matrix constructed from A, i.e each n × n block of circ(A) is a circulant matrix constructed from the rows A j,∶ , j ∈ [n -1]: circ(v) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ v 0 v 1 ⋯ v n-1 v n-1 v 0 ⋯ v n-2 ⋮ ⋮ ⋱ ⋮ v 1 v 2 ⋯ v 0 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , circ(A) = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ circ(A 0,∶ ) ⋯ circ(A n-1,∶ ) circ(A n-1,∶ ) ⋯ circ(A n-2,∶ ) ⋮ ⋱ ⋮ circ(A 1,∶ ) ⋯ circ(A 0,∶ ) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ We use F to denote the Fourier matrix of dimensions n × n, i.e F j,k = ω jk , ω = e -2πi n , i 2 = -1. For a matrix A, σ(A) denotes the set of singular values of A. σ max (A) and σ min (A) denotes the largest and smallest singular values of A respectively. A ⊗ B denotes the kronecker product of A and B. We use A ⊙ B to denote the hadamard product between two matrices (or vectors) of the same size. We use I n to denote the identity matrix of dimensions n × n. We use the following notation for a convolutional neural network. L denotes the number of layers and φ denotes the activation φ. For an input x, we use z (I) (x) ∈ R N I and a (I) (x) ∈ R N I to denote the raw (before applying the activation φ) and activated (after applying the activation function) neurons in the I th hidden layer of the network, respectively. Thus a (0) denotes the input image x. To simplify notation and when no confusion arises, we make the dependency of z (I) and a (I) to x implicit. φ ′ (z (I) ) and φ ′′ (z (I) ) denotes the elementwise derivative and double derivative of φ at z (1) . W (I) denotes the weights for the I th layer i.e W (I) will be a tensor for a convolution layer and a matrix for a fully connected layer. J (I) denotes the jacobian matrix of vec(z (1) ) with respect to the input x. θ denotes the neural network parameters. f θ (x) denotes the softmax probabilities output by the network for an input x. For an input x and label y, the cross entropy loss is denoted by (f θ (x), y).

B SEDGHI ET AL'S METHOD

Consider an input X ∈ R cin×n×n and a convolution filter L ∈ R cout×cin×h×w to X such that n > max(h, w). Using L, we construct a new filter K ∈ R cout×cin×n×n by padding with zeroes: K c,d,k,l = L c,d,k,l , k ∈ [h -1], l ∈ [w -1] 0, otherwise The Jacobian matrix J of dimensions c out n 2 × c in n 2 is given as follows: J = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ B (0,0) B (0,1) ⋯ B (0,cin-1) B (1,0) B (1,1) ⋯ B (1,cin-1) ⋮ ⋮ ⋱ ⋮ B (cout-1,0) B (cout-1,1) ⋯ B (cout-1,cin-1) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ where B (c,d) is given as follows: B (c,d) = circ(K c,d,∶,∶ ) By inspection we can see that: vec(Y) = Jvec(X) From Sedghi et al. (2019) , we know that singular values of J are given by: σ(J) = ⋃ j∈[n-1],k∈[n-1] σ(G (j,k) ) where each G (j,k) is a matrix of dimensions c out × c in and each element of G (j,k) is given by: G (j,k) c,d = (F T K c,d,∶,∶ F) j,k , c ∈ [c out -1], d ∈ [c in -1] The largest singular value of J i.e J 2 can be computed as follows: J 2 = σ max (J) = max j∈[n-1],k∈[n-1] σ max (G (j,k) ) C SECOND-ORDER ROBUSTNESS C.1 ROBUSTNESS CERTIFICATION Given an input x (0) ∈ R D and a binary classifier f (here f ∶ R D → R is differentiable everywhere), our goal is to find a lower bound to the decision boundary f (x) = 0. The primal is given as follows: p * cert = min x max η 1 2 x -x (0) 2 2 + ηf (x) The dual of the above optimization is given as follows: d * cert = max η min x 1 2 x -x (0) 2 2 + ηf (x) The inner minimization can be solved exactly when the function 1 2 xx (0) 2 2 + ηf (x) is strictly convex, i.e has a positive-definite hessian. The same condition can be written as: ∇ 2 x 1 2 x -x (0) 2 2 + ηf (x) = I + η∇ 2 x f ≽ 0 It is easy to verify that if the eigenvalues of the hessian of f i.e ∇ 2 x f are bounded between m and M for all x ∈ R D , i.e: mI ≼ ∇ 2 x f ≼ M I, ∀ x ∈ R D The hessian I + η∇ 2 x f ≽ 0 (i.e is positive definite) for -1 M < η < -1 m. We refer an interested reader to Singla & Feizi (2020) for more details and a formal proof of the above theorem. Thus the inner minimization can be solved exactly for these set of values resulting in the following optimization: q * cert = max -1 M ≤η≤-1 m min x 1 2 x -x (0) 2 2 + ηf (x) Note that q * cert can be solved exactly using convex optimization. Thus, we get the following chain of inequalities: q * cert ≤ d * cert ≤ p * cert Thus solving q * cert gives a robustness certificate. The resulting certificate is called Curvature-based Robustness Certificate.

C.2 ADVERSARIAL TRAINING WITH CURVATURE REGULARIZATION

Let K(θ, y, t) denote the upper bound on the magnitude of curvature values for label y and target t (computed using Singla & Feizi (2020) ). The objective function for training with curvature regularization is given by: min θ E (x,y) (f θ (x), y) + γK(θ, y, t) The objective function for adversarial training with curvature regularization is given by: min θ E (x,y) f θ (x (attack) ), y + γK(θ, y, t) x (attack) is computed by doing an adversarial attack on the input x either using l 2 bounded PGD or Curvature-based Attack Optimization defined in Singla & Feizi (2020) 

D DIFFICULTY OF COMPUTING THE GRADIENT OF CONVOLUTION

Consider a 1D convolution filter of size 3 given by ([1, 2, -1]) applied to an array of size 5. The corresponding jacobian matrix is given by: J = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 1 2 -1 0 0 0 1 2 -1 0 0 0 1 2 -1 -1 0 0 1 2 2 -1 0 0 1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ The largest singular value ( J 2 ) , the left singular vector (u) and the right singular vector (v) is given by: J 2 = 2.76008, u = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ -0.55614 0.11457 0.62695 0.27290 -0.45829 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ , v = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ -0.63245 -0.19544 0.51167 0.51167 -0.19544 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ The gradient (∇ J J 2 ) is given as follows:  ∇ J J 2 = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ 0. ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ Clearly the gradient contains non-zero elements in elements of J that are always zero and unequal gradient values at elements of J that are always equal.

E PROOF OF THEOREM 1

Proof. Consider an input X ∈ R cin×n×n and a convolution filter L ∈ R cout×cin×h×w to X such that n > max(h, w). Using L, we construct a new filter K ∈ R cout×cin×n×n by padding with zeroes: K c,d,k,l = L c,d,k,l , k ∈ [h -1], l ∈ [w -1] 0, otherwise The output Y ∈ R cout×n×n of the convolution operation is given by: Y c,r,s = cin-1 d=0 n-1 k=0 n-1 l=0 X d,r+k,s+l K c,d,k,l We construct a matrix J of dimensions c out n 2 × c in n 2 as follows: ,d) is given as follows: J = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ B (0,0) B (0,1) ⋯ B (0,cin-1) B (1,0) B (1,1) ⋯ B (1,cin-1) ⋮ ⋮ ⋱ ⋮ B (cout-1,0) B (cout-1,1) ⋯ B (cout-1,cin-1) ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ where B (c B (c,d) = circ(K c,d,∶,∶ ) By inspection we can see that: vec(Y) = Jvec(X) This directly implies that J is the jacobian of vec(Y) with respect to vec(X) and our goal is to find a differentiable upper bound on the largest singular value of J. From Sedghi et al. (2019) , we know that singular values of J are given by: σ(J) = ⋃ j∈[n-1],k∈[n-1] σ(G (j,k) ) where each G (j,k) is a matrix of dimensions c out × c in and each element of G (j,k) is given by: G (j,k) c,d = (F T K c,d,∶,∶ F) j,k , c ∈ [c out -1], d ∈ [c in -1] Using equation 3, we can directly observe that a differentiable upper bound over the largest singular value of G (j,k) that is independent of j and k will give an desired upper bound. We will next derive the same. Using equation 4, we can rewrite G (j,k) c,d as: G (j,k) c,d = (F T K c,d,∶,∶ F) j,k = (F ∶,j ) T K c,d,∶,∶ F ∶,k Using equation 2, we know that K c,d,∶,∶ is a sparse matrix of size n × n with only the top-left block of size h × w that is non-zero. We take the first h rows and w columns of K c,d,∶,∶ i.e K c,d,∶h,∶w , first h elements of F ∶,j i.e F ∶h,j and first w elements of F ∶,k i.e F ∶w,k . We have the following set of equalities: G (j,k) c,d = (F ∶,j ) T K c,d,∶,∶ F ∶,k = (F ∶h,j ) T K c,d,∶h,∶w F ∶w,k = (F ∶h,j ) T L c,d,∶,∶ F ∶w,k Thus, G (j,k) can be written as follows: G (j,k) c,d = h-1 l=0 w-1 m=0 F l,j L c,d,l,m F m,k G (j,k) = h-1 l=0 w-1 m=0 F l,j L ∶,∶,l,m F m,k Now consider a block matrix R of dimensions c out h × c in w given by: R = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ L ∶,∶,0,0 L ∶,∶,0,1 ⋯ L ∶,∶,0,w-1 L ∶,∶,1,0 L ∶,∶,1,1 ⋯ L ∶,∶,1,w-1 ⋮ ⋮ ⋱ ⋮ L ∶,∶,h-1,0 L ∶,∶,h-1,1 ⋯ L ∶,∶,h-1,w-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (8) Thus the block in l th row and m th column of R is the c out × c in matrix L ∶,∶,l,m . Consider two matrices: (F ∶h,j ) T ⊗ I cout (of dimensions c out × c out h) and F ∶w,k ⊗ I cin (of dimensions c in w × c in ). Using equation 7 and equation 8, we can see that: G (j,k) = (F ∶h,j ) T ⊗ I cout R (F ∶w,k ⊗ I cin ) Taking the spectral norm of both sides: G (j,k) 2 = (F ∶h,j ) T ⊗ I cout R (F ∶w,k ⊗ I cin ) 2 G (j,k) 2 ≤ (F ∶h,j ) T ⊗ I cout 2 R 2 (F ∶w,k ⊗ I cin ) 2 Using A ⊗ B 2 = A 2 B 2 and since both I cout 2 and I cin 2 are 1: G (j,k) 2 ≤ F ∶h,j 2 R 2 F ∶w,k 2 Further note that since F j,k = ω jk , we have F ∶h,j 2 = √ h and F ∶w,k 2 = √ w. G (j,k) 2 ≤ √ hw R 2 Alternative inequality (1): Note that equation 6 can also be written by taking the transpose of the scalar (F ∶h,j ) T L ∶,∶,c,d F ∶w,k : G (j,k) c,d = (F ∶h,j ) T L c,d,∶,∶ F ∶w,k = (F ∶w,k ) T (L c,d,∶,∶ ) T F ∶h,j Thus, G (j,k) can alternatively be written as follows: G (j,k) c,d = w-1 l=0 h-1 m=0 F l,k L c,d,m,l F m,j G (j,k) = w-1 l=0 h-1 m=0 F l,k L ∶,∶,m,l F m,j Now consider a block matrix S of dimensions c out w × c in h given by: S = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ L ∶,∶,0,0 L ∶,∶,1,0 ⋯ L ∶,∶,h-1,0 L ∶,∶,0,1 L ∶,∶,1,1 ⋯ L ∶,∶,h-1,1 ⋮ ⋮ ⋱ ⋮ L ∶,∶,0,w-1 L ∶,∶,1,w-1 ⋯ L ∶,∶,h-1,w-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (11) For S the block in l th row and m th column of R is the c out × c in matrix L ∶,∶,m,l . Consider two matrices: (F ∶w,k ) T ⊗I cout (of dimensions c out ×c out w) and F ∶h,j ⊗I cin (of dimensions c in h × c in ). Using equation 10 and equation 11, we again have: G (j,k) = (F ∶w,k ) T ⊗ I cout S (F ∶h,j ⊗ I cin ) Taking the spectral norm of both sides and using the same procedure that we used for R, we get the following inequality: G (j,k) 2 ≤ √ hw S 2 Alternative inequality (2): Using equation 7, G (j,k) can alternatively be written as follows: G (j,k) c,d = h-1 l=0 w-1 m=0 L c,d,l,m F l,j F m,k G (j,k) = h-1 l=0 w-1 m=0 L ∶,∶,l,m F l,j F m,k Now consider a block matrix T of dimensions c out × c in hw given by: T = [A 0 A 1 ⋯ A h-1 ] where each block A l is a matrix of dimensions c out × c in w given as follows: A l = [L ∶,∶,l,0 L ∶,∶,l,1 ⋯ L ∶,∶,l,w-1 ] For T, the block in l th column is the c out × c in w matrix A l . For A l , the block in the m th column is the c out × c in matrix L ∶,∶,l,m . Consider the matrix: F ∶w,k ⊗ F ∶h,j ⊗ I cin (of dimensions c in hw × c in ). Using equation 13, equation 14 and equation 15, we again have: G (j,k) = T (F ∶h,j ⊗ F ∶w,k ⊗ I cin ) Taking the spectral norm of both sides and using the same procedure that we used for R and S, we get the following inequality: G (j,k) 2 ≤ √ hw T 2 Alternative inequality (3): Using equation 7, G (j,k) can alternatively be written as follows: G (j,k) c,d = h-1 l=0 w-1 m=0 F l,j F m,k L c,d,l,m G (j,k) = h-1 l=0 w-1 m=0 F l,j F m,k L ∶,∶,l,m Now consider a block matrix U of dimensions c out hw × c in given by: U = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ B 0 B 1 ⋮ B h-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (18) where each block B l is a matrix of dimensions c out w × c in given as follows: B l = ⎡ ⎢ ⎢ ⎢ ⎢ ⎢ ⎢ ⎣ L ∶,∶,l,0 L ∶,∶,1,1 ⋮ L ∶,∶,l,w-1 ⎤ ⎥ ⎥ ⎥ ⎥ ⎥ ⎥ ⎦ (19) For U, the block in l th row is the c out w × c in matrix B l . For B l , the block in the m th row is the c out × c in matrix L ∶,∶,l,m . Consider the matrix: (F ∶h,j ) T ⊗ (F ∶w,k ) T ⊗ I cout (of dimensions c out × c out hw). Using equation 17, equation 18 and equation 19, we again have: G (j,k) = (F ∶h,j ) T ⊗ (F ∶w,k ) T ⊗ I cout U Taking the spectral norm of both sides and using the same procedure that we used for R, S and T, we get the following inequality: G (j,k) 2 ≤ √ hw U 2 Taking the minimum of the four bounds i.e √ hw min( R 2 , S 2 , T 2 , U 2 ), we have the stated result.

F PROOF OF LEMMA 1

Proof. When h = 1 and w = 1, note that in equations 9, 12, 16 and 20, we have: F ∶h,j = 1, F ∶w,k = 1 This directly implies: J 2 = R 2 = S 2 = T 2 = U 2 G EFFECT OF INCREASING β ON SINGULAR VALUES In Figure 1 , we plot the effect of increasing β on the sum of true singular values of the network and sum of our upper bound. We observe that the gap between the two decreases as we increase β. In Table 6 , we show the effect of increasing β on the bound on the largest singular value of each layer. H 2019)) for a Resnet-32 neural network trained on CIFAR-10 dataset (without any weight decay). We observe that the gap between the two decreases as we increase β. In Table 6 , we show the effect of increasing β on the bounds of individual layers of the same network.) 



We do not use the sum of log of spectral norm values since that would make the filter-size dependant factor of √ hw irrelevant for gradient computation.



Figure1: We plot the effect of increasing β on the sum of the bounds (computed using our method i.e √ hw R 2 ) and sum of true spectral norms ( J (I) 2 computed using Sedghi et al. (2019)) for a Resnet-32 neural network trained on CIFAR-10 dataset (without any weight decay). We observe that the gap between the two decreases as we increase β. In Table6, we show the effect of increasing β on the bounds of individual layers of the same network.)

Comparison between the exact spectral norm of the jacobian of convolution layers (computed usingSedghi et al. (2019);Ryu et al. (

Tightness analysis between our proposed bound and exact method bySedghi et al. (2019).3.2 GRADIENT COMPUTATIONSince the matrix R can be directly computed by reshaping the filter weights L (equation 8), we can compute the derivative of our bound √ hw R 2 (or S 2 , T 2 , U 2 ) with respect to filter weights L by first computing the derivative of√hw R 2 with respect to R and then appropriately reshaping the obtained derivative.

.





Comparison between robustness certificates computed using CNN-Cert(Boopathy et al., 2018) and the method proposed in Singla & Feizi (2020) coupled with our spectral bound, for different values of γ for a single hidden layer convolutional neural network with tanh activation function. Certified Robust Accuracy is computed as the fraction of correctly classified samples with robustness-certificate (computed usingSingla & Feizi (

. When Curvature-based Attack Optimization is used for adversarial attack, the resulting attack is called Curvature-based Robust Training (CRT).

EFFECT ON CERTIFIED ROBUST ACCURACY Effect of increasing β on the bounds ( √ hw R 2 ) of each layer

Comparison between certified robust accuracy for different values of the regularization parameter γ for a single hidden layer convolutional neural network with softplus activation function. Certified Robust Accuracy is computed as the fraction of correctly classified samples with CRC (Curvature-based Robustness Certificate as defined inSingla & Feizi (2020)) greater than 0.5. Empirical robust accuracy is computed by running l 2 bounded PGD (200 steps, step size 0.01). We use convolution layer with 64 filters and stride of size 2.

6. ACKNOWLEDGEMENTS

This project was supported in part by NSF CAREER AWARD 1942230, HR001119S0026, HR00112090132, NIST 60NANB20D134 and Simons Fellowship on "Foundations of Deep Learning."

