

A

Recent work has highlighted several advantages of enforcing orthogonality in the weight layers of deep networks, such as maintaining the stability of activations, preserving gradient norms, and enhancing adversarial robustness by enforcing low Lipschitz constants. Although numerous methods exist for enforcing the orthogonality of fully-connected layers, those for convolutional layers are more heuristic in nature, often focusing on penalty methods or limited classes of convolutions. In this work, we propose and evaluate an alternative approach to directly parameterize convolutional layers that are constrained to be orthogonal. Specifically, we propose to apply the Cayley transform to a skew-symmetric convolution in the Fourier domain, so that the inverse convolution needed by the Cayley transform can be computed efficiently. We compare our method to previous Lipschitz-constrained and orthogonal convolutional layers and show that it indeed preserves orthogonality to a high degree even for large convolutions. Applied to the problem of certified adversarial robustness, we show that networks incorporating the layer outperform existing deterministic methods for certified defense against 2 -norm-bounded adversaries, while scaling to larger architectures than previously investigated. Code is available at https://github.com/locuslab/orthogonal-convolutions.

1. I

Encouraging orthogonality in neural networks has proven to yield several compelling benefits. For example, orthogonal initializations allow extremely deep vanilla convolutional neural networks to be trained quickly and stably (Xiao et al., 2018; Saxe et al., 2013) . And initializations that remain closer to orthogonality throughout training seem to learn faster and generalize better (Pennington et al., 2017) . Unlike Lipschitz-constrained layers, orthogonal layers are gradient-norm-preserving (Anil et al., 2019) , discouraging vanishing and exploding gradients and stabilizing activations (Rodríguez et al., 2017) . Orthogonality is thus a potential alternative to batch normalization in CNNs and can help to remember long-term dependencies in RNNs (Arjovsky et al., 2016; Vorontsov et al., 2017) . Constraints and penalty terms encouraging orthogonality can improve generalization in practice (Bansal et al., 2018; Sedghi et al., 2018) , improve adversarial robustness by enforcing low Lipschitz constants, and allow deterministic certificates of robustness (Tsuzuku et al., 2018) . Despite evidence for the benefits of orthogonality constraints, and while there are many methods to orthogonalize fully-connected layers, the orthogonalization of convolutions has posed challenges. More broadly, current Lipschitz-constrained convolutions rely on spectral normalization and kernel reshaping methods (Tsuzuku et al., 2018) , which only allow loose bounds and can cause vanishing gradients. Sedghi et al. (2018) showed how to clip the singular values of convolutions and thus enforce orthogonality, but relied on costly alternating projections to achieve tight constraints. Most recently, Li et al. (2019) introduced the Block Convolution Orthogonal Parameterization (BCOP), which cannot express the full space of orthogonal convolutions. In contrast to previous work, we provide a direct, expressive, and scalable parameterization of orthogonal convolutions. Our method relies on the Cayley transform, which is well-known for parameterizing orthogonal matrices in terms of skew-symmetric matrices, and can be easily extended to non-square weight matrices. The transform requires efficiently computing the inverse of a particular convolution in the Fourier domain, which we show works well in practice. We demonstrate that our Cayley layer is indeed orthogonal in practice when implemented in 32-bit precision, irrespective of the number of channels. Further, we compare it to alternative convolutional and Lipschitz-constrained layers: we include them in several architectures and evaluate their deterministic certifiable robustness against an 2 -norm-bounded adversary. Our layer provides stateof-the-art results on this task. We also demonstrate that the layers empirically endow a considerable degree of robustness without adversarial training. Our layer generally outperforms the alternatives, particularly for larger architectures.

2. R W

Orthogonality in neural networks. (Helfrich et al., 2018; Arjovsky et al., 2016) . One way to orthogonalize weight matrices is with the Cayley transform, which is often used in Riemannian optimization (Absil et al., 2009) . Helfrich et al. (2018) and Maduranga et al. (2019) avoid vanishing/exploding gradients in RNNs using the scaled Cayley transform. Similarly, Lezcano-Casado & Martínez-Rubio (2019) use the exponential map, which the Cayley transform approximates. Li et al. (2020) derive an iterative approximation of the Cayley transform for orthogonally-constrained optimizers and show it speeds the convergence of CNNs and RNNs. However, they merely orthogonalize a matrix obtained by reshaping the kernel, which is not the same as an orthogonal convolution (Sedghi et al., 2018) . Our contribution is unique here in that we parameterize orthogonal convolutions directly, as opposed to reshaping kernels. Bounding neural network Lipschitzness. Orthogonality imposes a strict constraint on the Lipschitz constant, which itself comes with many benefits: Lower Lipschitz constants are associated with improved robustness (Yang et al., 2020) and better generalization bounds (Bartlett et al., 2017) . Tsuzuku et al. (2018) showed that neural network classifications can be certified as robust to 2norm-bounded perturbations given a Lipschitz bound and sufficiently confident classifications. Along with Szegedy et al. (2013) , they noted that the Lipschitz constant of neural networks can be bounded if the constants of the layers are known. Thus, there is substantial work on Lipschitz-constrained and regularized layers, which we review in Sec. 5. However, Anil et al. (2019) realized that mere Lipschitz constraints can attenuate gradients, unlike orthogonal layers. There have been other ideas for calculating and controlling the minimal Lipschitzness of neural networks, e.g., through regularization (Hein & Andriushchenko, 2017) , extreme value theory (Weng et al., 2018) , or using semi-definite programming (Latorre et al., 2020; Chen et al., 2020; Fazlyab et al., 2019) , but constructing bounds from Lipschitz-constrained layers is more scalable and efficient. Besides Tsuzuku et al. (2018) 's strategy for deterministic certifiable robustness, there are many approaches to deterministically verifying neural network defenses using SMT solvers (Huang et al., 2017; Ehlers, 2017; Carlini & Wagner, 2017) , integer programming approaches (Lomuscio & Maganti, 2017; Tjeng & Tedrake, 2017; Cheng et al., 2017) , or semi-definite programming (Raghunathan et al., 2018) . Wong et al. (2018) 's approach to minimize an LP-based bound on the robust loss is more scalable, but networks made from Lipschitz-constrained components can be more efficient still, as shown by Li et al. (2019) who outperform their approach. However, none of these methods yet perform as well as probabilistic methods (Cohen et al., 2019) . Consequently, orthogonal layers appear to be an important component to enhance the convergence of deep networks while encouraging robustness and generalization. Orthogonality. Since we are concerned with orthogonal convolutions, we review orthogonal matrices: A matrix Q ∈ R n×n is orthogonal if Q T Q = QQ T = I. However, in building neural networks, layers do not always have equal input and output dimensions: more generally, a matrix U ∈ R m×n is semi-orthogonal if U T U = I or U U T = I. Importantly, if m ≥ n, then U is also norm-preserving: U x 2 = x 2 for all x ∈ R n . If m < n, then the mapping is merely non-expansive (a contraction), i.e., U x 2 ≤ x 2 . A matrix having all singular values equal to 1 is orthogonal, and vice versa. Orthogonal convolutions. The same concept of orthogonality applies to convolutional layers, which are also linear transformations. A convolutional layer conv : R c×n×n → R c×n×n with c = c in = c out input and output channels is orthogonal if and only if conv(X) F = X F for all input tensors X ∈ R c×n×n ; the notion of semi-orthogonality extends similarly for c in = c out . Note that orthogonalizing each convolutional kernel as in Lezcano-Casado & Martínez-Rubio (2019) ; Lezcano-Casado (2019) does not yield an orthogonal (norm-preserving) convolution. Lipschitzness under the 2 norm. A consequence of orthogonality is 1-Lipschitzness. A function f : R n → R m is L-Lipschitz with respect to the 2 norm iff f (x) -f (y) 2 ≤ L x -y 2 for all x, y ∈ R n . If L is the smallest such constant for f , then it's called the Lipschitz constant of f , denoted by Lip(f ). An useful property for certifiable robustness is that the Lipschitz constant of the composition of f and g is upper-bounded by the product of their constants: Lip(f • g) ≤ Lip(f )Lip(g). Since simple neural networks are fundamentally just composed functions, this allows us to bound their Lipschitz constants, albeit loosely. We can extend this idea to residual networks using the fact that Lip(f + g) ≤ Lip(f ) + Lip(g), which motivates using a convex combination in residual connections. More details can be found in Li et al. (2019) ; Szegedy et al. (2013) . Lipschitz bounds for provable robustness. If we know the Lipschitz constant of the neural network, we can certify that a classification with sufficiently a large margin is robust to 2 perturbations below a certain magnitude. Specifically, denote the margin of a classification with label t as M f (x) = max(0, y t -max i =t y i ), which can be interpreted as the distance between the correct logit and the next largest logit. Then if the logit function f has Lipschitz constant L, and M f (x) > √ 2L , then f (x) is certifiably robust to perturbations {δ : δ 2 ≤ }. Tsuzuku et al. (2018) and Li et al. (2019) provide proofs.

4. T C C

Before describing our method, we first review discrete convolutions and the Cayley transform; then, we show the need for inverse convolutions and how to compute them efficiently in the Fourier domain, which lets us parameterize orthogonal convolutions via the Cayley transform. The key idea in our method is that multi-channel convolution in the Fourier domain reduces to a batch of matrix-vector products, and making each of those matrices orthogonal makes the convolution orthogonal. We describe our method in more detail in Appendix A and provide a minimal implementation in PyTorch in Appendix E. An unstrided convolutional layer with c in input channels and c out output channels has a weight tensor W of shape R cout×c in ×n×n and takes an input X of shape R c in ×n×n to produce an output Y of shape R cout×n×n , i.e., conv W : R c in ×n×n → R cout×n×n . It is easiest to analyze convolutions when they are circular: if the kernel goes out of bounds of X, it wraps around to the other side-this operation can be carried out efficiently in the Fourier domain. Consequently, we focus on circular convolutions. We define conv W (X) as the circular convolutional layer with weight tensor W ∈ R cout×c in ×n×n applied to an input tensor X ∈ R c in ×n×n yielding an output tensor Y = conv W (X) ∈ R cout×n×n . Equivalently, we can view conv W (X) as the doubly block-circulant matrix C ∈ R coutn 2 ×c in n 2 corresponding to the circular convolution with weight tensor W applied to the unrolled input tensor vec X ∈ R c in n 2 ×1 . Similarly, we denote by conv T W (X) the transpose C T of the same convolution, which can be obtained by transposing the first two channel dimensions of W and flipping each of the last two (kernel) dimensions vertically and horizontally, calling the result W , and computing conv W (X). We denote conv -1 W (X) as the inverse of the convolution, i.e., with corresponding matrix C -1 , which is more difficult to efficiently compute. Now we review how to perform a convolution in the spatial domain. We refer to a pixel as a c in or c out -dimensional slice of a tensor, like X[:, i, j]. Each of the n 2 (i, j) output pixels Y [:, i, j] are computed as follows: for each c ∈ [c out ], compute Y [c, i, j] by centering the tensor W [c] on the (i, j) th pixel of the input and taking a dot product, wrapping around pixels of W that go out-ofbounds. Typically, W is zero except for a k × k region of the last two (spatial) dimensions, which we call the kernel or the receptive field. Typically, convolutional layers have small kernels, e.g., k = 3. Considering now matrices instead of tensors, the Cayley transform is a bijection between skewsymmetric matrices A and orthogonal matrices Q without -1 eigenvalues: Q = (I -A)(I + A) -1 . (2) A matrix is skew-symmetric if A = -A T , and we can skew-symmetrize any square matrix B by computing the skew-symmetric part A = B -B T . The Cayley transform of such a skew-symmetric matrix is always orthogonal, which can be seen by multiplying Q by its transpose and rearranging. We can also apply the Cayley transform to convolutions, noting they are also linear transformations that can be represented as doubly block circulant matrices. While it is possible to construct the matrix C corresponding to a convolution conv W and apply the Cayley transform to it, this is highly inefficient in practice: Convolutions can be easily skew-symmetrized by computing conv W (X) -conv T W (X), but finding their inverse is challenging; instead, we manipulate convolutions in the Fourier domain, taking advantage of the convolution theorem and the efficiency of the fast Fourier transform. According to the 2D convolution theorem (Jain, 1989) , the circular convolution of two matrices in the Fourier domain is simply their elementwise product. We will show that the convolution theorem extends to multi-channel convolutions of tensors, in which case convolution reduces to a batch of complex matrix-vector products rather than elementwise products: inverting these smaller matrices is equivalent to inverting the convolution, and finding their skew-Hermitian part is equivalent to skew-symmetrizing the convolution, which allows us to compute the Cayley transform. We define the 2D Discrete (Fast) Fourier Transform for tensors of order ≥ 2 as a mapping FFT : R m1×...×mr×n×n → C m1×...×mr×n×n defined by FFT(X)[i 1 , ..., i r ] = F n X[i 1 , ..., i r ]F n for i l ∈ 1, ..., m l and l ∈ 1, ..., r and r ≥ 0, where 1) . That is, we treat all but the last two dimensions as batch dimensions. We denote X = FFT(X) for a tensor X. F n [i, j] = 1 √ n exp( -2πı n ) (i-1)(j- Using the convolution theorem, in the Fourier domain the c th output channel is the sum of the elementwise products of the c in input and weight channels: that is, Ỹ [c] = c in k=1 W [c, k] X[k]. Equiva- lently, working in the Fourier domain, the (i, j) th pixel of the c th output channel is the dot product of the (i, j) th pixel of the c th weight with the (i, j) th input pixel: Ỹ [c, i, j] = W [c, :, i, j] • X[:, i, j]. From this, we can see that the whole (i, j) th Fourier-domain output pixel is the matrix-vector product FFT(conv W (X))[:, i, j] = W [:, :, i, j] X[:, i, j]. This interpretation gives a way to compute the inverse convolution as required for the Cayley transform, assuming c in = c out : FFT(conv -1 W (X))[:, i, j] = W [:, :, i, j] -1 X[:, i, j]. Given this method to compute inverse convolutions, we can now parameterize an orthogonal convolution with a skew-symmetric convolution through the Cayley transform, highlighted in Algorithm 1: In line 1, we use the Fast Fourier Transform on the weight and input tensors. In line 4, we compute the Fourier domain weights for the skew-symmetric convolution (the Fourier representation is skew-Hermitian, thus the use of the conjugate transpose). Next, in lines 4-5 we compute the inverses required for FFT(conv -1 I+A (x)) and use them to compute the Cayley transform written as (I +A) -1 -A(I +A) -1 in line 6. Finally, we get our spatial domain result with the inverse FFT, which is always exactly real despite working with complex matrices in the Fourier domain (see Appendix A).

4.1. P

It is important to note that the inverse in the Cayley transform always exists: Because A is skewsymmetric, it has all imaginary eigenvalues, so I + A has all nonzero eigenvalues and is thus nonsingular. Since only square matrices can be skew-symmetrized and inverted, Algorithm 1 only Algorithm 1: Orthogonal convolution via the Cayley transform. Input: A tensor X ∈ R c in ×n×n and convolution weights W ∈ R cout×c in ×n×n , with c in = c out . Output: A tensor Y ∈ R cout×n×n , the orthogonal convolution parameterized by W applied to X. 1 W := FFT(W ) ∈ C cout×c in ×n×n , X := FFT(X) ∈ C c in ×n×n 2 for all i, j ∈ 1, . . . , n // In parallel 3 do 4 Ã[:, :, i, j] := W [:, :, i, j] -W [:, :, i, j] * 5 Ỹ [:, i, j] := (I + Ã[:, :, i, j]) -1 X[:, i, j] 6 Z[:, i, j] := Ỹ [:, i, j] -Ã[:, :, i, j] Ỹ [:, i, j] 7 end 8 return FFT -1 ( Z).real works for c in = c out , but can be extended to the rectangular case where c out ≥ c in by padding the matrix with zeros and then projecting out the first c in columns after the transform, resulting in a norm-preserving semi-orthogonal matrix; the case c in ≥ c out follows similarly, but the resulting matrix is merely non-expansive. With efficient implementation in terms of the Schur complement (Appendix A.1, Eq. A22), this only requires inverting a square matrix of order min(c in , c out ). We saw that learning was easier if we parameterized W in Algorithm 1 by W = gV / V F for a learnable scalar g and tensor V , as in weight normalization (Salimans & Kingma, 2016) . Comparison to BCOP. While the Block Convolution Orthogonal Parameterization (BCOP) can only express orthogonal convolutions with fixed k × k-sized kernels, a Cayley convolutional layer can represent orthogonal convolutions with a learnable kernel size up to the input size, and it does this without costly projections unlike Sedghi et al. (2018) . However, our parameterization as presented is limited to orthogonal convolutions without -1 eigenvalues. Hence, our parameterization is incomplete; besides kernel size restrictions, BCOP was also demonstrated to incompletely represent the space of orthogonal convolutions, though the details of the problem were unresolved (Li et al., 2019) . Our method can represent such orthogonal convolutions by multiplying the Cayley transform by a fixed diagonal matrix with ±1 entries (Gallier, 2006; Helfrich et al., 2018) ; however, we cannot optimize over the discrete set of such scaling matrices, so our method cannot optimize over all orthogonal convolutions, nor all special orthogonal convolutions. In our experiments, we did not find improvements from adding randomly initialized scaling matrices as in Helfrich et al. (2018) . Limitations of our method. As our method requires computing an inverse convolution, it is generally incompatible with strided convolutions; e.g., a convolution with stride 2 cannot be inverted since it involves noninvertible downsampling. It is possible to apply our method to stride-2 convolutions by simultaneously increasing the number of output channels by 4× to compensate for the 2× downsampling of the two spatial dimensions, though we found this to be computationally inefficient. Instead, we use the invertible downsampling layer from (Jacobsen et al., 2018) to emulate striding. The convolution resulting from our method is circular, which is the same as using the circular padding mode instead of zero padding in, e.g., PyTorch, and will not have a large impact on performance if subjects tend to be centered in images in the data set. BCOP (Li et al., 2019) and Sedghi et al. (2018) also restricted their attention to circular convolutions. Our method is substantially more expensive than plain convolutional layers, though in most practical settings it is more efficient than existing work: We plot the runtimes of our Cayley layer, BCOP, and plain convolutions in a variety of settings in Figure 6 for comparison, and we also report runtimes in Tables 4 and 5 (see Appendix C). Runtime comparison Our Cayley layer does c in c out FFTs on n × n matrices (i.e., the kernels padded to the input size), and c in FFTs for each n × n input. These have complexity O(c in c out n 2 log n) and O(c out n 2 log n) respectively. The most expensive step is computing the inverse of n 2 square matrices of order c = min(c in , c out ), with complexity O(n 2 c 3 ), similarly to the method of Sedghi et al. (2018) . We note like the authors that parallelization could effectively make this O(n 2 log n + c 3 ), and it is quite feasible in practice. As in Li et al. (2020) , the inverse could be replaced with an iterative approximation, but we did not find it necessary for our relatively small architectures. For comparison, the related layers BCOP and RKO (Sec. 5) take only O(c 3 ) to orthogonalize the convolution, and OSSN takes O(n 2 c 3 ) (Li et al., 2019) . In practice, we found our Cayley layer takes anywhere from 1/2× to 4× as long as BCOP, depending on the architecture (see Appendix C).

5. E

Our experiments have two goals: First, we show that our layer remains orthogonal in practice. Second, we compare the performance of our layer versus alternatives (particularly BCOP) on two adversarial robustness tasks on CIFAR-10: We investigate the certifiable robustness against an 2 -norm-bounded adversary using the idea of Lipschitz Margin Training (Tsuzuku et al., 2018) , and then we look at robustness in practice against a powerful adversary. We find that our layer is always orthogonal and performs relatively well in the robustness tasks. Separately, we show our layer improves on the Wasserstein distance estimation task from Li et al. (2019) in Appendix D.2. For alternative layers, we adopt the naming scheme for previous work on Lipschitz-constrained convolutions from Li et al. ( 2019), and we compare directly against their implementations. We outline the methods below.

RKO.

A convolution can be represented as a matrix-vector product, e.g., using a doubly blockcirculant matrix and the unrolled input. Alternatively, one could stack each k × k receptive field, and multiply by the c out × k 2 c in reshaped kernel matrix (Cisse et al., 2017) . The spectral norm of this reshaped matrix is bounded by the convolution's true spectral norm (Tsuzuku et al., 2018 2018) used the power method on the matrix W associated with the convolution, i.e., s i+1 := W T W s i , and σ max ≈ W s n / s n . Gouk et al. (2018) improved upon this idea by applying the power method directly to convolutions, using the transposed convolution for W T . However, this one-sided spectral normalization is quite restrictive; dividing out σ max can make other singular values vanishingly small. SVCM. Sedghi et al. (2018) showed how to exactly compute the singular values of convolutional layers using the Fourier transform before the SVD, and proposed a singular value clipping method. However, the clipped convolution can have an arbitrarily large kernel size, so they resorted to alternating projections between orthogonal convolutions and k × k-kernel convolutions, which can be expensive. Like Li et al. (2019) , we found that ≈ 50 projections are needed for orthogonalization. BCOP. The Block Convolution Orthogonal Parameterization extends the orthogonal initialization method of Xiao et al. (2018) . It differentiably parameterizes k × k orthogonal convolutions with an orthogonal matrix and 2(k -1) symmetric projection matrices. The method only parameterizes the subspace of orthogonal convolutions with k × k-sized kernels, but is quite expressive empirically. Internally, orthogonalization is done with the method by Björck & Bowie (1971) . Note that BCOP and SVCM are the only other orthogonal convolutional layers, and SVCM only for a large number of projections. RKO, CRKO, and OSSN merely upper-bound the Lipschitz constant of the layer by 1.

5.1. T A D

Training details. For all experiments, we used CIFAR-10 with standard augmentation, i.e., random cropping and flipping. Inputs to the model are always in the range [0, 1]; we implement normalization as a layer for compatibility with AutoAttack. For each architecture/convolution pair, we tried learning rates in {10 -5 , 10 -4 , 10 -3 , 10 -2 , 10 -1 }, choosing the one with the best test accuracy. Most often, 0.001 is appropriate. We found that a piecewise triangular learning rate, as used in top performers in the DAWNBench competition (Coleman et al., 2017) , performed best. Adam (Kingma & Ba, 2014) showed a significant improvement over plain SGD, and we used it for all experiments. 2019) used multi-class hinge loss where the margin is the robustness certificate √ 2L 0 . We corroborate their finding that this works better than cross-entropy, and similarly use 0 = 0.5. Varying 0 controls a tradeoff between accuracy and robustness (see Fig. 5 ). Initialization. We found that the standard uniform initialization in PyTorch performed well for our layer. We adjusted the variance, but significant differences required order-of-magnitude changes. For residual networks, we tried Fixup initialization (Zhang et al., 2019) , but saw no significant improvement. We hypothesize this is due to (1) the learnable scaling parameter inside the Cayley transform, which changes significantly during training and (2) the dynamical isometry inherent with orthogonal layers. For alternative layers, we used the initializations from Li et al. (2019) . Architecture considerations. For fair comparison with previous work, we use the "large" network from Li et al. (2019) , which was first implemented in Kolter & Wong (2017) 's work on certifiable robustness. We also compare the performance of the different layers in a 1-Lipschitz-constrained version of ResNet9 (He et al., 2016) and WideResNet10-10 (Zagoruyko & Komodakis, 2016) . The architectures we could investigate were limited by compute and memory, as all the layers compared are relatively expensive. For RKO, OSSN, SVCM, and BCOP, we use Björck orthogonalization (Björck & Bowie, 1971) for fully-connected layers, as reported in Li et al. (2019) ; Anil et al. (2019) . For our Cayley convolutional layer and CRKO, we orthogonalize the fully-connected layers with the Cayley transform to be consistent with our method. We found the gradient-norm-preserving GroupSort activation function from Anil et al. (2019) to be more effective than ReLU, and we used a group size of 2, i.e., MaxMin. Strided convolutions. For the KWLarge network, we used "invertible downsampling", which emulates striding by rearranging the inputs to have 4× more channels while halving the two spatial dimensions and reducing the kernel size to k/2 (Jacobsen et al., 2018) . For the residual networks, we simply used a version of pooling, noting that average pooling is still non-expansive when multiplied by its kernel size, which allows us to use more of the network's capacity. We also halved the kernel size of the last pooling layer, instead adding another fully-connected layer; empirically, this resulted in higher local Lipschitz constants. Ensuring Lipschitz constraints. Batch normalization layers scale their output, so they can't be included in our 1-Lipschitz-constrained architecture; the gradient-norm-preserving properties of our layers compensate for this. We ensure residual connections are non-expansive by making them a convex combination with a new learnable parameter α, i.e., g( x) = αf (x) + (1 -α)x, for α ∈ [0, 1]. To ensure the latter constraint, we use sigmoid(α). We can tune the overall Lipschitz bound to a given L using the Lipschitz composition property, multiplying each of the m layers by L 1/m .

5.2. A R

For certifiable robustness, we report the fraction of certifiable test points: i.e., those with classification margin M f (x) greater than √ 2L , where = 36/255. For empirical defense, we use both vanilla projected gradient descent and AutoAttack by Croce & Hein (2020) . For PGD, we use α = /4.0 with 10 iterations. Within AutoAttack, we use both APGD-CE and APGD-DLR, finding the decisionbased attacks provided no improvements. We report on = 36/255 for consistency with Li et al. (2019) and previous work on deterministic certifiable robustness (Wong et al., 2018) 

5.3. R

Practical orthogonality. We show that our layer remains very close to orthogonality in practice, both before and after learning, when implemented in 32-bit precision. We investigated Cayley layers from one of our ResNet9 architectures, running them on random tensors to see if their norm is preserved, which is equivalent to orthogonality. We found that Conv(x) / x , the extent to which our layer is gradient norm preserving, is always extremely close to 1. We illustrate the small discrepancies, easily bounded between 0.99999 and 1.00001, in Figure 1 . Cayley layers which do not change or increase the number of channels are guaranteed to be orthogonal, which we see in practice for graphs (b, c, d, e) . Those which decrease the number of channels can only be non-expansive, and in fact the layer seems to become slightly more norm-preserving after training (a). In short, our Cayley layer can capture the full benefits of orthogonality. Certifiable robustness. We use our layer and alternatives within the KWLarge architecture for a more direct comparison to previous work on deterministic certifable robustness (Li et al., 2019; Wong et al., 2018) . As in (Li et al., 2019) , we got the best performance without normalizing inputs, and can thus say that all networks compared here are at most 1-Lipschitz. Our layer outperforms BCOP on this task (see Table 1 ), and is thus state-of-the-art, getting on average 75.33% clean test accuracy and 59.16% certifiable robust accuracy against adversarial perturbations with norm less than = 36/255. In contrast, BCOP gets 75.11% test accuracy and 58.29% certifiable robust accuracy. The reshaped kernel methods perform only a percent or two worse on this task, while the spectral normalization and clipping methods lag behind. We assumed that a layer is only meaningfully better than the other if both the test and robust accuracy are improved; otherwise, the methods may simply occupy different parts of the tradeoff curve. Since reshaped kernel methods can encourage smaller Lipschitz constants than orthogonal layers (Sedghi et al., 2018) , we investigated the clean vs. certifiable robust accuracy tradeoff enabled by scaling the Lipschitz upper bound of the network, visualized in Figure 2 . To that end, in light of the competitiveness of RKO, we chose a Lipschitz upper-bound of 0.85 which gave our Cayley layer similar test accuracy; this allowed for even higher certifiable robustness of 59.99%, but lower test accuracy of 74.35%. Overall, we were surprised by the similarity between the four top-performing methods after scaling Lipschitz constants. We were not able to improve certifiable accuracy with ResNets. However, it was useful to increase the kernel size: we found 5 was an improvement in accuracy, while 7 and 9 were slightly worse. (Since our method operates in the Fourier domain, increases in kernel size incur no extra cost.) We also saw an improvement from scaling up the width of each layer of KWLarge, and our Cayley layer was substantially faster than BCOP as the width of KWLarge increased (see Appendix C). Multiplying the width by 3 and increasing the kernel size to 5, we were able to get 61.13% certified robust accuracy with our layer, and 60.55% with BCOP. Empirical robustness. Previous work has shown that adversarial robustness correlates with lower Lipschitz constants. Thus, we investigated the robustness endowed by our layer against 2 gradient-based adversaries. Here, we got better accuracy with the standard practice of normalizing inputs. Our layer outperformed the others in ResNet9 and WideResNet10-10 architectures; results were less decisive for KWLarge (see Appendix B). For the WideResNet, we got 82.99% clean accuracy and 73.16% robust accuracy for = 36/255. For comparison, the state-of-the-art achieves 91.08% clean accuracy and 72.91% robust accuracy for = 0.5 using a ResNet50 with adversarial training and additional unlabeled data (Augustin et al., 2020) . We visualize the tradeoffs for our residual networks in Figure 3 , noting that they empirically have smaller local Lipschitz constants than KWLarge. While our layer outperforms others for the default Lipschitz bound of 1, and is consistently slightly better than BCOP, RKO can perform similarly well for larger bounds. This provides some support for studies showing that hard constraints like ours may not match the performance of softer constraints, such as RKO and penalty terms (Bansal et al., 2018; Vorontsov et al., 2017) .

6. C

In this paper, we presented a new, expressive parameterization of orthogonal convolutions using the Cayley transform. Unlike previous approaches to Lipschitz-constrained convolutions, ours gives deep networks the full benefits of orthogonality, such as gradient norm preservation. We showed empirically that our method indeed maintains a high degree of orthogonality both before and after learning, and also scales better to some architectures than previous approaches. Using our layer, we were able to improve upon the state-of-the-art in deterministic certifiable robustness against an 2 -norm-bounded adversary, and also showed that it endows networks with considerable inherent robustness empirically. While our layer offers benefits theoretically, we observed that heuristics involving orthogonalizing reshaped kernels were also quite effective for empirical robustness. Orthogonal convolutions may only show their true advantage in gradient norm preservation for deeper networks than we investigated. In light of our experiments in scaling the Lipschitz bound, we hypothesize that not orthogonality, but insead the ability of layers such as ours to exert control over the Lipschitz constant, may be best for the robustness/accuracy tradeoff. Future work may avoid expensive inverses using approximations or the exponential map, or compare various orthogonal and Lipschitz-constrained layers in the context of very deep networks.

A

We thank Shaojie Bai, Chun Kai Ling, Eric Wong, and the anonymous reviewers for helpful feedback and discussions. This work was partially supported under DARPA grant number HR00112020006.

A O C F D

Our method relies on the fact that a multi-channel circular convolution can be block-diagonalized by a suitable Discrete Fourier Transform matrix. We show how this follows from the 2D convolution theorem (Jain, 1989, p. 145) below. Definition A.1. F n is the DFT matrix for sequences of length n; we drop the subscript when it can be inferred from context. Definition A.2. We define conv W (X) as in Section 4; if c in = c out = 1, we drop the channel axes, i.e., for X, W ∈ R n×n , the 2D circular convolution of X with W is conv W (X) ∈ R n×n . Theorem A.1. If C ∈ R n 2 ×n 2 represents a 2D circular convolution with weights W ∈ R n×n operating on a vectorized input vec(X) ∈ R n 2 ×1 , with X ∈ R n×n , then it can be diagonalized as (F ⊗ F )C(F * ⊗ F * ) = D. Proof. According to the 2D convolution theorem, we can implement a single-channel 2D circular convolution by computing the elementwise product of the DFT of the filter and input signals: F W F F XF = F conv W (X)F. (A1) This elementwise product is easier to work mathematically with if we represent it as a diagonalmatrix-vector product after vectorizing the equation: diag(vec(F W F )) vec(F XF ) = vec(F conv W (X)F ). (A2) We can then rearrange this using vec(ABC) = (C T ⊗ A) vec(B) and the symmetry of F : diag(vec(F W F ))(F ⊗ F ) vec(X) = (F ⊗ F ) vec(conv W (X)). Left-multiplying by the inverse of F ⊗ F and noting C vec(X) = vec(conv W (X)), we get the result (F * ⊗ F * ) diag(vec(F W F ))(F ⊗ F ) = C ⇒ diag(vec(F W F )) = (F ⊗ F )C(F * ⊗ F * ), which shows that the (doubly-block-ciculant) matrix C is diagonalized by F ⊗ F . An alternate proof can be found in Jain (1989, p. 150 ). Now we can consider the case where we have a 2D circular convolution C ∈ R coutn 2 ×c in n 2 with c in input channels and c out output channels. Here, C has c out × c in blocks, each of which is a circular convolution C ij ∈ R n 2 ×n 2 . The input image is vec X = vec T X 1 , . . . , vec T X c in T ∈ R c in n 2 ×1 , where X i is the i th channel of X . Corollary A.1.1. If C ∈ R coutn 2 ×c in n 2 represents a 2D circular convolution with c in input channels and c out output channels, then it can be block diagonalized as F cout CF * c in = D, where F c = S c,n 2 (I c ⊗ (F ⊗ F )), S c,n is a permutation matrix, I k is the identity matrix of order k, and D is block diagonal with n 2 blocks of size c out × c in . Proof. We first look at each of the blocks of C individually, referring to D as the block matrix before applying the S permutations, i.e., D = S T cout,n 2 DS c in ,n 2 , so that: Dij = [(I cout ⊗ (F ⊗ F )) C (I c in ⊗ (F * ⊗ F * ))] ij = (F ⊗ F )C ij (F * ⊗ F * ) = diag(vec(F W ij F )), where W ij are the weights of the (ij) th single-channel convolution, using Theorem A.1. That is, D is a block matrix of diagonal matrices. Then, let S a,b be the perfect shuffle matrix that permutes the block matrix of diagonal matrices to a block diagonal matrix. S a,b can be constructed by subselecting rows of the identity matrix. Using slice notation: (A6) As an example: S 2,4      a 0 0 0 e 0 0 0 i 0 0 0 0 b 0 0 0 f 0 0 0 j 0 0 0 0 c 0 0 0 g 0 0 0 k 0 0 0 0 d 0 0 0 h 0 0 0 l m 0 0 0 q 0 0 0 u 0 0 0 0 n 0 0 0 r 0 0 0 v 0 0 0 0 o 0 0 0 s 0 0 0 w 0 0 0 0 p 0 0 0 t 0 0 0 x      D S T 3,4 =      a e i 0 0 0 0 0 0 0 0 0 m q u 0 0 0 0 0 0 0 0 0 0 0 0 b f j 0 0 0 0 0 0 0 0 0 n r v 0 0 0 0 0 0 0 0 0 0 0 0 c g k 0 0 0 0 0 0 0 0 0 o s w 0 0 0 0 0 0 0 0 0 0 0 0 d h l 0 0 0 0 0 0 0 0 0 p t x      D . (A7) Then, with the perfect shuffle matrix, we can compute the block diagonal matrix D as: S cout,n 2 DS T c in ,n 2 = S cout,n 2 (I cout ⊗ (F ⊗ F )) C (I c in ⊗ (F * ⊗ F * )) S T c in ,n 2 = F cout CF * c in = D. (A8) The effect of left and right-multiplying with the perfect shuffle matrix is to create a new matrix D from D such that [D k ] ij = [ Dij ] kk , where the subscript inside the brackets refers to the k th diagonal block and the (ij) th block respectively. Remark. It is much more simple to compute D (here wfft) in tensor form given the convolution weights w as a c out × c in × n × n tensor: wfft = fft2(w).reshape(cout, cin, n**2).permute(2, 0, 1). BCOP (Li et al., 2019) tends to be more efficient. Since convolutional layers in neural networks tend to decrease the spatial dimensionality while increasing the number of channels, and also often have unequal numbers of input and output channels, our Cayley layer is often more efficient in practice. In all cases, orthogonal convolutional layers are significantly slower than plain convolutional layers. Each runtime was recorded using the autograd profiler in PyTorch (Paszke et al., 2019) by summing the CUDA execution times. The batch size was fixed at 128 for all graphs, and each data point was averaged over 32 iterations. We used a Nvidia Quadro RTX 8000. The main competing orthogonal convolutional layer, BCOP (Li et al., 2019) , uses Björck (Björck & Bowie, 1971 ) orthogonalization for internal parameter matrices; they also used it in their experiments for orthogonal fully-connected layers. Similarly to how we replaced the method in RKO with the Cayley transform for our CRKO (Cayley RKO) experiments, we replaced Björck with the Cayley transform in BCOP and used a Cayley linear layer for CayleyBCOP experiments, reported in Tables 6 and 7 . We see slightly decreased performance over all metrics, similarly to the relationship between RKO and CRKO. For additional comparison, we also report on a plain convolutional baseline in Table 7 . For this model, we used a plain circular convolutional layer and a Cayley linear layer, which still imparts a considerable degree of robustness. With the plain convolutional layer, the model gains a considerable degree of accuracy but loses some robustness. We did not report a plain convolutional baseline for the provable robustness experiments on KWLarge, as it would require a more sophisticated technique to bound the Lipschitz constants of each layer, which is outside the scope of our investigation.

D.2 W D E

Cayley BCOP RKO OSSN Wasserstein Distance: 10.72 10.08 9.18 7.50 Table 8 : For BCOP, RKO and OSSN, we report the best bound over all trials from the experiments in the repository containing BCOP's implementation (Li et al., 2019) . We ran one trial of the Wasserstein GAN experiment, replacing the BCOP and Björck layers with our Cayley convolutional and linear layers, and achieved a significantly tighter bound. We only report on experiments using the GroupSort (MaxMin) activation (Anil et al., 2019) and on the STL-10 dataset. We repeated the Wasserstein distance estimation experiment from Li et al. (2019) , simply replacing the BCOP layer with our Cayley convolutional layer, and the Björck linear layer with our Cayley fully-connected layer. We took the best Wasserstein distance bound from one trial of each of the four learning rates considered in BCOP (0.1, 0.01, 0.001, 0.0001); see Table 8 .



Figure 1: (Titles: c in → c out ). Our layer remains orthogonal in practice even for large convolutions (d, e), and is norm-preserving even when c out > c in (b, c); it is nonexpansive when c in > c out (a).

Figure 2: The provable robustness vs. clean accuracy tradeoff enabled by scaling the Lipschitz upper-bound for KWLarge. we found it useful to report on empirical local Lipschitz constants throughout training using the PGD-like method from Yang et al. (2020).

ab (1 : b : ab, :) I ab (2 : b : ab, :) . . . I ab (b : b : ab, :)

Figure6: Our Cayley layer is particulaly efficient for inputs with small spatial dimension (width and height, i.e., n) (see (a), (c)), large kernel size k (see (e)), and where the number of input and output channels are not equal (c in = c out (see (f)). For very large spatial size (image width and height) (see (d)), or the combination of relatively large spatial size and many channels (see (b)), BCOP(Li et al., 2019) tends to be more efficient. Since convolutional layers in neural networks tend to decrease the spatial dimensionality while increasing the number of channels, and also often have unequal numbers of input and output channels, our Cayley layer is often more efficient in practice. In all cases, orthogonal convolutional layers are significantly slower than plain convolutional layers.

. Additionally, Clean 75.33 ± .41 75.11 ± .37 74.47 ± .28 73.92 ± .27 71.69 ± .34 72.43 ± .84 74.35 ± .33 .31 67.29 ± .35 68.32 ± .22 68.03 ± .28 65.13 ± .10 66.43 ± .62 67.29 ± .52 AutoAttack 65.13 ± .48 64.62 ± .31 66.10 ± .26 65.95 ± .25 62.92 ± .16 64.27 ± .67 65.00 ± .58 Certified 59.16 ± .36 58.29 ± .19 57.50 ± .17 57.48 ± .34 55.71 ± .57 52.11 ± .90 59.99 ± .40 Emp.Lip 0.740 ± .01 0.740 ± .02 0.667 ± .01 0.668 ± .01 0.716 ± .01 0.570 ± .02 0.648 ± .01 Trained without normalizing inputs, mean/s.d. from 5 experiments reported. Our Cayley layer outperforms other methods in both test and 2 certifiable robust accuracy.

Empirical adversarial robustness for residual networks, mean and standard deviation for ResNet9 from 3 experiments. Cayley layers perform competitively on clean and robust accuracy.

Additional baseline for KWLarge trained for provable robustness. Mean and s.d. over 5 trials.

Additional baselines for ResNet9 trained for empirical adversarial robustness. Mean and s.d. over 3 trials.

annex

Definition A.3. The Cayley transform is a bijection between skew-Hermitian matrices and unitary matrices; for real matrices, it is a bijection between skew-symmetric and orthogonal matrices. We apply the Cayley transform to an arbitrary matrix by first computing its skew-Hermitian part: we define the function cayley : C m×m → C m×m by cayley(B) = (I m -B + B * )(I m + B -B * ) -1 , where we compute the skew-Hermitian part of B inline as B -B * . Note that the Cayley transform of a real matrix is always real, i.e., Im(B) = 0 ⇒ Im(cayley(B)) = 0, in which case B-B * = B-B T is a skew-symmetric matrix.We now note a simple but important fact that we will use to show that our convolutions are always exactly real despite manipulating their complex representations in the Fourier domain. Lemma A.2. Say J ∈ C m×m is unitary so that J * J = I, and B = J BJ * for B ∈ R m×m and B ∈ C m×m . Then cayley(B) = Jcayley( B)J * .Proof. First note that B = J BJ * implies B T = B * = (J BJ * ) * = J B * J * . Then cayley(B) = (I -B + B T )(I + B -B T ) = (I -J BJ * + J B * J * )(I + J BJ * -J B * J * ) -1 = J(I -B + B * )J * J(I + B -B * )J * -1 = J(I -B + B * )J * J(I + B -B * )For the rest of this section, we drop the subscripts of F and S when they can be inferred from context. Proof. Note that F is unitary:) since S is a permutation matrix and is thus orthogonal. Then apply Lemma A.2, where we have J = F * , B = C, and B = D, to see the result. Note that cayley(C) is real because C is real; that is, even though we apply the Cayley transform to skew-Hermitian matrices in the Fourier domain, the resulting convolution is real.Remark. While we deal with skew-Hermitian matrices in the Fourier domain, we are still effectively parameterizing the Cayley transform in terms of skew-symmetric matrices: as in the note in Lemma A.2, we can see thatwhere C is real, D is complex, and C -C T is skew-symmetric (in the spatial domain) despite computing it with a skew-Hermitian matrix D -D * in the Fourier domain.Remark. Since D is block diagonal, we only need to apply the Cayley transform (and thus invert) its n 2 blocks of size c × c, which are much smaller than the whole matrix:In many cases, convolutional layers do not have c in = c out , in which case they cannot be orthogonal. Rather, we must resort to enforcing semi-orthogonality. We can semi-orthogonalize convolutions using the same techniques as above. Lemma A.4. Right-padding the multi-channel 2D circular convolution matrix C (from c in to c out channels) with dn 2 columns of zeros is equivalent to padding each diagonal block of the corresponding block-diagonal matrix D on the right with d columns of zeros:where 0 k refers to k columns of zeros and a compatible number of rows.Proof. For a fixed column j, note that Proof. This proceeds similarly to the previous lemma: removing columns of each of the n 2 matrices D 1 , . . . , D n 2 implies removing the corresponding blocks of columns of D, and thus of C.which is a real 2D multi-channel semi-orthogonal circular convolution.Proof. For the first step, we use Lemma A.4 for right padding, gettingThen, noting that [C 0 dn 2 ] is a convolution matrix with c in = c out , we can apply Theorem A.3 (and the following remark) to get:) is still a real convolution matrix, we can apply Lemma A.5 to get the result.

Published as a conference paper at ICLR 2021

This demonstrates that we can semi-orthogonalize convolutions with c in = c out by first padding them so that c in = c out ; despite performing padding, the Cayley transform, and projections on complex matrices in the Fourier domain, we have shown that the resulting convolution is still real.In practice, we do not literally perform padding nor projections; we explain how to do an equivalent but more efficient comptutation on each diagonal block D k ∈ C cout×c in below.Proposition A.7. We can efficiently compute the Cayley transform for semi-orthogonalization, i.e., cayley ([W 0 d ]) I d 0 d , when c in ≤ c out by writing the inverse in terms of the Schur complement.Proof. We can partition W ∈ C cout×c in into its top part U ∈ C c in ×c in and bottom part V ∈ C (cout-c in )×c in , and then write the padded matrixTaking the skew-Hermitian part and applying the Cayley transform, then projecting, we get:We focus on computing the inverse while keeping only the first c in columns. We use the inversion formula noted in Zhang (2006, p. 13 ) for a block partitioned matrix M ,where we assume M takes the form of the inverse in Eq. A20, and M/S = P -QS -1 R is the Schur complement. Using this formula for the first c in columns of the inverse in Eq. A20, and computing the Schur complementwhich is semi-orthogonal and requires computing only one inverse of size c in ≤ c out . Note that this inverse always exists because U -U * is skew-Hermitian, so it has purely imaginary eigenvalues, and V * V is positive semidefinite and has all real non-negative eigenvalues. That is, the sum I c in + U -U * + V * V has all nonzero eigenvalues and is thus nonsingular.Proposition A.8. We can also compute semi-orthogonal convolutions when c in ≥ c out using the method described above because cayley ([Proof. We use that (A -1 ) T = (A T ) -1 and (I -We have thus shown how to (semi-)orthogonalize real multi-channel 2D circular convolutions efficiently in the Fourier domain. A minimal implementation of our method can be found in Appendix E. The techniques described above could also be used with other orthogonalization methods, or for calculating the determinants or singular values of convolutions.Published as a conference paper at ICLR 2021 B A R 4 3 2 1 0 1 2 3 4 5 6 4 3 2 1 0 1 2 3 4 5 6 4 3 2 1 0 1 2 3 4 5 6 For KWLarge, our results on empirical robustness were mixed: while our Cayley layer outperforms BCOP in robust accuracy, the RKO methods are overall more robust by around 2%, for only a marginal decrease in clean accuracy. We note the lower empirical local Lipschitzness of RKO methods, which may explain their higher robustness: Figure 4 shows that the best choice of Lipschitz upper-bound for Cayley and BCOP layers may be less than 1 for this architecture. 5 : Our Cayley layer was not as fast for residual networks, possibly because they have convolutions with more channels and also larger spatial dimension, which is a multiplicative factor in our runtime analysis. This is especially true for the WideResNet. For plain conv, we replaced the Cayley convolutional layer with a plain circular convolution, leaving the Cayley fully-connected layers. For both plain, we also used plain fully-connected layers.

E E I

In PyTorch 1.8, our layer can be implemented as follows.def cayley(W): if len(W.shape) == 2: return cayley(W[None])[0] _, cout, cin = W.shape if cin > cout:return cayley(W.transpose(1, 2)).transpose(1, 2) U, V = W[:, :cin], W[:, cin:] I = torch.eye(cin, dtype=W.dtype, device=W.device)[None, :, :] A = U -U.conj().transpose(1, 2) + V.conj().transpose(1, 2) @ V inv = torch.inverse(I + A) return torch.cat((inv @ (I -A), -2 * V @ inv), axis=1) .reshape(cout, cin, n * (n // 2 + 1)).permute(2, 0, 1).conj() yfft = (cayley(wfft) @ xfft).reshape(n, n // 2 + 1, cout, batches) y = torch.fft.irfft2(yfft.permute(3, 2, 0, 1)) if self.bias is not None: y += self.bias[:, None, None] return yTo make the layer support stride-2 convolutions, have CayleyConv inherit from the following class instead, which depends on the einops package: 

