O C

A

Recent work has highlighted several advantages of enforcing orthogonality in the weight layers of deep networks, such as maintaining the stability of activations, preserving gradient norms, and enhancing adversarial robustness by enforcing low Lipschitz constants. Although numerous methods exist for enforcing the orthogonality of fully-connected layers, those for convolutional layers are more heuristic in nature, often focusing on penalty methods or limited classes of convolutions. In this work, we propose and evaluate an alternative approach to directly parameterize convolutional layers that are constrained to be orthogonal. Specifically, we propose to apply the Cayley transform to a skew-symmetric convolution in the Fourier domain, so that the inverse convolution needed by the Cayley transform can be computed efficiently. We compare our method to previous Lipschitz-constrained and orthogonal convolutional layers and show that it indeed preserves orthogonality to a high degree even for large convolutions. Applied to the problem of certified adversarial robustness, we show that networks incorporating the layer outperform existing deterministic methods for certified defense against 2 -norm-bounded adversaries, while scaling to larger architectures than previously investigated. Code is available at https://github.com/locuslab/orthogonal-convolutions.

1. I

Encouraging orthogonality in neural networks has proven to yield several compelling benefits. For example, orthogonal initializations allow extremely deep vanilla convolutional neural networks to be trained quickly and stably (Xiao et al., 2018; Saxe et al., 2013) . And initializations that remain closer to orthogonality throughout training seem to learn faster and generalize better (Pennington et al., 2017) . Unlike Lipschitz-constrained layers, orthogonal layers are gradient-norm-preserving (Anil et al., 2019) , discouraging vanishing and exploding gradients and stabilizing activations (Rodríguez et al., 2017) . Orthogonality is thus a potential alternative to batch normalization in CNNs and can help to remember long-term dependencies in RNNs (Arjovsky et al., 2016; Vorontsov et al., 2017) . Constraints and penalty terms encouraging orthogonality can improve generalization in practice (Bansal et al., 2018; Sedghi et al., 2018) , improve adversarial robustness by enforcing low Lipschitz constants, and allow deterministic certificates of robustness (Tsuzuku et al., 2018) . Despite evidence for the benefits of orthogonality constraints, and while there are many methods to orthogonalize fully-connected layers, the orthogonalization of convolutions has posed challenges. More broadly, current Lipschitz-constrained convolutions rely on spectral normalization and kernel reshaping methods (Tsuzuku et al., 2018) , which only allow loose bounds and can cause vanishing gradients. Sedghi et al. (2018) showed how to clip the singular values of convolutions and thus enforce orthogonality, but relied on costly alternating projections to achieve tight constraints. Most recently, Li et al. ( 2019) introduced the Block Convolution Orthogonal Parameterization (BCOP), which cannot express the full space of orthogonal convolutions. In contrast to previous work, we provide a direct, expressive, and scalable parameterization of orthogonal convolutions. Our method relies on the Cayley transform, which is well-known for parameterizing orthogonal matrices in terms of skew-symmetric matrices, and can be easily extended to Published as a conference paper at 2021 non-square weight matrices. The transform requires efficiently computing the inverse of a particular convolution in the Fourier domain, which we show works well in practice. We demonstrate that our Cayley layer is indeed orthogonal in practice when implemented in 32-bit precision, irrespective of the number of channels. Further, we compare it to alternative convolutional and Lipschitz-constrained layers: we include them in several architectures and evaluate their deterministic certifiable robustness against an 2 -norm-bounded adversary. Our layer provides stateof-the-art results on this task. We also demonstrate that the layers empirically endow a considerable degree of robustness without adversarial training. Our layer generally outperforms the alternatives, particularly for larger architectures.

2. R W

Orthogonality in neural networks. 2020) derive an iterative approximation of the Cayley transform for orthogonally-constrained optimizers and show it speeds the convergence of CNNs and RNNs. However, they merely orthogonalize a matrix obtained by reshaping the kernel, which is not the same as an orthogonal convolution (Sedghi et al., 2018) . Our contribution is unique here in that we parameterize orthogonal convolutions directly, as opposed to reshaping kernels. Bounding neural network Lipschitzness. Orthogonality imposes a strict constraint on the Lipschitz constant, which itself comes with many benefits: Lower Lipschitz constants are associated with improved robustness (Yang et al., 2020) et al., 2017; Ehlers, 2017; Carlini & Wagner, 2017) , integer programming approaches (Lomuscio & Maganti, 2017; Tjeng & Tedrake, 2017; Cheng et al., 2017) , or semi-definite programming (Raghunathan et al., 2018) . Wong et al. (2018) 's approach to minimize an LP-based bound on the robust loss is more scalable, but networks made from Lipschitz-constrained components can be more efficient still, as shown by Li et al. (2019) who outperform their approach. However, none of these methods yet perform as well as probabilistic methods (Cohen et al., 2019) . Consequently, orthogonal layers appear to be an important component to enhance the convergence of deep networks while encouraging robustness and generalization.



The benefits of orthogonal weight initializations for dynamical isometry, i.e., ensuring signals propagate through deep networks, are explained by Saxe et al. (2013) and Pennington et al. (2017), with limited theoretical guarantees investigated by Hu et al. (2020). Xiao et al. (2018) provided a method to initialize orthogonal convolutions, and demonstrated that it allows the training of extremely deep CNNs without batch normalization or residual connections. Further, Qi et al. (2020) developed a novel regularization term to encourage orthogonality throughout training and showed its effectiveness for training very deep vanilla networks. The signal-preserving properties of orthogonality can also help with remembering long-term dependencies in RNNs, on which there has been much work(Helfrich et al., 2018; Arjovsky et al., 2016).

and better generalization bounds(Bartlett et al., 2017).Tsuzuku et al. (2018)  showed that neural network classifications can be certified as robust to 2norm-bounded perturbations given a Lipschitz bound and sufficiently confident classifications. Along withSzegedy et al. (2013), they noted that the Lipschitz constant of neural networks can be bounded if the constants of the layers are known. Thus, there is substantial work on Lipschitz-constrained and regularized layers, which we review inSec. 5. However, Anil et al. (2019)  realized that mere Lipschitz constraints can attenuate gradients, unlike orthogonal layers.

