ONE REFLECTION SUFFICE

Abstract

Orthogonal weight matrices are used in many areas of deep learning. Much previous work attempt to alleviate the additional computational resources it requires to constrain weight matrices to be orthogonal. One popular approach utilizes many Householder reflections. The only practical drawback is that many reflections cause low GPU utilization. We mitigate this final drawback by proving that one reflection is sufficient, if the reflection is computed by an auxiliary neural network.

1. INTRODUCTION

Orthogonal matrices have shown several benefits in deep learning, with successful applications in Recurrent Neural Networks, Convolutional Neural Networks and Normalizing Flows. One popular approach can represent any d × d orthogonal matrix using d Householder reflections (Mhammedi et al., 2017) . The only practical drawback is low GPU utilization, which happens because the d reflections needs to be evaluated sequentially (Mathiasen et al., 2020) . Previous work often increases GPU utilization by using k d reflections (Tomczak & Welling, 2016; Mhammedi et al., 2017; Zhang et al., 2018; Berg et al., 2018) . Using fewer reflections limits the orthogonal transformations the reflections can represent, yielding a trade-off between representational power and computation time. This raises an intriguing question: can we circumvent the trade-off and attain full representational power without sacrificing computation time? We answer this question with a surprising "yes." The key idea is to use an auxiliary neural network to compute a different reflection for each input. In theory, we prove that one such "auxiliary reflection" can represent any number of normal reflections. In practice, we demonstrate that one auxiliary reflection attains similar validation error to models with d normal reflections, when training Fully Connected Neural Networks (Figure 1 

Convolutions in Normalizing Flow

Figure 1 : Models with one auxiliary reflection attains similar validation error to models with many reflections. Lower error means better performance. See Section 3 for details.

1.1. OUR RESULTS

The Householder reflection of x ∈ R d around v ∈ R d can be represented by a matrix H(v) ∈ R d×d . H(v)x = I -2 vv T ||v|| 2 x. An auxiliary reflection uses a Householder matrix H(v) with v = n(x) for a neural network n. f (x) = H(n(x))x = I -2 n(x)n(x) T ||n(x)|| 2 x. One auxiliary reflection can represent any composition of Householder reflections. We prove this claim even when we restrict the neural network n(x) to have a single linear layer n(x ) = W x for W ∈ R d×d such that f (x) = H(W x)x. Theorem 1. For any k Householder reflections U = H(v 1 ) • • • H(v k ) there exists a neural network n(x) = W x with W ∈ R d×d such that f (x) = H(W x)x = U x for all x ∈ R d \{0}. Previous work (Mhammedi et al., 2017; Zhang et al., 2018) often employ k d reflections and compute U x as k sequential Householder reflections H(v 1 ) • • • H(v k )•x with weights V = (v 1 • • • v k ). It is the evaluation of these sequential Householder reflection that cause low GPU utilization (Mathiasen et al., 2020) , so lower values of k increase GPU utilization but decrease representational power. Theorem 1 states that it is sufficient to evaluate a single auxiliary reflection H(W x)x instead of k reflections H(v 1 ) • • • H(v k ) • x, thereby gaining high GPU utilization while retaining the full representational power of any number of reflections. In practice, we demonstrate that d reflections can be substituted with a single auxiliary reflection without decreasing validation error, when training Fully Connected Neural Networks (Section 3.1), Recurrent Neural Networks (Section 3.2) and Normalizing Flows (Section 3.3). While the use of auxiliary reflections is straightforward for Fully Connected Neural Networks and Recurrent Neural Networks, we needed additional ideas to support auxiliary reflections in Normalizing Flows. In particular, we developed further theory concerning the inverse and Jacobian of f (x) = H(W x)x. Note that f is invertible if there exists a unique x given y = H(W x)x and W . Theorem 2. Let f (x) = H(W x)x with f (0) := 0, then f is invertible on R d with d ≥ 2 if W = W T and has eigenvalues which satisfy 3/2 • λ min (W ) > λ max (W ). Finally, we present a matrix formula for the Jacobian of the auxiliary reflection f (x) = H(W x)x. This matrix formula is used in our proof of Theorem 2, but it also allows us simplify the Jacobian determinant (Lemma 1) which is needed when training Normalizing Flows. Theorem 3. The Jacobian of f (x) = H(W x)x is: J = H(W x)A -2 W xx T W ||W x|| 2 where A = I -2 x T W T x ||W x|| 2 W. We prove Theorem 1 in Appendix A.1.1 while Theorems 2 and 3 are proved in Section 2.

2. NORMALIZING FLOWS

2.1 BACKGROUND Let z ∼ N (0, 1) d and f be an invertible neural network. Then f -1 (z) ∼ P model defines a model distribution for which we can compute likelihood of x ∼ P data (Dinh et al., 2015) . log p model (x) = log p z (f (x)) + log det ∂f (x) ∂x This allows us to train invertible neural network as generative models by maximum likelihood. Previous work demonstrate how to construct invertible neural networks and efficiently compute the log jacobian determinant (Dinh et al., 2017; Kingma & Dhariwal, 2018; Ho et al., 2019) .



left), Recurrent Neural Networks (Figure1center) and convolutions in Normalizing Flows (Figure1right). Notably, auxiliary reflections train between 2 and 6 times faster for Fully Connected Neural Networks with orthogonal weight matrices (see Section 3).

