ONE REFLECTION SUFFICE

Abstract

Orthogonal weight matrices are used in many areas of deep learning. Much previous work attempt to alleviate the additional computational resources it requires to constrain weight matrices to be orthogonal. One popular approach utilizes many Householder reflections. The only practical drawback is that many reflections cause low GPU utilization. We mitigate this final drawback by proving that one reflection is sufficient, if the reflection is computed by an auxiliary neural network.

1. INTRODUCTION

Orthogonal matrices have shown several benefits in deep learning, with successful applications in Recurrent Neural Networks, Convolutional Neural Networks and Normalizing Flows. One popular approach can represent any d × d orthogonal matrix using d Householder reflections (Mhammedi et al., 2017) . The only practical drawback is low GPU utilization, which happens because the d reflections needs to be evaluated sequentially (Mathiasen et al., 2020) . Previous work often increases GPU utilization by using k d reflections (Tomczak & Welling, 2016; Mhammedi et al., 2017; Zhang et al., 2018; Berg et al., 2018) . Using fewer reflections limits the orthogonal transformations the reflections can represent, yielding a trade-off between representational power and computation time. This raises an intriguing question: can we circumvent the trade-off and attain full representational power without sacrificing computation time? We answer this question with a surprising "yes." The key idea is to use an auxiliary neural network to compute a different reflection for each input. In theory, we prove that one such "auxiliary reflection" can represent any number of normal reflections. In practice, we demonstrate that one auxiliary reflection attains similar validation error to models with d normal reflections, when training Fully Connected Neural Networks (Figure 1 

Convolutions in Normalizing Flow

Figure 1 : Models with one auxiliary reflection attains similar validation error to models with many reflections. Lower error means better performance. See Section 3 for details.



left), Recurrent Neural Networks (Figure1center) and convolutions in Normalizing Flows (Figure1right). Notably, auxiliary reflections train between 2 and 6 times faster for Fully Connected Neural Networks with orthogonal weight matrices (see Section 3).

