SPARSE LINEAR NETWORKS WITH A FIXED BUTTER-FLY STRUCTURE: THEORY AND PRACTICE Anonymous authors Paper under double-blind review

Abstract

A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and often outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explain why the training speed and outcome are not compromised by our proposed approach. Empirically we demonstrate that the network performs as well as the encoderdecoder network.

1. INTRODUCTION

A butterfly network (see Figure 6 in Appendix A) is a layered graph connecting a layer of n inputs to a layer of n outputs with O(log n) layers, where each layer contains 2n edges. The edges connecting adjacent layers are organized in disjoint gadgets, each gadget connecting a pair of nodes in one layer with a corresponding pair in the next layer by a complete graph. The distance between pairs doubles from layer to layer. This network structure represents the execution graph of the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965) , Walsh-Hadamard transform, and many important transforms in signal processing that are known to have fast algorithms to compute matrix-vector products. Ailon and Chazelle (2009) showed how to use the Fourier (or Hadamard) transform to perform fast Euclidean dimensionality reduction with Johnson and Lindenstrauss (1984) guarantees. The resulting transformation, called Fast Johnson Lindenstrauss Transform (FJLT), was improved in subsequent works (Ailon and Liberty, 2009; Krahmer and Ward, 2011) . The common theme in this line of work is to define a fast randomized linear transformation that is composed of a random diagonal matrix, followed by a dense orthogonal transformation which can be represented via a butterfly network, followed by a random projection onto a subset of the coordinates (this research is still active, see e.g. Jain et al. ( 2020)). In particular, an FJLT matrix can be represented (explicitly) by a butterfly network followed by projection onto a random subset of coordinates (a truncation operator). We refer to such a representation as a truncated butterfly network (see Section 4). Simple Johnson-Lindenstrauss like arguments show that with high probability for any W ∈ R n2×n1 and any x ∈ R n1 , W x is close to (J T 2 J 2 )W (J T 1 J 1 ) x where J 1 ∈ R k1×n1 and J 2 ∈ R k2×n2 are both FJLT, and k 1 = log n 1 , k 2 = log n 2 (see Section 4.2 for details). Motivated by this, we propose to replace a dense (fully-connected) linear layer of size n 2 × n 1 in any neural network by the following architecture: J T 1 W J 2 , where J 1 , J 2 can be represented by a truncated butterfly network and W is a k 2 × k 1 dense linear layer. The clear advantages of such a strategy are: (1) almost all choices of the weights from a specific distribution, namely the one mimicking FJLT, preserve accuracy while reducing the number of parameters, and (2) the number of weights is nearly linear in the layer width of W (the original matrix). Our empirical results demonstrate that this offers faster training and prediction in deployment while producing results that match and often outperform existing known architectures. Compressing neural networks by replacing linear layers with structured linear transforms that are expressed by fewer parameters have been studied extensively in the recent past. We compare our approach with these related works in Section 3. Since the butterfly structure adds logarithmic depth to the architecture, it might pose optimization related issues. Moreover, the sparse structure of the matrices connecting the layers in a butterfly network defies the general theoretical analysis of convergence of deep linear networks. We take a small step towards understanding these issues by studying the optimization landscape of a encoder-decoder network (two layer linear neural network), where the encoder layer is replaced by a truncated butterfly network followed by a dense linear layer in fewer parameters. This replacement is motivated by the result of Sarlós ( 2006), related to fast randomized low-rank approximation of matrices using FJLT (see Section 4.2 for details). We consider this replacement instead of the architecture consisting of two butterfly networks and a dense linear layer as proposed earlier, because it is easier to analyze theoretically. We also empirically demonstrate that our new network with fewer parameters performs as well as an encoder-decoder network. The encoder-decoder network computes the best low-rank approximation of the input matrix. It is well-known that with high probability a close to optimal low-rank approximation of a matrix is obtained by either pre-processing the matrix with an FJLT (Sarlós, 2006) or a random sparse matrix structured as given in Clarkson and Woodruff ( 2009) and then computing the best low-rank approximation from the rows of the resulting matrixfoot_0 . A recent work by Indyk et al. ( 2019) studies this problem in the supervised setting, where they find the best pre-processing matrix structured as given in Clarkson and Woodruff ( 2009) from a sample of matrices (instead of using a random sparse matrix). Since an FJLT can be represented by a truncated butterfly network, we emulate the setting of Indyk et al. ( 2019) but learn the pre-processing matrix structured as a truncated butterfly network.

2. OUR CONTRIBUTION AND POTENTIAL IMPACT

We provide an empirical report, together with a theoretical analysis to justify our main idea of using sparse linear layers with a fixed butterfly network in deep learning. Our findings indicate that this approach, which is well rooted in the theory of matrix approximation and optimization, can offer significant speedup and energy saving in deep learning applications. Additionally, we believe that this work would encourage more experiments and theoretical analysis to better understand the optimization and generalization of our proposed architecture (see Future Work section). On the empirical side -The outcomes of the following experiments are reported: (1) In Section 6.1, we replace a dense linear layer in the standard state-of-the-art networks, for both image and language data, with an architecture that constitutes the composition of (a) truncated butterfly network, (b) dense linear layer in smaller dimension, and (c) transposed truncated butterfly network (see Section 4.2). The structure parameters are chosen so as to keep the number of weights near linear (instead of quadratic). (2) In Sections 6.2 and 6.3, we train a linear encoder-decoder network in which the encoder is replaced by a truncated butterfly network followed by a dense linear layer in smaller dimension. These experiments support our theoretical result. The network structure parameters are chosen so as to keep the number of weights in the (replaced) encoder near linear in the input dimension. Our results (also theoretically) demonstrate that this has little to no effect on the performance compared to the standard encoder-decoder network. (3) In Section 7, we learn the best pre-processing matrix structured as a truncated butterfly network to perform low-rank matrix approximation from a given sample of matrices. We compare our results



The pre-processing matrix is multiplied from the left.

