SPARSE LINEAR NETWORKS WITH A FIXED BUTTER-FLY STRUCTURE: THEORY AND PRACTICE Anonymous authors Paper under double-blind review

Abstract

A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and often outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explain why the training speed and outcome are not compromised by our proposed approach. Empirically we demonstrate that the network performs as well as the encoderdecoder network.

1. INTRODUCTION

A butterfly network (see Figure 6 in Appendix A) is a layered graph connecting a layer of n inputs to a layer of n outputs with O(log n) layers, where each layer contains 2n edges. The edges connecting adjacent layers are organized in disjoint gadgets, each gadget connecting a pair of nodes in one layer with a corresponding pair in the next layer by a complete graph. The distance between pairs doubles from layer to layer. This network structure represents the execution graph of the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965) , Walsh-Hadamard transform, and many important transforms in signal processing that are known to have fast algorithms to compute matrix-vector products. Ailon and Chazelle (2009) showed how to use the Fourier (or Hadamard) transform to perform fast Euclidean dimensionality reduction with Johnson and Lindenstrauss (1984) guarantees. The resulting transformation, called Fast Johnson Lindenstrauss Transform (FJLT), was improved in subsequent works (Ailon and Liberty, 2009; Krahmer and Ward, 2011) . The common theme in this line of work is to define a fast randomized linear transformation that is composed of a random diagonal matrix, followed by a dense orthogonal transformation which can be represented via a butterfly network, followed by a random projection onto a subset of the coordinates (this research is still active, see e.g. Jain et al. (2020) ). In particular, an FJLT matrix can be represented (explicitly) by a butterfly network followed by projection onto a random subset of coordinates (a truncation operator). We refer to such a representation as a truncated butterfly network (see Section 4). Simple Johnson-Lindenstrauss like arguments show that with high probability for any W ∈ R n2×n1 and any x ∈ R n1 , W x is close to (J T 2 J 2 )W (J T 1 J 1 )x where J 1 ∈ R k1×n1 and J 2 ∈ R k2×n2 are both FJLT, and k 1 = log n 1 , k 2 = log n 2 (see Section 4.2 for details). Motivated by this, we propose to replace a dense (fully-connected) linear layer of size n 2 × n 1 in any neural network by the following architecture: J T 1 W J 2 , where J 1 , J 2 can be represented by a truncated butterfly network and W is a k 2 × k 1 dense linear layer. The clear advantages of such a strategy are: (1) almost all choices of the weights from a specific distribution, namely the one mimicking FJLT, preserve accuracy while reducing the number of parameters, and (2) the number of weights is nearly linear in the layer width of W (the original matrix). Our empirical results demonstrate that this offers faster training and prediction in deployment while producing results that match and often outperform existing known architectures. Compressing neural networks by replacing linear layers with structured linear transforms that are expressed by fewer parameters have been studied extensively in the recent past. We compare our approach with these related works in Section 3. Since the butterfly structure adds logarithmic depth to the architecture, it might pose optimization related issues. Moreover, the sparse structure of the matrices connecting the layers in a butterfly network defies the general theoretical analysis of convergence of deep linear networks. We take a small step towards understanding these issues by studying the optimization landscape of a encoder-decoder network (two layer linear neural network), where the encoder layer is replaced by a truncated butterfly network followed by a dense linear layer in fewer parameters. This replacement is motivated by the result of Sarlós (2006) , related to fast randomized low-rank approximation of matrices using FJLT (see Section 4.2 for details). We consider this replacement instead of the architecture consisting of two butterfly networks and a dense linear layer as proposed earlier, because it is easier to analyze theoretically. We also empirically demonstrate that our new network with fewer parameters performs as well as an encoder-decoder network. The encoder-decoder network computes the best low-rank approximation of the input matrix. It is well-known that with high probability a close to optimal low-rank approximation of a matrix is obtained by either pre-processing the matrix with an FJLT (Sarlós, 2006) or a random sparse matrix structured as given in Clarkson and Woodruff (2009) and then computing the best low-rank approximation from the rows of the resulting matrixfoot_0 . A recent work by Indyk et al. (2019) studies this problem in the supervised setting, where they find the best pre-processing matrix structured as given in Clarkson and Woodruff (2009) from a sample of matrices (instead of using a random sparse matrix). Since an FJLT can be represented by a truncated butterfly network, we emulate the setting of Indyk et al. (2019) but learn the pre-processing matrix structured as a truncated butterfly network.

2. OUR CONTRIBUTION AND POTENTIAL IMPACT

We provide an empirical report, together with a theoretical analysis to justify our main idea of using sparse linear layers with a fixed butterfly network in deep learning. Our findings indicate that this approach, which is well rooted in the theory of matrix approximation and optimization, can offer significant speedup and energy saving in deep learning applications. Additionally, we believe that this work would encourage more experiments and theoretical analysis to better understand the optimization and generalization of our proposed architecture (see Future Work section). On the empirical side -The outcomes of the following experiments are reported: (1) In Section 6.1, we replace a dense linear layer in the standard state-of-the-art networks, for both image and language data, with an architecture that constitutes the composition of (a) truncated butterfly network, (b) dense linear layer in smaller dimension, and (c) transposed truncated butterfly network (see Section 4.2). The structure parameters are chosen so as to keep the number of weights near linear (instead of quadratic). (2) In Sections 6.2 and 6.3, we train a linear encoder-decoder network in which the encoder is replaced by a truncated butterfly network followed by a dense linear layer in smaller dimension. These experiments support our theoretical result. The network structure parameters are chosen so as to keep the number of weights in the (replaced) encoder near linear in the input dimension. Our results (also theoretically) demonstrate that this has little to no effect on the performance compared to the standard encoder-decoder network. (3) In Section 7, we learn the best pre-processing matrix structured as a truncated butterfly network to perform low-rank matrix approximation from a given sample of matrices. We compare our results to that of Indyk et al. (2019) , which learn the pre-processing matrix structured as given in Clarkson and Woodruff (2009) . On the theoretical side -The optimization landscape of linear neural networks with dense matrices have been studied by Baldi and Hornik (1989), and Kawaguchi (2016) . The theoretical part of this work studies the optimization landscape of the linear encoder-decoder network in which the encoder is replaced by a truncated butterfly network followed by a dense linear layer in smaller dimension. We call such a network as the encoder-decoder butterfly network. We give an overview of our main result, Theorem 1, here. Let X ∈ R n×d and Y ∈ R m×d be the data and output matrices respectively. Then the encoder-decoder butterfly network is given as Y = DEBX, where D ∈ R m×k and E ∈ R k× are dense layers, B is an × n truncated butterfly network (product of log n sparse matrices) and k ≤ ≤ m ≤ n (see Section 5). The objective is to learn D, E and B that minimizes ||Y -Y ||foot_1 F . Theorem 1 shows how the loss at the critical points of such a network depends on the eigenvalues of the matrix Σ = Y X T B T (BXX T B T ) -1 BXY T 2 . In comparison, the loss at the critical points of the encoder-decoder network (without the butterfly network) depends on the eigenvalues of the matrix Σ = Y X T (XX T ) -1 XY T (Baldi and Hornik, 1989) . In particular, the loss depends on how the learned matrix B changes the eigenvalues of Σ . If we learn only for an optimal D and E, keeping B fixed (as done in the experiment in Section 6.3) then it follows from Theorem 1 that every local minima is a global minima and that the loss at the local/global minima depends on how B changes the top k eigenvalues of Σ . This inference together with a result by Sarlós (2006) is used to give a worst-case guarantee in the special case when Y = X (called auto-encoders that capture PCA; see the below Theorem 1).

3. RELATED WORK

Important transforms like discrete Fourier, discrete cosine, Hadamard and many more satisfy a property called complementary low-rank property, recently defined by Li et al. (2015) . For an n × n matrix satisfying this property related to approximation of specific sub-matrices by low-rank matrices, Michielssen and Boag (1996) and O'Neil et al. (2010) developed the butterfly algorithm to compute the product of such a matrix with a vector in O(n log n) time. The butterfly algorithm factorizes such a matrix into O(log n) many matrices, each with O(n) sparsity. In general, the butterfly algorithm has a pre-computation stage which requires O(n 2 ) time (O'Neil et al., 2010; Seljebotn, 2012) . With the objective of reducing the pre-computation cost Li et al. (2015) ; Li and Yang (2017) compute the butterfly factorization for an n × n matrix satisfying the complementary low-rank property in O(n 3 2 ) time. This line of work does not learn butterfly representations for matrices or apply it in neural networks, and is incomparable to our work. A few works in the past have used deep learning models with structured matrices (as hidden layers). Such structured matrices can be described using fewer parameters compared to a dense matrix, and hence a representation can be learned by optimizing over a fewer number of parameters. Examples of structured matrices used include low-rank matrices (Denil et al., 2013; Sainath et al., 2013) , circulant matrices (Cheng et al., 2015; Ding et al., 2017) , low-distortion projections (Yang et al., 2015) , Toeplitz like matrices (Sindhwani et al., 2015; Lu et al., 2016; Ye et al., 2018) , Fourier-related transforms (Moczulski et al., 2016) and matrices with low-displacement rank (Thomas et al., 2018) . Recently Alizadeh et al. (2020) demonstrated the benefits of replacing the pointwise convolutional layer in CNN's by a butterfly network. Other works by Mocanu et al. (2018) ; Lee et al. (2019) ; Wang et al. (2020) ; Verdenius et al. (2020) consider a different approach to sparsify neural networks. The works closest to ours are by Yang et al. (2015) , Moczulski et al. (2016), and Dao et al. (2020) and we make a comparison below. Yang et al. (2015) and Moczulski et al. (2016) attempt to replace dense linear layers with a stack of structured matrices, including a butterfly structure (the Hadamard or the Cosine transform), but they do not place trainable weights on the edges of the butterfly structure as we do. Note that adding these trainable weights does not compromise the run time benefits in prediction, while adding to the expressiveness of the network in our case. Dao et al. (2020) replace handcrafted structured subnetworks in machine learning models by a kaleidoscope layer, which consists of compositions of butterfly matrices. This is motivated by the fact that the kaleidoscope hierarchy captures a structured matrix exactly and optimally in terms of multiplication operations required to perform the matrix vector product operation. Their work differs from us as we propose to replace any dense linear layer in a neural network (instead of a structured sub-network) by the architecture proposed in Section 4.2. Our approach is motivated by theoretical results which establish that this can be done with almost no loss in representation. Finally, Dao et al. (2019) show that butterfly representations of standard transformations like discrete Fourier, discrete cosine, Hadamard mentioned above can be learnt efficiently. They additionally show the following: a) for the benchmark task of compressing a single hidden layer model they compare the network constituting of a composition of butterfly networks with the classification accuracy of a fully-connected linear layer and b) in ResNet a butterfly sub-network is added to get an improved result. In comparison, our approach to replace a dense linear layer by the proposed architecture in Section 4.2 is motivated by well-known theoretical results as mentioned previously, and the results of the comprehensive list of experiments in Section 6.1 support our proposed method.

4. PROPOSED REPLACEMENT FOR A DENSE LINEAR LAYER

In Section 4.1, we define a truncated butterfly network, and in Section 4.2 we motivate and state our proposed architecture based on truncated butterfly network to replace a dense linear layer in any neural network. All logarithms are in base 2, and [n] denotes the set {1, . . . , n}.

4.1. TRUNCATED BUTTERFLY NETWORK

Definition 4.1 (Butterfly Network). Let n be an integral power of 2. Then an n×n butterfly network B (see Figure 6 ) is a stack of of log n linear layers, where in each layer i ∈ {0, . . . , log n -1}, a bipartite clique connects between pairs of nodes j 1 , j 2 ∈ [n], for which the binary representation of j 1 -1 and j 2 -1 differs only in the i'th bit. In particular, the number of edges in each layer is 2n. In what follows, a truncated butterfly network is a butterfly network in which the deepest layer is truncated, namely, only a subset of neurons are kept and the remaining nare discarded. The integer is a tunable parameter, and the choice of neurons is always assumed to be sampled uniformly at random and fixed throughout training in what follows. The effective number of parameters (trainable weights) in a truncated butterfly network is at most 2n log + 6n, for any and any choice of neurons selected from the last layer. 3 We include a proof of this simple upper bound in Appendix F for lack of space (also, refer to Ailon and Liberty (2009) for a similar result related to computation time of truncated FFT). The reason for studying a truncated butterfly network follows (for example) from the works (Ailon and Chazelle, 2009; Ailon and Liberty, 2009; Krahmer and Ward, 2011) . These papers define randomized linear transformations with the Johnson-Lindenstrauss property and an efficient computational graph which essentially defines the truncated butterfly network. In what follows, we will collectively denote these constructions by FJLT.foot_3 

4.2. MATRIX APPROXIMATION USING BUTTERFLY NETWORKS

We begin with the following proposition, following known results on matrix approximation (proof in Appendix B). Proposition 1. Suppose J 1 ∈ R k1×n1 and J 2 ∈ R k2×n2 are matrices sampled from FJLT distribution, and let W ∈ R n2×n1 . Then for the random matrix W = (J T 2 J 2 )W (J T 1 J 1 ), any unit vector x ∈ R n1 and any ∈ (0, 1), Pr [ W x -W x ≤ W ] ≥ 1 -e -Ω(min{k1,k2} 2 ) . From Proposition 1 it follows that W approximates the action of W with high probability on any given input vector. Now observe that W is equal to J T 2 W J 1 , where W = J 2 W J T 1 . Since J 1 and J 2 are FJLT, they can be represented by a truncated butterfly network, and hence it is conceivable to replace a dense linear layer connecting n 1 neurons to n 2 neurons (containing n 1 n 2 variables) in any neural network with a composition of three gadgets: a truncated butterfly network of size k 1 × n 1 , followed by a dense linear layer of size k 2 × k 1 , followed by the transpose of a truncated butterfly network of size k 2 × n 2 . In Section 6.1, we replace dense linear layers in common deep learning networks with our proposed architecture, where we set k 1 = log n 1 and k 2 = log n 2 .

5. ENCODER-DECODER BUTTERFLY NETWORK

Let X ∈ R n×d , and Y ∈ R m×d be data and output matrices respectively, and k ≤ m ≤ n. Then the encoder-decoder network for X is given as Y = DEX where E ∈ R k×n , and D ∈ R m×k are called the encoder and decoder matrices respectively. For the special case when Y = X, it is called auto-encoders. The optimization problem is to learn matrices D and E such that ||Y -Y || 2 F is minimized. The optimal solution is denoted as Y * , D * and E *foot_4 . In the case of auto-encoders X * = X k , where X k is the best rank k approximation of X. In this section, we study the optimization landscape of the encoder-decoder butterfly network : an encoder-decoder network, where the encoder is replaced by a truncated butterfly network followed by a dense linear layer in smaller dimension. Such a replacement is motivated by the following result from Sarlós (2006) , in which ∆ k = ||X k -X|| 2 F . Proposition 2. Let X ∈ R n×d . Then with probability at least 1/2, the best rank k approximation of X from the rows of JX (denoted J k (X)), where J is sampled from an × n FJLT distribution and = (k log k + k/ ) satisfies ||J k (X) -X|| 2 F ≤ (1 + )∆ k . Proposition 2 suggests that in the case of auto-encoders we could replace the encoder with a truncated butterfly network of size × n followed by a dense linear layer of size k × , and obtain a network with fewer parameters but loose very little in terms of representation. Hence, it is worthwhile investigating the representational power of the encoder-decoder butterfly network Y = DEBX . (1) Here, X, Y and D are as in the encoder-decoder network, E ∈ R k× is a dense matrix, and B is an × n truncated butterfly network. In the encoder-decoder butterfly network the encoding is done using EB, and decoding is done using D. This reduces the number of parameters in the encoding matrix from kn (as in the encoder-decoder network) to k + O(n log ). Again the objective is to learn matrices D and E, and the truncated butterfly network B such that ||Y -Y || 2 F is minimized. The optimal solution is denoted as Y * , D * , E * , and B * . Theorem 1 shows that the loss at a critical point of such a network depends on the eigenvalues of Σ(B) = Y X T B T (BXX T B T ) -1 XY T , when BXX T B T is invertible and Σ(B) has distinct positive eigenvalues.The loss L is defined as ||Y -Y || 2 F . Theorem 1. Let D, E and B be a point of the encoder-decoder network with a truncated butterfly network satisfying the following: a)  and c ) the gradient of L(Y ) with respect to the parameters in D and E matrix is zero. Then corresponding to this point (and hence corresponding to every critical point) there is an BXX T B T is invertible, b) Σ(B) has distinct positive eigen- values λ 1 > . . . > λ , I ⊆ [ ] such that L(Y ) at this point is equal to tr(Y Y T ) -i∈I λ i . Moreover if the point is a local minima then I = [k]. The proof of Theorem 1 is given in Appendix C. We also compare our result with that of Baldi and Hornik (1989) and Kawaguchi (2016) , which study the optimization landscape of dense linear neural networks in Appendix C. From Theorem 1 it follows that if B is fixed and only D and E are trained then a local minima is indeed a global minima. We use this to claim a worst-case guarantee using a two-phase learning approach to train an auto-encoder. In this case the optimal solution is denoted as B k (Y ), D B , and E B . Observe that when Y = X, B k (X) is the best rank k approximation of X computed from the rows of BX. Two phase learning for auto-encoder: Let = k log k + k/ and consider a two phase learning strategy for auto-encoders, as follows: In phase one B is sampled from an FJLT distribution, and then only D and E are trained keeping B fixed. Suppose the algorithm learns D and E at the end  X = B k (X). Namely X is the best rank k approximation of X from the rows of BX. From Proposition 2 with probability at least 1 2 , L(X ) ≤ (1 + )∆ k . In the second phase all three matrices are trained to improve the loss. In Sections 6.2 and 6.3 we train an encoder-decoder butterfly network using the standard gradient descent method. In these experiments the truncated butterfly network is initialized by sampling it from an FJLT distribution, and D and E are initialized randomly as in Pytorch.

6. EXPERIMENTS ON DENSE LAYER REPLACEMENT AND ENCODER-DECODER BUTTERFLY NETWORK

In this section we report the experimental results based on the ideas presented in Sections 4.2 and 5.

6.1. REPLACING DENSE LINEAR LAYERS BY THE PROPOSED ARCHITECTURE

This experiment replaces a dense linear layer of size n 2 × n 1 in common deep learning architectures with the network proposed in Section 4.2. 6 The truncated butterfly networks are initialized by sampling it from the FJLT distribution, and the dense matrices are initialized randomly as in Pytorch. We set k 1 = log n 1 and k 2 = log n 2 . The datasets and the corresponding architectures considered are summarized in Table 1 . For each dataset and model, the objective function is the same as defined in the model, and the generalization and convergence speed between the original model and the modified one (called the butterfly model for convenience) are compared. Figure 7 We remark that the modified architecture is also trained for fewer epochs. In almost all the cases the modified architecture does better than the normal architecture, both in the rate of convergence and in the final accuracy/F 1 score. Moreover, the training time for the modified architecture is less.

6.2. ENCODER-DECODER BUTTERFLY NETWORK WITH SYNTHETIC GAUSSIAN AND REAL DATA

This experiment tests whether gradient descent based techniques can be used to train encoderdecoder butterfly network. In all the experiments in this section Y = X. Five types of data matrices are tested, whose attributes are specified in Table 2 . 7 Two among them are random and This experiment is similar to the experiment in Section 6.2 but the training in this case is done in two phases. In the first phase, B is fixed and the network is trained to determine an optimal D and E. In the second phase, the optimal D and E determined in phase one are used as the initialization, and the network is trained over D, E and B to minimize the loss. Theorem 1 ensures worst-case guarantees for this two phase training (see below the theorem). Figure 3 reports the approximation error of an image from Imagenet. The red and green lines in Figure 3 correspond to the approximation error at the end of phase one and two respectively. Setup: Suppose X 1 , . . . , X t ∈ R n×d are training matrices sampled from a distribution D. Then a B is computed that minimizes the following empirical loss: i∈[t] ||X i -B k (X i )|| 2 F . We compute B k (X i ) using truncated SVD of BX i (as in Algorithm 1, Indyk et al. (2019) ). Similar to Indyk et al. (2019) , the matrix B is learned by the back-propagation algorithm that uses a differentiable SVD implementation to calculate the gradients, followed by optimization with Adam such that the butterfly structure of B is maintained. The learned B can be used as the pre-processing matrix for any matrix in the future. The test error for a matrix B and a test set Te is defined as follows: Err Te (B) = E X∼Te ||X -B k (X)|| 2 F -App Te , where App Te = E X∼Te ||X -X k || 2 F .

Experiments and Results:

The experiments are performed on the datasets shown in Table 3 Interestingly, the error for butterfly learned is not only less than the error for sparse learned (N = 1 as in (Indyk et al., 2019) ) but also less than than the error for dense learned (N = 20). In particular, our results indicate that using a learned butterfly sketch can significantly reduce the approximation loss compared to using a learned sparse sketching matrix. Discussion: Among other things, this work showed that it is beneficial to replace dense linear layer in deep learning architectures with a more compact architecture (in terms of number of parameters), using truncated butterfly networks. This approach is justified using ideas from efficient matrix approximation theory from the last two decades. however, results in additional logarithmic depth to the network. This issue raises the question of whether the extra depth may harm convergence of gradient descent optimization. To start answering this question, we show, both empirically and theoretically, that in linear encoder-decoder networks in which the encoding is done using a butterfly network, this typically does not happen. To further demonstrate the utility of truncated butterfly networks, we consider a supervised learning approach as in Indyk et al. (2019) , where we learn how to derive low rank approximations of a distribution of matrices by multiplying a pre-processing linear operator represented as a butterfly network, with weights trained using a sample of the distribution.

Future Work:

The main open questions arising from the work are related to better understanding the optimization landscape of butterfly networks. The current tools for analysis of deep linear networks do not apply for these structures, and more theory is necessary. It would be interesting to determine whether replacing dense linear layers in any network, with butterfly networks as in Section 4.2 harms the convergence of the original matrix. Another direction would be to check empirically whether adding non-linear gates between the layers (logarithmically many) of a butterfly network improves the performance of the network. In the experiments in Section 6.1, we have replaced a single dense layer by our proposed architecture. It would be worthwhile to check whether replacing multiple dense linear layers in the different architectures harms the final accuracy. Similarly, it might be insightful to replace a convolutional layer by an architecture based on truncated butterfly network. Finally, since our proposed replacement reduces the number of parameters in the network, it might be possible to empirically show that the new network is more resilient to over-fitting. A BUTTERFLY DIAGRAM FROM SECTION 1 Figure 6 referred to in the introduction is given here. 

B PROOF OF PROPOSITION 1

The proof of the proposition will use the following well known fact (Lemma B.1 below) about FJLT (more generally, JL) distributions (see Ailon and Chazelle (2009) ; Ailon and Liberty (2009) ; Krahmer and Ward (2011) ). Lemma B.1. Let x ∈ R n be a unit vector, and let J ∈ R k×n be a matrix drawn from an FJLT distribution. Then for all < 1 with probability at least 1 -e -Ω(k 2 ) : x -J T Jx ≤ . (2) By Lemma B.1 we have that with probability at least 1 -e -Ω(k1 2 ) , x -J T 1 J 1 x ≤ x = . Henceforth, we condition on the event x-J T 1 J 1 x ≤ x . Therefore, by the definition of spectral norm W of W : W x -W J T 1 J 1 x ≤ W . (4) Now apply Lemma B.1 again on the vector W J T 1 J 1 x and transformation J 2 to get that with probability at least 1 -e -Ω(k2 2 ) , W J T 1 J 1 x -J T 2 J 2 W J T 1 J 1 x ≤ W J T 1 J 1 x . Henceforth, we condition on the event W J T 1 J 1 x -J T 2 J 2 W J T 1 J 1 x ≤ W J T 1 J 1 x . To bound the last right hand side, we use the triangle inequality together with (4): W J T 1 J 1 x ≤ W x + W ≤ W (1 + ). Combining ( 5) and ( 6) gives: W J T 1 J 1 x -J T 2 J 2 W J T 1 J 1 x ≤ W (1 + ). Finally, J T 2 J 2 W J T 1 J 1 x -W x = (J T 2 J 2 W J T 1 J 1 x -W J T 1 J 1 x) + (W J T 1 J 1 x -W x) ≤ W (1 + ) + W = W (2 + ) ≤ 3 W , where the first inequality is from the triangle inequality together with ( 4) and ( 7), and the second inequality is from the bound on . The proposition is obtained by adjusting the constants hiding inside the Ω() notation in the exponent in the proposition statement.

C PROOF OF THEOREM 1

We first note that our result continues to hold even if B in the theorem is replaced by any structured matrix. For example the result continues to hold if B is an × n matrix with one non-zero entry per column, as is the case with a random sparse sketching matrix Clarkson and Woodruff (2009) . We also compare our result with that Baldi and Hornik (1989) ; Kawaguchi (2016) . Comparison with Baldi and Hornik (1989) and Kawaguchi (2016) : The critical points of the encoder-decoder network are analyzed in Baldi and Hornik (1989) . Suppose the eigenvalues of Y X T (XX T ) -1 XY T are γ 1 > . . . > γ m > 0 and k ≤ m ≤ n. Then they show that corresponding to a critical point there is an I ⊆ [m] such that the loss at this critical point is equal to tr(Y Y T ) -i∈I γ i , and the critical point is a local/global minima if and only if I = [k]. Kawaguchi (2016) later generalized this to prove that a local minima is a global minima for an arbitrary number of hidden layers in a linear neural network if m ≤ n. Note that since ≤ n and m ≤ n in Theorem 1, replacing X by BX in Baldi and Hornik (1989) or Kawaguchi (2016) does not imply Theorem 1 as it is. Next, we introduce a few notation before delving into the proof. Let r = (Y -Y ) T , and vec(r) ∈ R md is the entries of r arranged as a vector in column-first ordering,  (∇ vec(D T ) L(Y )) T ∈ R mk and (∇ vec(E T ) L(Y )) T ∈ R k denote . ∇ vec(D T ) L(Y ) = vec(r) T (I m ⊗ (E • X) T ), and 2. ∇ vec(E T ) L(X) = vec(r) T (D ⊗ X) T Proof. 1. Since L(Y ) = 1 2 vec(r) T • vec(r), ∇ vec(D T ) L(Y ) = vec(r) T • ∇ vec(D T ) vec(r) = vec(r) T (vec (D T ) ( XT • E T • D T )) = vec(r) T (I m ⊗ (E • X) T ) • ∇ vec(D T ) vec(D T ) = vec(r) T (I m ⊗ (E • X) T ) 2. Similarly, ∇ vec(E T ) L(Y ) = vec(r) T • ∇ vec(E T ) vec(r) = vec(r) T (vec (E T ) ( XT • E T • D T )) = vec(r) T (D ⊗ XT ) • ∇ vec(E T ) vec(E T ) = vec(r) T (D ⊗ XT ) Assume the rank of D is equal to p. Hence there is an invertible matrix C ∈ R k×k such that D = D • C is such that the last k -p columns of D are zero and the first p columns of D are linearly independent (via Gauss elimination). Let Ẽ = C -1 •E. Without loss of generality it can be assumed D ∈ R d×p , and Ẽ ∈ R p×d , by restricting restricting D to its first p columns (as the remaining are zero) and Ẽ to its first p rows. Hence, D is a full column-rank matrix of rank p, and DE = D Ẽ. Claims C.1 and C.2 aid us in the completing the proof of the theorem. First the proof of theorem is completed using these claims, and at the end the two claims are proved. Claim C.1 (Representation at the critical point). 1. Ẽ = ( DT D) -1 DT Y XT ( X • XT ) -1 2. D Ẽ = P D Y XT ( X • XT ) -1 Claim C.2. 1. ẼB D = ( ẼBY XT ẼT )( Ẽ X XT ẼT ) -1 2. P D Σ = ΣP D = P D ΣP D We denote Σ(B) as Σ for convenience. Since Σ is a real symmetric matrix, there is an orthogonal matrix U consisting of the eigenvectors of Σ, such that Σ = U ∧ U T , where ∧ is a m × m diagonal matrix whose first diagonal entries are λ 1 , . . . , λ and the remaining entries are zero. Let u 1 , . . . , u m be the columns of U . Then for i ∈ [ ], u i is the eigenvector of Σ corresponding to the eigenvalue λ i , and {u +1 , . . . , u dy } are the eigenvectors of Σ corresponding to the eigenvalue 0. Note that P U T D = U T D( DT U T U D) -1 DT U = U T P D U , and from part two of Claim C.2 we have (U P U T D U T )Σ = Σ(U P U T D U T ) (9) U • P U T D ∧ U T = U ∧ P U T D U T (10) P U T D ∧ = ∧P U T D Since P U T D commutes with ∧, P U T D is a block-diagonal matrix comprising of two blocks P 1 and P 2 : the first block P 1 is an × diagonal block, and P 2 is a (m -) × (m -) matrix. Since P U T D is orthogonal projection matrix of rank p its eigenvalues are 1 with multiplicity p and 0 with multiplicity m -p. Hence at most p diagonal entries of P 1 are 1 and the remaining are 0.  L(Y ) = tr(Y Y T ) -tr(U P U T D ∧ U T ) = tr(Y Y T ) -tr(P U T D ∧) The last line the above equation follows from the fact that tr(U P Ũ T D ∧ U T ) = tr(P U T D ∧ U T U ) = tr(P U T D ∧). From the structure of P U T D and ∧ it follows that there is a subset I ⊆ [ ], |I| ≤ p such that tr(P U T D ∧) = i∈I λ i . Hence, L(Y ) = tr(Y Y T ) -i∈I λ i . Since P D = U P U T D U T , there is a p × p invertible matrix M such that D = (U • V ) I • M , and Ẽ = M -1 (V T U T ) I Y XT ( X XT ) -1 where V is a block-diagonal matrix consisting of two blocks V 1 and V 2 : V 1 is equal to I , and  = (U V ) I [M |O p×(k-p) ]C -1 . Similarly, there is a p × (k -p) matrix N such that  E = C[ M -1 N ]((U V ) I ) T Y XT ( X XT ) -1 where [ M -1 N ] v i = u i for i ∈ [ ] (from the structure of V ). For > 0 let u b = (1 + 2 ) -1 2 (v b + u a ). Define U as the matrix which is equal to U V except that the column vector v b in U V is replaced by u b in U . Since a ∈ [k] ⊆ [ ] and a / ∈ I , v a = u a and (U I ) T U I = I p . Define D = U I [M |O p×(k-p) ]C -1 , and E = C[ M -1 N ](U I ) T Y XT ( X XT ) -1 and let Y = D E X. Now observe that, D E = U I (U I ) T Y XT ( X XT ) -1 , and that L(Y ) = tr(Y Y T ) - i∈I λ i - 2 1 + 2 (λ a -λ b ) = L(Y ) - 2 1 + 2 (λ a -λ b ) Since can be set arbitrarily close to zero, it can be concluded that there are points in the neighbourhood of Y such that the loss at these points are less than L(Y ). Further, since L is convex with respect to the parameters in D (respectively E), when the matrix E is fixed (respectively D is fixed) Y is not a local maximum.  Substituting DE as D Ẽ in Equation 12, and multiplying Equation 12 by C T on both the sides from the left, Equation 13 follows. ⇒ DT D Ẽ X XT = DT Y XT (13) Since D is full-rank, we have Ẽ = ( DT D) -1 DT Y XT ( X XT ) -1 . and, From part 1 of Claim C.1, it follows that Ẽ has full row-rank, and hence Ẽ X XT ẼT is invertible. Multiplying the inverse of Ẽ X XT ẼT from the right on both sides and multiplying ẼB from the left on both sides of the above equation we have, D Ẽ = P D Y XT ( X XT ) -1 (15) Proof of Claim C.2. Since ∇ vec(D T ) L(Y ) is zero, from the first part of Lemma C.1 the following holds, E X(Y -Y ) T = E XY T -E X • Y T = 0 ⇒ E X XT E T D T = E XY T ẼB D = ( ẼBY XT ẼT )( Ẽ X XT ẼT ) -1 This proves part one of the claim. Moreover, multiplying Equation 18 by DT from the right on both sides D Ẽ X XT ẼT DT = Y XT ẼT DT ⇒ (P D Y XT ( X XT ) -1 )( X XT )(( X XT ) -1 XY T P D ) = Y XT (( X XT ) -1 XY T • P D ) ⇒ P D Y XT ( X XT ) -1 XY T P D = Y XT ( X XT ) -1 XY T • P D The (Cambridge, 1994) , which is represented as a vector of size 4096 in column-first ordering. Finally, for HS-SOD the data matrix is a 1024 × 768 matrix sampled uniformly at random from HS-SOD -a dataset for hyperspectral images from natural scenes (Imamoglu et al., 2018) . Figure 13 reports the losses for the Gaussian 2, Olivetti, and Hyper data matrices. 8 Close to zero entries are sampled uniformly at random according to a Gaussian distribution with mean zero and variance 0.01. In this section we state a few additional cases that were done as part of the experiment in Section 7. preceding layer using at most 2 weights. By induction, for all i ≥ 0, the set of nodes S (p-i) ⊆ V (p-i) is of size at most 2 i • , and is connected to the set S (p-i-1) ⊆ V (p-i-1) using at most 2 i+1 • weights. Now take k = log 2 (n/ ) . By the above, the total number of weights that can participate in a path connecting some node in S (p) with some node in V (p-k) is at most 2 + 4 + • • • + 2 k ≤ 4n . From the other direction, the total number of weights that can participate in a path connecting any node from V (0) with any node from V (p-k) is 2n times the number of layers in between, or more precisely: 2n(p -k) = 2n(log 2 n -log 2 (n/ ) ) ≤ 2n(log 2 n -log 2 (n/ ) + 1) = 2n(log + 1) . The total is 2n log + 6n, as required.



The pre-processing matrix is multiplied from the left. At a critical point the gradient of the loss function with respect to the parameters in the network is zero. Note that if n is not a power of 2 then we work with the first n columns of the × n truncated butterfly network, where n is the closest number to n that is greater than n and is a power of 2. To be precise, the construction inAilon and Chazelle (2009),Ailon and Liberty (2009), andKrahmer and Ward (2011) also uses a random diagonal matrix, but the values of the diagonal entries can be 'absorbed' inside the weights of the first layer of the butterfly network. Possibly multiple D * and E * exist such that Y * = D * E * X. In all the architectures considered the final linear layer before the output layer is replaced, and n1 and n2 depend on the architecture. In Table 2 HS-SOD denotes a dataset for hyperspectral images from natural scenes(Imamoglu et al., 2018).



in Appendix D.1 reports the number of parameters in the dense linear layer of the original model, and in the replaced network, and Figure8in Appendix D.1 displays the number of parameter in the original model and the butterfly model. In particular, Figure7shows the significant reduction in the number of parameters obtained by the proposed replacement. On the left of Figure1, the test accuracy of the original model and the butterfly model is reported, where the black vertical lines denote the error bars corresponding to standard deviation, and the values above the rectangles denote the average accuracy. On the right of Figure1observe that the test accuracy for the butterfly model trained with stochastic gradient descent is even better than the original model trained with Adam in the first few epochs. Figure12in Appendix D.1 compares the test accuracy in the the first 20 epochs of the original and butterfly model. The results for the NLP tasks in the interest of space are reported in Figure 9, Appendix D.1. The training and inference times required for the original model and the butterfly model in each of these experiments are reported in Figures 10 and 11 in Appendix D.1.

Figure 1: Left: comparison of final test accuracy with different image classification models and data sets; Right: comparison of test accuracy in the first few epochs with different models and optimizers on CIFAR-10 with PreActResNet18

Figure 3: Approximation error achieved by different methods and the same zoomed on in the right

is compared to the test errors for the following three cases: 1) B is a learned as a sparse sketching matrix as inIndyk et al. (2019), b) B is a random sketching matrix as inClarkson and Woodruff (2009), and c) B is an × n Gaussian matrix. Figure4compares the test error for = 20, and k = 10, where App Te = 10.56. Figure14in Appendix E compares the test errors of the different methods in the extreme case when k = 1, and Figure15in Appendix E compares the test errors of the different methods for various values of .

Figure 16 in in Appendix E shows the test error for = 20 and k = 10 during the training phase on HS-SOD. In Figure16it is observed that the butterfly learned is able to surpass sparse learned after a merely few iterations.

Figure5compares the test error for the learned B via our truncated butterfly structure to a learned matrix B with N non-zero entries in each column -the N non-zero location for each column are chosen uniformly at random. The reported test errors are on HS-SOD, when = 20 and k = 10. Interestingly, the error for butterfly learned is not only less than the error for sparse learned (N = 1 as in(Indyk et al., 2019)) but also less than than the error for dense learned (N = 20). In particular, our results indicate that using a learned butterfly sketch can significantly reduce the approximation loss compared to using a learned sparse sketching matrix.

Figure 4: Test error by different sketching matrices on different data sets

Figure 6: A 16 × 16 butterfly network represented as a 4-layered graph on the left, and as product of 4 sparse matrices on the right. The white entries are the non-zero entries of the matrices.

the partial derivative of L(Y ) with respect to the parameters in vec(D T ) and vec(E T ) respectively. Notice that ∇ vec(D T ) L(Y ) and ∇ vec(E T ) L(Y ) are row vectors of size mk and k respectively. Also, let P D denote the projection matrix of D, and hence if D is a matrix with full column-rank then P D = D(D T • D) -1 • D T . The n × n identity matrix is denoted as I n , and for convenience of notation let X = B • X. First we prove the following lemma which gives an expression for D and E if ∇ vec(D T ) L(Y ) and ∇ vec(E T ) L(Y ) are zero. Lemma C.1 (Derivatives with respect to D and E).

1

) = tr((Y -Y )(Y -Y ) T ) = tr(Y Y T ) -2tr(Y Y T ) + tr(Y Y T ) = tr(Y Y T ) -2tr(P D Σ) + tr(P D ΣP D ) = tr(Y Y T )tr(P D Σ) The second line in the above equation follows using the fact that tr(Y Y T ) = tr(Y Y T ), the third line in the above equation follows by substituting Y = P D Y XT • ( X • XT ) -1 • X (from part two of Claim C.1), and the last line follows from part two of Claim C.2. Substituting Σ = U ∧ U T , and P D = U P U T D U T in the above equation we have,

V 2 is an (m -) × (m -) orthogonal matrix, and I is such that I ⊆ I and |I | = p. The relation for Ẽ in the above equation follows from part one of Claim C.1. Note that if I ⊆ [ ], then I = I , that is I consists of indices corresponding to eigenvectors of non-zero eigenvalues. Recall that D was obtained by truncating the last k -p zero rows of DC, where C was a k × k invertible matrix simulating the Gaussian elimination. Let [M |O p×(k-p) ] denoted the p × k matrix obtained by augmenting the columns of M with (k -p) zero columns. Then D

denotes the k × p matrix obtained by augmenting the rows of M -1 with the rows of N . Now suppose I = [k], and hence I = [k]. Then we will show that there are matrices D and E arbitrarily close to D and E respectively such that if Y = D E X then L(Y ) < L(Y ). There is an a ∈ [k] \ I , and b ∈ I such that λ a > λ b (λ b could also be zero). Denote the columns of the matrix U V as {v 1 , . . . , v m }, and observe that

Hence, if I = [k] then Y represents a saddle point, and in particular Y is local/global minima if and only if I = [k]. Proof of Claim C.1. Since ∇ vec(E T ) L(X) is equal to zero, from the second part of Lemma C.1 the following holds, X(Y -Y ) T D = XY T D -XY T D = 0 ⇒ X XT E T D T D = XY T D Taking transpose on both sides ⇒ D T DE X XT = D T Y XT

16) Substituting E T • D T as ẼT • DT in Equation 12, and multiplying Equation 16 by C -1 on both the sides from the left Equation 17 follows. Ẽ X XT ẼT DT = Ẽ XY T (17) Taking transpose of the above equation we have, D Ẽ X XT ẼT = Y XT ẼT(18)

Figure 7 displays the number of parameters in the dense linear layer of the original model and in the replaced butterfly based network. Figure 9 reports the results for the NLP tasks done as part of experiment in Section 6.1. Figure 8 displays the number of parameter in the original model and the butterfly model. Figures 10 and 11 reports the training and inference times required for the original model and the butterfly model in each of the experiments. The training and and inference times in Figures 10 and 11 are averaged over 100 runs. Figure 12 is the same as the right part of Figure 1 but here we compare the test accuracy of the original and butterfly model for the the first 20 epochs.

Figure 7: Number of parameters in the dense linear layer of the original model and in the replaced butterfly based architecture; Left: Vision data, Right: NLP

Figure 11: Training/Inference times for NLP; Left: Training time, Right: Inference time

Figure 13: Approximation error on data matrix with various methods for various values of k. From left to right: Gaussian 2, Olivetti, Hyper

Figure 14 compares the test errors of the different methods in the extreme case when k = 1. Figure 15 compares the test errors of the different methods for various values of . Figure 16 shows the test error for = 20 and k = 10 during the training phase on HS-SOD. Observe that the butterfly

Data and the corresponding architectures used in the fast matrix multiplication using butterfly matrices experiments. of phase one, and X = D E B. Then Theorem 1 guarantees that, assuming Σ(B) has distinct positive eigenvalues and D , E are a local minima, D = D B , E = E B , and

Data used in the Sketching algorithm for low-rank matrix decomposition experiments.

Table 4 in Appendix E in Appendix E reports the test error for different values of and k.

Test error for different and k

ACKNOWLEDGEMENT

This project has received funding from European Union's Horizon 2020 research and innovation program under grant agreement No 682203 -ERC-[ Inf-Speed-Tradeoff].

annex

learned is able to surpass sparse learned after a merely few iterations. Finally Table 4 compares the test error for different values of and k. A butterfly network for dimension n, which we assume for simplicity to be an integral power of 2, is log n layers deep. Let p denote the integer log n. The set of nodes in the first (input) layer will be denoted here by V (0) . They are connected to the set of n nodes V (1) from the next layer, and so on until the nodes V (p) of the output layer. Between two consecutive layers V (i) and V (i+1) , there are 2n weights, and each node in V (i) is adjacent to exactly two nodes from V (i+1) .When truncating the network, we discard all but some set S (p) ⊆ V (p) of at most nodes in the last layer. These nodes are connected to a subset S (p-1) ⊆ V (p-1) of at most 2 nodes from the

