SPARSE LINEAR NETWORKS WITH A FIXED BUTTER-FLY STRUCTURE: THEORY AND PRACTICE Anonymous authors Paper under double-blind review

Abstract

A butterfly network consists of logarithmically many layers, each with a linear number of non-zero weights (pre-specified). The fast Johnson-Lindenstrauss transform (FJLT) can be represented as a butterfly network followed by a projection onto a random subset of the coordinates. Moreover, a random matrix based on FJLT with high probability approximates the action of any matrix on a vector. Motivated by these facts, we propose to replace a dense linear layer in any neural network by an architecture based on the butterfly network. The proposed architecture significantly improves upon the quadratic number of weights required in a standard dense layer to nearly linear with little compromise in expressibility of the resulting operator. In a collection of wide variety of experiments, including supervised prediction on both the NLP and vision data, we show that this not only produces results that match and often outperform existing well-known architectures, but it also offers faster training and prediction in deployment. To understand the optimization problems posed by neural networks with a butterfly network, we study the optimization landscape of the encoder-decoder network, where the encoder is replaced by a butterfly network followed by a dense linear layer in smaller dimension. Theoretical result presented in the paper explain why the training speed and outcome are not compromised by our proposed approach. Empirically we demonstrate that the network performs as well as the encoderdecoder network.

1. INTRODUCTION

A butterfly network (see Figure 6 in Appendix A) is a layered graph connecting a layer of n inputs to a layer of n outputs with O(log n) layers, where each layer contains 2n edges. The edges connecting adjacent layers are organized in disjoint gadgets, each gadget connecting a pair of nodes in one layer with a corresponding pair in the next layer by a complete graph. The distance between pairs doubles from layer to layer. This network structure represents the execution graph of the Fast Fourier Transform (FFT) (Cooley and Tukey, 1965) , Walsh-Hadamard transform, and many important transforms in signal processing that are known to have fast algorithms to compute matrix-vector products. Ailon and Chazelle (2009) showed how to use the Fourier (or Hadamard) transform to perform fast Euclidean dimensionality reduction with Johnson and Lindenstrauss (1984) guarantees. The resulting transformation, called Fast Johnson Lindenstrauss Transform (FJLT), was improved in subsequent works (Ailon and Liberty, 2009; Krahmer and Ward, 2011) . The common theme in this line of work is to define a fast randomized linear transformation that is composed of a random diagonal matrix, followed by a dense orthogonal transformation which can be represented via a butterfly network, followed by a random projection onto a subset of the coordinates (this research is still active, see e.g. Jain et al. ( 2020)). In particular, an FJLT matrix can be represented (explicitly) by a butterfly network followed by projection onto a random subset of coordinates (a truncation operator). We refer to such a representation as a truncated butterfly network (see Section 4). Simple Johnson-Lindenstrauss like arguments show that with high probability for any W ∈ R n2×n1 and any x ∈ R n1 , W x is close to (J T 2 J 2 )W (J T 1 J 1 )x where J 1 ∈ R k1×n1 and J 2 ∈ R k2×n2 are both FJLT, and k 1 = log n 1 , k 2 = log n 2 (see Section 4.2 for details). Motivated by this, we propose to replace a dense (fully-connected) linear layer of size n 2 × n 1 in any neural network by the following architecture: J T 1 W J 2 , where J 1 , J 2 can be represented by a truncated butterfly 1

