DEEP LEARNING MEETS PROJECTIVE CLUSTERING

Abstract

A common approach for compressing Natural Language Processing (NLP) networks is to encode the embedding layer as a matrix A ∈ R n×d , compute its rank-j approximation A j via SVD (Singular Value Decomposition), and then factor A j into a pair of matrices that correspond to smaller fully-connected layers to replace the original embedding layer. Geometrically, the rows of A represent points in R d , and the rows of A j represent their projections onto the j-dimensional subspace that minimizes the sum of squared distances ("errors") to the points. In practice, these rows of A may be spread around k > 1 subspaces, so factoring A based on a single subspace may lead to large errors that turn into large drops in accuracy. Inspired by projective clustering from computational geometry, we suggest replacing this subspace by a set of k subspaces, each of dimension j, that minimizes the sum of squared distances over every point (row in A) to its closest subspace. Based on this approach, we provide a novel architecture that replaces the original embedding layer by a set of k small layers that operate in parallel and are then recombined with a single fully-connected layer. Extensive experimental results on the GLUE benchmark yield networks that are both more accurate and smaller compared to the standard matrix factorization (SVD). For example, we further compress DistilBERT by reducing the size of the embedding layer by 40% while incurring only a 0.5% average drop in accuracy over all nine GLUE tasks, compared to a 2.8% drop using the existing SVD approach. On RoBERTa we achieve 43% compression of the embedding layer with less than a 0.8% average drop in accuracy as compared to a 3% drop previously.

1. INTRODUCTION AND MOTIVATION

Deep Learning revolutionized Machine Learning by improving the accuracy by dozens of percents for fundamental tasks in Natural Language Processing (NLP) through learning representations of a natural language via a deep neural network (Mikolov et al., 2013; Radford et al., 2018; Le and Mikolov, 2014; Peters et al., 2018; Radford et al., 2019) . Lately, it was shown that there is no need to train those networks from scratch each time we receive a new task/data, but to fine-tune a full pre-trained model on the specific task (Dai and Le, 2015; Radford et al., 2018; Devlin et al., 2019) . However, in many cases, those networks are extremely large compared to classical machine learning models. For example, both BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) have more than 110 million parameters, and RoBERTa (Liu et al., 2019b) consists of more than 125 million parameters. Such large networks have two main drawbacks: (i) they use too much storage, e.g. memory or disk space, which may be infeasible for small IoT devices, smartphones, or when a personalized network is needed for each user/object/task, and (ii) classification may take too much time, especially for real-time applications such as NLP tasks: speech recognition, translation or speech-to-text.

Compressed Networks.

To this end, many papers suggested different techniques to compress large NLP networks, e.g., by low-rank factorization (Wang et al., 2019; Lan et al., 2019) , prun- ing (McCarley, 2019; Michel et al., 2019; Fan et al., 2019; Guo et al., 2019; Gordon et al., 2020 ), quantization (Zafrir et al., 2019; Shen et al., 2020 ), weight sharing (Lan et al., 2019) , and knowledge distillation (Sanh et al., 2019; Tang et al., 2019; Mukherjee and Awadallah, 2019; Liu et al., 2019a; Sun et al., 2019; Jiao et al., 2019) ; see more example papers and a comparison table in Gordon ( 2019) for compressing the BERT model. There is no consensus on which approach should be used in what contexts. However, in the context of compressing the embedding layer, the most common approach is low-rank factorization as in Lan et al. ( 2019), and it may be combined with other techniques such as quantization and pruning. In this work, we suggest a novel low-rank factorization technique for compressing the embedding layer of a given model. This is motivated by the fact that in many networks, the embedding layer accounts for 20% -40% of the network size. Our approach -MESSI: Multiple (parallel) Estimated SVDs for Smaller Intralayers -achieves a better accuracy for the same compression rate compared to the known standard matrix factorization. To present it, we first describe an embedding layer, the known technique for compressing it, and the geometric assumptions underlying this technique. Then, we give our approach followed by geometric intuition, and detailed explanation about the motivation and the architecture changes. Finally, we report our experimental results that demonstrate the strong performance of our technique.

Embedding Layer.

The embedding layer aims to represent each word from a vocabulary by a real-valued vector that reflects the word's semantic and syntactic information that can be extracted from the language. One can think of the embedding layer as a simple matrix multiplication as follows. The layer receives a standard vector x ∈ R n (a row of the identity matrix, exactly one nonzero entry, usually called one-hot vector) that represents a word in the vocabulary, it multiplies x by a matrix A T ∈ R d×n to obtain the corresponding d-dimensional word embedding vector y = A T x, which is the row in A that corresponds to the non-zero entry of x. The embedding layer has n input neurons, and the output has d neurons. The nd edges between the input and output neurons define the matrix A ∈ R n×d . Here, the entry in the ith row and jth column of A is the weight of the edge between the ith input neuron to the jth output neuron; see Figure . 1.

Compressing by Matrix Factorization.

A common approach for compressing an embedding layer is to compute the j-rank approximation A j ∈ R n×d of the corresponding matrix A via SVD (Singular Value Decomposition; see e.  )), factor A j into two smaller matrices U ∈ R n×j and V ∈ R j×d (i.e. A j = U V ), and replace the original embedding layer that corresponds to A by a pair of layers that correspond to U and V . The number of parameters is then reduced to j(n + d). Moreover, computing the output takes O(j(n + d)) time, compared to the O(nd) time for computing A T x. As above, we continue to use A j to refer to a rank-j approximation of a matrix A. Fine tuning. The layers that correspond to the matrices U and V above are sometimes used only as initial seeds for a training process that is called fine tuning. Here, the training data is fed into the



Figure 1: A standard embedding (or fully-connected) layer of 20 input neurons and 10 output neurons. Its corresponding matrix A ∈ R 20×10 has 200 parameters, where the ith row in A is the vector of weights of the i neuron in the input layer.

g., Lan et al. (2019); Yu et al. (2017) and Acharya et al. (

