DATA-AWARE LOW-RANK COMPRESSION FOR LARGE NLP MODELS

Abstract

The representations learned by large-scale NLP models such as BERT have been widely used in various tasks. However, the increasing model size of the pre-trained models also brings the efficiency challenges, including the inference speed and the model size when deploying the model on devices. Specifically, most operations in BERT consist of matrix multiplications. These matrices are not low-rank and thus canonical matrix decomposition could not find an efficient approximation. In this paper, we observe that the learned representation of each layer lies in a lowdimensional space. Based on this observation, we propose DRONE (data-aware low-rank compression), a provably optimal low-rank decomposition of weight matrices, which has a simple closed form solution that can be efficiently computed. DRONE is generic, could be applied to both fully-connected and self-attention layers, and does not require any fine-tuning or distillation steps. Experimental results show that DRONE could improve both model size and inference speed with limited loss of accuracy. Specifically, DRONE alone achieves 1.92x faster on MRPC task with only 1.5% loss of accuracy, and when combined with distillation, DRONE achieves over 12.3x faster on various natural language inference tasks.

1. INTRODUCTION

The representations learned by large-scale Natural Language Processing (NLP) models such as BERT have been widely used in various tasks (Devlin et al., 2018) . The pre-trained models of BERT and its variations are used as feature extractors for the downstream tasks such as question answering and natural language understanding (Radford et al.; Howard & Ruder, 2018) . The success of the pre-trained BERT relies on the usage of large corpus and big models. Indeed, researchers have reported better results of models with more parameters (Shazeer et al., 2018) and number of layers (Al-Rfou et al., 2019) . The increasing model size of the pre-trained models inhibits the public user from training a model from scratch, and it also brings the efficiency challenges, including the inference speed and the model size when deploying the model on devices. To deal with the efficiency issue, most existing works resort to adjusting the model structures or distillation. For instance, Kitaev et al. (2020) uses locality-sensitive hashing to accelerate dot-product attention, Lan et al. (2019) uses repeating model parameters to reduce the size and Zhang et al. (2018) applies a pre-defined attention pattern to save computation. A large body of prior work focuses on variants of distillation information has also been explored (Sanh et al., 2019; Jiao et al., 2019; Sun et al., 2020; Liu et al., 2020; Xu et al., 2020; Sun et al., 2019) . However, all these methods either require a specific design of model architecture which is not generic, or require users to train the proposed structure from scratch which greatly reduces its practicality. In this work, we try to explore an acceleration method that is more generic. Note that as shown in Figure 1 , matrix multiplication (Feed-forward layer) is a fundamental operation which appears many times in the Transformer architecture. In fact, the underlying computation of both multi-head attention layers and feed-forward layers is matrix multiplication. Therefore, instead of resorting to the complex architecture redesign approaches, we aim to investigate whether low-rank matrix approximation, the most classical and simple model compression approach, can be used to accelerate Transformers. Despite being successfully applied to CNN (Yu et al., 2017; Sindhwani et al., 2015; Shim et al., 2017; You et al., 2019) , at the first glance low-rank compression cannot work for BERT. We could see in Figure 2 that regardless of layers, matrices in feed-forward layer, query and key transformation of attention layer are not low-rank. Therefore, even the optimal low-rank approximation (e.g., by SVD) will lead to large reconstruction error and empirically the performance is limited. This is probably why low-rank approximation has not been used in BERT compression. In this paper, we propose a novel low-rank approximation algorithm to compress the weight matrices even though they are not low-rank. The main idea is to exploit the data distribution. In NLP applications, the latent features, indicating some information extracted from natural sentences, often lie in a subspace with a lower intrinsic dimension. Therefore, in most of the matrix-vector products, even though the weight matrices are not low-rank, the input vectors lie in a low-dimensional subspace, allowing dimension reduction with minimal degraded performance. We mathematically formulate this generalized low-rank approximation problem which includes the data distribution term and provide a closed-form solution for the optimal rank-k decomposition of the weight matrices. We propose DRONE method based on this novel Data-awaRe lOw-raNk comprEssion idea. Our decomposition significantly outperforms SVD under the same rank constraint, and can successfully accelerate the BERT model without sacrificing too much test performance.

2. RELATED WORK

The inference speed is important for NLP models when deployed in various applications. Generally speaking, inference efficiency could be enhanced by hardware (Shawahna et al., 2018) or lower-level instruction optimization (Ning, 2020) . On the other hand, the main focus of the current research is on using algorithmic methods to reduce the computational complexity. It could be mainly categorized into two aspects: Attention Complexity Reduction and Model Size Reduction.

Attention Complexity Reduction

Attention mechanism is the building block of transformer model and it attracts most attentions of researcher recently in NLP field (Vaswani et al., 2017) . Pre-training on large courpus of BERT, a transformer-based model, has contributed to state-of-the-art performance on various tasks after fine-tuning (Devlin et al., 2018) . Attention on sequences of length L is O(L 2 ) in both computational and memory complexity. This would take long inference time when the sequence is long. Thus, researchers have focused on reducing the complexity of the attention module. Kitaev et al. (2020) uses uses locality-sensitive hashing to reduce the complexity to O(LlogL). Zhang et al. (2018) ; Child et al. (2019) pre-defined an attention map to have a constant computational time. Goyal et al. progressively eliminates the redundant context vectors within the attended sequence to improve the efficiency of attention in last few layers of the model. Wang et al. (2020) proposes to train the low-rank attention by choosing a rank r L. This is similar with our work in the sense of leveraging low-rank structures. But our method do not require retraining the model and could be applied to different modules other than attention. In fact, most of these methods require special modules and thus we need to retrain the proposed models from scratch. This prohibits the usage of a large body of publicly available open models for faster research progress. More importantly, these methods only focus on the long sequence scenario. We found out that attention module is actually not the bottleneck of inference time in common usage as shown in Figure 1 . In most if not all models of common usages, two layers of large feed-forward layer are appended after attention module which incurs much more computational time. Attention complexity reduction only works when a large sequence is used but in current practice this is unusual. Thus, accelerating attention module itself is not contributing to a significant reduction of overall inference time.

Model Size Reduction

Efficiency of inference time is also related to model compression. In principle, smaller models would lead to smaller number of operations and thus faster inference time. Sanh et al. (2020) has explored pruning methods on BERT models to eliminate the redundant links, and there is a line of research on pruning methods (Han et al., 2015a; b; Chen et al., 2020) . Quantization methods (Zafrir et al., 2019; Hubara et al., 2016; Lin et al., 2016) could convert the 32 bits float models into lower-bits fixed-point representation, and theoretically make model prediction faster with fixed point accelerator. Lan et al. (2019) reduces the model size by sharing of encoder parameters. A large body of prior work focuses on variants of knowledge distillation (Sanh et al., 2019; Jiao et al., 2019; Sun et al., 2020; Liu et al., 2020; Xu et al., 2020; Sun et al., 2019; 2020) . These methods use different strategies to distill information from teacher network and reduce the number of layers (Sanh et al., 2019) or hidden dimension size (Jiao et al., 2019) . Further, A hybrid compression method by combining matrix factorization, pruning and knowledge distillation is proposed by Mao et al. (2020) . Among the above mentioned methods, Quantization requires hardware accelerator to reduce the inference time which is not applicable to general scenario. Pruning methods could only reduce the model size, but the inference speed might not be reduced due to the limitation of sparse operations. Only algorithmic method such as distillation could serve as a generic inference time accelerating method. We want to emphasize that our method is orthogonal to these distillation methods. In fact, the proposed method is a generic acceleration method applicable to all components in most NLP models. In section 4, we show that DRONE can be combined with the distilled models to further improve the performance.

3. METHODS

We now introduce a generic algorithm for improving efficiency of matrix multiplication. The computation of Feed-Forward (FF) Layer in the attention models can be described as: h = W x + b, where W ∈ R d2×d1 and b ∈ R d2 are model parameters, x ∈ R d1 is the latent representation of a token, and h ∈ R d2 is the output. Assume the sequence length is L, all the token representations x 1 , . . . , x L ∈ R d1 will pass through this same operation, so in practice the whole FF layer can be computed by a matrix-matrix product W [x 1 , . . . x n ] + b, and the computation of bias term b would be broadcasted to all L input tokens. In practice we will normally have L max(d 1 , d 2 ) (e.g., L = 128, d 2 = 3, 072). A standard way to accelerate the computation is to perform low-rank approximation over W. A lowrank approximation can be acquired by using singular value decomposition (SVD), which achieves the best rank-k approximation in terms of Frobenius norm and we could write W as: W = U SV T ≈ U W,k V W,k T , with unitary matrices U ∈ R d2×d2 , V ∈ R d1×d1 and a diagonal matrix S ∈ R d2×d1 . U W,k ∈ R d2×k and V W,k ∈ R d1×k are the rank-k approximation matrices by taking U W,k = U S 1 2 k , V W,k = S 1 2 k V , where S 1 2 k is the square-root of the first k entries of the diagonal matrix S. Given such approximation, we could simplify the computation in equation 1 by writing it as: h = W x + b ≈ U W,k V W,k T x + b. After the rank-k low-rank approximation, the computational complexity reduces from O(d 2 d 1 ) to O((d 1 + d 2 )k). When k is small enough, low-rank approximation could not only accelerate the computation (Shim et al., 2017) but also compress the model size (Sainath et al., 2013) . However, as we showed in Figure 2 , matrices in FF layer of BERT models do not show obvious low-rank structures. Ideally, we want a small percentage of the ranks which containing all large singular values such that sum of singular values connected to the selected ranks divided by sum of all singular values is large. But we could observe that choosing rank k to be larger than 50% of the ranks (e.g., about 0.5 times min(d 1 , d 2 )) could only accumulate 60 percent of the total singular values. This will lead to a large approximation error. In the meantime, the complexity is still about O(d 2 d 1 ) and there is no enhancement of speed. Despite the matrices in the model are not low-rank, here we provide an illustrative example to show that a low-rank computation could still exist when data distribution lies in a lower intrinsic dimension. Suppose we have a W defined as below and the input x lies in a subspace: W =      7 0 2 3 1 9 6 7 5 0 6 1 8 0 3 4 3 2 1 4 1 2 2 1 2      , x ∈ span(      2 2 5 5 4      ,      1 1 2 2 6      ), In this case, W is a full-rank matrix so there won't be a lossless low-rank approximation on W . On the other hand, input data x lies in a 2-dimensional subspace such that we could construct the following computation:      7 0 2 3 1 9 6 7 5 0 6 1 8 0 3 4 3 2 1 4 1 2 2 1 2      W      2 1 2 1 5 2 5 2 4 6      a b x =      43 23 90 39 66 41 45 37 29 21      U -1 -1 0.5 0.5 0 -0.5 0 0 0 0.25 V T      2 1 2 1 5 2 5 2 4 6      a b x , which gives a rank-2 matrix U V T where W = U V T but W x = U V T x for any x in the low dimensional space. This shows that even if we can't approximate the W matrix, it is still possible to construct a good low-rank decomposition, and the key will be to exploit the space of input vectors.

3.1. DRONE: DATA-AWARE LOW-RANK COMPRESSION

Assume the input x of the FF Layer follows some distribution, instead of minimizing approximation error of weight matrix (for which SVD is optimal), we want to minimize the approximation error of the outputs. Denote X as the R d1×n matrix where columns of X capture the empirical distribution of input, our goal is to find a pair of projection matrix V X,k ∈ R d1×k and recovery matrix U X,k ∈ R d2×k such that the output result is well approximated, and we could rewrite equation 1 as: h = W X + b ≈ W U x,k V x,k T X + b = (W U x,k )V x,k T X + b = W x,k V x,k T X + b, where W X,k = W U x,k . Intuitively, when X lies in a lower-dimensional space, we could find such a pair by PCA decomposition on X to project X into subspace that explains most variance of X. In this way, instead of considering the decomposition of W , we leverage the distribution of X to complete the low-rank approximation. Certainly, the best way is to consider the properties of both W and X simultaneously, and we could mathematically present this desideratum by the following optimization problem: min M W X -W M X 2 F , s.t. rank(M ) = k, ( ) where M is the desired rank-k transformation which could maximally preserve the results of matrix multiplication in the computation. In the theorem below, we will show that there exists a closed-form, optimal solution for the above optimization problem. Before stating the theorem, we first introduce some notations. Assume rank(W ) = r and rank(X) = t, we can write W = U W S W V T W and X T = U X S X V T X such that U W = U W,r ŪW,r , S W = S W,r 0 0 0 , V W = V W,r VW,r U X = U X,t ŪX,t , S X = S X,t 0 0 0 , V X = V X,t VX,t . In other words, the decomposition U W S W V T W and U X S X V T X are the full-SVD decomposition of W and X T . U W,r ,V W,r , U X,t ,V X,t denote corresponding row spaces and column spaces. ŪW,r , VW,r , ŪX,t and VX,t are null spaces. With these notations, we are ready to state the theorem. Theorem 1. Assume rank(W ) = r and rank(X) = t. The closed form solution M * of the optimization problem in equation 2 is M * = V W,r S -1 W,r Z k S -1 X,t V T X,t , where Z k is the rank-k truncated SVD of Z = S W,r V T W,r V X,t S X,t . The proof of Theorem 1 is postponed to the Supplementary A. We want to note that since Z k is the rank-k truncated SVD of Z, we could also write Z k as U Z,k V T Z,k by distributing top-k singular values of Z into left or right singular matrices. Thus the original computation could be rewrote as: W X ≈ (W V W,r S -1 W,r U Z,k )(V T Z,k S -1 X,t V T X,t )X = U * V * X, where we U * = W V W,r S -1 W,r U Z,k and V * = V T Z,k S -1 X,t V T X,t are two rank-k matrices, and we will replace W by U * V * .

3.2. EXTENSION TO DOT-PRODUCT ATTENTION

Although the optimization problem in equation 2 is proposed for feed-forward computation, in this section we show that it can also be applied to dot-product part of the attention module too. The most important computation in Attention layer is to compute pairwise similarity between query and key of the sequence, and this could be described as: O = (Q Ȳ ) T (KY ), where Ȳ ∈ R d1×n is the batch query data, Q ∈ R d2×d1 is the query transformation matrix, Y ∈ R d1×m is the batch key data, K ∈ R d2×d1 is the key transformation matrix and n, m are query and key batch size. We could again see that the desired low-rank approximation is the solution of following optimization problem: min M (Q Ȳ ) T (KY ) -(Q Ȳ ) T M (KY ) 2 F , s.t. rank(M ) = k. With Q Ȳ = W and K Ȳ = X, we get the following corollary from Theorem 1 directly. Corollary 2. Assume rank(Q Ȳ ) = r and rank(KY ) = t. Denote Q Ȳ = U W S W V T W and (KY ) T = U X S X V T X the SVD decomposition of Q Ȳ and (KY ) T respectively. The closed form solution M * of the optimization problem in equation 6 is Append φ(x) to X. M * = V W,r S -1 W,r Z k S -1 X,r V T X,r , where Z k is the rank-k truncated SVD of Z = S W,r V T W,r V X,t S X,t . Given X,k and W , solve the optimal low-rank matrices U * ,V * by equation 4.

3.3. OVERALL ALGORITHM

We have shown that the proposed DRONE method is a generic acceleration module applicable to all parts of neural language models. We summarize the DRONE on feed-forward layer in Algorithm 1. Since in practice we don't have the exact distribution of X, we would use training dataset to calculate the low-rank approximations as described in Algorithm 1. Attention map could be calculated by the same procedure with W = (Q Ȳ ) T and X = KY as introduced above. To accelerate the whole model, we need to select appropriate ranks for each components. However, since the approximation of one component will affect the distribution of overall representations, the optimal rank for the model requires a complete search of all possible combinations of rank values, which is infeasible in practice. We thus resort to an intuitive simplification as shown in Algorithm 2 listed in the supplementary B. In short, as the changes of lower layer parameters will cause the distribution of representation shifts in upper layer, we will approximate each component in ordered sequence. In other word, we will approximate the model from the lower layers toward higher layers. Within each layer, we follow the computational sequence of underlying modules. There is a total budget parameter r as an input to the Algorithm 2. The total allowed budget r depends on the efficiency and efficacy trade-off which users are willing to pay. We will distribute r into each module R l,i (allowed loss increase ratio of i-th module of l-th layer in Algorithm 2). For each module, if the approximation with certain rank used won't increase the loss over the ratio (1 + R l,i ), we will use that rank to approximate the module and move on to the next module. The distribution from r to each R l,i is based on empirical inference time of each module. The longer a module takes to compute, the more budget would be allocated such that total allowed loss increase (1 + r) = l i (1 + R l,i ). A sample of pseudo code is provided in the supplementary to illustrate the process.

4.1. EXPERIMENTAL SETUP

We evaluate DRONE on both LSTM model and transformer-based BERT model. For LSTM, we train a 2-layer LSTM-based language model on PTB from scratch with hidden sizes 1, 500 on Penn Treebank Bank (PTB) dataset. For BERT models, we evaluate the pre-trained BERT models on GLUE tasks. Various pre-trained models are offered in the open source platform (Wolf et al., 2019) . For BERT models, we use BERT-base models and it contains 12 layers of the same model structure without sharing parameters. Each layer contains an attention module with hidden size 768 and 12 channels, a small 768 × 768 Feed-forward (FF) layer followed by 2 larger FF layers (768 × 3, 072 and 3, 072 × 768). As shown in Figure 1 , these four components contribute to the most computational time in the BERT-base models. To the best of our knowledge, we are the first work to discuss the generic matrix computation acceleration on NLP tasks. Therefore, our baseline comparison will be the SVD low-rank approximation. We will also include those state-of-the-art distillation methods TinyBERT (Jiao et al., 2019) in the comparison and show that the proposed method could be combined with it to further improve the performance. TinyBERT reduces the model into 4 layers of attention dimension 312 with 12 channels, and the FF layers are downsized to 312 × 1, 200. As we mentioned above that all the approximation methods need to consider efficiency and efficacy trade-off. In principle, we would have a plot of accuracy versus speedup ratio as shown in Figure 3 for MRPC and SST-2 tasks. The allowed accuracy loss is up to tolerance of individual users. In this paper, we will report the approximation results with about 3% loss of accuracy. The inference Table 3 : The average inference time of comparison to distilled models. The unit is in millisecond. speed is measured on an Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz chip with single-thread. Inference are performed in a batch fashion with size 100 of the whole evaluation dataset. Average single sequence prediction inference time in millisecond is reported in the results. To perform the approximation, we randomly sub-sample 10% of the training data to be the training distribution used. After the proposed data-aware low-rank distribution, we could slightly fine-tune the model to make the performance better. We use a relatively smaller learning rate 10 -7 and retrain 1 epoch on the sub-sampled training data to complete the fine-tuning procedure.

4.2. RESULTS OF BERT MODELS ON GLUE DATASET

We summarize the results of the DRONE on GLUE tasks in Table 1 . We could observe that each task exhibits different difficulty. The best acceleration we could achieve is 92% faster with less than 2% accuracy loss after retraining (MRPC). In addition, DRONE achieves 1.52x acceleration without accuracy loss on the STS-B task. By applying the same selected rank for each module with SVD method, we could observe that the performance drops significantly. This shows that the matrices within the model is generally not low-rank; thus the direct low-rank approximation without considering data distribution won't work. The detailed average inference time of the approximated models are listed in Table 2 . We could observe that the FF2 layer could be accelerated most. A plausible reason could be that the input dimension to the FF2 layer is in a larger dimension (3072) than all the other layers (64 or 768). When the input distribution actually lies in a lower-dimensional space, there is much more room for FF2 layer to be compressed and accelerated by the data-aware low-rank method.

4.3. COMBINATION WITH MODEL SIZE REDUCTION METHODS

Distillation is a competitive method to compress the underlying model into a smaller one without losing much accuracy. Distilled models are much smaller in number of layers or hidden dimension, resulting a smaller model size and faster inference time. As shown in the loss for some of the GLUE tasks. We want to emphasize that DRONE is orthogonal to distillation methods. Since DRONE is generic, it could be combined together with other methods to further improve the performance. Due to the fact that the computation inside the distilled model is still full matrix computation, DRONE could then be applied to find the data-aware low-rank approximation of these smaller matrices. Results are summarized in Table 3 . As we can see that combined with the distillation method, DRONE could further reduce the inference time without sacrificing accuracy. In particular, on SST-2 task DRONE speedups the inference time from 11.7x to 15.3x while achieving the same accuracy as the TinyBERT. These results again show that the proposed method is generic and has the potential to be applied under various scenarios.

4.4. RESULTS ON LSTM MODELS

We demonstrate that DRONE could also work on accelerating matrices in the LSTM model. As shown in Table 5 in the Supplementary C, DRONE could accelerate 2-layer LSTM models about 3.4x on PTB dataset. And the result is slightly better than SVD methods. After the fine-tuning, DRONE could achieve less than 1% accuracy loss. This shows that DRONE is generic. As long as the underlying computation is a matrix multiplication, DRONE could leverage the data distribution to obtain a better low-rank approximated computation.

4.5. COULD THE LOW-RANK STRUCTURE LEARNED BY END-TO-END TRAINING?

One natural question to ask is if once the rank is decided, the same optimal low-rank structure could be learned by end-to-end fine-tuning. We have conducted the experiments on MRPC to verify this, and the results are summarized in Table 6 in the Supplementary D. We start the fine-tuning from the SVD results, and use the fine-tuning hyper-parameters as in (Wolf et al., 2019) foot_0 . After fine-tuning, accuracy goes from 63.8% to 85.8% which is still slightly worse than DRONE. This shows that fine-tuning on the SVD result might not achieve the best low-rank result. The proposed method under the the optimization problem (equation 2) indeed provides a good initial solution to the data-aware low-rank problems.

5. CONCLUSION

In this work, we propose DRONE, a data-aware low-rank approximation, to achieve a better low-rank approximation. DRONE leverages the fact that data distribution in NLP tasks usually lies in a lower-dimensional subspace. By including the data distribution into consideration, we propose a data-aware low-rank approximation problem and provide an closed-form solution. Empirical results validates that DRONE could significantly outperformed the vanilla-SVD method. It could achieve at least 20% acceleration with less than 3% accuracy loss. When combined with the distillation methods, DRONE could achieve 15.3 times acceleration with less then 2% accuracy loss.

B AN ALGORITHM TO SEARCH OF RANKS UNDER DRONE

In Algorithm 2, we illustrate how to select the rank of each module by applying DRONE illustrated in Algorithm 1. The input to Algorithm 2 consists of training data, the model with all parameters of weight matrices and original training loss. In addition, a pre-defined search grid is also necessary. Taking W ∈ R 768×768 as an example, we can perform a grid search for a proper low rank k over [1, 768] such as {96, 192, 288, 384, . . . , 768}. The finer the grid, the more compressed model we could get at the cost of longer running time of the DRONE method. With these input parameters, we firstly distribute the total allowed loss into each individual module. We then iteratively apply Algorithm 1 following the computational sequence illustrated in Figure 1 . For the compression of each module, we search the rank k by going through the grid. If the approximated result will not increase the allowed loss increase ratio of the component, we will end the search and tie the found rank to the component and move on. The procedure will continue until all components are compressed. The whole process could guarantee us that the final loss L of the compressed model M would not be greater than (1 + r)L, where L is the original loss before approximation. 



https://huggingface.co/transformers/v2.1.1/examples.html#glue



Figure 1: Illustration of the BERT-base computational model. |V | denotes the number of tokens in the model. #Classes denotes the number of classes in the down-stream classification task. Input encoding, Feed-forward 3 and Feed-forward 4 are computed only once in the inference and thus do not contribute to overall computational time much.

Data-Aware Low-rank Compression of Feed-forward layer. Input: rank k; training data D train ; Original weight matrix W ; Prediction Model M Output: Low-rank Approximation U * , V * X = {} for x = 1, • • • , d in D train do Feed the training data x into M and extract the representation φ(x). φ(x) is the representation which will be multiplied with W .

Figure 3: Illustration of efficiency and efficacy trade-off. Each point in this graph represents a specific ratio of training loss increase after approximation.

The experimental results of natural language inference tasks on Glue dataset.

The detailed average inference time of each component in the model by retrained DRONE method. The unit is in millisecond.

TinyBERT, one of the most competitive distillation methods, indeed achieves good performance within 3% accuracy

The average inference time of comparison to distilled models on GPU. The unit is in millisecond.

The average inference time of each component in the model of 2-layer LSTM model. Both proposed methods and SVD use same ranks so the inference time is approximately the same. The unit is in millisecond and the number in parenthesis shows the ratio respective to the overall inference time.

Illustration of SVD fine-tuning. Using the same rank as the proposed method, SVD accuracy will drop significantly. After fine-tuning on the SVD-based approximation, the accuracy could be recovered. But it's still less competitive than the proposed method.F PYTHON PSEUDO CODE OF SOLVING EQUATION 2

annex

Supplementary Material for: Data-Aware Low-Rank Compression for Large NLP Models A PROOF OF THEOREM 1 Theorem 1. Assume rank(W ) = r and rank(X) = t. The closed form solution M * of the optimization problem in equation 2 iswhere Z k is the rank-k truncated SVD of Z = S W,r V T W,r V X,t S X,t .Proof. We firstly consider the unconstrained problem:where the second equality holds due to the fact that U W and U X are orthonormal matrices. Note that we could expand the term S W V T W V X S X as:Similarly, we will haveTherefore, we could continue above unconstrained problem as:The above minimization problem obtains the optimal value if S W,r V T W,r M V X,t S X,t equals the rank-k truncated SVD of Z by the fundamental property of SVD decomposition. Thus, we will have: 

