INITIALIZATION AND REGULARIZATION OF FACTORIZED NEURAL LAYERS

Abstract

Factorized layers-operations parameterized by products of two or more matrices-occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head selfattention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

1. INTRODUCTION

Most neural network layers consist of matrix-parameterized functions followed by simple operations such as activation or normalization. These layers are the main sources of model expressivity, but also the biggest contributors to computation and memory cost; thus modifying these layers to improve computational performance while maintaining performance is highly desirable. We study the approach of factorizing layers, i.e. reparameterizing them so that their weights are defined as products of two or more matrices. When these are smaller than the original matrix, the resulting networks are more efficient for both training and inference (Denil et al., 2013; Moczulski et al., 2015; Ioannou et al., 2016; Tai et al., 2016) , resulting in model compression. On the other hand, if training cost is not a concern, one can increase the width or depth of the factors to over-parameterize models (Guo et al., 2020; Cao et al., 2020) , improving learning without increasing inference-time cost. This can be seen as a simple, teacher-free form of knowledge distillation. Factorized layers also arise implicitly, such as in the case of multi-head attention (MHA) (Vaswani et al., 2017) . Despite such appealing properties, networks with factorized neural layers are non-trivial to train from scratch, requiring custom initialization, regularization, and optimization schemes. In this paper we focus on initialization, regularization, and how they interact with gradient-based optimization of factorized layers. We first study spectral initialization (SI), which initializes factors using singular value decomposition (SVD) so that their product approximates the target un-factorized matrix. Then, we study Frobenius decay (FD), which regularizes the product of matrices in a factorized layer rather than its individual terms. Both are motivated by matching the training regimen of the analogous un-factorized optimization. Note that SI has been previously considered in the context of model compression, albeit usually for factorizing pre-trained models (Nakkiran et al., 2015; Yaguchi et al., 2019; Yang et al., 2020) rather than low-rank initialization for end-to-end training; FD has been used in model compression using an uncompressed teacher (Idelbayev & Carreira-Perpiñán, 2020) . We formalize and study the justifications of SI and FD from both the classical perspectivematching the un-factorized objective and scaling-and in the presence of BatchNorm (Ioffe & Szegedy, 2015) , where this does not apply. Extending recent studies of weight-decay (Zhang et al., 2019) , we argue that the effective step-size at spectral initialization is controlled by the factorization's Frobenius norm and show convincing evidence that weight-decay penalizes the nuclear norm. We then turn to applications, starting with low-memory training, which is dominated by unstructured sparsity methods-i.e. guessing "lottery tickets" (Frankle & Carbin, 2019) -with a prevailing trend of viewing low-rank methods as uncompetitive for compression (Blalock et al., 2020; Zhang et al., 2020; Idelbayev & Carreira-Perpiñán, 2020; Su et al., 2020) . Here we show that, without tuning, factorized neural layers outperform all structured sparsity methods on ResNet architectures (He et al., 2016) , despite lagging on VGG (Simonyan & Zisserman, 2015) . Through ablations, we show that this result is due to using both SI and FD on the factorized layers. We further compare to a recent evaluation of tensor-decomposition approaches for compressed WideResNet training (Zagoruyko & Komodakis, 2016; Gray et al., 2019) , showing that (a) low-rank approaches with SI and FD can outperform them and (b) they are themselves helped by tensor-variants of SI and FD. We also study a fledgling subfield we term overcomplete knowledge distillation (Arora et al., 2018; Guo et al., 2020; Cao et al., 2020) in which model weights are over-parameterized as overcomplete factorizations; after training the factors are multiplied to obtain a compact representation of the same network. We show that FD leads to significant improvements, e.g. we outperform ResNet110 with an overcomplete ResNet56 that takes 1.5x less time to train and has 2x fewer parameters at test-time. Finally, we study Transformer architectures, starting by showing that FD improves translation performance when applied to MHA. We also show that SI is critical for low-rank training of the model's linear layers. In an application to BERT pre-training (Devlin et al., 2019) , we construct a Frobeniusregularized variant-FLAMBé-of the LAMB method (You et al., 2020) , and show that, much like for transformers, it improves performance both for full-rank and low-rank MHA layers. To summarize, our main contributions are (1) motivating the study of training factorized layers via both the usual setting (model compression) and recent applications (distillation, multi-head attention), (2) justifying the use of SI and FD mathematically and experimentally, and (3) demonstrating their effectiveness by providing strong baselines and novel advances in many settings. Code to reproduce our results is available here: https://github.com/microsoft/fnl_paper.

1.1. RELATED WORK

We are not the first study gradient descent on factorized layers; in particular, deep linear nets are well-studied in theory (Saxe et al., 2014; Gunasekar et al., 2019) . Apart from Bernacchia et al. (2018) these largely examine existing algorithms, although Arora et al. (2018) do effectively propose overcomplete knowledge distillation. Rather than the descent method, we focus on the initialization and regularization. For the former, several papers use SI after training (Nakkiran et al., 2015; Yaguchi et al., 2019; Yang et al., 2020) , while Ioannou et al. (2016) argue for initializing factors as though they were single layers, which we find inferior to SI in some cases. Outside deep learning, spectral methods have also been shown to yield better initializations for certain matrix and tensor problems (Keshavan et al., 2010; Chi et al., 2019; Cai et al., 2019) . For regularization, Gray et al. (2019) suggest compression-rate scaling (CRS), which scales weight-decay using the reduction in parameter count; this is justified via the usual Bayesian understanding of 2 -regularization (Murphy, 2012) . However, we find that FD is superior to any tuning of regular weight-decay, which subsumes CRS. Our own analysis is based on recent work suggesting that the function of weight-decay is to aid optimization by preventing the effective step-size from becoming too small (Zhang et al., 2019) .

2. PRELIMINARIES ON FACTORIZED NEURAL LAYERS

In the training phase of (self-)supervised ML, we often solve optimization problems of the form min θPΘ 1 |S| ř px,yqPS pf θ pxq, yq `Ωpθq, where f θ : X Þ Ñ Y is a function from input domain X to output domain Y parameterized by elements θ P Θ, : Y ˆY Þ Ñ R is a scalar-valued loss function, Ω : Θ Þ Ñ R is a scalar-valued regularizer, and S Ă X ˆY is a finite set of (self-)supervised training examples. We study the setting where f θ is a neural network, an L-layer function whose parameters θ consist of L matrices W i P R miˆni and whose output f θ pxq given input x is defined recursively using L functions g i via the formula x i " g i pW i , x i´1 q, with x 0 " x and f θ pxq " x L . The standard approach to training f θ is to specify the regularizer Ω, (randomly) pick an initialization in Θ, and iteratively update the parameters using some first-order algorithm such as SGD to optimize the objective above until some stopping criterion is met. However, in many cases we instead optimize over factorized variants of these networks, in which some or all of the matrices W i P R miˆni are re-parameterized as a product W i " U i p ś di j"1 M ij qV T i for some inner depth d i ě 0 and matrices U i P R miˆri , V i P R niˆri , and M ij P R riˆri @ j. As discussed in the following examples, this can be done to obtain better generalization, improve optimization, or satisfy practical computational or memory constraints during training or inference. For simplicity, we drop the subscript i whenever re-parameterizing only one layer and only consider the cases when inner depth d is 0 or 1.

2.1. FULLY-CONNECTED LAYERS

A fully-connected layer takes an n-dimensional input x i´1 and outputs an m-dimensional vector x i " σpW x i´1 q, where σ : R m Þ Ñ R m is an element-wise activation function. Here, decomposing W P R mˆn into the product U V T , where U P R mˆr , V P R nˆr , and setting r ! mintm, nu reduces computation and memory costs from Opmnq to Opmr `nrq. We refer to this setting as model compression. Standard learning theory suggests that a small rank r also improves generalization, e.g. for a factorized fully-connected ReLU network, applying }W } 2 F {}W } 2 2 ď rankpW q to Neyshabur et al. (2018, Theorem 1) and substituting W i " U i V T i gives a w.h.p. margin-bound Õp a mr{|S|q suggesting that generalization error varies with the square root of the rank (see Corollary A.1). Alternatively, by setting r ě mintm, nu and/or including an inner matrix M P R rˆr , we can attempt to take advantage of improved optimization due to increased width (Du & Hu, 2019) and/or increased depth (Arora et al., 2018) . Crucially, this does not increase inference costs because we can recompose the matrix after training and just use the product. As the goal is to obtain a better small model by first training a large one, we refer to this setting as overcomplete knowledge distillation; of course, unlike regular distillation it is much simpler since there is no student-teacher training stage.

2.2. CONVOLUTIONAL LAYERS

A 2d convolutional layer takes an h ˆw ˆci´1 -dimensional input x i´1 and outputs a h ˆw ˆcidimensional output x i defined by convolving c i different k ˆk filters over each of c i´1 input channels. Often the result is passed through a nonlinearity. 2d convolutional layers are parameterized by c i ˆci´1 ˆk ˆk tensors and require Opk 2 c i c i´1 q memory and compute. A straightforward way of factorizing this tensor without using tensor decomposition is to reshape it into a c i k ˆci´1 k matrix W , which can then be decomposed as W " U V T for U P R cikˆr , V P R ci´1kˆr and some rank r ą 0. As in the fully-connected case, we can either set the rank r to be small in order to reduce the number of parameters or alternatively increase the width (r) or the depth (d) of the factorization to do overcomplete knowledge distillation. Note that in the low-rank case a naive approach does not save computation since we must first multiply U and V T , reshape the product U V T , and then use the resulting tensor in a regular 2d convolution of the original size and complexity. However, as shown by Tai et al. (2016) , applying the 2d k ˆk convolution with c i´1 input channels and c i output channels obtained by reshaping U V T is equivalent to a composition of two 1d convolutions: the first defined by V T P R rˆci´1k consists of r output channels and filters of size k along one input dimension and the second defined by U P R cikˆr consists of c i output channels and filters of size k along the other input dimension. Together the two 1d convolutions require Opkrpc i `ci´1 qq memory and computation, which is significantly better than the Opk 2 c i c i´1 q cost of the unfactorized case if r ! k mintc i , c i´1 u.

2.3. MULTI-HEAD ATTENTION

An MHA layer (Vaswani et al., 2017) with H attention heads and hidden dimension d can be expressed as being parameterized with 4H matrices: one each of Q h , K h , V h , O h P R dˆd{H for each head h. Then for a length-T input x P R T ˆd it outputs H ÿ h"1 Softmax ˜xQ h K T h x T a d{H ¸xV h O T h (1) MHA combines 2H quadratic forms Q h K T h , V h O T h of rank r " d{H , each a product of matrices, i.e. a factorized layer. We refer to the first form as "Query-Key" and the second as "Output-Value." Note that r can be varied independently of d to change expressivity, memory, and computation. Figure 1 : Average nuclear norm across factorized layers of ResNet20 during CIFAR-10 training when initialized regularly (left) and using SI (center); the dotted line is the upper bound on the nuclear norm regularized by weight-decay (2). The right plot track the same quantities for the case of regular weight-decay across different rank-scales. "no decay (normalized)" normalizes matrix factors after each step to have the same norm as "Frobenius decay " (detailed in Section 3.3).

3. INITIALIZATION AND REGULARIZATION

We now define the initialization and regularization schemes we study as natural extensions of techniques for non-factorized models. They thus require no tuning when an existing training implementation of a non-factorized deep net is available. We later discuss how to justify these schemes when layers are normalized, e.g. using BatchNorm (Ioffe & Szegedy, 2015) . In all experiments with convolutional models in this and subsequent sections we factorize all layers except the first and last, which are small, and determine layer ranks by multiplying a uniform scaling factor by the product of a layer's output channels and kernel width. This rank-scale can be varied to attain the desired number of parameters. Note that our approach may be further improved via a more sophisticated or adaptive rank-assignment scheme (Idelbayev & Carreira-Perpiñán, 2020) .

3.1. SPECTRAL INITIALIZATION

Initialization is a major focus of deep learning research (He et al., 2015; Mishkin & Matas, 2016; Yang & Schoenholz, 2017) . A common approach is to prevent compounding changes in the norms of the intermediate representations across layers caused by repeated multiplication by the weight matrices. The spectral initialization scheme for initializing low-rank factorized layers attempts to inherit, when possible, this property from an existing initialization by using SVD to ensure that the resulting product matrix is as close as possible to the original parameter: Definition 3.1. Let W P R mˆn be a parameter of an unfactorized layer. For r ď mintm, nu the spectral initialization (SI) of the factors U P R mˆr , V P R nˆr of the corresponding factorized layer sets U " Ũ ? Σ and V " Ṽ ? Σ, for Ũ , Σ, Ṽ " SVD r pW q given by the rank-r SVD of W . SI preserves the largest singular value of W , so if the original scheme did not suffer from a compounding increase in representation norm then neither will spectral initialization. On the other hand, while low-rank layers impose a nullspace, SI aligns it with the directions minimized by W .

3.2. FROBENIUS DECAY

Weight-decay is a common regularizer for deep nets, often implemented explicitly by adding Ωpθq " λ 2 ř W Pθ }W } 2 F to the objective for some λ ě 0. Classically, it is thought to improve generalization by constraining model capacity. When training factorized layers parameterized by U p ś d j"1 M j qV T , the easiest approach to implement is to replace each λ 2 }W } 2 F term in Ωpθq by λ 2 ´}U } 2 F `}V } 2 F `řd j"1 }M j } 2 F ¯. However, this yields a very different optimization problem: for example, if we consider the case of d " 0 then this regularizer is in fact an upper bound on the nuclear norm of the recomposed matrix U V T (Srebro & Shraibman, 2005 , Lemma 1): λ 2 `}U } 2 F `}V } 2 F ˘ě min Ũ Ṽ T "U V T λ 2 ´} Ũ } 2 F `} Ṽ } 2 F ¯" λ}U V T } ˚(2) In fact, Figure 1 shows that for weight-decay this upper bound is tight throughout the training of factorized ResNet20 across a variety of ranks and initializations, suggesting that the naive approach is indeed regularizing the nuclear norm rather than the Frobenius norm. Since compression already constrains capacity, in the low-rank case one might favor just reducing regularization, e.g. multiplying λ by the compression rate (Gray et al., 2019) . However, Figure 2 shows that this can lead to worse performance, and the approach still penalizes the nuclear norm. Frobenius decay avoids this issue by simply penalizing the squared norm of the entire factorization: Definition 3.2. For λ ě 0 let λ 2 }W } 2 F be the contribution of an unfactorized layer parameterized by W P R mˆn to the penalty term. Then the Frobenius decay (FD) penalty on matrices U P R mˆr , V P R nˆr , and M j P R rˆr of the corresponding factorized layer is λ 2 › › ›U p ś d j"1 M j qV T › › › 2 F . By substituting the factorization directly, FD makes the least change to the problem: rank-r optima of the non-factorized objective will also minimize the factorized one. We can also bound generalization error of the ReLU net from Section 2.1 by a term Õ ´b m |S| ř L i"1 }U i p ś d j"1 M ij qV T i } 2 F varying directly with the quantity penalized by FD (see Corollary A.2). Notably, FD is a stronger penalty than the nuclear norm implicitly regularized by weight-decay, yet it still yields better models.

3.3. INITIALIZATION AND REGULARIZATION IN THE PRESENCE OF NORMALIZATION

The use of spectral initialization and Frobenius decay is largely motivated by the norms of the recomposed matrices: SI prevents them increasing feature vector norms across layers, while FD constrains model capacity via parameter norms. However, normalization layers like BatchNorm (Ioffe & Szegedy, 2015) and others (Ba et al., 2016) largely negate the forward-pass and modelcapacity effects of the norms of the weights parameterizing the layers they follow. Thus for most modern models we need a different explanation for the effectiveness of SI and FD. Despite the fact that most layers' norms do not affect inference or capacity, weight-decay remains useful for optimizing deep nets. Recently, Zhang et al. (2019) extended the analysis of Hoffer et al. (2018) to argue that the effective step-size of the weight direction Ŵ is roughly η{}W } 2 F , where η is the SGD step-size. Thus by preventing the norm of W from growing too large weight-decay maintains a large-enough effective step-size during training. We draw on this analysis to explore the effect of SI and FD on factorized models. For simplicity, we define a normalized layer to be one that does not depend on the scale of its parameter. Ignoring stability offset terms, this definition roughly holds for normalized linear layers, convolutional layers in ResNets, and the Output-Value quadratic form in Transformers if the residual connection is added after rather than before normalization. Definition 3.3. A normalized layer gpW , xq parameterized by W P R mˆn is one that satisfies gpW , xq " gpρW , xq for all W P R mˆn and all positive scalars ρ. Because the output does not depend on the magnitude of U V T , what matters is the direction of the composed matrix. During an SGD step this direction is updated as follows (proof in Appendix B): Claim 3.1. At all steps t ě 0 let g be a normalized layer of a differentiable model f θt : X Þ Ñ Y parameterized by U t V T t for U t P R mˆr , V T t P R nˆr , r ě 1. Suppose we update P " U and P " V by SGD, setting P t`1 Ð P t ´η∇ Pt using a gradient ∇ Pt " 1 |B| ř px,yqPB ∇ Pt pf θt pxq, yq over batch B Ă pX , Yq and η ą 0 sufficiently small. Then for ∇t " The effective step size (right) is the average over convolution layers i of ∇ Ŵt V t V T t `Ut U T t ∇ Ŵt we have that the vectorized direction ŵt " vecp Ŵt q of W t " U t V T t is updated as ŵt`1 Ð ŵt ´η }W t } 2 F `Imn ´ŵ t ŵT t ˘vecp ∇t q `Opη 2 q (3) η{}U i V T i } 2 F . Note that (1) we ignore decay because λ " Opηq so any resulting term is Opη 2 q and (2) the update rule is almost the same as that obtained for the unfactorized case by Zhang et al. (2019) , except they have ∇t as the true gradient of the direction. Thus, apart from a rank-one correction, Ŵt is approximately updated with step-size η{}W t } 2 F multiplying a linear transformation of its gradient. To understand the nature of this transformation, note that at spectral initialization we have that V 0 V T 0 " U 0 U T 0 " Σ r are diagonal matrices of singular values of the full-rank initialization W ; furthermore, if W is a Gaussian ensemble with scale 1{ ? n, which roughly aligns with common initialization schemes (He et al., 2015) , then its singular values are roughly distributed around 1 and supported on r0, 2s (Bai & Yin, 1993) . Since ∇t " ∇ Ŵt V t V T t `Ut U T t ∇ Ŵt , this suggests that, at spectral initialization, an effective learning rate of η{}W 0 } 2 F is a reasonable approximation for the factorized update. This points to the role of SI being to initialize the factorization at an appropriate scale and perhaps also to make the first update more aligned with the gradient w.r.t. Ŵ0 . As in the unfactorized case, our analysis suggests that, the main role of decay may be to maintain a large effective learning rate η{}W } 2 F ; furthermore, FD may be more effective than regular decay because it provides stronger regularization and directly penalizes the quantity of interest. We support this hypothesis using experiments analogous to Zhang et al. (2019) by comparing training a low-rank ResNet-20 with FD to training it with no decay but at each step normalizing all BatchNormed layers to have the same Frobenius norm as the FD run. In Figure 3 we see that the latter scheme closely tracks the former in terms of both the training loss and the test accuracy. Figure 3 also shows that FD maintains a higher effective step-size than regular weight-decay throughout training.

4. COMPRESSED MODEL TRAINING: LOW-RANK, SPARSE, AND TENSORIAL

We first study SI and FD for training factorized models with low-rank convolutional layers, comparing against the dominant approaches to low-memory training: sparse layers and tensor decomposition. For direct comparison we evaluate models with near-identical parameter counts; note that by normalizing by memory we disadvantage the low-rank approach, which often has a comparative advantage for speed. All models are trained for 200 epochs with the same optimizer settings as for the unfactorized models; the weight-decay coefficient is left unchanged when replacing by FD.

Low-Rank and Sparse Training:

We train modified ResNet32 and VGG19 used in the lotteryticket-guessing literature (Wang et al., 2020; Su et al., 2020) . Motivated by Frankle & Carbin (2019) , such methods fix a sparsity pattern at initialization and train only the unpruned weights. While they achieve high parameter savings, their computational advantages over full models are less clear, as software and accelerators often do not efficiently implement arbitrary sparsity (Paszke et al., 2019; NVIDIA, 2020) . In contrast, we do see acceleration via low-rank convolutions, almost halving the time of ResNet's forward and backward pass at the highest compression. For completeness, we also show methods that vary the sparsity pattern (dynamic), prune trained models (pruning), and prune trained models and retrain (lottery); note that the latter two require training an uncompressed model. In Table 1 we see that the low-rank approach, with SI & FD, dominates at the higher memory settings of ResNet across all three datasets considered, often outperforming even approaches that train an uncompressed model first. It is also close to the best compressed training approach in the lowest memory setting for CIFAR-100 (Krizhevksy, 2009) and Tiny-ImageNet (Deng et al., 2009) . Table 1 : Comparison of low-rank and sparse training in a common evaluation setting for "ticketguessing" (Wang et al., 2020; Su et al., 2020) . Best results overall are italicized; best results from full low-memory training are bolded. For complete results and deviations see Tables 5 and 6 . 

:

Best of (Frankle & Carbin, 2019; Renda et al., 2020; Su et al., 2020) , obtained from Wang et al. (2020) and Su et al. (2020) .

;

Best of (Mostafa & Wang, 2019; Mocanu et al., 2018; Bellec et al., 2018) , obtained from Wang et al. (2020) .

#

Best of (Lee et al., 2018; Wang et al., 2020; Su et al., 2020) , obtained from Wang et al. (2020) and Su et al. (2020) .

‹

Our method or our reproduction; results averaged over three random trials. 7 for results using the Tucker decomposition, which generally performs worse than Tensor-Train at the same compression.

:

Regular weight-decay with λ " 0.005.

;

Regular weight-decay with coefficient scaled by the compression rate (Gray et al., 2019) . On the other hand, the low-rank approach is substantially worse for VGG; nevertheless, ResNet is both smaller and more accurate, so if the goal is an accurate compressed model learned from scratch then one should prefer the low-rank approach. Our results demonstrate a strong, simple baseline not frequently compared to in the low-memory training literature (Frankle & Carbin, 2019; Wang et al., 2020; Su et al., 2020) . In fact, since it preserves the top singular components of the original weights, SI can itself be considered a type of (spectral) magnitude pruning. Finally, Table 1 highlights the complementary nature of SI and FD, which outperform regular low-rank training on both models, although interestingly they consistently decrease ResNet performance when used separately.

Matrix and Tensor Decomposition:

We next compare against tensor decompositions, another common approach to small model training (Kossaifi et al., 2020a; b) . A recent evaluation of tensors and other related approaches by Gray et al. (2019) found that Tensor-Train decomposition (Oseledets, 2011) obtained the best memory-accuracy trade-off on WideResNet; we thus compare directly to this approach. Note that, while we normalize according to memory, Tensor-Train must be expanded to the full tensor prior to convolution and thus increases the required compute, unlike low-rank factorization. In Table 2 we show that at 6.7% of the original parameters the low-rank approach with FD and SI significantly outperforms Tensor-Train. Tensor-Train excels at the highly compressed 1.7% setting but is greatly improved by leveraging tensor analogs of SI-decomposing a random initialization rather than randomly initializing tensor cores directly-and FD-penalizing the squared Frobenius norm of the full tensor. We also compare to CRS (Gray et al., 2019) , which scales regular weight-decay by the compression rate. It is roughly as beneficial as FD across different evaluations of Tensor-Train, but FD is significantly better for low-rank factorized neural layers. 

5. OVERCOMPLETE KNOWLEDGE DISTILLATION

In contrast to compressed model training, in knowledge distillation (KD) we have the capacity to train a large model but want to deploy a small one. We study what we call overcomplete KD, in which a network is over-parameterized in such a way that an equivalent small model can be directly recovered. Similar approaches have been previously studied only with small models (Arora et al., 2018; Guo et al., 2020) or using convolution-specific methods (Ding et al., 2019; Cao et al., 2020) . We take a simple factorization approach in which we decompose weights W P R mˆn to be products of 2 or 3 matrices while increasing the parameter count, either via depth or width. Here we consider three cases: the full (rank) setting where W " U V T for U P R mˆm , V P R nˆm , the deep setting where W " U M V T for U , M P R mˆm , V P R nˆm , and the wide setting where W " U V T for U P R mˆ3m , V P R nˆ3m . As before, we factorize all but the first and last layer and train factorized networks using the same routine as for the base model, except when replacing weight-decay by FD. We do not study SI here, using the default initialization for U , V T and setting M " I m . Table 3 shows the strength of this approach: we can train ResNet32 to beat ResNet56 and ResNet56 to beat ResNet110 while needing about the same number of parameters during training and only half as many at inference. Note that training ResNet56 using the "full" setting is also 1.5x faster than training ResNet110 (see Table 8 for timings). Furthermore, we improve substantially upon DO-Conv (Cao et al., 2020) , which obtains much smaller improvements over the unfactorized baseline. In fact, Table 9 suggests our approach compares favorably even with regular KD methods; combined with its simplicity and efficiency this suggests that our overcomplete method can be considered as a baseline in the field. Finally, we also show that Frobenius decay is critical: without it, overcomplete KD performs worse than regular training. A detailed visualization of this, comparing the CIFAR performance of factorized ResNets across a range of rank-scale settings covering both the current high-rank (distillation) case and the previous low-rank (compression) case, can be found in Figure 2 .

6. MULTI-HEAD ATTENTION AS FACTORIZED QUADRATIC FORMS

Our final setting is multi-head attention (Transformer) architectures (Vaswani et al., 2017) . As discussed in Section 2, the MHA component is already a factorized layer, consisting of an aggregation over the output of two quadratic forms per head: the "Query-Key" (QK) forms passed into the softmax and the "Output-Value" (OV) forms that multiply the resulting attention scores (c.f. Equation 1). Transformers also contain large linear layers, which can also be factorized for efficiency.

Improving Transformer Training and Compression:

We start with the original Transformer architecture on the IWSLT-14 translation task (Cettolo et al., 2014) but use an SGD-based training routine as a baseline (Gehring et al., 2017) . As there is no weight-decay by default, we first tune both this and FD on the non-factorized model; here FD is applied to the implicitly factorized MHA layers. Table 4 : Comparison of BERT unsupervised pretraining using LAMB (You et al., 2020) and our Frobenius decay scheme FLAMBé. Evaluation is conducted by finetuning the final model on the SQuAD question-answering task (Rajpurkar et al., 2016) . Guided by IWSLT results, FLAMBé is applied only to the Output-Value form in MHA; regular weight-decay is used on all other parameters. Spectral initialization of MHA is used for FLAMBé as well, but we find the effect minimal. Regularization coefficients for both methods were obtained by tuning on the uncompressed model, targeting the unsupervised loss. In Figure 4 we show that this alone yields an improvement: whereas the effect of weight-decay is either negligible or negative, tuning FD does improve the BLEU score. Furthermore, we see that tuning just the OV form in MHA is more robust at higher regularization levels than tuning both OV and QK. We conclude by examining both SI and FD when reducing the number of parameters by ( 1) factorizing all linear and embedding layers and (2) scaling down the embedding dimension in MHA. In Figure 4 we see that the benefit of Frobenius decay disappears when compressing; on the other hand, SI provides a strong boost under both types of decay, and is in fact necessary for FD to work at all. Note that the major effect here is for the factorized linear layers-we found that SI has minimal effect when applied to MHA, likely because those initializations have already been tuned. FLAMBé for Unsupervised BERT Pre-Training: Lastly, we examine BERT (Devlin et al., 2019) , a large transformer trained on a massive unsupervised text corpus and evaluated on downstream language tasks. The state-of-the-art training approach is via the LAMB optimizer (You et al., 2020) using weight-decay based on the AdamW algorithm of Loshchilov & Hutter (2019) , in which λη times each parameter is subtracted from itself; this is equivalent to 2 -regularization for SGD but not for adaptive methods. We can define a similar Frobenius alternative by subtracting λη times the Frobenius gradients U V T V " ∇ U 1 2 }U V T } 2 F and V U T U " ∇ V 1 2 }U V T } 2 F from U and V , respectively; when used with the LAMB optimizer we call this method FLAMBé. We see in Table 4 that FLAMBé outperforms the simple FD modification of LAMB and, as with IWSLT, leads to an improvement in downstream task performance without changing model size. For BERT, however, applying FD via FLAMBé also leads to better downstream performance when scaling down the MHA embedding dimension by half. Besides achieving a better compressed model, the success of FLAMBé also shows the potential of new types of decay schemes for adaptive methods.

7. CONCLUSION

In this paper we studied the design of training algorithms for deep nets containing factorized layers, demonstrating that two simple specializations of standard initialization and regularization schemes from the unfactorized case lead to strong improvements for model compression, knowledge distillation, and MHA-based architectures. While we largely focused on the case where the unfactorized model uses Gaussian initialization and 2 -regularization, we believe our work provides guidance for the many cases where other schemes are used to enforce alternative priors or improve optimization. For example, SI as defined can be applied to any random initialization, while our FD results suggest that regularizers such as penalty terms and DropConnect (Wan et al., 2013) should be applied on the product matrix rather than directly on individual factors. The success of SI and FD for both low-rank and tensor decomposition also suggests that these schemes may be useful for other types of factorized neural layers, such as ACDC (Moczulski et al., 2015) or K-matrices (Dao et al., 2020) . Corollary A.1. Let f θ be a neural network with L ´1 factorized fully-connected ReLU layers g i pU i V T i , x i´1 q " max U i V T i x i´1 , 0 m ( of hidden dimension m and one factorized classifica- tion layer g L pU L V T L , x L´1 q " U L V T L x L´1 . Let D be a distribution over B-bounded classification data X ˆY and suppose we have a finite set S of i.i.d. samples from it. If rankpU i V T i q ď r and }U i V T i } 2 u ď σ for all i then for any δ ą 0 we have w.p. 1 ´δ that 0 D pf θ q ď γ UniformpSq `O ¨d B 2 L 3 mσ 2L r logpLmq ś L i"1 }U i V T i } 2 2 `log L|S| δ γ 2 |S| ‚ (4) Proof. Apply the inequality }W i } 2 F {}W i } 2 2 ď rankpW i q to Neyshabur et al. (2018, Theorem 1) and substitute W i " U i V T i . Corollary A.2. Let f θ be a neural network with L ´1 factorized fully-connected ReLU layers g i pU i p ś d j"1 M ij qV T i , x i´1 q " max ! U i p ś d j"1 M ij qV T i x i´1 , 0 m ) of hidden dimension m and one factorized classification layer g L pU L p ś d j"1 M Lj qV T L , x L´1 q " U L p ś d j"1 M Lj qV T L x L´1 . Let D be a distribution over B-bounded classification data X ˆY from which we have a finite set S of i.i.d. samples. If }U i ś d j"1 M ij V T i } 2 u ď σ for all i then for any δ ą 0 we have w.p. 1 ´δ that 0 D pf θ q ď γ UniformpSq `O ¨d B 2 L 2 mσ 2L´2 logpLmq ř L i"1 }U i p ś d j"1 M ij qV T i } 2 F `log L|S| δ γ 2 |S| ‚ Proof. Apply the equality ´śL i"1 }W i } 2 2 ¯řL i"1 }Wi} 2 F }Wi} 2 2 " ř L i"1 }W i } 2 F ś j‰i }W j } 2 2 to Neyshabur et al. (2018, Theorem 1) and substitute W i " U i p ś d j"1 M ij qV T i . B PROOF OF CLAIM 3.1 Proof. Let ρ t " }W t } F be the Frobenius norm of the composed matrix at time t. Applying the update rules for U t and V t and using the fact that ρ t ∇ Wt " ∇ Ŵt yields W t`1 " pU t ´η∇ Ut qpV T t ´η∇ V T t q " pU t ´η∇ Wt V t qpV T t ´ηU T t ∇ W T t q " W t ´η ρ t ∇t `η2 ∇ Wt W T t ∇ Wt Taking the squared norm of both sides yields ρ 2 t`1 " ρ 2 t ´2η Trp Ŵ T t ∇t q `Opη 2 q; we can then take the square root of both sides and use a Tayor expansion to obtain ρ t`1 " ρ t d 1 ´2η ρ 2 t Trp Ŵ T t ∇t q `Opη 2 q " ρ t ´η ρ t Trp Ŵ T t ∇t q `Opη 2 q (7) Then, starting from Equation 6 divided by ρ t`1 , substituting Equation 7, and applying a Taylor expansion yields Ŵt`1 " ρ t ρ t`1 Ŵt ´η ρ t ρ t`1 ∇t `Opη 2 q " 1 1 ´η ρ 2 t Trp Ŵ T t ∇t q Ŵt ´η ρ 2 t ´η Trp Ŵ T t ∇t q ∇t `Opη 2 q " ˆ1 `η ρ 2 t Trp Ŵ T t ∇t q ˙Ŵ t ´η ρ 2 t ∇t `Opη 2 q " Ŵt ´η ρ 2 t p ∇t ´p ŵT t vecp ∇t qq Ŵt q `Opη 2 q (8) Vectorizing and observing that p ŵT t vecp ∇t qq ŵt " ŵt ŵT t vecp ∇t q yields the result.

C EXPERIMENTAL DETAILS FOR TRAINING CONVOLUTIONAL NETWORKS

For experiments with regular ResNets on CIFAR we used code provided here: https:// github.com/akamaster/pytorch_resnet_cifar10. All hyperparameter settings are the same, except initialization and regularization as appropriate, with the exception that we use a warmup epoch with a 10 times smaller learning rate for ResNet56 for stability (this is already done by default for ResNet110). For For comparisons with LAMB optimization of BERT, we use an implementation provided by NVIDIA: https://github.com/NVIDIA/DeepLearningExamples/tree/master/ PyTorch/LanguageModeling/BERT. All hyperparameter settings are the same, except initialization and regularization as appropriate. For fine-tuning on SQuAD we apply the same optimization routine to all pre-trained models.

E PAST WORK ON KNOWLEDGE DISTILLATION

We first briefly summarize past work on overcomplete KD. While the use of factorized neural layers for improving training has theoretical roots (Arora et al., 2018; Du & Hu, 2019) , we are aware of two other works focusing on experimental practicality: ExpandNets (Guo et al., 2020) and DO-Conv (Cao et al., 2020) . As the former focuses on small student networks, numerically we compare directly to the latter, showing much better improvement due to distillation for both ResNet56 and ResNet110 on both CIFAR-10 and CIFAR-100 in Table 3 . Note that we are also aware of one paper proposing a related method that also trains an over-parameterized model without increasing expressivity that can then be collapsed into a smaller "original" model (Ding et al., 2019) ; their approach, called ACNet, passes the input to each layer through differently-shaped kernels that can be composed additively. Note that it is unclear how to express this method as a factorization and it may not be easy to generalize to non-convolutional networks, so we do not view it as an overcomplete KD approach. We now conduct a brief comparison with both these works and more standard approaches to KD. Direct comparison with past work is made difficult by the wide variety of training routines, teacher models, student models, and evaluation procedures employed by the community. Comparing our specific approach is made even more difficult by the fact that we have no teacher network, or at least not one that is a standard model used in computer vision. Nevertheless, in Table 9 we collect an array of existing results that can be plausibly compared to our own overcomplete distillation of ResNets. Even here, note that absolute numbers vary significantly, so we focus on changes in accuracy. As can be seen from the results, our overcomplete approach yields the largest improvements for ResNet56 and ResNet110 on both CIFAR-10 and CIFAR-100. For ResNet32, the Snapshot Distillation method of Xu & Liu (2019) outperforms our own, although it does not do so for ResNet110 and is not evaluated for ResNet56. On CIFAR-10 the additive ACNet approach also has a larger performance improvement for ResNet32. Nevertheless, our method is still fairly close in these cases, so the results in Table 9 together with the simplicity and short (single-stage) distillation routine of our overcomplete approach suggest that it should be a standard baseline for KD. 



Figure 2: Comparison of weight-decay and FD at different regularization levels (left) and different rank-scales (center and right) when training factorized ResNets on CIFAR.

Figure 3: Traces depicting training of low-rank ResNet-20 with different decay settings; "no decay (normalized)" normalizes matrix factors after each step to have the same norm as "Frobenius decay." The effective step size (right) is the average over convolution layers i of η{}U i V T i } 2 F .

Figure 4: Transformer performance on IWSLT-14 as a function of regularization (top) and compression (bottom).

LeCun et al., 1990;Zeng & Urtasun, 2019), obtained fromWang et al. (2020).

Comparison of low-rank and tensor-decomposition training of WideResNet28-10 on CIFAR-10 (mean of 3 trials) with different regularization and initialization. Best errors bolded.

Overcomplete ResNet performance (mean of 3 trials). Best at each depth is bolded; cases where we match the next deeper network with around the same training memory are underlined. accuracies fromCao et al. (2020) are averages over the last five epochs across five training runs.

comparisons with sparse model training of ResNet32 x2 and VGG19 NFC we use code byWang  et al. (2019, https://github.com/alecwangcq/EigenDamage-Pytorch), which is closely related to that of the lottery-ticket-guessing paper byWang et al. (2020). All hyperparameter settings are the same, except (1) initialization and regularization as appropriate and (2) for Tiny-ImageNet we only train for 200 epochs instead of 300.For comparisons with tensor decomposition training of WideResNet we use code byGray et al.  (2019, https://github.com/BayesWatch/deficient-efficient). All hyperparameter settings are the same, except initialization and regularization as appropriate.D EXPERIMENTAL DETAILS FOR TRAINING TRANSFORMER MODELSFor experiments with Transformer models on machine translation we used code provided here: https://github.com/StillKeepTry/Transformer-PyTorch. All hyperparameter settings are the same, except initialization and regularization as appropriate.

Pruning, sparse training, and low-rank training for VGG-19 NFC , i.e. the model ofSimonyan & Zisserman (2015) with no fully-connected layers, as inWang et al. (2020).

Pruning, sparse training, and low-rank training for ResNet-32 x2 , i.e. the model of He et al. (2016) with twice the number of filters, as in Wang et al. (2020). Lee et al., 2018) 56.33˘0.24 55.43˘0.14 49.57˘0.44 GraSP ˚(Wang et al., 2020) 57.25˘0.11 55.53˘0.11 51.34˘0.29 Random Tickets : (Su et al., 2020) N/A 55.26˘0.22 51.41˘0.38 Low-Rank 60.82˘0.72 58.72˘0.53 55.39˘1.02 Low-Rank (SI) 59.53˘1.41 57.60˘0.88 55.00˘0.90 Low-Rank (FD) 59.45˘0.82 57.24˘0.61 54.15˘1.22 Low-Rank (SI & FD) 62.24˘0.39 60.25˘1.02 55.97˘0.48 ˚Obtained from Wang et al. (2020).

Overcomplete ResNet performance (mean of 3 trials). The best accuracy at each depth is bolded; cases where we match the next deeper network with around the same training memory are underlined. Note that training times are reported for roughly similar machine types; all ResNet56 and ResNet110 times are reported on identical machine types.

Comparison of our knowledge distillation approach with past work in which the same student network attains roughly similar performance.

A GENERALIZATION ERROR OF FACTORIZED LAYERS

In this section we briefly discuss how to apply the generalization bounds in Neyshabur et al. (2018) to factorized models.Definition A.1. For any γ ą 0 and distribution D over classification data X ˆY the γ-margin-loss of a model f θ : X Þ Ñ Y is defined as γ D pf θ q " P px,yq"D pf θ pxqrys ď γ `max y 1 ‰y f θ pxqry 1 sq.Note that 0 D is the expected classification loss over the distribution and γ UniformpSq is the empirical γ-margin-loss. We have the following corollaries:Published as a conference paper at ICLR 2021 (Gray et al., 2019) .

