INITIALIZATION AND REGULARIZATION OF FACTORIZED NEURAL LAYERS

Abstract

Factorized layers-operations parameterized by products of two or more matrices-occur in a variety of deep learning contexts, including compressed model training, certain types of knowledge distillation, and multi-head selfattention architectures. We study how to initialize and regularize deep nets containing such layers, examining two simple, understudied schemes, spectral initialization and Frobenius decay, for improving their performance. The guiding insight is to design optimization routines for these networks that are as close as possible to that of their well-tuned, non-decomposed counterparts; we back this intuition with an analysis of how the initialization and regularization schemes impact training with gradient descent, drawing on modern attempts to understand the interplay of weight-decay and batch-normalization. Empirically, we highlight the benefits of spectral initialization and Frobenius decay across a variety of settings. In model compression, we show that they enable low-rank methods to significantly outperform both unstructured sparsity and tensor methods on the task of training low-memory residual networks; analogs of the schemes also improve the performance of tensor decomposition techniques. For knowledge distillation, Frobenius decay enables a simple, overcomplete baseline that yields a compact model from over-parameterized training without requiring retraining with or pruning a teacher network. Finally, we show how both schemes applied to multi-head attention lead to improved performance on both translation and unsupervised pre-training.

1. INTRODUCTION

Most neural network layers consist of matrix-parameterized functions followed by simple operations such as activation or normalization. These layers are the main sources of model expressivity, but also the biggest contributors to computation and memory cost; thus modifying these layers to improve computational performance while maintaining performance is highly desirable. We study the approach of factorizing layers, i.e. reparameterizing them so that their weights are defined as products of two or more matrices. When these are smaller than the original matrix, the resulting networks are more efficient for both training and inference (Denil et al., 2013; Moczulski et al., 2015; Ioannou et al., 2016; Tai et al., 2016) , resulting in model compression. On the other hand, if training cost is not a concern, one can increase the width or depth of the factors to over-parameterize models (Guo et al., 2020; Cao et al., 2020) , improving learning without increasing inference-time cost. This can be seen as a simple, teacher-free form of knowledge distillation. Factorized layers also arise implicitly, such as in the case of multi-head attention (MHA) (Vaswani et al., 2017) . Despite such appealing properties, networks with factorized neural layers are non-trivial to train from scratch, requiring custom initialization, regularization, and optimization schemes. In this paper we focus on initialization, regularization, and how they interact with gradient-based optimization of factorized layers. We first study spectral initialization (SI), which initializes factors using singular value decomposition (SVD) so that their product approximates the target un-factorized matrix. Then, we study Frobenius decay (FD), which regularizes the product of matrices in a factorized layer rather than its individual terms. Both are motivated by matching the training regimen of the analogous un-factorized optimization. Note that SI has been previously considered in the context of model compression, albeit usually for factorizing pre-trained models (Nakkiran et al., 2015; Yaguchi et al., 2019; Yang et al., 2020) rather than low-rank initialization for end-to-end training; FD has been used in model compression using an uncompressed teacher (Idelbayev & Carreira-Perpiñán, 2020) . We formalize and study the justifications of SI and FD from both the classical perspectivematching the un-factorized objective and scaling-and in the presence of BatchNorm (Ioffe & Szegedy, 2015) , where this does not apply. Extending recent studies of weight-decay (Zhang et al., 2019) , we argue that the effective step-size at spectral initialization is controlled by the factorization's Frobenius norm and show convincing evidence that weight-decay penalizes the nuclear norm. We then turn to applications, starting with low-memory training, which is dominated by unstructured sparsity methods-i.e. guessing "lottery tickets" (Frankle & Carbin, 2019)-with a prevailing trend of viewing low-rank methods as uncompetitive for compression (Blalock et al., 2020; Zhang et al., 2020; Idelbayev & Carreira-Perpiñán, 2020; Su et al., 2020) . Here we show that, without tuning, factorized neural layers outperform all structured sparsity methods on ResNet architectures (He et al., 2016) , despite lagging on VGG (Simonyan & Zisserman, 2015) . Through ablations, we show that this result is due to using both SI and FD on the factorized layers. We further compare to a recent evaluation of tensor-decomposition approaches for compressed WideResNet training (Zagoruyko & Komodakis, 2016; Gray et al., 2019) , showing that (a) low-rank approaches with SI and FD can outperform them and (b) they are themselves helped by tensor-variants of SI and FD. We also study a fledgling subfield we term overcomplete knowledge distillation (Arora et al., 2018; Guo et al., 2020; Cao et al., 2020) in which model weights are over-parameterized as overcomplete factorizations; after training the factors are multiplied to obtain a compact representation of the same network. We show that FD leads to significant improvements, e.g. we outperform ResNet110 with an overcomplete ResNet56 that takes 1.5x less time to train and has 2x fewer parameters at test-time. Finally, we study Transformer architectures, starting by showing that FD improves translation performance when applied to MHA. We also show that SI is critical for low-rank training of the model's linear layers. In an application to BERT pre-training (Devlin et al., 2019) , we construct a Frobeniusregularized variant-FLAMBé-of the LAMB method (You et al., 2020) , and show that, much like for transformers, it improves performance both for full-rank and low-rank MHA layers. To summarize, our main contributions are (1) motivating the study of training factorized layers via both the usual setting (model compression) and recent applications (distillation, multi-head attention), (2) justifying the use of SI and FD mathematically and experimentally, and (3) demonstrating their effectiveness by providing strong baselines and novel advances in many settings. Code to reproduce our results is available here: https://github.com/microsoft/fnl_paper.

1.1. RELATED WORK

We are not the first study gradient descent on factorized layers; in particular, deep linear nets are well-studied in theory (Saxe et al., 2014; Gunasekar et al., 2019) . Apart from Bernacchia et al. (2018) these largely examine existing algorithms, although Arora et al. (2018) do effectively propose overcomplete knowledge distillation. Rather than the descent method, we focus on the initialization and regularization. For the former, several papers use SI after training (Nakkiran et al., 2015; Yaguchi et al., 2019; Yang et al., 2020 ), while Ioannou et al. (2016) argue for initializing factors as though they were single layers, which we find inferior to SI in some cases. Outside deep learning, spectral methods have also been shown to yield better initializations for certain matrix and tensor problems (Keshavan et al., 2010; Chi et al., 2019; Cai et al., 2019) . For regularization, Gray et al. ( 2019) suggest compression-rate scaling (CRS), which scales weight-decay using the reduction in parameter count; this is justified via the usual Bayesian understanding of 2 -regularization (Murphy, 2012). However, we find that FD is superior to any tuning of regular weight-decay, which subsumes CRS. Our own analysis is based on recent work suggesting that the function of weight-decay is to aid optimization by preventing the effective step-size from becoming too small (Zhang et al., 2019) .

2. PRELIMINARIES ON FACTORIZED NEURAL LAYERS

In the training phase of (self-)supervised ML, we often solve optimization problems of the form min θPΘ 1 |S| ř px,yqPS pf θ pxq, yq `Ωpθq, where f θ : X Þ Ñ Y is a function from input domain X to output domain Y parameterized by elements θ P Θ, : Y ˆY Þ Ñ R is a scalar-valued loss function, Ω : Θ Þ Ñ R is a scalar-valued regularizer, and S Ă X ˆY is a finite set of (self-)supervised training examples. We study the setting where f θ is a neural network, an L-layer function whose parameters θ consist of L matrices W i P R miˆni and whose output f θ pxq given input x is defined recursively using L functions g i via the formula x i " g i pW i , x i´1 q, with x 0 " x and f θ pxq " x L .

