ELRT: EFFICIENT LOW-RANK TRAINING FOR COM-PACT CONVOLUTIONAL NEURAL NETWORKS

Abstract

Low-rank compression, a popular model compression technique that produces compact convolutional neural networks (CNNs) with low rankness, has been well studied in the literature. On the other hand, low-rank training, as an alternative way to train low-rank CNNs from scratch, is little exploited yet. Unlike low-rank compression, low-rank training does not need pre-trained full-rank models and the entire training phase is always performed on the low-rank structure, bringing attractive benefits for practical applications. However, the existing low-rank training solutions are still facing several challenges, such as considerable accuracy drop and/or still needing to update full-size models during the training. In this paper, we perform a systematic investigation on low-rank CNN training. By identifying the proper low-rank format and performance-improving strategy, we propose ELRT, an efficient low-rank training solution for high-accuracy high-compactness low-rank CNN models. Our extensive evaluation results for training various CNNs on different datasets demonstrate the effectiveness of ELRT.

1. INTRODUCTION

Convolutional neural networks (CNNs) have obtained widespread adoption in numerous real-world computer vision applications, such as image classification, video recognition and object detection. However, modern CNN models are typically storage-intensive and computation-intensive, potentially hindering their efficient deployment in many resource-constrained scenarios, especially at the edge and embedded computing platforms. To address this challenge, many prior efforts have been proposed and conducted to produce low-cost compact CNN models. Among them, low-rank compression is a popular model compression solution. By leveraging matrix or tensor decomposition techniques, low-rank compression aims to explore the potential low-rankness exhibited in the fullrank CNN models, enabling simultaneous reductions in both memory footprint and computational cost. To date, numerous low-rank CNN compression solutions have been reported in the literature (Phan et al. (2020) ; Kossaifi et al. (2020) ; Li et al. (2021b) ; Liebenwein et al. (2021) ).

Low-rank

Training: A Promising Alternative Towards Low-rank CNNs. From the perspective of model production, performing low-rank compression on the full-rank networks is not the only approach to obtaining low-rank CNNs. In principle, we can also adopt low-rank training strategy to directly train a low-rank model from scratch. As illustrated in Fig. 1 , low-rank training starts from a low-rank initialization and keeps the desired low-rank structure in the entire training phase. Compared with low-rank compression that is built on two-stage pipeline ("pre-training-thencompressing"), the single-stage low-rank training enjoys two attractive benefits: relaxed operational requirement and reduced training cost. More specifically, first, the underlying training-fromscratch scheme, by its nature, completely eliminates the need for pre-trained full-rank high-accuracy models, thereby lowering the barrier to obtaining low-rank CNNs. In other words, producing lowrank networks becomes more feasible and accessible. Second, the overall computational cost for the entire low-rank CNN production pipeline is significantly reduced. This is because: 1) the removal of the pre-training phase completely saves the incurred computations that were originally needed for pre-training the full-rank models; and 2) directly training on the compact low-rank CNNs naturally consumes much fewer floating point operations (FLOPs) than full-rank pre-training. Existing Works and Limitations. Despite the above analyzed benefits, low-rank training is currently little exploited in the literature. Unlike the prosperity of studying low-rank compression, to date very few research efforts have been conducted towards efficient low-rank training. (Ioannou et al. (2015) ; Tai et al. (2015) ) are the pioneering works in this research direction; however the obtained low-rank models suffer considerable accuracy drop. In addition, the corresponding training methods are not evaluated on the modern CNNs such as ResNet. Recently, (Gural et al. (2020) ; Hawkins et al. (2022) ) propose emerging memory-aware and Bayesian estimation-based low-rank training, respectively; however, these two method are either built on costly repeated SVD operations (Gural et al. (2020) ) or Bayesian estimation (Hawkins et al. (2022) ), which are very computation intensive and potentially not scalable for practical deployment. (Hayashi et al. (2019) ; Khodak et al. (2020) ; Su et al. (2022) ) propose to learn the suitable low-rank format and/or apply spectral initialization during the training. However, the trained models with the new format have a considerable accuracy drop, even on CIFAR-10 dataset. (Waleffe & Rekatsinas (2020) ; Wang et al. (2021) ) perform several epochs of full-size training to mitigate this issue. However, with the cost of increasing memory and computational overhead, this hybrid-training strategy still brings a considerable accuracy drop. And it is essentially not training low-rank model from scratch. Therefore, the satisfactory answer to the following fundamental question is still missing: Fundamental Question for Low-rank Training: What is the proper training-from-scratch solution that can produce modern low-rank CNN models with high accuracy on the large-scale dataset, even outperforming the state-of-the-art low-rank compression methods? Technical Preview and Contributions. To answer this question and put the low-rank training technique into practice, in this paper, we perform a systematic investigation on training low-rank CNN models from scratch. By identifying the proper low-rank format and performance-improving strategy, we propose ELRT, an efficient low-rank training solution for high-accuracy high-compactness CNN models. Compared with the state-of-the-art low-rank compression approaches, ELRT demonstrates superior performance for various CNNs on different datasets, demonstrating the promising potential of low-rank training in practical applications. Overall, the contributions of this paper are summarized as follows: • We systematically investigate the important design knobs of low-rank CNN training from scratch, such as the suitable low-rank format and the potential performance-improving strategy, to understand the key factors for building a high-performance low-rank CNN training framework. • Based on the study and analysis of these design knobs, we develop ELRT, an orthogonalityaware low-rank training approach that can train high-accuracy high-compactness lowtensor-rank CNN models from scratch. By enforcing and imposing the desired orthogonality on the low-rank model during the training process, significant performance improvement with low computational overhead can be obtained. • We conduct empirical evaluations for various CNN models to demonstrate the effectiveness of ELRT. On CIFAR-10 dataset, ELRT can train low-rank ResNet-20, ResNet-56 and MobileNetV2 from scratch with providing 1.98×, 2.05× and 1.71× FLOPs reduction, respectively; and meanwhile the trained compact models enjoy 0.48%, 0.70% and 0.29% accuracy increase over the baseline. On ImageNet dataset, compared with the state-ofthe-art approaches that generate compact ResNet-50 models, ELRT achieves 0.49% higher accuracy with the same or even higher inference and training FLOPs reduction, respectively.

2. RELATED WORKS

Low-rank Compression. As an important type of model compression strategy, low-rank compression aims to leverage low-rank decomposition techniques to factorize the original full-rank neural network model into a set of small matrices or tensors, leading to storage and computational savings. Based on the adopted factorization methods, the existing low-rank CNN compression works can be categorized into 2-D matrix decomposition based (Tai et al. (2015) ; Li & Shi (2018) 

3. PRELIMINARIES

Notation. Throughout this paper the d-order tensor, matrix and vector are represented by boldface calligraphic script letter X ∈ R n1×n2×•••×n d , boldface capital letters X ∈ R n1×n2 , and boldface lower-case letters x ∈ R n1 , respectively. Also, X i1,••• ,i d and X i,j denote the entry of tensor X and matrix X, respectively. Tucker-2 Format for Convolutional Layer. As will be analyzed in Section 4, in this paper we choose to use low-tensor-rank format (e.g., Tucker-2) to form efficient low-rank training approach. In general, given a convolutional layer W ∈ R Cin×Cout×K×K , it can be represented in a Tucker-2 format as follows: W p,q,i,j = Φ1 r1=1 Φ2 r2=1 G r1,r2,i,j U (1) r1,p U (2) r2,q , where G is the 4-D core tensor of size Φ 1 ×Φ 2 ×K×K, and U (1) ∈ R Φ1×Cin and U (2) ∈ R Φ2×Cout are the factor matrices. In addition, Φ 1 and Φ 2 are the tensor ranks that determine the complexity of compact convolutional layer. Computation on the Tucker-2 Convolutional Layer. Given the above described Tucker-2 format representation, the corresponding execution on this compact convolutional layer can be performed via consecutive computations as: T 1 r1,h,w = Cin p=1 U (1) r1,p X p,h,w , T 2 r2,h ′ ,w ′ = K i=1 K j=1 Φ1 r1=1 G r1,r2,i,j T 1 r1,hi,wj , Y q,h ′ ,w ′ = Φ2 r2=1 U (2) r2,q T 2 r2,h ′ ,w ′ , where X ∈ R Cin×H×W and Y ∈ R Cout×H ′ ×W ′ are the input and output tensor of the convolutional layer, respectively. For indices of T 1 , h i = stride × (h ′ -1) + ipadding, w j = stride × (w ′ -1) + jpadding. In addition, T 1 and T 2 are the incurred intermediate results.

4. PROPOSED LOW-RANK TRAINING SOLUTION

As outlined in Section 1, to date the efficient training for high-accuracy low-rank CNN models from scratch is still largely under-explored. In this section we propose to systematically explore several important design knobs and factors when training low-rank CNN models from scratch. Based on the outcomes from these analytic and empirical studies, we will then further develop efficient solution for low-rank CNN training from scratch. Questions to be Answered. To be specific, in order to obtain better understanding of low-rank training and improve its performance, we explore the answers to the following three questions. Question1: Which type of low-rank format is more suitable for efficient training from scratch, 2-D matrix or high-order tensor? Analysis. In general, when training a compact CNN model from scratch, there are two types of low-rank formats that can be considered. The 4-D weight tensor of the trained convolutional layer can be either in the format of a lowmatrix-rank 2-D matrix or a highorder low-tensor-rank tensor. As illustrated in Fig. 2 , the low-matrixrankness means the flattened and matricized 4-D weight tensor exhibits low-rankness; while the low-tensorrankness means the trained convolutional layer can be directly represented and constructed via multiple small-size factorized matrices/tensors without any flattening operations. during the entire training phase. Our rationale for this proposal is that, unlike low-rank matrix format that may lose the important spatial weight correlation incurred by the inevitable flattening operation; the low-rank tensor format is a more natural way to represent 4-D weight tensor of convolutional layer; and therefore it can better extract and preserve the weight information and correlation existed in the 4-D space (Liu et al. (2012) ). For instance, as illustrated in Fig. 3 , when representing the same weight tensors of layers in ResNet-20 model, the low-rank Tucker format enjoys a smaller approximation error than the low-rank matrix format with the same number of weight parameters. Encouraged by this better representation capability for the high-order weight tensors, we choose to train low-rank tensor format CNN from scratch.

Question2: Consider low-rankness implies potentially low model capacity, what is proper strategy to improve the performance of low-rank training?

Analysis. As analyzed in Section 1, a key challenge of the existing low-rank training works (Tai et al. (2015) ; Ioannou et al. (2015) ) is their inferior model accuracy as compared to model compression approaches. We hypothesize that this phenomenon is caused by two reasons: 1) low-rank training starts from a low-accuracy random initialization with low-rank constraints; while model compression is built on a pre-trained high-accuracy model; and 2) the entire procedure of low-rank training is constrained in the low-rank space, thereby limiting the growing space for model capability. Notice that a possible solution is that we can reconstruct and train a full-rank model in some epochs, and then decompose the model to low-rank format again during the training procedure. However, such "train-decompose-train" scheme is very costly since it needs to perform computation-intensive tensor decomposition many times. Our Proposal. To improve the performance of low-rank training with preserving low computational and memory costs, we propose to perform orthogonality-aware low-tensor-rank training. Our key idea is to impose and enforce the orthogonality on the factor matrices U (1) and U (2) during the entire training process, and such training philosophy lies in the following rationale: It is well known that in principle using orthogonal-format basis can maximize the capacity of information representation. For instance, when we aim to approximate a full-rank matrix with low-rank SVD factorization, the decomposed U and V always exhibit self-orthogonality (as unitary matrix) with the smallest approximation error. A similar phenomenon also exists for tensor decomposition, where the factor matrices U (1) and U (2) are also unitary matrices after performing Tucker decomposition on the original full-rank tensor. Inspired by above observations, when we aim to approach a high-capacity full-rank model using the low-rank format, the orthogonality-based representation is Question3: How should we properly impose the orthogonality during the low-tensor-rank training? Analysis. The preliminary experiment in Fig. 4 shows the encouraging benefits of orthogonalityaware low-tensor-rank training. To fully unlock its promising potentials, efficient scheme for imposing the desired orthogonality should be properly explored. To be specific, there are four commonly used regularization approaches (Bansal et al. (2018) ) that can introduce the orthogonality on the target factor matrices: Soft Orthogonal (SO) Regularization: R s (A) = ρ Φ 2 ∥A T A -I∥ 2 F , where ∥ • ∥ F is the Frobenius norm, ρ is regularization strength, A is the matrix to be enforced with orthogonality, and Φ is the rank of A. Notice the experiment in Fig. 4 adopts this approach. Double Soft Orthogonal (DSO) Regularization: R d (A) = ρ Φ 2 (∥A T A -I∥ 2 F + ∥AA T -I∥ 2 F ). Here compared with SO regularization, this DSO scheme can always impose the proper orthogonality on A no matter it is over-complete or under-complete. Mutual Coherence (MC) Regularization: R mc (A) = ρ(∥A T A -I∥ ∞ ), where ∥ • ∥ ∞ is the matrix norm induced by the ℓ ∞ -norm. Spectral Restricted Isometry Property (SRIP) Regularization: R sp (A) = ρ • σ(A T A -I), where 2017)) based on RIP condition (Candes & Tao (2005) ). σ(R) = sup z∈R n ,z̸ =0 ∥Rz∥ 2 ∥z∥ 2 is the spectral norm of A T A -I (Yoshida & Miyato ( Our Proposal. We propose to develop DSO regularized scheme to impose the desired orthogonality on the factor matrices. Here our main rationale is that DSO regularization can better provide the desired orthogonality. To be specific, as illustrated in Fig. 5 , the orthogonality of the factor matrix U (2) can be measured by its corresponding residual matrix R = U (2) T U (2) -I. Here when R has lower energy, it is more likely that U (2) will exhibit more self-orthogonality. Therefore, the essential Algorithm 1: Overall ELRT training procedure Input: Dataset D, pre-set ranks {Φ}, Tucker-2-format weights in each layer U (1) ∈ R Φ 1 ×C in , G ∈ R Φ 1 ×Φ 2 ×K×K , U (2) ∈ R Φ 2 ×C out , orthogonal parameters ρ, λ d , training epochs T . Output: Trained {U (1) , U (2) , G}. Initialize: xavier uniform({U (1) , U (2) , G}). for t = 1 to T do X , Y ← sample batch(D); Ŷ ← forward(X , {U (1) , U (2) , G}) via Eq. 2; ▷ Orthogonal regularization R d (U (1) ) ← ρ Φ 2 1 (∥U (1) T U (1) -I∥ 2 F +∥U (1) U (1) T -I∥ 2 F ); R d (U (2) ) ← ρ Φ 2 2 (∥U (2) T U (2) -I∥ 2 F +∥U (2) U (2) T -I∥ 2 F ); loss ← L(Y, Ŷ) + λ d (R d (U (1) ) + R d (U (2) )); update ({U (1) , U (2) , G}, loss); end mechanism of SRIP and MC is to minimize the maximum singular value of R and the maximum row sum norm of R, respectively, to push R close to all-zero matrix. Evidently, such constraint posed by SRIP and MC schemes inherently targets to the local components of R, thereby degrading the overall effect of orthogonality. On the other hand, DSO and SO schemes aim to push all the singular values of R to zero in a global way, thereby in principle bringing stronger orthogonality for U (2) . In addition, since U (1) and U (2) may not be square matrix in practice, DSO scheme, as an approach that simultaneously considers the potential under-completeness and overcompleteness of matrix, is a more general and better solution than SO scheme. To verify this hypothesis, we also perform ablation study for low-tensor-rank training using different orthogonal regularization schemes. As reported in the Appendix B, DSO scheme shows consistently better performance than other schemes, and hence it is adopted in our proposed low-rank training procedure. Overall Training Procedure. Based on the above analysis and proposals, we summarize them and develop the corresponding efficient low-rank training (ELRT) algorithm to train high-performance high-compactness low-tensor rank CNN model from scratch. Algorithm 1 describes the details.

5. EXPERIMENTS

Datasets and Baselines. We evaluate our approach on CIFAR-10 and ImageNet datasets with different inference and training FLOPs reduction ratios (e.g., 2× FLOPs reduction ratios refers to 50% FLOPs reduction). For experiments on CIFAR-10 dataset, the performance of ELRT for VGG-16, ResNet-20, ResNet-56 and MobileNetV2 are evaluated and compared with the results using compression and structured sparse training methods. For experiments on ImageNet dataset, we evaluate our approach for training ResNet-50.

Calculation of Inference and Training FLOPs Reduction.

A very important benefit provided by compression-aware training, e.g., low-rank training and sparse training, is simultaneously achieving both the "Inference FLOPs Reduction" and "Training FLOPs Reduction". The details of calculation mechanism for these two metrics are described in the Appendix C. Hyperparameter. We use SGD optimizer for training with batch size, momentum and weight decay as 128, 0.9 and 0.0001, respectively. The learning rates are set as 0.1 on CIFAR-10 dataset and 0.05 on ImageNet dataset, respectively, with the cosine scheduler. All experiments are performed via using PyTorch 1.12 and following the PyTorch official training strategy. The detailed configurations for the tensor rank values in each layer are reported in the Appendix D. 2022)) are evaluated on small dataset (MNIST) and/or non-popularly used models, we report their comparison with ELRT in the Appendix A.3. Also, we compare ELRT with directly training a small dense models with the same target model size, and the results are reported in the Appendix A.4. Ablation Study. We perform ablation study on the effect of imposing orthogonality and different orthogonal regulation schemes, the details are reported in the Appendix B. Practical Speedup of Low-Tensor-Rank CNNs. To demonstrate the practical effectiveness of lowtensor-rank models, we measure the inference time of the models trained by using ELRT. Here we evaluate the speedup on four hardware platforms: Nvidia V100 (desktop GPU), Nvidia Jetson TX2 (embedded GPU), Xilinx PYNQZ1 (FPGA) and Eyeriss (ASIC). The evaluated models are lowrank ResNet-50 for ImageNet with three target FLOPs reduction settings. As summarized in Table 3 , the low-rank CNNs obtained from ELRT enjoy considerable speedup across different hardware platforms. Similar to (Hayashi et al. (2019) ), (Su et al. (2022) ) also aims to train low-rank CNNs using new decomposition formats. However, the model obtained via using the method proposed in [R2] has limited performance. As shown in the following Table , for training low-rank ResNet-32 on CIFAR-10 dataset, ELRT shows much higher accuracy (at least 3%) than (Su et al. (2022) ) with even higher compression ratio. 

B ABLATION STUDIES

Effect of Imposing Orthogonality. We conduct experiments to study the effect of imposing the orthogonality on the factor matrices via using DSO regularization. As shown in Fig. 6 , our proposed orthogonality-aware low-rank training shows very significant performance improvement than the standard low-rank training with the same FLOPs reduction and same low-rank format, thereby demonstrating the importance of enforcing orthogonality on the components of the low-rank model. Different Orthogonal Regularization Schemes. We also conduct the ablation study to explore the best-suited scheme to impose the orthogonality. As shown in Table 16 , DSO regularization demon-strates consistently better performance than the other schemes. Such empirical results also coincide with our analysis in Question 3. Therefore we adopt the DSO scheme for all our experiments. 

C CALCULATION SCHEMES FOR INFERENCE AND TRAINING FLOPS REDUCTION C.1 CALCULATION OF INFERENCE FLOPS REDUCTION

To calculate the inference FLOPs reduction of one convolution layer after using Tucker-2 decomposition, we adopt the scheme used in (Kim et al. (2015) ): E = D 2 ST H ′ W ′ SR 1 H ′ W ′ + D 2 R 1 R 2 H ′ W ′ + T R 2 H ′ W ′ , ( ) where D is the kernel height and width, S is the number of input channels, T is the number of output channels, H ′ , W ′ are output height and width, and R 1 , R 2 are the ranks of the Tucker decomposition. With the above notation, the inference FLOPs reduction can be calculated as all layers A all layers B , where A = D 2 ST H ′ W ′ represents the number of multiplication-addition operations of a layer in the oiginal model, and B = SR 1 H ′ W ′ +D 2 R 1 R 2 H ′ W ′ +T R 2 H ′ W ′ represents the number of multiplicationaddition operations of a layer in the Tucker decomposed low-rank model.

C.2 CALCULATION OF TRAINING FLOPS REDUCTION

As defined in (Zhou et al. (2021) ; Evci et al. (2020) ), "training FLOPs reduction", also noted as "training-cost saving", is the ratio of the average FLOPs of the dense network over that of the compact network. Here the total FLOPs of training a network consists of the part in forward pass and backward pass. As indicated in (Evci et al. (2020) ; Zhou et al. (2021); Baydin et al. (2018) ), the FLOPs of backward propagation can be roughly counted as about 2 times of that is consumed in the forward propagation. Next we denote the FLOPs of the dense network, the sparse network, the low-rank network during forward propagation as f D , f S and f L , respectively.  f D +2f D 2f S +2f S = 3f D 4f S .

C.2.5 ELRT (OURS).

Recall that ELRT is built on Tucker-2 decomposition that converts one convolutional layer into two factor matrices, which can be viewed as 1 × 1 convolutional layers, and one core tensor that is viewed as 3 × 3 convolutional layer. Therefore, our low-rank model can be viewed as a compact dense network with more convolutional layers but fewer FLOPs and parameters. Therefore, assume the FLOPs of the forward propagation is f L . The FLOPs of the backward propagation is 2f L . The training FLOPs reduction is calculated as f D +2f D f L +2f L = f D f L , which is identical to the inference FLOPs reduction as calculated in Eq. 7. In other words, the inference and training FLOPs reductions brought by ELRT are the same. Notice that the FLOPs of calculating the orthogonality loss term is ignored here since it only occurs per batch, while f L and f D are consumed per data. Considering the batch size is typically large (e.g., 128 in our experiment), the FLOPs contribution of calculating the orthogonality loss term is negligible. From these tables, it is seen that many adjacent layers share the same rank value, and the entire model only needs a few rank values to be assigned. For instance, for ResNet-56 model under 2.05× FLOPs reduction on CIFAR-10 dataset, only three numbers: 12, 18, 26, are selected as the ranks to be assigned to all layers, where the first/second/last eighteen layers share the rank value 12/18/26, respectively. Such a rank-sharing phenomenon significantly simplifies the rank selection process. conv2 (12, 12) layer2.1.conv2 (14, 14) layer3.1.conv1 (28, 28) layer1.2.conv1 (12, 12) layer2.2.conv1 (16, 16) layer3.2.conv1 (28, 28) layer1.2.conv2 (12, 12) layer2.2.conv2 (16, 16) layer3.2.conv1 (28, 28) Table 20 : Layer-wise rank settings of the compressed ResNet-20 model on CIFAR-10 dataset with 3.02× FLOPs reduction.

Layer Name

Rank Layer Name Rank Layer Name Rank FLOPs Reduction = 3.02× layer1.0.conv1 (10, 10) layer2.0.conv1 (12, 12) layer3.0.conv1 (20, 20) layer1.0.conv2 (10, 10) layer2.0.conv2 (12, 12) layer3.0.conv1 (20, 20) layer1.1.conv1 (10, 10) layer2.1.conv1 (12, 12) layer3.1.conv1 (20, 20) layer1.1.conv2 (8, 8) layer2.1.conv2 (14, 14) layer1.0.conv1 (9, 9) layer2.0.conv1 (12, 12) layer3.0.conv1 (16, 16) layer1.0.conv2 (9, 9) layer2.0.conv2 (12, 12) layer3.0.conv1 (16, 16) layer1.1.conv1 (9, 9) layer2.1.conv1 (12, 12) layer3.1.conv1 (16, 16) layer1.1.conv2 (9, 9) layer2.1.conv2 (12, 12) layer3.1.conv1 (16, 16) layer1.2.conv1 (9, 9) layer2.2.conv1 (12, 12) layer3.2.conv1 (16, 16) layer1.2.conv2 (9, 9) layer2.2.conv2 (12, 12) layer3.2.conv1 (16, 16) layer1.0.conv1 (12, 12) layer2.0.conv1 (18, 18) layer3.0.conv1 (26, 26) layer1.0.conv2 (12, 12) layer2.0.conv2 (18, 18) layer3.0.conv1 (26, 26) layer1.1.conv1 (12, 12) layer2.1.conv1 (18, 18) layer3.1.conv1 (26, 26) layer1.1.conv2 (12, 12) layer2.1.conv2 (18, 18) layer3.1.conv1 (26, 26) layer1.2.conv1 (12, 12) layer2.2.conv1 (18, 18) layer3.2.conv1 (26, 26) layer1.2.conv2 (12, 12) layer2.2.conv2 (18, 18) layer3.2.conv1 (26, 26) layer1.3.conv1 (12, 12) layer2.3.conv1 (18, 18) layer3.3.conv1 (26, 26) layer1.3.conv2 (12, 12) layer2.3.conv2 (18, 18) layer3.3.conv1 (26, 26) layer1.4.conv1 (12, 12) layer2.4.conv1 (18, 18) layer3.4.conv1 (26, 26) layer1.4.conv2 (12, 12) layer2.4.conv2 (18, 18) layer3.4.conv1 (26, 26) layer1.5.conv1 (12, 12) layer2.5.conv1 (18, 18) layer3.5.conv1 (26, 26) layer1.5.conv2 (12, 12) layer2.5.conv2 (18, 18) layer3.5.conv1 (26, 26) layer1.6.conv1 (12, 12) layer2.6.conv1 (18, 18) layer3.6.conv1 (26, 26) layer1.6.conv2 (12, 12) layer2.6.conv2 (18, 18) layer3.6.conv1 (26, 26) layer1.7.conv1 (12, 12) layer2.7.conv1 (18, 18) layer3.7.conv1 (26, 26) layer1.7.conv2 (12, 12) layer2.7.conv2 (18, 18) layer3.7.conv1 (26, 26) layer1.8.conv1 (12, 12) layer2.8.conv1 (18, 18) layer3.8.conv1 (26, 26) layer1.8.conv2 (12, 12) layer2.8.conv2 (18, 18) layer3.8.conv1 (26, 26) layer1.0.conv1 (10, 10) layer2.0.conv1 (15, 15) layer3.0.conv1 (28, 28) layer1.0.conv2 (10, 10) layer2.0.conv2 (15, 15) layer3.0.conv1 (28, 28) layer1.1.conv1 (10, 10) layer2.1.conv1 (15, 15) layer3.1.conv1 (28, 28) layer1.1.conv2 (10, 10) layer2.1.conv2 (15, 15) layer3.1.conv1 (28, 28) layer1.2.conv1 (10, 10) layer2.2.conv1 (15, 15) layer3.2.conv1 (28, 28) layer1.2.conv2 (10, 10) layer2.2.conv2 (15, 15) layer3.2.conv1 (28, 28) layer1.3.conv1 (10, 10) layer2.3.conv1 (15, 15) layer3.3.conv1 (28, 28) layer1.3.conv2 (10, 10) layer2.3.conv2 (15, 15) layer3.3.conv1 (28, 28) layer1.4.conv1 (10, 10) layer2.4.conv1 (15, 15) layer3.4.conv1 (28, 28) layer1.4.conv2 (10, 10) layer2.4.conv2 (15, 15) layer3.4.conv1 (28, 28) layer1.5.conv1 (10, 10) layer2.5.conv1 (15, 15) layer3.5.conv1 (28, 28) layer1.5.conv2 (10, 10) layer2.5.conv2 (15, 15) layer3.5.conv1 (28, 28) layer1.6.conv1 (10, 10) layer2.6.conv1 (15, 15) layer3.6.conv1 (28, 28) layer1.6.conv2 (10, 10) layer2.6.conv2 (15, 15) layer3.6.conv1 (28, 28) layer1.7.conv1 (10, 10) layer2.7.conv1 (15, 15) layer3.7.conv1 (28, 28) layer1.7.conv2 (10, 10) layer2.7.conv2 (15, 15) layer3.7.conv1 (28, 28) layer1.8.conv1 (10, 10) layer2.8.conv1 (15, 15) layer3.8.conv1 (28, 28) layer1.8.conv2 (10, 10) layer2.8.conv2 (15, 15) layer3.8.conv1 (28, 28) 



Figure 1: Different paths towards producing low-rank CNN models.

; Xu et al. (2020); Idelbayev & Carreira-Perpinán (2020); Yang et al. (2020); Liebenwein et al. (2021)) and high-order tensor decomposition based (Denton et al. (2014); Kim et al. (2015); Novikov et al. (2015); Yang et al. (2017); Wang et al. (2018); Kossaifi et al. (2019); Phan et al. (2020); Kossaifi et al. (2020); Li et al. (2021a) Lin et al. (2020b); Yu et al. (2021)). Low-rank Training. Similar to low-rank compression, the goal of low-rank training is also to produce compact neural network models with low-rankness; while the key difference is that low-rank training initializes and updates the low-rank CNNs during the entire training process. In other words, the pre-trained full-rank models are not required in this scenario, and the CNN models being updated are always kept in the low-rank format. To date efficient low-rank training approaches are still little exploited. More specifically, the existing works either have considerable accuracy loss (Ioannou et al. (2015); Tai et al. (2015) Hayashi et al. (2019); Khodak et al. (2020); Su et al. (2022)) or suffer high computational overhead because of the use of costly SVD operations (Gural et al. (2020)), Bayesian estimation (Hawkins et al. (2022)) or only performing partially low-rank training (Waleffe & Rekatsinas (2020); Wang et al. (2021)), limiting their effectiveness in the practical scenarios. Unstructured & Structured Sparse Training. Low-rank training is essentially a type of compression-aware training solutions, which include another related strategy as sparse training. Sparse training can be performed in the unstructured (Lee et al. (2018); Wang et al. (2019); Evci et al. (2020); Mostafa & Wang (2019); Liu et al. (2020); Mocanu et al. (2018); Bellec et al. (2018)) and structured (Yuan et al. (2021);

Figure 2: A low-rank CONV layer can either exhibit lowmatrix-rankness (Top) or low-tensor-rankness (Bottom).

Our Proposal. Currently most of the existing low-rank training works (Yang et al. (2020); Ioannou et al. (2015); Tai et al. (2015)) conduct and keep 4-D convolutional layer training in the format of low-rank 2-D matrix. Instead, we propose to perform low-rank CNN training directlyin the high-order tensor format. In other words, each convolutional layer always stays in the low-rank tensor decomposition format, e.g.,Tucker (Tucker (1966)) orCP (Hitchcock (1927)),

Figure 3: Approximation error (Mean Square Error (MSE)) of low-matrix-rank and low-tensor-rank methods for approximating ResNet-20 layers. Notice that MSE measurement is our analysis and exploration to identify the suitable low-rank format. It is not actually executed during training.

Figure 4: Training loss (left) and test accuracy (right) for low-tensor-rank ResNet-20 on CIFAR-10 with/without SO regularization. Same ranks are used for different experiments. Ranks are selected to provide 2× FLOPs reduction.

Figure 5: The mechanism of different approaches to impose orthogonality on U (2) . From top to bottom: (a) Soft Orthogonal Regularization, (b) Double Soft Orthogonal Regularization, (c) Spectral Restricted Isometry Property Regularization, (d) Mutual Coherence Regularization.

Figure 6: Performance of low-tensor-rank training with and without using DSO for ResNet-20/32/56 on CIFAR-10 dataset.

Figure 7: Performance of ELRT for ResNet-20/56 on CIFAR-10 dataset with different orthogonal penalty parameters λ d .

C.2.1 DENSE NETWORK. The FLOPs of the forward propagation is f D . The FLOPs of the backward propagation is 2f D . The training FLOPs reduction is f D +2f D f D +2f D = 1, which means there is no training FLOPs reduction. C.2.2 PRUNING. Assume T epochs are needed to obtain a pre-trained model. After pruning, another K epochs are needed to re-train the model for fine-tuning. In the pre-training phase, the total computation cost of the forward propagation and backward propagation are f D * T and 2f D * T , respectively. In the re-training phase, the total computation cost of the forward propagation and backward propagation are f S * K and 2f S * K, respectively. So overall the training FLOPs reduction is 3f D * T 3f D * T +3f S * K < 1, which means there is no training FLOPs reduction. C.2.3 LOW-RANK COMPRESSION. Assume T epochs are needed to obtain a pre-trained model. After low-rank decomposition, another K epochs are needed to re-train the model for fine-tuning. In the pre-training phase, the total computation cost of the forward propagation and backward propagation are f D * T and 2f D * T , respectively. In the re-training phase, the total computation cost of the forward propagation and backward propagation are f L * K and 2f L * K, respectively. So overall the training FLOPs reduction is 3f D * T 3f D * T +3f L * K < 1, which means there is no training FLOPs reduction. C.2.4 STRUCTURED SPARSE TRAINING. • GrowEfficient (Yuan et al. (2021)). For training a sparse model via GrowEfficient, the FLOPs of the forward propagation is f S . Since the channel masks and scores, which determine to keep or remove the corresponding channels/filters, are updated by the Straight-Through gradient Estimation (STE), the backward propagation has to go through all the channels/filters, leading to dense computation. Therefore, the FLOPs of the backward propagation is 2f D . For training a dense model, the computation cost of the forward propagation and backward propagation are f D and 2f D , respectively. Therefore, the training FLOPs reduction is f D +2f D f S +2f D = 3f D f S +2f D . • SparseBackward (Zhou et al. (2021)). For training a sparse model via SparseBackward, the FLOPs of the forward propagation is f S . The backward propagation keeps sparse with 2f S computation cost. Unlike updating masks/scores via dense backward propagation in GrowEfficient, SparseBackward updates them via Variance Reduced Policy Gradient Estimator (VR-PGE), which only requires an extra one-time forward propagation with training cost f S . For training a dense model, the computation cost of the forward propagation and backward propagation are f D and 2f D , respectively. Therefore, the training FLOPs reduction is

Results for VGG-16, ResNet-20, ResNet-56 and MobileNetV2 on CIFAR-10 dataset. " * " denotes compression ratio since the corresponding work does not report FLOPs reduction. inference and training. This is because structured sparse training consumes extra computation to calculate the channel mask during backward propagation, which is not needed in low-rank training. More details of calculation of Inference and Training FLOPs reduction for different methods are reported in the Appendix C.

Results for ResNet-50 on ImageNet dataset.

Runtime (per image) for the low-rank ResNet-50 trained via our proposed ELRT.



Comparison with directly training small dense models with the similar target model sizes on CIFAR-10 dataset.

Comparison with Hayashi et al. (2019) on CIFAR-10 dataset.

Comparison withSu et al. (2022) on CIFAR-10 dataset. that requires the existence of a pre-trained model; while ELRT perform low-rank training from scratch without consuming any pre-training cost, significantly reducing training complexity. More importantly, as shown in the following table, ELRT shows better model performance (more than 1% accuracy increase) than(Lin et al. (2020b)) for obtaining low-rank AlexNet on CIFAR-10 dataset. Here the ranks for Conv2, Conv3, Conv4 and Conv5 are set as[32, 64],[72, 108],[64, 32]   and[40, 40], respectively.

Comparison withLin et al. (2020b)  on CIFAR-10 dataset.Yu et al. (2021)) is also a low-rank compression work that requiring pre-training phase. As shown in the following table, even without using any pre-trained high-accuracy model, ELRT still achieves higher accuracy than the pre-trained model-required(Yu et al. (2021)) for obtaining lowrank ResNet-56 on CIFAR-10 and ResNet-50 on ImageNet.

Comparison withYu et al. (2021).

Comparison with Khodak et al. (2020) on CIFAR-10 dataset.

Comparison withWang et al. (2021) on ImageNet dataset.Wang et al. (2021)),(Waleffe & Rekatsinas (2020)) is initialized with a wide and fullrank model and then decompose it low-rank format after a few epochs. Instead, ELRT performs low-rank training from scratch and always keeps the model stay in the low-rank format during the entire training procedure. As shown in the following table, ELRT shows 0.51% accuracy increase over(Waleffe & Rekatsinas (2020)) with 2 times model size reduction for training ResNet-50 on ImageNet dataset.

Comparison with Waleffe & Rekatsinas (2020) on ImageNet dataset.

Performance of using different orthogonal regularization schemes for training low-rank ResNet-20, ResNet-32 and ResNet-56 on CIFAR-10 dataset. Effect of Different Orthogonality Penalty Parameters λ d . As shown in Table17, 18 and Figure7, we evaluate the effect of different orthogonality penalty parameters λ d .

with different orthogonality penalty parameters (λ d ).

ResNet-56 with different orthogonality penalty parameters (λ d ).

20 and 21 list the layer-wise rank settings of ResNet-20 model on CIFAR-10 dataset under 1.98×, 3.02× FLOPs and 6.01× parameters reduction, respectively. Table22, 23 list the layer-wise rank settings of ResNet-56 model on CIFAR-10 dataset under 2.05× and 2.52× FLOPs reduction.

lists the layer-wise rank settings of ResNet-50 model on ImageNet dataset under 2.49× FLOPs reduction.

Layer-wise rank settings of the compressed ResNet-20 model on CIFAR-10 dataset with 1.98× FLOPs reduction.

Layer-wise rank settings of the compressed ResNet-20 model on CIFAR-10 dataset with 6.01× parameters reduction.

Layer-wise rank settings of the compressed ResNet-56 model on CIFAR-10 dataset with 2.05× FLOPs reduction.

Layer-wise rank settings of the compressed ResNet-56 model on CIFAR-10 dataset with 2.52× FLOPs reduction.

APPENDIX

The entire Appendix consists of four sections.• Section A lists the additional experiments results with more model types and more comparisons.• Section B reports the ablation study for the orthogonality-imposing strategy.• Section C presents the calculation schemes for two important performance metrics: "inference FLOPs reduction" and "training FLOPs reduction".• Section D shows the layer-wise rank distribution in the experiments, thereby demonstrating the convenience of rank setting.

A ADDITIONAL EXPERIMENTS

A.1 RESNET-32 AND WRN-28-8 ON CIFAR-10 2015)), ELRT can achieve higher FLOPs reduction with providing better accuracy performance. In addition, we also compare our proposed method with another Tucker-format work (Hawkins et al. (2022) ) for low-rank training from scratch. As shown in Table 7 , ELRT can achieve higher accuracy and FLOPs reduction than (Hawkins et al. (2022) ).In addition, consider ELRT does not require computation-intensive Bayesian estimation that is used in (Hawkins et al. (2022) ), ELRT is more attractive for practical applications. 

A.4 COMPARISON WITH OTHER COMPACT MODELS TRAINED FROM SCRATCH

We also compare the performance of ELRT with directly training small dense models from scratch with the similar target model sizes. 

