RETHINKING COMPRESSED CONVOLUTIONAL NEU-RAL NETWORKS FROM A STATISTICAL PERSPECTIVE

Abstract

Many designs have recently been proposed to improve the model efficiency of convolutional neural networks (CNNs) at a fixed resource budget, while there is a lack of theoretical analysis to justify them. This paper first formulates CNNs with high-order inputs into statistical models, which have a special "Tucker-like" formulation. This makes it possible to further conduct the sample complexity analysis to CNNs as well as compressed CNNs via tensor decomposition. Tucker and CP decompositions are commonly adopted to compress CNNs in the literature. The low rank assumption is usually imposed on the output channels, which according to our study, may not be beneficial to obtain a computationally efficient model while a similar accuracy can be maintained. Our finding is further supported by ablation studies on CIFAR10, SVNH and UCF101 datasets.

1. INTRODUCTION

The introduction of AlexNet (Krizhevsky et al., 2012) spurred a line of research in 2D CNNs, which progressively achieve high levels of accuracy in the domain of image recognition (Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) . The current stateof-the-art CNNs leave little room to achieve significant improvement on accuracy in learning stillimages, and attention has hence been diverted towards two directions. The first is to deploy deep CNNs on mobile devices by removing redundancy from the over-parametrized network, and some representative models include MobileNetV1 & V2 (Howard et al., 2017; Sandler et al., 2018) . The second direction is to utilize CNNs to learn from higher-order inputs, for instance, video clips (Tran et al., 2018; Hara et al., 2017) or electronic health records (Cheng et al., 2016; Suo et al., 2017) . This area has not yet seen a widely-accepted state-of-the-art network. High-order kernel tensors are usually required to account for the multiway dependence of the input. This notoriously leads to heavy computational burden, as the number of parameters to be trained grows exponentially with the dimension of inputs. Subsequently, model compression becomes the critical juncture to guarantee the successful training and deployment of tensor CNNs. Tensor methods for compressing CNNs. Denil et al. (2013) showed that there is huge redundancy in network weights such that the entire network can be approximately recovered with a small fraction of parameters. Tensor decomposition recently has been widely used to compress the weights in a CNN network (Lebedev et al., 2015; Kim et al., 2016; Kossaifi et al., 2020b; Hayashi et al., 2019) . Specifically, the weights at each layer are first summarized into a tensor, and then tensor decomposition, CP or Tucker decomposition, can be applied to reduce the number of parameters. Different tensor decomposition to convolution layers will lead to a variety of compressed CNN block designs. For instance, the bottleneck block in ResNet (He et al., 2016) corresponds to the convolution kernel with a special Tucker low-rank structure, and the depthwise separable block in MobileNetV1 (Howard et al., 2017) and the inverted residual block in MobileNetV2 (Sandler et al., 2018) correspond to the convolution kernel with special CP forms. All the above are for 2D CNNs, and Kossaifi et al. (2020b) and Su et al. (2018) considered tensor decomposition to factorize convolution kernels for higher-order tensor inputs. Tensor decomposition can also be applied to fully-connected layers since they may introduce a large number of parameters (Kossaifi et al., 2017; 2020a) ; see also the discussions in Section 5. Moreover, Kossaifi et al. (2019) summarized all weights of a network into one single high-order tensor, and then directly imposed a low-rank structure to achieve full network compression. While the idea is

