RETHINKING COMPRESSED CONVOLUTIONAL NEU-RAL NETWORKS FROM A STATISTICAL PERSPECTIVE

Abstract

Many designs have recently been proposed to improve the model efficiency of convolutional neural networks (CNNs) at a fixed resource budget, while there is a lack of theoretical analysis to justify them. This paper first formulates CNNs with high-order inputs into statistical models, which have a special "Tucker-like" formulation. This makes it possible to further conduct the sample complexity analysis to CNNs as well as compressed CNNs via tensor decomposition. Tucker and CP decompositions are commonly adopted to compress CNNs in the literature. The low rank assumption is usually imposed on the output channels, which according to our study, may not be beneficial to obtain a computationally efficient model while a similar accuracy can be maintained. Our finding is further supported by ablation studies on CIFAR10, SVNH and UCF101 datasets.

1. INTRODUCTION

The introduction of AlexNet (Krizhevsky et al., 2012) spurred a line of research in 2D CNNs, which progressively achieve high levels of accuracy in the domain of image recognition (Simonyan & Zisserman, 2015; Szegedy et al., 2015; He et al., 2016; Huang et al., 2017) . The current stateof-the-art CNNs leave little room to achieve significant improvement on accuracy in learning stillimages, and attention has hence been diverted towards two directions. The first is to deploy deep CNNs on mobile devices by removing redundancy from the over-parametrized network, and some representative models include MobileNetV1 & V2 (Howard et al., 2017; Sandler et al., 2018) . The second direction is to utilize CNNs to learn from higher-order inputs, for instance, video clips (Tran et al., 2018; Hara et al., 2017) or electronic health records (Cheng et al., 2016; Suo et al., 2017) . This area has not yet seen a widely-accepted state-of-the-art network. High-order kernel tensors are usually required to account for the multiway dependence of the input. This notoriously leads to heavy computational burden, as the number of parameters to be trained grows exponentially with the dimension of inputs. Subsequently, model compression becomes the critical juncture to guarantee the successful training and deployment of tensor CNNs. Tensor methods for compressing CNNs. Denil et al. (2013) showed that there is huge redundancy in network weights such that the entire network can be approximately recovered with a small fraction of parameters. Tensor decomposition recently has been widely used to compress the weights in a CNN network (Lebedev et al., 2015; Kim et al., 2016; Kossaifi et al., 2020b; Hayashi et al., 2019) . Specifically, the weights at each layer are first summarized into a tensor, and then tensor decomposition, CP or Tucker decomposition, can be applied to reduce the number of parameters. Different tensor decomposition to convolution layers will lead to a variety of compressed CNN block designs. For instance, the bottleneck block in ResNet (He et al., 2016) corresponds to the convolution kernel with a special Tucker low-rank structure, and the depthwise separable block in MobileNetV1 (Howard et al., 2017) and the inverted residual block in MobileNetV2 (Sandler et al., 2018) correspond to the convolution kernel with special CP forms. All the above are for 2D CNNs, and Kossaifi et al. (2020b) and Su et al. (2018) considered tensor decomposition to factorize convolution kernels for higher-order tensor inputs. Tensor decomposition can also be applied to fully-connected layers since they may introduce a large number of parameters (Kossaifi et al., 2017; 2020a) ; see also the discussions in Section 5. Moreover, Kossaifi et al. (2019) summarized all weights of a network into one single high-order tensor, and then directly imposed a low-rank structure to achieve full network compression. While the idea is highly motivating, the proposed structure of the high-order tensor is heuristic and can be further improved; see the discussions in Section 2.4. Parameter efficiency of the above proposed architectures was heuristically justified by methods, such as FLOPs counting, naive parameter counting and/or empirical running time. However, there is still lack of a theoretical study to understand the mechanism of how tensor decomposition can compress CNNs. This paper attempts to fill this gap from statistical perspectives. Sample Complexity Analysis. Du et al. (2018a) first characterized the statistical sample complexity of a CNN; see also Wang et al. (2019) for compact autoregressive nets. Specifically, consider a CNN model, y = F CNN (x, W) + ξ, where y and x are output and input, respectively, W contains all weights and ξ is an additive error. Given the trained and true underlying networks F CNN (x, W) and F CNN (x, W * ), the root-mean-square prediction error is defined as E( W) = E x |F CNN (x, W) -F CNN (x, W * )| 2 , where W and W * are trained and true underlying weights, respectively, and E x is the expectation on x. The sample complexity analysis is to investigate how many samples are needed to guarantee a given tolerance on the prediction error. It can also be used to detect the model redundancy. Consider two nested CNNs, where F 1 is more compressed than F 2 . Given the same true underlying networks, when the prediction errors from trained F 1 and F 2 are comparable, we then can argue that F 2 has redundant weights comparing with F 1 . As a result, conducting sample complexity analysis to CNNs with higher order inputs will shed light on the compressing mechanism of popular compressed CNNs via tensor decomposition. The study in Du et al. ( 2018a) is limited to 1-dimensional convolution with a single kernel, followed by weighted summation, and its theoretical analysis cannot be generalized to CNNs with compressed layers. In comparison, our paper presents a more realistic modeling of CNN by introducing a general N -dimensional convolution with multiple kernels, followed by an average pooling layer and a fully-connected layer. The convolution kernel and fully-connected weights are in tensor forms, and this allows us to explicitly model compressed CNNs via imposing low-rank assumption on weight tensors. Moreover, we used an alternative technical tool, and a sharper upper bound on the sample complexity can be obtained. Our paper makes three main contributions: 1. We formulate CNNs with high-order inputs into statistical models, and show that they have an explicit "Tucker-like" form. 2. The sample complexity analysis can then be conducted to CNNs as well as compressed CNNs via tensor decomposition, with weak conditions allowing for time-dependent inputs like video data. 3. From theoretical analysis, we draw an interesting finding that forcing low dimensionality on output channels may introduce unnecessary parameter redundancy to a compressed network. et al., 2017; Golowich et al., 2018; Bartlett et al., 2017; Neyshabur et al., 2015) and low-rank compression based methods (Li et al., 2020; Zhou & Feng, 2018; Arora et al., 2018) . These works use a model-agnostic framework, and hence relies heavily on explicit regularization such as weight decay, dropout or data augumentation, as well as algorithmbased implicit regularization to remove the redundancy in the network. We, however, attempt to theoretically explain how much compressibility is achieved in a compressed network architecture. Specifically, we make a comparison between a CNN and its compressed version, and makes theoretically-supported modification to the latter to further increase efficiency.

