BETTER TOGETHER: RESNET-50 ACCURACY WITH 13× FEWER PARAMETERS AND AT 3× SPEED

Abstract

Recent research on compressing deep neural networks has focused on reducing the number of parameters. Smaller networks are easier to export and deploy on edge-devices. We introduce Adjoined networks as a training approach that can regularize and compress any CNN-based neural architecture. Our one-shot learning paradigm trains both the original and the smaller networks together. The parameters of the smaller network are shared across both the architectures. We prove strong theoretical guarantees on the regularization behavior of the adjoint training paradigm. We complement our theoretical analysis by an extensive empirical evaluation of both the compression and regularization behavior of adjoint networks. For resnet-50 trained adjointly on Imagenet, we are able to achieve a 13.7x reduction in the number of parameters 1 and a 3x improvement in inference time without any significant drop in accuracy. For the same architecture on CIFAR-100, we are able to achieve a 99.7x reduction in the number of parameters and a 5x improvement in inference time. On both these datasets, the original network trained in the adjoint fashion gains about 3% in top-1 accuracy as compared to the same network trained in the standard fashion.



While these networks achieve exceptional performance on many tasks, their large size makes it difficult to deploy on many edge devices (like mobile phones, iot and embedded devices). Unlike cloud servers, these edge devices are constrained in terms of memory, compute and energy resources. A large network performs a lot of computations, consumes more energy and is difficult to transport and to update. A large network also has a high prediction time per image. This is constraint when real-time inference is needed. Thus, compressing neural networks while maintaining accuracy has received significant attention in the last few years. Pruning -These techniques involve removing parameters (or weights) which satisfy some criteria. For example, in weight pruning, all the parameters whose values are below some pre-determined threshold are removed Han et al. (2015) . A natural extension of this is channel pruning Liu et al. (2017) and filter pruning Li et al. (2016) where entire convolution channel or filter is removed according to some criteria. However, all of these methods involve multiple passes of pruning followed by fine-tuning and require a very long time to compress. Moreover, weight pruning doesn't give the benefit of faster inference times unless there is hardware support for fast sparse matrix multiplications. In this paper, we propose a one-shot compression procedure (as opposed to multiple passes). To the best of our knowledge, pruning techniques have not been successfully applied (without significant accuracy loss) to large architectures like Resnet-50 considered in this paper. 2019) to name a few. In this paper, our goal is to design a paradigm which can compress any architecture. Hence, the direction of architecture search is orthogonal to our approach. To summarize, most of the current approaches suffer from one of the two problems. (1) Require the availability of special hardware to support fast inference. (2) Require large training time as they alternate between pruning and fine-tuning. In this work, we propose a novel training paradigm based on adjoined networks which can compress any neural architecture, provides inference-time speedups and works at the application layer (does not require any specialized hardware). As shown in Fig. 1 2018)) are discussed in Section 2. In our training paradigm, we get two outputs p and q corresponding to the original and smaller networks respectively. We train the two networks using a novel time-dependent loss function, adjoint loss described in Section 3. The adjoint loss not only trains the smaller network but also acts as a regularizer for the bigger (original) network. In Section 4, we provide strong theoretical guarentees on the regularization behaviour of adjoint training. We also show that training and regularizing in the adjoint fashion is better than other regularization techniques like dropouts Srivastava et al. (2014) . In Section 5, we describe our results. We run several experiments on various datasets like ImageNet Russakovsky et al. (2015) and CIFAR-10 and CIFAR-100 Krizhevsky et al. (2009) . For each of these datasets, we consider different architectures like resnet-50 and resnet-18. On CIFAR-100, the adjoint training paradigm allows to compress resnet-50 architecture by 99.7x without losing any accuracy. The compressed architecture has an inference speed of 5x when compared against the original, bigger architecture. Moreover, the original network gains 3.58% in accuracy when compared against the same network trained in the standard (non-adjoint) fashion. On the same dataset, for resnet-18, we For size comparison, we ignore the parameters in the last linear layer as it varies by dataset and are typically dropped during fine-tuning. Else, the reductions are 11.5x and 95x for imagenet and cifar-100 respectively.



achieved state-of-the art performance on computer vision such as classification, object detection Redmon et al. (2016), image segmentation Badrinarayanan et al. (2017) and many more. Since the introduction of Alexnet Krizhevsky et al. (2012), neural architectures have progressively gone deeper with an increase in the number of parameters. This includes architectures like Resnet He et al. (2016) and its many variants (xresnet He et al. (2019), resnext Xie et al. (2017); Hu et al. (2018) etc.), Densenet Huang et al. (2017), Inception networks Chen et al. (2017) and many others.

Figure 1: Training paradigm based on adjoined networks. The original and the compressed version of the network are trained together with the parameters of the smaller network shared across both.The network outputs two probability vectors p (original network)and q (corresponding to the smaller network).

, in the adjoint training paradigm, both the original and the compressed network are trained together at the same time. The parameters of the larger network are a super-set of the parameters of the smaller network. Details of our design, how it supports fast inference and relationship with other architectures (teacher-student Hinton et al. (2015), siamese networks Bertinetto et al. (2016), slimmable networks Yu et al. (2018), deep mutual learning Zhang et al. (2018) and lottery ticket hypothesis Frankle & Carbin (

