BETTER TOGETHER: RESNET-50 ACCURACY WITH 13× FEWER PARAMETERS AND AT 3× SPEED

Abstract

Recent research on compressing deep neural networks has focused on reducing the number of parameters. Smaller networks are easier to export and deploy on edge-devices. We introduce Adjoined networks as a training approach that can regularize and compress any CNN-based neural architecture. Our one-shot learning paradigm trains both the original and the smaller networks together. The parameters of the smaller network are shared across both the architectures. We prove strong theoretical guarantees on the regularization behavior of the adjoint training paradigm. We complement our theoretical analysis by an extensive empirical evaluation of both the compression and regularization behavior of adjoint networks. For resnet-50 trained adjointly on Imagenet, we are able to achieve a 13.7x reduction in the number of parameters 1 and a 3x improvement in inference time without any significant drop in accuracy. For the same architecture on CIFAR-100, we are able to achieve a 99.7x reduction in the number of parameters and a 5x improvement in inference time. On both these datasets, the original network trained in the adjoint fashion gains about 3% in top-1 accuracy as compared to the same network trained in the standard fashion.



While these networks achieve exceptional performance on many tasks, their large size makes it difficult to deploy on many edge devices (like mobile phones, iot and embedded devices). Unlike cloud servers, these edge devices are constrained in terms of memory, compute and energy resources. A large network performs a lot of computations, consumes more energy and is difficult to transport and to update. A large network also has a high prediction time per image. This is constraint when real-time inference is needed. Thus, compressing neural networks while maintaining accuracy has received significant attention in the last few years. Pruning -These techniques involve removing parameters (or weights) which satisfy some criteria. For example, in weight pruning, all the parameters whose values are below some pre-determined threshold are removed Han et al. (2015) . A natural extension of this is channel pruning Liu et al. (2017) and filter pruning Li et al. (2016) where entire convolution channel or filter is removed according to some criteria. However, all of these methods involve multiple passes of pruning followed by fine-tuning and require a very long time to compress. Moreover, weight pruning doesn't give the benefit of faster inference times unless there is hardware support for fast sparse matrix multiplications. In this paper, we propose a one-shot compression procedure (as opposed to multiple passes). To the best of our knowledge, pruning techniques have not been successfully applied (without significant accuracy loss) to large architectures like Resnet-50 considered in this paper. For size comparison, we ignore the parameters in the last linear layer as it varies by dataset and are typically dropped during fine-tuning. Else, the reductions are 11.5x and 95x for imagenet and cifar-100 respectively.



achieved state-of-the art performance on computer vision such as classification, object detection Redmon et al. (2016), image segmentation Badrinarayanan et al. (2017) and many more. Since the introduction of Alexnet Krizhevsky et al. (2012), neural architectures have progressively gone deeper with an increase in the number of parameters. This includes architectures like Resnet He et al. (2016) and its many variants (xresnet He et al. (2019), resnext Xie et al. (2017); Hu et al. (2018) etc.), Densenet Huang et al. (2017), Inception networks Chen et al. (2017) and many others.

