MORE SIDE INFORMATION, BETTER PRUNING: SHARED-LABEL CLASSIFICATION AS A CASE STUDY

Abstract

Pruning of neural networks, also known as compression or sparsification, is the task of converting a given network, which may be too expensive to use (in prediction) on low resource platforms, with another 'lean' network which performs almost as well as the original one, while using considerably fewer resources. By turning the compression ratio knob, the practitioner can trade off the information gain versus the necessary computational resources, where information gain is a measure of reduction of uncertainty in the prediction. In certain cases, however, the practitioner may readily possess some information on the prediction from other sources. The main question we study here is, whether it is possible to take advantage of the additional side information, in order to further reduce the computational resources, in tandem with the pruning process? Motivated by a real-world application, we distill the following elegantly stated problem. We are given a multi-class prediction problem, combined with a (possibly pre-trained) network architecture for solving it on a given instance distribution, and also a method for pruning the network to allow trading off prediction speed with accuracy. We assume the network and the pruning methods are state-of-the-art, and it is not our goal here to improve them. However, instead of being asked to predict a single drawn instance x, we are being asked to predict the label of an n-tuple of instances (x 1 , . . . x n ), with the additional side information of all tuple instances share the same label. The shared label distribution is identical to the distribution on which the network was trained. One trivial way to do this is by obtaining individual raw predictions for each of the n instances (separately), using our given network, pruned for a desired accuracy, then taking the average to obtain a single more accurate prediction. This is simple to implement but intuitively sub-optimal, because the n independent instantiations of the network do not share any information, and would probably waste resources on overlapping computation. We propose various methods for performing this task, and compare them using extensive experiments on public benchmark data sets for image classification. Our comparison is based on measures of relative information (RI) and n-accuracy, which we define. Interestingly, we empirically find that i) sharing information between the n independently computed hidden representations of x 1 , .., x n , using an LSTM based gadget, performs best, among all methods we experiment with, ii) for all methods studied, we exhibit a sweet spot phenomenon, which sheds light on the compression-information trade-off and may assist a practitioner to choose the desired compression ratio.

1. INTRODUCTION

Pruning Neural networks, the task of compressing a network by removing parameters, has been an important subject both for practical deployment and theoretical research. Some pruning algorithms have focused on manipulating pre-trained models, (Mozer & Smolensky, 1989; LeCun et al., 1990; Reed, 1993; Han et al., 2015) while recent work have identified that there exist sparse subnetwork (also called winning tickets) in randomly-initialized neural networks that, when trained in isolation, can match and often even surpass the test accuracy of the original network (Frankle & Carbin, 2019; Frankle et al., 2020) . There is a vast literature on network pruning, and we refer the reader to Blalock et al. ( 2020 More crucially, most literature on pruning has been focused on designing a machine that converts a fixed deep learning solution to a prediction problem, to a more efficient version thereof. The pruning machine has a compression knob which trades off the level of pruning with accuracy of the prediction. The more resources we are willing to expend in prediction (measured here using floating-point operations (FLOPs)), the more information we can obtain, where information here is measured as prediction accuracy, or as reduction of uncertainty (defined below). We now ask what happens when we want to prune a network, but also possess information on the prediction coming from another source. Intuitively, given some form of additional side information, we should be able to prune our network with a higher compression ratio to reach the same level of accuracy for the prediction task, compared with a scenario with no additional side information. But how can we take the side information into account when pruning? 1.1 MOTIVATION This question was motivated by an actual real-life scenario. We describe the scenario in detail, although the actual problem we thoroughly study in what follows is much simpler. Imagine a database retrieval system with a static space of objects X . Given a query object q, the goal is to return an object x from X that maximizes a ground-truth retrieval value function f q (x). We have access to a function fq (x) expressed as a deep network, which approximates f q , and was trained using samples thereof. The function fq is very expensive to compute. (Note that we keep q fixed here, as part of the definition of f q (•), although in an actual setting both q and x would be input to a bivariate retrieval function f .) Computing fq (x) for all x ∈ X is infeasible. One way to circumvent this is by computing a less accurate, but efficient function fq (•), defined by the network resulting in a pruning of the network defining fq . Then compute fq (•) on all x ∈ X to obtain a shortlist of candidates X , and then compute fq (x) on x ∈ X only. This idea can also be bootstrapped, using rougher, more aggresively pruned estimates fq , f (4) q , f (5) q ... and increasingly shorter shortlist. However, an important point is ignored in this approach: The space X is structured, and we expect there to be prior connections between its elements. This is the side information. Such connections can be encoded, for example, as a similarity graph over X where it is expected that f q (x 1 ) is close to f q (x 2 ) whenever there is an edge between x 1 , x 2 . There is much work on deep networks over graphs (Zhou et al., 2018; Kipf & Welling, 2017; Wu et al., 2020) . But how can the extra information, encoded as a graph, be used in conjunction with the pruning process? Let us simplify the information retrieval scenario. First, assume that we are in a classification and not in a regression scenario, so that f q (x) can take a finite set of discrete values, and fq (x) returns a vector of logits, one coordinate per class. Second, assume the side information on X is a partitioning of X into cliques, or clusters X 1 ...X k where on each clique the value of f q (•) is fixed, and written as f q (X i ), i = 1..k. Now the problem becomes that of estimating the f q (X i )'s using n random samples x i1 ...x in ∈ X i , i = 1..k. 1 Fixing the cluster X i , one obvious thing to do in order to estimate f q (X i ) is to take an average of the logit vectors fq (x i1 )... fq (x in ), where fq is some fixed (possibly pruned) network, and use the argmax coordinate as prediction. Assuming each pruned network fq outputs a prediction vector with a certain level of uncertainty, the averaged vector should have lower uncertainty, and this can be quantified using simple probabilistic arguments. This will henceforth be called the baseline method. Intuitively the baseline method, though easy to do using out-of-the-box pruning libraries, cannot possibly be optimal given the side information of same label across X i . Indeed, the baseline method feeds all the examples x i1 ...x in independently through separate instantiations of fq , and nothing



Continuing the retrieval story , the practitioner would now find the Xi that maximizes fq, and then further focus the search in that cluster.



); Sze et al. (2017); Reed (1993) for an excellent survey. In this work, we adopt the pruning methods of Tanaka et al. (2020); Lee et al. (2019); Wang et al. (2020); Han et al. (2015) which have been influential in our experiments.

