ENHANCING VISUAL REPRESENTATIONS FOR EFFI-CIENT OBJECT RECOGNITION DURING ONLINE DISTIL-LATION Anonymous

Abstract

We propose ENVISE, an online distillation framework that ENhances VISual representations for Efficient object recognition. We are motivated by the observation that in many real-world scenarios, the probability of occurrence of all classes is not the same and only a subset of classes occur frequently. Exploiting this fact, we aim to reduce the computations of our framework by employing a binary student network (BSN) to learn the frequently occurring classes using the pseudo-labels generated by the teacher network (TN) on an unlabeled image stream. To maintain overall accuracy, the BSN must also accurately determine when a rare (or unknown) class is present in the image stream so that the TN can be used in such cases. To achieve this, we propose an attention triplet loss which ensures that the BSN emphasizes the same semantically meaningful regions of the image as the TN. When the prior class probabilities in the image stream vary, we demonstrate that the BSN adapts to the TN faster than the real-valued student network. We also introduce Gain in Efficiency (GiE), a new metric which estimates the relative reduction in FLOPS based on the number of times the BSN and TN are used to process the image stream. We benchmark CIFAR-100 and tiny-imagenet datasets by creating meaningful inlier (frequent) and outlier (rare) class pairs that mimic real-world scenarios. We show that ENVISE outperforms state-of-the-art (SOTA) outlier detection methods in terms of GiE, and also achieves greater separation between inlier and outlier classes in the feature space.

1. INTRODUCTION

Deep CNNs that are widely used for image classification (Huang et al. (2017) ) often require large computing resources and process each image with high computational complexity (Livni et al. (2014) ). In real-world scenarios, the prior probability of occurrence of individual classes in an image stream is often unknown and varies with the deployed environment. For example, in a zoo, the image stream input to the deep CNN will mostly consist of animals, while that of vehicles would be rare. Other object classes such as furniture and aircraft would be absent. Therefore, only a subset of the many classes known to a deep CNN may be presented to it for classification during its deployment. To adapt to the varying prior class probability in the deployed scenario with high efficiency, we propose an online distillation framework -ENVISE. Here, we employ a high capacity general purpose image classifier as the teacher network (TN), while the student network (SN) is a low capacity network. For greater efficiency and faster convergence, we require the coefficients of the SN to be binary and refer to it as the binary student network (BSN). When the BSN is first deployed, it is trained on the unlabeled image stream using the predicted labels of the TN as pseudo-labels. Once the BSN converges to the performance of the TN, it is used as the primary classifier to rapidly classify the frequent classes faster than the TN. However, if a rare class appears (i.e. a class absent during online training) in the image stream , the BSN must accurately detect it as a class it has not yet encountered, which is then processed by the TN. Since the BSN is trained only on the frequent classes, we refer to these classes as inlier (IL) and the rare classes as outlier (OL). It is important to note, that the OL classes are outliers with respect to the BSN only, but are known to the TN. Detecting extremely rare classes which are unknown to both the BSN and TN (global unknowns) is beyond the scope of this paper. Thus, assuming that the TN knows all possible classes from the deployed environment, we 1

