ENHANCING VISUAL REPRESENTATIONS FOR EFFI-CIENT OBJECT RECOGNITION DURING ONLINE DISTIL-LATION Anonymous

Abstract

We propose ENVISE, an online distillation framework that ENhances VISual representations for Efficient object recognition. We are motivated by the observation that in many real-world scenarios, the probability of occurrence of all classes is not the same and only a subset of classes occur frequently. Exploiting this fact, we aim to reduce the computations of our framework by employing a binary student network (BSN) to learn the frequently occurring classes using the pseudo-labels generated by the teacher network (TN) on an unlabeled image stream. To maintain overall accuracy, the BSN must also accurately determine when a rare (or unknown) class is present in the image stream so that the TN can be used in such cases. To achieve this, we propose an attention triplet loss which ensures that the BSN emphasizes the same semantically meaningful regions of the image as the TN. When the prior class probabilities in the image stream vary, we demonstrate that the BSN adapts to the TN faster than the real-valued student network. We also introduce Gain in Efficiency (GiE), a new metric which estimates the relative reduction in FLOPS based on the number of times the BSN and TN are used to process the image stream. We benchmark CIFAR-100 and tiny-imagenet datasets by creating meaningful inlier (frequent) and outlier (rare) class pairs that mimic real-world scenarios. We show that ENVISE outperforms state-of-the-art (SOTA) outlier detection methods in terms of GiE, and also achieves greater separation between inlier and outlier classes in the feature space.

1. INTRODUCTION

Deep CNNs that are widely used for image classification (Huang et al. (2017) ) often require large computing resources and process each image with high computational complexity (Livni et al. (2014) ). In real-world scenarios, the prior probability of occurrence of individual classes in an image stream is often unknown and varies with the deployed environment. For example, in a zoo, the image stream input to the deep CNN will mostly consist of animals, while that of vehicles would be rare. Other object classes such as furniture and aircraft would be absent. Therefore, only a subset of the many classes known to a deep CNN may be presented to it for classification during its deployment. To adapt to the varying prior class probability in the deployed scenario with high efficiency, we propose an online distillation framework -ENVISE. Here, we employ a high capacity general purpose image classifier as the teacher network (TN), while the student network (SN) is a low capacity network. For greater efficiency and faster convergence, we require the coefficients of the SN to be binary and refer to it as the binary student network (BSN). When the BSN is first deployed, it is trained on the unlabeled image stream using the predicted labels of the TN as pseudo-labels. Once the BSN converges to the performance of the TN, it is used as the primary classifier to rapidly classify the frequent classes faster than the TN. However, if a rare class appears (i.e. a class absent during online training) in the image stream , the BSN must accurately detect it as a class it has not yet encountered, which is then processed by the TN. Since the BSN is trained only on the frequent classes, we refer to these classes as inlier (IL) and the rare classes as outlier (OL). It is important to note, that the OL classes are outliers with respect to the BSN only, but are known to the TN. Detecting extremely rare classes which are unknown to both the BSN and TN (global unknowns) is beyond the scope of this paper. Thus, assuming that the TN knows all possible classes from the deployed environment, we aim to increase the overall efficiency of the system (without sacrificing performance) by exploiting the higher probability of occurrence of frequent classes in a given scenario. Our approach for detecting OL classes is motivated from the observation by Ren et al. (2019) , where networks incorrectly learn to emphasize the background rather than the semantically important regions of the image leading to poor understanding of the IL classes. We know that attention maps highlight the regions of the image responsible for the classifier's prediction (Selvaraju et al. (2017) ). We empirically observe that, the attention map of the BSN may focus on the background even when the attention map of the TN emphasizes the semantically meaningful regions of the image. In doing so, the BSN memorizes the labels of the TN, making it difficult to differentiate between the representations of IL and OL classes. To mitigate these issues, we propose an attention triplet loss that achieves two key objectives -a) guide the attention map of the correct prediction of the BSN to focus on the semantically meaningful regions, and b) simultaneously ensure that the attention map from the correct and incorrect predictions of the BSN are dissimilar. We show that by focusing on the semantically relevant regions of the image, the BSN will learn to distinguish between the representations of IL and OL classes, thereby improving its ability to detect OL classes. To assess the overall gain in efficiency of ENVISE, we propose a new evaluation metric -GiE, based on the number of times the BSN and TN are used to process the image stream. Since the deployed scene is comprised mostly of IL classes with few OL classes, we expect the BSN to be employed most of the time for classifying IL classes. The TN is used rarely i.e. only when the BSN detects an OL class. We refer to efficiency as the overall reduction in FLOPs in the online distillation framework to process the varying prior probability of classes in the image stream. This term differs from conventional model compression techniques (Frankle & Carbin (2019) ; Chen et al. ( 2020)) which process the image stream comprising of classes with equal probability using a single compressed model. To the best of our knowledge, we are the first to propose supervision on attention maps for OL detection, and a new evaluation metric that measures the gain in computational efficiency of an online distillation framework. A summary of our main contributions is: • Faster convergence of BSN: We theoretically justify and empirically illustrate that the BSN adapts to the performance of the TN faster than the real-valued SN (RvSN). We also demonstrate the faster convergence of the BSN for different BSN architectures over its corresponding RvSN. • Attention triplet loss (L at ), which guides the BSN to focus on the semantically meaningful regions of the image, thereby improving OL detection. • New evaluation metric -GiE to measure the overall gain in computational efficiency of the online distillation framework, and • We benchmark CIFAR-100 and tiny-imagenet datasets with SOTA OL detection methods by creating meaningful IL and OL class pairs. ENVISE outperforms these baseline methods for OL detection, improves separation of IL and OL classes in the feature space, and yields the highest gain in computational efficiency. 



Distilling knowledge to train a low-capacity student network from a highcapacity teacher network has been proposed as part of model compression (Hinton et al. (2015)). Wang & Yoon (2020) provide a detailed review of different knowledge distillation methods. Mullapudi et al. (2019) propose an online distillation framework for semantic segmentation in videos, while Abolghasemi et al. (2019) use knowledge distillation to augment a visuomotor policy from visual attention. Lin et al. (2019) propose an ensemble student network that recursively learns from the teacher network in closed loop manner while Kim et al. (2019) use a feature fusion module to distill knowledge. Gao et al. (2019) propose online mutual learning and Cioppa et al. (2019) propose to periodically update weights to train an ensemble of student networks. These ensemble based methods require large computational resources and are expensive to train. However, ENVISE involves training a compact model that mimics the performance of the TN with less computation. Outlier detection : Outlier detection or out-of-distribution detection (OOD) refers to detecting a sample from an unknown class (Hendrycks & Gimpel (2016)). Existing SOTA OOD methods use outlier samples during training or validation. Yu & Aizawa (2019) increase the distance between IL and OL samples during training while Vyas et al. (2018); Lee et al. (2018) add perturbations

