ENHANCING VISUAL REPRESENTATIONS FOR EFFI-CIENT OBJECT RECOGNITION DURING ONLINE DISTIL-LATION Anonymous

Abstract

We propose ENVISE, an online distillation framework that ENhances VISual representations for Efficient object recognition. We are motivated by the observation that in many real-world scenarios, the probability of occurrence of all classes is not the same and only a subset of classes occur frequently. Exploiting this fact, we aim to reduce the computations of our framework by employing a binary student network (BSN) to learn the frequently occurring classes using the pseudo-labels generated by the teacher network (TN) on an unlabeled image stream. To maintain overall accuracy, the BSN must also accurately determine when a rare (or unknown) class is present in the image stream so that the TN can be used in such cases. To achieve this, we propose an attention triplet loss which ensures that the BSN emphasizes the same semantically meaningful regions of the image as the TN. When the prior class probabilities in the image stream vary, we demonstrate that the BSN adapts to the TN faster than the real-valued student network. We also introduce Gain in Efficiency (GiE), a new metric which estimates the relative reduction in FLOPS based on the number of times the BSN and TN are used to process the image stream. We benchmark CIFAR-100 and tiny-imagenet datasets by creating meaningful inlier (frequent) and outlier (rare) class pairs that mimic real-world scenarios. We show that ENVISE outperforms state-of-the-art (SOTA) outlier detection methods in terms of GiE, and also achieves greater separation between inlier and outlier classes in the feature space.

1. INTRODUCTION

Deep CNNs that are widely used for image classification (Huang et al. (2017) ) often require large computing resources and process each image with high computational complexity (Livni et al. (2014) ). In real-world scenarios, the prior probability of occurrence of individual classes in an image stream is often unknown and varies with the deployed environment. For example, in a zoo, the image stream input to the deep CNN will mostly consist of animals, while that of vehicles would be rare. Other object classes such as furniture and aircraft would be absent. Therefore, only a subset of the many classes known to a deep CNN may be presented to it for classification during its deployment. To adapt to the varying prior class probability in the deployed scenario with high efficiency, we propose an online distillation framework -ENVISE. Here, we employ a high capacity general purpose image classifier as the teacher network (TN), while the student network (SN) is a low capacity network. For greater efficiency and faster convergence, we require the coefficients of the SN to be binary and refer to it as the binary student network (BSN). When the BSN is first deployed, it is trained on the unlabeled image stream using the predicted labels of the TN as pseudo-labels. Once the BSN converges to the performance of the TN, it is used as the primary classifier to rapidly classify the frequent classes faster than the TN. However, if a rare class appears (i.e. a class absent during online training) in the image stream , the BSN must accurately detect it as a class it has not yet encountered, which is then processed by the TN. Since the BSN is trained only on the frequent classes, we refer to these classes as inlier (IL) and the rare classes as outlier (OL). It is important to note, that the OL classes are outliers with respect to the BSN only, but are known to the TN. Detecting extremely rare classes which are unknown to both the BSN and TN (global unknowns) is beyond the scope of this paper. Thus, assuming that the TN knows all possible classes from the deployed environment, we aim to increase the overall efficiency of the system (without sacrificing performance) by exploiting the higher probability of occurrence of frequent classes in a given scenario. Our approach for detecting OL classes is motivated from the observation by Ren et al. (2019) , where networks incorrectly learn to emphasize the background rather than the semantically important regions of the image leading to poor understanding of the IL classes. We know that attention maps highlight the regions of the image responsible for the classifier's prediction (Selvaraju et al. (2017) ). We empirically observe that, the attention map of the BSN may focus on the background even when the attention map of the TN emphasizes the semantically meaningful regions of the image. In doing so, the BSN memorizes the labels of the TN, making it difficult to differentiate between the representations of IL and OL classes. To mitigate these issues, we propose an attention triplet loss that achieves two key objectives -a) guide the attention map of the correct prediction of the BSN to focus on the semantically meaningful regions, and b) simultaneously ensure that the attention map from the correct and incorrect predictions of the BSN are dissimilar. We show that by focusing on the semantically relevant regions of the image, the BSN will learn to distinguish between the representations of IL and OL classes, thereby improving its ability to detect OL classes. To assess the overall gain in efficiency of ENVISE, we propose a new evaluation metric -GiE, based on the number of times the BSN and TN are used to process the image stream. Since the deployed scene is comprised mostly of IL classes with few OL classes, we expect the BSN to be employed most of the time for classifying IL classes. The TN is used rarely i.e. only when the BSN detects an OL class. We refer to efficiency as the overall reduction in FLOPs in the online distillation framework to process the varying prior probability of classes in the image stream. This term differs from conventional model compression techniques (Frankle & Carbin (2019) ; Chen et al. (2020) ) which process the image stream comprising of classes with equal probability using a single compressed model. To the best of our knowledge, we are the first to propose supervision on attention maps for OL detection, and a new evaluation metric that measures the gain in computational efficiency of an online distillation framework. A summary of our main contributions is: • Faster convergence of BSN: We theoretically justify and empirically illustrate that the BSN adapts to the performance of the TN faster than the real-valued SN (RvSN). We also demonstrate the faster convergence of the BSN for different BSN architectures over its corresponding RvSN. • Attention triplet loss (L at ), which guides the BSN to focus on the semantically meaningful regions of the image, thereby improving OL detection. • New evaluation metric -GiE to measure the overall gain in computational efficiency of the online distillation framework, and • We benchmark CIFAR-100 and tiny-imagenet datasets with SOTA OL detection methods by creating meaningful IL and OL class pairs. ENVISE outperforms these baseline methods for OL detection, improves separation of IL and OL classes in the feature space, and yields the highest gain in computational efficiency.

2. RELATED WORK

Online distillation : Distilling knowledge to train a low-capacity student network from a highcapacity teacher network has been proposed as part of model compression (Hinton et al. (2015) ). Wang & Yoon (2020) provide a detailed review of different knowledge distillation methods. Mullapudi et al. (2019) propose an online distillation framework for semantic segmentation in videos, while Abolghasemi et al. (2019) use knowledge distillation to augment a visuomotor policy from visual attention. Lin et al. (2019) propose an ensemble student network that recursively learns from the teacher network in closed loop manner while Kim et al. (2019) use a feature fusion module to distill knowledge. Gao et al. (2019) propose online mutual learning and Cioppa et al. (2019) propose to periodically update weights to train an ensemble of student networks. These ensemble based methods require large computational resources and are expensive to train. However, ENVISE involves training a compact model that mimics the performance of the TN with less computation. Outlier detection : Outlier detection or out-of-distribution detection (OOD) refers to detecting a sample from an unknown class (Hendrycks & Gimpel (2016) (2019) generate discriminative feature space using reconstruction based approaches. However, the likelihood of correct classification by these networks is based on background regions as described in Ren et al. (2019) . Hence, these methods fail when applied to real-world IL and OL class pairs since the confidence of prediction is based on background information. We overcome this drawback by proposing supervision on attention maps where we guide the BSN to focus on the important semantic regions using the attention map of the teacher network as a reference. This enables ENVISE to classify IL images with high confidence while also improving OL detection. 

3. ONLINE DISTILLATION FOR EFFICIENT OBJECT RECOGNITION

The framework of ENVISE is shown in Figure 1 (a). Initially, the TN (indicated by "1" in the figure) classifies the images from the input stream with high accuracy, albeit at a slower rate. The BSN (red dotted box) comprises of an adaptation module ("2") and a binary CNN ("3"). Given an image stream, the adaptation module learns the optimum binary weights to mimic the performance of the TN. Once its accuracy converges to that of the TN, the adaption of the real-valued weights stops, but the inference with binary weights continues. The OL detector ("4") uses the softmax output of the BSN as confidence to either generate the final prediction (if the confidence is high), or treats the image as an OL class and redirects it to the TN for classification. Binarization algorithm for student network: During adaptive training, the BSN does not have access to the labeled data and mimics the behavior of the TN on the IL classes as shown in Figure 1(b) . For an input image x, let f (g(w i,j ), x) represent the output of the BSN where w i,j are real-valued weights of i th kernel in the j th layer, and g(.) is a binarizing function for these weights. During adaptive training, the error between the output of the BSN and TN is minimized by optimizing w * i,j = arg min g(wi,j ) L , where L is the overall loss function used for training the BSN. While the intermediate weight w i,j is computed during adaptive learning, the binary versions of these weights w * i,j = g(w i,j ) are used for fast computation. After convergence, the real valued weights are discarded, and only the binary weights are used. The question becomes what is a good choice for the binarization function g(.)? Following Rastegari et al. (2016) , we find a binary vector b i,j (whose elements are either +1 or -1), and a non-negative scalar α i,j such that w * i,j = α i,j .b i,j is the best approximation of w i,j in a minimum squared error sense, where α i,j = 1 N |w i,j | 1 , b i,j = sign(w i,j ). However, unlike Rastegari et al. (2016) where a standalone network is binarized, our BSN learns the w * i,j during adaptive training in an online distillation framework. In doing so, the BSN compensates for the effects of binarization on the overall loss function. Our motivation for using the BSN is that it classifies the image stream with high accuracy and utilizes fewer FLOPs as compared to the RvSN. This results in an increase in efficiency of the overall framework. Furthermore, the BSN converges to the performance of the TN faster than the RvSN, which we theoretically emphasize in the Lemma below and experimentally validate in Section 4. We provide the sketch proof here and the complete proof in Appendix A.1. Lemma 3.1. Let RvSN and BSN represent the real-valued student network and binary student network respectively with the same network architecture. Let R(.) denote the rate of convergence in terms of accuracy when the student network is adaptively trained using the pseudo-labels from the teacher network. Then, R(BSN) > R(RvSN) for the same image stream and number of iterations. Proof : We assume our image stream as x(n) where n = 1, 2, 3...N , comprises of N samples from the deployed scenario. Since the weights of the BSN are derived from the RvSN, we prove this lemma first for RvSN and then extend it to the BSN. Let w * be the optimal weight value which represents network convergence i.e. misclassification error = 0. The weight update rule using back propagation is given by: w(n + 1) = w(n) + ηx(n) [ Since η = 1] w(n + 1) = w(n) + x(n) (1) Computing the bounds of the weights of the RvSN, we have n 2 α 2 w * 2 ≤ w(n + 1) 2 ≤ nβ where α = argmin x(n)∈C w x(n) and β = argmax x(n)∈C x(k) 2 . From eq. 2, we can say that for this inequality to be satisfied, there exists n r which denotes the optimal number of samples for which the RvSN converges i.e. we obtain w at n = n r . This is given as follows: n 2 r α 2 w 2 = n r β n r = β w 2 α 2 (3) Thus from eq. 3, we can say that the RvSN achieves convergence after being trained on β w * 2 α 2 number of samples. Since, the weights of the BSN are derived from the RvSN, we substitute the value of binary weight ŵ * from Rastegari et al. (2016) to obtain n b = n r N (4) Here, n b is the optimal number of samples for which the BSN converges. Comparing eq. 3 and eq. 4, we observe that number of n * b < n r i.e. the BSN takes fewer samples to converge to the performance of the TN than RvSN given the same network architecture. Knowledge transfer from teacher to student network: Given an unlabeled image stream, the BSN is trained from the predictions of the TN as hard pseudo-labels using cross-entropy loss, as L d = -1 N i y i log(s i ). Here, y and s are the pseudo-labels generated by the TN and the softmax output of the BSN respectively, N is the total images in the image stream. Once the BSN converges to the performance of the TN, we employ the BSN as the primary classifier during inference. When the deployed scenario does not change for a long duration, the BSN classifies the IL classes faster than the TN without relying on the latter. This improves the overall efficiency of the framework as the TN is now used to classify only the OL classes (number of OL images << number of IL images). However, to maintain overall accuracy, the BSN must also accurately determine when the image from the input stream belongs to an OL class so that the TN can be used in such cases. We know that attention maps highlight the regions of the image responsible for the classifier's prediction (Li et al. (2018)) . Some examples of the attention maps from the correct predictions of the BSN and the TN are shown in Figure 2 (a). The first three columns in Figure 2 (a) show that although the BSN (adaptively trained using L d ) and the TN both correctly classify the images, their attention maps are significantly different. The attention map from the correct prediction of the BSN focuses on the background while that of the TN lies on the semantically meaningful regions of the image. Furthermore, the attention maps of the BSN's correct and incorrect predictions are visually similar which causes the BSN to learn incorrect representations of the IL images while correctly classifying them. To better differentiate between IL and OL classes, the BSN should not memorize the predictions of the TN, but learn proper representations of the IL images. To achieve this, we propose the attention triplet loss given in Equation 5 that causes the BSN's attention map to be similar to that of the TN while forcing the attention map from the BSN's correct and incorrect prediction to be different. We use Grad-CAM (Selvaraju et al. ( 2017)) to compute the attention map A from the one-hot vector of the predicted class. To generate the attention map triplets, we use the attention map from the TN's prediction A t as the anchor. The hard positive attention map A sp is obtained from correct prediction of the BSN. When the BSN misclassifies an image (prediction of BSN and TN are different), we use the label from the TN to compute A sp . Motivated by the findings in Wang et al. (2019) , we observe that the attention maps generated from the correct and incorrect predictions of the BSN are visually similar. Hence, we use the attention map from an incorrect prediction as the hard negative attention map A sn to enforce its separability from A sp . To avoid bad local minima and a collapsed model early on during training, mining suitable hard negatives are important (Schroff et al. (2015) ). Hence, we use the attention map from the second most probable class (incorrect prediction) as our hard negative due to its similarity with the hard positive (third and fourth column in Figure 2 (a)). Thus, we formulate the attention triplet loss L at as: L at = 1 N 1 K N n ( K k A t k -A sp k 2 - K k A t k -A sn k 2 + δ) Initially, the hard negative lies within the margin since its squared distance with the hard positive is small. Hence, L at enforces separation between hard positive and hard negative by a distance greater than the value of margin δ, which we empirically set as 1.5. N is the total number of samples and K is the total number of pixels in A. Using L at and L d , we formulate our final objective function as: L f = λ 1 L d + λ 2 L at (6) where λ 1 and λ 2 are the weights of L d and L at which we empirically set as 1 and 0.2 respectively. The effect of L at is shown in the last column of Figure 2 (a) where the BSN learns proper representations of the IL image due to its attention map being very similar to that of the TN (second column). Furthermore, the improvement in feature space separation between the IL and OL classes in Figure 2 (b) shows that the BSN not only forms clusters of the different IL classes (in red), but is also well separated from the OL classes (in purple).

4. EXPERIMENTAL DETAILS

Datasets and implementation: We evaluate ENVISE on CIFAR-100 (Krizhevsky & Hinton (2009) ) and tiny-imagenet (TI) (Yao & Miller (2015) ) datasets by creating meaningful IL and OL super-class pairs that mimic real-world scenarios. On the test set of CIFAR-100 and validation set of TI dataset, we create 12 and 10 super-classes respectively. We summarize our experimental settings in Table 1 . We use DenseNet-201 (Huang et al. (2017) ) as the TN, pre-trained on the training set of all classes of CIFAR-100 or the TI dataset individually. The BSN is a VGG16 (Simonyan & Zisserman (2014) ) network whose weights, except for the first and last convolution layer, are binarized (Rastegari et al. (2016) ). We also compare this with other choices for the BSN including AlexNet, ResNet-18 and ResNet-50. In the Appendix A.3, we illustrate that ENVISE is insensitive to the specific TN and BSN architecture used and achieves high performance gains even with different network architectures. Table 1 : Different pairs of IL and OL super-classes on the CIFAR-100 and TI datasets. CIFAR-100 (Krizhevsky & Hinton (2009) ) tiny-imagenet (Yao & Miller (2015) ) Training and evaluation: Throughout our experiments, we fix the TN and do not train it. We train the BSN using the pseudo-labels and attention map from the TN using the cost function in eq. 6 only on the IL classes of the super-class from Table 1 . This is done using a learning rate of 1e -4 for 10 epochs using the Adam optimizer (Kingma & Ba ( 2014)) with a batch size of 10. Once the BSN converges to the performance of the TN, we employ it as our primary classifier during inference. To create a more realistic scenario during inference, we random center crop and randomly rotate the image stream between [-15 • , 15 • ]. We assume that the distribution of classes will not change rapidly in the deployed scenario, and that the BSN can be used for inference for long durations (e.g days, weeks or months). Thus, the epochs for online distillation are expected to require a small fraction of that time. For each image from the input stream during inference (comprising of IL and OL classes), following Hendrycks & Gimpel (2016) , we compute the confidence of prediction from the softmax probability of the predicted class. If the confidence is low, we treat the image as an OL class and transfer it to the TN for classification. Directly binarizing the RvSN (purple line) is also worse than the BSN, and substantially differs from the binarization algorithm in Section 3 which compensates for the quantization effect. The yellow line shows the performance of the standalone BSN which is not trained during the adaptive training. IL \ OL # IL # OL # IL # OL # total IL \ outlier # IL # OL # IL # OL # 𝐶 ! 𝐶 " 𝐶 # 𝐶 $ 𝐶 % 𝐶 & 𝐶 ! 𝐶 " 𝐶 # 𝐶 $ 𝐶 % 𝐶 & 𝐶 ! 𝐶 " 𝐶 # 𝐶 $ 𝐶 % 𝐶 & 𝐶 ! 𝐶 " 𝐶 # 𝐶 $ 𝐶 % 𝐶 & In Figure 3 we keep the IL/OL superclasses the same for 40 epochs, and then switch the input stream to a different scenario. The 10 epochs shaded in red indicate the adaptive training phase, while the unshaded (white) intervals indicate the 30 epochs for inference. We randomly choose our inference phase as 30 epochs. We show in Appendix A.3 that during inference, the combined accuracy of ENVISE is identical to that of the stand-alone TN, which illustrates that ENVISE maintains the overall accuracy of the system. When the scenario changes after 40 epochs (e.g from C 1 to C 2 ), we observe similar and consistent learning behaviour, and the BSN retrains quickly from the TN to regain efficiency. We also observe a similar convergence pattern for different BSN architectures when adaptively trained from the same TN in Figure 3 . Here, we observe that the rate of convergence for Binary AlexNet is the fastest and that of Binary Resnet-50 is the slowest. This illustrates that a smaller binary network would make an ideal BSN in real-world scenarios, since it would utilize fewer epochs during adaptive training. 2019)), maximum classifier discrepancy (MCD) (Yu & Aizawa (2019) ) and confidence aware learning (CAL) (Moon et al. (2020) ). We use the official code of these methods and adaptively train them using their proposed loss functions on our experimental settings. For fair comparisons in terms of network architecture, we use the TN as DenseNet-201 and the SN as our binary VGG-16. Following Hendrycks & Gimpel (2016) , we use FPR at 95% True positive rate (TPR), detection error (DE), area under ROC curve (AuROC), and area under precision-recall curve (AuPR) as our evaluation metric. From Table 2 , we observe that ENVISE outperforms the best performing baseline method by achieving the lowest FPR and detection error with high AuROC and AuPR. SOTA model compression techniques (Frankle & Carbin (2019) ) do not focus on processing images from the input stream with varying prior class probabilities. Hence, direct comparison with these methods is not meaningful since their objectives are different from those of ENVISE. We visualize the separation of IL and OL classes in the feature space using UMAP (McInnes et al. (2018) ) across the last fully connected layer of the BSN. Figure 4 shows that ENVISE has the best separation between IL and OL classes which is consistent with the quantitative analysis shown in Table 2 . To quantify the feature separation, we compute the p-value using Wilcoxon's rank sum test (Wilcoxon (1992) ) for the null hypothesis that the IL and OL feature distribution are the same i.e they overlap. Ideally for high separation, the p-value should be less than 0.05 (rejecting the hypothesis with 95% confidence). We observe from Figure 4 that the p-value of ENVISE has the smallest value as compared to the SOTA OL detection methods. This indicates that ENVISE achieves the least overlap between IL and OL classes, thereby outperforming the SOTA OL detection methods in its ability to better detect OL classes. A detailed comparison of ENVISE with each baseline method on different pairs of IL and OL classes and the corresponding feature space visualization is presented in the Appendix A.4. Furthermore, in Appendix A.2, we ablate the effect of the proposed L at , margin (δ) with the BSN and also performance of OL detection with RvSN. We present additional discussions in Appendix A.3.  RF L = N tr N T × F L tr X + N in N T × (p × F L in + q × F L o ) X (7) Here, N T = N tr + N in . For high computational efficiency, RF L should be small. If the BSN works with perfect accuracy (i.e. f p = 0 and od = 1.0), then the minimum value of RF L during inference is RF L min = Y +q×X X . We define GiE as RF L normalized with respect to its minimum value, i.e. GiE = RF L min RF L = Y + q × X Ntr N T × F L tr + Nin N T × (p × F L in + q × F L o ) To numerically compute GiE, we use X = 9048 (TN), Y = 6 (BSN), N tr = 10 and N in = 30, while p and q are calculated from Table 1 . However, as discussed previously, N in can be a very large value since the inference phase can occur for a very long duration. Figure 5 illustrates that ENVISE outperforms SOTA OL detection methods in terms of GiE for different scenarios on CIFAR-100 dataset. Thus, the L at ensures that the BSN can accurately distinguish between representations of IL and OL classes which enables ENVISE to operate in an efficient manner.

5. CONCLUSION

We propose ENVISE, an online distillation framework which uses a BSN to adaptively learn relevant knowledge from a TN to quickly classify the frequent classes in the deployed scenario. In doing so, it automatically allocates processing resources (i.e. either the BSN or the TN) to reduce overall computation. To learn proper representations of the IL classes, we propose an attention triplet loss that enables the BSN to learn the semantically relevant regions of the image that are emphasized by the TN. This enables the BSN to accurately distinguish between representations of IL and OL classes which is key for maintaining overall accuracy. Our experiments show that the BSN i) quickly converges to the performance of the TN, ii) classifies IL classes more accurately than a RvSN and other variants of the BSN, and iii) distinguishes between IL and OL classes with lower FPR and detection error than other SOTA OL detection methods on CIFAR-100 and tiny-imagenet datasets. We introduce a new metric GiE to assess the overall gain in efficiency, and experimentally show that the attention triplet loss enables ENVISE to achieve higher GiE than SOTA OL detection methods. We show that ENVISE is agnostic to the specific TN and BSN and achieves high gains with different BSN and TN architectures.

A APPENDIX

A.1 PROOF OF CONVERGENCE In Section 3 of the main paper, we describe that, one of the motivations for using the BSN instead of the RvSN is the ability of the BSN to adapt to the performance of the TN faster than the RvSN. Lemma A.1. Let RvSN and BSN represent the real-valued student network and binary student network respectively with the same architecture. Let R denote the rate of convergence in terms of accuracy when the student network is adaptively trained using the teacher network's predictions. Then, R(BSN) > R(RvSN) for the same input image stream and number of iterations. Proof For the ease of understanding, we assume to have a binary classification problem with 2 classes i.e. C = 2, the data points are linearly separable, the weights are initialized to 0 i.e. w(0) = 0 and learning rate η = 1. The initialization value of weights and learning rate do not affect the proof. Let us further assume that we have a stream of input images x(n) where n = 1, 2, 3...N , where N is the total number of samples. Since the weights of the BSN are derived from the RvSN (From Section 3.1 of the main paper), we prove this lemma for RvSN first and then extend it to the BSN. Since, the pretrained performance of the SN is worse than the TN (comparing starting point of blue line with the red line from Figure 3 (a) of the main paper), we assume that the SN misclassifies the images from the input stream i.e. w(n)x(n) < 0 where w(n) is the weight of the network. The misclassification generates an error through which the network's weights are updated and there exists an optimal weight value w * such that the network converges i.e. misclassification error = 0. The weight update rule using back propagation is given by: w(n + 1) = w(n) + ηx(n) [ Since η = 1] w(n + 1) = w(n) + x(n) Expanding eq (9), we have: w(1) = w(0) + x(0) w(2) = w(1) + x(1) ⇒ w(2) = w(0) + x(0) + x(1) [ Since w(0) = 0] w(2) = x(0) + x(1) (10) Similarly we have w(3) = x(2) + x(1) + x(0) and this recursively continues untill w(n + 1). Thus, w(n + 1) = x(1) + x(2) + x(3) + ....x(n) (11) Ideally, when the network converges, we obtain an optimal weight vector, which is represented as w . We define the margin α as the shortest distance of w to the datapoint x ∈ x(n). The concept of margin is similar to that of support vectors. Hence, the margin α is denoted as α = argmin x(n)∈C w x(n) We ignore x(0) for ease of calculation. Pre-multiplying both the sides of eq. ( 11) by w , we have w w = w x(1) + w x(2) + ....w x(n) [From eq. (12)] w w ≥ nα From Cauchy -Schwarz inequality, we have w 2 w(n + 1) 2 ≥ (w w(n + 1)) 2 w 2 w(n + 1) 2 ≥ (nα) 2 w(n + 1) 2 ≥ n 2 α 2 w 2 We highlight eq. ( 14) as we will be using this in the later stages of the proof. We re-write eq. ( 9) in a different way as w(k + 1) = x(k) + w(k) ∀ k ∈ [0, 1, 2, 3..n]. Squaring both sides of this equation and expanding it, we have w(k + 1) 2 = x(k) + w(k) 2 ⇒ w(k + 1) 2 = x(k) 2 + w(k) 2 + 2x(k)w(k) [Since in misclassification w(n)x(n) ≤ 0] w(k + 1) 2 ≤ x(k) 2 + w(k) 2 ⇒ w(k + 1) 2 -w(k) 2 ≤ x(k) 2 Iterating eq (15) for all values of k ∈ [0, 1, 2, 3..n], we have [Since when k = 0, w(k) = 0] w(k + 1) 2 ≤ n k=0 x(k) 2 (16) Since x(k) 2 > 0, we define β = argmax x(n)∈C x(k) 2 , thus eq. ( 9) can be re-written as w(n + 1) 2 ≤ nβ From eq. ( 14) and ( 17), we have n 2 α 2 w * 2 ≤ w(n + 1) 2 ≤ nβ From eq. ( 18), we can say that for this inequality to be satisfied, there exists n which denotes the optimal number of samples for which the network converges i.e. we obtain w at n = n . This is given as follows: n 2 α 2 w 2 = n β n = β w 2 α 2 Thus from eq. ( 19), we can say that the network achieves convergence after being trained on β w * 2 α 2 number of samples. In the case of the BSN, the binary weights ŵ * = ∆ * D * , where ∆ * = 1 N | ŵ | 1 and D * = sign(w * ), Let n b be the number of samples required for the BSN to converge. Thus, eq. ( 19) for the BSN would be given as n b = β ŵ 2 α 2 n b = β ∆ * D * 2 α 2 n b = β w * D * 2 N 2 α 2 n b = β w 2 D * T D * N 2 α 2 [From eq. ( 19) and Rastegari et al. (2016) ] n b = n N N 2 n b = n N From eq. ( 19) and ( 20), we see that n * b < n * i.e. the BSN takes fewer samples than the RvSN to converge to the TN's performance given the same network architecture. We also experimentally observe the higher rate of convergence of the BSN over the RvSN in Figure 3 of the main paper.

A.2 ABLATION STUDY

All the ablation studies are performed on the CIFAR-100 dataset. We illustrate the effectiveness of the attention triplet loss (L at ), the value of margin in the (L at , significance of the hard negative term in L at and the performance of the BSN over the RvSN for OL detection. The quantitative analysis is summarized in Table 3 . Effect of attention triplet loss (L at ) : To test the effectiveness of the attention triplet loss (L at ), we train the BSN without it. Specifically, we use the TN as DenseNet-201, the BSN as Binary VGG-16 and train the BSN using only the distillation loss L d . Comparing column ID 1 and 2 in Table 3 , we observe that training the BSN using L d + L at improves the mean FPR by 35.4%, mean detection error by 41.6%, mean AuROC by 12.5%, mean AuPR of IL class by 7.4% and mean AuPR of OL class by 30.3%. From Figure 6 , we observe that the IL (red) and OL features (purple) are barely separable when evaluated on a pre-trained BSN. When the BSN is adaptively trained using only L d , the features begin to separate but still have some overlap. However training the BSN using L d + L at significantly enhances the feature separation thereby improving OL detection. Furthermore, the p-value (noted in Figure 6 below each UMAP visualizations) of BSN when trained using L d + L at has the smallest value of 0.041. This validates our assertion that L at learns the proper representations of the IL classes thereby improving its ability to differentiate between IL and OL classes. Effect of margin δ : We mention in Section 3 that, in order to avoid a collapsed model early on during training, we select the hard negative attention map such that its squared distance is closest to the hard positive attention map. Without using any OL images during training or validation, we empirically observe that L 2 norm of the hard negative attention map with respect to the hard positive attention map ranges between [0.33, 2.8]. Hence, we choose δ as 1.5 to ensure separability with respect to the hard positive attention map. To show that ENVISE is insensitive to the specific value of margin, we train the BSN using L at with value of δ as 0.5, 1.0, 1.5, 2.0 and 2.5. From Table 3 , comparing column 3 with 4,5,6,7 and 8, we observe that ENVISE is insensitive to the specific value of margin and outperforms the best performing baseline method for each super-class, even with different values of the margin. Furthermore, from Fig. 7 , we observe that ENVISE has the best performance when the margin is set to 1.5 Significance of using hard negative term in L at : As mentioned in Section 3 of the main paper, we observe that the attention maps from the correct and incorrect predictions are visually similar. This causes the BSN to learn improper representations of the IL classes during online distillation. Hence, we choose the attention map from the incorrect prediction (second most probable class) as a hard negative to ensure separability with the attention map from the correct prediction. To illustrate the significance of the hard negative attention map, we train ENVISE without the second term in equation 5 along with L d . The attention L 2 loss is formulated as: L a = 1 N 1 K N n ( K k A t k -A sp k 2 ) ( ) From Figure 8 , we observe that the confidence of classifying an image decreases when ENVISE is trained using L a . Furthermore, the attention map of ENVISE Effect of using RvSN instead of BSN To illustrate the effectiveness of using the BSN to improve the overall gain in efficiency of the system, we employ a RvSN instead of the BSN as our SN. We use the TN as DenseNet-201 and real-valued VGG-16 as our SN and adaptively train it using L d + L at . We observe from Table 4 that using BSN instead of RvSN as our SN improves the mean FPR by 4.8%, mean detection error by 25%, mean AuROC by 11%, mean AuPR of IL class by 19.2% and mean AuPR of OL class by 15.9%. Furthermore, we also observe that mean GiE for the BSN is 0.53 while that of the RvSN is 0.78 which illustrates that ENVISE has 47.2% gain in efficiency with the BSN as compared to the RvSN. 

A.3 DISCUSSIONS

The TN does not leak OL class information to the BSN during online distillation We mention in Section 3 that the BSN is trained using only the hard pseudo-labels from the TN on the IL classes. We show that during adaptive training, the TN does not transfer OL class information to the BSN. Hence, the ability of the BSN to detect OL classes is due to its ability to accurately differentiate between IL and OL class representations. To illustrate this, we compute the Shannon entropy H(.) of OL and IL classes of the BSN during the online distillation process (when BSN is trained using L d + L at ). Figure 9 illustrates that while the H(IL|X) classes decreases, the H(OL|X) class is already high, where X is the image from the input stream. This illustrates that during the online distillation, the BSN learns the information of the IL classes, while information about the OL class is completely absent. Thus, the ability of the BSN to detect OL classes better than the SOTA outlier detection methods is not due to an OL information leak, but due to the BSN's ability to accurately differentiate between the representations of the IL and OL classes. Gain with smaller teacher network: We investigate the importance of the architecture and size of the TN network on the performance of the BSN for efficient OL detection. Here, we use ResNet-18 (11M parameters) for the originally used DenseNet-201 (18M parameters) as the TN, which is a much smaller network as compared to the latter. We train ResNet-18 using the same training procedure as discussed in Section 4. Once the BSN converges to the performance of the TN, we evaluate the BSN on the different CIFAR-100 super-classes in Table 1 for OL detection. From Table 5 , we observe that regardless of the TN used, ENVISE outperforms the best performing baseline method for OL detection by achieving lower FPR and detection error with higher AuROC and AuPR. Furthermore, we also observe that the BSN achieves similar gain in performance with ResNet-18 as with DenseNet-201. From Figure 3 and Table 5 , we show that ENVISE is agnostic to the specific TN and BSN network architectures and achieves similar performance gains with different architectures. we consider all 2500 images from IL class and 300 images from OL class, making the probability of occurrence of IL class p = 0.9 ( 2500 2500+300 = 0.89 ∼ 0.9). In such a setting, we obtain an FPR (f ) of 0.26 and detection error (od) of 0.15. From the formulation of F l r mentioned in Section 4 of the main paper, we use X = 9048 × 1e 6 and Y = 6 × 1e 6 to obtain GiE = 0.79. Thus, in the case when the images from IL class occur more frequently than OL class (i.e p = 0.9), ENVISE achieves a gain in efficiency of 36.2%. Trade-off between accuracy and GiE in ENVISE: Table 6 shows the accuracy of the BSN, the TN and their combined performance (i.e. ENVISE) for all six IL super-class pairs on the CIFAR-100 dataset. Comparing row 2 and row 3 in Table 6 , we observe a very minimal loss in accuracy in ENVISE with respect to the TN on different super-classes of the CIFAR-100 dataset. In ENVISE, during inference, the BSN classifies the IL classes from the image stream. When the BSN misclassifies an IL class as an OL class, the image is then given to the TN for classification. This ensures that the overall accuracy of the system is bounded by the performance of the TN. Furthermore, from Figure 5 , we observe that ENVISE has the highest GiE which indicates that, along with high computational efficiency, ENVISE also has minimal loss in overall accuracy as compared to a standalone TN. 

Complete comparison with baseline methods:

We present a detailed comparison of ENVISE with the SOTA OL methods on CIFAR-100 and TI datasets in Table 9 . We observe that ENVISE achieves lower FPR and detection error as compared to the baseline methods. Specifically, ENVISE outperforms the best performing baseline method (ODIN) by 23% and 25% in terms of mean FPR and mean detection error respectively. Furthermore, ENVISE also achieves higher AuROC and AuPR (outlier class) by outperforming ODIN by 9.5% and 15.9% in terms of mean AuROC and mean AuPR (outlier class) respectively on CIFAR-100 dataset. Table 9 also shows that ENVISE outperforms the baseline methods achieving lower FPR and detection error and higher AuROC and AuPR (IL and OL class) on TI dataset. Specifically, ENVISE achieve 9.3% and 30.4% lower mean FPR and mean detection error respectively as compared to the best performing baseline method (MCD). ENVISE also outperforms MCD by 5.5%, 14.1% and 3.1% in terms of mean AuROC, mean AuPR (inlier class) and mean AuPR (outlier class) respectively. Performance of ENVISE is insensitive to specific IL \ OL class pairs: As mentioned in Table 1 , we evaluate ENVISE on meaningful IL \ OL class pairs that mimic real-life scenarios. However, we show that ENVISE is insensitive to the specific IL \ OL class pairs created in Table 1 . From Table 10 , Table 11 and Table 12 , we show that ENVISE outperforms all baseline methods when each IL class is compared to every other OL class by achieving lower FPR and detection error and higher AuROC and AuPR on the CIFAR-100 dataset. Feature space separation: Figure 10 illustrates the comparison of ENVISE with all baselines in terms of the separation between IL and OL images in the feature space for all IL \ OL class pairs on CIFAR-100 dataset. To quantify the feature separation between the IL and OL samples, we compute the p-value using Wilcoxon's rank sum test Wilcoxon (1992) for the null hypothesis that the IL and OL feature distribution are the same. Hence, lower p-value indicates better separation. We observe that ENVISE achieves better separation than baseline methods with the lowest p-value. Furthermore, we also observe clusters of the IL images (in red) in the feature space which illustrates that ENVISE is capable of learning representations of IL images rather than memorizing labels from the TN. Sub-classes within each super-class: Table 7 and Table 8 represents [1, 2, 3, 4, 5, 6] representing the inlier and outlier super-class on CIFAR-100 dataset. Furthermore, the 



Figure 1: (a) The ENVISE architecture adaptively trains the BSN from predictions of the TN, (b) The process updates real valued weights that minimize the error produced by their binarized version.

Figure 2: (a) The attention map visualization illustrating that after the BSN is trained using the proposed attention triplet loss, it learns to focus on the semantically meaningful regions in the image. The confidence of prediction (c) also increases. (b) Separation between IL and OL classes in feature space improves with the proposed attention triplet loss. More visualizations in Appendix A.4

Figure 3: (a) : (d) The BSN converges to the performance of the TN faster than the RvSN and the variants of BSN. The red period illustrates adaptive training and white period illustrates inference. We observe a similar convergence pattern with different binary network architectures.Gradually changing the deployed scenario: When the prior probabilities of the classes change in the deployed scenario, the BSN quickly re-trains to learn the new IL classes and regains efficiency on the new image stream. Figure3illustrates the learning behavior of the BSN when the IL and OL classes are changed. Initially, we assume that the image stream is comprised of only IL classes from C 1 . Once the BSN (in green) converges to the performance of the TN (in red), we stop training the BSN and observe that it achieves an accuracy comparable to the TN's accuracy. The learning behaviour of the RvSN (in blue) is slower than the BSN and has poorer accuracy after 10 epochs.

Figure 4: Feature space separation of the IL and OL classes for different baseline methods. p-value denotes the overlap between the IL and OL features where, lower value indicates better separation. Comparison with SOTA outlier detection methods: The ability to accurately distinguish between IL and OL classes is key for improving the efficiency of ENVISE. We benchmark ENVISE on the CIFAR-100 and TI datasets with SOTA OL detection methods like error detection (ED) (Hendrycks & Gimpel (2016)), confidence estimation (CE) (DeVries & Taylor (2018)), confidence scaling (CS) (DeVries & Taylor (2018)), ODIN (Liang et al. (2018)), outlier exposure with confidence control (OECC) (Papadopoulos et al. (2019)), maximum classifier discrepancy (MCD)(Yu & Aizawa (2019)) and confidence aware learning (CAL)(Moon et al. (2020)). We use the official code of these methods and adaptively train them using their proposed loss functions on our experimental settings. For fair comparisons in terms of network architecture, we use the TN as DenseNet-201 and the SN as our binary VGG-16. FollowingHendrycks & Gimpel (2016), we use FPR at 95% True positive rate (TPR), detection error (DE), area under ROC curve (AuROC), and area under precision-recall curve (AuPR) as our evaluation metric. From Table2, we observe that ENVISE outperforms the best performing baseline method by achieving the lowest FPR and detection error with high AuROC and AuPR. SOTA model compression techniques(Frankle & Carbin (2019)) do not focus on processing images from the input stream with varying prior class probabilities. Hence, direct comparison with these methods is not meaningful since their objectives are different from those of ENVISE.

Figure 5: (a) Comparison of ENVISE with SOTA OL detection methods, and (b) Comparison of different BSN architectures in terms of GiE on different super-classes of the CIFAR-100 dataset. Gain in Efficiency (GiE): The main focus of our work is to develop an efficient system to classify the image stream with low computational cost and high accuracy. We propose a new evaluation metric -GiE to measure the overall gain in efficiency from the number of times the BSN and TN are used individually to classify the image stream. During inference, we require the BSN to accurately classify the IL classes such that it does not rely on the TN thereby reducing the overall computation cost. Furthermore, the BSN should detect an image as OL so that the TN can be used in such cases. To process a single image on the CIFAR-100 dataset, the TN uses 9048 MFLOPs and 28.5msec, while the RvSN uses 310 MFLOPs and 3.04 msec. However, the BSN achieves 6 MFLOPs and 0.57 msec; a ∼ 50× reduction in computation and a ∼ 6× speed improvement over the RvSN. Furthermore, the BSN occupies ∼ 30× and ∼ 100× less memory than the RvSN and TN respectively. During adaptive training, since each image is processed by both the BSN and TN, the total FLOPs used is F L tr = (X + Y ), where X and Y are the FLOPs used by the TN and BSN respectively. Once the the BSN converges to the performance of the TN, only the BSN is used to process the input image. Hence, for IL images the total FLOPs is F L in = Y + (f p × X), where (f p × X) indicates the number of times the TN is used when the BSN misclassifies an IL class as an OL. Similarly for OL images, the FLOPs is F L o = Y + (od × X), where (od × X) denotes the OL classes correctly detected by the BSN which is then processed by the TN. Since the adaptive training and inference phase for a given super-class occur for N tr and N in epochs respectively, the relative FLOPS of

Figure 6: (a):(c) Illustration of the effect of applying L d and its combination with L at on the separation between the IL (red) and OL (purple) classes in the feature space.

Figure 7: Effect of different values of margin on the separation between the IL (red) and OL (purple) classes in the feature space.

Figure 8: The attention map visualization representing the regions of the image that ENVISE focuses on when trained using L a and L at . The confidence of classification is reported below each image.

Figure 9: Shannon Entropy of IL and OL of the BSN during online distillation. The entropy of the IL decreases, the entropy of OL is high. This shows that BSN does not learn OL class information.

the sub-classes within each super-class of the CIFAR-100 and tiny-imagenet (TI) datasets. These sub-classes are the original classes of these datasets respectively. Each super-class is denoted by C xi and C xo where x =

Comparison of ENVISE with the baseline methods on IL class pairs C 5 and C 6 with every OL class on CIFAR-100. The ↓ indicates smaller value is better and ↑ indicates greater value is better.

Performance comparison of ENVISE with the best performing baseline method on different IL and OL class pairs on CIFAR-100 and TI dataset. ↓ and ↑ indicate smaller and greater value is better respectively. The detailed comparison with each baseline is in the Appendix A.4



Ablation study to illustrate the effect of employing the BSN over RvSN on the performance of ENVISE for outlier detection on CIFAR-100 dataset.

Performance of ENVISE in terms of OL detection, with a smaller TN network (ResNet-18) as compared larger TN (DenseNet-201) using the same BSN (Binary VGG-16) on different super-classes of CIFAR-100 dataset. Notations R : ResNet-18 as TN, D : DenseNet-201 as TN.

Comparison of accuracy between BSN, TN and ENVISE on the IL classes of CIFAR-100 dataset. The IL images that the BSN misclassifies as OD, are correctly reclassified by the TN.

Performance comparison of ENVISE with the baseline methods on different IL and OL classes on CIFAR-100 and TI datasets as presented in Table2. The ↓ indicates smaller value is better and ↑ indicates greater value is better.

Comparison of ENVISE with the baseline methods on IL class pairs C 1 and C 2 with every OL class on CIFAR-100. The ↓ indicates smaller value is better and ↑ indicates greater value is better.

Comparison of ENVISE with the baseline methods on IL class pairs C 3 and C 4 with every OL class on CIFAR-100. The ↓ indicates smaller value is better and ↑ indicates greater value is better.

annex

 , 2, 3, 4, 5] . We use these notations for comparing ENVISE with the SOTA OL detection methods in Table 9 , Table 10 , Table 11 and Table 12 respectively. 

