ONE SIZE DOESN'T FIT ALL: ADAPTIVE LABEL SMOOTHING

Abstract

This paper concerns the use of objectness measures to improve the calibration performance of Convolutional Neural Networks (CNNs). CNNs have proven to be very good classifiers and generally localize objects well; however, the loss functions typically used to train classification CNNs do not penalize inability to localize an object, nor do they take into account an object's relative size in the given image. During training on ImageNet-1K almost all approaches use random crops on the images and this transformation sometimes provides the CNN with background only samples. This causes the classifiers to depend on context. Context dependence is harmful for safety-critical applications. We present a novel approach to classification that combines the ideas of objectness and label smoothing during training. Unlike previous methods, we compute a smoothing factor that is adaptive based on relative object size within an image. This causes our approach to produce confidences that are grounded in the size of the object being classified instead of relying on context to make the correct predictions. We present extensive results using ImageNet to demonstrate that CNNs trained using adaptive label smoothing are much less likely to be overconfident in their predictions. We show qualitative results using class activation maps and quantitative results using classification and transfer learning tasks. Our approach is able to produce an order of magnitude reduction in confidence when predicting on context only images when compared to baselines. Using transfer learning, we gain 0.021AP on MS COCO compared to the hard label approach.

1. INTRODUCTION

Convolutional neural networks (CNNs) have been used for addressing many computer vision problems for over 2 decades (LeCun, 1998) ; in particular, showing promising results on object detection and localization tasks since 2013 (Krizhevsky et al., 2012; Russakovsky et al., 2015; Girshick et al., 2018) . Unfortunately, modern CNNs are overconfident in their predictions (Lakshminarayanan et al., 2017; Hein et al., 2019) and they suffer from reliability issues due to miscalibration (Guo et al., 2017a) . Problems related to overconfidence, generalization, bias and reliability represent a severe limitation of current CNNs for real-world applications. We address the problems of overconfidence and contextual bias in this work. Recently, (Szegedy et al., 2016) introduced label smoothing, providing soft labels that are a weighted average of the hard targets uniformly distributed over classes during training, to improve learning speed and generalization performance. In the case of classification CNNs, ground-truth labels are typically provided as a one-hot (hard labels) representation of class probabilities. These labels consist of 0s and 1s, with a single 1 indicating the pertinent class in a given label vector. Label smoothing minimizes weight magnification (Mukhoti et al., 2020; Müller et al., 2019) and shows improvement in learning speed and generalization; in contrast, hard targets tend to increase the values of the logits and produce overconfident predictions (Szegedy et al., 2016; Müller et al., 2019) . Label smoothing and the traditional hard labels force CNNs to produce high confidence predictions even when pertinent objects are absent during training. To obtain more reliable confidence measures, we use the objectness measure to derive a smoothing factor for every sample undergoing a unique scale and crop transformation in an adaptive manner. Safely deploying deep learning based models has also become a more immediate challenge (Amodei et al., 2016) . As a community, we need to obtain high accuracies, but also provide reliable uncertainty measures of CNNs. We can improve the precision of CNNs by providing reliable confidence measures, avoiding acting with certainty when uncertain predictions are produced, as in the case of safety-critical systems. Unfortunately, some of these crops miss the object as the process does not use any object location information. Traditional hard label and smooth label approaches do not account for the proportion of the object being classified and use a fixed label of 1 or 0.9 in the case of label smoothing. Our approach (right half) smooths the hard labels by accounting for the objectness measure to compute an adaptive smoothing factor. The objectness is computed using bounding box information as shown above. Our approach helps generate accurate labels during training and penalizes low-entropy (high-confidence) predictions for context-only images. Object detection (Girshick et al., 2018) requires bounding box information during training. Recently, (Dvornik et al., 2018) proposed using novel synthetic images to improve object detection performance by augmenting training data using object location information; however, classification CNNs have not exploited object size information to regularize CNNs on large datasets like ImageNet (Russakovsky et al., 2015) , to our knowledge. Objectness, quantifying the likelihood an image window contains an object belonging to any class, was first introduced by (Alexe et al., 2012) , and the role of objectness has been studied extensively since then. Object detectors specialize in a few classes, but objectness is class agnostic by definition. We limit the definition of objectness to the 1000 ImageNet-1K classes, meaning any object outside these defined classes will have an objectness score of 0. When training a classifier, the cross-entropy loss is employed but it does not penalize incorrect spatial attention, often making CNNs overfit to context or texture rather than the pertinent object (Geirhos et al., 2019) , as shown in the left half of figure 1. The bottom row displays samples with negligible amounts of 'Dog' pixels, where traditional methods would label them as 'Dog', causing CNNs to output incorrect predictions with high confidence when presented with images of backgrounds or just context. Adaptive label smoothing (our approach) involves using gross object size to smooth the hard labels of a classifier, as displayed to the right in figure 1. Our approach adapts label smoothing by deriving the smoothing factor using the objectness measure. When compared to approaches based on hard labels, sample mixing, and label smoothing, our approach improves object detection and calibration performance. Traditional approaches (Yun et al., 2019; Takahashi et al., 2018; Krizhevsky et al., 2012; Russakovsky et al., 2015) use random resize and random crop augmentation, and sometimes lose the pertinent object in the training sample, allowing the classifier to make the correct predictions by overfitting to the context surrounding the pictures. We believe that our approach addresses significant problems that are associated with current training techniques. In particular, random cropping of images is a common augmentation technique during training of classifiers, but occasionally the crop misses the object entirely. In such a case, the equivalent of a one-hot label is typically provided, with the result that the system is steered toward increased dependence on background (context) portions of the image. We argue that one-hot representations are too limiting, and our adaptive approach to label smoothing makes it possible for the classifier to avoid overconfidence in many cases. Specifically, our contributions are listed below: 1. Our regularization technique, called adaptive label smoothing adjusts labels during training based on an object's relative size for every sample, directly affecting the confidence measure produced by the classifier. This implicit regularization guides the classifiers, avoiding high confidence predictions when the object pixels are lower in proportion. 2. For safety-critical applications, our approach allows the classifier to produce low confidence predictions when images with context and no pertinent object are presented. Predictions from our approach are more explainable and they can easily be thresholded to reject false positives. High confidence approaches are hard to threshold as predictions have high confidence, even when they are wrong. Our predictions are more explainable as the confidence is grounded in object size and not context. We assume that every class is equiprobable when inputs without pertinent objects are supplied during training. While context helps increase computed accuracy for a given dataset, such reliance is not viable for real-world applications. 3. We show that the representation learned with adaptive label smoothing also leads to better transfer learning performance on MS COCO (Lin et al., 2014) . We have trained classifiers and evaluated them on three popular datasets, with results showing our approach produces an average confidence that is an order of magnitude lower when compared to baselines for context-only images. Confidence values generated by CNNs help us understand the output predictions, but unreliable confidence measures hurt the applicability of CNNs for safetycritical applications.

2. RELATED WORK

Bias exhibited by machine learning models can be attributed to many underlying statistics present in datasets and model architectures (Battaglia et al., 2018; Zhang et al., 2017b) including context, object texture (Geirhos et al., 2019) , size, shape, and color in the case of images. Various approaches to mitigate bias have been proposed (Anne Hendricks et al., 2018; Choi et al., 2019; Geirhos et al., 2019) in recent years. Our approach produces high entropy predictions when context-only images are provided as input during inference, as we aim to learn the size of the relevant object within the image and classify it, instead of relying on contextual bias to produce a prediction. Traditionally, any label preserving transformation on an input image is employed to help regularize a CNN. The authors of AlexNet (Krizhevsky et al., 2012) employed random cropping and horizontal flipping methods, designed to prevent overfitting and improve the generalization of viewpoints when they surpassed the performance of conventional machine learning approaches in 2012. The random noise class of data augmentation methods (DeVries & Taylor, 2017; Zhong et al., 2017) mask random regions of an input image with zeros, which may accidentally erase the pertinent object in a given image forcing the CNN to rely on context to make a prediction, contributing to label noise. The authors of DropBlock (Ghiasi et al., 2018) have used dropout in the feature space to obtain better generalization. Authors of AutoAugment (Cubuk et al., 2019) used reinforcement learning dynamically during training to learn the best combination of existing data augmentation methods. The latest work in the area of data augmentation uses samples from different classes and changes expected outputs to predict a probability distribution based on the number and intensity of pixels represented by each class. The authors of Mixup (Zhang et al., 2017a; Tokozume et al., 2018) use alpha blending (weighted sum of pixels from two different classes) applied to corresponding labels. The authors of CutMix and RICAP (Yun et al., 2019; Takahashi et al., 2018) use soft labels by cropping different regions and classes of images and 'mixing' the labels proportionally to corresponding regions in the final augmented sample. Neither of these approaches relies on object size when 'mixing' regions in images and computing the label. Conversely, our approach regularizes classification CNNs by using objectness information and applying a smoothing factor based on an object's proportion in a given image, producing a soft label without mixing the samples. Calibration and uncertainty estimation of predictors has been an ongoing interest to the machine learning community (Murphy, 1973; DeGroot & Fienberg, 1983; Platt, 1999; Lin et al., 2007; Zadrozny & Elkan, 2002) as predictions need to be equally accurate and confident. Bayesian binning into quantiles (BBQ) (Naeini et al., 2015) was proposed for binary classification and beta calibration, and (Kull et al., 2017) employed logistic calibration for binary classifiers. In the context of CNNs, (Guo et al., 2017b) proposed a temperature scaling approach to improve calibration performance of pre-trained models. Calibration has been explored in multiple directions; popular approaches include transforming outputs of pre-trained models using approximate bayesian inference (Maddox et al., 2019) , or using a special loss function to help regularize the model (Pereyra et al., 2017; Kumar et al., 2018) during training. Our approach is loosely related to the latter class of methods and relates to label smoothing proposed first by (Szegedy et al., 2016) , with applicability for many tasks explored by (Pereyra et al., 2017) ; (Xie et al., 2016) applies dropout like noise to the labels. Recently, (Müller et al., 2019) explored the benefits of label smoothing; aside from having a regularizing effect, label smoothing helps reduce the intra-class distance between samples (Müller et al., 2019) . Another approach to calibrate CNNs was proposed by (Mukhoti et al., 2020) . By applying a focal loss function and temperature scaling, the authors were able to obtain state-of-the-art calibration performance. Label smoothing also improves calibration performance of CNNs (Mukhoti et al., 2020) . In contrast to previously discussed methods, our approach involves using hard labels multiplied by the objectness measure and obtaining a uniform distribution over all other classes when input images are devoid of pertinent objects. We do not change our loss function as opposed to (Mukhoti et al., 2020) or add additional layers to our model. Our approach can be described as a variant of label smoothing, employing an adaptive label smoothing approach that is unique to every training sample as it accounts for object size. To our knowledge, we are the first to apply objectness based adaptive label smoothing to train image classification CNNs. The objectness is computed using bounding box information during training. CNNs trained using hard labels produce 'peaky' probability distributions without considering the spatial size of the pertaining object. Our approach produces outputs that are softer and the peaks correspond to the spatial footprint of the object being classified as illustrated in the appendix A.1.

3. METHOD

We provide a mathematical discussion of the cross entropy loss computed using different approaches in this section. Consider D = (x i , y i ) N i=1 to be a dataset consisting of N independent and identically distributed real-world images belonging to K different classes. Let X represent the set of images, and let Y denote the set of ground-truth class labels. Sample i consists of the image x i ∈ X along with its corresponding label y i ∈ Y = {1, 2, ..., K}. Let f θ represent the CNN classifier f with model parameters denoted by θ. The predicted class is ŷi = argmax y∈Y pi,y , where pi,y = f θ (y|x i ) is the computed probability that the image x i belongs to the class y. The confidence or class probability can be computed using pi = max y∈Y pi,y , following the notation adopted by (Mukhoti et al., 2020) . We denote the output probability distribution over K classes after applying the softmax function as: pi,y = exp(p k ) K j=1 exp(p j ) where, p k represents the logit for class k. Let π(k|x i ) be the one-hot label vector (K-element long) corresponding to input x i and k ∈ {1, 2, ..., K}. The cross-entropy loss L used to train the CNN is computed by: L(x i ) = - K j=1 π(k|x i ) log(p i,y ) In the case of one-hot labels, π(y|x i ) = 1 for the pertinent class y and π(k|x i ) = 0 for all other classes k = y. The cross entropy loss can now be reduced to a single term as opposed to a summation: L(x i ) = -log(p i,y ) There are three problems associated with the loss described above: 1. The CNN is encouraged to produce a very large peak for the pertinent class y and the CNN is not penalized for producing peaks for incorrect classes, k = y. 2. The supplied label and input may not always be correct when random cropping is used during training. More precisely, predicting correctly or incorrectly with high confidence based on just context shows that random cropping can lead to overreliance on context (predicting the presence of a dog based on an image of a dog park without any dogs, for example). 3. The CNNs trained with one-hot labels produce extremely high confidence values (p i ) without paying attention to the presence of an object or its proportion. Following (Szegedy et al., 2016) , the hard label π(k|x i ) can be converted to soft label π(k|x i ) using π(k|x i ) = π(k|x i )(1-α)+(1-π(k|x i ))α/(K -1) , where α ∈ [0, 1] is a fixed hyperparameter. This is the standard procedure known as label smoothing or uniform label smoothing. The cross-entropy loss L ls for uniform label smoothing can be written as: L ls (x i ) = -(1 -α) log(p i,y ) -α( K j =y (1 -π(k|x i )) log(p i,y ))/(K -1) The novelty of our approach is to make α adaptive, calculating the value based on the relative size of an object within a given training image. Using the bounding box annotations available for the images in the dataset, we generate object masks. We apply the same augmentation transform (scale, crop) to the masks and compute the objectness score on the fly for every training image. Let the image width and height be denoted by (W, H) and the object width and height be denoted by (w, h). The ratio α is computed as α = 1 -wh W H . The soft label π(k|x i ) is computed as before: π(k|x i ) = π(k|x i )(1 -α) + (1 -π(k|x i ))α/(K -1) We also explore a weighted combination of adaptive label smoothing and hard labels. To do this, we introduce parameter β ∈ [0, 1] to determine the degree of adaptive label smoothing being applied. The setting β = 0 corresponds to the case of classic hard labels. The soft label in this case is computed as π(k|x i ) = (π(k|x i )(1 -α) + (1 -π(k|x i ))α/(K -1))β + (1 -β)(π(k|x i )) . The cross-entropy loss L als with adaptive label smoothing can be written as: L als (x i ) = -β((1-α) log(p i,y )-α( K j =y (1-π(k|x i )) log(p i,y ))/(K -1))-(1-β) log(p i,y ) (6) We penalize the CNN for producing high confidence predictions when the objectness score is low using an adaptive α. We introduce β as an ablation parameter to adjust the amount of context dependence allowed. When β is set to 0, we end up with one-hot labels and when β is set to 1, the CNN is trained using adaptive label smoothing. Setting a value of β above 0 (under 1) reduces the context dependence. When β is set to 0.75, the CNN is trained with a label of at least 0.25 for the pertinent class regardless of whether an object is present or not. The rest of the label is computed using adaptive label smoothing and weighted by β. As adaptive label smoothing accounts for object size, the label for the pertinent class will increase based on the objectness score for the sample. When K is small β can be adjusted to avoid computing incorrect labels for objects with low objectness score.

4. EXPERIMENTS

In this section, we provide a description of the datasets used in our experiments, introduce some of the commonly used metrics for calibration of CNNs and describe our implementation details. We then discuss the merits of our approach and answer important questions related to applicability to transfer learning in an object detection setting, and we discuss the effect of using different types of labels during training in an ablative manner. We use ResNet-50 (He et al., 2016) for most of our experiments and ResNet-101 (He et al., 2016) for the rest. For additional information on experimental setup please refer to the appendix A.3.

4.1. DATASETS

We have used different training datasets that are based on ImageNet-1K dataset (Russakovsky et al., 2015) . ImageNet-1K consists of 1.28M training images and 50K validation images spanning 1K categories. As only 38% of ImageNet training images have bounding-box annotations, we distinguish these experiments from those trained on the full dataset. We use standard data-augmentation strategies for all methods and train all our models for 300 epochs starting with a learning rate of 0.1 and decayed by 0.1 at epochs 75, 150, and 225 using a batch size of 256. As shown in tables A.7, we have different training datasets that are based on ImageNet-1K dataset (Russakovsky et al., 2015) . For additional dataset information please refer to the appendix A.2. Our method needs object proportions to compute the objectness score, we use a subset of the standard ImageNet dataset that has bounding boxes (0.474M). To generate the 'mask' version, we make sure that only one object is present in a given image and 'mask' all other objects replacing them with pixel means. We use this version of the dataset derived from the 0.474M subset and identify the approach with '(mask)' next to the method in tables 2 and 3. We end up with about 54K more images as some ImageNet images have multiple annotated objects and our training dataset has 0.528M images as a result. Lastly, we generate another dataset that is devoid of any object altogether. We sample about 15% of the time from this dataset during training of one (identified with 'Context') of our approaches, and the label generated for these methods is a vector of uniform probability distribution across 1000 classes. The idea is that when no objects are present in a sample, a CNN should produce a high-entropy prediction.For validation, we use the validation set of (Russakovsky et al., 2015) (V1) and the newly released ImageNetV2 set (Recht et al., 2019) . Specifically, we use the more challenging 'MatchedFrequency' set of images. The different validation sets are identified in the 'Val.' column of table 2. To measure the transfer-learning ability of the representations learned by our classifiers, we used the challenging MS COCO (Lin et al., 2014) dataset to obtain the results described in table 4. The dataset consists of about 230K training images and we use the 'minival' validation set of 5K images with bounding box annotations.

4.2. CLASSIFICATION, CONTEXT AND CALIBRATION

This section identifies various calibration metrics used by the community and discusses our results obtained on the popular (Russakovsky et al., 2015; Recht et al., 2019) datasets. We use the implementation of (Wenger et al., 2020) on all of our classifiers to generate the results in table 2.For extended results, refer to appendix A.7. To evaluate the performance of adaptive label smoothing we use five metrics that are very common: accuracy (ACC), expected calibration error (ECE) (Naeini et al., 2015) , maximum calibration error (MCE) (Naeini et al., 2015) , overconfidence (Mund et al., 2015) , and underconfidence (Mund et al., 2015) . We computed ECE using 100 bins and 15 bins. The authors of (Wenger et al., 2020; Kumar et al., 2019) discuss the advantages of using 100 bins in greater detail. A classifier is said to be calibrated if its confidence matches the probability of the prediction being correct, E [1 ŷi=yi | pi ] = pi . ECE is defined as the expected absolute difference between a classifier's confidence and its accuracy using a finite number of bins (Naeini et al., 2015; Wenger et al., 2020) . ECE is computed as, ECE = E | pi -E [1 ŷi=yi | pi ]|]. MCE is defined as the maximum absolute difference between a classifier's confidence and its accuracy of each bin (Naeini et al., 2015; Wenger et al., 2020) , MCE is computed as , MCE = max pi,y∈[0,1] | pi -E [1 ŷi=yi | pi = pi,y ]|. Overconfidence is the average confidence of a classifier's false predictions, mathematically computed as, o(f ) = E [ pi | ŷi = y i ]. Underconfidence is the average uncertainty on its correct predictions (Wenger et al., 2020; Mund et al., 2015) , mathematically computed as, u(f ) = E [1 -pi | ŷi = y i ]. Overconfidence and underconfidence of a classifier are not reflective of its accuracy (Wenger et al., 2020) . Our approach uses labels that are more accurate than other baselines when random cropping and scaling of images are applied during training. To our knowledge, almost all classifiers trained on ImageNet use random crop and scaling based augmentation to regularize. The random crop transformation allows the CNNs to predict by relying on context rather than the pertinent object, our approach uses bounding box labels to produce labels in an adaptive way during training. To quantify context dependence, we used bounding box annotations on the 50K validation images, removed all Table 1 : Confidence and accuracy metrics on the validation set of ImageNet with all the objects removed using bounding box annotation provided by Choe et al. (2020) . Our approach has the best performance under total uncertainty. 'ACC', 'A.conf', 'O.conf' and 'U.conf' refer to accuracy, average confidence, mean overconfidence, and mean underconfidence scores. High underconfidence and low overconfidence point to minimal reliance on context when no pertinent objects are in the given image. The last row of figure 6 in appendix provides a qualitative example. Method ACC O.conf U.conf A.conf Hard Label 0.0633 0.2734 0.3362 0.2982 Label Smoothing (Szegedy et al., 2016) 0.0618 0.1851 0.4816 0.2057 CutMix (Yun et al., 2019) 0.0921 0.1679 0.4696 0.2013 A. L. S. (Ours) 0.0473 0.0121 0.8409 0.0191 objects and replaced the pixels with the mean image pixel values using bounding box annotation provided by (Choe et al., 2020) . Hard label trained CNN had an accuracy of 6.3% with an average confidence of 0.29, label smoothing based CNN predicted with an accuracy of 6.1% and an average confidence of 0.2, CutMix had an accuracy of 9.2% with an average confidence of 0.2. These baseline methods produced high confidence predictions on images with no objects present using just context information. Our approach had an accuracy of 4.7% and an average confidence of 0.02. We have an order of magnitude improvement in performance over recent baselines as our approach helps CNNs produce confidence based on the relative size of the pertinent object. Predictions using our method are more explainable as we ground our labels and confidences in the object size, as opposed to making correct predictions using contextual information only as shown in table 1 . The results in table 2 indicate our approaches based on adaptive label smoothing using the abbreviation 'A. L. S.' In general, these results have a low overconfidence score. This is highly desirable for safety-critical applications, as when our approach is wrong, it is wrong with the least amount of confidence. When no pertinent objects are present, our approach is the least confident compared to other baselines this makes our approach more suitable for safety-critical applications. The mean objectness of images in the validation set of ImageNet is 0.49. The mean objectness deviation, computed as the mean of the absolute difference between maximum confidence and objectness over the ImageNet validation samples, for our approach is 0.24 as opposed to 0.42 for the hard label case. Using these metrics, we show that our confidences are more explainable as they closely match the objectness statistics when compared to the hard label approach. Our approach is underconfident as we are not trying to produce the maximum possible confidence of 1 when we are correct. Our confidence is grounded in the objectness score instead, our peaks are proportional to the size of the object. These results demonstrate that adaptive label smoothing based CNNs seldom produce high confidence scores when they make incorrect predictions. In fact, our models are underconfident as they pay attention to the spatial footprint of the pertinent object instead of producing a large peaks most of the time. It is important to note that our methods outperform all baselines for the overconfidence metric. ECE and MCE measure the difference between a classifier's accuracy and its prediction, our approach has higher values as our predictions are not the same as the accuracy of the classifier. As we intend to produce peaks proportional to objectness values instead of the classifier's accuracy. As shown in figure 2 , we are over the diagonal, we are more accurate than we are confident compared to baselines.

4.3. TRANSFER LEARNING FOR OBJECT DETECTION

We use the MS COCO (Lin et al., 2014) dataset to benchmark our transfer learning performance. We adopt the architecture of Faster RCNN (Ren et al., 2015) adapted to use the ResNet-50 backbone. Specifically, we train all of our classifiers using the implementation of https://github.com/jwyang/faster-rcnn.pytorch. We train all ImageNet pre-trained models with a batch size of 16 and initial learning rate of 0.01 decayed after every 4 epochs for a total of 10 epochs. We employ the standard metrics for average precision (AP) and average recall (Lin et al., 2014) at different intersection over union (IoU) levels. As shown in 4, our approach outperforms hard label and label smoothing based approaches on this downstream task. Specifically, our approach performs almost as well as CutMix (Yun et al., 2019) using AP measures. For information on qualitative 2 , increasing the value of β helps reduce model overconfidence and produces predictions that are less 'peaky' compared to label smoothing and hard label settings. Another interesting trend can be observed by changing the value of the β parameter. As β decreases in value, the overconfidence rate goes up along with it as shown in table 2. In case of transfer learning, we observe that decreasing β causes the object localization performance to drop. Using objectness information helps our CNNs localize and detect objects better than the hard label baseline. Context dependence can be controlled using β.

5. CONCLUSION

This paper has addressed the problems of contextual bias and calibration using a novel approach called adaptive label smoothing. We show that bounding box information pertaining to objects can be used to compute a smoothing factor adaptively during training to improve the localization and calibration performance of CNNs. We use bounding box information for a portion of the ImageNet dataset (Russakovsky et al., 2015) to train different classifiers. We show that our approach can be used to train CNNs that are calibrated and have better localization performance on the challenging MS-COCO dataset (Lin et al., 2014) after fine-tuning, compared to approaches that use hard labels or traditional label smoothing approaches. Our labels implicitly capture the object proportion within an image during training, a significantly more challenging task than training with hard labels. Our methods provide the lowest accuracy and an order of magnitude reduction in average confidence when presented with context only images. We are extending this work to out of distribution detection as well. With adaptive label smoothing, when no pertinent objects are present, every class is equally probable for a given image. We introduce adaptive label smoothing with the notion that safety-critical applications need CNNs that are trained not to be overconfident in their predictions. Our intention is for decision making systems (steering inputs to an autonomous vehicle for example) to not make decisions in a definite way when the models are not confident in their predictions. Our approach provides a more reliable measure of confidence compared to all baselines. Even after clipping the sample counts, the OpenImages dataset is very skewed compared to ImageNet as shown in 5, and we believe this imbalance makes OpenImages unsuitable for training good classifiers. Image with bounding box annotation and its corresponding object mask. The `mask' version of our approach uses images with a single object. The `context' version of our approach uses images with all the objects masked out about 15% of the time during training. The label vector for such images (context only) is a vector of uniform distribution. 



Figure 1: Random crops of images are often used when training classification CNNs to help mitigate size, position and scale bias (as shown in the left half of the figure along with the objectness values listed below them).Unfortunately, some of these crops miss the object as the process does not use any object location information. Traditional hard label and smooth label approaches do not account for the proportion of the object being classified and use a fixed label of 1 or 0.9 in the case of label smoothing. Our approach (right half) smooths the hard labels by accounting for the objectness measure to compute an adaptive smoothing factor. The objectness is computed using bounding box information as shown above. Our approach helps generate accurate labels during training and penalizes low-entropy (high-confidence) predictions for context-only images.

Figure 4: The first row of images in the left half of the figure are an example of the ImageNet dataset (N=0.474M) that have bounding box annotations. We match the images from the training set of ImageNet-1K dataset with the corresponding '.xml' files included in the ImageNet object detection dataset.We then create object masks for each of the images. When applying any scaling and cropping operation to training samples, we apply the same transformation to the corresponding object masks as well. By counting the number of white pixels, we can determine the object proportion post transformation. We describe the two other approaches in the figure, the 'mask' version of our approach has a single object (for images with multiple bounding box annotations) and this version has 0.528M samples. Our approach helps generate accurate labels during training and penalizes low-entropy (high-confidence) predictions for context-only images like the example on the right half of the figure.

Classification and calibration results with ImageNet using ResNet-101. For a detailed explanation of the metrics please refer to 4.2.'O.conf' and 'U.conf' refer to overconfidence and underconfidence scores.

Fine-tuning on MS COCO using FRCNN for object detection using ResNet-50 backbone. For a detailed explanation of the results please refer to 4.3. AP refers to average precision and AR refers to average recall at the specified Intersection over union (IoU) level. Our AP is only 0.001 lower than CutMix.

annex

localization performance without any fine-tuning using class activation maps, please refer to appendix A.6. We compare our approach with standard baselines and provide results in an ablative manner to understand the benefits and limitations of applying adaptive label smoothing to classification and

A APPENDIX

We provide more detailed results and discussions that were left out due to space constraints in the main paper.A. A.2 DATASET Our approach to create the different versions of ImageNet Russakovsky et al. (2015) to train our models are described in figure 4 . We use the pixel means to mask all but one or all the objects using the same methodology as Anne Hendricks et al. ( 2018); Choi et al. (2019) . We use the standard validation set along with ImageNet V2 Recht et al. (2019) without any changes to the images.We also used a portion of the OpenImages Kuznetsova et al. (2020) dataset. More specifically, we used the object-detection version of the dataset, consisting of 600 classes and 1.7M images with bounding boxes. We selected a subset of these images and trained 5 classifiers.In the case of OpenImages Kuznetsova et al. (2020) , we use the object detection dataset consisting of 600 classes and 1.7M images with 14M bounding boxes. However, the 600 classes also include many parent nodes and as this can contribute to label confusion. We remove all parent node classes and use only the leaf node classes. The dataset has bounding boxes for only a subset of images for commonly occurring objects and we remove these classes as well. Finally, we follow the approach of Liu et al. (2020) and merge confusing classes. We end up with 480 classes and approximately 1.2M images. There are about 7 objects per image (average) in this subset and after applying the 'mask' method, we end up with approximately 6.8M images. Of these, about 1.3M images corresponded to the 'man' class and 'women' and 'windows' classes also had very high sample counts. We restrict the maximum number of images in a given class to around 50K and end up with roughly 2.2M images.We apply the same methodology to the val and test splits but we do not clip the sample counts per class.Visualization of the count per each of the 1000 classes in the `mask' version of ImageNet by our approach.Visualization of the count per each of the 480 classes in the `mask' version of OpenImages used by our approach. Class `256 ' for example, has 40k images. 

A.3 HYPERPARAMETERS

We use standard data-augmentation strategies like random cropping, scaling, color jitter, etc., for all methods and train all our ImageNet models for 300 epochs starting with a learning rate of 0.1 and decayed by 0.1 at epochs 75, 150, and 225 using a batch size of 256. For a fair comparison with our ImageNet-'mask' based models, we matched the number of iterations and reduced the total epochs for our OpenImages classifiers. We trained all our OpenImages models for 72 epochs starting with a learning rate of 0.1, and decayed by 0.1 at epochs 18, 36, and 54 using a batch size of 256.We assume that this reduced number of epochs also contributed to poor localization for the transfer learning case.

A.4 HARDWARE AND SOFTWARE

All our experiments were run on 'Dell C4130' nodes, equipped with 4 Nvidia V100 cards each. We used Docker to maintain the same set of libraries across multiple nodes. The host environment was running ubuntu 18.04 with cuda 10.2 installed. The docker environment used ubuntu 16.04 with cuda 9.0 and PyTorch 1.1 and Anaconda python 4.3. We will release all our code and pretrained models before the conference.A.5 RUNTIMES Our adaptive label smoothing approach using the 'mask' version of ImageNet took approximately 74 hours and the hard label version took approximately 48 hours for 300 epochs. The object detection experiments took approximately 34 hours for 10 epochs.

A.6 CLASS ACTIVATION MAPS

We provide more class activation maps to visualize the localization performance of baseline apas well as our approaches in figures 6 8 and 7. 

A.7 TABLES

We provide detailed calibration metrics for ImageNet and OpenImages classifiers in tables 5 and 6 respectively.Average confidence of a model describes the mean confidence of a model. As our model predictions are grounded in the spatial size of the object, our average confidence values on 'V1' and 'V2' are 0.48 and 0.39, respectively; in the case of hard labels the values are 0.77 and 0.69, respectively. We also provide AP (average precision) measures for different object sizes in table 7 . 7 : Fine-tuning on COCO using FRCNN for object detection. For a detailed explanation of the results please refer to section 4.3 in the main paper. AP refers to average precision and AR refers to average recall at the specified Intersection over union (IoU) level. We also provide AP values for small, medium, and large objects using 'S', 'M', and 'L' respectively 

