ONE SIZE DOESN'T FIT ALL: ADAPTIVE LABEL SMOOTHING

Abstract

This paper concerns the use of objectness measures to improve the calibration performance of Convolutional Neural Networks (CNNs). CNNs have proven to be very good classifiers and generally localize objects well; however, the loss functions typically used to train classification CNNs do not penalize inability to localize an object, nor do they take into account an object's relative size in the given image. During training on ImageNet-1K almost all approaches use random crops on the images and this transformation sometimes provides the CNN with background only samples. This causes the classifiers to depend on context. Context dependence is harmful for safety-critical applications. We present a novel approach to classification that combines the ideas of objectness and label smoothing during training. Unlike previous methods, we compute a smoothing factor that is adaptive based on relative object size within an image. This causes our approach to produce confidences that are grounded in the size of the object being classified instead of relying on context to make the correct predictions. We present extensive results using ImageNet to demonstrate that CNNs trained using adaptive label smoothing are much less likely to be overconfident in their predictions. We show qualitative results using class activation maps and quantitative results using classification and transfer learning tasks. Our approach is able to produce an order of magnitude reduction in confidence when predicting on context only images when compared to baselines. Using transfer learning, we gain 0.021AP on MS COCO compared to the hard label approach.

1. INTRODUCTION

Convolutional neural networks (CNNs) have been used for addressing many computer vision problems for over 2 decades (LeCun, 1998) ; in particular, showing promising results on object detection and localization tasks since 2013 (Krizhevsky et al., 2012; Russakovsky et al., 2015; Girshick et al., 2018) . Unfortunately, modern CNNs are overconfident in their predictions (Lakshminarayanan et al., 2017; Hein et al., 2019) and they suffer from reliability issues due to miscalibration (Guo et al., 2017a) . Problems related to overconfidence, generalization, bias and reliability represent a severe limitation of current CNNs for real-world applications. We address the problems of overconfidence and contextual bias in this work. Recently, (Szegedy et al., 2016) introduced label smoothing, providing soft labels that are a weighted average of the hard targets uniformly distributed over classes during training, to improve learning speed and generalization performance. In the case of classification CNNs, ground-truth labels are typically provided as a one-hot (hard labels) representation of class probabilities. These labels consist of 0s and 1s, with a single 1 indicating the pertinent class in a given label vector. Label smoothing minimizes weight magnification (Mukhoti et al., 2020; Müller et al., 2019) and shows improvement in learning speed and generalization; in contrast, hard targets tend to increase the values of the logits and produce overconfident predictions (Szegedy et al., 2016; Müller et al., 2019) . Label smoothing and the traditional hard labels force CNNs to produce high confidence predictions even when pertinent objects are absent during training. To obtain more reliable confidence measures, we use the objectness measure to derive a smoothing factor for every sample undergoing a unique scale and crop transformation in an adaptive manner. Safely deploying deep learning based models has also become a more immediate challenge (Amodei et al., 2016) . As a community, we need to obtain high accuracies, but also provide reliable uncertainty measures of CNNs. We can improve the precision of CNNs by providing reliable confidence measures, avoiding acting with certainty when uncertain predictions are produced, as in the case of safety-critical systems. Unfortunately, some of these crops miss the object as the process does not use any object location information. Traditional hard label and smooth label approaches do not account for the proportion of the object being classified and use a fixed label of 1 or 0.9 in the case of label smoothing. Our approach (right half) smooths the hard labels by accounting for the objectness measure to compute an adaptive smoothing factor. The objectness is computed using bounding box information as shown above. Our approach helps generate accurate labels during training and penalizes low-entropy (high-confidence) predictions for context-only images. Object detection (Girshick et al., 2018) requires bounding box information during training. Recently, (Dvornik et al., 2018) proposed using novel synthetic images to improve object detection performance by augmenting training data using object location information; however, classification CNNs have not exploited object size information to regularize CNNs on large datasets like ImageNet (Russakovsky et al., 2015) , to our knowledge. Objectness, quantifying the likelihood an image window contains an object belonging to any class, was first introduced by (Alexe et al., 2012), and the role of objectness has been studied extensively since then. Object detectors specialize in a few classes, but objectness is class agnostic by definition. We limit the definition of objectness to the 1000 ImageNet-1K classes, meaning any object outside these defined classes will have an objectness score of 0. When training a classifier, the cross-entropy loss is employed but it does not penalize incorrect spatial attention, often making CNNs overfit to context or texture rather than the pertinent object (Geirhos et al., 2019) , as shown in the left half of figure 1. The bottom row displays samples with negligible amounts of 'Dog' pixels, where traditional methods would label them as 'Dog', causing CNNs to output incorrect predictions with high confidence when presented with images of backgrounds or just context. Adaptive label smoothing (our approach) involves using gross object size to smooth the hard labels of a classifier, as displayed to the right in figure 1. Our approach adapts label smoothing by deriving the smoothing factor using the objectness measure. When compared to approaches based on hard labels, sample mixing, and label smoothing, our approach improves object detection and calibration performance. Traditional approaches (Yun et al., 2019; Takahashi et al., 2018; Krizhevsky et al., 2012; Russakovsky et al., 2015) use random resize and random crop augmentation, and sometimes lose the pertinent object in the training sample, allowing the classifier to make the correct predictions by overfitting to the context surrounding the pictures. We believe



Figure 1: Random crops of images are often used when training classification CNNs to help mitigate size, position and scale bias (as shown in the left half of the figure along with the objectness values listed below them).Unfortunately, some of these crops miss the object as the process does not use any object location information. Traditional hard label and smooth label approaches do not account for the proportion of the object being classified and use a fixed label of 1 or 0.9 in the case of label smoothing. Our approach (right half) smooths the hard labels by accounting for the objectness measure to compute an adaptive smoothing factor. The objectness is computed using bounding box information as shown above. Our approach helps generate accurate labels during training and penalizes low-entropy (high-confidence) predictions for context-only images.

