NO COST LIKELIHOOD MANIPULATION AT TEST TIME FOR MAKING BETTER MISTAKES IN DEEP NETWORKS

Abstract

There has been increasing interest in building deep hierarchy-aware classifiers that aim to quantify and reduce the severity of mistakes, and not just reduce the number of errors. The idea is to exploit the label hierarchy (e.g., the WordNet ontology) and consider graph distances as a proxy for mistake severity. Surprisingly, on examining mistake-severity distributions of the top-1 prediction, we find that current state-of-the-art hierarchy-aware deep classifiers do not always show practical improvement over the standard cross-entropy baseline in making better mistakes. The reason for the reduction in average mistake-severity can be attributed to the increase in low-severity mistakes, which may also explain the noticeable drop in their accuracy. To this end, we use the classical Conditional Risk Minimization (CRM) framework for hierarchy aware classification. Given a cost matrix and a reliable estimate of likelihoods (obtained from a trained network), CRM simply amends mistakes at inference time; it needs no extra hyperparameters, and requires adding just a few lines of code to the standard cross-entropy baseline. It significantly outperforms the state-of-the-art and consistently obtains large reductions in the average hierarchical distance of top-k predictions across datasets, with very little loss in accuracy. CRM, because of its simplicity, can be used with any off-the-shelf trained model that provides reliable likelihood estimates.

1. INTRODUCTION

The conventional performance measure of accuracy for image classification treats all classes other than ground truth as equally wrong. However, some mistakes may have a much higher impact than others in real-world applications. An intuitive example being an autonomous vehicle mistaking a car for a bus is a better mistake than mistaking a car for a lamppost. Consequently, it is essential to integrate the notion of mistake severity into classifiers and one convenient way to do so is to use a taxonomic hierarchy tree of class labels, where severity is defined by a distance on the graph (e.g., height of the Lowest Common Ancestor) between the ground truth and the predicted label (Deng et al., 2010; Zhao et al., 2011) . This is similar to the problem of providing a good ranking of classes in a retrieval setting. Consider the case of an autonomous vehicle ranking classes for a thin, white, narrow band (a pole, in reality). A top-3 prediction of {pole, lamppost, tree} would be a better prediction than {pole, person, building}. Notice that the top-k class predictions would have at least k -1 incorrect predictions here, and the aim is to reduce the severity of these mistakes, measured by the average hierarchical distance of each of the top k predictions from the ground truth. Silla & Freitas (2011) survey classical methods leveraging class hierarchy when designing classifiers across various application domains and illustrate clear advantages over the flat hierarchy classification, especially when the labels have a well-defined hierarchy. There has been growing interest in the problem of deep hierarchy-aware image classification (Barz & Denzler, 2019; Bertinetto et al., 2020) . These approaches seek to leverage the class hierarchy inherent in the large scale datasets (e.g., the ImageNet dataset is derived from the WordNet semantic ontology). Hierarchy is incorporated using either label embedding methods, hierarchical loss functions, or hierarchical architectures. We empirically found that these models indeed improve the ranking of the top-k predicted classes -ensuring that the top alternative classes are closer in the class hierarchy. However, this improvement is observed only for k > 1. While inspecting closely the top-1 predictions of these models, we observe that instead of improving the mistake severity, they simply introduce additional low-severity mistakes which in turn favours the mistake-severity metric proposed in (Bertinetto et al., 2020) . This metric involves division by the number of misclassified samples, therefore, in many situations (discussed in the paper), it can prefer a model making additional low-severity mistakes over the one that does not make such mistakes. This is at odds with the intuitive notion of making better mistakes. These additional low-severity mistakes can also explain the significant drop in their top-1 accuracy compared to the vanilla crossentropy model. We also find these models to be highly miscalibrated which further limits their practical usability. In this work we explore a different direction for hierarchy-aware classification where we amend mistake severity at test time by making post-hoc corrections over the class likelihoods (e.g., softmax in the case of deep neural networks). Given a label hierarchy, we perform such amendments to the likelihood by applying the very well-known and classical approach called Conditional Risk Minimization (CRM). We found that CRM outperforms state-of-the-art deep hierarchy-aware classifiers by large margins at ranking classes with little loss in the classification accuracy. As opposed to other recent approaches, CRM does not hurt the calibration of a model as the cross-entropy likelihoods can still be used for the same. CRM is simple, requires addition of just a few lines of code to the standard cross-entropy model, does not require retraining of a network, and contains no hyperparameters whatsoever. We would like to emphasize that we do not claim any algorithmic novelty as CRM has been well explored in the literature (Duda & Hart, 1973, Ch. 2) . Almost a decade ago, Deng et al. ( 2010) had proposed a very similar solution using Support Vector Machine (SVM) classifier applied on handcrafted features. However, this did not result in practically useful performance because of the lack of modern machine learning tools at that time. We intend to bring this old, simple, and extremely effective approach back into the attention before we delve deeper into the sophisticated ones requiring expensive retraining of large neural networks and designing complex loss functions. Overall, our investigation into the hierarchy-aware classification makes the following contributions: • We highlight a shortcoming in one of the metrics proposed to evaluate hierarchy-aware classification and show that it can easily be fooled and give the wrong impression of making better mistakes. • We revisit an old post-hoc correction technique (CRM) which significantly outperforms prior art when the ranking of the predictions made by the model are considered. • We also investigate the reliability of prior art in terms of calibration and show that these methods are severely miscalibrated, limiting their practical usefulness. , 2001) . The third category makes the prediction procedure cost-sensitive (Domingos, 1999; Zadrozny & Elkan, 2001a) . Such direct cost-sensitive decision-making is the most generic: it considers the underlying classifier as a black box and extends to any number of classes and arbitrary cost matrices. Our work comes under the third category of post-hoc amendment. We study cost-sensitive classification in



COST-SENSITIVE CLASSIFICATION Cost-sensitive classification assigns varying costs to different types of misclassification errors. The work by Abe et al. (2004) groups cost-sensitive classifiers into three main categories. The first category specifically extends one particular classification model to be cost-sensitive, such as support vector machines (Tu & Lin, 2010) or decision trees (Lomax & Vadera, 2013). The second category makes the training procedure cost-sensitive, which is typically achieved by assigning the training examples of different classes with different weights (rescaling) (Zhou & Liu, 2010) or by changing the proportions of each class while training using sampling (rebalancing) (Elkan

