FINE-GRAIN INFERENCE ON OUT-OF-DISTRIBUTION DATA WITH HIERARCHICAL CLASSIFICATION

Abstract

Machine learning methods must be trusted to make appropriate decisions in realworld environments, even when faced with out-of-distribution (OOD) samples. Many current approaches simply aim to detect OOD examples and alert the user when an unrecognized input is given. However, when the OOD sample significantly overlaps with the training data, a binary anomaly detection is not interpretable or explainable, and provides little information to the user. We propose a new model for OOD detection that makes predictions at varying levels of granularity-as the inputs become more ambiguous, the model predictions become coarser and more conservative. Consider an animal classifier that encounters an unknown bird species and a car. Both cases are OOD, but the user gains more information if the classifier recognizes that its uncertainty over the particular species is too large and predicts "bird" instead of detecting it as OOD. Furthermore, we diagnose the classifier's performance at each level of the hierarchy improving the explainability and interpretability of the model's predictions. We demonstrate the effectiveness of hierarchical classifiers for both fine-and coarse-grained OOD tasks.

1. INTRODUCTION

Real-world computer vision systems will encounter out-of-distribution (OOD) samples while making or informing consequential decisions. Therefore, it is crucial to design machine learning methods that make reasonable predictions for anomalous inputs that are outside the scope of the training distribution. Recently, research has focused on detecting inputs during inference that are OOD for the training distribution (Ahmed & Courville, 2020; Hendrycks & Gimpel, 2017; Hendrycks et al., 2019; Hsu et al., 2020; Huang & Li, 2021; Lakshminarayanan et al., 2017; Lee et al., 2018; Liang et al., 2018; Liu et al., 2020; Neal et al., 2018; Roady et al., 2020; Inkawhich et al., 2022) . These methods typically use a threshold on the model's "confidence" to produce a binary decision indicating if the sample is in-distribution (ID) or OOD. However, binary decisions based on model heuristics offer little interpretability or explainability. The fundamental problem is that there are many ways for a sample to be out-of-distribution. Ideally, a model should provide more nuanced information about how a sample differs from the training data. For example, if a bird classifier is presented with a novel bird species, we would like it to recognize that the sample is a bird rather than simply reporting OOD. On the contrary, if the bird classifier is shown an MNIST digit then it should indicate that the digit is outside its domain of expertise. Recent studies have shown that fine-grained OOD samples are significantly more difficult to detect, especially when there is a large number of training classes (Ahmed & Courville, 2020; Huang & Li, 2021; Roady et al., 2020; Zhang et al., 2021; Inkawhich et al., 2021) . We argue that the difficulty stems from trying to address two opposing objectives: learning semantically meaningful features to discriminate between ID classes while also maintaining tight decision boundaries to avoid misclassification on fine-grain OOD samples (Ahmed & Courville, 2020; Huang & Li, 2021) . We hypothesize that additional information about the relationships between classes could help determine those decision boundaries and simultaneously offer more interpretable predictions. To address these challenges, we propose a new method based on hierarchical classification. The approach is illustrated in fig. 1 . Rather than directly outputting a distribution over all possible classes, as in a flat network, hierarchical classification methods leverage the relationships between classes to produce conditional probabilities for each node in the tree. This can simplify the classification problem since each node only needs to distinguish between its children, which are far fewer in number (Redmon & Farhadi, 2017; Ridnik et al., 2021) . It can also improve the interpretability of the neural network (Wan et al., 2021) . For example, we leverage these conditional probabilities to define novel OOD metrics for hierarchical classifiers and make coarser predictions when the model is more uncertain. By employing an inference mechanism that predicts at different levels of granularity, we can estimate how similar the OOD samples are from the ID set and at what node of the tree the sample becomes OOD. When outliers are encountered, predicting at lower granularity allows the system to convey imprecise, but accurate information. We also propose a novel loss function for the hierarchical softmax classification technique to address the fine-grained OOD scenario. We propose hierarchical OOD metrics for detection and create a hierarchical inference procedure to perform inference on ID and OOD samples, which improves the utility of the model. We evaluate the method's sensitivity to granularity under coarse-and finegrain outlier datasets using ImageNet-1K ID-OOD holdout class splits (Russakovsky et al., 2015) . Our in-depth analysis of the ID and OOD classification error cases illuminates the behavior of the hierarchical classifier and showcases how explainable models are instrumental for solving fine-grain OOD tasks. Ultimately, we show that hierarchical classifiers effectively perform inference on ID and OOD samples at varying granularity levels and improve interpretability in fine-grained OOD scenarios.

2. RELATED WORK

There are four main types of related works: those which focus on fine-grain out-of-distribution detection; those which emphasize scalability to large numbers of classes; those which leverage hierarchical classifiers; and those which utilize improved feature extactors.



Figure 1: Method overview. Top: A ResNet50 extracts features from images and fully-connected layers output softmax probabilities p n (x i )for each set in the hierarchy H. Path-wise probabilities are used for final classification. Path-wise probability and entropy thresholds generated from the training set D train form stopping criterion for the inference process. Bottom: Common error cases encountered by the hierarchical predictor. From left to right: Standard error results from and incorrect intermediate or leaf decision, ID under-prediction where the network predicts at a coarse granularity due to high uncertainty, OOD over-prediction where the OOD sample is mistaken for a sibling node.

