VISUAL RECOGNITION WITH DEEP NEAREST CENTROIDS

Abstract

We devise deep nearest centroids (DNC), a conceptually elegant yet surprisingly effective network for large-scale visual recognition, by revisiting Nearest Centroids, one of the most classic and simple classifiers. Current deep models learn the classifier in a fully parametric manner, ignoring the latent data structure and lacking explainability. DNC instead conducts nonparametric, case-based reasoning; it utilizes sub-centroids of training samples to describe class distributions and clearly explains the classification as the proximity of test data to the class sub-centroids in the feature space. Due to the distance-based nature, the network output dimensionality is flexible, and all the learnable parameters are only for data embedding. That means all the knowledge learnt for ImageNet classification can be completely transferred for pixel recognition learning, under the "pre-training and fine-tuning" paradigm. Apart from its nested simplicity and intuitive decision-making mechanism, DNC can even possess ad-hoc explainability when the sub-centroids are selected as actual training images that humans can view and inspect. Compared with parametric counterparts, DNC performs better on image classification (CIFAR-10, CIFAR-100, ImageNet) and greatly boosts pixel recognition (ADE20K, Cityscapes) with improved transparency, using various backbone network architectures (ResNet, Swin) and segmentation models (FCN, DeepLab V3 , Swin). Our code is available at DNC.

1. INTRODUCTION

Deep learning models, from convolutional networks (e.g., VGG [1] , ResNet [2] ) to Transformer-based architectures (e.g., Swin [3] ), push forward the state-of-the-art on visual recognition. With these advancements, parametric softmax classifiers, which learn a set of parameters, i.e., weight vector, and bias term, for each class, have become the de facto regime in the area (Fig. 1(b) ). However, due to the parametric nature, they suffer from several limitations: First, they lack simplicity and explainability. The parameters in the classification layer are abstract and detached from the physical nature of the problem being modelled [4] . Thus these classifiers are hard to naturally lend to an explanation that humans are able to process [5] . Second, linear classifiers are typically trained to optimize classification accuracy only, paying less attention to modeling the latent data structure. For each class, only one single weight vector is learned in a fully parametric manner. Thus they essentially assume unimodality for each class [6, 7] , less tolerant of intra-class variation. Third, as each class has its own set of parameters, deep parametric classifiers require the output space with a fixed dimensionality (equal to the number of classes) [8] . As a result, their transferability is limited; when using ImageNet-trained classifiers to initialize segmentation networks (i.e., pixel classifiers), the last classification layer, whose parameters are valuable knowledge learnt from the image classification task, has to be thrown away. In light of the foregoing discussions, we are motivated to present deep nearest centroids (DNC), a powerful, nonparametric classification network (Fig. 1(d) ). Nearest Centroids, which has historical roots dating back to the dawn of artificial intelligence [9-14], is arguably the simplest classifier. Nearest Centroids operates on an intuitive principle: given a data sample, it is directly classified to the class of training examples whose mean (centroid) is closest to it. Apart from its internal transparency, Nearest Centroids is a classical form of exemplar-based reasoning [5, 11] , which is fundamental to our most effective strategies for tactical decision-making [15] (Fig. 1(c) ). Numerous past studies [16] [17] [18] have shown that humans learn to solve new problems by using past solutions of similar problems. Despite its conceptual simplicity, empirical evidence in cognitive science, and ever popularity [19] [20] [21] [22] , Nearest DNC enjoys a few attractive qualities: First, improved simplicity and transparency. The intuitive working mechanism and statistical meaning of class sub-centroids make DNC elegant and easy to understand. Second, automated discovery of underlying data structure. By within-class deterministic clustering, the latent distribution of each class is automatically mined and fully captured as a set of representative local means. In contrast, parametric classifiers learn one single weight vector per class, intolerant of rich intra-class variations. Third, direct supervision of representation learning. DNC achieves classification by comparing data samples and class sub-centroids on the feature space. With such distance-based nature, DNC blends unsupervised sub-pattern mining (class-wise clustering) and supervised representation learning (nonparametric classification) in a synergy: local significant patterns are automatically mined to facilitate classification decision-making; the supervisory signal from classification directly optimizes the representation, which in turn boosts meaningful clustering. Forth, better transferability. DNC learns by only optimizing the feature representation, thus the output dimensionality no longer needs to be as many as the classes. With this algorithmic merit, all the useful knowledge (parameters) learnt from a source task (e.g., ImageNet [25] classification) are stored in the representation space, and can be completely transferred to target tasks (e.g., Cityscapes [26] segmentation). Fifth, ad-hocexplainability. If further restricting the class sub-centroids to be samples (images) of the training set, DNC can explain its prediction based on IF• • •Then rules and allow users to intuitively view the class representatives, and appreciate the similarity of test data to the representative images (detailed in §3&4.3). Such ad-hoc explainability [27] is valuable in safety-sensitive scenarios, and differs DNC from most existing network interpretation techniques [28] [29] [30] that only investigate post-hoc explanations and thus fail to elucidate precisely how a model works [31, 32] .



Figure 1: (b) Prevalent visual recognition models , built upon parametric softmax classifiers, have few limitations, such as their non-transparent decision-making process. (c) Humans can use past cases as models when solving new problems [16, 18] (e.g., comparing with a few familiar/exemplar animals for categorization). (d) DNC makes classification based on the similarity of to class subcentroids (representative training examples) in the feature space. The class sub-centroids are vital for capturing underlying data structure, enhancing interpretability, and boosting recognition.

