PREDICTING CLASSIFICATION ACCURACY WHEN ADDING NEW UNOBSERVED CLASSES

Abstract

Multiclass classifiers are often designed and evaluated only on a sample from the classes on which they will eventually be applied. Hence, their final accuracy remains unknown. In this work we study how a classifier's performance over the initial class sample can be used to extrapolate its expected accuracy on a larger, unobserved set of classes. For this, we define a measure of separation between correct and incorrect classes that is independent of the number of classes: the reversed ROC (rROC), which is obtained by replacing the roles of classes and data-points in the common ROC. We show that the classification accuracy is a function of the rROC in multiclass classifiers, for which the learned representation of data from the initial class sample remains unchanged when new classes are added. Using these results we formulate a robust neural-network-based algorithm, CleaneX, which learns to estimate the accuracy of such classifiers on arbitrarily large sets of classes. Unlike previous methods, our method uses both the observed accuracies of the classifier and densities of classification scores, and therefore achieves remarkably better predictions than current state-of-the-art methods on both simulations and real datasets of object detection, face recognition, and brain decoding.

1. INTRODUCTION

Advances in machine learning and representation learning led to automatic systems that can identify an individual class from very large candidate sets. Examples are abundant in visual object recognition (Russakovsky et al., 2015; Simonyan & Zisserman, 2014) , face identification (Liu et al., 2017b) , and brain-machine interfaces (Naselaris et al., 2011; Seeliger et al., 2018) . In all of these domains, the possible set of classes is much larger than those observed at training or testing. Acquiring and curating data is often the most expensive component in developing new recognition systems. A practitioner would prefer knowing early in the modeling process whether the datacollection apparatus and the classification algorithm are expected to meet the required accuracy levels. In large multi-class problems, the pilot data may contain considerably fewer classes than would be found when the system is deployed (consider, for example, the case in which researchers develop a face recognition system that is planned to be used on 10,000 people, but can only collect 1,000 in the initial development phase). This increase in the number of classes changes the difficulty of the classification problem and therefore the expected accuracy. The magnitude of change varies depending on the classification algorithm and the interactions between the classes: usually classification accuracy will deteriorate as the number of classes increases, but this deterioration varies across classifiers and data-distributions. For pilot experiments to work, theory and algorithms are needed to estimate how accuracy of multi-class classifiers is expected to change when the number of classes grows. In this work, we develop a prediction algorithm that observes the classification results for a small set of classes, and predicts the accuracy on larger class sets. In large multiclass classification tasks, a representation is often learned on a set of k 1 classes, whereas the classifier is eventually used on a new larger class set. On the larger set, classification can be performed by applying simple procedures such as measuring the distances in an embedding space between a new example x ∈ X and labeled examples associated with the classes y i ∈ Y. Such classifiers, where the score assigned to a data point x to belong to a class y is independent of the other classes, are defined as marginal classifiers (Zheng et al., 2018) . Their performance on the larger set describes how robust the learned representation is. Examples of classifiers that are marginal when used on a larger class set include siamese neural networks (Koch et al., 2015) , oneshot learning (Fei-Fei et al., 2006) and approaches that directly optimize the embedding (Schroff et al., 2015) . Our goal in this work is to estimate how well a given marginal classifier will perform on a large unobserved set of k 2 classes, based on its performance on a smaller set of k 1 classes. Recent works (Zheng & Benjamini, 2016; Zheng et al., 2018 ) set a probabilistic model for rigorously studying this problem, assuming that the k 1 available classes are sampled from the same distribution as the larger set of k 2 classes. Following the framework they propose, we assume that the sets of k 1 and k 2 classes on which the classifier is trained and evaluated are sampled independently from an infinite continuous set Y according to Y i ∼ P Y (y), and for each class, r data points are sampled independently from X according to the conditional distribution P X|Y (x | y). In their work, the authors presented two methods for predicting the expected accuracy, one of them originally due to Kay et al. (2008) . We cover these methods in Section 2. As a first contribution of this work (Section 3), we provide a theoretical analysis that connects the accuracy of marginal classifiers to a variant of the receiver operating characteristic (ROC) curve, which is achieved by reversing the roles of classes and data points in the common ROC. We show that the reversed ROC (rROC) measures how well a classifier's learned representation separates the correct from the incorrect classes of a given data point. We then prove that the accuracy of marginal classifiers is a function of the rROC, allowing the use of well researched ROC estimation methods (Gonc ¸alves et al., 2014; Bhattacharya & Hughes, 2015) to predict the expected accuracy. Furthermore, the reversed area under the curve (rAUC) equals the expected accuracy of a binary classifier, where the expectation is taken over all randomly selected pairs of classes. We use our results regarding the rROC to provide our second contribution (Section 4): CleaneX (Classification Expected Accuracy Neural EXtrapolation), a new neural-network-based method for predicting the expected accuracy of a given classifier on an arbitrarily large set of classesfoot_0 . CleaneX differs from previous methods by using both the raw classification scores and the observed classification accuracies for different class-set sizes to calibrate its predictions. In Section 5 we verify the performance of CleaneX on simulations and real data-sets. We find it achieves better overall predictions of the expected accuracy, and very few "large" errors, compared to its competitors. We discuss the implications, and how the method can be used by practitioners, in Section 6.

1.1. PRELIMINARIES AND NOTATION

In this work x are data points, y are classes, and when referred to as random variables they are denoted by X, Y respectively. We denote by y(x) the correct class of x, and use y * when x is implicitly understood. Similarly, we denote by y an incorrect class of x. We assume that for each x and y the classifier h assigns a score S y (x), such that the predicted class of x is arg max y S y (x). On a given dataset of k classes, {y 1 , . . . , y k }, the accuracy of the trained classifier h is the probability that it assigns the highest score to the correct class A(y 1 , . . . , y k ) = P X (S y * (x) ≥ max k i=1 S yi (x)) where P X is the distribution of the data points x in the sample of classes. Since r points are sampled from each class, P X assumes a uniform distribution over the classes within the given sample. An important quantity for a data point x is the probability of the correct class y * to outscore a randomly chosen incorrect class Y ∼ P Y |Y =y * , that is C x = P Y (S y * (x) ≥ S y (x)). This is the cumulative distribution function of the incorrect scores, evaluated at the value of the correct score. We denote the expected accuracy over all possible subsets of k classes from Y by E k [A] and its estimator by Êk [A] . We refer to the curve of E k [A] at different values of k ≥ 2 as the accuracy curve. Given a sample of K classes, the average accuracy over all subsets of k ≤ K classes from the sample is denoted by ĀK k .



Code is publicly available at: https://github.com/YuliSl/CleaneX

