APPROXIMATE CONDITIONAL COVERAGE & CALI-BRATION VIA NEURAL MODEL APPROXIMATIONS

Abstract

A minimal required desideratum for quantifying the uncertainty from a classification model as a prediction set is class-conditional singleton set calibration. That is, such sets should map to the output of well-calibrated selective classifiers, matching the observed frequencies of similar instances. Recent works proposing adaptive and localized conformal p-values for deep networks do not guarantee this behavior, nor do they achieve it empirically. Instead, we use the strong signals for prediction reliability from KNN-based approximations of Transformer networks to construct data-driven partitions for Mondrian Conformal Predictors, which are treated as weak selective classifiers that are then calibrated via a new Inductive Venn Predictor, the VENN-ADMIT Predictor. The resulting selective classifiers are well-calibrated, in a conservative but practically useful sense for a given threshold, unlike conformal sets. They are inherently robust to changes in the proportions of the data partitions, and straightforward conservative heuristics provide additional robustness to covariate shifts. We compare and contrast to the quantities produced by recent Conformal Predictors on several representative and challenging natural language processing classification tasks, including class-imbalanced and distribution-shifted settings.

1. INTRODUCTION

Uncertainty quantification is hard. The problem of the reference class (see, e.g., Vovk et al., 2005, p. 159) necessitates task-specific care in interpreting even well-calibrated probabilities that agree with the observed frequencies. It is made harder in practice with deep neural networks, for which the otherwise strong blackbox point predictions are typically not well-calibrated and can unexpectedly under-perform over distribution shifts. And it is harder still for classification, given that the promising distribution-free approach of split-conformal inference (Vovk et al., 2005; Papadopoulos et al., 2002) , an assumption-light frequentist approach suitable when sample sizes are sufficiently large, produces a counterintuitive p-value quantity in the case of classification (cf., regression). Setting. In a typical natural language processing (NLP) binary or multi-class classification task, we have access to a computationally expensive blackbox neural model, F ; a training dataset, D tr = {Z i } I i=1 = {(X i , Y i )} I i=1 of |D tr | = I instances paired with their corresponding ground-truth dis- crete labels, Y i ∈ Y = {1, . . . , C}; and a held-out labeled calibration dataset, D ca = {Z j } N =I+J j=I+1 of |D ca | = J instances. We are then given a new test instance, X N +1 , from an unlabeled test set, D te . One approach to convey uncertainty in the predictions is to construct a prediction set, produced by some set-valued function Ĉ(X N +1 ) ∈ 2 C , containing the true unseen label with a specified level 1 -α ∈ (0, 1) on average. We consider two distinct interpretations: As coverage and as a conservatively coarsened calibrated probability (after conversion to selective classification), both from a frequentist perspective. Desiderata. For such prediction sets to be of general interest for classification, we seek classconditional singleton set calibration (CCS). We are willing to accept noise in other size stratifications, but the singleton sets, | Ĉ| = 1, must contain the true value with a proportion of ≥ 1 -α, at least on average per class. We further seek singleton set sharpness; that is, to maximize the number of singleton sets. We seek reasonable robustness to distribution shifts. Finally, we seek informative sets that avoid the trivial solution of full cardinality.

