APPROXIMATE CONDITIONAL COVERAGE & CALI-BRATION VIA NEURAL MODEL APPROXIMATIONS

Abstract

A minimal required desideratum for quantifying the uncertainty from a classification model as a prediction set is class-conditional singleton set calibration. That is, such sets should map to the output of well-calibrated selective classifiers, matching the observed frequencies of similar instances. Recent works proposing adaptive and localized conformal p-values for deep networks do not guarantee this behavior, nor do they achieve it empirically. Instead, we use the strong signals for prediction reliability from KNN-based approximations of Transformer networks to construct data-driven partitions for Mondrian Conformal Predictors, which are treated as weak selective classifiers that are then calibrated via a new Inductive Venn Predictor, the VENN-ADMIT Predictor. The resulting selective classifiers are well-calibrated, in a conservative but practically useful sense for a given threshold, unlike conformal sets. They are inherently robust to changes in the proportions of the data partitions, and straightforward conservative heuristics provide additional robustness to covariate shifts. We compare and contrast to the quantities produced by recent Conformal Predictors on several representative and challenging natural language processing classification tasks, including class-imbalanced and distribution-shifted settings.

1. INTRODUCTION

Uncertainty quantification is hard. The problem of the reference class (see, e.g., Vovk et al., 2005, p. 159) necessitates task-specific care in interpreting even well-calibrated probabilities that agree with the observed frequencies. It is made harder in practice with deep neural networks, for which the otherwise strong blackbox point predictions are typically not well-calibrated and can unexpectedly under-perform over distribution shifts. And it is harder still for classification, given that the promising distribution-free approach of split-conformal inference (Vovk et al., 2005; Papadopoulos et al., 2002) , an assumption-light frequentist approach suitable when sample sizes are sufficiently large, produces a counterintuitive p-value quantity in the case of classification (cf., regression). Setting. In a typical natural language processing (NLP) binary or multi-class classification task, we have access to a computationally expensive blackbox neural model, F ; a training dataset, D tr = {Z i } I i=1 = {(X i , Y i )} I i=1 of |D tr | = I instances paired with their corresponding ground-truth dis- crete labels, Y i ∈ Y = {1, . . . , C}; and a held-out labeled calibration dataset, D ca = {Z j } N =I+J j=I+1 of |D ca | = J instances. We are then given a new test instance, X N +1 , from an unlabeled test set, D te . One approach to convey uncertainty in the predictions is to construct a prediction set, produced by some set-valued function Ĉ(X N +1 ) ∈ 2 C , containing the true unseen label with a specified level 1 -α ∈ (0, 1) on average. We consider two distinct interpretations: As coverage and as a conservatively coarsened calibrated probability (after conversion to selective classification), both from a frequentist perspective. Desiderata. For such prediction sets to be of general interest for classification, we seek classconditional singleton set calibration (CCS). We are willing to accept noise in other size stratifications, but the singleton sets, | Ĉ| = 1, must contain the true value with a proportion of ≥ 1 -α, at least on average per class. We further seek singleton set sharpness; that is, to maximize the number of singleton sets. We seek reasonable robustness to distribution shifts. Finally, we seek informative sets that avoid the trivial solution of full cardinality. If we are willing to fully dispense with specificity in the non-singleton-set stratifications for tasks with |Y| > 2, our desiderata can be achieved, in principle, with selective classifiers. Definition 1 (Classification with reject option). A selective classifier, g : X → Y ∪ {⊥}, maps from the input to either a single class or the reject option (represented here with the falsum symbol). Remark 1 (Prediction sets are selective classifications). The output of any set-valued function Ĉ(X N +1 ) ∈ 2 C corresponds to that of a selective classifier: Map non-singleton sets, | Ĉ(X N +1 )| = 1, to ⊥. Map all singleton sets to the corresponding class in Y. To date, the typical approach for constructing prediction sets is not via methods for calibrating probabilities, but rather in the hypothesis testing framework of Conformal Predictors, which carry a PAC-style (α, δ)-valid coverage guarantee. In the inductive (or "split") conformal formulation (Vovk, 2012; Papadopoulos et al., 2002, inter alia) , the p-value corresponds to confidence that a new point is as or more conforming than a held-out set with known labels. More specifically, we require a measurable function A : Z I × Z → R, which measures the conformity between z and other instances. For example, given the softmax output of a neural network for x, π ∈ R C , with πy as the output of the true class, A((z 1 , . . . , z I ), (x, y)) := πy is a typical choice. We construct a p-value, v ŷ , as follows: v ŷ := |{j=I+1,...,N | τj ≤τ N +1 }|+1 N +1 , where τ j := A((z 1 , . . . , z I ), z j ), ∀ z j ∈ D ca and τ N +1 := A((z 1 , . . . , z I ), (x N +1 , ŷN+1 )), where we suppose the true label is ŷ. We then construct the prediction set: Ĉ(x N +1 ) = ŷ : v ŷ > α . This is accompanied by a finite-sample, distributionfree coverage guarantee, which we state informally here.foot_0 Theorem 1 (Marginal Coverage of Conformal Predictors (Vovk et al., 2005) ). Provided the points of D ca and D te are drawn exchangeably from the same distribution P XY (which need not be further specified), the following marginal guarantee holds for a given α: P Y N +1 ∈ Ĉ(X N +1 ) ≥ 1 -α. The distribution of split-conformal coverage is Beta distributed (Vovk, 2012) , from which a PACstyle (α, δ)-validity guarantee can be obtained, and from which we can determine a suitable sample size to achieve this coverage in expectation. Unfortunately, this does not guarantee singleton set coverage (the hypothesis testing analogue of our CCS desideratum), a known, but under-appreciated, negative result that motivates the present work: Corollary 1. Conformal Predictors do not guarantee singleton set coverage. If they did, it would imply a stronger than marginal coverage guarantee. Existing approaches. Empirically, Conformal Predictors are weak selective classifiers, limiting their real-world utility. We show this problem is not resolved by re-weighting the empirical CDF near a test point (Guan, 2022) , nor by applying separate per-class hypothesis tests, nor by APS conformal score functions (Romano et al., 2020) , nor by adaptive regularization RAPS (Angelopoulos et al., 2021) , and occurs even on in-distribution data. Solution. In the present work, we demonstrate, with a focus on Transformer networks (Vaswani et al., 2017) , first that a closer notion of approximate conditional coverage obtained via the stronger validity guarantees of Mondrian Conformal Predictors is not sufficient in itself to achieve our desired desiderata. Instead, we treat such Conformal Predictors as weak selective classifiers, which serve as the underlying learner to construct a taxonomy for a Venn Predictor (Vovk et al., 2003) , a valid multi-probability calibrator. This is enabled by data-driven partitions determined by KNN (Devroye et al., 1996) approximations, which themselves encode strong signals for prediction reliability. The result is a principled, well-calibrated selective classifier, with a sharpness suitable even for highly imbalanced, low-accuracy settings, and with at least modest robustness to covariate shifts.

2. MONDRIAN CONFORMAL PREDICTORS AND VENN PREDICTORS

A stronger than marginal coverage guarantee can be obtained by Mondrian Conformal Predictors (Vovk et al., 2005) , which guarantee coverage within partitions of the data, including conditioned on the labels. Such Predictors are not sufficient for obtaining our desired desiderata, but serve as a principled approach for constructing a Venn taxonomy with a desirable balance between specificity vs. generality (a.k.a., over-fitting vs. under-fitting), the classic problem of the reference class.



We omit the randomness component, which is not practically relevant at the sample sizes considered here.

