EFFICIENT CONFORMAL PREDICTION VIA CASCADED INFERENCE WITH EXPANDED ADMISSION

Abstract

In this paper, we present a novel approach for conformal prediction (CP), in which we aim to identify a set of promising prediction candidates-in place of a single prediction. This set is guaranteed to contain a correct answer with high probability, and is well-suited for many open-ended classification tasks. In the standard CP paradigm, the predicted set can often be unusably large and also costly to obtain. This is particularly pervasive in settings where the correct answer is not unique, and the number of total possible answers is high. We first expand the CP correctness criterion to allow for additional, inferred "admissible" answers, which can substantially reduce the size of the predicted set while still providing valid performance guarantees. Second, we amortize costs by conformalizing prediction cascades, in which we aggressively prune implausible labels early on by using progressively stronger classifiers-again, while still providing valid performance guarantees. We demonstrate the empirical effectiveness of our approach for multiple applications in natural language processing and computational chemistry for drug discovery.

1. INTRODUCTION

The ability to provide precise performance guarantees is critical to many classification tasks (Amodei et al., 2016; Jiang et al., 2012; 2018 ). Yet, achieving perfect accuracy with only single guesses is often out of reach due to noise, limited data, insufficient modeling capacity, or other pitfalls. Nevertheless, in many applications, it can be more feasible and ultimately as useful to hedge predictions by having the classifier return a set of plausible options-one of which is likely to be correct. Consider the example of information retrieval (IR) for fact verification. Here the goal is to retrieve a snippet of text of some granularity (e.g., a sentence, paragraph, or article) that can be used to verify a given claim. Large resources, such as Wikipedia, can contain millions of candidate snippets-many of which may independently be able to serve as viable evidence. A good retriever should make precise snippet suggestions, quickly-but do so without excessively sacrificing sensitivity (i.e., recall). Conformal prediction (CP) is a methodology for placing exactly that sort of bet on which candidates to retain (Vovk et al., 2005) . Concretely, suppose we have been given n examples, (X i , Y i ) ∈ X × Y, i = 1, . . . , n, as training data, that have been drawn exchangeably from an underlying distribution P . For instance, in our IR setting, X would be the claim in question, Y a viable piece of evidence that supports or refutes it, and Y a large corpus (e.g., Wikipedia). Next, let X n+1 be a new exchangeable test example (e.g., a new claim to verify) for which we would like to predict the paired y ∈ Y. The aim of conformal prediction is to construct a set of candidates C n (X n+1 ) likely to contain Y n+1 (e.g., the relevant evidence) with distribution-free marginal coverage at a tolerance level ∈ (0, 1): P (Y n+1 ∈ C n (X n+1 )) ≥ 1 -; for all distributions P. (1) The marginal probability above is taken over all the n + 1 calibration and test points {(X i , Y i )} n+1 i=1 . A classifier is considered to be valid if the frequency of error, Y n+1 ∈ C n (X n+1 ), does not exceed . In our IR setting, this would mean including the correct snippet at least -fraction of the time. Not all valid classifiers, however, are particularly useful (e.g., a trivial classifier that merely returns all Figure 1 : A demonstration of our conformalized cascade for m-step inference with set-valued outputs, here on an IR for claim verification task. The number of considered articles is reduced at every level-red frames are filtered, while green frames pass on. We only care about retrieving at least one of the admissible articles (starred) for resolving the claim. possible outputs). A classifier is considered to have good predictive efficiency if E[|C n (X n+1 )|] is small (i.e., |Y|). In our IR setting, this would mean not returning too many irrelevant articles-or in IR terms, maximizing precision while holding the level of recall at ≥ 1 -(assuming Y is a single answer). In practice, in domains where the number of outputs to choose from is large and the "correct" one is not necessarily unique, classifiers derived using conformal prediction can suffer dramatically from both poor predictive and computational efficiency (Burnaev and Vovk, 2014; Vovk et al., 2016; 2020) . Unfortunately, these two conditions tend to be compounding: large label spaces Y both (1) often place strict constraints on the set of tractable model classes available for consideration, and (2) frequently contain multiple clusters of labels that are difficult to discriminate between, especially for a low-capacity classifier. In this paper, we present two effective methods for improving the efficiency of conformal prediction for classification tasks with large output spaces Y, in which several y ∈ Y might be admissible-i.e., acceptable for the purposes of our given task. First, in Section 4 we describe a generalization of Eq. 1 to an expanded admission criteria, where C n (X n+1 ) is considered valid if it contains at least one admissible y with high probability. For example, in our IR setting, given the claim "Michael Collins took part in the Apollo mission to the moon," any of the articles "Apollo 11," "Michael Collins (astronaut)," or "Apollo 11 (2019 film)" have enough information to independently support it (see Figure 1 )-and are therefore all admissible. When Y n+1 is not unique, forcing the classifier to hedge for the worst case, in which a specific realization of Y n+1 must be contained in C n (X n+1 ), is too strict and can lead to conservative predictions. We theoretically and empirically show that optimizing for an expanded admission criteria yields classifiers with significantly better predictive efficiency. Second, in Section 5 we present a technique for conformalizing prediction cascades to progressively filter the number of candidates with a sequence of increasingly complex classifiers. This allows us to balance predictive efficiency with computational efficiency during inference. Importantly, we also theoretically show that, in contrast to other similarly motivated pipelines, our method filters the output space in a manner that still guarantees marginal coverage. Figure 1 illustrates our combined approach. We demonstrate that, together, these two approaches serve as complementary pieces of the puzzle towards making CP more efficient. We empirically validate our approach on information retrieval for fact verification, open-domain question answering, and in-silico screening for drug discovery. Contributions. In summary, our main results are as follows: • A theoretical extension of validity (Eq. 1) to allow for inferred admissible answers. • A principled framework for conformalizing computationally efficient prediction cascades. • Consistent empirical gains on three diverse tasks demonstrating up to 4.6× better predictive efficiency AUC (measured across all ) when calibrating for expanded admission, with computation pruning factors of up to 1/m, where m is the number of models, when using prediction cascades.

2. RELATED WORK

Confident prediction. Methods for obtaining precise uncertainty estimates have received intense interest in recent years. A significant body of work is concerned with calibrating model confidence-

