CONFORMAL PREDICTION IS ROBUST TO LABEL NOISE

Abstract

We study the robustness of conformal prediction-a powerful tool for uncertainty quantification-to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. With both theory and experiments, we argue that conformal prediction with noisy labels conservatively covers the clean ground truth labels except in adversarial cases. This leads us to believe that correcting for label noise is unnecessary except for pathological data distributions or noise sources. In such cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure correct coverage of the ground truth labels without score or data regularity.

1. INTRODUCTION

In most supervised classification and regression tasks, one would assume the provided labels reflect the ground truth. In reality, this assumption is often violated; see (Cheng et al., 2022; Xu et al., 2019; Yuan et al., 2018; Lee & Barber, 2022; Cauchois et al., 2022) . For example, doctors labeling the same medical image may have different subjective opinions about the diagnosis, leading to variability in the ground truth label itself. In other settings, such variability may arise due to sensor noise, data entry mistakes, the subjectivity of a human annotator, or many other sources. In other words, the labels we use to train machine learning (ML) models may often be noisy in the sense that these are not necessarily the ground truth. Quantifying the prediction uncertainty is crucial in highstakes applications in general, and especially so in settings where the training data is inexact. We aim to investigate uncertainty quantification in this challenging noisy setting via conformal prediction, a framework that uses hold-out calibration data to construct prediction sets that are guaranteed to contain the ground truth labels; see (Vovk et al., 2005; Angelopoulos & Bates, 2021) . In short, this paper shows that conformal prediction typically yields confidence sets with conservative coverage when the hold-out calibration data has noisy labels. We adopt a variation of the standard conformal prediction setup. Consider a calibration data set of i.i.d. observations {(X i , Y i )} n i=1 sampled from an arbitrary unknown distribution P XY . Here, X i ∈ R p is the feature vector that contains p features for the ith sample, and Y i denotes its response, which can be discrete for classification tasks or continuous for regression tasks. Given the calibration dataset, an i.i.d. test data point (X test , Y test ), and a pre-trained model f , conformal prediction constructs a set C(X test ) that contains the unknown test response, Y test , with high probability, e.g., 90%. That is, for a user-specified level α ∈ (0, 1), P Y test ∈ C(X test ) ≥ 1 -α. (1) This property is called marginal coverage, where the probability is defined over the calibration and test data. In the setting of label noise, we only observe the corrupted labels Ỹi = g(Y i ) for some corruption function g : Y×[0, 1] → Y, so the i.i.d. assumption and marginal coverage guarantee are invalidated. The corruption is random; we will always take the second argument of g to be a random seed U uniformly distributed on [0, 1]. To ease notation, we leave the second argument implicit henceforth. Nonetheless, using the noisy calibration data, we seek to form a prediction set C noisy (X test ) that covers the clean, uncorrupted test label, Y test . More precisely, our goal is to delineate when it is possible to provide guarantees of the form P Y test ∈ C noisy (X test ) ≥ 1 -α, where the probability is taken jointly over the calibration data, test data, and corruption function (this will be the case for the remainder of the paper). Our theoretical vignettes and experiments suggest that in realistic situations, (2) is usually satisfied. That is, even with access only to noisy labels, conformal prediction yields confidence sets that have conservative coverage on clean labels. There are a few failure cases involving adversarial noise that we discuss, but in general we argue that a user should feel safe deploying conformal prediction even with noisy labels.

MOTIVATIONAL EXAMPLE

As a real-world example of label noise, we conduct an image classification experiment where we only observe one annotater's label but seek to cover the majority vote of many annotators. 2020), which contains 10,000 images labeled by approximately 50 annotators. We calibrate using only a single annotator and seek to cover the majority vote of the 50. The single annotator differs from the ground truth labels in approximately 5% of the images. Using the noisy calibration set (i.e., a calibration set containing these noisy labels), we applied vanilla conformal prediction as if the data were i.i.d, and studied the performance of the resulting prediction sets. Details regarding the training procedure can be found in section 4.2. The fraction of majority vote labels covered is demonstrated in Figure 1 . As we can see, when using the clean calibration set the marginal coverage is 90%, as expected. When using the noisy calibration set, the coverage increases to approximately 93%. Figure 1 also demonstrates the prediction sets that are larger when calibrating with noisy labels. This experiment demonstrates the main intuition behind our paper: adding noise will usually increase the variability in the labels, leading to larger prediction sets that retain the coverage property. 

2. THEORETICAL ANALYSIS

In this section we show mathematically that under stylized settings and some regularity conditions, the marginal coverage guarantee (1) of conformal prediction persists even when the labels used for calibration are noisy; i.e., (2) holds. In Sections 3 and 4 we support this argument with realistic experiments. Towards that end, we now give more details on the conformal prediction algorithm. As explained in the introduction, conformal prediction uses a held-out calibration data set and a pre-trained model to construct the prediction set on a new data point. More formally, we use the model f to construct a score function, s : X × Y → R, which is engineered to be large when the model is uncertain and small otherwise. Abbreviate the scores on each calibration data point as s i = s(X i , Y i ) for each i = 1, ..., n. Conformal prediction tells us that we can achieve a marginal coverage guarantee by picking qclean = s (⌈(n+1)(1-α)⌉) as the ⌈(n + 1)(1 -α)⌉-smallest of the calibration scores and constructing the prediction sets as C (X test ) = {y ∈ Y : s (X test , y) ≤ qclean } . We will introduce different score functions for both classification and regression as needed in the following subsections. In this paper, we do not allow ourselves access to the calibration labels, only their noisy versions, Ỹ1 , . . . , Ỹn , so we cannot calculate qclean . Instead, we can calculate the noisy quantile qnoisy as the ⌈(n + 1)(1 -α)⌉-smallest of the noisy score functions, si = s(X i , Ỹi ). The main formal question of our work is whether the resulting prediction set, C noisy (X test ) = {y : s(X test , y) ≤ qnoisy }, covers the clean label as in (2). We state this general recipe algorithmically for future reference: Recipe 1 (Conformal prediction with noisy labels). 1. Consider i.i.d. data points (X 1 , Y 1 ), . . . , (X n , Y n ), (X test , Y test ), a corruption model g : Y → Y, and a score function s : X × Y → R. 2. Compute the conformal quantile with the corrupted labels, qnoisy = Quantile (n + 1)(1 -α) n , {s(X i , Ỹi )} n i=1 , where Ỹi = g(Y i ). 3. Construct the prediction set using the noisy conformal quantile, C noisy = {y : s(X test , y) ≤ qnoisy }. This recipe will produce valid prediction sets whenever the noisy score distribution stochastically dominates the clean score distribution. The intuition is that the noise distribution 'spreads out' the distribution of the score function such that qnoisy is (stochastically) larger than qclean . Theorem 1. Assume that P(s test ≤ t) ≤ P(s test ≤ t). Then, P Y test ∈ C noisy (X test ) ≥ 1 -α. Furthermore, for any u satisfying P(s test ≤ t) + u ≥ P(s test ≤ t), then P Y test ∈ C noisy (X test ) ≤ 1 -α + 1 n + 1 + u. In the following subsections, we present examples under which this schochastic dominance holds, and conformal prediction with noisy labels succeeds in covering the true, noiseless label. The purpose of these examples is to illustrate simple and intuitive statistical setups where Theorem 1 holds. Under the hood, all proofs from Sections 2.1-2.3 are applications of Theorem 1. Though the noise can be adversarially designed to violate these assumptions and cause miscoverage (as in the impossibility result Proposition 1), the evidence presented here suggests that in the majority of practical settings, conformal prediction can be applied without modification. All proofs are in Appendix A1.

2.1. RANDOM CORRUPTIONS IN CLASSIFICATION

First we will examine a classification problem similar to our motivating example in Section 1, in which the noisy label is randomly flipped ϵ fraction of the time. This noise model is well-studied in the literature; see, for example, (Aslam & Decatur, 1996; Angluin & Laird, 1988; Ma et al., 2018; Jenni & Favaro, 2018; Jindal et al., 2016; Yuan et al., 2018) . Formally, in the setting of K-class classification, we define the corruption model as follows: g flip (y) = y w.p 1 -ϵ Y ′ else, where Y ′ is uniformly drawn from the set {1, ..., K}. Also consider the adaptive prediction sets (APS) scores, first introduced by Romano et al. (2020) , s APS (x, y) = y ′ ∈Y πy ′ (x) I {π y ′ (x) > πy (x)} + πy (x) • U ′ , where I is the indicator function, πy (x) is the estimated conditional probability P Y = y | X = x and U ′ ∼ Unif(0, 1). To make this non-random, a variant of the above where U ′ = 1 is often used. The next proposition shows that for any classifier that ranks classes in the same order as P( Ỹ | X) the APS score leads to valid coverage even when calibrated on noisy data. Example 1. Let C APS noisy (X test ) be constructed as in Recipe 1 with the corruption function g flip and the score function s APS (deterministic version) with any classifier that ranks the classes in the same order as the oracle classifier πy (x) = P Ỹ = y | X = x . Then 1 -α ≤ P Y test ∈ C APS noisy (X test ) ≤ 1 -α + 1 n + 1 + ϵ K -1 K . The APS score is one of two popular conformal methods for classification. The other score from Vovk et al. (2005) ; Lei et al. (2013) , is referred to as homogeneous prediction sets (HPS) score, s HPS (x, y) = 1 -πy (x), for some classifier πy (x) ∈ [0, 1]. The next proposition shows that with access to an oracle classifier for the noisy label distribution πy (x) = P Ỹ = y | X = x , conformal prediction covers the noiseless test label. Example 2. Let C HPS noisy (X test ) be constructed as in Recipe 1 with the corruption function g flip where ϵ ≤ (1 -1 K ) 2 , and the score function s HPS with the oracle classifier πy (x) = P Ỹ = y | X = x . Then 1 -α ≤ P Y test ∈ C HPS noisy (X test ) ≤ 1 -α + 1 n + 1 + ϵ 2 1 -ϵ (K -1). It should be noted that the above theorem only requires knowledge of the noisy conditional distribution P Ỹ | X , so a model trained on a large amount of noisy data should approximately have the desired coverage as well.

2.2. CONFORMALIZED QUANTILE REGRESSION WITH SYMMETRIC NOISE

Next we will analyze a regression task where the labels are continuous-valued and the corruption function is g sym (y) = y + Z for some symmetric, unimodal, independent noise sample Z. We also assume that Y | X is symmetric unimodal. We analyze the most common regression strategy, conformalized quantile regression (CQR) as presented by Romano et al. (2019) . Let the score function be s CQR (x, y) := max{q α lo (x) -y, y -qα hi (x) , 0}. where qα lo and qα hi are estimates of the α/2 and 1 -α/2 conditional quantiles respectively. We will analyze the setting where qα lo and qα hi fall on the correct side of the true median of the noisy distribution, an extremely weak assumption about the fitted model. In this case, conformalized quantile regression achieves valid coverage even with noisy labels, as stated next. Example 3. Let C CQR noisy (X test ) be constructed as in Recipe 1 with the corruption function g sym and the score function s CQR where the estimated quantiles satisfy qα lo (x) ≤ q 1/2 (x) ≤ qα hi (x), where q 1/2 (x) is the true median of Ỹ | X = x. Then P Y test ∈ C CQR noisy (X test ) ≥ 1 -α.

2.3. CONFORMALIZED UNCERTAINTY SCALARS

We next analyze the case of a score function using a single uncertainty scalar, û(x) > 0. In this setting, conformal prediction will work whenever the noisy label distribution has the same mean as, but heavier tails than, the ground truth label distribution. We set the corruption function to be g vi (y) = y + Z, where Z satisfies P Y -E Y | X = x > t X = x ≤ P Y + Z -E Y | X = x > t X = x . This condition is satisfied in many statistical setups, including the case from the previous section, where Y and Z are symmetric and Y is unimodal. Also take the score function to be the normalized magnitude of the residual, s RM (x, y) = f (x) -y /û(x). The following result states that conformal prediction is robust to these conditions. Example 4. Let C RM noisy be constructed as in Recipe 1 with the score function s RM using the oracle model f (x) = E[ Ỹi | X i = x] and the corruption function g vi . Then P Y test ∈ C RM noisy (X test ) ≥ 1 -α. This example summarizes the practical intuition that if Y has lighter tails than Y +Z, then conformal prediction will be conservative (see Corollary 3 in Appendix A1 for a statement in terms of tail bounds).

2.4. DISCLAIMER: DISTRIBUTION-FREE RESULTS

Though the coverage guarantee holds in many realistic cases, we have also given examples where conformal prediction fails to cover. Indeed, in the general case, conformal prediction will fail to cover, and must be adjusted to account for the size of the noise. The following proposition makes it clear that for any nontrivial noise distribution, there exists a score function that breaks naïve conformal. Proposition 1 (Coverage is impossible in the general case.). Take any Ỹ d ̸ = Y . Then there exists a score function s that yields P Y test ∈ C noisy (X test ) < P Y test ∈ C(X test ) . The above proposition says that for any noise distribution, there exists an adversarially chosen score function that will disrupt coverage. Furthermore, as we discuss in Appendix A1, with noise of a sufficient magnitude, it is possible to get arbitrarily bad violations of coverage. Next, we discuss how to adjust the threshold of conformal prediction to account for noise of a known size, as measured by total variation (TV) distance from the clean label. Corollary 1 (Corollary of Barber et al. (2022) ). Let Ỹ be any random variable satisfying D TV (Y, Ỹ ) ≤ ϵ. Take α ′ = α + n n+1 ϵ. Letting C noisy (X test ) be the output of Recipe 1 with any score function at level α ′ yields P Y test ∈ C noisy (X test ) ≥ 1 -α. We discuss this strategy more in Appendix A1-the algorithm implied by Corollary 1 may not be particularly useful, as the TV distance is a badly behaved quantity that is also difficult to estimate. As a final note, if the noise is bounded in TV norm, then the coverage is also not too conservative. Corollary 2 (Corollary of Barber et al. (2022) Theorem 3). Let Ỹ be any random variable satisfying D TV (Y, Ỹ ) ≤ ξ. Letting C noisy (X test ) be the output of Recipe 1 with any score function at level α yields P Y test ∈ C noisy (X test ) ≤ 1 -α + 1 n + 1 + n n + 1 ξ.

3.1. CLASSIFICATION

In this section, we focus on multi-class classification problems, where we study the validity of conformal prediction using different types of label noise distributions, described below. Class-independent noise. This noise model, which we call uniform flip, randomly flips the ground truth label with probability ϵ, defined formally by the class corruption function g flip+ in (A6). We analyzed this setting in depth in Section 2.1, and proved that the coverage achieved by an oracle classifier is guaranteed to increase. Class-dependent noise. In contrast to uniform flip noise, here we consider a more challenging setup in which the probability of a label to be flipped depends on the ground truth class label Y . Such a noise label is often called Noisy at Random (NAR) in the literature, where certain classes are more likely to be mislabeled or confused with similar ones. Let T be a row stochastic transition matrix of size K × K such that T i,j is the probability of a point with label i to be swapped with label j. In what follows, we consider three possible strategies for building the transition matrix T . (1) Confusion matrix (Algan & Ulusoy, 2020): we define T as the oracle classifier's confusion matrix, up to a proper normalization to ensure the total flipping probability is ϵ. We provide a theoretical study of this case in Appendix A1.3. (2) Rare to most frequent class (Xu et al., 2019) : here, we flip the labels of the least frequent class with those of the most frequent class. This noise model is not uncommon in medical applications: imagine a setting where only a small fraction of the observations are abnormal, and thus likely to be annotated as the normal class. If switching between the rare and most frequent labels does not lead to a total probability of ϵ, we move to the next least common class, and so on. To set the stage for the experiments, we generate a synthetic data with K = 10 classes as follows. The features X ∈ R d follow a standard multivariate Gaussian distribution of dimen- sion d = 100. The conditional distribution of Y | X is multinomial with weights w j (x) = exp((x ⊤ B) j )/ K i=1 exp((x ⊤ B) i ) , where B ∈ R d×K whose entries are sampled independently from the standard normal distribution. In our experiments, we generate a total of 60, 000 data points, where 50, 000 are used to fit a classifier, and the remaining ones are randomly split to form calibration and test sets, each of size 5, 000. The training and calibration data are corrupted using the label noise models we defined earlier, with a fixed flipping probability of ϵ = 0.05. Of course, the test set is not corrupted and contains the ground truth labels. We apply conformal prediction using both the HPS and the APS score functions, with a target coverage level 1 -α of 90%. We use two predictive models: a two-layer neural network and an oracle classifier that has access to the conditional distribution of Ỹ | X. Finally, we report the distribution of the coverage rate as in (2) and the prediction set sizes across 100 random splits of the calibration and test data. As a point of reference, we repeat the same experimental protocol described above on clean data; in this case, we do not violate the i.i.d. assumption required to grant the marginal coverage guarantee in (1). The results are depicted in Figure 2 . As expected, in the clean setting all conformal methods achieve the desired coverage of 90%. Under the uniform flip noise model, the coverage of the oracle classifier increases to around 94%, supporting our theoretical results from Section 2.1. The neural network model follows a similar trend. Although not supported by a theoretical guarantee, when corrupting the labels using the more challenging confusion matrix noise, we can see a conservative behavior similar to the uniform flip. By contrast, under the rare to most frequent class noise model, we can see a decrease in coverage, which is in line with our dis- claimer from Section 2.4. Yet, observe how the APS score tends to be more robust to label noise than HPS, which emphasizes the role of the score function. In Appendix A2.1 we provide additional experiments with adversarial noise models that more aggressively reduce the coverage rate. Such adversarial cases are more pathological and less likely to occur in real-world settings, unless facing a malicious attacker.

3.2. REGRESSION

Similarly to the classification experiments, we study two types of noise distributions. Response-independent noise. We consider an additive noise of the form: Ỹ = Y + c • Z, where c is a parameter that allows us to control the noise level. The noise component Z is a random variable sampled from the following distributions. (1) Symmetric light tailed: standard normal distribution; (2) Symmetric heavy tailed: t-distribution with one degree of freedom; (3) Asymmetric: standard gumbel distribution, normalized to have zero mean and unit variance; and (4) Biased: positive noise formulated as the absolute value of #2 above. Response-dependent noise. Analogously to the class-dependent noise from Section 3.1, we define more challenging noise models as follows. (1) Contractive: this corruption pushes the ground truth response variables towards their mean. Formally, Ỹi = Y i -Y i -1 n n i=1 Y i • U , where U is a random uniform variable defined on the segment [0,0.5], and n is the number of calibration points. (2) Dispersive: this noise introduces some form of a dispersion effect on the ground truth response, which takes the opposite form of the contractive model, given by Ỹi = Y i + Y i -1 n n i=1 Y i • U . Having defined the noise models, we turn to describe the data generating process. We simulate a 100-dimensional X whose entries are sampled independently from a uniform distribution on the segment [0, 5]. Following Romano et al. (2019) , the response variable is generated as follows: Y ∼ Pois(sin 2 ( X) + 0.1) + 0.03 • X • η 1 + 25 • η 2 • 1 {U < 0.01} , (4) where X is the mean of the vector X, and Pois(λ) is the Poisson distribution with mean λ. Both η 1 and η 2 are i.i.d. standard Gaussian variables, and U is a uniform random variable on [0, 1]. The right-most term in (4) creates a few but large outliers. Figure A2 in the appendix illustrates the effect of the noise models discussed earlier on data sampled from (4). We apply conformal prediction with the CQR score for each noise model as follows.foot_0 First, we fit a quantile random forest model on 8, 000 noisy training points; we then calibrate the model using 2, 000 fresh noisy samples; and, lastly, test the performance on additional 5, 000 clean, ground truth samples. The results are summarized in Figures 3 and 4 . Observe how the prediction intervals tend to be conservative under symmetric, both for light-and heavy-tailed noise distributions, asymmetric, and dispersive corruption models. Intuitively, this is because these noise models increase the variability of Y ; in Example 3 we prove this formally for any symmetric independent noise model, whereas here we show this result holds more generally even for response-dependent noise. By contrast, the prediction intervals constructed under the biased and contractive corruption models tend to under-cover the response variable. This should not surprise us: following Figure A2 (c), the biased noise shifts the data 'upwards', and, consequently, the prediction intervals are undesirably pushed towards the positive quadrants. Analogously, the contractive corruption model pushes the data towards the mean, leading to intervals that are too narrow. Figure A5 in the appendix illustrates the scores achieved when using the different noise models and the 90%'th empirical quantile of the CQR scores. This figure supports the behaviour we see in Figures A3, 3 and A4: over-coverage is achieved when qnoisy is larger than qclean , and under-coverage is obtained when qnoisy is smaller. In Section A2.2 of the Appendix we study the effect of the predictive model on the coverage property, for all noise models. To this end, we repeat similar experiments to the ones presented above, however, we now fit the predictive model on clean training data; the calibration data remains noisy. We also provide an additional adversarial noise model that reduces the coverage rate, but is unlikely to appear in real-world settings. Figures A3 and A4 in the appendix depict a similar behaviour for most noise models, except the biased noise for which the coverage requirement is not violated. This can be explained by the improved estimation of the low and high conditional quantiles, as these are fitted on clean data and thus less biased.

4.1. REGRESSION: AESTHETIC VISUAL RATING

In this section, we present a real-world application with continuous response, using Aesthetic Visual Analysis (AVA) data set, first presented by Murray et al. (2012) . This data set contains pairs of images and their aesthetic scores in the range of 1 to 10, obtained by approximately 200 annotators. Following Kao et al. (2015) ; Talebi & Milanfar (2018) ; Murray et al. (2012) , the task is to predict the average aesthetic score of a given test image. Therefore, we consider the average aesthetic score taken over all annotators as the clean, ground truth response. The noisy response is the average aesthetic score taken over 10 randomly selected annotators only. We examine the performance of conformal prediction using both CQR and the residual magnitude score-the score from Section 2.3 with û(x) = 1. We follow Talebi & Milanfar (2018) and take a transfer learning approach to fit the predictive model. Figure 5 portrays the marginal coverage and average interval length achieved using CQR and residual magnitude scores. As a point of reference, this figure also presents the performance of the two conformal methods when calibrated with a clean calibration set; as expected, the two perfectly attain 90% coverage. By constant, when calibrating the same predictive models with a noisy calibration set, the resulting prediction intervals tend to be wider and to over-cover the average aesthetic scores.

4.2. CLASSIFICATION: OBJECT RECOGNITION

We return to the classification experiment in Section 1. The CIFAR-10H data set contains the same 10,000 images as CIFAR-10, but with labels from a single annotator instead of a majority vote of 50 annotators. We fine-tune a ResNet18 model pre-trained on the clean training set of CIFAR-10, which contains 50,000 samples. Then we randomly select 2,000 observations from CIFAR-10H for calibration. The test set contains the remaining 8,000 samples, but with CIFAR-10 labels. Figure 1 illustrates results obtained by applying conformal prediction with the APS score. We can see that (i) we obtain the exact desired coverage when using the clean calibration set; and (ii) when calibrating on noisy data, the constructed prediction sets over-cover the clean test labels. This is because the sets tend to be larger when calibrating on noisy data, as seen in Figure A6 .

5. DISCUSSION

Related work. Conformal prediction was first proposed by Vladimir Vovk and collaborators (Vovk et al., 1999; 2005) . Recently there has been a body of work studying the statistical properties of conformal prediction (Lei et al., 2018; Barber, 2020) and its performance under deviations from exchangeability (Tibshirani et al., 2019; Podkopaev & Ramdas, 2021; Barber et al., 2022) . Label noise independently of conformal prediction has been well-studied; see, for example Angluin & Laird (1988) ; Tanno et al. (2019) ; Frénay & Verleysen (2013) . To our knowledge, conformal prediction under label noise has not been previously analyzed. The closest work to ours is that of Cauchois et al. (2022) studying conformal prediction with weak supervision, which could be interpreted as a type of noisy label. Future directions. Our work raises many new questions. First, one can try and define a score function that is more robust to label noise, continuing the line of Gendler et al. (2021) ; Frénay & Verleysen (2013) ; Cheng et al. (2022) . Second, an important remaining question is whether or not conformal risk control (Angelopoulos et al., 2021; 2022) more generally is also robust to label noise. Lastly, it would be interesting to analyze the robustness of alternative conformal methods such as cross-conformal and jackknife+ (Vovk, 2015; Barber et al., 2021) that do not require data-splitting. Applying (A2) and using the fact that p k is sorted in decreasing order yields k * k=1 πk = (1 -ϵ) k * k=1 p k + k * ϵ/K = k * k=1 p k + ϵ k * /K - k * k=1 p k . Next we will critically use the fact that p is sorted from greatest to least. This sorting implies that (p 1 + . . . + p k ) ≥ k/K for all k (if this failed at any k, we would have (p 1 + . . . + p K ) < 1, which is impossible). Applying this fact at the k * th index, k * k=1 p k + ϵ k * /K - k * k=1 p k ≤ k * k=1 p k , as the term multiplied by ϵ is negative. Reversing the steps from (A3), one can rewrite the limits of the sum as an indicator function, which by law of total probability yields the CDF of the scores, k * k=1 p k = k:s(x,k)≤t p k = k∈Y p k 1 {s(X, k) ≤ t} = P(s(x, Y ) ≤ t | X = x). (A5) We have so far proved that P(s(x, Ỹ ) ≤ t | X = x) ≤ P(s(x, Y ) ≤ t | X = x); taking expectations on both sides yields the conclusion, P(s(X, Ỹ ) ≤ t) ≤ P(s(X, Y ) ≤ t). Proof of Example 1. To complete the proof of the lower bound it suffices to show that P(s(X, Y ) ≤ t) ≥ P(s(X, Ỹ ) ≤ t). We prove this in Lemma A1. For the coverage upper-bound, combining (A4) and (A5), witness that P(s(x, Ỹ ) ≤ t | X = x) = k * k=1 p k + ϵ k * /K - k * k=1 p k = P(s(x, Y ) ≤ t | X = x) + ϵ k * /K - k * k=1 p k . We can lower-bound this expression by realizing that k * /K -k * k=1 p k takes values no lower than -(K -1)/K. Therefore, P(s(x, Ỹ ) ≤ t | X = x) + ϵ K -1 K ≥ P(s(x, Y ) ≤ t | X = x). Applying the tower property and noting that by the standard conformal argument Vovk et al. (2005) , gives P(s(x, Ỹ ) ≤ t) ≥ 1 -α + 1/(n + 1) yields 1 -α + 1 n + 1 + ϵ K -1 K ≥ P(s(x, Y ) ≤ t).

A1.2 HPS SCORE

Proof of Example 2. For proof with the HPS function, we consider the following corruption model: g flip+ (y) = y w.p 1 -ϵ Y ′ else, ( ) where Y ′ is uniformly drawn from the set {1, ..., K} \ Y . This is the same as g flip , but rescaled so that ϵ in g flip corresponds to ϵ(K -1)/K in g flip+ .

A1.3 CONFUSION MATRIX

The confusion matrix noise model is more realistic than the uniform flip. However, there exists a score function that causes conformal prediction to fail for any non-identity confusion matrix. We define the corruption model as follows: consider a matrix T in which (T ) i,j = P( Ỹ = j | Y = i). g confusion (y) =    1 w.p. T 1,y . . . K w.p. T K,y . Example 5. Let C noisy be constructed as in Recipe 1 with any score function s and the corruption function g confusion . Then, P Y test ∈ C noisy (X test ) ≥ 1 -α. if and only if for all classes j ∈ {1, . . . , K}, K j ′ =1 P( Ỹ = j ′ | Y = j)P(s ≤ t | Ỹ = j ′ ) ≥ P(s ≤ t | Y = j). The proof is below. In words, the necessary and sufficient condition is for the noise distribution/confusion matrix P( Ỹ = j ′ | Y = j) to place sufficient mass on those classes j ′ whose quantiles which are larger than P(s ≤ t | Y = j). However, without assumptions on the model and score, the conditional probabilities are unknown, so it is impossible to say which noise distributions will preserve coverage. Proof. By law of total probability, P(s ≤ t) = E [P(s ≤ t | Y = j)] = K j=1 w j P(s ≤ t | Y = j). But under the noisy model, we have instead that P(s ≤ t) = E P(s ≤ t | Ỹ = j ′ ) = K j=1 K j ′ =1 w j P( Ỹ = j ′ | Y = j)P(s ≤ t | Ỹ = j ′ ). We can write P(s ≤ t)-P(s ≤ t) = K j=1 K j ′ =1 w j P( Ỹ = j ′ | Y = j)P(s ≤ t | Ỹ = j ′ )- K j=1 w j P(s ≤ t | Y = j). Combining the sums and factoring, the above display equals K j=1 w j   K j ′ =1 P( Ỹ = j ′ | Y = j)P(s ≤ t | Ỹ = j ′ ) -P(s ≤ t | Y = j)   . We can factor this expression as K j=1 w j P(s ≤ t | Y = j) K j ′ =1 P( Ỹ = j ′ | Y = j) P(s ≤ t | Ỹ = j ′ ) P(s ≤ t | Y = j) - 1 K . The stochastic dominance condition holds uniformly over all choices of base probabilities w j if and only if for all j ∈ [K], K j ′ =1 P( Ỹ = j ′ | Y = j)P(s ≤ t | Ỹ = j ′ ) ≥ P(s ≤ t | Y = j). Notice that the left-hand side of the above display is a convex mixture of the quantiles P(s ≤ t | Ỹ = j ′ ) for j ′ ∈ [K]. Thus, the necessary and sufficient condition is for the noise distribution P( Ỹ = j ′ | Y = j) to place sufficient mass on those classes j ′ whose quantiles which are larger than P(s ≤ t | Y = j). But of course, without assumptions on the model and score, the latter are unknown, so it is impossible to say which noise distributions will preserve coverage.

A1.4 CQR SCORE

Proof of Example 3. We will show that P (s ≤ q | X = x) ≤ P (s ≤ q | X = x). Notice that we will only prove for the case where q ≥ 0 since according to the score function, for q ≤ 0 both probabilities are equal to zero. Formally, P (s ≤ q | X = x) = P (max{q α lo (x) -Y -Z, Y + Z -qα hi (x)} ≤ q | X = x) = P (q α lo (x) -Y -Z ≤ q, Y + Z -qα hi (x) ≤ q | X = x) = P (q α lo (x) -q ≤ Y + Z ≤ qα hi (x) + q | X = x) = 1 -P (Y + Z ≥ qα hi (x) + q | X = x) -P (Y + Z ≤ qα lo (x) -q | X = x) . Similarly, P (s ≤ q | X = x) = 1 -P (Y ≥ qα hi (x) + q | X = x) -P (Y ≤ qα lo (x) -q | X = x) . According to Lemma A2, 1 -P (Y + Z ≥ qα hi (x) + q | X = x) -P (Y + Z ≤ qα lo (x) -q | X = x) ≤ 1 -P (Y ≥ qα hi (x) + q | X = x) -P (Y ≤ qα lo (x) -q | X = x) , under the assumption that qα lo (x) ≤ q 1/2 (x) ≤ qα hi (x) , where q 1/2 (x) is the true median of Ỹ | X = x. By the law of total probability this inequality holds also without conditioning on X. Lemma A2. Suppose Y has a density that is unimodal and symmetric around 0 and Z is symmetric, i.e., Z has the same distribution as -Z. Suppose further that Y and Z are independent. Then for any q > 0, P(Y + Z ≤ -q) ≥ P(Y ≤ -q) and P(Y + Z ≥ q) ≥ P(Y ≥ q). Proof of Lemma A2. P(Y + Z ≤ -q) = E [P(Y + Z ≤ -q | |Z| = z)] = E 1 2 P(Y ≤ -q -z) + 1 2 P(Y ≤ -q + z) = P(Y ≤ -q) + E 1 2 (P(Y ≤ -q -z) -P(Y ≤ -q)) + 1 2 (P(Y ≤ -q + z) -P(Y ≤ -q)) = P(Y ≤ -q) + E 1 2 (P(Y ≤ -q + z) -P(Y ≤ -q)) - 1 2 (P(Y ≤ -q) -P(Y ≤ -q -z)) = P(Y ≤ -q) + E 1 2 P(-q < Y ≤ -q + z) - 1 2 P(-q -z < Y ≤ -q) ≥0 ≥ P(Y ≤ -q). Above, the expectations are over z ≥ 0 which we take to be a random variable with the same distribution as |Z|. The final inequality follows from the fact that the density of Y is unimodal. The proof for the upper tail is similar and we omit it. Remark 1. In the above argument, if one further assumes continuity of the (ground truth) score function and P( Ỹ ∈ A) = P(Y ∈ A) + ρ for ρ = inf ρ ′ > 0 : BinomCDF(n, δ 1 , ⌈(n + 1)(1 -α)⌉ -1) + 1 n < BinomCDF(n, δ 2 + ρ ′ , ⌈(n + 1)(1 -α)⌉ -1) , then P(s n+1 ≤ q) < 1 -α. In other words, the noise must have some sufficient magnitude in order to disrupt coverage. Proof of Corollary 1. This a consequence of the TV bound from Barber et al. (2022) with weights identically equal to 1. Unfortunately, getting such a TV bound requires a case by case analysis. It's not even straightforward to get a TV bound under strong Gaussian assumptions. Proposition 2 (No general TV bound). Assume Y ∼ N (0, τfoot_2 ) and Ỹ = y + Z, where Z ∼ N (0, σ 2 ). Then D TV (Y, Ỹ ) τ →0 → 1. Proof. TV(N (0, τ 2 ), N (0, τ 2 + σ 2 )) = ∞ -∞ e -x 2 /τ 2 -e -x 2 /(τ 2 +σ 2 ) dx τ →0 → 1.

A2.1 SYNTHETIC CLASSIFICATION: ADVERSARIAL NOISE MODELS

In contrast with the noise distributions presented in Section 3.1, here we construct adversarial noise models to intentionally reduce the coverage rate. In these experiments we apply the same settings as described in Section 3.1 of the main manuscript and present the results in Figure A1 . We can see that the optimal adversarial noise causes the largest decrease in coverage as one would expect. The most-frequent-confusion noise decreases the neural network coverage to approximately 89%. The wrong-to-right noise decreases the coverage to around 85% with the HPS score and to around 87% with the APS score. This gap is expected as this noise directly reduces the HPS score. We can see that the optimal worst case noise for each score function reduces the coverage to around 85% when using that score. This is in fact the maximal decrease in coverage possible theoretically, hence it strengthens the optimally of our iterative algorithm.

A2.2 SYNTHETIC REGRESSION: ADDITIONAL RESULTS

Here we first illustrate in Figure A2 the data we generate in the synthetic regression experiment and the different corruptions we apply. In Section 3.2 of the main manuscript we apply some realistic noise models and examine the performance of conformal prediction using CQR score with noisy training and calibration sets. Here we construct some more experiments using the same settings, however we train the models using clean data instead of noisy data. Moreover, we apply an additional adversarial noise model that differs from those presented in Section 3.2 in the sense that it is designed to intentionally reduce the coverage level. Wrong to right: an adversarial noise that depends on the underlying trained regression model. In order to construct the noisy calibration set we switch 7% of the responses as follows: we randomly swap between outputs that are not included in the interval predicted by the model and outputs that are included. Figures A3 and A4 depict the marginal coverage and interval length achieved when applying the different noise models. We see that the adversarial wrong to right noise model reduces the coverage rate to approximately 83%. Moreover, these results are similar to those achieved in Section 3.2, except for the conservative coverage attained using biased noise, which can be explained by the more accurate low and high estimated quantiles. Lastly, in order to explain the over-coverage or under-coverage achieved for some of the different noise models, as depicted in Figures 3 and 4 , we present in Figure A5 the CQR scores and their 90%'th empirical quantile. Over-coverage is achieved when the noisy scores are larger than the clean ones, for example, in the symmetric heavy tailed case, and under-coverage is achieved when the noisy scores are smaller. Herein, we provide additional details regarding the training of the predictive models for the real world regression task. As explained in Section 4.1, we trained two different models in this experiment. The first is a quantile regression model for CQR. Here we trained the model for 70 epochs using 'SGD' optimizer with a batch size of 128 and an initial learning rate of 0.001 decayed every 20 epochs exponentially with a rate of 0.95 and a frequency of 10. We applied dropout regularization to avoid overfitting with a rate of 0.2. The second model is a classic regression model for conformal with residual magnitude scores. Here we trained the model for 70 epochs using 'Adam' optimizer with a batch size of 128 and an initial learning rate of 0.00005 decayed every 10 epochs exponentially with a rate of 0.95 and a frequency of 10. The dropout rate in this case is 0.5.



In our experiments, we use the original score fromRomano et al. (2019) that can have negative values in contrast to the score defined in (3) as the former is more likely to be used in practice. Formally, the score we apply is s CQR (x, y) := max{qα lo (x) -y, y -qα hi (x)}. Most frequent confusion: we extract from the confusion matrix the pair of classes with the highest probability to be confused between each other, and switch their labels until reaching a total probability of ϵ. In cases where switching between the most common pair is not enough to reach ϵ, we proceed by flipping the labels of second most confused pairs of labels, and so on. Wrong to right: wrong predictions during calibration cause larger predictions sets during test time. Hence making the model think it makes less mistakes then actual during calibration can lead to under-coverage during test time. Here, we first observe the model predictions over the calibration set, and then switch the labels only of points that were misclassified. We switch the label to the class that is most likely to be the correct class according to the model, hence making the model think it was correct. We switch a suitable amount of labels in order to reach a total switching probability of ϵ (this noise model assumes there are enough wrong predictions in order to do so). Optimal adversarial: we describe here an algorithm for building the worst possible label noise for a specific model using a specific non-conformity score. This noise will decrease the calibration threshold at most and as a result, will cause significant undercoverage during test time. To do this, we perform an iterative process. In each iteration we calculate the non-conformity scores of all of the calibration points with their current labels. We calculate the calibration threshold as in regular conformal prediction and then, from the points that have a score above the threshold, we search for the one that switching its label can reduce its score by most. We switch the label of this point to the label that gives the lowest score and then repeat the iterative process with the new set of labels. Basically, at every step we make the label swap that will decrease the threshold by most.



For this purpose, we use the CIFAR-10H data set, first introduced by Peterson et al. (2019); Battleday et al. (2020); Singh et al. (

Figure1: Effect of label noise on CIFAR-10. Left: distribution of average coverage on a clean test set over 30 independent experiments evaluated on CIFAR-10H test data with target coverage 1 -α = 90%, using noisy and clean labels for calibration. We use a pre-trained resnet 18 model, which has Top-1 accuracy of 93% and 90% on the clean and noisy test set, respectively. The gray bar represents the interquartile range. Center and right: prediction sets achieved using noisy and clean labels for calibration.

Figure 2: Effect of label noise on synthetic multi-class classification data. Performance of conformal prediction sets with target coverage 1 -α = 90%, using a noisy training set and a noisy calibration set. Left: Marginal coverage; Right: Average size of predicted sets. The results are evaluated over 100 independent experiments and the gray bar represents the interquartile range.

Figure 3: Response-independent noise. Performance of conformal prediction intervals with target coverage 1 -α = 90%, using a noisy training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals (divided by the average clean length) using symmetric, asymmetric and biased noise with a varying magnitude. The results are evaluated over 50 independent experiments, with error bars showing one standard deviation. The standard deviation of Y (without noise) and the square root of E [Var [Y | X]] are both approximately 2.6.

Figure 4: Dispersive versus contractive noise regression experiment. Performance of conformal prediction intervals with target coverage 1 -α = 90%, using a noisy training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals. The results are evaluated over 50 independent experiments and the gray bar represents the interquartile range.

Figure A1: Effect of label noise on synthetic multi-class classification data. Performance of conformal prediction sets with target coverage 1 -α = 90%, using a noisy training set and a noisy calibration set with adversarial noise models. Left: Marginal coverage; Right: Average size of predicted sets. The results are evaluated over 100 independent experiments.

Figure A2: Illustration of the generated data with different corruptions. (a): Clean samples. (b): Samples with symmetric heavy-tailed noise. (c): Samples with asymmetric noise. (d): Samples with biased noise. Noise magnitude is set to 0.1. (e): Samples with contractive noise. (f): Samples with dispersive noise.

Figure A4: Response-dependent noise. Performance of conformal prediction intervals with target coverage 1 -α = 90%, using a clean training set and a noisy calibration set. Left: Marginal coverage; Right: Length of predicted intervals. The results are evaluated over 50 independent experiments.

Results for real-data regression experiment: predicting aesthetic visual rating. Performance of conformal prediction intervals with 90% marginal coverage based on a VGG-16 model using a noisy training set. We compare the residual magnitude score and CQR methods with both noisy and clean calibration sets. Left: Marginal coverage; Right: Interval length. The results are evaluated over 30 independent experiments and the gray bar represents the interquartile range. classic regression model for conformal with residual magnitude score. The models are trained on 34, 000 noisy samples, calibrated on 7, 778 noisy holdout points, and tested on 7, 778 clean samples. Further details regarding the training strategy are in Appendix A2.3.

A1 MATHEMATICAL PROOFS

Theorem 1. Our assumption states that P(s test ≤ t) ≤ P(s test ≤ t).Note that the probability is only taken over stest . Since qnoisy is constant (measurable) with respect to this probability, we have that, for any α ∈ (0, 1), P(s test ≤ qnoisy ) ≥ P(s test ≤ qnoisy ) ≥ 1 -α.This implies that Y test ∈ C noisy (X test ) with probability at least 1 -α, completing the proof of the lower bound.Regarding the upper bound, by the same argument, P(s test ≤ qnoisy ) ≤ P(s test ≤ qnoisy ) + u ≤ 1 -α + 1 n + 1 + u.

A1.1 APS SCORE

Lemma A1. Let fy (x) = f (Y = y | X = x) be any classifier that ranks the classes in the same order as the oracle model, i.e., one satisfying, for all y ̸ = y ′ ,Let s be the (deterministic) APS conformal score with model f ,Consider the corruption function g flip . Then,Proof of Lemma A1. For notational convenience, assume the classes are ordered such that fy (x) is decreasing in y. Fix x ∈ X , let p k = P(Y = k | X = x) and let us use the shorthand πk = P( Ỹ = k | X = x). Finally, draw (X, Y, Ỹ ) from their joint distribution P.Notice that in this context, the assumption (A1) simply states that πk is sorted from greatest to least. Furthermore, by the definition of g flip ,This implies that πk and p k have the same ordering, and therefore, that the entries p k are also decreasing in k. In the ensuing argument, this fact will be crucial.Starting with the law of total probability,and applying the tower property. Therefore, we continue the proof in the case that k * ≥ 1. Because s(x, k) is monotone in k, we can rewrite the last expression asConsider the corrupted oracle nonconformity score s(x, y) = 1 -P( Ỹ | X). By definition of this score function, we can explicitly write the probability mass functions of si and s i as follows:Note that these two distributions are supported on the same discrete set, D = 1-P( Ỹ = j |X), j = 1, ..., K . Applying the definition of our corruption model and rearranging yieldsUsing this fact and the law of total probability, we can writewhere the last step holds because each summand is nonzero at exactly one value of j.To prove the lower bound, we will show that P (s i ≤ q) -P (s i ≤ q) ≥ 0. It suffices to show that each summand in the last line of (A8) is positive. This is trivially true when t > 1 -ϵ/(K -1) because by (A7), we have that inf j P( Ỹ = j | X) ≥ ϵ/(K -1), and thus q = 1. To complete the proof, our goal is to show that if t ≤ 1 -ϵ/(K -1) and ϵ ≤ 1 -1/K, thenRearranging terms in the above expression givesThis is implied directly by combining the two bounds t ≤ 1 -ϵ/(K -1) and ϵ ≤ 1 -1/K. This completes the proof of the lower bound.To prove the upper bound, we need to show thatThe last line of (A8) can be rewritten aswhere the inequality upper-bounded the cardinality by K and the last term by 0.To recover the statement in the main paper, one can rewrite this final expression with ϵ(K -1)/K substituted in wherever the term ϵ appears. Rearranging terms then gives the result.

A1.5 NORMALIZED RESIDUAL MAGNITUDE SCORE

Proof of Example 4. Equation 2.3 directly impliessince û(x) > 0 and both probabilities are X-conditional. Furthermore, since we are using the oracle There is nothing special about the sub-Gaussian tail decay in the above corollary; any tail decay conditions will work if they can be chained together.

A1.6 DISTRIBUTION-FREE RESULTS

Proof of Proposition 1. For convenience, assume the existence of probability density functions p and p for Ỹ and Y respectively (these can be taken to be probability mass functions if Y is discrete). Also define the multiset of Y values E = {Y 1 , ..., Y n } and the corresponding multiset of Ỹ values Ẽ. Take the set A = {y : p(y) > p(y)}.Since Y d ̸ = Ỹ , we know that the set A is nonempty andThe adversarial choice of score function will be s(x, y) = 1 {y ∈ A c }; it puts high mass wherever the ground truth label is more likely than the noisy label. The crux of the argument is that this design makes the quantile smaller when it is computed on the noisy data than when it is computed on clean data, as we next show.Begin by noticing that, because s(x, y) is binary, qclean is also binary, and therefore qclean > t ⇐⇒ qclean = 1. Furthermore, qclean = 1 if and only if |E ∩ A| < ⌈(n + 1)(1 -α)⌉. Thus, these events are the same, and for any t ∈ (0, 1],By the definition of A, we have that PChaining the inequalities, we getSince s n+1 is measurable with respect to E and Ẽ, we can plug it in for t, yielding the conclusion. Here we provide additional results of the classification experiment with CIFAR-10H explained in Section 4.2. We apply conformal prediction with the APS score. The marginal coverage achieved when using noisy and clean calibration sets are depicted in Figure 1 . Figure A6 illustrates the average prediction set sizes that are larger when using noisy data for calibration and thus lead to higher coverage level.

