CONFORMAL PREDICTION IS ROBUST TO LABEL NOISE

Abstract

We study the robustness of conformal prediction-a powerful tool for uncertainty quantification-to label noise. Our analysis tackles both regression and classification problems, characterizing when and how it is possible to construct uncertainty sets that correctly cover the unobserved noiseless ground truth labels. With both theory and experiments, we argue that conformal prediction with noisy labels conservatively covers the clean ground truth labels except in adversarial cases. This leads us to believe that correcting for label noise is unnecessary except for pathological data distributions or noise sources. In such cases, we can also correct for noise of bounded size in the conformal prediction algorithm in order to ensure correct coverage of the ground truth labels without score or data regularity.

1. INTRODUCTION

In most supervised classification and regression tasks, one would assume the provided labels reflect the ground truth. In reality, this assumption is often violated; see (Cheng et al., 2022; Xu et al., 2019; Yuan et al., 2018; Lee & Barber, 2022; Cauchois et al., 2022) . For example, doctors labeling the same medical image may have different subjective opinions about the diagnosis, leading to variability in the ground truth label itself. In other settings, such variability may arise due to sensor noise, data entry mistakes, the subjectivity of a human annotator, or many other sources. In other words, the labels we use to train machine learning (ML) models may often be noisy in the sense that these are not necessarily the ground truth. Quantifying the prediction uncertainty is crucial in highstakes applications in general, and especially so in settings where the training data is inexact. We aim to investigate uncertainty quantification in this challenging noisy setting via conformal prediction, a framework that uses hold-out calibration data to construct prediction sets that are guaranteed to contain the ground truth labels; see (Vovk et al., 2005; Angelopoulos & Bates, 2021) . In short, this paper shows that conformal prediction typically yields confidence sets with conservative coverage when the hold-out calibration data has noisy labels. We adopt a variation of the standard conformal prediction setup. Consider a calibration data set of i.i.d. observations {(X i , Y i )} n i=1 sampled from an arbitrary unknown distribution P XY . Here, X i ∈ R p is the feature vector that contains p features for the ith sample, and Y i denotes its response, which can be discrete for classification tasks or continuous for regression tasks. Given the calibration dataset, an i.i.d. test data point (X test , Y test ), and a pre-trained model f , conformal prediction constructs a set C(X test ) that contains the unknown test response, Y test , with high probability, e.g., 90%. That is, for a user-specified level α ∈ (0, 1), P Y test ∈ C(X test ) ≥ 1 -α. ( ) This property is called marginal coverage, where the probability is defined over the calibration and test data. In the setting of label noise, we only observe the corrupted labels Ỹi = g(Y i ) for some corruption function g : Y×[0, 1] → Y, so the i.i.d. assumption and marginal coverage guarantee are invalidated. The corruption is random; we will always take the second argument of g to be a random seed U uniformly distributed on [0, 1]. To ease notation, we leave the second argument implicit henceforth. Nonetheless, using the noisy calibration data, we seek to form a prediction set C noisy (X test ) that covers the clean, uncorrupted test label, Y test . More precisely, our goal is to delineate when it is possible to provide guarantees of the form P Y test ∈ C noisy (X test ) ≥ 1 -α, where the probability is taken jointly over the calibration data, test data, and corruption function (this will be the case for the remainder of the paper). Our theoretical vignettes and experiments suggest that in realistic situations, (2) is usually satisfied. That is, even with access only to noisy labels, conformal prediction yields confidence sets that have conservative coverage on clean labels. There are a few failure cases involving adversarial noise that we discuss, but in general we argue that a user should feel safe deploying conformal prediction even with noisy labels.

MOTIVATIONAL EXAMPLE

As a real-world example of label noise, we conduct an image classification experiment where we only observe one annotater's label but seek to cover the majority vote of many annotators. 2020), which contains 10,000 images labeled by approximately 50 annotators. We calibrate using only a single annotator and seek to cover the majority vote of the 50. The single annotator differs from the ground truth labels in approximately 5% of the images. Using the noisy calibration set (i.e., a calibration set containing these noisy labels), we applied vanilla conformal prediction as if the data were i.i.d, and studied the performance of the resulting prediction sets. Details regarding the training procedure can be found in section 4.2. The fraction of majority vote labels covered is demonstrated in Figure 1 . As we can see, when using the clean calibration set the marginal coverage is 90%, as expected. When using the noisy calibration set, the coverage increases to approximately 93%. Figure 1 also demonstrates the prediction sets that are larger when calibrating with noisy labels. This experiment demonstrates the main intuition behind our paper: adding noise will usually increase the variability in the labels, leading to larger prediction sets that retain the coverage property. 

2. THEORETICAL ANALYSIS

In this section we show mathematically that under stylized settings and some regularity conditions, the marginal coverage guarantee (1) of conformal prediction persists even when the labels used for calibration are noisy; i.e., (2) holds. In Sections 3 and 4 we support this argument with realistic experiments. Towards that end, we now give more details on the conformal prediction algorithm. As explained in the introduction, conformal prediction uses a held-out calibration data set and a pre-trained model to construct the prediction set on a new data point. More formally, we use the model f to construct a score function, s : X × Y → R, which is engineered to be large when the model is uncertain and small otherwise. Abbreviate the scores on each calibration data point as s i = s(X i , Y i ) for each i = 1, ..., n. Conformal prediction tells us that we can achieve a marginal coverage guarantee by picking



For this purpose, we use the CIFAR-10H data set, first introduced by Peterson et al. (2019); Battleday et al. (2020); Singh et al. (

Figure1: Effect of label noise on CIFAR-10. Left: distribution of average coverage on a clean test set over 30 independent experiments evaluated on CIFAR-10H test data with target coverage 1 -α = 90%, using noisy and clean labels for calibration. We use a pre-trained resnet 18 model, which has Top-1 accuracy of 93% and 90% on the clean and noisy test set, respectively. The gray bar represents the interquartile range. Center and right: prediction sets achieved using noisy and clean labels for calibration.

