LEARNING TO ABSTAIN FROM UNINFORMATIVE DATA

Abstract

Learning and decision making in domains with naturally high noise-to-signal ratios -such as Finance or Healthcare -can be challenging yet extremely important. In this paper, we study a problem of learning and decision making under a general noisy generative process. The distribution has a significant proportion of uninformative data with high noise in label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data at testing time. We propose a novel approach to learn under these conditions via a loss inspired by the selective learning theory. By minimizing the loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from the uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluate its empirical performance under a variety of settings.

1. INTRODUCTION

Despite the success of machine learning in computer vision (Krizhevsky et al., 2009; He et al., 2016a; Huang et al., 2017) and natural language processing (Vaswani et al., 2017; Devlin et al., 2018) , the power of ML is yet to make significant impact in other areas such as finance and public health. One major challenge is the inherently high noise-to-signal ratio in certain domains. In financial statistical arbitrage, the spread between two assets are usually modeled using Orstein-Uhlembeck processes (Øksendal, 2003; Avellaneda & Lee, 2010) . Spread behaves almost purely random near zero and are naturally unpredictable. They become predictable in certain rare pockets/scenarios. For example, when spread exceeds certain threshold, with high probability it will move toward zero, making arbitrage possible. In cancer research, due to limited resources, only a small number of the most popular gene mutations are routinely tested for differential diagnosis and prognosis. However, due to the long tail distribution of mutation frequencies across genes, these popular gene mutations can only capture a small proportion of the relevant list of driver mutations of a patient (Reddy et al., 2017) . For a significant number of patients, the tested gene mutations may not be in the relevant list of driver mutations and its relationship w.r.t. the outcome may appear completely random. Identifying these patients automatically will justify additional gene mutation testing. These high noise-to-signal ratio datasets pose new challenges to learning. New methods are required to deal with large fraction of uninformative/high-noise data in both training and testing stages. The source of uninformative data can be either due to the random nature of the data generating process, or due to the fact that the real causing factor is not captured during data collection. Direct application of standard supervised learning methods to such datasets is both challenging and unwarranted. Deep neural networks are even more affected by the presence of noise, due to their strong memorization power (Zhang et al., 2017a) : they are likely to overfit the noise and make overly confident predictions where weak/no real structure exists. In this paper, we propose a novel method for learning on datasets where a significant portion of content has high noise. Instead of forcing the classifier to make predictions for every sample, we learn to decide whether a datapoint is informative or not. Our idea is inspired by the classic selective prediction problem (Chow, 1957) , in which one learns to select a subset of the data and only predict on that subset. However, the goal of selective prediction is very different from ours. A selective prediction method pursues a balance between coverage (i.e. proportion of the data selected) and conditional accuracy on the selected data, and does not explicitly model the underlying generative process. In particular, the aforementioned balance needs to be specified by a human expert, as opposed to being derived directly from the data. In our problem, we assume that uninformative data is an integral part of the underlying generative process and needs to be accounted for. By definition, no learning method, no matter how powerful, can successfully make predictions on uninformative data. Our goal is therefore to identify these uninformative/high noise samples, and at the same time, to train a classifier that suffers less from the noisy data. Our method learns a selector, g, to approximate the optimal indicator function of informative data, g * . We assume that g * exists as a part of the data generation process, but it is never revealed to us, even during training. Instead of direct supervision, we therefore must rely on the predictor's mistakes to train the selector. To achieve this goal, we propose a novel selector loss enforcing that (1) the selected data best fits the predictor, and (2) the portion of the data where we abstain from forecasting, does not contain many correct predictions. This loss function is quite different from the loss in classic selective prediction, which penalizes all unselected data equally. We theoretically analyze our method under a general noisy data generation process which follows the standard data dependent label noise model (Massart & Nédélec, 2006; Hanneke, 2009) . We distinguish informative/uninformative data via a gap in label noise ratio. A major contribution of this paper is the derivation of theoretical guarantees for the empirical minimizer of our loss. A minimax-optimal sample complexity bound for approximating the optimal selector is provided. We show that optimizing the selector loss can recover nearly all the informative data in a PAC fashion (Valiant, 1984) . This guarantee holds even in a challenging setting where the uninformative data has purely random labels, and dominates the training set. This theoretical guarantee empowers us to expand to a more realistic setting where sample size is limited, and the initial predictor is not sufficiently close to the ground truth. Our method extends to an iterative algorithm, in which both the predictor and the selector are progressively optimized. The selector is improved by optimizing our novel selector loss. Meanwhile, the predictor is improved by optimizing the empirical risk, re-weighted based on the selector's output; uninformative samples identified by the selector will be down-weighed. Experiments on both synthetic and real-world datasets demonstrate the merit of our method compared to existing baselines.

2. RELATED WORK

Learning with untrusted data aims to recover the ground truth model from a partially corrupted dataset. Different noise models for untrusted data have been studied, including random label noise (Bylander, 1994; Natarajan et al., 2013; Han et al., 2018; Yu et al., 2019; Zheng et al., 2020; Zhang et al., 2020) , Massart Noise (Massart & Nédélec, 2006; Awasthi et al., 2015; Hanneke, 2009; Hanneke & Yang, 2015; Yan & Zhang, 2017; Diakonikolas et al., 2019; 2020; Cheng et al., 2020; Xia et al., 2020; Zhang & Li, 2021) and adversarial noise (Kearns & Li, 1993; Kearns et al., 1994; Kalai et al., 2008; Klivans et al., 2009; Awasthi et al., 2017) . Our noise model is similar to General Massart Noise (Massart & Nédélec, 2006; Hanneke, 2009; Diakonikolas et al., 2019) , where the label noise is data dependent and label can be generated via a purely random coin flipping. The major distinct formulation in our noisy generative model is the existence of some uninformative data with high noise in label compared to informative data with low noise in label. We characterize such uninformative/informative data structure via non-vanishing label noise ratio gap. While there exists long history of literature studying learning classifiers with label noise in the training stage (Thulasidasan et al., 2019; Cheng et al., 2020; Xia et al., 2020) , we are the first work to investigate learning a model for inference stage under label noise setting. We study the case where label noise is an integral part of the generative process and thus will appear during inference stage as well, where it must be detected and discarded once more. We view this as a realistic setup in industries like Finance and Healthcare. Selective learning is an active research area (Chow, 1957; 1970; El-Yaniv et al., 2010; Kalai et al., 2012; Nan & Saligrama, 2017; Ni et al., 2019; Acar et al., 2020; Gangrade et al., 2021a) . It extends the classic selective prediction problem and studies how to select a subset of data for different learning tasks, and has also been generalized to other problems, e.g., learning to defer human expert (Madras et al., 2018; Mozannar & Sontag, 2020) . We can summarize existing methods into 4 categories: Monte Carlo sampling based methods (Gal & Ghahramani, 2016; Kendall & Gal, 2017; Pearce et al., 2020) , margin based methods (Fumera & Roli, 2002; Bartlett & Wegkamp, 2008; Grandvalet et al., 2008; Wegkamp et al., 2011; Zhang et al., 2018) , confidence based methods (Wiener & El-Yaniv, 2011; Geifman & El-Yaniv, 2017; Jiang et al., 2018) and customized selective loss (Cortes et al., 2016; Geifman & El-Yaniv, 2019; Liu et al., 2019; Gangrade et al., 2021b) . Notably, several works propose customized losses, and incorporate them into neural networks. In (Geifman & El-Yaniv, 2019) , the network maintains an extra output neuron to indicate rejection of datapoints. Liu et al. (2019) uses Gambler loss where a cost term is associated with each output neuron and a doubling-rate-like loss function is used to balance rejections and predictions. Thulasidasan et al. (2019) also applies an extra output neuron for identifying noise label to improve the robustness in learning. Huang et al. (2020) adopts a progressive label smoothing method which prevents DNN from overfitting and improves selective risk when applied to selective classification task. Cortes et al. (2016) perform data selection with an extra model and introduce a selective loss that helps maximize the coverage ratio, thus trading off a small fraction of data for a better precision. Sharing a similar spirit with (Kalai et al., 2012) , (Gangrade et al., 2021b) applies an one-side prediction method to model high confidence region for each individual class, and maximizes coverage while maintains a low risk level. Existing works on selective prediction are all motivated by the trade off between accuracy and coverage -i.e. one wants to make safe prediction to achieve higher precision while maintaining a reasonable recall. Our paper is the first to investigate the case where some (or even majority) of the data is uninformative, and thus must be discarded at test time. Unlike the selective prediction, there is a latent ground truth indicator function of whether a data point should be selected or not. Our method is guaranteed to identify those uninformative samples.

3. PROBLEM FORMULATION

In this section, we describe our model for the inherently-noisy data generation process that we aim to study. Definition 1 (Noisy Generative Process). We define Noisy Generative Process by the following notation x ∼ D α where D α ≡ x ∼ D U with prob. 1 -α (Uninformative) x ∼ D I with prob. α (Informative). Let the ground truth labeling function f * : X → {+1, -1} be in hypothesis class F. Let Ω D ⊆ R d be the support of D α . Suppose {Ω U , Ω I } is a partition of Ω D . Let λ(x) ∈ (λ, 1 2 ] with λ > 0, the latent informative/uninformative status z ∈ {+1, -1} has posterior distribution: P[z = 1|x] ≡ 1 2 -λ(x), if x ∈ Ω U 1 2 + λ(x), if x ∈ Ω I . (2) The observed data (x, y) is generated according to: x ∼ D α ; z ∼ P[z|x]; y ≡ Bernoulli(0.5), if z = -1 f * (x), if z = 1. Since λ(x) > 0, x from Ω U has a lower chance to be observed with true label compared to Ω I , thus can be viewed as uninformative data, in a relative sense. On the contrary, x from Ω I can be viewed as informative data. Our Noisy Generative Process follows standard data dependent label noise, e.g., Massart Noise (Massart & Nédélec, 2006) and Benign Label Noise (Hanneke, 2009; Hanneke & Yang, 2015; Diakonikolas et al., 2019) . Indeed, one can always choose λ(x) ∈ [0, 1 2 ] and α ∈ [0, 1] to replicate General Massart noise. Compared to classical label noise models, the assumption λ(x) > λ introduces a label noise ratio gap, which distinguishes the informative and uninformative data. In Equation 3, the Bernoulli(0.5) label noise serves as a proxy for "white noise" in label corruption. When λ(x) = 1 2 and x ∈ Ω U , Bernoulli(0.5) random label noise can be viewed as the strongest known non-adversarial label noise, of both theoretical and practical interest (Diakonikolas et al., 2019) . Such Bernoulli(0.5) random label noise could happen when hard-to-classify examples are shown to human annotator (Klebanov & Beigman, 2010) , or when fluctuations in financial market closely resemble random walks (Tsay, 2005) . A typical setting that is studied in this work is the case that both value of α and λ are non-vanishing, i.e., there are significant fraction of uninformative data (large α) and the label noise ratio gap is distinguishable between informative and uninformative data (large λ). Next definition describes a recoverable condition of the optimal function for the latent informative/uninformative status z. Definition 2 (G-realizable ). Given support Ω and λ(x) ∈ (λ, 1 2 ], let the posterior distribution of z be defined in Equation (2). We say Ω is G-realizable if there exists g * ∈ G : X → {+1, -1} satisfying g * (x) = 21{P[z = 1|x] > 1 2 } -1. Ideally, one wish to select all informative data where signal dominates noise. This can be done via recovering g * (•), which we view as the ground truth selector we wish to recover. The G-realizable condition is analogous to the realizability condition (Massart & Nédélec, 2006; Hanneke & Yang, 2015) in classical label noise problem. The major difference and challenge in recovering g * (•) compared to learning a classifier, is that there is no direct observation on the informative/noninformative status z. The major contribution of this work is proposing a natural selector risk which recovers g * (•) without observing latent variable z. Having introduced the data generation process, we now describe the learning task: Assumption 1. Data S n = {x i , y i } N i=1 is i.i.d generated according to the Noisy Generative Process (Definition 1), with f * ∈ F and support Ω satisfies G-realizable condition. Given the above assumption, we are interested in the following learning task: Problem 1 (Abstain from Uninformative Data). Under Assumption 1 with i.i.d observations from D α , given f ∈ F sufficiently close to f * (x), we aim to learn a selector g ∈ G that is close to g * (x). 

4. OUR METHOD

In this section, we present our approach for learning and abstaining in the presence of uninformative data (Problem 1). The main challenge is that the latent informative/uninformative status of a datapoint is unknown. Our main idea is to introduce a novel yet natural selector loss function that trains a selector based on the performance of the best predictor (Section 4.1). In Section 4.1, we present our main theoretical result. We show that, given any reasonably good classifier, finding a selector minimizing the proposed selector loss, we can solve Problem 1, with minimax-optimal sample complexity. Inspired by the theoretical results, in Section 6, we propose a heuristic algorithm that iteratively optimizes the predictor and the selector.

4.1. SELECTOR LOSS

In an idealized setting, when access to latent informative/uninformative variables {(x i , z i )} n i=1 is available, recovering g * shares a similar spirit with learning classfier under label noise. It suffices to minimize following classical classification risk : Non-Realizable Risk(g; S n ) ≡ n i=1 1{g(x i ) ̸ = z i } (4) However, in practice z is never revealed. To learn a selector without direct supervision, we have to leverage the performance of a predictor f . We propose to replace z in the Equation 4 with a pseudo-informative label 1{f(x) ̸ = y}, which has randomness coming from z and noisy label y. Definition 3 (Selector Loss). Given f ∈ F and its selector g ∈ G, we define the following empirical version of weighted 0-1 type risk w.r.t g(•) as selector risk: R Sn (g; f, β) ≡ n i=1 {β1{f (x i ) ̸ = y i }1{g(x i ) > 0} + 1{f(x i ) = y i }1{g(x i ) ≤ 0}} (5) The selector loss is also a natural metric to evaluate the quality of the selector. This loss penalizes when (1) the predictor makes a correct prediction on a datapoint that the selector considers uninformative and abstains from, or (2) the predictor makes an incorrect prediction on a datapoint that the selector considers informative. Intuitively, the loss will drive the selector to partition the domain into informative and uninformative regions. Within the informative region, the predictor is supposed to fit the data well, and should be more accurate. Meanwhile, within the uninformative region, the label is random and the predictor is supposed to be more prone to error. Note that there are two types of errors penalized in the selector loss: an incorrect prediction on a selected datapoint, (f (x) ̸ = y) ∧ (g(x) > 0), and a correct prediction on an unselected datapoint, (f (x) = y) ∧ (g(x) ≤ 0). Since the label noise is non-adversarial, y tends to have higher probability of coincidence with f * (x), introducing imbalance on the pseudo-informative label. We thus use β to weigh these two types of errors in the loss. An analysis can be found in section A.1 on the choice of β. Our theoretical analysis suggests that for a wide range of β, the accuracy of the selector is guaranteed. Empirical study also shows stability with regard to these choices. Learning a selector with the novel loss. To learn a selector, one can follow standard procedure e.g., empirical risk minimization(ERM), to estimate a predictor f with reasonable quality. The selector can be estimated by minimizing the selector loss g = arg min g∈G R Sn (g, f , β), conditioned on the estimated predictor f . In Figure 1 , we show an example of using the ERM strategy using SVM with 0-1 loss replaced by hinge loss. In this case, the losses are all convex and the empirical minimizers f and g can be computed exactly. In practice, however, empirical minimization is not always possible, as optimization for complex models (e.g., DNNs) and non-convex losses remains open. We therefore propose a heuristic algorithm in the spirit of our theoretical results -it jointly learns f and g by minimizing the selector loss and a reweighed classification risk iteratively (see Section 6).

5. MINIMAX-OPITMAL RISK BOUND

In this section we present our main theoretical results. The main result can be summarized in following (informal) statement. Main Result (Informal) For any reasonably good predictor f , with sufficient data, the selector g estimated using g = arg min g∈G R Sn (g, f , β) is sufficiently close to the targets g * with high probability. Remark 1. The toolkit we use in the proof is a Bernstein type inequality for fast generalization rate under margin condition (Massart & Nédélec, 2006; Van Erven et al., 2015; Li & Liu, 2021) . We also provide an information theoretic lower bound construction in section A.2 to show our risk bound is minimax-optimal. Our construction of the lower bound is motivated from (Ehrenfeucht et al., 1989; Blumer et al., 1989) and Le Cam's method (Yu, 1997) . Due to space constraints, detailed proofs of our theorems are provided in the Appendix. We also present our extension from finite hypothesis class to VC-class using Local Rademacher Average tools (Bartlett et al., 2005) in section A.6. In the analysis, we do not pursue risk bounds for learning f * since it has been thorough studied in existing literatures (Blum et al., 2016; Mohri et al., 2018; Bartlett et al., 2005; Massart & Nédélec, 2006) . Instead, the theorem admits any classifier f that is close to f * in a PAC fashion, providing additional flexibility in choosing classifier f . Theorem 1 (Risk Bound). Let S n = {(x i , y i )} n i=1 be i.i.d sample from Data Generative Process described in Definition 1 under Assumption 1, with f * (•) ∈ F and g * (•) ∈ G, |F| < ∞ |G| < ∞. Given λ, let β ∈ 3-2λ 1+2λ + λ, min( 3+2λ 1-2λ -λ 1-4λ 2 , 10) . For any f (•) ∈ F, let g = arg min g∈G R Sn (g; f , β). Then for any ε > 0, there is a δ > 0 such that the following holds: For n ≥ max{ 32β 2 log( |G| δ ) λε , 24β log( |F | δ ) ε }, and for f that satisfies one of the following condition: • For any f (•) ∈ F that E x [ f (x) ̸ = f * (x)] ≤ ε 8β with prob at least 1 -δ, • If λ = 1 2 , for any f (•) ∈ F that E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] ≤ ε 8βα with prob at least 1 -δ, The following holds with probability at least 1 -2δ: R( g; f * , β) -R(g * ; f * , β) ≤ ε Remark 2. The assumption that E x [ f (x) ̸ = f * (x) ] ≤ ε could be achieved via an ERM on classification loss n i=1 1{f(x i ) ̸ = y i } under some margin condtions (Massart & Nédélec, 2006; Bartlett et al., 2005) . In practice, one can also apply some methods beyond ERM to obtain f (Namkoong & Duchi, 2017; Zhang et al., 2017b; Huang et al., 2020) . In particular, in case λ = 1 2 , the data in support Ω U is un-learnable as y are purely random. While approximating f * on the full support is not possible in general, one can control the conditional risk E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] via a standard ERM schema (see proof in appendix Section A.5). We stress that Theorem 1 holds for any classifier that is close to f * , even the case where f and g are trained on the same dataset. Corollary 1 (Recovering g * ). Given conditions in Theorem 1, if we choose β = 3, we have: E x [1{ g(x) ̸ = g * (x)}] ≤ 4ε(1 + 2λ) λ Corollary 1 suggests that by minimizing the empirical version of the loss from Definitions 3, one can recover g * in a PAC fashion. The theoretical guarnatee holds even under a very challenging case were α > 0.5 and λ = 1 2 , .e.g, majority of the data have purely random labels. The analysis of the selector loss (Theorem 1) relies on the quality of the classifier f . But since we know that g is able to abstain from uninformative data, we can retrain f beyond standard ERM, with up weighted informative data, therefore improving the accuracy of f . Such circular logic naturally leads to a practical iterative algorithm that we present in the next section. Algorithm 1 Iterative Soft Abstain 1: Input: Data set S n = {(x 1 , y 1 ), ..., (x n , y n ))}, weight parameter:β, random initial f 0 and g 0 , initial sample weights γ 0 i = 1 n , ∀i ∈ [n] , meta learning rate η, number of iterations T 2: 3: for t ← 1, • • • , T do 4: Optimize loss to update predictor f t : n i=1 γ t i {y i log(f (x i )) + (1 -y i ) log(1 -f (x i ))} 5: Approximate the 'pseudo-informative label' : z t i = 1{1{f t (x i ) > 0.5} = y i } 6: Optimize loss to update selector g t : n i=1 {z t i log(g(x i )) + β(1 -z t i ) log(1 -g(x i ))} 7: Update sample weights using g t : γ t+1 i = γ t i (1+η1{g t (xi)>0.5}) n j=1 γ t j (1+η1{g t (xj )>0.5} ) . 8: end for 9: Output: f T , g T 6 A PRACTICAL HEURISTIC ALGORITHM Motivated by our theoretical analysis, we propose a practical algorithm sharing a similar spirit with the selector loss. From a computational standpoint, we replace the binary loss by cross-entropy loss instead and require that both f and g have continuous-valued output, ranging between 0 and 1 instead of binary output {+1, -1}. The labels y also needs to be processed so that the values are in the {0, 1} range. We also relax the requirement for minimization oracles, allowing the practical algorithm to jointly optimize the predictor and the selector in an iterative manner. At each iteration, we update the predictor using the informative data selected by the selector, and then update the selector based on the predictor's output. See Algorithm 1 for the pseudo-code. A pictorial example of Algorithm 1's performance can be found in Figure 3 of the Appendix. During the joint optimization process, the predictor is counting on the selector to upweigh informative data. By putting more effort on the informative data, we wish to improve the performance of predictor via learning beyond ERM (Zhang et al., 2017b; Ren et al., 2018; Huang et al., 2020) . However, the initial selector is not trustworthy. To update the predictor f , we turn to a so-called soft abstention scheme: use a weight vector γ that progressively down-weighs samples abstained by g, in the spirit of multiplicative weights update (MWU) algorithms (Cesa-Bianchi & Lugosi, 2006; Arora et al., 2012) . Specifically, we increase the weight of i-th sample γ i if the selector accepts x i : γ i = γ i (1 + η • 1{g(x i ) > 0.5} ) and then normalize so that n i=1 γ i = 1. We call this a soft abstention approach because the algorithm decreases the weight of uninformative data gradually. We count on MWU mechanism to serve as a soft version of the selector, allowing the classifier to put less effort on learning uninformative data.

7. EXPERIMENTS

In this section, we test the efficacy of our heuristic algorithm (Algorithm 1) on publicly-available datasets. The empirical study aims to answer following questions. Q 1 : Is Algorithm 1 able to approximate ground truth selector g * ? The Answer is yes. The empirical result on Semi-synthetic dataset (Figure 2 in Section 7.1) suggests that Algorithm 1 recovers ground truth selector g * within reasonable error range. Q 2 : How does Algorithm 1 compare to baselines on semi-synthetic dataset in recovering ground truth selector g * ? The results are presented in table 1 in Section 7.2. All baselines are simply not equipped with the functionality to distinguish informative/uninformative data automatically. They all suffers from poor estimation of α, the proportion of informative data in the dataset, which must be given as a hyper-parameter. However, our method does not require such prior information and provably recovers α ( which is implied by recovery of g * ), thus consistently behaves well. Q 3 : How does Algorithm 1 work on real world datasets compared to selective learning baselines? On real world datasets, Algorithm 1 consistently gains competitive performance against other baselines in low coverage regime, e.g., the proportion of data chosen by selector ≤ 20%. These empirical results suggest that that our method is good at picking out strongly informative data. Baselines. We compare our method to two of the recently proposed selective learning algorithms. ( 1) SelectiveNet (Geifman & El-Yaniv, 2019) , which integrates an extra neuron as a data selector in the output layer and also introduces a loss term to control the coverage ratio; (2) DeepGambler (Liu et al., 2019) , which also maintains an extra neuron for abstention and uses a doubling-rate-like loss term (i.e., gambler loss) to train the model. (3) We also create a third baseline that selects data using model prediction confidence, which we refer to as Confidence. The intuition behind this heuristic baseline is that informative data should have higher confidence compared to uninformative data. Dataset Construction: We explicitly control the support of informative/uninformative data. For MNIST+FashionMNIST dataset, images from MNIST are set to be uninformative and images from Fashion-MNIST are set to be informative. For SVHN (Netzer et al., 2011) dataset, class 5-9 are set to be uninformative and class 0-4 are set to be informative. Datasets are constructed with different values of informative data fraction α and label noise ratio gap λ according to the noisy generative process. We inject label noise accordingly to Definition 1 by setting λ(x) = λ. We shuffle the labels of informative/uninformative data according to different values of λ and mix informative/uninformative data according to α. In particular, we choose α ∈ {25%, 50%} and λ ∈ {0.3, 0.35, 0.4}.

Experiment Details and

Results and Discussion. The average accuracy of the selector given by Algorithm 1 compared to the ground truth selector is presented in Figure 2 . As one can observe, Algorithm 1 recovers g * within reasonable error range. In addition, the accuracy improves as λ increases, which supports our bound in Equation 1. MNIST+FashionMNIST data turns out to be more challenging than SVHN in recovering g * . We believe this is because informative data in MNIST+FashionMNIST has 10-classes, which is more challenging for learning a predictor f compared to SVHN with 5 classes. The quality of selector g suffers from imprecise f .

7.2. EXPERIMENTS USING SEMI-SYNTHETIC DATA FOR Q 2

Dataset Construction: The construction of informative/uninformative data follows Section 7.1. We uniformly shuffle the labels for uninformative data and keep original labels of informative data. Informative and uninformative datasets are mixed in different proportions as proxy for noisy generative process with different α. The construction of dataset mimics the noisy generative process with λ = 1 2 . The choice of λ ensures that the informative data has no label noise. Such noiseless setting allows three baselines to estimate/set α to the best of their ability, according to accuracy of estimated predictor f . Evaluation Metric. We use three criteria to jointly evaluate a selective learning outcome. (1) Selective risk (SR). Selective risk is the empirical risk measured over data points selected by the algorithm. This is a metric that is also adopted in (Geifman & El-Yaniv, 2019; Liu et al., 2019) . ( 2) Precision. Precision is the proportion of true informative data point among all the data picked out by the selector. (3) Recall. Recall is the proportion of true informative samples picked out by the selector out of all the informative samples in the dataset. SR evaluates the quality of the classifier, Precision and Recall are the standard ML metrics for the selector. An ideal algorithm should have low SR, high precision and high recall. Results and discussion. Table 1 presents the results of MNIST+FashionMNIST and SVHN dataset with different fractions (α) of informative data. The selective learning methods, SelectiveNet and DeepGambler, perform poorly because they require prior information α to be given as input. The estimation of α turns out to be very challenging in practice due to the presence of noisy uninformative data. In contrast, our method provably recovers α automatically and is robust against the choice of hyper-parameter (β). The ablation study in Appendix Fig 7 exhibits this stability: our method consistently behaves well given different values of β. A thorough exploration of the estimation of α and corresponding performance of baselines is provided in Appendix Tables 10,11 and 12. We also provide ablation study on the MWU mechanism in Figure 4 and Table 13 , showing its ability to weight up informative data and improve algorithm's performance.

7.3. EXPERIMENTS USING REAL-WORLD DATA FOR Q 3

In this section, we report our empirical study on 3 publicly-available datasets: (1) breast ultrasound images (BUS) (Al-Dhabyani et al., 2020) , (2) lending club dataset (LC), and (3) Oxford realized volatility (Volatility) data set (Heber et al., 2009) . We aim to demonstrate the potential application of the proposed algorithm in real-world application and its advantages in selecting useful information out of noisy dataset. Evaluation Metric. Unlike synthetic experiments, in real-world data set, the ground-truth labels showing which data is informative and which is uninformative is not available. Metrics like precision/recall are not applicable. Instead, we report the selective risk of each algorithm given different coverage level. Specifically, we pick testing data points that have top coverage% selective confidence given by each selector, and calculate the testing selective risk at different coverage level accordingly. Result and Discussion. From Table 2 , we can see our method gains competitive performance against other baselines at low coverage level. This suggests that our method is especially good at picking out strongly informative data. Strongly informative data are easier to learn thus the classifier is more consistent with the ground-truth model. Such low risk regime can be captured by our selector loss, leading to low selective risk. 

8. CONCLUSION AND FUTURE WORK

In this work, we take the first step towards principled learning in domains where a lot of data is naturally uninformative/highly noisy and should be discarded in learning and inference stage. We propose a general noisy generative process that formally describes such setting. A novel loss is designed for the training of the selector model with theoretical guarantees. Based on this loss, we design a heuristic algorithm that jointly learns the predictor and selector. Our empirical study support merit of our methods. We believe the Noisy Generative Process can be generalize to solve different problem, such as active learning (Cohn et al., 1994) and out of distribution generalization (Arjovsky et al., 2019) . We look forward to these extensions in future work.

9. REPRODUCIBILITY STATEMENT

We describe a synthetic data generation procedure, evaluation metrics in Appendix section 7. For the convenience of the reader to reproduce the experiment, we also summarize the experiment setting and give implementation details in section B.1. The source code as well as parameters to reproduce the experimental results will be made available together with the publication of the paper. 

A THEORETIC RESULTS DETAILS

In this appendix section, we present the missing proofs as well as additional empirical results.

A.1 PRELIMINARY

We describe the risk for selector loss on (x, y) ∼ X × Y ⊂ R d × {+1, -1}. R(g; f, β) := E x,y β1{g(x) = 1}1{f (x) ̸ = y} + 1{g(x) ̸ = 1}1{f (x) = y} (7) The choice of β should ensure that given Bayes optimal classifier f * (•), g * (•) is the minimizer for the selector risk R(g; f * , β). We have that given f * , the risk gap between any selector g and g * is R(g; f * , β) -R(g * ; f * , β) could be written as: R(g; f * , β) -R(g * ; f * , β) = E x,y β1{g(x) = 1}1{g * (x) = 1}1{f * (x) ̸ = y} +β1{g(x) = 1}1{g * (x) ̸ = 1}1{f * (x) ̸ = y} +1{g(x) ̸ = 1}1{g * (x) = 1}1{f * (x) = y} +1{g(x) ̸ = 1}1{g * (x) ̸ = 1}1{f * (x) = y} -E x,y β1{g(x) = 1}1{g * (x) = 1}1{f * (x) ̸ = y} +β1{g(x) ̸ = 1}1{g * (x) = 1}1{f * (x) ̸ = y} +1{g(x) ̸ = 1}1{g * (x) ̸ = 1}1{f * (x) = y} +1{g(x) = 1}1{g * (x) ̸ = 1}1{f * (x) = y} = E x β 1 4 + λ(x) 2 - 3 4 + λ(x) 2 1{g(x) = 1}1{g * (x) ̸ = 1} +E x 3 4 + λ(x) 2 -β 1 4 - λ(x) 2 1{g(x) ̸ = 1}1{g * (x) = 1} (8) Since λ(x) is data dependent, to ensure that R(g, f * , β) ≥ R(g * ; f * , β) for all g ∈ G, it suffices to pick β 1 4 + λ(x) 2 -3 4 + λ(x) 2 > 0 and 3 4 + λ(x) 2 -β 1 4 -λ(x) 2 > 0, we need β ≥ sup x 3-2λ(x) 1+2λ(x) and β ≤ inf x 3+2λ(x) 1-2λ(x) which implies that it suffices to pick β ∈ 3-2λ 1+2λ , 3+2λ 1-2λ . Assuming λ(x) ≥ λ, we pick β within certain margin of the above interval: by picking  3-2λ 1+2λ + λ, 3+2λ 1-2λ -λ 1-4λ 2 we have R(g; f * , β) -R(g * ; f * , β) ≥ λ 4(1 + 2λ) E x [1{g(x) ̸ = g * (x)}] (x) = d j=1 1{x ⊤ e j ̸ = 0} -1{∥x∥ ≥ 1-α}•ζ j +1{∥x∥ ≤ α}•ζ j -1{α ≤ ∥x∥ ≤ 1-α} where ζ ∈ {+1, -1} d . For example, suppose for each j = 1, . . . , d, e j is the vector with the jth entry one and the other entries zero. Then for x = τ • e j for some j ∈ {1, . . . , d}, f * (x) is 1 if τ is positive and it is -1 if τ is negative. Moreover, g(x) is -ζ j if ∥τ ∥ ≥ 1 -α, it is ζ j if ∥τ ∥ ≤ α, and it is -1 otherwise. Let σ ∈ {+1, -1} d be a d-dimensional Rademacher vector and we set g * (x) = d j=1 1{x ⊤ e j ̸ = 0} -1{∥x∥ ≥ 1 -α} • σ j + 1{∥x∥ ≤ α} • σ j -1{α ≤ ∥x∥ ≤ 1 -α} where σ ∈ {+1, -1} d . In another word, the support of D α = ∪ d j=1 Ω j where Ω j : {x|x = τ • e j }. If σ j = -1, the informative part of Ω j is {x|∥x∥ ≥ 1 -α} otherwise the informative part of Ω j becomes {x|∥x∥ ≤ α}. Assuming f * (x) is deterministic and let S = {(x i , y i )} n i=1 be generated from following process which we denote as Q: σ ∼ U nif {+1, -1} d g * (x) = d j=1 1{x ⊤ e j ̸ = 0} -1{∥x∥ ≥ 1 -α} • σ j + 1{∥x∥ ≤ α} • σ j -1{α ≤ ∥x∥ ≤ 1 -α} Generate S = {(x i , y i )} n i=1 according to: j ∼ j = 1, with prob 1 -ε λ j ∼ U nif {2, ..., d} with prob ε λ . τ ∼ U nif [-1, 1] x = τ • e j y = f * (x), with prob 3 4 + g * (x)λ 2 -f * (x) with prob 1 4 -g * (x)λ 2 Let A be any (potentially randomized) algorithm that takes dataset S σ as input where S σ is generated from the process described in Equation 10. Let g be the hypothesis ouput of algorithm A. For a parameter β we define R(A(S σ ), β) = R( g(x), f * , β) = E x,y β1{ g(x) = 1}1{f * (x) ̸ = y} + 1{ g(x) ̸ = 1}1{f * (x) = y} . (11) Theorem 2. Consider the noisy generative process defined in Definition 1 with Ω being G-realizable. For any ε ≤ λ, to achieve E Sn [R(A(S n ), f * , β) -R(g * , f * , β)] ≤ ε 8(1 + 2λ) with β ∈ 3-2λ 1+2λ + λ, 3+2λ 1-2λ -λ 1-4λ 2 ,any algorithm A will take at least log(|G|) λε many samples. Proof. The lower bound construction is presented in Equation 10. Note when λ = 1 2 , y becomes purely random. Our lower bound construction in case λ = 1 2 works for any f that is consistent with f * on Ω I . From equation ( 9) the risk gap R( g, f * , β) -R(g * , f * , β) averaged over σ and S σ can be written as E σ E Sσ [R( g, f * , β) -R(g * , f * , β)] ≥ λ 4(1 + 2λ) E σ E Sσ E x [1{ g(x) ̸ = g * (x)}] σ ≥ λ 4(1 + 2λ) E σ E Sσ d j=2 P x x ∈ Ω j ]P x g(x) ̸ = g * (x) x ∈ Ω j σ ≥ ε 4(1 + 2λ)d d j=2 E σ E Sσ P x g(x) ̸ = g * (x) x ∈ Ω j σ (12) In the last inequality we use the fact that P x [x ∈ Ω j ] = ϵ/(λd) for j ≥ 2. Let σ /j be a Rademacher vector conditional on coordinates {1, ..., j -1, j + 1, ...d}. Let σ {-j} be a vector equal to σ except at the jth entry. Above equation becomes: E σ E Sσ [R( g, f * , β) -R(g * , f * , β)] ≥ ε 8(1 + 2λ)d d j=2 E σ /j E Sσ P x g(x) ̸ = g * (x) x ∈ Ω j + E S σ {-j} P x g(x) ̸ = g * (x) x ∈ Ω j σ /j = 1 2 d-1 σ /j ∈{+1,-1} d-1 ε 8(1 + 2λ)d d j=2 P Sσ,x g(x) ̸ = g * (x) x ∈ Ω j + P S σ {-j} ,x g(x) ̸ = g * (x) x ∈ Ω j (13) We make our notation more specific. Let A(S σ ) = g σ and A(S σ {-j} ) = g σ -j . Notice that g * (x) also depends on σ. We let g * σ be g * (x) conditioned on σ and g * σ -j be g * (x) conditioned on σ {-j} . In particular, for all x ∈ Ω j , g * σ -j (x) ̸ = g * σ (x) could happen only when α ≥ ∥x∥ or ∥x∥ ≥ 1 -α. So equation 13 becomes E σ E Sσ [R( g, f * , β) -R(g * , f * , β)] ≥ 1 2 d-1 σ /j ∈{+1,-1} d-1 ε 8(1 + 2λ)d d j=2 P Sσ,x g σ (x) ̸ = g * σ (x) x ∈ Ω j + P S σ {-j} ,x g σ -j (x) ̸ = g * σ -j (x) x ∈ Ω j = 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 P Sσ,x g σ (x) ̸ = g * σ (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α + P S σ {-j} ,x g σ -j (x) ̸ = g * σ -j (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α = 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 P Sσ,x g σ (x) ̸ = g * σ (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α + P S σ {-j} ,x g σ -j (x) ̸ = -g * σ (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α Next we make Equation14 independent of x. E σ E Sσ [R( g, f * , β) -R(g * , f * , β)] ≥ 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 P Sσ,x g σ (x) ̸ = g * σ (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α + P S σ {-j} ,x g σ -j (x) ̸ = -g * σ (x) x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α = 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 E Sσ,x 1{ g σ (x) ̸ = g * σ (x)} x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α + E S σ {-j} ,x 1{ g σ -j (x) ̸ = -g * σ (x)} x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α = 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 E Sσ,x 1{ g σ ̸ = g * σ } x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α + E S σ {-j} ,x 1{ g σ -j ̸ = -g * σ } x ∈ Ω j , α ≥ ∥x∥, or, ∥x∥ ≥ 1 -α = * 1 2 d-1 σ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 P Sσ g σ ̸ = g * σ + P S σ {-j} g σ -j ̸ = -g * σ = 1 2 d-1 σ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 1 -P Sσ g σ ̸ = -g * σ + P S σ {-j} g σ -j ̸ = -g * σ ≥ 1 2 d-1 σ /j ∈{+1,-1} d-1 αε 8(1 + 2λ)d d j=2 1 -∥Q (n) σ -Q (n) σ {-j} ∥ T V (15) where Q (n) σ , Q (n) σ {-j} is the product distribution of n samples for S σ andS σ -j . The last step of inequality follows from the Le Cam's method. In the Equation * we use the fact that for all x s.t., ∥x∥ ≤ α or ∥x∥ ≥ 1 -α, 1{ g σ j (x) ̸ = g * σ j (x)} = 1{ g σ j ̸ = g * σ j } is free of x. Let Q σ be distribution of S σ and Q σ {-j} be distribution of S σ -j . The total variation distance can be bounded using the Hellinger distance, which is denoted as H(•, •). Below we bound the TV distance using Hellinger distance. ∥Q (n) σ -Q (n) σ {-j} ∥ T V ≤H(Q (n) σ , Q (n) σ {-j} ) 1 - H 2 (Q (n) σ , Q (n) σ {-j} ) 4 ≤ √ nH(Q σ , Q σ {-j} ) 1 - H 2 (Q (n) σ , Q (n) σ {-j} ) 4 H 2 (Q (n) σ ,Q (n) σ {-j} )≤nH 2 (Qσ,Q σ {-j} ) ≤ √ nH(Q σ , Q σ {-j} ) Now we bound the Hellinger distance. H 2 (Q σ , Q σ {-j} ) = x,y Q σ (x, y) -Q σ {-j} (x, y) 2 dxdy = x∈Ω j ,∥x∥≤α y=f * (x) Q σ (x, y) -Q σ {-j} (x, y) 2 dxdy + x∈Ω j ,∥x∥≥1-α y=f * (x) Q σ (x, y) -Q σ {-j} (x, y) 2 dxdy + x∈Ω j ,∥x∥≤α y̸ =f * (x) Q σ (x, y) -Q σ {-j} (x, y) 2 dxdy + x∈Ω j ,∥x∥≥1-α y̸ =f * (x) Q σ (x, y) -Q σ {-j} (x, y) 2 dxdy = αε dλ 3 4 + λ 2 - 3 4 - λ 2 2 + 1 4 + λ 2 - 1 4 - λ 2 2 ≤ 3αελ d Thus we can bound the total variation distance as: ∥Q (n) σ -Q (n) σ {-j} ∥ T V ≤ 3nαελ d Note inequality 18 together with inequality 13 we have and inequality 15 E σ E Sσ [R(A(S σ ), β) -R(g * , f * , β)] =E σ E Sσ [R( g, f * , β) -R(g * , f * , β)] ≥ d -1 d αε 8(1 + 2λ) 1 - 3nαελ d (19) Above implies sup σ∈{+1,-1} d E Sσ [R(A(S σ ), f * , β) -R(g * , f * , β)] ≥ E σ E Sσ [R(A(S σ ), β) - R(g * , f * , β)] ≥ d-1 d αε 8(1+2λ) 1 -3nαελ d . Since |G| = 2 d , any algorithm A will needs number of samples at least n = Ω log |G)| λεα so that there is a hope to achieve sup σ E Sσ [R(A(S σ ), β)] -R(g * , f * , β) ≤ αε 32(1 + 2λ) . Replacing αε with α finishes the proof. Remark 3. From the second inequality in Equation 12, it can be observed that the construction of information theoretic lower bound for risk function R(g, β) can also be applied to construction an Ω(log(|G|/(λε))) sample complexity lower bound for E x [g(x) ̸ = g * (x)]. Thus our Corollary 1 also achieves minimax-optimal rate for recovering g * for family of Noise Generative Process.

A.3 PROOF OF SAMPLE COMPLEXITY UPPER BOUND

Here we prove Theorem 1 in which we bound the risk gap R(g; f * , β) -R(g * ; f * , β). Recall that the empirical version of the selector loss is R Sn (g; f, β) = 1 n n i=1 β1{g(x i ) = 1}1{f (x i ) ̸ = y i } + 1{g(x i ) = 1}1{f (x i ) = y i } . Our high level approach is as follows. We first analyze the gap between R Sn ( g; f * , β) and R Sn (g * ; f * , β) and provide an upper bound for it. Then we use this upper bound to get an upper bound for the gap between R( g; f * , β) and R(g * ; f * , β) using concentration properties and Bernstein inequality. CASE I: f (•) ∈ F and E x [ f (x) ̸ = f * (x)] ≤ ε 8β with probability at least 1 -δ. To upper bound R Sn ( g; f * , β) -R Sn (g * ; f * , β), we use R Sn ( g; f , β) and R Sn (g * ; f , β) as a middle step. Since g is the empirical risk minimizer, we have R Sn ( g; f , β) ≤ R Sn (g * ; f , β). ( ) Next we leverage on the fact that f is consistent with f to establish an inequality in following fashion: R Sn ( g; f * , β) ≤ R Sn (g * ; f * , β) + const • ε Note we have that R Sn ( g; f , β) = 1 n n i=1 1 f (x i ) ̸ = f * (x i ) β1{ g(x i ) = 1}1{ f (x i ) ̸ = y i } + 1{ g(x i ) = 1}1{ f (x i ) = y i } + 1 n n i=1 1 f (x i ) = f * (x i ) β1{ g(x i ) = 1}1{ f (x i ) ̸ = y i } + 1{ g(x i ) = 1}1{ f (x i ) = y i } (21) Recall that R Sn ( g; f * , β) = 1 n n i=1 β1{ g(x i ) = 1}1{f * (x i ) ̸ = y i } + 1{ g(x i ) = 1}1{f * (x i ) = y i } . So R Sn ( g; f , β) = R Sn ( g; f * , β) - 1 n n i=1 1 f (x i ) ̸ = f * (x i ) β1{ g(x i ) = 1}1{f * (x i ) ̸ = y i } + 1{ g(x i ) = 1}1{f * (x i ) = y i } + 1 n n i=1 1 f (x i ) ̸ = f * (x i ) β1{ g(x i ) = 1}1{ f (x i ) ̸ = y i } + 1{ g(x i ) = 1}1{ f (x i ) = y i } ≥ R Sn ( g; f * , β) - β -1 n n i=1 1 f (x i ) ̸ = f * (x i ) (22) Recall that in the theorem assumptions we have E x [ f (x) ̸ = f * (x)] ≤ ε 8β with probability at least 1-δ. By Lemma 2, if n ≥ 24β 2 log(|F |/δ) ε we have with probability at least 1-δ, 1 n n i=1 1{ f (x i ) ̸ = f * (x i )} ≤ ε 4β , so we have R Sn ( g; f , β) ≥ R Sn ( g; f * , β) -ε/4 With a similar approach we get that R Sn (g * ; f , β) ≤ R Sn (g * ; f * , β) + ε/4 Thus using (20) we have following inequality holds with probability at least 1 -δ R Sn ( g; f * , β) ≤ R Sn (g * ; f * , β) + ε/2. ( ) To get a bound for R( g; f * , β) -R(g * ; f * , β), we first define ℓ(g; f, x, y) = β1{g(x) = 1}1{f (x) ̸ = y} + 1{g(x) = 1}1{f (x) = y}. Note that at this point we think of β as fixed and so we have not included it in the arguments of ℓ(•) for simplicity. Observe that R Sn (g, f * , β) = 1 n n i=1 ℓ(g; f * , x i , y) and R(g, f * , β) = E x,y ℓ(g; f * , x, y) for any g. First we have the following simple inequality directly taken from (23).

R( g, f

* , β) -R(g * , f * , β) = E x,y ℓ( g; f * , x, y) -E x,y ℓ(g * ; f * , x, y) ≤R Sn (g * ; f * , β) -R Sn ( g; f * , β) -(E x,y ℓ(g * ; f * , x, y) -E x,y ℓ( g; f * , x, y)) + ε/2 (24) By defining ∆ℓ(g * , g, x, y) = ℓ(g * ; f * , x, y) -ℓ(g; f * , x, y) for any g, we can express the above inequality as follows: E x,y ℓ( g; f * , x, y) -E x,y ℓ(g * ; f * , x, y) ≤ 1 n n i=1 ∆ℓ(g * , g, x i , y) -E x,y ∆ℓ(g * ; g, x, y) + ε/2 (25) To bound 1 n n i=1 ∆ℓ(g * , g, x i , y) -E x,y ∆ℓ(g * ; g, x, y) with high probability over all S n , we need to find a bound on 1 n n i=1 ∆ℓ(g * , g, x i , y) -E x,y ∆ℓ(g * ; g, x, y) that is true for all g simultaneously with high probability. We have : P Sn ∃g ∈ G, 1 n n i=1 ∆ℓ(g * , g; x i , y i ) -E x,y [∆ℓ(g * , g; x, y)] ≥ n β log(|G|/δ) + 2V ar(∆(g * , g; x, y)) log(|G|/δ) n ≤ g∈G P Sn 1 n n i=1 ∆ℓ(g * , g; x i , y i ) -E x,y [∆ℓ(g * , g; x, y)] ≥ n β log(|G|/δ) + 2V ar(∆(g * , g; x, y)) log(|G|/δ) n Now to bound 1 n n i=1 ∆ℓ(g * , g, x i , y) -E x,y ∆ℓ(g * ; g, x, y) we use Bernstein inequality. For that we need to bound V ar x,y [∆ℓ(g * , g, x, y)]. We first expand ∆ℓ(g * , g, x, y). ∆ℓ(g * , g, x, y) =ℓ(g * ; f * , x, y) -ℓ(g; f * , x, y) =β1{g * (x) = 1}1{f * (x) ̸ = y} + 1{g * (x) = 1}1{f * (x) = y} -β1{g(x) = 1}1{f * (x) ̸ = y} -1{g(x) = 1}1{f * (x) = y} So we have ∆ 2 ℓ(g * , g, x, y) = β 2 1{g * (x) = 1} + 1{g(x) = 1} -21{g(x) = 1}1{g * (x) = 1} 1{f * (x) ̸ = y} + 1{g * (x) = 1} + 1{g(x) = 1} -21{g(x) = 1}1{g * (x) = 1} 1{f * (x) = y} = β 2 1{f * (x) ̸ = y} + 1{f * (x) = y} 1{g(x) = 1}1{g * (x) ̸ = 1} + 1{g(x) ̸ = 1}1{g * (x) = 1} ≤ β 2 1{g * (x) ̸ = g(x)} (27) Hence we conclude that V ar x,y [∆ℓ(g * , g, x, y)] ≤ E x,y ∆ 2 ℓ(g * , g, x, y) ≤ β 2 E x [1{g * (x) ̸ = g(x}]. On the other hand, Equation 8 implies that R(g; f * , β)-R(g * ; f * , β) ≥ λ 1+2λ E x [1{g * (x) ̸ = g(x}] . Thus we can use the following inequality to achieve fast rate of convergence using the Bernstein Inequality: V ar x,y [∆ℓ(g * , g; x, y)] ≤ β 2 (1 + 2λ) λ {R(g; f * , β)) -R(g * ; f * , β)}. We use following version of the Bernstein Inequality, with X 1 , ..., X n i.i.d random variable uniformly bounded by b: P 1 n n i=1 X i -E[X] < b n log(1/δ) + 2V ar(X) log(1/δ) n ≥ 1 -δ Using union bounds, the Bernstein Inequality implies that with probability for all g ∈ G simultaneously: P Sn 1 n n i=1 ∆ℓ(g * , g; x i , y i ) -E x,y [∆ℓ(g * , g; x, y)] ≤ β n log G /δ + 2V ar(∆(g * , g; x, y)) log G /δ n ≥ 1 -δ Thus by applying inequality 29 with g we have: P Sn 1 n n i=1 ∆ℓ(g * , g; x i , y i ) -E x,y [∆ℓ(g * , g; x, y)] ≤ β n log G /δ + 2V ar(∆(g * , g; x, y)) log G /δ n ≥ 1 -δ By inequality 23, we have 1 n n i=1 ∆ℓ(g * , g; x i , y i ) = R Sn (g * ; f * , β) -R Sn ( g; f * , β) ≥ -ε/2 holds with probability 1 -δ. Note R( g; f * , β)) -R(g * ; f * , β) = -E x,y [∆ℓ(g * , g; x, y)]. So we have P Sn {R( g; f * , β) -R(g * ; f * , β)} ≤ β n log G /δ + 2β 2 (1 + 2λ){R( g; f * , β)) -R(g * ; f * , β)} log G /δ λn + ε/2 ≥1 -2δ The choice of n ≥ 16β 2 log( |G| δ ) λε ensures that with probability at least 1 -2δ, R( g; f * , β) - R(g * ; f * , β) ≤ ε. CASE II: λ = 1 2 , f (•) ∈ F that satisfies E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] ≤ ε 8βαβ with probability at least 1 -δ. When λ = 1 2 , achieving E x [ f (x) ̸ = f * (x)] ≤ ε 8β is in general impossible. One can only approximate f * (•) on the informative support Ω I since y is generated by coin flipping when x ∈ Ω U . For simplicity of analysis, we introduce a 'pseudo' version of f * (•) denoted as f * . Let F be following hypothesis class: f f (x) = f 1 (x), x ∈ Ω U f 2 (x), x ∈ Ω I , f 1 ∈ F, f 2 ∈ F and we let f * (•) be: f * (x) = f (x), x ∈ Ω U f * (x), x ∈ Ω I Clearly, f * ∈ F. Note such hypothesis class is only introduced in analysis and is potentially impractical. The cardinality of hypothesis class | F| ≤ |F| 2 . The construction of f * is to make R(g(x), f * , β) -R(g * (x), f * , β) ≥ 0 for all g ∈ G. To see this: R(g(x), f * , β) -R(g * (x), f * , β) =E x,y β1{g(x) = 1}1{g * (x) ̸ = 1}1{ f * (x) ̸ = y} -1{g(x) = 1}1{g * (x) ̸ = 1}1{ f * (x) = y} +1{g(x) ̸ = 1}1{g * (x) = 1}1{ f * (x) = y} -β1{g(x) ̸ = 1}1{g * (x) = 1}1{ f * (x) ̸ = y} = β 2 - 1 2 E x 1{g(x) = 1}1{g * (x) ̸ = 1} +E x 1{g(x) ̸ = 1}1{g * (x) = 1} ≥ 0 (30) Meanwhile f * (•) also satisfies the property that E x [ f * (x) ̸ = f (x)] ≤ ε 8β with probability at least 1 -δ. Thus by Lemma 2, if n ≥ 24β 2 log( |F | δ ) ε we have with probability at least 1 -δ, 1 n n i=1 1{ f (x i ) ̸ = f * (x i )} ≤ ε 8β . The rest of the proof is the same as the proof in CASE I by replacing f * with f * , leveraging on the fact that R(g * (x), f * , β) = R(g * (x), f * , β). Remark 4. Let us point out that our proposed selective strategy is different from the consistent selective strategy in (El-Yaniv et al., 2010) . Instead of rejecting by looking for consistent output from all hypothesis in the version space, our approach deals with one single reasonably accurate hypothesis (the empirical minimizer). We leverage empirical mistakes made by the predictor in order to learn a selector, aiming to reject (only) the mistakes in a data driven manner. This avoids dealing with the issues found in Theorem 14 in (El-Yaniv et al., 2010) , where the selector fails to select any data points. Remark 5. In (Cortes et al., 2016) , a second hypothesis for the selector is introduced and analyzed, and at the same time, multiple commonly used loss functions are scrutinized and generalization results are provided. The major difference between this work and (Cortes et al., 2016; Geifman & El-Yaniv, 2019) is the motivation pertaining to selective learning. While in (Cortes et al., 2016; Geifman & El-Yaniv, 2019 ) the selective loss is designed from a coverage ratio perspective, i.e. one wants to trade coverage ratio for a higher precision (selective loss), our approach is designed to distinguish data that is naturally unlearnable and unpredictable. This difference leads to an alternative theoretical result. While the analysis in (Cortes et al., 2016) focuses on selective risk, our theoretical analysis focuses on the quality of the selector in distinguishing informative/uninformative data, without adjusting rejection cost given by human.

A.4 MISSING PROOF FOR COROLLARY 6

It can be easily verified that β = 3 is in the interval β ∈ 3-2λ 1+2λ + λ, min( 3+2λ 1-2λ -λ 1-4λ 2 , 10) . By the choice of β, from (9) we have R( g, f * , β) -R(g * , f * , β) ≥ λ 4(1 + 2λ) E x [1{ g(x) ̸ = g * (x)}], together with the conclusion in Theorem 1 that R( g; f * , β) -R(g * ; f * , β) ≤ ε we can conclude that Equation 6 holds.

A.5 MISSING PROOF FOR CONTROLLING CONDITIONAL

RISK E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] Lemma 1 (Sauer-Shelah Lemma(See (Blum et al., 2016; Mohri et al., 2018; Sauer, 1972) )). Let d vc (G) be the VC-dimension of hypothesis class G, for all n ∈ N, B G (n) ≤ dvc i=0 n i ≤ en d vc (G) dvc(G) Definition 4 (Growth Function(Vapnik & Chervonenkis, 2015) ). Let G be the hypothesis class of function f and F x1,...,xn = { f (x 1 ), ..., f (x n ) : f ∈ F } ⊆ {+1, -1} n . The growth function is defined to be the maximum number of ways in which n points can be classified by the function class: B F (n) = sup x1,...,xn |F x1,...,xn |. Theorem 3. For every ε > 0, there is a δ > 0 such that under Assumption 1, given a set of samples S n = {(x 1 , y 1 ), ..., (x n , y n )} drawn i.i.d. from the Noisy Generative Process and f = arg min f ∈F n i=1 1{f(x i ) ̸ = y i }, if n is chosen such that n ≥ 32 d V C (F) log( 1 ε ) + log( 1 δ ) ϵ 2 α 2 , then with probability at least 1 -2δ: E x,y [ f (x) ̸ = y] ≤ 1 2 (1 -α) + 2ϵα. Furthermore, P x [f * (x) ̸ = f (x)|x ∈ Ω I ] ≤ 2ϵ Proof. We first bound the probability of the event that E x,y [ f (x) ̸ = y] ≤ 1 2 (1 -α) + 2ϵα . By Lemma 3 and Hoeffiding inequality we have: P Sn [sup f ∈F | 1 n n i=1 1{f(x i ) ̸ = y i } -E x,y [1{f (x) ̸ = y}]| ≥ t] ≤ 4B F (2n)e -nt 2 32 (31) By setting t = αϵ and n ≥ 32(4 log(B F (2n))+log( 1 δ )) α 2 ϵ 2 we have with probability of at least 1 -δ: 1 n n i=1 1{ f (x i ) ̸ = y i } -E x,y [1{ f (x) ̸ = y}] ≤ αϵ 2 The term B F (2n) could be bounded by Sauer's lemma. Next we apply the fact that f = arg min f ∈F 1 n n i=1 1{ f (x i ) ̸ = y i }. We have: E x,y [1{ f (x) ̸ = y}] ≤ αϵ 2 + 1 n n i=1 1{ f (x i ) ̸ = y i } ≤ αϵ 2 + 1 n n i=1 1{f * (x i ) ̸ = y i } Since 1 n n i=1 1{f * (x i ) ̸ = y i } ≤ 1 2 (1 -α) + ϵα with failure probability at most δ (Lemma 5), we have with probability at least 1 -2δ: E x,y [1{ f (x) ̸ = y}] ≤ 1 2 (1 -α) + 2ϵα. Next we prove the claim that: P x∼D I [f * (x) ̸ = f (x)] ≤ 2ϵ. Since E x,y [1{ f (x) ̸ = y}] ≤ 1 2 (1 -α) + 2ϵα: E x,y [1{ f (x) ̸ = y}] =E (x,y)∼Dα [1{ f (x) ̸ = y}] = E (x,y)∼Dα [1{ f (x) ̸ = y}|x ∈ Ω U ] 1 2 P (x,y)∼Dα [x ∈ Ω U ] 1-α + E (x,y)∼Dα [1{ f (x) ̸ = y}|x ∈ Ω I ] P (x,y)∼Dα [1{ f (x)̸ =f * (x)}|x∈Ω I ] P (x,y)∼Dα [x ∈ Ω I ] α = 1 2 (1 -α) + αP x∼Dα [ f (x) ̸ = f * (x)|x ∈ Ω I ] ≤ 1 2 (1 -α) + 2ϵα =⇒P x∼Dα [1{ f (x) ̸ = f * (x)}|x ∈ Ω I ] ≤ 2ϵ A.6 EXTENTION TO VC-CLASS In order to leverage the margin condition of distribution of z to obtain a minimax-optimal generalization rate, we leverage on the Local Rademacher Average tool. Our analysis tool largely follows from (Bousquet et al., 2003; Bartlett et al., 2005) . Throughout this section, ≲ and ≳ represent as shorthand for the ≤ and ≥ that ignores universal constants. Definition 5 (L 2 -Covering Number). (Wellner et al., 2013) Let Local Rademacher Average (Bartlett et al., 2005; Bousquet et al., 2003) ). Let σ 1:n be Rademacher sequence of length n, the Empirical Local Rademacher Complexity at distributional and empirical radius r ≥ 0 for the class F are defined as x 1:n be set of points. A set of U ⊆ R n is an ε-cover w.r.t L 2 -norm of F on x 1:n , if ∀f ∈ F, ∃u ∈ U , s.t. 1 n n i=1 |[u] i -f (x i )| 2 ≤ ε, where [u] i is the i-th coordinate of u. We define the covering number N 2 (ε, F, x 1:n ) : N 2 (ε, F, x 1:n ) := min{|U |: U is an ε-cover of F on x 1:n } Let N 2 (ε, F, n) be the maximum cardinality of N 2 (ε, F, x 1:n ) over all x 1:n . Formally N 2 (ε, F, n) is defined as: N 2 (ε, F, n) := sup x1:n∈X n min{|U |: U is an ε-cover of F on x 1:n } Definition 6 ( R n (F, P f 2 ≤ r) ≡ E σ1:n [ sup f ∈F ,Exf (x) 2 ≤r 1 n n i=1 σ i f (x i )] R n (F, P n f 2 ≤ r) ≡ E σ1:n [ sup f ∈F , 1 n n i=1 f (xi) 2 ≤r 1 n n i=1 σ i f (x i )] and their distributional Average as: R(F, P f 2 ≤ r) ≡ E Sn [R n (F, P f 2 ≤ r)] and R(F, P n f 2 ≤ r) ≡ E Sn [R n (F, P n f 2 ≤ r)]. Definition 7 (Star Hull). (Bartlett et al., 2005; Bousquet et al., 2003) The star hull of set of functions F is defined as et al., 2005; Massart & Nédélec, 2006; Bousquet et al., 2003) A function ψ : R → R is sub-root if * F ≡ {αf : f ∈ F, α ∈ [0, 1]} Definition 8 (Sub-Root Function). (Bartlett • ψ is non-decreasing • ψ is non-negative • ψ(r)/ √ r is non-increasing And we say r * is a fixed point of ψ if ψ(r * ) = r * . Theorem 4. [Risk Bound VC-Class] Let S n = {(x i , y i )} n i=1 be i.i.d sample from Data Generative Process described in Definition 1 under Assumption 1, with f * (•) ∈ F and g * (•) ∈ G with VC- dimension d V C (F) < ∞ d V C (G) < ∞. Given λ, let β ∈ 3-2λ 1+2λ + λ, min( 3+2λ 1-2λ -λ 1-4λ 2 , 10) . For any f (•) ∈ F, let g = arg min g∈G R Sn (g; f , β). Then for any ε > 0, there is a δ > 0 such that the following holds: For n ≳ max{ β 4 d V C (G) log( 1 ε ) + β 4 log( 1 δ ) λε , βd V C (F) log( d V C (F ) ε ) + β log( 1 δ ) ε }. and for f that satisfies one of the following condition: • For any f (•) ∈ F that satisfies E x [ f (x) ̸ = f * (x)] ≲ ε β with probability at least 1 -δ, • If λ = 1 2 , for any f (•) ∈ F that satisfies E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] ≲ ε βα with probability at least 1 -δ, The following holds with probability at least 1 -3δ: R( g; f * , β) -R(g * ; f * , β) ≲ ε Proof. The major difference from the proof for Theorem 1 is the fact that G and F are not finite hypothesis class. To achieve fast generalization rate, we leverage the Local Rademacher Complexity Tool from (Bartlett et al., 2005) . CASE I : f (•) ∈ F and E x [ f (x) ̸ = f * (x)] ≲ ε β with probability at least 1 -δ. We use a proof similar to the one in Theorem 1 up to Equation 21. Since F is a VC-class, we will invoke Lemma 8 instead of Lemma 2. Since n ≳ β(d V C (F ) log( 1 ε )+log( 1 δ )) ε , it can be achieved with probability at least 1 -δ that R Sn ( g; f , β) ≥ R Sn ( g; f * , β) -ε/4 and R Sn (g * ; f , β) ≤ R Sn (g * ; f * , β) + ε/4. Thus following hold with probability at least 1 -δ: R Sn ( g; f * , β) ≤ R Sn (g * ; f * , β) + ε 2 . Next we turn to bound the risk gap using R( g; f * , β) -R(g * ; f * , β) using concentration property of inequality 33. Similar to the proof in Theorem 1, we define ℓ(g; f, x, y) = β1{g(x) = 1}1{f (x) ̸ = y} + 1{g(x) ̸ = 1}1{f (x) = y}. Based on ℓ, we define following hypothesis class: ∆ • ℓ • G ≡ ∆ℓ(g; g * , x, y) = ℓ(g; f * , x, y) -ℓ(g * ; f * , x, y) : g ∈ G . To invoke Lemma 6, we need to establish some hypothesis class H that satisfies condition V ar [h] ≤ BE[h]. Next we show ∆ • ℓ • G satisfies the condition that V ar[h] ≤ BE[h] and thus we can apply Lemma 6 with H = ∆ • ℓ • G. To begin with, one can apply Equation 27 to show that, 1{g * (x) ̸ = g(x)} ≤ ∆ 2 ℓ(g; g * , x, y) ≤ β 2 1{g * (x) ̸ = g(x)} Above implies that V ar x,y [∆ℓ(g * , g, x, y)] ≤ E x,y ∆ 2 ℓ(g * , g, x, y) ≤ β 2 E x [1{g * (x) ̸ = g(x}]. On the other hand, Equation 8 implies that R(g; f * , β) -R(g * ; f * , β) ≥ λ 1 + 2λ E x [1{g * (x) ̸ = g(x}]. Thus we have following holds: V ar x,y [∆ℓ(g; g * , x, y)] ≤ β 2 (1 + 2λ) λ E x,y {∆ℓ(g; g * , x, y)} Thus we can apply Lemma6 with H = ∆ • ℓ • G, T (h) = E[h 2 ] and B = β 2 (1+2λ) λ . Now we find a subroot function ψ(r) that ψ(r) ≥ β 2 (1 + 2λ) λ ER n {∆ℓ(g; g * ) ∈ H : E[h 2 ] ≤ r}. To find ψ(r), we show some analysis on the Local Rademacher Average ER n {∆ℓ(g; g * ) ∈ H : E[h 2 ] ≤ r}. ER n (∆ • ℓ • G, r) =E Snσ1:n [ sup g∈G,Ex,y[∆ 2 ℓ(g;g * )]≤r 1 n n i=1 σ i ∆ℓ(g; g * )] ≤ E Snσ1:n [ sup g∈G,Ex[1(g̸ =g * )]≤r 1 n n i=1 σ i ∆ℓ(g; g * )] 1(g̸ =g * )≤∆ 2 ℓ(g;g * ) ≤ βE Snσ1:n [ sup g∈G,Ex[1(g̸ =g * )]≤r 1 n n i=1 σ i 1g(x i ) ̸ = g * (x i ))] |∆ℓ(g1;g * )-∆ℓ(g2;g * )|≤β|1(g1̸ =g * )-1(g2̸ =g * )| Talagrand Contraction Inequality (Ledoux & Talagrand, 1991) In the last inequality, we use the fact that |∆ℓ(g 1 ; g * ) -∆ℓ(g 2 ; g * )| ≤ |ℓ(g 1 ) -ℓ(g 2 )| ≤ β|1(g 1 ̸ = g 2 )| = β|1(g 1 ̸ = g * ) -1(g 2 ̸ = g * )|. Define 1 • G ≡ 1{g(x) ̸ = g * (x), g ∈ G}. The indicator function 1{g(x) ̸ = g * (x) is a Boolean function taking g as input, thus d V C (1 • G) ≤ d V C (G). Thus we have β 2 (1 + 2λ) λ ER n {∆ℓ(g; g * ) ∈ H : E[h 2 ] ≤ r} ≤ β 3 (1 + 2λ) λ ER n {1{g(x) ̸ = g * (x)} ∈ 1 • G : E[1{g(x) ̸ = g * (x)}] ≤ r} Above implies that we can pick ψ(r) to be ψ(r) = β 3 (1 + 2λ) λ ER n { * 1 • G : E[1{g(x) ̸ = g * (x)}] ≤ r} + +11β 2 log n n By Equation 49, we have: E x,y [∆ℓ( g; g * , x, y)] ≤ 2 n n i=1 ∆ℓ( g; g * ; x i , y i ) + 1500λ β 2 r * + log(1/δ)(11β + 52 λ ) n By inequality 33, we have 1 n n i=1 ∆ℓ( g; g * ; x i , y i ) = R Sn ( g; f * , β) -R Sn (g * ; f * , β) ≤ ε/2 holds with probability 1 -δ. By Lemma 7 we have r * ≲ β 6 λ 2 d V C (G) log n n . Plugging in Equation 49we have that n ≳ β 4 (d V C (G) log( 1 ε )+log(1/δ)) λε suffices to achieve E x,y [∆ℓ( g; g * , x, y)] ≲ ε. CASE II: λ = 1 2 , f (•) ∈ F that satisfies E x [ f (x) ̸ = f * (x)|x ∈ Ω I ] ≤ ε 8αβ with probability at least 1 -δ. The proof is similar to the one in Theorem 1 except for that we need to bound the VC-dimension of pseudo hypothesis class F. Since f can be viewed as Boolean function given f 1 (x), f 2 (x) as input, with two hypothesis f 1 , f 2 ∈ F, by Lemma 3.2.3 in (Blumer et al., 1989) we know d V C ( F) ≤ 2d V C (F) log(d V C (F)). The rest of the proof follows from the one in Theorem 1. Next we present our extension of information theoretic lower bound to VC-class. The lower bounds suggest that the risk bound in Theorem 4 is tight up to some logarithmic factor. Theorem 5. There exists noisy generative process defined in Definition 1 with Ω being G-realizable, for any ε ≤ λ, to achieve E Sn [R(A(S n ), f * , β) -R(g * , f * , β)] ≤ ε 8(1 + 2λ) with β ∈ 3-2λ 1+2λ + λ, 3+2λ 1-2λ -λ 1-4λ 2 , any algorithm A will take at least d V C (G) log(d V C (G) )λε many samples. Proof. The proof follows from the proof of Theorem 2 except for the fact that we need to have an upper bound on the VC-dimension of our hypothesis construction G. Since G consists of composition of interval hypothesis and each individual interval has VC-dimension at most 3. By Lemma 3.2.3 in (Blumer et al., 1989) we know d V C (G) ≤ 6d log(d) which implies a d V C (G) log(d V C (G) )λε lower bound.

A.7 TECHNICAL LEMMAS

Lemma 2. Let S n = {(x i , y i )} be i.i.d sample from Data Generative Process described in Definition 1. For every ε > 0, there exist a δ > 0 such that if n ≥ 3 log( |F | δ ) ε , following inequality holds simultaneously for all f ∈ F with |F| < ∞ with probability at least 1 -δ 1 n n i=1 1{f(x i ) ̸ = f * (x i )} < E x 1{f(x) ̸ = f * (x)} + ε (40) Proof. By taking union bound one can ensure that P Sn sup f ∈F n i=1 1{f(x i ) ̸ = f * (x i )} -nE x 1{f(x) ̸ = f * (x)} ≥ nE x 1{f(x) ̸ = f * (x)} + nε ≤P Sn f ∈F n i=1 1{f(x i ) ̸ = f * (x i )} -nE x 1{f(x) ̸ = f * (x)} ≥ nE x 1{f(x) ̸ = f * (x)} + nε ≤ f ∈F P Sn n i=1 1{f(x i ) ̸ = f * (x i )} -nE x 1{f(x) ̸ = f * (x)} ≥ nE x,y 1{f(x) ̸ = f * (x)} + nε (41) We next apply following version of Chernoff inequality with a ≥ 1: Let X = n i=1 X i where X i ∈ {0, 1}. Then P[X ≥ (1 + a)EX] ≤ exp (- a 2 2 + a EX) ≤ exp (- a 3 EX) P[X ≤ (1 -a)EX] ≤ exp (- a 2 2 EX) ≤ exp (- a 3 EX) So we have P[|X -EX| ≥ aEX] ≤ exp (- a 3 EX) For any fixed f ∈ F, let a = ε/E x 1[f(x) ̸ = f * (x)] , by Chernoff Inequality we have P Sn n i=1 1{f(x i ) ̸ = f * (x i )} -nE x,y 1{f(x) ̸ = f * (x)} ≥ nE x 1{f(x) ̸ = f * (x)} + nε ≤ exp (- nE x 1{f(x) ̸ = f * (x)}a 3 ) = exp (- nε 3 ) Using ( 41) and setting δ = |F| exp(-nϵ/3) finishes the proof. Lemma 3. Suppose S n = {(x 1 , y 1 ), ..., (x n , y n )} are i.i.d sampled , L(f, x, y) ∈ [0, b] and L Sn (f ) = 1 n n i=1 L(f, x i , y i ). Given parameter t such that nt 2 ≥ 2b 2 then we have: P Sn∼D [sup f ∈F |L Sn (f ) -L(f )| ≥ t] ≤ 4B F (2n)e -nt 2 4b 2 Proof: For sample sets S n and S ′ n , if we have |L Sn (f ) -L(f )| ≥ t and |L S ′ n (f ) -L(f )| ≤ t 2 then we get that |L Sn -L S ′ n | ≥ t 2 . Thus we have 1{sup f ∈F |L Sn (f ) -L(f )| ≥ t} • 1{sup f ∈F |L S ′ n (f ) -L(f )| ≤ t 2 } ≤ 1{sup f ∈F |L Sn (f ) -L S ′ n (f )| ≥ t 2 } (43) Taking expectation w.r.t S n ∼ D and S ′ n ∼ D we have P Sn∼D sup f ∈F |L Sn (f ) -L(f )| ≥ t • P S ′ n ∼D sup f ∈F |L S ′ n (f ) -L(f )| ≤ t 2 ≤ P Sn,S ′ n ∼D sup f ∈F |L Sn (f ) -L S ′ n (f )| ≥ t 2 Next we lower bound P sup f ∈F |L Sn (f ) -L(f )| ≥ t 2 . Since L(f, x, y) ∈ [0, b] and so V ar(L(f, x, y)) ≤ b 2 /4, using nt 2 ≥ 2b 2 we have that: P Sn∼D sup f ∈F |L Sn (f ) -L(f )| ≥ t 2 ≤ 4V ar(L Sn ) nt 2 ≤ 1 2 So we have P S ′ n sup f ∈F |L S ′ n (f ) -L(f )| ≤ t 2 ≥ 1 2 . Combining this inequality with (44) we have P Sn∼D [sup f ∈F |L Sn (f ) -L(f )| ≥ t] ≤2P Sn,S ′ n ∼D [sup f ∈F |L Sn (f ) -L S ′ n (f )| ≥ t 2 ] =2P Sn,S ′ n ∼D [ sup f (x)∈F S 2n |L Sn (f ) -L S ′ n (f )| ≥ t 2 ] ≤2P S2n P Sn=S2n-S ′ n [ sup f (x)∈F S 2n |L Sn (f ) -L S ′ n (f )| ≥ t 2 |S 2n ] ≤2P S2n f (x)∈F S 2n P Sn=S2n-S ′ n [|L Sn (f ) -L S ′ n (f )| ≥ t 2 |S 2n ] ≤2P S2n 2|F S2n |e -nt 2 4b 2 |S 2n ] ≤2P S2n sup S2n |F S2n |e -nt 2 4b 2 |S 2n ] ≤2 sup S2n |F S2n |P S2n e -nt 2 4b 2 ] ≤2B F (2n)e -nt 2 4b 2 Lemma 4 (Hoeffding's Inequality). Let Z 1 , ..., Z n be independent bounded random variables with Z i ∈ [a, b] for all i, where -∞ < a < b < ∞. Then for all t > 0: P( 1 n | n i=1 Z i -E[Z i ]| ≥ t) ≤ 2e -2nt 2 (b-a) 2 Lemma 5. Consider a set of samples S = {(x 1 , y 1 ), ..., (x n , y n )} drawn i.i.d. from the Noisy Generative Process and f * in the hypothesis class F satisfying f (x) ∈ {-1, +1}. If: n ≥ 3 log( 1 δ ) ϵ 2 α 2 Then we have with probability at least 1 -δ : 1 n n i=1 1{f * (x i ) ̸ = y i } ≤ 1 2 (1 -α) + αε Proof: Since 1{f(x) ̸ = y} is bounded in the interval [0, 1] and given f * ∈ F, 1{f(x i ) ̸ = y i }, i ∈ [n] form a set of n independent random variables. By setting b -a = 1, t = αϵ in Equation 46, the choice of n ensures that -2nt 2 (b-a) 2 ≤ 6 log(δ). Thus P Sn∼Dα [| 1 n n i=1 1{f * (x i ) ̸ = y i } -E x,y [1{f * (x) ̸ = y}]| ≥ ϵα] ≤ δ. where we have E (x,y)∼Dα [1{f * (x) ̸ = y}] = E (x,y)∼Dα [1{f * (x) ̸ = y}|x ∈ Ω U ]P[x ∈ Ω U ] 1 2 P[x∈Ω U ]: Since y is labeled by coin flipping in Ω U + E (x,y)∼Dα [1{f * (x) ̸ = y}|x ∈ Ω I ]P[x ∈ Ω I ] 0: Since y is labeled by f * with 0 Bayes Risk in Ω I Since r * = ψ(r * ), r * satisfies r * ≤ 100BER n { * F, P n f 2 ≤ 2r * } + b + 11b 2 log n n . Next we leverage Dudley's chaining bound (Dudley, 2014) to upper bound ER n { * F, P n f 2 ≤ 2r * } using integral of covering number. We first bound the covering number of a star hull of F. It follows from (Bartlett et al., 2005) Corollary 3.7 that log N 2 (ε, F, x 1:n ) ≤ log N 2 ε 2 , F, x 1:n ⌈ 2 ε ⌉ + 1 And covering number log N 2 (ε, F, n) can be bounded using VC-dimension of F using Haussler's bound on the covering number (Haussler, 1995; Wellner et al., 2013) : log N 2 ε 2 , F, n ≤ c 1 d V C log 1 ε where c 1 is some universal constant. Now we are ready to apply the chaining bound, it follows from Theorem B.7 (Bartlett et al., 2005) that E[R n ( * F, P n f 2 ≤ 2r * )] ≤ c 2 √ n E √ 2r * 0 log N 2 (ε, * F, x 1:n )dε ≤ c 2 √ n E √ 2r * 0 N 2 ε 2 , F, x 1:n ⌈ 2 ε ⌉ + 1 dε ≤c 3 d V C (F)r * log(1/r * ) n ≤c 3 d 2 V C (F) n 2 + d V C (F)r * log(n/ed V C (F)) n Where c 2 and c 3 are some universal constants. Together with Equation 52 one can solve for r * ≲ B 2 d V C (F ) log( n d V C (F ) ) n Lemma 8. Let S n = {(x i , y i )} be i.i.d sample from Data Generative Process described in Definition 1. For every ε > 0, there exist a δ > 0 such that if n ≳ d V C (F ) log( 1 ε )+log( 1 δ ) ε , following inequality holds simultaneously for all f ∈ F with d V C (F) < ∞, with probability at least 1 -δ 1 n n i=1 1{f(x i ) ̸ = f * (x i )} ≲ E x 1{f(x) ̸ = f * (x)} + ε (54) Proof. The proof invokes Lemma 6, in particular, the Equation 50. Let 1•F : 1{f(x) ̸ = f * (x), f ∈ F} be the hypothesis class H in Lemma 6. Since f * is a deterministic boolean function, it does not increase the number of points that can be shattered by F. We have d V C (1 • F) ≤ d V C (F). In particular, we choose the functional T (•) = E[•] and it is easy to verify that V ar(1{f (x) ̸ = f * (x)}) ≤ E x [1{f (x) ̸ = f * (x)}] = E x [1 2 {f (x) ̸ = f * (x)}]. Let ψ(r) = 100ER n { * F, Ef ≤ r} + 11 log n n . We have ER n {F, Ef 2 ≤ r} ≤ ER n { * F, Ef 2 ≤ r} ≤ 100ER n { * F, Ef 2 ≤ r} + 11 log n n = ψ(r) Since local Rademacher averages of the star-hull is sub-root function, we know for all r ≥ r * , ψ(r) ≥ ψ(r * ) = r * . By Equation 50 in Lemma 6 we have 1 n n i=1 1{f(x i ) ̸ = f * (x i )} ≤ 2E x 1{f(x) ̸ = f * (x)} + 15r * + log(1/δ) + 5200 n ε Next we bound r * . A direct application of Lemma 7 show that r * ≲ d V C (1 • F) log( n d V C (1•F ) ) n ≲ d V C (F) log( n d V C (F ) ) n . The rest of the proof follows from plugging r * in Equation 50 and removing absolute constants. 

B MORE EXPERIMENT RESULTS AND DETAILS B.1 EXPERIMENT SETTING AND IMPLEMENTATION DETAILS

Extension to multi-class. Our method extends to the multi-class setting naturally. In the case of K-class classification, our selector loss remains the same while the predictor becomes f (x) = f (x) 1:K : X → ∆ K where ∆ K is the K-simplex. Meanwhile, we use multi-class cross entropy loss to train the classifier. The pseudo-informative label becomes z i = 1{arg max k∈[K] f (x i ) k = y i }. Semi-Synthetic Experiment Setting. For experiments in Section 7.1 and 7.2We use same backbone TinyCNN models for all baselines. It is a light-weight CNN with 2 convolutional layers and 3 fully connected layers . We adopt same training scheme for all baseline algorithm. We use Adam optimizer with learning rate 1e-3 and weight weight decay rate 1e-4. We train 220 epochs using batch size 196 for all baselines and the leanring rate is reduced by 50% at 45th epoch and 90th epoch. For experiments in Section 7.2, we use the default hyper-parameters for every method as recommended in the respective original paper (i.e., internal selective learning-specific defaults, as reported in (Geifman & El-Yaniv, 2019; Liu et al., 2019) , and β = 2 and MWU step-size η = 2 for our algorithm ). It simulates a practical scenario where hyper-parameter optimization is impossible, since the ground truth regarding which datapoints are actually informative is never revealed. For SelectiveNet, the default hyper-parameters are the weight for coverage rate penalty λ sl = 32 and the weight for selective net loss a = 0.5. For DeepGambler, the sole hyper-parameter "reward" should lie between 1 and 10, and we choose the default value recommended by the author which is 2.2. For the SVHN experiment, we use ResNet18 (He et al., 2016b) as the backbone model for every candidate. We set the batch size to 128, and use Adam as the optimizer with learning rate 1e-3 and weight decay 1e-4. We train each algorithm for 162 epochs, and shrink our learning rate at both epoch 45 and 90 by half each time. We assume that the ratio of informative data, α, is unknown to all methods. This is necessary in practice; the ratio and strength of noise are not known in most real world scenarios. For SelectiveNet and DeepGambler, such ratio is a required input. To run these baselines, we first run the original backbone for 60 epochs, and then estimate α using backbone's training accuracy. Assume that the backbone fits all of the informative data perfectly and also makes some correct guesses on noisy data with probability 1 num of classes , then the frequency estimation of α is α = num of classes * train acc-foot_0 num of classes-1 , which is the estimation of α we give to baselines. Note such estimation can only be applied when λ = 1 2 . Real-World Experiment Setting. We use same backbone model for all baseline methods. Specifically, we use a 1-layer LSTM for volatility data. We use VGG16 Simonyan & Zisserman (2014) for BUS data. We use a 3-layer multi-layer perceptron for LC data. We adopt same training scheme for all baseline algorithm. We use Adam optimizer with learning rate 1e-4 and weight decay rate 1e-4 for all experiments. The learning rate is reduced by 50% every 10 epochs. We use batchsize 128 for volatility data, batch size 32 for BUS data and batch size 256 for LC data, which is determined according to the size of the data. We further split the training set and perform 30 random hyper-parameter searching for each baseline. For BUS and LC, we set the training epoch to be 50. For financial time-series volatility data, since each baseline is quite sensitive to the running epoch, we add the training epoch as a hyper-parameter searching dimension during the HPO process.

B.2 REAL-WORLD DATASET DESCRIPTION

The first dataset is the Oxford realized volatility (Volatility) (Heber et al., 2009) data set containing 5-min realized volatility of 31 stock indices from 2000 to 2022 which contains 155107 records in total. We use past volatility and returns as features, and the task here is to predict whether the next day volatility will be higher than current one, making it a binary classification. We choose data from 2000 to 2020 as our training set and the rest for the testing (2020 Jan. to 2021 Oct.). This data set is used as an example to show our algorithm's possible application in selectively forecasting financial time series. The second one is the dataset of breast ultrasound images (BUS) (Al-Dhabyani et al., 2020) . BUS contains 780 gray-scale breast ultrasound images among women in ages between 25 and 75 years old. These images have average size 500 × 500 pixels and can be categorized into 3 classes (487 benign, 210 malign and 133 healthy). We randomly choose 80% of data as training dataset and the rest 20% for testing. We are going to use this dataset as an example to show a possible application of our algorithm in automatic diagnosis. The machine can generate diagnosis result only on selected cases and deliver unsure cases to human expert for further investigation. 1 The third one is the lending club dataset (LC). Lending club is a peer-to-peer lending company that matches borrowers with investors through an online platform. The lending club dataset (LC) contains loan data of its customers from 2007 to 2018. We compare different version of existing dataset of LC and remove all inconsistent and incomplete records. There are different status of loans record in this dataset, we keep 3 types of these record that consist the major part of the dataset (261442 charged off cases, 1035418 fully paid cases and 25757 late cases). We use 20% of the dataset as the testing set. This example shows our algorithm can be use to grant loan given on different risk preference. Table 3 presents the original accuracy given by neural network on each of these 3 real-world data set. For all dataset, the neural network without using selection mechanism cannot give reliable inference. In mortgage granting, high risk like this can cause significant loss. In medical diagnosis which is healthy issue critical, a diagnosis with miss-diagnose rate as high as 15% is not acceptable. However, if we apply our selective algorithm, we can see that the risk on all dataset sharply reduced. In BUS dataset, we can even almost perfectly guarantee the diagnosis result empirically for our most confident cases. These evidence are of practical interest. We provide several ablation studies that further elucidate the reasons of our advantages. We stick with case that λ = 1 2 so that baselines are able to estimate the value of α according to the 'no label noise' assumption for informative data. Firstly, we try to reduce the total data size by sampling the original dataset. We test each baseline with these partial dataset to evaluate their performance under data shortage scenario. The results are presented in Appendix Table 4 and Table 5 . In this study, we follow the setting of our synthetic experiment and fully shuffle the labels of the uninformative data and keep all the informative data intact. We can see that in this setting, our method still outperform all baselines. Then we conduct experiments where we reveal the ground truth α to each baseline. We test all baselines on both complete dataset and partial dataset. We always completely permute labels of MNIST to impose high label noise in uninformative data and keep informative data clean. The results are presented in Appendix Table 6 -9. We can see that even in the "easier" scenario, where every algorithm is handed the true α, our method still wins out. We believe that this is partially due to the MWU mechanism. We will have a section discussing MWU later. Finally, we provide a ablation study investigation the effect of the choice of epoch to estimate the α. For all baselines, the estimation of α is a non-trivial step, yet it is crucial to the results. We hypothesize that this is the main cause of the suboptimal performance of these baselines. Here we provide additional results by training for 120 epochs instead of 60 in section 7.2. The estimation error for α at 120 epochs is shown in Table 10 . One can see that the error's magnitude is significant. This results in worse performance of both selector and classifier (Tables 11 and 12 ). Our method's result stay the same as it in Table 1 , since it doesn't require α as an input. As mentioned in the discussion of α's effect, we can see that our method win in the "easier" scenario, where every algorithm is handed the true α, our method still wins out. We believe that this is partially due to the MWU mechanism. It guides the classifier to put more emphasis on the informative data. This is confirmed via an ablation study reported in Table 13 : when MWU is turned off, our algorithm's performance deteriorates. Precision 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 Recall 0.99 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 1.00 ± 0.00 We inject different level of noise into each part of the data according to Definition 1 by setting λ(x) = λ. The higher the λ is, the larger the gap of the information noise ratio between informative and uninformative partition. Specifically, for informative data, we inject 100 * ( 1 2 -λ)% uniform Selective Risk (%) Precision Recall The ideal coverage(α) is indicated by the black dash line on the plot. An selective learning algorithm achieves the ideal coverage if it just cover all informative data. Coverage rate goes beyond ideal level will make the algorithm select uninformative data, which has purely random label in this case. We can see that when coverage ratio is within reasonable range compared to ground truth α, our method outperforms all baselines. The advantage of our method is bigger when the noise ratio is high, where all other baselines show a reverse-shape coverage v.s selective risk curve. This curve implies an sever noise-over fitting issue of baseline method given few data under high noise regime. 



url:https://www.kaggle.com/datasets/wordsforthewise/lending-club/download?datasetVersionNumber=3



Illustration of the learning strategy with mixture of Gaussian data. We replace the 0-1 loss with hinge loss and train SVM models for f and g. (a) upper panel shows the original dataset and bottom panel shows the region of informative (easy) and uninformative (hard) data. (b) shows that the classifier has high accuracy in the informative region, but low accuracy in the uninformative region. In (c), the selector trained with f successfully recovers informative support thus resulting in low selective risk, and we abstain from making a prediction elsewhere.

Figure 2: Recover g * using Algorithm 1 under different label noise ratio gap λ.

PROOF OF INFORMATION THEORETIC LOWER BOUNDIn this section we quantify the hardness of recovering g * , from an information theoretic perspective. Let (x, y) ∼ X × Y ⊂ R d × {+1, -1}, we set X : {τ • e|e ∈ {e 1 , ..., e d }, |τ | ≤ 1} where e j represents the j-th cannonical basis. Let w be vector of ones, w = 1, we set f * (x) = 21{w ⊤ x > 0} -1. Let G be the hypothesis class that contains g * (x). In our lower bound construction |G| = 2 d and g

Figure 3: Illustration of Algorithm 1 when λ = 1 2 . a) shows a mix of informative/uninformative data. b) and c) show classifiers trained on weighted samples with different number of iterations. By up-weighing the informative datapoints, the algorithm progressively improves the classifier. d) shows the sum of weights of informative over total selected, i.e i:x i ∈Ω I γi n i γi ( See γ in Algorithm 1): the algorithm converges to an all-informative dataset.

Figure 4: We show case the effecacy of the MWU mechanism on MNIST dataset. We plot the weight of informative data as function of training epoch and the ratio of informative data. The y axis is the percentage of weight put on the iformative data, i.e i:x i ∈Ω I γi n i γi in the notation of Algorithm 1.

Figure 5: Ablation Study on Hyper-parameter o -DeepGambler.

Figure 9: Coverage v.s Selective Risk Curve under Different Hyper-parameter Setting. We refer 'conf' to Confidence, 'slnet' to Selective-Net, 'dg' to DeepGambler. We plot the curve of different methods with varying hyper-parameters, e.g., β for our method, λ and α for Selective-Net, warm up epoch and O for DeepGambler.

Synthetic Data Experiment on MNIST+Fashion and SVHN

Real-World Experiments: Selective Risk v.s Coverage

This paper focuses on a theoretical discussion about learning from data that contains different portion of non-informative samples. Our experiments only use publicly available datasets. Our discussion, analysis, or data shouldn't raise any ethics-related issues. The learning method proposed in this paper, however, can be potentially used in applications with fairness and privacy concerns. It our future efforts in this area, we aim to address and resolve possible negative impact.XingruiYu, Bo Han, Jiangchao Yao, Gang Niu, Ivor Tsang, and  Masashi Sugiyama. How does disagreement help generalization against label corruption? In ICML, pp. 7164-7173, 2019. Chicheng Zhang and Yinan Li. Improved algorithms for efficient active learning halfspaces with massart and tsybakov noise. arXiv preprint arXiv:2102.05312, 2021.

DNN Original Risk on Each Dataset

(MNIST -Partial Dataset -Unknown α) Uninformative/Informative MNIST/Fashion using 25% shuffled data as proxy for noise.

(SVHN -Partial Dataset -Unknown α) Uninformative/Informative SVHN using 25% of shuffled classes as proxy for noise.74.47 ± 4.15 64.68 ± 20.32 34.48 ± 13.46 14.36 ± 0.08

(MNIST -Full Data Setting -Known α) Results on a synthetic dataset consisting of uninformative MNIST data and informative Fashion-MNIST data using the entire MNIST.

(MNIST -Partial Data Setting -Known α) Results on a synthetic dataset consisting of uninformative MNIST data and informative Fashion-MNIST data using 25% of MNIST.

(SVHN -Full Data Setting -Known α) Results on a synthetic dataset consisting of uninformative SVHN and informative SVHN using the entire uninformative data.

annex

This way we have:which implies that Equation 47holds with probability at least 1 -δ. Lemma 6 (Theorem 3.3 in (Bartlett et al., 2005) ). Let F be a class of functions with range in [a, b] and assume that there are some functional T : H → R + and some constant B such that for every h ∈ H, V ar(h) ≤ T (h) ≤ BP [h] . Let ψ be a subroot function and r * be the fixed point of ψ.Assume the ψ satisfies, for any r ≥ r * , ψ(r) ≥ BER n {h ∈ H : T (h) ≤ r}Then with c 1 = 704 and c 2 = 26, for any K > 1 and every t > 1 with probability at least 1 -e -t , ∀h ∈ H,Also with probability at least 1 -e -t , ∀h ∈ H,whereProof. The proof largely follows from the proof in Corollary 3.7 in (Bartlett et al., 2005) . We include here for completeness. Since f is uniformly bounded by b, for any r ≥ ψ(r), Corollart 2.2 in (Bartlett et al., 2005) implies that with probability at least 1 -1 n , {f ∈ * F : P f 2 ≤ r} ⊆ {f ∈ * F : label noise (each class has chance ( 1 2 -λ) to be flipped into the other classes). For uninformative data, we inject 100 * 1 2 * ( 1 2 + λ)% uniform label noise. We test each baseline on these semi-synthesized dataset. The result is presented in Table 14 .For all baselines, we give λ as prior information to calculate α according to the realized accuracy of predictor f . In Table 14 , we can see that our method can effectively recover informative data out of the uninformative ones compared with existing baselines. We put a † on top of selective risk number where the corresponding algorithm fail to select reasonable amount of data (low recall) and thus result in degenerated performance. All baselines have the same problem with learning given noisy labels. The key input α cannot be properly estimated due to the poor accuracy, which in turn leads to poor selection result. Our method doesn't have this issue because it learns to abstain uninformative data and thus doesn't require knowing α.) In this section, we provide ablation study on sensitivity of different algorithms w.r.t their hyperparameter. We study under the same setting as Table 1 and 4 . For SelectiveNet, we first fix λ sl to 32 (default setting that is recommended by the author in the original paper) and then we vary a from 0.1 to 0.7. Then we fix a to be 0.5 (default setting) and then we vary λ sl from 1 to 66. For DeepGambler, we vary o from 1 to 7. For our algorithm, we progressively increase the hyper-parameter β from 4 to 10. As presented in We can see that the performance of all baselines are quite sensitive to the choice of hyper-parameters and are expected to experience large fluctuations. In contrast, our algorithm is more stable with regard to the choice of hyper-parameters. This empirical observation supports that the choice of β is flexible, as it is stated in Theorem 1. Furthermore, in all scenarios, our algorithm's performance is better than these two baselines as following. Firstly, our selector has better precision such that we can recover almost all informative data while the two baselines cannot. These two baselines tend to select the whole data set indistinguishably (low precision and high recall). Secondly, these baselines consistently show deteriorated risk performance compared against ours, mainly because of their selector fails to pick informative data.We also present the convergence curve of each evaluation metric for the partial data blind setting where α = 50% in Fig 8 . We pick different combination of β and MWU step-size η. We can see that both recall and precision can converge in a very quick and smooth manner. The performance of our algorithm is very robust against different combination of hyper-parameter. There can be some slightly recall drop and precision increase when β is chosen to be some extreme values, e.g., β = 10. We include such case to illustrate that while the method is not sensitive to hyper-parameter β, the trade off between precision and recall, controlled by β, do exists.

