LEARNING TO ABSTAIN FROM UNINFORMATIVE DATA

Abstract

Learning and decision making in domains with naturally high noise-to-signal ratios -such as Finance or Healthcare -can be challenging yet extremely important. In this paper, we study a problem of learning and decision making under a general noisy generative process. The distribution has a significant proportion of uninformative data with high noise in label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data at testing time. We propose a novel approach to learn under these conditions via a loss inspired by the selective learning theory. By minimizing the loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from the uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluate its empirical performance under a variety of settings.

1. INTRODUCTION

Despite the success of machine learning in computer vision (Krizhevsky et al., 2009; He et al., 2016a; Huang et al., 2017) and natural language processing (Vaswani et al., 2017; Devlin et al., 2018) , the power of ML is yet to make significant impact in other areas such as finance and public health. One major challenge is the inherently high noise-to-signal ratio in certain domains. In financial statistical arbitrage, the spread between two assets are usually modeled using Orstein-Uhlembeck processes (Øksendal, 2003; Avellaneda & Lee, 2010) . Spread behaves almost purely random near zero and are naturally unpredictable. They become predictable in certain rare pockets/scenarios. For example, when spread exceeds certain threshold, with high probability it will move toward zero, making arbitrage possible. In cancer research, due to limited resources, only a small number of the most popular gene mutations are routinely tested for differential diagnosis and prognosis. However, due to the long tail distribution of mutation frequencies across genes, these popular gene mutations can only capture a small proportion of the relevant list of driver mutations of a patient (Reddy et al., 2017) . For a significant number of patients, the tested gene mutations may not be in the relevant list of driver mutations and its relationship w.r.t. the outcome may appear completely random. Identifying these patients automatically will justify additional gene mutation testing. These high noise-to-signal ratio datasets pose new challenges to learning. New methods are required to deal with large fraction of uninformative/high-noise data in both training and testing stages. The source of uninformative data can be either due to the random nature of the data generating process, or due to the fact that the real causing factor is not captured during data collection. Direct application of standard supervised learning methods to such datasets is both challenging and unwarranted. Deep neural networks are even more affected by the presence of noise, due to their strong memorization power (Zhang et al., 2017a) : they are likely to overfit the noise and make overly confident predictions where weak/no real structure exists. In this paper, we propose a novel method for learning on datasets where a significant portion of content has high noise. Instead of forcing the classifier to make predictions for every sample, we learn to decide whether a datapoint is informative or not. Our idea is inspired by the classic selective prediction problem (Chow, 1957) , in which one learns to select a subset of the data and only predict on that subset. However, the goal of selective prediction is very different from ours. A selective prediction method pursues a balance between coverage (i.e. proportion of the data selected) and conditional accuracy on the selected data, and does not explicitly model the underlying generative process. In particular, the aforementioned balance needs to be specified by a human expert, as opposed to being derived directly from the data. In our problem, we assume that uninformative data is an integral part of the underlying generative process and needs to be accounted for. By definition, no learning method, no matter how powerful, can successfully make predictions on uninformative data. Our goal is therefore to identify these uninformative/high noise samples, and at the same time, to train a classifier that suffers less from the noisy data. Our method learns a selector, g, to approximate the optimal indicator function of informative data, g * . We assume that g * exists as a part of the data generation process, but it is never revealed to us, even during training. Instead of direct supervision, we therefore must rely on the predictor's mistakes to train the selector. To achieve this goal, we propose a novel selector loss enforcing that (1) the selected data best fits the predictor, and (2) the portion of the data where we abstain from forecasting, does not contain many correct predictions. This loss function is quite different from the loss in classic selective prediction, which penalizes all unselected data equally. We theoretically analyze our method under a general noisy data generation process which follows the standard data dependent label noise model (Massart & Nédélec, 2006; Hanneke, 2009) . We distinguish informative/uninformative data via a gap in label noise ratio. A major contribution of this paper is the derivation of theoretical guarantees for the empirical minimizer of our loss. A minimax-optimal sample complexity bound for approximating the optimal selector is provided. We show that optimizing the selector loss can recover nearly all the informative data in a PAC fashion (Valiant, 1984) . This guarantee holds even in a challenging setting where the uninformative data has purely random labels, and dominates the training set. This theoretical guarantee empowers us to expand to a more realistic setting where sample size is limited, and the initial predictor is not sufficiently close to the ground truth. Our method extends to an iterative algorithm, in which both the predictor and the selector are progressively optimized. The selector is improved by optimizing our novel selector loss. Meanwhile, the predictor is improved by optimizing the empirical risk, re-weighted based on the selector's output; uninformative samples identified by the selector will be down-weighed. Experiments on both synthetic and real-world datasets demonstrate the merit of our method compared to existing baselines.

2. RELATED WORK

Learning with untrusted data aims to recover the ground truth model from a partially corrupted dataset. Different noise models for untrusted data have been studied, including random label noise (Bylander, 1994; Natarajan et al., 2013; Han et al., 2018; Yu et al., 2019; Zheng et al., 2020; Zhang et al., 2020 ), Massart Noise (Massart & Nédélec, 2006; Awasthi et al., 2015; Hanneke, 2009; Hanneke & Yang, 2015; Yan & Zhang, 2017; Diakonikolas et al., 2019; 2020; Cheng et al., 2020; Xia et al., 2020; Zhang & Li, 2021) and adversarial noise (Kearns & Li, 1993; Kearns et al., 1994; Kalai et al., 2008; Klivans et al., 2009; Awasthi et al., 2017) . Our noise model is similar to General Massart Noise (Massart & Nédélec, 2006; Hanneke, 2009; Diakonikolas et al., 2019) , where the label noise is data dependent and label can be generated via a purely random coin flipping. The major distinct formulation in our noisy generative model is the existence of some uninformative data with high noise in label compared to informative data with low noise in label. We characterize such uninformative/informative data structure via non-vanishing label noise ratio gap. While there exists long history of literature studying learning classifiers with label noise in the training stage (Thulasidasan et al., 2019; Cheng et al., 2020; Xia et al., 2020) , we are the first work to investigate learning a model for inference stage under label noise setting. We study the case where label noise is an integral part of the generative process and thus will appear during inference stage as well, where it must be detected and discarded once more. We view this as a realistic setup in industries like Finance and Healthcare. Selective learning is an active research area (Chow, 1957; 1970; El-Yaniv et al., 2010; Kalai et al., 2012; Nan & Saligrama, 2017; Ni et al., 2019; Acar et al., 2020; Gangrade et al., 2021a) . It extends the classic selective prediction problem and studies how to select a subset of data for different learning tasks, and has also been generalized to other problems, e.g., learning to defer human expert (Madras et al., 2018; Mozannar & Sontag, 2020) . We can summarize existing methods into 4 categories: Monte Carlo sampling based methods (Gal & Ghahramani, 2016; Kendall & Gal, 2017; Pearce et al., 2020) , margin based methods (Fumera & Roli, 2002; Bartlett & Wegkamp, 2008; Grandvalet et al., 2008; Wegkamp et al., 2011; Zhang et al., 2018) , confidence based methods (Wiener & El-Yaniv, 

