LEARNING TO ABSTAIN FROM UNINFORMATIVE DATA

Abstract

Learning and decision making in domains with naturally high noise-to-signal ratios -such as Finance or Healthcare -can be challenging yet extremely important. In this paper, we study a problem of learning and decision making under a general noisy generative process. The distribution has a significant proportion of uninformative data with high noise in label, while part of the data contains useful information represented by low label noise. This dichotomy is present during both training and inference, which requires the proper handling of uninformative data at testing time. We propose a novel approach to learn under these conditions via a loss inspired by the selective learning theory. By minimizing the loss, the model is guaranteed to make a near-optimal decision by distinguishing informative data from the uninformative data and making predictions. We build upon the strength of our theoretical guarantees by describing an iterative algorithm, which jointly optimizes both a predictor and a selector, and evaluate its empirical performance under a variety of settings.

1. INTRODUCTION

Despite the success of machine learning in computer vision (Krizhevsky et al., 2009; He et al., 2016a; Huang et al., 2017) and natural language processing (Vaswani et al., 2017; Devlin et al., 2018) , the power of ML is yet to make significant impact in other areas such as finance and public health. One major challenge is the inherently high noise-to-signal ratio in certain domains. In financial statistical arbitrage, the spread between two assets are usually modeled using Orstein-Uhlembeck processes (Øksendal, 2003; Avellaneda & Lee, 2010) . Spread behaves almost purely random near zero and are naturally unpredictable. They become predictable in certain rare pockets/scenarios. For example, when spread exceeds certain threshold, with high probability it will move toward zero, making arbitrage possible. In cancer research, due to limited resources, only a small number of the most popular gene mutations are routinely tested for differential diagnosis and prognosis. However, due to the long tail distribution of mutation frequencies across genes, these popular gene mutations can only capture a small proportion of the relevant list of driver mutations of a patient (Reddy et al., 2017) . For a significant number of patients, the tested gene mutations may not be in the relevant list of driver mutations and its relationship w.r.t. the outcome may appear completely random. Identifying these patients automatically will justify additional gene mutation testing. These high noise-to-signal ratio datasets pose new challenges to learning. New methods are required to deal with large fraction of uninformative/high-noise data in both training and testing stages. The source of uninformative data can be either due to the random nature of the data generating process, or due to the fact that the real causing factor is not captured during data collection. Direct application of standard supervised learning methods to such datasets is both challenging and unwarranted. Deep neural networks are even more affected by the presence of noise, due to their strong memorization power (Zhang et al., 2017a) : they are likely to overfit the noise and make overly confident predictions where weak/no real structure exists. In this paper, we propose a novel method for learning on datasets where a significant portion of content has high noise. Instead of forcing the classifier to make predictions for every sample, we learn to decide whether a datapoint is informative or not. Our idea is inspired by the classic selective prediction problem (Chow, 1957) , in which one learns to select a subset of the data and only predict on that subset. However, the goal of selective prediction is very different from ours. A selective prediction method pursues a balance between coverage (i.e. proportion of the data selected) and conditional accuracy on the selected data, and does not explicitly model the underlying generative

