ON STATISTICAL BIAS IN ACTIVE LEARNING: HOW AND WHEN TO FIX IT

Abstract

Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation of the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models-like neural networks-with relatively little data.

1. INTRODUCTION

In modern machine learning, unlabelled data can be plentiful while labelling requires scarce resources and expert attention, for example in medical imaging or scientific experimentation. A promising solution to this is active learning-picking the most informative datapoints to label that will hopefully let the model be trained in the most sample-efficient way possible (Atlas et al., 1990; Settles, 2010) . However, active learning has a complication. By picking the most informative labels, the acquired dataset is not drawn from the population distribution. This sampling bias, noted by e.g., MacKay (1992) ; Dasgupta & Hsu (2008) , is worrying: key results in machine learning depend on the training data being identically and independently distributed (i.i.d.) samples from the population distribution. For example, we train neural networks by minimizing a Monte Carlo estimator of the population risk. If training data are actively sampled, that estimator is biased and we optimize the wrong objective. The possibility of bias in active learning has been considered by e.g., Beygelzimer et al. (2009); Chu et al. (2011); Ganti & Gray (2012) , but the full problem is not well understood. In particular, methods that remove active learning bias have been restricted to special cases, so it has been impossible to even establish whether removing active learning bias is helpful or harmful in typical situations. To this end, we show how to remove the bias introduced by active learning with minimal changes to existing active learning methods. As a stepping stone, we build a Plain Unbiased Risk Estimator, RPURE , which applies a corrective weighting to actively sampled datapoints in pool-based active learning. Our Levelled Unbiased Risk Estimator, RLURE , builds on this and has lower variance and additional desirable finite-sample properties. We prove that both estimators are unbiased and consistent for arbitrary functions, and characterize their variance. Interestingly, we find-both theoretically and empirically-that our bias corrections can simultaneously also reduce the variance of the estimator, with these gains becoming larger for more effective acquisition strategies. We show that, in turn, these combined benefits can sometimes lead to significant improvements for both model evaluation and training. The benefits are most pronounced in underparameterized models where each datapoint affects the learned function globally. For example, in linear regression adopting our weighting allows better estimates of the parameters with less data. On the other hand, in cases where the model is overparameterized and datapoints mostly affect the learned function locally-like deep neural networks-we find that correcting active learning bias can be ineffective or even harmful during model training. Namely, even though our corrections typically produce strictly superior statistical estimators, we find that the bias from standard active learning can actually be helpful by providing a regularising effect that aids generalization. Through this, our work explains the known empirical successes of existing active learning approaches for training deep models (Gal et al., 2017b; Shen et al., 2018) , despite these ignoring the bias this induces. 1. We offer a formalization of the problem of statistical bias in active learning. 2. We introduce active learning risk estimators, RPURE and RLURE , and prove both are unbiased, consistent, and with variance that can be less than the naive (biased) estimator. 3. Using these, we show that active learning bias can hurt in underparameterized cases like linear regression but help in overparameterized cases like neural networks and explain why.

2. BIAS IN ACTIVE LEARNING

We begin by characterizing the bias introduced by active learning. In supervised learning, generally, we aim to find a decision rule f θ corresponding to inputs, x, and outputs, y, drawn from a population data distribution p data (x, y) which, given a loss function L(y, f θ (x)), minimizes the population risk: r = E x,y∼p data [L(y, f θ (x)) ] . The population risk cannot be found exactly, so instead we consider the empirical distribution for some dataset of N points drawn from the population. This gives the empirical risk: an unbiased and consistent estimator of r when the data are drawn i.i.d from p data and are independent of θ, R = 1 N N n=1 L(y n , f θ (x n )). In pool-based active learning (Lewis & Gale, 1994; Settles, 2010) , we begin with a large unlabelled dataset, known as the pool dataset D pool ≡ {x n |1 ≤ n ≤ N }, and sequentially pick the most useful points for which to acquire labels. The lack of most labels means we cannot evaluate R directly, so we use the sub-sample empirical risk evaluated using the M actively sampled labelled points: R = 1 M M m=1 L(y m , f θ (x m )). Though almost all active learning research uses this estimator (see Appendix D), it is not an unbiased estimator of either R or r when the M points are actively sampled. Under active-i.e. non-uniformsampling the M datapoints are not drawn from the population distribution, resulting in a bias which we formally characterize in §4. See Appendix A for a more general overview of active learning. Note an important distinction between what we will call "statistical bias" and "overfitting bias." The bias from active learning above is a statistical bias in the sense that using R biases our estimation of r, regardless of θ. As such, optimizing θ with respect to R induces bias into our optimization of θ. In turn, this breaks any consistency guarantees for our learning process: if we keep M/N fixed, take M → ∞, and optimize for θ, we no longer get the optimal θ that minimizes r. Almost all work on active learning for neural networks currently ignores the issue of statistical bias. However, even without this statistical bias, indeed even if we use R directly, the training process itself also creates an overfitting bias: evaluating the risk using training data induces a dependency between the data and θ. This is why we usually evaluate the risk on held-out test data when doing model selection. Dealing with overfitting bias is beyond the scope of our work as this would equate to solving the problem of generalization. The small amount of prior work which does consider statistical bias in active learning entirely ignores this overfitting bias without commenting on it. In §3-6, we focus on statistical bias in active learning, so that we can produce estimators that are valid and consistent, and let us optimize the intended objective, not so they can miraculously close the train-test gap. From a more formal perspective, our results all assume that θ is chosen independently of the training data; an assumption that is almost always (implicitly) made in the literature. This ensures our estimators form valid objectives, but also has important implications that are typically overlooked. We return to this in §7, examining the interaction between statistical and overfitting bias.

3. UNBIASED ACTIVE LEARNING: RPURE AND RLURE

We now show how to unbiasedly estimate the risk in the form of a weighted expectation over actively sampled data points. We denote the set of actively sampled points D train ≡ {(x m , y m )|1 ≤ m ≤ M }, where ∀m : x m ∈ D pool . We begin by building a "plain" unbiased risk estimator, RPURE , as a stepping stone-its construction is quite natural in that each term is individually an unbiased estimator of the risk. We then use it to construct a "levelled" unbiased risk estimator, RLURE , which is an unbiased

