ON STATISTICAL BIAS IN ACTIVE LEARNING: HOW AND WHEN TO FIX IT

Abstract

Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation of the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models-like neural networks-with relatively little data.

1. INTRODUCTION

In modern machine learning, unlabelled data can be plentiful while labelling requires scarce resources and expert attention, for example in medical imaging or scientific experimentation. A promising solution to this is active learning-picking the most informative datapoints to label that will hopefully let the model be trained in the most sample-efficient way possible (Atlas et al., 1990; Settles, 2010) . However, active learning has a complication. By picking the most informative labels, the acquired dataset is not drawn from the population distribution. This sampling bias, noted by e.g., MacKay (1992) ; Dasgupta & Hsu (2008) , is worrying: key results in machine learning depend on the training data being identically and independently distributed (i.i.d.) samples from the population distribution. 2012), but the full problem is not well understood. In particular, methods that remove active learning bias have been restricted to special cases, so it has been impossible to even establish whether removing active learning bias is helpful or harmful in typical situations. To this end, we show how to remove the bias introduced by active learning with minimal changes to existing active learning methods. As a stepping stone, we build a Plain Unbiased Risk Estimator, RPURE , which applies a corrective weighting to actively sampled datapoints in pool-based active learning. Our Levelled Unbiased Risk Estimator, RLURE , builds on this and has lower variance and additional desirable finite-sample properties. We prove that both estimators are unbiased and consistent for arbitrary functions, and characterize their variance. Interestingly, we find-both theoretically and empirically-that our bias corrections can simultaneously also reduce the variance of the estimator, with these gains becoming larger for more effective acquisition strategies. We show that, in turn, these combined benefits can sometimes lead to significant improvements for both model evaluation and training. The benefits are most pronounced in underparameterized models where each datapoint affects the learned function globally. For example, in linear regression adopting our weighting allows better estimates of the parameters with less data. On the other hand, in cases where the model is overparameterized and datapoints mostly affect the learned function locally-like deep neural networks-we find that correcting active learning bias can be ineffective or even harmful during model training. Namely, even though our corrections typically produce strictly superior statistical estimators, we find that the bias from standard active learning can actually be helpful by providing a regularising effect that aids generalization. Through this, our work explains the known empirical successes of existing active learning approaches for training deep models (Gal et al., 2017b; Shen et al., 2018) , despite these ignoring the bias this induces.



For example, we train neural networks by minimizing a Monte Carlo estimator of the population risk. If training data are actively sampled, that estimator is biased and we optimize the wrong objective. The possibility of bias in active learning has been considered by e.g., Beygelzimer et al. (2009); Chu et al. (2011); Ganti & Gray (

