ON STATISTICAL BIAS IN ACTIVE LEARNING: HOW AND WHEN TO FIX IT

Abstract

Active learning is a powerful tool when labelling data is expensive, but it introduces a bias because the training data no longer follows the population distribution. We formalize this bias and investigate the situations in which it can be harmful and sometimes even helpful. We further introduce novel corrective weights to remove bias when doing so is beneficial. Through this, our work not only provides a useful mechanism that can improve the active learning approach, but also an explanation of the empirical successes of various existing approaches which ignore this bias. In particular, we show that this bias can be actively helpful when training overparameterized models-like neural networks-with relatively little data.

1. INTRODUCTION

In modern machine learning, unlabelled data can be plentiful while labelling requires scarce resources and expert attention, for example in medical imaging or scientific experimentation. A promising solution to this is active learning-picking the most informative datapoints to label that will hopefully let the model be trained in the most sample-efficient way possible (Atlas et al., 1990; Settles, 2010) . However, active learning has a complication. By picking the most informative labels, the acquired dataset is not drawn from the population distribution. This sampling bias, noted by e.g., MacKay (1992) ; Dasgupta & Hsu (2008) , is worrying: key results in machine learning depend on the training data being identically and independently distributed (i.i.d.) samples from the population distribution. For example, we train neural networks by minimizing a Monte Carlo estimator of the population risk. If training data are actively sampled, that estimator is biased and we optimize the wrong objective. The possibility of bias in active learning has been considered by e.g., Beygelzimer et al. (2009) ; Chu et al. (2011) ; Ganti & Gray (2012) , but the full problem is not well understood. In particular, methods that remove active learning bias have been restricted to special cases, so it has been impossible to even establish whether removing active learning bias is helpful or harmful in typical situations. To this end, we show how to remove the bias introduced by active learning with minimal changes to existing active learning methods. As a stepping stone, we build a Plain Unbiased Risk Estimator, RPURE , which applies a corrective weighting to actively sampled datapoints in pool-based active learning. Our Levelled Unbiased Risk Estimator, RLURE , builds on this and has lower variance and additional desirable finite-sample properties. We prove that both estimators are unbiased and consistent for arbitrary functions, and characterize their variance. Interestingly, we find-both theoretically and empirically-that our bias corrections can simultaneously also reduce the variance of the estimator, with these gains becoming larger for more effective acquisition strategies. We show that, in turn, these combined benefits can sometimes lead to significant improvements for both model evaluation and training. The benefits are most pronounced in underparameterized models where each datapoint affects the learned function globally. For example, in linear regression adopting our weighting allows better estimates of the parameters with less data. On the other hand, in cases where the model is overparameterized and datapoints mostly affect the learned function locally-like deep neural networks-we find that correcting active learning bias can be ineffective or even harmful during model training. Namely, even though our corrections typically produce strictly superior statistical estimators, we find that the bias from standard active learning can actually be helpful by providing a regularising effect that aids generalization. Through this, our work explains the known empirical successes of existing active learning approaches for training deep models (Gal et al., 2017b; Shen et al., 2018) , despite these ignoring the bias this induces. 1. We offer a formalization of the problem of statistical bias in active learning. 2. We introduce active learning risk estimators, RPURE and RLURE , and prove both are unbiased, consistent, and with variance that can be less than the naive (biased) estimator. 3. Using these, we show that active learning bias can hurt in underparameterized cases like linear regression but help in overparameterized cases like neural networks and explain why.

2. BIAS IN ACTIVE LEARNING

We begin by characterizing the bias introduced by active learning. In supervised learning, generally, we aim to find a decision rule f θ corresponding to inputs, x, and outputs, y, drawn from a population data distribution p data (x, y) which, given a loss function L(y, f θ (x)), minimizes the population risk: r = E x,y∼p data [L(y, f θ (x)) ] . The population risk cannot be found exactly, so instead we consider the empirical distribution for some dataset of N points drawn from the population. This gives the empirical risk: an unbiased and consistent estimator of r when the data are drawn i.i.d from p data and are independent of θ, R = 1 N N n=1 L(y n , f θ (x n )). In pool-based active learning (Lewis & Gale, 1994; Settles, 2010) , we begin with a large unlabelled dataset, known as the pool dataset D pool ≡ {x n |1 ≤ n ≤ N }, and sequentially pick the most useful points for which to acquire labels. The lack of most labels means we cannot evaluate R directly, so we use the sub-sample empirical risk evaluated using the M actively sampled labelled points: R = 1 M M m=1 L(y m , f θ (x m )). Though almost all active learning research uses this estimator (see Appendix D), it is not an unbiased estimator of either R or r when the M points are actively sampled. Under active-i.e. non-uniformsampling the M datapoints are not drawn from the population distribution, resulting in a bias which we formally characterize in §4. See Appendix A for a more general overview of active learning. Note an important distinction between what we will call "statistical bias" and "overfitting bias." The bias from active learning above is a statistical bias in the sense that using R biases our estimation of r, regardless of θ. As such, optimizing θ with respect to R induces bias into our optimization of θ. In turn, this breaks any consistency guarantees for our learning process: if we keep M/N fixed, take M → ∞, and optimize for θ, we no longer get the optimal θ that minimizes r. Almost all work on active learning for neural networks currently ignores the issue of statistical bias. However, even without this statistical bias, indeed even if we use R directly, the training process itself also creates an overfitting bias: evaluating the risk using training data induces a dependency between the data and θ. This is why we usually evaluate the risk on held-out test data when doing model selection. Dealing with overfitting bias is beyond the scope of our work as this would equate to solving the problem of generalization. The small amount of prior work which does consider statistical bias in active learning entirely ignores this overfitting bias without commenting on it. In §3-6, we focus on statistical bias in active learning, so that we can produce estimators that are valid and consistent, and let us optimize the intended objective, not so they can miraculously close the train-test gap. From a more formal perspective, our results all assume that θ is chosen independently of the training data; an assumption that is almost always (implicitly) made in the literature. This ensures our estimators form valid objectives, but also has important implications that are typically overlooked. We return to this in §7, examining the interaction between statistical and overfitting bias. and consistent estimator of the population risk just like RPURE , but which reweights individual terms to produce lower variance and resolve some pathologies of the first approach. Both estimators are easy to implement and have trivial compute/memory requirements.

3.1. RPURE : PLAIN UNBIASED RISK ESTIMATOR

For our estimators, we introduce an active sampling proposal distribution over indices rather than the more typical distribution over datapoints. This simplifies our proofs, but the two are algorithmically equivalent for pool-based active learning because of the one-to-one relationship between datapoints and indices. We define the probability mass for each index being the next to be sampled, once D train contains m -1 points, as q(i m ; i 1:m-1 , D pool ). Because we are learning actively, the proposal distribution depends on the indices sampled so far (i 1:m-1 ) and the available data (D pool , note though that it does not depend on the labels of unsampled points). The only requirement on this proposal distribution for our theoretical results is that it must place non-zero probability on all of the training data: anything else necessarily introduces bias. Considerations for the acquisition proposal distribution are discussed further in §3.3. We first present the estimator before proving the main results: RPURE ≡ 1 M M m=1 a m ; where a m ≡ w m L im + 1 N m-1 t=1 L it , where the loss at a point L im ≡ L(y im , f θ (x im )), the weights w m ≡ 1/N q(i m ; i 1:m-1 , D pool ) and i m ∼ q(i m ; i 1:m-1 , D pool ). For practical implementation, RPURE can further be written in the following more computationally friendly form that avoids a double summation: RPURE = 1 M M m=1 1 N q(i m ; i 1:m-1 , D pool ) + M -m N L im . However, we focus on the first form for our analysis because a m in (2) has some beneficial properties not shared by the weighting factors in (3). In particular, in Appendix B.1 we prove that: Lemma 1. The individual terms a m of RPURE are unbiased estimators of the risk: E [a m ] = r. The motivation for the construction of a m directly originates from constructing an estimator where Lemma 1 holds while only making use of the observed losses L i1 , . . . , L im , taking care with the fact that each new proposal distribution does not have support over points that have already been acquired. Except for trivial problems, a m is essentially unique in this regard; naive importance sampling (i.e. 1 M M m=1 w m L im ) does not lead to an unbiased, or even consistent, estimator. However, the overall estimator RPURE is not the only unbiased estimator of the risk, as we discuss in §3.2. We can now characterize the behaviour of RPURE as follows (see Appendix B.2 for proof) Theorem 1. RPURE as defined above has the properties: E RPURE = r, Var RPURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . (4) Remark 1. The first term of (4) is the variance of the loss on the whole pool, while the second term accounts for the variance originating from the active sampling itself given the pool. This second term is O(N/M ) times larger and so will generally dominate in practice as typically M N . Armed with Theorem 1, we can prove the consistency of RPURE under standard assumptions: RPURE converges in expectation (i.e. its mean squared error tends to zero) as M → ∞ under the assumptions that N > M , L(y, f θ (x)) is integrable, and q(i m ; i 1:m-1 , D pool ) is a valid proposal in the sense that it puts non-zero mass on each unlabelled datapoint. Formally, as proved in Appendix B.3, Theorem 2. Let α = N/M and assume that α > 1. If E L(y, f θ (x)) 2 < ∞ and ∃β > 0 : min n∈{1:N \i1:m-1} q(i m = n; i 1:m-1 , D pool ) ≥ β/N ∀N ∈ Z + , m ≤ N, then RPURE converges in its L 2 norm to r as M → ∞, i.e., lim M →∞ E ( RPURE -r) 2 = 0.

3.2. RLURE : LEVELLED UNBIASED RISK ESTIMATOR

RPURE is natural in that each term is an unbiased estimator of r. However, this creates surprising behaviour given the sequential structure of active learning. For example, with a uniform proposal distribution-equivalent to not actively learning-points sampled earlier are more highly weighted than later ones and RPURE = R. Specifically, a uniform proposal, q(i m ; i 1:m-1 , D pool ) = 1 N -m+1 , gives a weight on each sampled point of 1 + M -2m+1 N = 1. Similarly, as M → N (such that we use the full pool) the weights also fail to become uniform: setting M = N gives a weight for each point of 1 + M -2m+1 N = 1. RLURE fixes this. We first quote the estimator before proving key results: RLURE ≡ 1 M M m=1 v m L im ; v m ≡ 1 + N -M N -m 1 (N -m + 1) q(i m ; i 1:m-1 , D pool ) -1 . (5) This estimator ensures that the expected value of the weight, v m , does not depend on the position it was sampled in, but only on the probability with which it was sampled. That is, E [v m ] = 1 for all m, M , N , and q(i m ; i 1:m-1 ; D pool ). As a consequence the variance is generally lower. Moreover, we resolve the finite-sample behaviour shown by RPURE . The weights become more even as M increases for a given N , and when M = N , each v m = 1 such that RLURE = R = R. Additionally, if the proposal is uniform, all weights are always exactly 1 such that RLURE = R. To derive RLURE note that each a m estimates r without bias so for any normalized linear combination: E M m=1 c m a m M m=1 c m = r, provided that the c m are constant with respect to the data and sampled indices (they can depend on M , N , and m). In Appendix B.4 we show that the choice of : c m = N (N -M ) (N -m)(N -m + 1) produces the v m from (5) and in turn that these v m have the desired property E [v m ] = 1, ∀m ∈ {1, . . . , M }. We note also that M m=1 c m = M , such that RLURE = 1 M M m=1 c m a m . We further characterise the variance and unbiasedness of RLURE as follows (see Appendix B.5 for proof) Theorem 3. RLURE as defined above has the following properties: E RLURE = r, Var RLURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 c 2 m E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . Although not obvious from inspection of (6), in Appendix B.6 we prove that the variance of RLURE is always less than that of RPURE subject to a mild assumption about the proposal which we detail there. Theorem 4. If Equation (14) in Appendix B.6 holds then Var[ RLURE ] ≤ Var[ RPURE ]. If M > 1 and E Dpool [Var[w m L i1 |D pool ]] > 0 also hold, then the inequality is strict: Var[ RLURE ] < Var[ RPURE ]. To provide intuition into why this result holds, remember that c m were introduced to ensure that E [v m ] are all identically one. Therefore this weighting removes the tendency of RPURE to overemphasize the earlier samples; essentially increasing the effective sample size by correcting the imbalance. We finish by confirming that RLURE is a consistent estimator as M → ∞ (proof in Appendix B.7): Theorem 5. Under the same assumptions as Theorem 2: lim M →∞ E RLURE -r 2 = 0.

3.3. FROM ACTIVE LEARNING SCHEMES TO PROPOSALS

We have introduced two elements of the active learning scheme: the risk estimators-RPURE and RLURE -and the acquisition proposal distribution-q(i m |i 1:m-1 , D pool )-which has so far remained general. So long as the acquisition proposal puts non-zero mass on all the training data, RPURE and RLURE are unbiased and consistent as proven above. This is in contrast to the naive risk estimator R, for which the choice of proposal distribution affects the bias of the estimator. It is easy to satisfy the requirement for non-zero mass everywhere. Even prior work which selects points deterministically (e.g., Bayesian Active Learning by Disagreement (BALD) (Houlsby et al., 2011) or a geometric heuristic like coreset construction (Sener & Savarese, 2018) ) can be easily adapted. Any scheme, like BALD, that selects the points with argmax can use softmax to return a distribution. Alternatively, a distribution can be constructed analogous to epsilon-greedy exploration. With probability we pick uniformly, otherwise we pick the point returned by an arbitrary acquisition strategy. This adapts any deterministic active learning scheme to allow unbiased risk estimation. It is also possible to use RLURE and RPURE with data collected using a proposal distribution that does not fully support the training data, though they will not fully correct the bias in this case. Namely, if we have a set of points, I, that are ignored by the proposal (i.e. that are assigned zero mass), we can still use RLURE and RPURE in the same way but they both introduce the same following bias: E[ RI LURE ] = E[ RI PURE ] = E E RLURE D pool -E 1 N n∈I L n D pool = r -E 1 N n∈I L n . Sometimes this bias will be small and may be acceptable if it enables a desired acquisition scheme, but in general one of the stochastic adaptations described above is likely to be preferable. One can naturally also extend this result to cases where I varies at each iteration of the active learning (including deterministic acquisition strategies), for which we again have a non-zero bias. Though the choices of acquisition proposal and risk estimator are algorithmically detached, choosing a good proposal will still be critical to performance in practice. In the next section, we will discuss how the proposal distribution can affect the variance of the estimators, and we will see that our approaches also offer the potential to reduce the variance of the naive biased estimator. Later, in §7, we will turn to a third element of active learning schemes-generalization-and consider the fact that optimization introduces a bias separately from the choice of risk estimator and proposal distribution.

4. UNDERSTANDING THE EFFECT OF RLURE AND RPURE ON VARIANCE

In order to show that the variance of our unbiased estimators can be lower than that of the biased R, with a well-chosen acquisition function, we first introduce an analogous result to Theorems 1 and 3 for R, the proof for which is given in Appendix B.8: Theorem 6. Let µ m := E [L im ] and µ m|i,D := E [L im |i 1:m-1 , D pool ]. For R (defined in (1)): E R = 1 M M m=1 µ m ( = r in general) Var[ R] = 1 Var Dpool E R D pool + 2 1 M 2 M m=1 E Dpool,i1:m-1 [Var [L im |i 1:m-1 , D pool ]] + 1 M 2 M m=1 E Dpool Var µ m|i,D D pool 3 + 2 E Dpool Cov L im , k<m L i k D pool 4 . Examining this expression suggests that the variances of RPURE and, in particular, RLURE will often be lower than that of R, given a suitable proposal. Consider the terms of ( 7): 1 is analogous to the shared first term of ( 4) and ( 6), Var [L(y, f θ (x))] /N . If R were an unbiased estimator of R then these would be exactly equal, but the conditional bias introduced by R also varies between pool datasets. In general, 1 will typically be larger than, or similar to, its unbiased counterparts. In any case, recall that the first terms of ( 4) and ( 6) tend to be small contributors to the overall variance anyway, thus 1 provides negligible scope for R to provide notable benefits over our estimators. We can also relate 2 to terms in ( 4) and ( 6): it corresponds to the second half of ( 4), but where we replace of the expected conditional variances of the weighted losses w m L im with the unweighted losses L im . For effective proposals, w m and L im should be anticorrelated: high loss points should have higher density and thus lower weight. This means the expected conditional variance of w m L im should be less than L im for a well-designed acquisition strategy. Variation in the expected value of the weights with m can complicate this slightly for RPURE , but the correction factors applied for RLURE avoids this issue and ensure that the second half of (6) will be reliably smaller than 2 . We have shown that the variance of RLURE is typically smaller than 1 + 2 under sensible proposals. Expression (7) has additional terms: 3 is trivially always positive and so contributes to higher variance for R (it comes from variation in the bias in index sampling given D pool ); 4 reflects correlations between the losses at different iterations which have been eliminated by our estimators. This term is harder to quantify and can be positive or negative depending on the problem. For example, sampling points without replacement can cause negative correlation, while the proposal adaptation itself can cause positive correlations (finding one high loss point can help find others). The former effect diminishes as N grows, for fixed M , hinting that 4 may tend to be positive for N M . Regardless, if 4 is big enough to change which estimator has higher variance then correlation between losses in different acquired points would lead to high bias in R. In contrast, we prove in Appendix B.9 that under an optimal proposal distribution both RPURE and RLURE become exact estimators of the empirical risk for any number of samples M -such that they will inevitably have lower variance than R in this case. A similar result holds when we are estimating gradients of the loss, though note that the optimal proposal is different in the two cases. Theorem 7. Given a non-negative loss, the optimal proposal distribution q * (i m ; i 1:m-1 , D pool ) = L im /Σ n / ∈i1:m-1 L n yields estimators exactly equal to the pool risk, that is RPURE = RLURE = R almost surely ∀M . In practice, it is impossible to sample using the optimal proposal distribution. However, we make this point in order to prove that adopting our unbiased estimator is certainly capable of reducing variance relative to standard practice if appropriate acquisition strategies are used. It also provides interesting insights into what makes a good acquisition strategy from the perspective of the risk estimation itself.

5. RELATED WORK

Pool-based active learning (Lewis & Gale, 1994) is useful in cases where input data are prevalent but labeling them is expensive (Atlas et al., 1990; Settles, 2010) . The bias from selective sampling was noted by MacKay (1992) , but dismissed from a Bayesian perspective based on the likelihood principle. Others have noted that the likelihood principle remains controversial (Rainforth, 2017) , and in this case would assume a well-specified model. Moreover, from a discriminative learning perspective this bias is uncontentiously problematic. Lowell et al. (2019) observe that active learning algorithms and datasets become coupled by active sampling and that datasets often outlive algorithms. Despite the potential pitfalls, in deep learning this bias is generally ignored. As an informal survey, we examined the 15 most-cited peer-reviewed papers citing Gal et al. (2017b) , which considered active learning to image data using neural networks. Of these, only two mentioned estimator bias but did not address it while the rest either ignored or were unaware of this problem (see Appendix D). There have been some attempts to address active learning bias, but these have generally required fundamental changes to the active learning approach and only apply to particular setups. Beygelzimer et al. (2009) , Chu et al. (2011) , and (Cortes et al., 2019) apply importance-sampling corrections (Sugiyama, 2006; Bach, 2006) to online active learning. Unlike pool-based active learning, this involves deciding whether or not to sample a new point as it arrives from an infinite distribution. This makes importance-sampling estimators much easier to develop, but as Settles (2010) notes, "the pool-based scenario appears to be much more common among application papers." Ganti & Gray (2012) address unbiased active learning in a pool-based setting by sampling from the pool with replacement. This effectively converts pool-based learning into a stationary online learning setting, although it overweights data that happens to be sampled early. Sampling with replacement is unwanted in active learning because it requires retraining the model on duplicate data which is either impossible or wasteful depending on details of the setting. Moreover, they only prove the consistency of their estimator under very strong assumptions (well-specified linear models with noiseless labels and a mean-squared-error loss). Imberg et al. (2020) consider optimal proposal distributions in an importance-sampling setting. Outside the context of active learning, Byrd & Lipton (2019) question the value of importance-weighting for deep learning, which aligns with our findings below. Active learning deliberately over-samples unusual points (red x's) which no longer match the population (black dots). Common practice uses the biased unweighted estimator R which puts too much emphasis on unusual points. Our unbiased estimators RPURE and RLURE fix this, learning a function using only D train nearly equal to the ideal you would get if you had labels for the whole of D pool , despite only using a few points. We first verify that RLURE and RPURE remove the bias introduced by active learning and examine the variance of the estimators. We do this by taking a fixed function whose parameters are independent of D train and estimating the risk using actively sampled points. We note that this is equivalent to the problem of estimating the risk of an already trained model in a sample-efficient way given unlabelled test data. We consider two settings: an inflexible model (linear regression) on toy but non-linear data and an overparameterized model (convolutional Bayesian neural network) on a modified version of MNIST with unbalanced classes and noisy labels. Linear regression. For linear functions, removing active learning bias (ALB), i.e., the statistical bias introduced by active learning, is critical. We illustrate this in Figure 1 . Actively sampled points overrepresent unusual parts of the distribution, so a model learned using the unweighed D train differs from the ideal function fit to the whole of D pool . Using our corrective weights more closely approximates the ideal line. The full details of the population distribution and geometric acquisition proposal distribution are in Appendix C.1, where we also show results using an alternative epsilon-greedy proposal. We inspect the ALB in Figure 2a by comparing the estimated risk (with squared error loss and a fixed function) to the corresponding true population risk R. While M < N , the unweighted R is biased (in practice we never have M = N as then actively learning is unnecessary). RPURE and RLURE are unbiased throughout. However, they have high variance because the proposal is rather poor. Shading represents the std. dev. of the bias over 1000 different acquisition trajectories. Bayesian Neural Network. We actively classify MNIST and FashionMNIST images using a convolutional Bayesian neural network (BNN) with roughly 80,000 parameters. In Figure 2b and 2c we show that RPURE and RLURE remove the ALB. Here the variance of RPURE and RLURE is lower or similar to the biased estimator. This is because the acquisition proposal distribution, a stochastic relaxation of the Bayesian Active Learning by Disagreement (BALD) objective (Houlsby et al., 2011) , is effective (c.f. §4). A full description of the dataset and procedure is provided in Appendix C.2. Our modified MNIST dataset is unbalanced and has noisy labels, which makes the bias more distorting. Overall, Figure 2 shows that our estimators remove the bias introduced by active learning, as expected, and can do so with reduced variance given an acquisition proposal distribution that puts a high probability mass on more informative/surprising high-expected-loss points.  E[⋅] r -E[ R ] r -E[RPURE] r -E[ R LURE] (a) Linear regression. Next, we examine the overall effect of using the unbiased estimators to learn a model on downstream performance. Intuitively, removing bias in training while also reducing the variance ought to improve the downstream task objective: test loss and accuracy. To investigate this, we train models using R, RLURE , and RPURE with actively sampled data and measure the population risk of each model. For linear regression (Figure 3a ), the new estimators improve the test loss-even with small numbers of acquired points we have nearly optimal test loss (estimated with many samples). However, for the BNN, there is a small but significant negative impact on the full test dataset loss of training with RLURE or RPURE (Figure 3b ) and a slightly larger negative impact on test accuracy (Figure 3c ). That is, we get a better model by training using a biased estimator with higher variance! To validate this further, we consider models trained instead on FashionMNIST (Fig. 3d ), on MNIST but with Monte Carlo dropout (MCDO) (Gal & Ghahramani, 2015) (Fig. 3e ), and on a balanced version of the MNIST data (Fig. 3f ). In all cases we find similar patterns, suggesting the effects are not overly sensitive to the setting. Further experiments and ablations can be found in Appendix C.2.

7. ACTIVE LEARNING BIAS IN THE CONTEXT OF OVERALL BIAS

In order to explain the finding that RLURE hurts training for the BNN, we return to the bias introduced by overfitting, allowing us to examine the effect of removing statistical bias in the context of overall bias. Namely, we need to consider the fact that training would induce am overfitting bias (OFB) even if we had not used active learning. If we optimize parameters θ according to R, then E[ R(θ * )] = r, because the optimized parameters θ * tend to explain training data better than unseen data. Using RLURE , which removes statistical bias, we can isolate OFB in an active learning setting. More formally, supposing we are optimizing any of the discussed risk estimators (which we will write using R(•) as a placeholder to stand for any of them) we define the OFB as: B OFB ( R(•) ) = r -RLURE (θ * ) where θ * = arg min θ ( R(•) ) B OFB ( R(•) ) depends on the details of the optimization algorithm and the dataset. Understanding it fully means understanding generalization in machine learning and is outside our scope. We can still gain insight into the interaction of active learning bias (ALB) and OFB. Consider the possible relationships between the magnitudes of ALB and OFB: [ALB >> OFB] Removing ALB reduces overall bias and is most likely to occur when f θ is not very expressive such that there is little chance of overfitting. [ALB << OFB] Removing ALB is irrelevant as model has massively overfit regardless. [ALB ≈ OFB] Here sign is critical. If ALB and OFB have opposite signs and similar scale, they B OFB is small compared to ALB (c.f. Figure 2a ). Shading IQR. 1000 trajectories. (b) BNN, B OFB is similar scale and opposite magnitude to ALB (c.f. Figure 2b ). (c) BNN on FashionMNIST, OFB is somewhat larger than with MNIST, particularly for R (i.e. our approaches reduce overfitting) and dominates active learning bias (c.f. Figure 2c ). Shading ±1 standard error. 150 trajectories. will tend to cancel each other out. Indeed, they usually have opposite signs. B OFB is usually positive: θ * fits the training data better than unseen data. ALB is generally negative: we actively choose unusual/surprising/informative points which are harder to fit than typical points. Therefore, when significant overfitting is possible, unless ALB is also large, removing ALB will have little effect and can even be harmful. This hypothesis would explain the observations in §6 if we were to show that B OFB was small for linear regression but had a similar magnitude and opposite sign to ALB for the BNN. This is exactly what we show in Figure 4 . Specifically, we see that for linear regression, the B OFB for models trained with R, RPURE , and RLURE are all small (Figure 4a ) when contrasted to the ALB shown in Figure 2a . Here ALB >> OFB; removing ALB matters. For BNNs we instead see that the OFB has opposite sign to the ALB but is either similar in scale for MNIST (Figures 2b and 4b ), or the OFB is much larger than ALB for Fashion MNIST (Figures 4c and 2c ). The two sources of bias thus (partially) cancel out. Essentially, using active learning can be treated (quite instrumentally) as an ad hoc form of regularization. This explains why removing ALB can hurt active learning with neural networks.

8. CONCLUSIONS

Active learning is a powerful tool but raises potential problems with statistical bias. We offer a corrective weighting which removes that bias with negligible compute/memory costs and few requirements-it suits standard pool-based active learning without replacement. It requires a nonzero proposal distribution over all unlabelled points but existing acquisition functions can be easily transformed into sampling distributions. Indeed, estimates of scores like mutual information are so noisy that many applications already have an implicit proposal distribution. We show that removing active learning bias (ALB) can be helpful in some settings, like linear regression, where the model is not sufficiently complex to perfectly match the data, such that the exact loss function and input data distribution are essential in discriminating between different possible (imperfect) model fits. We also find that removing ALB can be counter-productive for overparameterized models like neural networks, even if its removal also reduces the variance of the estimators, because here the ALB can help cancel out the bias originating from overfitting. This leads to the interesting conclusion that active learning can be helpful not only as a mechanism to reduce variance as it was originally designed, but also because it introduces a bias that can be actively helpful by regularizing the model. This helps explain why active learning with neural networks has shown success despite using a biased risk estimator. We propose the following rules of thumb for deciding when to embrace or correct the bias, noting that we should always prefer RLURE to RPURE . First, the more closely the acquisition proposal distribution approaches the optimal distribution (as per Theorem 7), the relatively better RLURE will be to R. Second, the less overfitting we expect, the more likely it is that RLURE will be useful as it reduces the chance that the ALB will actually help. Third, RLURE will tend to have more of an effect for highly imbalanced datasets, as the biased estimator will over-represent actively selected but unlikely datapoints. Fourth, if the training data does not accurately represent the test data, using RLURE will likely be less important as the ALB will tend to be dwarfed by bias from the distribution shift. Fifth, at test-time, where optimization and overfitting bias are no-longer an issue, there is little cost to using RLURE to evaluate a model and it will usually be beneficial. This final application, of active learning for model evaluation, is an interesting new research direction that is opened up by our estimators.

A OVERVIEW OF ACTIVE LEARNING

Active learning selectively picks datapoints for which to acquire labels with the aim of more sampleefficient learning. For an excellent overview of the general active learning problem setting we refer the reader to Settles (2010) . Since that review was written, a number of significant advances have further developed active learning. Houlsby et al. (2011) develop an efficient way to estimate the mutual information between model parameters and the output distribution, which can be used for the Bayesian Active Learning by Disagreement (BALD) score. Active learning has been applied to deep learning, especially for vision Gal et al. (2017b) ; Wang et al. (2017) . In neural networks specifically, empirical work has suggested that simple geometric core-set style approaches can outperform uncertainty-based acquisition functions (Sener & Savarese, 2018) . A lot of recent work in active learning has focused on speeding up acquisition from a computational perspective (Coleman et al., 2020) and allowing batch acquisition in order to parallelize labelling (Kirsch et al., 2019; Ash et al., 2020) . Some work has also focused on applying active learning to specific settings with particular constraints (Krishnamurthy et al., 2017; Yan et al., 2018; Sundin et al., 2019; Behpour et al., 2019; Shi & Yu, 2019; Hu et al., 2019) .

B PROOFS B.1 PROOF OF LEMMA 1

Lemma 1. The individual terms a m of RPURE are unbiased estimators of the risk: E [a m ] = r. Proof. We begin by applying the tower property of expectations: E [a m ] = E w m L im + 1 N m-1 t=1 L it = E Dpool,i1:m-1 E im w m L im + 1 N m-1 t=1 L it D pool , i 1:m-1 . By further noting that E im [w m L im | D pool , i 1:m-1 ] can be written out analytically as a sum over all the possible values of i m , while 1 N m-1 t=1 L it is deterministic given D pool and i 1:m-1 we have: ( ( ( ( ( ( (  q(i m  = E Dpool,i1:m-1   n / ∈i1:m-1 (i m ; i 1:m-1 , D pool ) L im N ( ( ; i 1:m-1 , D pool ) + 1 N m-1 t=1 L it   = E Dpool,i1:m-1 1 N N n=1 L n , But L n is now independent of the indices which have been sampled: = E Dpool 1 N N n=1 L n = E Dpool R = r.

B.2 PROOF OF THE UNBIASEDNESS AND VARIANCE FOR RPURE : THEOREM 1

Theorem 1. RPURE as defined above has the properties: E RPURE = r, Var RPURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . Proof. Having established Lemma 1, unbiasedness follows quickly from the linearity of expectations: E RPURE = E 1 M M m=1 a m = 1 M M m=1 E [a m ] = r. For the variance we instead have (starting with the definition of RPURE ): Var RPURE = E   1 M M m=1 a m 2   -r 2 , which by the tower property of expectations = E   E   1 M M m=1 a m 2 D pool     -r 2 , and from the definition of variance = E Var 1 M M m=1 a m D pool + R2 -r 2 , = E Var 1 M M m=1 a m D pool + Var R , which using the fact that R is a standard Monte Carlo estimator = E Var 1 M M m=1 a m D pool + Var [L(y, f θ (x))] N , where x, y ∼ p data . Now considering the first term we have Var 1 M M m=1 a m D pool = E   1 M M m=1 a m -R 2 D pool   = 1 M 2 M m=1 M k=1 E a m -R a k -R D pool . We attack this term by first considering the terms for which m = k and show that these yield E [(a m -r)(a k -r)|D pool ] = 0, returning to the m = k terms later. We will assume, without loss of generality, that k < m, noting that by symmetry the same set of arguments can be similarly applied when m < k. Substituting in the definition of a m from equation ( 2): E (a m -R)(a k -R) D pool = E w m L im + 1 N m-1 t=1 L it -R w k L i k + 1 N k-1 s=1 L is -R D pool . We introduce the notation Rrem m = R -1 N m-1 t=1 L it to describe the remainder of the empirical risk ascribable to the datapoint with index i m . Then, by multiplying out the terms we have: E (a m -R)(a k -R) D pool = E [w m L im w k L i k |D pool ] a -E w m L im Rrem k D pool b -E w k L i k Rrem m D pool c + E Rrem k Rrem m D pool d .

Now by the tower property

: a = E [E [w m L im w k L i k | D pool , i 1:m-1 ]|D pool ] , and noting that because k < m, w k L i k is deterministic given D pool and i 1:m-1 : = E [w k L i k E [w m L im | D pool , i 1:m-1 ]|D pool ] = E w k L i k Rrem m D pool , which thus cancels with c . The b and d cancel similarly: b = E E w m L im Rrem k D pool , i 1:m-1 D pool = E Rrem k E [w m L im | D pool , i 1:m-1 ] D pool = E Rrem k Rrem m D pool = d . Putting this together, we have that: E (a m -R)(a k -R) D pool = 0 ∀k = m. Considering now the m = k terms we have E a m -R 2 D pool = E i1:m-1 E im a m -R 2 i 1:m-1 , D pool D pool = E i1:m-1 [Var [a m |i 1:m-1 , D pool ]|D pool ] = E i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]|D pool ] . Finally substituting everything back into (9) and then (8), and applying the tower property gives Var RPURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] and we are done.

B.3 PROOF OF THE CONSISTENCY OF RPURE : THEOREM 2

Theorem 2. Let α = N/M and assume that α > 1. If E L(y, f θ (x)) 2 < ∞ and ∃β > 0 : min n∈{1:N \i1:m-1} q(i m = n; i 1:m-1 , D pool ) ≥ β/N ∀N ∈ Z + , m ≤ N, then RPURE converges in its L 2 norm to r as M → ∞, i.e., lim M →∞ E ( RPURE -r) 2 = 0. Proof. Theorem 1 showed that RPURE is an unbiased estimator and so we first note that its MSE is simply its variance, which we found in (4). Substituting N = αM : E RPURE -r 2 = Var RPURE = Var [L(y, f θ (x)] αM + 1 M 2 M m=1 E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . The first term tends to zero as M → ∞ as our standard assumptions guarantee that 1 α Var [L(y, f θ (x)] < ∞. For the second term we note that our assumptions about q guarantee that w m ≤ 1/β and thus: Var [w m L im |i 1:m-1 , D pool ] = E w 2 m L 2 im |i 1:m-1 , D pool -R - 1 M m-1 t=1 L it 2 ≤ 1 β 2 E L 2 im |i 1:m-1 , D pool -R - 1 M m-1 t=1 L it 2 < ∞ ∀i 1:m-1 , D pool as our assumptions guarantee that 1 β 2 < ∞, we have E im L 2 im < ∞ and so the empirical risk and losses are finite. Given that each Var [w m L im |i 1:m-1 , D pool ] is finite, it follows that: s 2 := 1 M M m=1 E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] < ∞, and we thus have: lim M →∞ E RPURE -R 2 = lim M →∞ Var [L(y, f θ (x)] αM + s 2 M = 0 as desired.

B.4 DERIVATION OF THE CONSTANTS OF RLURE

We note from before that because of the unbiasedness of a m : E C M m=1 c m a m = r where C = 1 M m=1 c m , To construct our improved estimator RLURE , we now need to find the right constants c m , that in turn lead to overall weights v m (as per ( 5)) such that E [v m ] = 1. We start by substituting in the definition of a m : RLURE := C M m=1 v m L im := C M m=1 c m a m = C M m=1 c m w m L im + c m N m-1 t=1 L it , and then redistributing the L it from later terms where they match m: v m = c m w m + 1 N M t=m+1 c t . Note that though RLURE remains an unbiased estimator of the risk, each individual term v m L im is not. Now we require: E [v m ] = 1 ∀m ∈ {1, ..., M }. Remembering that w m = 1/(N q(i m ; i 1:m-1 , D pool )): E [v m ] = c m N E 1 q(i m ; i 1:m-1 , D pool ) + 1 N M t=m+1 c t = c m N n / ∈i1:m-1 q(i m = n; i 1:m-1 , D pool ) q(i m = n; i 1:m-1 , D pool ) + 1 N M t=m+1 c t = c m (N -m + 1) N + 1 N M t=m+1 c t . Imposing that each E [v m ] = 1, we now have M equations for M unknowns c 1 , . . . , c M , such that we can find the required values of c m by solving the set of simultaneous equations: (N -m + 1) c m N + 1 N M t=m+1 c t = 1 ∀m ∈ {1, . . . , M }. ( ) We do this by induction. First, consider E [v m ] -E [v m+1 ] = 0, for which can be rewritten: (N -m + 1) c m N -(N -m) c m+1 N + c m+1 N = 0 and thus: c m = N -m -1 N -m + 1 c m+1 . Published as a conference paper at ICLR 2021 To finish our definition, we simply need to derive C: C = M m=1 c m -1 = N (N -M ) M m=1 1 (N -m)(N -m + 1) -1 = N (N -M ) M m=1 1 N -m - 1 N -m + 1 -1 where we now have a telescopic sum so = N (N -M ) 1 N -M - 1 N -1 = 1 M . We thus see that our v m always sum to M , giving the quoted form for RLURE in the main paper. B.5 PROOF OF UNBIASEDNESS AND VARIANCE FOR RLURE : THEOREM 3 Theorem 3. RLURE as defined above has the following properties: E RLURE = r, Var RLURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 c 2 m E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . (6) Proof. RLURE is, by construction a linear combination of the weights a m . By Lemma 1 each a m is an unbiased estimator of r. So by the linearity of expectation, E RLURE = r. As in Theorem 1, the variance requires a degree of care because the a m are not independent. Noting that the expectation does not change through the weighting, we analogously have Var RLURE = E Var 1 M M m=1 c m a m D pool + Var [L(y, f θ (x))] N . Similarly, we also have Var 1 M M m=1 c m a m D pool = E   1 M M m=1 c m a m 2 D pool   -R2 = 1 M 2 M m=1 M k=1 c m c k E [a m a k |D pool ] -R2 = 1 M 2 M m=1 M k=1 c m c k E [a m a k |D pool ] -R2 = 1 M 2 M m=1 M k=1 c m c k E a m -R a k -R D pool . Using the result before that Recall from Theorems 1 and 3 that the variances of the RLURE and RPURE estimators are E a m -R a k -R D pool = E i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]|D pool ] if m = Var RPURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 E m (12) Var RLURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 c 2 m E m , where we have used the shorthand E m = E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]]. Recall also that c 2 m = N 2 (N -M ) 2 (N -m) 2 (N -m + 1) 2 . Though the potential for one to use pathologically bad proposals means that it is not possible to show that Var RLURE ≤ Var RPURE universally holds, we can show this result under a relatively weak assumption that ensures our proposal is "sufficiently good." To formalize this assumption, we first define F m := E Dpool,i1:m-1 Var w m E [w m | i 1:m-1 , D pool ] L im i i:m-1 , D pool = N -m + 1 N -2 E m as the weight-normalized expected variance, where the second form comes from the fact that E [w m |i 1:m-1 , D pool ] = (N -m + 1)/N . Our assumption is now that F m ≥ F M -m+1 ∀m : 1 ≤ m ≤ M/2. ( ) Note that a sufficient, but not necessary, condition for this to hold is that the F m do not increase with m, i.e. F m ≥ F j ∀(m, j) : 1 ≤ m ≤ j ≤ M . Intuitively, this is is equivalent to saying that the conditional variances of our normalized weighted losses should not increase as we acquire more points. It is, for example, satisfied by a uniform sampling acquisition strategy (for which all F m are equal). More generally, it should hold in practice for sensible acquisition strategies as a) our proposal should improve on average as we acquire more labels, leading to lower average variance; and b) higher loss points will generally be acquired earlier so the scaling will typically decrease with m. In particular, note that E [w m L im |i 1:m-1 , D pool ] < r and is monotonically decreasing with m because it omits the already sampled losses (which is why these are added back in when calculating a m ). This assumption is actually stronger than necessary: in practice the result will hold even if F m increases with m provided the rate of increase is sufficiently small. However, the assumption as stated already holds for a broad range of sensible proposals and fully encapsulating the minimum requirements on F m is beyond the scope of this paper. We are now ready to formally state and prove our result. For this, however, it is convenient to first prove the following lemma, which we will invoke multiple times in the main proof. Lemma 2. If a, b, M, N ∈ N + are positive integers such that, M < N and a + b ≤ M , then (N -a) 2 N 2 ≥ (N -M ) 2 (N -b) 2 . Proof. (N -a) 2 N 2 - (N -M ) 2 (N -b) 2 = (N -a) 2 (N -b) 2 -N 2 (N -M ) 2 N 2 (N -b) 2 = 2N 3 (M -a -b) + N 2 (a 2 + b 2 + 4ab -M 2 ) -2abN (a + b) + a 2 b 2 N 2 (N -b) 2 = 1 N 2 (N -b) 2 N (2N -M -a -b) + ab N (M -a -b) + ab ≥ 0 as 2N ≥ M + a + b and M ≥ a + b so all bracketed terms are positive. To cover the case where M = N , we simply note that here S m = (N -m + 1) 2 N 2 F m + m 2 N 2 F M -m+1 where both terms are clearly positive. We have now shown that S m ≥ 0 in all possible scenarios given our assumption on F m , and so we can conclude that Var RPURE ≥ Var RLURE . Finally, we need to show the inequality is strict if E 1 > 0 and M > 1. For this we first note that E 1 > 0 ensures F 1 > 0 and then consider S 1 as follows: S 1 = 1 - (N -M ) 2 (N -1) 2 F 1 + (N -M + 1) 2 N 2 -1 F M and as the second term is clearly negative and F 1 ≥ F M , ≥ (N -M + 1) 2 N 2 - (N -M ) 2 (N -1) 2 F 1 = (M -1)(2N 2 -2M N + M -1) N 2 (N -1) 2 F 1 > 0 as M > 1 and N ≥ M ensures that each bracketed term is strictly positive. Now as S 1 > 0 and S m ≥ 0 for all other m, we can conclude that the sum of the S m is strictly positive, and thus that the inequality in strict. B.7 PROOF OF THE CONSISTENCY OF RLURE : THEOREM 5 Theorem 5. Under the same assumptions as Theorem 2: lim M →∞ E RLURE -r 2 = 0. Proof. As before, since RLURE is unbiased the MSE is simply the variance so: E RPURE -r 2 = Var RPURE = Var [L(y, f θ (x))] N + 1 M 2 M m=1 c 2 m E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] . Taking N = αM , we already showed in the proof of Theorem 2 that the first of these terms tends to zero as M → ∞. We also showed that E Dpool,i1:m-1 Var q(im;i1:m-1,Dpool) [w m L im ] is finite given our assumptions. As such, there must be some finite constant d such that E Dpool,i1:m-1 Var q(im;i1:m-1,Dpool) [w m L im ] < d and thus 1 M 2 M m=1 c 2 m E Dpool,i1:m-1 [Var [w m L im |i 1:m-1 , D pool ]] < d M 2 M m=1 c 2 m = d M 2 M m=1 N (N -M ) (N -m)(N -m + 1) and by substituting  N = αM = d M 2 M m=1 αM (αM -M ) (αM -m)(αM -m + 1) 2 < d M 2 M m=1 αM αM -M + 1 2 = dα 2 M ((α -1)M + := E [L im |i 1:m-1 , D pool ]. For R (defined in (1)): E R = 1 M M m=1 µ m ( = r in general) Var[ R] = 1 Var Dpool E R D pool + 2 1 M 2 M m=1 E Dpool,i1:m-1 [Var [L im |i 1:m-1 , D pool ]] + 1 M 2 M m=1 E Dpool Var µ m|i,D D pool 3 + 2 E Dpool Cov L im , k<m L i k D pool 4 . Proof. The result of the bias follows immediately from definition of µ m and the linearity of expectations. For the variance, we have Var[ R] = E R2 -E R 2 = E Dpool E R2 D pool -E R 2 = E Dpool Var R D pool + E R D pool 2 -E R 2 = Var Dpool E R D pool + E Dpool Var R D pool ( ) where the first term is 1 from the result. L n = R. The proof proceeds identically for the case of ∇L im because the gradient passes through the summations. For RLURE , we similarly substitute the optimal proposal into the definition of the estimator RLURE = 1 M M m=1 v m L im = 1 M M m=1 L im + N -M N -m L im (N -m + 1)q * (i m ; i 1:m-1 , f θm-1 , D pool ) -L im = 1 M M m=1 L im + N -M N -m N t=m L it (N -m + 1) -L im , pulling out the loss = 1 M M m=1 L im       1 - N -M N -m + (N -M ) m k=1 1 (N -k)(N -k + 1) =m/(N (N -m))       + 1 M N t=M +1 L it (N -M ) M k=1 1 (N -k)(N -k + 1) =M/(N (N -M )) , simplifying and rearranging = 1 M M m=1 L im 1 - N -M N -m + N -M N -m m N + 1 N N t=M +1 L it = 1 M M m=1 L im 1 - N -M N -m N -m N + 1 N N t=M +1 L it = 1 N M m=1 L im + 1 N N t=M +1 L it = 1 N N n=1 L n = R as required. Remark 2. The optimal proposal for estimating the gradient of the pool risk, ∇ φ R, with respect to some scalar φ is insteadfoot_0  q * * (i m ; i 1:m-1 , D pool ) = |∇ φ L im | / n / ∈i1:m-1 |∇ φ L n | . Note that when taking gradients with respect to multiple variables, the optimal proposal for each will be different for each. and 3f. (d) and (e) show similar results for a smaller multi-layer perceptron (with one hidden layer of 50 units). In all cases the results broadly mirror the results in the main paper. (Zhang & Lee, 2019) Discusses bias in D pool . (Kellenberger et al., 2019) Table 2 : Existing applications of deep active learning rarely acknowledge the bias introduced by actively sampling points and do not, to the best of our knowledge, try to correct it.

D DEEP ACTIVE LEARNING IN PRACTICE

In Table 2 , we show an informal survey of highly cited papers citing Gal et al. (2017b) , which introduced active learning to computer vision using deep convolutional neural networks. Across a range of papers including theory papers as well as applications ranging from agriculture to molecular science only two papers acknowledged the bias introduced by actively sampling and none of the papers took steps to address it. It is worth noting, though, that at least two papers motivated their use of active learning by observing that they expected their training data to already be unrepresentative of the population data and saw active learning as a way to address that bias. This does not quite work, unless you explicitly assume that the actively chosen distribution is more like the population distribution, but is an interesting phenomenon to observe in practical applications of active learning.



One can, in principle, actually construct an exact estimator in this scenario as well with the TABI approach ofRainforth et al. (2020) by employing two separate proposals that target max(∇ θ R, 0) and -min(∇ θ R, 0) respectively, then taking the difference between the two resultant estimators.



Figure1: Illustrative linear regression. Active learning deliberately over-samples unusual points (red x's) which no longer match the population (black dots). Common practice uses the biased unweighted estimator R which puts too much emphasis on unusual points. Our unbiased estimators RPURE and RLURE fix this, learning a function using only D train nearly equal to the ideal you would get if you had labels for the whole of D pool , despite only using a few points.

Figure2: RPURE and RLURE remove bias introduced by active learning, while unweighted R, which most active learning work uses, is biased. Note the sign: R overestimates risk because active learning samples the hardest points. Variance for RPURE and RLURE depends on the acquisition distribution placing high weight on high-expected-loss points. In (b), the BALD-style distribution means that the variance of the unbiased estimators is smaller. For FashionMNIST, (c), active learning bias is small and high variance in all cases. Shading is ±1 standard deviation.

Figure4: Overfitting bias-B OFB -for models trained using the three objectives. (a) Linear regression, B OFB is small compared to ALB (c.f. Figure2a). Shading IQR. 1000 trajectories. (b) BNN, B OFB is similar scale and opposite magnitude to ALB (c.f. Figure2b). (c) BNN on FashionMNIST, OFB is somewhat larger than with MNIST, particularly for R (i.e. our approaches reduce overfitting) and dominates active learning bias (c.f. Figure2c). Shading ±1 standard error. 150 trajectories.

For the second term, introducing the notationsµ |D = E R D pool and µ m|D = E [L im |D pool ] we have Var R D pool = E L im -µ |D L i k -µ |D D pool L im -µ m|D + µ m|D -µ |D L i k -µ k|D + µ k|D -µ |D D pool ,For RPURE , the proof follows straightforwardly from substituting the definition of the optimal proposal into the a m form of the estimator

Figure 10: Further downstream performance experiments. (a)-(c) are partners to Figures 3d, 3e, and 3f. (d) and (e) show similar results for a smaller multi-layer perceptron (with one hidden layer of 50 units). In all cases the results broadly mirror the results in the main paper.

Figure 10: Further downstream performance experiments. (a)-(c) are partners to Figures 3d, 3e, and 3f. (d) and (e) show similar results for a smaller multi-layer perceptron (with one hidden layer of 50 units). In all cases the results broadly mirror the results in the main paper.

ACKNOWLEDGEMENTS

The authors would like to especially thank Lewis Smith for his helpful conversations and specifically for his assistance with the proof of Theorem 4. In addition, we would like to thank for their conversations and advice Joost van Amersfoort and Andreas Kirsch.The authors are grateful to the Engineering and Physical Sciences Research Council for their support of the Centre for Doctoral Training in Cyber Security, University of Oxford as well as the Alan Turing Institute.

annex

By further noting that the solution for m = M is trivial:we have by inductiont=m log(N -t -1) -log(N -t + 1) .Now we can exploit the fact that there is a canceling of most of the term in this sum. The exceptions are -log(N -m + 1), -log(N -m), log(N -M + 1), and log(N -M ). We thus have:which is our final simple form for c m . We can now check that this satisfies the required recursive relationship (noting it trivially gives the correct value for c M ) as per (11) :as required. Similarly, we can straightforwardly show that this form of c m satisfies (10) by substitution.We then find the form of v m given this expression for c m . Remember that:We can rearrange (10) to:from which it follows that:Substituting in our expressions for c m and w m we thus have:which is the form given in the original expression. Proof. We start by subtracting equation ( 13) from ( 12) yieldingAssuming, for now, that M is even and M < N , we can now group terms into pairs by counting from each end of the sequence (i.e. pairing the m-th and M -m + 1-th terms) to yieldwhereWe will now show that S m ≥ 0, ∀1 ≤ m ≤ M/2, from which we can directly conclude that Var RPURE ≥ Var RLURE . For this, note that F m and F M -m+1 are themselves non-negative by construction.Consider first the case whereHere the second term in S m is non-negative. Furthermore, invoking Lemma (2) with a = m -1 and b = m (noting this satisfies a + b ≤ M for all 1 ≤ m ≤ M/2 as required) shows that(N -m) 2 ≥ 0 and so the first term is also positive. It thus immediately follows that S m ≥ 0 in this scenario.When this does not hold, (N -M + m) 2 /N 2 < (N -M ) N /(N -M + m -1) 2 and so the second term in S m is negative. We now instead invoke our assumption that F m ≥ F M -m+1 , to yieldSubstituting these back into (15) thus again yield the desired result that S m ≥ 0 as required.To cover the case where M is odd, we simply need to note that this adds the following additional term as follows:and we can again invoke Lemma (2) with a = M/2 -1/2 and b = M/2 + 1/2 to show that this additional term is non-negative. multiplying out terms and using the symmetry of m and kwhere we have exploited symmetries in the indices. Now, as 1 M M k=1 µ k|D = µ |D , the second and third terms are simply zero, so we haveseparating out the m = k and m < k terms, with symmetryHere the second term will yield 4 in the result when substituted back into ( 16). For the first term, we have by analogous arguments as those used at the start of the proof for Var[ R],where µ m|i,D := E [L im |i 1:m-1 , D pool ] as per the definition in the theorem itself. Substituting this back into (17) and then ( 16) in turn now yields the desired result through the tower property of expectations, with the first term in (18) producing 2 and the second term producing 3 .B.9 PROOF OF OPTIMAL PROPOSAL DISTRIBUTION: THEOREM 7Theorem 7. Given a non-negative loss, the optimal proposal distribution q * (i m ; iProof. We start by proving the result for the simpler case of RPURE before considering RLURE . To make the notation simpler, we will introduce hypothetical indices i t for t > M , noting that their exact values will not change the proof provided that they are all distinct to each other and the real indices (i.e. that they are a possible realization of the active sampling process in the setting M = N ). Our training dataset contains a small cluster of points near x = -1 and two larger clusters at 0 ≤ x ≤ 0.5 and 1 ≤ x ≤ 1.5, sampled proportionately to the 'true' data distribution. The data distribution from which we select data in a Rao-Blackwellised manner has a probability density function over x equal to:while the distribution over y is then induced by:We set N = 101, where there are 5 points in the small cluster and 96 points in each of the other two clusters, and consider 10 ≤ M ≤ 100. We actively sample points without replacement using a geometric heuristic that scores the quadratic distance to previously sampled points and then selects points based on a Boltzman distribution with β = 1 using the normalized scores.Here, we also show in Figure 6 results that are collected using an epsilon-greedy acquisition proposal.The results are aligned with those from the other acquisition distribution we consider in the main body of the paper. This proposal selects the point that is has the highest total distance to all previously selected points with probability 0.9 and uniformly at random with probability = 0.1. That is, the acquistion proposal is given by:where of course D train are the i 1:m-1 elements of D pool .For all graphs we use 1000 trajectories with different random seeds to calculate error bars. Although, of course, each regression and scoring is deterministic, the acquistion distribution is stochastic.Although the variance of the estimators can be inferred from Figure 2a , we also provide Figure 5a which displays the variance of the estimator directly.

C.2 BAYESIAN NEURAL NETWORK

We train a Bayesian neural network using variational inference (Jordan et al., 1999) . In particular, we use the radial Bayesian neural network approximating distribution (Farquhar et al., 2020) . The details of the hyperparameters used for training are provided in Table 1 . Bias: R -(a) Bias (like Fig. 2a ). Figure 7 : We contrast the effect of using RLURE throughout the entire acquisition procedure and training (rather than using the same acquisition procedure based on R for all estimators). The purple test performance and orange are nearly identical, suggesting the result is not sensitive to this choice.The unbalanced dataset is constructed by first noising 10% of the training labels, which are assigned random labels, and then selecting a subset of the training dataset such that the numbers of examples of each class is proportional to the ratio (1., 0.5, 0.5, 0.2, 0.2, 0.2, 0.1, 0.1, 0.01, 0.01)-that is, there are 100 times as many zeros as nines in the unbalanced dataset. (Figure 3f shows a version of this experiment which uses a balanced dataset instead, in order to make sure that any effects are not entirely caused by this design choice.) In fact, we took only a quarter of this dataset in order to speed up acquisition (since each model must be evaluated many times on each of the candidate datapoints to estimate the mutual information). 1000 validation points were then removed from this pool to allow early stopping. The remaining points were placed in D pool . We then uniformly selected 10 points from D pool to place in D train . Adding noise to the labels and using an unbalanced dataset is designed to mimic the difficult situations that active learning systems are deployed on in practice, despite the relatively simple dataset. However, we used a simple dataset for a number of reasons.Active learning is very costly because it requires constant retraining, and accurately measuring the properties of estimators generally requires taking large numbers of samples. The combination makes using more complicated datasets expensive. In addition, because our work establishes a lower bound on architecture complexity for which correcting the active learning bias is no longer valuable, establishing that lower bound with MNIST is in fact a stronger result than showing a similar result with a more complex model.The active learning loop then proceeds by:1. training the neural network on D train using R; 2. scoring D pool ; 3. sampling a point to be added to D train ;4. Every 3 points, we separately trained models on D train using R, RPURE , and RLURE and evaluate them.This ensures that all of the estimators are on data collected under the same sampling distribution for fair comparison. As a sense-check, in Figures 7a and 7b we show an alternate version in which the first step trains with RLURE instead of R, and find that this does not have a significant effect on the results.When we compute the bias of a fixed neural network in Figure 2b , we train a single neural network on 1000 points. We then sample evaluation points using the acquisition proposal distribution from the test dataset and evaluate the bias using those points.In Figures 8a and 8b we review the graphs shown in Figures 3b and 3c , this time showing standard errors in order to make clear that the biased R estimator has better performance, while the earlier figures show that the performance is quite variable.We considered a range of alternative proposal distributions. In addition to the Boltzman distribution which we used, we considered a temperature range between 1,000 and 20,000 finding it had relatively little effect. Higher temperatures correspond to more certainly picking the highest mutual information Figure 9 : Higher temperatures approach a deterministic acquisition function. These also tend to increase the variance of the risk estimator because the weight associated with unlikely points increases, when it happens to be selected. The overall pattern seems fairly consistent, however.point, which approaches a deterministic proposal. We found that because the mutual information had to be estimated, and was itself a random variable, different trajectories still picked very different sets of points. However, for very high temperatures the estimators became higher variance, and for lower temperatures, the acquisition distribution became nearly uniform. In Figure 9 we show the results of networks trained with a variety of temperatures other than the 10,000 ultimately used. We also considered a proposal which was simply proportional to the scores, but found this was also too close to sampling uniformly for any of the costs or benefits of active learning to be visible.We considered Monte Carlo dropout as an alternative approximating distribution (Gal & Ghahramani, 2015) (see Figures 3e and 10b ). We found that the mutual information estimates were compressed in a fairly narrow range, consistent with the observation by Osband et al. (2018) that Monte Carlo dropout uncertainties do not necessarily converge unless the dropout probabilities are also optimized (Gal et al., 2017a) . While this might be good enough when only the relative score is needed in order to calculate the argmax, for our proposal distribution we would ideally prefer to have good absolute scores as well. For this reason, we chose the richer approximate posterior distribution instead.Last, we considered a different architecture, using a full-connected neural network with a single hidden layer with 50 units, also trained as a Radial BNN. This showed higher variance in downstream performance, but was broadly similar to the convolutional architecture (see Figures 10d and 10e ).

