SOCAL: SELECTIVE ORACLE QUESTIONING FOR CONSISTENCY-BASED ACTIVE LEARNING OF CAR-DIAC SIGNALS

Abstract

The ubiquity and rate of collection of cardiac signals produce large, unlabelled datasets. Active learning (AL) can exploit such datasets by incorporating human annotators (oracles) to improve generalization performance. However, the overreliance of existing algorithms on oracles continues to burden physicians. To minimize this burden, we propose SoCal, a consistency-based AL framework that dynamically determines whether to request a label from an oracle or to generate a pseudo-label instead. We show that our framework decreases the labelling burden while maintaining strong performance, even in the presence of a noisy oracle. The success of modern-day deep learning algorithms in the medical domain has been contingent upon the availability of large, labelled datasets (Poplin et al., 2018; Tomašev et al., 2019; Attia et al., 2019) . Curating such datasets, however, is a challenge due to the time-consuming nature of, and high costs associated with, labelling. This is particularly the case in the medical domain where the input of expert medical professionals is required. One way of overcoming this challenge and exploiting large, unlabelled datasets is via the active learning (AL) framework (Settles, 2009) . This framework iterates over three main steps: 1) a learner is tasked with acquiring unlabelled instances, usually through an acquisition function, 2) an oracle (e.g. physician) is tasked with labelling such acquired instances, and 3) the learner is trained on the existing and newly-labelled instances. By altering the way in which acquisitions are performed and the degree of involvement of the oracle, the active learning framework aims to improve the performance of a network while minimizing the burden of labelling on the oracle. One principal desideratum for an acquisition function is its ability to reduce the size of the version space, the set of hypotheses (decision boundaries) consistent with the labelled training instances. This ability is highly dependent upon the approximation of the version space, a goal that Monte Carlo Dropout (MCD) attempts to achieve (see Fig. 1a ). For example, state-of-the-art uncertainty-based acquisition functions, such as BALD (Houlsby et al., 2011) , used alongside MCD acquire instances that lie in a region of uncertainty, a region where there is high disagreement between the hypotheses about a particular instance. In many scenarios, however, estimating this region of uncertainty is nontrivial. Furthermore, existing AL frameworks are overly reliant on the presence of an oracle. Such over-reliance precludes the applicability of AL algorithms to certain environments, such as low-resource healthcare settings, where an oracle is either unavailable or ill-trained for the task at hand. In this work, we aim to design an active learning framework that better estimates the region of uncertainty and decreases its reliance on an oracle. Our contributions are as follows: 1. Consistency-based active learning framework: we propose a novel framework that stochastically perturbs inputs, network parameters, or both to guide the acquisition of unlabelled instances.



BALC where there are several hypotheses in addition to the unlabelled instance and its perturbed counterpart.

2. RELATED WORK

Active learning methdologies were recently reviewed by Settles (2009) . In the healthcare domain, Gong et al. (2019) propose to acquire instances from an electronic health record (EHR) database using a Bayesian deep latent Gaussian model to improve mortality prediction. Smailagic et al. (2018; 2019) acquire unannotated medical images by measuring their distance in a latent space to images in the training set. The work of Wang et al. (2019) is similar to ours in that they focus on the electrocardiogram (ECG). Gal et al. (2017) adopt BALD (Houlsby et al., 2011) in the context of Monte Carlo Dropout to acquire datapoints that maximize the Jensen-Shannon divergence (JSD) across MC samples. Previous work attempts to learn from multiple or imperfect oracles (Dekel et al., 2012; Zhang & Chaudhuri, 2015; Sinha et al., 2019) . For example, Urner et al. (2012) propose choosing the oracle that should label a particular instance. Unlike our approach, they do not explore independence from an oracle. Yan et al. (2016) consider oracle abstention in an AL setting. Instead, we place the decision of abstention under the control of the learner. To the best of our knowledge, previous work, in contrast to ours, has assumed the existence of an oracle and has not explored a dynamic oracle selection strategy. Consistency training in the context of semi-supervised learning helps enforce the smoothness assumption (Zhu, 2005) . For example, Interpolation Consistency Training (Verma et al., 2019) penalizes networks for not generating a linear combination of outputs in response to a linear combination of inputs. Similarly, Xie et al. (2019) penalizes networks for generating drastically different outputs in response to perturbed instances. In the process, networks learn perturbation-invariant representations. McCallumzy & Nigamy (1998) introduce an acquisition function that calculates the average Kullback-Leibler divergence, D KL , between the output of a network and the consensus output across all networks in an ensemble. Unlike ours, their approach does not exploit perturbations. Similar to our work is that of Gao et al. (2019) which incorporates into the objective function a consistency-loss based on the D KL and actively acquires instances using the variance of the probability assigned to each class by the network in response to perturbed versions of the same instance. Selective classification imbues a network with the ability to abstain from making predictions. Chow (1970); El-Yaniv & Wiener (2010) introduce the risk-coverage trade-off whereby the empirical risk of a model is inversely related to its rate of abstentions. Wiener & El-Yaniv (2011) use a support vector machine (SVM) to rank and reject instances based on the degree of disagreement between hypotheses. In some frameworks, these are the same instances that active learning views as most informative. Cortes et al. (2016) outline an objective function that penalizes abstentions that are inappropriate and frequent. Most recently, Liu et al. (2019) propose the gambler's loss to learn a selection function that determines whether instances are rejected. However, this approach is not implemented in the context of AL. Most similar to our work is SelectiveNet (Geifman & El-Yaniv, 2019) where a multi-head architecture is used alongside an empirical selective risk objective function and a percentile threshold. However, their work assumes the presence of ground-truth labels and, therefore, does not extend to unlabelled instances.

3. BACKGROUND

3.1 ACTIVE LEARNING Consider a learner f ω : x ∈ R m → v ∈ R d , parameterized by ω, that maps an m-dimensional input, x, to a d-dimensional representation, v. Further consider g φ : v ∈ R d → y ∈ R C that maps a d- dimensional representation, v, to a C-dimensional output, y, where C is the number of classes. After training on a pool of labelled data L = (X L , Y L ) for τ epochs, the learner is tasked with querying the unlabelled pool of data U = (X U , Y U ) and acquiring the top b fraction of instances, x b ∼ X U , that it deems to be most informative. The degree of informativeness of an instance is determined by an acquisition function, α, such as BALD (Houlsby et al., 2011) . Additional acquisition functions can be found in Appendix A. These are typically used in conjunction with Monte Carlo Dropout (Gal & Ghahramani, 2016) to identify instances that lie in the region of uncertainty, a region in which hypotheses disagree the most about instances.

4.1.1. MONTE CARLO PERTURBATIONS

Unlabelled instances in proximity to the decision boundary are likely to be more informative for training than those further away. To identify such instances, we stochastically perturb them and observe the network's outputs. The intuition is that such outputs will differ significantly across the perturbations for instances close to the decision boundary (see Fig. 1b ). We refer to this setup as Monte Carlo Perturbations (MCP) and illustrate its derivation in Appendix B.

4.1.2. BAYESIAN ACTIVE LEARNING BY CONSISTENCY

Acquisition functions dependent upon perturbations applied to either the inputs (MCP) or the network parameters (MCD) alone can fail to identify instances that lie in the region of uncertainty. We illustrate this point with the following example: without loss of generality, let us assume an unlabelled instance is in proximity to some decision boundary A and is classified by the network as belonging to some arbitrary class 3. Such proximity should deem the instance informative for the training process (Settles, 2009) . In the MCD setting, perturbations are applied to parameters, generating various decision boundaries, which in turn influence the network outputs. In Fig. 2 (red rectangle), we visualize such outputs for three arbitrary classes. If these parameter perturbations happen to be too small in magnitude, for example, then the network will continue to classify the instance as belonging to the same class. At this stage, regardless of whether an uncertainty-based or a consistency-based acquisition function is used, the instance would be deemed uninformative, and thus not acquired. As a result, an instance that should have been acquired (due to its proximity to the decision boundary) was erroneously deemed uninformative. A similar argument can be extended to MCP. By applying perturbations to both instances and network parameters, we aim to leverage the smoothness assumption (Zhu, 2005) to better identify instances that lie in the region of uncertainty and thus avoid missing their acquisition. Motivated by this, we propose a framework, entitled Bayesian Active Learning by Consistency (BALC) (see Fig. 1c ), that consists of three main steps: 1) we perturb an instance, x, to generate z, 2) we perturb the network parameters, ω, to generate ω , and 3) we pass both instances through the perturbed network, generating outputs, p(y|x, ω ) and p(y|z, ω ) ∈ R C , respectively. We perform these steps for T stochastic perturbations and generate two matrices of network outputs, G(x), G (z) ∈ R T ×C . We visualize such network outputs in Fig. 2 where T = 3 and C = 3. To leverage G and G , we propose two divergence-based acquisition functions that acquire instances that the network is least robust to. In BALC KLD , we calculate the D KL between two C-dimensional Gaussians that are empirically fit to G and G . The red rectangle illustrates a potential limitation of MCD. Given that these parameter perturbations do not result in network output variations and MCD is dependent upon said perturbations, unlabelled instances can be erroneously deemed uninformative. (Scenario 2) network output variations caused by both input and parameter perturbations. We show that while BALC KLD is likely to acquire instances due to input perturbations, BALC JSD considers both input and parameter perturbations when performing acquisitions. BALC KLD = D KL (N (µ(x), Σ(x) N (µ(z), Σ(z))) where µ = 1 T T i G and Σ = (G -µ) T (G -µ) represent the empirical mean vector and covariance matrix of the network outputs, respectively. BALC KLD is likely to detect output variations due to input perturbations. We support this claim in Fig. 2 by illustrating two scenarios. In scenario 1, network output variations are caused solely by input perturbations. In contrast, in scenario 2, network output variations are caused by both input and parameter perturbations. We show that BALC KLD ≈ 1 and 0 in these two scenarios, respectively. Since the higher the value of an acquisition function, the more informative an instance is, these scenarios illustrate BALC KLD 's preference for input perturbations. To detect variations due to both input and parameter perturbations, we introduce BALC JSD whose full derivation can be found in Appendix C. BALC JSD = across parameter perturbations E i∈T [D KL (G i (x) G i (z))] - across input perturbations D KL (E i∈T [G(x)] E i∈T [G (z)])

4.2. TRACKED ACQUISITION FUNCTION

Deriving the informativeness of unlabelled instances based solely on the value of the acquisition function at a single epoch can be erroneous. This is partially driven by limitations in the approximation of the version space, which is known to hinder performance (Cohn et al., 1994) . To improve this approximation, we propose to track an acquisition function over time (e.g., epochs) before employing it to acquire instances. The intuition is that by incorporating temporal information, we accumulate hypotheses in the version space and thus obtain a more reliable estimate of the relative informativeness of each instance. Acquiring such instances would help reduce the size of the version space at a greater rate. For any tracked acquisition function, α(t), the corresponding area under the temporal acquisition function, AUTAF ∈ R 1 , is calculated as follows: AUTAF = τ 0 α(t)dt ≈ τ t=0 α(t + ∆t) + α(t) 2 ∆t (3) where the integral is approximated using the trapezoidal rule, ∆t is the time-step (in epochs) between epochs at which ordinary acquisition values are calculated, and τ is the epoch at which the AUTAF is calculated and an acquisition of unlabelled instances is performed.

4.3. SELECTIVE ORACLE QUESTIONING

Alongside our consistency-based AL framework, we aim to minimize the burden of labelling on an oracle. To do so, we learn a network that dynamically determines whether to request a label from an oracle or to generate a pseudo-label for each acquired unlabelled instance. We refer to this strategy as selective oracle questioning in active learning (SoQal). Oracle selection network. In addition to the learners outlined in Sec. 3.1, f ω and g φ , we introduce an oracle selection network, h θ : v ∈ R d → t ∈ [0, 1] parameterized by θ, that maps d-dimensional representations, v, to a scalar, t, as shown in Fig. 3 . Figure 3 : An instance, x, is provided as input to the feature extractor, f ω . The representation, v, is passed through g φ to generate task predictions, and through h θ to help determine whether a label is requested from an oracle. The zero-one loss of g φ acts as the ground-truth label for t where t ≈ 1 indicates a hard-to-classify instance that would benefit from an oracle label. Otherwise, the instance is pseudo-labelled by taking the argmax of the outputs of the prediction network. Proxy for misclassifications. Ideally, a network should only be reliant on an oracle when it misclassifies an instance itself. Quantifying these misclassifications is trivial in the presence of ground-truth labels. However, in the AL framework, we are interested in making decisions on unlabelled instances. Therefore, we need a reliable proxy for such misclassifications. We propose that this proxy be the output of the oracle selection network, t. For example, low and high values of t can indicate correct and incorrect network predictions, respectively. To learn this behaviour, h θ needs an appropriate supervisory signal. For this, we choose the zero-one loss, e, of the prediction network, g φ , as the ground-truth label (see Fig. 3 ). For a mini-batch of size, B, our objective function thus consists of two terms: 1) a cross-entropy class prediction loss for the main task, and 2) a binary cross-entropy oracle selection loss, with a weighting coefficient, β, (described next). L = B i=1 class prediction loss -log (p(y i = c|x i , ω, φ)) - oracle selection loss βe i log (h θ (t|x i )) -(1 -e i ) log (1 -h θ (t|x i )) ( ) where c is the target class. During the early stages of training on labelled data, a network struggles to classify instances correctly. This means that the ratio of zero to one losses will be skewed toward the latter. As training progresses and the network becomes more adept at classifying instances, this ratio becomes skewed in the opposite direction. Such an imbalance in the ground-truth labels, e, and their subsequent shift send strong supervisory signals to h θ . For example, during the early stages of training, the outputs of h θ will be high and pulled towards t = 1 (high error) even if the corresponding instance was correctly classified. Such behaviour makes it difficult to ascertain whether instances have been misclassified, hindering the reliability of t as a proxy. To offset the aforementioned class imbalance, we introduce a dynamic hyperparameter, β = δe=0 δe=1 , where δ is the Kronecker delta function. As training progress, β < 1 → β > 1, as the ratio of correctly classified (e = 0) to misclassified (e = 1) instances within a mini-batch changes. Decision-making with proxy. We aim to exploit t for the binary decision of either requesting a label from an oracle or generating a pseudo-label. One way to do so is via a simple threshold at 0.5. However, this threshold may not be appropriate. In designing a robust selection strategy, we must account for the distribution of the t values that corresponds to each decision and the separability of such distributions. In Figs. 4a and 4b , we illustrate these distributions during the early and late stages of training, colour-coded based on whether the t values correspond to correctly-classified (e = 0) or misclassified (e = 1) labelled training instances. We now outline how the binary decision is made. After each training epoch, we fit the t values in Fig. 4b to two unimodal Gaussian distributions. This generates N 0 (µ 0 , σ 2 0 ) and N 1 (µ 1 , σ 2 1 ) for e = 0 and e = 1, respectively. We choose to quantify the separability of these two distributions using the Hellinger distance, D H ∈ [0, 1], as it allows for a straightforward threshold. Low separability expressed as D H < S implies that h θ has yet to generate a reliable proxy and thus an oracle is requested for a label. We note that the value of S can be altered depending on the relative level of trust one has in the network and oracle. When D H ≥ S, we evaluate N 0 and N 1 at the t output for each acquired unlabelled instance and define p(A) as the probability of requesting a label from an oracle. We elucidate the entire active learning algorithm in Appendix E. To observe the impact of the availability of labelled training data on the active learning procedure, we take a fraction β = (0.1, 0.3, 0.5, 0.7, 0.9) of the training dataset and place it into the labelled set. Its complement is placed into the unlabelled set. Details about the data splits, preprocessing steps, and the network architecture can be found in Appendices F and G. p(A) =      1, D H < S 1, N t|µ 1 , σ 2 1 , e = 1 > N t|µ 0 , σ

5.2. BASELINES

Acquisition Functions. We compare our novel active learning framework to the state-of-the-art acquisition functions used in conjunction with MCD. These include Var Ratio, Entropy, and BALD, definitions of which can be found in Appendix A. We also compare to the scenario in which active learning is not employed (No AL). Selective Oracle Questioning. We experiment with baselines that exhibit varying degrees of oracle dependence. No Oracle is a scenario in which 0% of the labels that correspond to unlabelled instances are oracle-based and are instead pseudo-labelled by taking the argmax of the network predictions. Epsilon Greedy is a stochastic strategy Watkins (1989) that we adapt to exponentially decay the reliance of the network on an oracle as a function of the number of acquisition epochs. Entropy Response assumes that high entropy predictions generated by a network are indicative of instances that the network is unsure of. Therefore, we introduce a threshold, S Entropy , such that if it is exceeded, an oracle is requested to label the chosen instance (see Appendix G). The most dependent baseline is 100% Oracle, a traditionally-employed strategy in AL where 100% of the labels are oracle-based. We do not compare our methods to Softmax Response (Geifman & El-Yaniv, 2017) and SelectiveNet (Geifman & El-Yaniv, 2019) , despite their strong performance for selective classification, as they do not trivially extend to the setting in which labels are unavailable.

5.3. HYPERPARAMETERS

For all experiments, we chose the number of MC samples T = 20 to balance between computational complexity and accuracy of the approximation of the version space. We acquire unlabelled instances at pre-defined epochs during training which we refer to as acquisition epochs, τ = 5n, n ∈ N + . During each acquisition epoch, we acquire b = 2% of the remaining unlabelled instances. We also investigate the effect of such hyperparameters on performance (see Appendices P-R). When experimenting with tracked acquisition functions, we chose the temporal period, ∆t = 1. For the CIFAR10 experiments, we chose T = 5, τ = 2n, and b = 10%. Selective Oracle Questioning. Recall that we delegate selective oracle questioning to the network only when D H ≥ S. Given D H 's increasing trend during training (see Fig. 4c ), we chose S = 0.15 to balance between the reliability of the proxy and the independence of the network from an oracle. We also explore the sensitivity of SoQal to this choice of S.

6.1. ACTIVE LEARNING WITHOUT ORACLE

Active learning frameworks typically assume the presence of an oracle (expert annotator). However, oracles are not always available, particularly in the medical domain where physicians have limited time to provide annotations. To reflect this scenario, we evaluate the ability of our AL framework to operate without an oracle. In Fig. 5a , we illustrate the validation AUC of various methods when exposed to a fraction, β, of the labelled training data on D 2 and D 5 , respectively. We find that BALC KLD achieves strong generalization performance and does so in a fewer number of epochs relative to the remaining methods. For example, in Fig. 5a , BALC KLD achieves an AUC ≈ 0.69 after only 20 epochs whereas BALD MCD does so at epoch 40. This implies that BALC KLD can result in a two-fold reduction in training time. It also achieves a higher final AUC ≈ 0.72 relative to the remaining methods. We hypothesize that such behaviour is due to BALC KLD 's improved ability to estimate the region of uncertainty, and thus acquires more informative instances. Moreover, given the absence of an oracle, these informative instances are likely to have also been pseudo-labelled correctly. The acquisition of more informative instances which are labelled correctly by the network suggests that such instances are closer to the decision boundary than their non-acquired counterparts yet are still on the correct side of the boundary. We arrive at similar conclusions for the remaining experiments (see Appendix H and J). Having illustrated the potential benefit of static acquisition functions, we now move on to quantify the effect of incorporating temporal information on generalization performance. In Figs. 5b and 5c , we illustrate the percent change in the AUC when using MCP, with and without tracked acquisition functions, relative to MCD. We find that tracked acquisition functions are most useful when the initial size of the labelled dataset is small (↓ β values) (red rectangle). For example, incorporating temporal information into BALD at β = 0.1 improves the generalization performance by 11%. We hypothesize that this improvement is due to the increased enumeration of hypotheses over time, which in turn, results in a more reliable approximation of the version space.

6.2. ACTIVE LEARNING WITH NOISE-FREE ORACLE

In this section, we relax the assumption that physicians are unavailable for annotation purposes. Instead, we focus on alleviating the labelling burden that is placed on physicians when they are available. More specifically, we assume that oracles can provide accurate labels, i.e., noise-free. In Table 1 , we illustrate the test-set AUC of the oracle questioning strategies on all datasets at β = 0.1. We find that SoQal consistently outperforms its counterparts across D 1 -D 3 . For example, while using BALD MCD on D 2 , SoQal achieves an AUC = 0.707 whereas Epsilon Greedy and Entropy Response achieve AUC = 0.609 and 0.584, respectively. Such a finding suggests that SoQal is better equipped to know when and for which instance a label should be requested from an oracle. One could argue that SoQal's superiority is due to its high dependence on the oracle. In fact, we show that this is not the case in Appendix L. On the other hand, we observe that SoQal performs on par with the other methods on D 4 and D 5 . We hypothesize that this outcome is due to the cold-start problem (Konyushkova et al., 2017) where AL algorithms fail to learn due to the limited availability of labelled training data. We support this claim with experiments in Appendix M. Moreover, we remind readers that by increasing the value of S in the SoQal experiments, networks can cede more control to the oracle and thus further improve performance, an effect we quantify in Appendix N. So far, we have presented scenarios in which oracles are either absent or present with the ability to provide accurate labels. In healthcare, however, physicians may be ill-trained, fatigued, or unable to diagnose a case due to its difficulty. We simulate this scenario by introducing two types of label noise. We stochastically flip each label to 1) any other label randomly (Random), or 2) its nearest neighbour from a different class in a compressed subspace (Nearest Neighbour). Whereas the first form of noise is extreme, the latter is more realistic as it may represent uncertain physician diagnoses. To simulate various magnitudes of noise, we chose the probability of introducing noise, γ = [0.05, 0.1, 0.2, 0.4, 0.8]. In Fig. 6 , we illustrate the effect of label noise on the test AUC for the various oracle questioning strategies. Figure 6 : Average test AUC for the oracle questioning strategies in the absence and presence, of various magnitudes, of label noise on D 1 using BALD MCP . We show that with up to 80% random or nearest neighbour label noise, SoQal still outperforms the remaining methods when trained without label noise. This illustrates that SoQal is better equipped to deal with oracle label noise. We find that SoQal outperforms the remaining strategies regardless of noise type and magnitude (except at 40% random noise). For example, at 5% random noise, SoQal achieves an AUC ≈ 0.66 whereas Epsilon Greedy and Entropy Response achieve an AUC ≈ 0.56 and ≈ 0.53, respectively. Surprisingly, we find that the introduction of label noise can sometimes improve performance. For example, SoQal's AUC ≈ 0.64 -→ 0.66 with no noise and 5% random noise, respectively. We hypothesize that this is due to inherent label noise in the datasets. By introducing further noise, we nudge these labels towards their ground-truth values. Moreover, SoQal is better able to deal with label noise than its counterparts. Specifically, SoQal at 80% random noise achieves AUC ≈ 0.53 whereas Epsilon Greedy and Entropy Response trained without noise achieve AUC ≈ 0.50 and ≈ 0.52, respectively. This effect, which is even more pronounced when dealing with nearest neighbour noise, indicates the utility of SoQal in the presence of a noisy oracle. We arrive at similar conclusions when experimenting with other datasets and acquisition functions (see Appendix O).

7. DISCUSSION AND FUTURE WORK

In this paper, we proposed a novel consistency-based active learning framework which perturbs both inputs and network parameters and acquires instances to which the network is least robust. We illustrate the utility of this approach in the absence of an oracle. Moreover, we propose a strategy that dynamically determines whether an oracle should be requested for a label. We empirically show that this approach is better able to deal with noisy oracles than the baseline methods. We now elucidate several future avenues worth exploring. Incorporating Prior Information. The default mode for SoQal is deferral to an oracle. However, relevant a priori information, such as the degree of noise inherent in the oracle's labels, can be incorporated to alter either the default mode or the Hellinger threshold, S. Incorporating Multiple Oracles. In this work, we focused on the presence of a single oracle. Scenarios in which multiple oracles exist may better reflect clinical environments which include multiple experts of various levels of competency. Therefore, dynamically querying these oracles might be of interest.



Selective oracle questioning: we propose a dynamic strategy which learns, for an acquired unlabelled instance, whether to request a label from an oracle or to generate a pseudo-label instead.



Figure 1: Labelled instances from two classes and unlabelled instances (gray) alongside the Version space of (a) MCD where each MC sample is viewed as a distinct hypothesis (decision boundary), (b) MCP where there is one hypothesis but several perturbations of the unlabelled instance, and (c) BALC where there are several hypotheses in addition to the unlabelled instance and its perturbed counterpart.

Figure 2: (Scenario 1) network output variations (for three classes, A, B, and C) caused primarily by input perturbations.The red rectangle illustrates a potential limitation of MCD. Given that these parameter perturbations do not result in network output variations and MCD is dependent upon said perturbations, unlabelled instances can be erroneously deemed uninformative. (Scenario 2) network output variations caused by both input and parameter perturbations. We show that while BALC KLD is likely to acquire instances due to input perturbations, BALC JSD considers both input and parameter perturbations when performing acquisitions.

Figure 4: Density of the outputs, t, of g θ colour-coded based on the zero-one classification error during the (a) early and (b) late stages of training. (c) the Hellinger distance, D H , between the two distributions of t during training increases as they become more separable.

in PyTorch(Paszke et al., 2019) on four publically-available datasets. These datasets consist of cardiac time-series data such as the photoplethysmogram (PPG) and the electrocardiogram (ECG) alongside cardiac arrhythmia labels. We useD 1 D 1 D 1 = PhysioNet 2015 PPG, D 2 D 2 D 2 =PhysioNet 2015 ECG (Clifford et al., 2015) (5-way), D 3 D 3 D 3 = PhysioNet 2017 ECG (Clifford et al., 2017) (4-way), D 4 D 4 D 4 = Cardiology ECG (Hannun et al., 2019) (12-way), and D 5 D 5 D 5 = CIFAR10 Krizhevsky et al. (2009) (10-way).

(a) D2 at β = 0.3 (b) Utility of tracked acquisition functions on D1 (c) Utility of tracked acquisition functions on D5

Figure 5: Validation AUC on (a) D 2 at β = 0.3. Mean percent change in test AUC when comparing MCP with static and tracked acquisition functions to MCD with their static counterparts on (b) D 1 and (c) D 5 . We show results for Var Ratio, Entropy, and BALD, at all fractions, β ∈ [0.1, 0.3, 0.5, 0.7, 0.9] and across five seeds.

Mean test AUC of oracle questioning strategies in the presence of a noise-free oracle. Results are shown for a subset of the acquisition functions on D 1 -D 5 and are averaged across five seeds.

