ON THE GEOMETRY OF DEEP BAYESIAN ACTIVE LEARNING Anonymous authors Paper under double-blind review

Abstract

We present geometric Bayesian active learning by disagreements (GBALD), a framework that performs BALD on its geometric interpretation interacting with a deep learning model. There are two main components in GBALD: initial acquisitions based on core-set construction and model uncertainty estimation with those initial acquisitions. Our key innovation is to construct the core-set on an ellipsoid, not typical sphere, preventing its updates towards the boundary regions of the distributions. Main improvements over BALD are twofold: relieving sensitivity to uninformative prior and reducing redundant information of model uncertainty. To guarantee the improvements, our generalization analysis proves that, compared to typical Bayesian spherical interpretation, geodesic search with ellipsoid can derive a tighter lower error bound and achieve higher probability to obtain a nearly zero error. Experiments on acquisitions with several scenarios demonstrate that, yielding slight perturbations to noisy and repeated samples, GBALD further achieves significant accuracy improvements than BALD, BatchBALD and other baselines.

1. INTRODUCTION

Lack of training labels restricts the performance of deep neural networks (DNNs), though prices of GPU resources were falling fast. Recently, leveraging the abundance of unlabeled data has become a potential solution to relieve this bottleneck whereby expert knowledge is involved to annotate those unlabeled data. In such setting, the deep learning community introduced active learning (AL) (Gal et al., 2017) that, maximizing the model uncertainty (Ashukha et al., 2019; Lakshminarayanan et al., 2017) to acquire a set of highly informative or representative unlabeled data, and solicit experts' annotations. During this AL process, the learning model tries to achieve a desired accuracy using minimal data labeling. Recent shift of model uncertainty in many fields, such as Bayesian neural networks (Blundell et al., 2015) , Monte-Carlo (MC) dropout (Gal & Ghahramani, 2016) , and Bayesian core-set construction (Sener & Savarese, 2018) , shows that, new scenarios arise from deep Bayesian AL (Pinsler et al., 2019; Kirsch et al., 2019) . Bayesian AL (Golovin et al., 2010; Jedoui et al., 2019) presents an expressive probabilistic interpretation on model uncertainty (Gal & Ghahramani, 2016) . Theoretically, for a simple regression model such as linear, logistic, and probit, AL can derive their closed-forms on updating one sparse subset that maximally reduces the uncertainty of the posteriors over the regression parameters (Pinsler et al., 2019) . However, for a DNN model, optimizing massive training parameters is not easily tractable. It is thus that Bayesian approximation provides alternatives including importance sampling (Doucet et al., 2000) and Frank-Wolfe optimization (Vavasis, 1992) . With importance sampling, a typical approach is to express the information gain in terms of the predictive entropy over the model, and it is called Bayesian active learning by disagreements (BALD) (Houlsby et al., 2011) . BALD has two interpretations: model uncertainty estimation and core-set construction. To estimate the model uncertainty, a greedy strategy is applied to select those data that maximize the parameter disagreements between the current training model and its subsequent updates as (Gal et al., 2017) . However, naively interacting with BALD using uninformative prior (Strachan & Van Dijk, 2003) (Price & Manson, 2002) , which can be created to reflect a balance among outcomes when no information is available, leads to unstable biased acquisitions (Gao et al., 2020) , e.g. insufficient prior labels. Moreover, the similarity or consistency of those acquisitions to the previous acquired samples, brings redundant information to the model and decelerates its training. Core-set construction (Campbell & Broderick, 2018) avoids the greedy interaction to the model by capturing characteristics of the data distributions. By modeling the complete data posterior over the distributions of parameters, BALD can be deemed as a core-set construction process on a sphere (Kirsch et al., 2019) , which seamlessly solicits a compact subset to approximate the input data distribution, and efficiently mitigates the sensitivity to uninformative prior and redundant information. BALD has two types of interpretation: model uncertainty estimation and core-set construction where the deeper the color of the core-set element, the higher the representation; GBALD integrates them into a uniform framework. Stage 1 : core-set construction is with an ellipsoid, not typical sphere, representing the original distribution to initialize the input features of DNN. Stage 2 : model uncertainty estimation with those initial acquisitions then derives highly informative and representative samples for DNN. From the view of geometry, updates of core-set construction is usually optimized with sphere geodesic as (Nie et al., 2013; Wang et al., 2019) . Once the core-set is obtained, deep AL immediately seeks annotations from experts and starts the training. However, data points located at the boundary regions of the distribution, usually win uniform distribution, cannot be highly-representative candidates for the core-set. Therefore, constructing the coreset on a sphere may not be the optimal choice for deep AL. This paper presents a novel AL framework, namely Geometric BALD (GBALD), over the geometric interpretation of BALD that, interpreting BALD with core-set construction on an ellipsoid, initializes an effective representation to drive a DNN model. The goal is to seek for significant accuracy improvements against an uninformative prior and redundant information. Figure 1 describes this two-stage framework. In the first stage, geometric core-set construction on an ellipsoid initializes effective acquisitions to start a DNN model regardless of the uninformative prior. Taking the core-set as the input features, the next stage ranks the batch acquisitions of model uncertainty according to their geometric representativeness, and then solicits some highly-representative examples from the batch. With the representation constraints, the ranked acquisitions reduce the probability of sampling nearby samples of the previous acquisitions, preventing redundant acquisitions. To guarantee the improvement, our generalization analysis shows that, the lower bound of generalization errors of AL with the ellipsoid is proven to be tighter than that of AL with the sphere. Achieving a nearly zero generalization error by AL with ellipsoid is also proven to have higher probability. Contributions of this paper can be summarized from Geometric, Algorithmic, and Theoretical perspectives. • Geometrically, our key innovation is to construct the core-set on an ellipsoid, not typical sphere, preventing its updates towards the boundary regions of the distributions. • In term of algorithm design, in our work, from a Bayesian perspective, we propose a two-stage framework that sequentially introduces the core-set representation and model uncertainty, strengthening their performance "independently". Moreover, different to the typical BALD optimizations, we present geometric solvers to construct core-set and estimate model uncertainty, which result in a different view for Bayesian active learning. • Theoretically, to guarantee those improvements, our generalization analysis proves that, compared to typical Bayesian spherical interpretation, geodesic search with ellipsoid can derive a tighter lower error bound and achieve higher probability to obtain a nearly zero error. See Appendix B. The rest of this paper is organized as follows. In Section 2, we first review the related work. Secondly, we elaborate BALD and GBALD in Sections 3 and 4, respectively. Experimental results are presented in Section 5. Finally, we conclude this paper in Section 6.

2. RELATED WORK

Model uncertainty. In deep learning community, AL (Cohn et al., 1994) was introduced to improve the training of a DNN model by annotating unlabeled data, where the data which maximize the model uncertainty (Lakshminarayanan et al., 2017) are the primary acquisitions. For example, in ensemble deep learning (Ashukha et al., 2019) , out-of-domain uncertainty estimation selects those data which do not follow the same distribution as the input training data; in-domain uncertainty draws the data from the original input distribution, producing reliable probability estimates. Gal & Ghahramani (2016) Bayesian AL. Taking a Bayesian perspective (Golovin et al., 2010) , AL can be deemed as minimizing the Bayesian posterior risk with multiple label acquisitions over the input unlabeled data. A potential informative approach is to reduce the uncertainty about the parameters using Shannon's entropy (Tang et al., 2002) . This can be interpreted as seeking the acquisitions for which the Bayesian parameters under the posterior disagree about the outcome the most, so this acquisition algorithm is referred to as Bayesian active learning by disagreement (BALD) (Houlsby et al., 2011) . Deep AL. Recently, deep Bayesian AL attracted our eyes. Gal et al. (2017) proposed to cooperate BALD with a DNN to improve the training. The unlabeled data which maximize the model uncertainty provide positive feedback. However, it needs to repeatedly update the model until the acquisition budget is exhausted. To improve the acquisition efficiency, batch sampling with BALD is applied as (Kirsch et al., 2019; Pinsler et al., 2019) . In BatchBALD, Kirsch et al. (2019) developed a tractable approximation to the mutual information of one batch of unlabeled data and current model parameters. However, those uncertainty evaluations of Bayesian AL whether in single or batch acquisitions all take greedy strategies, which lead to computationally infeasible, or excursive parameter estimations. For deep Bayesian AL, being short of interactions to DNN can not maximally drive their model performance as (Pinsler et al., 2019; Sener & Savarese, 2018) , etc.

3. BALD

BALD has two different interpretations: model uncertainty estimation and core-set construction. We simply introduce them in this section.

3.1. MODEL UNCERTAINTY ESTIMATION

We consider a discriminative model p(y x, θ) parameterized by θ that maps x ∈ X into an output distribution over a set of y ∈ Y. Given an initial labeled (training) set D 0 ∈ X × Y, the Bayesian inference over this parameterized model is to estimate the posterior p(θ D 0 ), i.e. estimate θ by repeatedly updating D 0 . AL adopts this setting from a Bayesian view. With AL, the learner can choose unlabeled data from D u = {x i } N j=1 ∈ X , to observe the outputs of the current model, maximizing the uncertainty of the model parameters. Houlsby et al. (2011) proposed a greedy strategy termed BALD to update D 0 by estimating a desired data x * that maximizes the decrease in expected posterior entropy: x * = arg max x∈Du H[θ D 0 ] -E y∼p(y x,D0) H[θ x, y, D 0 ] , where the labeled and unlabeled sets are updated by D 0 = D 0 ∪ {x * , y * }, D u = D u x * , and y * denotes the output of x * . In deep AL, y * can be annotated as a label from experts and θ yields a DNN model.

3.2. CORE-SET CONSTRUCTION

Let p(θ D 0 ) be updated by its log posterior logp(θ D 0 , x * ), y * ∈ {y i } N i=1 , assume the outputs are conditional independent of the inputs, i.e. p(y * x * , D 0 ) = ∫ θ p(y * x * , θ)p(θ D 0 )dθ, then we have the complete data log posterior following (Pinsler et al., 2019) : E y * [logp(θ D 0 , x * , y * )] = E y * [logp(θ D 0 ) + logp(y * x * , θ) -logp(y * x * , D 0 )] = logp(θ D 0 ) + E y * logp(y * x * , θ) + H[y * x * , D 0 ] = logp(θ D 0 ) + N i=1 ⎛ ⎝ E yi logp(y i x i , θ) + H[y i x i , D 0 ] ⎞ ⎠ . ( ) The key idea of core-set construction is to approximate the log posterior of Eq. ( 2) by a subset of D ′ u ⊆ D u such that: E Yu [logp(θ D 0 , D u , Y u )] ≈ E Y ′ u [logp(θ D 0 , D ′ u , Y ′ u )], where Y u and Y ′ u denote the predictive labels of D u and D ′ u respectively by the Bayesian discriminative model, that is, p(Y u D u , D 0 ) = ∫ θ p(Y u D u , θ)p(θ D 0 )dθ, and p(Y ′ u D ′ u , D 0 ) = ∫ θ p(Y ′ u D ′ u , θ)p(θ D 0 )dθ. Here D ′ u can be indicated by a core-set (Pinsler et al., 2019) that highly represents D u . Optimization tricks such as Frank-Wolfe optimization (Vavasis, 1992) then can be adopted to solve this problem. Motivations. Eqs. ( 1) and (2) provide the Bayesian rules of BALD over model uncertainty and core-set construction respectively, which further attract the attention of the deep learning community. However, the two interpretations of BALD are limited by: 1) redundant information and 2) uninformative prior, where one major reason which causes these two issues is the poor initialization on the prior, i.e. p(D 0 θ). For example, unbalanced label initialization on D 0 usually leads to an uninformative prior, which further conducts the acquisitions of AL to select those unlabeled data from one or some fixed classes; highly-biased results with (Gao et al., 2020) redundant information are inevitable. Therefore, these two limitations affect each other. GBALD consists of two components: initial acquisitions based on core-set construction and model uncertainty estimation with those initial acquisitions.

4.1. GEOMETRIC INTERPRETATION OF CORE-SET

Modeling the complete data posterior over the parameter distribution can relieve the above two limitations of BALD. Typically, finding the acquisitions of AL is equivalent to approximating a core-set centered with spherical embeddings (Sener & Savarese, 2018) . Let w i be the sampling weight of x i , w i 0 ≤ N , the core-set construction is to optimize: min w N i=1 E yi logp(y i x i , θ) + H[y i x i , D 0 ] L - N i=1 w i E yi logp(y i x i , θ) + H[y i x i , D 0 ] L(w) 2 , where L and L(w) denote the full and expected (weighted) log-likelihoods, respectively (Campbell & Broderick, 2018; 2019) . Specifically, ∑ N i=1 H[y i x i , D 0 ] = -∑ yi p(y i x i , D 0 )log(p(y i x i , D 0 ), where p(y i x i , D 0 ) = ∫ θ p(y i x i , θ)p(θ D 0 )dθ. Note ⋅ denotes the 2 norm. The approximation of Eq. (3) implicitly requires that the complete data log posterior of Eq. (2) w.r.t. L must be close to an expected posterior w.r.t. L(w) such that approximating a sparse subset for the original inputs by sphere geodesic search is feasible (see Figure 2(a) ). Generally, solving this optimization is intractable due to cardinality constraint (Pinsler et al., 2019) . Campbell & Broderick (2019) proposed to relax the constraint in Frank-Wolfe optimization, in which mapping X is usually performed in a Hilbert space (HS) with a bounded inner product operation. In this solution, the sphere embedded in the HS replaces the cardinality constraint with a polynomial constraint. However, the initialization on D 0 affects the iterative approximation to D u at the beginning of the geodesic search. Moreover, the posterior of p(θ D 0 ) is uninformative, if the initialized D 0 is empty or not correct. Therefore, the typical Bayesian core-set construction of BALD cannot ideally fit an uninformative prior. The another geometric interpretation of core-set construction, such as k-centers (Sener & Savarese, 2018) , is not restricted to this setting. We thus follow the construction of k-centers to find the core-set. k-centers. Sener & Savarese (2018) proposed a core-set representation approach for active deep learning based on k-centers. This approach can be adopted in core-set construction of BALD without the help of the discriminative model. Therefore, the uninformative prior has no further influence on the core-set. Typically, the k-centers approach uses a greedy strategy to search the data x whose nearest distance to elements of D 0 is the maximal: x = arg max xi∈Du min ci∈D0 x i -c i , then D 0 is updated by D 0 ∪ {x, ỹ}, D u is updated by D u x, where ỹ denotes the output of x. This max-min operation usually performs k times to construct the centers. From the view of geometry, k-centers can be deemed as the core-set construction via spherical geodesic search (Bādoiu et al., 2002; Har-Peled & Mazumdar, 2004) . Specifically, the max-min optimization guides D 0 to be updated into one data, which draws the longest line segment from x i , ∀i across the sphere center. The iterative update on x is then along its unique diameter through the sphere center. However, this greedy optimization has large probability that yields the core-set to fall into boundary regions of the sphere, which cannot capture the characteristics of the distribution.

4.2. INITIAL ACQUISITIONS BASED ON CORE-SET CONSTRUCTION

We present a novel greedy search which rescales the geodesic of a sphere into an ellipsoid following Eq. ( 4), in which the iterative update on the geodesic search is rescaled (see Figure 2(b) ). We follow the importance sampling strategy to begin the search. Initial prior on geometry. Initializing p(D 0 θ) is performed with a group of internal spheres centered with D j , ∀j, subjected to D j ∈ D 0 , in which the geodesic between D 0 and the unlabeled data is over those spheres. Since D 0 is known, specification of θ plays the key role for initializing p(D 0 θ). Given a radius R 0 for any observed internal sphere, p(y i x i , θ) is firstly defined by p(y i x i , θ) = ⎧ ⎪ ⎪ ⎪ ⎪ ⎪ ⎨ ⎪ ⎪ ⎪ ⎪ ⎪ ⎩ 1, ∃j, x i -D j ≤ R 0 , max ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ R 0 x i -D j ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , ∀j, x i -D i > R 0 , thereby θ yields the parameter R 0 . When the data is enclosed with a ball, the probability of Eq. ( 5) is 1. The data near the ball, is given a probability of max R0 xi-Dj constrained by min x i -D j , ∀j, i.e. the probability is assigned by the nearest ball to x i , which is centered with D j . From Eq. (3), the information entropy of y i ∼ {y 1 , y 2 , ..., y N } over x i ∼ {x 1 , x 2 , ..., x N } can be expressed as the integral regarding p(y i x i , θ): N i=1 H(y i x i , D 0 ) = - N i=1 θ p(y i x i , θ)p(θ D 0 )dθlog θ p(y i x i , θ)p(θ D 0 ) dθ, which can be approximated by -∑ N i=1 p(y i x i , θ)log p(y i x i , θ) following the details of Eq. (3). In short, this indicates an approximation to the entropy over the entire outputs on D u that assumes the prior p(D 0 θ) w.r.t. p(y i x i , θ) is already known from Eq. (5). Max-min optimization. Recalling the max-min optimization trick of k-centers in the core-set construction of (Sener & Savarese, 2018) , the minimizer of Eq. (3) can be divided into two parts: min x * L and max w L(w), where D 0 is updated by acquiring x * . However, updates of D 0 decide the minimizer of L with regard to the internal spheres centered with D i , ∀i. Therefore, minimizing L should be constrained by an unbiased full likelihood over X to alleviate the potential biases from the initialization of D 0 . Let L 0 denote the unbiased full likelihood over X that particularly stipulates D 0 as the k-means centers written as U of X which jointly draw the input distribution. We define L 0 = ∑ N i=1 E yi [logp(y i x i , θ) + H[y i x i , U]] to regulate L, that is min x * L 0 -L 2 , s.t. D 0 = D 0 ∪ {x * , y * }, D u = D u x * . (7) The other sub optimizer is max w L(w). We present a greedy strategy following Eq. (1): max 1≤i≤N min wi N i=1 w i E yi [logp(y i x i , θ) + H[y i x i , D 0 ]] = N i=1 w i logp(y i x i , θ) - N i=1 w i p(y i x i , θ)logp(y i x i , θ), which can be further written as: ∑ N i=1 w i logp(y i x i , θ)(1 -logp(y i x i , θ)). Let w i = 1, ∀i for unbiased estimation of the likelihood L(w), Eq. ( 8) can be simplified as max xi∈Du min Dj ∈D0 logp(y i x i , θ), where p(y i x i , θ) follows Eq. ( 5). Combining Eqs. ( 7) and ( 9), the optimization of Eq. ( 3) is then transformed as x * = arg max xj ∈Du min Dj ∈D0 ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ L 0 -L 2 + logp(y j x j , θ) ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , ( ) where D 0 is updated by acquiring x * , i.e. D 0 = D 0 ∪ {x * , y * }. Geodesic line. For a metric geometry M , a geodesic line is a curve γ which projects its interval I to M : I → M , maintaining everywhere locally a distance minimizer (Lou et al., 2020) . Given a constant ν > 0 such that for any a, b ∈ I there exists a geodesic distance d(γ(a), γ(b)) ∶= ∫ b a g γ(t) (γ ′ (t), γ ′ (t))dt , where γ ′ (t) denotes the geodesic curvature, and g denotes the metric tensor over M . Here, we define γ ′ (t) = 0, then g γ(t) (0, 0) = 1 such that d(γ(a), γ(b)) can be generalized as a segment of a straight line: d(γ(a), γ(b)) = a -b . Ellipsoid geodesic distance. For any observation points p, q ∈ M , if the spherical geodesic distance is defined as d(γ(p), γ(q)) = pq . The affine projection obtains its ellipsoid interpretation: d(γ(p), γ(q)) = η(pq) , where η denotes the affine factor subjected to 0 < η < 1. Optimizing with ellipsoid geodesic search. The max-min optimization of Eq. ( 10) is performed on an ellipsoid geometry to prevent the updates of core-set towards the boundary regions, where ellipsoid geodesic line scales the original update on the sphere. Assume x i is the previous acquisition and x * is the next desired acquisition, the ellipsoid geodesic rescales the position of x * as x * e = x i + η(x * -x i ). Then, we update this position of x * e to its nearest neighbor x j in the unlabeled data pool, i.e. arg min xj ∈Du x j -x * e , also can be written as arg min xj ∈Du x j -[x i + η(x * -x i )] . To study the advantage of ellipsoid geodesic search, Appendix B presents our generalization analysis.

4.3. MODEL UNCERTAINTY ESTIMATION WITH CORE-SET

GBALD starts the model uncertainty estimation with those initial core-set acquisitions, in which it introduces a ranking scheme to derive both informative and representative acquisitions. Single acquisition. We follow (Gal et al., 2017) and use MC dropout to perform Bayesian inference on the model of the neural network. It then leads to ranking the informative acquisitions with batch sequences is with high efficiency. We first present the ranking criterion by rewriting Eq. ( 1) as batch returns: {x * 1 , x * 2 , ..., x * b } = arg max {x1,x2,...,x b }⊆Du H[θ D 0 ] -E ŷ1∶b ∼p(ŷ 1∶b x1∶b ,D0) H[θ x1∶b , ŷ1∶b , D 0 ] , where x1∶b = {x 1 , x2 , ..., xb }, ŷ1∶b = {ŷ 1 , ŷ2 , ..., ŷb }, ŷi denotes the output of xi . The informative acquisition x * t is then selected from the ranked batch acquisitions x1∶b due to the highest representation for the unlabeled data: x * t = arg max x * i ∈{x * 1 ,x * 2 ,...,x * b } ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ max Dj ∈D0 p(y i x * i , θ) ∶= R 0 x * i -D j ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , ( ) where t denotes the index of the final acquisition, subjected to 1 ≤ t ≤ b. This also adopts the max-min optimization of k-centers in Eq. ( 4), i.e. x * t = arg max x * i ∈{x * 1 ,x * 2 ,...,x * b } min Dj ∈D0 x * i -D j . Batch acquisitions. The greedy strategy of Eq. ( 13) can be written as a batch acquisitions by setting its output as a batch set, i.e. {x * t1 , ..., x * t b ′ } = arg max x * t 1 ∶t b ′ ⊆{x * 1 ,x * 2 ,...,x * b } p(y * t1∶t b ′ x * t1∶t b ′ , θ), where x * t1∶t b ′ = {x * t1 , ..., x * t b ′ }, y * t1∶t b ′ = {y * t1 , ..., y * t b ′ }, y * ti denotes the output of x * ti , 1 ≤ i ≤ b ′ , and 1 ≤ b ′ ≤ b. This setting can be used to accelerate the acquisitions of AL in a large dataset. Appendix A presents the two-stage GBALD algorithm.

5. EXPERIMENTS

In experiments, we start by showing how BALD degenerates its performance with uninformative prior and redundant information, and show that how our proposed GBALD relieves theses limitations. Our experiments discuss three questions: 1) is GBALD using core-set of Eq. ( 11) competitive with uninformative prior? 2) can GBALD using ranking of Eq. ( 14) improve informative acquisitions of model uncertainty? and 3) can GBALD outperform state-of-the-art acquisition approaches? Following the experiment settings of (Gal et al., 2017; Kirsch et al., 2019) , we use MC dropout to implement the Bayesian approximation of DNNs. Three benchmark datasets are selected: MNIST, SVHN and CIFAR10. More experiments are presented in Appendix C.

5.1. UNINFORMATIVE PRIORS

As discussed in the introduction, BALD is sensitive to an uninformative prior, i.e. p(D 0 θ). We thus initialize D 0 from a fixed class of the tested dataset to observe its acquisition performance. Figure 3 presents the prediction accuracies of BALD with an acquisition budget of 130 over the training set of MNIST, in which we randomly select 20 samples from the digit '0' and '1' to initialize D 0 , respectively. The classification model of AL follows a convolutional neural network with one block of [convolution, dropout, max-pooling, relu] , with 32, 3x3 convolution filters, 5x5 max pooling, and 0.5 dropout rate. In the AL loops, we use 2,000 MC dropout samples from the unlabeled data pool to fit the training of the network following (Kirsch et al., 2019) . The results show BALD can slowly accelerate the training model due to biased initial acquisitions, which cannot uniformly cover all the label categories. Moreover, the uninformative prior guides BALD to unstable acquisition results. As the shown in Figure 3 (b), BALD with Bathsize = 10 shows better performance than that of Batchsize =1; while BALD in Figure 3 (a) keeps stable performance. This is because the initial labeled data does not cover all classes and BALD with Batchsize =1 may further be misled to select those samples from one or a few fixed classes at the first acquisitions. However, Batchsize >1 may result in a random acquisition process that possibly covers more diverse labels at its first acquisitions. Another excursive result of BALD is that the increasing batch size cannot degenerate its acquisition performance in Figure 3 (b). Specifically, Batchsize =10 ≻ Batchsize =1 ≻ Batchsize =20,40 ≻ Batchsize =30, where '≻' denotes 'better' performance; Batchsize = 20 achieves similar results of Batchsize =40. This undermines the acquisition policy of BALD: its performance would degenerate when the batch size increases, and sometimes worse than random sampling. This also is the reason why we utilize a core-set to start BALD in our framework. Different to BALD, the core-set construction of GBALD using Eq. ( 11) provides a complete label matching against all classes. Therefore, it outperforms BALD with the batch sizes of 1, 10, 20, 30, and 40. As the shown learning curves in Figure 3 , GBALD with a batch size of 1 and sequence size of 10 (i.e. breakpoints of acquired size are 10, 20, ..., 130) achieves significantly higher accuracies than BALD using different batch sizes since BALD misguides the network updating using poor prior. Training by the same parameterized CNN model as Section 5.1, Figure 4 presents the acquisition performance of parameterized BALD and GBALD. As the learning curves shown, BALD cannot accelerate the model as fast as GBALD due to the repeated information over the acquisitions. For GBALD, it ranks the batch acquisitions of the highly-informative samples and selects the highest representative ones. By employing this special ranking strategy, GBALD can reduce the probability of sampling nearby data of the previous acquisitions. It is thus GBALD significantly outperforms BALD, even if we progressively increase the ranked batch size b. 11) and ( 14) has been demonstrated to achieve successful improvements over BALD. We thus combine these two components into a uniform framework. Figure 5 reports the AL accuracies using different acquisition algorithms on the three image datasets. The selected baselines follow (Gal et al., 2017) including 1) maximizing the variation ratios (Var), 2) BALD, 3) maximizing the entropy (Entropy), 4) k-medoids, and one greedy 5) k-centers approach (Sener & Savarese, 2018) . The network architecture is a three-layer MLP with three blocks of [convolution, dropout, max-pooling, relu], with 32, 64, and 128 3x3 convolution filters, 5x5 max pooling, and 0.5 dropout rate. In the AL loops, the MC dropout still randomly samples 2,000 data from the unlabeled data pool to approximate the training of the network architecture following (Kirsch et al., 2019) . The initial labeled data of MNIST, SVHN and CIFAR-10 are 20, 1000, 1000 random samples from their full training sets. Details of baselines are presented in Appendix C.1.

5.2. IMPROVED INFORMATIVE ACQUISITIONS

The batch size of the compared baselines is 100, where GBALD ranks 300 acquisitions to select 100 data for the training, i.e. b = 300, b ′ = 100. As the learning curves shown in Figure 5 , 1) k-centers algorithm performs more poorly than other compared baselines because the representative optimization with sphere geodesic usually falls into the selection of boundary data; 2) Var, Entropy and BALD algorithms cannot accelerate the network model rapidly due to highly-skewed acquisitions towards few fixed classes at its first acquisitions (start states); 3) k-medoids approach does not interact with the neural network model while directly imports the clustering centers into its training set; 4) The accuracies of the acquisitions of GBALD achieve better performance at the beginning than the Var, Entropy and BALD approaches which fed the training set of the network model via acquisition loops. In short, the network is improved faster after drawing the distribution characteristics of the input dataset with sufficient labels. GBALD thus consists of the representative and informative acquisitions in its uniform framework. Advantages of these two acquisition paradigms are integrated and present higher accuracies than any single paradigm. 1 , all std values around 0.1, yielding a norm value. Usually, an average accuracy on a same acquisition size with different random seeds of DNNs, will result a small std value. Our mean accuracy spans across the whole learning curve. The results show that 1) GBALD achieves the highest average accuracies; k-medoids is ranked the second amongst the compared baselines; 2) k-centers has ranked the worst accuracies amongst these approaches; 3) the others, which iteratively update the training model are ranked at the middle including BALD, Var and Entropy algorithms. Table 2 shows the acquisition numbers of achieving the accuracies of 70%, 80%, and 90% on the three datasets. The three numbers of each cell are the acquisition numbers over MNIST, SVHN, and CIFAR10, respectively. The results show that GBALD can use fewer acquisitions to achieve a desired accuracy than the other algorithms.

5.4. GBALD VS. BATCHBALD

Batch active deep learning was recently proposed to accelerate the training of a DNN model. In recent literature, BatchBALD (Kirsch et al., 2019) extended BALD with a batch acquisition setting to converge the network using fewer iteration loops. Different to BALD, BathBALD introduces diversity to avoid repeated or similar output acquisitions. How to set the batch size of the acquisitions attracted our eyes before starting the experiments. It involves with whether our experiment settings are fair and reasonable. From a theoretical view, the larger the batch size, the worse the batch acquisitions will be. Experiments results of (Kirsch et al., 2019) also demonstrated this phenomenon. We thus set different batch sizes to run BatchBALD. Figure 6 BatchBALD degenerates the test accuracies if we progressively increase the bath sizes, where BatchBALD with a batch size of 10 keeps similar learning curves as BALD. This means BatchBALD actually can accelerate BALD with a similar acquisition result if the batch size is not large. That means, if the batch size is between 2 to 10, BatchBALD will degenerate into BALD and maintains highly-consistent results. Also because of this, BatchBALD has the same sensitivity to the uninformative prior. For our GBALD, the core-set solicits sufficient data which properly matches the input distribution (w.r.t. acquired data set size ≤ 100), providing powerful input features to start the DNN model (w.r.t. acquired data set size > 100). Table 3 then presents the mean±std of breakpoints ({0, 10, 20, ..., 600}) of active acquisitions on MNIST with batch settings. The statistical results show GBALD has much higher mean accuracy than BatchBALD with different bath sizes. Therefore, evaluating the model uncertainty of DNN using highly-representative core-set samples can improve the performance of the neural network. 

A. TWO-STAGE GBALD ALGORITHM

The two-stage GBALD algorithm is described as follows: 1) construct core-set on ellipsoid (Lines 3 to 13), and 2) estimate model uncertainty with a deep learning model (Lines 14 to 21). Core-set construction is derived from the max-min optimization of Eq. ( 10), then updated with ellipsoid geodesic w.r.t. Eq. ( 11), where θ yields a geometric probability model w.r.t. Eq. ( 5). Importing the core-set into D 0 derives the deep learning model to return b informative acquisitions one time, where θ yields a deep learning model. Ranking those samples, we select b ′ samples with the highest representations as the batch outputs. The iterations of batch acquisitions stop until its budget is exhaust. The final update on D 0 is our acquisition set of AL. Details of the hyperparameter settings are presented at Appendix C.6. Algorithm 1: Two-stage GBALD Algorithm Input: Data set X , core-set size N M , batch returns b, batch output b ′ , iteration budget A. Initialization: α ← 0, core-set M ← ∅. Stage 1 begins: Initialize θ to yield a geometric probability model w.r.t. Eq. ( 5). Perform k-means to initialize U to D 0 . Core-set construction begins by acquiring x * i , for i ← 1, 2, ..., N M do x * i ← arg max xi∈Du min Di∈D0 ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ L 0 -L 2 + logp(y i x i , θ) ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ , L 0 ← ∑ N i=1 E yi [logp(y i x i , θ) + H[y i x i , U]] . Ellipsoid geodesic line scales x * i : x * i ← arg min xj ∈Du x j -[x i + η(x * -x i )] . Update x * i into core-set M: M ← x * i ∪ M. Update N ← N -1. end Import core-set to update D 0 : D 0 ← M ∪ U ′ , where U ′ updates each element of U into their nearest samples in X . Stage 2 begins: Initialize θ to yield a deep learning model. while α < A do Return b informative deep learning acquisitions in one budget: {x * 1 , x * 2 , ..., x * b } ← arg max x∈Du H[θ D 0 ] -E y∼p(y x,D0) H[θ x, y, D 0 ] . Rank b ′ informative acquisitions with the highest geometric representativeness: {x * t1 , ..., x * t b ′ } ← arg max x * i ∈{x * 1 ,x * 2 ,...,x * b } p(y i x * i , θ). Update {x * t1 , ..., x * t b ′ } into D 0 : D 0 ← D 0 ∪ {x * t1 , ..., x * t b ′ }. α ← α + 1. end Output: final update on D 0 .

B. GENERALIZATION ERRORS OF GEODESIC SEARCH WITH SPHERE AND ELLIPSOID

Optimizing with ellipsoid geodesic linearly rescales the spherical search on tighter geometric object. The inherent motivation is that the ellipsoid can prevent the updates of core-set towards the boundary regions of the sphere. This section presents generalization error analysis from geometry, which provides feasible error guarantees for geodesic search with ellipsoid, following the perceptron analysis. Proofs are presented in Appendix D.

B.1 ERROR BOUNDS OF GEODESIC SEARCH WITH SPHERE

Given a perceptron function h ∶= w 1 x 1 + w 2 x 2 + w 3 , the classification task is over two classes A and B embedded in a three-dimensional geometry. Let S A and S B be the spheres that tightly cover A and B, respectively, where S A is with a center c a and radius R a , and S B is with a center c b and radius R b . Under this setting, generalization analysis is presented. Theorem 1. Given a linear perceptron function h = w 1 x 1 + w 2 x 2 + w 3 that classifies A and B, and a sampling budget k, with representation sampling over S A and S B , the minimum distances to the boundaries of S A and S B of that representation data are defined as d a and d b , respectively. Let err(h, k) be the classification error rate with respect to h and k, π ϕ = arcsin Ra-da Ra , we then have an inequality of error: min ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 4R 3 a -(2R a + t k )(R a -t k ) 2 4R 3 a + 4R 3 b , 4R 3 b -(2R b + t ′ k )(R b -t ′ k ) 2 4R 3 b + 4R 3 a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ < err(h, k) < 1 k , where t k = R 2 a 3 + 3 -µ k 2π + µ 2 k 4π 2 - π 3 R 3 a 27π 3 + 3 -µ k 2π - µ 2 k 4π 2 - π 3 R 3 a 27π 3 , µ k = ( 2k-4 3k -1 ϕ cos π ϕ )πR 3 a - 4πR 3 b 3k , t ′ k = R 2 b 3 + 3 - µ ′ k 2π + µ ′2 k 4π 2 - π 3 R 3 b 27π 3 + 3 - µ ′ k 2π - µ ′2 k 4π 2 - π 3 R 3 b 27π 3 , and µ ′ k = ( 2k-4 3k -1 ϕ cos π ϕ )πR 3 b - 4πR 3 a 3k .

B.2 ERROR BOUNDS OF GEODESIC SEARCH WITH ELLIPSOID

Given class A and B are tightly covered by ellipsoid E a and E b in a three-dimensional geometry. Let R a1 be the polar radius of E a , {R a2 , R a3 } be the equatorial radii of E a , R b1 be the polar radius of E b , and {R b2 , R b3 } be the equatorial radii of E b , the generalization analysis is ready to present following these settings. Theorem 2. Given a linear perceptron function h = w 1 x 1 + w 2 x 2 + w 3 that classifies A and B, and a sampling budget k, with representation sampling over E a and E b , the minimum distance to the boundaries of E a and E b of that representation data are defined as d a and d b , respectively. Let err(h, k) be the classification error rate with respect to h and k, π ϕ = arcsin Ra-da Ra , we then have an inequality of error: min ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 4 ∏ i R ai -(2R a1 + λ k )(R a1 -λ k ) 2 4 ∏ i R ai + 4 ∏ i R bi , 4 ∏ i R bi -(2R b1 + λ ′ k )(R b1 -λ ′ k ) 2 4 ∏ i R bi + 4 ∏ i R ai ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ < err(h, k) < 1 k , where i = 1, 2, 3, λ k = R 2 a 1 3 + 3 -σ k 2π + σ 2 k 4π 2 - π 3 R 3 a 1 27π 3 + 3 -σ k 2π - σ 2 k 4π 2 - π 3 R 3 a 1 27π 3 , σ k = ( 2k-4 3k - πRa 1 2ϕ )π ∏ i R ai - 4π ∏ i R b i 3k , λ ′ k = R 2 b 1 3 + 3 -σ k 2π + σ ′2 k 4π 2 - π 3 R 3 b 1 27π 3 + 3 -σ k 2π - σ 2 k 4π 2 - π 3 R 3 b 1 27π 3 , and σ ′ k = ( 2k-4 3k - πR b 1 2ϕ )π ∏ i R bi - 4π ∏ i Ra i 3k .

B.3 PROBABILITY BOUNDS OF ACHIEVING A ZERO GENERALIZATION ERROR

Let Pr[err(h, k) = 0] Sphere and Pr[err(h, k) = 0] Ellipsoid be the probabilities of achieving a zero generalization error of geodesic search with sphere and ellipsoid, respectively, we present their inequality relationship. Theorem 3. Based on γ-tube theory (Ben-David & Von Luxburg, 2008) of clustering stability, the probability of achieving a zero generalization error of geodesic search with ellipsoid can be defined as the volume ratio of Vol(Tube) Vol(Sphere) . Then, we have: Pr[err(h, k) = 0] Sphere = 1 - t 3 k R 3 a , where t k keeps consistent with Theorem 1. Theorem 4. Based on γ-tube theory (Ben-David & Von Luxburg, 2008) of clustering stability, the probability of achieving a zero generalization error of geodesic search with ellipsoid can be defined as the volume ratio of Vol(Tube) Vol(Ellipsoid) . Then, we have: Pr[err(h, k) = 0] Ellipsoid = 1 - λ k1 λ k2 λ k3 R a1 R a2 R a3 , where λ ki = R 2 a i 3 + 3 - σ k i 2π + σ 2 k i 4π 2 - π 3 R 3 a i 27π 3 + 3 - σ k i 2π - σ 2 k i 4π 2 - π 3 R 3 a i 27π 3 , and σ ki = ( 2k-4 3k -  πRa i 2ϕ )πR a1 R a2 R a3 - 4πR b 1 R b 2 R b 3 3k , i = 1,

HIGHER PROBABILITY OF ACHIEVING A ZERO ERROR

Let Pr[err(h, k) = 0] Sphere and Pr[err(h, k) = 0] Ellipsoid be the probabilities of achieving a zero generalization error of geodesic search with sphere and ellipsoid, respectively. Their relationship is presented in Proposition 2. Proposition 2. Given a linear perceptron function h = w 1 x 1 + w 2 x 2 + w 3 that classifies A and B, and a sampling budget k. With representation sampling over E a and E b , let err(h, k) be the classification error rate with respect to h and k, the probabilities of geodesic search with sphere and ellipsoid satisfy: Pr[err(h, k) = 0] Ellipsoid > Pr[err(h, k) = 0] Sphere . With these theoretical results, geometric interpretation of geodesic search over ellipsoid is more effective than sphere due to tighter lower error bound and higher probability to achieve zero error. Theorems 6 and 7 of Appendix D then present a high-dimensional generalization for the above theoretical results in terms of the volume functions of sphere and ellipsoid.

C. EXPERIMENTS C.1 BASELINES

To evaluate the performance of GBALD, several typical baselines from the latest deep AL literatures are selected. • Bayesian active learning by disagreement (BALD) (Houlsby et al., 2011) . It has been introduced in Section 3. • Maximize Variation Ratio (Var) (Gal et al., 2017) . The algorithm chooses the unlabeled data that maximizes its variation ratio of the probability: x * = arg max x∈Du 1 -max y∈Y Pr(y , x, D 0 )) . • Maximize Entropy (Entropy) (Gal et al., 2017) . The algorithm chooses the unlabeled data that maximizes the predictive entropy: • k-modoids (Park & Jun, 2009) . A classical unsupervised algorithm that represents the input distribution with k clustering centers: x * = arg max x∈Du - y∈Y Pr(y x, D 0 ))log Pr(y x, D 0 ) . ( ) {x * 1 , x * 2 , ..., x * k } = arg min z1,z2,...,z k k i=1 zi∈X k x i -z i , ( ) where X k denotes the k-th subcluster centered with z i , and z i ∈ X , ∀i.. • Greedy k-centers (k-centers) (Sener & Savarese, 2018) . A geometric core-set interpretation on sphere. See Eq. ( 4). • BatchBALD (Kirsch et al., 2019) . A batch extension of BALD which incorporates the diversity, not maximal entropy as BALD, to rank the acquisitions: {x * t1 , ..., x * t b } = arg max xt 1 ,...,xt b H(y t1 , ...., y t b ) -E p(θ D0) [H(y t1 , ....y t b θ)], where H(y t1 , ... et al., 2012) a subset from X which approximates the parameter distributions of θ. The parameter settings of Eq. ( 5) are R 0 = 2.0e + 3 and η =0.9. Accuracy of each acquired dataset size of the experiments are averaged over 3 runs.

C.2 ACTIVE ACQUISITIONS WITH REPEATED SAMPLES

Repeatedly collecting samples in the establishment of a database is very common. Those repeated samples may be continuously evaluated as the primary acquisitions of AL due to the lack of one or more kinds of class labels. Meanwhile, this situation may lead the evaluation of the model uncertainty to fall into repeated acquisitions. To respond this collecting situation, we compare the acquisition performance of BALD, Var, and GBALD using 5,000 and 10,000 repeated samples from the first 5,000 and 10,000 unlabeled data of SVHN, respectively. In addition, the unsupervised algorithms which do not interact with the network architecture, such as k-medoids and k-centers, have been shown that they cannot accelerate the training in terms of the experiment results of Section 5.3. Thus, we are no longer studying their performance. The network architecture still follows the settings of the MLP as Section 5.3. The acquisition results over the repeated SVHN datasets are presented in 4 . Results demonstrate that GBALD shows slighter perturbations on repeated samples than Var and BALD because it draws the core-set from the input distribution as the initial acquisition, leading small probability to sample from one or more fixed class. In GBALD, the informative acquisitions constrained with geometric representations further scatter the acquisitions spread in different classes. However, Var and BALD algorithms have no particular schemes against the repeated acquisitions. The maximizer on the model uncertainty may be repeatedly produced by those repeated samples. In additional, the unsupervised algorithms such as k-medoids and k-centers don not have these limitations, but cannot accelerate the training since there has no interactions with the network architecture. 

C.3 ACTIVE ACQUISITIONS WITH NOISY SAMPLES

Noisy labels (Golovin et al., 2010) (Han et al., 2018) are inevitable due to human errors in data annotation. Training on noisy labels, the neural network model will degenerate its inherent properties. To assess the perturbations of the above acquisition algorithms against noisy labels, we organize the following experiment scenarios: we select the first 5,000 and 10,000 samples respectively from the unlabeled data pool of the MNIST dataset and reset their labels by shifting {'0','1',...,'8'} to {'1','2',...,'9'}, respectively. The network architecture follows MLP of Section 5.3. The selected baselines are Var and BALD. Figure 7 presents the acquisition results of those baseline with noisy labels. The batch sizes of the compared baselines are 100, where GBALD ranks 300 acquisitions to select 100 data for the training, i.e. b = 300, b ′ = 100. Table 5 presents the mean±std values of the breakpoints (i.e. {0, 100, 200, ..., 10000}) over learning curves of Figure 8 . The results further show that GBALD has smaller noisy perturbations than other baselines. For Var and BALD, model uncertainty leads high probabilities to sample those noisy data due to their greatly updating on the model.

C.5 ACCELERATION OF ACCURACY

Accelerations of accuracy i.e. the first-orders of breakpoints of the learning curve, describe the efficiency of the active acquisition loops. Different to the accuracy curves, the acceleration curve reflects how active acquisitions help the convergence of the interacting DNN model. We thus firstly present the acceleration curves of different baselines on MNIST, SVHN, and CIFAR10 datasets following the experiments of Section 5.3. The acceleration curves of active acquisitions are drawn in Figure 9 . Observing those acceleration curves of different algorithms clearly finds that, GBALD always keeps higher accelerations of accuracy than the other baselines against the three benchmark datasets. This revels the reason of why GBALD can derive more informative and representative data to maximally update the DNN model. The acceleration curves of active acquisitions with repeated samples of Appendix C.2 are presented in Figure 10 . As the shown in this figure, GBALD presents slighter perturbations to the number of repeated samples than that of Var and BALD due to its effective ranking scheme on optimizing model uncertainty of DNN. The acceleration curves of active noisy acquisitions of Section Appendix C.3 are drawn in Figure 11 . Compared to Figure 7 , it presents more intuitive descriptions for the noisy perturbations to different baselines. With horizontal comparisons to acceleration curves of Var and BALD, our proposed GBALD has smaller noisy perturbations due to 1) the powerful core-set which properly captures the input distribution, and 2) highly representative and informative acquisitions of model uncertainty. 

C.6 HYPERPARAMETER SETTINGS

What is the proper time to start active acquisitions using Eq. ( 14) in GBALD framework? Does the ratio of core-set and model uncertainty acquisitions affect the performance of GBALD? We discuss the key hyperparameter of GBALD here: core-set size N M . Table 6 presents the relationship of accuracies and the sizes of core-set, where the start accuracy denotes the test accuracy over the initial core-set, and the ultimate accuracy denotes the test accuracy over up to Q = 20, 000 training data. Let b = 1000, b ′ = 500 in GBALD, N M be the number of the core-set size, the iteration budget A of GBALD then can be defined as A = (Q -N M ) b ′ . For example, if the number of the initial core-set labels are set as N M = 1, 000, we have A = (Q -N M ) b ′ ≈ 38; if N M = 2, 000, then A = (Q -N M ) b ′ ≈ 36. From Table 6 , GBALD algorithm keep stable accuracies over the start, ultimate, and mean±std accuracies when there inputs more than 1,000 core-set labels. Therefore, drawing sufficient core-set labels using Eq. ( 10) to start the model uncertainty of Eq. ( 14) can maximize the performance of our GBALD framework. Hyperparameter setting on iteration budget A. Given the acquisition budget Q, let b ′ be the number of the output returns at each loop, N M be the number of the core-set size, the iteration budget A of GBLAD then can be defined as A = (Q -N M ) b ′ . Other hyperparameter settings. Eq. ( 5) has one parameter R 0 which describes the geometry prior from probability. The default radius of the intern balls R 0 is used to legalize the prior and has no further influences on Eq. ( 10). It is set as R 0 = 2.0e + 3 for those three image datasets. Ellipsoid geodesic is adjusted by η which controls how far of the updates of core-set to the boundaries of the distributions. It is set as η = 0.9 in this paper.

C.7 TWO-SIDED t-TEST

We present two-sided (two-tailed) t-test for the learning curves of Figure 5 . Different to the mean± std of  where µ = 1 n ∑ n i=1 (α i -β i ), and σ = 1 n-1 ∑ n i=1 (α i -β i -µ) 2 . In two-sided t-test, B 1 beats B 2 on breakpoints α i and β i satisfying a condition of tscore > ν; B 2 beats B 1 on breakpoints α i and β i satisfying a condition of tscore < -ν, where ν denotes the hypothesized criterion with a given confidence risk. Following (Ash et al., 2019) , we add a penalty of 1 e to each pair of breakpoints, which further enlarges their differences in the aggregated penalty matrix, where e denotes the number of B 1 beats B 2 on all breakpoints. All penalty values finally calculate their L 1 expressions. 

D. PROOFS

We firstly present the generalization errors of k = 3 of AL with sphere as a case study. The assumption of the generalization analysis is described in Figure 13 .  min ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ (2R a + t)(R a -t) 2 4R 3 a + 4R 3 b , (2R b + t ′ )(R b -t ′ ) 2 4R 3 b + 4R 3 a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ < err(h, 3) < 0.3334, where t = R 2 a 3 + 3 -µ 2π + µ 2 4π 2 - π 3 R 3 a 27π 3 + 3 -µ 2π - µ 2 4π 2 - π 3 R 3 a 27π 3 , µ = ( 2 9 -1 ϕ cos π ϕ )πR 3 a - 4πR 3 b 9 , t ′ = R 2 b 3 + 3 -µ ′ 2π + µ ′2 4π 2 - π 3 R 3 b 27π 3 + 3 -µ ′ 2π -µ ′2 4π 2 - π 3 R 3 b 27π 3 , and µ ′ = ( 2 9 -1 ϕ cos π ϕ )πR 3 b - 4πR 3 a 9 . Proof. Given the unseen acquisitions of {q a , q b , q * }, where q a ∈ A, q b ∈ B, and q * ∈ A or q * ∈ B is uncertain. However, the position of q * largely decides h. Therefore, the proof studies the error bounds highly related to q * in terms of two cases: R a ≥ R b and R a < R b . 1) If R a ≥ R b , q * ∈ A. Estimating the position of q * starts from the analysis on q a . Given the volume function Vol(⋅) over the 3-D geometry, we know: Vol(A) = 4π 3 R 3 a and Vol(B) = 4π 3 R 3 b . Given k = 3 over S a and S b , we define the minimum distance of q a to the boundary of A as d a . Let S a be cut off by a cross section h ′ , where C be the cut and D be the spherical capfoot_0 of the half-sphere (see Figure 12 ) that satisfy Vol(C) = 2 3 πR 3 a -Vol(D) = 4π(R 3 a + R 3 b ) 9 , and the volume of D is Vol(D) = π( R 2 a -(R a -d a ) 2 ) 2 (R a -d a ) + 2π R 2 a -(Ra-da) 2 0 arcsin Ra-da Ra 2π πR 2 a dx = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + arcsin Ra-da Ra 2π πR 2 a (2π R 2 a -(R a -d a ) 2 ) = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + πR 2 a arcsin R a -d a R a R 2 a -(R a -d a ) 2 . ( ) Let π ϕ = arcsin Ra-da Ra , Eq. ( 20) can be written as Vol(D) = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + πR 3 a ϕ cos π ϕ . Introducing Eq. ( 21) to Eq. ( 19), we have ( 2 3 - 1 ϕ cos π ϕ )πR 3 a -π(R 2 a -(R a -d a ) 2 )(R a -d a ) = 4π(R 3 a + R 3 b ) 9 . ( ) Let t = R a -d a , Eq. ( 22) can be rewritten as πt 3 -πR 2 a t + ( 2 9 - 1 ϕ cos π ϕ )πR 3 a - 4πR 3 b 9 = 0. To simplify Eq. ( 23), let µ = ( 2 9 -1 ϕ cos π ϕ )πR 3 a -4πR 3 b 9 , Eq. ( 23) then can be written as πt 3 -πR 2 a t + µ = 0. The positive solution of t can be t = R 2 a 3 + 3 - µ 2π + µ 2 4π 2 - π 3 R 3 a 27π 3 + 3 - µ 2π - µ 2 4π 2 - π 3 R 3 a 27π 3 . Based on Eq. ( 19), we know Vol(D) = 2 3 πR 3 a - 4π(R 3 a + R 3 b ) 9 = 2 9 πR 3 a - 4 9 πR 3 b > 0. Thus, 3 √ 2R b < R a . We next prove q * ∈ A. Based on Eq. ( 26), we know πR 3 b < 1 2 πR 3 a . Then, the following inequalities hold: 1)2πR 3 b < πR 3 a , 2) 2 3 πR 3 b < 1 3 πR 3 a , and 3) 2 3 πR 3 b + 1 3 πR 3 b < 1 3 πR 3 a + 1 3 πR 3 b . Finally, we have πR 3 b < 1 3 (πR 3 a + πR 3 b ). Therefore, Vol(B) < 1 3 (Vol(A) + Vol(B)). We thus know: 1) q a ∈ A and it is with a minimum distance d a to the boundary of S a , 2) q b ∈ B, and 3) q * ∈ A. Therefore, class B can be deemed as having a very high probability to achieve a nearly zero generalization error and the position of q * largely decides the upper bound of the generalization error of h. In S a that covers class A, the nearly optimal error region can be bounded as the spherical cap of S a with a volume constraint of Vol(A) -Vol(C). We thence have an inequality of Vol(A) -Vol(C) Vol(A) + Vol(B) < err(h, 3) < 1 3 . ( ) We next calculate the volume of the spherical cap: Vol(A) -Vol(C) = Ra-da -Ra πx 2 d y = π Ra Ra-da R 2 a -y 2 d y = 4 3 πR 3 a -πd 2 a (R a - d a ). (30) Eq. ( 29) then is rewritten as 4 3 πR 3 a -π 3 (3R a -d a )d 2 a 4 3 πR 3 a + 4 3 πR 3 b < err(h, 3) < 0.3334, Then, we have the error bound of 4R 3 a -(3R a -d a )d 2 a 4R 3 a + 4R 3 b < err(h, 3) < 0.3334. Introducing d a = R a -t, Eq. ( 32) is written as 4R 3 a -(2R a + t)(R a -t) 2 4R 3 a + 4R 3 b < err(h, 3) < 0.3334. 2) With another assumption of R a < R b , we follow the same proof skills of R a ≥ R b and know 4R 3 b -(3R b -d b )d 2 b 4R 3 b +4R 3 a < err(h, 3) < 0.3334, i.e. where d b = R b -t ′ and t ′ = R 2 b 3 + 3 -µ ′ 2π + µ ′2 4π 2 - π 3 R 3 b 27π 3 + 3 -µ ′ 2π -µ ′2 4π 2 - π 3 R 3 b 27π 3 , and µ ′ = ( 2 9 -1 ϕ cos π ϕ )πR 3 b - 4πR 3 a 9 . We thus conclude that min ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 4R 3 a -(2Ra+t)(Ra-t) 2 4R 3 a +4R 3 b , 4R 3 b -(2R b +t ′ )(R b -t ′ ) 2 4R 3 b +4R 3 a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ < err(h, 3) < 0.3334.

D.2 PROOF OF THEOREM 1

We next present the generalization errors against an agnostic sampling budget k following the above proof technique. Proof. The proof studies two cases: R a ≥ R b and R a < R b . 1) If R a ≥ R b , we estimate the optimal position of q a that satisfies q a ∈ A. Given the volume function Vol(⋅) over the 3-D geometry, we know: Vol(A) = 4π 3 R 3 a and Vol(B) = 4π 3 R 3 b . Assume q a be the nearest representative data to the boundary of S a , q b be the nearest representative data to the boundary of S b , and q * be the nearest representative data to h either in S a or S b . Given the minimum distance of q a to the boundary of A as d a . Let S a be cut off by a cross section h ′ , where C be the cut and D be the spherical cap of the half-sphere that satisfy Vol(C) = 2 3 πR 3 a -Vol(D) = 4π(R 3 a + R 3 b ) 3k , ( ) and the volume of D is Vol(D) = π( R 2 a -(R a -d a ) 2 ) 2 (R a -d a ) + 2π R 2 a -(Ra-da) 2 0 arcsin Ra-da Ra 2π πR 2 a dx = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + arcsin Ra-da Ra 2π πR 2 a (2π R 2 a -(R a -d a ) 2 ) = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + πR 2 a arcsin R a -d a R a R 2 a -(R a -d a ) 2 . Let π ϕ = arcsin Ra-da Ra , Eq. ( 36) can be written as Vol(D) = π(R 2 a -(R a -d a ) 2 )(R a -d a ) + πR 3 a ϕ cos π ϕ . Introducing Eq. ( 36) to Eq. ( 34), we have ( 2 3 - 1 ϕ cos π ϕ )πR 3 a -π(R 2 a -(R a -d a ) 2 )(R a -d a ) = 4π(R 3 a + R 3 b ) 3k . ( ) Let t k = R a -d a , we know πt 3 -πR 2 a t + ( 2k -4 3k - 1 ϕ cos π ϕ )πR 3 a - 4πR 3 b 3k = 0. To simplify Eq. ( 38), let µ k = ( 2k-4 3k -1 ϕ cos π ϕ )πR 3 a -4πR 3 b 3k , Eq. ( 38) then can be written as πt 3 -πR 2 a t + µ k = 0. The positive solution of t k can be t k = R 2 a 3 + 3 - µ k 2π + µ 2 k 4π 2 - π 3 R 3 a 27π 3 + 3 - µ k 2π - µ 2 k 4π 2 - π 3 R 3 a 27π 3 . Based on Eq. ( 35), we know Vol(D) = 2 3 πR 3 a - 4π(R 3 a + R 3 b ) 3k = 2k -4 3k πR 3 a - 4 3k πR 3 b > 0. (41) Thus, 3 2 k-2 R b < R a . We next prove q * ∈ A. According to Eq. ( 41), we know πR 3 b < k -2 2 πR 3 a . Then, the following inequalities hold: 1 ) 2 k-2 πR 3 b < πR 3 a , 2) 2 (k-2)k πR 3 b < 1 k πR 3 a , and 3) 2 (k-2)k πR 3 b + k 2 -2k-2 (k-2)k πR 3 b < 1 k πR 3 a + k 2 -2k-2 (k-2)k πR 3 b . Finally, we have: πR 3 b < 1 k πR 3 a + k 2 -2k -2 (k -2)k πR 3 b = 1 k π(R 3 a + R 2 b ) - 2 (k -2)k πR 3 a < 1 k π(R 3 a + R 3 b ). Therefore, Vol(B) < 1 k (Vol(A) + Vol(B)). We thus know: 1) q a ∈ A and it is with a minimum distance d a to the boundary of S a , 2) q b ∈ B, and 3) q * ∈ A. Therefore, class B can be deemed as having a very high probability to achieve a zero generalization error and the position of q * largely decides the upper bound of the generalization error of h. In S a that covers class A, the nearly optimal error region can be bounded as Vol(A) -Vol(C). We then have the inequality of Vol(A) -Vol(C) Vol(A) + Vol(B) < err(h, k) < 1 k . ( ) Based on the volume equation of the spherical cap in Eq. ( 30), we have  Introducing d a = R a -t k , Eq. ( 46) is written as 4R 3 a -(2R a + t)(R a -t k ) 2 4R 3 a + 4R 3 b < err(h, k) < 1 k . 2) With another assumption of R a < R b , we follow the same proof skills of R a ≥ R b and know 4R 3 b -(3R b -d b )d 2 b 4R 3 b +4R 3 a < err(h, k) < 1 k , where d b = R b -t ′ k and t ′ k = R 2 b 3 + 3 - µ ′ k 2π + µ ′2 k 4π 2 - π 3 R 3 b 27π 3 + 3 - µ ′ k 2π - µ ′2 k 4π 2 - π 3 R 3 b 27π 3 , and µ ′ k = ( 2k-4 3k -1 ϕ cos π ϕ )πR 3 b - 4πR 3 a 3k . We thus conclude that min ⎧ ⎪ ⎪ ⎨ ⎪ ⎪ ⎩ 4R 3 a -(2Ra+t k )(Ra-t k ) 2 4R 3 a +4R 3 b , 4R 3 b (2R b +t ′ k )(R b -t ′ k ) 2 4R 3 b +4R 3 a ⎫ ⎪ ⎪ ⎬ ⎪ ⎪ ⎭ < err(h, k) < 1 k .

D.3 PROOF OF THEOREM 2

Proof. Given class A and B are tightly covered by ellipsoid E a and E b in a three-dimensional geometry. Let R a1 be the polar radius of E a , {R a2 , R a3 } be the equatorial radii of E a , R b1 be polar radius of E b , and {R b2 , R b3 } be the equatorial radii of E b . Based on Eq. ( 10), we know R ai < R a , R bi < R b , ∀i, where R a and R b are the radii of the spheres over the class A and B, respectively. We follow the same proof technique of Theorem 1 to present the generalization errors of AL with ellipsoid. Following this conclusion, representation data can achieve the optimal generalization error if they are spread over the tube structure. Let γ = d a , the probability of achieving a nearly zero generalization error can be expressed as the volume ration of γ-tube and S a : Pr 4 3 πR a1 R a2 R a3 = 1 - λ k1 λ k2 λ k3 R a1 R a2 R a3 , where λ ki = R 2 a i 3 + 3 - σ k i 2π + σ 2 k i 4π 2 - π 3 R 3 a i 27π 3 + 3 - σ k i 2π - σ 2 k i 4π 2 - π 3 R 3 a i 27π 3 , and σ ki = ( 2k-4 3k - πRa i 2ϕ )πR a1 R a2 R a3 - 4πR b 1 R b 2 R b 3 3k , i = 1, 2, 3. Theorem 4 then is as stated.

D.6 PROOF OF PROPOSITION 1

Proof. Let Cube a tightly covers S a with a side length of 2R a , and Cube ′ a tightly covers the cut C, following theorem 1, we know  Then, we know err(h, k) > πR a1 R a2 R a3 -πd a R a2 R a3 πR a1 R a2 R a3 = 1 - d a R a1 . Since R a1 < R a , we know 1 -da Ra > 1 -da Ra 1 . It is thus the lower bound of AL with ellipsoid is tighter than AL with sphere. Then, Proposition 1 holds.

D.7 PROOF OF PROPOSITION 2

Proof. Following the proofs of Theorem 3: Pr[err(h, k) = 0] Sphere = 1 - t 3 k R 3 a = 1 - (R a -d a ) 3 R 3 a = 1 - ⎛ ⎝ 1 - d a R a ⎞ ⎠ 3 . ( ) Following the proofs Theorem 4: Proof. Given S a over class A is defined with x 2 1 + x 2 2 + x 2 3 = R 2 a . Let x 2 1 + x 2 2 = r 2 a be its 2-D generalization of S a , assume that x 2 be a variable parameter in this 2-D generalization formula, the "volume" (2-D volume is the area of the geometry object) of it can be expressed as Pr[err(h, k) = 0] Ellipsoid = 1 - λ k1 λ k2 λ k3 R 3 a1 = 1 - (R a1 -d a )(R a2 -d a )(R a3 -d a ) R 3 a1 > 1 - ⎛ ⎝ 1 - d a R a1 ⎞ ⎠ 3 . Vol 2 (S a ) = ra -ra 2 r 2 a -x 2 2 dx 2 . ( ) Let ϑ be an angle variable that satisfies x 2 = r a sin(ϑ), we know dx 2 = r a cos(ϑ)dϑ. Then, Eq. ( 65) is rewritten as Vol 2 (S a ) = π 2 -π 2 2r 2 a cos 2 (ϑ)dϑ = π 2 0 4r 2 a cos 2 (ϑ)dϑ. (66) For a 3-D geometry, for the variable x 3 , it is over a cross-section which is a 2-dimensional ball (circle), where the radius of the ball can be expressed as r a cos(ϑ), s.  where Vol m-1 denotes the volume of ( ( (m-1) ) )-dimensional generalization geometry of S a .



https://en.wikipedia.org/wiki/Spherical_cap



Figure 1: Illustration of two-stage GBALD framework. BALD has two types of interpretation: model uncertainty estimation and core-set construction where the deeper the color of the core-set element, the higher the representation; GBALD integrates them into a uniform framework. Stage 1 : core-set construction is with an ellipsoid, not typical sphere, representing the original distribution to initialize the input features of DNN. Stage 2 : model uncertainty estimation with those initial acquisitions then derives highly informative and representative samples for DNN.

Figure2: Optimizing BALD with sphere and ellipsoid geodesics. Ellipsoid geodesic rescales the sphere geodesic to prevent the updates of the core-set towards the boundary regions of the sphere where the characteristics of the distribution cannot be captured. Black points denote the feasible updates of the red points. Dash lines denote the geodesics.

Figure 3: Acquisitions with uninformative priors from digit '0' and '1'.

Figure 4: GBALD outperforms BALD using ranked informative acquisitions which cooperate with representation constraints.Repeated or similar acquisitions delay the acceleration of the model training of BALD. Following the experiment settings of Section 5.1, we compare the best performance of BALD with a batch size of 1 and GBALD with different batch size parameters. Following Eq. (14), we set b = {3, 5, 7} and b ′ =1, respectively, that means, we output the highest representative data from a batch of highly-informative acquisitions. Different settings on b and b ′ are used to observe the parameter perturbations of GBALD.

Figure 5: Active acquisitions on MNIST, SVHN, and CIFAR10 datasets.

Figure 6: Comparisons of BALD, Batch-BALD, and GBALD of active acquisitions on MNIST with bath settings.

The batch sizes of the compared baselines are 100, where GBALD ranks 300 acquisitions to select 100 data for the training, i.e. b = 300, b ′ = 100. The mean±std values of these baselines of the breakpoints (i.e. {0, 100, 200, ..., 10000}) are reported in Table

Figure7: Active acquisitions on SVHN with 5,000 and 10,000 repeated samples.

Figure 9: Accelerations of accuracy of different baselines on MNIST, SVHN, and CIFAR10 datasets.

Hyperparameter settings on batch returns b and bath outputs b ′ . Experiments of Sections 5.1 and 5.2 used different b and b ′ to observe the parameter perturbations. No matter what the settings of b ′ and b are, GBALD still outperforms BALD. For single acquisition of GBALD, we suggest b = 3 and b ′ = 1. For bath acquisitions, the settings on b ′ and b are user-defined according the time cost and hardware resources.

Figure 12 presents the penalty matrix over learning curves of Figure 5. Column-wise values at the bottom of each matrix show the overall performance of the compared baselines. As the shown results, GBALD has significant performances than that of the other baselines over the three datasets. Especially for SVHN, it has superior performance.

Figure 13: Assumption of the generalization analysis. The ball above h denotes S b , the ball below h denotes S a , and R b < R a . Spherical cap 1 of the half-sphere of S a is denoted as D.

[err(h, k)  = 0] Sphere = Vol(Tube γ ) t k keeps consistent with Eq. (40). With the initial sampling from the tube structure of class A, the subsequent acquisitions of AL would be updated from the tube structure of class B. If the initial sampling comes from the tube structure of B, the next acquisition must be updated from the tube structure of A. With the updated acquisitions spread over the tube structures of both classes, h is easy to achieve a nearly zero error. Then Theorem 3 is as stated.D.5 PROOF OF THEOREM 4Proof. Following the proof technique of Theorem 3, volume of the tube is redefined as Vol(Tube γ ) =4 3 πR a1 R a2 R a3 . Then, we know Pr[err(h, k) = 0] Ellipsoid = Vol(Tube γ ) Vol(E a ) = πR a1 R a2 R a3 -4 3 (R a1 -d a )(R a2 -d a )(R a3 -d a )

Cube e tightly covers E a with a side length of 2R a1 , Cube e ′ tightly covers C,

Based on Proposition1, 1 -da Ra > 1 -da Ra 1 , therefore Pr[err(h, k) = 0] Sphere < Pr[err(h, k) = 0] Ellipsoid .Then, Proposition 2 is as stated. D.8 LOWER DIMENSIONAL GENERALIZATION OF THE d-DIMENSIONAL GEOMETRY With the above theoretical results, we next present a connection between 3-D geometry and ddimensional geometry. The major technique is to prove that the volume of the 3-D geometry is a lower dimensional generalization of the d-dimensional geometry. It then can make all proof process from Theorems 1 to 5 hold in d-dimensional geometry. Theorem 6. Let Vol d be the volume of a d-dimensional geometry, Vol 3 (S a ) is a low-dimensional generalization of Vol d (S a ).

t. ϑ ∈ [0, π]. Particularly, let Vol 3 (S a ) be the volume of S a with 3 dimensions, the volume of this 3-dimensional sphere then can be written as Vol 3 (S a ) = π 2 0 2Vol 2 (r a cos(ϑ))r a (cos(ϑ))dϑ. (67) With Eq. (67), volume of a d-dimensional geometry can be expressed as the integral over the (d-1)-dimensional cross-section of S a Vol m (S a ) = π 2 0 2Vol m-1 (r a cos(ϑ))r a (cos(ϑ))dϑ,

use MC dropout to estimate predictive uncertainty for approximating a Bayesian convolutional neural network. Lakshminarayanan et al. (2017) estimate predictive uncertainty using a proper scoring rule as the training criteria to fed a DNN.

Mean±std of the test accuracies of the breakpoints of the learning curves on MNIST, SVHN, and CIFAR-10.

Number of acquisitions on MNIST, SVHN and CIFAR10 until 70%, 80%, and 90% accuracies are reached. 100, 200, ..., 20000}. We then calculate their average accuracies and std values over these acquisition points. As the shown in Table

Mean±std of BALD, BatchBALD, and GBALD of active acquisitions on MNIST with batch settings.

Mean±std of active acquisitions on SVHN with 5,000 and 10,000 repeated samples.

Mean±std of active noisy acquisitions on SVHN with 5,000 and 10,000 noises.

Relationship of accuracies and sizes of core-set on SVHN.

t-test can enlarge the significant difference of those baselines. In typical t-test, the two groups of observations usually require a degree of freedom smaller than 30. However, the numbers of breakpoints of MNIST, SVHN, and CIFAR10 are 61, 101, and 201, respectively, thereby holding a degree of freedom of 60, 100, 200, respectively. It is thus we introduce t-test score to directly compare the significant difference of pairwise baselines.t-test score between any pair group of breakpoints are defined as follows. Let B 1 = {α 1 , α 2 , ..., α n } and B 2 = {β 1 , β 2 , ..., β n }, there exists t-score of

annex

The proof studies two cases: R a1 ≥ R b and R a1 < R b1 . 1) If R a1 ≥ R b1 , q * ∈ A. Given the volume function Vol(⋅) over the 3-D geometry, we know: Vol(A) = 4π 3 R a1 R a2 R a3 and Vol(B) = 4π 3 R b1 R b2 R b3 . Given the minimum distance of q a to the boundary of A as d a . Let E a by cut off by a cross section h ′ , where C be the cut and D be the ellipsoid cap of the half-ellipsoid that satisfyand the volume of D is approximated as, Eq. ( 49) can be written asIntroducing Eq. ( 50) to Eq. ( 48), we haveTo simplify Eq. ( 52), let, Eq. ( 52) then can be written asThe positive solution of λ k can beThe remaining proof process follows Eq. ( 40) to Eq. ( 46) of Theorem 1. We thus conclude thatwhere. In a simple way, R a1 R a2 R a3 and R b1 R b2 R b3 can be written as

D.4 PROOF OF THEOREM 3

Proof. In clustering stability, γ-tube structure that surrounds the cluster boundary largely decides the performance of a learning algorithm. Definition of γ-tube is as follows.Definition 1. γ-tube Tube γ (f ) is a set of points distributed in the boundary of the cluster.where X is a noise-free cluster with n samples, B(f ) ∶= {x ∈ X, f is discontinuous at x }, f is a clustering function, and (⋅, ⋅) denotes the distance function.Based on Eq. ( 68), we know Vol 3 can be written asIntroducing Eq. ( 66) into Eq. ( 69), we have Proof. The integral of Eq. ( 67) also can be adopted into the volume of Vol 3 (E a ) by transforming the area i.e. Vol 2 (S a ) into Vol 2 (E a ). Then, Eq. ( 68) follows this transform.

