EXPLORING ACTIVE 3D OBJECT DETECTION FROM A GENERALIZATION PERSPECTIVE

Abstract

To alleviate the high annotation cost in LiDAR-based 3D object detection, active learning is a promising solution that learns to select only a small portion of unlabeled data to annotate, without compromising model performance. Our empirical study, however, suggests that mainstream uncertainty-based and diversitybased active learning policies are not effective when applied in the 3D detection task, as they fail to balance the trade-off between point cloud informativeness and box-level annotation costs. To overcome this limitation, we jointly investigate three novel criteria in our framework CRB for point cloud acquisitionlabel conciseness, feature representativeness and geometric balance, which hierarchically filters out the point clouds of redundant 3D bounding box labels, latent features and geometric characteristics (e.g., point cloud density) from the unlabeled sample pool and greedily selects informative ones with fewer objects to annotate. Our theoretical analysis demonstrates that the proposed criteria aligns the marginal distributions of the selected subset and the prior distributions of the unseen test set, and minimizes the upper bound of the generalization error. To validate the effectiveness and applicability of CRB, we conduct extensive experiments on the two benchmark 3D object detection datasets of KITTI and Waymo and examine both one-stage (i.e., SECOND) and two-stage 3D detectors (i.e., PV-RCNN). Experiments evidence that the proposed approach outperforms existing active learning strategies and achieves fully supervised performance requiring 1% and 8% annotations of bounding boxes and point clouds, respectively.

1. INTRODUCTION

LiDAR-based 3D object detection plays an indispensable role in 3D scene understanding with a wide range of applications such as autonomous driving (Deng et al., 2021; Wang et al., 2020) and robotics (Ahmed et al., 2018; Montes et al., 2020; Wang et al., 2019) . The emerging stream of 3D detection models enables accurate recognition at the cost of large-scale labeled point clouds, where 7-degree of freedom (DOF) 3D bounding boxes -consisting of a position, size, and orientation informationfor each object are annotated. In the benchmark datasets like Waymo (Sun et al., 2020) , there are over 12 million LiDAR boxes, for which, labeling a precise 3D box takes more than 100 seconds for an annotator (Song et al., 2015) . This prerequisite for the performance boost greatly hinders the feasibility of applying models to the wild, especially when the annotation budget is limited. To alleviate this limitation, active learning (AL) aims to reduce labeling costs by querying labels for only a small portion of unlabeled data. The criterion-based query selection process iteratively selects the most beneficial samples for the subsequent model training until the labeling budget is run out. The criterion is expected to quantify the sample informativeness using the heuristics derived from sample uncertainty (Gal et al., 2017; Du et al., 2021; Caramalau et al., 2021; Yuan et al., 2021; Choi et al., 2021; Zhang et al., 2020; Shi & Li, 2019) and sample diversity (Ma et al., 2021; Gudovskiy et al., 2020; Gao et al., 2020; Sinha et al., 2019; Pinsler et al., 2019) . In particular, uncertainty-driven approaches focus on the samples that the model is the least confident of their labels, thus searching for the candidates with: maximum entropy (MacKay, 1992; Shannon, 1948; Kim et al., 2021b; Siddiqui et al., 2020; Shi & Yu, 2019) , disagreement among different experts (Freund et al., 1992; Tran et al., 2019) , minimum posterior probability of a predicted class (Wang et al., 2017) , or the samples with reducible yet maximum estimated error (Roy & McCallum, 2001; Yoo & Kweon, 2019; Kim et al., 2021a) . On the other hand, diversity-based methods try to find the most representative samples to avoid sample redundancy. To this end, they form subsets that are sufficiently diverse to describe the entire data pool by making use of the greedy coreset algorithms (Sener & Savarese, 2018) , or the clustering algorithms (Nguyen & Smeulders, 2004) . Recent works (Liu et al., 2021; Citovsky et al., 2021; Kirsch et al., 2019; Houlsby et al., 2011) combine the aforementioned heuristics: they measure uncertainty as the gradient magnitude of samples (Ash et al., 2020) or its second-order metrics (Liu et al., 2021) at the final layer of neural networks, and then select samples with gradients spanning a diverse set of directions. While effective, the hybrid approaches commonly cause heavy computational overhead, since gradient computation is required for each sample in the unlabeled pool. Another stream of works apply active learning to 2D/3D object detection tasks (Feng et al., 2019; Schmidt et al., 2020; Wang et al., 2022; Wu et al., 2022; Tang et al., 2021) , by leveraging ensemble (Beluch et al., 2018) or Monte Carlo (MC) dropout (Gal & Ghahramani, 2016) algorithms to estimate the classification and localization uncertainty of bounding boxes for images/point clouds acquisition (more details in Appendix I). Nevertheless, those AL methods generally favor the point clouds with more objects, which have a higher chance of containing uncertain and diverse objects. With a fixed annotation budget, it is far from optimal to select such point clouds, since more clicks are required to form 3D box annotations. To overcome the above limitations, we propose to learn AL criteria for cost-efficient sample acquisition at the 3D box level by empirically studying its relationship with optimizing the generalization upper bound. Specifically, we propose three selection criteria for cost-effective point cloud acquisition, termed as CRB, i.e., label conciseness, feature representativeness and geometric balance. Specifically, we divide the sample selection process into three stages: (1) To alleviate the issues of label redundancy and class imbalance, and to ensure label conciseness, we firstly calculate the entropy of bounding box label predictions and only pick top K 1 point clouds for Stage 2; (2) We then examine the feature representativeness of candidates by formulating the task as the K 2 -medoids problem on the gradient space. To jointly consider the impact of classification and regression objectives on gradients, we enable the Monte Carlo dropout (MC-DROPOUT) and construct the hypothetical labels by averaging predictions from multiple stochastic forward passes. (3) Finally, to maintain the geometric balance property, we minimize the KL divergence between the marginal distributions of point cloud density of each predicted bounding box. This makes the trained detector predict more accurate localization and size of objects, and recognize both close (i.e., dense) and distant (i.e., sparse) objects at the test time, using minimum number of annotations. We base our criterion design on our theoretical analysis of optimizing the upper bound of the generalization risk, which can be reformulated as distribution alignment of the selected subset and the test set. Note that since the empirical distribution of the test set is not observable during training, WLOG, we make an appropriate assumption of its prior distribution.

Contributions.

Our work is a pioneering study in active learning for 3D object detection, aiming to boost the detection performance at the lowest cost of bounding box-level annotations. To this end, we propose a hierarchical active learning scheme for 3D object detection, which progressively filters candidates according to the derived selection criteria without triggering heavy computation. Extensive experiments conducted demonstrate that the proposed CRB strategy can consistently outperform all the state-of-the-art AL baselines on two large-scale 3D detection datasets irrespective of the detector architecture. To enhance the reproducibility of our work and accelerate future work in this new research direction, we develop an active-3D-det toolbox, which accommodates various AL approaches and 3D detectors.

2. METHODOLOGY

2.1 PROBLEM FORMULATION In this section, we mathematically formulate the problem of active learning for 3D object detection and set up the notations. Given an orderless LiDAR point cloud P = {x, y, z, e} with 3D location (x, y, z) and reflectance e, the goal of 3D object detection is to localize the objects of interest as a set of 3D bounding boxes B = {b k } k∈[N B ] with N B indicating the number of detected bounding boxes, and predict the associated box labels Y = {y k } k∈[N B ] ∈ Y = {1, . . . , C}, with C being the number of classes to predict. Each bounding box b represents the relative center position (p x , p y , p z ) to the object ground planes, the box size (l, w, h), and the heading angle θ. Mainstream 3D object detectors use point clouds P to extract point-level features x ∈ R W •L•F (Shi et al., 2019; Yang et al., 2019 ; 

2.2. THEORETICAL MOTIVATION

The core question of active 3D detection is how to design a proper criterion, based on which a fixed number of unlabeled point clouds can be selected to achieve minimum empirical risk R T [ℓ(f, g; w)] on the test set D T and minimum annotation time. Below, inspired by (Mansour et al., 2009; Ben-David et al., 2010) , we derive the following generalization bound for active 3D detection so that the desired acquisition criteria can be obtained by optimizing the generalization risk. Theorem 2.1. Let H be a hypothesis space of Vapnik-Chervonenkis (VC) dimension d, with f and g being the classification and regression branches, respectively. The D S and D T represent the empirical distribution induced by samples drawn from the acquired subset D S and the test set D T , and ℓ the loss function bounded by J . It is proven that ∀ δ ∈ (0, 1), and ∀f, g ∈ H, with probability at least 1 -δ the following inequality holds, R T [ℓ(f, g; w)] ≤ R S [ℓ(f, g; w)] + 1 2 disc( D S , D T ) + λ * + const, where const = 3J ( log 4 δ 2Nr + log 4 δ 2Nt ) + 2d log(eNr/d) Nr + 2d log(eNt/d) Nt . Notably, λ * = R T [ℓ(f * , g * ; w * )] + R S [ℓ(f * , g * ; w * )] denotes the joint risk of the optimal hypothesis f * and g * , with w * being the model weights. N r and N t indicate the number of samples in the D S and D T . The proof can be found in the supplementary material. Remark. The first term indicates the training error on the selected subsets, which is assumed to be trivial based on the zero training assumption (Sener & Savarese, 2018) . To obtain a tight upper bound of the generalization risk, the optimal subset D * S can be determined via minimizing the discrepancy distance of empirical distribution of two sets, i.e., D * S = arg min D S ⊂D U disc( D S , D T ). Below, we define the discrepancy distance for the 3D object detection task. Definition 1. For any f, g, f ′ , g ′ ∈ H, the discrepancy between the distribution of the selected sets D S and unlabeled pool D T can be formulated as, disc( D S , D T ) = sup f,f ′ ∈H |E D S ℓ(f, f ′ ) -E D T ℓ(f, f ′ )| + sup g,g ′ ∈H |E D S ℓ(g, g ′ ) -E D T ℓ(g, g ′ )|, where the bounded expected loss ℓ for any classification and regression functions are symmetric and satisfy the triangle inequality. Remark. As 3D object detection is naturally an integration of classification and regression tasks, mitigating the set discrepancy is basically aligning the inputs and outputs of each branch. Therefore, with the detector freezed during the active selection, finding an optimal D * S can be interpreted as enhancing the acquired set's (1) Label Conciseness: aligning marginal label distribution of bounding boxes, (2) Feature Representativeness: aligning marginal distribution of the latent representations of point clouds, and (3) Geometric Balance: aligning marginal distribution of geometric characteristics of point clouds and predicted bounding boxes, and can be written as: D * S ≈ arg min D S ⊂D U Conciseness d A (P Y S , P Y T ) + Representativeness d A (P X S , P X T ) + Balance d A (P ϕ(P S , B S ) , P ϕ(P T ,B T ) ) . Here, P S and P T represent the point clouds in the selected set and the ones in the test set. ϕ(•) indicates the geometric descriptor of point clouds and d A distance (Kifer et al., 2004) which can be estimated by a finite set of samples. For latent features X S and X T , we only focus on the features that differ from the training sets, since E D L ℓ cls = 0 and E D L ℓ reg = 0 based on the zero training error assumption. Considering that test samples and their associated labels are not observable during training, we make an assumption on the prior distributions of test data. WLOG, we assume that the prior distribution of bounding box labels and geometric features are uniform. Note that we can adopt the KL-divergence for the implementation of d A assuming that latent representations follow the univariate Gaussian distribution. Connections with existing AL approaches. The proposed criteria jointly optimize the discrepancy distance for both tasks with three objectives, which shows the connections with existing AL strategies. The uncertainty-based methods focus strongly on the first term, based on the assumption that learning more difficult samples will help to improve the suprema of the loss. This rigorous assumption can result in a bias towards hard samples, which will be accumulated and amplified across iterations. Diversity-based methods put more effort into minimizing the second term, aiming to align the distributions in the latent subspace. However, the diversity-based approaches are unable to discover the latent features specified for regression, which can be critical when dealing with a detection problem. We introduce the third term for the 3D detection task, motivated by the fact that aligning the geometric characteristics of point clouds helps to preserve the fine-grained details of objects, leading to more accurate regression. Our empirical study provided in Sec. 3.3 suggests jointly optimizing three terms can lead to the best performance.

2.3. OUR APPROACH

To optimize the three criteria outlined in Eq. 1, we derive an AL scheme consisting of three components. In particular, to reduce the computational overhead, we hierarchically filter the samples that meet the selection criteria (illustrated in Fig. 1 ): we first pick K 1 candidates by concise label sampling (Stage 1), from which we select K 2 representative prototypes (Stage 2), with K 1 , K 2 << n. Finally, we leverage greedy search (Stage 3) to find the N r prototypes that match with the prior marginal distribution of test data. The hierarchical sampling scheme can save O((n -K 1 )T 2 + (n -K 2 )T 3 ) cost, with T 2 and T 3 indicating the runtime of criterion evaluation. The algorithm is summarized in the supplemental material. In the following, we describe the details of the three stages. Stage 1: Concise Label Sampling (CLS). By using label conciseness as a sampling criterion, we aim to alleviate label redundancy and align the source label distribution with the target prior label distribution. Particularly, we find a subset D * S1 of size K 1 that minimizes Kullback-Leibler (KL) divergence between the probability distribution P Y S and the uniform distribution P Y T . To this end, we formulate the KL-divergence with Shannon entropy H(•) and define an optimization problem of maximizing the entropy of the label distributions: D KL (P Y S 1 ∥ P Y T ) = -H( Y S1 ) + log | Y S1 |, D * S1 = arg min D S 1 ⊂D U D KL (P Y S 1 ∥ P Y T ) = arg max D S 1 ⊂D U H( Y S1 ), where log | Y S1 | = logK 1 indicates the number of values Y S1 can take on, which is a constant. Note that P Y T is a uniform distribution, and we removed the constant values from the formulations. We pass all point clouds {(P) j } i∈[n] from the unlabeled pool to the detector and extract the predictive labels {ŷ i } N B i=1 for N B bounding boxes, with ŷi = arg max y∈[C] f (x i ; w f ). The label entropy of the j-th point cloud H( Y j,S ) can be calculated as, H( Y j,S ) = - C c=1 p i,c log p i,c , p i,c = e |ŷi=c|/N B C c=1 e |ŷi=c|/N B . ( ) Based on the calculated entropy scores, we filter out the top-K 1 candidates and validate them through the Stage 2 representative prototype selection. Stage 2: Representative Prototype Selection (RPS). In this stage, we aim to to identify whether the subsets cover the unique knowledge encoded only in D U and not in D L by measuring the feature representativeness with gradient vectors of point clouds. Motivated by this, we find the representative prototypes on the gradient space G to form the subset D S2 , where magnitude and orientation represent the uncertainty and diversity of the new knowledge. For a classification problem, gradients can be retrieved by feeding the hypothetical label ŷ = arg max y∈[C] p(y|x) to the networks. However, the gradient extraction for regression problem is not explored yet in the literature, due to the fact that the hypothetical labels for regression heads cannot be directly obtained. To mitigate this, we propose to enable Monte Carlo dropout (MC-DROPOUT) at the Stage 1, and get the averaging predictions B of M stochastic forward passes through the model as the hypothetical labels for regression loss: B ≈ 1 M M i=1 g(x; w d , w g ), w d ∼ Bernoulli(1 -p), G S2 = {∇ Θ ℓ reg (g(x), B; w g ), x ∼ D S2 }, with p indicating the dropout rate, w d the random variable of the dropout layer, and Θ the parameters of the convolutional layer of the shared block. The gradient maps G S2 ∈ G can be extracted from shared layers and calculated by the chain rule. Since the gradients for test samples are not observable, we make an assumption that its prior distribution follows a Gaussian distribution, which allows us to rewrite the optimization function as, D * S2 = arg min D S 2 ⊂D S 1 D KL (P X S 2 ∥ P X T ) ≈ arg min D S 2 ⊂D S 1 D KL (P G S 2 ∥ P G T ) = arg min D S 2 ⊂D S 1 log σ T σ S2 + σ 2 S2 + (µ S2 -µ T ) 2δ 2 T - 1 2 ≈ K 2 -medoids(G S1 ), with µ S2 , σ S2 (µ T , and σ T ) being the mean and a standard deviation of the univariate Gaussian distribution of the selected set (test set), respectively. Based on Eq. 7, the task of finding a representative set can be viewed as picking K 2 prototypes (i.e., K 2 -medoids) from the clustered data, so that the centroids (mean value) of the selected subset and the test set can be naturally matched. The variance σ S2 and σ T , basically, the distance of each point to its prototypes will be minimized simultaneously. We test different approaches for selecting prototypes in Sec. 3.3. Stage 3: Greedy Point Density Balancing (GPDB). The third criterion adopted is geometric balance, which targets at aligning the distribution of selected prototypes with the marginal distribution of testing point clouds. As point clouds typically consist of thousands (if not millions) of points, it is computationally expensive to directly align the meta features (e.g., coordinates) of points. Furthermore, in representation learning for point clouds, the common practice of using voxel-based architecture typically relies on quantized representations of point clouds and loses the object details due to the limited perception range of voxels. Therefore, we utilize the point density ϕ(•, •) within each bounding box to preserve the geometric characteristics of an object in 3D point clouds. By aligning the geometric characteristic of the selected set and unlabeled pool, the fine-tuned detector is expected to predict more accurate localization and size of bounding boxes and recognize both close (i.e., dense) and distant (i.e., sparse) objects at the test time. The probability density function (PDF) of the point density is not given and has to be estimated from the bounding box predictions. To this end, we adopt Kernel Density Estimation (KDE) using a finite set of samples from each class which can be computed as: p(ϕ(P, B)) = 1 N B h N B j=1 Ker( ϕ(P, B) -ϕ(P, B j ) h ), with h > 0 being the pre-defined bandwidth that can determine the smoothing of the resulting density function. We use Gaussian kernel for the kernel function Ker(•). With the PDF defined, the optimization problem of selecting the final candidate sets D S of size N r for the label query is: D * S = arg min D S ⊂D S 2 D KL (ϕ(P S , B S ) ∥ ϕ(P T , B T )), where ϕ(•, •) measures the point density for each bounding box. We use greedy search to find the optimal combinations from the subset D S2 that can minimize the KL distance to the uniform distribution p(ϕ(P T , B T )) ∼ uniform(α lo , α hi ). Generic AL Baselines. We implemented the following five generic AL baselines of which the implementation details can be found in the supplementary material. (1) RAND: is a basic sampling method that selects N r samples at random for each selection round; (2) ENTROPY (Wang & Shang, 2014) : is an uncertainty-based active learning approach that targets the classification head of the detector, and selects the top N r ranked samples based on the entropy of the sample's predicted label; (3) LLAL (Yoo & Kweon, 2019) : is an uncertainty-based method that adopts an auxiliary network to predict an indicative loss and enables to select samples for which the model is likely to produce wrong predictions; (4) CORESET (Sener & Savarese, 2018) : is a diversity-based method performing the core-set selection that uses the greedy furthest-first search on both labeled and unlabeled embeddings at each round; and (5) BADGE (Ash et al., 2020) : is a hybrid approach that samples instances that are disparate and of high magnitude when presented in a hallucinated gradient space. Applied AL Baselines for 2D and 3D Detection. For a fair comparison, we also compared three variants of the deep active learning method for 3D detection and adapted one 2D active detection method to our 3D detector. ( 6) MC-MI (Feng et al., 2019) utilized Monte Carlo dropout associated with mutual information to determine the uncertainty of point clouds. ( 7) MC-REG: Additionally, to verify the importance of the uncertainty in regression, we design an uncertainty-based baseline that determines the regression uncertainty via conducting M -round MC-DROPOUT stochastic passes at the test time. The variances of predictive results are then calculated, and the samples with the top-N r greatest variance will be selected for label acquisition. We further adapted two applied AL methods for 2D detection to a 3D detection setting, where (8) LT/C (Kao et al., 2018) measures the classspecific localization tightness, i.e., the changes from the intermediate proposal to the final bounding box and ( 9) CONSENSUS (Schmidt et al., 2020) calculates the variation ratio of minimum IoU value for each RoI-match of 3D boxes.

3.2. COMPARISONS AGAINST ACTIVE LEARNING METHODS

Quantitative Analysis. We conducted comprehensive experiments on the KITTI and Waymo datasets to demonstrate the effectiveness of the proposed approach. The K 1 and K 2 are empirically set to 300 and 200 for KITTI and 2, 000 and 1, 200 for Waymo. Under a fixed budget of point clouds, the performance of 3D and BEV detection achieved by different AL policies are reported in Figure 2 , with standard deviation of three trials shown in shaded regions. We can clearly observe that CRB consistently outperforms all state-of-the-art AL methods by a noticeable margin, irrespective of the number of annotated bounding boxes and difficulty settings. It is worth noting that, on the KITTI dataset, the annotation time for the proposed CRB is 3 times faster than RAND, while achieving a comparable performance. Moreover, AL baselines for regression and classification tasks (e.g., LLAL) or for regression only tasks (e.g., MC-REG) generally obtain higher scores yet leading to higher labeling costs than the classification-oriented methods (e.g., ENTROPY). Table 1 reports the major experimental results of the state-of-the-art generic AL methods and applied AL approaches for 2D and 3D detection on the KITTI dataset. It is observed that LLAL and LT/C achieve competitive results, as the acquisition criteria adopted jointly consider the classification and regression task. Our proposed CRB improves the 3D mAP scores by 6.7% which validates the effectiveness of minimizing the generalization risk. Qualitative Analysis. To intuitively understand the merits of our proposed active 3D detection strategy, Figure 3 demonstrates that the 3D detection results yielded by RAND (bottom left) and CRB selection (bottom right) from the corresponding image (upper row). Both 3D detectors are trained under the budget of 1K annotated bounding boxes. False positives and corrected predictions are indicated with red and green boxes. It is observed that, under the same condition, CRB produces more accurate and more confident predictions than RAND. Besides, looking at the cyclist highlighted in the orange box in Figure 3 , the detector trained with RAND produces a significantly lower confidence score compared to our approach. This confirms that the samples selected by CRB are aligned better with the test cases. More visualizations can be found in the supplemental material.

3.3. ABLATION STUDY

Study of Active Selection Criteria. Table 2 reports the performance comparisons of six variants of the proposed CRB method and the basic random selection baseline (1-st row) on the KITTI dataset. We report the 3D and BEV mAP metrics at all difficulty levels with 1,000 bounding boxes annotated. We observe that only applying GPDB (4-th row) produces 12.5% lower scores and greater variance than the full model (the last row). However, with CLS (6-th row), the performance increases by approximately 10% with the minimum variance. This phenomenon evidences the importance of optimizing the discrepancy for both classification and regression tasks. It's further shown that re- Sensitivity to Prototype Selection. We examine the sensitivity of performance to different prototype selection methods used in the RPS module on the KITTI dataset (moderate difficulty level). Particularly, In Figure 4 (right), we show the performance of our approach using different prototype selection methods of the Gaussian mixture model (GMM), K-MEANS, and K-MEANS++. To fairly reflect the trend in the performance curves, we run two trials for each prototype selection approach and plot the mean and the variance bars. K-MEANS is slightly more stable than the other two, with higher time complexity and better representation learning. It is observed that there is very little variation (∼ 1.5%) in the performance of our approach when using different prototype selection methods. This confirms that the CRB's superiority over existing baselines is not coming from the prototype selection method. Sensitivity Analysis of Thresholds K 1 and K 2 . We examine the sensitivity of our approach to different values of threshold parameters K 1 and K 2 . We report the mean average precision (mAP) on the KITTI dataset, including both 3D and BEV views at all difficulty levels. We check four possible combinations of K 1 and K 2 and show the results in Table 3 . We can observe that at MODERATE and HARD levels, there is only 3.28% and 2.81% fluctuation on average mAP. In the last row, we further report the accuracy achieved by the backbone detector trained with all labeled training data and a larger batch size. With only 8% point clouds and 1% annotated bounding boxes, CRB achieves a comparable performance to the full model. 

4. DISCUSSION

This paper studies three novel criteria for sample-efficient active 3D object detection, that can effectively achieve high performance with minimum costs of 3D box annotations and runtime complexity. We theoretically analyze the relationship between finding the optimal acquired subset and mitigating the sets discrepancy. The framework is versatile and can accommodate existing AL strategies to provide in-depth insights into heuristic design. The limitation of this work lies in a set of assumptions made on the prior distribution of the test data, which could be violated in practice. For more discussions, please refer to Sec. A.1 in Appendix. In contrast, it opens an opportunity of adopting our framework for active domain adaptation, where the target distribution is accessible for alignment. Addressing these two avenues is left for future work.



Figure 1: An illustrative flowchart of the proposed CRB framework for active selection of point clouds. Motivated by optimizing the generalization risk, the derived strategy hierarchically selects point clouds that have non-redundant bounding box labels, latent gradients and geometric characteristics to mitigate the gap with the test set and minimize annotation costs. 2020) or by voxelization (Shi et al., 2020), with W , L, F representing width, length, and channels of the feature map. The feature map x is passed to a classifier f (•; w f ) parameterized by w f and regression heads g(•; w g ) (e.g., box refinement and ROI regression) parameterized by w g . The output of the model is the detected bounding boxes B = { bk } with the associated box labels Y = {ŷ k } from anchored areas. The loss functions ℓ cls and ℓ reg for classification (e.g., regularized cross entropy loss Oberman & Calder (2018)) and regression (e.g., mean absolute error/L 1 regularization Qi et al. (2020)) are assumed to be Lipschitz continuous. As shown in the left half of Figure 1, in an active learning pipeline, a small set of labeled point clouds D L = {(P, B, Y ) i } i∈[m] and a large pool of raw point clouds D U = {(P) j } j∈[n] are provided at training time, with n and m being a total number of point clouds and m ≪ n. For each active learning round r ∈ [R], and based on the criterion defined by an active learning policy, we select a subset of raw data {P j } j∈[Nr] from D U and query the labels of 3D bounding boxes from an oracle Ω : P → B × Y to construct D S = {(P, B, Y ) j } j∈[Nr] . The 3D detection model is pre-trained with D L for active selection, and then retrained with D S ∪D L until the selected samples reach the final budget B, i.e., R r=1 N r = B.

Figure 2: 3D and BEV mAP (%) of CRB and AL baselines on the KITTI and Waymo val split.

Figure 3: A case study of active 3D detection performance of RAND (bottom left) and CRB (bottom right) under the budge of 1,000 annotated bounding boxes. False positive (corrected predictions) are highlighted in red (green) boxes. The orange box denotes the detection with low confidence. moving any selection criteria from the proposed CRB triggers a drop on mAP scores, confirming the importance of each in a sample-efficient AL strategy.

the time complexity of training and active selection for different active learning approaches. n indicates the total number of unlabeled point clouds, N r is the quantity selected, and E is the training epochs, with N r ≪ n. We can clearly observe that, at training stage, the complexity of all AL strategies is O(En), except LLAL that needs extra epochs E l to train the loss prediction module. At the active selection stage, RAND randomly generates N r indices to retrieve samples from the pool. CORESET computes pairwise distances between the embedding of selected samples and unlabeled samples that yields the time complexity of O(N r n). BADGE iterates through the gradients of all unlabeled samples passing gradients into K-MEANS++ algorithm, with the complexity of O(N r n) bounded by K-MEANS++. Given K 1 , K 2 ≈ N r , the time complexity of our method is O(n log n + 2N 2 r ), with O(n log(n)) being the complexity of sorting the entropy scores in CLS, and O(N 2 r ) coming from K 2 -medoids and greedy search in RPS and GPDB. Note that, in our case, O(n log n + 2N 2 r ) < O(N r n). The complexity of simple ranking-based baselines is O(n log(n)) due to sorting the sample acquisition scores. Comparing our method with recent state-of-the-arts, LLAL has the highest training complexity, and BADGE and CORESET have the highest selection complexity. Unlike the existing baseline, training and selection complexities of the proposed CRB are upper bounded by the reasonable asymptotic growth rates.

The upper bound α hi and lower bound α lo of the uniform distribution are set to the 95% density interval, i.e., p(α lo < ϕ(P, B j ) < α hi ) = 95% for every predicted bounding box j. Notably, the density of each bounding box is recorded during the Stage 1, which will not cause any computation overhead. The analysis of time complexity against other active learning methods is presented in Sec. 3.4.

Performance comparisons (3D AP scores) with generic AL and applied AL for detection on KITTI val set with 1% queried bounding boxes.

Ablative study of different active learning criteria on the KITTI val split. 3D and BEV AP scores (%) are reported when 1,000 bounding boxes are annotated. ±13.22 56.12 ±12.74 52.85 ±11.49 73.57 ±10.45 62.49 ±10.62 59.45 ±9.78

Sensitivity to Bandwidth h. Figure4depicts the results of CRB with the bandwidth h varying in {3, 5, 7, 9}. Choosing the optimal bandwidth value h * can avoid under-smoothing (h < h * ) and over-smoothing (h > h * ) in KDE. Except h = 3 which yields a large variation, CRB with the bandwidth of all other values reach similar detection results within the 2% absolute difference on 3D mAP. This evidences that the CRB is robust to different values of bandwidth. Sensitivity to Detector Architecture. We validate the sensitivity of performance to choices of onestage and two-stage detectors. Table4reports the results with the SECOND detection backbone on the KITTI dataset. With only 3% queried 3D bounding boxes, it is observed that the proposed CRB approach consistently outperforms the SOTA generic active learning approaches across a range of detection difficulties, improving 4.7% and 2.8% on 3D mAP and BEV mAP scores.

Performance comparisons on KITTI val set w.r.t. varying thresholds K 1 and K 2 after two rounds of active selection (8% point clouds, 1% bounding boxes). Results are reported with 3D AP with 40 recall positions. † indicates the reported performance of the backbone trained with the full labeled set (100%).

AL Results with one-stage 3D detector SECOND.

Complexity Analysis.

ACKNOWLEDGEMENT

This work was supported by Australian Research Council (CE200100025).

REPRODUCIBILITY STATEMENT

The source code of the developed active 3D detection toolbox is available in the supplementary material, which accommodates various AL approaches and one-stage and two-stage 3D detectors. We specify the settings of hyper-parameters, the training scheme and the implementation details of our model and AL baselines in Sec. B of the supplementary material. We show the proofs of Theorem 2.1 in Sec. C followed by the overview of the algorithm in Sec. D in the supplementary material. We repeat the experiments on the KITTI dataset 3 times with different initial labeled sets and show the standard deviation in plots and tables.

ETHICS STATEMENT

Our work may have a positive impact on communities to reduce the costs of annotation, computation, and carbon footprint. The high-performing AL strategy greatly enhances the feasibility and practicability of 3D detection in critical yet data-scarce fields such as medical imaging. We did not use crowdsourcing and did not conduct research with human subjects in our experiments. We cited the creators when using existing assets (e.g., code, data, models).

availability

https://github.com/Luoyadan/CRB

