DBQ-SSD: DYNAMIC BALL QUERY FOR EFFICIENT 3D OBJECT DETECTION

Abstract

Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference. These strategies are typically based on fixed and handcrafted rules, making it difficult to handle complicated scenes. Different from them, we propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features, and assign the feature transform with a suitable receptive field for each selected point. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost. Extensive experiments demonstrate that our method can increase the inference speed by 30%-100% on KITTI, Waymo, and ONCE datasets. Specifically, the inference speed of our detector can reach 162 FPS on KITTI scene, and 30 FPS on Waymo and ONCE scenes without performance degradation. Due to skipping the redundant points, some evaluation metrics show significant improvements.

1. INTRODUCTION

3D object detection, as a fundamental task in the 3D scene, has made significant progress. It aims to recognize and localize objects from point clouds and paves the way for several real applications like autonomous driving (Geiger et al., 2012) , robotic system (Yang et al., 2020b) , and augmented reality (Park et al., 2008) . The structure of point clouds is sparse, unordered, and semantically deficient, making it difficult to encode point features like highly structured image features. To eliminate this barrier, voxel-based methods are proposed to organize the overall point cloud as neatly distributed voxels. Therefore, the naive 3D convolution or its efficient variant, i.e., 3D sparse convolution (Yan et al., 2018) , can be used to extract voxel features like image processing manner. Although the voxel-based methods bring convenience to the processing of point cloud, they are prone to drop detail information, making it inescapable to suffer from suboptimal performance. Another stream of methods are point-based methods (Yang et (Qi et al., 2017) . They employ a series of operations, i.e., farthest point sample (FPS), query, and grouping, to directly extract informative features from the naive point cloud. However, the straightforward pipeline is cumbersome and costly. 3D-SSD (Yang et al., 2020b) first propose a single stage architecture, i.e., using encoder-only architecture, which replaces the geometric Distance-based FPS (D-FPS) with the Feature similarity-based FPS (F-FPS) to recall more foreground point features and further discard feature propagation (FP) layers for reducing latency. Although eliminating the overhead of FP layer, F-FPS operations still occupy vast latency. IA-SSD (Zhang et al., 2022) further proposes a contextual centroid prediction module to replace F-FPS, which directly predicts the classification scores of each point and adopts efficient top-k operation to further recall more foreground, and further cut computation overhead. The current advanced designs mainly credit to efficient foreground recall, but they may still have redundancy in the other parts like the background points or the network structure. In this paper, we conduct several empirical analyses of IA-SSD (Zhang et al., 2022 ) on two representative benchmarks, i.e., KITTI (Geiger et al., 2012) and Waymo (Sun et al., 2020) , to uncover the full picture of its inference speed bottleneck. We first calculate the latency distribution of all detector modules to figure out which parts are the main speed bottlenecks. Then we count the ratio of background points and the scale distribution of all objects to further explore potential redundancy clues over the cumbersome modules. Our study reveals three valuable points: (1) MLP network occupies over half latency. (2) Tremendous spatial redundancy exists in background point features that appear in each stage of the detector. (3) The size of each object is varying, making it unusable to align each receptive field of the conventional multi-scale grouping (MSG) and suffering from branch redundancy in MSG. The above valuable finding motivates us to further build a more efficient detector with higher speed by reducing the redundant background points and blocking useless MSG branches. As shown in Fig. 2 , we propose Dynamic Ball Query (DBQ) to replace the vanilla ball querying module, where the vanilla ball querying means the sampling technique proposed by PointNet++ (Qi et al., 2017). It dynamically activates useful and compact point features, and blocks redundant background point features for each branch of MSG. For each point feature sampled by FPS or top-k classification score, we design a dynamic query multiplexer to determine which branch to go through. Specifically, we apply a light-weight MLP network for point features to predict N masks corresponding to N branches of MSG, where the value of the mask is {0, 1} corresponding to blocking and activating states. The overall dynamic router procedure is data-driven so that the point features are adaptively activated or blocked with a suitable combination among all receptive fields of MSG. Ultimately, we introduce a resource budget loss for DBQ to learn a trade-off between latency and performance. To verify the efficiency of our method, we conduct extensive experiments on three typical datasets, i.e., KITTI (Geiger et al., 2012) , Waymo (Sun et al., 2020) , ONCE (Mao et al., 2021b) . Our Dynamic Ball Query enables the 3D detector to cut the latency from 9.85 ms (102 FPS) to 6.172 ms (162 FPS) on KITTI scene and speed up the inference speed from 20 FPS to 30 FPS on Waymo and ONCE scene while keeping the comparable performance. In particular, some evaluation metrics show significant improvements. Our work focus on dynamically dropping the redundant background point features for single-stage point-based detector, which is rarely researched in previous works. Our method endows the detector with super inference speed.

2.2. EFFICIENT 3D POINT-BASED DETECTORS

The point-based methods need to process large-scale vanilla point features, which requires to build cumbersome models and suffers expensive computation costs. Therefore, several works (Yang et al., 

3. ANALYSIS AND MOTIVATION

To explore what hampers the higher speed for 3D detection task, we conduct several experiments on both KITTI val and Waymo val sets with the state-of-the-art point-based 3D-detector (Zhang et al., 2022) . To establish a strong baseline, we split the point cloud into four parallel parts to speed up the first FPS operation. We first measure the latency on each detector module. As shown in Fig. 1 (a), the MLP network occupies the major part of the overall time overhead, i.e., 6.44 ms (65.4%) and 28.56 ms (56.5%) on KITTI and Waymo scenes, respectively. Therefore, optimizing the cumbersome and costly MLP network is a top priority for building an efficient detector. Decreasing the input scale of points or network parameters is a natural and empirical manner for reducing the latency but the policies are prone to damage the performance. To avoid both notorious drawbacks, we attempt to analyze the redundancy of background points in each backbone stage. As shown in Fig. 1 (b), it indicates that the number of background points (point features) dominates the proportion of over 70% at any network level. The phenomenon reveals that significant redundancy exists in background points which may be discarded for speeding up the detection procedure. Going one step further, we point out that the conventional multi-scale grouping (MSG) operation (Qi et al., 2017) of the set abstraction (SA) layer may be also redundant. As shown in Fig. 1 (c), it reflects that the size of each object varies in either KITTI or Waymo scenes. Therefore, the given receptive field of MSG may not entirely align with the size of objects. In this regime, some grouping branches of MSG are useless. The valuable observation motivates us to choose optimal grouping branches, which can further make the detector efficient.

4. DYNAMIC BALL QUERYING

Our DBQ-SSD framework is established on the efficient IA-SSD framework, which adopts set abstraction (SA) layers to encode point features. To achieve a better balance between effectiveness and efficiency, we introduce the dynamic network mechanism into the IA-SSD framework. Specifically, we propose Dynamic Ball Querying (DBQ) to replace the vanilla ball querying in each set of abstraction layer, which is shown in Fig. 2 . It is able to adaptively select a subset of input points as queries and extract local features in a suitable spatial radius. Dynamic ball querying is a generic module which can be easily embedded into the encoder blocks in mainstream point-based 3D detection frameworks. Given an input point sequence x c ∈ R N ×3 with the corresponding features x f ∈ R N ×C , N denotes the length of sequence and C indicates the number of feature channels. Besides, the coordinate sequence of input queries is denoted as q c ∈ R Nq×3 , where N q is the number of queries. 

4.1. INFERENCE

For an input query i ∈ {1, 2, ..., N q } with coordinate q c (i), we first obtain the corresponding query feature according to its coordinate for the query multiplexer. The query feature can be used as the basis for dynamic decisions. Specifically, we adopt the nearest sampling technique to obtain its representation from the features of input points. Albeit the sampling process with more sample points, e.g., top-k sampling, and ball gathering, can lead to a slight performance gain, it causes a significant decrease in efficiency. The top-k sampling indicates choosing a set of points with the smallest distance under a specific metric. Additionally, ball gathering means sampling and concatenating features according to the coordinates of points. The previous variants of ball querying employ heuristic and hand-designed rules. Different from them, the routing process of our query multiplexer is performed in a data-dependent way. To achieve it, we aggregate input features for each query from the nearest input point and predict the gating logits for each query by a linear projection: h(i) = x f (arg min j ∥q c (i) -x c (j)∥)W + b ∈ R 1×K , where W ∈ R C×K and b ∈ R 1×K denotes the weight and bias, respectively. Moreover, the binary gating masks for the i-th query and the k-th group are generated by quantizing the gating logits: m(i, k) = step(h(i, k)), where step(h(i, k)) = 1, if h(i, k) ≥ 0 0, if otherwise The gating masks control which ball querying group is enabled, i.e., the group with a positive mask value is enabled and vice versa. Based on this, we can adaptively reduce the number of queries and obtain the coordinates of sparse queries for each group: qc (k) = {q c (i)|m(i, k) ̸ = 0, ∀i ∈ 1, 2, ..., N q }, ∀k ∈ {1, 2, ..., K}, Furthermore, the sparse coordinates are used to guide the vanilla ball querying with corresponding settings to generate the sparse query features. Following the conventional protocols in the PointNetlike methods, the sparse query features are then transformed by the predefined MLP layers and max pooling operator: ẑ(k) = MaxPool(MLP k (VanillaBallQuery( qc (k); x c , x f , r k , ϕ k ))), To fuse the sparse transformed features in different groups, we remap them into the dense form according to the gating mask. The remap operator is similar to the unpooling process, which projects each enabled feature to the original position and fills zero to the disabled positions. The output feature of the SA layer are then blended by summation: y f = k∈{1,...,K} z(k), where z(k) = Aggregation k (Remap(ẑ(k); h)), where the Aggregation k is a linear layer, which is used to transform the features in different groups into the same dimension.

4.2. TRAINING

Since the sparse selection in Eq. 3 is non-differentiable, it is nontrivial for the dynamic ball querying to enable fully end-to-end training. To achieve it, we replace the determined decisions in Eq. 3 with a stochastic sampling process. Concretely, we consider the gating logits unnormalized log probabilities under the Bernoulli distribution. To this end, with noise samples g and g ′ drawn from a standard Gumbel distribution (Gumbel, 1954), a discrete gating mask can be yielded: m(i, k) = step(h(i, k) + g -g ′ ), where g, g ′ ∼ Gumbel(0, 1). To enable end-to-end training, motivated by the previous dynamic networks, we use the Gumbel-Sigmoid technique (Gumbel, 1954) to give a differentiable approximation for the Eq. 6 by replacing the hard step function with the soft sigmoid function. The likelihood of the i-th query in k-th group being selected is: π(i, k) = exp((h(i, k) + g)/τ ) exp((h(i, k) + g)/τ ) + exp(g ′ /τ ) ∈ [0, 1], where τ is the temperature coefficient. In the training phase, we use the Eq. 6 as the gating mask to select the sparse queries and employ a straight-through estimator (Bengio et al., 2013; Verelst & Tuytelaars, 2020) to obtain the gradients of gating logits:  y f (i) = K k z(i, k) forward K k π(i, k) • z(i, k) backward Ψ = l k Ψ l,k ( i m l (i, k))) l k Ψ l,k (N l q ) , where Ψ l,k indicates the latency map of k-th group in l-th layer. Finally, the latency is constrained by using the Euclidean distance to design a budget loss, thus the total loss is: L = L tasks + λL budget , where L budget = |Ψ -γ|. ( ) For simplicity, the objective latency budget γ is set to 0 in all experiments. The hyper-parameter λ is used to scale the budget loss. In addition, if the input is batched format, Ψ needs to be averaged along the batch dimension to estimate the average overhead of the network. Evaluation metrics. For KITTI scene, we report the performance of all classes by measuring the average precision (AP) metric. Follow most of state-of-the-art methods, we adopt 0.7, 0.5, and 0.5 of the IoU thresholds for Car, Pedestrian, and Cyclist, respectively. In addition, three levels of difficulty ("Easy", "Moderate", and "Hard") are also reported. To evaluate Waymo, we use the official metrics, Average Precision (AP) and Average Precision weighted by Heading (APH), and report the performance on LEVEL 1 (L1) and LEVEL 2 (L2) difficulty levels. "Small" and "Large" means activating on group branches with small and large radii respectively. "Kill" represents blocking all groups, while "Small & Large" means going through all scales of groups. Dynamic vs Static To verify the efficiency of our method, we first report the performance and latency on KITTI scene. As shown in Fig. 3 (a), our detector achieves super speed while maintaining comparable performance with other detectors. When λ is set to 0.1, the performance surpasses the efficient baseline. Therefore, we set the scale parameter to 0.1 for all experiments by default. As increasing the supervision of resource budget loss, the speed is further improved to 223 FPS. The impressive results show that our method endows point-based 3D detector with efficient detection capability. Going one step further, we report the reduction of latency on two revenue modules, i.e., query & grouping operation, and MLP network. As shown in Fig. 3 (b), our method cut the considerable overhead of both in different SA layers. It verifies that adaptively turning off useless points by DBQ can speed up the computation of both modules. Green, cyan, and yellow represent Car, Pedestrian, and Cyclist. Red and white points represent activation and blocking points, respectively. "Small" and "Large" means the scale of group in MSG, and the digital in parentheses is the index of SA layer.

Implementation Details

Activation and Blocking Points To explore which points are activated or blocked, we conduct quantitative and qualitative experiments using visualization and statistics respectively. Fig. 3(c ) counts the activation ratios of point features in different layers. It indicates that a considerable large part of points is discarded. Next, we dip into visualization results to figure out the reason for the quantitative results. As shown in Fig. 4 , the deeper the network goes, the blocking point ratio is larger. This is because it learns rich semantic information to judge useful foreground and redundant background points when going deeper into the network. As for the early layer, high-level semantic knowledge is difficult to extract. Therefore, the model is hard to make decisions to block out background points. In general, it reflects clearly that the most activating points (red points) in the deeper layer are distributed around objects, while other points away from the objects are blocked. It reveals that our method discards most redundant background points which may be useless for the localization and classification of objects, making it reasonable for the detector to speed up inference. The remaining foreground points and surrounding points are used to support the structure of objects and enrich context information. Therefore, our method does not damage the performance of detection. It echoes our analysis in Sec. 3. Routing Manner and Branch Redundancy Tab. 1 compares the effectiveness between layer-wise and point-wise routing manner. During inference, the layer-wise routing manner only speeds up by 18 FPS with small performance gains, while our point-wise manner not only achieves significant performance improvement but also reduces considerable latency. In addition, the contrast on whether to predict split mask or share identity mask for each group of MSG indicates that the former is the optimal policy. Generating two masks for different scales of group allow each point to reach an optimal combination of the receptive field. Corresponding statistical and visualization results can be seen in Fig. 3(d ) and Fig. 4 , which reveals that more points go through both groups in early layers while most points are only keen on a single branch or even be killed. The phenomenon agrees well with the empirical analysis in Sec. 3.

Main Results

As illustrated in Tab. 2, our method outperforms the efficient baseline on all categories, while achieving higher inference speed (162 vs 102). It verifies that our method can not only drop redundant points for speeding up inference, but extract more useful information for localization and classification. By comparing with other state-of-the-art methods, our detector outperforms their speed with a large margin while maintains comparable accuracy.

5.3. EVALUATION ON WAYMO DATASET

To verify the generalization of our method, we further evaluate the performance on Waymo (Sun et al., 2020) dataset. Because Waymo scene is made up of the 360-degree point cloud whose scale is larger than KITTI scene, we increase the input number of points by sampling from 16,384 to 65,536. The batch size is set to 2 for each GPU. We train 30 epochs with 8 GPUs. Other settings are the same as the experiments of KITTI scene. As shown in Tab 3, we report the performance and inference speed of our DBQ-SSD. As carrying out suitable supervision (i.e., λ = 0.05) for the detector, it achieves impressive performance in all classes. Especially, some categories achieve state-of-the-art accuracy in evaluation metrics or levels of difficulty. Meanwhile, it also improves the inference speed of the efficient baseline from 20 FPS to 27 FPS. The results show that adaptively blocking redundant point features and activating high-quality point features are the key to endow our detector with efficient performance. When conducting a larger scale of supervision for DBQ-SSD, it is capable of detecting objects with real-time speed (30 FPS) and achieves comparable accuracy. It can provide flexible configuration for practical applications to realize the best trade-off between accuracy and overhead cost.

6. CONCLUSION

In this paper, we point out the existing spatial redundancy on background points and useless receptive field groups in MSG. This redundancy impedes the inference efficiency improvement of point-based 3D detectors. 

D VISUALIZATION

As shown in Fig. 5 , Fig. 6 , and Fig. 7 , we provide the detailed visualization of predicted results for Waymo val set and KITTI val set. The conclusion is the same as KITTI scene. As the network depth increases, the foreground points are retained for classification and regression, while redundant background points are dropped. It reveals that our method can adaptively discard useless points for speeding up inference. It's worth noting that the discarding behavior of point clouds significantly differs between KITTI and Waymo scenes, which verifies that our method equips generalization.



https://github.com/open-mmlab/OpenPCDet



3D DETECTORS The task of 3D detection is to predict 3D bounding boxes and class labels for each object in a point cloud scene. The detection algorithms can be split into voxel-based (Zhou & Tuzel, 2018; Yan et al., 2018; Lang et al., 2019; He et al., 2020) and point-based (Shi et al., 2019; Yang et al., 2020b; Chen et al., 2022; Zhang et al., 2022) methods. Voxel-based methods convert the point cloud into regular voxels or pillars, making it natural to apply 3D convolution or its sparse variant (Yan et al., 2018) for feature extraction. This regular partition method may lead to detailed information lost, so point-based methods are proposed to directly process vanilla point cloud. Inspired by PointNet++ (Qi et al., 2017) and Faster R-CNN (Ren et al., 2015), PointRCNN (Shi et al., 2019) employs SA and FP layers to extract feature for each point and designs a region proposal network (RPN) to produce proposals, and utilizes an extra stage of the module to predict bounding boxes and class labels. In addition, PV-RCNN (Shi et al., 2020a) integrates voxel and point features to the RPN for generating higher-quality proposals. Pyramid R-CNN (Mao et al., 2021a) introduces a pyramid RoI head with learnable radii to boost accuracy at the expense of latency overhead. In contrast, our method aims to achieve more efficient inference by adaptively selecting the network branches for each input point. Other efforts are similar to single-stage 2D detectors (Lin et al., 2017; Tian et al., 2019; Wang et al., 2021; Song et al., 2019b; Zhang et al., 2019). 3DSSD (Yang et al., 2020b) and IA-SSD (Zhang et al., 2022) discard the region proposal network and use encoder-only architecture to localize 3D objects.

Figure 1: Statistics of latency, background ratio, and size distribution on both KITTI val (Geiger et al., 2012) and Waymo val (Sun et al., 2020) sets. (a) reveals that the MLP network occupies the largest latency. "Q & G" means query and grouping operation. (b) reflects that redundant background points significantly dominate the input points of each stage. (c) means the distribution on varying object sizes (measuring in 3

Figure 2: The pipeline of dynamic ball query in a set abstraction layer. 'NS' indicates the nearest sampling, which samples the query features from the input features. The query multiplexer generates gating masks to adaptively select a subset of input queries for each group. The remap operator is used to map the sparse features to the dense form.

Dataset We evaluate our detector on two representative datasets: KITTI dataset (Geiger et al., 2012) and Waymo dataset (Sun et al., 2020). KITTI dataset includes 7,481 training point clouds/images and 7,518 test point clouds/images. The KITTI scene contains three classes, i.e., Car, Pedestrian, and Cyclist. Waymo scene contains 798 training, 202 validation, and 150 testing sequences with three classes of Vehicle, Pedestrian, and Cyclist. Each sequence includes nearly 200 frames with a 360-degree lidar point cloud.

Figure 3: Illustration of the effects on Dynamic Ball Query. All experiments are evaluated on KITTI val set. λ is the scale parameter of resource budget loss in Eq. 10. Latency here is evaluated by a single RTX2080Ti GPU with a batch size of 16. (a) reports the comparison on both accuracies of Car class and overall latency distribution. (b) indicates the latency reduction of query & grouping operation and MLP network in different SA layers. (c) reflects the activation distribution of point features in different SA layers. (d)shows the proportion of point features go through different groups of MSG. "Small" and "Large" means activating on group branches with small and large radii respectively. "Kill" represents blocking all groups, while "Small & Large" means going through all scales of groups.

Figure 4: Visualization results on KITTI val set. The 3D boxes in the figures are the prediction boxes.Green, cyan, and yellow represent Car, Pedestrian, and Cyclist. Red and white points represent activation and blocking points, respectively. "Small" and "Large" means the scale of group in MSG, and the digital in parentheses is the index of SA layer.

al., 2020b; Chen et al., 2022; Zhang et al., 2022) inspired from PointNet++

For efficiency, most previous point-based methods(Yang et al., 2020b;Zhang et al., 2022) use Farthest Point Sampling (FPS) or its variants (e.g., dividing point cloud by multiple parallel parts to reduce computational complexity) to generate the query points. For each query, the vanilla ball querying samples a predefined number of local points within a specific spatial radius. Since different object instances require different receptive fields, in a set abstraction layer, the previous 3D detectors (Yang et al., 2020b; Zhang et al., 2022) typically adopt multiple groups of vanilla ball querying with different radii to increase feature diversity. Specifically, for one set abstraction layer, we define the set of predefined radii as R = {r k } K and the number of sampling points as Φ = {ϕ k } K , where K indicates the number of groups. Based on these, we establish a set of K vanilla ball querying blocks. As shown in Fig.2, our dynamic ball querying is made up of a query multiplexer and a set of vanilla ball querying blocks. Similar to the gate-based dynamic networks(Song et al., 2021;Wang et al., 2018; Li et al., 2020b), the query multiplexer adopts a fine-grained routing process to select a suitable combination of vanilla ball querying blocks for each query.

8) 4.3 LATENCY CONSTRAINTWithout latency constraint, dynamic ball querying typically enables more queries for each group to obtain high accuracy. To achieve a better balance between effectiveness and efficiency, we introduce the latency constraint as a training target to reduce the inference time. Different from the computational complexity employed in many previous dynamic networks(Song et al., 2021;Wang et al., 2018; Li et al., 2020b), the latency can represent the actual runtime in specific devices. To this end, we first establish a latency map for each group in each SA layer, which records the latency with regard to the number of queries. Based on this, we can calculate the latency ratio of all the SA layers with dynamic ball querying:

Following the stream of single-stage point-based methods, we use encoderonly architecture like (Yang et al., 2020b; Chen et al., 2022; Zhang et al., 2022). Specially, we split the point features into four parallel fan parts to speed up D-FPS of the first sampling layer, which will be acted as the "Efficient Baseline". Other sampling layers follow the default setting of (Zhang

Performance of dynamic gating with different routing manners on KITTI val set. The scale parameter λ is set to 0.1. "Layer" indicates controlling an entire SA layer. "Share" means whether to share masks to all groups. "Point" indicates using point-wise routing instead of layer-wise routing.

Comparison with the state-of-the-art methods on the KITTßI test set. Bold font is used to indicate the best performance. The speed is tested on a single GPU with with batch size of 16 and measured by FPS.

Comparison with the state-of-the-art methods on the Waymo val set. The bold font is used to indicate best performance. The speed is tested on a single GPU with batch size of 16 and measured by FPS.

To eliminate the dilemma, we propose a dynamic ball query, which can dynamically generate gate masks for each group of MSG to process useful points and block redundant background points. The extensive experiments demonstrate our analysis and show the effectiveness of our method. In short, we launch a new view to focus on redundant background points instead of the limited foreground part, which further deepens the understanding of the sparsity of point cloud. We hope this work can shed the light to the research of efficient point-based models and inspire future works. Dingfu Zhou, Jin Fang, Xibin Song, Chenye Guan, Junbo Yin, Yuchao Dai, and Ruigang Yang. Iou loss for 2d/3d object detection. In 2019 International Conference on 3D Vision (3DV), pp. 85-94. IEEE, 2019. 9 Yin Zhou and Oncel Tuzel. Voxelnet: End-to-end learning for point cloud based 3d object detection. Comparison with the state-of-the-art methods on the KITTI val set. The average precision is measured with 11 recall positions. The bold font is used to indicate the best performance. The speed is tested on a single GPU with a batch size of 16 and measured by FPS.

Comparison with the state-of-the-art methods on the ONCE val set. Bold font is used to indicate the best performance. The speed is tested on a single GPU with batch size of 16 and measured by FPS.

A LIMITATION AND FUTURE WORK

With increasing the supervision on resource budget (increasing the value of λ.), the performance will decrease accordingly. We suspect that dropping too many point clouds may eliminate the part of the useful point cloud. Therefore, this paper targets to achieve a better trade-off between accuracy and inference speed, which maintaining or even achieving gain in accuracy, and significantly speeding up inference. In this paper, we can not specifying what point to drop but our detector equips strong ability to eliminate redundancy. Therefore, we look forward to inspire future works for focusing on dropping more redundant point cloud without performance degradation. 

C EXPERIMENTS ON TITTI val AND ONCE val SET

To verify the generalization, we evaluate our method on both KITTI test set and ONCE val set.KITTI val set. As shown in Tab. 5, our DBQ-SSD achieves comparable performance compared with IA-SSD, while showing super inference speed nearly two times than IA-SSD.ONCE val set. Because the official configuration file of IA-SSD is not released with respect to ONCE dataset, we reproduce the results according to the paper. As shown in Tab. 6, our method significantly improves the inference speed to 33 FPS (2.4x speedup), while maintaining comparable performance with IA-SSD. When adjusting the γ to 0.1, our method achieves nearly 1 mAP performance improvement while gaining 1.7x speedup.

Small (1)

Small ( 3)Large ( 1)

Small (2)

Amplifier5 Amplifier5 Amplifier6 Amplifier6 Amplifier3 Amplifier4 Amplifier3 Amplifier4Origin point cloud Amplifier2 Amplifier1 Amplifier1 Amplifier2Figure 5 : Visualization results on Waymo val set. The red and green 3D boxes in figures are ground truth and prediction boxes. Green, cyan, and yellow represent Car, Pedestrian, and Cyclist. Red and white points represent activation and blocking points, respectively. "Small" and "Large" means the scale of group in MSG, and the digital in parentheses is the index of SA layer. Origin point cloud Amplifier1 Amplifier2 Amplifier3 Amplifier4 Amplifier5 Amplifier6 Amplifier7 Amplifier8 Amplifier8 Amplifier7 Amplifier6 Amplifier5 Amplifier4 Amplifier3 Amplifier2 Amplifier1 Red and white points represent activation and blocking points, respectively. "Small" and "Large" means the scale of group in MSG, and the digital in parentheses is the index of SA layer.

