DBQ-SSD: DYNAMIC BALL QUERY FOR EFFICIENT 3D OBJECT DETECTION

Abstract

Many point-based 3D detectors adopt point-feature sampling strategies to drop some points for efficient inference. These strategies are typically based on fixed and handcrafted rules, making it difficult to handle complicated scenes. Different from them, we propose a Dynamic Ball Query (DBQ) network to adaptively select a subset of input points according to the input features, and assign the feature transform with a suitable receptive field for each selected point. It can be embedded into some state-of-the-art 3D detectors and trained in an end-to-end manner, which significantly reduces the computational cost. Extensive experiments demonstrate that our method can increase the inference speed by 30%-100% on KITTI, Waymo, and ONCE datasets. Specifically, the inference speed of our detector can reach 162 FPS on KITTI scene, and 30 FPS on Waymo and ONCE scenes without performance degradation. Due to skipping the redundant points, some evaluation metrics show significant improvements.

1. INTRODUCTION

3D object detection, as a fundamental task in the 3D scene, has made significant progress. It aims to recognize and localize objects from point clouds and paves the way for several real applications like autonomous driving (Geiger et al., 2012 ), robotic system(Yang et al., 2020b) , and augmented reality (Park et al., 2008) . The structure of point clouds is sparse, unordered, and semantically deficient, making it difficult to encode point features like highly structured image features. To eliminate this barrier, voxel-based methods are proposed to organize the overall point cloud as neatly distributed voxels. Therefore, the naive 3D convolution or its efficient variant, i.e., 3D sparse convolution (Yan et al., 2018) , can be used to extract voxel features like image processing manner. Although the voxel-based methods bring convenience to the processing of point cloud, they are prone to drop detail information, making it inescapable to suffer from suboptimal performance. Another stream of methods are point-based methods (Yang et al., 2020b; Chen et al., 2022; Zhang et al., 2022) inspired from PointNet++ (Qi et al., 2017) . They employ a series of operations, i.e., farthest point sample (FPS), query, and grouping, to directly extract informative features from the naive point cloud. However, the straightforward pipeline is cumbersome and costly. 3D-SSD (Yang et al., 2020b) first propose a single stage architecture, i.e., using encoder-only architecture, which replaces the geometric Distance-based FPS (D-FPS) with the Feature similarity-based FPS (F-FPS) to recall more foreground point features and further discard feature propagation (FP) layers for reducing latency. Although eliminating the overhead of FP layer, F-FPS operations still occupy vast latency. IA-SSD (Zhang et al., 2022) further proposes a contextual centroid prediction module to replace F-FPS, which directly predicts the classification scores of each point and adopts efficient top-k operation to further recall more foreground, and further cut computation overhead. The current advanced designs mainly credit to efficient foreground recall, but they may still have redundancy in the other parts like the background points or the network structure. In this paper, we conduct several empirical analyses of IA-SSD (Zhang et al., 2022) on two representative benchmarks, i.e., KITTI (Geiger et al., 2012) and Waymo (Sun et al., 2020) , to uncover the full picture of its inference speed bottleneck. We first calculate the latency distribution of all detector modules to figure out which parts are the main speed bottlenecks. Then we count the ratio of background points and the scale distribution of all objects to further explore potential redundancy clues over the cumbersome modules. Our study reveals three valuable points: (1) MLP network occupies over half latency. (2) Tremendous spatial redundancy exists in background point features that appear in each stage of the detector. (3) The size of each object is varying, making it unusable to align each receptive field of the conventional multi-scale grouping (MSG) and suffering from branch redundancy in MSG. The above valuable finding motivates us to further build a more efficient detector with higher speed by reducing the redundant background points and blocking useless MSG branches. As shown in Fig. 2 , we propose Dynamic Ball Query (DBQ) to replace the vanilla ball querying module, where the vanilla ball querying means the sampling technique proposed by PointNet++ (Qi et al., 2017) . It dynamically activates useful and compact point features, and blocks redundant background point features for each branch of MSG. For each point feature sampled by FPS or top-k classification score, we design a dynamic query multiplexer to determine which branch to go through. Specifically, we apply a light-weight MLP network for point features to predict N masks corresponding to N branches of MSG, where the value of the mask is {0, 1} corresponding to blocking and activating states. The overall dynamic router procedure is data-driven so that the point features are adaptively activated or blocked with a suitable combination among all receptive fields of MSG. Ultimately, we introduce a resource budget loss for DBQ to learn a trade-off between latency and performance. To verify the efficiency of our method, we conduct extensive experiments on three typical datasets, i.e., KITTI (Geiger et al., 2012 ), Waymo (Sun et al., 2020) , ONCE (Mao et al., 2021b) . Our Dynamic Ball Query enables the 3D detector to cut the latency from 9.85 ms (102 FPS) to 6.172 ms (162 FPS) on KITTI scene and speed up the inference speed from 20 FPS to 30 FPS on Waymo and ONCE scene while keeping the comparable performance. In particular, some evaluation metrics show significant improvements.

2.1. 3D DETECTORS

The task of 3D detection is to predict 3D bounding boxes and class labels for each object in a point cloud scene. The detection algorithms can be split into voxel-based (Zhou & Tuzel, 2018; Yan et al., 2018; Lang et al., 2019; He et al., 2020) and point-based (Shi et al., 2019; Yang et al., 2020b; Chen et al., 2022; Zhang et al., 2022) methods. Voxel-based methods convert the point cloud into regular voxels or pillars, making it natural to apply 3D convolution or its sparse variant (Yan et al., 2018) for feature extraction. This regular partition method may lead to detailed information lost, so point-based methods are proposed to directly process vanilla point cloud. Inspired by PointNet++ (Qi et al., 2017) and Faster R-CNN (Ren et al., 2015) , PointRCNN (Shi et al., 2019) employs SA and FP layers to extract feature for each point and designs a region proposal network (RPN) to produce proposals, and utilizes an extra stage of the module to predict bounding boxes and class labels. In addition, PV-RCNN (Shi et al., 2020a) integrates voxel and point features to the RPN for generating higher-quality proposals. Pyramid R-CNN (Mao et al., 2021a) introduces a pyramid RoI head with learnable radii to boost accuracy at the expense of latency overhead. In contrast, our method aims to achieve more efficient inference by adaptively selecting the network branches for each input point. Other efforts are similar to single-stage 2D detectors (Lin et al., 2017; Tian et al., 2019; Wang et al., 2021; Song et al., 2019b; Zhang et al., 2019 ). 3DSSD (Yang et al., 2020b) and IA-SSD (Zhang et al., 2022) discard the region proposal network and use encoder-only architecture to localize 3D objects. Our work focus on dynamically dropping the redundant background point features for single-stage point-based detector, which is rarely researched in previous works. Our method endows the detector with super inference speed.

2.2. EFFICIENT 3D POINT-BASED DETECTORS

The point-based methods need to process large-scale vanilla point features, which requires to build cumbersome models and suffers expensive computation costs. Therefore, several works (Yang et al., 



* Equal contribution. † Corresponding author.

