OPEN-SET 3D DETECTION VIA IMAGE-LEVEL CLASS AND DE-BIASED CROSS-MODAL CONTRASTIVE LEARNING

Abstract

Current point-cloud detectors have difficulty in detecting the open-set objects in the real world, due to their limited generalization capability. Moreover, collecting and fully annotating a point-cloud detection dataset with a large number of classes of objects is extremely laborious and expensive, leading to the limited class size of existing point-cloud datasets and hindering the model from learning general representations for open-set point-cloud detection. Instead of seeking well-annotated point-cloud datasets, we resort to ImageNet1K to broaden the vocabulary of pointcloud detectors. Specifically, we propose OS-3DETIC, an Open-Set 3D DETector using Image-level Class supervision. Intuitively, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo-labels for unseen classes. We then propose a novel de-biased cross-modal contrastive learning strategy to transfer knowledge from image to point-cloud. Without hurting the latency during inference, OS-3DETIC makes the well-known point-cloud detector capable of achieving open-set detection. Extensive experiments demonstrate that the proposed OS-3DETIC achieves at least 10.77% mAP improvement (absolute value) and 9.56% mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD and ScanNetV2, respectively. Moreover, we conduct sufficient experiments to shed light on why the proposed OS-3DETIC works.

1. INTRODUCTION

3D point-cloud detection is defined as finding objects (localization) in point-cloud and naming them (classification). Recently, deep learning based 3D detectors have achieved significant progress. However, most methods are developed on point-cloud detection datasets with limited classes, whereas the real world has a cornucopia of classes. It is common for 3D detectors to encounter objects that had never occurred during training, resulting in failure to generalize to real-life scenarios. Therefore However in the point-cloud field, as far as we know, there are few studies for the open-set pointcloud detection. The most notable hindrance is that we can hardly obtain large-scale point-cloud data and labels (or optionally, captions like the aforementioned image field), due to the difficulty of collection and annotation. The scarcity of point-cloud data and labels drastically restricts the pointcloud model from learning sufficient knowledge and obtaining general representations. Therefore, this limitation motivates us to ask -Can we transfer the knowledge from images to point-cloud so that the point-cloud model is capable of learning general representations? Open-set 3D detection it to detect unseen categories without corresponding 3D labels. Note that, in this setting, open-set is defined in terms of the 3D detection, we can make use of external knowledge such as ImageNet, which only provides Image-level category knowledge. Specifically, the proposed OS-3DETIC makes the 3D detector learning sufficient knowledge from image-level supervision thus achieving open-set point-cloud detection, and it is a synergy of two components: 1) Make full use of knowledge learning from ImageNet and generalizability of localization on point-cloud to generate pseudo-labels for unseen class. 2) We design a de-biased cross-modal contrastive learning with distance-aware temperature to capture the shared low-dimensional space within and across modalities, thus better transferring the sufficient knowledge from image domain to point-cloud domain. It is noteworthy to mention that during training, we introduce the paired images to narrow the gap between the point-cloud data and the images from ImageNet, but we do not need any extra annotations except the Lidar-Camera transformation matrix. Extensive experiments show that OS-3DETIC outperforms a wide range of state-of-the-art baselines by at least 10.77% mAP (absolute) and 9. • We propose an open-set 3D detector with image-level class, termed as OS-3DETIC, which is a synergy of two components: the pseudo-label generation from two modalities, and the de-biased cross-modal contrastive learning with distance-aware temperature.



, it is extremely important to design an open-set point-cloud detector which is able to generalize to unseen classes. The key ingredient of open-set detection is that the model learns sufficient knowledge thus is able to output general representations. To achieve this, in the image field, typical open-set classification and detection either require to introduce large-scale image-text pairs or image datasets with sufficient labels. For example, CLIP Radford et al. (2021) introduced 400 million image-text pairs for pre-training to help visual models learn general representation. Detic Zhou et al. (2022) leverages ImageNet21K to extend the knowledge of image detectors. OVR-CNN Zareian et al. (2021) uses the language pretrained embedding layer to broaden the vocabulary of the 2D detector.

mAP25 w.r.t categories on the ScanNet dataset.

Figure 1: (a). The left point-cloud includes the tables (green) and the chairs (red). The tables are labeled and denoted as seen objects, while the chairs are unseen objects. Right part shows examples from ImageNet1K which has sufficient labels. We aim at utilizing large-scale ImageNet1K to broaden the vocabulary of point-cloud detector. (b). We train the 3DETR Misra et al. (2021) and DETR Carion et al. (2020) on the ScanNet dataset. Under low-data regime, e.g., with randomly sampling 10 % data, AR 25 is still at a high-level accuracy, even competitive to using 100 % data, which demonstrates that localization in point-cloud object detection generalizes well. (c). The performance comparison between the proposed OS-3DETIC and the baseline 3DETR on the ScanNet dataset. OS-3DETIC significantly improves the baseline by a large margin on all the categories.

56% mAP (absolute) without hurting the latency of the original 3D detector, on the unseen classes of SUN RGB-D Song et al. (2015) and the ScanNet Dai et al. (2017), respectively. An example on the ScanNet dataset is shown in Fig.1(c). Sufficient ablation studies shed light on why the OS-3DETIC works. Overall, our contribution is as follows:

