OPEN-SET 3D DETECTION VIA IMAGE-LEVEL CLASS AND DE-BIASED CROSS-MODAL CONTRASTIVE LEARNING

Abstract

Current point-cloud detectors have difficulty in detecting the open-set objects in the real world, due to their limited generalization capability. Moreover, collecting and fully annotating a point-cloud detection dataset with a large number of classes of objects is extremely laborious and expensive, leading to the limited class size of existing point-cloud datasets and hindering the model from learning general representations for open-set point-cloud detection. Instead of seeking well-annotated point-cloud datasets, we resort to ImageNet1K to broaden the vocabulary of pointcloud detectors. Specifically, we propose OS-3DETIC, an Open-Set 3D DETector using Image-level Class supervision. Intuitively, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo-labels for unseen classes. We then propose a novel de-biased cross-modal contrastive learning strategy to transfer knowledge from image to point-cloud. Without hurting the latency during inference, OS-3DETIC makes the well-known point-cloud detector capable of achieving open-set detection. Extensive experiments demonstrate that the proposed OS-3DETIC achieves at least 10.77% mAP improvement (absolute value) and 9.56% mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD and ScanNetV2, respectively. Moreover, we conduct sufficient experiments to shed light on why the proposed OS-3DETIC works.



However in the point-cloud field, as far as we know, there are few studies for the open-set pointcloud detection. The most notable hindrance is that we can hardly obtain large-scale point-cloud data and labels (or optionally, captions like the aforementioned image field), due to the difficulty of collection and annotation. The scarcity of point-cloud data and labels drastically restricts the pointcloud model from learning sufficient knowledge and obtaining general representations. Therefore, this limitation motivates us to ask -Can we transfer the knowledge from images to point-cloud so that the point-cloud model is capable of learning general representations?



cloud detection is defined as finding objects (localization) in point-cloud and naming them (classification). Recently, deep learning based 3D detectors have achieved significant progress. However, most methods are developed on point-cloud detection datasets with limited classes, whereas the real world has a cornucopia of classes. It is common for 3D detectors to encounter objects that had never occurred during training, resulting in failure to generalize to real-life scenarios. Therefore, it is extremely important to design an open-set point-cloud detector which is able to generalize to unseen classes. The key ingredient of open-set detection is that the model learns sufficient knowledge thus is able to output general representations. To achieve this, in the image field, typical open-set classification and detection either require to introduce large-scale image-text pairs or image datasets with sufficient labels. For example, CLIP Radford et al. (2021) introduced 400 million image-text pairs for pre-training to help visual models learn general representation. Detic Zhou et al. (2022) leverages ImageNet21K to extend the knowledge of image detectors. OVR-CNN Zareian et al. (2021) uses the language pretrained embedding layer to broaden the vocabulary of the 2D detector.

