OPEN-SET 3D DETECTION VIA IMAGE-LEVEL CLASS AND DE-BIASED CROSS-MODAL CONTRASTIVE LEARNING

Abstract

Current point-cloud detectors have difficulty in detecting the open-set objects in the real world, due to their limited generalization capability. Moreover, collecting and fully annotating a point-cloud detection dataset with a large number of classes of objects is extremely laborious and expensive, leading to the limited class size of existing point-cloud datasets and hindering the model from learning general representations for open-set point-cloud detection. Instead of seeking well-annotated point-cloud datasets, we resort to ImageNet1K to broaden the vocabulary of pointcloud detectors. Specifically, we propose OS-3DETIC, an Open-Set 3D DETector using Image-level Class supervision. Intuitively, we take advantage of two modalities, the image modality for recognition and the point-cloud modality for localization, to generate pseudo-labels for unseen classes. We then propose a novel de-biased cross-modal contrastive learning strategy to transfer knowledge from image to point-cloud. Without hurting the latency during inference, OS-3DETIC makes the well-known point-cloud detector capable of achieving open-set detection. Extensive experiments demonstrate that the proposed OS-3DETIC achieves at least 10.77% mAP improvement (absolute value) and 9.56% mAP improvement (absolute value) by a wide range of baselines on the SUN-RGBD and ScanNetV2, respectively. Moreover, we conduct sufficient experiments to shed light on why the proposed OS-3DETIC works.

1. INTRODUCTION

3D point-cloud detection is defined as finding objects (localization) in point-cloud and naming them (classification). Recently, deep learning based 3D detectors have achieved significant progress. However, most methods are developed on point-cloud detection datasets with limited classes, whereas the real world has a cornucopia of classes. It is common for 3D detectors to encounter objects that had never occurred during training, resulting in failure to generalize to real-life scenarios. Therefore, it is extremely important to design an open-set point-cloud detector which is able to generalize to unseen classes. The key ingredient of open-set detection is that the model learns sufficient knowledge thus is able to output general representations. To achieve this, in the image field, typical open-set classification and detection either require to introduce large-scale image-text pairs or image datasets with sufficient labels. For example, CLIP Radford et al. (2021) introduced 400 million image-text pairs for pre-training to help visual models learn general representation. Detic Zhou et al. (2022) leverages ImageNet21K to extend the knowledge of image detectors. OVR-CNN Zareian et al. (2021) uses the language pretrained embedding layer to broaden the vocabulary of the 2D detector. However in the point-cloud field, as far as we know, there are few studies for the open-set pointcloud detection. The most notable hindrance is that we can hardly obtain large-scale point-cloud data and labels (or optionally, captions like the aforementioned image field), due to the difficulty of collection and annotation. The scarcity of point-cloud data and labels drastically restricts the pointcloud model from learning sufficient knowledge and obtaining general representations. Therefore, this limitation motivates us to ask -Can we transfer the knowledge from images to point-cloud so that the point-cloud model is capable of learning general representations? It is straightforward that we directly utilize the image detection dataset with fully bounding boxes and class labels, and transfer the knowledge from the 2D detector to 3D. However, bounding-box level annotation is still laborious and difficult to scale, while open-set detection requires rich labels to help the detector learn sufficient knowledge Zhou et al. (2022) . Therefore, in this paper, instead of seeking to build a large-scale point-cloud dataset or use 2D detection datasets, we open-up another path by resorting to large-scale image datasets with image-level class supervision, Ima-geNet1K Krizhevsky et al. (2012) , to enable the point-cloud detector capable of learning general representations, thus broadening the vocabulary of the point-cloud detector, as shown in Fig. 1 (a). Open-set 3D detection it to detect unseen categories without corresponding 3D labels. Note that, in this setting, open-set is defined in terms of the 3D detection, we can make use of external knowledge such as ImageNet, which only provides Image-level category knowledge. Specifically, the proposed OS-3DETIC makes the 3D detector learning sufficient knowledge from image-level supervision thus achieving open-set point-cloud detection, and it is a synergy of two components: 1) Make full use of knowledge learning from ImageNet and generalizability of localization on point-cloud to generate pseudo-labels for unseen class. 2) We design a de-biased cross-modal contrastive learning with distance-aware temperature to capture the shared low-dimensional space within and across modalities, thus better transferring the sufficient knowledge from image domain to point-cloud domain. It is noteworthy to mention that during training, we introduce the paired images to narrow the gap between the point-cloud data and the images from ImageNet, but we do not need any extra annotations except the Lidar-Camera transformation matrix. Extensive experiments show that OS-3DETIC outperforms a wide range of state-of-the-art baselines by at least 10.77% mAP (absolute) and 9.56% mAP (absolute) without hurting the latency of the original 3D detector, on the unseen classes of SUN RGB-D Song et al. (2015) and the ScanNet Dai et al. (2017) , respectively. An example on the ScanNet dataset is shown in Fig. 1(c ). Sufficient ablation studies shed light on why the OS-3DETIC works. Overall, our contribution is as follows: • We propose an open-set 3D detector with image-level class, termed as OS-3DETIC, which is a synergy of two components: the pseudo-label generation from two modalities, and the de-biased cross-modal contrastive learning with distance-aware temperature. • OS-3DETIC can be regarded as a superior baseline in open-set 3D point-cloud detection. • Extensive experiments demostrate the effectiveness of OS-3DETIC, and we also provide sufficient analysis to uncover why it works. 

3.1. NOTATION AND PRELIMINARIES

We use I ∈ R 3×H×W to represent image, and ; 3) ImageNet1K dataset denoted as A typical point-cloud detector deals with localization and classification, where the localization module outputs bounding boxes b3D ∈ R 7 and we could get its corresponding point-cloud ROI features f 3D . Similarly, we can project the 3D bounding box into 2D image via projection matrix K, i.e., b2D ∈ R 4 and index the corresponding image ROI feature f 2D . Then we can use the f 3D and f 2D to predict the class of each object. P = {p i ∈ R 3 , i = 1, 2, 3..., N } to rep- resent point-cloud, D ign = {(I, c ign 2D ) j } |D ign | j=1 ,

3.2. OS-3DETIC: OPEN-SET 3D DETECTOR WITH IMAGE CLASSES

We use ImageNet1K to broaden the vocabulary of the point-cloud detector. To transfer the knowledge contained in ImageNet1K to the point-cloud detector, paired image is introduced to bridge these two modalities. Specifically, We design a two-phase training strategy to enable open-set point-cloud detection. The first-phase training is similar to Detic Zhou et al. (2022) , which aims at leveraging ImageNet to help the 2D detector be able to learn sufficient knowledge. The second-phase is the core of the proposed OS-3DETIC, which aims at transferring the knowledge of the 2D detector to the 3D detector via customized pseudo-label strategy and de-biased cross-modal contrastive learning, as shown in Fig. 2 . During inference, we only need the 3D detector without any extra models or modalities. OS-3DETIC is developed on top of 3DETR Misra et al. (2021) , and we use DETR Carion et al. (2020) as the 2D detector. The details are illustrated below. In the first stage, the point-cloud P in D pc is injected to the 3D detector, which is supervised by the ground truth {b 3D , c 3D } of seen objects. The paired images I from D img are input into the 2D detector, which is supervised by projected 3D boxes and the corresponding class, denoted as {K × b 3D , c 3D }. It is noteworthy to mention that the images I from from ImageNet1K D ign are Figure 2 : Overview of the phase 2 of OS-3DETIC, which is the core and a synergy of two components: 1) We take advantage of two modalities, image modality for classification and point-cloud modality for localization, to generate pseudo-labels for unseen classes, and 2) we design a de-biased cross-modal contrastive learning to transfer the knowledge from image to point-cloud. Note that before phase 2, we first train both the 2D detector and the 3D detector as similar to Detic Zhou et al. (2022) . The green brackets denote ground truth of seen classes. For ImageNet1K, the "bbox" is not provided, so denoted as ∅. The red brackets denote pseudo-label, where bounding boxes come from the output of the 3D detector, and classes come from the output of the 2D detector. also input into the same 2D detector with only image-level labels (category) are provided. Following Detic Zhou et al. (2022) , we choose the max-size proposal f maxsize and apply classification label c ign 2D to supervised the classifier in the 2D detector. The loss in the phase 1 training is given by L phase1 =L 3D box (b 3D , b3D ) + L 3D cls (c 3D , W 3D f 3D )+ L 2D box (K × b 3D , b2D ) + L 2D cls (c 3D , W 2D f 2D ) + L ign cls (c ign 2D , W 2D f maxsize ), where L 3D box and L 2D box follow Misra et al. (2021) and Carion et al. (2020) . L 3D cls and L 2D cls are crossentropy loss for classification. W 3D and W 2D denote the classifier in the 3D detector and the 2D detector, respectively. In phase 2, we first generate pseudo-labels for unseen classes. The pseudo-labels contain two parts: bounding box from the 3D detector, and class from the 2D detector. Specifically, since we already exploit ImageNet1K to train the 2D detector in the phase 1, as similar to Detic Zhou et al. (2022) , the 2D detector is able to classify unseen classes so that we can use its classification results as relatively accurate pseudo-labels. To leverage this, we crop the image region of the projected 3D detection and use the 2D detector's classifier to generate its class label. For the bounding box pseudo-labels, we take advantage of the generalizability of localization for the point-cloud detector, as mentioned in Fig. 1(b ). Besides using pseudo-labels, we also design a de-biased cross-modal contrastive learning to better transfer the knowledge from 2D modality to point-cloud modality. Note that there is significant synergy between this pseudo-label strategy and our proposed de-biased cross-modal contrastive learning. The pseudo-label is beneficial for true positive sampling of crossmodal contrastive learning, and cross-modal contrastive learning transfer the knowledge from image to point-cloud gradually, in turn helping to generate better pseudo-labels. Thus the pseudo-labels are iteratively updated to be of higher quality. Overall, the total loss in phase 2 is given by L phase2 =L 3D box ( b3D , b3D ) + L 3D cls (c 3D , W 3D f 3D ) + L 2D box (K × b3D , b2D )+ L 2D cls (c 3D , W 2D f 2D ) + L ign cls (c ign 2D , W 2D f maxsize ) + L DECC , where b3D and c3D come from either ground truth or pseudo-labels, and L DECC denotes the loss function of the de-biased cross-modal contrastive learning. (2022) . This results in biases, as shown in left part of Fig. 3 . We can observe that both within and across modalities, some objects with the same classes are inaccurately assigned as negative samples. In order to mitigate the biased issue in contrastive learning, we propose a de-biased cross-modal contrastive learning (DECC), as shown in right part of Fig. 3 . We take the ROI features f 3D from the 3D detector and f 2D from the 2D detector, and use the ground truth or pseudo-labels c3D to assign the positive and negative samples. For a mini-batch of f 3D and f 2D with total M features from two modalities, we further use a linear projection to transform these feature into hidden representations, which is denoted as h i , where i = 1, 2, 3, ..., M . Then the loss is given by L DECC = - 1 M M i=1 log m t=0 e h ⊤ i ht/τ (distit) M j=0 e h ⊤ i hj /τ0 , ( ) where m is the number of positive sample corresponding to h i (m <= M ), τ 0 is the base temperature. It is noteworthy to mention that different from previous contrastive learning with constant temperature τ for each sample, we here adopt a distance-aware temperature τ (dist ij ) = τ 0 × γ distij for positive sample, where γ are hyperparameters. In particular, we calculate the Euclidean distance of the two samples in 3D space. Note that if the one of the sample from ImageNet, we consider the distance between ImageNet and any of other sample as 1. Intuitively, the correlation between two close-by objects is greater than that of two distant object, thus we scale the temperature with distance-aware strategy to facilitate this correlation. Specifically, If γ < 1.0, then L DECC pays more attention to connect close-by objects, and if γ = 1.0, L DECC degenerates to de-biased crossmodal contrastive learning with constant temperature, while if γ < 1.0, L DECC focuses on bridge two distant objects. In practices, we set γ > 1.0 is to enforce the alignment of close-by objects.

4. EXPERIMENTS

In this section, we compare the proposed OS-3DETIC with popular baselines on two widely used 3D detection datasets, SUN RGB-D mAP takes both classification and localization in to consideration, the large the mAP means the better the detection, whereas AR mainly focuses on localization. Therefore, compared to AR, mAP is a better metric for evaluating detection result.

4.2. EXPERIMENTAL SETUP AND IMPLEMENTATION DETAILS

Implementation The proposed OS-3DETIC is mainly based on 3DETR Misra et al. ( 2021), yet it can also generalize to other point-cloud detectors. During training, a minibatch of data consists of the point-cloud with its paired image and several images that from ImageNet. 3DETR as the 3D detector consumes the point-cloud, and DETR Carion et al. (2020) as the 2D detector deals with images. Note that only classification loss is used in the DETR when the input is images from ImageNet. We select the max-size ROI features detected by DETR and apply the classfication loss on it, as following Detic Zhou et al. (2022) . τ 0 and γ are two hyperparameters in the distance-aware temperature strategy, in practice, τ 0 is set to 0.2, γ is set to 1.1, and the distance between image from ImageNet and any other sample is 1.0. For both 3DETR and DETR, a unified learning rate is set to 2 × 10 -5 , with batch size of 4 for each GPU, and we train our model on 8 RTX 2080 TI gpus. We train 200 epochs for the phase 1, and 200 epochs for the phase 2. Pseudo-Label Generation Pseudo-label consists of two elements: 3D bounding box and category. We first generate initial pseudo-labels after finishing phase 1, and update them iteratively. As we discussed in Section 3.2, the 3D detector performs generalizable object localization, while the 2D detector is trained on the ImageNet dataset with sufficient knowledge. Therefore, the bounding box comes from the 3D detector and the class label comes from the 2D detector. Furthermore, we only keep proposals with high confidence and filter out the proposals that are duplicated with ground truth or no point in it. Finally, only top-k proposals are kept for each unseen class, and k increases linearly every 50 epochs in phase 2. k is default as 50, and increases 10 (default) every 50 epochs. The results are presented in Tables 1 and 2 . We can observe that OS-PointCLIP achieves mAP 25 of 2.26% and 3.09%, outperforming the other baselines on both SUN RGB-D and ScanNet. Furthermore, Detic-ModelNet reaches mAP 25 of 1.52% and 1.69% on SUN RGB-D and ScanNet, respectively. It can be observed that both OS-PointCLIP and Detic-ModelNet outperforms OS-Image2Point and Detic-ImageNet. Indeed, OS-PointCLIP and Detic-ModelNet enhances the classifier of the detector via introducing the pretraining on ModelNet, which is a 3D classification dataset, while OS-Image2Point and Detic-ImageNet try to transfer knowledge from image (COCO and ImageNet) to point-cloud. This contrast shows that the modality gap between 2D and 3D does indeed hinder knowledge transfer. The results also demonstrate that directly plugging and playing the Detic method on 3DETR with introducing ImageNet is infeasible to transferring the knowledge from 2D image to 3D. Nonetheless, our method achieves mAP 25 of 13.03% and 12.65% on SUN RGB-D and ScanNet, respectively, which proves the proposed OS-3DETIC can indeed make use of the image knowledge, to achieve open-set 3D detection.

Ablation Study on Different Components

We conduct an ablation study on SUN RGB-D, the results of unseen classes are reported in Table 3 . Our baseline is 3DETR which is trained only on seen categories. "Pseudo-Label" denotes to train with pseudo-label. "Augmentation Based CL" denotes we follow SimCLR Chen et al. (2020); He et al. (2020) that utilize data augmentation strategy in contrastive learning, which however, is a biased contrastive learning setting as well. Positionbased contrastive learning takes only the same instance from the paired image/point-cloud as positive. Class based contrasitve learning takes positive so long as two samples belong to the same category. "Distance-Aware Temperature" refer to the strategy that is proposed in Section 3.3. We can observe that, first, pseudo-label brings the largest improvement, actually, pseudo-label implicitly transfers the knowledge contained in ImageNet to a totally different modality, the point-cloud modality, via providing useful class labels. This not only validates our hypothesis that the 3D detector functions as a general region proposal network, but it also demonstrates the effectiveness of introducing large-scale image-level supervision for classification. Second, class-based contrastive learning and distance-aware temperature further significantly improve the performance, while traditional augmentation-based contrastive learning and position-based contrastive learning hurt the performance, which indicates the weakness of biased contrastive learning, and demonstrates the proposed de-biased cross-modal contrastive learning indeed benefits the point-cloud detector to learn general representations. denotes AR 25 , and the x axis represents the data ratio that used during training. We draw both mAP 25 , AR 25 results of the point-cloud and the paired image branch. On one hand, compared with mAP 25 , AR 25 converges with relatively less training data, which exists in both the image and the point-cloud branches. AR 25 partially reflects localization ability, which means we can train a generalizable bounding box detector with few annotations. On the other hand, compared with AR 25 of image branch, the AR 25 of point-cloud branch converges with less training data, which means the localization ability of 3D detector are better than 2D detector, verifying the strategy of using the point-cloud detector to generate bounding box pseudo-labels. Moreover, the converged mAP 25 of the image branch is better than the point-cloud branch. This may be because texture and detail account a lot for classification, which however is the weakness of the point-cloud detector.

Analysis of Performance on Different Training Data Ratio

Analysis on Pseudo-Label Effects Fig. 4 (b) illustrates the relation between mAP 25 , AR 25 and pseudo-label iteration. During the second phase, we iteratively update the pseudo-label every 50 epochs. The results show that the more the iteration, the better the pseudo-labels should be, thus leading to the better the performance.

4.5. QUALITATIVE RESULTS

Fig. 5 provides the qualitative results. We can observe that Baseline is able to predict the relatively accurate locations compared with ground truth, but the size, center, and direction are incorrect, especially the unseen class bounding boxes (chair in Scene 1, sofa in scene 2, bed in scene 3 and sofa in scene 4). Compare baseline with OS-3DETIC, OS-3DETIC performs much better on the unseen classes. 

5. CONCLUSION

In this paper, we study a new problem of open-set 3D detection. The proposed method OS-3DETIC introduces ImageNet1K to help open-set point-cloud detection. OS-3DETIC consists of two components: 1) we take advantage of two modalities -the image modality for classification and the point-cloud modality for localization, to generate pseudo-labels for unseen classes, and 2) de-biased cross-modal contrastive learning to transfer the knowledge from images to point-clouds. Extensive experiments show that we improve a wide range of baselines by a large margin, demonstrating the effectiveness of the proposed method. We also provide explanations of why it works via ablation studies and analyzing the representations. We do hope our work could inspire the research community to further explore this field.



mAP25 w.r.t categories on the ScanNet dataset.

Figure 1: (a). The left point-cloud includes the tables (green) and the chairs (red). The tables are labeled and denoted as seen objects, while the chairs are unseen objects. Right part shows examples from ImageNet1K which has sufficient labels. We aim at utilizing large-scale ImageNet1K to broaden the vocabulary of point-cloud detector. (b). We train the 3DETR Misra et al. (2021) and DETR Carion et al. (2020) on the ScanNet dataset. Under low-data regime, e.g., with randomly sampling 10 % data, AR 25 is still at a high-level accuracy, even competitive to using 100 % data, which demonstrates that localization in point-cloud object detection generalizes well. (c). The performance comparison between the proposed OS-3DETIC and the baseline 3DETR on the ScanNet dataset. OS-3DETIC significantly improves the baseline by a large margin on all the categories.

where N is the point number in the point-cloud. During training, we use 1) point-cloud dataset denoted as D pc = {(P, (b 3D ∈ R 7 , c 3D ) k ) j } |D pc | j=1 , with vocabulary size C pc , where b 3D is the annotation of 3D bounding box, c 3D is the corresponding classification label; 2) paired image dataset denoted as D img = {I j } |D img | j=1

Figure 3: Typical biased cross-modal contrastive learning (left) and the proposed de-biased crossmodal contrastive learning (right). Position-based cross-modal contrastive learning follows the position correspondence with one-to-one match. This may result in assigning inaccurate negative sample. The proposed de-biased cross-contrastive learning (DECC) leverages pseudo-labels to facilitate the biased issues. The red outline denotes it is an unseen class and we assign a pseudo-label to it, and the green outline denotes the seen class with ground truth.

no baseline directly solving the problem of both open-set 3D point-cloud localization and classification, we mainly compare OS-3DETIC with state-of-the-art 3D point-cloud detectors Liu et al. (2021d); Qi et al. (2019a); Zhang et al. (2020); Misra et al. (2021) and some well-known works Zhang et al. (2021); Xu et al. (2021a); Zhou et al. (2022) that study either transferability in point-cloud or 2D open-set detection. Specifically, the baselines we use include: • GroupFree3D Liu et al. (2021d), VoteNet Qi et al. (2019a), H3DNet Zhang et al. (2020), 3DETR Misra et al. (2021) are well-known and representative 3D point-cloud detectors that are chosen as our baselines. Specifically, these four baselines are trained on the seen classes while being tested on the unseen. • The second is PointCLIP Zhang et al. (2021) which bridges the point-cloud and text domain. We use it directly as a pre-trained open-set 3D classifier and replace the classifier of 3DETR with PointCLIP. This baseline is denoted as OS-PointCLIP, which is similar to well-known 2D open-set detection works Bansal et al. (2018); Gu et al. (2021); Zhou et al. (2022) that replace the classifier in the detector with the generalizable classifier. • Besides, Xu et al.Xu et al. (2021a) transfer the image pre-trained transformer to the pointcloud by copying or inflating the weights. Similarly, we copy the weights of the transformer and the classifier from pre-trained DETR (pre-trained on COCO Lin et al. (2014)) to 3DETR and finetune the set aggregation module and the 3D box head. We term this baseline as OS-Image2Point. • Moreover, Detic Zhou et al. (2022) leverages large-scale classification dataset (ImageNet) to broaden the 2D detector, here we directly extend the idea to 3D open-set detection. Specifically, we consider two manners, extend the classifier via ModelNet or ImageNet, and term them as Detic-ModelNet and Detic-ImageNet, respectively.

Performance vs. Iteration.

Figure 4: (a). illustrates relation between mAP 25 , AR 25 and the data regime of both 2D and 3D detection, the green, orange, solid and dashed lines represent point-cloud, image, mAP 25 and AR 25 , respectively. (b). illustrates the relation between mAP 25 , AR 25 and pseudo-label iteration, the blue and red lines represents mAP 25 and AR 25 , respectively.

Fig. 4(a)  illustrate relation between mAP 25 , AR 25 and the different training data ratio. the left y axis denotes mAP 25 , the right y axis

Figure 5: Visualization of detection result. 4 column represent 4 different scenes, the comparison conducts among Baseline, OS-3DETIC and Ground Truth. All the bouning boxes in Baseline and OS-3DETIC are predicted bounding boxes, And the red bounding boxes in Ground Truth represent unseen classes samples, and the green represent seen classes.

with vocabulary C ign , where c ign 2D denotes classfication label of the image in ImageNet1K. During test, we evaluate on the vocabulary C test , where C ign ≥ C test >C pc .

Song et al. (2015) andScanNetDai et al. (2017). Then we conduct sufficient analysis and ablation studies to explore why OS-3DETIC works. The details are illustrated in the below.

Detection results (AP 25 ) on unseen classes of SUN RGB-D.

Detection results (AP 25 ) on unseen classes of ScanNet.

Ablation study on different components.

