3D SEGMENTER: 3D TRANSFORMER BASED SEMANTIC SEGMENTATION VIA 2D PANORAMIC DISTILLATION

Abstract

Recently, 2D semantic segmentation has witnessed a significant advancement thanks to the huge amount of 2D image datasets available. Therefore, in this work, we propose the first 2D-to-3D knowledge distillation strategy to enhance 3D semantic segmentation model with knowledge embedded in the latent space of powerful 2D models. Specifically, unlike standard knowledge distillation, where teacher and student models take the same data as input, we use 2D panoramas properly aligned with corresponding 3D rooms to train the teacher network and use the learned knowledge from 2D teacher to guide 3D student. To facilitate our research, we create a large-scale, fine-annotated 3D semantic segmentation benchmark, containing voxel-wise semantic labels and aligned panoramas of 5175 scenes. Based on this benchmark, we propose a 3D volumetric semantic segmentation network, which adapts Video Swin Transformer as backbone and introduces a skip connected linear decoder. Achieving a state-of-the-art performance, our 3D Segmenter is computationally efficient and only requires 3.8% of the parameters compared to the prior art.

1. INTRODUCTION

Semantic segmentation assigns each 2D pixel (Long et al., 2015) or 3D point(Qi et al., 2017a)/voxel(C ¸ic ¸ek et al., 2016) to a separate category label representing the corresponding object class. As a fundamental computer vision technique, semantic segmentation has been widely applied to medical image analysis (Ronneberger et al., 2015) , autonomous driving (Cordts et al., 2016) and robotics (Ainetter & Fraundorfer, 2021) . Most of the existing efforts are invested in 2D settings thanks to great amount of public 2D semantic segmentation datasets(Zhou et al., 2017; Cordts et al., 2016; Nathan Silberman & Fergus, 2012) . Nowadays, the wide availability of consumer 3D sensors also largely promotes the need for 3D semantic segmentation (Han et We have seen great success in semantic segmentation in 2D images. However, such success does not fully transfer to the 3D domain. One reason is that, given the same scene, processing the 3D data usually requires orders of magnitude more computations than processing a 2D image. E.g. using a volumetric 3D representation, the number of voxels grows O(n 3 ) with the size of the scene. As a result, existing 3D semantic segmentation methods usually need to use smaller receptive fields or shallower networks than 2D models to handle large scenes. This motivates us to facilitate 3D semantic segmentation using 2D models. Specifically, we propose a novel 2D-to-3D knowledge distillation method to enhance 3D semantic segmentation by leveraging the knowledge embedded in a 2D semantic segmentation network. Unlike traditional knowledge distillation approaches, where student and teacher models should take the same input, in our case, the 2D teacher model is pre-trained on a large-scale 2D image repository and finetuned by panoramas rendered from 3D scenes. The panorama image is rendered at the center of the 3D scans with 360 • receptive field, i.e. it contains the global context of the environment. Our 3D student model takes the 3D scene as input and outputs 3D volumetric semantic segmentation. Through differentiable panoramic rendering of 3D semantic segmentation prediction, we could obtain the mapping from the 2D teacher's prediction to the 3D student's prediction. Then we transfer the class distributions of the pixel produced from the 2D teacher to the corresponding voxel of the 3D student. To the best of our knowledge, this is the first solution to distill from a pre-trained 2D model to enhance the 3D computer vision task. To facilitate this 2D-to-3D knowledge distillation design, we create a large-scale indoor dataset called PanoRooms3D. PanoRooms3D contains 5917 rooms diversely furnished by 3D CAD objects and augmented by randomly sampled floor and wall textures. We manually filter out scenes that contain unlabeled furniture. We prepare for each room a dense 3D volume, and a corresponding 2D panorama that is rendered in the center of the room with 360 • receptive field, both containing clean semantic labels. Upon finding panorama-to-volume correspondence, the knowledge distillation from 2D to 3D is then enabled. As our 3D student backbone, we propose a novel efficient architecture, 3D Segmenter, for semantic segmentation of 3D volumetric data. Inspired by Video Swin Transformer (Liu et al., 2022a) , 3D Segmenter employ a pure-transformer structure with an efficient linear decoder. We demonstrate that our 3D Segmenter achieves superior performance and only requires 3.8% of the parameters compared to the prior art that uses 3D convolutional backbones. Our contributions are three-fold: • We propose the first 2D-to-3D knowledge distillation method to utilize the data-abundant and pretrain-ready 2D semantic segmentation to improve 3D semantic segmentation. • We propose PanoRooms3D, a large-scale 3D volumetric dataset with a clean voxel-wise semantic label and well-aligned corresponding 2D panoramic renderings with pixel-wise semantic labels. • We propose a novel efficient pure-transformer-based network for the 3D semantic segmentation task. Our 3D Segmenter outperforms prior art that uses 3D convolutional backbone with only 3.8% of the parameters. We proved through experiments that our baseline variants have already achieved SoTA performance. The distilled knowledge from 2D further widened our lead. Considering the difficulty of 3D data collection, and the cubic memory consumption of 3D models, our model paves the way to bridge 2D and 3D vision tasks. TSDF is an implicit regular 3D shape representation that stores distance to the nearest surface in each voxel. It is proficient in producing high-quality real-time rendering. The encoder split input into small overlappable patches before outputting a sequence of feature embedding.

2. RELATED WORKS

Different from 2D images, when applying transformer models for 3D tasks, the length of the token sequence grows cubically with the size of the input. The shifted window design proposed by Swin Transformer (Liu et al., 2021) is efficient in limiting the sequence length of each Multi-head Self Attention (MSA) computation, which makes it has linear computational complexity to image size. For efficient and scalable 3D computation, we choose Video Swin Transformer (Liu et al., 2022a) as our encoder backbone. Our model is trained on fixed-sized 3D blocks cropped from 3D scenes. Receiving an input block x = [x tsdf , x rgb ] ∈ R X×Y ×Z×4 , we regard each 3D patch of spatial size P × P × P as a token. Therefore, each 3D block x is partitioned into X P × Y P × Z P tokens. A single 3D convolution layer embeds each token from 4 channels to feature dimension of size denoted by D. To formulate hierarchical architecture, Video Swin Transformer reduces the number of tokens through patch merging between blocks. They use a 3D patch merging operation that concatenates the C dimensional features of 2 × 2 spatially neighboring patches to 4C. The token size on the temporal dimension is kept the same during merging. This reduces the number of tokens by a multiple of 2 × 2 = 4. To keep the resolution consistent with typical convolutional networks, the 4C dimensional concatenated feature is downsampled to 2C using a linear layer. Following this design, we use a patch merging layer to merge neighboring 2 × 2 × 2 patches into 8C and downsample it to 2C. Our encoder is comprised of h Swin Transformer blocks. Patch merging is performed between every two blocks. Therefore, the output of encoder is a sequence with N = X P ×2 h-1 × Y P ×2 h-1 × Z P ×2 h-1 tokens. The dimension of each token is D × 2 h-1 .

3.2. DECODER

The decoder maps tokens coming from the encoder to voxel-wise prediction ŷsem ∈ R X×Y ×Z×K , where K is the number of categories in this task. In this work, we propose a decoder with only two linear layers. Accepting token sequence of shape N × (D × 2 h-1 ) from the encoder, a token-wise linear layer is applied to the sequence, getting a N × K class embedding. The class embedding is then rearranged to patch-wise class embedding X P ×2 h-1 × Y P ×2 h-1 × Z P ×2 h-1 × K and trilinearly interpolated to original resolution to get y f ∈ R X×Y ×Z×K . We use a skip connection here to 'remind' the model of the input x. By adding a small number of parameters, the skip connection effectively enhances our performance. Concretely, we concatenate x and y f and feed them through a linear layer to predict the output ŷsem ∈ R X×Y ×Z×K = Linear(Concatenate(y f , x)) The model is trained using cross entropy loss L CE between ŷsem and ground truth y sem .

4. 2D TO 3D DISTILLATION

Figure 2 : Overview of our proposed 2D-to-3D panoramic knowledge distillation. The panorama and 3D volume are fed into the 2D Segmentor and 3D Segmentor in parallel, and predict the semantic distributions in 2D and 3D respectively. Then we apply differentiable rendering on the predicted 3D semantic volume to obtain its panoramic projection. Finally, the knowledge is distilled from 2D to 3D projection using a Kullback-Leibler divergence loss. As shown in Figure 2 , We use pairs of panoramic renderings I r and 3D rooms r = [r tsdf , r rgb ] ∈ R X ′ ×Y ′ ×Z ′ ×4 as our input for 2D-to-3D distillation. The 2D teacher is pretrained on large scale 2D repository and finetuned on 2D panoramas I r and 2D semantic ground truth rendered from 3D rooms. The 3D Segmenter, as the student model in distillation, is first trained to converge using fix-sized block data. Then, the knowledge distillation from 2D teacher to 3D student is performed on the scene level. Suppose we have a input room r = [r tsdf , r rgb ] ∈ R X ′ ×Y ′ ×Z ′ ×4 , we feed it to the student model and get voxel-level semantic segmentation ŷr ∈ R X ′ ×Y ′ ×Z ′ ×K . The 2D panoramic image could be acquired from panoramic camera placed at any position that is not occupied by objects. Here, we place it at the center of the room for sake of simplification. Combining 3D semantic segmentation prediction ŷr and input geometry r tsdf , a mapping between pixels and voxels is required to project ŷr to 2D. The mapping is determined by raycasting. For each pixel in the image, we construct a ray from the view and march along the ray through r tsdf . Trilinear interpolation is used to determine TSDF values along the ray. The surface voxel is located when zero-crossing is detected. In this way, we establish a mapping between a pixel and a voxel. Through this mapping, 6 squared semantic segmentation maps are rendered from +X, -X, +Y , -Y , +Z, -Z directions with 90 • field of view. The 6 views are then stitched into a panorama through pixel mapping and interpolation. The total process is differentiable, making it possible for gradient backpropagates from 2D to 3D. We project a panoramic view of ŷr . StudentSegM ap = P anoramicP rojecting(r tsdf , ŷr ) where StudentSegM ap ∈ R H×W ×K , H and W is the height and width of rendered panorama. After forwarding I r through the teacher model, we receive a T eacherSegM ap ∈ R H×W ×K which contains the teacher model's learned knowledge. We follow (Hinton et al., 2015) to formulate a Kullback-Leibler divergence loss to measure the difference of distribution between StudentSegM ap and T eacherSegM ap. We have L Dist = τ 2 KLDiv(StudentSegM ap/τ, T eacherSegM ap/τ ) Where τ refers to distillation temperature. The model is trained using the sum of 2D-to-3D distillation loss L Dist and 3D cross-entropy loss L CE . To achieve a human-like understanding of the 3D environment, we use professional designed room layouts from 3D-FRONT (Fu et al., 2021a) , furnished by 3D-FUTURE (Fu et al., 2021b) , a largescale 3D CAD shapes with high-resolution informative textures.

5. PANOROOMS3D DATASET

To diversify the generation, we augment each scene with randomly sampled floor/wall textures. To obtain the TSDF representation together with semantic labels, we randomly sample camera poses inside the rooms with a height between 1.4m and 1.8m. This process simulates users holding RGB-D cameras to capture indoor scenes like (Chang et al., 2017; Dai et al., 2017; Hua et al., 2016; Armeni et al., 2017) . RGB-D plus semantic frames are rendered at all camera poses and later fused into 3D scenes and semantic ground truth using TSDF fusion. A total of 2,509,873 RGB-D images and corresponding pixel-wise semantic segmentation maps are rendered. We set each voxel a 4cm cube. We use 120 NVIDIA Tesla V100 GPU to render for 48 hours. 3D-FRONT includes 99 semantic categories. In practice, considering the semantical consistency between labels, we consolidate similar labels, merging the original 99 classes into 30 hyper classes. Since many scenes contain furniture that is left unlabeled, we manually filter out noisy rooms. 5917 TSDF-based living rooms and bedrooms are created. We split 5325 rooms for training, 297 for validation, and 295 for Input Scene 3D U-net 3DSeg-L Ground Truth • 3DSeg-M(iddle), D = 64, Swin Layers [3, 6] , Num Heads [4, 4] , #PARAM 1.55M • 3DSeg-L(arge), D = 96, Swin Layers [4, 8] , Num Heads [6, 6] , #PARAM 4.43M We train our 3D Segmenter on NVIDIA Ampere A100 80GB, using the Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 and batch size of 8. It takes 48 hours for our models to converge. For 2D to 3D knowledge distillation, we operate the experiments taking room as a unit. we render the 3D rooms for train/val/test sets to get 2D panoramas and 2D ground truth segmentation for train/val/test sets respectively. The 2D training set is used to finetune the 2D teacher model. The size of rendered panorama is set to (H, W ) = (512, 1024). We manually filter out noisy renderings in 5323 training rooms caused by objects that block the view. We use the 5175 panoramas after filtration paired up with their original 3D rooms to do distillation. We choose 2D Segmenter (Strudel et al., 2021) Seg-B-Mask/16 and Seg-L-Mask/16 as the 2D teacher models. Pretrained on ImageNet-21K (Steiner et al., 2021) , the models are finetuned on our rendered 2D panoramas training set for 200 epochs. Because the scale of rooms varies, we train/validate the distillation with a batch size of 1. We train the distillation for 45000 iterations and validate every 1500 iterations. The distillation temperature τ is set to 20 in all of our experiments. Apart from our PanoRoom3D dataset, we also include the Scannet (Dai et al., 2017) We also conduct a complexity comparison of our four baseline variants and 3D U-Net. We randomly choose 20 scenes to calculate the total inference time. Since some of the scenes are too large to be run on GPU for 3D U-Net, we run our inference time comparison on CPU. The total inference time of our four variants is 43.62s, 44.79s, 61.08s, 95.89s from Tiny to Large, and 254.75s for 3D U-Net.

6.3. ABLATION STUDY

The effectiveness of 2D-to-3D distillation. Our 3D Segmenter is pretrained on fixed-size blocks and then distilled at room level. We can see from Table 3 that compared with directly finetune the model on rooms, our panoramic distillation design can steadily improve the mIoU performance. The necessity of skip connection. While introducing the decoder of 3D Segmenter, we use a skip connection to 'remind' our model of the input x. We validate the necessity of the design in this experiment. aAcc↑ We can see in Table 4 by adding a skip connection before output significantly improves the overall performance by adding a small number of parameters.

7. DISCUSSIONS

In this work, we propose the first 2D-to-3D knowledge distillation method to utilize the data abundant and pretrain-ready 2D semantic segmentation to improve 3D semantic segmentation. Experiments demonstrate our technique significantly improves the 3D model over the baseline. We have also introduced a PanoRooms3D dataset that contains a large variety of indoor scenes with dense 3D volumes and their corresponding panorama renderings. Bolei Zhou, Hang Zhao, Xavier Puig, Sanja Fidler, Adela Barriuso, and Antonio Torralba. Scene parsing through ade20k dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 633-641, 2017. 1

A APPENDIX

Limitations. However, a few limitations are yet to be addressed. First, though the panorama images could provide the global context of the environment, the single-camera rendering inevitably suffers from occlusion, a promising direction is to directly leverage the full 2D textures from the surface of the 3D scans, as done in TextureNet (Huang et al., 2019) . Second, currently, we do not consider the similarity between different models, we leave this as future work. Input Scene 3D U-net 3DSeg-L Ground Truth 



al., 2020; Choy et al., 2019; Thomas et al., 2019; Graham et al., 2018b; Nekrasov et al., 2021).

Figure 1: Overview of our 3D Segmenter pipeline. On the left-hand side is the encoder: the input blocks are partitioned and embedded into a sequence of tokens before flowing forward through 3D Swin Transformer. On the right-hand side is the Decoder: a linear-layers-only decoder with a skip connection. Receiving a sequence of tokens from the encoder, the decoder maps the class embedding to per-voxel semantic label.

Figure 3: Sample of PanoRooms3D: a pair of input scene and its panoramic rendering.

Figure 4: Qualitative comparison between our variant 3DSeg-L and convolutional baseline 3D U-Net(C ¸ic ¸ek et al., 2016) on our PanoRooms3D dataset(top two rows) and Scannet(Dai et al., 2017)(bottom two rows)

Figure 5: Qualitative comparison between our variant 3DSeg-L and convolutional baseline 3D U-Net(C ¸ic ¸ek et al., 2016) on our PanoRooms3D dataset.

for baseline comparison. We random split the 1513 rooms into train/val/test 1361/77/75 respectively. All rooms are fused with a voxel resolution of 4cm.All spatial dimensions of scenes are padded to the multiples of 128. For training, the rooms are sliced into 128 × 128 × 128 cubes. Quantitative comparison of our proposed 3D Segmenter with FCN baseline 3D U-Net(C ¸ic ¸ek et al., 2016), SparseConvNet(Graham et al., 2018a) and VT-UNet(Peiris et al., 2022) on our PanoRoom3D dataset. All of the metrics are reported in percentage.

Quantitative comparison of our proposed 3D Segmenter with FCN baseline 3D U-Net(C ¸ic ¸ek et al., 2016) on Scannet(Dai et al., 2017) dataset. All of the metrics are reported in percentage. Net even with only 3.8% parameters. The distillation schemes achieve a higher mIoU score without adding extra parameters. This implies that our 2D-to-3D panoramic distillation demonstrates the feasibility of training powerful 3D networks with 2D knowledge. Our baseline comparison on Scannet(Dai et al., 2017) is shown in table 2. The result shows that our proposed 3D Segmenter also achieves the best performance on real world data.

To control variables and validate the effectiveness of our proposed distillation strategy, we conduct two experiments by finetuning 3DSeg-T and 3DSeg-B at room level with 3D cross-entropy loss only. Quantitative ablation study of distillation effectiveness. All of the metrics are reported in percentage.

Quantitative ablation study of the skip connection. All of the metrics are reported in percentage.

ACKNOWLEDGMENTS

This work is supported by JST Moonshot R&D Grant Number JPMJMS2011 and JST ACT-X Grant Number JPMJAX190D, Japan and partially supported by the Shanghai Committee of Science and Technology (Grant No. 21DZ1100100)

availability

//github.com/swwzn714

