3D SEGMENTER: 3D TRANSFORMER BASED SEMANTIC SEGMENTATION VIA 2D PANORAMIC DISTILLATION

Abstract

Recently, 2D semantic segmentation has witnessed a significant advancement thanks to the huge amount of 2D image datasets available. Therefore, in this work, we propose the first 2D-to-3D knowledge distillation strategy to enhance 3D semantic segmentation model with knowledge embedded in the latent space of powerful 2D models. Specifically, unlike standard knowledge distillation, where teacher and student models take the same data as input, we use 2D panoramas properly aligned with corresponding 3D rooms to train the teacher network and use the learned knowledge from 2D teacher to guide 3D student. To facilitate our research, we create a large-scale, fine-annotated 3D semantic segmentation benchmark, containing voxel-wise semantic labels and aligned panoramas of 5175 scenes. Based on this benchmark, we propose a 3D volumetric semantic segmentation network, which adapts Video Swin Transformer as backbone and introduces a skip connected linear decoder. Achieving a state-of-the-art performance, our 3D Segmenter is computationally efficient and only requires 3.8% of the parameters compared to the prior art.

1. INTRODUCTION

Semantic segmentation assigns each 2D pixel (Long et al., 2015) or 3D point (Qi et al., 2017a) /voxel (C ¸ic ¸ek et al., 2016) to a separate category label representing the corresponding object class. As a fundamental computer vision technique, semantic segmentation has been widely applied to medical image analysis (Ronneberger et al., 2015) , autonomous driving (Cordts et al., 2016) and robotics (Ainetter & Fraundorfer, 2021) . Most of the existing efforts are invested in 2D settings thanks to great amount of public 2D semantic segmentation datasets (Zhou et al., 2017; Cordts et al., 2016; Nathan Silberman & Fergus, 2012) . Nowadays, the wide availability of consumer 3D sensors also largely promotes the need for 3D semantic segmentation (Han et al., 2020; Choy et al., 2019; Thomas et al., 2019; Graham et al., 2018b; Nekrasov et al., 2021) . We have seen great success in semantic segmentation in 2D images. However, such success does not fully transfer to the 3D domain. One reason is that, given the same scene, processing the 3D data usually requires orders of magnitude more computations than processing a 2D image. E.g. using a volumetric 3D representation, the number of voxels grows O(n 3 ) with the size of the scene. As a result, existing 3D semantic segmentation methods usually need to use smaller receptive fields or shallower networks than 2D models to handle large scenes. This motivates us to facilitate 3D semantic segmentation using 2D models. Specifically, we propose a novel 2D-to-3D knowledge distillation method to enhance 3D semantic segmentation by leveraging the knowledge embedded in a 2D semantic segmentation network. Unlike traditional knowledge distillation approaches, where student and teacher models should take the same input, in our case, the 2D teacher model is pre-trained on a large-scale 2D image repository and finetuned by panoramas rendered from 3D scenes. The panorama image is rendered at the center of the 3D scans with 360 • receptive field, i.e. it contains the global context of the environment. Our 3D student model takes the 3D scene as input and outputs 3D volumetric semantic segmentation. Through differentiable panoramic rendering of 3D semantic segmentation prediction, we could obtain the mapping from the 2D teacher's prediction to the 3D student's prediction. Then we transfer the class distributions of the pixel produced from the 2D teacher to the corresponding voxel of the 3D student. To the best of our knowledge, this is the first solution to distill from a pre-trained 2D model to enhance the 3D computer vision task. To facilitate this 2D-to-3D knowledge distillation design, we create a large-scale indoor dataset called PanoRooms3D. PanoRooms3D contains 5917 rooms diversely furnished by 3D CAD objects and augmented by randomly sampled floor and wall textures. We manually filter out scenes that contain unlabeled furniture. We prepare for each room a dense 3D volume, and a corresponding 2D panorama that is rendered in the center of the room with 360 • receptive field, both containing clean semantic labels. Upon finding panorama-to-volume correspondence, the knowledge distillation from 2D to 3D is then enabled. As our 3D student backbone, we propose a novel efficient architecture, 3D Segmenter, for semantic segmentation of 3D volumetric data. Inspired by Video Swin Transformer(Liu et al., 2022a), 3D Segmenter employ a pure-transformer structure with an efficient linear decoder. We demonstrate that our 3D Segmenter achieves superior performance and only requires 3.8% of the parameters compared to the prior art that uses 3D convolutional backbones. Our contributions are three-fold: • We propose the first 2D-to-3D knowledge distillation method to utilize the data-abundant and pretrain-ready 2D semantic segmentation to improve 3D semantic segmentation. • We propose PanoRooms3D, a large-scale 3D volumetric dataset with a clean voxel-wise semantic label and well-aligned corresponding 2D panoramic renderings with pixel-wise semantic labels. • We propose a novel efficient pure-transformer-based network for the 3D semantic segmentation task. Our 3D Segmenter outperforms prior art that uses 3D convolutional backbone with only 3.8% of the parameters. We proved through experiments that our baseline variants have already achieved SoTA performance. The distilled knowledge from 2D further widened our lead. Considering the difficulty of 3D data collection, and the cubic memory consumption of 3D models, our model paves the way to bridge 2D and 3D vision tasks.

2.1. 2D SEMANTIC SEGMENTATION

The success of deep convolutional neural networks (Simonyan & Zisserman, 2014; He et al., 2016) for object classification led researchers to exploit the possibility of working out dense prediction problems. Fully Convolutional Networks (FCN) (Long et al., 2015) based encoder-decoder architectures have dominated the research of semantic segmentation since 2015. Later works extend FCN on different facets. Follow-up approaches (Badrinarayanan et al., 2017; Lin et al., 2017; Pohlen et al., 2017; Ronneberger et al., 2015; Zhao et al., 2017) leverage multi-scale feature aggregation. (Fu et al., 2019; Yin et al., 2020; Yu et al., 2020; Yuan et al., 2018; Zhao et al., 2018) 



apply attention mechanism to model long-range dependencies.Recently, ConNeXt(Liu et al., 2022b)   modernizes the CNNs to achieve higher performance than transformers, suggesting the effectiveness of convolutional methods. On the other hand,ViT(Dosovitskiy et al., 2020)  successfully introduced Transformer(Vaswani et al., 2017)  to computer vision. Later, SETR(Zheng et al., 2021) demonstrates the feasibility of using Transformer-based semantic segmentation. PVT(Pyramid Vision Transformer)(Wang et al., 2021) combines pyramid structures and ViT for dense prediction. SegFormer(Xie et al., 2021) proposes a hierarchical Transformer encoder and a lightweight MLP decoder to fuse multi-level features and predict the semantic segmentation mask. Using vanilla ViT and DeiT(Touvron et al., 2021) as backbone, Segmenter(Strudel et al., 2021) designed a trainable

availability

//github.com/swwzn714

