3D SEGMENTER: 3D TRANSFORMER BASED SEMANTIC SEGMENTATION VIA 2D PANORAMIC DISTILLATION

Abstract

Recently, 2D semantic segmentation has witnessed a significant advancement thanks to the huge amount of 2D image datasets available. Therefore, in this work, we propose the first 2D-to-3D knowledge distillation strategy to enhance 3D semantic segmentation model with knowledge embedded in the latent space of powerful 2D models. Specifically, unlike standard knowledge distillation, where teacher and student models take the same data as input, we use 2D panoramas properly aligned with corresponding 3D rooms to train the teacher network and use the learned knowledge from 2D teacher to guide 3D student. To facilitate our research, we create a large-scale, fine-annotated 3D semantic segmentation benchmark, containing voxel-wise semantic labels and aligned panoramas of 5175 scenes. Based on this benchmark, we propose a 3D volumetric semantic segmentation network, which adapts Video Swin Transformer as backbone and introduces a skip connected linear decoder. Achieving a state-of-the-art performance, our 3D Segmenter is computationally efficient and only requires 3.8% of the parameters compared to the prior art.

1. INTRODUCTION

Semantic segmentation assigns each 2D pixel (Long et al., 2015) or 3D point (Qi et al., 2017a) /voxel (C ¸ic ¸ek et al., 2016) to a separate category label representing the corresponding object class. As a fundamental computer vision technique, semantic segmentation has been widely applied to medical image analysis (Ronneberger et al., 2015) , autonomous driving (Cordts et al., 2016) and robotics (Ainetter & Fraundorfer, 2021) . Most of the existing efforts are invested in 2D settings thanks to great amount of public 2D semantic segmentation datasets (Zhou et al., 2017; Cordts et al., 2016; Nathan Silberman & Fergus, 2012) . Nowadays, the wide availability of consumer 3D sensors also largely promotes the need for 3D semantic segmentation (Han et al., 2020; Choy et al., 2019; Thomas et al., 2019; Graham et al., 2018b; Nekrasov et al., 2021) . We have seen great success in semantic segmentation in 2D images. However, such success does not fully transfer to the 3D domain. One reason is that, given the same scene, processing the 3D data usually requires orders of magnitude more computations than processing a 2D image. E.g. using a volumetric 3D representation, the number of voxels grows O(n 3 ) with the size of the scene. As a result, existing 3D semantic segmentation methods usually need to use smaller receptive fields or shallower networks than 2D models to handle large scenes. This motivates us to facilitate 3D semantic segmentation using 2D models. Specifically, we propose a novel 2D-to-3D knowledge distillation method to enhance 3D semantic segmentation by leveraging the knowledge embedded in a 2D semantic segmentation network. Unlike traditional knowledge distillation approaches, where student and teacher models should take the same input, in our case, the 2D teacher model is pre-trained on a large-scale 2D image repository and finetuned by panoramas rendered from 3D scenes. The panorama image is rendered at the center of the 3D scans with

availability

//github.com/swwzn714

