PRE-TRAINING BY COMPLETING POINT CLOUDS

Abstract

There has recently been a flurry of exciting advances in deep learning models on point clouds. However, these advances have been hampered by the difficulty of creating labelled point cloud datasets: sparse point clouds often have unclear label identities for certain points, while dense point clouds are time-consuming to annotate. Inspired by mask-based pre-training in the natural language processing community, we propose a pre-training mechanism based point clouds completion. It works by masking occluded points that result from observations at different camera views. It then optimizes a completion model that learns how to reconstruct the occluded points, given the partial point cloud. In this way, our method learns a pre-trained representation that can identify the visual constraints inherently embedded in real-world point clouds. We call our method Occlusion Completion (OcCo). We demonstrate that OcCo learns representations that improve the semantic understandings as well as generalization on downstream tasks over prior methods, transfer to different datasets, reduce training time and improve label efficiency.

1. INTRODUCTION

Point clouds are a natural representation of 3D objects. Recently, there has been a flurry of exciting new point cloud models in areas such as segmentation (Landrieu & Simonovsky, 2018; Yang et al., 2019a; Hu et al., 2020a) and object detection (Zhou & Tuzel, 2018; Lang et al., 2019; Wang et al., 2020b) . Current 3D sensing modalities (i.e., 3D scanners, stereo cameras, lidars) have enabled the creation of large repositories of point cloud data (Rusu & Cousins, 2011; Hackel et al., 2017) . However, annotating point clouds is challenging as: (1) Point cloud data can be sparse and at low resolutions, making the identity of points ambiguous; (2) Datasets that are not sparse can easily reach hundreds of millions of points (e.g., small dense point clouds for object classification (Zhou & Neumann, 2013) and large vast point clouds for 3D reconstruction (Zolanvari et al., 2019) ); (3) Labelling individual points or drawing 3D bounding boxes are both more complex and timeconsuming compared with annotating 2D images (Wang et al., 2019a) . Since most methods require dense supervision, the lack of annotated point cloud data impedes the development of novel models. On the other hand, because of the rapid development of 3D sensors, unlabelled point cloud datasets are abundant. Recent work has developed unsupervised pre-training methods to learn initialization for point cloud models. These are based on designing novel generative adversarial networks (GANs) (Wu et al., 2016; Han et al., 2019; Achlioptas et al., 2018) and autoencoders (Hassani & Haley, 2019; Li et al., 2018a; Yang et al., 2018) . However, completely unsupervised pre-training methods have been recently outperformed by the self-supervised pre-training techniques of (Sauder & Sievers, 2019) and (Alliegro et al., 2020) . Both methods work by first voxelizing point clouds, then splitting each axis into k parts, yielding k 3 voxels. Then, voxels are randomly permuted, and a model is trained to rearrange the permuted voxels back to their original positions. The intuition is that such a model learns the spatial configuration of objects and scenes. However, such random permutation destroys all spatial information that the model could have used to predict the final object point cloud. Our insight is that partial point-cloud masking is a good candidate for pre-training in point-clouds because of two reasons: (1) The pre-trained model requires spatial and semantic understanding of the input point clouds to be able to reconstruct masked shapes. (2) Mask-based completion tasks have become the de facto standard for learning pre-trained representations in natural language processing (NLP) (Mikolov et al., 2013; Devlin et al., 2018; Peters et al., 2018) . Different from random permutations, masking respects the spatial constraints that are naturally encoded in point clouds of real-world objects and scenes. Given this insight, we propose Occlusion Completion (OcCo)

