PRE-TRAINING BY COMPLETING POINT CLOUDS

Abstract

There has recently been a flurry of exciting advances in deep learning models on point clouds. However, these advances have been hampered by the difficulty of creating labelled point cloud datasets: sparse point clouds often have unclear label identities for certain points, while dense point clouds are time-consuming to annotate. Inspired by mask-based pre-training in the natural language processing community, we propose a pre-training mechanism based point clouds completion. It works by masking occluded points that result from observations at different camera views. It then optimizes a completion model that learns how to reconstruct the occluded points, given the partial point cloud. In this way, our method learns a pre-trained representation that can identify the visual constraints inherently embedded in real-world point clouds. We call our method Occlusion Completion (OcCo). We demonstrate that OcCo learns representations that improve the semantic understandings as well as generalization on downstream tasks over prior methods, transfer to different datasets, reduce training time and improve label efficiency.

1. INTRODUCTION

Point clouds are a natural representation of 3D objects. Recently, there has been a flurry of exciting new point cloud models in areas such as segmentation (Landrieu & Simonovsky, 2018; Yang et al., 2019a; Hu et al., 2020a) and object detection (Zhou & Tuzel, 2018; Lang et al., 2019; Wang et al., 2020b) . Current 3D sensing modalities (i.e., 3D scanners, stereo cameras, lidars) have enabled the creation of large repositories of point cloud data (Rusu & Cousins, 2011; Hackel et al., 2017) . However, annotating point clouds is challenging as: (1) Point cloud data can be sparse and at low resolutions, making the identity of points ambiguous; (2) Datasets that are not sparse can easily reach hundreds of millions of points (e.g., small dense point clouds for object classification (Zhou & Neumann, 2013) and large vast point clouds for 3D reconstruction (Zolanvari et al., 2019) ); (3) Labelling individual points or drawing 3D bounding boxes are both more complex and timeconsuming compared with annotating 2D images (Wang et al., 2019a) . Since most methods require dense supervision, the lack of annotated point cloud data impedes the development of novel models. On the other hand, because of the rapid development of 3D sensors, unlabelled point cloud datasets are abundant. Recent work has developed unsupervised pre-training methods to learn initialization for point cloud models. These are based on designing novel generative adversarial networks (GANs) (Wu et al., 2016; Han et al., 2019; Achlioptas et al., 2018) and autoencoders (Hassani & Haley, 2019; Li et al., 2018a; Yang et al., 2018) . However, completely unsupervised pre-training methods have been recently outperformed by the self-supervised pre-training techniques of (Sauder & Sievers, 2019) and (Alliegro et al., 2020) . Both methods work by first voxelizing point clouds, then splitting each axis into k parts, yielding k 3 voxels. Then, voxels are randomly permuted, and a model is trained to rearrange the permuted voxels back to their original positions. The intuition is that such a model learns the spatial configuration of objects and scenes. However, such random permutation destroys all spatial information that the model could have used to predict the final object point cloud. Our insight is that partial point-cloud masking is a good candidate for pre-training in point-clouds because of two reasons: (1) The pre-trained model requires spatial and semantic understanding of the input point clouds to be able to reconstruct masked shapes. (2) Mask-based completion tasks have become the de facto standard for learning pre-trained representations in natural language processing (NLP) (Mikolov et al., 2013; Devlin et al., 2018; Peters et al., 2018) . Different from random permutations, masking respects the spatial constraints that are naturally encoded in point clouds of real-world objects and scenes. Given this insight, we propose Occlusion Completion (OcCo) occlusion completion o(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " l x H n l 9 I J / R T 3 l 3 Q Specifically, in (a) point clouds are generated by determining what part of objects would be occluded if the underlying object was observed from a particular view-point. In fact, many point clouds generated from a fixed 3D sensor will have occlusions exactly like this. Given an occluded point cloud, the goal of the completion task (b) is to learn a model that accurately reconstructs the missing parts of the point cloud. For a model to perform this task well, it needs to learn to encode localized structural information, based on the context and geometry of partial objects. This is something that is useful for any point cloud model to know, even if used only for classification or segmentation. X j d y c Z E s g m Z c = " > A A A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U k Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E F H H 0 g l 7 R m / P s v D s f z u e 8 d c 3 J Z 0 7 Q A p y v X 6 O g m M o = < / l a t e x i t > c(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " y T q b Z + N F m J s 4 i L N N L A v k 1 h x v C O s = " > A A A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U o Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E F = " > A A A C B 3 i c b V D L S g M x F M 3 4 r P V V d e k m W A R X Z a Y K u i y 4 0 G U F + 4 D p U D L p n T Y 0 k w x J R i h D P 8 A P c K u f 4 E 7 c + h l + g b 9 h p p 2 F b T 0 Q O J x z L / f k h A l n 2 r j u t 7 O 2 v r G 5 t V 3 a K e / u 7 R 8 c V o 6 O 2 1 q m i k K L S i 5 V N y Q a O B P Q M s x w 6 C Y K S B x y 6 I T j 2 9 z v P I H S T I p H M 0 k g i M l Q s I h R Y q z k 9 2 J i R p T w r D n t V 6 p u z Z 0 B r x K v I F V U o N m v / P Q G k q Y x C E M 5 0 d r 3 3 M Q E G V G G U Q 7 T c i / V k B A 6 J k P w L R U k B h 1 k s 8 h T f G 6 V A Y 6 k s k 8 Y P F P / b m Q k 1 n o S h 3 Y y j 6 i X v V z 8 z / N T E 9 0 E G R N J a k D Q + a E o 5 d h I n P 8 f D 5 g C a v j E E k I V s 1 k x H R F F q L E t L V z R t h d Q w P N m v O U e V k m 7 X v M u a / W H q 2 r j r u i o h E 7 R G b p A H r p G D X S P m q i F K J L o B b 2 i N + f Z e X c u i V k s B j 6 W w C i H r q K K w S A R g C O f Q d + f 3 h Z + / w m E p D F / V F k C X o T H n I a U Y K W l k V l 3 F W U B 5 G 6 E 1 Y R g l n d m s 5 H Z s J v 2 H N Y 6 c U r S Q C U 6 I / P H D W K S R s A V Y V j K o W M n y s u x U J Q w m N X c V E K C y R S P Y a g p x x F I L 5 + H n 1 n n W g m s M B Z 6 u L L m 6 t + L H E d S Z p G v N 4 u M c t U r x P + 8 Y a r C G y + n P E k V c L J 4 F K b M U r F V N G E F V A B R L N M E E 0 F 1 V o t M s M B E 6 b 6 W v k h d D A h g R T P O a g / r p N d q O p f N 1 s N V o 3 1 X d l R F p + g M X S A P y Q j w Y a M E m 2 k v l 3 x A i J T L y R 6 T A l P m 1 n W t 6 t O z Z k B r x K 3 I F V U o N m 3 f 7 x B R J M Q h K a c K N V z n V j 7 K Z G a U Q 5 Z 2 U s U x I R O y A h 6 h g o S g v L T W f Q M n x l l g I e R N C M 0 n q l / L 1 I S K j U N A 7 O Z Z 1 T L X i 7 + 5 / U S P b z 2 U y b i R I O g 8 0 f D h G M d 4 b w H P G A S q O Z T Q w i V z G T F d E w k o d q 0 t f B F m W J A A s + b c Z d 7 W C X t e s 2 9 q N X v L 6 u N 2 6 K j E j p B p + g c u e g K N d A d a q I W o u g J v a B X 9 G Y 9 W + / W h / U 5 X 1 2 z i p t j t A D Z + N F m J s 4 i L N N L A v k 1 h x v C O s = " > A A A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U o Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E F H H 0 g l 7 R m / P s v D s f z u e 8 d c 3 J Z 0 7 Q A p y v X 5 A g m L 4 = < / l a t e x i t > We demonstrate that the weights learned by our pre-training method on a single unsupervised dataset can be used as initialization for models in downstream tasks (e.g., object classification, part and semantic segmentation) to improve them, even on completely different datasets. Specifically our pre-training technique: (i) leads to improved generalization over prior baselines on the downstream tasks of object classification, object part and scene semantic segmentation; (ii) speeds up model convergence, in some cases, by up to 5×; (iii) maintains improvements as the size of the labelled downstream dataset decreases; (iv) can be used for a variety of state-of-the-art point cloud models.

2. OCCLUSION COMPLETION

We now introduce Occlusion Completion (OcCo). Our approach is shown in Figure 1 . Our main insight is that by continually occluding point clouds and learning a model c(•) to complete them, the weights of the completion model can be used as initialization for downstream tasks (e.g., classification, segmentation) , speeding up training and improving generalization over other initialization techniques. Throughout we assume point clouds P are sets of points in 3D Euclidean space, P = {p 1 , p 2 , ..., p n }, where each point p i is a vector of coordinates (x i , y i , z i ) and features (e.g. color and normal). We begin by describing the components that make up our occlusion mapping o(•). Then we detail how to learn a completion model c(•), giving pseudocode and the architectural details in appendix. Finally we discuss the criteria on validating the effectiveness of a pre-training model for 3D point clouds.

2.1. GENERATING OCCLUSIONS

We first describe a randomized occlusion mapping o : P → P (where P is the space of all point clouds) from a full point cloud P to an occluded point cloud P. We will do so by determining which points are occluded when the point cloud is viewed from a particular camera position. This requires three steps: (1) A projection of the point cloud (in a world reference frame) into the coordinates of a camera reference frame; (2) Determining which points are occluded based on the camera view-points; (3) Mapping the points back from the camera reference frame to the world reference frame. Viewing the point cloud from a camera. A camera defines a projection from a 3D world reference frame into a distinctive 3D camera reference frame. It does so by specifying a camera model and a camera view-point from which the projection occurs. The simplest camera model is the pinhole



H H 0 g l 7 R m / P s v D s f z u e 8 d c 3 J Z 0 7 Q A p y v X 5 A g m L 4 = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 7 8 x N h B z L o B K I e 3 Z V 0 O m l M P d d f w

+ n M / 5 6 J p T 7 J y g B T h f v 9 + g m q U = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = "B j l O D O V H W b r h c x d s b Y e V Y u u R / E 0 = " > A A A C E X i c b V D L S s N A F J 3 U V 6 2 v a J d u g k V w V Z I q 6 L L g Q p c V 7 A O a U C a T m 3 b o Z B J m J k I I / Q o / w K 1 + g j t x 6 x f 4 B f 6 G k z Y L 2 3r g w u G c e 7 m H 4 y e M S m X b 3 0 Z l Y 3 N r e 6 e 6 W 9 v b P z g 8 M o 9 P e j J O B Y E

H X a M 2 u k c d 1 E U E Z e g F v a I 3 4 9 l 4 N z 6 M z 8 V q x S h v 6 m g J x t c v I V W e j A = = < / l a t e x i t t e x i t s h a 1 _ b a s e 6 4 = " + O B m A U 9 W b u U q 2 q q P g 18 R L F p 5 u W Q = " > A A A C D 3 i c b V D L S s N A F J 3 4 r P W V 6 t L N Y B F c l a Q K u i y 4 0 G U F + 4 A m l M n 0 t h 0 6 m Y S Z i V J C P s I P c K u f 4 E 7 c + g l + g b / h p M 3 C t h 6 4 c D j n X u 7 h B D F n S j v O t 7 W 2 v r G 5 t V 3 a K e / u 7R 8 c 2 p W j t o o S S a F F I x 7 J b k A U c C a g p Z n m 0 I 0 l k D D g 0 A k m N 7 n f e Q S p W C Q e 9 D Q G

r 6 x d m n Z 2 b < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " y T q b

Figure 1: OcCo consists of two steps: (a) occlusion o(•) of a point cloud P based on a random camera view-point into a partial point cloud P, and (b) a model c(•) that completes the occluded point cloud P so that P ≈ P. We demonstrate that the completion model c(•) can be used as initialization for downstream tasks, leading to faster training and better generalization over existing methods. a self-supervised pre-training method that consists of (a) a mechanism to generate occluded point clouds, and (b) a completion task to reconstruct the occluded point cloud.

