PRE-TRAINING BY COMPLETING POINT CLOUDS

Abstract

There has recently been a flurry of exciting advances in deep learning models on point clouds. However, these advances have been hampered by the difficulty of creating labelled point cloud datasets: sparse point clouds often have unclear label identities for certain points, while dense point clouds are time-consuming to annotate. Inspired by mask-based pre-training in the natural language processing community, we propose a pre-training mechanism based point clouds completion. It works by masking occluded points that result from observations at different camera views. It then optimizes a completion model that learns how to reconstruct the occluded points, given the partial point cloud. In this way, our method learns a pre-trained representation that can identify the visual constraints inherently embedded in real-world point clouds. We call our method Occlusion Completion (OcCo). We demonstrate that OcCo learns representations that improve the semantic understandings as well as generalization on downstream tasks over prior methods, transfer to different datasets, reduce training time and improve label efficiency.

1. INTRODUCTION

Point clouds are a natural representation of 3D objects. Recently, there has been a flurry of exciting new point cloud models in areas such as segmentation (Landrieu & Simonovsky, 2018; Yang et al., 2019a; Hu et al., 2020a) and object detection (Zhou & Tuzel, 2018; Lang et al., 2019; Wang et al., 2020b) . Current 3D sensing modalities (i.e., 3D scanners, stereo cameras, lidars) have enabled the creation of large repositories of point cloud data (Rusu & Cousins, 2011; Hackel et al., 2017) . However, annotating point clouds is challenging as: (1) Point cloud data can be sparse and at low resolutions, making the identity of points ambiguous; (2) Datasets that are not sparse can easily reach hundreds of millions of points (e.g., small dense point clouds for object classification (Zhou & Neumann, 2013) and large vast point clouds for 3D reconstruction (Zolanvari et al., 2019) ); (3) Labelling individual points or drawing 3D bounding boxes are both more complex and timeconsuming compared with annotating 2D images (Wang et al., 2019a) . Since most methods require dense supervision, the lack of annotated point cloud data impedes the development of novel models. On the other hand, because of the rapid development of 3D sensors, unlabelled point cloud datasets are abundant. Recent work has developed unsupervised pre-training methods to learn initialization for point cloud models. These are based on designing novel generative adversarial networks (GANs) (Wu et al., 2016; Han et al., 2019; Achlioptas et al., 2018) and autoencoders (Hassani & Haley, 2019; Li et al., 2018a; Yang et al., 2018) . However, completely unsupervised pre-training methods have been recently outperformed by the self-supervised pre-training techniques of (Sauder & Sievers, 2019) and (Alliegro et al., 2020) . Both methods work by first voxelizing point clouds, then splitting each axis into k parts, yielding k 3 voxels. Then, voxels are randomly permuted, and a model is trained to rearrange the permuted voxels back to their original positions. The intuition is that such a model learns the spatial configuration of objects and scenes. However, such random permutation destroys all spatial information that the model could have used to predict the final object point cloud. Our insight is that partial point-cloud masking is a good candidate for pre-training in point-clouds because of two reasons: (1) The pre-trained model requires spatial and semantic understanding of the input point clouds to be able to reconstruct masked shapes. (2) Mask-based completion tasks have become the de facto standard for learning pre-trained representations in natural language processing (NLP) (Mikolov et al., 2013; Devlin et al., 2018; Peters et al., 2018) . Different from random permutations, masking respects the spatial constraints that are naturally encoded in point clouds of real-world objects and scenes. Given this insight, we propose Occlusion Completion (OcCo)  D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U o Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E F Q O J x z L / f k h A l n 2 r j u t 7 O 2 v r G 5 t V 3 a K e / u 7 R 8 c V o 6 O 2 1 q m i k K L S i 5 V N y Q a O B P Q M s x w 6 C Y K S B x y 6 I T j 2 9 z v P I H S T I p H M 0 k g i M l Q s I h R Y q z k 9 2 J i R p T w r D n t V 6 p u z Z 0 B r x K v I F V U o N m v / P Q G k q Y x C E M 5 0 d r 3 3 M Q E G V G G U Q 7 T c i / V k B A 6 J k P w L R U k B h 1 k s 8 h T f G 6 V A Y 6 k s k 8 Y P F P / b m Q k 1 n o S h 3 Y y j 6 i X v V z 8 z / N T E 9 0 E G R N J a k D Q + a E o 5 d h I n P 8 f D 5 g C a v j E E k I V s 1 k x H R F F q L E t L V z R t h d Q w P N m v O U e V k m 7 X v M u a / W H q 2 r j r u i o h E 7 R G b p A H r p G D X S P m q i F K J L o B b 2 i N + f Z e X c + n M / 5 6 J p T 7 J y g B T h f v 9 + g m q U = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " Specifically, in (a) point clouds are generated by determining what part of objects would be occluded if the underlying object was observed from a particular view-point. In fact, many point clouds generated from a fixed 3D sensor will have occlusions exactly like this. Given an occluded point cloud, the goal of the completion task (b) is to learn a model that accurately reconstructs the missing parts of the point cloud. For a model to perform this task well, it needs to learn to encode localized structural information, based on the context and geometry of partial objects. This is something that is useful for any point cloud model to know, even if used only for classification or segmentation. B j l O D O V H W b r h c x d s b Y e V Y u u R / E 0 = " > A A A C E X i c b V D L S s N A F J 3 U V 6 2 v a J d u g k V w V Z I q 6 L L g Q p c V 7 A O a U C a T m 3 b o Z B J m J k I I / Q o / w K 1 + g j t x 6 x f 4 B f 6 G k z Y L 2 3 u i V k s B j 6 W w C i H r q K K w S A R g C O f Q d + f 3 h Z + / w m E p D F / V F k C X o T H n I a U Y K W l k V l 3 F W U B 5 G 6 E 1 Y R g l n d m s 5 H Z s J v 2 H N Y 6 c U r S Q C U 6 I / P H D W K S R s A V Y V j K o W M n y s u x U J Q w m N X c V E K C y R S P Y a g p x x F I L 5 + H n 1 n n W g m s M B Z 6 u L L m 6 t + L H E d S Z p G v N 4 u M c t U r x P + 8 Y a r C G y + n P E k V c L J 4 F K b M U r F V N G E F V A B R L N M E E 0 F 1 V o t M s M B E 6 b 6 W v k h d D A h g R T P O a g / r p N d q O p f N 1 s N V o 3 1 X d l R F p + g M X S A H X a M 2 u k c d 1 E U E Z e g F v + O B m A U 9 W b u U q 2 q q P g 1 8 R L F p 5 u W Q = " > A A A C D 3 i c b V D L S s N A F J 3 4 r P W V 6 t L N Y B F c l a Q K u i y 4 0 G U F + 4 A m l M n 0 t h 0 6 m Y S Z i V J C P s I P c K u f 4 E 7 c + g l + g b / h p M 3 C t h 6 4 c D j n X u 7 h B D F n S j v O t 7 W 2 v r G 5 t V 3 a K e / u 7 R 8 c 2 p W j t o o S S a F F I x 7 J b k A U c C a g p Z n m 0 I 0 l k D D g 0 A k m N 7 n f e Q S p W C Q e 9 D Q G P y Q j w Y a M E m 2 k v l 3 x A i J T L y R 6 T A l P m 1 n W t 6 t O z Z k B r x K 3 I F V U o N m 3 f 7 x B R J M Q h K a c K N V z n V j 7 K Z G a U Q 5 Z 2 U s U x I R O y A h 6 h g o S g v L T W f Q M n x l l g I e R N C M 0 n q l / L 1 I S K j U N A 7 O Z Z 1 T L X i 7 + 5 / U S P b z 2 U y b i R I O g 8 0 f D h G M d 4 b w H P G A S q O Z T Q w i V z G T F d E w k o d q 0 t f B F m W J A A s + b c Z d 7 W C X t e s 2 9 q N X v L 6 u N 2 6 K j E j p B p + g c u e g K N d A d a q I W o u g J v a B X 9 G Y 9 W + / W h / U 5 X 1 2 z i p t j t A D Z + N F m J s 4 i L N N L A v k 1 h x v C O s = " > A A A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U o Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E F We demonstrate that the weights learned by our pre-training method on a single unsupervised dataset can be used as initialization for models in downstream tasks (e.g., object classification, part and semantic segmentation) to improve them, even on completely different datasets. Specifically our pre-training technique: (i) leads to improved generalization over prior baselines on the downstream tasks of object classification, object part and scene semantic segmentation; (ii) speeds up model convergence, in some cases, by up to 5×; (iii) maintains improvements as the size of the labelled downstream dataset decreases; (iv) can be used for a variety of state-of-the-art point cloud models.

2. OCCLUSION COMPLETION

We now introduce Occlusion Completion (OcCo). Our approach is shown in Figure 1 . Our main insight is that by continually occluding point clouds and learning a model c(•) to complete them, the weights of the completion model can be used as initialization for downstream tasks (e.g., classification, segmentation) , speeding up training and improving generalization over other initialization techniques. Throughout we assume point clouds P are sets of points in 3D Euclidean space, P = {p 1 , p 2 , ..., p n }, where each point p i is a vector of coordinates (x i , y i , z i ) and features (e.g. color and normal). We begin by describing the components that make up our occlusion mapping o(•). Then we detail how to learn a completion model c(•), giving pseudocode and the architectural details in appendix. Finally we discuss the criteria on validating the effectiveness of a pre-training model for 3D point clouds.

2.1. GENERATING OCCLUSIONS

We first describe a randomized occlusion mapping o : P → P (where P is the space of all point clouds) from a full point cloud P to an occluded point cloud P. We will do so by determining which points are occluded when the point cloud is viewed from a particular camera position. This requires three steps: (1) A projection of the point cloud (in a world reference frame) into the coordinates of a camera reference frame; (2) Determining which points are occluded based on the camera view-points; (3) Mapping the points back from the camera reference frame to the world reference frame. Viewing the point cloud from a camera. A camera defines a projection from a 3D world reference frame into a distinctive 3D camera reference frame. It does so by specifying a camera model and a camera view-point from which the projection occurs. The simplest camera model is the pinhole camera, and view-point projection for it is given by a simple linear equation: x cam y cam z cam = f γ w/2 0 f h/2 0 0 1 intrinsic [ K ] r 1 r 2 r 3 t 1 r 4 r 5 r 6 t 2 r 7 r 8 r 9 t 3 rotation | translation [ R | t ]    x y z 1    . In the above, (x, y, z) are the original point cloud coordinates, the matrix including r and t entries is the concatenation of a 3D rotation matrix with a 3D translation vector, and the final matrix to the left is the camera intrinsic matrix (f specifies the camera focal length, γ is the skewness between the x and y axes in the camera, and w, h are the width and height of the camera image). Given these, the final coordinates (x cam , y cam , z cam ) are the positions of the point in the camera reference frame. We will refer to the intrinsic matrix as K and the rotation/translation matrix as [R|t] . Determining occluded points. We can think of the point (x cam , y cam , z cam ) in multiple ways: (a) as a 3D point in the camera reference frame, (b) as a 2D pixel with coordinates (f x cam /z cam , f y cam /z cam ) with a depth of z cam . In this way, some 2D points resulting from the projection may be occluded by others if they have the same pixel coordinates, but appear at a larger depth. To determine which points are occluded, we first use Delaunay triangulation to reconstruct a polygon mesh from the points, and remove the points which belong to the hidden surfaces that are determined via z-buffering. Mapping back from camera frame to world frame. Once occluded points are removed, we reproject the point cloud to the original world reference frame, via the following linear transformation:    x y z 1    =    r 1 r 2 r 3 t 1 r 4 r 5 r 6 t 2 r 7 r 8 r 9 t 3 0 0 0 1    R | t 0 | 1    1/f -γ/f 2 (γh -f w)/(2f 2 ) 0 0 1/f -h/(2f ) 0 0 0 1 0 0 0 0 1    K -1 0 0 1    x cam y cam z cam 1    . Our randomized occlusion mapping o(•) is constructed as follows. Fix an initial point cloud P. Given a camera intrinsics matrix K, sample rotation/translation matrices [[R 1 |t 1 ], . . . , [R V |t V ]], where V is the number of views. For each view v ∈ [V ], project P into the camera frame of that view-point using eq. ( 1), find occluded points and remove them, then map the rest back to the world reference using eq. ( 2). This yields the final occluded world frame point cloud for view-point v: Pv .

2.2. THE COMPLETION TASK

Given an occluded point cloud P produced by o(•), the goal of the completion task is to learn a completion mapping c : P → P from P to a completed point cloud P. We say that a completion mapping is accurate w.r.t. loss (•, •) if E P∼o(P) (c( P), P) → 0. The structure of the completion model c(•) is an "encoder-decoder" network (Dai et al., 2017b; Yuan et al., 2018; Tchapmi et al., 2019; Wang et al., 2020a) . The encoder maps an occluded point cloud to a vector, and the decoder reconstructs the full shape. After pre-training, the encoder weights can be used as initialization for downstream tasks. In appendix we gives pseudocode for OcCo and describes the architectures.

3.1. PRE-TRAINING AND DOWNSTREAM TRAINING DETAILS

We evaluate how OcCo improves the learning and generalization of a number of classification and segmentation tasks. Here we describe the details of training in each setting. OcCo pre-training. For all experiments, we will use a single pre-training dataset based on Mod-elNet40 (Wu et al., 2015) . It includes 12,311 synthesized objects from 40 object categories, divided into 9,843 training objects and 2,468 testing objects. To construct the pre-training dataset, Concretely, for PCN and PointNet, we use the Adam optimizer with an initial learning rate 1e-3, decayed by 0.7 every 20 epochs to a minimum value of 1e-5. For DGCNN, we use the SGD optimizer with a momentum of 0.9 and a weight decay of 1e-4. The learning rate starts from 0.1 and then reduces using cosine annealing Loshchilov & Hutter (2017) with a minimum value of 1e-3. We use dropout Srivastava et al. (2014) in the fully connected layers before the softmax output layer. The dropout rate of PointNet and PCN is set to 0.7, and is 0.5 for DGCNN. For all three models, we train them for 200 epochs with a batch size of 32. We report the results based on three runs. Part segmentation. We use the ShapeNetPart (Armeni et al., 2016) benchmark for object part segmentation. This dataset contains 16,881 objects from 16 categories, and has 50 parts in total. Each object is represented with 2048 points, and we use the same training settings as the original work. Semantic segmentation. We use the S3DIS benchmark (Armeni et al., 2016) for semantic indoor scene segmentation. It contains 3D scans collected via Matterport scanners in 6 different places, encompassing 271 rooms. Each point, described by a 9-dimensional vector (including coordinates, RGB values and normalised location), is labeled as one of 13 semantic categories (e.g. chair, table and floor). We use the same preprocessing procedures and training settings as the original work.

3.2. WHAT IS LEARNED FROM OCCO PRE-TRAINING

Alongside OcCo's ability to improve learning tasks we analyze the properties of the pre-trained representation itself. Here we describe the approaches we use for such analyses. Visualisation of learned features. Feature visualisation (Olah et al., 2017) is widely used to qualitatively understand the role of a convolutional neural network unit. It links highly activated parts of a CNN channel with human concepts which have semantic meaning. Ideally the pre-training process learns disentangled features that are useful to distinguish different parts of an object or a scene. These learned features will be beneficial to not only few-shot learning, but also object recognition and part and scene segmentation. Detection of semantic concepts. To quantitatively analyse the learned features of pre-training, we adapt network dissection (Bau et al., 2017; 2020) to determine the number of concept detectors in a pre-trained point cloud feature encoder. Specifically, for the k-th channel, we first create a binary activation mask M k based on highly activated point subsets. Since point cloud encoders usually learn each point feature either independently or via neighborhood aggregation, the feature maps usually do not change in the vertical direction (see Figure 3 ). Therefore we can skip the retrieval step and directly quantify the alignment between an activation mask M k and the n-th concept mask C n (i.e., object parts) via mean intersection of union (mIoU) over a collection of point clouds D P : mIoU (k,n) = E P∼D P |M k (P) ∩ C n (P)| |M k (P) ∪ C n (P)| where | • | is the set cardinality. mIoU (k,n) can be interpreted as how well unit k detects concept c. Structural invariance/equivariance under SO(3) transformation. Pre-training should learn a representation that is robust under under rigid SO(3) transformations (i.e., rotation, translation, permutation). Although a single representation might vary after transformation, the cluster structures should be preserved. We use adjusted mutual information (AMI) (Nguyen et al., 2009) based on the clustering Ω and the ground truth label C, which prevents the score from monotonically increasing when the number of clusters increases, AMI(Ω, C) = E P∼D P I(Ω; C) -E[I(Ω; C)] (H(Ω) + H(C))/2 -E[I(Ω; C)] ( ) where Ω is the clustering determined by the learned embeddings Enc(•) and unsupervised clustering methods such as K-means. I(Ω; C) = k j P (w k ∩ c j ) log P (w k ∩cj ) P (w k )P (cj ) denotes the mutual information, H(•) is the entropy. AMI has a maximal of 1 when two partitions are identical, and reaches a minimal of 0 if two clusters are total uncorrelated. It is calculated as: L = E P∼D P ,S∼SO(3) [AMI(Ω, Enc(S(P)))] Once we finish the pre-training on ModelNet40, we first analyse the learned features and embeddings of the OcCo PointNet via the tests or probes described above. Specifically, we examine the learned concepts of the pre-trained encoders on ShapeNetPart. We assign activation mask M k with points that have top 20% highest values in the k-th unit of the feature, and the n-th concept mask C n is derived from the ground-truth annotations of the n-th object parts. We ignore the object parts which have less than 100 points. We call k-th channel a detector of concept n when mIoU (k,n) > 0.5. We analyze the learned embeddings of the pre-trained encoders on ShapeNet10 and ScanObjectNN. We cluster the learned embeddings Enc(P) into Ω with K-means and calculate the AMI w.r.t labels. Since the encoders are permutation invariant, here we consider rotation, translation and jittering.

3.3. COMPLETION RESULTS AND PROBE TESTS

Visualisation of learned features. In Figure 3 , we first visualize the features learned by OcCo PointNet on the objects from test split of ModelNet40. We visualize each learned feature by coloring the points according to their channel values. We find that, in early stage the encoder is able to learn low-level geometric primitives, i.e., planes, cylinders and cones, while later the network recognises more complex shapes like wings, leafs and upper bodies. We use t-SNE on the embeddings of OcCo encoders based on ShapeNet10, distinguishable clusters are formed for different categories. Number of concept detectors. In Figure 4 , we sketch the number of detected parts based on random, Jigsaw and OcCo (trained for 10 epochs and 50 epochs)-initialised PointNet. We find that, while keeping the previously learned concepts, OcCo helps the encoder progressively detect more object parts as the training proceeds. We show that OcCo have outperformed prior methods in terms of total detected parts (numbers in legends). We provide visualisations in the appendix. Table 1 : Adjusted mutual information (AMI) under transformations. We reported the mean and std over 10 runs. 'J', 'T', 'R' stand for jittering, translation and rotation respectively. 1 . Each point cloud is represented as a vector, and we use K-means for clustering, where K is set as the number of categories. We show that OcCo pre-training helps the networks to learn better embeddings of point cloud objects, especially when they are occluded and with outlier points (ScanObjectNN).

3.4. FEW-SHOT LEARNING

We use the same setting and train/test split as (Sharma & Kaul, 2020) (cTree), and report the mean and standard deviation across on 10 runs. The top half of the table reports results for eight randomly initialized point cloud models, while the bottom-half reports results on two models across three pre-training methods. We bold the best results (and those whose standard deviation overlaps the mean of the best result). It is worth mentioning (Sharma & Kaul, 2020) pre-trained the encoders on both datasets before fine tuning, while we only pre-trained once on ModelNet40. The results show that models pre-trained with OcCo either outperform or have standard deviations that overlap with the best method in 7 out of 8 settings.

3.5. OBJECT CLASSIFICATION RESULTS

We now compare OcCo against prior initialization approaches on object classification tasks. Table 3 compares OcCo-initialization to random (Rand) and (Sauder & Sievers, 2019) 's (Jigsaw) initialization on various object classification datasets among different encoders. "MN40", "ScN10" and "SO15" stand for ModelNet40, ScanNet10 and ScanObjectNN respectively. Recall that OcCo-initialization is pre-trained only on occlusions generated from the train split of ModelNet40. We color blue the best results for each encoder and bold in black the overall best result (and those whose standard deviation overlaps the mean of the best result) for each dataset. We show that OcCo-initialized models outperform all baselines. These results demonstrate that the OcCo-initialized models have strong transfer capabilities on out-of-domain datasets. We make more comparisons in the appendix. 

3.6. OBJECT PART SEGMENTATION RESULTS

Table 4 compares OcCo-initialization to random and (Sauder & Sievers, 2019) 's (Jigsaw) initialization on object part segmentation task. We show that OcCo-initialized models outperform or match others in terms of accuracy and IoU in all three encoders, demonstrating representations derived from completing occluded ModelNet40 improves the performance of part segmentation. 

3.7. SEMANTIC SEGMENTATION

Here we compare random, Jigsaw and OcCo initialization on semantic segmentation task. We follow the same design of PointNet and DGCNN, use a k-fold train-test procedure as in (Armeni et al., 2016) . The results are reported in Table 5 . OcCo-initialized models outperform random and jigsawinitialized ones, demonstrating that the pre-trained representations derived from completing occluded ModelNet40 brings improvements on segmenting indoor scenes which consist of occluded objects.foot_0  Under review as a conference paper at ICLR 2021 

3.8. LEARNING CURVES

We plot the learning curves for classification and segmentation tasks in Figure 5 . We observe that the models with OcCo initialization converge faster to better test accuracy than the random and sometimes Jigsaw-initialized models. et al., 2019; Yang et al., 2019c; Lin et al., 2019) . One well-known method, PointNet, devises a novel neural network that is designed to respect the permutation invariance of point clouds. Each point is independently fed into a multi-layer perceptron, then outputs are aggregated using a permutationinvariant function (e.g., max-pooling) to obtain a global point cloud representation. Another class of methods are convolution-based networks (Hua et al., 2018; Su et al., 2018; Li et al., 2018b; Atzmon et al., 2018; Landrieu & Simonovsky, 2018; Hermosilla et al., 2018; Groh et al., 2018; Rao et al., 2019) . These works map point clouds to regular grid structures and extend the classic convolution operator to handle these grid structures. A representative model, PCNN (Atzmon et al., 2018) , defines two operators, extension and restriction, for mapping point cloud functions to volumetric functions and vise versa. The third class of models is graph-based networks (Simonovsky & Komodakis, 2017; Wang et al., 2019b; Shen et al., 2018; Wang et al., 2018; Zhang & Rabbat, 2018; Chen et al., 2019) . These networks regard each point as a vertex of a graph and generate edges based on spatial information and node similarities. A popular method is DGCNN (Wang et al., 2019b) , which introduces a new operation, EdgeConv, to aggregate local features and a graph update module to learn dynamic graph relations from layer to layer. NRS (Cao et al., 2020) uses a neural random subspace method based on the encoded embeddings to further improve the model performance.

4.2. PRE-TRAINING FOR POINT CLOUDS

Pre-training models on unlabelled data are gaining popularity recently due to its success on a wide range of tasks, such as natural language understanding (Mikolov et al., 2013; Devlin et al., 2018) , object detection (He et al., 2020; Chen et al., 2020) and graph representations (Hu et al., 2020c; d) . The representations learned from these pre-trained models can be used as a good initializer in downstream tasks, where task-specific annotated samples are scarce. The three most common pre-training objectives for point clouds are based on: (i) generative adversarial networks (GAN), (ii) autoencoders, and (iii) spatial relation (Sauder & Sievers, 2019; Sharma & Kaul, 2020) . However, GANs for point clouds are limited to non-point-set inputs, i.e., voxelized representations (Wu et al., 2016) , 2D depth images of point clouds (Han et al., 2019) , and latent representations from autoencoders (Achlioptas et al., 2018) , as sampling point sets from a neural network is non-trivial. Thus these GAN approaches cannot leverage the natural order-invariance of point-sets. Autoencoders (Yang et al., 2018; Li et al., 2018a; Hassani & Haley, 2019; Shi et al., 2020) learn to encode point clouds into a latent space before reconstructing these point clouds from their latent representation. Similar to these methods, generative models based on normalizing flow (Yang et al., 2019b) and approximate convex decomposition (Gadelha et al., 2020) have been shown effective for the unsupervised learning on point clouds. However, both GAN and autoencoder-based pre-training methods have been recently outperformed on downstream tasks by the pre-training technique of Sauder & Sievers (2019) or Sharma & Kaul (2020) in few-shot setting. These methods are based on spatial relation reconstruction, which aims to reconstruct points clouds given rearranged point clouds as input. To this end, Sauder & Sievers (2019) equally split the 3D space into k 3 voxels, rearrange k 3 voxels and train a model to predict the original voxel label for each point. However, these random permutations destroy all spatial information that the model could have used to predict the true point cloud. Inspired by cover-trees (Beygelzimer et al., 2006) , Sharma & Kaul (2020) utilised ball covers for hierarchical partitioning of points. They then train a model to classify each point to their assigned clusters. However, the selection of the ball centroids is somewhat random and they need to pre-train from scratch for each fine-tuning task. Instead, our method creates spatially realistic occlusions that a completion model learns to reconstruct. As such, this model learns how to naturally encode 3D object shape and contextual information. Recently there is a new method called PointContrast (Xie et al., 2020b) which mainly uses contrastive learning for pre-training indoor segmentation models. Our method is more general and transferable compared with theirs. Point cloud completion (Yuan et al., 2018) has received attentions in recent years. Most works aim at achieving a lower reconstruction loss by incorporating 1) a better encoder (Xie et al., 2020a; Huang et al., 2020) , 2) a better decoder (Tchapmi et al., 2019; Wen et al., 2020b) ; 3) cascaded refinement (Wang et al., 2020a) and 4) multi-viewed consistency (Hu et al., 2020b) . Completing 3D shapes for model initialisation has been considered before. Schönberger et al. ( 2018) used scene completion (Song et al., 2017; Dai et al., 2020; Hou et al., 2020) as an auxiliary task to initialise 3D voxel descriptors for visual localisation. They generated nearly-complete and partial voxelised scenes based on depth images and trained a variational autoencoder for completion. They have showed that the pre-trained encoder is more robust under different viewpoints and weather conditions. We adapt this idea to pre-training for point clouds. We have shown that our initialisation is better than random and prior methods in terms of 1) object understanding; 2) invariance under transformations; and 3) downstream task performance.

5. DISCUSSION

In this work, we have demonstrated that why and how the Occlusion Completion (OcCo) learns the representations on point clouds that are more transformation invariant, more accurate in few-shot learning, and in various classification and segmentation fine tuning tasks, compared to prior work. In future, it would be interesting to design a completion model that is explicitly aware the view-point of the occlusion. A model like this would likely converge even quicker, and require fewer parameters, as this knowledge could act as a stronger inductive bias during learning. In general, we advocate for structuring deep models using graphical constraints as an inductive bias to improve learning.

A DESIGN OF THE COMPLETION MODEL

Previous point completion models (Dai et al., 2017b; Yuan et al., 2018; Tchapmi et al., 2019; Wang et al., 2020a) all use an "encoder-decoder" architecture. The encoder maps a partial point cloud to a vector of a fixed dimension, and the decoder reconstructs the full point cloud. In the OcCo experiments, we exclude the last few MLPs of PointNet and DGCNN, and use the remaining architecture as the encoder to map a partial point cloud into a 1024-dimensional vector. We adapt the folding-based decoder design from PCN, which is a two-stage point cloud generator that produces a coarse and a fine-grained output point cloud (Y coarse , Y f ine ) for each input. We removed all the batch normalisation layers in the folding-based decoder since we find it brings negative effects in the completion process in terms of Chamfer distance loss and convergent speed. On the basis of prior self-supervised learning methods, SimCLR (Chen et al., 2020) , MoCo (He et al., 2020) and BYOL (Guo et al., 2020) , we find the batch normalisation is important in the encoder yet harmful for our decoder. Also, we find the L2 normalisation in the Adam optimiser is undesirable for completion training but brings improvements on the downstream fine-tuning tasks.  Note that it is no need that the two point cloud P and P have the same size. But when calculating the Earth Mover Distance (EMD), P and P are usually required to have the same number of points: EMD( P , P ) = min φ: P →P 1 | P | x∈ P ||x -φ(x)|| 2 , ( ) where φ is a bijection between points in P and P . Note that EMD is not commutative. Since finding the optimal mapping φ is quite time consuming, we use its approximation form Bertsekas (1985) . The loss l of the completion task is a adaptive weighted sum of coarse and fine generations: l = d 1 ( Ŷcoarse , Y coarse ) + α * d 2 ( Ŷfine , Y f ine ), where the step-wise trade-off coefficient α incrementally grows during training. In our experiments, we find that even with approximation, it is still suboptimal to use EMD for d 2 , since it is inefficient to solve the approximate bijection mapping φ for over 16k point pairs. We evaluate both 'EMD+CD' and 'CD+CD' combinations for the loss l. We have found that OcCo with 'EMD+CD' loss has achieved comparable performance yet longer time in the downstream classification tasks compared with the 'CD+CD'. We use 'CD+CD' as the loss function in the OcCo pre-training process described in Section. We construct the ModelNet Occluded using the methods described in Section 2 and for ShapeNet Occluded we directly use the data provided in the PCN, whose generation method are similar but not exactly the same with ours. Basic statistics of these two datasets are reported in Table 6 . Compared with the ShapeNet Occluded dataset which is publicized by PCN and used in all the follow-ups (Tchapmi et al., 2019; Wang et al., 2020a) , our occluded ModelNet dataset has more object categories, more view-points, more points per object and therefore is more challenging. We believe such differences will help the encoder models learn a more comprehensive and robust representation which is transferable to downstream tasks. To support our idea, we perform OcCo pre-training on these two datasets respectively, and test their performance on ModelNet40 and ShapeNet Occluded classification benchmarks. The reason of choosing these two datasets for benchmarking is, ShapeNet Occluded is the out-ofdomain data for the models pre-trained on ModelNet Occluded, and vice versa. We believe it will give us sufficient information on which occluded dataset should be preferred the OcCo pre-training. The Results are shown in Table 7 . By visualising the objects from the ShapeNet Occluded (in Figure . 11), we believe this performance deficiency in downstream fine-training of pre-trained models is due to the quality of the generated occluded point clouds (in comparison with our generated dataset shown in Figure . 2). Further, we think our dataset is a more challenging task for all the present completion models. We first described the benchmark datasets that are used for classification in Table 8 . To make a comprehensive and convincing comparison, we follow the similar procedures from (Achlioptas et al., 2018; Han et al., 2019; Sauder & Sievers, 2019; Wu et al., 2016; Yang et al., 2018) , to train a linear Support Vector Machine (SVM) to examine the generalisation of OcCo encoders that are pre-trained on occluded objects from ModelNet40. For all six classification datasets, we fit a linear SVM on the output 1024-dimensional embeddings of the train split and evaluate it on the test split. Since Sauder & Sievers (2019) have already proven their methods are better than the prior, here we only systematically compare with theirs. We report the resultsfoot_2 in Table 9 , we can see that all OcCo models achieve superior results compared to the randomly-initialized counterparts, demonstrating that OcCo pre-training helps the generalisation both in-domain and cross-domain. In this section, we describe how we reproduce the 'Jigsaw' pre-training methods from (Sauder & Sievers, 2019) . Following their description, we first separate the objects/chopped indoor scenes into 3 3 = 27 small cubes and assign each point a label indicting which small cube it belongs to. We then shuffle all the small cubes, and train a model to make a prediction for each point. We reformulate this task as a 27-class semantic segmentation, for the details on the data generation and model training, please refer to our released code.

G MORE COMPARISONS

In Table 10 , we compare OcCo with prior point-cloud-specific pre-training methods (Alliegro et al., 2020) . Our method obtains the best results on all settings. These results confirm that the inductive bias learned by reconstructing occluded point clouds is stronger than one based in reconstructing permuted clouds (Alliegro et al., 2020; Sauder & Sievers, 2019) . Specifically, we believe that because OcCo does not rearrange object parts but instead creates point clouds that resemble real-world 3D sensor occlusions, the initialization better encodes realistic object shape and context. 



We noticed that the random initialised/pre-trained model in(Sauder & Sievers, 2019) (mIoU=40.3/41.2) did not achieve the similar results as the original DGCNN (mIoU=56.1). They consider a transductive setting which is not directly comparable to ours, so here we stick to the supervised setting and report our reproduced scores. In our implementation, we also provide an alternative to use grid search to find the optimal set of parameters for SVM with a Radial Basis Function (RBF) kernel. In this setting, all the OcCo pre-trained models have outperformed the random initialised and Jigsaw pre-trained ones by a large margin as well.



t e x i t s h a 1 _ b a s e 6 4 = " l x H n l 9 I J / R T 3 l 3 Q X j d y c Z E s g m Z c = " > AA A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B D i J e x G Q Y 9 B L x 4 j m A c k S 5 i d 7 U 2 G z M 4 s M 7 N C W H L 1 A 7 z q J 3 g T r / 6 H X + B v O E n 2 Y B I L G o q q b r q 7 g o Q z b V z 3 2 1 l b 3 9 j c 2 i 7 s F H f 3 9 g 8 O S 0 f H L S 1 T R a F J J Z e q E x A N n A l o G m Y 4 d B I F J A 4 4 t I P R 3 d R v P 4 H S T I p H M 0 7 A j 8 l A s I h R Y q z U k Z U e D a W 5 6 J f K b t W d A a 8 S L y d l l K P R L / 3 0 Q k n T G I S h n G j d 9 d z E + B l R h l E O k 2 I v 1 Z A Q O i I D 6 F o q S A z a z 2 b 3 T v C 5 V U I c S W V L G D x T / 0 5 k J N Z 6 H A e 2 M y Z m q J e 9 q f i f 1 0 1 N d O N n T C S p A U H n i 6 K U Y y P x 9 H k c M g X U 8 L E l h C p m b 8 V 0 S B S h x k a 0 s E V T w k E B n 9 h k v O U c V k m r V v U u q 7 W H q 3 L 9 N s + o g E 7 R G a o g D 1 2 j O r p H D d R E FH H 0 g l 7 R m / P s v D s f z u e 8 d c 3 J Z 0 7 Q A p y v X 6 O g m M o = < / l a t e x i t > c(•) < l a t e x i t s h a 1 _ b a s e 6 4 = " y T q b Z + N F m J s 4 i L N N L A v k 1 h x v C O s = " > A A A C B H i c b V D L S g N B E J z 1 G e M r 6 t H L Y B

H H 0 g l 7 R m / P s v D s f z u e 8 d c 3 J Z 0 7 Q A p y v X 5 A g m L 4 = < / l a t e x i t > P < l a t e x i t s h a 1 _ b a s e 6 4 = " 7 7 8 x N h B z L o B K I e 3 Z V 0 O m l M P d d f w = " > A A A C B 3 i c b V D L S g M x F M 3 4 r P V V d e k m W A R X Z a Y K u i y 4 0 G U F + 4 D p U D L p n T Y 0 k w x J R i h D P 8 A P c K u f 4 E 7 c + h l + g b 9 h p p 2 F b T 0

r g w u G c e 7 m H 4 y e M S m X b 3 0 Z l Y 3 N r e 6 e 6 W 9 v b P z g 8 M o 9 P e j J O B Y E

a I 3 4 9 l 4 N z 6 M z 8 V q x S h v 6 m g J x t c v I V W e j A = = < / l a t e x i t t e x i t s h a 1 _ b a s e 6 4 = "

r 6 x d m n Z 2 b < / l a t e x i t > t e x i t s h a 1 _ b a s e 6 4 = " y T q b

Figure 1: OcCo consists of two steps: (a) occlusion o(•) of a point cloud P based on a random camera view-point into a partial point cloud P, and (b) a model c(•) that completes the occluded point cloud P so that P ≈ P. We demonstrate that the completion model c(•) can be used as initialization for downstream tasks, leading to faster training and better generalization over existing methods. a self-supervised pre-training method that consists of (a) a mechanism to generate occluded point clouds, and (b) a completion task to reconstruct the occluded point cloud.

Figure 2: Examples of self-occluded objects generated by our method.

Figure2shows examples of the resulting occluded point clouds. Given these, we train an "encoderdecoder" style completion model c(•). For encoders, similar to prior completion models(Tchapmi et al., 2019; Wang et al., 2020a; Wen et al., 2020a), we consider PointNet(Qi et al., 2017a), PCN(Yuan et al., 2018) and DGCNN(Wang et al., 2019b). These networks encode an occluded point cloud into a 1024-dimensional vector. We adapted the folding-based decoder from(Yuan et al., 2018) to complete the point clouds in a two-stage procedure. We use the Chamfer Distance (CD) as our loss function (•, •). We use Adam(Kingma & Ba, 2015) with an initial learning rate of 1e-4, decayed by 0.7 every 10 epochs to a minimum value of 1e-6, for a total of 50 epochs. We use a batch size of 32 and set the momentum in the batch normalisation to be 0.9.Few-shot learning. We useModelNet40 and Syndey10 (De Deuge et al., 2013)  for "K-way Nshot" learning. During training, K classes are randomly selected and for each class we sample N random samples, then the model is tested on the same K classes . As inSharma & Kaul (2020), we represent each object with 100 points. We use the same training settings as used in the next paragraph.Object classification. We use three 3D object recognition benchmarks: ModelNet40, Scan-Net10(Qin et al., 2019) andScanObjectNN (Uy et al., 2019); we describe them in the appendix. All objects are represented with 1024 points. We use the same training settings as the original works.

Figure 3: Visualisation on the learned features and embeddings of OcCo-initialised encoders. Above half illustrates the location of learned features in the architecture of PointNet (Qi et al., 2017a).

Figure 4: Number of (unique) detected object parts in the feature maps of random, Jigsaw and OcCo-initialised PointNet. Digit in the bracket is the number of parts in that object category.

Figure 5: Learning curves of random, Jigsaw and OcCo , '10%' is the portion of used training data

For example, on ModelNet40 with a PCN encoder, the OcCo-initialized model takes around 10 epochs to converge, while the randomly initialized model takes around 50 epochs. Similarly, for ScanObjectNN with DGCNN encoder, the OcCo-initialized model converges around 20 epochs and to a better test accuracy than the random and Jigsaw-initialized model. 4 RELATED WORK 4.1 DEEP MODELS FOR POINT CLOUDS Work on deep models for point clouds can largely be divided into three different structural approaches: (a) pointwise-based networks, (b) convolution-based networks, and (c) graph-based networks. We call the networks that independently process each point, before aggregating these point representations: pointwise-based networks (Qi et al., 2017a;b; Joseph-Rivlin et al., 2019; Duan et al., 2019; Zhao

3.1 in terms of simplicity and efficiency. B QUALITATIVE RESULTS FROM OCCO PRE-TRAINING In this section, we show some qualitative results of OcCo pre-training by visualising the input, coarse output, fine output and ground truth at different training epochs and encoders. In Figure. 6, Figure. 7 and Figure. 8, we notice that the trained completion models are able to complete even difficult occluded shapes such as plants and planes. In Figure. 9 we plot some failure examples of completed shapes, possibly due to their complicated fine structures, while it is worth mentioning that the completed model can still completed these objects under the same category.

Figure 6: OcCo pre-training with PCN encoder on occluded ModelNet40.

Figure 7: OcCo pre-training with PointNet encoder on occluded ModelNet40.

Figure 8: OcCo pre-training with DGCNN encoder on occluded ModelNet40.

Figure 9: Failure completed examples during OcCo pre-training.

Figure 11: Examples from ShapeNet Occluded which fail to depict the underlying object shapes



Overall point prediction accuracy (mAcc) and mean intersection of union (mIoU) on ShapeNetPart. We reported the mean and standard error based on three runs.



The predicted coarse point cloud Ŷcoarse , which represents the global geometry of a shape, is generated via a set of fully connected layers. A folding-based generator is used to predict the local fine structures of each point in Ŷcoarse , this results in Ŷfine . The folding based structures is proved to be good at approximating a smooth surface which reflects the local geometry. During training, Y coarse and Y f ine are generated via randomly sampling 1024 and 16384 points from the mesh, respectively.We use either Chamfer Distance (CD) or Earth Mover Distance (EMD) as the loss function for the completion model. We use a normalised and symmetric (thus commutative) version of Chamfer Distance (CD) to quantify the differences between two point clouds P and P :

Statistics of occluded datasets for OcCo pre-training

Performance of OcCo pre-trained models with different pre-trained datasets

Statistics of classification datasets

linear SVM on the output embeddings from random, Jigsaw and OcCo initialised encoders

Accuracy comparison between OcCo and prior pre-training baselinesAlliegro et al. (2020) on 3D object recognition benchmarks. ModelNet40-20% means only 20% of training data are used. investigate whether OcCo pre-training can improve the labelled sample efficiency of downstream tasks. Specifically, we reduce the labelled samples to 1%, 5%, 10% and 20% of the original training set for the ModelNet40 object classification task, and evaluate on the full test set. As shown in Table11, OcCo-initialized models achieve superior results compared to the randomly-initialized models, demonstrating that OcCo with in-domain pre-training improves labelled sample efficiency.

Sample efficiency with randomly-initialized and OcCo-initialized models.

I NETWORKS AND TRAINING SETTINGS OF PCN ENCODER

We sketch the network structures of PCN encoder and output layers for downstream tasks in Figure 12 . Here in Table 12 we report the detailed scores on each individual shape category from ShapeNetPart, we bold the best scores for each class respectively. We show that for all three encoders, OcCoinitialisation has achieved better results over two thirds of these 15 object classes. 

