BAYESIAN META-LEARNING FOR FEW-SHOT 3D SHAPE COMPLETION

Abstract

Estimating the 3D shape of real-world objects is a key perceptual challenge. It requires going from partial observations, which are often too sparse and incomprehensible for the human eye, to detailed shape representations that vary significantly across categories and instances. We propose to cast shape completion as a Bayesian meta-learning problem to facilitate the transfer of knowledge learned from observing one object into estimating the shape of another object. To facilitate the learning of object shapes from sparse point clouds, we introduce an encoder that describes the posterior distribution of a latent representation conditioned on the sparse cloud. With its ability to isolate object-specific properties from object-agnostic properties, our meta-learning algorithm enables accurate shape completion of newly-encountered objects from sparse observations. We demonstrate the efficacy of our proposed method with experimental results on the standard ShapeNet and ICL-NUIM benchmarks.

1. INTRODUCTION

The task of estimating 3D geometry from sparse observations, commonly referred to as shape completion, is a key perceptual challenge and an integral part of many mission-critical problems, including robotics (Varley et al., 2017) and autonomous driving (Giancola et al., 2019; Stutz & Geiger, 2018) . Recently, a series of methods (Mescheder et al., 2019; Park et al., 2019) have achieved great success by using the observations to infer the parameters of an implicit 3D geometric representation of the targets object. However, with some notable exceptions (Yuan et al., 2018) , such methods require relatively dense observations to achieve high accuracy, which is usually impractical in real situations. In this paper we introduce a novel methodology that enables state-of-the-art shape completion of previously unseen objects from highly sparse observations. Our insight comes from the following simple intuition: "Can we leverage the geometric information available in one object to improve shape completion results on another target object?" Meta-learning is an emerging field of study in machine learning that serves this very purpose. By training a model on multiple inter-related tasks, it learns how to learn new tasks efficiently from a small amount of observations. Recently proposed meta-learning methods often achieve this by parameterizing the input-output relationship with task-specific latent variables and training a separate, task-agnostic model/mechanism that can infer these task-specific variables from sparse observations of the target task (Chang et al., 2015; Finn et al., 2017; Garnelo et al., 2018) . We can cast the shape completion problem as a Bayesian meta-learning problem by treating each object as a task and its sparse observations as the corresponding contextual dataset. In popular Bayesian variants of meta-learning (Edwards & Storkey, 2017; Eslami et al., 2018; Garnelo et al., 2018) , the task-specific latent variables are treated as random variables, and the aforementioned task-agnostic model (i.e. the encoder) is represented as a posterior distribution of the latent variables conditioned on sparse observations. In this study, we combine probabilistic meta-learning with recent shape completion methods that represent the geometry of a given object with implicit parameters, such as the parameters of a signed distance function (SDF) . By training an encoder that computes the posterior distribution of these implicit parameters conditioned against sparse observations, we develop a framework that enables the few-shot learning of implicit geometric functions. Under appropriate regularity conditions, the computation of correct posterior distribution leads to optimal prediction in the sense of Bayes Risk (Maeda et al., 2020) . Our proposed approach is a natural extension of many implicit approaches, in the sense that it introduces an additional encoder function that represents the posterior distribution of the geometry-describing implicit parameters. More specifically, we build upon the Bayesian approach of (Maeda et al., 2020) whose posterior estimate behaves asymptotically well with respect to the size of contextual dataset, and combine their method with Implicit Geometrical Regularization (IGR) (Gropp et al., 2020) . We use IGR as the baseline, and demonstrate its efficacy on two benchmark datasets (ShapeNet and ICL-NUIM), especially when the observations are very sparse.

2. RELATED WORK

2.1 3D DECODING REPRESENTATIONS Unlike images, that contain a clear pixel-based structural pattern, there is no unified representation for 3D object reconstruction that is both computationally and memory efficient. In terms of used 3D representations, existing methods can be broadly divided into the following categories: Voxel-based methods are a generalization of 2D pixels into 3D space, and thus constitute a natural extension for classical image-based methods. Early works focused on 3D convolutions operating on dense grids (Choy et al., 2016) to generate an occupancy function that determines whether each cell is inside an object or not, however these were limited to relatively small resolutions. To address the high memory requirements of dense voxel grids, various works have proposed 3D reconstruction in a multi-resolution fashion (Häne et al., 2017) , with the added complexity of requiring multiple passes to generate the final output. More recently, OccNet (Mescheder et al., 2019) proposes encoding a 3D description of the output at infinite resolution, and shows that this representation can be learned from different sensor modalities. Signed Distance methods are an alternative to occupancy functions, where instead of the occupancy state we learn a function describing the signed distance to the object surface (Dai et al., 2017; Stutz & Geiger, 2018) . This approach builds upon earlier fusion methods that utilize a truncated signed distance function (SDF) introduced in (Curless & Levoy, 1996) . DeepSDF (Park et al., 2019) represents 3D space as a continuous volumetric field, and requires at training time the ground-truth SDF calculated from dense input data using numerical methods. Implicit Geometric Reconstruction (IGR) (Gropp et al., 2020) is a SDF variant that uses Eikonal regularization, thus enforcing that the output of the decoder will be the SDF of "some" surface. This is an effective way of mitigating the impact of outliers in the final generated surface, and is used as the starting point for our proposed meta-learning approach to shape completion. Point-based methods directly output points located on the object surface, thus eliminating the need for a dense representation of the 3D space. Earlier works such as PointNet (Charles et al., 2017; Qi et al., 2017) combined fully connected networks with a symmetric aggregation function, thus achieving permutation invariance and robustness to perturbations. (Fan et al., 2017b) introduced point clouds as a viable output representation for 3D reconstruction, and (Yang et al., 2017) proposed a decoder design that approximates a 3D surface as the deformation of a 2D plane. Point Completion Network (PCN) (Yuan et al., 2018) is a recent architecture that enables the generation of coarse-tofine shapes while maintaining a small number of parameters. However, a common limitation of all these methods is that they do not describe topology, and thus are not suitable for the generation of watertight surfaces. Also, to change the number of output points, methods like PCN have to re-train their networks entirely, while SDF-based methods learn the geometry in an implicit form and thus can generate any amount of points. Mesh-based methods choose to represent classes of similarly shaped objects in terms of a predetermined set of template meshes. First attempts focused on graph convolutions alongside the mesh's vertices and edges (Guo et al., 2015) , and more recently as a direct output representation for 3D reconstruction (Kanazawa et al., 2018) . These methods, however, are only able to generate meshes with simple topologies (Wang et al., 2018) , require a reference template from the same object class (Kanazawa et al., 2018) and cannot guarantee water-tight closed surfaces (Groueix et al., 2018) . A learnable extension to the Marching Cubes algorithm (Lorensen & Cline, 1987) has been proposed in (Liao et al., 2018) , however this approach is limited by the memory requirements of the underlying 3D voxel grid.

2.2. ENCODING AND DECODING MECHANISM

In shape completion methods, one must construct a geometry of object from point clouds of varying size that are scattered over various regions. Thus, all methods use some mechanism to encode the sparse point clouds to a tensor of fixed size, and decode this tensor to produce the final output. Existing methods also differ by the design of these decoding and encoding mechanisms. Methods like DeepSDF and IGR train an auto-decoder, and do not train separate encoder functions at all. To find the geometry-describing implicit parameters for each object, these methods apply likelihood-based gradient descent on randomly initialized latent variables, using just sparse observations from the target object. Thus, these methods by design do not use observations from multiple objects to train an object-agnostic mechanism that can efficiently learn the latent variables. Meanwhile, PCN and OccNet both train a complex encoder function that maps sparse observations to latent variables. Although their decoders differ entirely (PCN directly outputs 3D points, while OccNet outputs binary values), they both use an encoder that aims to capture the hierarchical structure of the object's geometry. More specifically, PCN's encoder is equipped with a mechanism to extract information from sparse observations in two steps, one aimed at extracting global information and another at extracting local information. On the other hand, OccNet's encoder is a version of PointNet (Charles et al., 2017) that uses max pooling with respect to the sparse observation set. Our method also uses an encoder function, however it differs from those mentioned above in the sense that it represents a posterior distribution, rather than a deterministic function. Under an appropriate set of regularity conditions , inference made from the predictions of posterior distribution is optimal in terms of the Bayes risk (Maeda et al., 2020) . Furthermore, because our encoder is probabilistic, it is capable of outputting multiple candidate shapes for a given sparse observation.

3. META-SHAPE

COMPLETION 3.1 PROBLEM SETTING Let D k = {x (k) n ∈ R 3 |n = 1, • • • , N k } be an arbitrary set of points located on the surface of a 3D object k. The goal of shape completion for object k is to use D k to infer its surface. There are various approaches to this problem: if the surface is closed, one can use an occupancy function F occ (x) that tells whether a point x is inside or outside the surface. One may also use a signed distance function F sdf (x) that evaluates how far point x is to the surface, with positive values indicating that it's outside and negative values indicating that it's side. In both approaches, the transcribed goal is to use D k to find the object-k-specific function F k that best describes its geometry. Our proposed method focuses on the latter approach, and in the following sections we show how Bayesian meta-learning can be used to learn a probabilistic SDF for shape completion.

3.2. META-LEARNING FOR SHAPE COMPLETION

Meta-learning exploits underlying similarities among tasks to enable the transfer of knowledge between tasks, so that information gained from solving one problem can improve performance on another. Traditional meta-learning methods work by parameterizing their models with two types of parameters: task-specific and task-agnostic. Following this basic philosophy, we represent the signed distance function (SDF) of all observed objects using object-specific parameters h k and object-agnostic parameters θ. More specifically, we write our SDF for object k as SDF θ (x; h k ), and determine h k using the θ-parameterized function that encodes D k . The function that maps D k to h k is often referred to as encoder in the meta-learning literature. The SDF, on the other hand, plays the role of decoder. In a probabilistic setting, we consider a probabilistic SDF of form N (s|SDF θ (x; h k ), σ 2 θ ). Now, if we interpret this estimation problem as a case of minimizing the Bayes risk (Maeda et al., 2020) , we can show that the optimal solution is given by the predictive distribution p θ (s|x; D k ) computed using the posterior distribution p(h|D k ): p(s|x, D k ) = p(s|x, h k )p(h k |D k )dh k . (1) The estimation of the posterior p(h k |D k ) is a challenging task: for the approximation to be valid, it must be able to accept unordered sets D k of various lengths, while satisfying all other properties of a posterior distribution. For example, its variance must approach zero as we take the number of observed points to infinity. According to the theory of (Maeda et al., 2020) , we can construct an approximation that satisfies all these requirements using a Gaussian distribution p θ (h k |D k ) = d i=1 N (h k,i |µ i,θ (D k ), σ 2 i,θ (D k )) with mean and variance as follows (we use h k,i to denote the i-th element of h k ): µ i,θ (D k ) = σ 2 i,θ (D k ) N k n=1 f i,n g 2 i,n , σ 2 i,θ (D k ) = N k n=1 1 g 2 i,n - 1 g 2 i,0 + 1 g 2 i,0 -1 . In the above, f i,n = f i,θ (x (k) n , y n ) and g i,n = g i,θ (x (k) n , y n ) are neural networks parameterized by θ. When making a probabilistic inference, we sample from this Gaussian posterior approximation and feed it to the decoder SDF. A schematic diagram of our proposed meta-learning shape completion method can be found in Figure 1 . The training proceeds by making predictions about target points using contextual information from various datasets. To prepare the mock-target points at training time, we decompose D k into D ctxt k and D targ k for each task k and use the following classic ELBO (Ranganath et al., 2014; Kingma & Welling, 2014) to optimize the predictive distribution given D k : L k (θ) := -p θ (h k |D k ) N k n=1 log p θ (s (k) n |x (k) n , h k ) + log p θ (h k |D ctxt k ) dh k -H(p θ (h k |D k )). Here s (k) n is the true signed distance from point x (k) n to the surface of object k. In our problem setting, s (k) n is 0 for all (n, k) because all the observed points are on the surface of the objects. Figure 1 contains a diaram of our proposed algorithm. To test the sheer efficacy of our meta-approach, we only use MLP for encoder in this study, and do not explore the specific architecture that is suited for the shape-completion problem.

3.3. EIKONAL REGULARIZATION

The datasets D k considered in shape completion tasks usually only contains points on the object surface. However, it is difficult to enforce the learned function to be a signed distance function (SDF) just by enforcing that is value should be close to 0 at observed points. Implicit Geometric Regularization (IGR) (Gropp et al., 2020) is a regularization based on the theory of Eikonal partial differential equations, which states that any function F satisfying F (x) = 0 ∀x ∈ B, ∇F (x) = 1 ∀x ∈ R 3 (4) is a signed distance function for the surface B. To better encourage our decoder to describe a valid SDF, we therefore augment our loss function by an extra term L eik = E[| ∇SDF θ (x; h k ) -1|]. We estimate this expectation by sampling x in the soft neighordhood of the object surface. For more details, we refer the reader to Section 4.

3.4. NORMAL VECTOR REGULARIZATION

If the training dataset contains ground-truth surface normal vectors, we can also add a regularization term to encourage the gradient of the estimated SDF at observed points to agree with the true normal vectors. This is a regularization used in the original IGR (Gropp et al., 2020 ) implementation as well.

3.5. LOSS FUNCTION AND POST-ENCODER LATENT OPTIMIZATION

The loss function we minimize at training time for object k is given by L = L k (θ) + λL eik , where λ is an empirically chosen regularization parameter. At inference time, we can also further fine-tune the encoder output by optimizing the likelihood of the sparse observation. That is, if µ k 0 is the the mean of the encoder conditioned against D k (i.e. E[h|D k ]), we additionally apply the following iterative updates on the encoder's mean: µ k t ←-µ k t-1 + (x k n ,s k n )∈D k ∇ h log p(s k n |x k n , h) µ k t-1 . ( ) At inference time, we feed μk T to the decoder instead of µ k 0 . When we apply this post-encoder latent optimization, our method becomes a version of IGR that is additionally equipped with an encoder.

4. EXPERIMENTS

We conducted a series of experiments to evaluate the efficacy of our proposed Meta-Shape Completion (MSC) method, relative to other well-known published methods. Particularly, we focus on very sparse scenarios, and show that the introduction of Bayesian meta-learning enables the generation of state-of-the-art shape predictions from a very small number of observed samples. Similarly, we show that our proposed method generalizes better to novel and unseen objects. (Chang et al., 2015) . We used the Synthetic ShapeNet CAD models as the primary source of evaluation for this paper. Following the procedure described in (Park et al., 2019) , we first applied an affine transformation to each object, so the center of mass is located at the origin, and rescaled all points so that the maximum distance from vertex to origin is 1. To exclusively sample points on the object-surfaces, we first generated 100 view-points uniformly around each object, and used OpenGL to obtain images from these view-points. We then identified the set of points that are observable from these 100 view points as surface points, and took samples from these points. This modified dataset is henceforth referred to as Disemboweled ShapeNet. For IGR and MSC, we used the above procedure to generate 200k pairs of coordinate and normal vectors on each object surface. For DeepSDF, we followed the original procedure and computed the ground-truth SDF at randomly generated 500k points in addition to the 200k surface points. For OccNet, we randomly sampled 100k points from those used in DeepSDF, and annotated whether they are inside or outside the object. For PCN, we sampled 16384 surface points as ground-truth.

ShapeNet

ICL-NUIM (Handa et al., 2014) . This is a dataset consisting of RGB-D images from two different scenes: living rooms and office rooms. The task on this dataset is to reconstruct 3D shapes from depth images. As pre-processing, we normalized location and scale of each scene using the same procedure described above for ShapeNet, and used Open3D (Zhou et al., 2018) to obtain normal vectors. Because Open3D sometimes produces normal vectors with inconsistent orientations, we conducted an extra step of inconsistency correction in which we made their inner products with the camera vector to be all positive.

4.2. EVALUATION

As our evaluation metric for each method, we calculated the Chamfer distance (Fan et al., 2017a) between ground-truth and generated point clouds. If P 1 and P 2 are two point clouds in 3D space, the Chamfer distance between them is given by: d(P 1 , P 2 ) = i∈P1 min j∈P2 |x i -x j | 2 2 /|P 1 | + j∈P2 min i∈P1 |x i -x j | 2 2 /|P 2 | (7) Let P gt k be the ground-truth point cloud of the object k, and PM k be the point cloud of object k generated by method M. To evaluate the performance of method M, we computed d M k := d(P gt k , PM k ) for each k and used it to calculate the following values: dM ave = 1 K K k=1 d M k , dM norm = 1 K K k=1 (d M k /d M0 k ) (8) where M 0 is our proposed method, so that dM norm is a measure of method M 's performance relative to our own performance. We refer to the former value as average chamfer distance, and the latter as normalized chamfer distance. For the ICL-NUIM experiments we used an assymetric variation of the Chamfer Distance, where only the first sum term in Equation 7 is used. This is because the ground-truth point clouds for this dataset are very sparse and lack many portions of the scene to be reconstructed. To generate ground-truth point clouds, we randomly sampled 30k points from the raw meshes in our Disemboweled Shapenet dataset. Each point on the object surface was sampled in two steps: first, we selected a triangle on the polygonal mesh with probability proportional to the area of the triangle. Then, we sampled a point from the uniform distribution over the selected triangle. DeepSDF, IGR and OccNet use implicit functions to represent the geometry of each object, so their decoders do not directly output a point cloud. For these methods, we used Marching Cubes (Lorensen & Cline, 1987) to generate a point cloud of size 30k. PCN, on the other hand, directly outputs 3D points from each set of sparse observations. Following the original implementation, we used those to generate a point cloud of size 16384.

4.3. SHAPE COMPLETION ON SHAPENET

To verify the ability of our proposed method to complete the shape of unseen categories, we split the object-categories of Disembowled ShapeNet into two groups: one to be used to train the model (training), and one to be used as newly-encountered object type (novel). For the exact splits, please refer to the Appendix. In the first set of experiments, we trained the model on the training dataset of training categories, and evaluated the model on the test set of the training categories. Results for this first set of experiments can be found on Table 1 . In the second set of experiments, we trained each model on the training set of training categories and evaluated on the model on the test set of novel categories. Results for this second set of experiments can be found on Table 2 . In both cases, our proposed method (MSC) was evaluated with and without post-encoder latent optimization. As we can see in both tables, our proposed method consistently outperforms all the other methods for both novel and training categories when the number of observation points is 100 or less, while achieving competitive results when higher densities are available. In particular, when the number of observations is 50, our method greatly outperforms PCN, the current SOTA in these tasks. The strength of our method can be verified qualitatively as well. Figures 2 and 3 are examples of completed shapes obtained by different methods for various ShapeNet categories. As shown, our method tends to output smooth surface predictions regardless of the number of observations. This is most likely because our encoder is learning a task-agnostic property of smoothness, that can be transferred across different objects. Meanwhile, the performance of IGR differs drastically across trained and novel categories, both in terms of Chamfer distance and output appearance. DeepSDF performs poorly on all categories, most likely because it is not allowed to use ground-truth SDF at test-time. When the number of observations is small, DeepSDF tends to output a complex nonsensical surface that looks like a combination of all the training objects. It also tends to produce disconnected artifacts that are located far away from the real surface. OccNet also tends to produce many artifacts when the number of observations is small. We hypothesize that these methods are failing to isolate category-agnostic properties from category-specific properties. As a general remark, all methods struggle with reproducing detailed topology even with a large number of observations, particularly in the case of holes and large gaps, possibly due to model limitations. For a detailed analysis of the category-wise performance of various methods, please refer to the Appendix.

4.4. SCENE COMPLETION OF ICL-NUIM DATASET

Table 3 summarizes the performance of various methods on ICL-NUIM, for the task of scene completion. Similarly to ShapeNet, MSC significantly outperforms all others when the number of observations is 300 or less, while still achieving competitive results when a higher density of observations is available. Figures 4 and 5 contain examples of scene completion results produced by different methods. DeepSDF and OccNet both tend to produce very complex shapes with many disconnected artifacts, and DeepSDF in particular seems to not converge as we increase the number of observations. PCN performs poorly when compared to the shape completion task on ShapeNet, mostly densifying observed areas rather than extrapolating this information to other portions of the scene. Our method, on the other hand, succeeds in correctly recognizing the boundary of objects even when large areas, such as walls and the floor, are missing in the ground-truth. 

5. CONCLUSION

In this paper we introduce the concept of meta-learning to the task of shape completion using implicit representations of 3D surfaces. Our proposed encoder mechanism allows the learning of object-agnostic properties separately from object-specific properties, thus succeeding in training a model that can consistently produce smooth predictions from highly sparse observations, achieving state-of-the-art under these conditions. Although we have used IGR, an SDF-based method, as the basis for our implementation, our proposed meta-shape completion algorithm (MSC) can be equally applied to different implicit surface representations by simply changing the decoder. Furthermore, in this paper we have used a simple MLP-based encoder to learn task-specific parameters, while other methods like PCN use hierarchical models to capture both global and local geometric properties of objects. We believe that using these more complex models will lead to substantial improvements to our proposed method, however this is left for future work. In conclusion, there seems to be a lot of room left for the application of meta-learning to scene completion tasks, and further studies in this direction may allow us to develop models that can be used in a wide range of applications.

A APPENDIX

A.1 IMPLEMENTATION DETAILS DeepSDF (Park et al., 2019) and IGR (Gropp et al., 2020) . We implemented both methods using Pytorch (Paszke et al., 2017) , matching their model architectures, initialization procedures and published results. As done in the original study of (Park et al., 2019) , however, we used ReLu in place of softplus for the Deep SDF decoder. To create the mesh with which the Chamfer distance is calculated, we used the Marching Cubes algorithm (Lorensen & Cline, 1987) with 256 3 resolution to convert SDF to mesh. To find the optimal z with respect to the likelihood, we applied gradient descent from a random initial value z 0 sampled from N (0, 0.01 2 ). For the stepsize, we chose lr * error 0 , where error 0 is the error computed with z 0 . We used the Adam optimizer, with batch size b = 32, alpha α = 3.2 × 10 -4 , and learning rates lr = 1.0 × 10 -3 for DeepSDF and lr = 1.0 × 10 -4 for IGR. We trained for 5000 epochs, halving the learning rate at every 500 epochs. We used 16384 surface points to compute the loss for each object. To generate points for the evaluation of the Eikonal term, we randomly selected 5k points from the 16384 on the surface, and sampled one point each from a Gaussian distribution centered around it, with variance of 4.0 × 10 -2 and λ = 1.0 as the Eikonal regularization parameter. We trained our IGR model with the version of the algorithm in (Gropp et al., 2020) that uses the surface normal vectors. OccNet (Mescheder et al., 2019) . We used the authors' official Pytorch implementation 1 and trained under the same conditions as described in the original paper. For the encoder architecture (Figure 6 ), we used the ResNetPointNet class, and for the decoder architecture we used the DecoderBatchNorm class. We trained the decoder over 5k epochs with batch size b = 64 and learning rate lr = 1.0×10 -4 . For each object, we used 300 surface points to the encoder, and produced 2048 binary points from the decoder. To stabilize training, we followed the same procedure described in the original paper and added a Gaussian noise with standard deviation σ = 5.0 × 10 -3 to the encoder input. PCN (Yuan et al., 2018) . We implemented PCN using PyTorch based on the authors' official Tensorflow implementation 2 , matching their model architectures, initialization procedures and published results. To train, we followed the same procedure as the original paper, feeding k surface points to the encoder. We used batch size b = 32. Starting from the initial value of 0.0001, we reduced the learning rate by a factor of 0.7 at every 50k iterations and trained the model for a total of 5k epochs.

MSC (Ours)

. Figure 1 of the main paper illustrates our proposed encoder network. The dimension of h (i.e., the output of the encoder) is set to 1024 for ShapeNet40 and 512 for the ICL-NUIM dataset. We used the same decoder as IGR. At training time, we set the number of contextual surface points to be 16384 and evaluated the loss using target sample points of size ranging from 1 ∼ 5000, chosen at random at each iteration. We trained our model for 5k epochs with batch size b = 32 and learning rate lr = 10 -4 , using the same Eikonal regularization term as IGR. For the ShapeNet experiment, training the model required 16 days with 16 GPUs (NVIDIA V100). The encoding of 1000 contextual points took 0.003 seconds on average, and 22.660 seconds to create the mesh surface using the Marching Cubes algorithm (which requires evaluation at multiple points). For the ICL-NUIM experiment, it took 30 hours with 32 GPUs to train the entire model.

A.2 RESULTS ON SHAPENET

Tables 4 are 5 contain more detailed shape completion results for each ShapeNet training category, and Table 6 contains similar results for each ShapeNet novel category. Qualitative results for selected objects can be found in Figures 2-12. Generally speaking, performance of all methods vary greatly across categories, and as expected, all methods perform better on trained than novel categories. Some objects were particularly difficult for all methods, such as headphones, for which it is difficult to determine whether the target object is ring-shaped or not until there is a large amount of contextual points. Following the same logic, all methods also performed poorly on objects with intricate details, such as motorcycles and keyboards. As described in the main paper, PCN tends to outperform MSC when there is a large number of observations. However, in many of theses cases simple Gaussian densification also yields good results. In fact, Gauss densification outperforms all methods on most categories when the number of observations is as large as 1000. This suggests that the representation power of these methods is still not sufficient. Meanwhile, MSC outperforms all methods on most of both novel and training categories when the number of observations is small. Particularly, when the number of observations is 50, our method with post-encoder optimization outperforms all methods on all novel categories. When the number of observations is 100, MSC with post-encoder optimization outperforms other methods in 10 out of 15 categories. However, we note that the categories where MSC is outperformed (pistol, motorcycle, guitar, earphone, microphone) correspond to objects with fine details, and in 2 of them the best performing method is Gauss densification. IGR, which represents a version of our method without the encoder, sometimes produced nonsensical outputs even with a large number of observations, apparently switching between different objects. Similarly, DeepSDF has a tendency of producing outputs that look like a combination of many objects. It is possible that, because IGR and DeepSDF are not equipped with an encoder, they are indeliberately mixing object-specific with object-agnostic features by encoding some object-specific properties into their decoder model. DeepSDF also produced disconnected artifacts in many cases, mostly because we only allow the use of contextual observations at test time (e.g. information like ground truth sdf values at points not on the surface and ground truth normal vectors are inaccessible at test time with sparse observations). OccNet also produced such artifacts when the number of observations is very small. Our proposed MSC method, on the other hand, makes relatively conservative smooth prediction when the observations are too sparse, and incrementally increases the complexity of the output shapes as more points become available.

A.3 RESULTS ON ICL-NUIM

Figures 13-15 are selected scene completion results obtained on the ICL-NUIM dataset. Although both the ground-truth and contextual point-cloud have missing large portions because of data collection procedure (collected from a single view), our MSC is able to generate smooth surfaces that cover the unobserved areas of the environment. Interestingly, in this case IGR also does a good job producing smooth surfaces, most likely because all rooms are somewhat familiar to each other (i.e. object-specific properties can be shared between classes without significant detrimental effects). DeepSDF again produces disonnected artifacts on this dataset as well, Failures cases for DeepSDF indicates that the Eikonal term plays an important role in producing smooth surfaces, especially when the model is not allowed to use normal vectors. Finally, PCN fails to reconstruct unobserved areas of the environment, producing instead what looks like a densified version of observed points. Figure 13 : Shape completion results on ShapeNet for Bed (novel category). Our proposed method correctly predicts the overall box shape of the object, and gradually improves this prediction as more observations points become available. Other methods, such as OccNet and IGR, fail to separate the top and lower portions of the object even at higher density levels. With only 50 points our proposed method is already capable of predicting the correct shape, that is then further refined with more observations. OccNet requires a higher number of observations before it settles on the correct shape, while IGR converges to the wrong object.



hk

Figure 1: Schematic of our proposed SDF-based meta-shape completion method.

Figure 2: Shape completion results on ShapeNet for chairs (training category).

Figure 4: Scene completion results on ICL-NUIM for office rooms.

Figure6: Encoder of OccNet(Mescheder et al., 2019). OccNet uses a version of PointNet that carries out max pooling procedures in multiple places. Compared to our encoder (Figure1, main paper) this architecture is much more complex. Future work will involve extending MSC to different encoder architectures.

Figure14: Shape completion results on ShapeNet for Bed (novel category). With only 50 points our proposed method is already capable of predicting the correct shape, that is then further refined with more observations. OccNet requires a higher number of observations before it settles on the correct shape, while IGR converges to the wrong object.

Results on ShapeNet trained categories (Mean Chamfer Distance per point).

Results on ShapeNet novel categories (Mean Chamfer Distance per point).

Results on the ICL-NUIM dataset (Mean Asymmetric Chamfer Distance per point). ±3.22 2.81 ±1.28 0.66 ±0.29 0.22 ±0.07 21.33 ±14.64 12.46 ±6.87 3.85 ±1.74 1.43 ±0.50 PCN (Yuan et al., 2018) 2.76 ±2.39 0.95 ±0.80 0.28 ±0.25 0.15 ±0.11 7.94 ±8.15 4.11 ±3.62 1.58 ±1.27 0.96 ±0.53 OccNet (Mescheder et al., 2019) 0.49 ±0.43 0.30 ±0.16 0.25 ±0.15 0.24 ±0.15 1.45 ±1.59 1.32 ±0.55 1.47 ±0.44 1.55 ±0.42 DeepSDF(Park et al., 2019) 3.33 ±5.10 2.12 ±5.48 1.03 ±2.71 0.57 ±1.36 10.09 ±20.04 9.51 ±34.35 5.91 ±17.61 3.45 ±8.17 IGR(Gropp et al., 2020) 3.11 ±5.76 2.28 ±5.01 1.62 ±4.28 1.30 ±4.07

Results for each ShapeNet trained category (1/2)

Results for each ShapeNet trained category (2/2)

Results for each ShapeNet novel category

annex

Figure 7 : Shape completion results on ShapeNet for Basket (training category). Our method starts conservatively and improves as more points are added (it also correctly infers that there is no lid). All methods struggle with predictions for this category, irrespective of the number of observations. Particularly, IGR consistently assumes that the target object is rectangular, even at higher density levels. At these levels, our method succeeds in capturing the approximate geometry of the lamp. Figure 15 : Shape completion results on ShapeNet for Guitar (novel category). Here, PCN is the only method that seems to succeed at shape completion, mostly because of the compact nature of the object that enables accurate densification. Interestingly, as the number of observations increase, our proposed method performs better without post-encoder optimization, producing a reasonable output with 300 observations. We attribute this to a smoothing effect generated by this step, that enables a better generalization at the expense of finer details. 

