Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding

Abstract

Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves state-of-the-artperformance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts).

1. Introduction

A fundamental question in 3D computer vision and computer graphics is how to represent 3D data (Mescheder et al., 2019; Qi et al., 2017a; Maturana & Scherer, 2015) . This question becomes particularly vital given how the success of deep learning in 2D computer vision has pushed for the wide adoption of deep learning in 3D vision and graphics. In fact, deep networks already achieve impressive results in 3D classification (Hamdi et al., 2021) , 3D segmentation (Hu et al., 2021) , 3D detection (Liu et al., 2021a) , 3D reconstruction (Mescheder et al., 2019) , and novel view synthesis (Mildenhall et al., 2020) . 3D computer vision networks either rely on direct 3D representations, indirect 2D projection on images, or a mixture of both. Direct approaches operate on 3D data commonly represented with point clouds (Qi et al., 2017a) , meshes (Feng et al., 2019) , or voxels (Choy et al., 2019) . In contrast, indirect approaches commonly render multiple 2D views of objects or scenes (Su et al., 2015) , and process each image with a traditional 2D image-based architecture. The human visual system is closer to such a multi-view indirect approach for 3D understanding, as it receives streams of rendered images rather than explicit 3D data. Tackling 3D vision tasks with indirect approaches has three main advantages: (i) mature and transferable 2D computer vision models (CNNs, Transformers, etc. ) , (ii) large and diverse labeled image datasets for pre-training (e.g. ImageNet (Russakovsky et al., 2014) ), and (iii) the multi-view images give context-rich features based on the viewing angle, which are different from the geometric 3D neighborhood features. Multi-view approaches achieve impressive performance in 3D shape classification and segmentation (Wei et al., 2020; Hamdi et al., 2021; Dai & Nießner, 2018) . However, the challenge with the multi-view representation (especially for dense predictions) lies in properly aggregating the per-view features with 3D point clouds. The appropriate aggregation is necessary to obtain representative 3D point representation that is compact and naturally descriptive of view projections of a 3D point cloud. Each point in the 3D cloud is tagged with a Voint, which accumulates view-features for that point. Note that not all 3D points are visible from all views. The set of Voints constructs a Voint cloud. clouds with a single feature per point suitable for typical point cloud processing pipelines. Previous multi-view works rely on heuristics (e.g. average or label mode pooling) after mapping pixels to points (Kundu et al., 2020; Wang et al., 2019a) , or multi-view fusion with voxels (Dai & Nießner, 2018) . Such setups might not be optimal for a few reasons. (i) Such heuristics may aggregate information of misleading projections that are obtained from arbitrary view-points. For example, looking at an object from the bottom and processing that view independently can carry wrong information about the object's content when combined with other views. (ii) The views lack geometric 3D information. To this end, we propose a new hybrid 3D data structure that inherits the merits of point clouds (i.e. compactness, flexibility, and 3D descriptiveness) and leverages the benefits of rich perceptual features of multi-view projections. We call this new representation multi-view point cloud (or Voint cloud) and illustrate it in Figure 1 . A Voint cloud is a set of Voints, where each Voint is a set of view-dependent features (view-features) that correspond to the same point in the 3D point cloud. The cardinality of these view-features may differ from one Voint to another. In Table 1 , we compare some of the widely used 3D representations and our Voint cloud representation. Voint clouds inherit the characteristics of the parent explicit 3D point clouds, which facilitates learning Voint representations for a variety of vision applications (e.g. point cloud classification and segmentation). To deploy deep learning on the new Voint space, we define basic operations on Voints, such as pooling and convolution. Based on these operations, we define a practical way of building Voint neural networks that we dub VointNet. VointNet takes a Voint cloud and outputs point cloud features for 3D point cloud processing. We show how learning this Voint cloud representation leads to strong performance and gained robustness for the tasks of 3D classification, 3D object retrieval, and 3D part segmentation on standard benchmarks like ScanObjectNN (Uy et al., 2019), and ShapeNet (Chang et al., 2015) .

Contributions: (i)

We propose a novel multi-view 3D point cloud representation (denoted as Voint cloud), which represents each point (namely a Voint) as a set of features from different view-points. (ii) We define pooling and convolutional operations at the Voint level to construct a Voint Neural Network (VointNet) capable of learning to aggregate information from multiple views in the Voint space. (iii) Our VointNet reaches state-ofthe-artperformance on several 3D understanding tasks, including 3D shape classification, retrieval, and robust part segmentation. Further, VointNet achieves robustness improvement to occlusion and rotation. (Mildenhall et al., 2020) while inheriting the merits of 3D point clouds.

2. Related Work

Learning on 3D Point Clouds. 3D point clouds are widely used for 3D representation in computer vision due to their compactness, flexibility, and because they can be obtained naturally from sensors like LiDAR and RGBD cameras. PointNet (Qi et al., 2017a) paved the way as the first deep learning algorithm to operate directly on 3D point clouds. It computes point features independently and aggregates them using an order-invariant function like max-pooling. Subsequent works focused on finding neighborhoods of points to define point convolutional operations (Qi et al., 2017b; Wang et al., 2019c; Li et al., 2018; Han et al., 2019) . Several recent works combine point cloud representations with other 3D modalities like voxels (Liu et al., 2019b; You et al., 2018) or multi-view images (Jaritz et al., 2019) . We propose a novel Voint cloud representation for 3D shapes and investigates novel architectures that aggregate view-dependent features at the 3D point level. Multi-View Applications. The idea of using 2D images to understand the 3D world was initially proposed in 1994 by Bradski et. al. (Bradski & Grossberg, 1994) . This intuitive multi-view approach was combined with deep learning for 3D understanding in MVCNN (Su et al., 2015) . A line of works continued developing multi-view approaches for classification and retrieval by improving the aggregation of the view-features from each image view (Kanezaki et al., 2018; Esteves et al., 2019; Cohen & Welling, 2016; Wei et al., 2020; Hamdi et al., 2021) . In this work, we fuse the concept of multi-view into the 3D structure itself, such that every 3D point would have an independent set of view-features according to the view-points available in the setup. Our Voints are aligned with the sampled 3D point cloud, offering a compact representation that allows for efficient computation and memory usage while maintaining the view-dependent component that facilitates view-based learning for vision. Hybrid Multi-View with 3D Data. On the task of 3D semantic segmentation, a smaller number of works tried to follow the multi-view approach (Dai & Nießner, 2018; Kundu et al., 2020; Wang et al., 2019a; Kalogerakis et al., 2017; Jaritz et al., 2019; Liu et al., 2021b; Lyu et al., 2020) . A problem arises when combining view features to represent local points/voxels while preserving local geometric features. These methods tend to average the view-features (Kundu et al., 2020; Kalogerakis et al., 2017) , propagate the labels only (Wang et al., 2019a) , learn from reconstructed points in the neighborhood (Jaritz et al., 2019) , order points on a single grid (Lyu et al., 2020) , or combine the multi-view features with 3D voxel features (Dai & Nießner, 2018; Hou et al., 2019) . To this end, our proposed VointNet operates on the Voint cloud space while preserving the compactness and 3D descriptiveness of the original point cloud. VointNet leverages the power of multi-view features with learned aggregation on the view-features applied to each point independently.

3. Methodology

The primary assumption in our work is that surface 3D points are spherical functions, i.e. their representations depend on the viewing angles observing them. This condition contrasts with most 3D point cloud processing pipelines that assume a view-independent representation of 3D point clouds. The full pipeline is illustrated in Figure 2 .  Pixel-to-Point Mapping B VointNet F Visibility V Point Cloud X Multi-View Renderings Renderer R 2D Backbone C Voint Cloud X Point Features ( M ⨉ 2 ) ( N ⨉ d ) ( M ⨉ H ⨉ W ⨉ 3 ) Multi-View Features ( N ⨉ M ) (N ⨉ M ⨉ d ) Unprojection ɸ B ( M ⨉ H ⨉ W ) View-Points U ( N ⨉ 3 ) (M ⨉ H ⨉ W ⨉ d )

3.1. 3D Voint Cloud

From Point Clouds to Voint Clouds. A 3D point cloud is a compact 3D representation composed of sampled points on the surface of a 3D object or a scene and can be obtained by different sensors like LiDAR (Chen et al., 2017) or as a result of reconstruction (Okutomi & Kanade, 1993) . Formally, we define the coordinate function for the surface g s (x) : R 3 → R as the Sign Distance Function (SDF) in the continuous Euclidean space (Park et al., 2019; Mescheder et al., 2019) . The 3D iso-surface is then defined as the set of all points x that satisfy the condition g s (x) = 0. We define a surface 3D point cloud X ∈ R N ×3 as a set of N 3D points, where each point x i ∈ R 3 is represented by its 3D coordinates (x i , y i , z i ) and satisfies the iso-surface condition as follows: X = x i ∈ R 3 | g s (x i ) = 0 N i=1 . In this work, we aim to fuse the view-dependency to 3D point. Inspired by NeRFs (Mildenhall et al., 2020) , we assume that surface points also depend on the view direction from which they are being observed. Specifically, there exists a continuous implicit spherical function g(x, u) : R 5 → R d that defines the features of each point x depending on the view-point direction u. Given a set of M view-point directions U ∈ R M ×2 , a Voint x ∈ R M ×d is a set of M view-dependent features of size d for the sphere centered at point x as follows. x i = g (x i , u j ) ∈ R d | x i ∈ X M j=1 (1) The Voint cloud X ∈ R N ×M ×d = { x i } N i=1 is the set of all N Voints x i corresponding to the parent point cloud X . Note that we typically do not have access to the underlying implicit function g and we approximate it with the following three steps.

1-Multi-View Projection.

As mentioned earlier, a Voint combines multiple view-features of the same 3D point. These view-features come from a multi-view projection of the points by a point cloud renderer R : R N ×3 → R M ×H×W ×3 that renders the point cloud X from multiple view-points U into M images of size H × W × 3. In addition to projecting the point cloud into the image space, R defines the index mapping B ∈ {0, .., N } M ×H×W between each pixel to the N points and background it renders. Also, R outputs the visibility binary matrix V ∈ {0, 1} N ×M for each point from each view. Since not all points appear in all the views due to pixel discretization, the visibility score V i,j defines if the Voint x i is visible in the view u j . The matrix B is crucial for unprojection, while V is needed for defining meaningful operations on Voints. 

3-Multi-View Unprojection.

We propose a module Φ B : R M ×H×W ×d → R N ×M ×d that unprojects the 2D features from each pixel to be 3D view-features at the corresponding voint. Using the mapping B created by the renderer, Φ B forms the Voint cloud features X . To summarize, the output Voint cloud is described by Eq (1), where g (x i , u j ) = Φ B C (R (X , u j )) i and the features are only defined for a view j of Voint x i if V i,j = 1.

3.2. Operations on 3D Voint Clouds

We show in the Appendix that a functional form of max-pooled individual view-features of a set of angles can approximate any function in the spherical coordinates. We provide a theorem that extends PointNet's theorem of point cloud functional composition (Qi et al., 2017a) and its Universal Approximation to spherical functions underlying Voints. Next, we define a set of operations on Voints as building blocks for Voint neural networks (VointNet). VointMax. We define VointMax as max-pooling on the visible view-features along the views dimension of the voint x. For all i ∈ 1, 2, ..., N and j ∈ 1, 2, ..., M , VointMax( x i ) = max j x i,j , s.t. V i,j = 1 (2) VointConv. We define the convolution operation h V : R N ×M ×d → R N ×M ×d ′ as any learnable function that operates on the Voint space with shared weights on all the Voints and has the view-features input size d and outputs view-features of size d ′ and consists of l V layers. A simple example of this VointConv operation is the shared MLP applied only on the visible view-features. We provide further details for such operations in Section 4.2, which result in different non-exhaustive variants of VointNet.

3.3. Learning on 3D Voint Clouds

VointNet. The goal of the VointNet model is to obtain multi-view point cloud features that can be subsequently used by any point cloud processing pipeline. The VointNet module F : R N ×M ×d → R N ×d is defined as follows. F( X ) = h P VointMax h V X , where h P is any point convolutional operation (e.g. shared MLP or EdgeConv). VointNet F transforms the individual view-features using the learned VointConv h V before VointMax is applied on the view-features to obtain point features. VointNet Pipeline for 3D Point Cloud Processing. The full pipeline is described in Figure 2 . The loss for this pipeline can be described as follows: arg min θ C ,θ F N i L F Φ B C (R (X , U)) i , y i , ( ) where L is a Cross-Entropy (CE) loss defined on all the training points X , and {y i } N i=1 defines the labels of these points. The other components (R, Φ B , U, C) are all defined before. The weights to be jointly learned are those of the 2D backbone (θ C ) and those of the VointNet (θ F ) using the same 3D loss. An auxiliary 2D loss on θ C can be optionally added for supervision at the image level. For classification, the entire object can be treated as a single Voint, and the global features of each view would be the view-features of that Voint. We analyze different setups in detail in Section 6. Metrics. For 3D point cloud classification, we report the overall accuracy, while shape retrieval is evaluated using mean Average Precision (mAP) over test queries (Hamdi et al., 2021) . 3D semantic segmentation is evaluated using mean Intersection over Union (mIoU) on points. For part segmentation, we report Instance-averaged mIoU (Ins. mIoU).

4. Experiments

Baselines. We include PointNet (Qi et al., 2017a) , PointNet++ (Qi et al., 2017b) , DGCNN (Wang et al., 2019c) , as baselines that use point clouds. We also compare against multi-view classification approaches like MVCNN (Su et al., 2015) , SimpleView (Goyal et al., 2021) , and MVTN (Hamdi et al., 2021) as baselines for classification and retrieval and adopt some of the multi-view segmentation baselines (e.g. Label Fusion (Wang et al., 2019a) and Mean Fusion (Kundu et al., 2020) ) for part segmentation.

4.2. VointNet Variants

VointNet in Eq (3) relies on the VointConv operation h V as the basic building block. Here, we briefly describe three examples of h V operations VointNet uses.

Shared Multi-Layer Perceptron (MLP).

It is the most basic VointConv formulation. For a layer l, the features of Voint i at view j are updated to layer l +1 as: h l+1 i,j = ρ h l i,j W ρ , where ρ is the shared MLP with weights W ρ followed by normalization and a nonlinear function (e.g. ReLU). This operation is applied on all Voints independently and only involves the visible views-features for each Voint. This formulation extends the shared MLP formulation for PointNet (Qi et al., 2017a) to work on Voints' view-features.

Graph Convolution (GCN).

We define a fully connected graph for each Voint by creating a virtual center node connected to all the view-features to aggregate their information (similar to "cls" token in ViT (Dosovitskiy et al., 2021) ). Then, the graph convolution can be defined as the shared MLP (as described above) but on the edge features between all view features, followed by a max pool on the graph neighbors. An additional shared MLP is used before the final output.

Graph Attention (GAT).

A graph attention operation can be defined just like the GCN operation above but with learned attention weights on the graph neighbor's features before averaging them. A shared MLP computes these weights.

MVCNN RotNet

ViewGCN MVTN VointNet (Su et al., 2015) (Kanezaki et al., 2018) (Wei et al., 2020) (Hamdi et al., 2021) (Yi et al., 2016) . At test time, we randomly rotate the objects and report the results over ten runs. Note how VointNet's performance largely exceeds the point baselines in the realistic rotated scenarios, while exceeding multi-view baselines on the unrotated benchmark. All the results are reproduced in our setup.

4.3. Implementation Details

Rendering and Unprojection. We choose the differentiable point cloud renderer R from Pytorch3D (Ravi et al., 2020) in our pipeline for its speed and compatibility with Pytorch libraries (Paszke et al., 2017) . We render point clouds on multi-view images with size 224 × 224 × 3. We color the points by their normals' values or keep them white if the normals are not available. Following a similar procedure to (Wei et al., 2020; Hamdi et al., 2021) , the view-points setup is randomized during training (using M = 8 views) and fixed to spherical views in testing (using M = 12 views). Architectures. For the 2D backbone C, we use ViT-B (Dosovitskiy et al., 2021) (with pretrained weights from TIMM library (Wightman, 2019) ) for classification and DeepLabV3 (Chen et al., 2018) for segmentation. We use the 3D CE loss on the 3D point cloud output and the 2D CE loss when the loss is defined on the pixels. The feature dimension of the VointNet architectures is d = 64, and the depth is l V = 4 layers in h V . The main results are based on VointNet (MLP), unless otherwise specified as in Section 6, where we study in details the effect of VointConv h V and C. Training Setup. We train our pipeline in two stages, where we start by training the 2D backbone on the 2D projected labels of the points, then train the entire pipeline end-to-end while focusing the training on the VointNet part. We use the AdamW optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 0.0005 and a step learning rate schedule of 33.3% every 12 epochs for 40 epochs. The pipeline is trained with one NVIDIA Tesla V100 GPU. We do not use any data augmentation. More details about the training setup (loss and rendering), VointNet, and the 2D backbone architectures can be found in the Appendix .

5. Results

The main test results of our Voint formulations are summarized in Tables 2,3, 4, and 5. We achieve state-of-the-artperformance in the task of 3D 2021). VointNet demonstrates state-of-the-artresults on all the variants, including the challenging Hardest (PB_T50_RS) variant that includes challenging scenarios of rotated and translated objects. The increase in performance (+2.6%) is significant on this variant, which highlights the benefits of Voints on challenging scenarios, with further affirming results in Section 5.4. We follow exactly the same procedure as in MVTN Hamdi et al. (2021) . Figure 3 shows qualitative 3D segmentation results for VointNet and Mean Fuse Kundu et al. (2020) as compared to the ground truth.

5.4. Occlusion Robustness

One of the aspects of the robustness of 3D classification models that have been recently studied is their robustness to occlusion, as detailed in MVTN Hamdi et al. (2021) . These simulated occlusions are introduced at test time, and the average test accuracy is reported on each cropping ratio. We benchmark our VointNet against recent baselines in Table 5 . PointNet Qi et al. (2017a) and DGCNN Wang et al. (2019c) (Kundu et al., 2020) and Label Fuse (Wang et al., 2019a) . Both baselines use the same trained 2D backbone as VointNet and are tested on the same unrotated setup.

6. Analysis and Insights

Number of Views. We study the effect of the number of views M on the performance of 3D part segmentation using multiple views. We compare Mean Fuse (Kundu et al., 2020) and Label Fuse (Wang et al., 2019a) to our VointNet when all of them have the same trained 2D backbone. The views are randomly picked, and the experiments are repeated four times. Ins. mIoU with confidence intervals are shown in Figure 4 . We observe a consistent improvement with VointNet over the other two baselines across different numbers of views. 

2D Backbone

VointConv Results FCN DeepLabV3 MLP GCN GAT Inst. mIoU ✓ - ✓ - - 78.8 ± 0.2 ✓ - - ✓ - 77.6 ± 0.2 ✓ - - - ✓ 77.1 ± 0.2 - ✓ ✓ - - 80.6 ± 0.1 - ✓ - ✓ - 77.2 ± 0.4 - ✓ - - ✓ 80.4 ± 0.2

Choice of Backbones.

We ablate the choice of the 2D backbone and the VointConv operation used in VointNet and report the segmentation Ins. mIoU results in Table 6 . Note how the 2D backbone greatly affects performance, while the VointConv operation type does not. This ablation highlights the importance of the 2D backbone in VointNet pipeline and motivates the use of the simplest variant of VointNet (MLP). We provide a detailed study of more factors as well as compute and memory costs in the Appendix . One aspect limiting the performance of Voints is how well-trained the 2D backbone is for the downstream 3D task. In most cases, the 2D backbone must be pretrained with enough data to learn meaningful information for VointNet. Another aspect that limits the capability of the Voint cloud is how to properly select the view-points for segmentation. Addressing these limitations is an important direction for future work. Also, extending Voint learning on more 3D tasks like 3D scene segmentation and 3D object detection is left for future work. Here, we look at a single 2D point at the center with a circular function g(u) = sign (cos u) from five arbitrary view-points {uj} 5 j=1 . Trying to reduce g to a single value based on uj projections undermines the underlying structure of g. We take the full set {(uj, g(uj))} 5 j=1 as a representation of g and learn a set function f on these view-features for a more informative manner of representation aggregation. +1 +1 +1 +1 +1 +1 +1 -1 -1 -1 -1 -1 -1 -1 u 1 u 2 u 3 u 5 u 4 +1 +1 -1 -1 -1 g(u) = sign( cos u ) Note that γ (q (U)) can be rewritten as follows: γ (q (U)) = γ (q (u 1 , . . . , u M )) = γ (MAX (h (u 1 ) , . . . , h (u M ))) = (γ • MAX) (h (u 1 ) , . . . , h (u M )) Since γ• MAX is a symmetric function and from Eq (8) and Eq (9), we reach to the main result in Eq (5). This concludes the proof. □

A.3 3D Voint Cloud

Plenoptic and Spherical Coordinate Functions. The Plenoptic function was first introduced by McMillan and Bishop (McMillan & Bishop, 1995) in 1995 as a general function that describes the visible world. The Plenoptic function P is a continuous spherical function that describes the visibility at any Euclidean 3D point in space (V x , V y , V x ) when looking into any direction (θ, ϕ) across wavelength λ at time t . It is defined as p = P (θ, ϕ, λ, V x , V y , V x , t). Such a remarkable and compact formulation covers all the images observed as just samples of the function P . For fixed time and wavelength, the reduced Plenoptic function P becomes p = P (θ, ϕ, V x , V y , V x , ) which can describe any field in 3D space. This shortened formulation is what Neural Radiance Fields (NeRFs) (Mildenhall et al., 2020; Pumarola et al., 2021; Martin-Brualla et al., 2021) try to learn with MLPs to describe the radiance and RGB values in the continuous Euclidean space with a dependency on the view direction (θ, ϕ). In the same spirit of the Plenoptic function and NeRFs, the Voint cloud representation relies on the viewing angles (θ, ϕ) to define the view-features. The problem with the plenoptic functions P , and subsequently NeRFs, is that they are very high dimensional, and any attempt to densely represent the scene with discrete and fixed data will cause memory and compute issues (Yu et al., 2021; Pumarola et al., 2021) . Unlike NERFs (Mildenhall et al., 2020) that define dense 3D volumes, we focus only on the surface of the 3D shapes with our Voint clouds representation. Our Voints are in the order of the sampled point cloud, offering a compact representation that allows for efficient computation and memory while maintaining the view-dependent component that facilitates view-based learning. From Point Clouds to Voint Clouds. Implicit representation of 3D surfaces typically aims to learn an implicit function g s (x) : R 3 → R that define the Sign Distance Function (SDF) or the occupancy in the continuous Euclidean space (Park et al., 2019; Mescheder et al., 2019) . The 3D iso-surface is then defined as the set of all points x that satisfy the condition g s (x) = 0 (assuming g s (x) as SDF hereafter). We define a surface 3D point cloud X ∈ R N ×3 , as a set of N 3D points, where each point x i ∈ R 3 is represented by its 3D coordinates (x i , y i , z i ) and satisfy the iso-surface condition as follows. X = x i ∈ R 3 |g s (x i ) = 0 N i=1 (10) Here, we assume that surface points also depend on the view direction from which they are being observed. Specifically, there exists a continuous implicit spherical function g(x, u) : R 5 → R d that defines the features at each point x depending on the view direction u. Given a set of M view-point directions U ∈ R M ×2 , a Voint x ∈ R M ×d is a set of M view-dependent features of size d for the sphere centered at point x. The Voint cloud X ∈ R N ×M ×d is the set of all N Voints x. x i = g (x i , u j ) ∈ R d | x i ∈ X M j=1 X = x i ∈ R M ×d N i=1 (11) Note that we typically do not have access to the underlying implicit function g and we approximate it by 2D projection, feature extraction, and then un-projection as we show next. 1-Multi-View Projection. As mentioned earlier, a Voint combines multiple view-features of the same 3D point. These view-features come from a multi-view projection of the points by a point cloud renderer R : R N ×3 → R M ×H×W ×3 that renders the point cloud X from multiple view-points U into M images of size H × W × 3. In addition to projecting the point cloud into the image space, R defines the mapping B ∈ {0, .., N } M ×H×W between each pixel to the N points and background it renders. Also, R outputs the visibility binary matrix V ∈ {0, 1} N ×M for each point from each view. Since not all points appear in all the views due to pixel discretization, the visibility score V i,j defines if the Voint x i is visible in the view u j . The matrix B is crucial for unprojection, while V is needed for defining meaningful operations on Voints. 2-Multi-View Feature Extraction. The rendered images are processed by a function C : R M ×H×W ×3 → R M ×H×W ×d that extracts image features. If C is the identity function, all the view-features would be identical for each Voint (typically the RGB value of the corresponding point). However, the C function can be a 2D network dedicated to the downstream task and can extract useful global and local features about each view.

3-Multi-View Unprojection.

We propose a module Φ B : R M ×H×W ×d → R N ×M ×d that unprojects the 2D features from each pixel to be 3D view-features at the corresponding Voint. This is performed by using the mapping B created by the renderer to form the Voint cloud features X . Note that the points are not necessarily visible from all the views, and some Voints that are not visible from any of the M views will not receive any features. We post-process these empty points (∼ 0.5% of points during inference) to be filled with nearest 3D neighbors features. The output Voint cloud features would be described as follows. x i = g i,j,: ∈ R d | x i ∈ X , V i,j = 1 M j=1 g :,j = Φ B (C (R (X , u j )) , B) X = x i ∈ R M ×d N i=1 (12) A.4 Voint Operations VointMax. In order to learn a neural network in the Voint space in the form dictated by Theorem 1, we need to define some basic differentiable operations on the Voint space. The Voints and has the view-features input size d and outputs view-features of size d ′ and consists of l V layers. Examples of this VointConv operation include the following: Shared MLP. It is the most basic Voint neural network. For layer l, the features of Voint i at view j is updated as follows to layer l + 1 h l+1 i,j = ρ h l i,j W ρ , ∀i, j s.t. i ∈ 1, 2, ..., N , j ∈ 1, 2, ..., M , V i,j = 1 ( ) where ρ is the shared MLP with weights W ρ followed by normalization and nonlinear function ( e.g. ReLU) applied on all Voints independently at the visible views features for each Voint. This formulation extends the shared MLP formulation for PointNet (Qi et al., 2017a) to make the MLP shared across the Voints and the views-features.

Graph Convolution (GCN).

Just like how DGCNN (Wang et al., 2019c) extended PointNet (Qi et al., 2017a ) by taking the neighborhood information and extract edge features, we extend the basic VointNet formulation in Eq (15). We define a fully connected graph for each Voint along the views dimension by creating a center virtual node connected to all the view features ( similar to the classification token in ViT (Dosovitskiy et al., 2021) ). This center virtual view-feature would be assigned the index j = 0 and can be initilized with zeros as the "cls" token in ViT (Dosovitskiy et al., 2021) . Then, Voint graph convolution operation can be defined as follows to update the activations from layer l to l + 1 h l+1 i,j = ρ max k ψ (h l i,j , h l i,k )W ψ W ρ ∀i, j, k s.t. i ∈ 1, 2, ..., N , j ∈ 0, 1, ..., M k ∈ 0, 1, ..., M , k ̸ = j , V i,j = 1 where ρ, ψ are two different shared MLPs as in Eq ( 16). The difference between VointNet and VointNet (GCN) is highlighted in Figure 6 .

Graph Attention (GAT).

Similar to how Point Transformer (Zhao et al., 2020) extended the graph convolution by adding attention to DGCNN (Wang et al., 2019c) , we extend the basic Voint GraphConv formulation in Eq (17). Voint graph attention operation can be defined as follows to update the activations from layer l to l + 1 h l+1 i,j = ρ   M k=0,k̸ =j η k ψ (h l i,j , h l i,k )W ψ W ρ   ∀i, j s.t. i ∈ 1, 2, ..., N , j ∈ 0, 1, ..., M η k = ζ h l i,k W ζ , V i,j = 1 (18) where ρ, ψ, ζ are three different shared MLPs as in Eq (16), and η k are the learned attention weights for each neighbor view-feature.

B Detailed Experimental Setup B.1 Datasets

ScanObjectNN: 3D Point Cloud Classification. We follow the literature (Goyal et al., 2021; Hamdi et al., 2021) (Yi et al., 2016) . Visualization is provided in Figure 10 of some of the renderings used in training the 2D backbone in our pipeline colored with the ground truth segmentation labels. ModelNet40: 3D Shape Classification Occlusion Robustness. ModelNet40 (Wu et al., 2015) is composed of 12,311 3D objects (9,843/2,468 in training/testing) labelled with 40 object classes. We sample 2048 points clouds from the objects following previous works (Qi et al., 2017b; Zhao et al., 2020) . Visualization is provided in Figure 8 of some of the renderings used in training the 2D backbone in our pipeline.

B.2 Metrics

Classification Accuracy. The standard evaluation metric in 3D classification is accuracy. We report overall accuracy (percentage of correctly classified test samples) and average per-class accuracy (mean of all true class accuracies). Retrieval mAP. Shape retrieval is evaluated by mean Average Precision (mAP) over test queries. For every query shape S q from the test set, AP is defined as AP = 1 GTP N n

1(Sn)

n , where GT P is the number of ground truth positives, N is the size of the ordered training set, and 1(S n ) = 1 if the shape S n is from the same class label of query S q . We average the retrieval AP over the test set to measure retrieval mAP. Segmentation mIoU. Semantic Segmentation is evaluated by mean Intersection over Union (mIoU) over pixels or points. For every class label, measure the size of the intersection mask between the ground truth points of that label and the predicted points as that label. Then, divide by the size of the union mask of the same label to get IoU. This procedure is repeated over all the labels, and averaging the IoUs gives mIoU. We report two types of mIoUs: Instance-averaged mIoU (averages all mIoUs across all objects ) and Category-averaged mIoU (averages all mIoU from shapes of the same category, and then average those across object categories).

B.3 Baselines

Point Cloud Networks. We include PointNet (Qi et al., 2017a ), PointNet++ (Qi et al., 2017b) , DGCNN (Wang et al., 2019c) , PVNet (You et al., 2018) , and KPConv (Thomas et al., 2019 ), Point Transformer (Zhao et al., 2020) and CurveNet (Xiang et al., 2021) as baselines that use point clouds. These methods leverage different convolution operators on point clouds by aggregating local and global point information. Multi-View Networks. We also compare against multi-view classification approaches like MVCNN (Su et al., 2015) and MVTN (Hamdi et al., 2021) as baselines for classification and retrieval. Since there is no available multi-view pipeline for 3D part segmentation, we adopt some of the multi-view segmentation baselines (e.g. Label Fusion (Wang et al., 2019a) and Mean Fusion (Kundu et al., 2020) ) for part segmentation to work in the Voint space.

B.4 Implementation Details

Rendering and Un-Projection. We choose the differentiable point cloud renderer R from Pytorch3D (Ravi et al., 2020) in our pipeline for its speed and compatibility with Pytorch libraries (Paszke et al., 2017) . We render multi-view images with size 224 × 224 × 3. We color the points by their normals' values or keep them white if the normals are not available. Following a similar procedure to (Wei et al., 2020; Hamdi et al., 2021) , the view-point setup is randomized during training (using M = 8 views) and fixed to spherical views in testing (using M = 12 views). 



Figure 1: 3D Voint Clouds. We propose the multi-view point cloud (Voint cloud), a novel 3D

Figure 2: Learning from Voint Clouds. To construct a 3D Voint cloud X , a renderer R renders the point cloud X from view-points U and image features are extracted from the generated images via a 2D backbone C. The image features are then unprojected to the Voint cloud by ΦB and passed to VointNet F. To learn both C and F, a 3D loss on the output points is used with an optional auxiliary 2D loss on C.

-View Feature Extraction. The rendered images are processed by a function C : R M ×H×W ×3 → R M ×H×W ×d that extracts image features, as shown in Figure2. If C is the identity function, all the view-features would typically the RGB value of the corresponding point. However, the C function can be a 2D network dedicated to the downstream task and can extract useful global and local features about each view.

Figure 3: Qualitative Comparison for Part Segmentation. We compare our VointNet 3D segmentation predictions to Mean Fuse (Kundu et al., 2020) that is using the same trained 2D backbone. Note how VointNet distinguishes detailed parts (e.g. the car window frame).

Figure 5: A Toy 2D Example of Voints. Voints assume view-dependency for every 3D point.

Figure 8: ModelNet40. We show some examples of point cloud renderings of ModelNet40 (Wu et al., 2015) used for 3D classification robustness in our setup.

Figure 9: ShapeNet Core55. We show some examples of point cloud renderings of ShapeNet Core55 (Chang et al., 2015) used for 3D shape retrieval in our setup.

Figure 10: ShapeNet Parts. We show some examples of point cloud renderings of ShapeNet Parts (Yi et al., 2016) colored with ground truth segmentation labels. We use these renderings as 2D ground truth to pre-train the 2D backbone C for 2D segmentation before training VointNet's pipeline for 3D segmentation.

Comparison of Different 3D Representations. We compare some of the widely used 3D representations to our proposed Voint cloud. Note that our Voint cloud shares the view-dependency of NeRFs

3D Point Cloud Classification on ScanObjectNN. We report the accuracy of VointNet in 3D point cloud classification on three different variants ofScanObjectNN (Uy et al.,  2019). Bold denotes the best result in its setup. Note that the Hardest variant includes rotated and translated objects, which highlights the benefits of Voints on challenging scenarios.

3D Shape Retrieval. We report 3D shape retrieval mAP on ShapeNet Core55(Chang et al., 2015;Sfikas et al., 2017).VointNet achieves state-of-the-art results on this benchmark.

Robust 3D Part Segmentation on ShapeNet Parts. We compare the Inst. mIoU of VointNet against other methods in 3D segmentation on ShapeNet Parts





Table4reports the Instance-averaged segmentation mIoU of VointNet compared with other methods on ShapeNet PartsYi et al. (2016). Two variants of the benchmark are reported : unrotated normalized setup, and the rotated realistic setup. For the rotated setup, we follow the previous 3D literatureLiu et al. (2019a);Hamdi et al. (2021; 2020)  by testing the robustness of trained models by perturbing the shapes in ShapeNet Parts with random rotations at test time (ten runs) and report the averages in Table4. Note VointNet's improvement over Mean FuseKundu et al. (2020) and Label FuseWang et al. (2019a)  on unrotated setup despite that both baselines use the same trained 2D backbone as VointNet. Also, for rotated setups, point methods don't work as well. All the results in Table4are reproduced by our code in the same setup (see the code attached in supplementary material).

Ablation Study for 3D Segmentation. We ablate different components of VointNet (2D backbone and VointConv choice) and report Ins. mIoU performance on ShapeNet Parts.

on testing 3D classification in the challenging ScanObjectNN(Uy et al., 2019) point cloud dataset, since it includes background and considers occlusions. The dataset is composed of 2902 point clouds divided into 15 object categories. We use 2048 sampled points per object for Voint learning. We benchmark on its variants: Object only, Object with Background, and the Hardest perturbed variant (PB_T50_RS variant). Visualization is provided in Figure7of some of the renderings used in training the 2D backbone in our pipeline. The dataset consists of 51,162 3D mesh objects labeled with 55 object classes. The training, validation, and test sets consist of 35764, 5133, and 10265 shapes. We create a dataset of point clouds by sampling 5000 points from each mesh object as in MVTN(Hamdi et al., 2021).

Acknowledgments

. This work was supported by the King Abdullah University of Science and Technology (KAUST) Office of Sponsored Research through the Visual Computing Center (VCC) funding and the SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI)

availability

//github.com/ajhamdi/vointcloud

Appendix A Detailed Formulations

A.1 Toy Example In the toy 2D example in Figure 5 , the center point (represented by a circular function g) is viewed from various view-points u j that are agnostic to the underlying function itself. In many applications, it is desired to have a single feature representing each point in the point cloud. When the projected values of g from these u j view-points are aggregated together (e.g. by max/mean pool) to get a constant representation of that point, the underlying properties of g are lost. We build our Voint representation to keep the structure of g intact by taking the full set {(u j , g(u j ))} 5 j=1 in learning the aggregations.

A.2 Functional Form of VointNet

We can look at a simplified setup to decide on the functional form of the deep neural network that operates in the Voint space. In this simplified setup, we consider a 2D example (instead of 3D Voints) and assume that a circular function describes a point at the center. The center point will assume its value according to the angle u. The following Theorem 1 proves that for any continuous set function f that operates on any set of M angles {u 1 , ..., u M }, there exists an equivalent composite function consisting of transformed max-pooled individual view-features. This composition is the functional form we describe later for Voint neural networks where γ is a continuous function, and MAX is an element-wise vector max operator.Proof. By the continuity of f , we takewhich split [0, 2π] into K intervals evenly and define an auxiliary function that maps an angle to the beginning of the interval it lies in:indicating the occupancy of the j-th interval by angles in U. Let q = [q 1 ; . . . ; q K ], then q : [0, 2π] M → {0, 1} K is a symmetric function, indicating the occupancy of each interval by angles in U.1 which maps the occupancy vector to a set which contains the left end of each angle interval. It is straightforward to show:Let γ : R K → R be a continuous function such that γ(q) = f (ζ(q)) for q ∈ {0, 1} K . Then from Eq (6) and Eq (7), max operation on the Voint cloud can be defined as follows.VointMax( x) = maxEquivalently, VointMax( x) = max j x :,j -∞V :,j , where V is the complement of V.VointConv. We define the convolution operation h V : R N ×M ×d → R N ×M ×d ′ as any learnable function that operates on the Voint space with shared weights on all the Voints and has the view-features input size d and outputs view-features of size d ′ and consists of l V layers. Examples of this VointConv operation include the following operations applied only on the visible view-features: a shared MLP, a graph convolution, and a graph attention. We detail these operations later in Section A.6, which result in different non-exhaustive variants of VointNet.

A.5 Learning on 3D Voint Clouds

VointNet. Typical 3D point cloud classifiers with a feature max pooling layer work as in Eq (14) , where h mlp and h Pconv are the MLP and point Convolutional (1 × 1 or edge) layers, respectively. This produces a K-class classifier F.Here, F : R N ×3 → R K produces the logits layer of the classifier with size K. On the other hand, the goal of the VointNet model is to get multi-view point cloud features that can be used after which by any point cloud processing pipeline. The VointNet module F : R N ×M ×d → R N ×d as follows.

A.6 VointNet Variants

We define the convolution operation h Architectures. For the 2D backbone, we use ViT (Dosovitskiy et al., 2021) (with pretrained weights from TIMM library (Wightman, 2019) ) for classification and DeepLabV3 (Chen et al., 2018) for segmentation. We used parallel heads for each object category for part segmentation since the task is solely focused on parts. We use the 3D cross-entropy loss on the 3D point cloud output and the 2D cross-entropy loss when the loss is defined on the pixels. When used, the linear tradeoff coefficient of the 2D loss term is set to 0.003. To balance the frequency of objects in part segmentation, we multiply the loss by the frequency of the object class of each object we segment. The feature dimension of the VointNet architectures is d = 64, and the depth is l V = 4 layers in h V . The main results are based on VointNet (MLP) variant unless otherwise specified. The coordinates x can be optionally appended to the input view-features x, which can improve the performance but reduce the rotation robustness as we show later in Section C.1 and Table 9 .Training Setup. We train our pipeline in two stages, where we start by training the 2D backbone on the 2D projected labels of the points, then train the full pipeline end-to-end while focusing the training on the VointNet part. We use the AdamW optimizer (Loshchilov & Hutter, 2017) with an initial learning rate of 0.0005 and a step learning rate schedule of 33.3% every 12 epochs for 40 epochs. The pipeline is trained with one NVIDIA Tesla V100 GPU. We do not use any data augmentation. 

C.1 Model Robustness

Rotation Robustness for 3D Classification. We follow the standard practice in 3D shape classification literature by testing the robustness of trained models to perturbations at test time (Liu et al., 2019a; Hamdi et al., 2021) . We perturb the shapes with random rotations around the Y-axis (gravity-axis) contained within ±90 • and ±180 • and report the test accuracy over ten runs in Table 8 .Rotation Robustness for 3D Segmentation. We follow the previous 3D literature by testing the robustness of trained models to perturbations at test time (Liu et al., 2019a; Hamdi et al., 2021; 2020) . We perturb the shapes in ShapeNet Parts with random rotations in SO(3) at test time (ten runs) and report Ins. mIoU in Table 9 . Note how our VointNet performance largely exceeds the baselines in this realistic unaligned scenario. We can augment the training with rotated objects for the baselines, which improves their robustness, but loses performance on the unrated setup. Adding xyz coordinates to the view-features of VointNet improves the performance on an unrotated setup but negatively affects the robustness to rotations. The discrepancy between the Voint results and the results of some point cloud methods is that Voints heavily depend on the underlying 2D backbone and inherit all its biases, especially those from pretraining. Hence, the 2D backbone limits what the performance can reach with VointNet. We study the effect of the backbone in detail in Section C.2. Figure 11 shows qualitative 3D segmentation results for VointNet and Mean Fuse (Kundu et al., 2020) as compared to the ground truth. Classification Backbone. We study the effect of ablating the 2D backbone C for 3D classification on ModelNet40. We show in Table 10 the performance of VointNet (MLP) when Vit-B (Dosovitskiy et al., 2021) and ResNet-18 (He et al., 2015) are used. We also show that following the per-point classification setup instead of the per-shape for 3D shape classification leads to worse performance for VointNet and the naive multi-view. This is why we used the per-shape approach when adopting VointNet for 3D classification (using one Voint for the entire shape).

Number of points and visibility.

Table 11 studies the effect of point number on 3D part segmentation performance, when different numbers of views are used. The visibility ratio is also reported in each case.Points color. We colored the points with ground truth normals as in Figure 16 , when they are available (ShapeNet Parts), and we used white colors as in Figure 9 , when other baselines do not use normals. We ablate the color of the points on VointNet (MLP) with normals colors, white color, and NOCs colors (Wang et al., 2019b) . We obtain the following segmentation mIoU results: (normals: 80.6), (white: 74.7), and (NOCs: 57.9).

Time and Memory Requirements.

To assess the contribution of the Voint module, we take a macroscopic look at the time and memory requirements of each component in the pipeline. We record the number of floating-point operations (GFLOPs) and the time of a forward pass for a single input sample. In Table 12 , the VointNet module contributes negligibly to the memory requirements compared to multi-view and point networks.

Feature Size (d).

We study the effect of the feature size d on the performance of VointNet (MLP) in 3D part segmentation on ShapeNet Parts (Yi et al., 2016) and plot the results ( with confidence intervals) in Figure 12 . We note that the performance peaks at d = 128, but it is close to what we use in the main results (d = 64).

View Aggregation 2D Backbone ResNet18

ViT-B DeepLabV3 (per-shape) (per-shape) (per-point) VointNet 91.2 92.8 10.2Table 10 : Ablation Study for 3D Classification. We study the effect of different 2D backbone for ModelNet40 3D classification task. We compare VointNet's performance to naive multi-view (e.g. MVCNN (Su et al., 2015) or Mean Fuse (Kundu et al., 2020) ) using the same 2D backbone. Note that using the per-point classification setup instead of the per-shape for 3D shape classification leads to worse performance for VointNet and the naive multi-view. Model Depth (l v ). We study the effect of the model depth l v on the performance of VointNet (MLP) in 3D part segmentation on ShapeNet Parts (Yi et al., 2016) and plot the results ( with confidence intervals) in Figure 13 . We note that model depth of VointNet does not enhance the performance significantly. Our choice of l v = 4 balances the performance and the memory/computations requirements of VointNet (MLP).

Points # Metric

Distance to the Object. We study the effect of distance to the object in rendering as in Figure 17 to the performance of VointNet (MLP) in 3D part segmentation on ShapeNet Parts (Yi et al., 2016) and plot the results ( with confidence intervals) in Figure 14 . We note that our default choice of 1.0 is actually reasonable. This choice of distance shows the object entirely ( as illustrated in Figure 17 ), but also cover the details needed for small parts segmentation (see Figure 11 ).

Image Size (H, W ).

We study the effect of the image size H&W on the performance of Mean Fuse (Kundu et al., 2020) baseline when training the 2D backbone for 3D part segmentation. We plot the results ( with confidence intervals) in Figure 15 .

Number of Views on Classification.

We study the effect of the number of views (M) on classification accuracy on ModelNet40 Wu et al. (2015) of VointNet and report results in Table 13 . Unprojection Operation Speed. We evaluate the speed of the unprojection operation Φ B and report average latency of 10,000 runs (in ms) in Table 14 .Unprojection Operation Speed. We evaluate the speed of the point cloud renderer R used in Voint pipeline from Pytroch3D Ravi et al. (2020) and report average latency of 1,000 renderings (in ms/image) in Table 15 .

C.3 Visualizations

In Figure 16 and 17, we visualize the multi-view renderings of the point clouds along with the 2D learned features based on the DeepLabV3 (Chen et al., 2018) backbone. These features are then unprojected and transformed by VointNet to obtain 3D semantic labels.

Network

GFLOPs Time (ms) Parameters # (M)MVCNN (Su et al., 2015) 43.72 39.89 11.20 ViewGCN (Wei et al., 2020) 44.19 26.06 23.56 ResNet 18 (He et al., 2015) 3.64 3.70 11.20 ResNet 50 (He et al., 2015) 8.24 9.42 23.59 ViT-B (Dosovitskiy et al., 2021) 33.70 12.46 86.57 ViT-L (Dosovitskiy et al., 2021) 119.30 29.28 304.33 FCN (Long et al., 2015) 53.13 10.34 32.97 DeeplabV3 (Chen et al., 2018) 92.61 20.62 58.64 PointNet (Qi et al., 2017a) 1.78 4.24 3.50 DGCNN (Wang et al., 2019c) 10.42 0.95 16.350 MVTN (Hamdi et al., 2021) 1 (Yi et al., 2016) . We note that model depth of VointNet does not enhance the performance significantly. Our choice of l v = 4 balances the performance and the memory/computations requirements of VointNet (MLP). Figure 14 : The Effect of Distance to the Object. We plot Ins. mIoU of 3D segmentation vs. the distance to the object used in inference on ShapeNet Parts (Yi et al., 2016) . We note that our default choice of 1.0 is actually reasonable. This choice of distance shows the object entirely ( as illustrated in Figure 17 ), but also cover the details needed for small parts segmentation (see Figure 11 ). 

