Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding

Abstract

Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves state-of-the-artperformance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts). 1

1. Introduction

A fundamental question in 3D computer vision and computer graphics is how to represent 3D data (Mescheder et al., 2019; Qi et al., 2017a; Maturana & Scherer, 2015) . This question becomes particularly vital given how the success of deep learning in 2D computer vision has pushed for the wide adoption of deep learning in 3D vision and graphics. In fact, deep networks already achieve impressive results in 3D classification (Hamdi et al., 2021 ), 3D segmentation (Hu et al., 2021 ), 3D detection (Liu et al., 2021a ), 3D reconstruction (Mescheder et al., 2019) , and novel view synthesis (Mildenhall et al., 2020) . 3D computer vision networks either rely on direct 3D representations, indirect 2D projection on images, or a mixture of both. Direct approaches operate on 3D data commonly represented with point clouds (Qi et al., 2017a) , meshes (Feng et al., 2019) , or voxels (Choy et al., 2019) . In contrast, indirect approaches commonly render multiple 2D views of objects or scenes (Su et al., 2015) , and process each image with a traditional 2D image-based architecture. The human visual system is closer to such a multi-view indirect approach for 3D understanding, as it receives streams of rendered images rather than explicit 3D data. Tackling 3D vision tasks with indirect approaches has three main advantages: (i) mature and transferable 2D computer vision models (CNNs, Transformers, etc. ), (ii) large and diverse labeled image datasets for pre-training (e.g. ImageNet (Russakovsky et al., 2014) ), and (iii) the multi-view images give context-rich features based on the viewing angle, which are different from the geometric 3D neighborhood features. Multi-view approaches achieve impressive performance in 3D shape classification and segmentation (Wei et al., 2020; Hamdi et al., 2021; Dai & Nießner, 2018) . However, the challenge with the multi-view representation (especially for dense predictions) lies in properly aggregating the per-view features with 3D point clouds. The appropriate aggregation is necessary to obtain representative 3D point



The code is available at https://github.com/ajhamdi/vointcloud

Multi-View Renderings

clouds with a single feature per point suitable for typical point cloud processing pipelines. Previous multi-view works rely on heuristics (e.g. average or label mode pooling) after mapping pixels to points (Kundu et al., 2020; Wang et al., 2019a) , or multi-view fusion with voxels (Dai & Nießner, 2018) . Such setups might not be optimal for a few reasons. (i) Such heuristics may aggregate information of misleading projections that are obtained from arbitrary view-points. For example, looking at an object from the bottom and processing that view independently can carry wrong information about the object's content when combined with other views. (ii) The views lack geometric 3D information.To this end, we propose a new hybrid 3D data structure that inherits the merits of point clouds (i.e. compactness, flexibility, and 3D descriptiveness) and leverages the benefits of rich perceptual features of multi-view projections. We call this new representation multi-view point cloud (or Voint cloud) and illustrate it in Figure 1 . A Voint cloud is a set of Voints, where each Voint is a set of view-dependent features (view-features) that correspond to the same point in the 3D point cloud. The cardinality of these view-features may differ from one Voint to another. In Table 1 , we compare some of the widely used 3D representations and our Voint cloud representation. Voint clouds inherit the characteristics of the parent explicit 3D point clouds, which facilitates learning Voint representations for a variety of vision applications (e.g. point cloud classification and segmentation). To deploy deep learning on the new Voint space, we define basic operations on Voints, such as pooling and convolution. Based on these operations, we define a practical way of building Voint neural networks that we dub VointNet. VointNet takes a Voint cloud and outputs point cloud features for 3D point cloud processing. We show how learning this Voint cloud representation leads to strong performance and gained robustness for the tasks of 3D classification, 3D object retrieval, and 3D part segmentation on standard benchmarks like ScanObjectNN (Uy et al., 2019), and ShapeNet (Chang et al., 2015) .

Contributions: (i)

We propose a novel multi-view 3D point cloud representation (denoted as Voint cloud), which represents each point (namely a Voint) as a set of features from different view-points. (ii) We define pooling and convolutional operations at the Voint level to construct a Voint Neural Network (VointNet) capable of learning to aggregate information from multiple views in the Voint space. (iii) Our VointNet reaches state-ofthe-artperformance on several 3D understanding tasks, including 3D shape classification, retrieval, and robust part segmentation. Further, VointNet achieves robustness improvement to occlusion and rotation.

