Voint Cloud: Multi-View Point Cloud Representation for 3D Understanding

Abstract

Multi-view projection methods have demonstrated promising performance on 3D understanding tasks like 3D classification and segmentation. However, it remains unclear how to combine such multi-view methods with the widely available 3D point clouds. Previous methods use unlearned heuristics to combine features at the point level. To this end, we introduce the concept of the multi-view point cloud (Voint cloud), representing each 3D point as a set of features extracted from several view-points. This novel 3D Voint cloud representation combines the compactness of 3D point cloud representation with the natural view-awareness of multi-view representation. Naturally, we can equip this new representation with convolutional and pooling operations. We deploy a Voint neural network (VointNet) to learn representations in the Voint space. Our novel representation achieves state-of-the-artperformance on 3D classification, shape retrieval, and robust 3D part segmentation on standard benchmarks ( ScanObjectNN, ShapeNet Core55, and ShapeNet Parts). 1

1. Introduction

A fundamental question in 3D computer vision and computer graphics is how to represent 3D data (Mescheder et al., 2019; Qi et al., 2017a; Maturana & Scherer, 2015) . This question becomes particularly vital given how the success of deep learning in 2D computer vision has pushed for the wide adoption of deep learning in 3D vision and graphics. In fact, deep networks already achieve impressive results in 3D classification (Hamdi et al., 2021 ), 3D segmentation (Hu et al., 2021 ), 3D detection (Liu et al., 2021a ), 3D reconstruction (Mescheder et al., 2019) , and novel view synthesis (Mildenhall et al., 2020) . 3D computer vision networks either rely on direct 3D representations, indirect 2D projection on images, or a mixture of both. Direct approaches operate on 3D data commonly represented with point clouds (Qi et al., 2017a) , meshes (Feng et al., 2019) , or voxels (Choy et al., 2019) . In contrast, indirect approaches commonly render multiple 2D views of objects or scenes (Su et al., 2015) , and process each image with a traditional 2D image-based architecture. The human visual system is closer to such a multi-view indirect approach for 3D understanding, as it receives streams of rendered images rather than explicit 3D data. Tackling 3D vision tasks with indirect approaches has three main advantages: (i) mature and transferable 2D computer vision models (CNNs, Transformers, etc. ), (ii) large and diverse labeled image datasets for pre-training (e.g. ImageNet (Russakovsky et al., 2014) ), and (iii) the multi-view images give context-rich features based on the viewing angle, which are different from the geometric 3D neighborhood features. Multi-view approaches achieve impressive performance in 3D shape classification and segmentation (Wei et al., 2020; Hamdi et al., 2021; Dai & Nießner, 2018) . However, the challenge with the multi-view representation (especially for dense predictions) lies in properly aggregating the per-view features with 3D point clouds. The appropriate aggregation is necessary to obtain representative 3D point



The code is available at https://github.com/ajhamdi/vointcloud 1

