REVISITING POINT CLOUD CLASSIFICATION WITH A SIMPLE AND EFFECTIVE BASELINE Anonymous

Abstract

Processing point cloud data is an important component of many real-world systems. As such, a wide variety of point-based approaches have been proposed, reporting steady benchmark improvements over time. We study the key ingredients of this progress and uncover two critical results. First, we find that auxiliary factors like different evaluation schemes, data augmentation strategies, and loss functions, which are independent of the model architecture, make a large difference in performance. The differences are large enough that they obscure the effect of architecture. When these factors are controlled for, PointNet++, a relatively older network, performs competitively with recent methods. Second, a very simple projection-based method, which we refer to as SimpleView, performs surprisingly well. It achieves on par or better results than sophisticated state-ofthe-art methods on ModelNet40 while being half the size of PointNet++. It also outperforms state-of-the-art methods on ScanObjectNN, a real-world point cloud benchmark, and demonstrates better cross-dataset generalization.

1. INTRODUCTION

Processing 3D point cloud data accurately is crucial in many applications including autonomous driving (Navarro-Serment et al., 2010) and robotics (Rusu et al., 2009) . In these settings, sensors like LIDAR produce unordered sets of points that correspond to object surfaces. Correctly classifying objects from this data is important for 3D scene understanding (Uy et al., 2019) . While classical approaches for this problem have relied on hand-crafted features (Arras et al., 2007) , recent efforts have focused on the design of deep neural networks (DNNs) to learn features directly from raw point cloud data (Qi et al., 2017a) . Deep learning-based methods have proven effective in aggregating information across a set of 3D points to accurately classify objects. The most widely adopted benchmark for comparing methods for point cloud classification has been ModelNet40 (Wu et al., 2015b) . The accuracy on ModelNet40 has steadily improved over the last few years from 89.2% by PointNet (Qi et al., 2017a) to 93.6% by RSCNN (Liu et al., 2019c) (Fig. 1 ). This progress is commonly perceived to be a result of better designs of network architectures. However, after performing a careful analysis of recent works we find two surprising results. First, we find that auxiliary factors including differing evaluation schemes, data augmentation strategies, and loss functions affect performance to such a degree that it can be difficult to disentangle improvements due to the network architecture. Second, we find that a very simple projection-based architecture works surprisingly well, outperforming state-of-the-art point-based architectures. In deep learning, as results improve on a benchmark, attention is generally focused on the novel architectures used to achieve those results. However, there are many factors beyond architecture design that influence performance including data augmentation and evaluation procedure. We refer to these additional factors as a method's protocol. A protocol defines all details orthogonal to the network architecture that can be controlled to compare differing architectures. Note that it is possible for some specific form of loss or data augmentation to be tied to a specific architecture and inapplicable to other architectures. In these cases, it would be inappropriate to treat them as part of the protocol. However, for all the methods we consider in this paper, their losses and augmentation schemes are fully compatible with each other and can be considered independently. We do experiments to study the effect of protocol and discover that it accounts for a large difference in performance, so large as to obscure the contribution of a novel architecture. For example, the performance of the PointNet++ architecture (Qi et al., 2017b) jumps from 90.0±0.3 to 93.3±0.3, when switching from its original protocol to RSCNN's protocol (Liu et al., 2019c) . We further find that the protocols that lead to the strongest performance rely on feedback from the test set, which differs from conventional evaluation setups. We re-evaluate prior architectures using the best augmentation and loss functions, while not using any feedback from the test set. We find that by taking protocol into account, the PointNet++ architecture performs competitively with more recent ones in various settings. In addition to the surprising importance of protocol, in reviewing past approaches, another surprising discovery is that a very simple projection based baseline works very well. One needs to simply project the points to depth maps along the orthogonal views, pass them through a light-weight CNN and fuse the features. We refer to this baseline as SimpleView. Compared to previous projection-based method (Roveri et al., 2018; Sarkar et al., 2018) for pointcloud classification, SimpleView is very simple. Prior methods have developed special modules for view selection, rendering, and feature merging, as well as use larger CNN backbones that are pretrained on ImageNet (refer to Sec. 2 for more details). In contrast, SimpleView has no such special operations, and only requires simple point projections, a much smaller CNN backbone, and no ImageNet pretraining. The discovery of SimpleView is surprising because recent state-of-the-art results have all been achieved by point-based architectures of increasing sophistication. In recent literature, it is often assumed that point-based methods are the superior choice for point-cloud processing as they "do not introduce explicit information loss" (Guo et al., 2020) . Prior work has stated that "convolution operation of these methods lacks the ability to capture nonlocally geometric features" (Yan et al., 2020) , that a projection-base method "often demands a huge number of views for decent performance" (Liu et al., 2019c) , and that projection-based methods often "fine-tune a pre-trained image-based architecture for accurate recognition" (Liu et al., 2019c) . It is thus surprising that a projection-based method could achieve state-of-the-art results with a simple architecture, only a few views, and no pretraining. On ModelNet40, SimpleView performs on par or better than more sophisticated state-of-the-art networks across various protocols, which includes the ones used by prior methods (Table . 3) as well as our protocol ( Note that we are not proposing a new architecture or method, but simply evaluating a simple and strong projection-based baseline for point-cloud classification that is largely ignored in the literature. We do not claim any novelty in the design of SimpleView because all of its components have appeared in the literature. Our contribution is showing that such a simple baseline works surprisingly well, which is a result absent in existing literature.



Figure 1: Performance of point-based models on ModelNet40. Those using > 1024 points or normals are marked with triangle. Line joins the topperforming models across times.

Table. 5). At the same time, SimpleView outperforms state-of-the-art architectures onScanObjectNN (Uy et al., 2019), a real-world dataset where point clouds are noisy (background points, occlusions, holes in objects) and are not axis-aligned. SimpleView also demonstrates better cross-dataset generalization than prior works. Furthermore, SimpleView uses less parameters than state-of-the-art networks (Table. 5).

