REVISITING POINT CLOUD CLASSIFICATION WITH A SIMPLE AND EFFECTIVE BASELINE Anonymous

Abstract

Processing point cloud data is an important component of many real-world systems. As such, a wide variety of point-based approaches have been proposed, reporting steady benchmark improvements over time. We study the key ingredients of this progress and uncover two critical results. First, we find that auxiliary factors like different evaluation schemes, data augmentation strategies, and loss functions, which are independent of the model architecture, make a large difference in performance. The differences are large enough that they obscure the effect of architecture. When these factors are controlled for, PointNet++, a relatively older network, performs competitively with recent methods. Second, a very simple projection-based method, which we refer to as SimpleView, performs surprisingly well. It achieves on par or better results than sophisticated state-ofthe-art methods on ModelNet40 while being half the size of PointNet++. It also outperforms state-of-the-art methods on ScanObjectNN, a real-world point cloud benchmark, and demonstrates better cross-dataset generalization.

1. INTRODUCTION

Processing 3D point cloud data accurately is crucial in many applications including autonomous driving (Navarro-Serment et al., 2010) and robotics (Rusu et al., 2009) . In these settings, sensors like LIDAR produce unordered sets of points that correspond to object surfaces. Correctly classifying objects from this data is important for 3D scene understanding (Uy et al., 2019) . While classical approaches for this problem have relied on hand-crafted features (Arras et al., 2007) , recent efforts have focused on the design of deep neural networks (DNNs) to learn features directly from raw point cloud data (Qi et al., 2017a) . Deep learning-based methods have proven effective in aggregating information across a set of 3D points to accurately classify objects. The most widely adopted benchmark for comparing methods for point cloud classification has been ModelNet40 (Wu et al., 2015b) . The accuracy on ModelNet40 has steadily improved over the last few years from 89.2% by PointNet (Qi et al., 2017a) to 93.6% by RSCNN (Liu et al., 2019c) (Fig. 1 ). This progress is commonly perceived to be a result of better designs of network architectures. However, after performing a careful analysis of recent works we find two surprising results. First, we find that auxiliary factors including differing evaluation schemes, data augmentation strategies, and loss functions affect performance to such a degree that it can be difficult to disentangle improvements due to the network architecture. Second, we find that a very simple projection-based architecture works surprisingly well, outperforming state-of-the-art point-based architectures. In deep learning, as results improve on a benchmark, attention is generally focused on the novel architectures used to achieve those results. However, there are many factors beyond architecture design that influence performance including data augmentation and evaluation procedure. We refer to these additional factors as a method's protocol. A protocol defines all details orthogonal to the network architecture that can be controlled to compare differing architectures. Note that it is possible for some specific form of loss or data augmentation to be tied to a specific architecture and inapplicable to other architectures. In these cases, it would be inappropriate to treat them as part of the protocol. However, for all the methods we consider in this paper, their losses and augmentation schemes are fully compatible with each other and can be considered independently. We do experiments to study the effect of protocol and discover that it accounts for a large difference in performance, so large as to obscure the contribution of a novel architecture. For example, the

