ON THE ADVERSARIAL ROBUSTNESS OF 3D POINT CLOUD CLASSIFICATION

Abstract

3D point clouds play pivotal roles in various safety-critical fields, such as autonomous driving, which desires the corresponding deep neural networks to be robust to adversarial perturbations. Though a few defenses against adversarial point cloud classification have been proposed, it remains unknown whether they can provide real robustness. To this end, we perform the first security analysis of state-of-the-art defenses and design adaptive attacks on them. Our 100% adaptive attack success rates demonstrate that current defense designs are still vulnerable. Since adversarial training (AT) is believed to be the most effective defense, we present the first in-depth study showing how AT behaves in point cloud classification and identify that the required symmetric function (pooling operation) is paramount to the model's robustness under AT. Through our systematic analysis, we find that the default used fixed pooling operations (e.g., MAX pooling) generally weaken AT's performance in point cloud classification. Still, sorting-based parametric pooling operations can significantly improve the models' robustness. Based on the above insights, we further propose DeepSym, a deep symmetric pooling operation, to architecturally advance the adversarial robustness under AT to 47.0% without sacrificing nominal accuracy, outperforming the original design and a strong baseline by 28.5% (∼ 2.6×) and 6.5%, respectively, in PointNet.

1. INTRODUCTION

Despite the prominent achievements that deep neural networks (DNN) have reached in the past decade, adversarial attacks (Szegedy et al., 2013) are becoming the Achilles' heel in modern deep learning deployments, where adversaries generate imperceptible perturbations to mislead the DNN models. Numerous attacks have been deployed in various 2D vision tasks, such as classification (Carlini & Wagner, 2017) , object detection (Song et al., 2018) , and segmentation (Xie et al., 2017) . Since adversarial robustness is a critical feature, tremendous efforts have been devoted to defending against 2D adversarial images (Guo et al., 2017; Papernot et al., 2016; Madry et al., 2018) . However, Athalye et al. (2018) suggest that most of the current countermeasures essentially try to obfuscate gradients, which give a false sense of security. Besides, certified methods (Zhang et al., 2019) often provide a lower bound of robustness, which are not helpful in practice. Therefore, adversarial training is widely believed as the most and only effective defense solution. The emergence of 3D point cloud applications in safety-critical areas like autonomous driving raises public concerns about their security of DNN pipelines. A few studies (Xiang et al., 2019; Cao et al., 2019; Sun et al., 2020) have demonstrated that various deep learning tasks on point clouds are indeed vulnerable to adversarial examples. Among them, point cloud classification models have laid solid foundations upon which other complex models are built (Lang et al., 2019; Yu et al., 2018a) . While it seems intuitive to extend convolutional neural networks (CNN) from 2D to 3D for point cloud classification, it is actually not a trivial task. The difficulty mainly inherits from that point cloud is an unordered set structure that CNN cannot handle. Modern point cloud classification models (Qi et al., 2017a; Zaheer et al., 2017) address this problem by leveraging a symmetric function, which is permutation-invariant to the order of points, to aggregate local features, as shown in Figure 2 . Recently, a number of countermeasures have been proposed to defend against 3D adversarial point clouds. However, the failure of gradient obfuscation-based defenses in the 2D space motivates us to re-think whether current defense designs provide real robustness for 3D point cloud classification. Especially, DUP-Net (Zhou et al., 2019) and GvG-PointNet++ (Dong et al., 2020a) claim to improve the adversarial robustness significantly. However, we find that both defenses belong to gradient obfuscation through our analysis, hence further design white-box adaptive attacks to break their robustness. Unfortunately, our 100% attack success rates demonstrate that current defense designs are still vulnerable. As mentioned above, adversarial training (AT) is considered the most effective defense strategy; we thus perform the first rigorous study of how AT behaves in point cloud classification by exploiting projected gradient descent (PGD) attacks (Madry et al., 2018) . We identify that the default used symmetric function weakens the effectiveness of AT. Specifically, popular models (e.g., PointNet) utilize fixed pooling operations like MAX and SUM pooling as their symmetric functions to aggregate features. Different from CNN-based models that usually apply pooling operations with a small sliding window (e.g., 2 × 2), point cloud classification models leverage such fixed pooling operations to aggregate features from a large number of candidates (e.g., 1024). We find that those fixed pooling operations inherently lack flexibility and learnability, which are not appreciated by AT. Moreover, recent research has also presented parametric pooling operations in set learning (Wang et al., 2020; Zhang et al., 2020) , which also preserve permutation-invariance.We take a step further to systematically analyze point cloud classification models' robustness with parametric pooling operations under AT. Experimental results show that the sorting-based pooling design benefits AT well, which vastly outperforms MAX pooling, for instance, in adversarial accuracy by 7.3% without hurting the nominal accuracyfoot_0 . Lastly, based on our experimental insights, we propose DeepSym, a sorting-based pooling operation that employs deep learnable layers, to architecturally advance the adversarial robustness of point cloud classification models under AT. Experimental results show that DeepSym reaches the best adversarial accuracy in all chosen backbones, which on average, is a 10.8% improvement compared to the default architectures. We also explore the limits of DeepSym based on PointNet due to its broad adoption (Guo et al., 2020) . We obtain the best robustness on ModelNet40, which achieves the adversarial accuracy of 47.0%, significantly outperforming the default MAX pooling design by 28.5% (∼ 2.6×). In addition, we demonstrate that PointNet with DeepSym also reaches the best adversarial accuracy of 45.2% under the most efficient AT on ModelNet10 (Wu et al., 2015) , exceeding MAX pooling by 17.9% (∼ 1.7×).

2. BACKGROUND AND RELATED WORK

3D point cloud classification. Early works attempt to classify point clouds by adapting deep learning models in the 2D space (Su et al., 2015; Yu et al., 2018b) . DeepSets (Zaheer et al., 2017) and PointNet (Qi et al., 2017a) are the first to achieve end-to-end learning on point cloud classification and formulate a general specification (Figure 2 ) for point cloud learning. PointNet++ (Qi et al., 2017b) and DGCNN (Wang et al., 2019) build upon PointNet set abstraction to better learn local features. Lately, DSS (Maron et al., 2020) generalizes DeepSets to enable complex functions in set learning. Besides, ModelNet40 (Wu et al., 2015) is the most popular dataset for benchmarking point cloud classification, which consists of 12,311 CAD models belonging to 40 categories. The numerical range of the point cloud data is normalized to [-1, 1] in ModelNet40. Adversarial attacks and defenses on point clouds. Xiang et al. (2019) perform the first study to extend C&W attack (Carlini & Wagner, 2017) to point cloud classification. Wen et al. (2019) improve the loss function in C&W attack to realize attacks with smaller perturbations and Hamdi et al. (2019) present black-box attacks on point cloud classification. Recently, Zhou et al. (2019) and Dong et al. (2020a) propose to defend against adversarial point clouds by input transformation and adversarial detection. Besides, Liu et al. (2019) conduct a preliminary investigation on extending countermeasures in the 2D space to defend against simple attacks like FGSM (Goodfellow et al., 2014) on point cloud data. In this work, we first design adaptive attacks to break existing defenses and analyze the adversarial robustness of point cloud classification under adversarial training.

3. BREAKING THE ROBUSTNESS OF EXISTING DEFENSES

3.1 ADAPTIVE ATTACKS ON DUP-NET DUP-Net (ICCV'19) presents a denoiser layer and upsampler network structure to defend against adversarial point cloud classification. The denoiser layer g : X → X leverages kNN (k-nearest neighbour) for outlier removal. Specifically, the kNN of each point x i in point cloud X is defined as knn(x i , k) so that the average distance d i of each point x i to its kNN is denoted as: d i = 1 k xj ∈knn(xi,k) ||x i -x j || 2 , i = {1, 2, . . . , n} where n is the number of points. The mean µ = 1 n n i=1 d i and standard deviation σ = 1 n n i=1 (d i -µ)foot_1 of all these distances are computed to determine a distance threshold as µ+α•σ to trim the point clouds, where α is a hyper-parameter. As a result, the denoised point cloud is represented as X = {x i | d i < µ + α • σ}. The denoised point cloud X will be further fed into PU-Net (Yu et al., 2018a) , defined as p : X → X , to upsample X to a fixed number of points. Combined with the classifier f , the integrated DUP-Net can be noted as (f • p • g)(X). The hypothesis is that the denoiser layer will eliminate the adversarial perturbations and the upsampler network will re-project the denoised off-manifold point cloud to the natural manifold. Analysis. The upsampler network p (i.e., PU-Net) is differentiable and can be integrated with the classification network f . Therefore, f • p is clearly vulnerable to gradient-based adaptive attacks. Although the denoiser layer g is not differentiable, it can be treated as deterministic masking: M(x i ) = 1 di<µ+α•σ so that the gradients can still flow through the masked points. By involving M(x i ) into the iterative optimization process: ∇ xi (f •p•g)(X)| xi= x ≈ ∇ xi (f •p)(X)| xi= x•M( x) , similar to BPDA (Athalye et al., 2018) , attackers may still find adversarial examples.

Experimentation.

We leverage the open-sourced codebase 2 of DUP-Net for experimentation. Specifically, a PointNet (Qi et al., 2017a) trained on ModelNet40 is used as the target classifier f . For the PU-Net, the upsampled number of points is 2048, and the upsampling ratio is 2. For the adaptive attacks, we exploit targeted L 2 norm-based C&W attack and untargeted L ∞ norm-based PGD attack with 200 iterations . Detailed setups are elaborated in Appendix A.1. where x ci is the geometry center of the local point set, r is the distance threshold to mask the local feature, and F g is the cleaned feature set for final classification. To train GvG-PointNet++, it is necessary to optimize a surrogate loss to correctly learn the gather vectors besides the cross-entropy (xent) loss: c i = x ci + v i ; M i = 1 ||cg-ci||<r ; F g = {f i • M i } L total = L xent + λ • L gather , L gather = n i=1 ||c i -c g || 1 ( ) where n is the number of local features and λ is a hyper-parameter. Thus, GvG-PointNet++ essentially applies self-attention to the local features and relies on it for robustness enhancement. Analysis. Dong et al. (2020a) evaluate white-box adversaries on GvG-PointNet++ with naïve L 2 norm-based PGD attacks. Specifically, only L xent is utilized in the adversarial optimization process so that the masking M i will hinder the gradient propagation. However, since M i is learned from the network itself, it is highly possible to further break this obfuscation with L gather considered. The adaptive attack can be then formulated as an optimization problem with the loss function: L adv = L xent -β • L gather (4) where β is a hyper-parameter. By maximizing L adv with L 2 norm-based PGD attacks, adversaries strive to enlarge the adversarial effect but also minimize the perturbations on gather vectors. We also find that GvG-PointNet++ is by design vulnerable to PGD attacks on L gather as such perturbations will potentially affect most gather vector predictions to make g i masked out so that insufficient for final classification. Discussion. As shown in Table 2 , both adaptive PGD attacks achieve high success rates on GvG-PointNet++. we also observe that the L ∞ norm-based PGD attack is more effective on L gather since L ∞ norm perturbations assign the same adversarial budget to each point, which can easily impact a large number of gather vector predictions. However, it is hard for the L 2 norm-based PGD attack to influence so many gather vector predictions because it prefers to perturb key points rather than the whole point set. GvG-PointNet++ leverages DNN to detect adversarial perturbations, which is similar to MagNet (Meng & Chen, 2017) in the 2D space. We validate that adversarial detection also fails to provide real robustness under adaptive white-box adversaries in point cloud classification.

4. ADVERSARIAL TRAINING WITH DIFFERENT SYMMETRIC FUNCTIONS

We have so far demonstrated that state-of-the-art defenses against 3D adversarial point clouds are still vulnerable to adaptive attacks. While gradient obfuscation cannot offer real adversarial robust- , 2020) . We also additionally select MEDIAN pooling due to its robust statistic feature (Huber, 2004) . Though models with fixed pooling operations have achieved satisfactory accuracy under standard training, they face various difficulties under AT. As shown in Table 3 , models with MEDIAN pooling achieve better nominal accuracy among fixed pooling operations, but much worse adversarial accuracy, while SUM pooling performs contrarily. Most importantly, none of them reach a decent balance of nominal and adversarial accuracy. a single statistic to represent the distribution of a feature dimension (Murray & Perronnin, 2014) . Although MEDIAN pooling, as a robust statistic, intuitively should enhance the robustness, we find it actually hinders the inner maximization stage from making progress. We utilize L ∞ norm-based PGD attack to maximize the xent loss of standard trained model with three fixed pooling operations. Figure 3 validates that MEDIAN pooling takes many more steps to maximize the loss. Therefore, MEDIAN pooling fails to find the worst adversarial examples in the first stage with limited steps. Though MAX and SUM pooling are able to achieve higher loss value, they encounter challenges in the second stage. MAX pooling backward propagates gradients to a single point at each dimension so that the rest n-1 n features do not contribute to model learning. Since n is oftentimes a large number (e.g., 1024), the huge information loss and non-smoothness will fail AT (Xie et al., 2020) . While SUM pooling realizes a smoother backward propagation, it lacks discriminability because by applying the same weight to each element, the resulting representations are strongly biased by the adversarial perturbations. Thus, with SUM pooling, the models cannot generalize well on clean data.

4.3. ADVERSARIAL TRAINING WITH PARAMETRIC POOLING OPERATIONS

Recent studies have also presented trainable parametric pooling operations for different tasks in set learning, e.g., multiple instance learning, which are also qualified to be the symmetric function ρ(•) in point cloud classification models. Thus, we first group them into two classes: 1) attentionbased and 2) sorting-based pooling, and further benchmark their robustness under AT in point cloud classification. It is worth noting that none of those parametric pooling operations are proposed to improve the adversarial robustness, and we are the first to conduct such an in-depth analysis of how they behave as the symmetric function under AT in point cloud classification.

4.3.1. ATTENTION-BASED POOLING OPERATIONS

An attention module can be abstracted as mapping a query and a set of key-value pairs to an output, making the models learn and focus on the critical information (Bahdanau et al., 2014) . Figure 4(a) shows the design principle of attention-based pooling, which leverages a compatibility function to learn point-level importance. The aggregated global feature is computed as a column-wise weighted sum of the local features. Two attention-based pooling operations, ATT and ATT-GATE, are first proposed for multiple instance learning (Ilse et al., 2018) . Let F = {f 1 , f 2 , . . . , f n } be a set of features, ATT aggregates the global feature g by: g = n i=1 a i • f i , a i = exp(w • tanh(V • f i )) n j=1 exp(w • tanh(V • f j )) where w ∈ R L×1 and V ∈ R L×dm are learnable parameters. ATT-GATE improves the expressiveness of ATT by introducing another non-linear activation sigmoid(•) and more trainable parameters into weight learning. Furthermore, PMA (Lee et al., 2019 ) is proposed for general set learning, which leverages multi-head attention (Vaswani et al., 2017) for pooling. We detail the design and our implementation of ATT, ATT-GATE, and PMA in Appendix B.3, and adversarially train the backbone models with these attention-based pooling operations.

4.3.2. SORTING-BASED POOLING OPERATIONS

Sorting has been recently considered in the set learning literature due to its permutation-invariant characteristic, as shown in Figure 4 (b). Let F ∈ R n×dm be the matrix version of the feature set F, FSPool (Zhang et al., 2020) aggregates F by feature-wise sorting in a descending order: F i,j = sort ↓ (F :,j ) i ; g j = n i=1 W i,j • F i,j where W ∈ R n×dm are learnable parameters. Therefore, the pooled representation is column-wise weighted sum of F . SoftPool (Wang et al., 2020) re-organizes F so that its j-th dimension is sorted in a descending order, and picks the top k point-level embeddings F j ∈ R k×dm to further form F = [F 1 , F 2 , . . . , F dm ]. Then, SoftPool applies CNN to each F j → g j so that the pooled representation is g = [g 1 , g 2 , . . . , g dm ]. Implementation details of SoftPool are elaborated in Appendix B.3. We also adversarially train the backbone models with FSPool and SoftPool. 

4.3.3. EXPERIMENTAL ANALYSIS

Table 4 shows the results of AT with different parametric pooling operations. To meet the requirement of permutation-invariance, attention-based pooling is restricted to learn point-level importance. For example, ATT applies the same weight to all dimensions of a point embedding. As a result, attention barely improves the pooling operation's expressiveness as it essentially re-projects the point cloud to a single dimension (e.g., f i → a i in ATT) and differentiates them based on it, which significantly limits their discriminability. Therefore, little useful information can be learned from the attention module, explaining why they perform similarly to SUM pooling that applies the same weight to each point under AT, as shown in Table 4 . Sorting-based pooling operations naturally maintain permutation-invariance as sort ↓ (•) re-organizes the unordered feature set F to an ordered feature map F . Thus, FSPool and SoftPool are able to further apply feature-wise linear transformation and CNN. The insight is that feature dimensions are mostly independent of each other, and each point expresses to a different extent in every dimension. By employing feature-wise learnable parameters, the gradients also flow smoother through sorting-based pooling operations. Table 4 validates that sorting-based pooling operations achieve much better adversarial accuracy, e.g., on average, 7.3% better than MAX pooling while maintaining comparable nominal accuracy.

5. IMPROVING THE ADVERSARIAL ROBUSTNESS WITH DE E PSY M

In the above analysis, we have shed light on that sorting-based pooling operations can benefit AT in point cloud classification. We hereby explore to further improve the sorting-based pooling design inspired by existing arts. First, we notice that both FSPool and SoftPool apply sort ↓ (•) right after a ReLU function (Nair & Hinton, 2010) . However, ReLU leads to some neurons being zero (Goodfellow et al., 2016) , which makes sort ↓ (•) unstable. Second, recent studies have shown that AT appreciates deeper neural networks (Xie & Yuille, 2019) . Nevertheless, FSPool only employs one linear layer to aggregate features, and SoftPool requires d m to be a small number. The reason is that scaling up the depth in these existing sorting-based pooling designs requires exponential growth of parameters, which will make the end-to-end learning intractable. To address the above limitations, we propose a simple yet effective pooling operation, DeepSym, that embraces the benefits of sorting-based pooling and also applies deep learnable layers to the pooling process. Given a feature set after ReLU activation F ∈ R + n×dm , DeepSym first applies another linear transformation to re-map F into R n×dm so that f i = W • f i where W ∈ R dm×dm and F = {f 1 , f 2 , . . . , f n }. Let F be the matrix version of F , DeepSym also sorts F in a descending order (Equation 7) to get F . Afterwards, we apply column-wise shared MLP on F : To deal with the variable-size point clouds, DeepSym adopts column-wise linear interpolation in F to form a continuous feature map and then re-samples it to be compatible with the trained MLP (Jaderberg et al., 2015) . Last but not least, DeepSym is by design flexible with the number of pooled features from each dimension. g j = MLP( F :,j ) , j = {1, 2, . . . , d m } In the paper, we only allow DeepSym to output a single feature for a fair comparison with others. However, it is hard for other pooling operations to have this ability. For example, it requires a linear complexity increase for FSPool to enable this capability.

5.1. EVALUATIONS

We implement a 5-layer DeepSym with [512, 128, 32, 8 , 1] hidden neurons on three backbone networks and adversarially train them on ModelNet40 the same way introduced in Section 4.1. Table 4 shows that almost all models with DeepSym reach the best results in both nominal and adversarial accuracy, outperforming the default architecture by 10.8%, on average. Taking PointNet as an example, DeepSym (33.6%) improves the adversarial accuracy by 17.5% (∼ 2.1×) compared to the original MAX pooling architecture. Besides, DeepSym also achieves a 3.5% improvement in adversarial accuracy compared to FSPool and SoftPool. Overall, we demonstrate that DeepSym can benefit AT significantly in point cloud classification. We further leverage various white-and black-box adversarial attacks to cross validate the robustness improvements of DeepSym on PointNet. Specifically, we exploit well-known FGSM (Szegedy et al., 2013) , BIM (Kurakin et al., 2016) , and MIM (Dong et al., 2018) as the white-box attack methods. We set the adversarial budget = 0.05, and leverage 200 steps for the iterative attacks, as well. For the black-box attacks, we choose two score-based methods: SPSA (Uesato et al., 2018) and NES (Ilyas et al., 2018) , and a decision-based evolution attack (Dong et al., 2020b) . We still select = 0.05 and allow 2000 queries to find each adversarial example. The detailed setups are elaborated in Appendix C.1. As shown in Table 5 , PointNet with DeepSym consistently achieves the best adversarial accuracy under white-box attacks, except for FGSM. The reason is that FGSM is a single-step method that has limited ability to find adversarial examples. Besides, we find the black-box attacks are not as effective as the white-box attacks, which also demonstrate that adversarial training with DeepSym is able to improve the robustness of point cloud classification without gradient obfuscation (Carlini et al., 2019) . Since DeepSym brings deep trainable layers into the original backbones, it is necessary to report its overhead. We leverage TensorFlow (Abadi et al., 2016) and NVIDIA profilers to measure the inference time, the number of trainable parameters, and GPU memory usage on PointNet. Specifically, the inference time is averaged from 2468 objects in the validation set, and the GPU memory is measured on an RTX 2080 with batch size = 8. As shown in Table 6 , DeepSym indeed introduces more computation overhead by leveraging the shared MLP. However, we believe the overhead is relatively small and acceptable, compared to its massive improvements on the adversarial robustness. To fur- 

5.2. EXPLORING THE LIMITS OF DE E PSY M

There is a trade-off between the training cost and adversarial robustness in AT. Increasing the number of PGD attack steps can create harder adversarial examples (Madry et al., 2018) , which could further improve the model's robustness. However, the training time also increases linearly with the number of attack iterations increasing. Due to PointNet's broad adoption (Guo et al., 2020) , we here analyze how it performs under various AT settings. Specifically, we exploit the most efficient AT with PGD-1 on ModelNet10 (Wu et al., 2015) , a dataset consisting of 10 categories with 4899 objects, and a relatively expensive AT with PGD-20 on ModelNet40 to demonstrate the effectiveness of DeepSym. Other training setups are identical to Section 4.1. Figure 5 

6. CONCLUSION

In this work, we perform the first rigorous study on the adversarial robustness of point cloud classification. We design adaptive attacks and demonstrate that state-of-the-art defenses against adversarial point clouds cannot provide real robustness. Furthermore, we conduct a thorough analysis of how the required symmetric function affects the AT performance of point cloud classification models. We are the first to identify that the fixed pooling generally weakens the models' robustness under AT, and on the other hand, sorting-based parametric pooling benefits AT well. Lastly, we propose DeepSym that further architecturally advances the adversarial accuracy of PointNet to 47.0% under AT, outperforming the original design and a strong baseline by 28.5% (∼ 2.6×) and 6.5%. (Wu et al., 2015) . We randomly sample 1024 points from each object to form its point cloud, if not otherwise stated. PointNet. We leverage the default architecture in PointNet codebasefoot_2 and exclude the transformation nets (i.e., T-Net) and dropout layers for simplicity and reproducibility. PointNet leverages shared fully connected (FC) layers as the permutation-equivariant layer φ l : FC l (F l:,i ) → F l+1 :,i and MAX pooling as the symmetric function ρ(•). We visualize the differences of φ(•) in Figure 9 , and summarize the layer information in Table 7 .

B.3 PARAMETRIC POOLING DESIGN AND IMPLEMENTATION

We have introduced ATT in Section 4.3.1. In our implementation, we choose L = 512 so that V ∈ R 512×1024 to train the backbone models. ATT-GATE is a variant of ATT with more learnable parameters: g = n i=1 a i • f i , a i = exp(w • (tanh(V • f i ) sigm(U • f i ))) n j=1 exp(w • (tanh(V • f j ) sigm(U • f j ))) where U , V ∈ R L×M , sigm(•) is the sigmoid activation function, and is an element-wise multiplication. We also choose L = 512 in ATT-GATE to train the backbone models. where FC(•) is the fully connected layer and Multihead(•) is the multi-head attention module (Vaswani et al., 2017) . We follow the implementation in the released codebasefoot_4 to choose k = 1, the number of head = 4, and the hidden neurons in FC(•) = 128 to train the backbone models. Since SoftPool (Wang et al., 2020) sorts the feature set in each dimension, it requires the number of dimensions d m to be relatively small. We follow the description in their paper to choose d m = 8 and k = 32 so that each F j ∈ R 32×8 . We apply one convolutional layer to aggregate each F j into g j ∈ R 1×32 so that the final g ∈ R 1×256 . Therefore, for all backbone networks with SoftPool, we apply the last equivariant layer as φ : n × d m-1 → n × 8 and ρ : n × 8 → 256. 



In this paper, we use nominal and adversarial accuracy to denote the model's accuracy on clean and adversarially perturbed data, respectively. https://github.com/RyanHangZhou/DUP-Net https://github.com/charlesq34/pointnet https://github.com/manzilzaheer/DeepSets https://github.com/juho-lee/set_transformer



Figure 1: Sampled visualizations of adversarial examples generated by adaptive attacks on both defenses ( = 0.05 and δ = 0.16). More visualizations can be found in Appendix A.2.

We train GvG-PointNet++ based on single-scale groupedPointNet++ (Qi et al.,  2017b)  on ModelNet40 and set r = 0.08 and λ = 1 as suggested byDong et al. (2020a). The model is trained by Adam (Kingma & Ba, 2014) optimizer with 250 epochs using batch size 16, and the initial learning rate is 0.01. For the adaptive attack, we use 10-step binary search to find a appropriate β. The setup of L 2 norm-based PGD attacks is identical toDong et al. (2020a), and we also leverage L ∞ norm-based PGD-200 in the evaluation. Detailed setups are elaborated in Appendix A.1.

Figure 2: The general specification of point cloud classification (σ • ρ • Φ)(X), where n is the number of points, di is the number of hidden dimensions in the i-th feature map, Φ(•) represents the permutation-equivariant layers, ρ(•) denotes the column-wise symmetric (permutation-invariant) function, and σ(•) is the fully connected layer.

Figure 3: Xent loss of PGD attack on PointNet with three fixed pooling operations (each value is averaged over 100 runs from random starting points).

Attention-based pooling operations apply self-attention to each point-level feature vector fi. The learned weight αi is multiplied with each element in fi, and the aggregated feature is computed by a column-wise summation. Sorting-based pooling operations sort each dimension to re-organize the feature set into an ordered matrix F to which complex operations (e.g., CNN) can be applied to aggregate features.

Figure 4: Design philosophy of attention-based and sorting-based pooling operations.

(a) PGD-1 adversarial training on ModelNet10. (b) PGD-20 adversarial training on ModelNet40.

Figure 5: Adversarial robustness of PointNet with various pooling operations under PGD-200 at = 0.05.to learn the global feature representation g. Each layer of the MLP composes of a linear transformation, a batch normalization module(Ioffe & Szegedy, 2015), and a ReLU activation function. Compared to FSPool that applies different linear transformations to different dimensions, DeepSym employs a shared MLP to different dimensions. By doing so, DeepSym deepens the pooling process to be more capable of digesting the adversarial perturbations. DeepSym can also address the problem of SoftPool that is only achievable with limited d m because the MLP is shared by all the feature channels so that it can scale up to a large number of d m with little complexity increases. Moreover, DeepSym generalizes MAX and SUM pooling by specific weight vectors. Therefore, it can also theoretically achieve universality with d m ≥ n(Wagstaff et al., 2019) while being more expressive in its representation and smoother in gradients propagation. To deal with the variable-size point clouds, DeepSym adopts column-wise linear interpolation in F to form a continuous feature map and then re-samples it to be compatible with the trained MLP(Jaderberg et al., 2015). Last but not least, DeepSym is by design flexible with the number of pooled features from each dimension. In the paper, we only allow DeepSym to output a single feature for a fair comparison with others. However, it is hard for other pooling operations to have this ability. For example, it requires a linear complexity increase for FSPool to enable this capability.

Figure 6: Visualizations of adversarial examples (2048 points) generated by L 2 norm-based C&W adaptive attacks on PU-Net.in both training and evaluation phases to make sure PGD attacks reach the allowed maximum perturbations.

Figure 7: Visualizations of adversarial examples (2048 points) generated by L 2 norm-based C&W adaptive attacks on DUP-Net.

Figure 8: Visualizations of adversarial examples (1024 points) generated by L ∞ norm-based PGD adaptive attacks on GvG-PointNet++.

The aggregated feature in (b) and (c).

Figure 9: Different architectures of φ(•) in PointNet, DeepSets, and DSS.

MAX pooling on training data, validation data, and PGD-200 adversarial validation data. (b) FSPool on training data, validation data, and PGD-200 adversarial validation data. (c) SoftPool on training data, validation data, and PGD-200 adversarial validation data. (d) DeepSym on training data, validation data, and PGD-200 adversarial validation data.

Figure 12: T-SNE visualizations of PointNet feature embeddings with MAX, FSPool, SoftPool, and DeepSym pooling operations. Three columns correspond to training data, validation data, and PGD-200 adversarial validation data, from left to right.

Adversarial accuracy under adaptive attacks on PU-Net and DUP-Net. For the denoiser layer g, k = 2 and α = 1.1 are set the same asZhou et al. (2019). † denotes the attack in the original paper.



Adversarial robustness of models with fixed pooling operations under PGD-200 at = 0.05.

Adversarial robustness of models with parametric pooling operations under PGD-200 at = 0.05.

Adversarial robustness of PointNet with different pooling operations under attacks at = 0.05. lateral comparison, point cloud classification backbones are much more light-weight than image classification models. For example, ResNet-50(He et al., 2016) and VGG-16(Simonyan & Zisserman, 2014) have 23 and 138 million trainable parameters, respectively, and take much longer time to do the inference. The reason that models with SoftPool and PMA have fewer trainable parameters is that they limit the number of dimensions in the global feature by design.

Overhead measurement of PointNet with different pooling operations.

shows the results of the robustness of adversarially trained PointNet with various pooling operations under PGD-200. We demonstrate that PointNet with DeepSym still reaches the best adversarial accuracy of 45.2% under AT with PGD-1 on ModelNet10, which outperforms the original MAX pooling by 17.9% (∼ 1.7×) and SoftPool by 4.0%. Surprisingly, PointNet with DeepSym also achieves the best nominal accuracy of 88.5%. Moreover, DeepSym further advances itself under AT with PGD-20 on ModelNet40. Figure5(b)shows that PointNet with DeepSym reaches the best 47.0% adversarial accuracy, which are 28.5% (∼ 2.6×) and 6.5% improvements compared to MAX pooling and SoftPool, respectively while maintaining competent nominal accuracy. We also report detailed evaluations using different PGD attack steps and budgets in Appendix C.1.

Layer information of PointNet, DeepSets, and DSS. BN represents a batch normalization layer.

A ADAPTIVE ATTACK EXPERIMENTAL SETUP AND VISUALIZATIONS

A.1 EXPERIMENTAL SETUPS Since DUP-Net is open-sourced, we target the publicly released PointNet and PU-Net models. For the L 2 norm-based C&W attack, we set the loss function as:where X ∈ R n×3 is the matrix version of point cloud X, X is the optimized adversarial example, Z(X) i is the i-th element of the output logits, and t is the target class. We leverage 10-step binary search to find the appropriate hyper-parameter λ from [10, 80] . As suggested by Xiang et al. (2019) , we choose 10 distinct classes and pick 25 objects in each class from the ModelNet40 validation set for evaluation. The step size of the adversarial optimization is 0.01 and we allow at most 500 iterations of optimization in each binary search to find the adversarial examples.For the L ∞ norm-based PGD attack, we adopt the formulation in Madry et al. (2018) :) where X t is the adversarial example in the t-th attack iteration, Π is the projection function to project the adversarial example to the pre-defined perturbation space S, which is the L ∞ norm ball in our setup, and α is the step size. We select the boundary of allowed perturbations = {0.01, 0.025, 0.05, 0.075} out of the point cloud data range [-1, 1]. Since point cloud data is continuous, we set the step size α = 10 .For GvG-PointNet++, we train it based on the single scale grouping (SSG)-PointNet++ backbone. The backbone network has three PointNet set abstraction module to hierarchically aggregate local features, and we enable gather vectors in the last module, which contains 128 local features (i.e., n = 128 in Section 3.2) with 256 dimensions. To learn the gather vectors, we apply three fully connected layers with 640, 640, and 3 hidden neurons respectively, as suggested by Dong et al. (2020a) . Since the data from ModelNet40 is normalized to [-1,1], the global object center is c g = [0, 0, 0].For the L ∞ norm-based PGD attack, we leverage the same setup as the attack on DUP-Net. For the L 2 norm-based PGD attack, we follow the settings in Dong et al. (2020a) to set the L 2 norm threshold = δ √ n × d in , where δ is selected in {0.08, 0.16, 0.32}, n is the number of points, and d in is the dimension of input point cloud (i.e., 3). The attack iteration is set to 50, and the step size α = 50 .

A.2 VISUALIZATIONS

We visualize some adversarial examples generated by adaptive attacks on PU-Net and DUP-Net in Figure 6 and Figure 7 . It is expected that adversarial examples targeting DUP-Net are noisier than the ones targeting PU-Net as the former needs to break the denoiser layer. However, as mentioned in Section 3.1, they are barely distinguishable from human perception. We also visualize some adversarial examples generated by untargeted adaptive PGD attacks on GvG-PointNet++ in Figure 8 with different perturbation budgets .

B ADVERSARIAL TRAINING SETUP B.1 PGD ATTACK IN ADVERSARIAL TRAINING

We also follow the formulation in Equation 18 to find the worst adversarial examples. Specifically, we empirically select = 0.05 into the training recipe as there is no quantitative study on how much humans can bear the point cloud perturbations. Figure 8 shows that adversarial examples with = 0.05 are still recognizable by human perception. Moreover, because point cloud data is continuous, we set the step size of PGD attacks as:C DE E PSY M ABLATIONS It is worth noting that DeepSym does not require the final layer to have only one neuron. However, to have a fair comparison with other pooling operations that aggregate into one feature from each dimension, our implementation of DeepSym also aggregates into one feature from each dimension.

C.1 EVALUATION DETAILS

We also perform extensive evaluations using different PGD attack steps and budgets on PGD-20 trained PointNet. Figure 10 shows that PointNet with DeepSym consistently achieves the best adversarial accuracy. We also validate MEDIAN pooling indeed hinders the gradient backward propagation. The adversarial accuracy of PointNet with MEDIAN pooling consistently drops even after PGD-1000. However, the adversarial accuracy of PointNet with other pooling operations usually converges after PGD-200. Figure 11 shows that DeepSym also outperforms other pooling operations under different adversarial budgets . We leverage the default setup in FGSM, BIM, and MIM in our evaluation. FGSM is a single-step attack method, which can be represented as:The BIM attack is similar to PGD attacks described in Appendix A.1. The differences are 1) the attack starts from the original point cloud X and 2) the step size α = /T , where T is the number of attack steps. The MIM attack introduces momentum terms into the adversarial optimization:Similar to BIM, the attack starts from the original point cloud X and the step size α = /T . We set µ = 1 following the original setup (Dong et al., 2018) .Due to the computational resource constraints, we set the sample size = 32 and allow 2000 quires to find each adversarial example in the score-based black-box attack (Uesato et al., 2018; Ilyas et al., 2018) . For the evolution attack, we use the default loss L as the fitness score, and initialize 32 sets of perturbations from a Gaussian distribution N (0, 1). 4 sets of perturbations with top fitness scores will remain for the next iteration, while others will be discarded. We also allow 2000 generations of evolution to find the adversarial example.

C.2 EVALUATION ON SCANOBJECTNN

We also evaluate the adversarial robustness of different pooling operations on a new point cloud dataset, ScanObjectNN (Uy et al., 2019) , which contains 2902 objects belonging to 15 categories. We leverage the same adversarial training setup as ModelNet10 (i.e., PGD-1). Table 8 shows the results. We find that PointNet with DeepSym still achieves the best adversarial robustness. Since the point clouds from ScanObjectNN are collected from real-world scenes, which suffers from occlusion and imperfection, both nominal and adversarial accuracy drops compared to the results ModelNet40. We find that even some clean point clouds cannot be correctly recognized by human perception. Therefore, the performance degradation is also expected and we believe the results are not as representative as ones on ModelNet40. 

C.3 T-SNE VISUALIZATIONS

We visualize the global feature embeddings of adversarially trained PointNet under PGD-20 with different pooling operations in Figure 12 and their logits in Figure 13 . Since it is hard to pick 40 distinct colors, though we put all data from 40 classes into the T-SNE process, we only choose 10 categories from ModelNet40 to realize the visualizations. 

