MANIFOLD-AWARE TRAINING: INCREASE ADVER-SARIAL ROBUSTNESS WITH FEATURE CLUSTERING

Abstract

The problem of defending against adversarial attacks has attracted increasing attention in recent years. While various types of defense methods (e.g., adversarial training, detection and rejection, and recovery) were proven empirically to bring robustness to the network, their weakness was shown by later works. Inspired by the observation from the distribution properties of the features extracted by the CNNs in the feature space and their link to robustness, this work designs a novel training process called Manifold-Aware Training (MAT), which forces CNNs to learn compact features to increase robustness. The effectiveness of the proposed method is evaluated via comparisons with existing defense mechanisms, i.e., the TRADES algorithm, which has been recognized as a representative state-of-theart technology, and the MMC method, which also aims to learn compact features. Further verification is also conducted using the attack adaptive to our method. Experimental results show that MAT-trained CNNs exhibit significantly higher performance than state-of-the-art robustness.

1. INTRODUCTION

1.1 BACKGROUND Convolutional neural networks (CNNs) are increasingly used in recent years due to their high adaptivity and flexibility. However, Szegedy et al. (2014) discovered that by maximizing the loss of a CNN model w.r.t the input data, one can find a small and imperceptible perturbation which causes misclassification errors of the CNN. The proposed method for constructing such perturbations was designated as the Fast Gradient Sign Method (FGSM), while the corrupted data (with perturbation) were referred to as adversarial examples. Since that time, many algorithms for constructing such perturbations have been proposed, where these algorithms are referred to generally as adversarial attack methods (e.g., Madry et al. (2018) ; Carlini & Wagner (2017) ; Rony et al. (2019) ; Brendel et al. (2018) ; Chen et al. (2017) ; Alzantot et al. (2019) ; Ru et al. (2020) ; Al-Dujaili & O'Reilly (2020) ). Among them, Projected Gradient Descent (PGD) (Kurakin et al., 2017) and Carlini & Wagner (C&W) (Carlini & Wagner, 2017) attacks are the most widely-used methods. Specifically, PGD is a multi-step variant of FGSM, which exhibits higher attack success rate, and C&W attack leverages an objective function designed to jointly minimize the perturbation norm and likelihood of the input being correctly classified. The existence of adversarial examples implies an underlying inconsistency between the decision-making processes of CNN models and humans, and can be catastrophic in life-and-death applications, such as automated vehicles or medical diagnosis systems, in which unpredictable noise may cause the CNN to misclassify the inputs. Various countermeasures for thwarting adversarial attacks, known as adversarial defenses, have been proposed. One of the most common forms in adversarial defense is to augment the training dataset with adversarial examples so as to increase the generalization ability of the CNN toward these patterns. Such a technique is known as adversarial training (Shaham et al., 2015; Zhang et al., 2019; Wang et al., 2020) . However, while such methods can achieve state-of-the-art robustness in terms of robust accuracy (i.e., the accuracy of adversarial examples), training robust classifiers is a non-trivial task. For example, Nakkiran (2019) showed that significantly higher model capacity is required for robust training, while Schmidt et al. (2018) proved that robust training requires a significantly larger number of data instances than natural training.

1.2. MOTIVATION AND CONTRIBUTIONS

The study commences by observing the distribution properties of the features learned by existing training methods in order to obtain a better understanding of the adversarial example problem. In particular, the t-SNE dimension reduction method (van der Maaten & Hinton, 2008) is used to visualize the extracted features, as illustrated in Figures 1. The observation results reveal the following feature distribution properties: • Non-clustering: same-class features are not always clustered together (i.e., some points leave the clusters of their respective colors in Figure 1 ), which is at odds with intuition, which expects that the representative (for classification) features of same-class samples should be similar to one another. • Confusing-distance: closeness between samples in the feature space does not imply resemblance of their prediction (especially for adversarial examples, as there are many triangles colored differently than surrounding points in Figure 1 (b)). We confirm the validity of these observations through a numerical analysis of the matching rate between the closest cluster dominant class and the predicted class. Clustering analysis algorithms (e.g., Ward's Hierarchical Clustering algorithm (Ward, 1963) ) are leveraged to find the clusters formed by the CNN-learned features. Additionally, the match rate is defined as E x [1 dom(C (x) )=f (x) ], where (x) is the closest cluster to x (C (x) = arg min C dist(C, x)) and dom(C) evaluates the dominant class of a cluster (dom(C) = arg max i m(C) i , where m(C) ∈ N L produces a cluster mapping vector describing the number of members of each class prediction). Table 1 summarizes the analysis results and confirms that both properties exist. Intuitively, a good feature extractor for classification purpose should produce similar features for all samples within the same class. C On the other hand, according to Tang et al. (2019) , the existence of adversarial examples results from a mismatch between features used by human and those used by CNNs. Therefore, one intuitive approach for increasing CNN robustness is simply to drive CNN-learned features toward human-used features. However, it is impossible to understand and predict human-used features with any absolute degree of certainty. Thus, an alternative approach is to force the CNN-learned features to have some expected properties that human-used features should also have. For example, as mentioned above, features for the classification of objects belonging to the same class should be similar to one another. Based on the above observations, the present study proposes a novel training process, designated as Manifold-Aware Training (MAT), for learning features which are both representative and compact. The experimental results confirm that models trained with MAT exhibit significantly higher robustness than existing state-of-the-art models. It would be clear later that our idea is, in some sense, similar to Pang et al. (2020) in that the authors proposed the Max-Mahalanobis Center (MMC) loss, which minimizes the distance of features to their assigned preset class centers. By showing that robustness of the model using MMC with adversarial training is higher than that using simply adversarial training, the authors claimed that high feature compactness results in locally sufficient Nakkiran (2019) . Their conclusion of feature compactness helps robustness matches our idea. The main contributions of our study can be summarized as follows: • A better understanding of the relationship between robustness and the distance between adversarial examples and clean samples in the MAT-learned feature space. • Our method improves the state-of-the-art performance from 57% to 80% for CIFAR10 and from 96% to 99% for MNIST in terms of robust accuracy. The main notations used in present study are summarized in Appendix A

2. RELATED WORKS

One family of adversarial defense methods, referred to as adversarial-training-based methods, augments the training data with adversarial examples. For example, Zhang et al. ( 2019) defined the so-called TRADES loss, which is based on the trade-off between the clean accuracy (i.e., the accuracy of clean images) and the robustness accuracy and has the form i) , y (i) ; θ) + max min θ 1 N N i=1 J(x ( x (i) ∈B (x (i) ) J (x (i) , x (i) ; θ)/λ, where B (x (i) ) denotes the l p -bounded ball (B (x) = {x | xx p ≤ }), and J(•, •; θ) and J (•, •; θ) are implemented using the cross-entropy loss and the Kullback-Leibler divergence criterion, respectively. Ding et al. ( 2020) proposed the Max Margin Adversarial (MMA) training method, which minimizes the cross-entropy loss for misclassified samples and maximizes the margin between them and the nearest decision boundaries for correctly classified samples. The authors proved that by minimizing the cross-entropy loss of the samples on the decision boundaries, the margin can be maximized. Wang et al. (2020) found that misclassified samples bring more robustness when used in adversarial training than correctly classified ones. Inspired by this observation, the authors proposed an objective function that encouraged stability around misclassified samples and optimized the classification loss for adversarial examples. While the adversarial-training-based methods achieve state-of-the-art robust accuracy, they inherit some limitations (Nakkiran, 2019; Schmidt et al., 2018) . Some studies try to alleviate these limitations to further improve the CNN robustness. For examples, according to the work of Schmidt et al. (2018) , robust training requires significantly more samples than natural training. However, high sample density may make the training samples locally sufficient for robust training. Therefore, Pang et al. (2020) proposed the Max-Mahalanobis Center (MMC) loss as follows: L M M C (φ(x), y) = 1 2 φ(x) -µ (y) 2 2 , where φ(•) is a CNN feature extractor, and µ y denotes the pre-set class center for class y. Additionally, the authors analyzed the limitations of traditional softmax and cross-entropy (SCE) loss based training methods regarding the induced sample density in the feature space. The authors then proved that the feature density induced by the MMC loss is guaranteed to be high around the centers (µ (y) ). To our knowledge, however, improving the robustness of CNNs by manipulating the feature space attracts little attention. In view of this, we propose a novel defense method by training CNNs with loss functions designed to induce feature compactness and improve the state-of-the-art robustness significantly.

3. PROPOSED METHOD

The proposed Manifold-Aware Training (MAT) algorithm is introduced in Subsection 3.1, and two additional loss functions to further improve the performance of MAT are described in Subsection 3.2. Moreover, a transformation technique designed to architecturally transform MAT-trained CNNs into traditional CNNs for compatibility with existing techniques is presented in Subsection 3.3.

3.1. MANIFOLD-AWARE TRAINING (MAT)

Recall the observed properties of features learned by traditional CNNs, i.e., the non-clustering property and the confusing-distance property (see Section 1). This subsection introduces our proposed loss function to achieve intra-class feature compactness and inter-class feature dispersion. Specifically, the loss function minimizes the distance of CNN output features to their corresponding class centers while the class centers are far from one another. According to Pang et al. (2020) , fixing class centers during the training phase makes better performance than updating them while learning for another objective (e.g., intra-class feature compactness) since the CNN can focus on one objective rather than seeking a trade-off between multiple objectives. To ensure the pre-defined class centers are far from one another (to ensure inter-class dispersion), the distance between the samples is measured using the cosine-distance (CD(a, b) = 1 -cos(a, b), where cos(a, b) = (a • b)/( a 2 b 2 )) and the class centers are generated using the algorithm proposed by Pang et al. (2018) , i.e., the Max-Mahalanobis Distribution (MMD) centers. The generated center vectors are the vertices of an L-dimensional simplex such that the included angles between class center vectors are maximized, i.e., {µ (1) , ..., µ (L) } = arg min µ max i =j µ (i) • µ (j) , µ (k) 2 = C ∀ 1 ≤ k ≤ L. Note that the details of the generation algorithm and examples of MMD centers are provided in Appendix B. Accordingly, the proposed loss function for intra-class compactness, i.e., Feature To Center (FTC) loss, is defined as L F T C (x, y) = CD(φ(x), µ (y) ). In addition, in the inference phase, an input images x is predicted to belong to class ŷ as: k) ). ŷ = f (x) = arg min 1≤k≤L CD(φ(x), µ (k) ) = arg max 1≤k≤L cos(φ(x), µ (5)

3.2. AUXILIARY LOSS FUNCTIONS TO ENHANCE ROBUSTNESS

This subsection introduces two auxiliary loss functions which are proposed to further improve the robustness of MAT.

3.2.1. SECOND-ORDER (SO) LOSS

Inspired by the approach proposed by Yan et al. (2018) , this sub-section proposes an objective function which minimizes the magnitude of the gradient of the classification loss and thus improves the stability of the classification results. Generally speaking, backpropagation for a gradient value for an arbitrary CNN is difficult and expensive to compute. However, since the computation operations of the FTC loss after the feature layer are fixed, the gradient w.r.t the features can be computed simply by applying a fixed series of operations through forwarding. Note that rather than using an ordinary classification loss (e.g., the cross-entropy loss), the FTC loss is used when applying this technique to MAT since the stability of the FTC loss implies both the stability of the output features and the classification performance. Therefore, the magnitude of the gradient of the FTC loss w.r.t the features can be formulated as the following Second-order (SO) loss, i.e., L SO (x, y) = ∇ φ L F T C 2 2 = 1 φ 2 [ µ (y) µ (y) 2 -cos(φ, µ (y) ) φ φ 2 ] 2 2 . ( ) The corresponding derivation is provided below as: ∇ φ L F T C = ∇ φ [1 -cos(φ, µ (y) )] = -∇ φ cos(φ, µ (y) ) = -∇ φ φ • µ (y) φ 2 µ (y) 2 = - 1 µ (y) 2 ∇ φ φ • µ (y) φ 2 = - 1 µ (y) 2 [ φ 2 µ (y) -φ • µ (y) φ φ 2 φ 2 2 ] = -[ µ (y) φ 2 µ (y) 2 -cos(φ, µ (y) ) φ φ 2 2 ] = - 1 φ 2 [ µ (y) µ (y) 2 -cos(φ, µ (y) ) φ φ 2 ]. (7)

3.2.2. BOUNDED-INPUT-BOUNDED-OUTPUT (BIBO) LOSS

Conceptually, the adversarial example problem is a consequence of the local instability of CNNs. Specifically, the output of a CNN may change significantly when a small perturbation is added to the input. Intuitively, therefore, if the local instability is alleviated, the adversarial robustness should be correspondingly improved. This subsection thus proposes a Bounded-Input-Bounded-Output (BIBO) loss to minimize the difference in the feature space between samples with small perturbations and clean sample, i.e., L BIBO (x, y) = max x ∈B (x) CD(φ(x), φ(x )). Specifically, the maximization is practically achieved using PGD optimizationm, which was also adopted in the second term of the TRADES loss (Eq. 1).

3.2.3. TOTAL LOSS OF MAT

The total loss of MAT is defined by: L M AT = L F T C + α • L SO + β • L BIBO . The SO loss ensures the local stability of the feature distance to class center (L F T C ) w.r.t the features while the BIBO loss ensures the local stability of the features w.r.t the input image. Therefore, the total loss stablizes the feature distance to class center when the input is subject to small perturbations.

3.3. MAT TRANSFORMATION

The MAT-trained models described above produce features rather than class scores (logits), as for traditional CNNs, and this difference may make the former incompatible with existing theories or techniques designed for traditional CNNs. For example, the objective functions of most adversarial attacks are based on logits, which MAT-trained CNNs do not produce. However, the MAT models can be transformed to be architecturally identical to traditional CNNs by using the cosine similarities to each class center as the logits. Additionally, since the l 2 -norms of the center vectors are the same, the computation of the scores can be simplified to one of computing inner-products, as proved below: Proof. Let µ (i) and µ (j) be center vectors of two different classes. We can derive: cos(φ, µ (i) ) cos(φ, µ (j) ) (10) =⇒ φ • µ (i) φ 2 µ (i) 2 φ • µ (j) φ 2 µ (j) 2 =⇒ φ • µ (i) φ • µ (j) . ( ) Note that the simplified logit computation can be achieved by adding a linear layer to the back end of the CNN with the weights set as the class centers and biases set as zero (h = W φ, W = [µ (1) , µ (2) , ..., µ (L) ] T , where h are the logits.) 

4. EXPERIMENTAL RESULTS

This section evaluates the robustness of the proposed MAT algorithm. First, the assumptions made in the threat model are described in Subsection 4.1. In Subsection 4.2 and 4.3, the defense capability of our method is evaluated and compared with state-of-the-art, known as TRADES (Zhang et al., 2019) and MMC (Pang et al., 2020) , respectively, along with verification against adaptive attacks in Subsection 4.4. Finally, we examine the effectiveness of feature compactness in Subsection 4.5.

4.1. THREAT MODEL

According to Carlini et al. (2019) , defining threat models is essential to achieve fair and reliable evaluation and comparison. Since the motivation of our work is to test the worst-case robustnes, the goal of the adversary is assumed to be that of causing misclassification errors, i.e., non-targeted adversaries. In pursuing this goal, the adversary is further assumed to have full knowledge of the victim model, including the architecture, parameters, and datasets used. In other words, white-box attacks were considered in this paper, where the perturbation of bounded attacks is constrained to have bounded l p -norm while the mean l p -norm of perturbation rather than robust accuracy is adopted for evaluation against unbounded attacks.

4.2. ROBUSTNESS OF MAT-TRAINED CNNS

This subsection compares the robustness of our proposed MAT with that of existing mechanisms. TRADES (Zhang et al., 2019) was selected as the state-of-the-art method because the authors released their code and trained models, which makes it easy to reproduce the experiments, and we note that both MMA (Ding et al., 2020) and MART (Wang et al., 2020) only yield comparable performance with TRADES. The basic adversarial training (AT) approach, which minimizes the cross-entropy loss for adversarial examples rather than clean samples, was chosen as a baseline. Furthermore, to perform an ablation study, all variants of MAT with various combination of loss functions were empirically compared to explore their capabilities in defending against attacks. The experiments use two datasets, namely MNIST and CIFAR10 (as in Zhang et al. ( 2019)), associated with VGG Net (Simonyan & Zisserman, 2015) and wide ResNet (Zagoruyko & Komodakis, 2016) , respectively. A library called foolbox (Rauber et al., 2017) was utilized to construct the adversarial examples. Three common white-box attack methods were used for evaluation purpose: PGD (Madry et al., 2018) , C&W (Carlini & Wagner, 2017) , and DDN (Rony et al., 2019) , where C&W and DDN belong to unbounded attacks. The parameters with respect to these attacks are shown in Appendix C. Before applying adversarial attacks to MAT-trained models, the models are transformed into traditional architecture using the approach described in Subsection 3.3. The robust accuracy under different attacks and the clean accuracy (the accuracy of clean test data) are shown in Table 2 . We can observe that the MAT-trained CNN (with all loss functions applied) exhibit significantly higher robustness than TRADESfoot_0 . Figure 2 : Robust accuracy vs. PGD with different perturbation strength ( ). Note that here MAT employs the total loss function.

4.3. COMPARISON WITH MMC

As a similar approach to MAT, MMC (Pang et al., 2020) was chosen to be another baseline. To make fair comparison, the attack settings in MMC were used. The classification accuracies obtained from natural training (Natural), AT, and MMC were directly from Pang et al. (2020) . Note that the model architecture for MAT was wide ResNet (Zagoruyko & Komodakis, 2016) while the one chosen in MMC was ResNet32. As shown in Table 3 , the MAT model yields remarkably higher robustness than the MMC model, despite the underlying difference in model architectures. Carlini et al. (2019) stated that performing robustness evaluation against adaptive attacks, i.e., attacks with objective functions which take the defense details into account, is essential to prevent a false sense of robustness. Thus, adaptive PGD attacks for MAT-trained models can be achieved by directly maximizing the training loss, including the regularization terms (SO loss or BIBO loss). Note that the BIBO loss is a result of maximization (using PGD). However, conducting attacks by optimizing an objective function incorporating such a maximized result suffers quadratic time complexity, which may be not realistic. Nonetheless, the inner maximization can actually be simplified as the inner cosine-distance function since the outer optimization is also a maximization process.

4.4. ADAPTIVE ATTACKS

The same technique can also be applied to the TRADES loss to conduct adaptive attack to TRADES models. In the experiments, the threat model of this adaptive attack is the same as described in Subsection 4.1. The robust accuracy under such attack of TRADES models is shown in Table 4 . The best performance are still comparable with the state-of-the-art performance (of TRADES). Note that the adversarial training works (Zhang et al., 2019; Ding et al., 2020; Wang et al., 2020) did not consider adaptive attack evaluations, while an adaptive attack method for TRADES (Zhang et al., 2019) is proposed in our work, and it is intuitive that similar approach can be used for MMA (Ding et al., 2020) and MART (Wang et al., 2020) to construct adaptive adversarial examples. Additionally, Tramèr et al. (2020) have managed to conduct adaptive attacks toward MMC (Pang et al., 2020) models and reduce robust accuracy to under 0.5% by using the MMC loss as the objective function of PGD, which is the same as our strategy for MAT and TRADES.

4.5. FEATURE CLUSTERING PERFORMANCE BY MAT

The effectiveness of MAT in performing feature clustering was examined by evaluating the cosinesimilarity (mean ± standard deviation) to the class center of the predicted class. For each class, the mean and standard deviation were computed to quantify the level of clustering within class and over different classes, respectively. As shown in Table 5 , the models with lower cosine-similarity for clean data have lower clean accuracy (Table 2 ) than the other models. However, the difference between their adversarial (PGD) similarity and the clean similarity is higher. Therefore, the cosinesimilarity of clean data and the difference between the adversarial and clean similarity can be taken as indicators of the clean accuracy and robust accuracy, respectively.

5. CONCLUSION

To defend against adversarial attacks, this work analyzes the feature distribution of traditionallytrained CNNs for gaining more knowledge about adversarial examples. Two properties, i.e., the non-clustering property and confusing-distance property, of the feature distribution are identified by means of t-SNE visualization and clustering analysis (showing the limitations regarding representativeness). In this paper, by exploiting feature compactness, a novel training process for improving model robustness, designated as MAT, is proposed, in which the FTC loss aims to induce intraclass feature compactness and inter-class feature dispersion, while two auxiliaries functions (the SO loss and the BIBO loss) of MAT are introduced. The experimental results show that the MAT-trained model with all loss functions applied exhibits significantly higher robustness than the state-of-the-art method-trained (TRADES) model. Additionally, the effectiveness of feature clustering and defense against adaptive attack are also evaluated. Our results reveal that the compactness of the clean features provides a useful indication of the clean accuracy, while the distance between the clean compactness and the adversarial compactness provides an indication of the robust accuracy. Overall, this study gives a useful insight into the relation between feature compactness and robustness and provides a novel training process for improving the adversarial robustness of CNNs compared to current state-of-the-art methods.



The second term in Eq. 1 is a BIBO-like loss, where the distance is measured by Kullback-Leibler divergence rather than cosine-distance.



Figure 1: Features learned by models trained for MNIST using TRADES (Zhang et al., 2019) training, where the clean samples (points) are colored by predicted classes and the adversarial examples (triangles) are colored by (a) true labels or (b) predicted classes.

, as Carlini et al. (2019) stated, for bounded attacks, the accuracy-versus-perturbationstrength curve is more indicative of the robustness than the accuracy of the single perturbation strength. Therefore, the related curves are presented in Figure 2 (Note that the detailed values are presented in Appendix D).

Match rates of clean data and adversarial examples, which were generated using PGD and Decoupled Direction and Norm (DDN) attack(Rony et al., 2019), in naturally trained models (-N) and TRADES-trained models (-T).

Robustness comparison. For Clean and PGD, the accuracy is reported. For C&W and DDN, the average perturbation norm of adversarial examples is reported.

Classification accuracy (%) comparison with MMC(Pang et al., 2020) for CIFAR10. The superscripts un and tar denote untargeted and targeted PGD attacks, respectively, and the subscripts of PGD denote the number of iterations for conducting such attacks. Note that ≤ 1 denotes the robust accuracy which is under 1%.

Robust accuracy under the adaptive attack.

