MANIFOLD-AWARE TRAINING: INCREASE ADVER-SARIAL ROBUSTNESS WITH FEATURE CLUSTERING

Abstract

The problem of defending against adversarial attacks has attracted increasing attention in recent years. While various types of defense methods (e.g., adversarial training, detection and rejection, and recovery) were proven empirically to bring robustness to the network, their weakness was shown by later works. Inspired by the observation from the distribution properties of the features extracted by the CNNs in the feature space and their link to robustness, this work designs a novel training process called Manifold-Aware Training (MAT), which forces CNNs to learn compact features to increase robustness. The effectiveness of the proposed method is evaluated via comparisons with existing defense mechanisms, i.e., the TRADES algorithm, which has been recognized as a representative state-of-theart technology, and the MMC method, which also aims to learn compact features. Further verification is also conducted using the attack adaptive to our method. Experimental results show that MAT-trained CNNs exhibit significantly higher performance than state-of-the-art robustness.

1. INTRODUCTION

1.1 BACKGROUND Convolutional neural networks (CNNs) are increasingly used in recent years due to their high adaptivity and flexibility. However, Szegedy et al. (2014) discovered that by maximizing the loss of a CNN model w.r.t the input data, one can find a small and imperceptible perturbation which causes misclassification errors of the CNN. The proposed method for constructing such perturbations was designated as the Fast Gradient Sign Method (FGSM), while the corrupted data (with perturbation) were referred to as adversarial examples. Since that time, many algorithms for constructing such perturbations have been proposed, where these algorithms are referred to generally as adversarial attack methods (e.g., Madry et al. (2018) 2020)). Among them, Projected Gradient Descent (PGD) (Kurakin et al., 2017) and Carlini & Wagner (C&W) (Carlini & Wagner, 2017) attacks are the most widely-used methods. Specifically, PGD is a multi-step variant of FGSM, which exhibits higher attack success rate, and C&W attack leverages an objective function designed to jointly minimize the perturbation norm and likelihood of the input being correctly classified. The existence of adversarial examples implies an underlying inconsistency between the decision-making processes of CNN models and humans, and can be catastrophic in life-and-death applications, such as automated vehicles or medical diagnosis systems, in which unpredictable noise may cause the CNN to misclassify the inputs. Various countermeasures for thwarting adversarial attacks, known as adversarial defenses, have been proposed. One of the most common forms in adversarial defense is to augment the training dataset with adversarial examples so as to increase the generalization ability of the CNN toward these patterns. Such a technique is known as adversarial training (Shaham et al., 2015; Zhang et al., 2019; Wang et al., 2020) . However, while such methods can achieve state-of-the-art robustness in terms of robust accuracy (i.e., the accuracy of adversarial examples), training robust classifiers is a non-trivial task. For example, Nakkiran (2019) showed that significantly higher model capacity is required for robust training, while Schmidt et al. (2018) proved that robust training requires a significantly larger number of data instances than natural training.

1.2. MOTIVATION AND CONTRIBUTIONS

The study commences by observing the distribution properties of the features learned by existing training methods in order to obtain a better understanding of the adversarial example problem. In particular, the t-SNE dimension reduction method (van der Maaten & Hinton, 2008) is used to visualize the extracted features, as illustrated in Figures 1. The observation results reveal the following feature distribution properties: • Non-clustering: same-class features are not always clustered together (i.e., some points leave the clusters of their respective colors in Figure 1 ), which is at odds with intuition, which expects that the representative (for classification) features of same-class samples should be similar to one another. • Confusing-distance: closeness between samples in the feature space does not imply resemblance of their prediction (especially for adversarial examples, as there are many triangles colored differently than surrounding points in Figure 1 (b)). We confirm the validity of these observations through a numerical analysis of the matching rate between the closest cluster dominant class and the predicted class. Clustering analysis algorithms (e.g., Ward's Hierarchical Clustering algorithm (Ward, 1963) ) are leveraged to find the clusters formed by the CNN-learned features. Additionally, the match rate is defined as E x [1 dom(C (x) )=f (x) ], where (x) is the closest cluster to x (C (x) = arg min C dist(C, x)) and dom(C) evaluates the dominant class of a cluster (dom(C) = arg max i m(C) i , where m(C) ∈ N L produces a cluster mapping vector describing the number of members of each class prediction). Table 1 summarizes the analysis results and confirms that both properties exist. Intuitively, a good feature extractor for classification purpose should produce similar features for all samples within the same class. C On the other hand, according to Tang et al. ( 2019), the existence of adversarial examples results from a mismatch between features used by human and those used by CNNs. Therefore, one intuitive approach for increasing CNN robustness is simply to drive CNN-learned features toward human-used features. However, it is impossible to understand and predict human-used features with any absolute degree of certainty. Thus, an alternative approach is to force the CNN-learned features to have some expected properties that human-used features should also have. For example, as mentioned above, features for the classification of objects belonging to the same class should be similar to one another. Based on the above observations, the present study proposes a novel training process, designated as Manifold-Aware Training (MAT), for learning features which are both representative and compact. The experimental results confirm that models trained with MAT exhibit significantly higher robustness than existing state-of-the-art models. It would be clear later that our idea is, in some sense, similar to Pang et al. (2020) in that the authors proposed the Max-Mahalanobis Center (MMC) loss, which minimizes the distance of features to their assigned preset class centers. By showing that robustness of the model using MMC with adversarial training is higher than that using simply adversarial training, the authors claimed that high feature compactness results in locally sufficient



; Carlini & Wagner (2017); Rony et al. (2019); Brendel et al. (2018); Chen et al. (2017); Alzantot et al. (2019); Ru et al. (2020); Al-Dujaili & O'Reilly (

Figure 1: Features learned by models trained for MNIST using TRADES (Zhang et al., 2019) training, where the clean samples (points) are colored by predicted classes and the adversarial examples (triangles) are colored by (a) true labels or (b) predicted classes.

