INEQUALITY PHENOMENON IN l ∞ -ADVERSARIAL TRAINING, AND ITS UNREALIZED THREATS

Abstract

The appearance of adversarial examples raises attention from both academia and industry. Along with the attack-defense arms race, adversarial training is the most effective against adversarial examples. However, we find inequality phenomena occur during the l ∞ -adversarial training, that few features dominate the prediction made by the adversarially trained model. We systematically evaluate such inequality phenomena by extensive experiments and find such phenomena become more obvious when performing adversarial training with increasing adversarial strength (evaluated by ϵ). We hypothesize such inequality phenomena make l ∞ -adversarially trained model less reliable than the standard trained model when the few important features are influenced. To validate our hypothesis, we proposed two simple attacks that either perturb important features with noise or occlusion. Experiments show that l ∞ -adversarially trained model can be easily attacked when a few important features are influenced. Our work sheds light on the limitation of the practicality of l ∞ -adversarial training.

1. INTRODUCTION

discovered adversarial examples of deep neural networks (DNNs), which pose significant threats to deep learning-based applications such as autonomous driving and face recognition. Prior to deploying DNN-based applications in real-world scenarios safely and securely, we must defend against adversarial examples. After the emergence of adversarial examples, several defensive strategies have been proposed (Guo et al., 2018; Prakash et al., 2018; Mummadi et al., 2019; Akhtar et al., 2018) . By retraining adversarial samples generated in each training loop, adversarial training (Goodfellow et al., 2015; Zhang et al., 2019; Madry et al., 2018b) is regarded as the most effective defense against adversarial attacks. The most prevalent adversarial training is l ∞ adversarial training, which applies adversarial samples with l ∞ bounded perturbation by ϵ. 2020), the l ∞ -adversarial training suppresses the significance of the redundant features, and the robust model, therefore, has sparser and better-behaved feature representations than the standard trained model. In general, previous research indicates that robust models have a sparse representation of features and view such sparse representation as advantageous because it is more human-aligned. Several works investigate this property of robust models and attempt to transfer such feature representation to a standard trained model using various methods (Ross & Doshi-Velez, 2018; Salman et al., 2020; Deng et al., 2021) . 1



Numerous works have been devoted to theoretical and empirical comprehension of adversarial training(Andriushchenko & Flammarion, 2020; Allen-Zhu & Li, 2022; Kim et al., 2021). For example,Ilyas et al. (2019)  proposed that an adversarially trained model (robust model for short) learns robust features from adversarial examples and discards non-robust ones. Engstrom et al. (2019) also proposed that adversarial training forces the model learning to be invariant to features to which humans are also invariant. Therefore, adversarial training results in robust models' feature representations that are more comparable to humans. Theoretically validated byChalasani et al. (

