BOOSTING ADVERSARIAL TRAINING WITH MASKED ADAPTIVE ENSEMBLE

Abstract

Adversarial training (AT) can help improve the robustness of a deep neural network (DNN) against potential adversarial attacks by intentionally injecting adversarial examples into the training data, but this way inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are vulnerable to sparse attacks, due to "robustness overfitting" upon dense attacks, often adopted by AT to produce a threat model. To tackle such shortcomings, this paper proposes a novel framework, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, a Guided Backpropagationbased detector is designed to sniff adversarial examples, driven by our empirical observation. Meanwhile, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples. The adaptive ensemble approach also enables us to mask off a random subset of image patches within input data, eliminating potential adversarial effects when encountering malicious inputs with negligible standard accuracy degradation. As such, our approach enjoys improved robustness, able to withstand both dense and sparse attacks, while maintaining high standard accuracy. Experimental results exhibit that our detector and classifier outperform their state-of-the-art counterparts, in terms of detection accuracy, standard accuracy, and adversarial robustness. For example, on CIFAR-10, our detector achieves the best detection accuracy of 99.6% under dense attacks and of 98.5% under sparse attacks. Our classifier achieves the best standard accuracy of 91.2% and the best robustness against dense attack (or sparse attack) of 57.5% (or 54.8%).

1. INTRODUCTION

Deep neural networks (DNNs) have been reported to be vulnerable to adversarial attacks. That is, maliciously crafting clean images under a small distance can mislead DNNs into incorrect predictions. Such vulnerability prevents DNNs' wide adoption in critical domains, such as healthcare, autonomous driving, finances, among many others. In a nutshell, adversarial attacks can be roughly grouped into two categories, i.e., the dense attack and the sparse attack. The former (e.g., Goodfellow et al. (2015) ; Moosavi-Dezfooli et al. ( 2016 2022) can significantly reduce the computational overhead while achieving decent robustness under dense attacks. Despite effectiveness, existing ATs suffer from two shortcomings: i) a tradeoff between standard accuracy (i.e., the accuracy on clean images) and adversarial robustness (i.e., the accuracy on adversarial examples), with improved robustness yielding non-negligible standard accuracy degradation and ii) robustness overfitting on dense attacks, making improved robustness vulnerable to sparse attacks. One promising direction to address the trade-off between standard accuracy and adversarial robustness is via a detection/rejection mechanism, that is, training an additional detector to reject malicious input data, with various detection techniques proposed Roth et al. ( 2019 2022). Unfortunately, the detection/rejection mechanism is still ineffective to defend against sparse attacks, as sparse attacks only perturb a limited number of pixels. Even worse, the detection/rejection mechanism can be applied merely to a limited number of scenarios. For example, it cannot be generalized to the application domains where natural adversarial examples exist, as reported in a recent study Hendrycks et al. (2021) . In this work, we consider the robustness under a more general and challenging scenario (than that addressed earlier by the detection/rejection mechanism), where the malicious input is not allowed to be rejected. Note that a robust model under such a scenario is crucial for applying DNNs to critical domains. For example, an autonomous driving car is expected to recognize a road sign even if it has been maliciously crafted. Our goal is to develop a novel framework, including a detector and a classifier, to boost adversarial training for improving DNN's robustness against both dense and sparse attacks at a small expense of standard accuracy degradation. Specifically, our framework is adversarially trained by using one-step least-likely adversarial training, adopted from Fast Adversarial Training Wong et al. ( 2020) with slight modification (see Section A.2 in Appendix for details). We incorporate two new designs in our detector to make adversarial examples more noticeable. First, we resort to Guided Backpropagation Springenberg et al. (2015) to expose adversarial perturbations, driven by our empirical observations. Second, the Soft-Nearest Neighbors Loss (SNN Loss) Salakhutdinov & Hinton (2007); Frosst et al. ( 2019) is tailored to push adversarial examples away from their corresponding clean images. As such, our detector is effective in sniffing both dense attack-generated and sparse attack-generated adversarial examples. Our classifier includes two encoders for extracting visual representations respectively from clean images and adversarial examples, aiming to alleviate the negative effect of adversarial training on standard accuracy. We separate the training process into "pre-training" and "fine-tuning" for representation learning and classification, respectively. In the pre-training, our goal is to jointly learn high-quality representations and encourage pairwise similarity between a clean image and its adversarial example. Specifically, we extend Masked Autoencoders (MAE) He et al. (2022), i.e., learning visual representations by reconstructing the masked images, for adversarial training via a new design. That is, we reconstruct images from a pair of masked clean image and masked adversarial example, for representation learning, with a contrastive loss on visual representations to encourage pair similarity. In the fine-tuning of classification, we freeze the weights on the two encoders and fine-tune an MLP (Multi-layer Perceptron) for accurate classification by using our proposed adaptive ensemble to bridge the detector and the classifier. Meanwhile, our adaptive ensemble allows us to mask off an arbitrary subset of image patches within the input, enabling our approach to mitigate potential adversarial effects when encountering malicious inputs with negligible standard accuracy degradation. Extensive experiments have been carried out on three popular benchmarks, with the results demonstrating that our solutions outperform state-of-the-art detection and adversarial training techniques in terms of detection accuracy, standard accuracy, and robustness. 



); Madry et al. (2018); Croce & Hein (2020); Yao et al. (2021)) tends to perturb almost all pixels on the clean image, whereas the latter (e.g., Papernot et al. (2016); Carlini & Wagner (2017); Modas et al. (2019); Dong et al. (2020); Pintor et al. (2021); Zhu et al. (2021)) modifies only a limited number of pixels to fool the DNN models. So far, adversarial training (AT) is widely accepted as the most effective method to improve DNNs' robustness against adversarial attacks, by intentionally injecting adversarial examples into the training data. In particular, multi-step ATs Madry et al. (2018); Zhang et al. (2019); Jia et al. (2022) perform multi-step dense attacks (e.g., PGD attack) to find the worst-case adversarial examples for training, achieving state-of-the-art robustness but incurring a significant computational overhead. On the other hand, by using the single-step dense attack (e.g., FGSM attack), one-step ATs Wong et al. (2020); Andriushchenko & Flammarion (2020); Kim et al. (2021); Li et al. (2022); Wang et al. (

); Ma & Liu (2019); Yin et al. (2020); Raghuram et al. (2021); Tramèr (

Our work closely relates to two research scopes, i.e., detection/rejection mechanisms and adversarial training approaches. This section discusses how our work relates to, and differs from, prior studies. Detection Mechanisms. Detecting adversarial examples (AEs) and then rejecting them (i.e., detection/rejection mechanism) can improve the model robustness. That is, the input will be rejected if the detector classifies it as an adversarial example. Popular detection techniques include Odds Roth et al. (2019), which considers the difference between clean images and AEs in terms of log-odds; NIC Ma & Liu (2019), which checks channel invariants within DNNs; GAT Yin et al. (2020), which resorts to multiple binary classifiers; JTLA Raghuram et al. (2021), which proposes a detection framework by employing internal layer representations, among many others Lee et al. (2018); Yang et al. (2020); Sheikholeslami et al. (2021). Unfortunately, existing detection methods are typically ineffective in sniffing sparse attack-generated AEs, which just modify limited numbers of pixels.

