BOOSTING ADVERSARIAL TRAINING WITH MASKED ADAPTIVE ENSEMBLE

Abstract

Adversarial training (AT) can help improve the robustness of a deep neural network (DNN) against potential adversarial attacks by intentionally injecting adversarial examples into the training data, but this way inevitably incurs standard accuracy degradation to some extent, thereby calling for a trade-off between standard accuracy and robustness. Besides, the prominent AT solutions are vulnerable to sparse attacks, due to "robustness overfitting" upon dense attacks, often adopted by AT to produce a threat model. To tackle such shortcomings, this paper proposes a novel framework, including a detector and a classifier bridged by our newly developed adaptive ensemble. Specifically, a Guided Backpropagationbased detector is designed to sniff adversarial examples, driven by our empirical observation. Meanwhile, a classifier with two encoders is employed for extracting visual representations respectively from clean images and adversarial examples. The adaptive ensemble approach also enables us to mask off a random subset of image patches within input data, eliminating potential adversarial effects when encountering malicious inputs with negligible standard accuracy degradation. As such, our approach enjoys improved robustness, able to withstand both dense and sparse attacks, while maintaining high standard accuracy. Experimental results exhibit that our detector and classifier outperform their state-of-the-art counterparts, in terms of detection accuracy, standard accuracy, and adversarial robustness. For example, on CIFAR-10, our detector achieves the best detection accuracy of 99.6% under dense attacks and of 98.5% under sparse attacks. Our classifier achieves the best standard accuracy of 91.2% and the best robustness against dense attack (or sparse attack) of 57.5% (or 54.8%).

1. INTRODUCTION

Deep neural networks (DNNs) have been reported to be vulnerable to adversarial attacks. That is, maliciously crafting clean images under a small distance can mislead DNNs into incorrect predictions. Such vulnerability prevents DNNs' wide adoption in critical domains, such as healthcare, autonomous driving, finances, among many others. In a nutshell, adversarial attacks can be roughly grouped into two categories, i.e., the dense attack and the sparse attack. The former (e.g., Goodfellow et al. ( 2015 2022) can significantly reduce the computational overhead while achieving decent robustness under dense attacks. Despite effectiveness, existing ATs suffer from two shortcomings: i) a tradeoff between standard accuracy (i.e., the accuracy on clean images) and adversarial robustness (i.e., the accuracy on adversarial examples), with improved robustness yielding non-negligible standard



); Moosavi-Dezfooli et al. (2016); Madry et al. (2018); Croce & Hein (2020); Yao et al. (2021)) tends to perturb almost all pixels on the clean image, whereas the latter (e.g., Papernot et al. (2016); Carlini & Wagner (2017); Modas et al. (2019); Dong et al. (2020); Pintor et al. (2021); Zhu et al. (2021)) modifies only a limited number of pixels to fool the DNN models. So far, adversarial training (AT) is widely accepted as the most effective method to improve DNNs' robustness against adversarial attacks, by intentionally injecting adversarial examples into the training data. In particular, multi-step ATs Madry et al. (2018); Zhang et al. (2019); Jia et al. (2022) perform multi-step dense attacks (e.g., PGD attack) to find the worst-case adversarial examples for training, achieving state-of-the-art robustness but incurring a significant computational overhead. On the other hand, by using the single-step dense attack (e.g., FGSM attack), one-step ATs Wong et al. (2020); Andriushchenko & Flammarion (2020); Kim et al. (2021); Li et al. (2022); Wang et al. (

