ADVERSARIAL MASKING: TOWARDS UNDERSTAND-ING ROBUSTNESS TRADE-OFF FOR GENERALIZATION

Abstract

Adversarial training is a commonly used technique to improve model robustness against adversarial examples. Despite its success as a defense mechanism, adversarial training often fails to generalize well to unperturbed test data. While previous work assumes it is caused by the discrepancy between robust and nonrobust features, in this paper, we introduce Adversarial Masking, a new hypothesis that this trade-off is caused by different feature maskings applied. Specifically, the rescaling operation in the batch normalization layer, when combined together with ReLU activation, serves as a feature masking layer to select different features for model training. By carefully manipulating different maskings, a well-balanced trade-off can be achieved between model performance on unperturbed and perturbed data. Built upon this hypothesis, we further propose Robust Masking (Rob-Mask), which constructs a unique masking for every specific attack perturbation by learning a set of primary adversarial feature maskings. By incorporating different feature maps after the masking, we can distill better features to help model generalization. Sufficiently, adversarial training can be treated as an effective regularizer to achieve better generalization. Experiments on multiple benchmarks demonstrate that RobMask achieves significant improvement on clean test accuracy compared to strong state-of-the-art baselines.

1. INTRODUCTION

Deep neural networks have achieved unprecedented success over a variety of tasks and across different domains. However, studies have shown that neural networks are inherently vulnerable to adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) . To enhance model robustness against adversarial examples, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) has become one of the most effective and widely applied defense methods, which employs specific attacking algorithms to generate adversarial examples during training in order to learn robust models. Albeit effective in countering adversarial examples, adversarial training often suffers from inferior performance on clean data (Zhang et al., 2019; Balaji et al., 2019) . This observation has led prior work to extrapolate that a trade-off between robustness and accuracy may be inevitable, particularly for image classification tasks (Zhang et al., 2019; Tsipras et al., 2019) . However, Yang et al. ( 2020) recently suggests that it is possible to learn classifiers both robust and highly accurate on real image data. The current state of adversarial training methods falls short of this prediction, and the discrepancy remains poorly understood. In this paper, we conduct an in-depth study on understanding the trade-off between robustness and clean accuracy in adversarial training, and introduce Adversarial Masking, a new hypothesis stating that a widely used technique, batch normalization (BN), has a significant impact on the trade-off between robustness and natural accuracy. Specifically, we break down BN into normalization and rescaling operations, and find that the rescaling operation has a significant impact on the robustness trade-off while normalization only has marginal influence. Built upon this observation, we hypothesize that adversarial masking (i.e., the combination of the rescaling operation and the follow-up ReLU activation fucntion) acts as a feature masking layer that can magnify or block feature maps to influence the performance of robust or clean generalization. In this hypothesis, different rescaling parameters in BN contribute to different adversarial maskings learned through training. By using a simple linear combination of two adversarial maskings, rather than using robust features learned by adversarial training (Madry et al., 2018; Ilyas et al., 2019; Zhang et al., 2019) , we show that a well-balanced trade-off can be readily achieved. Based on the Adversarial Masking hypothesis, we further propose RobMask (Robust Masking), a new training scheme that learns an adaptive feature masking for different perturbation strengths. We use the learned adaptive feature masking to incorporate different features so that we could improve model generalization with a better robustness trade-off. Specifically, each perturbation strength is encoded as a low-dimensional vector, and we take this vector as input to a learnable linear projection layer together with ReLU activation, to obtain the adversarial masking for the corresponding perturbation strength. Therefore, for different perturbation strengths, we learn different maskings accordingly. By doing so, rather than hurting the performance on clean test data, we use adversarial examples as powerful regularization to boost model generalization. Experiments on multiple benchmarks demonstrate that RobMask achieves not only significantly better natural accuracy, but also a better trade-off between robustness and generalization. Our contributions are summarized as follows. (i) We conduct a detailed analysis to demonstrate that the rescaling operation in batch normalization has a significant impact on the trade-off between robustness and natural accuracy. (ii) We introduce Adversarial Masking, a new hypothesis to explain that this trade-off is caused by different feature maskings applied, and different combinations of maskings can lead to different trade-offs. (iii) We propose RobMask, a new training scheme to learn an adaptive masking for different perturbation strengths, in order to utilize adversarial examples to boost generalization on clean data. RobMask also achieves a better trade-off between robust and natural accuracy.

2. PRELIMINARY AND RELATED WORK

Adversarial Training Since the discovery of the vulnerability of deep neural networks, diverse approaches have been proposed to enhance model adversarial robustness. A natural idea is to iteratively generate adversarial examples, add them back to the training data, and then retrain the model. For example, Goodfellow et al. (2015) uses adversarial examples generated by FGSM to augment the data, and Kurakin et al. (2017) proposes to use a multi-step FGSM to further improve performance. Madry et al. (2018) shows that adversarial training can be formulated as a min-max optimization problem, and proposes PGD attack (similar to multi-step FGSM) to find adversarial examples for each batch. Specifically, for a K-class classification problem, let us denote D = {(x i , y i )} n i=1 as the set of training samples with x i ∈ R d , y i ∈ {1, . . . , K}, where K is the number of classes. Considering a classification model f θ (x) : R d → ∆ K parameterized by θ, where ∆ K represents a K-dimensional simplex, adversarial training can be formulated as: min θ 1 n n i=1 max x i ∈Bp(xi, ) (f θ (x i ), y i ), where B p (x i , ) denotes the p -norm ball centered at x i with radius , and (•, •) is the cross-entropy loss. The inner maximization problem aims to find an adversarial version of a given data point x i that yields the highest loss. In general, B p (x i , ) can be defined based on the threat model, but the ∞ ball is the most popular choice among recent work (Madry et al., 2018; Zhang et al., 2019) , which is also adopted in this paper. For deep neural networks, the inner maximization does not have a closed-form solution. 



Trade-off between Robustness and Accuracy While effective in improving model robustness, adversarial training is known to bear a performance drop on clean test data.Tsipras et al. (2019)   provides a theoretical example of data distribution where any classifier with high test accuracy must also have low adversarial accuracy under ∞ perturbations. They claim that high performance on both accuracy and robustness may be unattainable due to their inherently opposing goals.Zhang  et al. (2019)  decomposes the robust error as the sum of natural (classification) error and boundary

