ADVERSARIAL MASKING: TOWARDS UNDERSTAND-ING ROBUSTNESS TRADE-OFF FOR GENERALIZATION

Abstract

Adversarial training is a commonly used technique to improve model robustness against adversarial examples. Despite its success as a defense mechanism, adversarial training often fails to generalize well to unperturbed test data. While previous work assumes it is caused by the discrepancy between robust and nonrobust features, in this paper, we introduce Adversarial Masking, a new hypothesis that this trade-off is caused by different feature maskings applied. Specifically, the rescaling operation in the batch normalization layer, when combined together with ReLU activation, serves as a feature masking layer to select different features for model training. By carefully manipulating different maskings, a well-balanced trade-off can be achieved between model performance on unperturbed and perturbed data. Built upon this hypothesis, we further propose Robust Masking (Rob-Mask), which constructs a unique masking for every specific attack perturbation by learning a set of primary adversarial feature maskings. By incorporating different feature maps after the masking, we can distill better features to help model generalization. Sufficiently, adversarial training can be treated as an effective regularizer to achieve better generalization. Experiments on multiple benchmarks demonstrate that RobMask achieves significant improvement on clean test accuracy compared to strong state-of-the-art baselines.

1. INTRODUCTION

Deep neural networks have achieved unprecedented success over a variety of tasks and across different domains. However, studies have shown that neural networks are inherently vulnerable to adversarial examples (Biggio et al., 2013; Szegedy et al., 2014) . To enhance model robustness against adversarial examples, adversarial training (Goodfellow et al., 2015; Madry et al., 2018) has become one of the most effective and widely applied defense methods, which employs specific attacking algorithms to generate adversarial examples during training in order to learn robust models. Albeit effective in countering adversarial examples, adversarial training often suffers from inferior performance on clean data (Zhang et al., 2019; Balaji et al., 2019) . This observation has led prior work to extrapolate that a trade-off between robustness and accuracy may be inevitable, particularly for image classification tasks (Zhang et al., 2019; Tsipras et al., 2019) . However, Yang et al. ( 2020) recently suggests that it is possible to learn classifiers both robust and highly accurate on real image data. The current state of adversarial training methods falls short of this prediction, and the discrepancy remains poorly understood. In this paper, we conduct an in-depth study on understanding the trade-off between robustness and clean accuracy in adversarial training, and introduce Adversarial Masking, a new hypothesis stating that a widely used technique, batch normalization (BN), has a significant impact on the trade-off between robustness and natural accuracy. Specifically, we break down BN into normalization and rescaling operations, and find that the rescaling operation has a significant impact on the robustness trade-off while normalization only has marginal influence. Built upon this observation, we hypothesize that adversarial masking (i.e., the combination of the rescaling operation and the follow-up ReLU activation fucntion) acts as a feature masking layer that can magnify or block feature maps to influence the performance of robust or clean generalization. In this hypothesis, different rescaling parameters in BN contribute to different adversarial maskings learned through training. By using a simple linear combination of two adversarial maskings, rather than using robust features learned

