STRENGTH-ADAPTIVE ADVERSARIAL TRAINING

Abstract

Adversarial training (AT) is proved to reliably improve network's robustness against adversarial data. However, current AT with a pre-specified perturbation budget has limitations in learning a robust network. Firstly, applying a prespecified perturbation budget on networks of various model capacities will yield divergent degree of robustness disparity between natural and robust accuracies, which deviates from robust network's desideratum. Secondly, the attack strength of adversarial training data constrained by the pre-specified perturbation budget fails to upgrade as the growth of network robustness, which leads to robust overfitting and further degrades the adversarial robustness. To overcome these limitations, we propose Strength-Adaptive Adversarial Training (SAAT). Specifically, the adversary employs an adversarial loss constraint to generate adversarial training data. Under this constraint, the perturbation budget will be adaptively adjusted according to the training state of adversarial data, which can effectively avoid robust overfitting. Besides, SAAT explicitly constrains the attack strength of training data through the adversarial loss, which manipulates model capacity scheduling during training, and thereby can flexibly control the degree of robustness disparity and adjust the tradeoff between natural accuracy and robustness. Extensive experiments show that our proposal boosts the robustness of adversarial training.

1. INTRODUCTION

Current deep neural networks (DNNs) achieve impressive breakthroughs on a variety of fields such as computer vision (He et al., 2016) , speech recognition (Wang et al., 2017), and NLP (Devlin et al., 2018) , but it is well-known that DNNs are vulnerable to adversarial data: small perturbations of the input which are imperceptible to humans will cause wrong outputs (Szegedy et al., 2013; Goodfellow et al., 2014) . As countermeasures against adversarial data, adversarial training (AT) is a method for hardening networks against adversarial attacks (Madry et al., 2017) . AT trains the network using adversarial data that are constrained by a pre-specified perturbation budget, which aims to obtain the output network with the minimum adversarial risk of an sample to be wrongly classified under the same perturbation budget. Across existing defense techniques, AT has been proved to be one of the most effective and reliable methods against adversarial attacks (Athalye et al., 2018) . Although promising to improve the network's robustness, AT with a pre-specified perturbation budget still has limitations in learning a robust network. Firstly, the pre-specified perturbation budget is inadaptable for networks of various model capacities, yielding divergent degree of robustness disparity between natural and robust accuracies, which deviates from robust network's desideratum. Ideally, for a robust network, perturbing the attack budget within a small range should not cause signifcant accuracy degradation. Unfortunately, the degree of robustness disparity is intractable for AT with a pre-specified perturbation budget. In standard AT, there could be a prominent degree of robustness disparity in output networks. For instance, a standard PGD adversarially-trained PreAct ResNet18 network has 84% natural accuracy and only 46% robust accuracy on CIFAR10 under ℓ ∞ threat model, as shown in Figure 1(a) . Empirically, we have to increase the pre-specified perturbation budget to allocate more model capacity for defense against adversarial attacks to mitigate the degree of robustness disparity, as shown in Figure 1(b) . However, the feasible range of perturbation budget is different for networks with different model capacities. For example, AT with perturbation budget ϵ = 40/255 will make PreAct ResNet-18 optimization collapse, while wide ResNet-34-10 can learn normally. In order to maintain a steady degree of robustness disparity, we have to find separate perturbation budgets for each network with different model capacities. Therefore, it may be pessimistic to use AT with a pre-specified perturbation budget to learn a robust network. Secondly, the attack strength of adversarial training data constrained by the pre-specified perturbation budget is gradually weakened as the growth of network robustness. During the training process, adversarial training data are generated on the fly and are changed based on the updating of the network. As the the network's adversarial robustness continues to increase, the attack strength of adversarial training data with the pre-specified perturbation budget is getting relatively weaker. Given the limited network capacity, a degenerate or stagnant adversary accompanied by an evolving network will easily cause training bias: adversarial training is more inclined to the defense against weak strength attacks, and thereby erodes defenses on strong strength attacks, leading to the undesirable robust overfitting, as shown in Figure 1(c ). Moreover, compared with the "best" checkpoint in AT with robust overfitting, the "last" checkpoint's defense advantage in weak strength attack is slight, while its defense disadvantage in strong strength attack is significant, which indicates that robust overfitting not only exacerbates the degree of robustness disparity, but also further degrades the adversarial robustness. Thus, it may be deficient to use adversarial data with a pre-specified perturbation budget to train a robust network. To overcome these limitations, we propose strength-adaptive adversarial training (SAAT), which employs an adversarial loss constraint to generate adversarial training data. The adversarial perturbation generated under this constraint is adaptive to the dynamic training schedule and networks of various model capacities. Specifically, as adversarial training progresses, a larger perturbation budget is required to satisfy the adversarial loss constraint since the network becomes more robust. Thus, the perturbation budgets in our SAAT is adaptively adjusted according to the training state of adversarial data, which restrains the training bias and effectively avoids robust overfitting. Besides, SAAT explicitly constrains the attack strength of training data by the adversarial loss constraint, which guides model capacity scheduling in adversarial training, and thereby can flexibly adjust the tradeoff between natural accuracy and robustness, ensuring that the output network maintains a steady degree of robustness disparity even under networks with different model capacities. Our contributions are as follows. (a) In standard AT, we characterize the pessimism of adversary with a pre-specified perturbation budget, which is due to the intractable robustness disparity and undesirable robust overfitting (in Section 3.1). (b) We propose a new adversarial training method, i.e., SAAT (its learning objective in Section 3.2 and its realization in Section 3.3). SAAT is a general adversarial training method that can be easily converted to natural training or standard AT. (c) Empirically, we find that adversarial training loss is well-correlated with the degree of robustness disparity and robust generalization gap (in Section 4.2), which enables our SAAT to overcome the issue of robust overfitting and flexibly adjust the tradeoff of adversarial training, leading to the improved natural accuracy and robustness (in Section 4.3).

2. PRELIMINARY AND RELATED WORK

In this section, we review the adversarial training method and related works.

2.1. ADVERSARIAL TRAINING

Learning objective. Let f θ , X and ℓ be the network f with trainable model parameter θ, input feature space, and loss function, respectively. Given a C-class dataset S = {(x i , y i )} n i=1 , where x i ∈ X and y i ∈ Y = {0, 1, ..., C -1} as its associated label. In natural training, most machine



Figure 1: Robustness evaluation on different test perturbation budgets of (a) standard AT; (b) AT with different training pre-specified perturbation budgets. (c) The learning curve of standard AT with pre-specified perturbation ϵ = 8/255 on PreAct ResNet-18 under ℓ ∞ threat model and the robustness evaluation of its "best" and "last" checkpoints.

