TOWARDS UNDERSTANDING FAST ADVERSARIAL TRAINING

Abstract

Current neural-network-based classifiers are susceptible to adversarial examples. The most empirically successful approach to defending against such adversarial examples is adversarial training, which incorporates a strong self-attack during training to enhance its robustness. This approach, however, is computationally expensive and hence is hard to scale up. A recent work, called fast adversarial training, has shown that it is possible to markedly reduce computation time without sacrificing significant performance. This approach incorporates simple self-attacks, yet it can only run for a limited number of training epochs, resulting in sub-optimal performance. In this paper, we conduct experiments to understand the behavior of fast adversarial training and show the key to its success is the ability to recover from overfitting to weak attacks. We then extend our findings to improve fast adversarial training, demonstrating superior robust accuracy to strong adversarial training, with much-reduced training time.

1. INTRODUCTION

Adversarial examples are carefully crafted versions of the original data that successfully mislead a classifier (Szegedy et al., 2013) , while realizing minimal change in appearance when viewed by most humans. Although deep neural networks have achieved impressive success on a variety of challenging machine learning tasks, the existence of such adversarial examples has hindered the application of deep neural networks and drawn great attention in the deep-learning community. Empirically, the most successful defense thus far is based on Projected Gradient Descent (PGD) adversarial training (Goodfellow et al., 2014; Madry et al., 2017) , augmenting the data of interest with strong adversarial examples, to help improve model robustness. Although effective, this approach is not efficient and may take multiple days to train a moderately large model. On the other hand, one of the early versions of adversarial training, based on a weaker Fast Gradient Signed Method (FGSM) attack, is much more efficient but suffers from "catastrophic overfitting," a phenomenon where the robust accuracy with respect to strong attacks suddenly drops to almost zero during training (Tramèr et al., 2017; Wong et al., 2019) , and fails to provide robustness against strong attacks.

Fast adversarial training (Wong et al., 2019

) is a simple modification to FGSM, that mitigates this issue. By initializing FGSM attacks with large randomized perturbations, it can efficiently obtain robust models against strong attacks. Although the modification is simple, the underlying reason for its success remains unclear. Moreover, fast adversarial training is only compatible with a cyclic learning rate schedule (Smith & Topin, 2019) , with a limited number of training epochs, resulting in sub-optimal robust accuracy compared to PGD adversarial training (Rice et al., 2020) . When fast adversarial training runs for a large number of epochs, it still suffers from catastrophic overfitting, similar to vanilla FGSM adversarial training. Therefore, it remains an unfinished task to obtain the effectiveness of PGD adversarial training and the efficiency of FGSM adversarial training simultaneously. In this paper, we conduct experiments to show that the key to the success of fast adversarial training is not avoiding catastrophic overfitting, but being able to retain the robustness of the model when catastrophic overfitting occurs. We then utilize this understanding to propose a simple fix to fast adversarial training, making possible the training of it for a large number of epochs, without sacrificing efficiency. We demonstrate that, as a result, we yield improved performance.

