FINDING ACTUAL DESCENT DIRECTIONS FOR ADVER-SARIAL TRAINING

Abstract

Adversarial Training using a strong first-order adversary (PGD) is the gold standard for training Deep Neural Networks that are robust to adversarial examples. We show that, contrary to the general understanding of the method, the gradient at an optimal adversarial example may increase, rather than decrease, the adversarially robust loss. This holds independently of the learning rate. More precisely, we provide a counterexample to a corollary of Danskin's Theorem presented in the seminal paper of Madry et al. ( 2018) which states that a solution of the inner maximization problem can yield a descent direction for the adversarially robust loss. Based on a correct interpretation of Danskin's Theorem, we propose Danskin's Descent Direction (DDi) and we verify experimentally that it provides better directions than those obtained by a PGD adversary. Using the CIFAR10 dataset we further provide a real world example showing that our method achieves a steeper increase in robustness levels in the early training stages of smooth-activation networks without BatchNorm, and is more stable than the PGD baseline. As a limitation, PGD training of ReLU+BatchNorm networks still performs better, but current theory is unable to explain this.

1. INTRODUCTION

Adversarial Training (AT) (Goodfellow et al., 2015; Madry et al., 2018) has become the de-facto algorithm used to train Neural Networks that are robust to adversarial examples (Szegedy et al., 2014) . Variations of AT together with data augmentation yield the best-performing models in public benchmarks (Croce et al., 2020) . Despite lacking optimality guarantees for the inner-maximization problem, the simplicity and performance of AT are enough reasons to embrace its heuristic nature. From an optimization perspective, the consensus is that AT is a sound algorithm: based on Danskin's Theorem, Madry et al. (2018, Corollary C.2 ) posit that by finding a maximizer of the inner non-concave maximization problem, i.e., an optimal adversarial example, one can obtain a descent direction for the adversarially robust loss. What if this is not true? are we potentially overlooking issues in its algorithmic framework? As mentioned in (Dong et al., 2020, Section 2.3), Corollary C.2 in Madry et al. ( 2018) can be considered the theoretical optimization foundation of the non-convex non-concave min-max optimization algorithms that we now collectively refer to as Adversarial Training. It justifies the two-stage structure of the training loop: first we find one approximately optimal adversarial example and then we update the model using the gradient (with respect to the model parameters) at the perturbed input. The only drawbacks of a first-order adversary seem to be its computational complexity and its approximate suboptimal solver nature. Ignoring the computational complexity issue, suppose we have access to a theoretical oracle that provides a single solution of the inner-maximization problem. In such idealized setting, can we safely assume AT is decreasing the adversarially robust loss on the data sample? According to the aforementioned theoretical results, it would appear so. In this work, we scrutinize the optimization paradigm on which Adversarial Training (AT) has been founded, and we posit that finding multiple solutions of the inner-maximization problem is necessary * These authors contributed equally to this work 1

