DISTRIBUTED ADVERSARIAL TRAINING TO ROBUS-TIFY DEEP NEURAL NETWORKS AT SCALE

Abstract

Current deep neural networks are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective and popular approach, known as adversarial training, has been shown to mitigate the negative impact of adversarial attacks by virtue of a min-max robust training method. While effective, this approach is difficult to scale well to large models on large datasets (e.g., ImageNet) in general. To address this challenge, we propose distributed adversarial training (DAT), a large-batch adversarial training framework implemented over multiple machines. DAT supports one-shot and iterative attack generation methods, gradient quantization, and training over labeled and unlabeled data. Theoretically, we provide, under standard conditions in the optimization theory, the convergence rate of DAT to the first-order stationary points in general non-convex settings. Empirically, on ResNet-18 and -50 under CIFAR-10 and ImageNet, we demonstrate that DAT either matches or outperforms state-of-the-art robust accuracies and achieves a graceful training speedup.

1. INTRODUCTION

The rapid increase of research in deep neural networks (DNNs) and their adoption in practice is, in part, owed to the significant breakthroughs made with DNNs in computer vision (Alom et al., 2018 ). Yet, with the apparent power of DNNs, there remains a serious weakness of robustness. That is, DNNs can easily be manipulated (by an adversary) to output drastically different classifications and can be done so in a controlled and directed way. This process is known as an adversarial attack and considered as one of the major hurdles in using DNNs in security critical and real-world applications (Goodfellow et al., 2015; Szegedy et al., 2013; Carlini & Wagner, 2017; Papernot et al., 2016; Kurakin et al., 2016; Eykholt et al., 2018; Xu et al., 2019b) . Methods to train DNNs being robust against adversarial attacks are now a major focus in research (Xu et al., 2019a) . But most are far from satisfactory (Athalye et al., 2018) with the exception of the adversarial training (AT) approach (Madry et al., 2017b) . AT is a min-max robust training method that minimizes the worst-case training loss at adversarially perturbed examples. AT has inspired a wide range of state-of-the-art defenses (Kannan et al., 2018; Ross & Doshi-Velez, 2018; Moosavi-Dezfooli et al., 2019; Zhang et al., 2019b; Wang et al., 2019b; Sinha et al., 2018; Chen et al., 2019; Boopathy et al., 2020; Wong & Kolter, 2017; Dvijotham et al., 2018; Stanforth et al., 2019; Carmon et al., 2019; Shafahi et al., 2019; Zhang et al., 2019a; Wong et al., 2020) , which ultimately resort to min-max optimization. However, these methods, together with AT, are generally difficult to scale well to large networks on large datasets. While scaling AT is important, doing so effectively is non-trivial. We find that scaling AT with the direct solution of distributing the data batch across multiple machines may not work and leaves many unanswered questions. First, if the direct solution does not allow for scaling batch size with machines, then it does not speed the process up and leads to a significant amount of communication costs (considering that the number of training iterations is not reduced over a fixed number of epochs). Second, without proper design, the direct application of a large batch size to distributed adversarial training introduces a significant loss in both normal accuracy and adversarial robustness (e.g., more than 10% performance drops for ResNet-18 on CIFAR-10 shown by our experiments). Third, the direct approach does not confer a general algorithmic framework, which is needed in order to support different variants of AT, large-batch optimization, and efficient communication. Taking all factors into consideration, a question that naturally arises is: Can we speed up AT by leveraging distributed learning with full utility of multiple computing nodes (machines), even when each only has access to limited GPU resources? Although a few works made empirical efforts to scale AT up by simply using multiple computing nodes (Xie et al., 2019; Kang et al., 2019; Qin et al., 2019) , they were limited to specific use cases and a thorough study on when and how distributed learning helps, either in theory or in practice. By contrast, we propose a principled and theoretically-grounded distributed (large-batch) adversarial training (DAT) framework by making full use of the computing capability of multiple data-locality (distributed) machines, and show that DAT expands the capacity of data storage and the computational scalability. We summarize our main contributions as below. Contributions (i) We provide a general problem formulation of DAT, which supports multiple distributed variants of AT, e.g., supervised AT and semi-supervised AT. (ii) We propose a principled algorithmic framework for DAT, which, different from conventional AT, supports large-batch DNN training (without losing performance over a fixed number of epochs) and allows the transmission of compressed gradients for efficient communication. (iii) We theoretically quantify the convergence speed of DAT to the first-order stationary points in general non-convex settings at a rate of O(1/ √ T ), where T is the total number of iterations. This result matches the standard convergence rate of classic training algorithms, e.g., stochastic gradient descent (SGD), for only the minimization problems. (iv) We make a comprehensive empirical study on DAT, showing that it not only speeds up in training large models on large datasets but also matches (and even exceeds) state-of-the-art robust accuracies in different attacking and learning scenarios. For example, DAT on ImageNet with 6 × 6 (machines × GPUs per machine) yields 38.45% robust accuracy (comparable to 40.38% from AT) but only requires 16.3 hours training time (6 times larger batch size allowed in DAT), exhibiting 3.1 times faster than AT on a single machine of 6 GPUs. We also make a significant effort to evaluate the empirical performance of DAT across different computing configurations at different learning regimes, e.g., semi-supervised learning and transfer learning.

2. RELATED WORK

Training robust classifiers The lack of robustness of DNNs has promoted a rapid expansion of defenses against adversarial attacks, ranging from heuristic defenses to robust (min-max) optimization based approaches. However, many heuristic strategies are easily bypassed by stronger adversaries due to the presence of obfuscated gradients (Athalye et al., 2018) . By contrast, the min-max optimizationbased training methods are generally able to offer significant gains in robustness. AT (Madry et al., 2017b) , the first known min-max optimization-based defense, has inspired a wide range of other effective defenses. Examples include adversarial logit pairing (Kannan et al., 2018) , input gradient or curvature regularization (Ross & Doshi-Velez, 2018; Moosavi-Dezfooli et al., 2019) , trade-off between robustness and accuracy (TRADES) (Zhang et al., 2019b) , distributionally robust training (Sinha et al., 2018 ), dynamic adversarial training (Wang et al., 2019b) , robust input attribution regularization (Chen et al., 2019; Boopathy et al., 2020) , certifiably robust training (Wong & Kolter, 2017; Dvijotham et al., 2018) , and semi-supervised robust training (Stanforth et al., 2019; Carmon et al., 2019) . In particular, some recent works proposed fast but approximate AT algorithms, such as 'free' AT (Shafahi et al., 2019) , you only propagate once (YOPO) (Zhang et al., 2019a) , and fast gradient sign method (FGSM) based AT (Wong et al., 2020) . These algorithms achieve speedup in training by simplifying the inner maximization step of AT. Although there is vast literature on min-max optimization based robust training, it is designed for centralized model training and without care about the scalability issue in AT. In (Xie et al., 2019) , although a distributed version of vanilla AT was implemented via the large-batch SGD algorithm (Goyal et al., 2017) , it is significantly different from our proposal. Compared to (Xie et al., 2019) , we adopt a different distributed training recipe (layer-wise adaptive learning rate method vs. SGD), and make in-depth theoretical analysis of quantifying the convergence rate of DAT. Distributed model training Distributed optimization has been found to be effective for the standard training of machine learning models (Dean et al., 2012; Goyal et al., 2017; You et al., 2019; Chen et al., 2020) . In contrast to centralized optimization, distributed learning enables increasing the batch size proportional to the number of computing nodes/machines. However, it is challenging to train a

