DISTRIBUTED ADVERSARIAL TRAINING TO ROBUS-TIFY DEEP NEURAL NETWORKS AT SCALE

Abstract

Current deep neural networks are vulnerable to adversarial attacks, where adversarial perturbations to the inputs can change or manipulate classification. To defend against such attacks, an effective and popular approach, known as adversarial training, has been shown to mitigate the negative impact of adversarial attacks by virtue of a min-max robust training method. While effective, this approach is difficult to scale well to large models on large datasets (e.g., ImageNet) in general. To address this challenge, we propose distributed adversarial training (DAT), a large-batch adversarial training framework implemented over multiple machines. DAT supports one-shot and iterative attack generation methods, gradient quantization, and training over labeled and unlabeled data. Theoretically, we provide, under standard conditions in the optimization theory, the convergence rate of DAT to the first-order stationary points in general non-convex settings. Empirically, on ResNet-18 and -50 under CIFAR-10 and ImageNet, we demonstrate that DAT either matches or outperforms state-of-the-art robust accuracies and achieves a graceful training speedup.

1. INTRODUCTION

The rapid increase of research in deep neural networks (DNNs) and their adoption in practice is, in part, owed to the significant breakthroughs made with DNNs in computer vision (Alom et al., 2018 ). Yet, with the apparent power of DNNs, there remains a serious weakness of robustness. That is, DNNs can easily be manipulated (by an adversary) to output drastically different classifications and can be done so in a controlled and directed way. This process is known as an adversarial attack and considered as one of the major hurdles in using DNNs in security critical and real-world applications (Goodfellow et al., 2015; Szegedy et al., 2013; Carlini & Wagner, 2017; Papernot et al., 2016; Kurakin et al., 2016; Eykholt et al., 2018; Xu et al., 2019b) . Methods to train DNNs being robust against adversarial attacks are now a major focus in research (Xu et al., 2019a) . But most are far from satisfactory (Athalye et al., 2018) with the exception of the adversarial training (AT) approach (Madry et al., 2017b) . AT is a min-max robust training method that minimizes the worst-case training loss at adversarially perturbed examples. AT has inspired a wide range of state-of-the-art defenses (Kannan et al., 2018; Ross & Doshi-Velez, 2018; Moosavi-Dezfooli et al., 2019; Zhang et al., 2019b; Wang et al., 2019b; Sinha et al., 2018; Chen et al., 2019; Boopathy et al., 2020; Wong & Kolter, 2017; Dvijotham et al., 2018; Stanforth et al., 2019; Carmon et al., 2019; Shafahi et al., 2019; Zhang et al., 2019a; Wong et al., 2020) , which ultimately resort to min-max optimization. However, these methods, together with AT, are generally difficult to scale well to large networks on large datasets. While scaling AT is important, doing so effectively is non-trivial. We find that scaling AT with the direct solution of distributing the data batch across multiple machines may not work and leaves many unanswered questions. First, if the direct solution does not allow for scaling batch size with machines, then it does not speed the process up and leads to a significant amount of communication costs (considering that the number of training iterations is not reduced over a fixed number of epochs). Second, without proper design, the direct application of a large batch size to distributed adversarial training introduces a significant loss in both normal accuracy and adversarial robustness (e.g., more than 10% performance drops for ResNet-18 on CIFAR-10 shown by our experiments). Third, the direct approach does not confer a general algorithmic framework, which is needed in order to support different variants of AT, large-batch optimization, and efficient communication. Taking all factors into consideration, a question that naturally arises is: Can we speed up AT by leveraging distributed learning with full utility of multiple computing nodes (machines), even when each only has access to limited GPU resources? Although a few works made empirical efforts to scale AT up by simply using multiple computing nodes (Xie et al., 2019; Kang et al., 2019; Qin et al., 2019) , they were limited to specific use cases and lacked a thorough study on when and how distributed learning helps, either in theory or in practice. By contrast, we propose a principled and theoretically-grounded distributed (large-batch) adversarial training (DAT) framework by making full use of the computing capability of multiple data-locality (distributed) machines, and show that DAT expands the capacity of data storage and the computational scalability. We summarize our main contributions as below. Contributions (i) We provide a general problem formulation of DAT, which supports multiple distributed variants of AT, e.g., supervised AT and semi-supervised AT. (ii) We propose a principled algorithmic framework for DAT, which, different from conventional AT, supports large-batch DNN training (without losing performance over a fixed number of epochs) and allows the transmission of compressed gradients for efficient communication. (iii) We theoretically quantify the convergence speed of DAT to the first-order stationary points in general non-convex settings at a rate of O(1/ √ T ), where T is the total number of iterations. This result matches the standard convergence rate of classic training algorithms, e.g., stochastic gradient descent (SGD), for only the minimization problems. (iv) We make a comprehensive empirical study on DAT, showing that it not only speeds up in training large models on large datasets but also matches (and even exceeds) state-of-the-art robust accuracies in different attacking and learning scenarios. For example, DAT on ImageNet with 6 × 6 (machines × GPUs per machine) yields 38.45% robust accuracy (comparable to 40.38% from AT) but only requires 16.3 hours training time (6 times larger batch size allowed in DAT), exhibiting 3.1 times faster than AT on a single machine of 6 GPUs. We also make a significant effort to evaluate the empirical performance of DAT across different computing configurations at different learning regimes, e.g., semi-supervised learning and transfer learning.

2. RELATED WORK

Training robust classifiers The lack of robustness of DNNs has promoted a rapid expansion of defenses against adversarial attacks, ranging from heuristic defenses to robust (min-max) optimization based approaches. However, many heuristic strategies are easily bypassed by stronger adversaries due to the presence of obfuscated gradients (Athalye et al., 2018) . By contrast, the min-max optimizationbased training methods are generally able to offer significant gains in robustness. AT (Madry et al., 2017b) , the first known min-max optimization-based defense, has inspired a wide range of other effective defenses. Examples include adversarial logit pairing (Kannan et al., 2018) , input gradient or curvature regularization (Ross & Doshi-Velez, 2018; Moosavi-Dezfooli et al., 2019) , trade-off between robustness and accuracy (TRADES) (Zhang et al., 2019b) , distributionally robust training (Sinha et al., 2018) , dynamic adversarial training (Wang et al., 2019b) , robust input attribution regularization (Chen et al., 2019; Boopathy et al., 2020) , certifiably robust training (Wong & Kolter, 2017; Dvijotham et al., 2018) , and semi-supervised robust training (Stanforth et al., 2019; Carmon et al., 2019) . In particular, some recent works proposed fast but approximate AT algorithms, such as 'free' AT (Shafahi et al., 2019) , you only propagate once (YOPO) (Zhang et al., 2019a) , and fast gradient sign method (FGSM) based AT (Wong et al., 2020) . These algorithms achieve speedup in training by simplifying the inner maximization step of AT. Although there is vast literature on min-max optimization based robust training, it is designed for centralized model training and without care about the scalability issue in AT. In (Xie et al., 2019) , although a distributed version of vanilla AT was implemented via the large-batch SGD algorithm (Goyal et al., 2017) , it is significantly different from our proposal. Compared to (Xie et al., 2019) , we adopt a different distributed training recipe (layer-wise adaptive learning rate method vs. SGD), and make in-depth theoretical analysis of quantifying the convergence rate of DAT. Distributed model training Distributed optimization has been found to be effective for the standard training of machine learning models (Dean et al., 2012; Goyal et al., 2017; You et al., 2019; Chen et al., 2020) . In contrast to centralized optimization, distributed learning enables increasing the batch size proportional to the number of computing nodes/machines. However, it is challenging to train a model via large-batch optimization without incurring accuracy loss compared to the standard training with same number of epochs (Krizhevsky, 2014; Keskar et al., 2016) . To tackle this challenge, it was shown in (You et al., 2017; 2018; 2019) that adaptation of learning rates to the increase of the batch size is an essential mean to boost the performance of large-batch optimization. A layer-wise adaptive learning rate strategy was then proposed to speed up the training as well as preserve the accuracy. Although these works have witnessed several successful applications of distributed learning in training standard image classifiers, they leave the question of how to build robust DNNs with DAT open. In this paper, we show that the power of layer-wise adaptive learning rate also applies to DAT. Since distributed learning introduces machine-machine communication overhead, another line of work (Alistarh et al., 2017; Yu et al., 2019; Bernstein et al., 2018; Wangni et al., 2018; Stich et al., 2018; Wang et al., 2019a) focused on the design of communication-efficient distributed optimization algorithms. The study on distributed learning is extensive, but the problem of distributed min-max optimization is less explored, with some exceptions (Srivastava et al., 2011; Notarnicola et al., 2018; Hanada et al., 2017; Tsaknakis et al., 2020; Liu et al., 2019a; b) . A key difference to our work is that none of the aforementioned literature studied the large-batch min-max optimization with its applications to training robust DNNs, neither theoretically nor empirically. While there are recent proposed algorithms for training Generative Adversarial Nets (GANs) (Liu et al., 2019a; b) , training robust DNNs against adversarial examples is intrinsically different from GAN training. In particular, training robust DNNs requires inner maximization with respect to each training data rather than empirical maximization with respect to model parameters. Such an essential difference leads to different optimization goals, algorithms, convergence analyses and implementations.

3. PROBLEM FORMULATION

In this section, we first review the standard setup of adversarial training (AT) (Madry et al., 2017b) . We then motivate the need of distributed AT (DAT) and propose a general min-max setup for DAT. Adversarial training AT (Madry et al., 2017b ) is a min-max optimization method for training robust ML/DL models against adversarial examples (Goodfellow et al., 2015) . Formally, AT solves the problem minimize θ E (x,y)∈D maximize δ ∞≤ (θ, x + δ; y) , where θ ∈ R d denotes the vector of model parameters, δ ∈ R n is the vector of input perturbations within an ∞ ball of the given radius , namely, δ ∞ ≤ , (x, y) ∈ D corresponds to the training example x with label y in the dataset D, and represents a pre-defined training loss, e.g., the crossentropy loss. The rationale behind problem (1) is that the model θ is robustly trained against the worst-case loss induced by the adversarially perturbed samples. It is worth noting that the AT problem (1) is different from conventional stochastic min-max optimization problems (e.g., GANs training (Goodfellow et al., 2014) ). Note that in (1), the stochastic sampling corresponding to the expectation over (x, y) ∈ D is conducted prior to the inner maximization operation. Such a difference leads to the sample-specific adversarial perturbation δ(x) := maximize δ ∞≤ (θ, x + δ; y).

Distributed AT (DAT)

The need of AT in a distributed setting arises from at least the following two aspects: 1) Training data are distributed, provided by multiple parties, which expands the individual capability of data storage or data privacy. 2) Computing units are often distributed, provided by distributed machines, which enables large-batch optimization and thus improves the AT's scalability. Let us consider a popular parameter-server model of distributed learning (Dean et al., 2012) . Formally, there exist M workers each of which has access to a local dataset D (i) , and thus D = ∪ M i=1 D (i) . There also exists a server/master node (e.g., one of workers could perform as server), which collects local information (e.g., individual gradients) from the other workers to update the model parameters θ. Spurred by (1), DAT solves problems of the following generic form, minimize θ 1 M M i=1 λE (x,y)∈D (i) [ (θ; x, y)] + E (x,y)∈D (i) maximize δ ∞ ≤ φ(θ, δ; x, y) =: f i (θ; D (i) ) , where f i denotes the local cost function at the ith worker, φ is a robustness regularizer against the input perturbation δ, and λ ≥ 0 is a regularization parameter that strikes a balance between the training loss and the worst-case robustness regularization. In (2), if M = 1, D (1) = D, λ = 0 and φ = , then the DAT problem reduces to the AT problem (1). We cover two categories of (2): • DAT with labeled data: In (2), we consider φ(θ, δ; x, y) = (θ, x + δ; y) with labeled training data (x, y) ∈ D (i)  for i ∈ [M ]. Here [M ] denotes the integer set {1, 2, . . . , M }. • DAT with unlabeled data: In (2), different from DAT with labeled data, we have D (i) with an unlabeled dataset U (i) (namely, U (i) ⊆ D (i) ), and define the robust regularizer φ as (Stanforth et al., 2019; Zhang et al., 2019b) : φ(θ, δ; x) = CE(z(x + δ; θ), z(x; θ)). Here z(x; θ) represents the probability distribution over class labels predicted by the model θ, and CE denotes the cross-entropy function.

4. A UNIFIED ALGORITHMIC FRAMEWORK FOR DAT

In this section, we will introduce the algorithmic framework of DAT, where we address three main challenges of algorithm design in DAT: a) inner maximization; b) gradient quantization/compression; and c) outer large-batch training by leveraging state-of-the-art adversarial defense and distributed optimization techniques. We also achieve the first theoretical convergence rate result of min-max optimization based robust training in a distributed large-batch setting. We first discuss the key components of DAT and their respective roles in performing scalable training; see the meta-form of DAT in Algorithm 1 (or detailed Algorithm A1). The proposed DAT contains three algorithmic blocks. In the first block (Steps 3-8 of Algorithm A1), every distributed worker calls for a maximization oracle to obtain the adversarial perturbation for each sample within a data batch, then computes the gradient of the local cost function f i in (2) with respect to (w.r.t.) model parameters θ. And every worker is allowed to quantize/compress the local gradient prior to transmission to a server. In the second block (Steps 9-10 of Algorithm A1), the server aggregates the local gradients, and transmits the aggregated gradient (or the quantized gradient) to the other workers. In the third block (Steps 11-13 of Algorithm A1), the model parameters are eventually updated by a minimization oracle at each worker based on the received gradient information from the server. Local gradient computation (A2) 5: Worker-server communication (optional) 6: end for 7: Gradient aggregation at server (A3) Block 2 8: Server-worker communication (optional) 9: for Worker i = 1, 2, . . . , M Block 3 10: Model parameter update (A4) 11: end for In contrast to standard AT, DAT allows for using a M times larger batch size to update the model parameters θ. Thus, given the same number of epochs, DAT takes M fewer gradient updates than AT. In addition, distributed learning introduces communication overhead. To address this issue, it is optional to perform gradient quantization at both worker and server sides when a very large model is possibly trained. We elaborate on DAT in what follows. Inner maximization: Iterative and one-shot solutions In DAT, each worker calls for an inner maximization oracle to generate adversarial perturbations (Step 1 of Algorithm 1). We consider two solvers of perturbation generation: iterative projected gradient descent (PGD) method used in standard AT (Madry et al., 2017a) and one-shot fast gradient sign method (FGSM) (Goodfellow et al., 2015) . We specify attack generation in the unified form δ (i) t (x) = z K , z k = Π [-, ] d [z k-1 + α • sign(∇ δ φ(θ t , z k-1 ; x))], k ∈ [K], where K is the total number of iterations in the inner loop, the cases of K = 1 and K > 1 correspond to iterative PGD attack and FGSM attack respectively, z k denotes the PGD update of δ at the kth iteration, z 0 is a given intial point, Π [-, ] d (•) denotes the projection onto the box constraint [-, ] d , α > 0 is a given step size, and sign(•) denotes the element-wise sign operation. The recent work (Wong et al., 2020) showed that if FGSM is conducted with random initialization z 0 and a proper step size, e.g., α = 1.25 , then FGSM can be as effective as iterative PGD in robust training. Indeed, we will show in Sec. 5 that the effectiveness of our proposed DAT-FGSM algorithm echoes the finding in Wong et al. (2020) . We remark that other techniques (Shafahi et al., 2019; Zhang et al., 2019a) can also be used to simplify inner maximization, however, we focus on FGSM since it is the most computationally-light.

Gradient quantization

In contrast to standard AT, DAT requires worker-server communications (Steps 5 and 8 of Algorithm 1). That is, if a single-precision floating-point data type is used, then DAT needs to transmit 32d bits per worker-server communication at each iteration. Here recall that d is the dimension of θ. In order to reduce the communication cost, DAT has the option to quantize the transmitted gradients using a fixed number of bits fewer than 32. We specify the gradient quantization operation as the randomized quantizer (Alistarh et al., 2017; Yu et al., 2019) . In Sec. 5 we will show that DAT, combined with gradient quantization, still leads to a competitive performance. For example, the robust accuracy of ResNet-50 trained by a 8-bit DAT (performing quantization at Step 5 of Algorithm 1) for ImageNet is just 0.55% lower than the robust accuracy achieved by the 32-bit DAT. It is also worth mentioning that the All-reduce communication protocol can be regarded as a special case of the parameter-server setting considered in Algorithm 1 when every worker performs as a server. In this case, the communication network becomes fully connected and the server-worker quantization (Step 8 of Algorithm 1) can be mitigated. Outer minimization by layerwise adaptive learning rate (LALR) In DAT, the aggregated gradient (Step 7 in Algorithm 1) used for updating model parameters (Step 10 in Algorithm 1) is built on the data batch that is M times larger than standard AT. The recent works (You et al., 2019; 2017) showed that the use of LALR is the key to succeed in training standard DNNs with large data batch. Spurred by that, we incorporate LALR in DAT. Specifically, the parameter updating operation A in Eq. ( A4) is given by θt+1,i = θt,i - τ ( θt,i 2) • ηt ut,i 2 • ut,i, ∀i ∈ [h], where θ t,i denotes the ith-layer parameters, h is the number of layers, u t is a descent direction computed based on the first-order information Q(ĝ t ), τ ( θ t,i 2 ) = min{max{ θ t,i 2 , c l }, c u } is a layerwise scaling factor of the adaptive learning rate ηt ut,i 2 , c l = 0 and c u = 10 are set in our experiments (see Appendix 4.2 for results on tuning c u ), and θ t = [θ t,1 , . . . , θ t,h ] . In (5), the specific form of the descent direction u t is determined by the optimizer employed. For example, if the adaptive momentum (Adam) method is used, then u t is given by the exponential moving average of past gradients scaled by square root of exponential moving averages of squared past gradients (Reddi et al., 2018; Chen et al., 2018) . Such a variant of (5) that uses Adam as the base algorithm is also known as LAMB (You et al., 2019) in standard training. However, it was elusive if the advantage of LALR is preserved in large-batch min-max optimization. We show in both theory and practice that the use of LALR can significantly boost the performance of DAT with large data batch.

CONVERGENCE ANALYSIS OF DAT

To the best of our knowledge, none of existing work tackled the convergence of DAT and took into account LALR and gradient quantization even in standard AT, although AT has been proved with convergence guarantees (Wang et al., 2019b; Gao et al., 2019) . DAT needs to quantify the descent errors from multiple sources (namely, gradient estimation, quantization, adaptive learning rate, and inner maximization oracle). In particular, the incorporation of LALR makes our analysis of DAT highly non-trivial. The fundamental challenge lies in the nonlinear coupling between the biased gradient estimate resulted from LALR and the additional error generated from alternating update in AT. In our theoretical results, we show that even in the case where the gradient estimate is a function of the AT variables, the estimate bias resulted from the layer-wise normalization can still be compensated by increasing the batch-size so that the convergence rate of DAT achieves a linear speedup of reducing gradient estimate error w.r.t. the increasing number of computing nodes. 2), we measure the convergence of DAT by the first-order stationarity of Ψ. Prior to convergence analysis, we impose the following assumptions: (A1) Ψ(θ) is with layer-wise Lipschitz continuous gradients; (A2) φ(θ, δ; x) in Eq. ( A1) is strongly concave with respect to δ and with Lipschitz continuous gradients; (A3) Stochastic gradient estimate in Eq. ( A2) is unbiased and has bounded variance for each worker denoted by σ 2 . Note that the validity of (A2) could be justified from distributional robust optimization (Sinha et al., 2018; Wang et al., 2019b) . It is also needed for tractability of analysis. We refer readers to Appendix 2.1 for more justifications on our assumptions (A1)-(A3). In Theorem 1, we present the sub-linear rate of DAT. Theorem 1. Suppose that assumptions A1-A3 hold, the inner maximizer Eq. (A1) provides a εapproximate solution (i.e., the 2 -norm of inner gradient is upper bounded by ε), and the learning rate is set by η t ∼ O(1/ √ T ), then {θ t } T t=1 generated by DAT yields the following convergence rate

Upon defining

Ψ(θ) := 1 M M i=1 f i (θ; D (i) ) in ( 1 T T t=1 E ∇ θ Ψ(θ t ) 2 2 = O 1 √ T + σ √ M B + min d 4 b , √ d 2 b + ε , ( ) where b denotes the number of quantization bits, and B = min{|B (i) t |, ∀t, i} stands for the smallest batch size per worker. Proof: Please see Appendix 3. The error rate given by ( 6) involves four terms. The term O(1/ √ M B) characterizes the benefit of using the large per-worker batch size B and M computing nodes in DAT. It is introduced since the variance of adaptive gradients (i.e., σ 2 ) is reduced by a factor 1/M B, where 1/M corresponds to the linear speedup by M machines. In (6), the term min{ d 4 b , √ d 2 b } arises due to the variance of compressed gradients, and the other two terms imply the dependence on the number of iterations T as well as the ε-accuracy of the inner maximization oracle. We highlight that our convergence analysis (Theorem 1) is not barely a combination of LALR-enabled standard training analysis (You et al., 2019; 2017) and adversarial training convergence analysis (Wang et al., 2019b; Gao et al., 2019) . Different from the previous work, we address the fundamental challenges in (a) quantifying the descent property of the objective value at the presence of multi-source errors during alternating min-max optimization, and (b) deriving the theoretical relationship between large data batch (across distributed machines) and the eventual convergence error of DAT.

5. EXPERIMENTS: DAT FOR ROBUST IMAGE CLASSIFICATION

In this section, we empirically evaluate DAT and show its success in training robust deep neural networks (DNNs) over CIFAR and ImageNet datasets. We measure the performance of DAT in the following four aspects: a) accuracies against clean and adversarial test inputs, b) scalability to multiple computing nodes, c) incorporation of unlabeled data, and d) transferability of pre-trained model by DAT.

DNN models and Datasets

We use the DNN models Pre-act ResNet-18 (He et al., 2016b) and ResNet-50 (He et al., 2016a) for image classification, where the former is shortened as ResNet-18. We train these models under datasets CIFAR-10 and ImageNet, but preserve ResNet-18 for CIFAR-10 only. We also acquire unlabeled data from 80 Million Tiny Images following (Carmon et al., 2019) . When studying pre-trained model's transferability, CIFAR-100 is used as a target dataset for down-stream classification. Computing resources We train a DNN using p computing nodes, each of which contains q GPUs (Nvidia V100 or P100). Nodes are connected with 1Gbps ethernet. A configuration of computing resources is noted by p × q. If p > 1, then the training is conducted in a distributed manner. And we split training data into p subsets, each of which is stored at a local node. Based on our resources, in the CIFAR experiments, we consider p ∈ {1, 6, 18, 24} machines, each of which has 1 GPU. In the ImageNet experiments, we consider p ∈ {1, 6} machines, each of which has 6 GPUs. Based on our computing resource budget, we are not able to use as many GPUs as (Xie et al., 2019; Kannan et al., 2018) . However, as will be evident later, our results clearly demonstrate the advantages of our proposal in applicability and scalability across various computing configurations and adversarial scenarios. In particular, our used batch size 6 × 512 = 3072 on ImageNet has only access to 36 GPUs. By contrast, Xie et al. (2019) used the batch size 4096 across 128 GPUs. Training setting We consider 2 variants of DAT: 1) DAT-PGD, namely, Algorithm A1 with applying (iterative) PGD as the inner maximization oracle; and 2) DAT-FGSM, namely, Algorithm A1 with use of FGSM as the inner maximization oracle. Additionally, we consider 4 training baselines: 1) AT (Madry et al., 2017b) ; 2) Fast AT (Wong et al., 2020) ; 3) DAT w/o LALR, namely, a direct distributed implementation of AT, which is in the form of DAT-PGD or DAT-FGSM but without considering LALR; and 4) DAT-LSGD (Xie et al., 2019) , namely, a distributed implementation of large-batch SGD (LSGD) for standard AT. We remark that both AT and Fast AT are centralized training methods. In our setup, the number of GPUs is limited to 6 at a single machine, and thus the largest batch size that the centralized method can use is around 2048 for CIFAR-10 and 85 for ImageNet in our case. We also find that the direct implementation of Fast-AT in a distributed way leads to a quite poor scalability versus the growth of batch size, and thus a worse distributed baseline than DAT-FGSM w/o LALR. Lastly, we remark that the work (Xie et al., 2019) proposed modifying a model architecture by incorporating feature denoising. In contrast, DAT does not call for architecture modification. Thus, to enable a fair comparison, we use the same training recipe LSGD as (Xie et al., 2019) in the DAT setting, leading to the considered distributed training baseline DAT-LSGD. Unless specified otherwise, we choose the training perturbation size = 8/255 and 2/255 for CIFAR and ImageNet respectively, where recall that was defined in (1). We also choose 10 steps and 4 steps for PGD attack generation in DAT (and its variants) under CIFAR and ImageNet, respectively. Such training settings are consistent with previous state-of-the-art (Zhang et al., 2019a; Wong et al., 2020) . The number of training epochs is given by 100 for CIFAR-10 and 30 for ImageNet. Note that the adversarially robust deep learning could be sensitive to the step size (learning rate) choice (Rice et al., 2020) . For example, the use of a cyclic learning rate trick can further accelerate the Fast AT algorithm (Wong et al., 2020) . However, such a trick becomes less effective when the batch size becomes larger (namely, the number of iterations gets smaller); see Appendix 4.1. Meanwhile, the sensitivity of adversarially model training to step size can be mitigated by using early-stop remedy due to the existence of robust overfitting (Rice et al., 2020) . Spurred by that, we use the standard piecewise decay step size and an early-stop strategy during robust training. We refer readers to Appendix 4.2 for more implementation details. We observe that the direct extension from AT to DAT (namely, DAT-PGD w/o LALR) leads to significantly poor TA and RA. As the 18 times larger batch size is used, DAT-PGD w/o LALR yields more than 25% drop in TA and 10% drop in RA compared to the best AT case. We find that the performance of DAT-PGD w/o LALR rapidly degrades as the number of computing nodes increases. The similar conclusion holds for DAT-FGSM w/o LALR vs. Fast AT. Furthermore, we observe that DAT-PGD outperforms DAT-LSGD (Xie et al., 2019) with 16.13% and 4.32% improvement in TA and RA, respectively. In Figure 1 , we further compare our proposed DAT with the DAT-LSGD baseline in terms of TA/RA versus the number of computing nodes. Clearly, our approach scales more gracefully than the baseline (without losing much performance as the batch sizes increases along the number of computing nodes). Moreover, we consistently observe that DAT-PGD (or DAT-FGSM) is able to achieve competitive performance to AT (or Fast AT) and enables a graceful training speedup, e.g., by 3 times using 6 machines for ImageNet. In practice, DAT is not able to achieve linear speed-up mainly because of the communication cost. For example, when comparing the computation time of DAT-PGD (batch size 6×512) with that of AT (batch size 512) under ImageNet, the computation speed-up (by excluding the communication cost) is given by (6022)/(1960 -898) = 5.67, consistent with the ideal computation gain using 6× larger batch size in DAT-PGD. Furthermore, we observe that when the largest batch size (24 × 2048) is used, DAT-FGSM takes only (500 seconds) to obtain satisfactory robustness. When comparing DAT-FGSM with DAT-PGD, we consistently observe that the former is capable of offering satisfactory (and even better) RA, but inevitably introduces a TA loss. This phenomenon also holds for Fast AT versus AT, e.g., 0.4% RA improvement vs. 3.71% TA degradation for ImageNet. We also note that the per-epoch communication time decreases when the more GPU machines (24) are used, since a larger batch size allows a smaller number of iterations per epoch, leading to less frequent communications among machines. In Appendix 4.3, we present additional results on CIFAR-10 using ResNet-50.

Robustness against different PGD attacks

In Figure 2 , we evaluate the adversarial robustness of ResNet-50 at ImageNet learned by DAT-PGD and DAT-FGSM against PGD attacks of different steps and perturbation sizes (i.e., values of ). We observe that DAT matches robust accuracies of standard AT even against PGD attacks at different values of and steps. We also note that although DAT-FGSM has the worst TA ( = 0), it yields slightly better robustness as attack steps increase. The similar results can be found in Appendix 4.4 for (CIFAR-10, ResNet-18) against PGD (and C&W) attacks. DAT under unlabeled data In Table 2, we report TA and RA of DAT in the semi-supervised setting (Carmon et al., 2019) with the use of 500K unlabeled images mined from Tiny Images (Carmon et al., 2019) . Compared to the DAT supervised learning results at in Table 1 , we observe that although the communication and computation costs increase due to the use of additional unlabeled images, both TA and RA are significantly improved. In particular, the performance of DAT-FGSM matches that of DAT-PGD. This suggests that unlabeled data might provide a solution to compensate the TA loss induced by FGSM-based robust training algorithms. DAT from pre-training to fine-tuning In Figure 3 , we investigate if a DAT pre-trained model (ResNet-50) over a source dataset (ImageNet) can offer a fast fine-tuning to a down-stream target dataset (CIFAR-100). Here we up-sample a CIFAR image to the same dimension of an ImageNet image before feeding it into the pre-trained model (Shafahi et al., 2020) . Compared with the direct application of DAT to the target dataset (without pretraining), the pre-training enables a fast adaption to the down-stream CIFAR task in both TA and RA within just 3 epochs. Thus, the scalability of DAT to large datasets and multiple nodes offers a great potential in the pre-training + fine-tuning paradigm. The similar results can be found in Appendix 4.5 on CIFAR-10. Quantization effect In Appendix 4.6, we also study how the performance of DAT is affected by gradient quantization. We find that when the number of bits is reduced from 32 to 8, the resulting TA and RA becomes worse than the best 32-bit case. For example, in the worst case (8-bit 2-sided quantization) of CIFAR-10, TA drops 1.52% and 6.32% for DAT-PGD and DAT-FGSM, respectively. And RA drops 4.74% and 5.58%, respectively. Note that our main communication configuration is given by Ring-AllReduce that calls for 1-sided (rather than 2-sided) quantization. We also observe that DAT-FGSM is more sensitive to effect of gradient quantization than DAT-PGD. Even in the centralized setting, the use of 8-bit quantization can lead to a non-trivial drop in TA (see Table A5 ). However, the use of quantization reduces the amount of data transmission per iteration. We also show that if a high performance computing cluster of nodes (with NVLink high-speed GPU interconnect (Foley & Danskin, 2017) ) is used, the communication cost can be further reduced. LALR vs. centralized and distributed solutions We examine the effect of LALR on both centralized and distributed robust training methods given a batch size that is affordable to a single machine. We consider a variant of AT by incorporating LALR, termed as AT w/ LALR. As presented in Table 3 , when the batch size is not large, both centralized and distributed methods lead to very similar performance although the former is slightly better as it is free of machine synchronization and communication. And the performance is not sensitive to LALR. By contrast, if the batch size is large (inapplicable to centralized cases as Table 1 ), then DAT + LALR outperforms DAT (namely, LALR matters).

6. CONCLUSIONS

We proposed distributed adversarial training (DAT) to scale up the training of adversarially robust DNNs over multiple machines. We showed that DAT is general in that it enables large-batch min-max optimization and supports gradient compression and different learning regimes. We proved that under mild conditions, DAT is guaranteed to converge to a first-order stationary point with a sub-linear rate. Empirically, we provided comprehensive experiment results to demonstrate the effectiveness and the usefulness of DAT in training robust DNNs with large datasets and multiple machines. In the future, it will be worthwhile to examine the speedup achieved by DAT in the extreme training cases, e.g., using a significantly large number of PGD attack steps and/or computing nodes during training.

2.2. ORACLE OF MAXIMIZATION

In practice, Φ i (θ; x), ∀i may not be obtained, since the inner loop needs to iterate by the infinite number of iterations to achieve the exact maximum point. Therefore, we allow some numerical error term resulted in the maximization step at (A1). This consideration makes the convergence analysis more realistic. First, we have the following criterion to measure the closeness of the approximate maximizer to the optimal one. Definition 1. Under A2, if point δ(x) satisfies max δ≤ δ -δ * (x), ∇ δ φ(θ, δ * (x); x) ≤ ε (A20) then, it is a ε approximate solution to δ * (x), where δ * (x) := arg max δ φ(θ, δ; x). (A21) and x denotes the sampled data. Condition (A20) is standard for defining approximate solutions of an optimization problem over a compact feasible set and has been widely studied in (Wang et al., 2019b; Lu et al., 2020) . In the following, we can show that when the inner maximization problem is solved accurately enough, the gradients of function φ(θ, δ(x); x) at δ(x) and δ * (x) are also close. A similar claim of this fact has been shown in (Wang et al., 2019b, Lemma 2) . For completeness of the analysis, we provide the specific statement for our problem here and give the detailed proof as well. Lemma 1. Let δ (k) t be the (µε)/L 2 φ approximate solution of the inner maximization problem for worker k, i.e., max δ (k) φ(θ, δ (k) ; x t ), where x t denotes the sampled data at the tth iteration of DAT. Under A2, we have ∇ θ φ θ t , δ (k) t (x t ); x t -∇ θ φ θ t , (δ * ) (k) t (x t ); x t 2 ≤ ε. (A22) Throughout the convergence analysis, we assume that δ (k) t (x t ), ∀k, t are all the (µε)/L 2 φ approximate solutions of the inner maximization problem. Let us define [∇φ(θ t , δ (k) t (x t ); x t )] ij -[∇φ(θ t , (δ * ) (k) t (x t ); x t ] ij 2 = ε ij . (A23) From Lemma 1, we know that when δ (k) t (x t ) is a (µε)/L 2 φ approximate solution, then h i=1 di j=1 ε ij = h i=1 di j=1 [∇φ(θ t , δ (k) t (x t ); x t )] ij -[∇φ(θ t , (δ * ) (k) t (x t ); x t ] ij 2 ≤ ε. (A24)

2.3. FORMAL STATEMENTS OF CONVERGENCE RATE GUARANTEES

In what follows, we provide the formal statement of convergence rate of DAT. In our analysis, we focus on the 1-sided quantization, namely, Step 10 of Algorithm A1 is omitted, and specify the outer minimization oracle by LAMB (You et al., 2019) , see Algorithm A2. The addition and multiplication operations in LAMB are component-wise. Theorem 2. Under A1-A4, suppose that {θ t } is generated by DAT for a total number of T iterations, and let the problem dimension at each layer be d i = d/h. Then the convergence rate of DAT is given by 1 T T t=1 E ∇ θ Ψ(θ t ) 2 ≤ ∆ Ψ η t c l CT + 2 ε + (1 + λ)σ 2 M B + 4δ 2 + κ √ 3 C χ 1 + η t c u κ L 1 2C . (A25) where ∆ Ψ := E[Ψ(θ 1 )] -Ψ ], η t is the learning rate, κ = c u /c l , c l and c u are constants used in LALR (5), χ is an error term with the (ih + j)th entry being (1+λ)σ 2 ij M B + ε ij + δ 2 ij , ε ε ij were given in (A24), L = [L 1 , . . . , L h ] T , C = 1 4 h(1-β2) G 2 d , 0 < β 2 < 1 is given in LAMB, B = min{|B (i) |, ∀i}, and G is given in A1. Remark 1. When the batch size is large, i.e., B ∼ √ T , then the gradient estimate error will be O(σ 2 / √ T ). Further, it is worth noting that different from the convergence results of LAMB, there is a linear speedup of deceasing the gradient estimate error in DAT with respect to M , i.e., O(σ 2 /(M √ T )), which is the advantage of using multiple computing nodes. Remark 2. Note that A4 implies E[(Q([g (k) (θ)] ij ) -[g (k) (θ)] ij 2 ] ≤ h i=1 di j=1 δ 2 ij := δ 2 . From (Alistarh et al., 2017, Lemma 3.1), we know that δ 2 ≤ min{d/s 2 , √ d/s}G 2 . Recall that s = 2 b , where b is the number of quantization bits. Therefore, with a proper choice of the parameters, we can have the following convergence result that has been shown in Theorem 1. Corollary 1. Under the same conditions of Theorem 2, if we choose η t ∼ O(1/ √ T ), ε ∼ O(ξ 2 ), we then have 1 T T t=1 E ∇ θ Ψ(θ t ) 2 ≤ ∆ Ψ c l C √ T + (1 + λ)σ 2 M B + c u κ L 1 2C √ T + O ξ, σ √ M T , min d 4 b , √ d 2 b . (A27) In summary, when the batch size is large enough, DAT converges to a first-order stationary point of problem (2) and there is a linear speed-up in terms of M with respect to σ 2 . Next, we provide the details of the proof.

3.1. PRELIMINARIES

In the proof, we use the following inequality and notations. 1. Young's inequality with parameter is x, y ≤ 1 2 x 2 + 2 y 2 , ( ) where x, y are two vectors. 2. Define the historical trajectory of the iterates as F t = {θ t-1 , . . . , θ 1 }. 3. We denote vector [x] i as the parameters at the ith layer of the neural net and [x] ij represents the jth entry of the parameter at the ith layer. 4. We define g t := 1 M M i=1 E xt∈B (i) λ∇l(θ t ; x t ) + ∇ θ φ(θ t , δ (i) t (x t ); x t ) = 1 M M i=1 g (i) t . (A29)

3.5. PROOF OF THEOREM 2

Proof. We set β 1 = 0 in LAMB for simplicity. From gradient Lipschitz continuity, we have Ψ(θ t+1 ) A1 ≤ Ψ(θ t ) + h i=1 [∇ θ Ψ(θ t )] i , θ t+1,i -θ t,i + h i=1 L i 2 θ t+1,i -θ t,i 2 (A54) (a) ≤ Ψ(θ t ) -η t h i=1 di j=1 τ ( θ t,i ) [∇Ψ(θ t )] ij , [u t ] ij u t,i :=R + h i=1 η 2 t c 2 u L i 2 , ( ) where in (a) we use (A30), and the upper bound of τ ( θ t,i ). Next, we split term R as two parts by leveraging sign([∇Ψ(θ t )] ij ) and sign([u t ] ij ) as follows. R = -η t h i=1 di j=1 τ ( θ t,i )[∇Ψ(θ t )] ij [u t ] ij u t,i 1 (sign([∇Ψ(θ t )] ij ) = sign([u t ] ij )) -η t h i=1 di j=1 τ ( θ t,i )[∇Ψ(θ t )] ij [u t ] ij u t,i 1 (sign([∇Ψ(θ t )] ij ) = sign([u t ] ij )) (A56) (a) ≤ -η t c l h i=1 di j=1 1 -β 2 G 2 d i [∇Ψ(θ t )] ij [ĝ t ] ij 1 (sign([∇[Ψ(θ t )] ij ) = sign([ĝ t ] ij )) -η t h i=1 di j=1 τ ( θ t,i )[∇Ψ(θ t )] ij [u t ] ij u t,i 1 (sign([∇Ψ(θ t )] ij ) = sign([u t ] ij )) (A57) (b) ≤ -η t c l h i=1 di j=1 1 -β 2 G 2 d i [∇Ψ(θ t )] ij [ĝ t ] ij -η t h i=1 di j=1 τ ( θ t,i )[∇Ψ(θ t )] ij [u t ] ij u t,i 1 (sign([∇Ψ(θ t )] ij ) = sign([u t ] ij )) . (A58) where in (a) we use the fact that u t,i ≤ di 1-β2 and √ v t ≤ G, and in (b) we add -η t c l h i=1 di j=1 1 -β 2 G 2 d i [∇Ψ(θ t )] ij [ĝ t ] ij 1 (sign([∇Ψ(θ t )] ij ) = sign([ĝ t ] ij )) ≥ 0. (A59) Taking expectation on both sides of (A58), we have the following: E[R] ≤ -η t c l h(1 -β 2 ) G 2 d h i=1 di j=1 E[[∇Ψ(θ t )] ij [ĝ t ] ij :=U + η t c u h i=1 di j=1 E [[∇Ψ(θ t )] ij 1 (sign([∇Ψ(θ t )] ij ) = sign([u t ] ij ))] :=V . (A60) Next, we will get the upper bounds of U and V separably as follows. First, we write the inner product between [∇Ψ(θ)] ij and [ĝ t ] ij more compactly, U ≤ -η t c l h(1 -β 2 ) G 2 d h i=1 E [∇Ψ(θ)] i , [ĝ t ] i (A61) ≤ -η t c l h(1 -β 2 ) G 2 d h i=1 E [∇Ψ(θ t )] i , [ĝ t ] i -[g t ] i + [g t ] i (A62) ≤ -η t c l h(1 -β 2 ) G 2 d E ∇Ψ(θ), g t + h i=1 E [∇Ψ(θ t )] i , [ĝ t ] i -[g t ] i . (A63) Applying Lemma 2, we can get U (A38) ≤ -η t c l h(1 -β 2 ) G 2 d 1 2 E ∇Ψ(θ t ) 2 + η t c l h(1 -β 2 ) G 2 d ε + (1 + λ)σ 2 M B -η t c l h(1 -β 2 ) G 2 d h i=1 E [∇Ψ(θ t )] i , [ĝ t ] i -[g t ] i (A64) (a) ≤ -η t c l h(1 -β 2 ) G 2 d 1 2 E ∇Ψ(θ t ) 2 + η t c l h(1 -β 2 ) G 2 d ε + (1 + λ)σ 2 M B + η t c l 4 h(1 -β 2 ) G 2 d E ∇Ψ(θ t ) 2 + c l η t h(1 -β 2 ) G 2 d E ĝt -g t 2 (A65) (b) ≤ - η t c l 4 h(1 -β 2 ) G 2 d 1 2 E ∇Ψ(θ t ) 2 + η t c l h(1 -β 2 ) G 2 d ε + (1 + λ)σ 2 M B + η t c l h(1 -β 2 ) G 2 d δ 2 (A66) where we use the in (a) we use Young's inequality (with parameter 2), and in (b) we have E ĝt -g t 2 = E 1 M M i=1 Q(g (i) t ) -g (i) t 2 A4 ≤ δ 2 . (A67) Second, we give the upper of V: V ≤η t c u h i=1 di j=1 [∇Ψ(θ t )] ij P (sign([∇Ψ(θ t )] ij ) = sign([ĝ t ] ij )) :=W (A68) where the upper bound of W can be quantified by using Markov's inequality followed by Jensen's inequality as the following: W =P (sign([∇Ψ(θ t )] ij ) = sign([ĝ t ] ij )) ≤P[|[∇Ψ(θ t )] ij -[ĝ t ] ij | > [∇Ψ(θ t )] ij ] (A69) ≤ E[[∇Ψ(θ t )] ij -[ĝ t ] ij ] |[∇Ψ(θ t )] ij | (A70) ≤ E[([∇Ψ(θ t )] ij -[ĝ t ] ij ) 2 ] |[∇Ψ(θ t )] ij | (A71) (A42) ≤ E[([ḡ t ] ij -[g * t ] ij + [g * t ] ij -[g t ] ij + [g t ] ij -[ĝ t ] ij ) 2 ] |[∇Ψ(θ t )] ij | (A72) (a) ≤ √ 3 (1+λ)σ 2 ij M |B| + ij + δ 2 ij |[∇Ψ(θ t )] ij | (A73) 4 ADDITIONAL EXPERIMENTS 4.1 DISCUSSION ON CYCLIC LEARNING RATE It was shown in (Wong et al., 2020) that the use of a cyclic learning rate (CLR) trick can further accelerate the Fast AT algorithm in the small-batch setting (Wong et al., 2020) . In Figure A1 , we present the performance of Fast AT with CLR versus batch sizes. We observe that when CLR meets the large-batch setting, it becomes significantly worse than its performance in the small-batch setting. The reason is that CLR requires a certain number of iterations to proceed with the cyclic schedule. However, the use of large data batch only results in a small amount of iterations by fixing the number of epochs. Tuning LALR hyperparameter c u . We also evaluate the sensitivity of the performance of DAT to the choice of the hyperparameter c u in LALR. In Table A1 , we fix c l = 0 (this is a natural choice) but varies c u ∈ {8, 9, 10, 11, 12} when DAT-FGSM is executed under CIFAR-10 using 18x2048 batch size, where c u = 10 is our default choice. As we can see, both RA and TA are not quite sensitive to c u and the default choice yields the RA-best model (in spite of minor improvement).

4.3. OVERALL PERFORMANCE OF (CIFAR-10, RESNET-50)

In Table A2 , we observe that in the large-batch setting, the proposed DAT-PGD and DAT-FGSM algorithms outperform the baseline algorithm DAT-PGD w/o LALR, and result in competitive performance to AT and Fast AT, which call for more iterations by using a smaller batch size. In Figure A5 , we investigate if a DAT pre-trained model (ResNet-50) over a source dataset (ImageNet) can offer a fast fine-tuning to a down-stream target dataset (CIFAR-10). Compared with the direct application of DAT to the target dataset (without pre-training), the pre-training enables a fast adaption to the down-stream CIFAR-10 task in both TA and RA within just 5 epochs. 

4.6. QUANTIZATION

In Table A3 , we present the performance of DAT by making use of gradient quantization. Two quantization scenarios are covered: 1) quantization is conducted at each worker (Step 7 of Algorithm A1), and 2) quantization is conducted at both worker and server sides (Step 7 and 10 of Algorithm A1). As we can see, when the number of bits is reduced from 32 to 8, the communication cost and the amount of transmitted data is saved by 2 and 4 times, respectively. Although the use of gradient quantization introduces a performance loss to some extent, the resulting TA and RA are still comparable to the best 32-bit case. In the worst case of CIFAR-10 (8-bit 2-sided quantization), TA drops 0.91% and 6.33% for DAT-PGD and DAT-FGSM, respectively. And RA drops 4.73% and 5.22%, respectively. However, 8-bit 2-sided quantization transmitted the least amount of per iteration. To further reduce communication cost, we also conduct DAT at a HPC cluster. The computing nodes of the cluster are connected with InfiniBand (IB) and PCIe Gen4 switch. To compare with results in Table 1 , we use 6 of 57 nodes of the cluster. Each node has 6 Nvidia V100s which are interconnected with NVLink. We use Nvida NCCL as communication backend. In Table A4 , we present the performance of DAT for ImageNet, ResNet-50 with use of HPC compared to standard (non-HPC) distributed system. As we can see, the communication cost is largely alleviated, and thus the total training time is further reduced. In Table A5 , we conduct an additional experiment by integrating a centralized method with gradient quantization operation on CIFAR-10 under the batch size 2048 and 6 × 2048, respectively. We specify the centralized method as Fast AT with LALR, where LALR is introduced to improve the scalability of Fast AT under the larger batch size 6 × 2048. Due to the centralized implementation, we only need 1-sided gradient quantization (namely, no server-worker communication is involved). As we can see, when the batch size 2048 is used, Fast AT w/ LALR performs as well as Fast AT even at the presence of 8-bit gradient quantization. On the other hand, as the larger batch size 6 × 2048 is used, Fast AT w/ LALR can still preserve the performance at the absence of gradient quantization. By contrast, Fast AT w/ LALR at the presence of quantization encounters 6.05% TA drop. This suggests that even in the non-DAT setting, 8-bit gradient quantization hurts the performance as the batch size becomes large. Thus, in DAT it is not surprising that 8-bit quantized gradients could cause a non-trivial accuracy drop, particularly for using 2-sided gradient quantization and a much larger data batch size (≥ 18x2048 on CIFAR-10). One possible reason is that the quantization error cannot easily be mitigated as the number of iterations decreases (due to increased batch size under a fixed number of epochs). 



Code will be released https://pytorch.org/docs/stable/distributed.html



Figure 1: TA/RA comparison between DAT-FGSM and DAT-LSGD vs. node-GPU configurations. Left: (CIFAR-10, ResNet-18). Right: (ImageNet, ResNet-50).

Figure 2: RA against PGD attacks for model trained by DAT-PGD, DAT-FGSM, and AT following (ImageNet, ResNet-50) in Table 1. (Left) RA versus different perturbation sizes (over the divisor 255). (Right) RA versus different steps.

Figure 3: Fine-tuning ResNet-50 (pre-trained on ImageNet) under CIFAR-100. Here DAT-PGD is used for both pre-training and finetuning at 6 nodes with batch size 6 × 128.

Figure A1: TA/RA of Fast AT with CLR versus batch sizes.

Figure A5: Fine-tuning ResNet-50 (pre-trained on ImageNet) under CIFAR-10. Here DAT-PGD is used for both pre-training and fine-tuning at 6 nodes with batch size 6 × 128.

Overall performance of DAT (in gray color), compared with baselines, in TA (%), RA (%), communication time per epoch (seconds), and total training time (including communication time) per epoch (seconds). For brevity, 'p × q' represents '# nodes × # GPUs per node', 'Comm.' represents communication cost, and 'Tr. Time' represents training time.Overall performance of DAT In Table1, we evaluate our proposals and baselines in TA, RA, communication and computation efficiency. Note that AT and Fast AT are centralized training methods in single node under the same number of epochs as distributed training.

DAT in semi-supervised learning with unlabeled data,

Effect of LALR on centralized and distributed training under CIAR-10 with same batch size (2048).



Overall performance of DAT (in gray color), compared with baselines, in TA (%), RA (%),

Effect of gradient quantization on the performance of DAT for various numbers of bits. The training settings of (CIFAR-10, ResNet-18) and (ImageNet, ResNet-50) are consistent with those in Table1.

Comparison to training over a high performance computing (HPC) cluster of nodes

Effect of 8-bit quantization on centralized robust training Fast AT w/ LALR.

APPENDIX 1 DAT ALGORITHM FRAMEWORK

Algorithm A1 Distributed adversarial training (DAT) for solving problem (2) 1: Input: Initial θ 1 , dataset D (i) for each of M workers, and T iterations 2: for Iteration t = 1, 2, . . . , T

3:

for Worker i = 1, 2, . . . , M Worker 4:Draw a finite-size data batch B i t ⊆ D (i) 5:For each data sample x ∈ B i t , call for an inner maximization oracle:where we omit the label or possible pseudo-label y of x for brevity 6:Computing local gradient of f i in (2) with respect to θ given perturbed samples:(Optional) Call for gradient quantizer Q(•) and transmit Q(gt ) to server 8:end for 9:Gradient aggregation at server: Server(Optional) Call for gradient quantizer ĝt ← Q(ĝ t ), and transmit ĝt to workers:11:for Worker i = 1, 2, . . . , M Worker 12:Call for an outer minimization oracle A(•) to update θ:end for 14: end for Additional details on gradient quantization Let b denote the number of bits (b ≤ 32), and thus there exists s = 2 b quantization levels. We specify the gradient quantization operation Q(•) in Algorithm A1 as the randomized quantizer (Alistarh et al., 2017; Yu et al., 2019) . Formally, the quantization operation at the ith coordinate of a vector g is given by (Alistarh et al., 2017 )In (A5), ξ i (g i , s) is a random number drawn as follows. Given |g i |/ g 2 ∈ [l/s, (l + 1)/s] for some l ∈ N + and 0 ≤ l < s, we then havewhere |a| denotes the absolute value of a scalar a, and a 2 denotes the 2 norm of a vector a. The rationale behind using (A5) is that Q(g i ) is an unbiased estimate of g i , namely, E ξi(gi,s) [Q(g i )] = g i , with bounded variance. Moreover, we at most need (32 + d + bd) bits to transmit the quantized Q(g), where 32 bits for g 2 , 1 bit for sign of g i and b bits for ξ i (g i , s), whereas it needs 32d bits for a single-precision g. Clearly, a small b saves the communication cost. We note that if every worker performs as a server in DAT, then the quantization operation at Step 10 of Algorithm A1 is no longer needed. In this case, the communication network becomes fully connected. With synchronized communication, this is favored for training DNNs under the All-reduce operation.

2. THEORETICAL RESULTS

In this section, we will quantify the convergence behaviour of the proposed DAT algorithm. First, we define the following notations:We also define l i (θ) = E x∈D (i) l(θ; x), (A8) where the label y of x is omitted for labeled data. Then, the objective function of problem (2) can be expressed in the compact wayand the optimization problem is then given by min θ Ψ(θ).Therefore, it is clear that if a point θ satisfiesthen we say θ is a ξ approximate first-order stationary point (FOSP) of problem (2).Prior to delving into the convergence analysis of DAT, we make the following assumptions.2.1 ASSUMPTIONS A1. Assume objective function has layer-wise Lipschitz continuous gradients with constant L i for each layerwhere ∇ i Ψ(θ •,i ) denotes the gradient w.r.t. the variables at the ith layer. Also, we assume that Ψ(θ) is lower bounded, i.e., Ψ := min θ Ψ(θ) > -∞ and bounded gradient estimate, i.e., ∇ĝA2. Assume that φ(θ, δ; x) is strongly concave with respect to δ with parameter µ and has the following gradient Lipschitz continuity with constant L φ :A3. Assume that the gradient estimate is unbiased and has bounded variance, i.e.,where recall that B (i) denotes a data batch used at worker i,Further, we define a component-wise bounded variance of the gradient estimatewhere j denotes the index of the layer, and k denotes the index of entry at each layer. Under A3, we have h j=1 dj k=1 max{σ 2 jk , σ 2 jk } ≤ σ 2 A4. Assume that the component wise compression error has bounded varianceThe assumption A4 is satisfied as the randomized quantization is used (Alistarh et al., 2017 , Lemma 3.1).

3.2. DETAILS OF LAMB ALGORITHM

Algorithm A2 LAMB (You et al., 2019) Input:where ĝt is given by (A3)Proof. From A2, we haveAlso, we know that function φ(θ, δ, x) is strongly concave with respect to δ, so we haveNext, we have two conditions about the qualities of solutions δt (x t ), so we haveAdding them together, we can obtainSubstituting (A35) into (A32), we can getCombining (A31), we have3.4 DESCENT OF QUANTIZED LAMB First, we provide the following lemma as a stepping stone for the subsequent analysis. Lemma 2. Under A1-A3, suppose that sequence {θ t } is generated by DAT. Then, we haveProof. From (A21), (A7) and A2, we know thatso we can getThen, we havewhereandNext, we can quantify the different between g t and g * t by gradient Lipschitz continuity of function τ (•) as the following≤ ε (A49) where in (a) we use Jensen's inequality.And the difference between ḡt and g * t can be upper bounded byApplying Young's inequality with parameter 2, we havewhere (a) is true due to the following relations: i) from (A51), we haveTherefore, combining (A55) with the upper bound of U shown in (A66) and V shown in (A68)(A73), we have(A77)Note that the error vector χ is defined as the followingand we haveRearranging the terms, we can arrive atApplying the telescoping sum over t = 1, . . . , T , we have

4.4. ROBUSTNESS AGAINST PGD AND C&W ATTACKS

In Figure A3 , we evaluate the adversarial robustness of ResNet-18 at CIFAR-10 learned by DAT-PGD and DAT-FGSM against PGD attacks of different steps and perturbation sizes (namely, values of ). We consistently observe that DAT matches robust accuracies of standard AT even against PGD attacks at different values of and steps. Specifically, DAT has slightly smaller RA than AT when facing weak PGD attacks with less than (5/255) and steps less than 5. Moreover, although DAT-FGSM has the worst RA against weak PGD attacks (which reduces to TA at = 0), it outperforms other methods when the attacks become stronger in CIFAR-10 experiments. In Figure A4 , we present the additional robust accuracies against C&W attacks (Carlini & Wagner, 2017) of different perturbation sizes. As we can see, the results are consistent with the aforementioned ones against PGD attacks. 

