RADEMACHER COMPLEXITY OVER H∆H CLASS FOR ADVERSARIALLY ROBUST DOMAIN ADAPTATION

Abstract

In domain adaptation, a model is trained on a dataset generated from a source domain and its generalization is evaluated on a possibly different target domain. Understanding the generalization capability of the learned model is a longstanding question. Recent studies demonstrated that the adversarial robust learning under ℓ ∞ attack is even harder to generalize to different domains. To thoroughly study the fundamental difficulty behind adversarially robust domain adaptation, we propose to analyze a key complexity measure that controls the cross-domain generalization: the adversarial Rademacher complexity over H∆H class. For linear models, we show that adversarial Rademacher complexity over H∆H class is always greater than the non-adversarial one, which reveals the intrinsic hardness of adversarially robust domain adaptation. We also establish upper bounds on this complexity measure, and extend them to the ReLU neural network class as well. Finally, based on our adversarially robust domain adaptation theory, we explain how adversarial training helps transferring the model performance to different domains. We believe our results initiate the study of the generalization theory of adversarially robust domain adaptation, and could shed lights on distributed adversarially robust learning from heterogeneous sources -a scenario typically encountered in federated learning applications.

1. INTRODUCTION

Domain adaptation is a key learning scenario where one tries to generalize the model learnt on a source domain to a target domain. How to predict target accuracy using source accuracy has been a longstanding research topic in both theory Ben-David et al. (2006) ; Quinonero-Candela et al. (2008) ; Ben-David et al. (2010) ; Mansour et al. (2009) ; Cortes et al. (2015) ; Zhang et al. (2019; 2020) and application community Long et al. (2015) ; Saito et al. (2018) ; You et al. (2019) . From a theoretical perspective, this problem can be attacked by establishing bounds on the generalization of the source-domain-learnt model on target domain, using different complexity measures including the VC-dimension Ben-David et al. (2006; 2010) ; Zhang et al. (2020) and Rademacher complexity Mansour et al. (2009) ; Zhang et al. (2019) . In particular, the latter works Mansour et al. (2009) ; Zhang et al. (2019) rely on the Rademacher complexity over a so-called H∆H function class to bound the gap between source and target generalization risks: Definition 1 ( Mansour et al. (2009) ). Let hypothesis space H be a set of real (vector)-valued functions defined over input space X and label space Y: H = {h w : X → Y} each parameterized by w ∈ W ⊆ R d , and ℓ : Y × Y → R + be the loss function. Given a dataset {x 1 , ..., x n } sampled i.i.d. from distribution D defined over X , the empirical Rademacher complexity of H∆H over this dataset is defined as follows: RD (ℓ • H∆H) = E σ sup hw,h w ′ ∈H 1 n n i=1 σ i ℓ(h w (x i ), h w ′ (x i )) , where σ 1 , . . . , σ n are i.i.d. Rademacher random variables with P{σ i = 1} = P{σ i = -1} = 1 2 . Intuitively, above quantity measures how well the loss vector realized by two hypotheses within H correlates with random vectors. The better correlation will imply a richer hypothesis class. However, unlike the classical Rademacher complexity whose loss vector is computed between predictions made by a hypothesis and true labels, Eq. ( 1) is defined merely over predictions made by two hypotheses. Authors of Mansour et al. (2009) ; Zhang et al. (2019) have shown that this complexity measure controls the domain adaptation generalization bound. Unfortunately, none of those works give the precise analysis of RD (ℓ • H∆H). To our best knowledge, Kuroki et al. (2019) is the only prior work to analyze RD (ℓ • H∆H) on linear classifier class, but their analysis is not tight. Due to the importance of such complexity measure, we are interested in characterizing how large this complexity measure can be in terms of model dimension and data diversity, even on some toy model, e.g., linear model. Hence, the first question we investigate in this paper is: for linear models, what quantities control the Rademacher complexity over H∆H function class? Meanwhile, in modern machine learning, practitioners are not only interested in transferring standard model accuracy to another domain, but also in transferring robustness. Consider adversarially robust risk over domain D: R adv-label D (h w , y D ) = E x∼D max ∥δ∥∞≤ϵ ℓ(h w (x + δ), y D (x)) , where y D (•) is the labeling function. In the adversarially robust domain adaptation problem, we are curious about the robust risk when the same model h w is tested on the new domain D ′ . Unfortunately, as shown empirically Shafahi et al. (2019) ; Hong et al. (2021) ; Fan et al. (2021) , robust model learnt on source domain will lose its robustness catastrophically on a different domain. That is, the gap between robust risks on the old domain and new domains can be dramatically huge, compared to the standard risk. This observation naturally leads to the question Why is the robust risk harder to adapt to different domains?, which we aim to examine in this paper. To answer this question, inspired by the Rademacher complexity over H∆H function class, we properly extend this complexity measure to the adversarial learning setting, and propose the adversarial Rademacher complexity over the H∆H class. We show that, the adversarial version complexity is always greater than its non-adversarial counterpart, similar to the results proven in Yin et al. (2019) in the single domain setting. Relying on this new complexity measure, for the first time, we characterize the generalization bound of adversarially robust learning between source and target domain. Recent studies Salman et al. (2020) ; Deng et al. (2021) also show that, the model trained adversarially on the source domain, usually entails better standard accuracy on target domain, compared to the normally trained model. In this paper, by further exploring our generalization bound, we show that given large enough adversarial budget, small source adversarially robust risk will almost guarantee small target domain standard risk, with the residual error controlled by ϵ. This connection between source robust risk and target standard risk theoretically supports the advantage of performing robust training in domain adaptation tasks. Our contributions are summarized as follow: • We study the Rademacher complexity over H∆H class, and propose the adversarial variant of it, which is a new complexity measure towards better understanding the domain adaptation in adversarial learning. In both linear classification and regression settings, we first show that adversarial Rademacher complexity over H∆H class is greater than its non-adversarial counterpart. We also show that adversarial complexity is smaller than its non-adversarial counterpart plus residual terms polynomially depending on data dimension, model norm and adversarial budget. • We generalize our results to ReLU neural networks, where we derive the similar upper bounds of adversarial H∆H Rademacher complexity of a 2-layer ReLU neural network. • We also establish the connection between robust learning and standard domain adaptation, which helps explain the widely-observed phenomena that adversarially trained models can have good generalization performance on different domains. • We support our theoretical analysis by providing experiments illustrating how adversarial training can help domain adaptation, especially with ℓ 1 regularization. We also highlight numerically the difficulty of transferring adversarial robustness across domains.

2. PROBLEM SETUP

We adapt the following notations throughout this paper. We use lower case bold letter to denote vector, e.g., w, and use upper case bold letter to denote matrix, e.g., M. We use ∥w∥ p and ∥M∥ p to denote ℓ p -norm of vector w and matrix M respectively. We define the (p, q)-group norm as the ∥M∥ p,q := ∥(∥m 1 ∥ p , . . . , ∥m n ∥ p ) ⊤ ∥ q where the m i s are the columns of M. We use D : X → R to denote a data distribution (domain) defined over instance space X , and D be the empirical distribution with n D samples drawn i.i.d. from D. We let H := {h w : X → Y} be the hypothesis space, and vector w ∈ W ⊆ R d denotes the model parametrization of h w . Given a loss function ℓ : Y × Y → R, and a data distribution D, we let R D (h w , h w ′ ) := E x∼D [ℓ(h w (x), h w ′ (x))] be the risk of the disagreement between models h w and h w ′ on domain D. Specially, when the second argument of R D (•, •) is the labeling function over D, it becomes the commonly used risk function. We also define two adversarially robust risks: (1) Model-label robust risk as R adv-label D (h w , y) := E x∼D max ∥δ∥∞≤ϵ ℓ(h w (x + δ), y(x)) and (2) Model-model robust risk as R adv D (h w , h w ′ ) := E x∼D max ∥δ∥∞≤ϵ ℓ(h w (x + δ), h w ′ (x + δ)) . 1 In the domain adaptation scenario, we consider source domain S and target domain T distributions, and let T and Ŝ be the empirical source and target distributions with n S and n T samples. A key quantity that controls the generalization in domain adaptation is the following discrepancy measure: Definition 2 (H∆H discrepancy Mansour et al. (2009) ; Ben-David et al. (2010) ). Given a hypothesis class H, risk function R D (•, •), H∆H discrepancy between distributions S and T is defined by: disc H∆H (S, T ) = max hw,h w ′ ∈H |R S (h w , h w ′ ) -R T (h w , h w ′ )| . (2) The H∆H discrepancy defines a semi-distance over two distributions, and it does not depend on the labeling function of two distributions hence invariant to potential model shift across domains. Another advantage of it, is that it can be efficiently estimated by finite samples, if the Rademacher complexity over H∆H is finite. Hence based on Definitions 1 and 2, Mansour et al. (2009) derived the following generalization bound among source and target domains. Lemma 1 (Domain adaptation generalization lemma, consequence of Theorem 8 of Mansour et al. (2009) ). Assume that the loss function ℓ is symmetric and obeys the triangle inequality. We further assume ℓ is bounded by M . Then ∀h w ∈ H , the following holds with probability at least 1 -c: R T (h w , y T ) ≤ R S (h w , h w * S ) + disc H∆H ( Ŝ, T ) + R T (h w * T , h w * S ) + R T (h w * T , y T ) + RS (ℓ • H∆H) + RT (ℓ • H∆H) + 3M log(2/c) n S + 3M log(2/c) n T , where y T is the labeling function on target domain, h w * T , h w * S are the best target and source models in H, i.e., h w * S = arg min h∈H R S (h, y S ) and h w * T = arg min h∈H R T (h, y T ). The above bound successfully connects the target risk and source risk, with the help of Rademacher complexity over H∆H and disc H∆H distance. It turns out that, RS (ℓ • H∆H) and RT (ℓ • H∆H) are the key complexity measures that control the generalization between different domains. Hence, to study the generalization of domain adaptation in the adversarial setting, it naturally motivates us to consider the following adversarial robust variant of this measure as defined below. Definition 3 (Adversarial Rademacher complexity over H∆H class). Let H be a set of real-valued hypothesis functions: H = {h w : X → Y}, and ℓ(•, •) : Y × Y → R be the loss function. Given a dataset {x 1 , ..., x n } sampled from distribution D, the empirical adversarial Rademacher complexity of H∆H over this dataset is defined as follows RD ( l • H∆H) = E σ sup hw,h w ′ ∈H 1 n n i=1 σ i max ∥δ∥∞≤ϵ ℓ(h w (x i + δ), h w ′ (x i + δ)) , where σ 1 , . . . , σ n are i.i.d. Rademacher random variables with P{σ i = 1} = P{σ i = -1} = 1 2 . As we can see, (3) is defined over the class: l • H∆H := x → max ∥δ∥∞≤ϵ ℓ(h w (x + δ), h w ′ (x + δ)) : h w , h w ′ ∈ H . We will see later how this quantity controls the generalization of adversarial domain adaptation. We also generalize H∆H discrepancy to the adversarial setting: 1 R adv-label D and R adv D are also called constant-in-ball risk and exact-in-ball risk in Gourdeau et al. (2021) . Definition 4 (Adversarial H∆H discrepancy). Given a hypothesis class H, risk function R adv D (•, •), the adversarial H∆H discrepancy distance between two distributions S and T is defined by: disc adv H∆H (S, T ) = max hw,h w ′ ∈H |R adv S (h w , h w ′ ) -R adv T (h w , h w ′ )| . The definition of adversarial H∆H discrepancy is analogous to standard one, and for linear models can indeed be estimated as a function of the latter. We defer this result to Appendix F, Lemma 19. Lemma 2 (Adversarially robust domain adaptation generalization lemma). Assume that the loss function l is symmetric and obeys the triangle inequality. We further assume l is bounded by M . Then, for any hypothesis w ∈ H , the following holds: R adv-label T (h w , y T ) ≤ R adv-label S (h w , y S ) + R adv-label S (h w * S , y S ) + disc adv H∆H ( T , Ŝ) + R adv T (h w * T , h w * S ) + R adv-label T (h w * T , y T ) + RS ( l • H∆H) + RT ( l • H∆H) + 3M log(2/c) n S + 3M log(2/c) n T . The proof of Lemma 2 is deferred to Appendix E. Here we establish the relation between source adversarially robust risk and target adversarially robust risk. It shows that the adversarial discrepancy and adversarial complexity measure RD ( l • H∆H) on source and target domains are the two key quantities controlling the deviation between the model's performance on the two domains. Hence, to answer our previously proposed question, why the robust risk is harder to adapt to different domain, it is essential to study the connection between adversarial and non-adversarial Rademacher complexities over H∆H class.

3.1. BINARY CLASSIFICATION SETTING

We start with the binary classification problem where the labels come from {-1, +1}. Like in Section 4.1 of Yin et al. (2019) , we introduce the hypothesis class of linear functions with bounded weights: H := {h w : x → ⟨w, x⟩ , w ∈ R d : ∥w∥ p ≤ W } , where p ≥ 1. Moreover, we consider the following loss ℓ(h w (x), y) := ϕ(yh w (x)) where ϕ is a monotonic non-increasing and L ϕ -Lipschitz function. With such a loss ϕ, the non-adversarial class of loss functions over H∆H becomes ℓ • H∆H := x → ℓ(h w (x), h w ′ (x)) := ϕ h w (x)h w ′ (x) : h w , h w ′ ∈ H . However, directly analyzing ℓ • H∆H class will be difficult since we do not assume the formula of ϕ explicitly. Hence, following Yin et al. (2019) , let us define the following class of functions f • H∆H := x → h w (x)h w ′ (x) : h w , h w ′ ∈ H . We switch from the study of the Rademacher complexity defined in (1) over the function class introduced in (5), that is for linear classifiers applied to binary classification, to the following formula RD (f • H∆H) = E σ sup w,w ′ :∥w∥p≤W,∥w ′ ∥p≤W 1 n n i=1 σ i w ⊤ x i w ′⊤ x i . Indeed, by Ledoux-Talagrand contraction property (Ledoux & Talagrand (2013) ) of Rademacher complexity we have that RD (ℓ • H∆H) ≤ L ϕ RD (f • H∆H). Thus, in the following lemma we aim at estimating RD (f • H∆H). Lemma 3 (Rademacher complexity for binary classification under linear hypothesis). Consider hypothesis class defined in (5). Assume that a set of data {x 1 , ..., x n } are draw from D. Let R(f • H∆H) be defined as in (6). Then the following statement holds true for non-adversarial Rademacher complexity : RD (f • H∆H) ≤ W 2 n   2 n i=1 (x i x ⊤ i ) 2 2 log(2d) + ∥X∥ 2 2,∞ log(2d) 3   • 1, 1 ≤ p ≤ 2 d 1-2/p , p > 2 , where X ∈ R n×d is the data matrix and i-th row of X is x i . The proof of Lemma 3 is deferred to the Appendix G.1. Lemma 3 shows that the magnitude of the non-adversarial Rademacher over H∆H class depends on the spectral norm of data covariance matrix. It implies that a more diverse dataset will result in a larger Rademacher complexity, and hence harder to perform domain adaptation. We notice that Kuroki et al. (2019) also gave an estimation of the upper bound of RD (f • H∆H) in their Lemma 5, but our bound is superior to theirs in the following two aspectives: (1) Our bound is tighter in terms of the dependency on covariance matrix, since our bound depends on n i=1 (x i x ⊤ i ) 2 2 while their bound depends on n i=1 (x i x ⊤ i ) 2 2 F . (2) We consider that model capacity is controlled by p-norm while they only consider 2-norm. Then, let us specify the class of functions involved in the definition of the adversarial Rademacher complexity of H∆H in (3) as follows: l • H∆H := max ∥δ∥∞≤ϵ ϕ h w (x + δ)h w ′ (x + δ) : h w , h w ′ ∈ H , and let us define f • H∆H := x → min ∥δ∥∞≤ϵ h w (x + δ)h w ′ (x + δ) : h w , h w ′ ∈ H . With the above notations, we can characterize the adversarial counterpart of (6). Again by Ledoux-Talagrand's property, as in Yin et al. (2019) ; Awasthi et al. (2020) , we get that RD ( l • H∆H) ≤ L ϕ RD ( f • H∆H), where RD ( f • H∆H) = E σ sup w,w ′ :∥w∥p≤W,∥w ′ ∥p≤W 1 n n i=1 σ i min ∥δ∥∞≤ϵ w ⊤ (x i + δ)w ′⊤ (x i + δ) . (8) Theorem 1 (Adversarial Rademacher complexity for binary classification under linear hypothesis). Consider hypothesis class defined in (5). Assume a set of data {x 1 , ..., x n } are drawn from D. Let RD (f • H∆H) and RD ( f • H∆H) be defined as in (6) and (8), respectively. The following statement holds true for adversarial Rademacher complexity over H∆H function class under linear hypothesis (5): RD ( f • H∆H) ≤ RD (f • H∆H) + 2 W 2 √ n ϵd 1/p * 1 + √ d log(3 √ n) ϵd 1/p * + 2 ∥X∥ p * ,∞ , where p * is such that 1/p + 1/p * = 1, and X ∈ R n×d is the data matrix and i-th row of X is x i . Moreover, the following lower bound also holds: RD ( f • H∆H) ≥ RD (f • H∆H) + 0, 1 ≤ p ≤ 2 W 2 n (1 -d 1-2/p )E σ n i=1 σ i x i x ⊤ i 2 , p > 2 . ( ) The proofs for Theorem 1 are deferred to Appendices G.2 and G.3. From (1), we notice that the upper bound of RD ( f • H∆H) has the smallest dependence in model dimension d if the weights are constrained by the ℓ 1 -norm (p = 1). This is a similar observation as in Yin et al. (2019) where they consider single domain setting. However, in their single domain setting, when p = 1, adversarial Rademacher complexity is dimension free while we still have √ d dependency. This heavier dependence is likely due to the fact that RD ( f •H∆H) is defined by coupling two models and hence enlarges the complexity. The bound achieves the sublinear convergence O(1/ √ n) over the number of samples n, and quadratic dependence on maximum model weight W . It implies that models with suppressed norm can help adversarially robust domain adaptation since it reduces the Rademacher complexity, as we will see in the experiments. The lower bound result in (9) shows that, adversarial Rademacher complexity over H∆H will be always larger than non-adversarial one, which implies that adversarial robust domain adaptation is at least as hard as non-adversarial domain adaptation, and that is why, as we will also see in our experiments, given a model, the gap between its source domain robust risk and target domain robust risk is usually larger than that in terms of standard risk. Moreover, the gap between adversarial and non-adversarial complexity is controlled by the spectral norm of Rademacher variable induced covariance matrix. This dependence reveals that a more diverse dataset would be harder to transfer robustness, compared to the standard domain adaptation.

3.2. LINEAR REGRESSION SETTING

In this section we consider linear regression problems. The hypothesis class of linear functions with bounded weights remains the same as in (5). However, we consider the following class of quadratic loss functions: ℓ • H∆H := x → ℓ(h w (x), h w ′ (x)) := (h w (x) -h w ′ (x)) 2 , h w , h w ′ ∈ H . ( ) The following lemma establishes the upper bound of non-adversarial Rademacher complexity over H∆H in the above setting. Lemma 4 (Rademacher complexity for regression under linear hypothesis). Let H be the set of linear functions with bounded weights as defined in (5). Then the following statement holds true for non-adversarial Rademacher complexity over H∆H class: RD (ℓ • H∆H) ≤ 4W 2 n   2 n i=1 (x i x ⊤ i ) 2 2 log(2d)+ 1 3 ∥X∥ 2 2,∞ log(2d)   • 1 1 ≤ p ≤ 2 d 1-2/p p > 2 , where X ∈ R n×d is the data matrix and i-th row of X is x i . The proof of Lemma 4 is deferred to Appendix H.1. As in the binary classification case presented in Section 3.1, we are able to relate the non-adversarial Rademacher complexity over H∆H class to the spectral norm of data covariance matrix. Theorem 2 (Adversarial Rademacher complexity for regression under linear hypothesis). Let H be the set of linear functions with bounded weights as defined in (5). Then the following statement holds true for adversarial Rademacher complexity over H∆H function class: RD ( l • H∆H) ≤ RD (ℓ • H∆H) + 4 W 2 √ n 2 √ dϵ ∥X∥ 2,∞ + dϵ 2 2d log(6 √ n) + 1 • 1, 1 ≤ p ≤ 2 d 1-2/p , p > 2 , where X ∈ R n×d is the data matrix and i-th row of X is x i . Meanwhile, the following lower bound holds as well: RD ( l • H∆H) ≥ RD (ℓ • H∆H) + 0, 1 ≤ p ≤ 2 4W 2 n (1 -d 1-2/p )E σ n i=1 σ i x i x ⊤ i 2 , p > 2 . The proof of Theorem 2 is deferred to Appendices H.2 and H.3. Few comments can be made concerning the above theorem. First, the upper bound of adversarial Rademacher complexity also depends quadratically on W and adversarial budget ϵ, and super-linearly on model dimension d. Second, for the lower bound, we established the similar gap between RD ( l • H∆H) and RD (ℓ • H∆H) as in classification setting, which means the data diversity also affects hardness of adversarially robust domain adaptation in regression setting.

4. EXTENSION TO NEURAL NETWORKS WITH RELU ACTIVATION

We next extend our analysis methods to more complicated neural network function class. In this section, we will present our results for two-layer ReLU neural networks. That is, we consider the following hypothesis class H := {h w : x → a ⊤ ReLU(Wx) , a ∈ R d , W ∈ R d×d : ∥a∥ p ≤ A, ∥W∥ p ≤ W }. The following theorem establishes the relation between Adversarial Rademacher complexity and non-adversarial version in classification setting. As in Section 3.1, we consider the same loss functions of the form ℓ(h w (x), y) := ϕ(yh w (x)) in classification setting, and ℓ 2 loss in regression setting. Theorem 3 (Adversarial Rademacher complexity on ReLU neural network class). Let H be the set of two-layer ReLU neural networks with bounded weights as defined in (11). Then, the following statement holds true for adversarial Rademacher complexity over H∆H function class, with classification loss RD( f • H∆H) ≤ RD(f • H∆H) + A 2 W 2 n ϵ d log(2d)   2 n i=1 ( √ dϵ + 2 ∥xi∥ 2 ) 2 + 1 3 log(2d)( √ dϵ + 2 ∥X∥ 2,∞ )   • 1, 1 ≤ p ≤ 2 d 2-4/p , p > 2 . Similarly the following statement holds true for adversarial Rademacher complexity over H∆H function class with ℓ 2 loss: RD( l • H∆H) ≤ RD(ℓ • H∆H) + A 2 n W 2 ϵ d log(2d)   3 2 n i=1 ( √ dϵ + 2 ∥xi∥ 2 ) 2 + log(2d)( √ dϵ + 2 ∥X∥ 2,∞ )   • 1, 1 ≤ p ≤ 2 d 2-4/p , p > 2 . The proof of Theorem 3 is deferred to Appendices I.1 and I.2. As we can see from the above theorem, we get the similar upper bound for ReLU neural network class to what we showed in the linear model case. The upper bound of adversarial Rademacher can be bounded by non-adversarial version plus terms depending on the norm of each layer, and the norm of data points. We would like to mention that, unlike the linear case where we show that the gap between adversarial and nonadversarial Rademacher complexity is lower bounded, here we do not establish such lower bound, due to the difficulty of analyzing ReLU unit.

5. ADVERSARIAL TRAINING HELPS TRANSFER TO DIFFERENT DOMAIN

Here we discuss the connection between standard ERM learning and adversarially robust learning. As observed by prior works Salman et al. ( 2020); Deng et al. (2021) , if a model is adversarially trained on the source domain, then its standard accuracy on target domain is sometimes better than if it had been fitted via vanilla ERM on source domain. In this section, we try to explain this phenomena from adversarially robust domain adaptation perspective. We found that, when adversarial budget is large enough, small source adversarial risk almost guarantees the small target domain standard risk. First, we need to introduce the following optimization problem. Definition 5. Let p and p ′ be two vectors on the N -dimensional simplex and let ℓ be a 0-1 vector. Let also Λ be an arbitrary subset of [N ] := {1, . . . , N }. The Subset Sum Problem with Structural Objective can be defined as solving the following combinatorial optimization problem:    min l∈{0,1} N p ⊤ l -p ′⊤ ℓ s.t. li = ℓ i , ∀ i ∈ [N ] \ Λ . We denote its optimal value as V * (p ′ , p, ℓ, Λ). The above problem is a variant of Subset Sum Problem Hartmanis (1982) , which is also NPcomplete. We look for a subset of coordinates of a simplex vector p, such that their sum is closest to a given goal. The given goal has special structure: it is defined as sum of a subset of coordinates in another simplex vector p ′ . If the constraint set Λ ϵ has more indices, the optimal value will be smaller since we can determine the value on more coordinates of l. In the following lemma, we explain how this combinatorial measure helps us to connect adversarially robust and standard risks for the binary classification task. Lemma 5. Consider binary classification task, with sign linear classifier class H = {h w : h w = sign(w ⊤ x), ∥w∥ p ≤ W } and 0-1 loss function ϕ(x, y) = 1 2 |x -y|. Assume all domains share the same labeling function y(x) ∈ {-1, 1}. The following statement holds for any T ′ : R T ′ (h w , y) ≤ R adv-label T (h w , y) + V * (p ′ , p, ℓ, Λ ϵ ), where ℓ is the loss vector such that ℓ i = 1 2 | sign(w ⊤ x i ) -y(x i )|, with x i ∈ X . The vectors p, p ′ are probability mass vectors of T and T ′ , i.e., p(x) = P X∼T (X = x), x ∈ X and p ′ (x) = P X∼T ′ (X = x), x ∈ X Moreover, Λ ϵ = {i : |w ⊤ x i | ≤ ϵ∥w∥ 1 , x i ∈ X , ∀w, ∥w∥ p ≤ W }. The corresponding proof is given in Appendix J. Lemma 5 shows that, the standard risk on domain T ′ can be bounded by robust risk on domain T , plus the quantity controlled by ϵ. Since the set Λ ϵ stores all indices i such that we can choose to flip li 's value between 0 and 1, then if we have larger ϵ, there will be more indices in Λ ϵ , which means there are more coordinates in l we can play with, and hence smaller value of V * .

6. EMPIRICAL RESULTS

In this section, we verify the theoretical implications through empirical studies on a multi-domain dataset, DIGITS Ganin & Lempitsky (2015) . DIGITS has 28 × 28 images and includes 5 different domains: MNIST (Lecun et al., 1998) , SVHN (Netzer et al., 2011) , USPS (Hull, 1994) , SynthDigits (Ganin & Lempitsky, 2015) , and MNIST-M (Ganin & Lempitsky, 2015) . All domain datasets are subsampled to contain 7438 images to eliminate the effect of number of samples in generalization. Given a model h w parameterized by w, we consider two training methods: Adversarial Training: min w 1 n n i=1 max ∥δ∥ ∞ ≤ϵ ℓ(h w (x i + δ), y(x i )), Standard Training: min w 1 n n i=1 ℓ(h w (x i ), y(x i )). To solve the inner maximization in (12), we leverage k-step PGD (projected gradient descent) attack (Madry et al., 2018) with a constant noise magnitude ϵ. Following (Madry et al., 2018) , we use ϵ = 8/255, k = 7, and attack inner-loop step size 2/255, for training, and adversarial test. Then we use Adam to minimize the losses with 100 epochs and learning rate of 10 -2 decaying in a cosine manner. We evaluate the model performance by: (1) standard accuracy (SA): classification accuracy on the clean test set; and (2) robust accuracy (RA): classification accuracy on adversarial images perturbed from the original test set.

How does adversarial robustness transfer arcoss domains?

In this experiments, we use a convolutional network whose architecture is elaborated in Appendix K. We report the transfer accuracy in Table 1 where models are trained on source domain (first column in each row) and tested on different target domains (the rest columns), as well as the difference between source SA/RA and target SA/RA. The experiment has the following implications: (1) We observe that transfer difference ∆ is more significant on RA than SA. For example, for the model trained on MNIST dataset, no matter trained standardly or adversarially, their testing RAs on all other domains drop dramatically than SAs. It implies that adversarially robust domain adaptation is harder than standard domain adaptation, as illustrated by Theorem 1. (2) The models trained on complicated dataset may gain higher robust accuracy at simple dataset, e.g., SVHN → {MNIST, SynthDigits and USPS} and MNIST-M → MNIST. The increase can be attributed to that the source domain has more complicated features and thus more robust features are learnt. However, exploring the reason behind this interesting phenomena is beyond the scope of this paper. Adversarial training helps domain adaptation. We can see from Table 1 that, sometimes when models are adversarial trained on simple dataset (e.g., MNIST and Syn-thDigits dataset), it is noticeable the standard accuracy on other datasets are improved. For example, if we do adversarial training on MNIST dataset, we achieve significantly higher SA on other dataset, than standard trained model on MNIST. The same phenomena happens when we choose SynthDigits or USPS as source domain. Such advantages are consistent with our Corollary 3. Does ℓ 1 regularization help adversarial transfer? Our Theorem 1 shows that the adversarial Rademacher complexity over H∆H class is suppressed the most when the ℓ 1 -norm of the model parameters is controlled. To empirically investigate the relation, we consider a linear model on vectorized images x and solve the following optimization problem with ℓ 1 regularization: min w 1 n n i=1 max ∥δ∥ ∞ ≤ϵ ℓ(h w (x i + δ), y(x i )) + µ ∥w∥ 1 , where µ ≥ 0 is the regularization intensity. In Figure 1 , we present the drops of robust accuracy from source domain to target domain RA S -RA T , regarding the intensity of ℓ 1 regularization. Consistent with our theoretical results, increasing ℓ 1 regularization (µ > 0) can reduce the transfer accuracy drops on different level of adversarial attacks.

7. CONCLUSION

In this paper we propose and analyze the adversarial Rademacher complexity over H∆H class, which is proven to be key factor that controls the generalization of adversarially robust risk to different domains. We theoretically explain that why adversarial domain adaptation is harder than different domain than standard domain adaptation. We also characterize the standard accuracy of a given model on any target domain, using its adversarial accuracy on the source domain, which matches with the recent observation regarding the superiority of adversarially training.

A MORE RELATED WORK

Here, we briefly discuss some relevant prior works. Discrepancy Based Domain Adaptation Theory A significant category of the domain adaptation study is discrepancy based generalization analysis. Ben-David et al. (2006) borrowed the Adiscrepancy from seminal work Kifer et al. (2004) , and gave the target domain generalization in terms of source domain error and this discrepancy measure. Afterwards, Ben-David et al. (2010) proposed H∆H discrepancy, which is easier to estimate from unlabeld data, and also proved VC dimenson based generalization bound. Mansour et al. (2009) also consider H∆H discrepancy, while their analysis depends on Rademacher complexity over H∆H function class. They claim that in some situation, their learning bound is superior to Ben-David et al. ( 2010)'s bound. Mohri & Muñoz Medina (2012) proposed Y-discrepancy which is a labeling function dependent measure, but hence it cannot be estimated from unlabeled data. Kuroki et al. ( 2019) advocated a source-guided discrepancy and showed that it is a tighter discrepancy measure than H∆H discrepancy. Zhang et al. (2020) proposed localized discrepancy measure, where they argued that when defining a discrepancy measure, considering the whole hypothesis class may be too pessimistic, so they chose to incorporate risk level as well into the discrepancy definition. 2018) proved that empirical robust risk minimization is a successful robust PAC learner. Montasser et al. (2019) show that, the function classes with finite VC dimension are adversarially robustly PAC learnable, with the sample complexity related to dual VC dimension, which could be exponentially larger than vanilla VC dimension. Diochnos et al. (2019) proved the lower sample complexity bound for robust PAC learning under hybrid attack. They show that a sample complexity exponetially in the adversary budget is unavoidable. Gourdeau et al. (2021) also studied the hardness of robust classification under PAC learning framework, and proved some impossibility results regarding the adversary budget. Diochnos et al. (2018) investigated different adversarial risk definitions, and proved negative results on the uniform distribution. Pydi & Jog (2022) also analyzed the existing adversarial risk notions, and discovered the difference and connections among them.

Generalization of Adversarially Robust Learning

Robustness Transfer Robustness transfer is a newly initiated research area. Shafahi et al. (2019) discovered that by fine-tuning the network on target domain, the robustness can be inherited by the new model. Hong et al. (2021) considered the federated learning scenario, where they wish to transfer robust model from computationally rich users to users that cannot afford adversarial training. They proposed a batch-normalization based method to share robust among different clients. Fan et al. (2021) studied when the robust features learnt in contrastive learning can be transferred to different tasks.

B DIFFERENCE BETWEEN ROBUST LEARNING, STANDARD DOMAIN ADAPTATION AND ADVERSARIALLY ROBUST DOMAIN ADAPTATION

Here we briefly discuss the difference between the three relevant learning scenario: adversarially robust learning, standard domain adaptation and adversarially robust domain adaptation. In the goal of adversarially robust learning, we care about the gap between population robust risk and empirical robust risk, on the same domain; While in standard domain adaptation, we consider the gap between the (standard) risk on the target domain and the risk on the source domain. In adversarially robust domain adaptation, we care about the relation between adversarially robust risks on target and source domain, respectively. We consider the loss between predictions among two models, hence the Rademacher complexity is R = E[sup w,w ′ 1 n n i=1 σ i min ∥δ∥≤∞ w ⊤ (x + δ)w ′⊤ (x + δ)], where the inner problem is quadratic in terms of w and δ. Hence, unlike previous works on single domain Rademacher complexity, where the inner problem has simple closed form solution, we have to use ε-nets and covering number idea to prove upper bound. Our another key technical contribution is in the proof of lower bound results. For lower bound proofs, controlling the magnitude of Rademacher complexity with the inner problem being quadratic objective is significantly harder than linear objective. We derive the (complicated) closed form solution to inner quadratic programming, and leverage the symmetric property of Rademacher random variables to avoid heavy computation.

D USEFUL LEMMAS

In this section, we present necessary lemmas that are used further in the proof of our main results.

D.1 MATRIX CONCENTRATION INEQUALITY

Theorem 4 (Matrix Bernstein inequality, Thm 6.1.1 of Tropp et al. (2015) ). Let us denote by ∥.∥ 2 the spectral norm of matrix. Consider a finite sequence of n independent, random matrices Z i with common dimension d 1 × d 2 . Assume that E [Z i ] = 0 and ∥Z i ∥ 2 ≤ L, ∀i ∈ [n] . Let Y := n i=1 Z i . Then, E [∥Y∥ 2 ] = E n i=1 Z i 2 ≤ 2 Var(Y) log(d 1 + d 2 ) + 1 3 L log(d 1 + d 2 ) , where the matrix variance is given by Var(Y) := max E YY ⊤ 2 , E Y ⊤ Y 2 = max n i=1 E Z i Z ⊤ i 2 , n i=1 E Z ⊤ i Z i 2 .

D.2 BASIC LEMMAS

Lemma 6 (Basic squared norm inequality). For any vector a, b, we have that ∥a -b∥ If p = 1, we set p * = ∞. 2 2 ≤ 2 ∥a∥ 2 2 + 2 ∥b∥ 2 2 . Lemma 7 (Hölder inequality). Let p ∈ R such that 1 < p < ∞. Let p * be its conjugate, that is if 1 < p < ∞, p * is such that 1 p + 1 p * = 1. Let v, w ∈ R d , Lemma 8 (Equivalence of p-norms). Let q > p ≥ 1, then for all v ∈ R d we have ∥v∥ q ≤ ∥v∥ p ≤ d 1/p-1/q ∥v∥ q . It also holds for q = ∞, that is ∥v∥ ∞ ≤ ∥v∥ p ≤ d 1/p ∥v∥ ∞ . Lemma 9 (Maximum dot product over ℓ ∞ ball). The ℓ ∞ and ℓ 1 norms are duals of each other. That is: max ∥x∥ ∞ ≤ϵ z ⊤ x = ϵ ∥z∥ 1 , and this maximum is attained for x * = ϵsgn(z), where sgn denotes the element-wise sign function. Proof. Hölder inequality implies that for all z, x ∈ R p , z ⊤ x ≤ |z ⊤ x| ≤ ∥x∥ ∞ ∥z∥ 1 ≤ ϵ ∥z∥ 1 . Finally, we notice this upper bound is reached for x * = ϵsgn(z) as z ⊤ x * = ϵz ⊤ sgn(z) = ϵ p i=1 z i sgn(z i ) = ϵ ∥z∥ 1 . Lemma 10 (Minimum dot product over ℓ ∞ ball). Let z ∈ R d , the solution of min ∥x∥ ∞ ≤ϵ z ⊤ x = -ϵ ∥z∥ 1 , is attained at x * = -ϵsgn(z). Proof. Same reasoning as in Lemma 9. Lemma 11 (Lower bound of minimum "quadratic" form over ℓ ∞ ball). Let z ∈ R d , the solution of min ∥x∥ ∞ ≤ϵ ⟨u, x⟩⟨v, x⟩ ≥ -ϵ 2 ∥u∥ 1 ∥v∥ 1 , Proof. Let x ∈ B ∞ (0 d , ϵ) which denotes the ℓ ∞ centered ball of radius ϵ > 0. Hölder's inequality for dual norms applied twice gives us: |⟨u, x⟩⟨v, x⟩| ≤ ∥x∥ 2 ∞ ∥u∥ 1 ∥v∥ 1 ≤ ϵ 2 ∥u∥ 1 ∥v∥ 1 , which directly implies that min ∥x∥ ∞ ≤ϵ ⟨u, x⟩⟨v, x⟩ ≥ -ϵ 2 ∥u∥ 1 ∥v∥ 1 . Remark 1. We make several comments on the above Lemma 11: • Note that if v = u, then the objective becomes positive and 0 is a simpler and sharp lower bound as it is reached for x = 0 d . • Else if v = -u, then applying Lemma 12 implies that the minimum is reached and equals -ϵ 2 ∥u∥ 2 1 which means the lower bound in (17) is sharp. • Else if v⊥u, then one should be able to prove that the minimum is reached at something like x * = ϵ u-v ∥u-v∥ ∞ , which correspond to an objective equaling: ⟨u, x * ⟩⟨v, x * ⟩ = -ϵ 2 ∥u∥ 2 2 ∥v∥ 2 2 ∥u-v∥ 2 ∞ . Lemma 12 (Maximum squared dot product over ℓ ∞ ball). We have that max ∥x∥ ∞ ≤ϵ (z ⊤ x) 2 = ϵ 2 ∥z∥ 2 1 , and this maximum is attained for x ∈ {ϵsgn(z), -ϵsgn(z)}, where sgn denotes the element-wise sign function. Proof. Hölder inequality implies that for all z, x ∈ R p , (z ⊤ x) 2 ≤ ∥x∥ 2 ∞ ∥z∥ 2 1 ≤ ϵ 2 ∥z∥ 2 1 . Finally, we notice this upper bound is reached for x * = ±ϵsgn(z) as (z ⊤ x * ) 2 = ϵ 2 (z ⊤ sgn(z)) 2 = ϵ 2 ∥z∥ 2 1 . Lemma 13. Let A be a symmetric matrix, we have that sup ∥w∥ 2 ≤W,∥w ′ ∥ 2 ≤W w ⊤ Aw ′ = ∥A∥ 2 . Proof. Let w, w ′ with ℓ 2 -norm smaller than W . By Cauchy-Schwarz's inequality we directly get that w ⊤ Aw ′ ≤ |⟨w, Aw ′ ⟩| Cauchy-Schwarz ≤ ∥A∥ 2 ∥w∥ 2 ∥w ′ ∥ 2 ≤ W 2 ∥A∥ 2 . ( ) We then perform eigendecomposition on A: w ⊤ Aw ′ = w ⊤ UΣU ⊤ w ′ = W 2 y ⊤ Σy ′ , where Σ is a diagonal matrix containing eigenvalues λ i 's of A, U is an orthogonal matrix since A is symmetric, y := 1 W U ⊤ w and y ′ := 1 W U ⊤ w ′ . In this orthogonal basis, let i * be the coordinate of the eigenvalue λ i * with largest magnitude in absolute value. We denote by (e i ) i∈ [d] the canonical basis of R d . Let y = e i * and y ′ = sgn(λ i * )y, we get y ⊤ Σy ′ = |λ i * | = max i∈[d] λ i (A) 2 = λ max (A 2 ) = ∥A∥ 2 . Thus, the upper bound in ( 19) is attained by inverting the change of variable from y, y ′ to w, w ′ . Lemma 14. Let A ∈ R d×d . Then the following statements hold: sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ ≤ sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ • 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , ( ) sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ ≥ sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ • d 1-2/p , if 1 ≤ p ≤ 2 1, else if p > 2 . ( ) Proof. We begin with proving the first inequality. If 1 ≤ p ≤ 2, we know that: B p (W ) ⊆ B 2 (W ). Hence sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ ≤ sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ . If p > 2, since 1 d 1/2-1/p ∥w∥ 2 ≤ ∥w∥ p , we know that ∥w∥ p ≤ W implies 1 d 1/2-1/p ∥w∥ 2 ≤ W . So we have: B p (W ) := {w : ∥w∥ p ≤ W } ⊆ {w : 1 d 1/2-1/p ∥w∥ 2 ≤ W } ⊆ B 2 (W d 1/2-1/p ). Hence: sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ ≤ sup ∥w∥2≤W d 1/2-1/p ,∥w ′ ∥2≤W d 1/2-1/p w ⊤ Aw ′ ≤ sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ • d 1-2/p . Now we switch to prove the second inequality. If 1 ≤ p ≤ 2, then ∥w∥ 2 ≥ 1 d 1/p-1/2 ∥w∥ p , so we know B 2 (W ) := {w : ∥w∥ 2 ≤ W } ⊆ {w : 1 d 1/p-1/2 ∥w∥ p ≤ W } = B p (d 1/p-1/2 W ). Hence: sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ ≤ sup ∥w∥p≤d 1/p-1/2 W,∥w ′ ∥p≤d 1/p-1/2 W w ⊤ Aw ′ ≤ d 2/p-1 sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ . If p > 2, then we have B 2 (W ) ⊆ B p (W ), so we can conclude the relation: sup ∥w∥2≤W,∥w ′ ∥2≤W w ⊤ Aw ′ ≤ sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ Aw ′ . Lemma 15 (Partition). Let us define A := {-1, +1} N . Then, there must be an equal partition of A = A + + A -, such that A -is obtained by multiplying -1 on each vector in A + . That is, |A + | = |A -| and A -= {-a, a ∈ A + }. Proof. We prove by induction. When N = 1, we have A 1 = {-1, 1}, and we can partition it as A + = {1}, and A -= {-1}; The we assume the hypothesis holds for N = k, that is, A k := {-1, +1} k can be partition as  A k = A + k + A - k such that |A + k | = |A - k | and A - k = {-a, a ∈ A + k }. Now, for N = k + 1, (w ⊤ δ + a) 2 . The solution is given by δ * = ϵsgn(a)sgn(w) ∈ R p , (22) where we overload the notation sgn denotes in the mean time a single element and a coordinate wise sign operator, i.e., sgn(a) ∈ R but sgn(w) ∈ R p . Moreover, the maximum reached is (w T δ * + a) 2 = (ϵsgn(a)w T sgn(w) + a) 2 = (ϵ ∥w∥ 1 + |a|) 2 . ( ) Proof. Let us give a first intuition and proof in dimension one and then extend this to larger dimensions. • Case p = 1 (w becomes w). In this setting the problem intuition is clear: one should select δ, with maximal amplitude, that makes δw having the same sign as a. If a and w have the same sign, then δ = ϵ. Else, δ = -ϵ. • Case p ∈ N * . For all w, δ ∈ R p , Hölder inequality gives that |w ⊤ δ| ≤ ∥δ∥ ∞ ∥w∥ 1 . Let δ ∈ R p such that ∥δ∥ ∞ ≤ ϵ. Then, this implies that on the feasible set |w ⊤ δ| ≤ ϵ ∥w∥ 1 . Thus, (w T δ + a) 2 = (w T δ) 2 + 2aw T δ + a 2 (24) ≤ ϵ 2 ∥w∥ 2 1 + 2aw T δ + a 2 ≤ ϵ 2 ∥w∥ 2 1 + 2|a||w T δ| + a 2 (24) ≤ ϵ 2 ∥w∥ 2 1 + 2|a|ϵ ∥w∥ 1 + a 2 = (ϵ ∥w∥ 1 + |a|) 2 . Finally, one can check that upper bound of the objective is attained for δ * given in ( 22). (25) Let I := {i ∈ [p] : w i ̸ = 0}. • If ϵ ∥w∥ 1 ≥ |a|, then a solution is given by δ * i = -a ∥w∥ 1 wi |wi| ∀i ∈ I δ * i = 0 ∀i ∈ [p]\I . • Else ϵ ∥w∥ 1 < |a|, and a the solution is given by δ * i = -ϵ a |a| wi |wi| ∀i ∈ I δ * i = 0 ∀i ∈ [p]\I . This solution can be condensed in the following formulation:    δ * i = -a w i |w i | min 1 ∥w∥ 1 , ϵ |a| ∀i ∈ I δ * i = 0 ∀i ∈ [p]\I . ( ) The minimal value is given by: min δ∈R p :∥δ∥ ∞ ≤ϵ (w T δ + a) 2 = a 2 1 -min 1, ϵ∥w∥ 1 |a| 2 . Remark 2. Note that in general there are an infinite number of solutions to (25) as in (26) one can choose arbitrarily the value of δ * i for all i ∈ [p]\I (as soon as it is kept smaller than ϵ in absolute value). Proof. Let [p] := {1, . . . , p}. Let us try to build a solution which drives the dot product w ⊤ δ towards -a. Let I := {i ∈ [p] : w i ̸ = 0}. Case 1: If ϵ ∥w∥ 1 ≥ |a|. Let us δ * ∈ R p such that δ * i = -a ∥w∥ 1 wi |wi| ∀i ∈ I δ * i = 0 ∀i ∈ [p]\I . This vector is in the feasible set as ∥δ * ∥ ∞ = max i∈I |a| ∥w∥ 1 |wi| |wi| = |a| ∥w∥ 1 ≤ ϵ, as assumed. Then, w T δ * + a = - a ∥w∥ 1 i∈I w 2 i |w i | + a = 0 . This means that if the entries of vector w are large enough (in absolute value), we can build a feasible vector δ * such that the objective in (25) is zero. Case 2: If ϵ ∥w∥ 1 < |a|. Let us δ * ∈ R p such that δ * i = -ϵ a |a| wi |wi| ∀i ∈ I δ * i = 0 ∀i ∈ [p]\I . This vector is in the feasible set as ∥δ * ∥ ∞ = max i∈I ϵ |a| |a| |wi| |wi| = ϵ, as assumed. Then, w T δ * + a = -ϵ a |a| i∈I w 2 i |w i | + a = a 1 -ϵ ∥w∥ 1 |a| ∈[0,1] . This means that if the entries of vector w are too small (in absolute value), we can only build a feasible vector δ * such that w ⊤ δ * close too -a. And the corresponding objective in (25) becomes a 2 1 -ϵ ∥w∥ 1 |a| 2 . Finally, one just can show with Hölder inequality that (w T δ + a) 2 ≥ a 2 1 -ϵ ∥w∥ 1 |a| 2 for all δ in the feasible set, which concludes the proof. If a ≥ 0, the computation follows easily, else we can just replace δ ← -δ to get back to the former case.

E PROOF OF GENERALIZATION LEMMA ( LEMMA 2)

In this section we provide the proof of Lemma 2. First let us introduce the following helper lemma. Lemma 18. Assume Ŝ and T are the sets of data drawn from S and T , with size n S and n T respectively, and the value of l (•) is bounded by M . Then we have: |disc adv H∆H (S, T ) -disc adv H∆H ( Ŝ, T )| ≤ RS ( l • H∆H) + RT ( l • H∆H) +   3M log(2/c) n S + 3M log(2/c) n T   . Proof. Since absolute value satisfies triangle inequality, we have: disc adv H∆H (S, T ) ≤ disc adv H∆H (S, Ŝ) + disc adv H∆H (T , T ) + disc adv H∆H ( Ŝ, T ) . According to Rademacher-based generalization bound of Mohri et al. (2018) , we know that disc adv H∆H (S, Ŝ) = max h,h ′ ∈H |R adv S (h, h ′ ) -R adv Ŝ (h, h ′ )| ≤ R Ŝ ( l • H∆H) + 3M log(2/c) n S , and so is for disc adv H∆H (T , T ). Proof of Lemma 2. Proof. Since the loss function l satisfies triangle inequality, we can split R adv T (w, v * T ) into the following terms: R adv-label T (h w , y T ) ≤ R adv T (h w , h w * S ) + R adv T (h w * S , h w * T ) + R adv T (h w * T , y T ) ≤ R adv S (h w , h w * S ) + disc adv H∆H (T , S) + R adv T (h w * S , h w * T ) + R adv T (h w * T , y T ) ≤ R adv S (h w , h w * S ) + disc adv H∆H ( T , Ŝ) + R adv T (h w * S , h w * T ) + R adv T (h w * T , y T ) + RS ( l • H∆H) + RT ( l • H∆H) +   3M log(2/c) n S + 3M log(2/c) n T   ≤ R adv-label S (h w , y S ) + R adv-label S (h w * S , y S ) + disc adv H∆H ( T , Ŝ) + R adv T (h w * S , h w * T ) + R adv T (h w * T , y T ) + RS ( l • H∆H) + RT ( l • H∆H) +   3M log(2/c) n S + 3M log(2/c) n T   , where we plug in Lemma 18 at last step.

F EXTENSIONS F.1 ESTIMATION OF ADVERSARIAL DISCREPANCY FROM STANDARD DISCREPANCY

The following Lemma gives the bound if we estimate adversarial H∆H discrepancy from standard H∆H discrepancy. Lemma 19. The following relations between adversarial discrepancy from standard discrepancy holds for linear model class with bounded norm: H = {h w : x → ⟨w, x⟩, ∥w∥ p ≤ W }. For L ϕ -Lipschitz binary classification loss, we have: disc adv H∆H ( Ŝ, T ) ≤ disc H∆H ( Ŝ, T ) + 2W 2 L ϕ √ dϵ   1 n T xi∈ T ∥x i ∥ 2 + 1 n S xi∈ Ŝ ∥x i ∥ 2   • 1 1 ≤ p ≤ 2 d 1-2/p p > 2 . For ℓ 2 loss, we have: disc adv H∆H ( Ŝ, T ) ≤ disc H∆H ( Ŝ, T ) + 8 √ dϵW 2   1 n T xi∈ T ∥x i ∥ 2 + 1 n T xi∈ Ŝ ∥x i ∥ 2   • 1 1 ≤ p ≤ 2 d 1-2/p p > 2 . Proof. Let B p (W ) := {w ∈ R d : ∥w∥ p ≤ W } be the ℓ p -norm centered ball with radius W . By the definition of disc adv H∆H ( Ŝ, T ), we have: disc adv H∆H ( Ŝ, T ) (4) = max w,w ′ ∈{w:∥w∥p≤W } |R adv T (w, w ′ ) -R adv Ŝ (w, w ′ )| ≤ max w,w ′ ∈Bp(W ) 2 |R T (w, w ′ ) -R Ŝ (w, w ′ ) + R adv T (w, w ′ ) -R T (w, w ′ ) -(R adv Ŝ (w, w ′ ) -R Ŝ (w, w ′ ))| (2) ≤ disc H∆H ( Ŝ, T ) + max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| + max w,w ′ ∈Bp(W ) 2 |R adv Ŝ (w, w ′ ) -R Ŝ (w, w ′ )| . Now we study the gap max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| and max w,w ′ ∈Bp(W ) 2 |R adv Ŝ (w, w ′ ) -R Ŝ (w, w ′ )|. For linear classification, we have: max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| = max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T max δ:∥δ∥∞≤ϵ (ϕ(⟨w, x i + δ⟩ • ⟨w ′ , x i + δ⟩) -ϕ(⟨w, x i ⟩ • ⟨w ′ , x i ⟩)) ≤ max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T max δ:∥δ∥∞≤ϵ L ϕ |(⟨w, x i + δ⟩ • ⟨w ′ , x i + δ⟩) -(⟨w, x i ⟩ • ⟨w ′ , x i ⟩)| ≤ max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T max δ:∥δ∥∞≤ϵ L ϕ |w ⊤ (x i + δ)(x i + δ) ⊤ -x i x ⊤ i w ′ | ≤ L ϕ 1 n T xi∈ T max w,w ′ ∈Bp(W ) 2 w ⊤ (x i + δ * i )(x i + δ * i ) ⊤ -x i x ⊤ i w ′ Cauchy-Schwarz ≤ L ϕ 1 n T xi∈ T max w,w ′ ∈Bp(W ) 2 ∥w∥ 2 ∥(x i + δ * i )(x i + δ * i ) ⊤ -x i x ⊤ i ∥ 2 ∥w ′ ∥ 2 ≤ L ϕ 1 n T xi∈ T max w,w ′ ∈Bp(W ) 2 ∥w∥ 2 ∥δ * i x ⊤ i + x i δ * i ⊤ ∥ 2 ∥w ′ ∥ 2 , where δ * i is a maximizer of the i-th optimization problem in δ over the ℓ ∞ -ball of radius ϵ. We know that ∥w∥ 2 ≤ W • 1 1 ≤ p ≤ 2 d 1/2-1/p p > 2 , and ∥δ * i x ⊤ i + x i δ * i ⊤ ∥ 2 ≤ 2 √ dϵ∥x i ∥ 2 , so we have: max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| ≤ W 2 L ϕ 1 n T xi∈ T 2 √ dϵ∥x i ∥ 2 • 1 1 ≤ p ≤ 2 d 1-2/p p > 2 . which concludes the proof for linear classification setting. Now we switch to regression setting: max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| = max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T max δ:∥δ∥∞≤ϵ ∥⟨w, x i + δ⟩ -⟨w ′ , x i + δ⟩∥ 2 2 -∥⟨w, x i ⟩ -⟨w ′ , x i ⟩∥ 2 2 = max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T max δ:∥δ∥∞≤ϵ ∥⟨w -w ′ , x i + δ⟩∥ 2 2 -∥⟨w -w ′ , x i ⟩∥ 2 2 = max w,w ′ ∈Bp(W ) 2 1 n T xi∈ T (w -w ′ ) ⊤ (x i + δ * i )(x i + δ * i ) ⊤ -x i x ⊤ i (w -w ′ ) = max w,w ′ ∈Bp(W ) 2 (w -w ′ ) ⊤ 1 n T xi∈ T δ * i x ⊤ i + x i δ * i ⊤ (w -w ′ ) Now, we let v := ww ′ , and re-write the above inequality as: max w,w ′ ∈Bp(W ) 2 |R adv T (w, w ′ ) -R T (w, w ′ )| Cauchy-Schwarz ≤ max v:∥v∥p≤2W v ⊤ 2 1 n T xi∈ T δ * i x ⊤ i + x i δ * i ⊤ 2 ∥v∥ 2 ≤ 1 n T xi∈ T δ * i x ⊤ i + x i δ * i ⊤ 2 4W • 1 p ≤ 2 d 1-2/p p > 2 ≤ 8 √ dW 1 n T xi∈ T ∥x i ∥ 2 • 1 p ≤ 2 d 1-2/p p > 2 . where we use norm equivalence (Lemma 8) to bound ∥v∥ 2 .

F.2 ADVERSARIALLY ROBUST DOMAIN ADAPTATION GENERALIZATION BOUND

In this section, we will present the generalizatin bound of adversarially robust domain adaptation, using our upper bound for adversarial Rademacher complexity over H∆H class. An immediate implication of Theorem 1 is the following bound: Corollary 1 (Adversarially Robust Domain Adapation Learning Bound, Linear Classification). Assume that the loss function l is symmetric and obeys the triangle inequality. We further assume l is bounded by M . Then, for any hypothesis h w ∈ H , the following holds: R adv-label T (h w , y T ) ≤ R adv-label S (h w , y S ) + R adv-label S (h w * S , y S ) + disc adv H∆H ( T , Ŝ) + R adv T (h w * T , h w * S ) + R adv T (h w * T , y T ) + RS (ℓ • H∆H) + RT (ℓ • H∆H) + 3M   log(2/c) n S + log(2/c) n T   + Õ L ϕ ϵd 1/p * W 2 √ d ϵd 1/p * + 2 ∥X T ∥ p * ,∞ √ n T + ϵd 1/p * + 2 ∥X S ∥ p * ,∞ √ n S , where p * is such that 1/p + 1/p * = 1, and X S and X T are the data matrix concatenated by data points from Ŝ and T , respectively. Proof. Combining Lemma 2 and Theorem 1 will conclude the proof. An immediate implication of Theorem 2 is the following result. Corollary 2 (Adversarially Robust Domain Adapation Learning Bound, Linear Regression). Assume that the loss function l is symmetric and convex. Also let Ŝ and T have n S and n T data points, respectively. We further assume l is bounded by M . Then, for any hypothesis h w ∈ H , the following holds with probability at least 1 -c: R adv-label T (h w , y T ) ≤ 6R adv-label S (h w , y S ) + 6R adv-label S (h w * S , y S ) + 4disc adv H∆H ( T , Ŝ) + 3R adv T (h w * T , h w * S ) + 3R adv T (h w * S , y T ) + M O   log(2/c) n S + log(2/c) n T   + 3 RS (ℓ • H∆H) + 3 RT (ℓ • H∆H) + Õ W 2 √ n S dϵ ∥X S ∥ 2,∞ + d 3/2 ϵ 2 + W 2 √ n T dϵ ∥X T ∥ 2,∞ + d 3/2 ϵ 2 × 1, 1 ≤ p ≤ 2 d 1-2/p , p > 2 , where X S and X T are the data matrix concatenated by data points from Ŝ and T , respectively. Proof. The following proof is almost identical to that of Lemma 2, and the only change is that we apply Jensen's inequality instead of triangle inequality here: R adv T (h w , y T ) ≤ 3R adv T (h w , h w * S ) + 3R adv T (h w * S , h w * T ) + 3R adv T (h w * T , y T ) ≤ 3R adv S (h w , h w * S ) + 3disc adv H∆H (T , S) + 3R adv T (h w * S , h w * T ) + 3R adv T (h w * T , y T ) ≤ 3R adv S (h w , h w * S ) + 3disc adv H∆H ( T , Ŝ) + 3R adv T (h w * S , h w * T ) + 3R adv T (h w * T , y T ) + 3 RS ( l • H∆H) + 3 RT ( l • H∆H) + 3   3M log(2/c) n S + 3M log(2/c) n T   ≤ 6R adv-label S (h w , y S ) + 6R adv-label S (h w * S , y S ) + 3disc adv H∆H ( T , Ŝ) + 3R adv T (h w * S , h w * T ) + 3R adv T (h w * T , y T ) + 3 RS ( l • H∆H) + 3 RT ( l • H∆H) + 3   3M log(2/c) n S + 3M log(2/c) n T   where we plug in Lemma 18 at last step. Finally plugging in Theorem 2 will conclude the proof.

G PROOFS FOR BINARY CLASSIFICATION G.1 PROOF OF LEMMA 3

Proof. To simplify notations, we omit to specify the fact that the model parameters w and w ′ belong to R d . We first prove the upper bound results. By definition, we have RD (f • H∆H) = E σ sup ∥w∥ p ≤W,∥w ′ ∥ p ≤W 1 n n i=1 σ i w T x i w ′T x i (20) ≤ E σ sup ∥w∥ 2 ≤W,∥w ′ ∥ 2 ≤W w ⊤ 1 n n i=1 σ i x i x ⊤ i w ′ • 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 Lemma 13 = W 2 n E σ n i=1 σ i x i x ⊤ i 2 • 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . ( ) We now look for a more explicit upper bound of the above Rademacher complexity depending on the dimension d and on a norm of covariance of data points x 1 , . . . , x n . To do so, we introduce some notations before applying a matrix Bernstein inequality (Theorem 6.1.1 of Tropp et al. (2015) ) recalled in Theorem 4. Let Z i := σ i x i x ⊤ i ∈ R d×d for all i ∈ [n] . These random matrices are symmetric, independent, have zero mean and are such that for all i ∈ [n]. Moreover, let Y := n i=1 Z i . For each Z i , we notice that it has bounded spectral norm: ∥Z i ∥ 2 = λ max (Z 2 i ) = λ max (x i x ⊤ i x i x ⊤ i ) = ∥x i ∥ 2 2 ≤ max j∈[n] ∥x j ∥ 2 2 = ∥X∥ 2 2,∞ , so that according to matrix Bernstein inequality, we get the desired bound R(f • H∆H) (14) ≤ W 2 n   2 n i=1 (x i x ⊤ i ) 2 2 log(2d) + 1 3 ∥X∥ 2 2,∞ log(2d)   × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 .

G.2 PROOF OF THE UPPER BOUND OF THEOREM 1

Alike the analysis of Theorem 7 from Awasthi et al. (2020) , the below study uses the notion of coverings. For completeness sake, we recall its definition. Definition 6 (ρ-covering). Let ρ > 0 and let (V, ∥.∥) be a normed space. A set C ⊆ V is an ϵ-covering of V if for any v ∈ V , there exists v ′ ∈ C such that ∥v -v ′ ∥ ≤ ρ. We also copy Lemma 6 from Awasthi et al. ( 2020 Proof of the upper bound of Theorem 1. In this proof, we consider the linear hypothesis class were the norm of the models is controlled by a general ℓ p -norm for p > 0. Let B p (W ) := {w ∈ R d : ∥w∥ p ≤ W }, the hypothesis class defined in (5) then writes H := {h w : x → ⟨w, x⟩ : w ∈ B p (W )} . Similarly, let B ∞ (ϵ) := {δ ∈ R d : ∥δ∥ ∞ ≤ ϵ}. Recall that we define in ( 8) RS ( f • H∆H) = E σ sup w,w ′ ∈Bp(W ) 2 1 n n i=1 σ i min δ∈B∞(ϵ) w T (x i + δ)w ′T (x i + δ) = E σ sup w,w ′ ∈Bp(W ) 2 1 n n i=1 σ i (x ⊤ i ww ′⊤ x i + min δ∈B∞(ϵ) δ ⊤ ww ′⊤ δ + x ⊤ i (ww ′⊤ + w ′ w ⊤ )δ) ≤ RS (f • H∆H) + E σ sup w,w ′ ∈Bp(W ) 2 1 n n i=1 σ i min δ∈B∞(ϵ) δ ⊤ ww ′⊤ δ + x ⊤ i (ww ′⊤ + w ′ w ⊤ )δ A . Now we examine the upper bound of the second term A using the notion of covering recalled in Definition 6. Let C be a ρ-covering of the ℓ p ball B p (W ) w.r.t. the ℓ p -norm, with ρ > 0. Let us define ψ i (w, w ′ ) := min δ∈B∞(ϵ) δ ⊤ ww ′⊤ δ + x ⊤ i (ww ′⊤ + w ′ w ⊤ )δ . ( ) Thus we can rewrite A as A = E σ sup w,w ′ ∈Bp(W ) 2 1 n n i=1 σ i ψ i (w, w ′ ) = E σ      sup w,w ′ ∈Bp(W ) 2 wc,w ′ c ∈C 2 : ∥w-wc∥ p ,∥w ′ -w ′ c ∥ p ≤ρ 1 n n i=1 σ i (ψ i (w c , w ′ c ) + ψ i (w, w ′ ) -ψ i (w c , w ′ c ))      , where w c , respectively w ′ c , is the closest element to w, resp. w ′ , in C. Using the subadditivity of the supremum, we get A ≤ E σ sup w, w′ ∈C 2 1 n n i=1 σ i ψ i ( w, w′ ) + E σ sup w,w ′ ∈Bp(W ) 2 1 n n i=1 σ i (ψ i (w, w ′ ) -ψ i (w c , w ′ c )) ≤ E σ sup w, w′ ∈C 2 1 n n i=1 σ i ψ i ( w, w′ ) + sup w,w ′ ∈Bp(W ) 2 1 n n i=1 |ψ i (w, w ′ ) -ψ i (w c , w ′ c )| ≤ E σ sup w, w′ ∈C 2 1 n n i=1 σ i ψ i ( w, w′ ) (I) + max i∈[n] sup w,w ′ ∈Bp(W ) 2 |ψ i (w, w ′ ) -ψ i (w c , w ′ c )| (II) . ( ) where we recall that w c , resp. w ′ c , is the closest vector to w, resp. w ′ , in C.

Bounding (I):

We first need to bound the left-hand side term (I). We introduce the vector ψ( w, w′ ) := [ψ 1 ( w, w′ ), . . . , ψ n ( w, w′ )] ⊤ ∈ R n . By Massart's lemma (Lemma 5.2 of Massart ( 2000)), we are able to control the first term (I) in ( 29): (I) = E σ sup w, w′ ∈C 2 1 n n i=1 σ i ψ i ( w, w′ ) ≤ K 2 log(|C| 2 ) n , with K given by the largest ℓ 2 -norm of ψ over the covering C 2 , that is K 2 = max w w′ ∈C 2 ∥ψ∥ 2 2 = max w w′ ∈C 2 n i=1 ψ i ( w, w′ ) 2 . ( ) Now we examine the upper and lower bound of ψ i ( w, w′ ). For upper bound, by taking δ = 0 we know that ψ i ( w, w′ ) is non-positive. Thus, we only have to control how negative this term can be. Let w, w′ ∈ C 2 , we have ψ i ( w, w′ ) = min δ∈B∞(ϵ) δ ⊤ w w′⊤ δ + x ⊤ i ( w w′⊤ + w′ w⊤ )δ ≥ min δ∈B∞(ϵ) δ ⊤ w w′⊤ δ + min δ∈B∞(ϵ) x ⊤ i ( w w′⊤ + w′ w⊤ )δ Lemma 10 = min δ∈B∞(ϵ) δ ⊤ w w′⊤ δ -ϵ ( w w′⊤ + w′ w⊤ )x i 1 . We focus on the first term which is a quadratic optimization problem under infinite norm constraints. For all δ, w, w ′ ∈ B ∞ (ϵ) × B p (W ) 2 , this quadratic form can be lower bounded by calling Hölder's inequality twice and norm equivalence, that is if p * ≥ 1 we have ∥v∥ p * ≤ d 1/p * ∥v∥ ∞ : δ ⊤ w w′⊤ δ ≥ -|δ ⊤ w w′⊤ δ| Lemma 7 ≥ -∥δ∥ 2 p * ∥ w∥ p ∥ w′ ∥ p Lemma 8 ≥ -d 2/p * ∥δ∥ 2 ∞ ∥ w∥ p ∥ w′ ∥ p if p > 1 -∥δ∥ 2 ∞ ∥ w∥ 1 ∥ w′ ∥ 1 else if p = 1 δ∈B∞(ϵ), w, w′ ∈Bp(W ) 2 ≥ -d 2/p * ϵ 2 W 2 if p > 1 -ϵ 2 W 2 else if p = 1 . Using the same tools, we now study the second term -w w′⊤ x i 1 = -|⟨ w′ , x i ⟩| ∥ w∥ 1 Lemma 7 ≥ -∥x i ∥ p * ∥ w′ ∥ p ∥ w∥ 1 Lemma 8 ≥ -d 1/p * ∥x i ∥ p * ∥ w′ ∥ p ∥ w∥ p if p > 1 -∥x i ∥ ∞ ∥ w′ ∥ 1 ∥ w∥ 1 else if p = 1 w, w′ ∈Bp(W ) 2 ≥ -d 1/p * W 2 ∥x i ∥ p * if p > 1 -W 2 ∥x i ∥ ∞ else if p = 1 , and symmetrically we get the same bound for -w′ w⊤ x i 1 . By taking the convention that d 1/p * = 1 if p = 1 (i.e. p * = ∞), we drop the disjunction between p = 1 and p > 1 in what follows. Combining (32) and the above two inequalities, the auxiliary function ψ i (28) can be lower bounded after applying the triangle inequality: ψ i ( w, w′ ) ≥ -d 2/p * ϵ 2 W 2 -2ϵd 1/p * W 2 ∥x i ∥ p * . So that we get ψ i ( w, w′ ) 2 ≤ (d 2/p * ϵ 2 W 2 + 2ϵd 1/p * W 2 ∥x i ∥ p * ) 2 ≤ ϵ 2 d 2/p * W 4 (ϵd 1/p * + 2 max j∈[n] ∥x j ∥ p * ) 2 = ϵ 2 d 2/p * W 4 (ϵd 1/p * + 2 ∥X∥ p * ,∞ ) 2 . Finally we get the following upper bound for K defined in (31): K ≤ √ nϵd 1/p * W 2 (ϵd 1/p * + 2 ∥X∥ p * ,∞ ) , which, jointly with the application of Lemma 20 implies the upper bound for (I): (I) (30) ≤ K 2 log |C| 2 n ≤ ϵd 1/p * W 2 (ϵd 1/p * + 2 ∥X∥ p * ,∞ ) √ n 4d log(3W/ρ) . Bounding (II). Now we turn to bounding the second term of (29). Let w, w ′ ∈ B p (W ) 2 and let w c , resp. w ′ c , be the closest element to w, resp. w ′ , in C. Let us define an "implicit" minimizer w.r.t. δ (the objective being continuous over a closed ball it is attained) for ψ i (w c , w ′ c ): δ * c := arg min ∥δc∥ ∞ ≤ϵ δ ⊤ c w c w ′⊤ c δ c + x ⊤ i (w c w ′⊤ c + w ′ c w ⊤ c )δ c . Thus, we have ψ i (w, w ′ ) -ψ i (w c , w ′ c ) (34) = min δ∈B∞(ϵ) δ ⊤ ww ′⊤ δ + x ⊤ i (ww ′⊤ + w ′ w ⊤ )δ -(δ * c ) ⊤ w c w ′⊤ c δ * c -x ⊤ i (w c w ′⊤ c + w ′ c w ⊤ c )δ * c ≤(δ * c ) ⊤ ww ′⊤ δ * c + x ⊤ i (ww ′⊤ + w ′ w ⊤ )δ * c -(δ * c ) ⊤ w c w ′⊤ c δ * c -x ⊤ i (w c w ′⊤ c + w ′ c w ⊤ c )δ * c =(δ * c ) ⊤ (ww ′⊤ -w c w ′⊤ c )δ * c + x ⊤ i (ww ′⊤ -w c w ′⊤ c + w ′ w ⊤ -w ′ c w ⊤ c )δ * c =(δ * c ) ⊤ w(w ′ -w ′ c ) ⊤ -(w c -w)w ′⊤ c δ * c + x ⊤ i w(w ′ -w ′ c ) ⊤ -(w c -w)w ′⊤ c + w ′ (w -w c ) ⊤ -(w ′ c -w ′ )w ⊤ c δ * c . We focus on upper bounding a single term of the ones appearing above. By applying Hölder's inequality twice and norm equivalence we get: |(δ * c ) ⊤ w(w ′ -w ′ c ) ⊤ δ * c | Lemma 7 ≤ ∥w∥ p ∥δ * c ∥ 2 p * ∥w ′ -w ′ c ∥ p Lemma 8 ≤ ρϵ 2 d 2/p * W , where in the last line we used that ∥w -w c ∥ p ≤ ρ, by the definition of the ρ-covering of the ball B p (W ) w.r.t. the ℓ p -norm. Proceeding identically with other terms involving x i we get |x ⊤ i w(w ′ -w ′ c ) ⊤ δ * c | Lemma 7 ≤ ∥x i ∥ p * ∥w∥ p ∥w ′ -w ′ c ∥ p ∥δ * c ∥ p * Lemma 8 ≤ ρϵd 1/p * W ∥x i ∥ p * , finally get that ψ i (w, w ′ ) -ψ i (w c , w ′ c ) ≤ 2ρϵ 2 d 2/p * W + 4ρϵd 1/p * W ∥x i ∥ p * . Similarly we can prove the same bound holds for other side of the difference (by using an "implicit" minimizer of ψ i (w, w ′ ). Thus we are able to control (II): (II) (29) = max i∈[n] sup w,w ′ ∈Bp(W ) 2 |ψ i (w, w ′ )-ψ i (w c , w ′ c )| ≤ 2ρϵd 1/p * W ϵd 1/p * +2 ∥X∥ p * ,∞ . And finally, we proved that A (33)+(35) ≤ 2ϵd 1/p * W (ϵd 1/p * + 2 ∥X∥ p * ,∞ ) ρ + d n W log(3W/ρ) , which concludes the first part of the proof if we choose ρ = W/ √ n: A ≤ 2ϵ d 1/p * √ n W 2 1 + √ d log(3 √ n) ϵd 1/p * + 2 ∥X∥ p * ,∞ .

G.3 PROOF OF THE LOWER BOUND OF THEOREM 1

In this subsection we present the proof of lower bound of the adversarial Rademacher complexity for binary classification under linear hypothesis. Proof of the lower bound of Theorem 1. Now we are going to prove the lower bound result of adversarial Rademacher complexity. Recall the definition of non-adversarial Rademacher complexity RD (f • H∆H) = E sup ∥w∥p≤W,∥w ′ ∥p≤W 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ = 1 n E sup ∥w∥p≤W,∥w ′ ∥p≤W w ⊤ n i=1 σ i x i x ⊤ i w ′ (20) ≤ 1 n E sup ∥w∥2≤W,∥w ′ ∥2≤2W w ⊤ n i=1 σ i x i x ⊤ i w ′ × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 Lemma 13 = W 2 n E n i=1 σ i x i x ⊤ i 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , due to equivalence of norms. Now, we denote v * such that v * = arg max ∥w∥p≤W,∥w ′ ∥p≤W 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ . According to Lemma 13 the maximum value of 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ is: sup ∥w∥2≤W,∥w ′ ∥2≤W 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ = W 2 n n i=1 σ i x i x ⊤ i 2 and if we define S(σ) := n i=1 σ i x i x ⊤ i the maxima is attained when v * = W v max (S(σ) 2 ) is an eigenvector of ℓ 2 -norm W associated to the largest eigenvalue of S(σ) 2 . Now, we switch to adversarial Rademacher: RD ( f • H∆H) = E sup ∥w∥p≤W,∥w ′ ∥p≤W 1 n n i=1 σ i min ∥δ∥∞≤ϵ w ⊤ (x i + δ)(x i + δ) ⊤ w ′ ≥ E sup ∥w∥2≤W,∥w ′ ∥2≤W 1 n n i=1 σ i min ∥δ∥∞≤ϵ w ⊤ (x i + δ)(x i + δ) ⊤ w ′ × d 1-2/p , if 1 ≤ p ≤ 2 1, else if p > 2 ≥ E sup ∥w∥2≤W 1 n n i=1 σ i min ∥δ∥∞≤ϵ w ⊤ (x i + δ)(x i + δ) ⊤ w × d 1-2/p , if 1 ≤ p ≤ 2 1, else if p > 2 where in the first inequality we used that, if 1 ≤ p ≤ 2, then B 2 (W ) ⊆ B p (d 1/p-1/2 W ) and else when p > 2, we simply have that B 2 (W ) ⊆ B p (W ). According to Lemma 17, we have: RD ( f • H∆H) ≥ E sup ∥w∥2≤2W 1 n n i=1 σ i w ⊤ x i -w ⊤ x i min 1, ϵ∥w∥ 1 |w ⊤ x i | 2 Case I: 1 ≤ p ≤ 2. First, to avoid confusion in different Rademacher variables, let us use σ ′ and σ to denote the Rademacher variables in RD ( f • H∆H) and RD (f • H∆H). Then, let us define v ′ * := W v max (S(σ ′ ) 2 ). Then we consider the gap: RD ( f • H∆H) -RD (f • H∆H) = E σ ′ sup ∥w∥p≤2W 1 n n i=1 σ ′ i w ⊤ x i -w ⊤ x i min 1, ϵ∥w∥ 1 |w ⊤ x i | 2 -E σ sup ∥w∥2≤W,∥w ′ ∥2≤W 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ ≥ E σ ′ 1 n n i=1 σ ′ i v ′ * ⊤ x i -v ′ * ⊤ x i min 1, ϵ∥v ′ * ∥ 1 |v ′ * ⊤ x i | 2 -E σ 1 n n i=1 σ i v * ⊤ x i x ⊤ i v * = W 2 n E σ ′ n i=1 σ ′ i -2(v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | + W 2 n E σ ′ n i=1 σ ′ i (v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 ))∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | 2 Let us define I(σ ′ ) := n i=1 σ ′ i -2(v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | and J(σ ′ ) := n i=1 σ ′ i (v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | 2 , so that RD ( f • H∆H) -RD (f • H∆H) ≥ W 2 n (E σ ′ [I(σ ′ )] + E σ ′ [J(σ ′ )]) . We are now going to prove that I(σ) + I(-σ) = 0 and J(σ) + J(-σ) = 0. First we know that v max (S(σ) 2 ) = v max (S(-σ) 2 ), since S(-σ) 2 = (- n i=1 σ i x i x ⊤ i ) 2 = S(σ) 2 . So I(σ) + I(-σ) = n i=1 σ ′ i -2(v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v ⊤ max (S(σ) 2 )x i | + n i=1 -σ i -2(v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | = 0. Similarly J(σ) + J(-σ) = 0. According to Lemma 15, we can split {-1, +1} n into A + and A -, such that |A + | = |A -| and A -= -A + whereis element-wised negative sign. So we know: E σ [I(σ)] = σ∈A + 1 2 n I(σ) + σ∈A - 1 2 n I(σ) = σ∈A + 1 2 n I(σ) + σ∈A + 1 2 n I(-σ) = 0 E σ [J(σ)] = σ∈A + 1 2 n J(σ) + σ∈A - 1 2 n J(σ) = σ∈A + 1 2 n J(σ) + σ∈A + 1 2 n J(-σ) = 0 Hence we conclude that RD ( f • H∆H) ≥ RD (f • H∆H). Case II: p > 2. Similarly we have that RD ( f • H∆H) -RD (f • H∆H) = E σ ′ sup ∥w∥p≤2W 1 n n i=1 σ ′ i w ⊤ x i -w ⊤ x i min 1, ϵ∥w∥ 1 |w ⊤ x i | 2 -d 1-2/p E σ sup ∥w∥2≤W,∥w ′ ∥2≤W 1 n n i=1 σ i w ⊤ x i x ⊤ i w ′ ≥ E σ ′ 1 n n i=1 σ ′ i v ′ * ⊤ x i -v ′ * ⊤ x i min 1, ϵ∥v ′ * ∥ 1 |v ′ * ⊤ x i | 2 -d 1-2/p E σ 1 n n i=1 σ i v * ⊤ x i x ⊤ i v * = W 2 n (1 -d 1-2/p )E σ n i=1 σ i x i x ⊤ i + W 2 n E σ ′ n i=1 σ ′ i -2(v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | + W 2 n E σ ′ n i=1 σ ′ i (v max (S(σ ′ ) 2 ) ⊤ x i ) 2 min 1, ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v max (S(σ ′ ) 2 ) ⊤ x i | 2 ≥ W 2 n (1 -d 1-2/p )E σ n i=1 σ i x i x ⊤ i 2 . where in the last step we also use the same reasoning as in Case I.

H PROOFS FOR LINEAR REGRESSION

H.1 PROOF OF LEMMA 4 In this subsection we are going to present the proof of upper bound of the Rademacher complexity for regression under linear hypothesis. Proof. We first aim at controlling the non-adversarial Rademacher complexity over the H∆H class. We specify its definition given in (1) for the linear regression setting below RD (ℓ • H∆H) = E σ sup w,w ′ :∥w∥ p ≤W,∥w ′ ∥ p ≤W 1 n n i=1 σ i (w T x i -w ′T x i ) 2 . We introduce the variable change v := ww ′ , which yields to RD (ℓ • H∆H) = E σ sup v:∥v∥p≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 . ( ) We first derive the upper bound of RS (ℓ • H∆H). We follow similar steps than in Appendix G.1: we rewrite the supremum as a spectral norm and then apply a matrix Bernstein inequality. We have that RD (ℓ • H∆H) = E σ sup v:∥v∥p≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 = E σ sup v:∥v∥ p ≤2W v ⊤ 1 n n i=1 σ i x i x ⊤ i v (20) ≤ E σ sup v:∥v∥ 2 ≤2W v ⊤ 1 n n i=1 σ i x i x ⊤ i v ′ • 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 = 4W 2 n E σ n i=1 σ i x i x ⊤ i 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . Following exactly the same steps as in Appendix G.1, we apply the matrix Bernstein inequality. Let us denote by Z i := σ i x i x ⊤ i ∈ R d×d the random matrices we want to apply Theorem 4 to. Then, Z 2 i = σ 2 i (x i x ⊤ i ) 2 = (x i x ⊤ i ) 2 is a deterministic matrix. These random matrices Z i are symmetric, independent, have zero mean and are such that for all i ∈ [n] ∥Z i ∥ 2 = λ max (Z 2 i ) = λ max (x i x ⊤ i x i x ⊤ i ) = ∥x i ∥ 2 2 ≤ max j∈[n] ∥x j ∥ 2 2 = ∥X∥ 2 2,∞ . Moreover, let Y := n i=1 Z i . According to matrix Bernstein inequality, we get the desired bound RD (ℓ • H∆H) (14) ≤ 4W 2 n   2 n i=1 (x i x ⊤ i ) 2 2 log(2d) + 1 3 ∥X∥ 2 2,∞ log(2d)   × 1 p ≤ 2 d 1-2/p p > 2 . H.2 PROOF OF THE UPPER BOUND OF THEOREM 2 In this subsection we will present the proof of upper bound of the adversarial Rademacher complexity for regression under linear hypothesis. Proof. We then examine the adversarial Rademacher complexity of linear regression models defined in (3) as RS ( l • H∆H) = E σ     sup w:∥w∥ p ≤W w ′ :∥w ′ ∥p≤W 1 n n i=1 σ i max δ:∥δ∥ ∞ ≤ϵ (w T (x i + δ) -w ′T (x i + δ)) 2     . (37) We start by expressing RD ( l•H∆H) as a function of RD (ℓ•H∆H), its non-adversarial counterpart studied in Lemma 4. Let v := ww ′ . We expend this quantity as follows RS ( l • H∆H) (37) = E σ sup v:∥v∥ p ≤2W 1 n n i=1 σ i max δ:∥δ∥ ∞ ≤ϵ (v T x i + v T δ) 2 = E σ sup v:∥v∥ p ≤2W 1 n n i=1 σ i max δ:∥δ∥ ∞ ≤ϵ (v T x i ) 2 + 2v T x i v T δ + (v T δ) 2 ≤ RS (ℓ • H∆H) + E σ sup v:∥v∥ p ≤2W 1 n n i=1 σ i max δ:∥δ∥ ∞ ≤ϵ 2v T x i δ T v + (v T δ) 2 = RS (ℓ • H∆H) + E σ sup v:∥v∥ p ≤2W 1 n n i=1 σ i max δ:∥δ∥ ∞ ≤ϵ v ⊤ (2x i δ ⊤ + δδ ⊤ )v A , where we used the subadditivity of the supremum in to make appear the non-adversarial Rademacher complexity over H∆H class. Now we examine the upper bound of the second term A using the notion of covering recalled in Definition 6. Let C be a covering of the centered ℓ p ball of radius 2W , that we denote by B p (2W ), with ℓ p balls of radius ρ > 0. Let us define ζ i (v) := max δ:∥δ∥ ∞ ≤ϵ v ⊤ (2x i δ ⊤ + δδ ⊤ )v . ( ) Thus we can rewrite A as A (39) = E σ sup v∈Bp(2W ) 1 n n i=1 σ i ζ i (v) = E σ     sup v∈Bp(2W ) vc∈C : ∥v-vc∥ p ≤ρ 1 n n i=1 σ i ζ i (v c ) + ζ i (v) -ζ i (v c )     ≤ E σ sup ṽ∈C 1 n n i=1 σ i ζ i (ṽ) + E σ sup v∈Bp(2W ) 1 n n i=1 σ i (ζ i (v) -ζ i (v c )) ≤ E σ sup ṽ∈C 1 n n i=1 σ i ζ i (ṽ) (I) + sup v∈Bp(2W ) 1 n n i=1 |ζ i (v) -ζ i (v c )| (II) . ( ) where v c is the closest element to v in C and where we used the subadditivity of the supremum.

Bounding (I):

We first need to bound the left-hand side term (I). We introduce the vector ζ(ṽ) := [ζ 1 (ṽ), . . . , ζ n (ṽ)] ⊤ ∈ R n . By Massart's lemma (Lemma 5.2 of Massart (2000) ), we are able to control the first term (I) in ( 29): (I) = E σ sup ṽ∈C 1 n n i=1 σ i ζ i (ṽ) ≤ K 2 log |C| n , Also, M * := 2x i δ * ⊤ + δ * δ * ⊤ Then, we can make the difference explicit and upper bound it ζ i (v) -ζ i (v c ) = v ⊤ (2x i δ * ⊤ + δ * δ * ⊤ )v -v ⊤ c (2x i δ * c ⊤ + δ * c δ * c ⊤ )v c ≤ v ⊤ (2x i δ * ⊤ + δ * δ * ⊤ )v -v ⊤ c (2x i δ * ⊤ + δ * δ * ⊤ )v c = v ⊤ M * v -v ⊤ c M * v c = (v -v c ) ⊤ M * v + v ⊤ c M * (v -v c ) Cauchy-Schwarz ≤ ∥M * ∥ 2 ∥v -v c ∥ 2 (∥v∥ 2 + ∥v c ∥ 2 ) ≤ 4ρW √ dϵ √ dϵ + 2 ∥x i ∥ 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , where lastly we used the same arguments leading to (43), the norm transfer in (44) and the inequality ∥.∥ 2 ≤ √ d ∥.∥ ∞ . Symmetrically, we can show that the above upper bound holds for ζ i (v c ) -ζ i (v). Thus, (II) is upper bounded by (II) ≤ 4ρW √ dϵ √ dϵ + 2 n n i=1 ∥x i ∥ 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . ( ) Let us choose ρ = W/n. We have then proved that A (45)-( 46) ≤ (I) + (II) ≤   1 n 4W 2 √ dϵ n i=1 √ dϵ + 2 ∥x i ∥ 2 2 2d log(6W/ρ) + 4ρW √ dϵ √ dϵ + 2 n n i=1 ∥x i ∥ 2   × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 = 4 W 2 n √ dϵ   √ dϵ + 2 n n i=1 ∥x i ∥ 2 + n i=1 √ dϵ + 2 ∥x i ∥ 2 2 2d log(6n)   × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . By combining (38), (40) and the above bound, we are able to finish the proof for the upper bound of the adversarial Rademacher complexity over class H∆H for linear regression. Finally, this gives RS ( l • H∆H) ≤ RS (ℓ • H∆H) + 4 W 2 n √ dϵ   √ dϵ + 2 n n i=1 ∥x i ∥ 2 + n i=1 √ dϵ + 2 ∥x i ∥ 2 2 2d log(6n)   × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 .

H.3 PROOF OF THE LOWER BOUND OF THEOREM 2

In this subsection we present the proof of lower bound of the adversarial Rademacher complexity for regression under linear hypothesis. Proof. Recall definition non-adversarial Rademacher complexity RD (ℓ • H∆H) (36) = E sup ∥v∥p≤2W 1 n n i=1 i (v ⊤ x i ) 2 Lemma 8 ≤ E sup ∥v∥2≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 • 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , due to equivalence of norms. Now, we denote v * such that v * = arg max ∥v∥2≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 . One can verify that the maximum value of 1 n n i=1 σ i (v ⊤ x i ) 2 is: sup ∥v∥2≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 = 1 n sup ∥v∥2≤2W v ⊤ n i=1 σ i x i x ⊤ i v ≤ 4W 2 n n i=1 σ i x i x ⊤ i 2 and if we define S(σ) := n i=1 σ i x i x ⊤ i the maxima is attained when v * = 2W v max (S(σ) 2 ). Now, we switch to adversarial Rademacher: RD ( l • H∆H) = E sup ∥v∥p≤2W 1 n n i=1 σ i max ∥δ∥∞≤ϵ (v ⊤ (x i + δ)) 2 Lemma 14 ≥ E sup ∥v∥2≤2W 1 n n i=1 σ i max ∥δ∥∞≤ϵ (v ⊤ (x i + δ)) 2 • d 1-2/p , if 1 ≤ p ≤ 2 1, else if p > 2 According to Lemma 16, we have: RD ( l • H∆H) = E sup ∥v∥2≤2W 1 n n i=1 σ i (ϵ∥v∥ 1 + |v ⊤ x i |) 2 • d 1-2/p , if 1 ≤ p ≤ 2 1, else if p > 2 . Case I: 1 ≤ p ≤ 2:  * := 2W v max (S(σ ′ ) 2 ) RD ( l • H∆H) -RD (ℓ • H∆H) = E σ ′ sup ∥v∥p≤2W 1 n n i=1 σ ′ i (ϵ∥v∥ 1 + |v ⊤ x i |) 2 -E σ sup ∥v∥2≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 ≥ E σ ′ 1 n n i=1 σ ′ i (ϵ∥v ′ * ∥ 1 + |v ′ * ⊤ x i |) 2 -E σ 1 n n i=1 σ i (v * ⊤ x i ) 2 = 4W 2 n E σ ′ n i=1 σ ′ i 2ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v ⊤ max (S(σ ′ ) 2 )x i | + ϵ 2 ∥v max (S(σ ′ ) 2 )∥ 2 1 Let J(σ) = n i=1 σ i 2ϵ∥v max (S(σ) 2 )∥ 1 |v ⊤ max (S(σ) 2 )x i | + ϵ 2 ∥v max (S(σ) 2 )∥ 2 1 , and we claim that J(σ) + J(-σ) = 0. Now we are going to prove this claim. First we know that v max (S(σ) 2 ) = v max (S(-σ) 2 ), since S(-σ) 2 = (- n i=1 σ i x i x ⊤ i ) 2 = S(σ) 2 . So J(σ) + J(-σ) = n i=1 σ i 2ϵ∥v max (S(σ) 2 )∥ 1 |v ⊤ max (S(σ) 2 )x i | + ϵ 2 ∥v max (S(σ) 2 )∥ 2 1 + n i=1 -σ i 2ϵ∥v max (S(σ) 2 )∥ 1 |v ⊤ max (S(σ) 2 )x i | + ϵ 2 ∥v max (S(σ) 2 )∥ 2 1 = 0. According to Lemma 15, we can split {-1, +1} n into + A -, such that |A + | = |A -| A -= -A + whereis element-wised negative sign. So we know: E σ [J(σ)] = σ∈A + 1 2 n J(σ) + σ∈A - 1 2 n J(σ) = σ∈A + 1 2 n J(σ) + σ∈A + 1 2 n J(-σ) = 0 Hence we conclude that RD ( l • H∆H) ≥ RD (ℓ • H∆H). Case II: p > 2: Similarly we have: RD ( l • H∆H) -RD (ℓ • H∆H) = E σ ′ sup ∥v∥p≤2W 1 n n i=1 σ ′ i (ϵ∥v∥ 1 + |v ⊤ x i |) 2 -d 1-2/p E σ sup ∥v∥2≤2W 1 n n i=1 σ i (v ⊤ x i ) 2 ≥ E σ ′ 1 n n i=1 σ ′ i (ϵ∥v ′ * ∥ 1 + |v ′ * ⊤ x i |) 2 -d 1-2/p E σ 1 n n i=1 σ i (v * ⊤ x i ) 2 = (1 -d 1-2/p ) 4W 2 n E n i=1 σ i x i x ⊤ i 2 + 4W 2 n E σ ′ 2 n i=1 σ ′ i ϵ∥v max (S(σ ′ ) 2 )∥ 1 |v ⊤ max (S(σ ′ ) 2 )x i | + ϵ 2 ∥v max (S(σ ′ ) 2 )∥ 2 1 ≥ (1 -d 1-2/p ) 4W 2 n E n i=1 σ i x i x ⊤ i 2 , where in the last step we also use the same reasoning as in Case I.

I PROOF OF NEURAL NETWORK COMPLEXITY BOUNDS

In this section, we will present the proof of upper bound of the adversarial Rademacher complexity under two-layer neural network hypothesis (Theorem 3). We provide the proof of classification bound in Appendix I.1, and regression in Appendix I.2.

I.1 PROOF OF BINARY CLASSIFICATION SETTING

Proof of the classification bound of Theorem 3. We recall that B p (R) stands for the ℓ p ball of radius R (either in vector R d or matrix R d×d space). To simplify notations, we denote the coordinate-wise ReLU activation function by g(x) := max{0, x}, R d → R d . where max is applied coordinate-wisely. Using the definition of the f • H∆H class in (7), we upper bound the adversarial Rademacher complexity in the binary classification setting by expressing it as its non-adversarial counterpart plus an additional term. Thus we get RD ( f • H∆H) = E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i min ∥δ∥∞≤ϵ a ⊤ g(W(x i + δ))a ′ ⊤ g(W ′ (x i + δ))     . As the function δ → a ⊤ g(W(x i + δ))a ′ ⊤ g(W ′ (x i + δ)) is continuous as a composition of continuous function (linear and ReLU), then it reaches a minimum of the compact ℓ ∞ ball of radius ϵ > 0, also denoted by B ∞ (ϵ). Let δ * i an argument of the minima of the latter function, i.e. δ * i ∈ arg min δ∈B∞(ϵ) a ⊤ g(W(x i + δ))a ′ ⊤ g(W ′ (x i + δ)). With this notation, we can RD f • H∆H) ≤ R D (f • + E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i a ⊤ g(W(x i + δ * i ))g(W ′ (x i + δ * i )) ⊤ a ′ -ag(Wx i )g(W ′ x i ) ⊤ a ′     = R D (f • H∆H) + E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 a ⊤ 1 n n i=1 σ i g(W(x i + δ * i ))g(W ′ (x i + δ * i )) ⊤ -g(Wx i )g(W ′ x i ) ⊤ a ′     , Let the term inside the expectation being maximized (as a continuous function over a compact set) at a * , a ′ * , W * , W ′ * . And let S i := σ i g(W * (x i + δ * i ))g(W ′ * (x i + δ * i )) ⊤ -g(W * x i )g(W ′ * x i ) ⊤ ∈ R d×d . Then, we can rewrite the term in the expectation as * ⊤ 1 n n i=1 S i a ′ * Cauchy-Schwarz ineq. ≤ ∥a * ∥ 2 a ′ * 2 1 n n i=1 S i 2 Lemma 8 ≤ ∥a * ∥ p a ′ * p 1 n n i=1 S i 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 a * ,a ′ * ∈Bp(A) 2 ≤ A 2 1 n n i=1 S i 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . Thus we have the following inequality RD ( f • H∆H) ≤ R D (f • H∆H) + A 2 n E σ n i=1 S i 2 × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . Let us now estimate the spectral norm of the average of random matrices n i=1 S i using the matrix Bernstein inequality of Theorem 4. We have that ∥S i ∥ 2 = g(W * (x i + δ * i ))g(W ′ * (x i + δ * i )) ⊤ -g(W * x i )g(W ′ * x i ) ⊤ 2 = ∥g(W * (x i + δ * i ))(g(W ′ * (x i + δ * i )) -g(W ′ * x i )) ⊤ + (g(W * (x i + δ * i )) -g(W * x i ))g(W ′ * x i ) ⊤ ∥ 2 ≤ g(W * (x i + δ * i ))(g(W ′ * (x i + δ * i )) -g(W ′ * x i )) ⊤ 2 + (g(W * (x i + δ * i )) -g(W * x i ))g(W ′ * x i ) ⊤ 2 Cauchy-Schwarz ≤ ∥W * (x i + δ * i )∥ 2 W ′ * δ * i 2 + ∥W * δ * i ∥ 2 W ′ * x i 2 Lemma 8 ≤ W 2 √ dϵ( √ dϵ + 2 ∥x i ∥ 2 ) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 (48) ≤ W 2 √ dϵ( √ dϵ + 2 ∥X∥ 2,∞ ) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , ( ) where we used the 1-Lipschitzness of the ReLU function. Now let us examine the upper bound of the variance of the sum. We that ⊤ i S i is a deterministic matrix as σ 2 i 1. Thus, we have Var n i=1 S i := n i=1 E σ [S ⊤ i S i ] 2 ≤ n i=1 ∥S i ∥ 2 2 (48) ≤ W 2 √ dϵ 2 n i=1 ( √ dϵ + 2 ∥x i ∥ 2 ) 2 × 1, if 1 ≤ p ≤ 2 (d 1-2/p ) 2 , else if p > 2 . Using (49), we can apply matrix Bernstein inequality of Theorem 4 we get  E σ n i=1 S i 2 ≤ W 2 ϵ √ d 2 n i=1 ( √ dϵ + 2 ∥x i ∥ 2 ) 2 log(2d) + 1 3 W 2 ϵ √ d( √ dϵ + 2 ∥X∥ 2,∞ ) log(2d) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 = W 2 ϵ d log(2d)   2 n i=1 ( √ dϵ + 2 ∥x i ∥ 2 ) 2 + 1 3 log(2d)( √ dϵ + 2 ∥X∥ 2,∞ )   × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . + E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i a ⊤ g(W(x i + δ * i )) -a ′⊤ g(W ′ (x i + δ * i )) 2 -a ⊤ g(Wx i ) -a ′⊤ g(W ′ x i ) 2 , where, like in Appendix I.1, we denote by δ * i an argument of the maxima of the inner function, i.e. δ * i ∈ arg max δ∈B∞(ϵ) a ⊤ g(W(x i + δ)) -a ′ g(W ′ (x i + δ)) 2 . Then, by introducing following matrices • P i := g(W(x i + δ * i ))g(W(x i + δ * i )) ⊤ -g(Wx i )g(Wx i ) ⊤ ∈ R d×d • Q i := g(W ′ (x i + δ * i ))g(W (x + δ * i )) ⊤ -g(W ′ x i )g(W ′ x i ) ∈ R d×d • R i := g(W(x i + δ * i ))g(W ′ (x i + δ * i )) ⊤ -g(Wx i )g(W ′ x i ) ⊤ ∈ R d×d we can rewrite the upper bound of the adversarial Rademacher complexity as R D ( l • H∆H) ≤ R D (ℓ • H∆H) + E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i (a ⊤ P i a + a ′⊤ Q i a ′ + a ⊤ R i a ′ )     ≤ R D (ℓ • H∆H) + E σ     sup a∈Bp(A) W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i a ⊤ P i a     + E σ     sup a ′ ∈Bp(A) W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i a ′⊤ Q i a ′     + E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 1 n n i=1 σ i a ⊤ R i a ′     ≤ R D (ℓ • H∆H) + 1 n E σ     sup a∈Bp(A) W,W ′ ∈Bp(W ) 2 a ⊤ n i=1 σ i P i a     + 1 n E σ     sup a ′ ∈Bp(A) W,W ′ ∈Bp(W ) 2 a ′⊤ n i=1 σ i Q i a ′     + 1 n E σ     sup a,a ′ ∈Bp(A) 2 W,W ′ ∈Bp(W ) 2 a ⊤ n i=1 σ i R i a ′     . Following the same steps that lead to (47), we get that R D ( l • H∆H) ≤ R D (ℓ • H∆H) + A 2 n E σ n i=1 σ i P * i + E σ n i=1 σ i Q * i + E σ n i=1 σ i R * i (I) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , where the matrices are defined by • P * i := g(W * P (x i + δ * i ))g(W * P (x i + δ * i )) ⊤ -g(W * P x i )g(W * P x i ) ⊤ ∈ R d×d • Q * i := g(W ′ * Q (x i + δ * i ))g(W ′ * Q (x i + δ * i )) ⊤ -g(W ′ * Q x i )g(W ′ * Q x i ) ⊤ ∈ R d×d • R * i := g(W * R (x i + δ * i ))g(W ′ * R (x i + δ * i )) ⊤ -g(W * R x i )g(W ′ * R x i ) ⊤ ∈ R d×d for some optimal matrices W * P , W ′ * P , W * Q , W ′ * Q , W * R , W ′ * R with ℓ p -norm smaller than W . Now examine the spectral norm of σ i P * i , σ i Q * i and σ i R * i by following the same steps leading to (48): ∥σ i P * i ∥ 2 = g(W * P (x i + δ * i ))g(W * P (x i + δ * i )) ⊤ -g(W * P x i )g(W * P x i ) ⊤ 2 ≤ W 2 ϵ √ d( √ dϵ + 2 ∥x i ∥ 2 ) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , ( ) ∥σ i Q * i ∥ 2 = g(W ′ * Q (x i + δ * i ))g(W ′ * Q (x i + δ * i )) ⊤ -g(W ′ * Q x i )g(W ′ * Q x i ) ⊤ 2 ≤ W 2 ϵ √ d( √ dϵ + 2 ∥x i ∥ 2 ) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 , ∥σ i R * i ∥ 2 = g(W * R (x i + δ * i ))g(W ′ * R (x i + δ * i )) ⊤ -g(W * R x i )g(W ′ * R x i ) ⊤ 2 ≤ W 2 ϵ √ d( √ dϵ + 2 ∥x i ∥ 2 ) × 1, if 1 ≤ p ≤ 2 d 1-2/p , else if p > 2 . Now, let's examine the coordinates in Λ. For i-th coordinate, if it is in Λ, we know that -ϵ∥w∥ 1 ≤ w ⊤ x i ≤ ϵ∥w∥ 1 , which implies that, there is a δ that can change the sign of sign(w(x δ)), hence change the value of li . That is, if w ⊤ xy(x) ≥ 0, there a δ * = arg min ∥δ∥∞≤ϵ w ⊤ δy(x), such that w ⊤ (x + δ * )y(x) = w ⊤ xy(x) -ϵ∥w∥ 1 ≤ 0 . Similarly, if w ⊤ xy(x) ≤ 0, there is a δ * = arg max ∥δ∥∞≤ϵ w ⊤ δy(x), such that: w ⊤ xy(x) + w ⊤ δ * y(x) = w ⊤ xy(x) + ϵ∥w∥ 1 ≥ 0 .  Combining Lemmas 2 and 5 imply Corollary 3. Equation (55) shows that, small robust risk on source domain will imply small standard risk on any target domain T ′ , if the residual error V * (p ′ , p, ℓ, Λ ϵ ) is also small. The residual error mainly depends on the adversarial budget. If the adversarial budget ϵ is larger, Λ ϵ will be larger, which means the quantity V * (p ′ , p, ℓ, Λ ϵ ) will be smaller. However, there is a trade-off since ϵ will also affect other terms in the right hand side of the bound, such as adversarial Rademacher and the adversarial disagreement between h w * T and h w * S .

K DETAILS ON EXPERIMENTS

In Table 2 , we present the details of the convolutional network. For the convolutional layer (Conv2D or Conv1D), the first argument is the number channel. For a fully connected layer (FC), we list the number of hidden units as the first argument. 



Figure 1: Robust accuracy drops (∆) by varying the ℓ 1 regularization intensity (µ) and ℓ ∞ perturbation ϵ. A linear classifier is adversarially trained on the MNIST and tested on target domains.

To characterize the generalization of adversarially robust learning, a line of researches Khim & Loh (2018); Yin et al. (2019); Awasthi et al. (2020) are conducted via Rademacher complexity point of view. Khim & Loh (2018) is among the first to examine the adversarial Rademacher complexity under ℓ ∞ attack, and as a concurrent work, Yin et al. (2019) characterized the upper and lower bound of it, and claim that adversarially robust is at lease as hard as standard ERM learning. Awasthi et al. (2020) further extended Yin et al. (2019)'s results to adversary set under arbitrary norm constraint, and analyze the complexity of neural network as well. Another category of generalization studies of robust learning are based on PAC learning framework. Cullina et al. (

C TECHNICAL NOVELTYIn this section we explain our technical novelty compared to existing works regard Adversarial Rademacher complexityYin et al. (2019);Awasthi et al. (2020). Taking classification setting for example,Yin et al. (2019);Awasthi et al. (2020) consider the Rademacher complexity over the loss class between model predictions and labels, i.e., R = E[sup w 1 n n i=1 σ i min ∥δ∥≤∞ w ⊤ (x + δ)], where the inner minimization problem is linear in w and δ.

then the following inequality holds |⟨v, w⟩| ≤ ∥v∥ p ∥w∥ p * .

Let w ∈ R p , a ∈ R and ϵ ≥ 0. Let us consider the problem δ * = arg min δ∈R p :∥δ∥ ∞ ≤ϵ (w T δ + a) 2 .

) dealing with the size of coverings of balls. Lemma 20. Let ρ > 0. Let B ⊆ R d be a the ball of radius R ≥ 0 in a norm ∥.∥ and let C be one of the smallest ρ-covering of B w.r.t. ∥.∥. Then, Now we are ready to present the proof of upper bound of the adversarial Rademacher complexity for linear binary classification.

confusion in different Rademacher variables, let us use σ ′ and σ to denote the Rademacher variables in RD ( l • H∆H) and RD (ℓ • H∆H). Then, let us define v ′

The above combined with (47) concludes the proofRD ( f • H∆H) ≤ R D (f • H∆H) OF REGRESSION SETTINGProof of the regression bound of Theorem 3. First, by the definition of R D ( l • H∆H) we have thatR D ( l • H∆H) = E σ (W(x i + δ)) -a ′ g(W ′ (x i + δ))

N p ⊤ lp ′⊤ ℓ s.t. li = ℓ i , ∀ i ∈ [N ] \ Λ = V * (p ′ , p, ℓ, Λ) .Corollary 3. Continuing with the settings and assumptions in Lemma 5, the following statement holds with probability at least 1 -c:R T ′ (h w , y) ≤ R adv-label S (h w , y) + R adv-label S (h w * S , y) + V * (p ′ , p, ℓ, Λ ϵ ) + disc adv H∆H ( Ŝ, T ) + R adv T (h w * T , h w * S ) + R adv T (h w * T , y)+ RS ( l • H∆H) + RT ( l • H∆H)+3M log(

Transferred standard (SA %) and robust (RA %) accuracies tested on different domains from the DIGITS datasets. ∆ indicates the difference between the train and test domain accuracy.

we append all vectors a ∈ A + k by 1, and put [a ⊤ , 1] into A + k+1 and append all vectors a ∈ A - k by -1, and put [a ⊤ , -1] into A - k+1 . It can be verify that, |A + k+1 | = |A -

Convolutional network architecture.

annex

K by the largest ℓ 2 -norm of ζ the covering C, that is(42)Now we examine the upper bound of K. Note that ζ i (v) ≥ 0, then we can upper bound as followswhere we applied Cauchy-Schwarz inequality and the definition of operator norm and then the subadditivity of the maximum. Recalling the case disjunctionwe are able to upper bound K 2 , by using ( 42) and ( 43) which leads toThen, by taking the square root in the above we getJointly with the application of Lemma 20, we can conclude thatBounding (II). Now we turn to bounding the second term of (40) . Now we examine the variance ofSimilarly, we can show that Var () can be upper bounded the same way. Now, by applying three times the matrix Bernstein inequality (Theorem 4) we have:which concludes the proof.

J PROOF OF LEMMA 5

In this section we present the proof of Lemma 5.Proof. The proof idea is to show that, by perturbing the standard risk on T within the adversary set, the perturbed risk can approximate the standard risk on T ′ with some error. First, let us define a perturbed risk for any perturbationWe then recall the definition of the standard risk onWe recall the definition of adversarially robust risk over domain T for the labeling function y(•)So, for any δ such that ∥δ∥ ∞ ≤ ϵ, we get that R T ′ (h w , y) -R adv T (h w , y) 

