FORMALIZING GENERALIZATION AND ROBUSTNESS OF NEURAL NETWORKS TO WEIGHT PERTURBATIONS

Abstract

Studying the sensitivity of weight perturbation in neural networks and its impacts on model performance, including generalization and robustness, is an active research topic due to its implications on a wide range of machine learning tasks such as model compression, generalization gap assessment, and adversarial attacks. In this paper, we provide the first formal analysis for feed-forward neural networks with non-negative monotone activation functions against norm-bounded weight perturbations, in terms of the robustness in pairwise class margin functions and the Rademacher complexity for generalization. We further design a new theory-driven loss function for training generalizable and robust neural networks against weight perturbations. Empirical experiments are conducted to validate our theoretical analysis. Our results offer fundamental insights for characterizing the generalization and robustness of neural networks against weight perturbations.

1. INTRODUCTION

Neural network is currently the state-of-the-art machine learning model in a variety of tasks, including computer vision, natural language processing, and game-playing, to name a few. In particular, feed-forward neural networks consists of layers of trainable model weights and activation functions with the premise of learning informative data representations and the complex mapping between data samples and the associated labels. Albeit attaining superior performance, the need for studying the sensitivity of neural networks to weight perturbations is also intensifying owing to several practical motivations. For instance, in model compression, the robustness to weight quantification is crucial for reducing memory storage while retaining model performance (Hubara et al., 2017; Weng et al., 2020) . The notion of weight perturbation sensitivity is also used as a metric to evaluate the generalization gap at local minima (Keskar et al., 2017; Neyshabur et al., 2017) . In adversarial robustness and security, weight sensitivity can be leveraged as a vulnerability for fault injection and causing erroneous prediction (Liu et al., 2017; Zhao et al., 2019) . However, while weight sensitivity plays an important role in many machine learning tasks and problem setups, theoretical characterization of its impacts on generalization and robustness of neural networks remains elusive. This paper bridges this gap by developing a novel theoretical framework for understanding the generalization gap (through Rademacher complexity) and the robustness (through classification margin) of neural networks against norm-bounded weight perturbations. Specifically, we consider the multiclass classification problem setup and multi-layer feed-forward neural networks with non-negative monotonic activation functions. Our analysis offers fundamental insights into how weight perturbation affects the generalization gap and the pairwise class margin. To the best of our knowledge, this study is the first work that provides a comprehensive theoretical characterization of the interplay between weight perturbation, robustness in classification margin, and generalization gap. Moreover, based on our analysis, we propose a theory-driven loss function for training generalizable and robust neural networks against norm-bounded weight perturbations. We validate its effectiveness via empirical experiments. We summarize our main contributions as follows. • We study the robustness (worst-case bound) of the pairwise class margin function against weight perturbations in neural networks, including the analysis of single-layer (Theorem 1), all-layer (Theorem 2), and selected-layer (Theorem 3) weight perturbations. • We characterize the generalization behavior of robust surrogate loss for neural networks under weight perturbations (Section 3.4) through Rademacher complexity (Theorem 4). • We propose a theory-driven loss design for training generalizable and robust neural networks (Section 3.5). The empirical results in Section 4 validate our theoretical analysis and demonstrate the effectiveness of improving generalization and robustness against weight perturbations.

2. RELATED WORKS

In model compression, the robustness to weight quantization is critical to reducing memory size and accesses for low-precision inference and training (Hubara et al., 2017) . Weng et al. (2020) showed that incorporating weight perturbation sensitivity into training can better retain model performance (standard accuracy) after quantization. For studying the generalization of neural networks, Keskar et al. (2017) proposed a metric called sharpness (or weight sensitivity) by perturbing the learned model weights around the local minima of the loss landscape for generalization assessment while An (1996) introduced weight noise into the training process and concluded that random noise training improves the overall generalization. Neyshabur et al. ( 2017) made a connection between sharpness and PAC-Bayes theory and found that some combination of sharpness and norms on the model weights may capture the generalization behavior of neural networks. Additionally, Bartlett et al. (2017) discovered normalized margin measure to be useful towards quantifying generalization property and a bound was therefore constructed to give an quantitative description on the generalization gap. Moreover, Golowich et al. (2019) incorporated additional assumptions to offer tighter and size-independent bounds from the setting of (Neyshabur et al., 2015) and (Bartlett et al., 2017) respectively. Despite development of various generalization bounds, empirical observations in (Nagarajan & Kolter, 2019) showed that once the size of the training dataset grows, generalization bounds proposed in (Neyshabur et al., 2017) and (Bartlett et al., 2017) will enlarge thus become vacuous. Discussions on the relation between (Nagarajan & Kolter, 2019) and our works can be found at Appendix E, where we show that in our studied setting the associated generalization bounds are non-vacuous. On the other hand, Barron & Klusowski (2018) and Theisen et al. (2019) applied several techniques in tandem with a probabilistic method named path sampling to construct a representing set of given neural networks for approximating and studying the generalization property. Another approach considered by Petzka et al. (2020) consists of segmenting the neural networks into two functions, predictor and feature selection respectively where two measures (representativeness and feature robustness) concerning these aforementioned functions were later combined to offer a meaningful generalization bound. However, these works only focused on the generalization behavior of the local minima and did not consider the generalization and robustness under weight perturbations. Weng et al. (2020) proposed a certification method for weight perturbation retaining consistent model prediction. While the certification bound can be used to train robust models with interval bound propagation (Gowal et al., 2019) , it requires additional optimization subroutine and computation costs when comparing to our approach. Moreover, the convoluted nature of certification bound complicates the analysis when studying generalization, which is one of our main objectives. In adversarial robustness, fault-injection attacks are known to inject errors to model weights at the inference phase and causing erroneous model prediction (Liu et al., 2017; Zhao et al., 2019) , which can be realized at the hardware level by changing or flipping the logic values of the corresponding bits and thus modifying the model parameters saved in memory (Barenghi et al., 2012; Van Der Veen et al., 2016) . Zhao et al. (2020) proposed to use the mode connectivity of the model parameters in the loss landscape for mitigating such weight-perturbation-based adversarial attacks. Although, to the best of our knowledge, theoretical characterization of generalization and robustness for neural networks against weight perturbations remains elusive, recent works have studied these properties under another scenario -the input perturbations. Both empirical and theoretical evidence have been given to the existence of a fundamental trade-off between generalization and robustness against norm-bounded input perturbations (Xu & Mannor, 2012; Su et al., 2018; Zhang et al., 2019; Tsipras et al., 2019) . The adversarial training proposed in (Madry et al., 2018 ) is a popular training strategy for training robust models against input perturbations, where a min-max optimization principle is used to minimize the worst-case input perturbations of a data batch during model parameter updates. For adversarial training with input perturbations, Wang et al. (2019) proved its convergence and Yin et al. ( 2019) derived bounds on its Rademacher complexity for generalization. Different from the case of input perturbation, we note that min-max optimization on neural network training subject to weight perturbation is not straightforward, as the minimization and maximization steps are both taken on the model parameters. In this paper, we disentangle the min-max formulation for weight perturbation by developing bounds for the inner maximization step and provide quantifiable metrics for training generalization and robust neural networks against weight perturbations.

3. MAIN RESULTS

We provide an overview of the presentation flow for our main results as follows. First, we introduce the mathematical notations and preliminary information in Section 3.1. In Section 3.2, we establish our weight perturbation analysis on a simplified case of single-layer perturbation. We then use the single-layer analysis as a building block and extend the results to the multi-layer perturbation setting in Section 3.3. In Section 3.4, we define the framework of robust training with surrogate loss and study the generalization property using Rademacher complexity. Finally, we propose a theory-driven loss toward training robust and generalizable neural networks in Section 3.5.

3.1. NOTATION AND PRELIMINARIES

Notation We start by introducing the mathematical notations used in this paper. We define the set [L] := {1, 2, ..., L}. For any two non-empty sets A, B, F A →B denotes the set of all functions from A to B. We mark the indicator function of an event E as 1(E), which is 1 if E holds and 0 otherwise. We use sgn(•) to denote element-wise sign function that outputs 1 when input is nonnegative and -1 otherwise. Boldface lowercase letters are used to denote vectors (e.g., x), and the i-th element is denoted as [x] i . Matrices are presented as boldface uppercase letters, say W. Given a matrix W ∈ R k×d , we write its i-th row, j-th column and (i, j) element as W i,: , W :,j , and W i,j respectively. Moreover, we write its transpose matrix as (W) T . The matrix (p, q) norm is defined as W p,q := [ W :,1 p , W :,2 p , ..., W :,d p ] q for any p, q ≥ 1. For convenience, we have W p = W p,p and write the spectral norm and Frobenius norm as W σ and W F respectively. We mark one matrix norm commonly used in this paper -the matrix (1, ∞) norm. With a matrix W, we express its matrix (1, ∞) norm as W 1,∞ , which is defined as W 1,∞ = max j W :,j 1 and W T 1,∞ = max i W i,: 1 . We use IB ∞ W ( ) to express an element-wise ∞ norm ball of matrix W within radius , i.e., IB ∞ W ( ) = { Ŵ| | Ŵi,j -W i,j | ≤ , ∀i ∈ [k], j ∈ [d]}. Preliminaries In order to formally explain our theoretical results, we introduce the considered learning problem, neural network model and complexity definition. Let X and Y be the feature space and label space, respectively. We place the assumption that all data are drawn from an unknown distribution D over X × Y and each data point is generated under i.i.d condition. In this paper, we specifically consider the feature space X as a subset of d-dimensional Euclidean space, i.e., X ⊆ R d . We denote the symbol F ⊆ F X →Y to be the hypothesis class which we use to make predictions. Furthermore, we consider a loss function : X × Y -→ [0, 1] and compose it with the hypothesis class to make a function family written as F := {(x, y) -→ (f (x), y)| f ∈ F}. The optimal solution of this learning problem is a function f * ∈ F such that it minimizes the population risk R(f ) = E (x,y)∼D [ (f (x), y)]. However, since the underlying data distribution is generally unknown, one typically aims at reducing the empirical risk evaluated by a set of training data {(x i , y i )} n i=1 , which can be expressed as R n (f ) = 1 n n i=1 (f (x i ), y i ). The generalization error is the gap between population and empirical risk, which could serve as an indicator of model's performance under unseen data from identical distribution D. To study the generalization error, one would explore the learning capacity of a certain hypothesis class. In this paper, we adopt the notion of Rademacher complexity as a measure of learning capacity, which is widely used in statistical machine learning literature (Mohri et al., 2018) . The empirical Rademacher complexity of a function class F given a set of samples S = {(x i , y i )} n i=1 is RS ( F ) = Eν [sup f ∈F 1 n n i=1 νi (f (xi), yi)] where {ν i } n i=1 is a set of i.i.d Rademacher random variables with P{ν i = -1} = P{ν i = +1} = 1 2 . The empirical Rademacher complexity measures on average how well a function class F correlates with random noises on dataset S. Thus, a richer or more complex family could better correlate with random noise on average. With Rademacher complexity as a toolkit, one can develop the following relationship between generalization error and complexity measure. Specifically, it is shown in (Mohri et al., 2018) that given a set of training samples S and assume that the range of loss function (f (x), y) is [0, 1]. Then for any δ ∈ (0, 1), with at least probability 1 -δ we have ∀f ∈ F R(f ) ≤ Rn(f ) + 2RS ( F ) + 3 log 2 δ 2n Note that when the Rademacher complexity is small, it is then viable to learn the hypothesis class F by minimizing the empirical risk and thus effectively reducing the generalization gap. Finally, we define the structure of neural networks and introduce a few related quantities. The problem studied in this paper is a multi-class classification task with the number of classes being K. Consider an input vector x ∈ X ⊆ R d , an L-layer neural network is defined as f W (x) = W L (...ρ(W 1 x)...) ∈ FX →R (3) with W being the set containing all weight matrices, i.e., W := {W i | ∀i ∈ [L]}, and the notation ρ(•) is used to express any non-negative monotone activation function and we further assume that ρ(•) is 1-Lipschitz, which includes popular activation functions such as ReLU applied element-wise on a vector. Moreover, the i-th component of neural networks' output is written as f i W (x) = [f W (x) ] i and a pairwise margin between i-th and j-th class, denoted as f ij W (x) := f j W (x) -f j W (x) , is said to be the difference between two classes in output of the neural network. Lastly, we use the notion of z k and ẑk to represent the output vector of the k-th layer (k ∈ [L -1]) under natural and weight perturbed settings respectively, which are z k = ρ(W k (...ρ(W 1 x)...)) and ẑk = ρ( Ŵk (...ρ( Ŵ1 x)...)), where Ŵi ∈ IB ∞ W i ( i ) denotes the perturbed weight matrix bounded by its element-wise ∞ -norm with radius i for some i ∈ [k].

3.2. BUILDING BLOCK: SINGLE-LAYER WEIGHT PERTURBATION

We study the sensitivity of neural network to weight perturbations through the pairwise margin bound f ij W (x). Specifically, when i and j corresponds to the top-1 and the second-top class prediction of x, respectively, the margin can be used as an indicator of robust prediction under weight perturbation to W . For ease of understanding, we first consider a simple example with a three-layer neural network and explain the bound through the error propagation incurred by weight perturbation. We define the neural network as f W (x) = W 3 ρ(W 2 ρ( Ŵ1 x)) with W i being the weight matrix of the i-th layer and assume that one could only perturb any element in the first weight matrix within an ∞ norm ball of radius , i.e., Ŵ1 ∈ IB ∞ W 1 ( ). We also define an error vector as e i , which stands for the entry-wise error after propagating through the i-th layer. Since no perturbations happened prior to the first layer, we would directly take input vector x and derive an upper bound on the entry-wise error e 1 . While every element in the first weight matrix is allowed to change its magnitude by at most , the maximum error for any entry by matrix-vector multiplication becomes [e1]i := | Ŵ 1 i,: x -Wi,:x| ≤ j | Ŵ 1 i,j -W 1 i,j ||[x]j| ≤ j |[x]j| = x 1 . Since the following layer weight is not subject to perturbation, we simply take the magnitude of each element in the subsequent weight matrix to calculate the next error vector. In this case, we have the next layer's error e 2 with W 2 as [e 2 ] i = j |W 2 i,j | [e 1 ] j = x 1 j |W 2 i,j |. Eventually, with error propagation over layers, we arrive at the final layer and are able to assess the maximum change of any entry in output value. By recalling the pairwise class margin f ij W (x), we would like to inspect the relative change in error between any two classes. Specifically, we derive an upper bound on the pairwise margin between any two classes α and β. In the above example, the difference in entry-wise maximum error can be deduced in the following manner: [e3]α -[e3] β = k (|W 3 α,k | -|W 3 β,k |)[e2] k (i) ≤ k (|W 3 α,k -W 3 β,k |) x 1 l |W 2 k,l | (5) (ii) ≤ x 1 max k W 2 k,: 1 k (|W 3 α,k -W 3 β,k |) = x 1 (W 2 ) T 1,∞ W 3 α,: -W 3 β,: 1 , where inequality (i) comes from triangle inequality and inequality (ii) results from taking the row in W 2 with maximum 1 norm. It is worth noting that there exist possible scenarios for the above inequalities to hold and therefore achieving the worst-case error. Specifically, using the example in Section 3.2, as we trace down the associated inequality bound in (i), we see that the first inequality can be achieved when the final weight layer possesses all positive weights and that the row associated with label α is greater than label β in all individual entries. Furthermore, as long as the second weight matrix W 2 has equal 1 norm throughout all rows, we can then tighten the bound to give the worst-case error in (ii). Nevertheless, we could see from the above example that the difference of maximum error between entries in output would be propagating at the rate of weight matrices' (1, ∞) norm. By utilizing this essential concept, we introduce the first theorem of our results, which provides an upper bound on the pairwise margin under single-layer weight perturbation. Theorem 1 (N -th layer weight perturbation (N = L)) Let f W (x) = W L (...ρ(W 1 x)...) denote an L-layer neural network and let f W (x) = W L (.. ŴN ...ρ(W 1 x)...) with ŴN ∈ IB ∞ W N ( ), N = L, denote the corresponding network subject to N -th layer perturbation. For any set of perturbed and unperturbed pairwise margin f ij W (x) and f ij W (x), we have f ij W (x) ≤ f ij W (x) + W L i,: -W L j,: 1 z N -1 1 Π L-N -1 k=1 (W L-k ) T 1,∞ where z k = ρ(W k (...ρ(W 1 x)...)). Proof : See Appendix A.1 Since the final layer does not have any activation function, the margin bound on the margin difference when only perturbing the final layer can be simply derived, which is given in the following lemma. Lemma 1 (Final-layer weight perturbation) Consider the case N = L in Theorem 1, we have f ij W (x) ≤ f ij W (x) + 2 z L-1 1 , where z L-1 is the output of the (L -1)-th layer. Proof : See Appendix A.1.

3.3. GENERAL SETTING: MULTI-LAYER WEIGHT PERTURBATION

With the developed single-layer analysis in Section 3.2 as a building block, we now extend our analysis to the general setting of multi-layer weight perturbation, which is further divided into two cases: (i) the case of perturbing all L layers; and (ii) the case of perturbing I out of L layers.

3.3.1. PERTURBING ALL LAYERS

Once equipped with the concept of error propagation over subsequent layers, we consider the scenario where every layer in a neural network is subject to weight perturbation. We denote the model under this circumstance as the all-perturbed setting. The following theorem states an upper bound on the pairwise margin between the natural (unperturbed) and all-perturbed settings. Theorem 2 (all-layer perturbation) Let f W (x) = W L (...ρ(W 1 x)...) denote an L-layer (natural) neural network and let f W (x) = ŴL (.. ŴN ...ρ( Ŵ1 x)...) with Ŵm ∈ IB ∞ W m ( i ), ∀m ∈ [L] , denote its perturbed version. For any set of pairwise margin f ij W (x) and f ij W (x), we have f ij W (x) ≤ f ij W (x) + W L i,: -W L j,: 1 1 x 1 Π L-2 l=1 (W L-l ) T 1,∞ Input Layer Error + L-3 k=1 Π L-1 m=k+2 (W m ) T 1,∞ k+1 z k * 1 Error from layer 1 to L-2 + L-1 z L-2 * 1 Error of layer L-1 + 2 L z L-1 * 1 Error of layer L where z k * = ρ(W k * ...ρ(W 1 * x) with W m * defined as W m * i,j = W m i,j + m, ∀i, j and ∀m ∈ [L] \ {1} W 1 * i,j = W 1 i,j + sgn([x]j) 1, ∀i, j Proof : See Appendix A.2.2. Here, we provide some intuition on deriving the upper bound of the margin in the all-perturbed setting. The scheme behind this all-perturbed scenario can be viewed as an inductive layer-wise error propagation. Specifically, we can choose any perturbed layer as the commencement point of propagation, then fix any other weight matrices' values and further calculate the propagation of error from that layer using the concept in Section 3.2. In such manner, after iterating through all these weight matrices subject to weight perturbation, one could obtain the final change in output value and therefore establish the pairwise margin bound. A close inspection of the bound shows that the propagation of error causes the first term since the input layer and the rest of the terms are errors propagating since the i-th layer in the neural network, where i ∈ [L].

3.3.2. PERTURBING MULTIPLE LAYERS

The all-perturbed setting is a special case of perturbing layers from an index set I when I = [L]. We extend our analysis to the general multi-layer weight perturbation setting with I ⊆ [L], which includes the single-layer setting (I = {N }) and all-perturbed setting (I = [L]) as special cases. Theorem 3 (multiple-layer perturbation) Let an L-Layer neural network be written as f W (x) = W L (...ρ(W 1 x)...). Given an index set I ⊆ [L], we define the perturbed neural network as f W (x) = WL (... WN ...ρ( W1 x)...) with Wi = W i , ∀i ∈ [L] \ I Wi = Ŵi , Ŵi ∈ IB ∞ W i ( i), ∀i ∈ I for any pairwise margin between f ij W (x) and f ij W (x) we have f ij W (x) ≤ f ij W (x) + W L i,: -W L j,: 1 ∈I\{L,L-1} z -1 * 1 Π L-1 j= +1 (W j ) T 1,∞ + 1(L -1 ∈ I) L-1 z L-2 * 1 + 1(L ∈ I)2 L z L-1 * 1 := f ij W (x) + η ij W (x|I) where z k * = ρ(W k * ...ρ(W 1 * x) with W m * defined as              W m * i,j =    W m i,j + m, ∀i, j ∀m ∈ [L] ∩ I \ {1} W m i,j , ∀i, j ∀m ∈ [L] \ (I ∪ {1}) W 1 * i,j = W 1 i,j + sgn([x]j) 1, ∀i, j if 1 ∈ I W 1 i,j , ∀i, j otherwise and z 0 * = x Proof : See Appendix A.2.3.

3.4.1. CONSTRUCTION OF ROBUST SURROGATE LOSS ON PAIRWISE MARGIN

We aim to construct a surrogate loss function based on a standard loss function and study its behavior against weight perturbations. Specifically, given a perturbation radius and the original loss function (f W (x), y), robust training aims to minimize the following objective function: ˜ (f W (x), y) = max ∀m∈[L], Ŵm ∈IB ∞ W m ( ) (f W (x), y) which we call it as the robustness (worst-case) loss. Even for a single data point (x, y), it is hard to assess the exact robustness loss since it requires the maximization of a non-concave function over a norm ball. To make the problem of robust training against weight perturbations more computationally tractable, we aim to design a surrogate loss as an upper bound on the worst-case loss. We focus on the construction of surrogate loss by means of pairwise margin bound in Section 3.3. We first define mathematically two popular loss functions in the classification problem, ramp loss and cross entropy, and derive their surrogate versions. Define the margin function M (f W (x), y) as M (f W (x), y) = min y =y [f W (x)]y -[f W (x)] y = [f W (x)]y -max y =y [f W (x)] y The ramp loss for a given data point (x, y) and neural network f W (•) is written as ramp (f W (x), y) = φ γ (M (f W (x), y)), where the function φ γ : R → [0, 1] is defined as φ γ (t) = 1 if t ≤ 0, φ γ (t) = 0 if t ≥ γ, and φ γ (t) = 1 -t γ if t ∈ [0, γ]. Since the ramp loss is a piece-wise linear function, its surrogate loss can be directly obtained with the pairwise margin bound in Section 3.3. The cross entropy is written as CE( fW (x), y) = -ln([ fW (x)] y ), where fW (x) represents a neural network with its output passing through a softmax layer. That is, [ fW (x)] i = exp [f W (x)] i k∈[K] exp [f W (x)] k . For ease of demonstration, we will be using ramp loss and its pairwise margin under single-layer perturbation in the following lemma. The surrogate loss analysis for cross entropy and robust surrogate loss for multiple-layer perturbation is given in Appendix B. Lemma 2 (robust surrogate ramp loss) Let N ∈ [L] denote the perturbed layer index and let ˆ (f W (x), y) := φγ M (f W (x), y) margin -2 max k∈[K] W L k,: 1 Π N -1 m=1 W m 1,∞ Π L-N -1 k=1 (W L-k ) T 1,∞ x 1 worst-case error Then we have upper and lower bounds of ˆ in terms of 0-1 losses expressed as max Ŵ N ∈IB ∞ W N ( ) 1{y = arg max y ∈[K] [f Ŵ (x)] y } ≤ ˆ (f W (x), y) ≤ 1{M (f W (x), y) -2 max k∈[K] W L k,: 1 Π N -1 m=1 W m 1,∞ Π L-N -1 k=1 (W L-k ) T 1,∞ x 1 ≤ γ}. Proof : Please see Appendix B.1 One could observe in the formula that the margin function M (f W (x), y) serves as an accuracy objective similar to the standard training process while the latter term could be conceived as the worst-case error caused by weight perturbation that should be suppressed. Therefore, by training under such an objective, we can simulate the scenario of robust training. Another intuition on the surrogate loss function is that the surrogate loss also implies the difficulty of training robust and generalizable models against large weight perturbations. Since error caused by perturbations would be surging rapidly through layers, only small perturbations can be applied in training and practice, permitting the worst-case error term to be smaller than the margin term. However, one follow-up question that naturally arises is whether or not the generalization gap will be widened when training with the robust surrogate loss. The following section investigates the generalization property while conducting robust training and provides some theoretical insights to training toward a generalizable and robust model under weight perturbation. 

3.4.2. GENERALIZATION GAP

F = {f W (x)|W = (W 1 , W 2 , ..., W L ), W h σ ≤ s h , (W h ) T 2,1 ≤ b h , h ∈ [L]}. For any γ > 0, with probability at least 1 -δ, we have for all f W (•) ∈ F As highlighted in the bracket term of Theorem 4, if the product of multiple weight norm bounds is not well confined, the model will suffer from a notable generalization gap. Consequently, our analysis suggests a solution to reduce the generalization gap by imposing norm penalty functions on all weight matrices for training generalizable neural networks subject to weight perturbations. P (x,y)∼D ∃ ŴN ∈ IB ∞ W N ( ) s.t. y = arg max y ∈[K] [f Ŵ(x)] y ≤ 1 n n i=1 1 [f W (xi)]y i ≤ γ + max y =y i [f W (xi)] y + 2 max k∈[K] W L k,: 1 Π N -1 m=1 W m 1,∞ Π L-N -1 k=1 (W L-k ) T 1,∞ xi 1 + 1 γ 4 n 3/2 + 60 log(n) log(2dmax) n X F Π L h=1 s h L j=1 b j s j 2/3 3/2 + 2 sup f ∈F Π N -1 m=1 W m 1,∞ Π L-N -1 k=0 (W L-k ) T 1,∞ n X 1,

3.5. THEORY-DRIVEN LOSS TOWARD ROBUSTNESS AND GENERALIZATION

With our theoretical insights, we now propose a robust and generalizable loss function. Standard neural network classifier training uses a classification loss cls (f W (x), y) that aims to widen the pairwise margin so as to raise accuracy, but won't necessarily be able to curb the error in output once weight perturbation is imposed. To address this issue, we propose to train under a mixed and regularized objective given a data sample (x, y), which would, in turn, balance the tradeoff between standard accuracy, robustness, and generalization. The designed loss takes the form: * (f W (x), y) = cls (f W (x), y) standard loss +λ • max y =y {η y y W (x|I)} robustness loss from Thm. 3 +µ • L m=1 (W m ) T 1,∞ + W m 1,∞ generalization gap regularization from Thm. 4 (9) The first term in the proposed loss function originates from standard classification problem for taskspecific accuracy, while the second term results from the maximum error on pairwise margin (the term η y y W (x|I) defined in Theorem 3) induced by weight perturbation. Finally, we contribute the last term to the theoretical findings in Theorem 4, where imposing norm constraints on weight matrices could benefit generalization and prevent the generalization gap from widening.

4. NUMERICAL VALIDATION

We validate our theoretical results and the designed loss function in (9) through two sets of experiments: empirical generalization gap with matrix norm regularization and robust accuracy against adversarial weight perturbations. Experiment setup We used the MNIST dataset comprised of gray-scale images of hand-written digits with ten categories. We trained neural network models as in (3) with four dense layers (number of neurons are 128-64-32-10) and the ReLU activation function without the bias term. We used the loss function * in (9) with all-layer perturbation bound (i.e., I = [L]), identical weight perturbation radius (or train ), cross entropy as the standard classification loss cls , and a batch size of 32 with 20 epochs. Stochastic gradient descent with momentum is used for training, with the learning rate set to be 0.01. For the generalization experiment, we follow the same setting as in (Yin et al., 2019) , which uses 1000 data samples to train the neural network with 100 epochs. All experiments were conducted using an Intel Xeon E5-2620v4 CPU, 125 GB RAM, and an NVIDIA TITAN Xp GPU with 12 GB RAM. For reproducibility, our codes are given in the supplementary material. 

Robustness against adversarial weight perturbation

To evaluate the robustness against weight perturbations, we modified the projected gradient descent (PGD) attack originally designed for input perturbation (Madry et al., 2018) , which we call as weight PGD attack. Starting from a trained neural network weight W , the perturbed weight W is crafted by iterative gradient ascent using the signed gradient of the standard loss denoted as sgn(∇ W cls (f W (x), y)), followed by an element-wise clipping centered at W . The attack iteration with the step size α is expressed as We trained neural networks with different combinations of the coefficients λ and µ in ( 9) using train = 0.01.  W (0) = W , W (t+1) = Clip W , W (t) + α sgn(∇ W cls (f W (t) (x), y))

5. CONCLUSION

In this paper, we developed a formal analysis of the robustness of the pairwise class margin for neural networks against weight perturbations. We also characterized its generalization gap through Rademacher complexity. A theory-driven loss function was proposed, and the empirical results showed significantly improved performance in generalization and robustness. Our analysis offers theoretical insights and training loss design principles for studying the generalization and robustness of neural networks subject to weight perturbations. A MARGIN BOUND

A.1 SINGLE-LAYER BOUND

We shall first prove when N = L and follow similar reasoning to prove the case when N = L. Consider the difference between set of pairwise margin f ij W (x) -f ij W (x), we have f ij W (x) -f ij W (x) = {W L i,: -W L j,: }(ẑ L-1 -z L-1 ) (11) (a) ≤ W L i,: -W L j,: 1 ρ(W L-1 ẑL-2 ) -ρ(W L-1 z L-2 ) ∞ ( ) (b) ≤ W L i,: -W L j,: 1 W L-1 (ẑ L-2 -z L-2 ) ∞ ( ) (c) ≤ W L i,: -W L j,: 1 (W L-1 ) T 1,∞ (ẑ L-2 -z L-2 ) ∞ ( ) (d) ≤ W L i,: -W L j,: 1 (W L-1 ) T 1,∞ ... (W N +1 ) T 1,∞ ( ŴN -W N )z N -1 ∞ ( ) (e) ≤ W L i,: -W L j,: 1 z N -1 1 Π L-N -1 k=1 (W L-k ) T 1,∞ , where inequality (a) results from applying Hölder inequality, and inequality (b) comes from the contractive property (1-Lipschitz) of activation function ρ(•). Inequality (c) and (d) come from triangle inequality applied element-wise on vector W L-1 (ẑ L-2 -z L-2 ) combined with induction while inequality (e) simply comes from the fact that every element in matrix ŴN -W N has at most in magnitude. With similar reasoning, we proof the scenario when N = L as following f ij W (x) -f ij W (x) = {W L i,: -W L j,: }z L-1 -{ Ŵ L i,: -Ŵ L j,: }z L-1 (17) (i) ≤ 2 1 T z L-1 (18) = 2 z L-1 1 , where inequality (i) comes from problem definition (within element-wise ∞ norm ball) and since the activation function ρ(•) is non-negative, we could transform the inner product to its 1 norm.

A.2 MULTI-LAYER SCENARIO

A.2.1 KEY PREREQUISITE Before going through the proof for all-perturbed bound, we shall be introducing a maximization problem and search for its solution. Recall that we have defined in Section 3.1 the notion of output vector under weight perturbation setting, we now consider maximizing its 1 norm over a given perturbed matrix set W and give the following solution. Using the notation in Section 3.1, we have ẑk as the output vector under perturbation setting and write the optimal vector that achieve its maximum 1 norm as z k * . We then obtain the following solution z k * 1 = max W ẑk 1 , where z k * = ρ(W k * ...ρ(W 1 * x) with W m * i,j = W m i,j + m , ∀i, j ∀m ∈ {2, ..., L} W 1 * i,j = W 1 i,j + sgn([x] j ) 1 , ∀i, j The reasoning behind uses the non-negative property of activation function, for any element in a matrix W, the choice to maximize it's 1 norm matrix-vector product is to go in the direction identical to the sign of the vector's element-wise value since the activation function is applied after the first layer, we would then obtain the solution above.

A.2.2 ALL-PERTURBED BOUND

In the following proof for Theorem 2, we apply similar steps in Appendix A.1 and consider the difference between set of pairwise margin under natural and weight perturbation setting, recall in Theorem 2 we defined that f W (x) = W L (..W N ...ρ(W 1 x)...) and f W (x) = ŴL (.. ŴN ...ρ( Ŵ1 x)...) Thus for any set of pairwise margin f ij W (x) and f ij W (x), we have f ij W (x) -f ij W (x) = { Ŵ L i,: -Ŵ L j,: }ẑ L-1 -{W L i,: -W L j,: }z L-1 (20) (a) ≤ W L i,: -W L j,: 1 ρ( ŴL-1 ẑL-2 ) -ρ(W L-1 z L-2 ) ∞ + 2 L 1 T ẑL-1 (21) (b) ≤ W L i,: -W L j,: 1 W L-1 (ẑ L-2 -z L-2 ) ∞ + ( ŴL-1 -W L-1 )ẑ L-2 ∞ + 2 L ẑL-1 1 (22) (c) ≤ W L i,: -W L j,: 1 (W L-1 ) T 1,∞ ρ( ŴL-2 ẑL-3 ) -ρ(W L-2 z L-3 ) ∞ + L-1 ẑL-2 1 + 2 L ẑL-1 1 ( ) (d) ≤ W L i,: -W L j,: 1 1 x 1 Π L-2 l=1 (W L-l ) T 1,∞ + L-3 j=1 Π L-1 k=j+2 (W k ) T 1,∞ j+1 ẑj 1 + L-1 ẑL-2 1 + 2 L ẑL-1 1 (24) (e) ≤ W L i,: -W L j,: 1 1 x 1 Π L-2 l=1 (W L-l ) T 1,∞ + L-3 j=1 Π L-1 k=j+2 (W k ) T 1,∞ j+1 z j * 1 + L-1 z L-2 * 1 + 2 L z L-1 * 1 In the above proof, inequality (a) comes from the problem definition (perturbation of final layer within L ) and inequality (b) results from the contractive property of ρ(•) (1-Lipschitz) combined with the use of triangle inequality. Inequality (c) was achieved through triangle inequality applied on elements of W L-1 (ẑ L-2 -z L-2 ) and using the fact that ŴL-1 -W L-1 has every element less than or equal to L-1 in magnitude. By the process of induction and maximizing the 1 norm of perturbed output under weight perturbation ẑk , we arrive at inequality (d) and (e).

A.2.3 MULTI-LAYER BOUND

We now proceed to utilize similar reasoning to establish the multi-layer bound when weight perturbation is imposed according to an index set I f ij Ŵ (x) -f ij W (x) = { W L i,: -W L j,: }ẑ L-1 -{W L i,: -W L j,: }z L-1 (26) ≤ W L i,: -W L j,: 1 ρ( WL-1 ẑL-2 ) -ρ(W L-1 z L-2 ) ∞ + 1(L ∈ I)2 L 1 T ẑL-1 (27) ≤ W L i,: -W L j,: 1 W L-1 (ẑ L-2 -z L-2 ) ∞ + ( WL-1 -W L-1 )ẑ L-2 ∞ + 1(L ∈ I)2 L ẑL-1 1 (28) ≤ W L i,: -W L j,: 1 (W L-1 ) T 1,∞ ρ( WL-2 ẑL-3 ) -ρ(W L-2 z L-3 ) ∞ + 1(L -1 ∈ I) L-1 ẑL-2 1 + 1(L ∈ I)2 L ẑL-1 1 (29) ≤ W L i,: -W L j,: 1 1(1 ∈ I) 1 x 1 Π L-2 l=1 (W L-l ) T 1,∞ + 1(L -1 ∈ I) L-1 ẑL-2 1 + L-3 j=1 1(j + 1 ∈ I) Π L-1 k=j+2 (W k ) T 1,∞ j+1 ẑj 1 + 1(L ∈ I)2 L ẑL-1 1 (30) ≤ W L i,: -W L j,: 1 ∈I\{L,L-1} Π L-1 k= +1 (W k ) T 1,∞ z -1 * 1 + 1(L -1 ∈ I) L-1 z L-2 * 1 + 1(L ∈ I)2 L z L-1 * 1 The proof for multi-layer bound follows same reasoning from the all-perturbed setting except indicator function was added to check whether a certain layer m is in the index set I and at last we rewrite the expression using the members of set I.

B SURROGATE LOSS B.1 CASE ON RAMP LOSS

We now provide a proof for Lemma 2. Recall the definition of ramp function in Section 3.4.1, we have that ramp loss for a given data point (x, y) and neural network f W (•) is written as ramp (f W (x), y) = φ γ (M (f W (x), y)), where the function φ γ : R → [0, 1] is defined as φ γ (t) =      1 if t ≤ 0 0 if t ≥ γ 1 -t γ if t ∈ [0, γ] Then for any (x, y), using ReLU as activation function, we have max W 1(y = arg max y [f W (x)] y ) (a) ≤ φ γ (min W M (f W (x), y)) ≤ φ γ (min y =y min W [f W (x)] y -[f W (x)] y ) (34) (c) ≤ φ γ min y =y [f W (x)] y -[f W (x)] y -max y =y W L y ,: -W L y,: 1 z N -1 1 Π L-N -1 k=1 (W L-k ) T 1,∞ ≤ φ γ min y =y [f W (x)] y -[f W (x)] y -2 max k∈[K] W L k,: 1 z N -1 1 Π L-N -1 k=1 (W L-k ) T 1,∞ (e) ≤ φ γ M (f W (x), y) -2 max k∈[K] W L k,: 1 z N -1 1 Π L-N -1 k=1 (W L-k ) T 1,∞ (f ) ≤ φ γ M (f W (x), y) -2 max k∈[K] W L k,: 1 x 1 Π N -1 m=1 W m 1,∞ Π L-N -1 k=1 (W L-k ) T 1,∞ := ˆ (f W (x), y) (g) ≤ 1 M (f W (x), y) -2 max k∈[K] W L k,: 1 x 1 Π N -1 m=1 W m 1,∞ Π L-N -1 k=1 (W L-k ) T 1,∞ ≤ γ , where inequality (a) is due to the property of ramp loss while inequality (b) is by the definition of margin and inequality (c) comes from applying Theorem 1. Inequality (d) results from using triangle inequality and taking its maximum, inequality (e) is by the definition of margin and inequality (f) comes from the fact that with ReLU we have ρ(Ax) 1 ≤ Ax 1 . Lastly, inequality (g) is a direct consequence from property of ramp loss.

B.1.1 RAMP LOSS ON MULTIPLE LAYER BOUND

We now follow a similar course and prove robust ramp loss using the multi-layer bound in Theorem 3. We consider the robust loss form proposed in Section 3.4.1 and have that, max W ramp (f W (x), y) (a) ≤ φ γ (min W M (f W (x), y)) ( ) (b) ≤ φ γ min y =y [f W (x)] y -[f W (x)] y -max y =y η y y W (x|I) (42) (c) ≤ φ γ M (f W (x), y) -max y =y η y y W (x|I) := ˆ (f W (x), y)

B.2 CROSS ENTROPY

We further consider the case of cross entropy and prove an upper bound for it. We denote the loss function as CE(•), and during training, hard label was applied. Recall the definition of fW (x) in Section 3.4.1, we have the difference of loss function between natural and perturbation settings as, CE( f W (x), y) -CE( fW (x), y) = -y ln [ f W (x)] y [ fW (x)] y (44) = ln [ fW (x)] y [ f W (x)] y (45) = ln e [f W (x)]y-[f W (x)]y k∈[K] e [f W (x)] k k∈[K] e [f W (x)] k } ( ) (a) ≤ ln max y =y e [f W (x)]y-[f W (x)]y • e [f W (x)] y -[f W (x)] y (47) ≤ ln e max y =y f y y W (x)-f y y W (x) ( ) (c) = max y =y η y y W (x|I), where inequality (a) comes from taking the maximum in the set of all ratios e [f W ( x)] k e [f W (x)] k and inequality (b) comes from monotonicity of exponential function. Finally, the last expression (c) can be referred to Theorem 3. Thus with the above proof, we could establish the following robust surrogate loss for cross entropy, denoted as CE(f W (x), y) : Lemma 3 (Mohri et al. (2018) ) Assume that the range of loss function (•) is [0,1]. Then, for any δ ∈ (0, 1), with probability as least 1 -δ, we have for all CE(f W (x), y) = CE(f W (x), f ∈ F R(f ) ≤ R n (f ) + 2R( F ) + 3 log 2 δ 2n , where R(f ) and R n (f ) stand for population risk and empirical risk, respectively. Lemma 4 (Bartlett et al. ( 2017)) Consider the neural network hypothesis class, F = {f W (x) | W = (W 1 , W 2 , ..., W L ), W h σ ≤ s h , (W h ) T 2,1 ≤ b h h ∈ [L]} We have an upper bound on the Rademacher complexity, R(F) ≤ 4 n 3/2 + 26 log(n) log(2dmax) n X F Π L h=1 s h L j=1 bj sj 2/3 3/2 We now study the Rademacher Complexity of the function class ˆ F = {(x, y) → ˆ (f W (x), y)| f ∈ F}, where ˆ (•) is denoted in Lemma 2 and let M F = {(x, y) → M (f W (x), y)| f ∈ F}. Then we could obtain, R( ˆ F ) ≤ 1 γ R(M F ) + 2 n E ν [sup f ∈F n i=1 ν i max k∈[K] W L k,: 1 Π N -1 m=1 W m 1 Π L-N -1 k=1 (W L-k ) T 1,∞ x i 1 ] , where the inequality was achieved by using the Ledoux-Talagrand contraction inequality and the convexity of the supreme operation. Consider the second term, we have that 2 n E ν [sup f ∈F n i=1 ν i max k∈[K] W L k,: 1 Π N -1 m=1 W m 1 Π L-N -1 k=1 (W L-k ) T 1,∞ x i 1 ] (51) (a) ≤ 2 n sup f ∈F max k∈[K] W L k,: 1 Π N -1 m=1 W m 1 Π L-N -1 k=1 (W L-k ) T 1,∞ E ν [| n i=1 ν i x i 1 |] (52) (b) ≤ 2 n sup f ∈F Π N -1 m=1 W m 1 Π L-N -1 k=0 (W L-k ) T 1,∞ X 1,2 , where inequality (a) is achieved by separating all neural network related parameters and inequality (b) is a result of applying Khintchine's inequality. Thus, combined with Lemma 4, we have that R( ˆ F ) ≤ 1 γ 4 n 3/2 + 60 log(n) log(2d max ) n X F Π L h=1 s h L j=1 b j s j 2/3 3/2 + 2 n sup f ∈F Π N -1 m=1 W m 1 Π L-N -1 k=0 (W L-k ) T 1,∞ X 1,2 Once we have calculated an upper bound for R( ˆ F ), then Theorem 4 is a direct consequence of Lemma 2 and 3.

C.2 EXTENSION TO MULTIPLE LAYER BOUND

In this section, we consider the robust surrogate loss under multi-layer bound and study its Rademacher complexity, We first give the expression of the robust surrogate loss then give an result on generalization bound. Lemma 5 Define the robust loss function ˆ (f W (x), y) as ˆ (f W (x), y) = φ γ M (f W (x), y) -2 max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ x 1 + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ x 1 -1(L ∈ I)2 L Π L-1 i=1 W i * 1,∞ x 1 We would have that max W 1(y = arg max y [f W (x)] y ) ≤ ˆ (f W (x), y) ≤ 1 M (f W (x), y) -2 max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ x 1 + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ x 1 -1(L ∈ I)2 L Π L-1 i=1 W i * 1,∞ x 1 ≤ γ Using the above loss, we could further establish an upper bound on robust surrogate loss and provide statements on generalization bound. Given the following function class ˆ F = {(x, y) → ˆ (f W (x), y)|f ∈ F} We have that, R( ˆ F ) ≤ 1 γ R(M F ) + 2 n E ν sup f ∈F n i=1 ν i max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ × Π L-1 j= +1 (W j ) T 1,∞ x i 1 + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ x i 1 + 2 n E ν sup f ∈F n i=1 ν i 1(L ∈ I) L Π L-1 i=1 W i * 1,∞ x i 1 (56) which the second term can be bounded as, 2 n E ν sup f ∈F n i=1 ν i max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ × Π L-1 j= +1 (W j ) T 1,∞ x i 1 + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ x i 1 (57) ≤ 2 n sup f ∈F max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ E ν [| n i=1 ν i x i 1 |] (58) ≤ 2 n sup f ∈F max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ X 1,2 while the last term can as well be bounded as, 2 n E ν sup f ∈F n i=1 ν i 1(L ∈ I) L Π L-1 i=1 W i * 1,∞ x i 1 (60) ≤ 2 sup f ∈F 1(L ∈ I) L Π L-1 i=1 W i * 1,∞ n E ν [| n i=1 ν i x i 1 |] (61) ≤ 21(L ∈ I) L sup f ∈F Π L-1 i=1 W i * 1,∞ n X 1,2 With all of the upper bounds above and Lemma 4, 5, we have the following theorem, Theorem 5 (generalization gap for robust surrogate loss) With Lemma 5, consider the neural network hypothesis class F = {f W (x)|W = (W 1 , W 2 , ..., W L ), W h σ ≤ s h , (W h ) T 2,1 ≤ b h , h ∈ [L]}. For any γ > 0, with probability at least 1 -δ, we have for all f W (•) ∈ F P (x,y)∼D ∃ W s.t. y = arg max y ∈[K] [f W (x)] y ≤ 1 n n i=1 1 [f W (xi)]y i ≤ γ + max y =y i [f W (xi)] y + 2 max k∈[K] W L k,: 1 × ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ xi 1 + 21(L ∈ I) L Π L-1 i=1 W i * 1,∞ xi 1 + 1 γ 4 n 3/2 + 60 log(n) log(2dmax) n X F + 2 n sup f ∈F max k∈[K] W L k,: 1 ∈I\{L,L-1} Π l-1 i=1 W i * 1,∞ Π L-1 j= +1 (W j ) T 1,∞ + 1(L -1 ∈ I) L-1 Π L-2 i=1 W i * 1,∞ + 21(L ∈ I) L Π L-1 i=1 W i * 1,∞ X 1,2 + 3 log 2 δ 2n (63) D ADDITIONAL EXPERIMENTS D.1 SUPPLEMENT OF FIGURE 1 (A) For completeness, Figure 2 adds the curve of = 0 (the standard generalization setting) in Figure 1 (a). The generalization gap is obviously lower than others when = 0, but its trend with respect to regularization is similar to the setting when = 0. Figure 4 shows the trade-off with a fine grid search between coefficients λ and µ, we present test accuracy under weight PGD attack using perturbation radius of ( = 0.01 and 0.02) with different combinations of λ (from 0.01 to 0.015) and µ (from 0 to 0.05). We find that there is indeed a sweet spot with proper values of λ and µ leading to significantly better robust accuracy. When both λ and µ are too large or too small, the robustness of the model will decrease.

D.4 ALTERNATIVE ROBUST LOSS FUNCTION

In addition to the proposed loss function in equation ( 9), one can consider an alternative generalization gap regularization term derived from Theorem 4, which is 

E ON THE VACUITY OF GENERALIZATION BOUND

In (Nagarajan & Kolter, 2019) , empirical observations were made to point out the fact that when given increase in the size of the training data, the error bound proposed in (Bartlett et al., 2017) grows rapidly, loosing the ability to describe generalization gap and thus becomes vacuous. However, we note that under our settings with models trained using the loss function in Section 3.5, the bound would not grow in a polynomial rate and instead shows a decreasing trend. We conducted experiments and presented results under the same setting as (Nagarajan & Kolter, 2019) in Figure 6 . Here we verify two existing generalization bounds from different literature, one from (Bartlett et al., 2017) while another one from (Barron & Klusowski, 2018) in which the former one is composed mainly of product of weight matrices' norm and the latter is comprised of the norm of matrices' product. Empirical results in Figure 6 (a) show that under the standard settings the main components of generalization bound in (Bartlett et al., 2017) and Barron & Klusowski (2018) both grows rapidly with respect to the increase in size of the training dataset, as confirmed in (Nagarajan & Kolter, 2019) . Another empirical finding in the last column of Figure 6 9). The experiment setting follows Figure 1 (a). In the first column, we present test error under weight PGD attack with perturbation level . The curve with = 0 (no weight perturbation) corresponds to the standard generalization setting. In the second column, we present the product of spectral norms of the weights matrices, related to bounds in (Bartlett et al., 2017) . In the third column, we show spectral norm of the product of the weight matrices, related to bounds in (Barron & Klusowski, 2018) . In the fourth column, we show the product of spectral norms of the weights matrices divided by the spectral norm of the product of the weight matrices. Notably, we present each value into the logarithm function. For the standard model, both types of bound increase with respect to training set size and are shown to be vacuous, consistent with the results in (Nagarajan & Kolter, 2019) . For the robust model, both types of bound exhibit same decreasing trend as the test error and therefore are non-vacuous. Moreover, these two bounds demonstrate similar scaling behavior (nearly constant log ratio) in both standard and robust models.



FOR ROBUST SURROGATE LOSS We consider the robust surrogate loss established in Lemma 2 and study its generalization bound via Rademacher complexity in Theorem 4, where S = {(x i , y i )} n i=1 denotes the set of i.i.d training samples, X := [x 1 , x 2 , . . . , x n ] ∈ R d×n denotes the matrix composed of training data samples, and d max = max{d, d 1 , . . . , d L } denotes the maximum dimension among all weight matrices. Theorem 4 (generalization gap for robust surrogate loss) With Lemma 2, consider the neural network hypothesis class

Figure 1 (a) shows the empirical generalization gap (training accuracy -test accuracy) with respect to the matrix norm regularization coefficient µ defined in (9). As indicated in Theorem 4, increasing µ effectively suppresses the Rademacher complexity and thus reduces the generalization gaps, which are consistently observed on neural networks trained with different weight perturbation level .

Figure 1: (a) Empirical generalization gaps when varying the matrix norm regularization coefficient µ in (9). Consistent with the theoretical results, the gap reduces as µ increases for every value used for training. (b) Comparison of test accuracy of neural networks trained with different coefficients λ and µ under weight PGD attack (200 steps) with perturbation level. AUC refers to the area under curve score. Joint training with the two theory-driven terms as described in (9) indeed yields more generalizable and robust neural networks against weight perturbations.

Figure 1 (b)  shows the test accuracy under different weight perturbation level (i.e., the robust accuracy) with 200 attack steps. The standard model (λ = µ = 0) is fragile to weight PGD attack. On the other hand, neural networks trained only with the robustness loss (λ > 0 and µ = 0) or the generalization gap regularization (λ = 0 and µ > 0) can improve the robust accuracy due to improved generalization and classification margin. Moreover, joint training using the proposed loss with proper coefficients can further boost model performance (e.g., (λ, µ) = (0.0125, 0.01)), as seen by the significantly improved area under curve (AUC) score of robust accuracy over all tested values. The AUC of the best model is about 2× larger than that of the standard model. Similar results can be concluded for the attack with 100 steps (see Appendix D). In Appendix D.3, we conduct additional experiments on the coefficients λ and µ and discuss their tradeoffs. We also report the run-time analysis in Appendix F. Training with our loss function has a comparable perepoch run time when comparing to standard training.

y) + max y =y η y y W (x|I) C GENERALIZATION BOUND ON RADEMACHER COMPLEXITY C.1 PROOF ON SINGLE LAYER BOUND To show the Rademacher complexity and generalization gap on single layer robust surrogate loss, we first introduce a result proven in (Bartlett et al., 2017) and another classical result in statistical learning theory and proceed to give a proof on Theorem 4. Given a set S = {(x i , y i )} n i=1 of i.i.d training samples, denote X := [x 1 , x 2 , . . . , x n ] ∈ R d×n as the matrix composed of training data and let d max = max{d, d 1 , . . . , d L } as the maximum dimension among all weight matrices.

Figure 2: Empirical generalization gaps when altering the matrix norm regularization coefficient µ in (9).

Figure 3: Comparison of test accuracy of neural networks trained with different coefficients λ and µ against weight PGD attack (100 steps) with perturbation level .

m=1 (log (||(W m ) || 1,∞ ) + log (||W m || 1,∞ )).We compare its performance following the same experiment setting as in Figure1(b) with finetuned coefficients λ and µ. It can be observed that this alternative loss function also yields robust models with comparable (sometimes slightly better) performance to those in Figure1(b), verifying the effectiveness in using theory-driven insights to reduce the generalization gap against weight perturbation.

Figure 5: Test accuracy under different weight perturbation level with 200 attack steps. The models are trained using the alternative loss described in Section D.4.

(a)  shows that the multiplicative difference between bounds in(Bartlett et al., 2017) and (Barron & Klusowski, 2018) exhibit a constant rate, demonstrating the vacuity of both bounds. However, when measuring the same component under our setting in Figure6(b), new results showed decreasing bounds as the size of the training dataset increases, concluding the non-vacuity of the associated generalization bounds in our settings.

Figure 6: Statistics associated with generalization bounds for standard model and robust model trained using equation (9). The experiment setting follows Figure1(a). In the first column, we present test error under weight PGD attack with perturbation level . The curve with = 0 (no weight perturbation) corresponds to the standard generalization setting. In the second column, we present the product of spectral norms of the weights matrices, related to bounds in(Bartlett et al., 2017). In the third column, we show spectral norm of the product of the weight matrices, related to bounds in (Barron & Klusowski, 2018). In the fourth column, we show the product of spectral norms of the weights matrices divided by the spectral norm of the product of the weight matrices. Notably, we present each value into the logarithm function. For the standard model, both types of bound increase with respect to training set size and are shown to be vacuous, consistent with the results in(Nagarajan & Kolter, 2019). For the robust model, both types of bound exhibit same decreasing trend as the test error and therefore are non-vacuous. Moreover, these two bounds demonstrate similar scaling behavior (nearly constant log ratio) in both standard and robust models.

Huan Xu and Shie Mannor. Robustness and generalization. Machine learning, 86(3):391-423, 2012. Dong Yin, Ramchandran Kannan, and Peter Bartlett. Rademacher complexity for adversarially robust generalization. In International Conference on Machine Learning, pp. 7085-7094, 2019.

annex

As an ablation study, Figure 7 shows the performance of the robust model with (λ = 0.01, µ = 0). That is, training a neural network without the generalization regularization term in equation ( 9). Unlike the standard model, it is observed that the generalization bounds still show decreasing trend with the test error. This is due to the fact that the robustness term in equation ( 9), which is induced from our analysis of the worst-case error propagation from weight perturbation, also plays a role in regularizing the network weights, and therefore making the resulting generalization bound nonvacuous.Figure 7 : Ablation study of the robust model with (λ = 0.01, µ = 0). The experiment setting is the same as Figure 6 . In the absence of the generalization regularization term in equation ( 9), the generalization bounds still show decreasing trend with the test error.

F RUN TIME ANALYSIS

Table 1 reports the per-epoch run time of the models trained with the standard model (λ = 0, µ = 0) and the robust model (λ = 0.01, µ = 0.1) using equation ( 9). We train both models with 20 epochs and use the same hyperparameters, including setting the SGD optimizer with learning rate as 0. 

