TOWARDS CERTIFYING ∞ ROBUSTNESS USING NEU-RAL NETWORKS WITH ∞ -DIST NEURONS

Abstract

It is well-known that standard neural networks, even with a high classification accuracy, are vulnerable to small ∞ perturbations. Many attempts have been tried to learn a network that can resist such adversarial attacks. However, most previous works either can only provide empirical verification of the defense to a particular attack method, or can only develop a theoretical guarantee of the model robustness in limited scenarios. In this paper, we develop a theoretically principled neural network that inherently resists ∞ perturbations. In particular, we design a novel neuron that uses ∞ distance as its basic operation, which we call ∞ -dist neuron. We show that the ∞ -dist neuron is naturally a 1-Lipschitz function with respect to the ∞ norm, and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. This directly provides a theoretical guarantee of the certified robustness based on the margin of the prediction outputs. We further prove that the ∞ -dist Nets have enough expressiveness power to approximate any 1-Lipschitz function, and can generalize well as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. Preliminary experiments show that even without the help of adversarial training, the learned networks with high classification accuracy are already provably robust.

1. INTRODUCTION

Modern neural networks are usually sensitive to small, adversarially chosen perturbations to the inputs (Szegedy et al., 2013; Biggio et al., 2013) . Given an image x that is correctly classified by a neural network, a malicious attacker may find a small adversarial perturbation δ such that the perturbed image x + δ, though visually indistinguishable from the original image, is assigned to a wrong class with high confidence by the network. Such vulnerability creates security concerns in many real-world applications. Developing a model that can resist small ∞ perturbations has been extensively studied in the literature. Adversarial training methods (Szegedy et al., 2013; Madry et al., 2017; Goodfellow et al., 2015; Huang et al., 2015; Athalye et al., 2018; Ding et al., 2020) first learn on-the-fly adversarial examples of the inputs, and then update model parameters using these perturbed samples together with the original labels. Such approaches are restricted to a particular (class of) attack method and cannot be formally guaranteed whether the resulting model is robust against other attacks. Another line of algorithms trains robust models by maximizing the certified radius provided by robust certification methods. Weng et al. (2018) ; Wong & Kolter (2018a) ; Zhang et al. (2018) ; Mirman et al. (2018) ; Wang et al. (2018) ; Gowal et al. (2018) ; Zhang et al. (2019b) develop their methods based on linear or convex relaxations of fully connected ReLU networks. However, the certification methods are usually computationally expensive and can only handle ReLU activations. Cohen et al. (2019) ; Salman et al. (2019) ; Zhai et al. (2020) show that a certified guarantee on small 2 perturbations can be easily computed for general Gaussian smoothed classifiers. But recent works suggest that such methods are hard to extend to the ∞ -perturbation scenario. In this work, we overcome the challenge mentioned above by introducing a new type of neural network that naturally resists local adversarial attacks and can be easily certified under the ∞ perturbation. In particular, we propose a novel neuron called ∞ -dist neuron. Unlike the standard neuron design that uses a non-linear activation after a linear transformation, the ∞ -dist neuron is purely based on computing the ∞ distance between the inputs and the parameters. It is straightfor-ward to see that such a neuron is 1-Lipschitz with respect to the ∞ norm and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. Based on such a property, we can obtain the certified robustness for any ∞ -dist Nets using the margin of the prediction outputs. Theoretically, we investigate the expressive power of ∞ -dist Nets and its adversarially robust generalization ability. We first prove a Lipschitz-universal approximation theorem for ∞ -dist Net using a structured approach, which shows that ∞ -dist Nets can approximate any 1-Lipschitz function with respect to ∞ norm. We then give upper bounds of robust test error based on the Rademacher complexity, which shows that the robust test error would be small if the ∞ -dist Net learns a large margin classifier on the training data. Both results demonstrate the excellent capability and generalization ability of the ∞ -dist Net function class. The ∞ -dist Nets have nice theoretical guarantees, but empirically, training an ∞ -dist Net is not easy. For example, the gradient of the parameters in the ∞ norm is sparse, which makes the optimization inefficient. We show how to initialize the model parameters, apply proper normalization, and overcome the sparse gradient problem via smoothed approximated gradients. Preliminary experiments on MNIST and Fashion-MNIST show that even without using adversarial training, the learned networks are already provably robust. Our contributions are summarized as follows: • We propose a novel neural network using ∞ -dist neurons, called ∞ -dist Nets. Theoretically, -In Section 3, we show that ∞ -dist Nets are 1-Lipschitz with respect to the ∞ norm in nature, which directly guarantees the certified robustness of any ∞ -dist Net (with respect to the ∞ norm). -In Section 4.1, we prove that ∞ -dist Nets have good expressive power as it can approximate any 1-Lipschitz function with respect to the ∞ norm. -In Section 4.2, we prove that ∞ -dist Nets have good generalization ability as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. • We provide several implementation strategies which are shown to be practically helpful for model training (in Section 5).

2. RELATED WORKS

There are two major lines of works seeking to get robust neural networks: Robust Training Approaches. Previous works showed that the conventional neural networks learned using standard training approaches (e.g., maximum likelihood method) are sensitive to small adversarial perturbations, and significant efforts have been put on developing training approaches for learning robust models. Adversarial training is the most successful method against adversarial attacks. By adding adversarial examples to the training set on the fly, adversarial training methods (Szegedy et al., 2013; Goodfellow et al., 2015; Huang et al., 2015; Zhang et al., 2019a; Wong et al., 2020) can significantly improve the robustness of the conventional neural networks. However, all the methods above are evaluated according to the empirical robust accuracy against pre-defined adversarial attack algorithms, such as projected gradient decent. It cannot be formally guaranteed whether the resulting model is also robust against other attacks. Certified Robustness. Many recent works focus on certifying the robustness of learned neural networks under any attack. Approaches based on bounding the certified radius layer by layer using some convex relaxation methods have been proposed for certifying the robustness of neural networks (Wong & Kolter, 2018b; Gowal et al., 2018; Mirman et al., 2018; Dvijotham et al., 2018; Raghunathan et al., 2018; Zhang et al., 2020) . However, such approaches are usually computationally expensive and have difficulties in scaling to deep models. More recently, researchers found a new approach called randomized smoothing that can provide a certified robustness guarantee for general models. Zhai et al. (2020) showed that if a Gaussian random noise is added to the input layer, a certified guarantee on small 2 perturbations can be computed for Gaussian smoothed classifiers. However, Yang et al. (2020) ; Blum et al. (2020) ; Kumar et al. (2020) showed that randomized smoothing cannot achieve nontrivial certified accuracy against more than Ω min 1, d 1/p-1/2 radius for p perturbations, where d is the input dimension, therefore it cannot provide meaningful results for ∞ perturbations due to the curse of dimensionality. Another line of more conservative certification approaches sought to bound the global Lipschitz constant of the neural network (Gouk et al., 2018; Tsuzuku et al., 2018; Anil et al., 2019; Cisse et al., 2017) . However, the bounds are not tight (Cohen et al., 2019) , and the final robustness performances are not as good as other certification methods.  = {(x 1 , y 1 ), • • • , (x n , y n )} in which (x i , y i ) is i.i.d. drawn from D, i = 1, 2, • • • , n. Let f ∈ F be the classifier of interest that maps any x ∈ X to Y. We call x = x + δ an adversarial example of x to classifier f if f can correctly classify x but assigns a different label to x . In real practice, the most commonly used setting is to consider the attack under -bounded ∞ norm constraint, i.e., δ satisfies δ ∞ ≤ , which is also called ∞ perturbations. Our goal is to learn a model from T that can resist attacks at (x, y) with high probability over (x, y) ∼ D for any small ∞ perturbations. It relates to compute the radius of the largest ∞ ball centered at x in which f does not change its prediction. This radius is called the robust radius, which is defined as (Zhai et al., 2020; Zhang et al., 2019a) : R(f ; x, y) = inf f (x ) =f (x) x -x ∞ , when f (x) = y 0 , when f (x) = y (1) Unfortunately, computing the robust radius (1) of a classifier induced by a standard deep neural network is very difficult. For example, Weng et al. (2018) showed that computing the p robust radius of a deep neural network is NP-hard for some specific p. Researchers then seek to derive a tight lower bound of R(f ; x, y) for general f . This lower bound is called certified radius and we denote it as CR(f ; x, y). The certified radius satisfies 0 ≤ CR(f ; x, y) ≤ R(f ; x, y) for any f, x, y.

3.2. NETWORKS WITH ∞ -DIST NEURONS

In this subsection, we propose the ∞ -dist Neuron that is inherently Lipschitz with respect to ∞ norm. Using these neurons as building blocks, we then show how to obtain an ∞ -Lipschitz neural networks dubbed ∞ -dist Net. Denote z as the input vector to a neuron. A standard neuron processes the input by first projecting z to a scalar value using a linear transformation, and then applying an non-linear activation function σ on it, i.e., σ(w z + b). w and b are parameters, and function σ can be the sigmoid or ReLU activation. Unlike the previous design paradigm, we introduce a new type of neuron using ∞ distance as the basic operation, which we call ∞ -dist neuron: u(z, θ) = z -w ∞ + b, where θ = {w, b} is the parameter set. From the above equation, we can see that the ∞ -dist neuron is non-linear as it calculates the ∞ distance between input z and parameter w with a bias term b. As a result, there is no need to further apply a non-linear activation function. Without loss of generality, we study the properties of multi-layer perceptron (MLP) networks that are constructed using ∞ -dist neurons. All theoretical results can be easily extended to other neural network architectures, such as convolutional neural nets. We use x ∈ R dinput to denote the input vector of an MLP network. A (MLP) network using ∞ -dist neurons can be formally defined as follows. Definition 3.1. ( ∞ -dist Net) Denote g as an MLP network which takes x (1) x ∈ R dinput as input. Assume there are L hidden layers and the l-th hidden layer contains d l hidden units. We call g is an MLP network constructed by ∞ -dist neurons, if the k-th unit in the l-th hidden layer x (l+1) k is computed by x (l+1) k = u(x (l) , θ (l,k) ) = x (l) -w (l,k) ∞ + b (l,k) , 0 < l ≤ L, 0 < k ≤ d l , where x (l) = (x (l) 1 , x (l) 2 , • • • , x d l-1 ) is the input vector to the l-th hidden layer. For simplicity, we will call an MLP network constructed by ∞ -dist neurons as ∞ -dist Net. For classification tasks, the dimensionality of the output of g matches the number of categories, i.e., M . We write g(x) = (g 1 (x), g 2 (x), • • • , g M (x)) and define the predictor f (x) = arg max i∈[M ] g i (x). Note that g(x) can be used as the output logits the same as in any other networks, so we can apply any standard loss function on the ∞ -dist Net, such as the cross-entropy loss or hinge loss. We will further show that the ∞ -dist neuron and the neural networks constructed using it have nice theoretical properties to control the robustness of the model. For completeness, we first introduce the definition of Lipschitz functions as below. Definition 3.2. (Lipschitz Function) A function g(z) : R m → R n is called λ-Lipschitz with respect to p norm • p , if for any z 1 , z 2 , the following holds: g(z 1 ) -g(z 2 ) p ≤ λ z 1 -z 2 p It is straightforward to see that the ∞ -dist neuron is 1-Lipschitz with respect to ∞ norm.

3.3. LIPSCHITZ AND ROBUSTNESS FACTS ABOUT ∞ -DIST NETS

In this subsection, we introduce two straightforward but important facts about ∞ -dist Nets. We first show that ∞ -dist Nets are still 1-Lipschitz with respect to ∞ norm, and then derive the certified robustness of the model based on this property. Fact 3.1. Any ∞ -dist Net g(•) is 1-Lipschitz with respect to ∞ norm, i.e., for any x 1 , x 2 ∈ R dinput , we have g(x 1 ) -g(x 2 ) ∞ ≤ x 1 -x 2 ∞ . Proof. It's easy to check that every operation u(x (l) , θ (l,k) ) is 1-Lipschitz, and therefore the mapping from one layer to the next x (l) → x (l+1) is 1-Lipschitz. Then we have for any x 1 , x 2 ∈ R dinput , g(x 1 ) -g(x 2 ) ∞ ≤ x 1 -x 2 ∞ . Since g is 1-Lipschitz with respect to ∞ norm, if the perturbation over x is rather small, the change of the output can be bounded and the label of the perturbed data x will not change as long as arg max i∈[M ] g i (x) = arg max i∈[M ] g i (x ) , which directly bounds the certified radius. Fact 3.2. Given model f (x) = arg max i∈[M ] g i (x) defined as above, and x is correctly classified as y. We define margin(x; g) as the difference between the largest and second-largest elements of g(x), then for any x satisfying xx ∞ ≤ margin(x; g)/2, we have that f (x) = f (x ) and: CR(f, x, y) ≥ margin(x; g) 2 (3) Proof. Since g(x) is 1-Lipschitz, each element of g(x) can move at most margin(x;g) 2 when x changes to x , therefore the largest element will remain the same. Using this bound, we can certify the robustness of an ∞ -dist Net of any size under ∞ norm with little computational cost (only a forward pass). In contrast, existing certified methods suffer from either poor scalability (methods based on convex relaxation) or curse of dimensionality (randomized smoothing), and lack efficiency as well.

4. EXPRESSIVE POWER AND ROBUST GENERALIZATION OF ∞ -DIST NET

The expressive power of a model family and its generalization are two central topics in machine learning. In the previous section, we show that an ∞ -dist Net is 1-Lipschitz with respect to ∞ norm. Then it's natural to ask whether the designed neural network can approximate any 1-Lipschitz function (with respect to ∞ norm) and whether we can give theoretical guarantees on the robust test error of a learned model. In this section, we give affirmative answers to both questions. We will first prove a Lipschitzuniversal approximation theorem for ∞ -dist Net using a structured approach, then give upper bounds of robust test error based on the Rademacher complexity of the model family. Without loss of any generality, we consider binary classification problems and assume the output dimension is 1. All the omitted proofs in this section can be found in the appendix.

4.1. LIPSCHITZ-UNIVERSAL APPROXIMATION OF ∞ -DIST NET

In this subsection, we show that ∞ -dist Nets with bounded width can approximate any 1-Lipschitz function (with respect to ∞ norm), formalized in the following theorem: Theorem 1. For any 1-Lipschitz function g(x) (with respect to ∞ norm) on a bounded domain K ∈ R dinput and any > 0, there exists an ∞ -dist Net g(x) with width no more than d input + 2, such that for all x ∈ K, we have g(x) -g(x) ∞ ≤ . To prove Theorem 1, we need the following key lemma: Lemma 4.1. For any 1-Lipschitz function f (x) (with respect to ∞ norm) on a bounded domain K ∈ R n , and any > 0, there exists a finite number of functions f i (x) such that for all x ∈ K max i f i (x) ≤ f (x) ≤ max i f i (x) + , where each f i (x) has the following form f i (x) = min{x 1 -w (i) 1 , w (i) 1 -x 1 , x 2 -w (i) 2 , ..., w (i) n -x n } + b i . Lemma 4.1 "decomposes" any target 1-Lipschitz function into simple "base functions", which will serve as building blocks in proving the main theorem. Lu et al. (2017) first proved such universal approximation theorem for width-bounded ReLU networks by constructing a special network that approximates the target function by "sum of cubes". For ∞ -dist Net, such an approach cannot be directly applied as the summation will break the Lipschitz property. We employ a novel "max of pyramids" construction to overcome the issue. The key idea is to approximate the target function from below using the maximum of many "pyramid-like" basic 1-Lipschitz functions. Theorem 1 implies that an ∞ -dist Net can approximate any 1-Lipschitz function with respect to ∞ norm on a compact set, using only finite width which is barely larger than the input dimension. Combining with the fact that an ∞ -dist Net is 1-Lipschitz, we conclude that ∞ -dist Nets is a good class of model to approximate 1-Lipschitz functions.

4.2. BOUNDING ROBUST TEST ERROR OF ∞ -DIST NET

Good robust generalization ability, i.e. small empirical (margin) error implies small robust test error, is as important as strong expressive power. In this subsection, we give upper bound for the robust test error of ∞ -dist Net. Consider the following classification problem: let (x, y) be an instance-label couple where x ∈ K and y ∈ {1, -1} and denote D as the distribution of (x, y). For a function g(x) : R dinput → R, we use sign(g(x)) as the classifier. The r-robust test error γ r of a classifier g is defined as γ r = E D sup x -x ∞≤r I yg(x )≤0 We can upper bound γ r by the margin error on training data and the size of network: Theorem 2. Let F denote the set of all g represented by an ∞ -dist Net with width w and depth L, for every t > 0, with probability at least 1 -2e -2t 2 over the random drawing of n samples, for all r > 0 and g ∈ F we have that γ r ≤ inf δ∈(0,1]       1 n n i=1 I yig(xi)≤δ+r large training margin + Õ Lw 2 δ √ n network size + log log 2 ( 2 δ ) n 1 2       + t √ n . Theorem 2 provides a theoretical guarantee on the adversarial robust test error: when a large margin classifier is found on training data, and the size of the ∞ -dist Net is not too large, then with high probability the model can generalize well in terms of adversarial robustness. We prove the theorem in two steps. One step is to provide a margin bound to control the gap between standard training error and standard test error. In this step, we use Rademacher complexity R(F) of a hypothesis set F, where the test error β r with r as margin is defined as: β r = E D I yg(x)≤r . The other step is to bound the gap between test error and robust test error. The two steps are provided in the following two lemmas. Lemma 4.2. Let F be a hypothesis class, then for any t > 0, P   ∃g ∈ F : β r > inf δ∈(0,1]   1 n n i=1 I yig(xi)≤δ+r + 48 δ R n (F) + log log 2 ( 2 δ ) n 1 2   + t √ n   ≤ 2e -2t 2 . Lemma 4.3. The r-robust test error γ r is no larger than the test error β r , i.e., γ r ≤ β r .

5. TRAINING ∞ -DIST NET

Motivated by the theoretical analysis in the previous sections, we would like to find a large margin solution on the training data. Therefore we use the standard hinge loss as the objective function: L(g, T ) = 1 n n i=1 max 0, t + max y i =yi g(x i ) y i -g(x i ) yi ( ) where t controls the desired margin and should be slightly larger than twice of the desired robustness level according to Eqn. 3. It can be easily seen that L(g, T ) is a continuous function of the network parameters and is differentiable almost everywhere. Therefore any gradient based optimization method can be used to train an ∞ -dist Net. However, the optimization is not easy. We find that directly using SGD or Adam optimizer fails to train the network well. In fact, the gradient of the ∞ norm (i.e., zw ∞ ) with respect to w is very sparse, which makes the optimization inefficient. In this section, we list a few important optimization tricks which we find is effective in practice. Normalization and parameter initialization Batch Normalization (Ioffe & Szegedy, 2015) , which shifts and scales feature values in each layer, is shown to be one of the most important components in training deep neural networks. However, in ∞ -dist Net, if we directly apply batch normalization, the Lipschitz constant of the network will change due to the scaling operation, and the robustness of the trained model cannot be guaranteed. To leverage the advantages of BatchNorm, we only apply the shift operation that shifts the output of each layer using the mean over the current batch during training. Similar to BatchNorm, we use the running mean during inference. Note that in inference, the running mean serves as an additional bias term in ∞ -dist neuron, which does not affect the Lipschitz constant. Finally, we do not use affine transformation which is typically used in BatchNorm. We apply the shift operation in all intermediate layers after calculating the ∞ -dist norm, but not the final layer. As a result, we remove the bias term in the ∞ -dist neuron in all intermediate layers. The bias in the last layer is initialized to zero. As for the weight initialization, weights and inputs should have the same scale due to the ∞ distance operation. Therefore we initialize all weights from standard Gaussian distribution since the inputs are normalized to have unit variance. Smoothed approximated gradients. We find that training an ∞ -dist from scratch is usually inefficient, and the model can easily be stuck in some bad local minima. To obtain a well-optimized model, we relax the ∞ distance by p distance for each neuron, to get an approximate and nonsparse gradient of the parameters (Chen et al., 2020) . During training, we set p to be a small value in the beginning, and linearly increase it in each iteration to reach a large value. When the training finishes, we freeze the model parameter and set p to ∞ as our final model.

6. EXPERIMENT

6.1 EXPERIMENTAL SETTING Model details and training configurations. We train the ∞ -dist Net on two benchmark datasets: MNIST and Fashion-MNIST. For both tasks, we use a 4-layer ∞ -dist Net. Each hidden layer has 4096 neurons. Normalization is applied between each intermediate layer. The batch size is set to 512. Random-crop data augmentation and image standardization are used during training. We train the network using Adam optimizer with hyper-parameters β 1 = 0.9 and β 2 = 0.99. The training procedure is as follows: First, we relax the ∞ -dist Net to p -dist Net with p = 6 and train the model parameters for 30 epochs. Then we gradually increase p from 6 to 100 in the next 170 epochs. Finally, we fix p = 100 and train the last 50 epochs. The learning rate in the first and second parts is set to be 0.015 and is divided by 5.0 in the final 50 epochs. We use hinge loss with parameter t = 0.9. For the Fashion-MNIST dataset, we use exactly the same model and the same hyper-parameters as MNIST, except for the hinge loss parameter t = 0.4. Adjusting models and hyper-parameters may further improve the results. We use two evaluation metrics to measure the robustness of the model. We first evaluate the robust test accuracy under the Projected Gradient Descent (PGD) attack (Madry et al., 2017) . Following standard practice, we set the number of steps of the PGD attack to be 20 and set the step size to be 0.01. We also calculate the certified radius for each sample using Eq. 3, and check the percentage of test samples that can be certified to be robust within the chosen radius. (Wong & Kolter, 2018b) , IBP (Gowal et al., 2018) , CROWN-IBP (Zhang et al., 2019b) , GroupSort Network (Anil et al., 2019) . Randomized Smoothing (Cohen et al., 2019; Salman et al., 2019; Zhai et al., 2020) is another strong baseline when considering the robustness with respect to 2 perturbation. However, as discussed in the related work section, it cannot achieve nontrivial certified accuracy against more than Ω min 1, d -1/2 radius for ∞ perturbation where d is the input dimension, which implies randomized smoothing cannot obtain good certification under ∞ norm. We report the performances (picked from the original papers) together with other properties of these methods, e.g., whether these methods use adversarial training (abbreviated as Adv Train in the table); are these methods scalable to large neural networks.

6.2. EXPERIMENTAL RESULTS

We list the performances of ∞ -dist Net in Table 1 and 2 . In these tables, we use "Standard Acc", "Robust Acc" and "Certified Acc" as abbreviations of standard (clean) test accuracy, robust test accuracy under PGD attack and certified robust test accuracy. All the numbers are reported in percentage. From Table 1 and 2, we find that ∞ -dist Net can achieve comparative or better robustness than other baselines while maintaining high standard accuracy. Specifically, on MNIST, among the methods that do not need adversarial training or can be scaled to large networks, ∞ -dist Net has the best robustness and achieves more than 93% robust test accuracy and more than 91% certified test accuracy. These performances are also comparative with those methods that train the model using adversarial training and calculate the certified radius with huge computational cost (Wong & Kolter, 2018b; Gowal et al., 2018; Zhang et al., 2019b) . We also plot the radius-(certified accuracy) curve on MNIST in Figure 1 (b). Results on Fashion-MNIST are similar. To understand how the ∞ -dist Net extracts information from the data, we sample some neurons in the first layer and record its parameter w. We then visualize w together with some input data on both tasks. Interestingly, in Figure 1 (a), we can see that the parameters w (first row) "look similar" to the input data (second row). This phenomenon is because we use ∞ distance in the network to obtain effective signals for classification. Then the neurons seek to find typical patterns that can differentiate one category to others in terms of ∞ distance, e.g., the shape of the objects, which is quite different from the feature extractor in standard neural networks.

7. CONCLUSION

In this paper, we design a novel neuron that uses ∞ distance as its basic operation. We show that the ∞ -dist neuron is naturally a 1-Lipschitz function with respect to the ∞ norm and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. This directly provides a theoretical guarantee of the certified robustness based on the margin of the prediction outputs. We further formally analyze the expressiveness power and the robust generalization ability of the network. Preliminary experiments show promising results on MNIST and Fashion-MNIST datasets. As future work, we will conduct experiments of the ∞ -dist Net on more datasets (CI-FAR10 and ImageNet) and extend it to modern networks, such as deep convolutional networks with residual connections. Proposition A.1. ∀j, k, p and constant w, c, the following base functions are realizable at the kth unit in the lth hidden layer: 1, the projection map: u(x (l) , θ (l,k) ) = x (l) j + c (11) 2, the negation map: u(x (l) , θ (l,k) ) = -x (l) j + c (12) 3, the maximum map: u(x (l) , θ (l,k) ) = max{x (l) j + w, x (l) p } + c, u(x (l) , θ (l,k) ) = max{-x (l) j + w, x (l) p } + c Proof. 1, the projection map: Setting u(x (l) , θ (l,k) ) as follows u(x (l) , θ (l,k) ) = (x (l) 1 , ..., x (l) j + 2C, ..., x (l) n ) ∞ -2C + c (14) 2, the negation map: Setting u(x (l) , θ (l,k) ) as follows u(x (l) , θ (l,k) ) = (x (l) 1 , ..., x (l) j -2C, ..., x (l) n ) ∞ -2C + c (15) 3, the maximum map: Setting u(x (l) , θ (l,k) ) as follows u(x (l) , θ (l,k) ) = (x (l) 1 , ..., x j + w + 2C, ..., x (l) p + 2C, ..., x (l) n ) ∞ -2C + c (16) u(x (l) , θ (l,k) ) = (x (l) 1 , ..., -x (l) j + w + 2C, ..., x (l) p + 2C, ..., x (l) n ) ∞ -2C + c With three basic maps in hand, we are prepared to construct our network. Using proposition A.1, the first hidden layer realizes u(x, θ (1,k) ) = x k for k = 1, ..., d input . The rest two units can be arbitrary, we set both to be x 1 . By proposition A.1, throughout the whole network, we can set u(x (l) , θ (l,k) ) = x k for all l and k = 1, ..., n. Notice that f i (x) can be rewritten as f i (x) = -max{x 1 -w (i) 1 , max{w (i) 1 -x 1 , max{..., w (i) n -x n }...}} + g(w (i) ) Using the maximum map recurrently while keeping other units unchanged with the projection map, we can utilize the unit u(x (l) , θ (l,dinput+1) ) to realize one f i (x) at a time. Again by the use of maximum map, the last unit u(x (l) , θ (l,dinput+2) ) will recurrently calculate (initializing with max{f 1 (x)} = f 1 (x)) max i f i (x) = max{f m (x), max{..., max{f 1 (x)}...}} using only finite depth, say L, then the network outputs g(x) = u(x (L) , θ (L,1) ) = u(x (L-1) , θ (L-1,dinput+2) ) = max i f i (x) as desired. We are only left with deciding a valid value for C. Because K is bounded and g(x) is continuous, there exists constants C 1 , C 2 such that ∀x ∈ K, x ∞ ≤ C 1 and |g(x)| ≤ C 2 , it's easy to verify that C = 2C 1 + C 2 is valid. A.3 PROOF OF LEMMA 4.2 First we give a quick revisit on Rademacher complexity and its properties. Rademacher Complexity Given a sample X n = {x 1 , ..., x n } ∈ K n , and a real-valued function class F on K, the Rademacher complexity of F is defined as R n (F) = E Xn 1 n E σ sup f ∈F n i=1 σ i f (x i ) where σ i are drawn from the Rademacher distribution independently, i.e. P (σ i = 1) = P (σ i = -1) = 1 2 . It's worth noting that ∀r, R n (F) = R n (F ⊕ r) where F ⊕ r = {f + r|f ∈ F}. It's well known (using Massart's Lemma) that Rademacher complexity can be bounded by VC dimension: To further generalize lemma A.1 to β r with r > 0, we use the fact that Rademacher complexity remain unchanged if the same constant r is added to all functions in F. Lemma 4.2 is a direct consequence by replacing m f by m f -r at the end of the proof of theorem 11 in Koltchinskii et al. (2002) , where it plugs m f into theorem 2 in Koltchinskii et al. (2002) : A.4 PROOF OF LEMMA 4.3 Proof. yg(x) > r implies inf x -x ∞≤r yg(x ) > 0 because g(x) is 1-Lipschitz, therefore E D I yg(x)>r ≤ E D inf x -x ∞≤r I yg(x )>0 ≤ E D sup x -x ∞≤r I yg(x )>0 and so 1 -β r ≤ 1 -γ r which concludes the proof.

A.5 PROOF OF THEOREM 2

Proof. There are generalization bounds like (Luxburg & Bousquet, 2004) , which involves only the Lipschitz property. Unfortunately, in those bounds, the dependence on sample size is of order n -1 d input , which suffers from the curse of dimensionality. To derive a more meaningful bound, we take the network size into consideration as well. We can bound the VC dimension of ∞ -dist Net by reducing a given ∞ -dist Net network to a fullyconnected ReLU network. We first introduce the VC bound for fully-connected neural networks with ReLU activation borrowed from Bartlett et al. The following lemma shows how to calculate ∞ distance using a fully-connected ReLU network architecture.



∞ -DIST NETWORK AND ITS ROBUSTNESS GUARANTEE 3.1 PRELIMINARIES Throughout this paper, we will use bold letters to denote vectors and otherwise scalars. Consider a standard classification task with an underlying data distribution D over pairs of examples x ∈ X and corresponding labels y ∈ Y = {1, 2, • • • , M }. Usually D is unknown and we can only access a training set T

Following Wong & Kolter (2018b); Gowal et al. (2018); Madry et al. (2017); Zhang et al. (2019b), we test the robustness of ∞ -dist Net under ∞ radius 0.3 on MNIST and 0.1 on Fashion-MNIST.

Figure 1: (a) We sample some neurons from the first layer and list the parameters w together with some input data on MNIST and Fashion-MNIST. It can be seen that on both tasks, the parameters (first row) look very similar to the data (second row) in some categories. This shows that the neuron captures the shape information, which is a meaningful feature and useful for classification. (b) The radius-(certified accuracy) curve of our trained model on MNIST dataset.

Denote the hypothesis set with F and its Rademacher complexity on training data with R n (F), we can upper bound the generalization error when a large margin solution is found on training data: Lemma A.1. (Theorem 11 in Koltchinskii et al. (2002)) For all t > 0,

(2019): Lemma A.2. (Theorem 6 in Bartlett et al. (2019)) Consider a fully-connected ReLU network architecture F with input dimension d, width w ≥ d and depth (number of hidden layers) L, then its VC dimension satisfies: V Cdim(F ) = Õ(L 2 w 2 ) (21)

Comparison of different methods on MNIST under ∞ radius 0.3.

Comparison of different methods on Fashion-MNIST under ∞ radius 0.1. In Table 1 and 2, we compare ∞ -dist Net with several baselines, i.e., standard training on standard neural network (abbreviated as Natural in the table), CAP

annex

Hongyang Zhang, Yaodong Yu, Jiantao Jiao, Eric Xing, Laurent El Ghaoui, and Michael Jordan.Theoretically principled trade-off between robustness and accuracy. In Kamalika Chaudhuri and Ruslan Salakhutdinov (eds. 

A APPENDIX

A.1 PROOF OF LEMMA 4.1Proof. Without loss of generality we may assume K ∈ [0, 1] n . Consider the set S consisting of all points 2 (N 1 , ..., N n ) where N j are integers, we can write S ∩ K = {w (1) , ..., w (m) } since it's a finite set. ∀w (i) ∈ S ∩ K, we define the corresponding f i (x) as followsObviously ∀x ∈ K we have that f (x) ≥ f i (x), thus max i f i (x) ≤ f (x) holds as a direct consequence.On the other hand, ∀x ∈ K there exists its 'neighbour' w (j) ∈ S ∩ K such that xw (j) ∞ ≤ 2 , therefore by using the Lipschitz properties of both f (x) and f j (x), we have thatSince max i f i (x) ≥ f j (x), we conclude our proof.A.2 PROOF OF THEOREM 1Proof. By lemma 4.1, there exists a finite number of functionswhere each f i (x) has the formThe high-level idea of the proof is very simple: among width d input + 2, we allocate d input neurons each layer to keep the information of x, one to calculate each f i (x) one after another and the last neuron calculating the maximum of f i (x) accumulated.To simplify the proof, we would first introduce three general basic maps which can be realized at a single unit, then illustrate how to represent max i f i (x) by combing these basic maps.Let's assume for now that any input to any unit in the whole network has its ∞ norm upper bounded by a large constant C, we will come back later to determine this value and prove its validity.Lemma A.3. ∀w ∈ R d , there exists a fully-connected ReLU network h with width O(d) and depthwhich is a maximum of 2d items. Notice that max{x, y} = ReLU (x-y)+ReLU (y)-ReLU (-y), we can use 3d neurons in the first hidden layer so that the input to the second hidden layer are max{x i -w i , w i -x i }, in all d items. Repeating this procedure which cuts the number of items within maximum by half, within O(log d) hidden layers this network finally outputs xw ∞ as desired.The VC bound of ∞ -dist Net is formalized by the following lemma:Lemma A.4. Consider an ∞ -dist Net architecture F with input dimension d, width w ≥ d and depth (number of hidden layers) L, then its VC dimension satisfies:Proof. By lemma A.3, each neuron in the ∞ -dist Net can be replaced by a fully-connected ReLU subnetwork with width O(w) and depth O(log w). Therefore a fully-connected ReLU network architecture G with width O(w 2 ) and depth O(L log w) can realize any function represented by the ∞ -dist Net when parameters vary. Remember that VC dimension is monotone under the ordering of set inclusion, we conclude that such ∞ -dist Net architecture F has VC dimension no more than that of G which equals Õ(L 2 w 4 ) by lemma A.2.Finally, Theorem 2 is a direct consequence by combing Lemmas 4.2, 4.3, A.3, A.2 and A.4. 

