TOWARDS CERTIFYING ∞ ROBUSTNESS USING NEU-RAL NETWORKS WITH ∞ -DIST NEURONS

Abstract

It is well-known that standard neural networks, even with a high classification accuracy, are vulnerable to small ∞ perturbations. Many attempts have been tried to learn a network that can resist such adversarial attacks. However, most previous works either can only provide empirical verification of the defense to a particular attack method, or can only develop a theoretical guarantee of the model robustness in limited scenarios. In this paper, we develop a theoretically principled neural network that inherently resists ∞ perturbations. In particular, we design a novel neuron that uses ∞ distance as its basic operation, which we call ∞ -dist neuron. We show that the ∞ -dist neuron is naturally a 1-Lipschitz function with respect to the ∞ norm, and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. This directly provides a theoretical guarantee of the certified robustness based on the margin of the prediction outputs. We further prove that the ∞ -dist Nets have enough expressiveness power to approximate any 1-Lipschitz function, and can generalize well as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. Preliminary experiments show that even without the help of adversarial training, the learned networks with high classification accuracy are already provably robust.

1. INTRODUCTION

Modern neural networks are usually sensitive to small, adversarially chosen perturbations to the inputs (Szegedy et al., 2013; Biggio et al., 2013) . Given an image x that is correctly classified by a neural network, a malicious attacker may find a small adversarial perturbation δ such that the perturbed image x + δ, though visually indistinguishable from the original image, is assigned to a wrong class with high confidence by the network. Such vulnerability creates security concerns in many real-world applications. Developing a model that can resist small ∞ perturbations has been extensively studied in the literature. Adversarial training methods (Szegedy et al., 2013; Madry et al., 2017; Goodfellow et al., 2015; Huang et al., 2015; Athalye et al., 2018; Ding et al., 2020) first learn on-the-fly adversarial examples of the inputs, and then update model parameters using these perturbed samples together with the original labels. Such approaches are restricted to a particular (class of) attack method and cannot be formally guaranteed whether the resulting model is robust against other attacks. Another line of algorithms trains robust models by maximizing the certified radius provided by robust certification methods. (2020) show that a certified guarantee on small 2 perturbations can be easily computed for general Gaussian smoothed classifiers. But recent works suggest that such methods are hard to extend to the ∞ -perturbation scenario. In this work, we overcome the challenge mentioned above by introducing a new type of neural network that naturally resists local adversarial attacks and can be easily certified under the ∞ perturbation. In particular, we propose a novel neuron called ∞ -dist neuron. Unlike the standard neuron design that uses a non-linear activation after a linear transformation, the ∞ -dist neuron is purely based on computing the ∞ distance between the inputs and the parameters. It is straightfor-ward to see that such a neuron is 1-Lipschitz with respect to the ∞ norm and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. Based on such a property, we can obtain the certified robustness for any ∞ -dist Nets using the margin of the prediction outputs. Theoretically, we investigate the expressive power of ∞ -dist Nets and its adversarially robust generalization ability. We first prove a Lipschitz-universal approximation theorem for ∞ -dist Net using a structured approach, which shows that ∞ -dist Nets can approximate any 1-Lipschitz function with respect to ∞ norm. We then give upper bounds of robust test error based on the Rademacher complexity, which shows that the robust test error would be small if the ∞ -dist Net learns a large margin classifier on the training data. Both results demonstrate the excellent capability and generalization ability of the ∞ -dist Net function class. The ∞ -dist Nets have nice theoretical guarantees, but empirically, training an ∞ -dist Net is not easy. For example, the gradient of the parameters in the ∞ norm is sparse, which makes the optimization inefficient. We show how to initialize the model parameters, apply proper normalization, and overcome the sparse gradient problem via smoothed approximated gradients. Preliminary experiments on MNIST and Fashion-MNIST show that even without using adversarial training, the learned networks are already provably robust. Our contributions are summarized as follows: • We propose a novel neural network using ∞ -dist neurons, called ∞ -dist Nets. Theoretically, -In Section 3, we show that ∞ -dist Nets are 1-Lipschitz with respect to the ∞ norm in nature, which directly guarantees the certified robustness of any ∞ -dist Net (with respect to the ∞ norm). -In Section 4.1, we prove that ∞ -dist Nets have good expressive power as it can approximate any 1-Lipschitz function with respect to the ∞ norm. -In Section 4.2, we prove that ∞ -dist Nets have good generalization ability as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. • We provide several implementation strategies which are shown to be practically helpful for model training (in Section 5).

2. RELATED WORKS

There are two major lines of works seeking to get robust neural networks: Robust Training Approaches. Previous works showed that the conventional neural networks learned using standard training approaches (e.g., maximum likelihood method) are sensitive to small adversarial perturbations, and significant efforts have been put on developing training approaches for learning robust models. Adversarial training is the most successful method against adversarial attacks. By adding adversarial examples to the training set on the fly, adversarial training methods (Szegedy et al., 2013; Goodfellow et al., 2015; Huang et al., 2015; Zhang et al., 2019a; Wong et al., 2020) can significantly improve the robustness of the conventional neural networks. However, all the methods above are evaluated according to the empirical robust accuracy against pre-defined adversarial attack algorithms, such as projected gradient decent. It cannot be formally guaranteed whether the resulting model is also robust against other attacks. Certified Robustness. Many recent works focus on certifying the robustness of learned neural networks under any attack. Approaches based on bounding the certified radius layer by layer using some convex relaxation methods have been proposed for certifying the robustness of neural networks (Wong & Kolter, 2018b; Gowal et al., 2018; Mirman et al., 2018; Dvijotham et al., 2018; Raghunathan et al., 2018; Zhang et al., 2020) . However, such approaches are usually computationally expensive and have difficulties in scaling to deep models. More recently, researchers found a new approach called randomized smoothing that can provide a certified robustness guarantee for general models. 



Weng et al. (2018); Wong & Kolter (2018a); Zhang et al. (2018); Mirman et al. (2018); Wang et al. (2018); Gowal et al. (2018); Zhang et al. (2019b) develop their methods based on linear or convex relaxations of fully connected ReLU networks. However, the certification methods are usually computationally expensive and can only handle ReLU activations. Cohen et al. (2019); Salman et al. (2019); Zhai et al.

Lecuyer et al. (2018); Li et al. (2018); Cohen et al. (2019); Salman et al. (2019); Zhai et al. (2020) showed that if a Gaussian random noise is

