TOWARDS CERTIFYING ∞ ROBUSTNESS USING NEU-RAL NETWORKS WITH ∞ -DIST NEURONS

Abstract

It is well-known that standard neural networks, even with a high classification accuracy, are vulnerable to small ∞ perturbations. Many attempts have been tried to learn a network that can resist such adversarial attacks. However, most previous works either can only provide empirical verification of the defense to a particular attack method, or can only develop a theoretical guarantee of the model robustness in limited scenarios. In this paper, we develop a theoretically principled neural network that inherently resists ∞ perturbations. In particular, we design a novel neuron that uses ∞ distance as its basic operation, which we call ∞ -dist neuron. We show that the ∞ -dist neuron is naturally a 1-Lipschitz function with respect to the ∞ norm, and the neural networks constructed with ∞ -dist neuron ( ∞ -dist Nets) enjoy the same property. This directly provides a theoretical guarantee of the certified robustness based on the margin of the prediction outputs. We further prove that the ∞ -dist Nets have enough expressiveness power to approximate any 1-Lipschitz function, and can generalize well as the robust test error can be upper-bounded by the performance of a large margin classifier on the training data. Preliminary experiments show that even without the help of adversarial training, the learned networks with high classification accuracy are already provably robust.

1. INTRODUCTION

Modern neural networks are usually sensitive to small, adversarially chosen perturbations to the inputs (Szegedy et al., 2013; Biggio et al., 2013) . Given an image x that is correctly classified by a neural network, a malicious attacker may find a small adversarial perturbation δ such that the perturbed image x + δ, though visually indistinguishable from the original image, is assigned to a wrong class with high confidence by the network. Such vulnerability creates security concerns in many real-world applications. Developing a model that can resist small ∞ perturbations has been extensively studied in the literature. Adversarial training methods (Szegedy et al., 2013; Madry et al., 2017; Goodfellow et al., 2015; Huang et al., 2015; Athalye et al., 2018; Ding et al., 2020) first learn on-the-fly adversarial examples of the inputs, and then update model parameters using these perturbed samples together with the original labels. Such approaches are restricted to a particular (class of) attack method and cannot be formally guaranteed whether the resulting model is robust against other attacks. Another line of algorithms trains robust models by maximizing the certified radius provided by robust certification methods. (2020) show that a certified guarantee on small 2 perturbations can be easily computed for general Gaussian smoothed classifiers. But recent works suggest that such methods are hard to extend to the ∞ -perturbation scenario. In this work, we overcome the challenge mentioned above by introducing a new type of neural network that naturally resists local adversarial attacks and can be easily certified under the ∞ perturbation. In particular, we propose a novel neuron called ∞ -dist neuron. Unlike the standard neuron design that uses a non-linear activation after a linear transformation, the ∞ -dist neuron is purely based on computing the ∞ distance between the inputs and the parameters. It is straightfor-



Weng et al. (2018); Wong & Kolter (2018a); Zhang et al. (2018); Mirman et al. (2018); Wang et al. (2018); Gowal et al. (2018); Zhang et al. (2019b) develop their methods based on linear or convex relaxations of fully connected ReLU networks. However, the certification methods are usually computationally expensive and can only handle ReLU activations. Cohen et al. (2019); Salman et al. (2019); Zhai et al.

