ROBUSTNESS EVALUATION USING LOCAL SUBSTITUTE NETWORKS

Abstract

Robustness of a neural network against adversarial examples is an important topic when a deep classifier is applied in safety critical use cases like health care or autonomous driving. In order to assess the robustness, practitioners use a range of different tools ranging from the adversarial attacks to exact computation of the distance to the decision boundary. We use the fact that robustness of a neural network is a local property and empirically show that computing the same metrics for the smaller local substitute networks yields good estimates of the robustness for lower cost. To construct the substitute network we develop two pruning techniques that preserve the local properties of the initial network around a given anchor point. Our experiments on MNIST dataset prove that this approach saves a significant amount of computing time and is especially beneficial for the larger models. (a) Global, before pruning (b) Local, before pruning (c) Global, 12.5% pruned (d) Local, 12.5% pruned (e) Global, 50% pruned (f) Local, 50% pruned (g) Global, 100% pruned (h) Local, 100% pruned



Figure 1 : A toy example in the two-dimensional setting. We show the boundaries of a classifier, before (1a and 1b) and after the pruning (1c -1h), when up to the 100% of the hidden neurons are removed. The sample we are interested in is marked by the black point and the square around it on the plots with the global view (1a, 1c, 1e and 1g) shows the local region that is depicted on the other four plots (1b, 1d, 1f and 1h). While the global behaviour changes a lot, the boundary around the chosen anchor point remains similar. Most importantly, the distance to the closest adversarial from the anchor point (shown by the black line on the plots 1b, 1d, 1f and 1h) does not change significantly. Note, that this applies even in the extreme case when we prune all the hidden layers and what remains is a linear classifier (1g and 1h). That means, in order to solve the complex task of finding the distance to the decision boundary for the initial model, we save cost by working with the simple local substitute and still get a good approximation of the exact solution.

1. INTRODUCTION

Impressive success of neural networks in a variety of complicated tasks makes them irreplaceable for the practitioners in spite of the known flaws. One of the problems that continuously gains more attention is the robustness of the deep neural classifiers. While multiple notions of robustness exist, depending on the use case, we consider the most basic concept of the robustness against adversarial examples -small perturbations of the correctly classified samples that lead to a false prediction. Presence of the adversarial examples severely limits the application of the neural networks in safety critical tasks like health care and autonomous driving, where the data is collected from sensors and it is not acceptable that the same image, for example a road sign, is classified differently depending on the signal noise. While this problem is widely known, formal robustness verification methods do not allow for an assessment of the classifier's robustness when the network is large or require specific modifications in network's architecture or training procedure. In fact, Katz et al. (2017) show that the exact verification task for the ReLU classifiers is NP-complete. Therefore, constructing adversarial attacks and measuring the magnitude of the perturbation, that is required to change the prediction, is still one of the most popular ways to estimate the network's robustness. The farther away the adversarial point is from the initial sample, the more robust behaviour we expect from the network around this point. Unfortunately, the distance to an adversarial point provides only an upper bound on the distance to the decision boundary. On the other side, formal verification methods output a lower bound on that value by certifying a region around the sample as adversarial-free. In this work we develop a novel inexact robustness assessment method that utilizes both techniques as well as the fact, that robustness of a network against adversarial perturbations around a given sample is a local property. That is, it does not depend on the behaviour of the network outside of the sample's neighborhood. In other words, two networks with the similar decision boundaries around the same anchor point must have similar robustness properties, despite showing completely different behaviour away from its local neighborhood. Based on these observations we 1. develop a novel method to assess the robustness of the deep neural classifiers based on the local nature of the robustness properties, 2. develop two pruning techniques that remove the non-linear activation functions to reduce the complexity of the verification task, but preserve the local behaviour of the network as much as possible: one that is based on a bound propagation technique and replaces specific activations by a constant value, one that preserves the output of the initial network for one adversarial point and replaces activation functions by the linear ones, 3. empirically verify that the robustness metrics computed on the pruned substitute networks are good estimates of the robustness of the initial network by conducting experiments on the MNIST dataset and convolutional networks of different sizes. On Figure 1 we show an example of the difference in the decision boundary before and after we apply one of the proposed pruning techniques in the two-dimensional setting. We remove up to all of the hidden neurons while retaining the important properties of the decision boundary locally around the base point. This work is organized as follows. In Section 2 we introduce the necessary notation and formalize the context of our analysis. In Section 3 we develop both pruning techniques and explain, how do we put the focus on the local neighborhood around the base sample. Further, in Section 4 we set up the experimentation workflow and show the results. In Section 5 we mention the relevant related work and, finally, we draw the conclusions including options for the future research in Section 6.

2. NOTATION

The general idea as well as our pruning methods are applicable to any type of deep classifier and the constraints apply depending on the deployed attack, verification and bound propagation approaches. However, in order to allow for a simpler comparison with the existing attacks and verification techniques we develop our analysis for the classification networks that Li et al. ( 2023) use for their comprehensive overview and toolbox of robustness verification approaches.

