NETWORK ROBUSTNESS TO PCA PERTURBATIONS

Abstract

A key challenge in analyzing neural networks' robustness is identifying input features for which networks are robust to perturbations. Existing work focuses on direct perturbations to the inputs, thereby studies network robustness to the lowestlevel features. In this work, we take a new approach and study the robustness of networks to the inputs' semantic features. We show a black-box approach to determine features for which a network is robust or weak. We leverage these features to obtain provably robust neighborhoods defined using robust features and adversarial examples defined by perturbing weak features. We evaluate our approach with PCA features. We show (1) our provably robust neighborhoods are larger: on average by 1.5x and up to 4.5x, compared to the standard neighborhoods, and (2) our adversarial examples are generated using at least 12.2x fewer queries and have at least 2.8x lower L 2 distortion compared to state-of-the-art. We further show that our attack is effective even against ensemble adversarial training.

1. INTRODUCTION

The reliability of deep neural networks (DNNs) has been undermined by adversarial examples: small perturbations to inputs that deceive the network (e.g., Goodfellow et al. (2015) ). A key step in recovering DNN reliability is identifying input features for which the network is robust. Existing work focuses on the input values, the lowest-level features, to evaluate the network robustness. For example, a lot of work analyzes networks' robustness to neighborhoods consisting of all inputs at a certain distance from a given input (e. 2018)). Despite the variety of approaches introduced to analyze robustness, the diameter (controlling the neighborhood size) of the provably robust neighborhoods is often very small. This may suggest an inherent barrier of the robustness of DNNs to distance-based neighborhoods. To illustrate, consider Figure 1(a) and Figure 1 (b) -which are visibly the same but in fact each of their pixels differs by = 0.026. That is the maximal one for which the L ∞ ball B (x) (x is Figure 1(a) ) was proven robust by ERAN (Singh et al., 2018; Gehr et al., 2018) , a state-of-the-art robustness analyzer.

Feature-defined neighborhoods

We propose to analyze network robustness to perturbations of high-level input features. A small perturbation to a feature translates to changes of multiple input entries (e.g., image pixels) and as such may produce visible perturbations. To illustrate, consider a neighborhood around Figure 1 (a) in which only the background pixels can change their color. It turns out that, for this neighborhood, ERAN -the same robustness analyzer -is able to prove a neighborhood which has 10 672 x more images. Figure 1(c ) shows a maximally perturbed image in this neighborhood, and Figure 1 (d) illustrates two other images in it. These images are visibly different from Figure 1(a) . Proving such neighborhood robust, for many inputs, can suggest that the network is robust to background color perturbations, thereby provide insights to the patterns the network learned. Key idea: robust features An inherent challenge in finding robust feature-defined neighborhoods is automatically finding good candidate features (e.g., background color). Part of this challenge stems from the substantial running time of any robustness analyzer on a single neighborhood. This makes brute-force search of feature-defined neighborhoods for a large number of features and inputs futile. We propose a sampling approach to identify features which are likely to be robust for many inputs. We call these robust features. We experimentally observe that our robust features generalize to unseen inputs, even though they were determined from a (small) set of inputs.



g., Boopathy et al. (2019); Katz et al. (2017); Salman et al. (2019); Singh et al. (2019a); Tjeng et al. (2019); Wang et al. (

