NETWORK ROBUSTNESS TO PCA PERTURBATIONS

Abstract

A key challenge in analyzing neural networks' robustness is identifying input features for which networks are robust to perturbations. Existing work focuses on direct perturbations to the inputs, thereby studies network robustness to the lowestlevel features. In this work, we take a new approach and study the robustness of networks to the inputs' semantic features. We show a black-box approach to determine features for which a network is robust or weak. We leverage these features to obtain provably robust neighborhoods defined using robust features and adversarial examples defined by perturbing weak features. We evaluate our approach with PCA features. We show (1) our provably robust neighborhoods are larger: on average by 1.5x and up to 4.5x, compared to the standard neighborhoods, and (2) our adversarial examples are generated using at least 12.2x fewer queries and have at least 2.8x lower L 2 distortion compared to state-of-the-art. We further show that our attack is effective even against ensemble adversarial training.

1. INTRODUCTION

The reliability of deep neural networks (DNNs) has been undermined by adversarial examples: small perturbations to inputs that deceive the network (e.g., Goodfellow et al. (2015) ). A key step in recovering DNN reliability is identifying input features for which the network is robust. Existing work focuses on the input values, the lowest-level features, to evaluate the network robustness. For example, a lot of work analyzes networks' robustness to neighborhoods consisting of all inputs at a certain distance from a given input (e.g., Boopathy et al. (2019) 2018)). Despite the variety of approaches introduced to analyze robustness, the diameter (controlling the neighborhood size) of the provably robust neighborhoods is often very small. This may suggest an inherent barrier of the robustness of DNNs to distance-based neighborhoods. To illustrate, consider Figure 1 (a) and Figure 1 (b) -which are visibly the same but in fact each of their pixels differs by = 0.026. That is the maximal one for which the L ∞ ball B (x) (x is Figure 1 (a)) was proven robust by ERAN (Singh et al., 2018; Gehr et al., 2018) , a state-of-the-art robustness analyzer. Feature-defined neighborhoods We propose to analyze network robustness to perturbations of high-level input features. A small perturbation to a feature translates to changes of multiple input entries (e.g., image pixels) and as such may produce visible perturbations. To illustrate, consider a neighborhood around Figure 1 (a) in which only the background pixels can change their color. It turns out that, for this neighborhood, ERAN -the same robustness analyzer -is able to prove a neighborhood which has 10 672 x more images. Figure 1 (c) shows a maximally perturbed image in this neighborhood, and Figure 1 (d) illustrates two other images in it. These images are visibly different from Figure 1(a) . Proving such neighborhood robust, for many inputs, can suggest that the network is robust to background color perturbations, thereby provide insights to the patterns the network learned. Key idea: robust features An inherent challenge in finding robust feature-defined neighborhoods is automatically finding good candidate features (e.g., background color). Part of this challenge stems from the substantial running time of any robustness analyzer on a single neighborhood. This makes brute-force search of feature-defined neighborhoods for a large number of features and inputs futile. We propose a sampling approach to identify features which are likely to be robust for many inputs. We call these robust features. We experimentally observe that our robust features generalize to unseen inputs, even though they were determined from a (small) set of inputs. Feature-guided adversarial examples Dually to robust features, we define and identify weak features. We show how to exploit them in a simple yet effective black-box attack. Our attack perturbs a given input based on the weak features and then greedily reduces the number of modified pixels to obtain an adversarial example which minimizes the L 2 distance from the original input. Figure 1 illustrates our attack on ImageNet with the Inception-V3 architecture (Szegedy et al., 2016) . PCA features We obtain an initial set of features by running principal component analysis (PCA). PCA provides us with an automatic way to extract useful features from a dataset. For example, the sky color feature of the airplane in Figure 1 (a) is in fact the second PCA dimension of this class of images in CIFAR10. Figure 1 (h) shows the effect of perturbing this feature by small constants (multiplied by δ = 5) in the PCA domain and projecting it back to the image domain. Our choice of using PCA is inspired by earlier work. We hypothesized that DNN may learn (some) of the PCA features by relying on works that showed that PCA can capture semantic features of the dataset (e.g., Zhoua et al. (2008); Jolliffe (2002) ) and that DNNs learn semantic features (e.g., Zeiler & Fergus (2014) ). Further, PCA has been linked to adversarial examples: several works showed how to detect adversarial examples using PCA (Hendrycks & Gimpel., 2017; Jere et al., 2019; Li & Li, 2017) , others utilized PCA to approximate an adversarial manifold (Zhang et al., 2020) , and others constructed an attack that modified only the first PCA dimensions (Carlini & Wagner., 2017) . Computing the exact PCA dimensions requires perfect-knowledge of the dataset and is time-consuming for large datasets (e.g., ImageNet). We show that an approximation of the PCA dimensions is sufficient to obtain robust and weak features and that these can be computed from a small subset of the dataset (not used for training). The assumption that the attacker has access to a small dataset of similar distribution to the training set is often valid in practice (e.g., for traffic sign recognition benchmarks, like GTSRB, or face recognition applications). We evaluate our approach on six datasets and various network architectures. Results indicate that (1) compared to the standard neighborhoods, our provably robust feature-guided neighborhoods have larger volume, on average by 1.5x and up to 4.5x, and they contain 10 79 x more images (average is taken over the exponent), and (2) our adversarial examples are generated using at least 12.2x fewer queries and have at least 2.8x lower L 2 distortion compared to state-of-the-art practical black-box attacks. We also show that our attack is effective even against ensemble adversarial training. To conclude, our main contributions are:



; Katz et al. (2017); Salman et al. (2019); Singh et al. (2019a); Tjeng et al. (2019); Wang et al. (

Figure 1: Network robustness to feature perturbations. (a) An image x from CIFAR10. (b) A maximally perturbed input from the maximally L ∞ neighborhood B (x) proven robust by ERAN. (c) A maximally perturbed input (sky color is blue instead of gray) from the maximally featuredefined neighborhood N sky_color δ

Figure 1(e) shows a traffic light image (correctly classified), Figure 1(f) shows our feature-guided adversarial example (classified as crane), and Figure 1(g) visualizes the difference between the two images. We experimentally show that our attack is competitive with state-of-the-art practical black-box attacks (AutoZoom by Tu et al. (2019) and GenAttack by Alzantot et al. (2019)) and fools the ensemble adversarial training defense by Tramèr et al. (2018). Our results strengthen the claim of Ilyas et al. (2019) who suggested that weak (non-robust) features of the data contribute to the existence of adversarial examples. However, unlike Ilyas et al. (2019), we do not focus on features that stem from the DNN. This allows us to expose interesting patterns, expressed via simple functions, that the DNN has learned or missed.

