THE SHAPE AND SIMPLICITY BIASES OF ADVERSARI-ALLY ROBUST IMAGENET-TRAINED CNNS

Abstract

Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-ofdistribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of "robustifying" the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks highfrequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie & Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).

1. INTRODUCTION

Given excellent test-set performance, deep neural networks often fail to generalize to out-ofdistribution (OOD) examples (Nguyen et al., 2015) including "adversarial examples", i.e. modified inputs that are imperceptibly different from the real data but change predicted labels entirely (Szegedy et al., 2014) . Importantly, adversarial examples can transfer between models and cause unseen, all machine learning (ML) models to misbehave (Papernot et al., 2017) , threatening the security and reliability of ML applications (Akhtar & Mian, 2018) . Adversarial training-teaching a classifier to correctly label adversarial examples (instead of real data)-has been a leading method in defending against adversarial attacks and the most effective defense in ICLR 2018 (Athalye et al., 2018) . Besides improved performance on adversarial examples, test-set accuracy can also be improved, for some architectures, when real images are properly incorporated into adversarial training (Xie et al., 2020) . It is therefore important to study how the standard adversarial training (by Madry et al. 2018) changes the hidden representations and generalization capabilities of neural networks. On smaller datasets, Zhang & Zhu (2019) found that adversarially-robust networks (hereafter, R networks) rely heavily on shapes (instead of textures) to classify images. Intuitively, training on pixel-wise noisy images would encourage R networks to focus less on local statistics (e.g. textures) and instead harness global features (e.g. shapes) more. However, an important, open question is: Q1: On ImageNet, do R networks still prefer shapes over textures? It remains unknown whether such shape preference carries over to the large-scale ImageNet (Russakovsky et al., 2015) , which often induces a large texture bias into networks (Geirhos et al., 2019) e.g. to separate ∼150 four-legged species in ImageNet. Also, this shape-bias hypothesis suggested by Zhang & Zhu (2019) seems to contradict the recent findings that R networks on ImageNet act as a strong texture prior i.e. they can be successfully used for many image translation tasks without any extra image prior (Santurkar et al., 2019) . The above discussion leads to a follow-up question: Q2: If an R network has a stronger preference for shapes than standard ImageNet networks (hereafter, S networks), will it perform better on OOD distorted images? Networks trained to be more shape-biased can generalize better to many unseen ImageNet-C (Hendrycks & Dietterich, 2019) image corruptions than S networks, which have a strong texture bias (Brendel & Bethge, 2019) . In contrast, there was also evidence that classifiers trained on one type of images often do not generalize well to others (Geirhos et al., 2018; Nguyen et al., 2015; Kang et al., 2019) . Importantly, R networks often underperform S networks on original test sets (Tsipras et al., 2019) perhaps due to an inherent (Madry et al., 2018) , a mismatch between real vs. adversarial distributions (Xie et al., 2020) , or a limitation in architectures-AdvProp helps improving performance of EfficientNets but not ResNets (Xie et al., 2020) . Most previous work aimed at understanding the behaviors of R classifiers as a function but little is known about the internal characteristics of R networks and, furthermore, their connections to the shape bias and generalization performance. Here, we ask: Q3: How did adversarial training change the hidden neural representations to make classifiers more shape-biased and adversarially robust? In this paper, we harness the common benchmarks in ML interpretability and neuroscience-cueconflict (Geirhos et al., 2019 ), NetDissect (Bau et al., 2017) , and ImageNet-C-to answer the three questions above via a systematic study across three different convolutional architectures-AlexNet (Krizhevsky et al., 2012 ), GoogLeNet (Szegedy et al., 2015 ), and ResNet-50 (He et al., 2016) trained to perform image classification on the large-scale ImageNet dataset (Russakovsky et al., 2015) . Our main findings include:foot_0 1. R classifiers trained on ImageNet prefer shapes over textures ∼67% of the time (Sec. 3.1)a stark contrast to the S classifiers, which use shapes at only ∼25%. 2. Consistent with the strong shape bias, R classifiers interestingly outperform S counterparts on texture-less, distorted images (stylized and silhouetted images) (Sec. 3.2.2). 3. Adversarial training makes R networks more robust by (1) blocking pixel-wise input noise via smooth filters (Sec. 3.3.1); (2) narrowing the input range that highly activates neurons to simpler patterns, effectively reducing the space of adversarial inputs (Sec. 3.3.2). 4. Units that detect texture patterns (according to NetDissect) are not only useful to texturebased recognition as expected but can be also highly useful to shape-based recognition (Sec. 3.4). By aligning NetDissect and cue-conflict frameworks, we found that hidden neurons in R networks are surprisingly neither strongly shape-biased nor texture-biased, but instead generalists that detect low-level features (Sec. 3.4).

2. NETWORKS AND DATASETS

Networks To understand the effects of adversarial training across a wide range of architectures, we compare each pair of S and R models while keeping their network architectures constant. That is, we conduct all experiments on two groups of classifiers: 



All code and data will be available on github upon publication.



(a) standard AlexNet, GoogLeNet, & ResNet-50 (hereafter, ResNet) models pre-trained on the 1000-class 2012 ImageNet dataset; and (b) three adversarially-robust counterparts i.e. AlexNet-R, GoogLeNet-R, & ResNet-R which were trained via adversarial training (see below) (Madry et al., 2018). Training A standard classifier with parameters θ was trained to minimize the cross-entropy loss L over pairs of (training example x, ground-truth label y) drawn from the ImageNet training set D: the other hand, we trained each R classifier via Madry et al. (2018) adversarial training framework where each real example x is changed by a perturbation ∆: arg min θ E (x,y)∼D max ∆∈P L(θ, x + ∆, y)

