THE SHAPE AND SIMPLICITY BIASES OF ADVERSARI-ALLY ROBUST IMAGENET-TRAINED CNNS

Abstract

Adversarial training has been the topic of dozens of studies and a leading method for defending against adversarial attacks. Yet, it remains largely unknown (a) how adversarially-robust ImageNet classifiers (R classifiers) generalize to out-ofdistribution examples; and (b) how their generalization capability relates to their hidden representations. In this paper, we perform a thorough, systematic study to answer these two questions across AlexNet, GoogLeNet, and ResNet-50 architectures. We found that while standard ImageNet classifiers have a strong texture bias, their R counterparts rely heavily on shapes. Remarkably, adversarial training induces three simplicity biases into hidden neurons in the process of "robustifying" the network. That is, each convolutional neuron in R networks often changes to detecting (1) pixel-wise smoother patterns i.e. a mechanism that blocks highfrequency noise from passing through the network; (2) more lower-level features i.e. textures and colors (instead of objects); and (3) fewer types of inputs. Our findings reveal the interesting mechanisms that made networks more adversarially robust and also explain some recent findings e.g. why R networks benefit from much larger capacity (Xie & Yuille, 2020) and can act as a strong image prior in image synthesis (Santurkar et al., 2019).

1. INTRODUCTION

Given excellent test-set performance, deep neural networks often fail to generalize to out-ofdistribution (OOD) examples (Nguyen et al., 2015) including "adversarial examples", i.e. modified inputs that are imperceptibly different from the real data but change predicted labels entirely (Szegedy et al., 2014) . Importantly, adversarial examples can transfer between models and cause unseen, all machine learning (ML) models to misbehave (Papernot et al., 2017) , threatening the security and reliability of ML applications (Akhtar & Mian, 2018) It remains unknown whether such shape preference carries over to the large-scale ImageNet (Russakovsky et al., 2015) , which often induces a large texture bias into networks (Geirhos et al., 2019) e.g. to separate ∼150 four-legged species in ImageNet. Also, this shape-bias hypothesis suggested by Zhang & Zhu (2019) seems to contradict the recent findings that R networks on ImageNet act as a strong texture prior i.e. they can be successfully used for many image translation tasks without any extra image prior (Santurkar et al., 2019) . The above discussion leads to a follow-up question:



. Adversarial training-teaching a classifier to correctly label adversarial examples (instead of real data)-has been a leading method in defending against adversarial attacks and the most effective defense in ICLR 2018 (Athalye et al., 2018). Besides improved performance on adversarial examples, test-set accuracy can also be improved, for some architectures, when real images are properly incorporated into adversarial training (Xie et al., 2020). It is therefore important to study how the standard adversarial training (by Madry et al. 2018) changes the hidden representations and generalization capabilities of neural networks. On smaller datasets, Zhang & Zhu (2019) found that adversarially-robust networks (hereafter, R networks) rely heavily on shapes (instead of textures) to classify images. Intuitively, training on pixel-wise noisy images would encourage R networks to focus less on local statistics (e.g. textures) and instead harness global features (e.g. shapes) more. However, an important, open question is: Q1: On ImageNet, do R networks still prefer shapes over textures?

