WHAT DO DEEP NETS LEARN? CLASS-WISE PAT-TERNS REVEALED IN THE INPUT SPACE

Abstract

Deep neural networks (DNNs) have been widely adopted in different applications to achieve state-of-the-art performance. However, they are often applied as a black box with limited understanding of what the model has learned from the data. In this paper, we focus on image classification and propose a method to visualize and understand the class-wise patterns learned by DNNs trained under three different settings including natural, backdoored and adversarial. Different from existing class-wise deep representation visualizations, our method searches for a single predictive pattern in the input (i.e. pixel) space for each class. Based on the proposed method, we show that DNNs trained on natural (clean) data learn abstract shapes along with some texture, and backdoored models learn a small but highly predictive pattern for the backdoor target class. Interestingly, the existence of class-wise predictive patterns in the input space indicates that even DNNs trained on clean data can have backdoors, and the class-wise patterns identified by our method can be readily applied to "backdoor" attack the model. In the adversarial setting, we show that adversarially trained models learn more simplified shape patterns. Our method can serve as a useful tool to better understand DNNs trained on different datasets under different settings.

1. INTRODUCTION

Deep neural networks (DNNs) are a family of powerful models that have demonstrated superior learning capabilities in a wide range of applications such as image classification, object detection and natural language processing. However, DNNs are often applied as a black box with limited understanding of what the model has learned from the data. Existing understandings about DNNs have mostly been developed in the deep representation space or using the attention map. DNNs are known to be able to learn high quality representations (Donahue et al., 2014) , and the representations are well associated with the attention map of the model on the inputs (Zhou et al., 2016; Selvaraju et al., 2016) . It has also been found that DNNs trained on high resolution images like ImageNet are biased towards texture (Geirhos et al., 2019) . While these works have significantly contributed to the understanding of DNNs, a method that can intuitively visualize what DNNs learn for each class in the input space (rather than the deep representation space) is still missing. Recently, the above understandings have been challenged by the vulnerabilities of DNNs to backdoor (Szegedy et al., 2014; Goodfellow et al., 2015) and adversarial attacks (Gu et al., 2017; Chen et al., 2017) . The backdoor vulnerability is believed to be caused by the preference of learning high frequency patterns (Chen et al., 2017; Liu et al., 2020; Wang et al., 2020) . Nevertheless, no existing method is able to reliably reveal the backdoor patterns, even though it has been well learned into the backdoored model. Adversarial attacks can easily fool state-of-the-art DNNs by either sample-wise (Goodfellow et al., 2016) or universal (Moosavi-Dezfooli et al., 2017) adversarial perturbations. One recent explanation for the adversarial vulnerability is that, besides robust features, DNNs also learn useful (to the prediction) yet non-robust features which are sensitive to small perturbations (Ilyas et al., 2019) . Adversarial training, one state-of-the-art adversarial defense method, has been shown can train DNNs to learn sample-wise robust features (Madry et al., 2018; Ilyas et al., 2019) . However, it is still not clear if adversarially trained DNNs can learn a robust pattern for each class. In this paper, we focus on image classification tasks and propose a visualization method that can reveal the pattern learned by DNNs for each class in the input space. Different from sample-wise visualization methods like attention maps, we aim to reveal the knowledge (or pattern) learned by DNNs for each class. Moreover, we reveal these patterns in the input space rather than the deep representation space. This is because input space patterns are arguably much easier to interpret. Furthermore, we are interested in a visualization method that can provide new insights into the " 1-" " " " 0 " " " " 0 " "1 -0 " The 3 ImageNet classes are "n02676566" ("guitar"), "n02123045"("cat"), and "n03874599" ("padlock"). The pattern size is set to 5% of the image size. backdoor and adversarial vulnerabilities of DNNs, both of which are input space vulnerabilities (Szegedy et al., 2014; Ma et al., 2018) . Given a target class, a canvas image, and a subset of images from the nontarget classes, our method searches for a single pattern (a set of pixels) from the canvas image that is highly predictive of the target class. In other words, when the pattern is attached to images from any other (i.e. nontarget) classes, the model will consistently predict them as the target class. Figure 1 In summary, our main contributions are: 1) We propose a visualization method to reveal the classwise patterns learned by DNNs in the input space, and show the difference to attention maps and universal adversarial perturbations. 2) With the proposed visualization method, we show that DNNs trained on natural datasets can learn a consistent and predictive pattern for each class, and the pattern contains abstract shapes along with some texture. This sheds new lights on the current texture bias understanding of DNNs. 3) When applied on backdoored DNNs, our method can reveal the trigger patterns learned by the model from the poisoned dataset. Our method can serve as an effective tool to assist the detection of backdoored models. 4) The existence of class-wise predictive patterns in the input space indicates that even DNNs trained on clean data can have backdoors, and the class-wise patterns identified by our method can be readily applied to "backdoor" attack the model. 5) By examining the patterns learned by DNNs trained in the adversarially setting, we find that adversarially trained models learn more simplified shape patterns.

2. RELATED WORK

General Understandings of DNNs. DNNs are known to learn more complex and higher quality representations than traditional models. Features learned at intermediate layers of AlexNet have been found to contain both simple patterns like lines and corners and high level shapes (Donahue et al., 2014) . These features have been found crucial for the superior performance of DNNs (He et al., 2015) . The exceptional representation learning capability of DNNs has also been found related to structures of the networks like depth and width (Safran & Shamir, 2017; Telgarsky, 2016) . One recent work found that ImageNet-trained DNNs are biased towards texture features (Geirhos et al., 2019) . Attention maps have also been used to develop better understandings of the decisions made by DNNs on a given input (Simonyan et al., 2014; Springenberg et al., 2015; Zeiler & Fergus, 2014; Gan et al., 2015) . The Grad-CAM technique proposed by Selvaraju et al. (2016) utilizes input gradients to produce intuitive attention maps. Whilst these works mostly focus on deep representations or sample-wise attention, an understanding and visualization of what DNNs learn for each class in the input space is still missing from the current literature. Understanding Vulnerabilities of DNNs. Recent works have found that DNNs are vulnerable to backdoor and adversarial attacks. A backdoor attack implants a backdoor trigger into a victim model by injecting the trigger into a small proportion of training data (Gu et al., 2017; Liu et al., 2018) . The model trained on poisoned dataset will learn a noticeable correlation between the trigger and



Figure 1: Example images (top row) and the class-wise patterns (bottom row) learned by a ResNet18on CIFAR-10 (left three columns) and ResNet-50 on ImageNet (right three columns) and revealed by our method. The 3 ImageNet classes are "n02676566" ("guitar"), "n02123045"("cat"), and "n03874599" ("padlock"). The pattern size is set to 5% of the image size.

illustrates a few examples of the class-wise patterns revealed by our method for DNNs trained on natural (clean) CIFAR-10 (Krizhevsky, 2009) and ImageNet (Deng et al., 2009) datasets.

