WHAT DO DEEP NETS LEARN? CLASS-WISE PAT-TERNS REVEALED IN THE INPUT SPACE

Abstract

Deep neural networks (DNNs) have been widely adopted in different applications to achieve state-of-the-art performance. However, they are often applied as a black box with limited understanding of what the model has learned from the data. In this paper, we focus on image classification and propose a method to visualize and understand the class-wise patterns learned by DNNs trained under three different settings including natural, backdoored and adversarial. Different from existing class-wise deep representation visualizations, our method searches for a single predictive pattern in the input (i.e. pixel) space for each class. Based on the proposed method, we show that DNNs trained on natural (clean) data learn abstract shapes along with some texture, and backdoored models learn a small but highly predictive pattern for the backdoor target class. Interestingly, the existence of class-wise predictive patterns in the input space indicates that even DNNs trained on clean data can have backdoors, and the class-wise patterns identified by our method can be readily applied to "backdoor" attack the model. In the adversarial setting, we show that adversarially trained models learn more simplified shape patterns. Our method can serve as a useful tool to better understand DNNs trained on different datasets under different settings.

1. INTRODUCTION

Deep neural networks (DNNs) are a family of powerful models that have demonstrated superior learning capabilities in a wide range of applications such as image classification, object detection and natural language processing. However, DNNs are often applied as a black box with limited understanding of what the model has learned from the data. Existing understandings about DNNs have mostly been developed in the deep representation space or using the attention map. DNNs are known to be able to learn high quality representations (Donahue et al., 2014) , and the representations are well associated with the attention map of the model on the inputs (Zhou et al., 2016; Selvaraju et al., 2016) . It has also been found that DNNs trained on high resolution images like ImageNet are biased towards texture (Geirhos et al., 2019) . While these works have significantly contributed to the understanding of DNNs, a method that can intuitively visualize what DNNs learn for each class in the input space (rather than the deep representation space) is still missing. Recently, the above understandings have been challenged by the vulnerabilities of DNNs to backdoor (Szegedy et al., 2014; Goodfellow et al., 2015) and adversarial attacks (Gu et al., 2017; Chen et al., 2017) . The backdoor vulnerability is believed to be caused by the preference of learning high frequency patterns (Chen et al., 2017; Liu et al., 2020; Wang et al., 2020) . Nevertheless, no existing method is able to reliably reveal the backdoor patterns, even though it has been well learned into the backdoored model. Adversarial attacks can easily fool state-of-the-art DNNs by either sample-wise (Goodfellow et al., 2016) or universal (Moosavi-Dezfooli et al., 2017) adversarial perturbations. One recent explanation for the adversarial vulnerability is that, besides robust features, DNNs also learn useful (to the prediction) yet non-robust features which are sensitive to small perturbations (Ilyas et al., 2019) . Adversarial training, one state-of-the-art adversarial defense method, has been shown can train DNNs to learn sample-wise robust features (Madry et al., 2018; Ilyas et al., 2019) . However, it is still not clear if adversarially trained DNNs can learn a robust pattern for each class. In this paper, we focus on image classification tasks and propose a visualization method that can reveal the pattern learned by DNNs for each class in the input space. Different from sample-wise visualization methods like attention maps, we aim to reveal the knowledge (or pattern) learned by DNNs for each class. Moreover, we reveal these patterns in the input space rather than the deep representation space. This is because input space patterns are arguably much easier to interpret. Furthermore, we are interested in a visualization method that can provide new insights into the 1

