GAME-THEORETIC UNDERSTANDING OF MISCLASSI-FICATION

Abstract

This paper analyzes various types of image misclassification from a gametheoretic view. Particularly, we consider the misclassification of clean, adversarial, and corrupted images and characterize it through the distribution of multiorder interactions. We discover that the distribution of multi-order interactions varies across the types of misclassification. For example, misclassified adversarial images have a higher strength of high-order interactions than correctly classified clean images, which indicates that adversarial perturbations create spurious features that arise from complex cooperation between pixels. By contrast, misclassified corrupted images have a lower strength of low-order interactions than correctly classified clean images, which indicates that corruptions break the local cooperation between pixels. We also provide the first analysis of Vision Transformers using interactions. We found that Vision Transformers show a different tendency in the distribution of interactions from that in CNNs, and this implies that they exploit the features that CNNs do not use for the prediction. Our study demonstrates that the recent game-theoretic analysis of deep learning models can be broadened to analyze various malfunctions of deep learning models including Vision Transformers by using the distribution, order, and sign of interactions.

1. INTRODUCTION

Deep learning models misclassify images for various reasons. They fail to classify some clean images in a dataset, and they also misclassify because of adversarial perturbations and common corruption. Understanding the causes of misclassifications in deep learning models is vital for their safe applications in society. Several recent studies provided new directions for understanding deep learning models from game-theoretic viewpoints (Cheng et al., 2021; Deng et al., 2022; Ren et al., 2021; Wang et al., 2021; Zhang et al., 2021) . For example, adversarial images (Goodfellow et al., 2015; Szegedy et al., 2014) -the images that are slightly but maliciously perturbed to fool deep learning models-were characterized using the interaction (Ren et al., 2021; Wang et al., 2021) , which is originally used as a measure of the synergy of two players in game theory (Grabisch & Roubens, 1999) . In image classification, the average interaction I of an image is a measure of the average change in the confidence score (i.e., the softmax value of logits of the true class)by the cooperation of various pairs of pixels. I = E (i,j) [I(i, j)]. Here, I(i, j) denotes the interaction of the iand j-th pixels, which is, roughly speaking, defined by the difference in their contributions to the confidence score: I(i, j) = g(i, j) -g(i) -g(j) + const., where g(k) relates to the contribution of the k-th pixel to the confidence score. The formal definition of I(i, j) will be provided later in this paper. When I(i, j) ≈ 0 for a pixel pair, it indicates that the two pixels contribute to the model prediction almost independently. By contrast, a large interaction indicates that the combination of pixels (e.g., edges) contributes to the model prediction in synergy. An interaction can be presented as an average of interactions of different orders, s) , where I (s) and n denote the interaction of order s and the number of the pixels, respectively. The decomposition into I (s) s gives a more detailed view of the cooperation of pixels; low-order interactions measure simple cooperation between pixels, whereas high-order interactions measure relatively global and complex concepts. In other words, low-and high-order interactions correspond to different categories of features. Cheng et al. (2021) investigated the link between the order of interactions and the image features. They showed that in general, low-order interactions reflect local shapes and textures, whereas high-order interactions reflect global shapes and textures that frequently appear in training samples. Ren et al. (2021) showed that adversarial perturbations affect high-order interactions and adversarially trained models are robust to perturbations to the features related to high-order interactions. Zhang et al. (2021) found that the dropout regularizes the low-order interactions. These results suggest that interactions can characterize how deep learning models view images; thus, one can obtain a deeper understanding of the cause of the model predictions through interactions. I(i, j) = 1 n-1 s I ( In this study, we investigate one of the most fundamental issues of deep learning models, misclassification, through the lens of interactions. We examine various types of misclassifications; we consider misclassification of clean, adversarial, and corrupted images, and characterize them by the distribution, order, and sign of the interactions. In the experiments, we contrasted the distribution of interactions of misclassified images to that of successfully classified clean images, thereby revealing which types of features are more exploited to make a prediction for each set of images. The results show that these three types of misclassifications have a distinct tendency in interactions, which indicates that each of them arises from different causes. The summarized results are as follows: Misclassification of clean images. The distributions of interactions did not present a large difference between misclassified clean images and those successfully classified. This result indicates that the misclassification is not triggered by distracting the model from the useful features, and the model relies on similar features in images regardless of the correctness of the prediction. Misclassification of adversarial images. We observed a sharp increase in the strength of interactions in high order, indicating that for adversarial images, the model exploits more features that relate to interactions in these orders. Namely, the misclassification is triggered by the destruction of the model from the useful cooperation of pixels in low order to the spurious one in other orders. Misclassification of corruption images. We observed that while the interactions moderately increased in high order, they also decrease in low-and middle-order interactions. This indicates that the model can no longer use the originally observed pixel cooperations because of the corruption and gain useless or even harmful pixel cooperations in high order to make a prediction. The abovementioned results were observed on convolutional neural networks (CNNs; He2). We investigated whether these results generalize to Vision Transformers, which recently shows a better performance in image recognition tasks over CNNs. For DeiT-Ti (Touvron et al., 2021) and Swin-T Transformer (Liu et al., 2021) , most results hold but with stronger contrast. Besides, for the misclassification of clean images, where CNNs showed no particular difference in the distribution of the interactions between correctly classified and misclassified images, the Vision Transformers showed a striking difference. This suggests that the characteristics of the predictions are more clearly exhibited for Vision Transformers than CNNs, and thus, the analysis with interactions can be exploited even after the shift from CNNs to Vision Transformers. We also conducted experiments on adversarial attacks and transferability as they have been discussed extensively in the literature (Croce & Hein, 2020; Dong et al., 2018; Ilyas et al., 2019; Wang et al., 2021; Yang et al., 2021) . We discovered that when images are adversarially perturbed, the distribution of interactions shifts to negative values. This reveals that adversarial perturbations break the features that the model exploits or even alter them to misleading ones. We also discovered that the adversarial transferability depends on the order of interaction; adversarial images with higher interactions in high order transfer better when adversarially perturbed using ResNet-18, whereas, interestingly, they transfer less when Swin-T is used. This contrastive tendency is analogous to the recent observations that the adversarial images are more perturbed in the high-frequency domain with CNNs and in the low-frequency domain with Vision Transformers (Kim & Lee, 2022) . The contributions of our study can be summarized as follows: • This study investigates various types of misclassifications from a game-theoretic perspective for the first time. In particular, we characterize the misclassification of clean, adversarial, and corrupted images with the distribution, order, and sign of the interactions.

