GAME-THEORETIC UNDERSTANDING OF MISCLASSI-FICATION

Abstract

This paper analyzes various types of image misclassification from a gametheoretic view. Particularly, we consider the misclassification of clean, adversarial, and corrupted images and characterize it through the distribution of multiorder interactions. We discover that the distribution of multi-order interactions varies across the types of misclassification. For example, misclassified adversarial images have a higher strength of high-order interactions than correctly classified clean images, which indicates that adversarial perturbations create spurious features that arise from complex cooperation between pixels. By contrast, misclassified corrupted images have a lower strength of low-order interactions than correctly classified clean images, which indicates that corruptions break the local cooperation between pixels. We also provide the first analysis of Vision Transformers using interactions. We found that Vision Transformers show a different tendency in the distribution of interactions from that in CNNs, and this implies that they exploit the features that CNNs do not use for the prediction. Our study demonstrates that the recent game-theoretic analysis of deep learning models can be broadened to analyze various malfunctions of deep learning models including Vision Transformers by using the distribution, order, and sign of interactions.

1. INTRODUCTION

Deep learning models misclassify images for various reasons. They fail to classify some clean images in a dataset, and they also misclassify because of adversarial perturbations and common corruption. Understanding the causes of misclassifications in deep learning models is vital for their safe applications in society. Several recent studies provided new directions for understanding deep learning models from game-theoretic viewpoints (Cheng et al., 2021; Deng et al., 2022; Ren et al., 2021; Wang et al., 2021; Zhang et al., 2021) . For example, adversarial images (Goodfellow et al., 2015; Szegedy et al., 2014) -the images that are slightly but maliciously perturbed to fool deep learning models-were characterized using the interaction (Ren et al., 2021; Wang et al., 2021) , which is originally used as a measure of the synergy of two players in game theory (Grabisch & Roubens, 1999) . In image classification, the average interaction I of an image is a measure of the average change in the confidence score (i.e., the softmax value of logits of the true class)by the cooperation of various pairs of pixels. I = E (i,j) [I(i, j)]. Here, I(i, j) denotes the interaction of the iand j-th pixels, which is, roughly speaking, defined by the difference in their contributions to the confidence score: I(i, j) = g(i, j) -g(i) -g(j) + const., where g(k) relates to the contribution of the k-th pixel to the confidence score. The formal definition of I(i, j) will be provided later in this paper. When I(i, j) ≈ 0 for a pixel pair, it indicates that the two pixels contribute to the model prediction almost independently. By contrast, a large interaction indicates that the combination of pixels (e.g., edges) contributes to the model prediction in synergy. An interaction can be presented as an average of interactions of different orders, I(i, j) = 1 n-1 s I (s) , where I (s) and n denote the interaction of order s and the number of the pixels, respectively. The decomposition into I (s) s gives a more detailed view of the cooperation of pixels; low-order interactions measure simple cooperation between 1

