SCHEMA INFERENCE FOR INTERPRETABLE IMAGE CLASSIFICATION

Abstract

In this paper, we study a novel inference paradigm, termed as schema inference, that learns to deductively infer the explainable predictions by rebuilding the prior deep neural network (DNN) forwarding scheme, guided by the prevalent philosophical cognitive concept of schema. We strive to reformulate the conventional model inference pipeline into a graph matching policy that associates the extracted visual concepts of an image with the pre-computed scene impression, by analogy with human reasoning mechanism via impression matching. To this end, we devise an elaborated architecture, termed as SchemaNet, as a dedicated instantiation of the proposed schema inference concept, that models both the visual semantics of input instances and the learned abstract imaginations of target categories as topological relational graphs. Meanwhile, to capture and leverage the compositional contributions of visual semantics in a global view, we also introduce a universal Feat2Graph scheme in SchemaNet to establish the relational graphs that contain abundant interaction information. Both the theoretical analysis and the experimental results on several benchmarks demonstrate that the proposed schema inference achieves encouraging performance and meanwhile yields a clear picture of the deductive process leading to the predictions.

1. INTRODUCTION

"Now this representation of a general procedure of the imagination for providing a concept with its image is what I call the schema for this concept 1 ." -Immanuel Kant Deep neural networks (DNNs) have demonstrated the increasingly prevailing capabilities in visual representations as compared to conventional hand-crafted features. Take the visual recognition task as an example. The canonical deep learning (DL) scheme for image recognition is to yield an effective visual representation from a stack of non-linear layers along with a fully-connected (FC) classifier at the end (He et al., 2016; Dosovitskiy et al., 2021; Tolstikhin et al., 2021; Yang et al., 2022a) , where specifically the inner-product similarities are computed with each category embedding as the prediction. Despite the great success of DL, existing deep networks are typically required to simultaneously perceive low-level patterns as well as high-level semantics to make predictions (Zeiler & Fergus, 2014; Krizhevsky et al., 2017) . As such, both the procedure of computing visual representations and the learned category-specific embeddings are opaque to humans, leading to challenges in security-matter scenarios, such as autonomous driving and healthcare applications. Unlike prior works that merely obtain the targets in the black-box manner, here we strive to devise an innovative and generalized DNN inference paradigm by reformulating the traditional one-shot forwarding scheme into an interpretable DNN reasoning framework, resembling what occurs in deductive human reasoning. Towards this end, inspired by the schema in Kant's philosophy that describes human cognition as the procedure of associating an image of abstract concepts with the specific sense impression, we propose to formulate DNN inference into an interactive matching procedure between the local visual semantics of an input instance and the abstract category imagination, which is termed as schema inference in this paper, leading to the accomplishment of interpretable deductive inference based on visual semantics interactions at the micro-level. To elaborate the achievement of the proposed concept schema inference, we take here image classification, the most basic task in computer vision, as an example to explain our technical details. At a high level, the devised schema inference scheme leverages a pre-trained DNN to extract feature ingredients which are, in fact, the semantics represented by a cluster of deep feature vectors from a specific local region in the image domain. Furthermore, the obtained feature ingredients are organized into an ingredient relation graph (IR-Graph) for the sake of modeling their interactions that are characterized by the similarity at the semantic-level as well as the adjacency relationship at the spatial-level. We then implement the category-specific imagination as an ingredient relation atlas (IR-Atlas) for all target categories induced from observed data samples. As a final step, the graph similarity between an instance-level IR-Graph and the category-level IR-Atlas is computed as the measurement for yielding the target predictions. As such, instead of relying on deep features, the desired outputs from schema inference contribute only from the relationship of visual words, as shown in Figure 1 . More specifically, our dedicated schema-based architecture, termed as SchemaNet, is based on vision Transformers (ViTs) (Dosovitskiy et al., 2021; Touvron et al., 2021) , which are nowadays the most prevalent vision backbones. To effectively obtain the feature ingredients, we collect the intermediate features of the backbone from probe data samples clustered by k-means algorithm. IR-Graphs are established through a customized Feat2Graph module that transfers the discretized ingredients array to graph vertices, and meanwhile builds the connections, which indicates the ingredient interactions relying on the self-attention mechanism (Vaswani et al., 2017) and the spatial adjacency. Eventually, graph similarities are evaluated via a shallow graph convolutional network (GCN). Our work relates to several existing methods that mine semantic-rich visual words from DNN backbones for self-explanation (Brendel & Bethge, 2019; Chen et al., 2019; Nauta et al., 2021; Xue et al., 2022b; Yang et al., 2022b) . Particularly, BagNet (Brendel & Bethge, 2019) uses a DNN as visual



Figure 1: An example showing how an instance IR-Graph is matched to the class imagination. The vertices of IR-Graphs represent visual semantics, and the edges indicate the vertex interactions. The graph matcher captures the similarity between the local structures of joint vertices (e.g., vertex #198 and vertex #150) by aggregating information from their neighbors as class evidences. The final prediction is defined as the sum of all evidence.

