LEARNABLE VISUAL WORDS FOR INTERPRETING IMAGE RECOGNITION MODELS Anonymous

Abstract

To interpret deep models' predictions, attention-based visual cues are widely used in addressing why deep models make such predictions. Beyond that, the current research community becomes more interested in reasoning how deep models make predictions, where some prototype-based methods employ interpretable representations with their corresponding visual cues to reveal the black-box mechanism of deep model behaviors. However, these pioneering attempts only either learn the category-specific prototypes with their generalization ability deterioration or demonstrate several illustrative examples without a quantitative evaluation of visualbased interpretability narrowing their practical usage. In this paper, we revisit the concept of visual words and propose the Learnable Visual Words (LVW) to interpret the model prediction behaviors with two novel modules: semantic visual words learning and dual fidelity preservation. The semantic visual words learning relaxes the category-specific constraint, enabling the generic visual words shared across multiple categories. Beyond employing the visual words for prediction to align visual words with the base model, our dual fidelity preservation also includes the attention-guided semantic alignment that encourages the learned visual words to focus on the same conceptual regions for prediction. Experiments on six visual benchmarks demonstrate the superior effectiveness of our proposed LVW in both accuracy and interpretation fidelity over the state-of-the-art methods. Moreover, we elaborate on various in-depth analyses to further explore the learned visual words and the generalizability of our method for unseen categories. 1 

1. INTRODUCTION

Model interpretation aims to explain the black-box base model in a semantically understandable way and preserve high fidelity with model outputs (Li et al., 2021; Molnar et al., 2020; Zhang et al., 2021b) . Although interpretative models are not designed to pursue higher performance than the base model, they are of great importance, especially in the deep learning era. With humanunderstandable explanations, model designers can diagnose errors and potential biases embedded in deep models; model users are confident and relaxed to rely on model predictions. This trust is built on the assumption that the interpretation model is faithful to the original deep networks. For tabular data, the raw features with physical meanings are used to interpret the model prediction. LIME (Ribeiro et al., 2016) and SHAP (Lundberg & Lee, 2017) are two representative explainable models, which learn linear models to locally fit the base model's outputs with sample perturbations. For visual data, attention-based visual cues are widely adopted in interpreting deep models' predictions (Correia & Colombini, 2021) . Grad-Cam (Selvaraju et al., 2017; 2020) is a popular technique for visualizing the region where a convolutional neural network model looks at. It uses the category-specific gradient information flowing into the final convolutional layer of a CNN to produce a coarse localization attention map of the important regions/pixels in the image. Following this direction, several studies further employ weakly supervised information directly on attention maps during the training stage to reduce the attention bias (Li et al., 2018; Srinivas & Fleuret, 2019; Chaudhari et al., 2021) . Beyond the above studies addressing why deep models make their predictions at the generally coarse level, some pioneering attempts have been taken to reason how deep models make predictions at the fine-grained level. ProtoPNet (Chen et al., 2019) learns a predetermined number of prototypes per category and proposes a prototypical part network with a hidden layer of prototypes representing the activated patterns. By learning these prototype-based interpretable representations, ProtoPNet explains the model reasoning process by dissecting the query image into several prototypical parts and interpreting these prototypes with training images in the same category. Later, ProtoTree (Nauta et al., 2021) and ProtoPShare (Rymarczyk et al., 2021b) extend ProtoPNet by decision trees and prototype pruning to achieve global interpretation and reduce the model complexity, respectively. Unfortunately, these studies either learn the category-specific prototypes that deteriorate their generalizing capacities or only demonstrate several illustrative examples without a quantitative evaluation of visual-based interpretability (Figure 2 demonstrates their discovered meaningless or irrelevant prototypes despite their high recognition accuracy). Moreover, they only evaluate on two object-cropped visual datasets and leave their performance on other general visual datasets and samples from unseen categories on the shelf. It is crucial to note that an interpretation model should be evaluated by whether the interpretation model is loyal to the base model in all aspects of the outputs, rather than solely focusing on if it achieves higher accuracy or delivers human cognition-consistent explanations. Contributions. The process of ProtoPNet recalls us of the bags-of-visual words (Csurka et al., 2004; Sivic & Zisserman, 2003) , a popular image representation technique before the deep learning era. It treats an image as a document and visual words are defined by the keypoints/descriptors/patches that are used to construct vocabularies. Then the image can be represented as a histogram over the occurrences of these visual words. It is worth noting that these visual words with semantic meanings across different images can also be used for model interpretation, which is similar to the prototypes in ProtoPNet. The major difference lies in that conventional visual words are usually hand-crafted, pre-defined, and independent of the downstream learning tasks. In light of this, we propose Learnable Visual Words (LVW) to overcome the aforementioned drawbacks of the prototype-based interpretative methods. Technically, our model consists of two modules, semantic visual words learning, and dual fidelity preservation. The semantic visual words learning relaxes the category-specific constraint, enabling the generic visual words to be shared across different categories, while the dual fidelity preservation encourages the learned visual words to behave similarly to the base model in both prediction and model attention. This novel attention fidelity ensures that the learned visual words attend to the same areas of a sample image when the base deep network makes its predictions. Our major contributions are summarized as follows: • In the semantic visual words learning, we relax its category-specific constraint and further simplify ProtoPNet to achieve cross-category visual words and increase the generalization of model interpretation, rather than adding any new terms. • In the dual fidelity preservation, we encourage the learned visual words to preserve high fidelity with the base model in terms of both prediction and model attention. This dual fidelity helps our model identify an interpretation that loyally presents the base network. Additionally, we further design a measurement to quantitatively evaluate the visual-based interpretation. • We demonstrate the superior effectiveness of our model on six visual benchmarks over the state-of-the-art prototype-based methods in both accuracy and visual-based interpretation and explore the generalization of our model by interpreting unseen categories. et al., 2014; Zhou et al., 2016; Zhang et al., 2018; Bach et al., 2015; Sundararajan et al., 2017; Shrikumar et al., 2017; Smilkov et al., 2017; Fong & Vedaldi, 2017; Rebuffi et al., 2020) and activation maximization caused by perturbation (Simonyan et al., 2013; Zeiler & Fergus, 2014; Petsiuk et al., 2018; Fong et al., 2019; Kapishnikov et al., 2019; Dabkowski & Gal, 2017; Ancona et al., 2017) to identify the most influential parts for the black-box model's prediction. However, visualizing the salient areas does not explain how the black box makes such decisions. Other posthoc methods (Ghorbani et al., 2019; Zhang et al., 2021a; Olah et al., 2018; Akula et al., 2020; Yeh et al., 2020; Koh et al., 2020; Kim et al., 2018; Chen et al., 2020) obtain interpretable concept activation vectors from pre-segmented feature maps and interpret the CNN model with these concepts.



Our code is available at https://github.com/LearnableVW/Learnable-Visual-Words.



Visual Image Understanding. Research efforts in interpretable explanations of a Convolutional Neural Network (CNN) can be generally divided into posthoc and self-interpretable genres. Posthoc methods attempt to build an extra explainer for the pre-trained black-box model to interpret its prediction. Approaches including saliency visualization based on backpropagation (Springenberg

