MULTILEVEL XAI: VISUAL AND LINGUISTIC BONDED EXPLANATIONS

Abstract

Applications of deep neural networks are booming in more and more fields but lack transparency due to their black-box nature. Explainable Artificial Intelligence (XAI) is therefore of paramount importance, where strategies are proposed to understand how these black-box models function. The research so far mainly focuses on producing, for example, class-wise saliency maps, highlighting parts of a given image that affect the prediction the most. However, this way does not fully represent the way humans explain their reasoning and, awkwardly, validating these maps is quite complex and generally requires subjective interpretation. In this article, we conduct XAI differently by proposing a new XAI methodology in a multilevel (i.e., visual and linguistic) manner. By leveraging the interplay between the learned representations, i.e., image features and linguistic attributes, the proposed approach can provide salient attributes and attribute-wise saliency maps, which are far more intuitive than the class-wise maps, without requiring per-image ground-truth human explanations. It introduces self-interpretable attributes to overcome the current limitations in XAI and bring the XAI towards human-like level. The proposed architecture is simple in use and can reach surprisingly good performance in both prediction and explainability for deep neural networks thanks to the low-cost per-class attributes 1 .

1. INTRODUCTION

Exciting developments in computational resources with a significant rise in data size have led deep neural networks (DNNs) to be widely used in various tasks, for example image classification. Despite their excellent performance in prediction, DNNs are seen as black boxes as their decision process generally includes a huge number of parameters and nonlinearities (Gilpin et al., 2018; Hagras, 2018; Zeiler & Fergus, 2014) . The lack of explanation in these black boxes hinders their direct implementation in important and sensitive domains such as medicine and autonomous driving, where human life may directly be affected (Loyola-Gonzalez, 2019; Lipton, 2018) . An example would be the DNNs trained to detect coronavirus. Although many works have been conducted and claimed to have a high predictive performance in detecting COVID-19 cases, a Turing Institute's recent report (Heaven, 2021) disappointingly finds that Artificial Intelligence (AI) used to detect coronavirus had little to no benefit and may even be harmful, mainly due to unnoticed biases in the data and its inherent black-box nature (also see e.g. Roberts et al. 2021) . Another example is a woman who was hit and killed by an autonomous car. An investigation showed that the death was caused by the incapability of the car in detecting a human unless they are near a crosswalk (McCausland, 2019) . In addition to these life-related examples, there are plenty of others where bias in training data or the model itself causes unwanted discriminations that may immensely affect people's lives. Amazon's AI-enabled recruitment tool is an example of how discriminative these models could be by only recommending men and directly eliminating resumes including the word "woman"; the company later announced that this tool had never been used to recruit people due to the detected bias (Olavsrud, 2022) . These examples clearly show that for machine learning models to gain acceptance, it is critical to be able to reason why a certain decision has been made to prevent any unwanted consequences. et al., 2014; Springenberg et al., 2014; Zhou et al., 2016; Chattopadhay et al., 2018; Petsiuk et al., 2018; Ribeiro et al., 2016; 2018) . However, the most widely used techniques, creating class-wise saliency maps (e.g. see left of Figure 1 ) to indicate the areas that contribute to the prediction the most, have severe innate limitations. The first is the validation process of these maps, which is mostly qualitative or requires labour intensive object-wise annotations (Goebel et al., 2018; Park et al., 2018) . A recent study in (Bearman et al., 2016) showed that a full supervision of object segmentation by humans takes around 78 seconds per instance while higher error rate bounding boxes take 10 seconds per instance to produce, which are much more expensive than 1 second per instance image level annotations. Moreover, requiring a higher level of annotation by experts is rather impractical. Another limitation stems from the discrepancy between these maps and human-like explanations. Humans naturally explain their reasoning using discriminative words (e.g. domestic vs wild or weak vs strong to differentiate a cat from a lion) together with pointing to where those words lie in the given image if visually permitting (Park et al., 2018; Goebel et al., 2018) (cf. our results on the right of Figure 1 ). To produce human-like explanations, this multilevel (i.e., visual and linguistic) manner is crucial, which also inspires the work in this article. In this article we propose a new methodology called multilevel XAI to delve into DNNs by leveraging visual and linguistic attributes. Our approach exploits per-class attributes (rather than per-image attributes, which are too expensive and generally impractical) to interpret DNNs in e.g. classifying raw images. By creating multilevel explanations, i.e., linguistic salient attributes and attribute-wise saliency maps, our method can achieve towards human-like explanations (e.g. see right of Figure 1 ). This is a big step forward in XAI and this new methodology does not suffer from the abovementioned limitations existing in current XAI solutions. The proposed setting adds a tiny extra cost to the training set, i.e., per-class attributes, which can be easily obtained if needed using for example online search engines or some autonomous tools (e.g. GPT-3 API Brown et al. 2020), and once acquired they can always be in use since in most cases they are time and image invariant. Our main contributions lie in: i) proposing a multilevel XAI methodology which is easy to use and can achieve towards human-like explanations; ii) implementing extensive experiments on both coarse-grained and fine-grained datasets to validate the performance of the proposed approach; and iii) conducting insightful discussions in XAI and future paths.

2. METHODOLOGY OF MULTILEVEL XAI

In this section, we introduce our multilevel XAI methodology, see Figure 2 for its main architecture. It consists of three main components: i) a pre-trained feature extraction block generating high level image features from input images (left of Figure 2 ); ii) a self-explainable DNN block bridging the extracted features with linguistic attributes (middle of Figure 2 ); and iii) a language model block



Our code webpage: https://anonymous.4open.science/r/Multilevel_XAI-FBBC



Figure 1: Explainability of the proposed multilevel XAI model. A bird image (middle) from Least Auklet class is predicted correctly by our approach, with human-like multilevel explanations via salient attributes (e.g. "striped belly") and the corresponding attribute-wise saliency maps (right). Result by Grad-CAM (Selvaraju et al., 2017) (left) is also given for comparison.

