FORMAL INTERPRETABILITY WITH MERLIN-ARTHUR CLASSIFIERS

Abstract

We propose a new type of multi-agent interactive classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of bounds on the mutual information achieved by the features selected by this classifier. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express this bound in terms of measurable metrics such as soundness and completeness. We introduce the notion of Asymmetric Feature Concentration which relates the information carried by sets of features to the one of individual features. Crucially, our bound does not rely on optimal play by the agents nor on independently distributed features. We verify our framework through numerical experiments on image classification problems.

1. INTRODUCTION

Merlin-Arthur Classifier Safe deployment of Neural Network (NN) based AI systems in high-stakes applications requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps (Mohseni et al., 2021) and mechanistic interpretability (Olah et al., 2018) . These have had some successes, such as detecting biases in established datasets (Lapuschkin et al., 2019) , or connecting individual neurons to understandable features (Carter et al., 2019) . However, these approaches are motivated purely by heuristics and come without any theoretical guarantees. Thus, their success cannot be verified. It has also been demonstrated for numerous XAI-methods that they can be manipulated by a clever design of the NNs (Slack et al., 2021; 2020; Anders et al., 2020; Dimanov et al., 2020) . On the other hand, formal approaches to interpretability run into complexity barriers when applied to NNs and require an exponential amount of time to guarantee useful properties (Macdonald et al., 2020; Ignatiev et al., 2019) . This makes any "right to explanation," as codified in the EU's GDPR (Goodman & Flaxman, 2017), unenforceable. In this work, we design a classifier that guarantees feature-based interpretability under reasonable assumptions, thus overcoming both theoretical and computational shortcomings. For this, we connect classification to the Merlin-Arthur protocol (Arora & Barak, 2009) from Interactive Proof Systems (IPS), see Figure 1 . For easier illustration, we split the unreliable prover into a cooperative Merlin and an adversarial Morgana. Merlin aims to send features that cause Arthur to correctly classify the underlying data point. On the opposite side, Morgana selects features to convince Arthur of the wrong class. Arthur does not know who sent the feature and is allowed to say "Don't know!" if he cannot discern the class. We can then translate the concepts of completeness and soundness from IPS to our setting. Completeness describes the probability that Arthur classifies correctly based on features from Merlin. Soundness is the probability that Arthur does not get fooled by Morgana, thus either giving the correct class or answering "Don't know!". These two quantities can be measured on a test dataset and are used to lower bound the information contained in features selected by Merlin.

1.1. RELATED WORK

"Isle!" "Boat!" Original Images: Masked Images: In the original dataset, the features "sea" and "sky" appear equally in both classes "boat" and "island". In the new set of images that Merlin creates by masking features of the original image, the "sea" feature is visible only in the images labelled "boat" and the "sky" feature is visible only in the images labelled "island". Thus, these features now strongly indicate the class of the image. This allows Merlin to communicate the correct class with uninformative features -in contrast to our concept of an interpretable classifier. 2019). In this setup, the feature selector chooses a feature from a data point and presents it to the classifier who decides the class, see Figure 1 . The classification accuracy is meant to guarantee the informativeness of the exchanged features. P (C = "boat"|"sea") = 1 P (C = "isle"|"sea") = 0 I(C ; "sea") = 1 P (C = "boat"|"sea") = 0.5 P (C = "isle"|"sea") = 0.5 I(C ; "sea") = 0 However, it was noted by Yu et al. that the selector and the classifier can cooperate to achieve high accuracy while communicating over uninformative features, see Figure 2 for an illustration of this "cheating". Thus, one cannot hope to bound the information content of features via accuracy alone. The authors propose to include an adversarial feature classifier to remedy this fact, however do not provide any bounds. Irving et al. introduce a different adversary, and we discuss in Appendix A.4 why this approach cannot yield bounds similar to the ones in our work. Chang et al. include an adversarial selector to prevent the cheating. The reasoning is that any "cheating" strategy can be exploited by the adversary to fool the classifier into stating the wrong class, see Figure 3 for an illustration. Anil et al. investigate scenarios in which this three-player setup converges to an equilibrium of perfect completeness and soundness. However, both works assume that the players can play optimally. This, however, is unlikely in practice, since the general feature selection problem is hard Waeldchen et al. (2021) . Optimal play would thus amount to an exponentially expensive search problem similar to Chattopadhyay et al. (2022) . In our work, we rather rely on the relative strength of the cooperative and the adversarial provers which can both be realised with imperfect strategies. It has been shown that the optimal strategy for the prover is to select the features that have the highest mutual information with the class Chang et al. (2019) . This demonstrates a strong theoretical link between a game equilibrium and the feature quality. However, the authors rely on the assumption that the features are independently distributed in each class, but in almost all types of data there are strong correlations between features. We do not make this assumption, but rather define Asymmetric Feature Correlation as an important property of datasets to draw conclusions about feature quality.

1.2. CONTRIBUTION

Our framework requires few constraints on the classifier and is thus applicable to a wide range of tasks. Similarly, our results hold for general feature spaces, allowing free choice for the practitioner, e.g., parts of an image up to a certain size, or functional queries about the input. 1. We prove a lower bound for the mutual information of the exchanged features with the true class of the data point. This bound does not rely on the features being independently distributed, nor on restricting the strategy of the provers. 2. We introduce Asymmetric Feature Concentration as a possible effect that complicates drawing conclusions about individual features using the informativeness of feature sets. We show how to circumvent it in Theorem 2.8 or include it explicitly as in Theorem 2.10. 3. We numerically demonstrate how the interactive setup prevents the manipulation that has been demonstrated for other XAI-methods. Furthermore, we evaluate our theoretical bounds on the MNIST dataset for provers based on either Frank-Wolfe optimisers or UNets.



Figure 1: The Merlin-Arthur classifier consists of two interactive agents that communicate over an exchanged feature. This feature serves as an interpretation of the classification.

Figure 2: Illustration of "cheating" behaviour.

Interactive classification has emerged as part of the effort to design inherently interpretable classifiers Lei et al. (2016); Li et al. (2016); Chen et al. (2018); Bastings et al. (

