FORMAL INTERPRETABILITY WITH MERLIN-ARTHUR CLASSIFIERS

Abstract

We propose a new type of multi-agent interactive classifier that provides provable interpretability guarantees even for complex agents such as neural networks. These guarantees consist of bounds on the mutual information achieved by the features selected by this classifier. Our results are inspired by the Merlin-Arthur protocol from Interactive Proof Systems and express this bound in terms of measurable metrics such as soundness and completeness. We introduce the notion of Asymmetric Feature Concentration which relates the information carried by sets of features to the one of individual features. Crucially, our bound does not rely on optimal play by the agents nor on independently distributed features. We verify our framework through numerical experiments on image classification problems.

1. INTRODUCTION

Merlin-Arthur Classifier ald et al., 2020; Ignatiev et al., 2019) . This makes any "right to explanation," as codified in the EU's GDPR (Goodman & Flaxman, 2017), unenforceable. In this work, we design a classifier that guarantees feature-based interpretability under reasonable assumptions, thus overcoming both theoretical and computational shortcomings. For this, we connect classification to the Merlin-Arthur protocol (Arora & Barak, 2009) from Interactive Proof Systems (IPS), see Figure 1 . For easier illustration, we split the unreliable prover into a cooperative Merlin and an adversarial Morgana. Merlin aims to send features that cause Arthur to correctly classify the underlying data point. On the opposite side, Morgana selects features to convince Arthur of the wrong class. Arthur does not know who sent the feature and is allowed to say "Don't know!" if he cannot discern the class. We can then translate the concepts of completeness and soundness from IPS to our setting. Completeness describes the probability that Arthur classifies correctly based on features from Merlin. Soundness is the probability that Arthur does not get fooled by Morgana, thus



Figure 1: The Merlin-Arthur classifier consists of two interactive agents that communicate over an exchanged feature. This feature serves as an interpretation of the classification.Safe deployment of Neural Network (NN) based AI systems in high-stakes applications requires that their reasoning be subject to human scrutiny. The field of Explainable AI (XAI) has thus put forth a number of interpretability approaches, among them saliency maps(Mohseni et al., 2021)  and mechanistic interpretability(Olah et al., 2018). These have had some successes, such as detecting biases in established datasets(Lapuschkin et al., 2019), or connecting individual neurons to understandable features(Carter et al., 2019). However, these approaches are motivated purely by heuristics and come without any theoretical guarantees. Thus, their success cannot be verified. It has also been demonstrated for numerous XAI-methods that they can be manipulated by a clever design of the NNs(Slack  et al., 2021; 2020; Anders et al., 2020; Dimanov  et al., 2020). On the other hand, formal approaches to interpretability run into complexity barriers when applied to NNs and require an exponential amount of time to guarantee useful properties(Macdonald et al., 2020; Ignatiev et al., 2019). This makes any "right to explanation," as codified in the EU's GDPR (Goodman & Flaxman, 2017), unenforceable.

