CONSTRAINT-DRIVEN EXPLANATIONS OF BLACK-BOX ML MODELS

Abstract

Modern machine learning techniques have enjoyed widespread success, but are plagued by lack of transparency in their decision making, which has led to the emergence of the field of explainable AI. One popular approach called LIME, seeks to explain an opaque model's behavior, by training a surrogate interpretable model to be locally faithful on perturbed instances. Despite being model-agnostic and easyto-use, it is known that LIME's explanations can be unstable and are susceptible to adversarial attacks as a result of Out-Of-Distribution (OOD) sampling. The quality of explanations is also calculated heuristically, and lacks a strong theoretical foundation. In spite of numerous attempts to remedy some of these issues, making the LIME framework more trustworthy and reliable remains an open problem. In this work, we demonstrate that the OOD sampling problem stems from rigidity of the perturbation procedure. To resolve this issue, we propose a theoretically sound framework based on uniform sampling of user-defined subspaces. Through logical constraints, we afford the end-user the flexibility to delineate the precise subspace of the input domain to be explained. This not only helps mitigate the problem of OOD sampling, but also allow experts to drill down and uncover bugs and biases hidden deep inside the model. For testing the quality of generated explanations, we develop an efficient estimation algorithm that is able to certifiably measure the true value of metrics such as fidelity up to any desired degree of accuracy, which can help in building trust in the generated explanations. Our framework called CLIME can be applied to any ML model, and extensive experiments demonstrate its versatility on real-world problems.

1. INTRODUCTION

Advances in Machine Learning (ML) in the last decade have resulted in new applications to safetycritical and human-sensitive domains such as driver-less cars, health, finance, education and the like (c.f. (Lipton, 2018) ). In order to build trust in automated ML decision processes, it is not sufficient to only verify properties of the model; the concerned human authority also needs to understand the reasoning behind predictions (DARPA, 2016; Goodman & Flaxman, 2017; Lipton, 2018) . Highly successful models such as Neural Networks and ensembles, however, are often complex, and even experts find it hard to decipher their inner workings. Such opaque decision processes are unacceptable for safety-critical domains where a wrong decision can have serious consequences. This has led to the emergence of the field of eXplainable AI (XAI), which targets development of both naturally interpretable models (e.g. decision trees, lists, or sets) (Hu et al., 2019; Angelino et al., 2018; Rudin, 2019; Avellaneda, 2020) as well as post-hoc explanations for opaque models (Ribeiro et al., 2016; Lundberg & Lee, 2017) . Although interpretable models have been gaining traction, state-of-the-art approaches in most domains are uninterpretable and therefore necessitate post-hoc explanations. One popular post-hoc approach is the Locally Interpretable Model-agnostic Explanation (LIME) framework (Ribeiro et al., 2016) , which seeks to explain individual predictions by capturing the local behaviour of the opaque model in an approximate but interpretable way. To do so, it trains a surrogate interpretable classifier to be faithful to the opaque model in a small neighborhood around a user-provided data instance. Specifically, the given data instance is perturbed to generate a synthetic data set on which an interpretable classifier such as a linear model is trained with the objective of having high fidelity to the original model. The human decision-maker can inspect the coefficients of the linear model to understand which features contributed positively or negatively to the prediction. LIME is model-agnostic as it treats the opaque model as a black-box. It is applicable to a wide variety of data such as text, tabular and image, and is easy to use. Nevertheless, it suffers from sensitivity to out-of-distribution sampling, which undermines trust in the explanations and makes them susceptible to adversarial attacks (Slack et al., 2020) . Post-hoc explanation techniques are increasingly being used by ML experts to debug their models. In this setting, LIME does not provide the user with the flexibility to refine explanations by drilling down to bug-prone corners of the input domain. Further, LIME has limited capabilities for measuring explanation quality accurately. For instance, checking whether the generated explanation is faithful to the original model can only be done on a heuristically defined number of samples, which can be unreliable. While later works such as Anchor (Ribeiro et al., 2018) and SHAP (Lundberg & Lee, 2017) mitigate some of these issues, LIME remains a popular framework and making it more robust and reliable remains an open problem. Towards this goal, we first demonstrate that LIME's rigid perturbation procedure is at the root of many of these problems, due to its inability to capture the original data distribution. We craft concrete examples which clearly show LIME generating misleading explanations due to OOD sampling. Further, LIME affords the end-user limited control over the sub-space to be explained; since the notions of 'locality' and 'neighborhood' are defined implicitly by LIME's perturbation procedure. XAI, however, is inherently human-centric and one-size-fits-all approaches are not powerful enough to handle all user-needs and application domains (Sokol et al., 2019) . Instead, we propose to generalize the LIME framework by letting the user define the specific subspace of the input domain to be explained, through the use of logical constraints. Boolean constraints can capture a wide variety of distributions and recent advances in SAT solving and hashing technology have enabled fast solution sampling from large complex formulas (Soos et al., 2020) . Making LIME's neighborhood generation flexible in this way resolves both limitations of LIME. First, our approach helps in mitigating the problem of OOD sampling. Second, it allows the user to drill down and refine explanations. For instance, it might be not sufficient for doctors to simply know which test results contributed to a model's prediction of cancer; it is also important to understand the prediction in the context of the patient's specific age group, ethnicity etc. Such requirements can be naturally represented as constraints. In the same vein, constraints can also be used to zoom in to bug-prone corners of the input space to uncover potential problems in the model. This is especially useful for model debugging which is a recent direction for ML research (Kang et al., 2020) . Letting users define sub-spaces to be explained also necessitates a theoretically grounded method of measuring explanation quality, as a poor quality score can indicate to the user, the need for refining constraints. Existing works compute these metrics heuristically without formal guarantees of accuracy. In this light, we propose a theoretical framework and an efficient estimation algorithm that enables measurement of the true value of metrics like fidelity, up to any desired accuracy, in a model-agnostic way. Through extensive experiments, we demonstrate the scalability of our estimation framework, as well as applications of CLIME to real world problems such as uncovering model biases and detecting adversarial attacks. In summary, our contributions are as follows: 1. Framework for precisely crafting explanations for specific subspaces of the input domain through logical constraints Problem formulation. We follow notations from Ribeiro et al. (2016) . Let D = (X, y) = {(xfoot_0 , y 1 ), (x 2 , y 2 ), . . . , (x n , y n )} denote the input dataset from some distribution D where , x i ∈ R d is a vector that captures the feature values of the ith sample, and y i ∈ {C 0 , C 1 } is the corresponding class label 1 . We use subscripts, i.e. x j , to denote the j th feature of the vector x.



We focus on binary classification; extension to multi-class classification follows by 1-vs-rest approach



2. A theoretical framework and an efficient algorithm for estimating the 'true' value of metrics like fidelity up to any desired accuracy 3. Empirical study which demonstrates the efficacy of constraints in • Mitigating problem of OOD sampling • Detecting adversarial attacks • Zooming in and refining explanations for uncovering hidden biases 2 PRELIMINARIES

