ROBUST EXPLANATION CONSTRAINTS FOR NEURAL NETWORKS

Abstract

Post-hoc explanation methods are used with the intent of providing insights about neural networks and are sometimes said to help engender trust in their outputs. However, popular explanations methods have been found to be fragile to minor perturbations of input features or model parameters. Relying on constraint relaxation techniques from non-convex optimization, we develop a method that upperbounds the largest change an adversary can make to a gradient-based explanation via bounded manipulation of either the input features or model parameters. By propagating a compact input or parameter set as symbolic intervals through the forwards and backwards computations of the neural network we can formally certify the robustness of gradient-based explanations. Our bounds are differentiable, hence we can incorporate provable explanation robustness into neural network training. Empirically, our method surpasses the robustness provided by previous heuristic approaches. We find that our training method is the only method able to learn neural networks with certificates of explanation robustness across all six datasets tested.

1. INTRODUCTION

Providing explanations for automated decisions is a principal way to establish trust in machine learning models. In addition to engendering trust with end-users, explanations that reliably highlight key input features provide important information to machine learning engineers who may use them to aid in model debugging and monitoring (Pinto et al., 2019; Adebayo et al., 2020; Bhatt et al., 2020a; b) . The importance of explanations has led to regulators considering them as a potential requirement for deployed decision-making algorithms (Gunning, 2016; Goodman & Flaxman, 2017) . Unfortunately, deep learning models can return significantly different explanations for nearly identical inputs (Dombrowski et al., 2019) which erodes user trust in the underlying model and leads to a pessimistic outlook on the potential of explainability for neural networks (Rudin, 2019) . Developing models that provide robust explanations, i.e., that provide similar explanations for similar inputs, is vital to ensuring that explainability methods have beneficial insights (Lipton, 2018) . Current works evaluate the robustness of explanations in an adversarial setting by finding minor manipulations to the deep learning pipeline which cause the worst-case (e.g., largest) changes to the explanation (Dombrowski et al., 2019; Heo et al., 2019) . It has been shown that imperceptible changes to an input can fool explanation methods into placing importance on arbitrary features (Dombrowski et al., 2019; Ghorbani et al., 2019) and that model parameter can be manipulated to globally corrode which features are highlighted by explanation methods (Heo et al., 2019) . While practices for improving explanation robustness focus on heuristic methods to avoid current attacks (Dombrowski et al., 2022; Wang et al., 2020; Chen et al., 2019) , it is well-known that adversaries can develop more sophisticated attacks to evade these robustness measures (Athalye et al., 2018) . To counter this, our work establishes the first framework for general neural networks which provides a guarantee that for any minor perturbation to a given input's features or to the model's parameters, the change in the explanation is bounded. Our bounds are formulated over the input-gradient of the model which is a common source of information for explanations (Sundararajan et al., 2017; Wang et al., 2020) . Our guarantees constitute a formal certificate of robustness for a neural network's explanations at a given input which can provide users, developers, and regulators with a heightened sense of trust. Further, while it is known that explanations of current neural networks are not robust, the differentiable nature of our method allows us to incorporate provable explanation robustness as a constraint at training time, yielding models with significantly heightened explanation robustness. Formally, our framework abstracts all possible manipulations to an input's features or model's parameters into a hyper-rectangle, a common abstraction in the robustness literature (Mirman et al., 2018; Gowal et al., 2018) . We extend known symbolic interval analysis techniques in order to propagate hyper-rectangles through both the forwards and backwards pass operations of a neural network. The result of our method is a hyper-rectangle over the space of explanations that is guaranteed to contain every explanation reachable by an adversary who perturbs features or parameters within the specified bounds. We then provide techniques that prove that all explanations in the reachable explanation set are sufficiently similar and thus that no successful adversarial attack exists. Noticing that smaller reachable explanation sets imply more robust predictions, we introduce a novel regularization scheme which allows us minimize the size of of the explanation set during parameter inference. Analogous to state-of-the-art robustness training (Gowal et al., 2018) , this allows users to specify input sets and parameter sets as explainability constraints at train time. Empirically, we test our framework on six datasets of varying complexity from tabular datasets in financial applications to medical imaging datasets. We find that our method outperforms state-of-the-art methods for improving explanation robustness and is the only method that allows for certified explanation robustness even on full-color medical image classification. We highlight the following contributions: • We instantiate a framework for bounding the largest change to an explanation that is based on the input gradient, therefore certifying that no adversarial explanation exists for a given set of inputs and/or model parameters. • We compute explicit bounds relying on interval bound propagation, and show that these bounds can be used to regularize neural networks during learning with robust explanations constraints. • Empirically, our framework allows us to certify robustness of explanations and train networks with robust explanations across six different datasets ranging from financial applications to medical image classification.

2. RELATED WORK

Adversarial examples, inputs that have been imperceptibly modified to induce misclassification, are a well-known vulnerability of neural networks (Szegedy et al., 2013; Goodfellow et al., 2015; Madry et al., 2018) . A significant amount of research has studied methods for proving that no adversary can change the prediction for a given input and perturbation budget (Tjeng et al., 2017; Weng et al., 2018; Fazlyab et al., 2019) . More recently, analogous attacks on gradient-based explanations have been explored (Ghorbani et al., 2019; Dombrowski et al., 2019; Heo et al., 2019) . In (Ghorbani et al., 2019; Dombrowski et al., 2019) the authors investigate how to maximally perturb the explanation using first-order methods similar to the projected gradient descent attack on predictions (Madry et al., 2018) . In (Heo et al., 2019; Slack et al., 2020; Lakkaraju & Bastani, 2020; Dimanov et al., 2020) the authors investigate perturbing model parameters rather than inputs with the goal of finding a model that globally produces corrupted explanations while maintaining model performance. This attack on explanations has been extended to a worrying use-case of disguising model bias (Dimanov et al., 2020) . Methods that seek to remedy the lack of robustness work by either modifying the training procedure (Dombrowski et al., 2022; Wang et al., 2020) or by modifying the explanation method (Anders et al., 2020; Wang et al., 2020; Si et al., 2021) . Methods that modify the model include normalization of the Hessian norm during training in order to give a penalty on principle curvatures (Dombrowski et al., 2022; Wang et al., 2020) . Further work suggests attributional adversarial training, searching for the points that maximize the distance between local gradients (Chen et al., 2019; Ivankay et al., 2020; Singh et al., 2020; Lakkaraju et al., 2020) . Works that seek to improve explanation in a model agnostic way include smoothing the gradient by adding random noise (Smilkov et al., 2017; Sundararajan et al., 2017) or by using an ensemble of explanation methods (Rieger & Hansen, 2020) The above defenses cannot rule out potential success of more sophisticated adversaries and it is well-known that the approaches proposed for improving the robustness of explanations do not work

