INTERPRETABILITY THROUGH INVERTIBILITY: A DEEP CONVOLUTIONAL NETWORK WITH IDEAL COUNTERFACTUALS AND ISOSURFACES

Abstract

Current state of the art computer vision applications rely on highly complex models. Their interpretability is mostly limited to post-hoc methods which are not guaranteed to be faithful to the model. To elucidate a model's decision, we present a novel interpretable model based on an invertible deep convolutional network. Our model generates meaningful, faithful, and ideal counterfactuals. Using PCA on the classifier's input, we can also create "isofactuals"-image interpolations with the same outcome but visually meaningful different features. Counter-and isofactuals can be used to identify positive and negative evidence in an image. This can also be visualized with heatmaps. We evaluate our approach against gradient-based attribution methods, which we find to produce meaningless adversarial perturbations. Using our method, we reveal biases in three different datasets. In a human subject experiment, we test whether nonexperts find our method useful to spot spurious correlations learned by a model. Our work is a step towards more trustworthy explanations for computer vision.

1. INTRODUCTION

The lack of interpretability is a significant obstacle for adopting Deep Learning in practice. As deep convolutional neural networks (CNNs) can fail in unforeseeable ways, are susceptible to adversarial perturbations, and may reinforce harmful biases, companies rightly refrain from automating high-risk applications without understanding the underlying algorithms and the patterns used by the model. Interpretable Machine Learning aims to discover insights into how the model makes its predictions. For image classification with CNNs, a common explanation technique are saliency maps, which estimate the importance of individual image areas for a given output. The underlying assumption, that users studying local explanations can obtain a global understanding of the model (Ribeiro et al., 2016) , was, however, refuted. Several user studies demonstrated that saliency explanations did not significantly improve users' task performance, trust calibration, or model understanding (Kaur et al., 2020; Adebayo et al., 2020; Alqaraawi et al., 2020; Chu et al., 2020) . Alqaraawi et al. ( 2020) attributed these shortcomings to the inability to highlight global image features or absent ones, making it difficult to provide counterfactual evidence. Even worse, many saliency methods fail to represent the model's behavior faithfully (Sixt et al., 2020; Adebayo et al., 2018; Nie et al., 2018) . While no commonly agreed definition of faithfulness exists, it is often characterized by describing what an unfaithful explanation is (Jacovi & Goldberg, 2020). For example, if the method fails to create the same explanations for identically behaving models. To ensure faithfulness, previous works have proposed building networks with interpretable components (e.g. ProtoPNet (Chen et al., 2018) or Brendel & Bethge ( 2018)) or mapping network activations to human-defined concepts (e.g. TCAV (Kim et al., 2018) ). However, the interpretable network components mostly rely on fixed-sized patches and concepts have to be defined a priori. Here, we argue that explanations should neither be limited to patches and not rely on a priori knowledge. Instead, users should discover hypotheses in the input space themselves with faithful counterfactuals that are ideal, i.e. samples that exhibit changes that directly and exclusively correspond to changes in the network's prediction (Wachter et al., 2018) . We can guarantee this property by combining an invertible deep neural network z = ϕ(x) with a linear classifier y = w T ϕ(x) + b. This yields three major advantages: 1) the model is powerful (can approximate any function Zhang et al. ( 2019)), 2) the weight vector w of the classifier directly and faithfully encodes the feature importance of a target class y in the z feature space. 3) Human-interpretable explanations can be obtained by simply inverting explanations for the linear classifier back to input space. As a local explanation for one sample x, we generate ideal counterfactuals by altering its feature representation z along the direction of the weight vector z = z + αw. The logit score can be manipulated directly via α. Inverting z back to input space results in a human-understandable counterfactual x = ϕ -foot_0 (z + αw). Any change orthogonal to w will create an "isofactual", a sample that looks different but results in the same prediction. While many vectors are orthogonal to w, we find the directions that explain the highest variance of the features z using PCA. As the principal components explain all variance of the features, they can be used to summarize the model's behavior globally. We demonstrate the usefulness of our method on a broad range of evaluations. We compared our approach to gradient-based saliency methods and find that gradient-based counterfactuals are not ideal as they also change irrelevant features. We evaluated our method on three datasets, which allowed us to create hypotheses about potential biases in all three. After statistical evaluation, we confirmed that these biases existed. Finally, we evaluated our method's utility against a strong baseline of example-based explanations in an online user study. We confirmed that participants could identify the patterns relevant to the model's output and reject irrelevant ones. This work demonstrates that invertible neural networks provide interpretability that conceptually stands out against the more commonly used alternatives.

2. METHOD

Throughout this work, we rely on the following definitions, which are based on Wachter et al. ( 2018): Definition 2.1 (Counterfactual Example). Given a data point x and its prediction y, a counterfactual example is an alteration of x, defined as x = x + ∆x, with a altered prediction ỹ = y + ∆y where ∆y = 0. Samples x with ∆y = 0 are designated "isofactuals". Almost any ∆x will match the counterfactual definition, including those that additionally change aspects which are unrelated to the model's prediction, e.g. removing an object but also changing the background's color. It is desirable to isolate the change most informative about a prediction: Definition 2.2 (Ideal Counterfactual). Given a set of unrelated properties ξ(x) = {ξ i (x)}, a sample x is called ideal counterfactual of x if all unrelated properties ξ i remain the same. The following paragraphs describe how we generate explanations using an invertible neural network ϕ : R n → R n . The forward function ϕ maps a data point x to a feature vector z = ϕ(x). Since ϕ is invertible, one can regain x by applying the inverse x = ϕ -1 (z). We used the features z to train a binary classifier f (x) = w T z + b that predicts the label y. In addition to the supervised loss, we also trained ϕ as a generative model (Dinh et al., 2016; 2015) to ensure that the inverted samples are human-understandable. Counterfactuals To create a counterfactual example x for a datapoint x, we can exploit that w encodes feature importance in the z-space directly. To change the logit score of the classifier, we simply add the weight vector to the features z and then invert the result back to the input space: x = ϕ -1 (z + αw). Hence, for any sample x, we can create counterfactuals x with an arbitrary change in logit value ∆y = αw T w by choosing α accordingly. Figure 1a shows several such examples. Since the generation (ϕ -1 ) and prediction (ϕ) are performed by the same model, we know that x will correspond exactly to the logit offset αw T w. Consequently, x is a faithful explanation. To show that our counterfactuals are ideal, we have to verify that no property unrelated to the prediction is changed. For such a property ξ(x) = v T z, v has to be orthogonal to w. 1 As the unrelated property ξ does not change for the counterfactual ξ( x) = v T (z + αw) = v T z = ξ(x), we know that x = ϕ -1 (z + αw) is indeed an ideal counterfactual.



ξ(x) could actually be non-linear in the features z as long as the gradient ∂ξ ∂z is orthogonal to w.

funding

4open.science/r/ae263acc-aad1-42f8

availability

https://anonymous.

