CONCEPT GRADIENTS: CONCEPT-BASED INTERPRE-TATION WITHOUT LINEAR ASSUMPTION

Abstract

Concept-based interpretations of black-box models are often more intuitive than feature-based counterparts for humans to understand. The most widely adopted approach for concept-based gradient interpretation is Concept Activation Vector (CAV). CAV relies on learning linear relations between some latent representations of a given model and concepts. The premise of meaningful concepts lying in a linear subspace of model layers is usually implicitly assumed but does not hold true in general. In this work we proposed Concept Gradients (CG), which extends concept-based gradient interpretation methods to non-linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically measure how a small change of concept affects the model's prediction, which is an extension of gradient-based interpretation to the concept space. We demonstrate empirically that CG outperforms CAV in evaluating concept importance on real world datasets and perform a case study on a medical dataset. The code is available at github.com/jybai/concept-gradients.

1. INTRODUCTION

Explaining the prediction mechanism of machine learning models is important, not only for debugging and gaining trust, but also for humans to learn and actively interact with them. Many feature attribution methods have been developed to attribute importance to input features for the prediction of a model (Sundararajan et al., 2017; Zeiler & Fergus, 2014) . However, input feature attribution may not be ideal in the case where the input features themselves are not intuitive for humans to understand. It is then desirable to generate explanations with human-understandable concepts instead, motivating the need for concept-based explanation. For instance, to understand a machine learning model that classifies bird images into fine-grained species, attributing importance to high-level concepts such as body color and wing shape explains the predictions better than input features of raw pixel values (see Figure 1 ). The most popular approach for concept-based interpretation is Concept Activation Vector (CAV) Kim et al. (2018) . CAV represents a concept with a vector in some layer of the target model and evaluates the sensitivity of the target model's gradient in the concept vector's direction. Many followup works are based on CAV and share the same fundamental assumption that concepts can be represented as a linear function in some layer of the target model (Ghorbani et al., 2019; Schrouff et al., 2021) . This assumption generally does not hold, however, and it limits the application of concept-based attribution to relatively simple concepts. Another problem is the CAV concept importance score. The score is defined as the inner product of CAV and the input gradient of the target model. Inner product captures correlation, not causation, but the concept importance scores are often perceived causally and understood as an explanation for the predictions it does. In this paper, we rethink the problem of concept-based explanation and tackle the two weak points of CAV. We relax the linear assumption by modeling concepts with more complex, non-linear functions (e.g. neural networks). To solve the causation problem, we extend the idea of taking gradients from feature-based interpretation. Gradient-based feature interpretation assigns importance to input features by estimating input gradients, i.e., taking the derivative of model output with respect to input features. Input features corresponding to larger input gradients are considered more important. A question naturally arises: is it possible to extend the notion of input gradients to "concept" gradients? Can we take the derivative of model output with respect to post-hoc concepts when the model is not explicitly trained to take concepts as inputs? We answer this question in affirmative and formulate Concept Gradients (CG). CG measures how small changes of concept affect the model's prediction mathematically. Given any target function and (potentially non-linear) concept function, CG first computes the input gradients of both functions. CG then combines the two input gradients with the chain rule to imitate taking the derivative of the target with respect to concept through the shared input. The idea is to capture how the target function changes locally according to the concept. If there exists a unique function that maps the concept to the target function output, CG exactly recovers the gradient of that function. If the mapping is not unique, CG captures the gradient of the mapping function with the minimum gradient norm, which avoids overestimating concept importance. We discover that when the concept function is linear, CG recovers CAV (with a slightly different scaling factor), which explains why CAV works well in linearly separable cases. We showed in real world datasets that the linear separability assumption of CAV does not always hold and CG consistently outperforms CAV. The average best local recall@30 of CG is higher than the best of CAV by 7.9%, while the global recall@30 is higher by 21.7%.

2. PRELIMINARIES

Problem definition In this paper we use f : R d → R k to denote the machine learning model to be explained, x ∈ R d to denote input, and y ∈ R k to denote the label. For an input sample x ∈ R d , concept-based explanation aims to explain the prediction f (x) based on a set of m concepts {c 1 , . . . , c m }. In particular, the goal is to reveal how important is each concept to the prediction. Concepts can be given in different ways, but in the most general forms, we can consider concepts as functions mapping from the input space to the concept space, denoted as g : R d → R. The function can be explicit or implicit. For example, morphological concepts such as cell perimeter and circularity can be given as explicit functions defined for explaining lung nodule prediction models. On the other hand, many concept-based explanations in the literature consider concepts given as a set of examples, which are finite observations from the underlying concept function. We further assume



Figure 1: Comparison of feature-based interpretation heatmap (left: Integrated Gradients) and concept-based importance score (right: Concept Gradients) for the model prediction of "Black footed Albatross". Attribution to high-level concepts is more informative to humans than raw pixels.

