CONCEPT GRADIENTS: CONCEPT-BASED INTERPRE-TATION WITHOUT LINEAR ASSUMPTION

Abstract

Concept-based interpretations of black-box models are often more intuitive than feature-based counterparts for humans to understand. The most widely adopted approach for concept-based gradient interpretation is Concept Activation Vector (CAV). CAV relies on learning linear relations between some latent representations of a given model and concepts. The premise of meaningful concepts lying in a linear subspace of model layers is usually implicitly assumed but does not hold true in general. In this work we proposed Concept Gradients (CG), which extends concept-based gradient interpretation methods to non-linear concept functions. We showed that for a general (potentially non-linear) concept, we can mathematically measure how a small change of concept affects the model's prediction, which is an extension of gradient-based interpretation to the concept space. We demonstrate empirically that CG outperforms CAV in evaluating concept importance on real world datasets and perform a case study on a medical dataset. The code is available at github.com/jybai/concept-gradients.

1. INTRODUCTION

Explaining the prediction mechanism of machine learning models is important, not only for debugging and gaining trust, but also for humans to learn and actively interact with them. Many feature attribution methods have been developed to attribute importance to input features for the prediction of a model (Sundararajan et al., 2017; Zeiler & Fergus, 2014) . However, input feature attribution may not be ideal in the case where the input features themselves are not intuitive for humans to understand. It is then desirable to generate explanations with human-understandable concepts instead, motivating the need for concept-based explanation. For instance, to understand a machine learning model that classifies bird images into fine-grained species, attributing importance to high-level concepts such as body color and wing shape explains the predictions better than input features of raw pixel values (see Figure 1 ). The most popular approach for concept-based interpretation is Concept Activation Vector (CAV) Kim et al. (2018) . CAV represents a concept with a vector in some layer of the target model and evaluates the sensitivity of the target model's gradient in the concept vector's direction. Many followup works are based on CAV and share the same fundamental assumption that concepts can be represented as a linear function in some layer of the target model (Ghorbani et al., 2019; Schrouff et al., 2021) . This assumption generally does not hold, however, and it limits the application of

