INTERPRETING NEURAL NETWORKS THROUGH THE LENS OF HEAT FLOW Anonymous

Abstract

Machine learning models are often developed in a way that prioritizes task-specific performance but defers the understanding of how they actually work. This is especially true nowadays for deep neural networks. In this paper, we step back and consider the basic problem of understanding a learned model represented as a smooth scalar-valued function. We introduce HeatFlow, a framework based upon the heat diffusion process for interpreting the multi-scale behavior of the model around a test point. At its core, our approach looks into the heat flow initialized at the function of interest, which generates a family of functions with increasing smoothness. By applying differential operators to these smoothed functions, summary statistics (i.e., explanations) characterizing the original model on different scales can be drawn. We place an emphasis on studying the heat flow on data manifold, where the model is trained and expected to be well behaved. Numeric approximation procedures for implementing the proposed method in practice are discussed and demonstrated on image recognition tasks.

1. INTRODUCTION

In recent years, thanks to the growing availability of computation power and data, together with the rapid advancement of methodology, the machine learning community is witnessing the success of creating models with increasingly higher capacity and performance. However, a downside of scaling the model complexity is that it complicates the understanding of how the learned models work and why sometimes they fail. Such requirements for interpretability arise from both scientific research and engineering practices. Carefully interpreting the working mechanism of a predictive model may help uncover its weakness in robustness, informing further improvements should to be made before deployment in high-stakes decision-making. In this paper, we consider the interpretation of scalar-valued smooth functions, a basic hypothesis class in machine learning. Models of this type arise naturally in regression and binary classification tasks that deal with continuous input features. Multi-output models, e.g., neural networks for multiclass classification, can be treated as such functions by investigating each output separately. While 1D and 2D such functions can be understood intuitively through graphical visualization, there is no straightforward way to visualize or even imagine general higher-dimensional functions. Fortunately, mathematicians developed the derivative to interpret functions in a pointwise manner. The directional derivative at a point measures the instantaneous rate of change of the function along a given direction, and the gradient gives the direction of steepest ascent. This forms the basis of popular gradient-based explanation methods for neural networks (Simonyan et al., 2014; Selvaraju et al., 2017; Sundararajan et al., 2017; Smilkov et al., 2017; Ancona et al., 2018; Erion et al., 2021; Xu et al., 2020; Hesse et al., 2021; Srinivas & Fleuret, 2021; Kapishnikov et al., 2021) . To interpret the outcome of a learned high-dimensional function at a test point, the gradient there is only part of the story because it just characterizes the first-order behavior of the function in an infinitesimal range. Such extreme localness is to blame for several known pitfalls of vanilla gradientbased interpretation. For example, if the point falls into a locally constant region, the gradient will be zero (Shrikumar et al., 2017) . On the other hand, the gradient may change dramatically even for nearby points, leading to noisy and non-robust explanations in practice (Dombrowski et al., 2019; Wang et al., 2020) . Moreover, the gradient will also be zero at different classes of critical points, suggesting the need for higher-order derivatives. To this end, we introduce HeatFlow, an interpretation framework that enables summarizing the behavior of a learned model at different scales from the point of view of a test input. Our approach is motivated by a natural question to ask about a function f : How much does the value of f at a point x deviate from the average value of f in a neighborhood of x? In our opinion, such deviation from local average is more comprehensible than instantaneous rate of change for non-mathematical audience. If the neighborhood is taken to be a small open ball centered at x, then the answer is related to ∆f (x), where ∆ is the Laplace operator, a fundamental second-order differential operator. To consider increasingly larger neighborhoods in a multi-scale manner, we propose solving a heat equation, a fundamental partial differential equations (PDE) studied in mathematics and physics. We show that detailed interpretation for a function of interest may be achieved by extracting a rich set of principled summary statistics from the solution of the heat equation initialized at it. Furthermore, because a learned function is expected to be well-behaved only on the data manifold embedded in Euclidean space, it is possible to restrict the function on the manifold and solve the corresponding heat equation. In doing so, the interpretation problem is treated in a principled way based on the theory of differential geometry. Briefly, HeatFlow satisfies the following desiderata: (i) It provides a multi-scale analysis of feature importance in the formation of model predictions. (ii) It is stable and informative because, implicitly, the neighborhood of a test point is exhaustively explored by Brownian motion. (iii) It offers practitioners the flexibility to restrict their analysis on a manifold chosen from among Euclidean space, (learned) data manifold, and other interested submanifolds. A Toy Example. A toy example of understanding the sum of two Gaussian functions in a 2D Euclidean space is illustrated in Figure 1a . First row demonstrates the initial function and its heat flow. The heat flow generates a sequence of functions that are increasingly smoother than the initial one. Later we will see one possible representation of these functions as E[f (X t )|X 0 = x], in which f is the initial function and {X t } t>0 is a path of Brownian motion. Intuitively, the values of these smoothed functions at x are local average values of the initial function in increasingly larger neighborhoods centered at x. The first subplot in the second row shows the deviation between the initial function and smoothed functions at three points, while the remaining subplots show Laplacian of smoothed functions along with their gradient fields. The deviation and the Laplacian are further decomposed into two directions, x 1 and x 2 , presented in the third and forth rows, respectively. Intuitively, our core idea is to distribute the deviation caused by the heat flow to each input feature by decomposing the heat flow as the sum of "sub-flows" in corresponding directions. We show that the proposed method further enables the detection of interaction strength between one variable and other variables, as demonstrated in Figure 1b .

2. PRELIMINARIES

In this section, we introduce some basic concepts of explanation methods (Covert et al., 2021; Arrieta et al.) in the interpretable AI community, along with definitions and key concepts about the Laplacian and the heat equation (Hsu, 2002; Grigoryan, 2009) . The connections between the two fields where this paper is inspired are also presented. Throughout our discussion, we consider a closed and connected manifold M endowed with a Riemannian metric g. A function f : M → R is assumed to be bounded and twice continuously differentiable.

2.1. FEATURE IMPORTANCE & ATTRIBUTION

Let f : R D → R be a trained neural network that predicts a label y ∈ R for a given input x = (x 1 , ..., x D ) ∈ R D . Post hoc interpretation methods aim at generating human-friendly explanations to explain why y = f (x). One of the most popular types of explanations is feature importance, which assigns certain score to each feature x i of a particular data point x. For differentiable f , the gradient ∇f : R D → R D is the simplest method that quantifies local importance of each input feature to the model's outcome. Attribution methods go a step further to assign each feature a value ψ i (f, x) with the same physical meaning and unit of measurement as the model's output. It is then enforced that the values for all input features sum up to the difference between the outcome y and a baseline b, i.e., f (x) -b = D i=1 ψ i (f, x). Previous work usually chose the baseline as the model prediction on a baseline input x ′ , i.e., b := f (x ′ ). However, the choice of baseline input is known to be non-trivial (Sturmfels et al., 2020) . In this work, we directly choose the baseline b instead and design a corresponding attribution method. An interesting setting of b is the locally averaged outcome of f in the neighborhood around x, that is, b : = E X∼N (x) [f (X)]. To avoid selecting a single neighborhood distribution N (x), we further suggest, informally, considering a sequence of increasingly larger neighborhoods {N t (x)} t≥0 , where we expect t to govern the scale of neighborhood. To obtain this sequence, we turn to Laplacian and heat equations introduced below. The manifold hypothesis assumes that data in real world concentrates on a low-dimensional submanifold M embedded in the ambient space R D of much higher dimension. The function to be explained is expected to behave consistently only on M despite being learned in R D . For a learned function f : R D → R, we may be interested in its mechanism in either the open world or a closed world. Arbitrary input in R D will be allowed in the former setting, while the latter setting only accepts input on the data manifold. Thus, users should choose the appropriate target and manifold accordingly depending on the purpose of model interpretation. If the goal is to understand model behavior in the open world, f and R D should be chosen; while f |M , the restriction of f on M, and M are suggested if the analysis is restricted to the data manifold.

2.2. THE LAPLACIAN ON A RIEMANNIAN MANIFOLD

Since we are interested in manifolds apart from Euclidean space, we introduce the Laplace-Beltrami operator, ∆ M , which generalizes the ordinary Laplacian to Riemannian manifold. There are multiple equivalent ways of introducing this operator (Hsu, 2002, Section 3.1) . • Divergence of the gradient field: ∆ M f = div grad f . The gradient grad f is the unique vector field on M that satisfies ⟨grad f, X⟩ g = df (X) for any vector field X, where the differential df gives the directional derivative of f along X. The divergence div X of a vector field X measures how much it locally behaves like a sink or source. Laplacian will therefore be positive at minima and negative at maxima. More generally, it acts as a measure of deviation from local average. The relevance of Laplacian in our work is then obvious: through a closer look into Laplacian, it is possible to show, for example, why the model's output at a local maxima(minima) is larger(less) than its neighborhood. • Trace of the Hessian: ∆ M f = tr(∇ 2 f ). In Euclidean space R n , Hessian ∇ 2 f is a matrix of the second partial derivatives, hence ∆f = i ∂ 2 ∂x 2 i f . In Riemannian manifold with its Levi-Civita connection ∇, Hessian is the second covariant derivative, such that given local coordinates {x i }, a local expression for the Hessian tensor is ∇ 2 f = ∂ 2 f ∂x i ∂x j -Γ k ij ∂f ∂x k dx i ⊗ dx j , where Γ k ij are the Christoffel symbols of connection. Therefore the Laplacian contains information about the second order behavior of a function, which is not available in gradients. • Hodge Laplacian: ∆ M f = -d * df . In exterior calculus, the codifferential d * is adjoint to the differential d and by f d * θdV g = ⟨df, θ⟩ g dV g , where V g is the Riemannian volume. When acting on 1-forms, d * can be given by the divergence as d * (α X ) =div(X) where α X is the dual of X by α X (Y ) = ⟨X, Y ⟩ g and hence ∆ M f = div grad f = -d * df . The Hodge Laplacian for differential forms, given by □ M = -(d * d + dd * ), coincides with ∆ M f when acting on functions, i.e. 0-forms. Generalization of Laplacian to differential forms allows us to analysis the gradient field of a function in addition to the function itself. • Infinitesimal generator that generates Brownian motion on M: roughly speaking, it means 1 2 ∆ M f (x) = lim t↓0 1 t (E[f (X t )|X 0 = x] -f (x)) , where {X t } t>0 is a Brownian path. This connection provides a clue to the definition of a distribution on the neighborhood of x through Brownian motion.

2.3. HEAT EQUATION

With the above Laplacian defined, we can now introduce the most basic diffusion process governed by the following PDE, which describes how an initial heat distribution f : M → R looks after being diffused for time t > 0: ∂ ∂t u(t, x) = ∆ M u(t, x), u(0, x) = f (x). The solution, u(t, x) : (0, ∞) × M → R, can also be given in multiple roughly equivalent ways: • Convolution: u(t, x) = M k t (x, y)f (y)dy, in which k t (x, y) is the fundamental solution of this PDE, known as the heat kernel. In Euclidean space, it is just a Gaussian centered at the point x. • Expectation: u(t, x) = E[f (X t )|X 0 = x]. As the heat kernel is also the transition density function of Brownian motion on M, the expectation of function value over ends of Brownian paths yields the solution on a stochastically complete manifold (Hsu, 2002) . Such a stochastic representation of PDE solutions is known as the Faynman-Kac formula (Karatzas & Shreve, 2012) . • Gradient flow (Santambrogio, 2016) : It is well-known that the solution is the gradient flow for the Dirichlet energy ∥ grad f ∥ 2 g dV g in L 2 (M) space. Consequently, the Dirichlet energy monotonically decreases in time under the heat flow, meaning that the smoothness of the solution is always increasing. It is convenient to define the heat operators {P t } t>0 to represent the solution: u(t, x) = (P t f )(x). The heat equation can be extended to the diffusion of tensor fields and differential forms using generalized Laplacian. For instance, heat equation on 1-forms may be defined with the Hodge Laplacian: ∂ ∂t θ(t, x) = □ M θ(t, x). Since □ M commutes with the differential d, if the initial condition is set to be a closed 1-form, i.e., the differential of a function: θ(0, x) = df (x), then the solution will have an interesting connection with the heat equation on functions (Hsu, 2002, Section 7 .2): θ(t, x) = d(P t f )(x). It means that diffusing the differential of a function with Hodge Laplacian is equivalent to applying the differential operator to the solution of scalar heat equation.

3. MULTI-SCALE INTERPRETATION BASED ON HEAT DIFFUSION

By treating the model to be explained as the initial heat distribution, we analyze the solution of the heat equation to summarize the model at different scales from the point of view of an input x. The solution alone has already given local weighted average value of the function centered at x: (P t f )(x) = E[f (X t )|X 0 = x]. The time t plays the role of vicinity quantification that defines the range of neighborhood through a stochastic view as the area reachable by Brownian motion starting at x. For small value of t, the smoothed value is mainly affected by small neighborhood of x, hence reflecting only local information of the explained model. While as t increases, the smoothed function tends to capture more global trends of the model. In this section, we show that it is possible to extract more detailed information from the solution by applying differential operators to {P t f } t>0 . 3.1 SUMMARIZING GRADIENTS By Eq. 3 and the duality of differential and gradient, taking the gradient of P t f is equivalent to smoothing the gradient field of f through running a heat equation for vector fields up to time t. In Euclidean settings, the gradient of P t f is ∇P t f (x) = E[∇f (X t )|X 0 = x], which is simply the locally averaged version of the initial gradient field. On more general manifolds, it does not make sense to add vectors living in different tangent spaces T Xt M, and we need the parallel transport (Lee, 2018, Chapter 4) term acting as a mechanism to connect nearby tangent spaces. There is a Feynman-Kac formula for this operation (Hsu, 2002, Theorem 7.2.1) : grad P t f (x) = E[M t τ t grad f (X t )|X 0 = x], where τ t is the stochastic parallel transport map that transfers vectors from the tangent space of X t back to that of X 0 along the Brownian path, and M t is related to the Ricci curvature tensor on M. One can interpret Eq. 4 as collecting derivative information from neighborhood by sending Brownian particles that travel for a fixed time length, parallel transporting the gradients at the positions of particles back to the explained point and averaging them.

3.2. ATTRIBUTION

For attribution tasks, a natural baseline to choose in our framework is the local average E[f (X t )|X 0 = x]. Formally, given a model function f : M ⊂ R D → R, the input of interest x, and the solution P t f of heat equation initialized at f , we attribute (P t f )(x) -f (x) to input features {x i } D i=1 at each t. In order to allocate the difference to the coordinates of ambient space R D , we try to disaggregate the Laplace-Beltrami operator using the the standard orthonormal basis {e i } D i=1 . Let E i (x) be the orthogonal projection of the unit vector e i onto the tangent space T x M. Since the gradient field is tangential to M, we have grad f = i ⟨grad f, e i ⟩ g e i = i ⟨grad f, E i ⟩ g E i . The projection of grad f onto the ith dimension is achieved by grad i f = ⟨grad f, E i ⟩ g E i = (E i f )E i , and we define the dual form of grad i f as d i f to be the "partial differential" operator. Further, we can compute the contribution of each feature i by taking the divergence of the projected vector field, div((E i f )E i ). In this representation, the flow of heat along the ith dimension for an input x up to time T is defined as, HeatFlow M i (x, T, f ) := T t=0 div grad i (P t f )(x) dt. A desirable property of HeatFlow is that the attributions add up to the difference between the values of f and its smoothed version as follows, (P T f )(x)-f (x) = T t=0 ∂ ∂t (P t f )(x)dt = T t=0 ∆ M (P t f )(x)dt = D i=1 HeatFlow M i (x, T, f ), (6) where the last equality is an immediate result of ∆ M f = div gradf = D i=1 div( i ⟨gradf, E i ⟩ g E i ) . Instead of computing div with respect to manifold, we can also globally express both div and grad using the standard basis {e i }, i.e., grad f = D i=1 (E i f )e i and div X = D i=1 E i X i where X i is the i-th component of vector field X expressed in ambient coordinates. This leads to another decomposition ∆ M f = D i=1 E 2 i f . Furthermore, through the trace of the Hessian expression, we also have (Hsu, 2002, Corollary 3.1.5) . Together, we have the following three decompositions, ∆ M f = D i=1 ∇ 2 f (E i , E i ) D i=1 T t=0 div grad i (P t f )(x) dt = D i=1 T t=0 E 2 i (P t f )(x) dt = D i=1 T t=0 ∇ 2 (P t f )(E i (x), E i (x)) dt. (7) It is worth noticing that while they are the same when summed over D dimensions, the three summands for each dimension i are not necessarily equal to each other. For simplicity of calculation, we adopt the last decomposition strategy based on the Hessian tensor in our experiments. In Euclidean space, all the three decompositions just reduce to sum of the diagonal terms of the Hessian matrix. Hence, attribution for i-th dimension becomes integral of the second partial derivative: HeatFlow R D i (x, T, f ) := T t=0 ∂ 2 ∂x 2 i (P t f )(x) dt.

3.3. PROPERTIES IN EUCLIDEAN SPACE

In this section, some nice properties satisfied by our method in Euclidean space are discussed. All proofs are given in the Appendix. We leave the study of analogous properties on general manifolds for future work. HeatFlow obeys attribution axioms. The evolvement of the model interpretation problem brings several desirable properties that a new attribution method should satisfy (Lundberg & Lee, 2017; Sundararajan et al., 2017; Friedman, 2004) . The following proposition includes four axioms defined in (Sundararajan & Najmi, 2020) . Proposition 1. HeatFlow R D satisfies the Dummy, Efficiency, Symmetry, and Linearity axioms. HeatFlow recovers GAM. Generalized additive models (Hastie & Tibshirani, 2017; Lou et al., 2012) are a family of inherently interpretable models based on the sum of univariate functions. The interaction between the features is simply additive in a GAM, therefore a user can interpret the contribution of each feature independently through looking into the corresponding univariate function. Ideally, if an attribution method is applied to a GAM, its output is expected to be consistent with the GAM itself. This is the case for HeatFlow. Proposition 2. Suppose f : R D → R; x → D i=1 f i (x i ) is a smooth additive function, in which ∀i ∈ {1, . . . , D}, f i : R → R is a bounded continuous function in L 1 (R). Then lim t→∞ HeatFlow R D i (x, f, t) = -f i (x i ). HeatFlow detects additive structure. For a given neural network f , it is usually impossible to decompose f into sum of univariate functions. How do we know if at least some variable, say, x i , contributes additively to f ? One idea is to look into the its gradient ∇f : R D → R D . Since Euclidean space is the focus of this section, let us slightly abuse the notations df = grad f = ∇f , d i f = grad i f = [∇f ] i e i , ∂ i f = [∇f ] i , and d * X = -div X for simplicity. If x i is truly contributing to f additively, i.e., ∃f i : R → R such that f (x) = f i (x i ) + f -i (x -i ) , then the magnitude of the gradient projected onto the i-th direction can be fully characterized by f i via ∂ i f = df i . The following lemma shows that the converse is also true. Lemma 1. Assume that there exists an univariate function f i : R → R such that ∀x ∈ R D : df i (x i ) = ∂ i f (x). Then f (x) = f i (x i ) + f -i (x -i ), in which the function f -i : R D-1 → R does not depend on x i . Based on this observation, in order to detect if x i contributes univariately, one can try to find a function g i : R D → R whose gradient matches d i f everywhere. If such a g i exists, it will depend only on x i because its partial derivatives with respect to {x j } j̸ =i will always be zero. A straightforward implementation of this idea is to solve the optimization problem: min g: R D →R |dg -d i f | 2 := ∥dg(x) -d i f (x)∥ 2 2 dx (9) It is well known that the Euler-Lagrange equation for this problem is the Poisson equation ∆g = d * d i f . Interestingly, we discover that HeatFlow is solving this kind of equation. Proposition 3. g i,t (•) = HeatFlow R D i (•, f, t) solves the Poisson equation ∆g i,t = d * d i (P t f -f ). As a result, for a function f satisfying ∀x : lim t→∞ P t f (x) = 0, e.g., f ∈ L 1 (R D ), the difference between the gradient of HeatFlow at large enough t and the partial gradient of -f approximately measures the degree of deviation from being univariate for a considered feature. Specifically, if the feature i rarely interacts with other features, the following residual should be small everywhere: r i (x) = d HeatFlow i (x, f, t) -d i (P t f -f )(x) 2 2 . ( ) When this residual is zero everywhere, according to Lemma 1, the feature i contributes to P t ff additively. In Figure 1b , we visualize the residuals at a large t for the toy example. Interestingly, similar residuals are also discovered in a different setting where the Shapley value (Shapley, 1953) is employed for attribution (Kumar et al., 2021) .

4. CONNECTION WITH EXISTING WORK

Close connections can be drawn between our method and many existing attribution approaches. We discuss in this section how our method generalizes the literature of attributing prediction of a model f (•) on an input x to its features. • Vanilla Gradient (Grad) (Simonyan et al., 2014) . The Vanilla Gradient is defined as Grad(x) = ∇ x f (x). It only characterizes first-order behavior of the model in an infinitesimal range. • Smooth Gradient (SG) (Smilkov et al., 2017) . Given a user-defined variance σ, the Smooth Gradient is defined as SG(x) = E z∼N (x,σfoot_1 I) ∇ z f (z). It is a special case of our method in Euclidean settings: SG (x) = ∇ x [(f * N (0, σ 2 I))(x)] = ∇ x u(t = σ 2 /2, x). • Integrated Gradients (IG) (Sundararajan et al., 2017) . Given a user-defined baseline input x ′ , IG of feature i is defined as IG i (x, x ′ ) = (x i -x ′ i ) • 1 0 ∂ ∂xi f (x ′ + α(x -x ′ ))dα , where gradients are accumulated along the straight-line path in Euclidean space. Our method can also be viewed as expectation over multiple path integrals: E[f (X t )|X 0 = x] = E[ X [0,t] df |X 0 = x]. • Expected Gradients (EG) (Erion et al., 2019) . By introducing a distribution of baselines D, EG is the expectation of path integrals defined as EG i (x, D) = E x ′ ∼D IG i (x, x ′ ). When a Gaussian baseline is adopted as D = N (x, σ 2 I), it coinsides with our method in Euclidean settings, explaining deviation from local average as i EG i (x, D) = f (x) -E x ′ ∼N (x,σ 2 I) f (x ′ ). When the training data itself is used, D = D data , it coinsides with our method in manifold setting when t → ∞, explaining the difference from global average of all data. • Blur Integrated Gradients (BlurIG) (Xu et al., 2020) . BlurIG extends IG by considering the path of successively blurring the input image using Gaussian filter. It solves a heat equation in the 2D plane of a single image as contrast to data space in our method. It is worth noting that the heat equation has also found applications in diffusion-based generative modeling (Sohl-Dickstein et al., 2015; Song et al., 2021; De Bortoli et al., 2022) , in which certain design of forward diffusion process is equivalent to solving the heat equation (as a special case of the Fokker-Planck equation) initialized at p(x), the probability density function of data distribution.

5. EXPERIMENTS

To illustrate our multi-scale attribution methods, we demonstrate a sequence of attribution maps at discretized time steps on three image recognition tasks, including a synthetic image regression task, MNIST classification and facial age estimationfoot_0 . We compare our method, HeatFlow, with four other methods, including Grad, IG, SG and BlurIG. As for SG, increasing levels of Gaussian noises are applied and resulting gradients are averaged over 100 samples. We denote noise level s as the amount of noise to add to the input as fraction of the total spread, max xmin x. As for BlurIG, partial accumulation of gradients along the path is calculated, where α denotes variance of the 2D Gaussian kernel. A detailed discussion and pseudo-algorithm of the numerical implementation of HeatFlow in an end-to-end framework is presented in Appendix A.5. In the following experiments, all data manifold is learnt by VAEs. Since our method involves many differentiation operations, experiments 2 are implemented using the JAX package (Bradbury et al., 2018) with Tesla V100 GPU. Further information, such as latent dimension of VAEs and image sizes are summarized in Appendix A.7. For all attribution maps of following figures, red and blue pixels denote positive and negative values, respectively, and deeper color denotes higher absolute value and stronger contribution.

5.1. SYNTHETIC EXPERIMENTS FOR HIERARCHICAL FEATURE STRUCTURE

To study the ability of HeatFlow in uncovering different features with global or local effect, we design a synthetic dataset with hierarchical feature structure. As shown in Figure 2b , vanilla Grad mostly highlights local features that are close to the upper-left corner. Both Grad and SG misleadingly highlight squares lying on the diagonal which stays constant among all images. For IG and BlurIG, more global features are concentrated on, however multiscaled information is lacking. For HeatFlow, localness is meticulously controlled by heat diffusion on the manifold and when value of t is small, only small squares which correspond to local features are highlighted; as value of t grows, more global pixels, i.e. larger squares are gradually shown. Furthermore, highlighted pixels tend to appear as an organized group in HeatFlow, suggesting its ability to attribute features with correlated contributions. A latent code z = [z 0 , • • • , z d-1 ] ∈ [Uniform(0, 2π)] d is Manifold Mismatch. The heat kernel is known to be stable under perturbations of the underlying manifold (Sun et al., 2009) . Intuitively, such stability comes from the Brownian motion interpretation of heat flow: small perturbations will only affect a subset of the infinitely many Brownian paths. To study the reliance of alignment between the learned manifold and the true manifold, an ablation experiment is conducted where HeatFlow is run separately using the true underlying manifold and manifold learned by VAE with latent dimensionality d = 6 and d = 12, shown in first three rows of Figure 2b . It is observed that, when the correctness of the manifold increases, more detailed information is captured. The multi-scale property behaves as expected regardless of degree of manifold mismatch.

5.2. MNIST AND UTKFACE

The same experiments are conducted on the MNIST classification task, where in Figure 3a , logits of a neural network with 98% test accuracy is diffused over time. To handle all logits simultaneously, we use a heuristic extension of HeatFlow described in the Appendix. Saliency maps are particularly noisy for vanilla Grad. SG and Blur IG also fail to provide succinct information as Gaussian noise added in Euclidean space introduce unreliable evaluations of gradients over off-the-manifold samples. For HeatFlow, concise attribution with pixels only on digit area are highlighted. It is also evident that the attribution changes as heat flows. More interesting findings can be drawn if we study the logits of digits that are similar. In Figure 3b , we present heat flows and multi-scale attribution maps for three digits, 3, 6, 9, and compare them with their visually similar digits, {5, 8}, {0, 8}, {7, 4}, respectively. It is observed that, in the last column, only pixels corresponding to the common features of the compared digits remain. At small t, pixels that separate digits apart are highlighted. As for explaining the facial age estimation task, results are presented in Appendix A.9. Our method managed to locate structured features such as eye brows, wrinkles, whether smile with teeth shown as important information for prediction of ages, where other methods failed to discover. We also emphasize that multi-scale information is summarized in our method, such that attributed features which remains for large value of t represent more global effects, such as wrinkles. Quantitative Evaluation. We also include a quantitative evaluation to compare HeatFlow with other methods using a strategy adopted in (Jethani et al., 2021) . This evaluation strategy utilizes the remove/retain-and-retrain ideas where an evaluator was trained to predict the label given an arbitrary subset of features, and then, performance of the evaluator was assessed as the features were gradually excluded/included according to the absolute importance output by each explanation method. Using a set of 1000 test images for MNIST and 100 for UTKFace, accuracy and mean absolute error is assessed respectively as we include or exclude the most important features ranging from 0 -100%, with curves visualized in Figure 4 . HeatFlow is very competitive in terms of AUC, indicating that it is good at identifying important discriminative features.

6. CONCLUSIONS

We have introduced a novel interpretation framework, HeatFlow, which generates a sequence of feature attributions for an interested input, explaining the deviations of model's outcome from multi-scale local averages. Drawbacks of this work lie in the difficulties of manifold learning on more complex datasets, high-dimensional PDE solving, and high-order derivative computation. Our method will benefit from methodological advancements in these directions. It is also interesting to generalize HeatFlow to the union of disconnected manifolds, which may be a more appropriate assumption for classification datasets.

A APPENDIX A.1 AXIOMATIC PROPERTIES IN EUCLIDEAN SPACE

Proof of Proposition 1. Since the derivative of a function with respect to a dummy variable is always zero, the Dummy axiom is satisfied. By Eq. 6, the attributions always add up to the difference (P t f )(x)f (x) and hence proved the Efficiency axiom. The linearity is a consequence of the fact that the heat operator is linear. For a function that is symmetric in two variables x i and x j , we prove below ∂ 2 P t f /∂x 2 i = ∂ 2 P t f /∂x 2 j as long as x i = x j . Let f (x 1 , ..., x i , ..., x j , ..., x n ) be a function of n variables, f : R n → R. Without loss of generality, let x i and x j be symmetric variables, i.e. f (..., x i , ..., x j , ...) = f (..., x j , ..., x i , ...) for ∀x i , x j ∈ R. Then we have P t f (..., x i , ..., x j , ...) = y 1 (4πt) n/2 exp - 1 4t n k=1 (x k -y k ) 2 f (..., y i , ..., y j , ...)dy = y 1 (4πt) n/2 exp - 1 4t n k / ∈{i,j} (x k -y k ) 2 + (x i -y i ) 2 + (x j -y j ) 2 f (..., y i , ..., y j , ...)dy = y 1 (4πt) n/2 exp - 1 4t n k / ∈{i,j} (x k -y ′ k ) 2 + (x i -y ′ j ) 2 + (x j -y ′ i ) 2 f (..., y ′ j , ..., y ′ i , ...)dy ′ = y 1 (4πt) n/2 exp - 1 4t n k / ∈{i,j} (x k -y ′ k ) 2 + (x j -y ′ i ) 2 + (x i -y ′ j ) 2 f (..., y ′ i , . .., y ′ j , ...)dy ′ = P t f (..., x j , ..., x i , ...), where y ′ is the permutation of y by exchanging the i-th and j-th variables. The third equality follows from the change of variables formula and the fact that the determinant of a permutation matrix is ±1. Hence, we draw the conclusion that the heat operator preserves the symmetry in Euclidean space. Next, further assuming that x i = x j = a and according to the limit for second derivative, we have, ∂ 2 ∂x 2 i f (..., a, ..., a, ...) = lim h→0 f (..., a + h, ..., a, ...) -2f (..., a, ..., a, ...) + f (..., ah, ..., a, ...) h 2 = lim h→0 f (..., a, ..., a + h, ...) -2f (..., a, ..., a, ...) + f (..., a, ..., ah, ...) h 2 = ∂ 2 ∂x 2 j f (..., a, ..., a, ...). In other words, we have shown that the second derivatives of the symmetric function with respect to symmetric variables are equal when the values are equal. As a result, we have proved that ∂ 2 P t f /∂x 2 i = ∂ 2 P t f /∂x 2 j as long as x i = x j . A.2 DEVIATION FROM ADDITIVE CONTRIBUTION AND LINK WITH SHAPLEY VALUE An interesting link exists between our method and the Hodge decomposition of a coorperative game (Stern & Tettenhorst, 2019) . Given a set of D players and a value function v : 2 [D] → R, each coalition of players S ⊂ 2 [D] = V can be considered as a vertex of a D-dimensional hypercube G = (V, E), where each edge corresponds to addition of a single player i / ∈ S to S. The gradient and divergence of exterior calculus operating on this graph, denoted by (d, d * ), are defined as dv(S, S ∪ {i}) = v(S ∪ {i})v(S) and (d * f )(a) = (b,a)∈E f (b, a), respectively. dv ∈ ℓ 2 (E) gives the marginal value contributed by a player joining a coalition. To specify marginal contribution of an individual player i ∈ [D], a partial gradient d i : ℓ 2 (V ) → ℓ 2 (E) is defined as d i v(S, S ∪ {j}) = v(S ∪ {j}) -v(S), i = j 0, otherwise . Notice the corresponding definition of partial differentiation operator in the continuous space adopted in our method, d i f = grad i f . A key result in (Stern & Tettenhorst, 2019) is connecting inessentiality of games to the defined partial gradient operators in the following proposition, where in an inessential game, for all S ⊆ V , v(S) = i∈S v({i}), meaning each player contributes a precise value v({i}) to any coalition it participates in. Proposition 4. (Stern & Tettenhorst, 2019, Prop 3. 3) A game is inessential if and only if d i v ∈ im d for all i ∈ [D]. This means a game is inessential if one can find games v i such that d i v = dv i for each player i. In our setting, there is no notion of game but we consider in breaking down the given differentiable model f (•) to additive univariate functions. This ability in separating each feature to contribute independently is the "inessentiality" of our problem. In particular, we consider the simplest case where the function to be explained is itself an additive one, i.e. f : R D → R; x → D i=1 f i (x i ). We prove Proposition 2 that our method HeatFlow R D i (•, T, f ) tends to recover each independent component as T tends to infinity. Proof. The solution of heat equation initialized at an additive function is the sum of 1D solutions, as follow, P t f (x) = 1 (4πt) D/2 exp - 1 4t ∥x -y∥ 2 D i=1 f i (y i ) dy = D i=1 1 (4πt) 1/2 exp - 1 4t (x i -y i ) 2 f i (y i ) dy i = D i=1 (P t f i )(x i ) Following this observation, it is easy to derive that, HeatFlow R D i (x, T, f ) := T t=0 ∂ 2 ∂x 2 i (P t f )(x) dt = T t=0 ∂ 2 ∂x 2 i (P t f i )(x i ) dt = T t=0 ∂ ∂t (P t f i )(x i ) dt = P T f i (x i ) -f i (x i ) For f i ∈ L 1 (R), lim t→∞ P t f i (x i ) = 0. Hence, Proposition 2 is proved. Next, we consider more general case, where some variables might interact with each other. Proof of Lemma 1. Assume an arbitrary path over R n , r(t) = (x 1 (t), . . . , x n (t)) T , such that the end points are r(0) = x a and r(1) = x b , particularly, the i-th feature has value (x a ) i = a and (x b ) i = b. By the fundamental theorem of calculus for line integrals, and if ∂f ∂xi = dfi dxi for any x, we have, f (x b ) -f (x a ) = 1 0 ∂f (r(t)) ∂x • dr(t) dt dt = 1 0 n i=1 ∂f (r(t)) ∂x i dx i (t) dt dt = j̸ =i 1 0 ∂f (r(t)) ∂x j dx j (t) dt dt + 1 0 ∂f (r(t)) ∂x i dx i (t) dt dt = j̸ =i 1 0 ∂f (r(t)) ∂x j dx j (t) dt dt + 1 0 df i (x i (t)) dx i dx i (t) dt dt = j̸ =i 1 0 ∂f (r(t)) ∂x j dx j (t) dt dt + f i (b) -f i (a). With x a set to 0 and assuming f (0) = 0, we have f (x) = j̸ =i 1 0 ∂f (r(t)) ∂xj dxj (t) dt dt + f i (x i ). Since path r(t) is arbitrary meaning the equation holds for any path between 0 and x, if the first term depends on x i , then by differentiating both sides with respect to x i , ∂f ∂xi = dfi dxi would be contradicted and hence Lemma 1 proved. Next, we prove Proposition 3, such that our method HeatFlow R D i (•, t, f ) is solving the Poisson equation ∆g i,t = d * dg i,t = d * d i (P t f -f ). Proof. First we emphasize that operator d * d i and the heat operator P t commute with each other since d * d i commute with the Laplacian ∆ as follows: d * d i ∆f = div ∂ ∂x i D k=1 ∂ 2 ∂x 2 k f = ∂ 2 ∂x 2 i D k=1 ∂ 2 ∂x 2 k f = D k=1 ∂ 2 ∂x 2 k ∂ 2 ∂x 2 i f = ∆d * d i f. With this fact, we have d * d HeatFlow i (x, t, f ) = d * d t τ =0 d * d i (P τ f )(x)dτ = d * d t τ =0 P τ (d * d i f )(x)dτ = t τ =0 ∆P τ (d * d i f )(x)dτ = t τ =0 ∂ ∂t P τ (d * d i f )(x)dτ = P t (d * d i f )(x) -d * d i f (x) = d * d i (P t f -f )(x).

A.3 EXAMPLE ON MULTIVARIATE GAUSSIAN

Here, we present a simple example of our method, HeatFlow, on a d-dimensional multivariate Gaussian with mean µ and diagonal covariance Σ = diag(σ 1 , . . . , σ d ). Firstly, the solution of heat equation is calculated as follow, P t f (x) = k t (x, y)N (y; µ, Σ) dy = 1 (4πt) d/2 exp - 1 4t ∥x -y∥ 2 1 (2π) d/2 |Σ| 1/2 exp - 1 2 (y -µ) T Σ -1 (y -µ) dy = N (x; µ, Σ + 2tI) Consequently, we have the contribution of the i-th variable as, HeatFlow R D i (x, T, f ) := T t=0 ∂ 2 ∂x 2 i (P t f )(x) dt = T t=0 1 (2π) d/2 |Σ + 2tI| 1/2 - 1 σ i + 2t dt A.4 ADDITIONAL RELATED WORK A relevant literature from the perspective of generative models is identifying interpretable latent dimensions (Yang et al., 2021) . In (Wang & Ponce, 2020) , unsupervised discovery of interpretable axes is facilitated by exploring the latent space along the eigenvectors of the metric tensor defined by the decoder. Eigenvectors at different ranks encode qualitatively different type of changes. The main difference between our work and these prior literature is that our primary goal is to interpret a learned regression/classification model with the help of a generative model to explore the data manifold, while their goal is to explain the generative model itself. The discussion in the main text assumes that we have access to a parameterized representation of the manifold M. In practice, the manifold may be implicitly defined via a projection function that projects an arbitrary point onto the manifold through identifying its closest on-manifold counterpart in terms of Euclidean distance (Ruuth & Merriman, 2008; März & Macdonald, 2012)  (x) = ∇[f • cp](x) and ∆ M f (x) = ∆[f • cp](x). The latter equation provides the fourth possibility to disaggregate the Laplace-Beltrami operator. A.5 NUMERICAL IMPLEMENTATION In this section, we briefly discuss the numerical implementation of our method in an end-to-end framework from manifold learning to multi-scale explanation. Since there are numerous approaches for manifold learning and high-dimensional PDE solving, we claim our implementation is only one of many possible realizations and can be generalized to different manifold settings for various purposes. Manifold Learning. Deep generative models strive to infer probability distribution from observational data x ∈ X = R D , as well as learning the underlying data manifold (Brehmer & Cranmer, 2020; Arvanitidis et al., 2018) . Approaches such as VAEs (Kingma & Welling, 2014; Rezende et al., 2014) and GANs (Goodfellow et al., 2014) makes the assumption of a d-dimensional data manifold, M, embedded in the ambient space with d < D, through highly flexible generator, g : Z → M, where z ∈ Z = R d is the latent variable. The local Jacobian of the generator function, J z = ∂g ∂z | z=z , provides local basis in the input space and G z = J T z J z is the Riemannian metric. Heat Equation Solving. To solve the heat equation on manifold, especially a high-dimensional one, we utilize the Feynman-Kac formula and the framework proposed by (Beck et al., 2021; Berner et al., 2020) where one can simulate training data in order to learn solution P t f by means of deep learning. A supervised learning problem is constructed via the predictor variables (x, t) and the dependent target variables y = f (X t )| X0=x . The unique minimizer of the quadratic loss min ϕ E[(ϕ(x, t)y) 2 ] is the solution of the heat equation. In practice, the following empirical error is minimized over the function space of suitable neural networks H: φ = arg min ϕ∈H 1 s s i=1 (ϕ(x i , t i ) -f (X t )| X0=xi ) 2 , where {(x i , t i )} s i=1 are realization of i.i.d. samples uniformly drawn from M × [0, T ]. To simulate Brownian motion on a Riemannian manifold, we perform random walks over the learned manifold depending on the Laplacian adopted. In Euclidean setting, random walks are realized by adding Gaussian noise. While in curved space, one usually resort to geodesic random walks performed directly on latent variables. Detailed algorithms are presented in the Algorithm 2. For classification models, to align with training of the original model, we minimize the cross entropy loss, instead of MSE in Eq. 11.

Gradient and Decomposition of Laplacian

To compute the gradient and the attribution after heat equation solving, we need to explicitly realize the projection E i . When a local chart g is available, such as a decoder function, we define J(J T J) -1 J T as the projection matrix. The resulting explicit forms of gradf and E i are grad f = (J T J) -1 J T ∂f /∂x = G -1 ∂f /∂z, E i = (J T J) -1 J T e i = G -1 ∂g i /∂z. By the definition of Christoffel symbol which describes the metric connection of manifold, it is easy to derive Γ i jk = m 1 2 G -1 im (∂ j G km +∂ k G jm -∂ m G jk ) = m,b G -1 im J b m H b jk , where H b jk = ∂ 2 g b ∂zj ∂z k is the Hessian matrix of the generator and J b m = ∂g b ∂zm the Jacobian matrix. Following the definition of Hessian tensor, we have ∇ 2 f (E i , E i ) = E T i ∂ 2 f ∂z j ∂z k -Γ i jk ∂f ∂z i jk E i = ∂g i ∂z T G -1 H f -J f G -1 ( b J b H b ) G -1 ∂g i ∂z , where H f and J f are the Hessian and Jacobian matrix of the explained model f with respect to latent variables, respectively. Substitute this implementation into the third decomposition strategies in Eq. 7 for each P t f and then we obtain multi-scale attribution. The following pseudo-algorithm shows a reference implementation of HeatFlow. Notice that given training input data and function to be trained, the steps of manifold learning and heat equation solving only need to be computed once. The resulting manifold and heat kernel can be reused for explaining any further test input. In order to simulate Brownian motion on a Riemannian manifold as needed in solving a highdimensional heat equation, we resort to Algorithm 2, which is also adopted by (Arvanitidis et al., 2018) . 1 : Definition of involved math symbols, in local coordinates on a Riemannian manifold and in Cartesian coordinates in the special case of Euclidean space, along with their intuitive meanings in natural language (intuitive meaning referenced from Wikipedia). Particularly for Cartesian coordinates in Euclidean space, a three-dimensional example is also shown. Notice that the Einstein summation convention is used, implying summation over i and j. e i = ∂x/∂x i and e i = dx i refer to the unnormalized local covariant and contravariant bases, g ij is the inverse metric tensor, α X is dual of vector field X, and ⟪, ⟫ is square-integrable inner product. The amount of "stuff" flowing through a surface locally per unit time, with velocity moving by the vector field.

Symbol

Laplacian: ∆f = ∇ • ∇f 1 √ g ∂ ∂xi √ gg ij ∂f ∂xj ∂ 2 f ∂x 2 i ∂ 2 f ∂x 2 + ∂ 2 f ∂y 2 + ∂ 2 f ∂z 2 Local average deviation, how much the average value of a function over small balls centered at a point deviates from its output.



Face images from the UTKFace https://susanqq.github.io/UTKFace/ dataset. Source code available at: https://anonymous.4open.science/r/heat-explainer-FFD0



(a) Heat flow and the decomposition of Laplacian in two directions.

Figure 1: A toy example in R 2 .

first sampled, based on which, labels and input images are generated as shown in Figure2a. The underlying manifold is known, which is the Cartesian products of six circles. Local features are those pixels close to the upper-left corner, while global ones are those of larger squares.

Heatmaps generated by various methods.

Figure 2: (a) Top: Generate labels. Bottom: Generate images. (b) Comparison of HeatFlow, Grad, SG, IG with black baseline and BlurIG on synthetic data. HeatFlow under true manifold, manifold learnt by VAE with latent dimension d = 12 and d = 6 are presented in the first three rows, respectively. On the fourth row, gradients on the true manifold is collected at each time step. The reconstructed inputs are shown below the original input. Vanilla Grad and IG are presented below the reconstructed inputs. For HeatFlow, SG and BlurIG, different time steps, noise levels of Gaussian blur, and partial integration up to kernel width α are shown, respectively.

(a) Multi-scale attributions on MNIST.(b) HeatFlow for different logits.

Figure 3: (a) Heat maps of the logit of label class on MNIST test samples. Original input is shown in the most upper-left corner. Vanilla Grad and IG are presented below the input. HeatFlow, SG and BlurIG with increasing level of time, noise and kernel width is shown on each row started from the second column, respectively. (b) Heat diffusion for MNIST samples comparing logits of different classes. Left: Change in function value. Right: HeatFlow attribution maps.

Figure 4: The change of accuracy for MNIST (upper) and MAE for UTKFace (lower) as an increasing percentage of pixels attributed to be important are included (left) or excluded (right).

. Given such a closest point function cp(x) : R D → M, it is possible to extend a function f (x) : M → R to the surrounding space in R D by defining a new function f • cp(x) : R D → R. It has been shown that the intrinsic operations of grad and ∆ M are simplified as grad f

A reference implementation of HeatFlowRequire : D: training dataset, f : R D → R: a trained model, x * : a test input to be explained, T : total horizon Output :Ψ = {ψ (t) i } D,T i,t=1feature importance for each feature i and level of localness t // Manifold Learning 1 learn the underlying manifold using VAE from D -get-: encoder g, decoder h, Jacobian of the encoder J g , and metric G g = J T g J g // Heat Equation Solving 2 initialize model ϕ with similar structure as f and take t as extra input 3 while learning not done do 4 for each batch do 5 sample {x (i) , t (i) } s i=1 uniformly from D × [0, T ] 6 for each i = 1, ...s do 7Z (i) ← RW(z 0 = g(x (i) ), s, t (i) , G g ) to minimize loss L = s i=1 ϕ(x (i) , t (i) )f (h(Z (i) [-1, :])) solution ϕ * (x,t) ≈ P t f (x) on the learned manifold // Decomposition of Laplacian 12 initialize Ψ as a zero array 13 z * = g(x * ) 14 for each i = 1, ..., D do 15 for each t = 1, ..., T do 16 calculate δ HeatFlow i = ∇ 2 ϕ * (•, t) E i (z * ), E i (z * ) according to Eq. 12 17 accumulate Ψ[i, t] = Ψ[i, t -1] + δ HeatFlow i • δt 18 end 19 end

Random walk on a Riemannian manifold: RW(z 0 , s, T, G)input : Latent starting point z 0 ∈ R d , step size s, number of steps T , metric tensor G output : Random walk path Z ∈ R T ×d 1 z = z 0 2 for t = 1 to T do 3 L, U = eig(G z ), (L : eigenvalues, U : eigenvectors) 4 v = UL -1 2 ϵ, ϵ ∼ N (0, I d ), 5 z = z + s • v, 6 Z(t, :) = z 7 endA.6 DEFINITION OF MATHEMATICAL SYMBOLS Table

of the outward flux of a vector field from an infinitesimal volume around a given point. (adjoint of d) codifferential:d * (α X ) =div(X) ⟪f, d * θ⟫ = ⟪df, θ⟫ -

A.7 IMPLEMENTATION DETAILS

We summarize hyperparameter settings in our experiments in Table 2 . 

