VARIATIONAL SALIENCY MAPS FOR EXPLAINING MODEL'S BEHAVIOR

Abstract

Saliency maps have been widely used to explain the behavior of an image classifier. We introduce a new interpretability method which considers a saliency map as a random variable and aims to calculate the posterior distribution over the saliency map. The likelihood function is designed to measure the distance between the classifier's predictive probability of an image and that of locally perturbed image. For the prior distribution, we make attributions of adjacent pixels have a positive correlation. We use a variational approximation, and show that the approximate posterior is effective in explaining the classifier's behavior. It also has benefits of providing uncertainty over the explanation, giving auxiliary information to experts on how much the explanation is trustworthy.

1. INTRODUCTION

Since the advent of deep learning brought significant improvement in general machine learning tasks (Krizhevsky et al., 2012) , explaining deep networks have become an important issue (Ribeiro et al. (2016) ). Problems inherent in training a deep neural network, such as fairness (Arrieta et al., 2020) or the model classifying based on unintended features (Ribeiro et al., 2016) , can be mitigated when the model is finely explained. Therefore, the models that have gained users' trust through explanation are preferred in practical applications. Saliency maps, also called attribution maps or relevance maps, have been widely used for interpretability methods in classification tasks, typically in an image domain (Simonyan et al., 2013) . A saliency map represents the importance of each feature of given data that influences the model's decision. There have been several approaches for obtaining the saliency map, which are backpropagation based methods (Ancona et al., 2017; Bach et al., 2015; Lundberg & Lee, 2017; Montavon et al., 2017; Selvaraju et al., 2017; Shrikumar et al., 2017; Simonyan et al., 2013; Smilkov et al., 2017; Srinivas & Fleuret, 2019; Sundararajan et al., 2017) and perturbation based methods (Chang et al., 2019; Chen et al., 2018; Dabkowski & Gal, 2017; Fong et al., 2019; Fong & Vedaldi, 2017; Schulz et al., 2020; Zeiler & Fergus, 2014; Zintgraf et al., 2017) . Regardless of the approaches, the common implicit assumption shared by most of the previous interpretability methods is that a saliency map exists in a deterministic manner when a model and an input data are given: one attribution map is provided to explain the model's decision for each data point. Instead of the implicit assumption, we propose a stochastic approach called Variational Saliency maps (VarSal) where it is assumed that the interpretation has inherent randomness. The intuition stems from the stochastic effect that makes interpretation methods more explainable. For instance, FIDO (Chang et al., 2019) expands the search space of the mask by drawing it from Bernoulli distribution. This approach prevents the mask to be searched in the local space when it is directly optimized (Fong & Vedaldi, 2017) . The example informs us that the stochastic property draws better interpretation. We define the posterior distribution as the probability of the saliency map when the training data and the classifier are given. To make the posterior behave as the distribution of explanation, it is essential to carefully design the likelihood function and the prior distribution. We follow the idea of perturbation based methods to form the likelihood where the input that only contains features which correspond to high attribution in a saliency map is likely to describe the classifier's behavior. For modeling the prior, we propose a new covariance matrix of Gaussian distribution that implies the property of having a positive correlation among attributions of adjacent pixels. As this property mimics total variation (TV) regularization, we name the prior as soft-TV Gaussian prior. After modeling the likelihood and the prior, the Variational Bayesian method (Hoffman et al., 2013; Kingma & Welling, 2013) is used since the posterior is intractable. After the optimization, unlike most of perturbation based methods, VarSal produces a real-time saliency map since only a single forward pass is required for generating it. Also, the VarSal method provides high quality in the visual inspection where sophisticated borderlines exist with objectoriented attention. We compare VarSal with baseline methods on the perturbation benchmark test to show the effectiveness of our approach. At the end, we examine the benefit of employing a posterior distribution, which is uncertainty over the explanation.

2. RELATED WORK

In this section, we take a look at perturbation based interpretability methods. Fong & Vedaldi (2017) optimize the cost function with respect to the mask which indicates the most important features in an image for the classifier's prediction. This approach is further developed by Fong et al. (2019) where they introduce a new method for making a perturbed image which helps to reduce hyper-parameters and produces better qualitative results. Both methods should optimize the mask every time they receive input, which is computationally expensive. Dabkowski & Gal (2017) relax the problem of time complexity by using a trained network of which the output is a saliency mask. However, all three methods have a limitation for producing importance ranking among features of a given image since their objective is to produce a binary mask. PDA (Zintgraf et al., 2017) produces a saliency map from a different perspective. It computes the importance of each pixel by regarding it as an unobserved pixel and marginalizes it out to get the predictive probability output of the classifier. The same idea is used in FIDO (Chang et al., 2019) to generate a perturbed image that is regarded as a sample from training data distribution. It optimizes the parameters of a Bernoulli dropout distribution for making a saliency mask. It helps exploring the search space of binary mask rather than being limited to local search since the mask is sampled from the distribution for each training iteration. Our method is similar to FIDO in that VarSal also explores the search space by sampling the saliency map from the encoder in the training phase. There is an information theoretic approach for explaining the classifier's prediction. Schulz et al. (2020) adopt an information bottleneck for restricting the flow of information in an intermediate layer by adding noise. They find the importance of each feature by calculating the information flow. Chen et al. (2018) also adopt mutual information concept and optimize its variational bound for training a network that maps an input image to a saliency map. VarSal is similar in that we also train the encoder network by optimizing the evidence lower bound (ELBO). However, our method differs in that we regard the saliency map as a random variable and aim to calculate the posterior over the saliency map. In this section, we introduce details of the VarSal method which provides stochastic saliency maps. Let us define a pre-trained classifier that we aim to interpret as M : R c×h×w → Y where x ∈ R c×h×w is an input with c, h, and w to be channel, height, and width of the input image, respectively, and Y = {1, 2, . . . , K} is a set of classes. The classifier M provides categorical probability P M (•) = ŷ ∈ K-1 where K-1 is a K -1 simplex. Since the purpose of a saliency map s ∈ R h×w is to describe the behavior of the classifier's prediction, our goal is to calculate the posterior distribution of the saliency map, p(s|x, ŷ) (solid lines in Figure 1 ). By Bayes' rule, the posterior is stated as:

𝑀 𝜃

p(s|x, ŷ) = p( ŷ|x,s) p(s|x) / Z , ( ) where Z is the marginal likelihood. To calculate the posterior, we should model two terms: the likelihood p( ŷ|x,s) and the prior p(s|x).

3.1. MODELING LIKELIHOOD

The likelihood should be well-designed to make the posterior over the saliency map explain the behavior of the classifier. We focus on the property that the importance of each feature in an image is determined by observing the response of the classifier's output when the feature is perturbed (Zeiler & Fergus, 2014; Fong & Vedaldi, 2017; Zintgraf et al., 2017) . More specifically, important features are enough to correctly classify the input as target class with high confidence. This concept is first introduced by Dabkowski & Gal (2017) , and called smallest sufficient region (SSR). The difference between SSR and our approach is that we do not consider the smallest sufficient region, but rather rank the features (therefore, s ∈ R h×w , not s ∈ {0, 1} h×w ). Moreover, we consider not the target class but the categorical probability to interpret the model itself. The likelihood is designed such that the aforementioned properties satisfy -log p( ŷ|x,s,k) = D KL [ P M (x) P M (x τ (k) (s)) ] + const, p( ŷ|x,s) = E p(k) [ p( ŷ|x,s,k) ] , where D KL is a Kullback-Leibler (KL) divergence, τ (k) is a top-k operation, and is a perturb operation that makes local perturbation of input x. The top-k operation applied to the saliency map, τ (k) (s) ∈ {0, 1} h×w , acts as a mask where [τ (k) (s)] i,j = 1 when s i,j corresponds to the biggest k attributions in s. This way, the top-k operation makes the conditional likelihood in equation 2 to consider only the selected features in the input. By varying k, the amount of selected features is controlled, and we set p(k) as uniform distribution. To make the local perturbation of input x using perturb operation , we follow the method proposed by Fong & Vedaldi (2017) , x m = x • m + x • (1 -m) , where x is a baseline input, and • is a pointwise multiplication. We bring three baseline settings in our experiment: blurred baselinefoot_0 , noise baselinefoot_1 , and mean baselinefoot_2 . The equation 2 states that the categorical probability ŷ is more likely when the distance between the classifier's predictive probability of input x and that of perturbed input is close for given k. This makes sense since better saliency map that explains the classifier's behavior would approximate the model's prediction closer with the selected features of top-k attributions. Also, we do not consider the ground-truth class or the top-1 predicted class, but rather whole classes with predictive probability in order to examine the classifier's behavior itself. To consider various values of k, we also take the expectation in equation 3.

3.2. SOFT-TV GAUSSIAN PRIOR

The easiest way to model the prior distribution p(s|x) is to consider it as independent of x, p(s|x) = p(s), and design it as standard Gaussian distribution N (vec(s); 0, I) where vec(s) ∈ R hw is a vectorized version of s. However, the standard Gaussian prior does not consider the belief over the saliency map that the attribution of neighbor pixels might have correlation (Fong et al., 2019) . Therefore, we propose a new prior distribution that expresses the belief. While Dabkowski & Gal (2017) ; Fong & Vedaldi (2017) proposed total variation (TV) regularization to prevent the saliency mask from being occurred adversarial artifacts, we mimic the TV method in building the prior distribution to provide the positive correlation between neighbor attributions. To be more specific, we design the prior distribution as zero mean Gaussian distribution, N (vec(s); 0, Σ), and infuse the TV knowledge to the covarianace matrix by setting Σ i,j > 0 when pixel i and pixel j are identical or adjacent. We have Σ i,j =        1 , if i = j α , if i-j ∈ {-w, -1, 1, w} and j ∈ Adj i α 2 , if i-j ∈ {-w-1, -w+1, w-1, w+1} and j ∈ Adj i 0 , otherwise , where α > 0 and Adj i is the adjacent index set of pixel i (Figure 9 in Appendix B). This way, we grant the TV knowledge to the Gaussian prior, and call it soft-TV Gaussian prior.

3.3. VARIATIONAL INFERENCE ON SALIENCY MAPS

Modeling the likelihood as equation 3 and the prior as N (vec(s); 0, Σ) makes the posterior of equation 1 intractable. Therefore, we approximate it with the distribution q θ (s|x) parameterized by θ (dotted lines in Figure 1 ) where the objective is to minimize the KL divergence between q θ (s|x) and p(s|x, ŷ): argmin θ D KL [ q θ (s|x) p(s|x, ŷ) ] = argmin θ E q [ -log p( ŷ|x,s) ( * ) ] + D KL [ q θ (s|x) p(s|x) ] ( * * ) . (5) We apply a mean-field approximation with univariate Gaussian for each factorized term of approximate posterior, q θ (s|x) = N ( vec(s); There are two problems in optimizing the equation 5: non-differentiable top-k operation in equation ( * ) and computationally expensiveness in equation ( * * ). In case of ( * ), the top-k operation τ (k) is non-differentiable where the gradient cannot flow backward. To overcome this issue, we approximate it with a differentiable SOFT operator proposed by Xie et al. (2020) . This allows flowing the gradient from the classifier M to the encoder parameters θ. Note that the classifier M is a pre-trained classifier that we aim to interpret, and thus should be fixed. µ θ (x), diag(ν θ (x)) ), where µ θ (•) ∈ R hw is Since the size of covariance Σ is large, which is hw×hw, it is computationally expensive to calculate ( * * ) when it is naively used (in case of Imagenet dataset (Russakovsky et al., 2015) , the size of Σ is 224 4 !). We solve the problem by decomposing Σ with a Kronecker product: Σ = κ h ⊗ κ w , where ⊗ is the Kronecker product, and κ h ∈ R h×h and κ w ∈ R w×w are tridiagonal matrices with 1 for main diagonal and α for first diagonal below and above the main diagonal (Figure 9 in Appendix B). After all, ( * * ) can be analytically derived as: D KL [ q θ (s|x) p(s|x) ] = diag κ -1 w T • rsh (ν θ ) • diag κ -1 h -sum (log ν θ ) + sum rsh (µ θ ) κ -1 w • rsh (µ θ ) • κ -1 h T + const , where sum(•) is the summation of elements. For a vector b ∈ R hw , we denote rsh(b) ∈ R w×h as reshaping the vector b to the matrix where [rsh(b)] i,j = b i+wj . Also, for a square matrix B, diag(B) is the vector where the ith entry is B i,i . The derivation of equation 7 is provided in Appendix D. The equation 7 shows that the computation is easily done in general deep learning frameworks such as Pytorch (Paszke et al., 2017) .

4.1. IMPLEMENTATION DETAIL

There are two terms in the objective function equation 5: the reconstruction term (*) and the regularization term (**). The components in the regularization term have high dimension, which is hw. This is usually too large that the regularization term becomes dominant in the loss function. We avoid this phenomenon by introducing a hyper-parameter β to the regularization term (Higgins et al., 2017) . We set the default value of β to be 1/(10hw). We test our method on the ImageNet dataset (Russakovsky et al., 2015) . We use a pre-trained VGG16 (Simonyan & Zisserman, 2014) for the classifier M , and 16 convolutional layers for the encoder µ θ and ν θ . As for the variable k used in the top-k operator, we sample k from the uniform distribution, k ∼ U(hw/10, 9hw/10), for each training iteration. The baseline input x is also randomly selected among three baselines (blurred, noise, and mean baseline) for each training iteration. As for α in the soft-TV Gaussian prior, we set α = 0.4. Finally, if not mentioned, the mean of approximate posterior is used when performing qualitative and quantitative experiments. More details are given in Appendix C.

4.2. QUALITATIVE RESULTS

We perform visual inspection by comparing VarSal with previous interpretability methods. For fair comparison, with the data that are correctly classified, previous interpretability methods are performed based on the ground-truth target, while VarSal explains the predictive probability. In Figure 2 , the first heatmap (Simonyan et al., 2013) are generated by the gradients of input. They visually highlights the object, but there exists sparsity. PDA (Zintgraf et al., 2017 ) usually provides unnecessary highlights since it only considers spatially local parts in the optimization process. Realtime saliency (Dabkowski & Gal, 2017) and EP (Fong et al., 2019) show a boolean mask for the saliency map with smoothed borderlines. This is because it optimizes the mask with the size smaller than the input, followed by performing upsampling for the final saliency mask. FIDO (Chang et al., 2019) provides the shape of the object to some extent, but it still has sparsity. Compared to the previous approaches, VarSal does not contain upsampling process, but rather directly provides the saliency map with the same size of input image. As the last heatmap shows, the VarSal highlights the object with more sophisticated borderlines. More results are provided in Appendix E.

Input Standard prior

Soft-TV prior To investigate the importance of modeling the prior distribution, we perform visual inspection between the VarSal method trained with standard Gaussian prior and that of soft-TV Gaussian prior. As Figure 3 shows, VarSal trained with standard Gaussian prior provides saliency map with high frequency noise inside the object boundary. This is because the standard Gaussian prior does not constrain the adjacent attributions to have correlation. On the other side, VarSal optimized using the soft-TV Gaussian prior shows smaller variation of attribution between adjacent pixels. Moreover, the noise on the background has been reduced when the soft-TV prior is used for modeling the prior distribution. The prerequisite for becoming an interpretability method is to pass the sanity check (Adebayo et al., 2018) . This is to identify whether the interpretability method provides a saliency map dependent of a classifier or a data Determining the state-of-the-art interpretability method is challenging since there is no evaluation benchmark that exactly reflects the method's performance (Hooker et al., 2019) . Commonly used benchmarks can split the superiority and inferiority of each method to some extent, but cannot exactly rank them based on the quantitative evaluation indicator. We use pixel perturbation benchmark to verify the usefulness of our method.

4.3. SANITY CHECK

For the pixel perturbation metric, image pixels are erased that correspond to the largest k% saliency values (Ancona et al., 2017; Samek et al., 2016) or the smallest k% saliency values (Srinivas & Fleuret, 2019) , and observe the response to the change of classifier's output. In our experiment, we erase pixels with the latter procedure as the former is more prone to create unnecessary artifacts that lead to misunderstanding of the reason for the score drop (Srinivas & Fleuret, 2019) . We change pixel values of the input image to zero that correspond to the least k% values in the saliency map, and observe the KL divergence between the classifier's predictive probability of original input and that of perturbed input. The interpretability method is thought to be better when the distance is smaller. It takes expensive computational time for PDA and FIDO method, about 25 minute and 1 minute per one image (Figure 5 (b)). Therefore, for evaluating PDA and FIDO, we randomly sample 100 data in the validation dataset to perform the top-k perturbation benchmark, with repeating the process 5 times. Other methods use entire validation dataset for the evaluation. For fair comparison, saliency map is generated from the predicted class for each interprebility methods except VarSal. Also, perturbing randomly drawn k pixels is suggested as a control experiment. We omit drawing error range for this control experiment since the error (standard deviation) is less than 1e-2. The results are shown in Figure 5 (a). It is observed that Input-Gradient (Simonyan et al., 2013) , Integrated-Gradient (Sundararajan et al., 2017) , and PDA gets close to the control setting as k gets larger. For others such as SmoothGrad-squared (Smilkov et al., 2017; Hooker et al., 2019) , FIDO, and VarSal, they have similar values throughout the change of k values. Moreover, VarSal performs better when using the soft-TV Gaussian prior than using the standard Gaussian prior. An interesting point is that the saliency map obtained by sampling from the approximate posterior instead of the mean values results in low-quality (red dotted line in Figure 5 (a)). We speculate this is because the sampling method produces artifacts in a perturbed image that cause score drop (Kurakin et al., 2016) . As shown in Figure 6 , even though both the sampled saliency map and the mean saliency map captures objects with high values, the sampled one has high-frequency noise. When the image pixels corresponding to low k% of the noisy saliency map are perturbed, the perturbed image has the artifact of sharp color contrast between adjacent pixels that might cause score degradation. One advantage of having posterior distribution over the saliency map is that it gives us the uncertainty of the explanation. This is done by summarizing the posterior with its covariance matrix. Since the VarSal approximates the posterior with factorized Gaussian, we can observe the variance of each attribution where the examples are shown in Figure 7 (a). While the explanation (second row) has higher attribution at the object, the uncertainty over the explanation (third row) presents at the borderline of the object.

4.5. UNCERTAINTY OVER EXPLANATION

Input Mean Standard deviation We also qualitatively compare the posterior results on the shifted samples. Recently Hendrycks & Dietterich (2018) have established Imagenet-C dataset that is created by visually corrupting the data in the Imagenet dataset. The first row in Figure 7 (b) and (c) respectively shows the example of images corrupted by "Brightness" and "Motion blur" with level 5 of severity. While the explanation of corrupted images is similar to that of original images in that they all capture the object to some extent, the uncertainty of the explanation shows different appearance where the heatmaps of standard deviation are noisy in the corrupted images.

5. CONCLUSION AND FUTURE WORK

In this paper, we presented a new perspective on a saliency map where it is assumed to be a random variable. After designing the likelihood function and the prior distribution that makes the posterior distribution over the saliency map explain the behavior of the classifier's prediction, the approximate posterior is optimized by maximizing ELBO. The experimental results were performed with the mean of the approximate posterior, and showed that our method has visually sharp borderlines with object-oriented saliency map. For quantitative results, the pixel perturbation benchmark is used to prove the effectiveness of our method. We verified that using the proposed soft-TV Gaussian distribution rather than the standard Gaussian distribution for modeling the prior has better performance in both qualitative and quantitative comparison. It was also shown that the proposed VarSal method has a strong advantage over other methods in terms of inference computation complexity. Finally, we showed that our method provides not only the explanation but also the uncertainty over the explanation. There remain future works for producing better quality of the posterior distribution over a saliency map. In modeling the likelihood, the problem of data distribution shift could be mitigated by generating a perturbed image that is expected to be sampled from training data distribution (Chang et al., 2019) . It would also be interesting to consider axiomatic prior knowledge (Sundararajan et al., 2017; Srinivas & Fleuret, 2019) in modeling the prior distribution and the likelihood. Finally, the uncertainty consideration over explaination is believed to be critical since it can tells us how much the explanation given by an interpretability method can be trusted, which needs to be studied further. An input image is fed into the encoder network that gives the mean and the variance of Guassian distribution (which is the approximate posterior). A saliency map is sampled from it, followed by passing a differentiable top-k operator to provide a binary mask. With this mask, the input image is perturbed. The perturbed image is passes a classifier M that gives categorical probability ŷperturb . The loss function is composed of two terms: the reconstruction term between ŷperturb and the categorical probability obtained from the original image ŷorigin , and the regularization term between the approximate posterior q(s|x) = N (s; µ(x), diag(ν(x))) and the prior distribution N (s; 0, Σ). As the classifier M is the model that we aim to interpret, it is fixed so as the parameter not to be updated while training the encoder network. B SOFT-TV GAUSSIAN PRIOR Σ We consider the prior knowledge that adjacent pixels in a saliency map have positive correlation (Figure 9 (a)). After modeling the prior as Gaussian distribution N (s; 0, Σ), the covariance matrix Σ is designed to infuse this prior knowledge into the prior distribution (Figure 9 (b)). The covariance matrix is then decomposed by Kronecker product to better calculate the KL divergence of the regularization loss. • • • • • • • • • • • • • • • (a) = = 1 = α (b) w h w = 0 𝜅 $ 𝜅 % w w h = α &

C IMPLEMENTATION DETAILS

Encoder architecture We use 17 convolution layers for the encoder network. To make the spatial size of the encoder's input and output to be same, we do not use a pooling layer. Every convolution layer is comprised of a convolution with kernel size 3 × 3, stride 1, and padding 1, followed by batch normalization and a rectified linear unit. The number of output channels for each convolution is as follows: [64, 64, 64, 64, 32, 32, 32, 32, 32, 16, 16, 16, 16, 16, 16, 2] . The encoder network provides µ ∈ R h×w and η ∈ R h×w for each channel of the output where µ is the mean of the Gaussian distribution and η = log ν with ν ∈ R h×w the variance of the Gaussian distribution. hyper-parameters We use Adam (Kingma & Ba, 2014) optimizer with learning rate to be 0.0001, weight decay to be 0.0005, and betas to be (0.9, 0.99). We use batch size of 128 while training the encoder network. We run 10 epochs for the Imagenet dataset, and save the network that has the lowest loss.

D PROOF OF REGULARIZATION LOSS EQUATION

Let us first define the notation. ⊗ is the Kronecker product, and is the element-wise multiplication. sum(•) is the summation of elements. For a vector b ∈ R hw , we denote rsh(b) ∈ R w×h as reshaping the vector b to the matrix where the (i, j)-th entry is b[i + wj], and diag(b) as the diagonal matrix where the diagonal is b. Also, for a square matrix B, diag(B) is the vector where the i-th entry is B[i, i]. For a matrix B, vec(B) denotes the vectorization by stacking the columns of the matrix B to a single column vector. Recall that the approximate posterior is q(s|x) = N (µ, Σ 0 ) where Σ 0 = diag(ν) with µ, ν ∈ R hw , and the prior distribution is p(s|x) = N (0, Σ 1 ) where Σ 1 = κ h ⊗ κ w with κ h ∈ R h×h and κ w ∈ R w×w . KL divergence between two Gaussian distribution is: D KL [ q(s|x) p(s|x) ] = D KL [ N (µ, Σ 0 ) N (0, Σ 1 ) ] = 1 2 tr Σ -1 1 Σ 0 + µ T Σ -1 1 µ -hw + log |Σ 1 | |Σ 0 | . ( ) We compute each term in RHS of equation 8 to make it computationally efficient.  tr Σ -1 1 Σ 0 = tr (κ h ⊗ κ w ) -1 • diag (ν) = tr κ -1 h ⊗ κ -1 w • diag (ν) = diag κ -1 h ⊗ κ -1 w diag (ν) = diag κ -1 w T • rsh (ν) • diag κ -1 h . µ T Σ -1 1 µ = µ T • κ -1 h ⊗ κ -1 w • µ = µ T • κ -1 h ⊗ κ -



We define "blurred baseline" as an input image blurred with Gaussian kernel. The "noise baseline" is defined as Gaussian noise. We term "mean baseline" when the baseline is set to be the per channel mean of an original image and added by Gaussian noise.



Figure 1: Graphical model.

Figure 2: Qualitative results. Compared with previous methods, VarSal visually captures the most sophisticated borderlines and object-oriented saliency map.

Figure 3: Prior selection.

Figure 4: Sanity check.

Figure 5: (a) Quantitative results for pixel perturbation benchmark The results verify the usefulness of our method. (b) Time complexity of each interpretability methods. SG-sq. refers to SmoothGradsquared.

Figure 6: Sampled saliency map.

Figure7: Uncertainty over the explanation. The posterior distribution of saliency map gives two summaries, which are the explanation (mean) and the uncertainty over the explanation (standard deviation). They are compared qualitatively on the original dataset and two different shifted dataset.

Figure 9: Soft-TV Gaussian prior.

vec (rsh (µ)) = µ T • vec κ -1 h • rsh (µ) • κ = log |κ h ⊗ κ w | -log |diag (ν)| = log |κ h | w |κ w | h -log hw i=1 ν i = w • log |κ h | + h • log |κ w | -log hw i=1 ν i = -sum (log ν i ) + const .(11)Therefore, the equation 8 is derived as:D KL [ q(s|x) p(s|x) ] = diag κ -1 w T • rsh (ν) • diag κ

