VALID P -VALUE FOR DEEP LEARNING-DRIVEN SALIENT REGION

Abstract

Various saliency map methods have been proposed to interpret and explain predictions of deep learning models. Saliency maps allow us to interpret which parts of the input signals have a strong influence on the prediction results. However, since a saliency map is obtained by complex computations in deep learning models, it is often difficult to know how reliable the saliency map itself is. In this study, we propose a method to quantify the reliability of a salient region in the form of p-values. Our idea is to consider a salient region as a selected hypothesis by the trained deep learning model and employ the selective inference framework. The proposed method can provably control the probability of false positive detections of salient regions. We demonstrate the validity of the proposed method through numerical examples in synthetic and real datasets. Furthermore, we develop a Keras-based framework for conducting the proposed selective inference for a wide class of CNNs without additional implementation cost.

1. INTRODUCTION

Deep neural networks (DNNs) have exhibited remarkable predictive performance in numerous practical applications in various domains owing to their ability to automatically discover the representations needed for prediction tasks from the provided data. To ensure that the decision-making process of DNNs is transparent and easy to understand, it is crucial to effectively explain and interpret DNN representations. For example, in image classification tasks, obtaining salient regions allows us to explain which parts of the input image strongly influence the classification results. Several saliency map methods have been proposed to explain and interpret the predictions of DNN models (Ribeiro et al., 2016; Bach et al., 2015; Doshi-Velez & Kim, 2017; Lundberg & Lee, 2017; Zhou et al., 2016; Selvaraju et al., 2017) . However, the results obtained from saliency methods are fragile (Kindermans et al., 2017; Ghorbani et al., 2019; Melis & Jaakkola, 2018; Zhang et al., 2020; Dombrowski et al., 2019; Heo et al., 2019) . Therefore, it is important to develop a method for quantifying the reliability of DNN-driven salient regions. Our idea is to interpret salient regions as hypotheses driven by a trained DNN model and employ a statistical hypothesis testing framework. We use the p-value as a criterion to quantify the statistical reliability of the DNN-driven hypotheses. Unfortunately, constructing a valid statistical test for DNN-driven salient regions is challenging because of the selection bias. In other words, because the trained DNN selects the salient region based on the provided data, the post-selection assessment of importance is biased upwards. To correct the selection bias and compute valid p-values for DNN-driven salient regions, we introduce a conditional selective inference (SI) approach. The selection bias is corrected by conditional Figure 1 : Examples of the problem setup and the proposed method on the brain tumor dataset. By applying a saliency method called CAM (Zhou et al., 2016) on a query input image, we obtain the salient region. Our goal is to provide the statistical significance of the salient region in the form of p-value by considering two-sample test between the salient region and the corresponding region in the reference image. Note that, since the salient region is selected based on the data, the degree of saliency in the selected region is biased upward. In the upper image where there is no true brain tumor, the naive p-value which is obtained without caring about the selection bias is nearly zero, indicating the false positive finding of the salient region. On the other hand, the selective p-value which is obtained by the proposed conditional SI approach is 0.43, indicating that the selected saliency region is not statistically significant. In the lower image where there is a true brain tumor, both the naive p-value and the selective p-value are very small, which indicate a true positive finding. These results illustrate that naive p-value cannot be used to quantify the reliability of DNN-based salient region. In contrast, with the selective p-values, we can successfully identify false positive and true positive detections with a desired error rate. SI in which the test statistic conditional on the event that the hypotheses (salient regions) are selected using the trained DNNs is considered. Our main technical contribution is to develop a method for explicitly deriving the exact (non-asymptotic) conditional sampling distribution of the salient region for a wide class convolutional neural networks (CNNs), which enables us to conduct conditional SI and compute valid p-values. Figure 1 presents an example of the problem setup. Related works. In this study, we focus on statistical hypothesis testing for post-hoc analysis, i.e., quantifying the statistical significance of the salient regions identified in a trained DNN model when a test input instance is fed into the model. Several methods have been developed to visualize and understand trained DNNs. Many of these post-hoc approaches (Mahendran & Vedaldi, 2015; Zeiler & Fergus, 2014; Dosovitskiy & Brox, 2016; Simonyan et al., 2013) have focused on developing visualization tools for saliency maps given a trained DNN. Other methods have aimed to identify the discriminative regions in an input image given a trained network (Selvaraju et al., 2017; Fong & Vedaldi, 2017; Zhou et al., 2016; Lundberg & Lee, 2017) . However, some recent studies have shown that many of these saliency methods of these saliency methods are not stable against a perturbation or adversarial attack on the input data and model (Kindermans et al., 2017; Ghorbani et al., 2019; Melis & Jaakkola, 2018; Zhang et al., 2020; Dombrowski et al., 2019; Heo et al., 2019) . To the best of our knowledge, no study to date has succeeded in quantitatively evaluating the reproducibility of DNN-driven salient regions with a rigorous statistical inference framework. In recent years, conditional SI has emerged as a promising approach for evaluating the statistical reliability of data-driven hypotheses. It has been actively studied for making inferences on the features of linear models selected by various feature selection methods, such as Lasso (Lee et al., 2016) . The main concept behind conditional SI is to make inference based on the sampling distribution of the test statistic conditional on a selection event. This approach allows us to derive the exact sampling distribution of the test statistic. After the seminal work of Lee et al. (2016) , conditional SI has also been applied to a wide range of problems (Loftus, 2015; Choi et al., 2017; Tian & Taylor, 2018; Yang et al., 2016; Tibshirani et al., 2016; Fithian et al., 2014; Loftus & Taylor, 2014; Panigrahi et al., 2016; Sugiyama et al., 2021a; Hyun et al., 2021; Duy & Takeuchi, 2021a; b; Sugiyama et al., 2021b; Chen & Bien, 2019; Tsukurimichi et al., 2021; Tanizaki et al., 2020; Duy et al., 2020; 2022) . The most relevant existing work is Duy et al. (2022) , where the authors provide a framework to compute valid p-values for DNN-based image segmentation results. In Duy et al. (2022) , the authors only considered the inference on the output of a DNN in a segmentation task. In this paper, we address a more general problem in which the hypotheses characterized by any internal nodes of the DNN can be considered. This enables us to quantify the statistical significance of salient regions. Furthermore, we introduce a Keras-based implementation framework that enables us to conduct SI for a wide class of CNNs without additional implementation costs. This is in contrast to Duy et al. (2022) in which the selection event must be implemented when the network architecture is changed. In another direction, Burns et al. (2020) considered the black box model interpretability as a multiple hypothesis testing problem. Their goal was to identify important features by testing the significance of the difference between the prediction and the expected outcome when certain features are replaced with their counterfactuals. However, this approach faces a significant challenge: the number of hypotheses to be considered can be very large (e.g., in the case of an image with n pixels, the number of possible salient regions is 2 n ). Multiple testing correction methods, such as the Bonferroni correction, are highly conservative when the number of hypotheses is large. To address the challenge, they only considered a tractable number of regions selected by a human expert or object detector, which causes selection bias because the candidate regions are selected based on the data. Contribution. Our main contributions are as follows: • We provide an exact (non-asymptotic) inference method for salient regions obtained by CAM based on the SI concept. We introduce valid p-values to statistically quantify the reliability of the DNN-driven salient regions inspired by Duy et al. (2022) . • We propose a novel algorithm and its implementation. Specifically, we propose Keras-based implementation which enables us to conduct conditional SI for a wide class of CNNs without additional implementation costs. • We conducted experiments on synthetic and real-world datasets, through which we show that our proposed method can control the false positive rate, has good performance in terms of computational efficiency, and provides good results in practical applications. Our code is available at https://github.com/takeuchi-lab/selective inference dnn salient region.

2. PROBLEM FORMULATION

In this paper, we consider the problem of quantifying the statistical significance of the salient regions identified by a trained DNN model when a test input instance is fed into the model. Consider an n-dimensional query input vector X = (X 1 , ..., X n ) > = s + ", " ⇠ N(0, 2 I n ) and an n-dimensional reference input vector, X ref = (X ref 1 , ..., X ref n ) > = s ref + " ref , " ref ⇠ N(0, 2 I n ), where s, s ref 2 R n are the signals and ", " ref 2 R n are the noises for query and reference input vectors, respectively. We assume that the signals, s and s ref are unknown, whereas the distribution of noises " and " ref are known (or can be estimated from external independent data) to follow N(0, 2 I n ), an n-dimensional normal distribution with a mean vector 0 and covariance matrix 2 I n , which are mutually independent. In the illustrative example presented in §1, X is a query brain image for a potential patient (we do not know whether she/he has a brain tumor), whereas X ref is a brain image of a healthy person without brain tumors. Consider a saliency method for a trained CNN. We denote the saliency method as a function A : R n ! R n that takes a query input vector X 2 R n and returns the saliency map A(X) 2 R n . We define a salient region M X for the query input vector X as the set of elements whose saliency map value is greater than a threshold M X = {i 2 [n] : A i (X) ⌧ } , where ⌧ 2 R denotes the given threshold. In this study, we consider CAM (Zhou et al., 2016) as an example of saliency method and threshold-based definition of the salient region. Our method can be applied to other saliency methods and other definitions of salient region. Statistical inference. To quantify the statistical significance of the saliency region M X , we consider a two-sample test for the difference between the salient regions of the query input vector X M X and corresponding region of the reference input vector X ref M X where X M X is a sub-vector of X indexed by X. As examples of the two-sample test, we consider the mean null test: H 0 : 1 |M X | X i2M X s i = 1 |M X | X i2M X s ref i v.s. H 1 : 1 |M X | X i2M X s i 6 = 1 |M X | X i2M X s ref i , and global null test: H 0 : s i = s ref i , 8i 2 M X , v.s. H 1 : s i 6 = s ref i , 9i 2 M X . (3) In the mean null test in Eq. ( 2), we consider a null hypothesis that the average signals in the salient region M X are the same between X and X ref . In contrast, in the global null test in Eq. ( 3), we consider a null hypothesis that all elements of the signals in the salient region M X are the same between X and X ref . The p-values for these two-sample tests can be used to quantify the statistical significance of the salient region M X . Test-statistic. For a two-sample test conducted between X M X and X ref M X , we consider a class of test statistics called conditionally linear test-statistic, which is expressed as T (X, X ref ) = ⌘ > M X ✓ X X ref ◆ , and conditionally test-statistic, which is expressed as T (X, X ref ) = 1 P M X ✓ X X ref ◆ , where ⌘ M X 2 R 2n is a vector and P M X 2 R 2n⇥2n is a projection matrix that depends on M X . The test statistics for the mean null tests and the global null test can be written in the form of Eqs. (4) and ( 5), respectively. For the mean null test in Eq. ( 2), we consider the following test-statistic T (X, X ref ) = ⌘ > M X ✓ X X ref ◆ = 1 |M X | X i2M X X i 1 |M X | X i2M X X ref i , where ⌘ M X = 1 |M X | ✓ 1 n M X 1 n M X ◆ 2 R 2n with 1 n C being the n-dimensional vector whose elements belongs to the set C are set to 1, and 0 otherwise. For the global null test in Eq. ( 3), we consider the following test-statistic T (X, X ref ) = 1 P M X ✓ X X ref ◆ = v u u t X i2M X ✓ X i X ref i p 2 ◆ 2 , where P M X = 1 2 ✓ diag(1 n M X ) diag(1 n M X ) diag(1 n M X ) diag(1 n M X ) ◆ . To obtain p-values for these two-sample tests we need to know the sampling distribution of the test-statistics. Unfortunately, it is challenging to derive the sampling distributions of test-statistics because they depend on the salient region M X , which is obtained through a complicated calculation in the trained CNN.

3. COMPUTING VALID p-VALUE BY CONDITIONAL SELECTIVE INFERENCE

In this section, we introduce an approach to compute the valid p-values for the two-sample tests for the salient region M X between the query input vector X and the reference input vector X ref based on the concept of conditional SI (Lee et al., 2016) .

3.1. CONDITIONAL DISTRIBUTION AND SELECTIVE p-VALUE

Conditional distribution. The basic idea of conditional SI is to consider the sampling distribution of the test-statistic conditional on a selection event. Specifically, we consider the sampling property of the following conditional distribution T (X, X ref ) {M X = M X obs } , where X obs is the observation (realization) of random vector X. The condition in Eq.( 6) indicates the randomness of X conditional on the event that the same salient region M X as the observed M X obs is obtained. By conditioning on the salient region M X , derivation of the sampling distribution of the conditionally linear and test-statistic T (X, X ref ) is reduced to a derivation of the distribution of linear function and quadratic function of (X, X ref ), respectively. Selective p-value. After considering the conditional sampling distribution in (6), we introduce the following selective p-value: p selective = P H0 ⇣ T (X, X ref ) T (X obs , X ref obs ) M X = M X obs , Q X,X ref = Q obs ⌘ , where Q X,X ref = ⌦ X,X ref , Q obs = Q X obs ,X ref obs with ⌦ X,X ref = ✓ I 2n ⌘ M X ⌘ > M X k⌘ M X k 2 ◆ X X ref 2 R 2n in the case of mean null test, and Q X,X ref = V X,X ref , U X,X ref , Q obs = Q X obs ,X ref obs with V X,X ref = P M X X X ref . P M X X X ref 2 R 2n , U X,X ref = P ? M X X X ref 2 R 2n in the case of global null test. The Q X,X ref is the sufficient statistic of the nuisance parameter that needs to be conditioned on in order to tractably conduct the inferencefoot_0 . The selective p-value in Eq.( 7) has the following desired sampling property P H0 ⇣ p selective  ↵ | M X = M X obs ⌘ = ↵, 8↵ 2 [0, 1]. This means that the selective p-values p selective can be used as a valid statistical significance measure for the salient region M X .

3.2. CHARACTERIZATION OF THE CONDITIONAL DATA SPACE

To compute the selective p-value in (7), we need to characterize the conditional data space whose characterization is described and introduced in the next section. We define the set of (X > X ref > ) > 2 R 2n that satisfies the conditions in Eq. ( 7) as D = n (X > X ref > ) > 2 R 2n M X = M X obs Q X,X ref = Q obs o . According to the second condition, the data in D is restricted to a line in R 2n as stated in the following Lemma.  Lemma 1. Let us define a = ⌦ X obs ,X ref obs and b = ⌘ M X k⌘ M X k 2 2 R 2n in = n X > X ref > > = a + bz | z 2 Z o by using the scalar parameter z 2 R, where Z = {z 2 R | M a1:n+b1:n z = M X obs } . ( ) with x 1:n representing a vector of elements 1 through n of x. Proof. The proof is deferred to Appendix A.1. Lemma 1 indicates that we do not need to consider the 2n-dimensional data space. Instead, we only need to consider the one-dimensional projected data space Z in (10). Now, let us consider a random variable Z 2 R and its observation Z obs 2 R that satisfies (X > X ref > ) > = a + bZ and (X > obs X ref > obs ) > = a + bZ obs . The selective p-value (7) is rewritten as p selective = P H0 (|Z| |Z obs | | Z 2 Z) . (11) Because (X > X ref > ) > ⇠ N ✓ ⇣ s > s ref > ⌘ > , 2 I 2n ◆ due to the independence between X and X ref , the variable Z ⇠ N(0, 2 k⌘k 2 ) in the case of mean null test and Z ⇠ (Trace(P )) in the case of global null test under the null hypothesis. Therefore, Z | Z 2 Z follows a truncated normal distribution and a truncated distribution, respectively. Once the truncation region Z is identified, computation of the selective p-value in ( 11) is straightforward. Therefore, the remaining task is to identify Z.

4. PIECEWISE LINEAR NETWORK

The problem of computing selective p-values for the selected salient region is cast into the problem of identifying a set of intervals Z = {z 2 R | M X(z) = M X obs }. Given the complexity of saliency computation in a trained DNN, it seems difficult to obtain Z. In this section, however, we show that this is feasible for a wide class of CNNs.

Piecewise linear components in CNN.

The key idea is to note that most of basic operations and common activation functions used in a trained CNN can be represented as piecewise linear functions in the following form: Definition 1. (Piecewise Linear Function) A piecewise linear function f : R n ! R m is written as: ) ] are certain matrices and vectors with appropriate dimensions, f (X) = 8 > > > > < > > > > : f 1 X + f 1 , if X 2 P f 1 := {X 0 2 R n | f 1 X 0  f 1 }, f 2 X + f 2 , if X 2 P f 2 := {X 0 2 R n | f 2 X 0  f 2 }, . . . f K(f ) X + f K(f ) , if X 2 P f K(f ) := {X 0 2 R n | f K(f ) X 0  f K(f ) }, where f k , f k , f k and f k for k 2 [K(f P f k := {x 2 R n | f k x  f k } is a polytope in R n for k 2 [K(f )], and K(f ) is the number of polytopes for the function f . Examples of piecewise linear components in a trained CNN are shown in Appendix A.2. Piecewise Linear Network. Definition 2. (Piecewise Linear Network) A network obtained by concatenations and compositions of piecewise linear functions is called piecewise linear network. Since the concatenation and the composition of piecewise linear functions is clearly piecewise linear function, the output of any node in the piecewise linear network is written as a piecewise linear function of an input vector X. This is also true for the saliency map function A i (X), i 2 [n] obtained by CAM. Furthermore, as discussed in §4, we can focus on the input vector in the form of X(z) = a 1:n + b 1:n z which is parametrized by a scalar parameter z 2 R. Therefore, the saliency map value for each element is written as a piecewise linear function of the scalar parameter z, i.e., A i (X(z)) = 8 > > > > < > > > > :  Ai 1 z + ⇢ Ai 1 , if z 2 [L Ai 1 , U Ai 1 ],  Ai 2 z + ⇢ Ai 2 , if z 2 [L Ai 2 , U Ai 2 ], . . .  Ai K(Ai) z + ⇢ f K(Ai) , if z 2 [L Ai K(Ai) , U Ai K(Ai) ], , Algorithm 1 SI DNN Saliency Input: X obs , zmin, zmax, T ; 1: Obtain E obs , compute ⌘ as well as a and b Lemma 1 and initialize: t = 1, zt = zmin 2: for t  T do 3: Compute zt+1 by Auto-Conditioning (see §5) 4: if E X(z),X ref (z) = E obs in z 2 [zt, ] (by using Eq.( 13)) then 5: T T + {t} 6: end if 7: t = t + 1 8: end for 9: Identify Z S t2T [zt, zt+1] 10: p selective Eq. ( 11) Output: p selective where K(A i ) is the number of linear pieces of the piecewise linear function,  Ai k , ⇢ Ai k are certain scalar parameters, [L Ai k , U Ai k ] are intervals for k 2 [K(A i )] (note that a polytope in R n is reduced to an interval when it is projected onto one-dimensional space). This means that, for each piece of the piecewise linear function, we can identify the interval of z such that A i (X(z)) ⌧ as follows 2 z 2 8 < : h max ⇣ L Ai k , ⇣ ⌧ ⇢ Ai k ⌘ / Ai k ⌘ , U Ai k i if  Ai k > 0 h L Ai k , min ⇣ U Ai k , ⇣ ⌧ ⇢ Ai k ⌘ / Ai k ⌘ , i if  Ai k < 0 ) A i (X(z)) ⌧. With a slight abuse of notation, let us collectively denote the finite number of intervals on z 2 R that are defined by L Ai k , U Ai k , (⌧ ⇢ Ai i / Ai k ) for all (k, i) 2 [K(A i )] ⇥ [n] as [z 0 , z 1 ], [z 1 , z 2 ], . . . , [z t 1 , z t ], [z t , z t+1 ], . . . , [z T 1 , z T ], where z min = z 0 and z max = z T are defined such that the probability mass of z < z min and z > z max are negligibly small.

Algorithm. Algorithm 1 shows how we identify

Z = {z 2 R | M X(z),X ref (z) = M obs }. We simply check the intervals of z in the order of [z 0 , z 1 ], [z 1 , z 2 ], ..., [z T 1 , z T ] to see whether M X(z) = M X(z obs ) or not in the interval by using Eq.( 13). Then, the truncation region Z in Eq.( 10) is given as Z = S t2[T ]|E X(z),X ref (z) =E obs for z2[zt,zt+1] [z t , z t+1 ]. In the literature of homotopy method (a.k.a. parametric programming), it is known that the actual computational cost differs significantly from the worst case. A well-known application of the homotopy method in the ML community is the Lasso regularization path, which also has the worst-case computational cost on the exponential order of the number of features, but the actual cost is known to be nearly linear order. Empirically, this also applies to our proposed method.

5. IMPLEMENTATION: AUTO-CONDITIONING

The bottleneck of our algorithm is Line 3 in Algorithm 1, where z t+1 must be found by considering all relevant piecewise linear components in a complicated trained CNN. The difficulty lies not only in the computational cost but also in the implementation cost. To implement conditional SI in DNNs naively, it is necessary to characterize all operations at each layer of the network as selection events and implement each of them specifically (Duy et al., 2022) . To circumvent this difficulty, we introduce a modular implementation scheme called auto-conditioning, which is similar to autodifferentiation (Baydin et al., 2018) in concept. This enables us to conduct conditional SI for a wide class of CNNs without additional implementation costs. The basic idea in auto-conditioning is to add a mechanism to compute and maintain the interval z 2 [L f k , U f k ] for each piecewise linear component f in the network (e.g., layer API in the Keras 2 For simplicity, we omit the description for the case of  A i k = 0. In this case, if ⇢ A i k ⌧ , then z 2 [L A i k , U A i k ] ) i 2 M X(z) . framework). This enables us to automatically compute the interval [L f k , U f k ] of a piecewise linear function f when it is obtained as concatenation and/or composition of multiple piecewise linear components. If f is obtained by concatenating two piecewise linear functions f 1 and f 2 , we can easily obtain [L f k , U f k ] = [L f1 k1 , U f1 k1 ] \ [L f2 k2 , U f2 k2 ]. However, if f is obtained as a composition of two piecewise linear functions f 1 and f 2 , the calculation of the interval is given by the following lemma. Lemma 2. Consider the composition of two piecewise linear functions f (X(z)) = (f 2 f 1 )(X(z)). Given a real value of z, the interval [L f2 k , U f2 k ] in the input domain of f 2 can be computed as L f2 k2 = max j:( f 2 k 2 f 1 )j <0 ( f2 k2 ) j ( f2 k2 f1 ) j ( f2 k2 f1 ) j , U f2 k2 = min j:( f 2 k 2 f 1 )j >0 ( f2 k2 ) j ( f2 k2 f1 ) j ( f2 k2 f1 ) j , where f1 + f1 z is the output of f 1 (i.e., the input of f 2 ). Moreover, f2 k2 and f2 k2 are obtained by verifying the value of f1 + f1 z. Then, the interval of the composite function is obtained as follows: [L f k , U f k ] = [L f1 k1 , U f1 k1 ] \ [L f2 k2 , U f2 k2 ] The proof is provided in Appendix A.3. Here, the variables f k and f k can be recursively computed through layers as f k+1 = f k k f k + f k k and f k+1 = f k k f k . Lemma 2 indicates that the intervals in which X(z) falls in can be forwardly propagated through these layers. This means that the lower bound L Ai k and upper bound U Ai k of the current piece in the piecewise linear function in Eq. ( 12) can be automatically computed by forward propagation of the intervals of the relevant piecewise linear components.

6. EXPERIMENT

We only highlight the main results. More details (methods for comparison, network structure, etc.) can be found in the Appendix A.4. Experimental setup. We compared our proposed method with the naive method, over-conditioning (OC) method, and Bonferroni correction. To investigate the false positive rate (FPR), we considered 1000 null images X = (X 1 , ..., X n ) and 1000 reference images X ref = (X ref 1 , ..., x ref n ) , where s = s ref = 0 and ", " ref ⇠ N(0, I n ), for each n 2 {64, 256, 1024, 4096}. To investigate the true positive rate (TPR), we set n = 256 and generated 1,000 images, in which s i = for any i 2 S, where S is the "true" salient region whose location is randomly determined, and s i = 0 for any i 6 2 S. We set 2 {1, 2, 3, 4}. Reference images were generated in the same way as in the case of FPR. In all experiments, we set ⌧ = 0 in the mean null test and ⌧ = 5 in the global null test. We set the significance level ↵ = 0.05. We used CAM as the saliency method in all experiments. Numerical results. The results of FPR control properties are presented in Fig. 2 . The proposed method, OC, and Bonferroni successfully controlled the FPR in both the mean and global null test cases, whereas the naive method could not. Because naive method failed to control the FPR, we no longer considered its TPR. The results of the TPR comparison are shown in Fig. 3 . The proposed method has the highest TPR in all cases. The Bonferroni method has the lowest TPR because it is conservative owing to considering the number of all possible hypotheses. The OC method also has a low TPR because it considers several extra conditions, which cause the loss of TPR. Real data experiments. We examined the brain image dataset extracted from the dataset used in Buda et al. (2019) , which included 939 and 941 images with and without tumors, respectively. The results of the mean null test are presented in Figs. 4 and 5 . The results of the global null test are presented in Figs. 6 and 7 . The naive p-value remains small even when the image has no tumor region, which indicates that naive p-values cannot be used to quantify the reliability of DNN-based salient regions. The proposed method successfully identified false and true positive detections.

7. CONCLUSION

In this study, we proposed a novel method to conduct statistical inference on the significance of DNN-driven salient regions based on the concept of conditional SI. We provided a novel algorithm for efficiently and flexibly conducting conditional SI for salient regions. We conducted experiments on both synthetic and real-world datasets to demonstrate the performance of the proposed method. In current setting, we have not considered the situations where there is a misalignment between the input image and the reference image. A potential future improvement could be additionally performing a step to automatically find an appropriate region in the reference image before conducting a statistical test. If the matching operations can be represented as a set of linear inequalities, they can be easily incorporated to the proposed method. If the matching operations can be represented as a set of linear inequalities, they can be easily incorporated to the proposed method.



This nuisance parameter Q X,X ref corresponds to the component z in the seminal conditional SI paper(Lee et al., 2016) (see Sec. 5, Eq. 5.2 and Theorem 5.2) and z, w in(Chen & Bien, 2019)(see Sec. 3, Theorem 3.7). We note that additional conditioning on Q X,X ref is a standard approach in the conditional SI literature and is used in almost all conditional SI-related studies. Here, we would like to note that the selective p-value depends on Q X,X ref , but the property in (8) is satisfied without this additional condition because we can marginalize over all values of Q X,X ref (see the lower part of the proof of Theorem 5.2 inLee et al. (2016) and the proof of Theorem 3.7 inChen & Bien (2019) ).



Image without tumor region. The naive-p = 0.00 (wrong detection) and selective-p = 0.43 (true negative) Image with tumor region. The naive-p = 0.00 (true positive) and selective-p = 0.00 (true positive)

the case of mean null test, and a = U X obs ,X ref obs and b = V X obs ,X ref obs in the case of global null test. The D in (9) can be rewritten as D

Figure 2: False Positive Rate (FPR) comparison.

Figure 3: True Positive Rate (FPR) comparison.

Figure 4: Mean null test for image without tumor (p naive = 0.00, p selective = 0.78).

Figure 5: Mean null test for image with a tumor (p naive = 0.00, p selective = 1.92 ⇥ 10 4 ).

Figure 6: Global null test for image without tumor (p naive = 0.03, p selective = 0.46)

Figure 7: Global null test for image with a tumor (p naive = 0.00, p selective = 1.51 ⇥ 10 3 ).

ACKNOWLEDGEMENTS

This work was partially supported by MEXT KAKENHI (20H00601), JST CREST (JPMJCR21D3), JST Moonshot R&D (JPMJMS2033-05), JST AIP Acceleration Research (JPMJCR21U2), NEDO (JPNP18002, JPNP20006), and RIKEN Center for Advanced Intelligence Project.

