QUANTIFYING STATISTICAL SIGNIFICANCE OF NEU-RAL NETWORK REPRESENTATION-DRIVEN HYPOTHE-SES BY SELECTIVE INFERENCE Anonymous

Abstract

In the past few years, various approaches have been developed to explain and interpret deep neural network (DNN) representations, but it has been pointed out that these representations are sometimes unstable and not reproducible. In this paper, we interpret these representations as hypotheses driven by DNN (called DNN-driven hypotheses) and propose a method to quantify the reliability of these hypotheses in statistical hypothesis testing framework. To this end, we introduce Selective Inference (SI) framework, which has received much attention in the past few years as a new statistical inference framework for data-driven hypotheses. The basic idea of SI is to make conditional inferences on the selected hypotheses under the condition that they are selected. In order to use SI framework for DNN representations, we develop a new SI algorithm based on homotopy method which enables us to derive the exact (non-asymptotic) conditional sampling distribution of the DNN-driven hypotheses. In this paper, we demonstrate the proposed method in computer vision tasks as practical examples. We conduct experiments on both synthetic and real-world datasets, through which we offer evidence that our proposed method can successfully control the false positive rate, has decent performance in terms of computational efficiency, and provides good results in practical applications.

1. INTRODUCTION

The remarkable predictive performance of deep neural networks (DNNs) stems from their ability to learn appropriate representations from data. In order to understand the decision-making process of DNNs, it is thus important to be able to explain and interpret DNN representations. For example, in image classification tasks, knowing the attention region from DNN representation allows us to understand the reason for the classification. In the past few years, several methods have been developed to explain and interpret DNN representations (Ribeiro et al., 2016; Bach et al., 2015; Doshi-Velez & Kim, 2017; Lundberg & Lee, 2017; Zhou et al., 2016; Selvaraju et al., 2017) ; however, some of them have turned out to be unstable and not reproducible (Kindermans et al., 2017; Ghorbani et al., 2019; Melis & Jaakkola, 2018; Zhang et al., 2020; Dombrowski et al., 2019; Heo et al., 2019) . Therefore, it is crucially important to develop a method to quantify the reliability of DNN representations. In this paper, we interpret these representations as hypotheses that are driven by DNN (called DNNdriven hypotheses) and employ statistical hypothesis testing framework to quantify the reliability of DNN representations. For example, in an image classification task, the reliability of an attention region can be quantified based on the statistical significance of the difference between the attention region and the rest of the image. Unfortunately, however, traditional statistical test cannot be applied to this problem because the hypothesis (attention region in the above example) itself is selected by the data. Traditional statistical test is valid only when the hypothesis is non-random. Roughly speaking, if a hypothesis is selected by the data, the hypothesis will over-fit to the data and the bias needs to be corrected when assessing the reliability of the hypothesis. Our main contribution in this paper is to introduce Selective Inference (SI) approach for testing the reliability of DNN representations. The basic idea of SI is to perform statistical inference under the condition that the hypothesis is selected. SI approach has been demonstrated to be effective in the context of feature selections such as Lasso. In this paper, in order to introduce SI for DNN representations, we develop a novel SI algorithm based on homotopy method, which enables us to derive the exact (non-asymptotic) conditional sampling distribution of the DNN-driven hypothesis. We use p-value as a criterion to quantify the reliability of DNN representation. In the literature, pvalues are often misinterpreted and there are various source of mis-interpretation has been discussed (Wasserstein & Lazar, 2016) . In this paper, by using SI, we address one of the sources of misinterpreted p-values; the p-values are biased when the hypothesis is selected after looking at the data (often called double-dipping or data dredging). We believe our approach is a first significant step to provide valid p-values for assessing the reliability of DNN representations. Figure 1 shows an example that illustrates the importance of our method. Related works. Several recent approaches have been developed to visualize and understand a trained DNN. Many of these post-hoc approaches (Mahendran & Vedaldi, 2015; Zeiler & Fergus, 2014; Dosovitskiy & Brox, 2016; Simonyan et al., 2013) have focused on developing visualization tools for the activation maps and/or the filter weights within trained networks. Others have aimed to identify the discriminative regions in an input image, given a trained network (Selvaraju et al., 2017; Fong & Vedaldi, 2017; Zhou et al., 2016; Lundberg & Lee, 2017) . In parallel, some recent studies have showed that many popular methods for explanation and interpretation are not stable with respect to the perturbation or the adversarial attack on the input data and the model (Kindermans et al., 2017; Ghorbani et al., 2019; Melis & Jaakkola, 2018; Zhang et al., 2020; Dombrowski et al., 2019; Heo et al., 2019) . However, there are no previous studies that quantitatively evaluate the stability and reproducibility of DNN representations with a rigorous statistical inference framework. In the past few years, SI has been actively studied for inference on the features of linear models selected by several feature selection methods, e.g., Lasso (Lee et al., 2016; Liu et al., 2018; Duy & Takeuchi, 2020) . The basic idea of SI is to make inference conditional on the selection event, which allows us to derive the exact (non-asymptotic) sampling distribution of the test statistic. Besides, SI has also been applied to various problems (Bachoc et al., 2014; Fithian et al., 2015; Choi et al., 2017; Tian et al., 2018; Chen & Bien, 2019; Hyun et al., 2018; Bachoc et al., 2018; Loftus & Taylor, 2014; Loftus, 2015; Panigrahi et al., 2016; Tibshirani et al., 2016; Yang et al., 2016; Suzumura et al., 2017; Duy et al., 2020) . However, to the best of our knowledge, there is no existing study that provides SI for DNNs, which is technically challenging. This study is partly motivated by Tanizaki et al. (2020) where the authors provide a framework to compute p-values for image segmentation results provided by graph cut and threshold-based segmentation algorithms. As we demonstrate in this paper, our method can be also used to assess the reliability of DNN-based segmentation results. Contribution. To our knowledge, this is the first study that provides an exact (non-asymptotic) inference method for statistically quantifying the reliability of data-driven hypotheses that are discovered from DNN representation. We propose a novel SI homotopy method, inspired by Duy & Takeuchi (2020), for conducting powerful and efficient SI for DNN representations. We conduct experiments on both synthetic and real-world datasets, through which we offer evidence that our proposed method can successfully control the false positive rate, has decent performance in terms of computational efficiency, and provides good results in practical applications. We provide our implementation in the supplementary document and it will be released when this paper is published.

2. PROBLEM STATEMENT

To formulate the problem, we denote an image with n pixels corrupted with Gaussian noise as X = (X 1 , ..., X n ) = µ + ε, ε ∼ N(0, Σ), where µ ∈ R n is an unknown mean pixel intensity vector and ε ∈ R n is a vector of Normally distributed noise with the covariance matrix Σ that is known or able to be estimated from external data. We note that we do not assume that the pixel intensities in an image follow Normal distribution in Equation ( 1). Instead, we only assume that the vector of noises added to the true pixel values follows a multivariate Normal distribution. For an image X and a trained DNN, the main target is to identify an attention region (discriminative/informative region) in the input image X based on a DNN representation. A pixel is assigned to the attention region if its corresponding value in the representation layer is greater than a pre-defined threshold. We denote the set of pixels of X divided into attention region and non-attention region as C + X and C - X , respectively. Definition 1. We define A(X) as the event that the result of dividing pixels of image X into two sets of pixels C + X and C - X is obtained by applying a DNN on X, i.e., A(X) = {C + X , C - X }. Quantifying the statistical significance of DNN-driven hypotheses. Given an observed image x obs ∈ R n sampled from the model (1), we can obtain C + x obs and C - x obs by applying DNN on x obs . Let us consider a score ∆ that represents the degree to which the attention region differs from the non-attention region. In general, we can define any score as long as it is written in the form ∆ = η x obs . For example, we can define ∆ as the difference in average pixel values between the attention region and the non-attention region, i.e., ∆ = m C + x obs -m C - x obs = 1 |C + x obs | i∈C + x obs x obs i - 1 |C - x obs | i∈C - x obs x obs i = η x obs , where η = 1 |C + x obs | 1 n C + x obs -1 |C - x obs | 1 n C - x obs , and 1 n C ∈ R n is a vector whose elements belonging to a set C are 1, and 0 otherwise. If the value of |∆| is sufficiently large, the difference between C + x obs and C - x obs is significant and the attention region is reliable. To quantify the statistical significance, we consider a statistical hypothesis testing with the following null hypothesis H 0 and alternative hypothesis H 1 : H 0 : µ C + x obs = µ C - x obs vs. H 1 : µ C + x obs = µ C - x obs , where µ C + x obs and µ C - x obs are the true means of the pixel values in the attention region and nonattention region, respectively. Given a significance level α (e.g., 0.05), we reject H 0 if the p-value is smaller than α, which indicates the attention region differs from the non-attention region. Otherwise, we cannot say that the difference is significant. In a standard (naive) statistical test, the hypotheses in (3) are assumed to be fixed, i.e., non-random. Then, the naive (two-sided) p-value is simply given as p naive = P H0 |η X| ≥ |∆| = P H0 |η X| ≥ |η x obs | . (4) However, since the hypotheses in (3) are actually not fixed in advance, the naive p-value is not valid in the sense that, if we reject H 0 with a significance level α, the false detection rate (type-I error) cannot be controlled at level α, which indicates that p naive is unreliable. This is due to the fact that the hypotheses (the attention region) in (3) are selected by looking at the data (the input image), and thus selection bias exists. This selection bias is sometimes called data dredging, data snooping or p-hacking (Ioannidis, 2005; Head et al., 2015) . Selective inference (SI) for computing valid p-values. The basic idea of SI is to make inference conditional on the selection event, which allows us to derive the exact (non-asymptotic) sampling distribution of the test statistic η X in an attempt to avoid the selection bias. Thus, we employ the following conditional p-value p selective = P H0 |η X| ≥ |η x obs | | A(X) = A(x obs ), q(X) = q(x obs ) , where q(X) = (I n -cη )X with c = Ση(η Ση) -1 . The first condition A(X) = A(x obs ) indicates the event that the result of dividing pixels into an attention region and non-attention region for a random image X is the same as that of the observed image x obs , i.e., C + X = C + x obs and C - X = C - x obs . The second condition q(X) = q(x obs ) indicates the component that is independent of the test statistic for X is the same as the one for x obs . The q(X) corresponds to the component z in the seminal SI paper of Lee et al. (2016) (Sec 5, Eq 5.2 and Theorem 5.2). The p-value in (5), which is called selective type I error or selective p-values in the SI literature (Fithian et al., 2014) , is valid in the sense that P H0 (p selective < α) = α, ∀α ∈ [0, 1], i.e., the false detection rate is theoretically controlled at level α indicating the selective p-value is reliable. To calculate the selective p-value in (5), we need to identify the conditional data space. Let us define the set of x ∈ R n that satisfies the conditions in (5) as X = {x ∈ R n | A(x) = A(x obs ), q(x) = q(x obs )}. According to the second condition, the data in X are restricted to a line (Sec 6 in Liu et al. (2018) , and Fithian et al. ( 2014)). Therefore, the set X can be re-written, using a scalar parameter z ∈ R, as X = {x(z) = a + bz | z ∈ Z}, where a = q(x obs ), b = Ση (η Ση ) -1 , and Z = z ∈ R | A(x(z)) = A(x obs ) . Now, let us consider a random variable Z ∈ R and its observation z obs ∈ R that satisfy X = a+bZ and x obs = a + bz obs . Then, the selective p-value in ( 5) is re-written as p selective = P H0 |η X| ≥ |η x obs | | X ∈ X = P H0 |Z| ≥ |z obs | | Z ∈ Z . ( ) Since the variable Z ∼ N(0, η Ση) under the null hypothesis, the law of Z | Z ∈ Z follows a truncated Normal distribution. Once the truncation region Z is identified, the selective p-value (9) can be computed as p selective = F Z 0,η Ση (-|z obs |) + 1 -F Z 0,η Ση (|z obs |), where F E m,s 2 is the c.d.f. of the truncated normal distribution with mean m, variance s 2 and truncation region E. Therefore, the most important task is to identify Z. Extension of the problem setup to hypothesis driven from DNN-based image segmentation. We interpret the hypothesis driven from image segmentation result as the one obtained from the representation at output layer instead of internal representation. Our problem setup is general and can be directly applied to this case. For example, we can consider the attention region as the object region and the non-attention region as the background region. Then, we can conduct SI to quantify the significance of the difference between object and background regions. We note that we consider the case where the image is segmented into two regions-object and background-to simplify the problem and notations. The extension to more than two regions is straightforward.

3. PROPOSED METHOD

As we discussed in §2, to calculate the selective p-value, the truncation region Z in Equation ( 8) must be identified. To construct Z, we have to 1) compute A(x(z)) for all z ∈ R, and 2) identify the set of intervals of z on which A(x(z)) = A(x obs ). However, it seems intractable to obtain A(x(z)) for infinitely many values of z ∈ R. Our first idea to develop SI for DNN is that we additionally condition on some extra event to make the problem tractable. We now focus on a class of DNNs whose activation functions (AFs) are piecewise-linear, e.g., ReLU, Leaky ReLU (the extension to general AFs is discussed later). Then, we consider additionally conditioning on the selected piece of each piecewise-linear AF in the DNN. Data Space ℝ n x obs Observed Representation DNN 𝒜(x obs ) Proposed Method Θ (z1) x ≤ ψ (z1) Θ (z2) x ≤ ψ (z2) Θ (z3) x ≤ ψ (z3) Θ (z4) x ≤ ψ (z4) Θ (z5) x ≤ ψ (z5) Parametrized Line x(z) = a + bz Conditional Data Space 𝒳 = {x ∈ ℝ n | 𝒜(x) = 𝒜(x obs ), q(x) = q(x obs )} Figure 2 : A schematic illustration of the proposed method. By applying DNN on the observed image x obs , we obtain an representation. Then, we parametrize x obs with a scalar parameter z in dimension of test-statistic to identify the subspace X whose data has the same representation as x obs has. Finally, the valid statistical inference is conducted conditional on X . We introduce a homotopy method for efficiently characterizing the conditional data space X . Definition 2. Let j (x) be "the selected piece" of a piecewise-linear AF at the j-th unit in a DNN for a given input image x, and let s(x) be the set of s j (x) for all the nodes in a DNN . For example, for a ReLU activation function, s j (x) takes either 0 or 1 depending on whether the input to the j-th unit is located at the flat part (inactive) or the linear part (active) of the ReLU function. Using the notion of selected pieces s(x), instead of computing the selective p-value in (9), we consider the following over-conditioning (oc) conditional p-value p oc selective = P H0 |Z| ≥ |z obs | | Z ∈ Z oc , where Z oc = z ∈ R | A(x(z)) = A(x obs ), s(x(z)) = s(x obs ) . However, such an overconditioning in SI leads to the loss of statistical power (Lee et al., 2016) . Our second idea is to develop a homotopy method to resolve the over-conditioning problem, i.e., remove the conditioning of s(x(z)) = s(x obs ). With the homotopy method, we can efficiently compute A(x(z)) in a finite number of operations without the need of considering infinitely many values of z ∈ R, which is subsequently used to obtain truncation region Z in (8). The main idea is to compute a finite number of breakpoints at which one node of the network is going to change its status from active to inactive or vice versa. This concept is similar to the regularization path of Lasso where we can compute a finite number of breakpoints at which the active set changes. To this end, we introduce a two-step iterative approach generally described as follows (see Fig. 2 ): • Step 1 (over-conditioning step). Considering over-conditioning case by additionally conditioning on the selected pieces of all the hidden nodes in the DNN. • Step 2 (homotopy step). Combining multiple over-conditioning cases by homotopy method to obtain A(x(z)) for all z ∈ R.

3.1. STEP1: OVER-CONDITIONING STEP

We now show that by conditioning on the selected pieces s(x obs ) of all the hidden nodes, we can write the selection event of the DNN as a set of linear inequalities. Lemma 1. Consider a class of DNN which consists of affine operations and piecewise-linear AFs. Then, the over-conditioning region is written as Z oc = {z ∈ R | Θ (s(x obs )) x(z) ≤ ψ (s(x obs )) } for a matrix Θ (s(x obs )) and a vector ψ (s(x obs )) which depend only on the selected pieces s(x obs ). Proof. For the class of DNN, by fixing the selected pieces of all the piecewise-linear AFs, the input to each AF is represented by an affine function of an image x. Therefore, the condition for selecting a piece in a piecewise-linear AF, s j (x(z)) = s j (x obs ), is written as a linear inequality w.r.t. x(z). Similarly, the value of each unit in the representation layer is also written as an affine function of x(z). Since the attention region is selected if the value is greater than a threshold, the choice of attention region A(x(z)) = A(x obs ) is characterized by a set of linear inequalities w.r.t. x(z). Furthermore, let us consider max-operation, an operation to select the max one from a finite number of candidates. A max-operation is characterized by a set of comparison operators, i.e., inequalities. Let us consider a DNN which contains max-operators, and denote s(x) be the set of selected candidates of all the max-operators for an image x. Corollary 1. Consider a class of DNN which consists of affine operations, max-operations and piecewise-linear AFs. Then, a region Zoc defined as Zoc := {z ∈ Z oc | s(x(z)) = s(x obs )} is characterized by a set of linear inequalities w.r.t. x(z). The proof is shown in Appendix A.1. Remark 1. In this work, we mainly focus on the trained DNN where the activation functions used at hidden layers are piecewise linear, e.g., ReLU, Leaky ReLU, which is commonly used in CNN. Otherwise, if there is any specific demand to use non-piecewise linear functions such as sigmoid or tanh at hidden layers, we can apply some piecewise-linear approximation approach to these functions. We provided examples about the approximation for this case in Appendix A.5. Remark 2. Most of the basic operations in a trained neural network are written as affine operations. In the traditional neural network, the multiplication results between the weight matrix and the output of the previous layer and its summation with bias vector is affine operation. In a CNN, the main convolution operation is obviously an affine operation. Upsampling operation is also affine. Remark 3. Although the max-pooling operation is not an affine operation, it can be written as a set of linear inequalities. For instance, v 1 = max{v 1 , v 2 , v 3 } can be written as a set {e 1 v ≤ e 2 v, e 1 v ≤ e 3 v}, where v = (v 1 , v 2 , v 3 ) and e i is a standard basis vector with a 1 at position i. Remark 4. In Remark 1, we mentioned that we need to perform piecewise linear approximation for non-piecewise linear activations. However, if these functions are used at output layer, we do not need to perform the approximation task because we can define the set of linear inequalities based on the values before doing activation. See the next example for the case of sigmoid function. Example 1. Let us consider a 3-layer neural network with n input nodes, h hidden nodes and n ouput nodes. Let W (1) ∈ R h×n and w (1) ∈ R h respectively be the weight matrix and bias vector between input layer and hidden layer, and W (2) ∈ R n×h and w (2) ∈ R n respectively be the weight matrix and bias vector between hidden layer and output layer. The activation function at hidden layer is ReLU, and we use sigmoid function at output layer. At the hidden layer, for any node j ∈ [h], the selection event is written as  and a (1) j = 0, s W (1) j,: x + w (1) j ≥ 0, if the output of ReLU function at j th node ≥ 0, W (1) j,: x + w (1) j < 0, otherwise. Let a (1) ∈ R h and s (1) ∈ R h be the vectors in which a (1) j∈[h] = 1, s (1) j∈[h] = 1 if the output of ReLU function at the j th node ≥ 0, (1) j = -1 otherwise. Then we have the linear inequality system Θ 1 x ≤ ψ 1 where Θ 1 = (-s (1) 1 W (1) 1,: , ..., -s (1) h W (1) h,: ) and ψ 1 = (s (1) 1 w (1) 1 , ..., s (1) h w (1) h ) . Next, for any output node o ∈ [n], the selection event-a linear inequality-is written as W (2) o,: ((W (1) x + w (1) ) • a (1) ) + w (2) o ≥ 0, if the output of sigmoid function at o th node ≥ 0.5, W (2) o,: ((W (1) x + w (1) ) • a (1) ) + w (2) o < 0, otherwise, where • is the element-wise product. Similar to the hidden layer, we can also construct the linear inequality system Θ 2 x ≤ ψ 2 at the output layer. Finally, the whole linear inequality system is written as Θx ≤ ψ = (Θ 1 Θ 2 ) x ≤ (ψ 1 ψ 2 ) . ( ) Algorithm 1 compute solution path Input: a, b, [zmin, zmax] 1: Initialization: t = 1, zt = zmin, T = zt 2: while zt < zmax do 3: Obtain A(x(zt)) by applying a trained DNN to x(zt) = a + bzt 4: Compute the next breakpoint zt+1 ← Equation ( 13). Then assign T = T ∪ {zt+1}, and t = t + 1 5: end while Output: {A(x(zt)}z t ∈T 3.2 STEP 2: HOMOTOPY STEP We now introduce a homotopy method to compute A(x(z)) based on over-conditioning step. Lemma 2. Consider a real value z t . By applying a trained DNN to x(z t ), we obtain a set of linear inequalities Θ (s(x(zt))) x(z t ) ≤ ψ (s(x(zt))) . Then, the next breakpoint z t+1 > z t at which the status of one node is going to be changed from active to inactive or vice versa, i.e., the sign of one linear inequality is going to be changed, is calculated by z t+1 = min k:(Θ (s(x(z t ))) b) k >0 ψ (s(x(zt))) k -(Θ (s(x(zt))) a) k (Θ (s(x(zt))) b) k . ( ) The proof is shown in Appendix A.2. Algorithm 1 shows our solution to efficiently identify A(x(z)). In this algorithm, multiple breakpoints z 1 < z 2 < ... < z |T | are computed one by one. Each breakpoint z t , t ∈ [|T |], indicates a point at which the sign of one linear inequality is changed, i.e., the status of one node in the network is going to change from active to inactive or vice versa. By identifying all these breakpoints {z t } t∈[|T |] , the solution path is given by A (x(z)) = A(x(z t )) if z ∈ [z t , z t+1 ], t ∈ [|T |]. For the choice of [z min , z max ], see Appendix A.3.

4. EXPERIMENT

We highlight the main results. Several additional results and details can be found in Appendix A.6. Numerical Experiments. We demonstrate the performances of two versions of the proposed method: proposed-method (homotopy) and proposed-method-oc. The p-values in these two versions were computed by ( 5) and ( 11), respectively. Besides, we also compared the proposed methods with the naive p-value in (4) and the permutation test. The details of permutation test procedure is described in Appendix A.6. To test the FPR control, we generated 120 null images x = (x 1 , ..., x n ) in which x i ∈ [n] ∼ N(0, 1) for each n ∈ {64, 256, 1024, 4096}. To test the power, we generated images x = (x 1 , ..., x n ) with n = 256 for each true average difference in the underlying model µ C + x -µ C - x = ∆ µ ∈ {0 .5, 1.0, 1.5, 2.0}. For each case, we ran 120 trials. We chose the significance level α = 0.05. For more information about the setup as well as the the structure of a neural network, see the experimental setup paragraph in Appendix A.6. The results of FPR control are shown in the first part of Fig. 3 . The proposed methods could successfully control the FPR under α = 0.05 while the naive method can not. Since the naive method fails to control FPR, we did not consider the power anymore. In the second part of Fig. 3 , we see that the over-conditioning option has lower power than the homotopy method. It is because the truncation region in proposed-methodoc is shorter than the one in proposed-method (homotopy), which is demonstrated in the third part of Fig. 3 . The last part of Fig. 3 shows the reason why the proposed homotopy method is efficient. With the homotopy method, we only need to consider the number of encountered intervals on the line along the direction of test statistic which is almost linearly increasing in practice. Real-data examples. We performed comparison on real-world brain image dataset, which includes 939 images with tumor and 941 images without tumor. We first compared our method with permutation test in terms of FPR control. The results are shown in Table 1 . Since the permutation test could not control the FPR properly, we did not compare the power. The comparisons between naive p-value and selective p-value are shown in Figs. 4, 5 , 6 and 7. The naive p-value was still small even when the image has no tumor region, which indicates that the naive p-values cannot be used for quantifying the reliability of DNN-driven hypotheses. The proposed method could successfully identify false positive detections as well as true positive detections.

5. CONCLUSION

We proposed a novel method to conduct statistical inference on the significance of the data-driven hypotheses driven from neural network representation based on the concept of selective inference. In the context of explainable DNN or interpretable DNN, we are primarily interested in the reliability of the trained network when given new inputs (not training inputs). Therefore, the validity of our proposed method does not depend on how the DNN is trained. In regard of the generality, the proposed method can be applied to any kind of network as long as the network operation is characterized by a set of linear inequalities (or approximated by piecewiselinear functions) because all the algorithms and theories in §2 and §3 only depend on the property of each component and does not depend on the entire structure of the network. We believe that this paper provides a significant step toward reliable artificial intelligence (AI) and open several directions for statistically evaluating the reliability of DNN representation-driven hypotheses. Although it is not necessary to account the impact of training in this paper because the validity of our proposed method does not depend on how the DNN is trained, defining a new problem setup and providing solution for the case in which the training process needs to be considered is a potential direction. Moreover, widening the practical applicability of the proposed method in other fields such as NLP and signal processing would also represent a valuable contribution.



Figure 1: Examples of the proposed method on brain tumor image classification. Given a CNN trained to classify tumor versus non-tumor brain images in advance, our method provides the statistical significance of the attention region for each test image in the form of p-values by comparing the pixel information in the attention and non-attention regions. Since the attention region is selected by the input image, the p-value obtained by the naive comparison of the two regions (naive p-value) is highly biased. In the left-hand side figure where there is no brain tumor, the naive p-value is nearly zero (indicating false positive-incorrectly identifying tumor region), while the proposed selective p-value is large (indicating true negative). On the other hand, in the right-hand side figure where there actually exist a brain tumor, both the naive p-value and the selective p-values are very small (indicating true positive). The proposed selective inference method can provide valid exact (non-asymptotic) p-values for DNN representations such as attentions.

Figure 3: Results of false positive rate (FPR), power, length of interval, and encountered interval.

Figure 4: Inference on hypotheses obtained from internal representation (without tumor region).

FPR and power comparisons on real-world brain image dataset.

