SEQSHAP: SUBSEQUENCE LEVEL SHAPLEY VALUE EXPLANATIONS FOR SEQUENTIAL PREDICTIONS

Abstract

With the increasing demands of interpretability in real-world applications, various methods for explainable artificial intelligence (XAI) have been proposed. However, most of them overlook the interpretability in sequential scenarios, which have a wide range of applications, e.g., online transactions and sequential recommendations. In this paper, we propose a Shapley value based explainer named SeqSHAP to explain the model predictions in sequential scenarios. Compared to existing methods, SeqSHAP provides more intuitive explanations at a subsequence level, which explicitly models the effect of contextual information among the related elements in a sequence. We propose to calculate subsequence-level feature attributions instead of element-wise attributions to utilize the information embedded in sequence structure, and provide a distribution-based segmentation method to obtain reasonable subsequences. Extensive experiments on two online transaction datasets from a real-world e-commerce platform show that the proposed method could provide valid and reliable explanations for sequential predictions.

1. INTRODUCTION

Sequential prediction tasks have a wide range of applications in real-world, e.g., Online Transaction (Wang et al., 2017; Zhang et al., 2018; Weber et al., 2018; Tam et al., 2019; Zhu et al., 2020; Chen & Lai, 2021) and Sequential Recommendation (Quadrana et al., 2017; Tang & Wang, 2018; Sun et al., 2019; Shen et al., 2021; Cui et al., 2022) , since sequences contain continuous signals which are important for model predictions. With the development of deep learning technique, sequence-based models have achieved a desirable performance in recent years (Hidasi et al., 2015; Quadrana et al., 2017; Wang et al., 2017; Tang & Wang, 2018; Zhang et al., 2018; Sun et al., 2019; Zhu et al., 2020; Qiao & Wang, 2022) . However, the complicated sequential data and increased model complexity make it hard for humans to understand the prediction of models. Indeed, for security and trust considerations, it is essential to develop effective explainable artificial intelligence (XAI) methods for sequence-based models in scenarios like fraud detection and medical care, so that end-users could understand how model predictions are produced with these complicated sequential data and models. In recent years, considerable efforts have been made on the model explanation algorithms (Ribeiro et al., 2016; Shrikumar et al., 2017; Lundberg & Lee, 2017; Selvaraju et al., 2017; Wachter et al., 2017; Alvarez-Melis & Jaakkola, 2018; Mothilal et al., 2020; Slack et al., 2021; Ghalebikesabi et al., 2021; Ali et al., 2022) . Among these works, feature attribution methods (Ribeiro et al., 2016; Shrikumar et al., 2016; 2017; Lundberg & Lee, 2017) are a popular family of post-hoc XAI methods. They calculate an attribution score for each feature to capture those important features for model predictions. However, most existing methods mainly pay attention to explain tabular data or images. And when dealing with the data and models in sequential scenarios, the complex input sequences make the element-wise explanations produced by these methods less explainable. The high-dimensional features and abundant interactions bring difficulty to existing element-wise XAI methods to provide explanations. Separately assigning attribution scores to individual feature cells in the sequence is not informative enough for users to understand the predictions. In addition, the great amount of features in a sequence could bring an extensive execution cost for existing methods, since the time complexity of them are mostly related to the number of features to be explained. In this paper, we propose SeqSHAP, a Shapley value based method to explain model predictions in sequential scenarios. SeqSHAP provides explanations at a unique subsequence level, which is more intuitive in sequential scenarios for humans compared to the element-wise explanations. Meanwhile, we propose a distribution-based segmentation method to split the sequence into reasonable subsequences which utilizes the distribution information of sequential features. With obtained subsequences, we group the feature elements under each subsequence as independent units. Then Shapley value estimations for feature units are calculated, to capture the important features that strongly influence the model prediction. Extensive experiments on two large-scale online transaction datasets collected from real-world are carried out. We analyze the local explanations produced by SeqSHAP and prove that our method provides intuitive explanations with meaningful subsequences, compared to existing feature attribution methods in sequential scenarios. Our contribution could be summarized as follows: • We propose an effective XAI method to explain sequential predictions at a subsequence level, which is a unique and intuitive view in sequential scenarios. • We propose a distribution-based segmentation method characterizing the distribution information of sequential features to capture the context information and obtain reasonable subsequences. • Extensive experiments on two real-world transaction datasets are provided to evaluate the validity of our segmentation method and subsequence-level explanations produced by Se-qSHAP.

2. BACKGROUND

In this section, we firstly introduce the task of explaining model predictions with sequential inputs. Then we introduce the background of SHAP (Lundberg & Lee, 2017), a popular interpretable framework based on Shapley values in game theory.

2.1. EXPLAINING PREDICTIONS WITH SEQUENTIAL DATA

Machine learning (ML) models for sequential prediction tasks have been widely applied in realworld applications (Hidasi et al., 2015; Tang & Wang, 2018; Sun et al., 2019; Zhu et al., 2020) , since the historical behaviour records in a sequence contain valuable information for the prediction task. However, while different models with desirable performance are proposed, predictions are getting particularly difficult to explain due to the increasing model complexity, which blocks the application of new techniques in some scenarios requiring a high degree of interpretability. As a result, the demand of XAI methods in sequential domains is growing rapidly, as existing methods mostly focus on tabular data and are not suitable for data with sequence structure. Task Description In this paper, our task is building an interpreter g to explain model predictions in sequential scenarios. Specifically, given a classifier f and a sequence X which could be formed as: X = {e 1 , e 2 , ..., e T }, where e t = {x t 1 , x t 2 , ..., x t M }, where T is the length of sequence and M is the number of features, e t ∈ R M represents the t-th element of sequence which has M feature fields to describe it. The interpreter g is expected to generate an explanation for the model prediction ŷ = f (x) ∈ [0, 1]. For the family of additive feature attribution methods, an element-wise explanation ϕ ∈ R T ×M assigns an importance score ϕ i,j (1 ≤ i ≤ T, 1 ≤ j ≤ M ) to the corresponding feature cell x i j in the sequence X, which represents the influence of features on the model prediction.

2.2. SHAPLEY VALUE BASED EXPLANATIONS

SHapley Additive exPlanation, termed as SHAP (Lundberg & Lee, 2017) , is a popular framework to explain model predictions based on the Shapley value in game theory. Through summarizing previous methods (Lipovetsky & Conklin, 2001; Štrumbelj & Kononenko, 2014; Bach et al., 2015; Datta et al., 2016; Ribeiro et al., 2016; Shrikumar et al., 2017) , SHAP builds an additive explanation model g as: g(z) = ϕ 0 + M i=1 ϕ i z i , where M is the number of features, z ∈ R M is simplified features in a binary feature space, and an attribution score ϕ i is assigned to each participating feature by solving Shapley values in a designed cooperative game. As long as there has been a lot of research and methods on SHAP's properties and applications in recent years (Sundararajan & Najmi, 2020; Frye et al., 2020; Zhang et al., 2020; Slack et al., 2020; Kumar et al., 2021; Jethani et al., 2021; Covert & Lee, 2021; Bento et al., 2021; Watson, 2022) , here we mainly introduce KernelSHAP (Lundberg & Lee, 2017) which is most relevant to our work. KernelSHAP KernelSHAP is a model-agnostic explainer for local predictions which adopts the same objective function as the classic feature attribution method LIME (Ribeiro et al., 2016) shown in Eq. ( 2), while adjusting the choice of several settings to satisfy three desirable properties. Given a classifier f and an input sample x, the objective function could be solved using weighted linear regression with the loss function L in Eq. ( 3), where h x is a mapping function that maps simplified features z to the original input feature space, and π x is a weighting kernel. The solution ϕ = {ϕ i | i ∈ {1, 2, . . . , M }} provides an estimation of Shapley values for the input features. ξ = arg min g L(f, g, π z ) + Ω(g) (2) L(f, g, π z ) = z∈Z [f (h x (z) -g(z)] 2 π x (z) KernelSHAP treats each feature independently and calculate the attribution score using the weighted sum of feature's marginal contributions. It works well for tabular data since there is less context information, and the calculation of marginal contribution could partly model the interactions among features. However, for the case of sequential data, various contextual information is embedded in the sequence. And when explaining sequences, KernelSHAP has to adopt a Monte Carlo sampling strategy which sacrifices the precision of Shapley values to reduce the computational complexity, since the number of features in a sequence is too large to build a power set and calculate marginal contributions for all features. Accordingly, abundant contextual information could be ignored, and the element-wise explanation by KernelSHAP is less reliable.

3. SEQSHAP

In this section we introduce our method SeqSHAP which could provide more intuitive explanations for sequential predictions. Firstly, we discuss the motivation and advantage of explaining sequential predictions at the subsequence level. Then we propose a segmentation method to obtain reasonable subsequences from the input sequence. Finally, the process of generating subsequence level explanations with SeqSHAP is given.

3.1. EXPLAINING SEQUENTIAL PREDICTIONS AT SUBSEQUENCE LEVEL

A sequence is a stack of events that happened in a range of time, interactions of features among neighboring events often contains some hidden patterns (e.g., continuously changing, alternating, and recurring fields), which can be captured by ML models and has a significant impact on the model predictions. We note that applying attribution methods to a set of individual cells of the sequence (i.e., the cell-level explanation), which returns an importance matrix with the same shape as the input sequence: G cell =   g1,1 ... g 1,M g2,1 ... g 2,M . . . . . . . . . g T ,1 ... g T ,M   , can not explicitly model the effects of this interaction. And in practical applications, it is also difficult for end-users to understand the prediction with such an importance matrix, since humans tend to make predictions based on finding abnormal patterns rather than single feature cells in sequences. We are inspired by the concept of session in recommendation system, which means several operations by a user over a short period of time. Now that a session could represent the user characteristics during this time period, we attempt to group related neighbouring events in the sequence as sessions, and adopt XAI methods on them to explain the sequential data. By splitting the sequence into several subsequences and calculating importance scores for each feature under each subsequence (i.e., the subsequence-level explanation), explanations are provided as G subseq =   g1,1 ... g 1,M g2,1 ... g 2,M . . . . . . . . . g K,1 ... g K,M   , where K ≪ T is the number of subsequences split from the input sequence, and g k,i is the importance score of the i-th feature field under k-th subsequence. In this way, feature interactions between adjacent events are taken as a unit to be explained using attribution methods. And the explanation with a smaller shape provides a clearer guide for end-users to focus on the important areas in the sequence.

3.2. A DISTRIBUTION-BASED SEGMENTATION METHOD

As mentioned above, we hope those patterns of features can be explained as grouped units, through splitting the sequence into several subsequences. So the problem becomes how to split sequence properly while ensuring the events that make up a pattern can be grouped into the same subsequence. Simply split the sequence randomly or split with a fixed window could easily separate the related events and break the patterns, and the subsequences obtained are meaningless. Here we suppose those hidden patterns and contextual information can be viewed as a specific distribution of the features, and propose a distribution-based segmentation method to get reasonable subsequences from the sequence. We attempt to maximize the distribution discrepancy among adjacent subsequences, in order to make adjacent subsequences include different context information. Firstly, the events happened within a specific time range are grouped as units s i (1 ≤ i ≤ k) and these units make up the initial set S init waiting to be segmented: S init = {s 1 , s 2 , . . . , s k } = {{e 1 , . . . , e n1 }, {e n1+1 , . . . , e n2 }, . . . , {e n k-1 +1 , . . . , e n k }} ∀ i ∈ [1, k], ts(e ni ) -ts(e ni-1+1 ) ≤ w , δ 1 ≤ |s i | ≤ δ 2 , where k is the number of grouped units, w is the size of time window, ts(e) is the scaled timestamp feature of event e, δ 1 and δ 2 are defined to limit the size of subsequence. Then, we insert split points into S init gradually, the point that maximizes a metric function will be chosen as the split point of the current round. The segmentation process is shown in algorithm 1: Algorithm 1 Distribution-based segmentation Input: Initial set S init , subsequence amount K, Metric function D 1: S ← {S init } = {{s 1 , s 2 , . . . , s k }}, Split points P ← ϕ 2: while |S| < K do 3: d max ← 0, p ← 0, S p ← ϕ 4: for i ← 1 to |S init | -1 do 5: if i / ∈ P then 6: P ′ ← Sort(P + i) ▷ Add point i to P temporarily 7: S ′ ← {{s 1 , . . . , s P ′ [1] }, {s P ′ [1]+1 , . . . , s P ′ [2] }, . . . , {s P ′ [-1]+1 , . . . , s k }} 8: d ← D(S ′ ) ▷ Calculate the metric 9: if d > d max then 10: p ← i, d max ← d, S p ← S ′ 11: end if 12: end if 13: end for 14: P ← P + p, S ← S p ▷ Update split points and segmented subsequences 15: end while Output: Segmented subsequences S The metric function D we design is shown in Eq. ( 5), where the input S p is the set of subsequences obtained after the latest insert at index p, f dist is a distance function measuring the distribution discrepancy between two subsequences in S p (e.g., MMD (Gretton et al., 2012) , KL-divergence), |s p i | and |s p j | are used to limit the size of subsequences. And m is the size of the measuring window which determines how many neighbouring subsequences should be included to calculate the distance with the current subsequence. Our purpose is to distinguish the subsequences under different distribution, to capture the related events into a unit. The process of segmentation stops when the number of subsequences segmented reaches a given parameter K < |S init |. D(S p ) = |Sp|-1 i=1 min(i+m,|Sp|) j=min(i-m,1) f dist (s p i , s p j ) |s p i | * |s p j | ,

3.3. PROVIDING EXPLANATIONS WITH SEQSHAP

With a sequence divided into K subsequences, feature matrix of sequence X ∈ R T ×M could be formed as X ′ ∈ R K×M . Each row in X ′ corresponds to a subsequence contains several events and N = [n 1 , n 2 , . . . , n K ] is the number of events in subsequences. The background values B = [x 1 , x 2 , . . . , x M ] are sampled with average feature values in the dataset, to fill the absent features as uninformative feature values for the computation of Shapley values. When explaining a sequence X ′ directly using KernelSHAP, the explanation model g k is shown in Eq. ( 6). It takes K * M feature units in X ′ to build the coalition game, and the large feature space could bring an obvious loss of the precision using Monte Carlo sampling strategy to approximate the Shapley values, as mentioned in Subsection 2.2. To reduce the computational cost and the loss of precision, our method SeqSHAP calculates the subsequence-level Shapley values with two stages. For the first stage, we build a feature-level explanation model g f like TimeSHAP (Bento et al., 2021) as shown in Eq. ( 7), where each feature field of the sequence is taken as a unit and the feature-level Shapley values ϕ f ∈ R M are calculated with KernelSHAP. f (h X ′ (z)) ≈ g k (z k ) = ϕ k 0 + K i=1 M j=1 ϕ k i,j z k i,j . f (h X ′ (z)) ≈ g f (z f ) = ϕ f 0 + M j=1 ϕ f j z f j . The simplified features z f ∈ {0, 1} M in Eq. ( 7) could be treated as a coalition of z k in Eq. ( 6), i.e., z f j = 0 is equivalent to ∀i ∈ [1, K], z k i,j = 0, since each column of X ′ is taken as a unit in g f . And there is ϕ k 0 = f (h X ′ (z k 0 )) ≈ f (h X ′ (z f 0 )) = ϕ f 0 , since z k 0 is equivalent to z f 0 while all the features are absent.Therefore, the feature-level explanation ϕ f is actually an estimation of the sum of features' Shapley values in KernelSHAP: ϕ f j ≈ K i=1 ϕ k i,j , 1 ≤ j ≤ M, The second stage of SeqSHAP is shown in Algorithm 2, with feature-level explanations ϕ f , we traverse M feature fields of X ′ to provide subsequence-level explanations. For the case of j-th field, the explanation model g seq j is built with the candidate feature set S j = {x ′ 1,j , x ′ 2,j , . . . , x ′ K,j }, where x ′ i,j represents the i-th subsequence of j-th feature field. The simplified feature z seq j ∈ R K corresponds to the presence of subsequences in S j and the mapping function h x ′ is defined to map z seq j to the original feature space: h x ′ (z seq j ) =      x ′ 1,1 . . . h x ′ (z seq 1,j ) . . . x ′ 1,M x ′ 2,1 . . . h x ′ (z seq 2,j ) . . . x ′ 2,M . . . . . . . . . . . . . . . x ′ K,1 . . . h x ′ (z seq K,j ) . . . x ′ K,M      , h x ′ (z seq i,j ) = x ′ i,j if z seq i,j = 1 [x j ] ni if z seq i,j = 0 . (9) Thus when it is j-th feature's turn, other features in the mapping result are original input features, and the simplified features z seq j determines the subsequences under j-th feature whether to be retained with original input values or replaced by uninformative feature values. Notably, each element x ′ i,j in the matrix represents a subsequence of j-th feature with the shape of n i , and the replacement of element is achieved by filling the same number of uninformative values as n i . Afterwards, KernelSHAP is applied to solve the explanation model: g seq j (z seq j ) = ϕ seq 0,j + K i=1 ϕ seq i,j z seq i,j , ϕ seq 0,j = ϕ f 0 + i̸ =j ϕ f i , the definition of ϕ seq 0,j is based on Eq. ( 6) and Eq. ( 7), since other feature fields are taken as the background and will not change during sampling coalitions, the effect of them should be added into the bias part of the explanation model. Hence the calculated explanations satisfy the property of Eq. ( 11), which maintains the consistency of explanation among features. Finally, sub-sequence level explanation ϕ is obtained by repeating the process M times, and each element ϕ seq j ∈ R K is the Shapley value for K subsequences under j-th feature. K i ϕ seq i,j ≈ ϕ f j . ( ) Algorithm 2 Subsequence-level explanation Input: sequence X ′ ∈ R K×M , classifier f , feature-level attributions ϕ f 1: for j ← 1 to M do 2: S j ← [x ′ 1,j , x ′ 2,j , . . . , x ′ K,j ] ▷ Candidate feature set 3: z j ← [z j 1 , z j 2 , . . . , z j K ] ⊆ {0, 1} K ▷ Simplified features 4: h x (z j ) ← Equation(9) ▷ mapping function 5: g(z j ) ← Equation( 10) ▷ Explanation model 6: ϕ j = [ϕ j 1 , ϕ j 2 , . . . , ϕ j K ] ← KernelSHAP (f, g, S j , z j , h x ) 7: end for Output: ϕ = [ϕ 1 , ϕ 2 , . . . , ϕ M ] Through calculating Shapley values with two stages, SeqSHAP reduce the size of feature space significantly compared to applying KernelSHAP directly to the sequence X ′ . Indeed, SeqSHAP acquires fewer perturbed samples to calculate the Shapley values and could provide subsequencelevel explanations for sequence data more precisely with lower computation cost.

4. EXPERIMENTS

To evaluate our method, experiments on two online transaction datasets from an e-commerce platform in real world are carried out. The task of XAI methods is to provide local explanations for sequence-based fraud detection models, which are used to help end-users understand the model predictions, as described in Subsection 2.1.

4.1. EXPERIMENTAL SETUP

Our datasets are collected from a large-scale e-commerce platform, consisting of approximately 1.1M (Dataset A) and 1.6M (Dataset B) samples separately. Each sample X in these datasets is tabular and corresponds to a sequence of one user's historical operation records on the platform ordered by the time, as formed in 2.1. An operation record, which is called an event here, includes M features to describe the details of the event. The details of the datasets are shown in Table 1 . For considerations of privacy, the feature names are encoded into several types according to the content they describe (e.g., location, time, account, description of transactions). For dataset A, a given classifier f A predicts whether the last event of the sequence e T is a fraudulent transaction, based on the historical events e[1 : T -1]. And for dataset B the classifier f B predicts whether the user is a fraudster based on the operation records e[1 : T ] happened recently. In our experiments, we choose RNN-based models as the classifier for the prediction tasks, which is built with an embedding layer, two LSTM layers and several feed-forward layers. Both models f A and f B are fit optimally on two datasets separately. Additionally, the detailed settings of parameters in our segmentation method could be found in Appendix A. 

4.2. COMPARATIVE EXPERIMENTS

Metric of Feature Attributions. In our quantitative experiments, we take feature removal experiments to evaluate the local explanations produced by different feature attribution methods. Since the ground truth of an explanation is hard to get, we remove the top α% elements in the ordered explanations and observe the difference of model predictions. A large change of the predicted score means that the explanation does capture those important features for model predictions. We define the metric Dr α as Eq. ( 12), where x is the origin input and x ′ is the perturbed input obtained by replacing the top α% elements in the explanation ϕ with uninformative feature values. Dr α = 1 N N i=1 |f (x ′ ) -f (x)| f (x) . ( ) Experiments on Segmentation Methods. We compare the performance of different segmentation methods to show our distribution-based method could capture relative events and generate reasonable subsequences. The subsequence-level attributions are calculated as local explanations with the subsequences obtained from different segmentation methods. And the metric Dr α are used to compare the perfomance of explanations. The result is shown in Table 2 , where segmentation method Uniform means split the input sequence with a fixed window of size ⌊T /K⌋, and Random means split the sequence into K subsequences randomly. Ours(KL) and Ours(MMD) are our distributionbased method using KL-divergence and MMD distance as the distance function f dist , separately. We choose the drop rate α from {1%, 1.5%, 2%, 2.5%} and calculate the mean of changes of model predictions after removing top α elements in the explanations. The result shows that the subsequences split using our distribution-based segmentation method are more reasonable for the subsequencelevel explanations, while the contextual information helpful for the model predictions are better grouped in subsequences and removing the top elements could bring significant changes to model predictions. Experiments on Local Explanations. To prove that the subsequence-level explanations from Se-qSHAP outperforms the element-wise explanations from existing feature attribution methods, we choose two popular feature attribution methods, KernelSHAP (Lundberg & Lee, 2017) and LIME (Ribeiro et al., 2016) as our baselines. We compare the explanations produced by SeqSHAP and the baselines with the metric Dr α mentioned above. Since both baseline methods provide elementwise explanations, we drop the top ⌈T * M * α%⌉ elements in the explanations from baselines, and ⌈K * M * α%⌉ elements in SeqSHAP when calculating Dr α for fairness. The amount of perturbed samples for KernelSHAP and LIME are set to 64K to make a trade-off between the efficiency and precision. As shown in Table 3 , the performance of KernelSHAP and LIME are similar since they apply the same objective function (Eq. ( 2)) with different kernel weights for perturbed samples (Eq. ( 3)). Our method outperforms the baselines in most cases on two datasets, which means the important feature subsequences are accurately captured and assigned with higher attribution scores by SeqSHAP, and the removal of them makes a large difference to the model predictions. 

4.3. CASE STUDY

In this section, we analyze several local explanation cases with sequence samples to show the effect of subsequence-level explanations from SeqSHAP. Firstly, we choose a positive sample from Dataset A whose latest event e T in sequence is fraudulent, which is predicted correctly with a large confidence by the RNN-based classifier f A . We generate local explanations for the sample applying KernelSHAP and SeqSHAP separately. As shown in Figure 1 , we visualize the feature attributions ϕ kernel ∈ R T ×M and ϕ seq ∈ R K×M , where T is the length of sequence and K is the number of subsequences obtained with our segmentation method. For simplicity we drop the feature fields where all cells under the feature have a smaller absolute attribution score than a threshold δ a , and mask the cells whose absolute attribution score ϕ i,j is smaller than another threshold δ b (the grey cells in Figure 1 ). The darker the cell in heatmaps, the larger attribution score it gets which means more important for the model prediction. It is obviously that the element-wise explanation ϕ kernel in Figure 1 (a) is not intuitive for users to understand with so many events and feature fields, even if we have dropped those less important elements. And for the case of subsequence-level explanation ϕ seq in Figure 1 (b), each cell represents the importance of a subsequence including several related events under a feature field. It is more intuitive with less subsequences (K ≪ T ) for users to understand which parts of the sequence promotes the classifier to make such a prediction. Our method SeqSHAP provides explanations at subsequence-level and the feature attribution results ϕ seq ∈ R K×M assign an importance score to each subsequence for each feature field. Through analysing the distribution of attributions along different axes, higher level explanations could be obtained. Figure 2 visualize the higher level explanations in two different views, capturing the importance of subsequences and features separately. As mentioned in Subsection 3.1, SeqSHAP explains a sequence at subsequence-level to capture those contextual information and hidden patterns included in the related neighbouring events of the We analyze the elements in explanation ϕ seq with large attribution scores and summarize the corresponding input features of ϕ seq i,j . As shown in Figure 3 , the top elements with largest Shapley values are labeled with the summary of corresponding features in the sequence sample. Multiple abnormal patterns are discovered in the 10-th subsequence, including several transactions with large amount, variable locations and uncommon operations. Other subsequences with patterns like transactions happened at midnight and repetitive failed transactions are also assigned with high scores. Indeed, our segmentation method could provide reasonable subsequences including explicit feature patterns, and SeqSHAP is applied to find the important subsequences for model predictions. 

5. CONCLUSION

For security and trust considerations in real world, the increasing need for explainable AI promotes the study on post-hoc feature attribution methods. However existing XAI methods mostly overlook the interpretability in sequential scenarios which has a wide range of applications. The widely used element-wise explanations which assign importance scores to all feature cells in a sequence is not intuitive for end-users to understand and could cause a huge reduction of the precision of explanations. In this work, we propose SeqSHAP to explain sequential predictions at the subsequence-level, a unique view for feature attribution methods. We provide a distribution-based segmentation method to obtain reasonable subsequences to capture the hidden patterns and contextual information among neighbouring events. Through evaluating the explanations on online transaction datatsets collected from real-world, SeqSHAP is proven to be able to generate reliable Shapley value explanations for sequential data. With the user studies looking into the explanations and input features, the subsequence-level explanations are confirmed to be aligned with human concepts and could help users find out the abnormal patterns in the sequence that significantly influence the model predictions.



Figure 1: Local explanations with KernelSHAP and SeqSHAP

Figure 2(a) shows the importance of subsequences where the Shapley values of features under the i-th subsequence are plotted along the y-axis, which could help locate the abnormal subsequenes . Figure 2(b) provides a feature-level importance explanation, which could help identify the influential features in sequence.

Figure 2: Two higher level explanations

Figure 3: Semantic patterns included in the important subsequences with higher Shapley values.

Dataset details

Dr α of different segmentation methods Dr 1% Dr 1.5% Dr 2% Dr 2.5% Dr 1% Dr 1.5% Dr 2% Dr 2.5%

Dr α of different attribution methods Dr 1% Dr 1.5% Dr 2% Dr 2.5% Dr 1% Dr 1.5% Dr 2% Dr 2.5%

A PARAMETERS SETTING OF SEGMENTATION METHOD

For the first step we divide input sequence into units S init (Eq. ( 6)), the size of time window is defined as w = max( ts(e0) 100 , 0.01), where ts(e 0 ) is the scaled time interval between the earliest event and latest event. δ 1 = 3 and δ 2 = 8 are set to limit the size of initial units. Maximum mean discrepancy(MMD) Gretton et al. (2012) is chosen as the distance function f d ist to split the sequence(Eq. ( 5)), the embedding representation of events obtained from the given model f are used to calculate the distribution distance. The number of sessions K is defined as min(max(10, |Sinit| 2 ), 30), for the precision and efficiency of computing Shapley values.

