

Abstract

Selective rationalizations improve the explainability of neural networks by selecting a subsequence of the input (i.e., rationales) to explain the prediction results. Although existing methods have achieved promising results, they still suffer from adopting the spurious correlations in data (aka., shortcuts) to compose rationales and make predictions. Inspired by the causal theory, in this paper, we develop an interventional rationalization (Inter-RAT) to discover the causal rationales. Specifically, we first analyse the causalities among the input, rationales and results with a structural causal model. Then, we discover spurious correlations between the input and rationales, and between rationales and results, respectively, by identifying the confounder in the causalities. Next, based on the backdoor adjustment, we propose a causal intervention method to remove the spurious correlations in input and rationales. Further, we discuss reasons why spurious correlations between the selected rationales and results exist by analysing the limitations of the sparsity constraint in the rationalization, and employ the causal intervention method to remove these correlations. Extensive experimental results on three real-world datasets clearly validate the effectiveness of our proposed method. Under review as a conference paper at ICLR 2023 selector predictor Manslaughter The defendant and the victim fought over a trivial matter the defendant punched the victim in the face with the fist, causing the victim to fall and hit his head on the ground resulting in serious injuries. The defendant immediately resuscitated the victim but he died after being sent to hospital...... The defendant and the victim fought over a trivial matter, the defendant punched the victim in the face with the fist, causing the victim to fall and hit his head on the ground resulting in serious injuries. The defendant immediately resuscitated the victim but he died after being sent to hospital......

1. INTRODUCTION

The remarkable success of deep neural networks (DNNs) in natural language understanding tasks has prompted the interest in how to explain the results of DNNs. Among them, the selective rationalization task Lei et al. (2016) ; Yu et al. (2019; 2021) has received increasing attention, answering the question "What feature has a significant impact on the prediction results of the model?". Specifically, the goal of selective rationalization is to extract a small subset of the input (i.e., rationale) to support and explain the prediction results when yielding them. Existing methods often generate rationales with a conventional framework consisting of a selector (aka., rationale generator) and a predictor Lei et al. (2016) . As shown in Figure 1 , giving the input X, the selector and the predictor generate rationales R and prediction results Y cooperatively (i.e., P (Y |X) = P (Y |R)P (R|X)). Among them, the selector (P (R|X)) first extracts a subsequence of the input. Then, the predictor (P (Y |R)) yields results based only on the selected tokens, and the selected subsequence is defined as the rationale. Despite the appeal of the rationalization methods, the current implementation is prone to exploit spurious correlations (aka., shortcuts) between the input and labels to yield the prediction results and select the rationales Chang et al. (2020) ; Wu et al. (2022) . We illustrate this problem with an example of the charge predictionfoot_0 . Considering Figure 1 , although this case is corresponding to the Manslaughter, a DNNs model readily predicts the charge as Intentional homicide. Specifically, as Intentional homicide occurs more frequently than Manslaughterfoot_1 and is often accompanied by tokens denoting violence and death, DNNs do not need to learn the real correlations between the case facts and the charge to yield the result. Instead, it is much easier to exploit spurious correlations in data to achieve high accuracy (i.e., predicting the charge as Intentional homicide directly when identifying the tokens about violence and death.). As a result, when facing the cases such as the example in Figure 1 , the effectiveness of such DNNs tends to degrade (e.g., the underlined tokens in Figure 1 denoting the offence is negligent will be ignored in rationales extraction and the charge will be misjudged.). Therefore, these types DNNs depending on spurious correlation in data fail to reveal truly critical subsequence for predicting labels. To solve that, Chang et al. (2020) propose an environment-invariant method (INVRAT) to discover the causal rationales. They argue that the causal rationales should remain stable as the environment shifts, while the spurious correlation between input and labels vary. Although this method performs well in selecting rationales, since the environment in rationalization is hard to observe and obtain, we argue that this "causal pattern" can be further explored to improve the rationalization. Along this research line, in this paper, we propose an interventional rationalization (Inter-RAT) method which removes the spurious correlation by the causal intervention Glymour et al. (2016) . Specifically, motivated by the causal inference theory, we first formulate the causal relationships among X, R and Y in a Structural Causal Model (SCM) Pearl et al. (2000) ; Glymour et al. (2016) as shown in Figure 2 (a). Then, we identify the confounder C in this SCM, which opens two backdoor paths X ← C → R and R ← C → Y , making X and R, R and Y spuriously correlated. Next, we address the above correlations, respectively. For spurious correlations between X and R, we assume the confounder is observed and intervene the X (i.e., calculating P (R|do(X)) instead of P (R|X)) to block the backdoor path and remove the spurious correlations based on the backdoor adjustment Glymour et al. (2016) . Among them, the do-operation denotes the pursuit of real causality from X to R. For spurious correlations in R and Y , since by the definition of R (rationales are the only basis for yields prediction results), we argue that there should be no spurious correlations between R and Y . However, in practice, we discover the sparsity constraint commonly defined in rationalization Lei et al. (2016) ; Cao et al. (2020) ; Chang et al. (2020) ; Yu et al. (2019) , ensuring the selector to extract short rationales, results in the spurious correlations between R and Y . Therefore, we further analyse this discovery and employ the causal intervention to remove these correlations. 

2. THE CONVENTIONAL FRAMEWORK OF RATIONALIZATION

This section formally defines the problem of rationalization, and then presents the details about the conventional rationalization framework consisting of the selector and predictor, where these two components are trained cooperatively to generate rationales and yield the prediction results.

2.1. PROBLEM FORMULATION

Considering a text classification task, only the text input X = {x 1 , x 2 , . . . , x n }, where x i represents the i-th token, and the discrete ground truth Y are observed during training, while the rationale R is unavailable. The goal of selective rationalization is first adopting the selector to learn a binary mask variable M = {m 1 , m 2 , . . . , m n }, where m j ∈ {0, 1}, and further select a subsequence of input R = M X = {m 1 • x 1 , m 2 • x 2 , . . . , m n • x n }, and then employing the predictor to re-recode the mask input R to yield the results. Finally, the whole process of rationalization is defined as: P (Y |X) = P (Y |R) predictor P (R|X) selector . (1)

2.2. SELECTOR

The selector divides the process of generating rationales into three steps. First, the selector samples each binary value m j from the probability distribution P ( M |X) = {p 1 , p 2 , . . . , p n }, where p j represents the probability of selecting each x j as the part of the rationale. Specifically, p j is calculated as p j = P ( m j |x j ) = softmax(W e f e (x j )), where the encoder f e (•) encodes the token x j into a d-dimensional vector and W e ∈ R 2×d . Then, to ensure the sampling operation is differentiable, several reparameterization tricks have been proposed such as policy gradient Lei et al. (2016) and HardKuma Bastings et al. (2019) . In this paper, we adopt the Gumbel-softmax method Jang et al. (2017) to achieve this goal: m j = exp ((log (p j ) + g j ) /τ ) t exp ((log (p t ) + g t ) /τ ) , ( ) where τ is a temperature hyperparameter, g j =log (-log (u j )) and u j is random sampled from the uniform distribution U (0, 1). Finally, the rationale can be selected as R = M X = {m 1 • x 1 , m 2 • x 2 , . . . , m n • x n }. Therefore, we conclude that the probability of generating rationales P (R|X) is calculated as: P (R|X) = P (M X|X) ≡ P ( M |X).

2.3. PREDICTOR

Based on selected rationale tokens R, the predictor outputs the prediction results (i.e., calculating P (Y |R) = P (Y |M X)) , and then R can be seen as an explanation of Y . Specifically, after obtaining R from the selector, we adopt the neural network f p (•) to re-encode the rationale into d-dimensional continuous hidden states to yield results. The objective of the predictor is defined as: L task = E X,Y ∼Dtr M ∼P ( M |X) [ (Y, W p f p (M X))] , where D tr denotes the training set, (•) represents the cross-entropy loss function, W p ∈ R N ×d is the trained parameter and N is the number of labels (e.g., N = 2 in the binary classification).

2.4. SPARSITY AND CONTINUITY CONSTRAINTS

Since an ideal rationale should be a short and coherent part of original inputs, we add the sparsity and continuity constraints Lei et al. (2016) ; Chang et al. (2020) : L re = λ 1 α - 1 n n j=1 m j + λ 2 n j=2 |m j -m j-1 | , where the first term encourages the model to select short rationales and α is a predefined sparsity level at the scale of [0, 1], and the second term ensures the coherence of selected tokens. Finally, the overall objective of the rationalization is defined as: L = L task + L re .

3. INTERVENTIONAL RATIONALIZATION

In this section, we first reveal how the confounder C causes spurious correlations in rationalization with a causal graph. Then, we remove these correlations by using a causal intervention method.

3.1. STRUCTURAL CAUSAL MODEL

As shown in Figure 2 (a), we formulate the causalities among the text input X, rationale R, groundtruth label Y and the confounder C with a Structural Causal Model (SCM) Pearl et al. (2000) ; Glymour et al. (2016) , where the link between two variables represents a causal relationship. In the following, we introduce the causal graph with these variables at a high-level: C → X. The confounder C in rationalization can be seen as the context prior, determining which tokens can be "put" into the text input X. For example, in Figure 1 , the context prior decides where the tokens denoting violence and manslaughter appear, and also decides where other tokens that are meaningless appear. In practice, the confounder is commonly partially observed (e.g., in the text classification, we consider the entire label set as the partially observed confounder. See section 3.2 for details.). From the graph, we find that X and R, R and Y are confounded by the context prior C with two backdoor paths C X Y R (a) (b) C X R K X R K (c) ❌ C C R Y H (d) X → R ← C. X ← C → R (or X ← C → K → R for elaboration) and R ← C → Y (or R ← C → H → Y ). The above backdoor paths result in spurious correlations among the text input X, rationale R, and label Y . Based on this, we propose a causal intervention method to remove the confounding effect by cutting off the link C → X and C → R, respectively.

3.2. CAUSAL INTERVENTION VIA BACKDOOR ADJUSTMENT

To pursue the real causality from X to R (or R to Y ), we adopt the causal intervention P (R|do(X)) instead of P (R|X) (or P (Y |do(R)) instead of P (Y |R)) to remove the effects of confounder C. Next, we introduce the causal intervention method by taking P (R|do(X)) as an example, and P (Y |do(R)) is similar. Specifically, since adopting the randomized controlled trial to intervene X is impossible, which requires the control over causal features, we apply the backdoor adjustment Glymour et al. (2016) to achieve P (R|do(X)) by cutting off C → X (Figure 2 (c)): P (R|do(X)) = |C| i=1 [P (R|X, K = g s (X, c i )) P (c i )] , where the confounder C is stratified into pieces C = c 1 , c 2 , . . . , c |C| , P (c i ) denotes the prior distribution of c i , which is calculated before training, and g s (•) is a function achieving X → K ← C. However, the confounder C is commonly hard to observe. Fortunately, based on the existing researches Wang et al. (2020) ; D'Amour (2019), we can consider the entire label set as the partially observed children of the unobserved confounder. Therefore, we approximate it by designing a dictionary D c = c 1 , c 2 , . . . , c |N | as an N × d matrix, where N represents the number of labels and d is the hidden feature dimension. As described in section 2.2, we conclude P (R|X) ≡ P ( M |X). Therefore, we can achieve P (R|do(X)) ≡ P ( M |do(X)). Specifically, to calculate the probability of each token x j selected as the rationale, the implementation is defined as: P ( m j |do(X)) = |N | i=1 [P ( m j |f s (x j , k i )) P (c i )] = |N | i=1 [softmax(f s (x j , k i ))P (c i )] . Among them, f s (•) is the function achieving X → R ← K, k i ∈ K is defined as the contentspecific representation by using the context prior c i , we express it as k i = g s (x j , c i ) = λ i c i , where λ i ∈ λ. λ ∈ R N is the set of the normalized similarity between x j and each c i in the confounder set C (i.e., λ = softmax(f e (x j )D T c )). Besides, since Eq (6) requires sampling of C and this sampling is expensive, we try to find an alternative function that would be easy to compute to approximate it. Empirically, based on the results in Xu et al. (2015) ; Wang et al. (2020) ; Yue et al. (2020) , we can adopt the NWGM approximation to move the outer sum into the softmax (i.e., P ( m j |do(X)) ≈ softmax( |N | i=1 f s (x j , k i )P (c i ))). In this paper, we adopt the linear model f s (x j , k i ) = W 1 f e (x j ) + W 2 k i = W 1 f e (x j ) + W 2 λ i c i to fuse the information of the input X and the confounder C. Then, the final implementation of the intervention is formulated as: P ( m j |do(X)) ≈ softmax(W 1 f e (x j ) + W 2 |N | i=1 λ i c i P (c i )). (7)

3.3. LIMITATIONS ON THE PREDEFINED SPARSITY α IN RATIONALIZATION

In this section, we discuss why C → Y in Figure 2 (a) holds in detail. Since rationales are defined as the subsequence of inputs, being sufficient to yield results, C → Y should not exist. However, unfortunately, in practical implementation, the sparsity constraint (denoted by α-constraint) in the first term of Eq (4) may result in spurious correlations between the extracted rationale and the predicted result. Specifically, the α-constraint encourages the selector to extract α of tokens from the original text input. When the predefined number of extracted tokens is greater than the length of the practical rationale, a few tokens corresponding to shortcuts of Y may still be selected. For example, as α converges to 1, all tokens in the input will be extracted, including the rationales tokens and shortcuts tokens (more examples are shown in Appendix B.1). Then, the shortcuts tokens will hurt the prediction performance. To alleviate this situation, we first construct a fine-grained causal graph (Figure 2(d) ) between the selected rationale R and the prediction results Y . Among them, R represents the rationale generated by α-constraint, H denotes the context-specific representation of R based on the context prior C. As mentioned before, from the graph, we find that as there exists a backdoor path R ← C → H → Y , R and Y are confounded. Then, based on the above observation, the predictor adopts the causal intervention methods described in section 3.2 (i.e., calculating P (Y |do(R)) ≈ softmax( |N | i=1 f r (R, h i )P (c i )) to remove the spurious correlations and further yield prediction results, where f r (•) is the function to obtain R → Y ← H, and h i = g r (R, c i ) represents the process of R → H ← C. Detailed descriptions of the graph at a high-level and the derivation are shown in Appendix A.2. Besides, although many rationalizers Jain et al. (2020) ; Paranjape et al. (2020) do not use α-constraint, we believe their constraint on selecting short rationales can be considered as a variant of α-constraint, as detailed in our Appendix B.2. Then, our intervention method will still be effective on these methods.

4. EXPERIMENTS

In this section, we validate the effectiveness of our method on three real-world tasks including the beer reviews sentiment analysis, movies reviews prediction and the legal judgment prediction.

4.1. BEER REVIEWS SENTIMENT ANALYSIS

Beer Reviews Sentiment Analysis is formulated as a sentiment prediction task, predicting the ratings (at the scale of [0, 1]) for the multiple aspects of beer reviews (e.g., appearance, aroma and palate). We use the BeerAdvocate McAuley et al. as our dataset, which is commonly used in the field of rationalization. As there is high sentiment correlation in different aspects in the same beer review Lei et al. (2016) , which may confuse the model training, several researches Lei et al. (2016) ; Bastings et al. (2019) adopt the de-correlated sub-datasets (i.e., a part of BeerAdvocate) in the training stage. However, a high correlated dataset is more conducive to validating our Inter-RAT which is designed to remove the spurious correlations in data. Although Chang et al. (2020) also conduct a correlated sub-dataset, the data split and processing are not available. Therefore, for a fair comparison, different from the previous study which makes experiments on the sub-dataset, we train and validate models on the original BeerAdvocate containing more than 220,000 beer reviews. Besides, following the setup of Chang et al. (2020) , we consider the beer review prediction as a binary classification where the ratings ≤ 0.4 as negative and ≥ 0.6 as positive. Then, the processed BeerAdvocate is a non-balanced dataset. For example, the label distribution in the appearance is positive:negative ≈ 20:1. For testing, we take manually annotated rationales as our test set, detailed statistics are shown in Appendix C.1. 2014) with hidden size 100. We optimize the objective of rationalization using Adam Kingma & Ba (2014) with mini-batch size of 256 and an initial learning rate of 0.001. Besides, we consider the α in Eq (4) as {0.1, 0.2, 0.3}, respectively. For testing, we report the token precision, recall and F1-score to evaluate the quality of selected rationales. Among them, token precision is defined as the percentage of how many the selected tokens are in annotated rationales, and token recall is the percentage of annotated rationale tokens that are selected by model. The token F1-score is calculated as 2 * precision * recall precision+recall .

4.1.2. EXPERIMENTAL RESULTS

To demonstrate the effectiveness of our Inter-RAT, we briefly compare it with RNP, HardKuma and INVRAT in Table 1 , where Inter-RAT outperforms the baselines consistently in finding correct rationales. Specifically, Inter-RAT surpasses RNP and HardKuma on all three aspects (i.e, appearance, aroma and palate) by a large margin in most metrics. Besides, although INVRAT has shown helpful in discovering the de-confounded rationales, Inter-RAT still performs better than it, improving 10.5, 14.0 and 1.4 on the average token F1-score across three aspects, and Inter-RAT has a lower variance illustrating our method is more stable than INVRAT. Such observations strongly demonstrate that Inter-RAT can remove the spurious correlation in data to select rationales effectively. As discussed in section 3.3, we propose the causal intervention method to alleviate the problem, where several tokens corresponding to spurious correlations in data may be selected and further mislead the prediction with an increasing α. Here, we conduct an experiment to validate the effectiveness of the causal intervention. Since there is only about 1,000 beer reviews in the test set, we report the binary classification F1-scorefoot_4 with different α in the dev set which contains about 30,000 reviews. As shown in Figure 3 , we make experiments on the palate aspect, and Inter-once is a variant of Inter-RAT, which yields the rationales based on P (R|do(X)) but predicts the results based on P (Y |R), rather than P (Y |do(R)). From the observation, we can conclude that when α is small (i.e., the length of selected rationales is smaller than real rationales), the difference between Inter-RAT and Inter-once is minor. However, as α increases, Inter-RAT steadily improves, while the Inter-once grows slowly and even degrades. The above observation illustrates that our causal intervention method can alleviate the spurious correlations problem between R and Y caused by the α-constraint. Predefined Sparsity ↵ < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " B 8 9 g T h P 8 r V T Z 7 I A h S / w W b G Y U b A A = " > A A A C y X i c j V H L S s N A F D 2 N r 1 p f V Z d u g k V w V R I R d F l 0 I 7 i p Y B / Q F p l M p + 1 o m s R k I t b i y h 9 w q z 8 m / o H + h X f G F N Q i O i H J m X P v O T P 3 X i / y Z a I c 5 z V n z c z O z S / k F w t L y y u r a 8 X 1 j X o S p j E X N R 7 6 Y d z 0 W C J 8 G Y i a k s o X z S g W b O j 5 o u F d H e t 4 4 0 b E i Q y D c z W K R G f I + o H s S c 4 U U f U 2 8 6 M B u y i W n L J j l j 0 N 3 A y U k K 1 q W H x B G 1 2 E 4 E g x h E A A R d g H Q 0 J P C y 4 c R M R 1 M C Y u J i R N X O A e B d K m l C U o g x F 7 R d 8 + 7 V o Z G 9 B e e y Z G z e k U n 9 6 Y l D Z 2 S B N S X k x Y n 2 a b e G q c N f u b 9 9 h 4 6 r u N 6 O 9 l X k N i F Q b E / q W b Z P 5 X p 2 t R 6 O H Q 1 C C p p s g w u j q e u a S m K / r m 9 p e q F D l E x G n c p X h M m B v l p M + 2 0 S S m d t 1 b Z u J v J l O z e s + z 3 B T v + p Y 0 Y P f n O K d B f a / s O m X 3 b L 9 U O c p G n c c W t r F L 8 z x A B S e o o k b e l 3 j E E 5 6 t U + v a u r X u P l O t X K b Z x L d l P X w A u d e R m Q = = < / l a t e x i t > Figure 5 : The token F1 for rationales on the appearance aspect with different prior distributions. As mentioned in section 4.1, we make experiments on the non-balanced dataset, which is different from the previous study Chang et al. (2020; 2019) ; Huang et al. (2021) adopting the balanced datasets. Therefore, there exists a research question we need to answer:"Does the information of label distributions (or prior distributions) somehow influence Inter-RAT to yield better rationales instead of the causal intervention ?". For instance, as the label distribution in the appearance aspect is positive:negative ≈ 20:1, we conduct experiments on the appearance dataset with P (c 1 ) = 20 21 and P (c 2 ) = 1 21 , where P (c i ) in Eq (5) represents the prior distribution of c i , c 1 is the positive label and c 2 is the negative one. The above non-balanced label distribution might be inducing the model to "better rationalize" for the majority class (i.e., the positive), further reflecting the improvement of Inter-RAT over the whole dataset. Therefore, we compute token-F1 scores for positive and negative examples separately for a safer evaluation, where we denote the evaluation of Inter-RAT on positive examples as Inter-RAT(+) and on the negative ones as Inter-RAT(-). Figure 5 summarizes the results on the appearance aspect. From the result, we find that the performance of extracting positive rationales is better than extracting the negative, although the difference between the two types results is not significant, and the scores for the negative are still high (better than INVRAT). Therefore, to further validate the effect of label distribution, we add the analysis as follows: we re-run the experiments with P (c 1 ) = P (c 2 ) = 1 2 (i.e., assuming this is a balanced dataset with uniform label distributions) and denote the corresponding model as Inter-RAT-balance. We report experimental results in Figure 5 . From the observation, we can find adopting the true prior distribution P (c i ) (Inter-RAT) performs better than the assumed one (Inter-RAT-balance), which demonstrates the prior distribution is critical for the backdoor adjustment method. Besides, it is interesting to see that with a balanced label distribution, the results of the minority label (i.e., negative) are worse than using the true label distribution, which suggests that Inter-RAT is not simply "paying more attention" to instances of the majority class. Besides, comparing with INVRAT, we investigate the model performance by showing the changes in token precision and recall with training epochs. Figure 4 shows the experiments on the appearance aspect with α = 0.3. From the observation, we can conclude that Inter-RAT significantly outperforms INVRAT in both precision and recall with lower variance from the training onwards, which proves the effectiveness of our proposed method. Inter-RAT is agnostic to the structure of the selector and predictor, we adopt Bert Devlin et al. to replace bi-GRU in f e (•) and f p (•) in both RNP and Inter-RAT, and denote them as Bert_RNP and Bert_Inter-RAT, respectively. From the result, we observe that Bert_Inter-RAT still outperforms Bert_RNP, illustrating the effectiveness of Inter-RAT.

4.3. LEGAL JUDGMENT PREDICTION

Since there are only two categories (positive and negative) in both beer and movie reviews prediction, we further generalize our Inter-RAT to the multi-classification task. Specifically, we focus on the Legal Judgment Prediction (LJP) task, which yields the judgment results such as the charges based on the case fact. We conduct experiments on publicly available datasets of the Chinese AI and Law challenge (CAILfoot_5 ). CAIL contains criminal cases consisting of the fact description and corresponding charges, law articles, and terms of penalty results. For data processing, referring to Yue et al. (2021) , we remove several infrequent and multiple charges cases, and divide the terms into non-overlapping intervals. The detailed statistics of the datasets can be found in Yue et al. (2021) . Figure 1 shows an example of LJP, which predicts the charge according to the case fact.

4.3.1. COMPARISON METHODS AND EXPERIMENTAL SETUP

In addition to comparing RNP Lei et al. (2016 ), HardKuma Bastings et al. (2019) and INVRAT Chang et al. (2020) , we also compare our method with some classical baselines in the LJP task, including TopJudge Zhong et al. (2018 ), Few-Shot Hu et al. (2018) , LADAN Xu et al. (2020) and NeurJudge Yue et al. (2021) . All the above baselines are trained by exploiting legal particularities. Among them, NeurJudge is the state-of-the-art model in LJP, which adopts different crime circumstances to yield corresponding results. Meanwhile, it employs a label embedding method to enhance the prediction. We conduct experiments on one of versions of CAIL containing 134,739 cases Yue et al. (2021) . For testing, as there are no annotated rationales, we first employ the accuracy (Acc), macro-precision (MP), macro-recall (MR), and macro-F1 (F1) to evaluate the performance of yielding judgment results. Then, we provide a human evaluation for selected rationales in LJP. Detailed description of comparison methods and experimental setups can be found in Appendix C.2.

4.3.2. EXPERIMENTAL RESULTS

To evaluate the performance of our model on LJP, we show the experimental results from two aspects. First, Table 3 shows that our Inter-RAT still performs better than the rationalization methods when generalizing to the multi-classification task. Meanwhile, compared with the LJP approaches (e.g. TopJudge and NeurJudge), even though our model is trained on the three subtasks separately, while these LJP approaches explore the dependencies between tasks and are trained with a multi-task learning framework, our model still achieves promising performance. However, Inter-RAT does not perform better than NeurJudge. A potential reason is that NeurJudge is designed only for LJP, exploiting the legal particularities well (e.g., crime circumstances). In contract, our Inter-RAT is designed for general text classification tasks. Therefore, the performance of Inter-RAT does not surpass NeurJudge. Furthermore, different from the NeurJudge and other LJP baselines, our Inter-RAT can provide an intuitive explanation (i.e., rationales) when yielding the judgment results while LJP baselines fail to produce them. The above observation provides strong validation of adopting the causal intervention method to remove spurious correlation in data for predicting results. Interestingly, we find there exists a minor difference between Inter-RAT and NeurJudge on yielding the charge and law article. We argue a potential reason is the label embedding method in NeurJudge can be approximated as the causal intervention method. We further discuss it in Appendix D.1 in detail. Second, as CAIL does not provide annotated rationales like Beer-Advocate, we make a human evaluation to evaluate the performance of selected rationales. Specifically, we sample 100 examples and ask human annotators to evaluate rationales in the charge prediction. Besides, following Sha et al. (2021) , we employ three metrics with an interval from 1 (lowest) to 5 (e.g. 2.0 and 3.2) to evaluate rationales, including usefulness (U), completeness (C), and fluency (F). Appendix C.3 describes detailed scoring standards for human annotators. The human evaluation results are shown in Table 4 . From the results, we can find Inter-RAT outperforms RNP and INVRAT in all metrics, further demonstrating our causal intervention method can select more sufficient rationales for yielding results.

5. RELATED WORK

Rationalization. To improve the explainability of DNNs, the rationalization has attracted increasing attention Lei et al. (2016); Treviso & Martins (2020) ; Bastings et al. (2019) ; Chang et al. (2019) ; Yu et al. (2021) . Specifically, Lei et al. (2016) first proposed a rationalization framework which consists of a selector and a predictor. Following this framework, multiple variants were proposed to improve rationalization. Among them, to replace the Bernoulli sampling distribution in Lei et al. (2016) , Bastings et al. (2019) introduced a HardKuma distribution for reparameterized gradient estimates. And Paranjape et al. (2020) studied the Gumbel-softmax trick for reparameterization. Meanwhile, they also adopted the information bottleneck method to manage the trade-off between selecting sparse rationales and yielding accurate results. Additionally, another fundamental direction is adding external components to enhance the original framework. Yu et al. (2019) employed an introspective selector which incorporated the prediction results into the selection process. Some researchers Huang et al. (2021) ; Sha et al. (2021); Cao et al. (2020) proposed an external guider to reduce the difference between the distributions of rationales and input. However, few considered the spurious correlations in data which degraded the rationalization. Among them, Chang et al. (2020) discovered the causal rationales with environment invariant methods by creating different environments. Wu et al. (2022) extracted the rationales from the graph to study the explainability of graph neural networks (GNNs) by the intervention distributions Tian et al. (2006) . Causal Inference. Causal inference Glymour et al. (2016) has been widely explored in various fields, including medicine Richiardi et al. (2013) and politics Keele (2015) , which aims to empower models the ability to achieve the causal effect. Recently, several researchesDeng & Zhang (2021); Dong et al. (2020) ; Yue et al. (2020) introduced causal inference into machine learning with causal intervention to remove the spurious correlations in data. Especially, it has inspired several studies in natural language understanding such as Named Entity Recognition Zhang et al. (2021) , Topic modeling Wu et al. (2021) , and Relation Extraction Liu et al. (2021) . In this paper, we focus on improving the rationalization with causal intervention.

6. CONCLUSION

In this paper, we proposed a causal intervention method (Inter-RAT) to improve rationalization. To be specific, we first formulated the causalities in rationalization with a structural causal model and revealed how the confounder hurt the performance of selecting rationales with opened backdoor paths. Then, considering the entire label set as the observed confounder set, we introduced a backdoor adjustment method to remove spurious correlations between inputs and rationales, and between rationales and results. Besides, we further discussed the potential bias between selected rationales and predicted results caused by the sparsity constraints, and adopted the above causal intervention method to yield de-confounded prediction results. Experimental results on three real-world datasets have clearly demonstrated the effectiveness of our proposed method. C → X. The context prior C determines which tokens can be "put" into the text input X. Among them, the context prior consists of unobserved prior and observed prior (such as the label set). For example, in Figure 1 , both observed Intentional homicide and Manslaughter priors decide where the tokens denoting violence and death appear ; the Manslaughter prior determines where the tokens representing manslaughter appear ; the unobserved prior decides where other tokens that are meaningless appear.

A INSTANTIATED STRUCTURE CAUSAL MODEL

X → K ← C. K denotes the context-specific representation which is a weighted representation of the prior knowledge associated with X in C. Taking Figure 1 as an example, we assume that the label set consisting of Intentional homicide, Manslaughter, and Theft is the observed prior. Then, we can get the context prior which consists of four parts (i.e., the Intentional homicide prior c 1 , the Manslaughter prior c 2 , the Theft prior c 3 and the unobserved prior c 4 ). Next, we calculate the association between X and C, and obtain the corresponding scores, assuming a 1 = 0.3, a 2 = 0.6, a 3 = 0.0, a 4 = 0.1, where Manslaughter prior c 2 and X are the most relevant, and the Theft prior c 3 and X are the least relevant. Finally, we can calculate the K as a 1 c 1 + a 2 c 2 + a 3 c 3 + a 4 c 4 = 0.3c 1 + 0.6c 2 + 0.1c 4 . X → R ← K. As the rationale R is a subsequence of X, X → R holds. Besides, K → R represents the contextual constitution of the text that affects the composition of rationales. Taking the previous example as an example, since K is calculated as 0.3c 1 + 0.6c 2 + 0.1c 4 , tokens in R will be more inclined with the Manslaughter prior c 2 . 

A.2 SCM FOR THE PREDICTOR

P (Y |do(R)) = |N | i=1 [P (Y |R, H = g r (R, c i )) P (c i )] = |N | i=1 [P (Y |f r (R, h i )) P (c i )] = |N | i=1 [softmax(f r (R, h i ))P (c i )] , where f r (•) is the function achieving R → Y ← H, h i = g r (R, c i ) = β i c i , β i ∈ β. β ∈ R N is the set of the normalized similarity between R and each c i in the confounder set C (i.e., β = softmax(f p (R)D T c )). Among them, f p (•) encodes R into a d-dimensional vector, and the dictionary D c = c 1 , c 2 , . . . , c |N | is approximated as the observed confounder. 2018) explores the dependencies among the three subtasks in LJP. • Few-Shot Hu et al. (2018) utilizes the charge attributes to identify the confusing charges. • LADAN Xu et al. (2020) learns the distinguished law articles representations for LJP prediction. • NeurJudge Yue et al. ( 2021) is a circumstance aware approach adopting different crime circumstances to yield corresponding results. Meanwhile, it employs a label embedding method to enhance the prediction. For training, we adopt the word2vec Mikolov et al. (2013) for word embedding pre-training with size 200, and set the encoder in f e (•) and f p (•) as Bi-GRU. Besides, we implement the learning rate of 0.0002 with batch size 256, and take α as 0.2. For evaluating, we employ the accuracy (Acc), macro-precision (MP), macro-recall (MR), and macro-F1 (F1) to evaluate the performance of yielding judgment results.

C.3 SCORING STANDARDS FOR HUMAN EVALUATION

Following Sha et al. (2021) , we evaluate the rationales with three metrics: usefulness (U), completeness (C), and fluency (F) in the charge prediction. Among them, each scored from 1 (lowest) to 5. Below, we introduce scoring standards for the above metrics in brief. Detailed standards for human annotators can be found in Sha et al. (2021) .

C.3.1 USEFULNESS

Q: Do you think the selected rationales can be useful for explaining the predicted labels? • 5: Exactly. Selected rationales are useful for me to get the correct label. • 4: Highly useful. Although several tokens have no relevance to correct label, most selected tokens are useful to explain the labels. • 3: Half of them are useful. About half of the tokens are useful for getting labels. • 2: Almost useless. Almost all of the tokens are useless. • 1: No Use. The selected rationales are useless for identifying labels.

C.3.2 COMPLETENESS

Q: Do you think the selected rationales are enough for explaining the predicted labels? • 5: Exactly. Selected rationales are enough for me to get the correct label. • 4: Highly complete. Several tokens related to the label are missing. • 3: Half complete. There are still some important tokens that have not been selected, and they are in nearly the same number as the selected tokens. • 2: Somewhat complete. The selected tokens are not enough. • 1: Nonsense. None of the important tokens is selected.

Label -Manslaughter CAIL -Charge

The defendant and the victim were both students, after the dormitory relocation, in the new dormitory, the defendant and the victim for sleeping in the lower bunk bed dispute, the defendant picked up a bottle and forcefully smashed the victim's head, resulting in the victim's head injury, then the defendant sent the victim to hospital. The victim died in hospital treatment failed. The forensic medical appraisal showed the injury to the victim's face was a pre-existing injury and the degree of injury was minor. The victim's death was consistent with an acute heart attack where trauma and emotional stress as precipitating factors… The defendant and the victim were both students, after the dormitory relocation, in the new dormitory, the defendant and the victim for sleeping in the lower bunk bed dispute, the defendant picked up a bottle and forcefully smashed the victim's head, resulting in the victim's head injury, then the defendant sent the victim to hospital. The victim died in hospital treatment failed. The forensic medical appraisal showed the injury to the victim's face was a pre-existing injury and the degree of injury was minor. The victim's death was consistent with an acute heart attack where trauma and emotional stress as precipitating factors… Figure 8 : Examples of selective rationalizations on the charge prediction. Although both Inter-RAT and INVRAT predict the charge correctly, Inter-RAT can extract more comprehensive rationales (i.e., The victim's death was consistent with an acute heart attack), which support the victim's death was due to the negligence.

Label -A fixed-term imprisonment of two years CAIL -Term of Penalty

The trial found that the defendant saw the victim, who was a waiter in the hotel, when he was preparing to go on duty in the lobby of the xx hotel, and he kicked the victim in the buttocks, and a fight broke out between the two. In the course of the fight, the defendant pulled out the folding fruit knife he was carrying and stabbed the victim in the inner right thigh. After seeing that the victim's leg was bleeding profusely, the victim and his colleague finished sending the victim to the hospital, where it was determined that the victim suffered minor injuries. The defendant then surrendered to the police and made a truthful confession to the crime...… The trial found that the defendant saw the victim, who was a waiter in the hotel, when he was preparing to go on duty in the lobby of the xx hotel, and he kicked the victim in the buttocks, and a fight broke out between the two. In the course of the fight, the defendant pulled out the folding fruit knife he was carrying and stabbed the victim in the inner right thigh. After seeing that the victim's leg was bleeding profusely, the victim and his colleague finished sending the victim to the hospital, where it was determined that the victim suffered minor injuries. The defendant then surrendered to the police and made a truthful confession to the crime...… Figure 9 : Examples of selective rationalization on the term of penalty prediction. Among them, both Inter-RAT and INVRAT predict the term of penalty correctly, but Inter-RAT extracts more plausible rationales (i.e., sending the victim to the hospital which is important for sentencing.).



Charge prediction: predicting the charge such as Robbery and Theft based on the case fact. Detailed definition of charge prediction is described in section 4.3. https://wenshu.court.gov.cn The definition of token F1-score can be found in section 4.1.1. Different from token F1, F1-score is commonly adopted to evaluate the performance of binary classification. https://github.com/china-ai-law-challenge/CAIL2018



Figure 1: Conventional framework of rationalization presented in this paper. In the charge prediction, the input X represents the case fact and the result Y denotes the charge.

Figure 2: Structural Causal Model for Rationalization.

Figure3: The F1-score on the palate aspect with an increasing α.

SCM FOR THE SELECTOR In this section, we describe SCM for the selector (Figure 2(b)) in detail with examples:

Figure 6: SCM for the predictor.

In X → R ← K, R is affected by the context prior C through K indirectly. For example, in Figure1, the underlined tokens denoting negligent are ignored in rationalization, since the context prior in C misleads the model to focus on the violence and death feature in X by the mediator K. Detailed examples about explanations for this SCM can be found in Appendix A.1.C → Y ← R.As the predictor yields the result based on the rationale, R → Y holds. Meanwhile, in the ideal situation, since the rationale is defined as a subsequence of X sufficient to predict the Y , there should be no direct causal relationship between C and Y . However, in practice, rationales R are commonly extracted with shortcut tokens (we will introduce it later in section 3.3), making C → Y exists. Figure2(d) describes a fine-grained scm between Y and C with a mediator H, where H is the context-specific representation of R by using the context prior C. More detailed descriptions of this SCM are presented in Appendix A.2.

Precision, Recall and F1 of selected rationales for the three aspects. Among them, α is the predefined sparsity level.

Results on movie reviews, where several results of baselines are quoted fromYu et al. (2021).

LJP results on CAIL. Among them, the underline scores are the state-of-the-art performances in LJP but lacking explainability, and the results in bold perform second only to NeurJudge but with an explainable rationale. Results of LJP baselines are quoted fromYue et al. (2021).

Human evaluation on charge prediction.

Detailed statistics of the processed BeerAdvocate and MovieReview.

Movie

Besides the beer reviews sentiment analysis task, we also make experiments on another binary classification task (i.e., movie review prediction Zaidan & Eisner (2008) ) in the ERASER benchmark DeYoung et al. (2020) , which contains token-level human annotations. We follow the same experimental setups in 4.1.1 and report the experimental results with α = 0.2 in Table 2. Detailed description of the dataset is shown in Appendix C.1. As shown in the table, Inter-RAT performs better than RNP, HardKuma and INVRAT on the three metrics, which further validates the effectiveness of Inter-RAT. Furthermore, to validate Besides, we also adopt the NWGM approximation to Eq (8) and set f r (R, h i ) as a linear model (i.e., f r (R,Then, the final objective of intervention is formulated as:As mentioned in section 3.3, in the practical, a higher predefined sparsity level α may bring shortcuts tokens which hurt the prediction performance, an extreme example being that all tokens in the text input will be selected. Below, we take a beer review as an example to further illustrate this problem, where example is adopted to predict of the smell aspect.the original text : He thinks this beer smells great and tastes terrific .rationale : smells great rationale with shortcuts tokens, where α is set to 0.5: smells great and tastes terrific. Among them, "and tastes terrific" can be considered as shortcuts tokens.

B.2 DISCUSSIONS ON α-CONSTRAINT

Although several rationalization methods do not set α-constraint to extract rationales, we believe that their methods of constraining the short rationales extraction can be considered a variant of α-constraint, and our intervention method in section 3.3 will still be effective on these Specifically, we argue that these methods should set hyperparameters to encourage the model to select short rationales. However, if the hyperparameters are not set properly, it is possible that more shortcuts tokens will be extracted, making R and Y confounded. For example, for several methods Chen & Ji (2020); Paranjape et al. (2020) adopting the information bottleneck to ensure the model extracts short rationales, there exists a KL divergence between the posterior distribution P ( m j |x j ) and the prior distribution r( m j ), where r( m j ) = Bernoulli(π) for some constant π ∈ (0, 1). For instance, if we set π as 0.1, it means we encourage the model to extract 10% of the input text. Therefore, we consider π as a variant of α proposed by us.

C SETTING DETAILS C.1 STATISTICS OF BEERADVOCATE AND MOVIEREVIEW

In this section, we show the detailed statistics of BeerAdvocate and MovieReview in Table 5 . Among them, BeerAdvocate contains three aspects, including appearance, aroma and palate. From the Table 5 , we can observe that the processed BeerAdvocate is a non-balanced dataset. In the training set, the prior distribution is positive:negative ≈ 20:1 in appearance, positive:negative ≈ 17:3 in aroma, positive:negative ≈ 17:3 in palate. Meanwhile, MovieReview is a balanced dataset with positive:negative = 1:1.

C.2 COMPARISON METHODS AND EXPERIMENTAL SETUPS FOR LJP

In addition to comparing RNP Lei et al. (2016 ), HardKuma Bastings et al. (2019) and INVRAT Chang et al. (2020) , we also compare our method with some classical baselines in the LJP task: C.3.3 FLUENCY Q: Do you think the selected rationales are fluent?• 5: Very fluent.• 4: Highly fluent.• 3: Partial fluent.• 2: Very unfluent.• 1: Nonsense.

D.1 A CAUSAL VIEW ON NEURJUDGE

In this section, from the causal view, we discuss the reason why the difference between Inter-RAT and NeurJudge on the charge and law article prediction is not significant. Here, we explain the observation by taking the charge prediction as an example, and the article prediction is similar. Specifically, NeurJudge adopts a label embedding method to incorporate the semantics of charge into the case fact to yield the corresponding result. We argue that this method can be approximated as the causal intervention method. To illustrate this discovery, we assume Figure 2(d) is the SCM of the charge prediction task, and consider the case fact as R (i.e., α=1) and the charge label set as C. Then the process of label embedding can be formulated as R → H ← C and R → Y ← H. The objective of NeurJudge is written as:where h i = g neru (R, c i ). We can find the difference between the Eq (10) and our causal intervention method is that Eq (10) ignores the prior distribution P (c i ). It is worth noting that although NeurJudge ignores P (c i ), it performs slightly better than our model. A potential reason is that NeurJudge exploits the dependencies among LJP tasks well, while our model is trained on the independent task.

D.2 VISUALIZATION

We provide several visualization cases in CAIL dataset as shown in Figure 7 , Figure 8 and Figure 9 , which are selected by Inter-RAT and INVRAT, respectively. Among them, annotated rationales are underlined. Inter-RAT and INVRAT rationales are highlighted in pink and green colors, respectively.

Label -Law Article 233 CAIL -Law Article

The defendant was driving a tricycle without a driver's licence to deliver walnuts to a certain place. When he drove into the door of the victim's house, he collapsed the gate pier of the victim's house, causing the victim to be injured by the collapsed gate pier, and he died after being rescued.…The defendant was driving a tricycle without a driver's licence to deliver walnuts to a certain place. When he drove into the door of the victim's house, he collapsed the gate pier of the victim's house, causing the victim to be injured by the collapsed gate pier, and he died after being rescued.… 

