COUNTERFACTUAL THINKING FOR LONG-TAILED INFORMATION EXTRACTION

Abstract

Information Extraction (IE) aims to extract structured information from unstructured texts. However, in practice, the long-tailed and imbalanced data may lead to severe bias issues for deep learning models, due to very few training instances available for the tail classes. Existing works are mainly from computer vision society, leveraging re-balancing, decoupling, transfer learning and causal inference to address this problem on image classification and scene graph generation. However, these approaches may not achieve good performance on textual data, which involves complex language structures that have been proven crucial for the IE tasks. To this end, we propose a novel framework (named CFIE) based on language structure and causal reasoning with three key ingredients. First, by fusing the syntax information to various structured causal models for mainstream IE tasks including relation extraction (RE), named entity recognition (NER), and event detection (ED), our approach is able to learn the direct effect for classification from an imbalanced dataset. Second, counterfactuals are generated based on an explicit language structure to better calculate the direct effect during the inference stage. Third, we propose a flexible debiasing approach for more robust prediction during the inference stage. Experimental results on three IE tasks across five public datasets show that our model significantly outperforms the state-of-the-arts by a large margin in terms of Mean Recall and Macro F1, achieving a relative 30% improvement in Mean Recall for 7 tail classes on the ACE2005 dataset. We also discuss some interesting findings based on our observations.

1. INTRODUCTION

The goal of Information Extraction (IE) (Sarawagi, 2008; Chiticariu et al., 2013) is to detect the structured information from unstructured texts. IE tasks, such as named entity recognition (NER) (Lample et al., 2016) , relation extraction (RE) (Zeng et al., 2014; Peng et al., 2017) and event detection (ED) (Nguyen & Grishman, 2015) have developed rapidly with the data-hungry deep learning models trained on a large amount of data. However, in real-world settings, unstructured texts follow a long-tailed distribution (Doddington et al., 2004) , leading to a significant performance drop on the instance-scarce (or tail) classes which have very few instances available. For example, in the ACE2005 (Doddington et al., 2004) dataset, nearly 70% of event triggers are long-tailed while they only take up 20% of training data. On a strong baseline (Jie & Lu, 2019) , the macro F1 score of instance-rich (or head) classes can be 71.6, while the score of tail classes sharply drops to 41.7. The underlying causes for the above issues are the biased statistical dependencies and spurious correlations between feature representations and classes learned from an imbalanced dataset. For example, an entity Gardens appears 13 times in the training set of OntoNotes5.0 (Pradhan et al., 2013) , with the NER tag LOC, and only 2 times as organization ORG. A classifier trained on this dataset will build a spurious correlations between Gardens and LOC. As a result, an organization that contains the entity Gardens may be wrongly predicted as a location LOC. There are only a few studies (Zhang et al., 2019; Han et al., 2018) in the Natural Language Processing (NLP) field to address such long-tailed issues. These works mostly rely on external and pre-constructed knowledge graphs, providing useful data-specific prior information which may not be available for other datasets. On the other hand, there are plenty of works from the computer vision society, where the bias is also quite straightforward. Current solutions include re-balanced training (Lin et al., 2017) that re-balances the contribution of each class in the training stage, transfer learning (Liu et al., 2019b) that takes advantage of the knowledge in data-rich class to boost the performance of instance-scarce classes, decoupling (Kang et al., 2019) strategy that learns the representations and classifiers separately, and causal inference (Tang et al., 2020a; b; Abbasnejad et al., 2020) that relies on structured causal models for unbiased scene graph generation, image classification and visual question answering. The aforementioned studies from the computer vision community may not achieve good performance on the textual datasets in the NLP area due to a significant difference between the two fields. For example, unlike images, texts involve complex language structures such as dependency tree and constituent tree that describe the syntactic or semantic level relations between tokens. For the longtailed IE, how to explore the rich relational information as well as complex long-distance interactions among words as conveyed by such linguistic structures remains an open challenge. Furthermore, to capture a more informative context, the way of utilizing the syntax tree for three IE tasks varies: the RE task relies more on the context and entity type rather than entities themselves, while classifications in NER and ED tasks count more on entities than the context. Hence, it is challenging to decide properly on how to utilize language structures for the above three different IE tasks. One may also think that the prevalent pre-trained models such as BERT (Devlin et al., 2019) may address the long-tailed issues. However, we empirically show that such models still suffer from bias issues. In this paper, we propose CFIE, a novel framework that combines the language structure and counterfactual analysis in causal inference (Pearl et al., 2016) to alleviate the spurious correlations of the IE tasks including NER, RE and ED. From a causal perspective, counterfactuals (Bottou et al., 2013; Abbasnejad et al., 2020) state the results of the outcome if certain factors had been different. This concept entails a hypothetical scenario where the values in the causal graph can be altered to study the effect of the factor. Intuitively, the factor that yields the most significant changes in model predictions have the greatest impact and is therefore considered as main effect. Other factors with minor changes are categorized as side effects. In the context of IE with complex language structures, counterfactual analysis answers the question on "which tokens in the text would be the key clues for RE, NER or ED that could change the prediction result?". With that in mind, our CFIE is proposed to explore the language structure to eliminate the bias caused by the side effect and maintain the main effect for the classification. We evaluate our model on five public datasets across three IE tasks, and achieve significant performance gain on instance-scarce classes. We will release our code to contribute the community. Our major contributions are summarized as: • To the best of our knowledge, our CFIE is the first attempt that marries the counterfactual analysis and language structure to address the long-tailed IE issues. We build different structured causal models (SCMs) (Pearl et al., 2016) for the IE tasks and fuse the dependency structure to the models to better capture the main causality for the classification. • We generate counterfactuals based on syntax structure, where the counterfactuals can be used as interventions to alleviate spurious corrections on models. In doing so, the main effect can be better estimated by the intervention methodology. • We also propose flexible classification debiasing approaches inspired by Total Direct Effect (TDE) in causal inference. Our proposed approach is able to make a good balance between the direct effect and counterfactuals representation to achieve more robust predictions.

2. RELATED WORK

Long-tailed Information Extraction: Information extraction tasks, such as relation extraction (Zeng et al., 2014; Peng et al., 2017; Quirk & Poon, 2017) , named entity recognition (Lample et al., 2016; Chiu & Nichols, 2016) , and event extraction (Nguyen & Grishman, 2015; Huang et al., 2018) are fundamental NLP tasks and have been extensively studied in recent years, For the long-tailed IE, recent models (Lei et al., 2018; Zhang et al., 2019) leverage external rules or transfer knowledge from data-rich classes to the tail classes. Few-shot leaning (Gao et al., 2019; Obamuyide & Vlachos, 2019) has been also applied to IE tasks, although this task focuses more on new classification tasks with only a handful of training instances. Re-balancing/Decoupling Models: Re-balancing approaches include re-sampling strategies (Mahajan et al., 2018; Wang et al., 2020a ) that aim to alleviate statistical bias from head classes, and re-weighting approaches (Milletari et al., 2016; Lin et al., 2017) which assign balanced weights to the losses of training samples from each class to boost the discriminability via robust classifier decision boundaries. These techniques may inevitably suffer the under-fitting/over-fitting issue to head/tail classes (Tang et al., 2020a) . There are also recent studies (Kang et al., 2019) that decouple the representation learning and the classifier, which effectively mitigate the performance loss caused by direct re-sampling. Casual Inference: Causal inference (Pearl et al., 2016; Rubin, 2019) and counterfactuals have been widely used in psychology, politics and epidemiology for years. There are many studies in computer vision society (Tang et al., 2020b; Abbasnejad et al., 2020; Tang et al., 2020a; Niu et al., 2020; Yang et al., 2020; Zhang et al., 2020; Yue et al., 2020) , which use Total Direct Effect (TDE) analysis framework and counterfactuals for Scene Graph Generation (SGG), visual question answering, and image classifications. There is also a recent work (Zeng et al., 2020) that generates counterfactuals for weakly-supervised NER by replacing the target entity with another entity. Our methods differ from the previous works in three aspects: 1) We explore the syntax structures of texts for building different causal graphs, 2) Counterfactuals are generated based on a task-specific pruned dependency tree. 3) Our proposed inference method yields robust predictions for the NER and ED tasks. Model Interpretation: Besides causal inference, there have been plenty of studies (Molnar, 2020) about traditional model interpretation applied in various applications, such as text and image classification (Ribeiro et al., 2016; Ebrahimi et al., 2018 ), question answering (Feng et al., 2018; Ribeiro et al., 2018) , and machine translation (Doshi-Velez & Kim, 2017) . LIME (Ribeiro et al., 2016) was proposed to select a set of instances to explain the predictions. The input reduction method (Feng et al., 2018) is able to find out the most important features and use very few words to obtain the same prediction. Unlike the LIME and input reduction method, the word selections in our CFIE are based on the syntax structure. SEARs (Ribeiro et al., 2018) induces adversaries by data augmentation during the training phase. Along this line, a recent study (Kaushik et al., 2019) also uses data augmentation technqiue to provide extra training signal. Our CFIE is orthogonal to data augmenation as it generates counterfactuals during the inference stage, where the counterfactuals are used to mitigate the spurious correlations rather than training the network parameters.

3. MODEL

Figure 1 shows the work flow of our proposed CFIE. We detail these components as follows. The man was killed The program was killed 

3.1. STEP1: CAUSAL REPRESENTATION LEARNING

In this step, we train a causal graph on an imbalanced dataset. Our goal here is to teach the model to identify the main cause (main effect) and the spurious correlations (side effect) for the classification. Structural Causal Models (SCMs): The two well-known causal inference frameworks are SCMs and potential outcomes (Rubin, 2019) which are complementary and theoretically connected. We choose SCMs in our case due to their advantages in expressing and reasoning about the effects of causal relationships among variables. An SCM can be represented as a directed acyclic graph (DAG) G = {V, F, U }, where we denote the set of observables (vertices) as V = {V 1 , ..., V n } , the set of functions (directed edges) as F = {f 1 , ..., f n }, and the set of exogenous variables (e.g. noise) as U = {U 1 , ..., U n }. Note that in the deterministic case where U is given, the value of all variables in the SCM are uniquely determined (Pearl, 2009) . Each observable V i can be derived from: V i := f i (PA i , U i ), (i = 1, ..., n), ∀i, PA i ⊆ V\V i is the set of parents of V i . Directed edges, such as PA i → V i in the graph G, i.e., f i , refers to the direct causation from the parental variables PA i to the child variable V i . Our Proposed SCMs: Figure 2 (a) demonstrates our unified SCMs for IE tasks, which are built based on our prior knowledge for the tasks. The variable S indicates the contextualized representations of an unstructured input sentence, where the representations are the output from a BiLSTM (Schuster & Paliwal, 1997) or pre-trained BERT encoder (Devlin et al., 2019) .  Z i (i ∈ [1, m]) Let E = {S, X, Z 1 , ..., Z m } denotes the par- ents of Y . The direct causal effects towards Y including X → Y , S → Y , Z 1 → Y , ...., Z m → Y are linear transformations. For each edge i → Y , its transformation is denoted as W iY ∈ R c×d , where i ∈ E and c is the number of classes. We let H i ∈ R d×h denote hfoot_0 representations with d dimensions for node i ∈ E. Then, the prediction can be obtained by summation Y x = i∈E W iY H i or gated mechanism Y x = W g H X σ( i∈E W iY H i ), where refers to element-wise product, W g ∈ R c×d is the linear transformation, and σ(•) indicates the sigmoid function. To avoid any single edge, such as S → Y , dominating the generation of the logits Y x , we add a cross-entropy loss L iY , i ∈ E for each branch, where i indicates the parent of the node Y . Let L Y denote the loss for Y x , the total loss L can be computed by: L = L Y + i∈E L iY (2) Note that the proposed SCM is encoder neutral. The SCM can be equipped with various encoders, such as BiLSTM, BERT and Roberta (Liu et al., 2019a) . For simplicity, we omit exogenous variables U from the graph as its only useful for the derivations in the following sections. Fusing Syntax Structures Into SCMs: So far we have built basic SCMs for IE tasks. On the edge S → X, we adopt different neural networks architectures for RE, NER and ED. For RE, we use dependency trees to aggregate long-range relations with graph convolution networks (GCN) (Kipf & Welling, 2017) . Assume the length of the sentence is h. For the GCN, we generate a matrix A ∈ R h×h from a dependency tree. The convolution computation for the node i at the l-th layer takes the representation x l-1 i from previous layer as input and outputs the updated representations x l i . The formulation is given as: x l i = σ( l j=1 A ij W l x l-1 i + b l ), i ∈ [1, h] where W l and b l are the weight matrix and bias vector of the l-th layer respectively, and σ(•) is the sigmoid function. Here x 0 takes value from H S and H X takes value from the output of the last GCN layer x lmax . For NER and ED, we adopt the dependency-guided concatenation approach (Jie & Lu, 2019) . Given a dependency edge (t h ,t i ,r) with t h as a head (parent), t i as a dependent (child) and r is the dependency relation between them, the representations of the dependent (assume at the i-th position of a sentence) can be denoted as: x i = [H (i) S ; H (h) S ; v r ], t h = parent(t i ) H X = LST M (x) where H (i) S and H (h) S are the word representations of the word t i and its parent t h , v r denotes the learnable embedding of dependency relation r.

3.2. STEP 2 AND 3: INFERENCE AND COUNTERFACTUAL GENERATION

We have trained our SCMs in the first step. The second step performs inference with the SCMs, and the third step generates dependency-based counterfactuals to better measure the main effect. Interventions: For the SCM G, an intervention indicates an operation that modifies a subset of variables V ⊆ V to new values where each variable V i ∈ V is generated by a new structural mechanism fi ( PA i , U i ) that is independent from the original f i (PA i , U i ). Thus, the causal dependency between V i and its parents {PA i , U i } is cutoff. Mathematically, such intervention for one variable X ∈ V can be expressed by do-notation do(X = x * ) and where x * is the given value.

Counterfactuals: Unlike interventions, the concept of counterfactual reflects an imaginary scenario

for "what would the outcome be had the variable(s) been different". Recall from Section 3.1 the definition of SCM and the set of environmental variables U which uniquely determines the variables in the system (Pearl, 2009) . Let Y ∈ V denote the outcome variable, and let X ∈ V\{Y } denote the variable of study. The counterfactual for setting X = x * is formally estimated as: Y x * (u) = Y G x * (u) where G x * means assigning X = x * for all equations in the SCM G. In our CFIE setting, we aim to estimate the counterfactual for the model prediction at instance level. For the proposed SCM shown in Figure 1 , the counterfactual Y x * for our prediction Y is practically computed as follows: Y x * = Y G x * (u) = f Y (do(X = x * ), S = s, Z = z) = i∈E\{X} W iY H i + W XY H x * (6) where f Y is the function that computes Y and we only replace the original feature representation H X with H x * . No actual value is needed for u. See Appendix A.1.1 for derivation.

Dependency-based Counterfactuals Generation:

There are many other language structures such as constituent tree, abstract meaning representation (Flanigan et al., 2014) and semantic role labeling (Björkelund et al., 2009) . We choose the dependency structure in our case as it is able to capture rich relational information as well as complex long-distance interactions that have been proven effective on IE tasks. Counterfactuals lead us to think about: "what are the key clues that determine the relations of two entities for RE, and a certain span of a sentence to be an entity or an event trigger for NER and ED task respectively?". To generate the counterfactual representations for the RE task, we mask the tokens along the shortest path between the two entities of a relation in a dependency tree to form a new sequence. Then this masked sequence is fed to a BiLSTM or BERT encoder to output new contextualized representations S * . For the NER and ED task, we mask entities, or the tokens in the scope of 1 hop on the dependency tree to generate S * . Then we feed S * to the function S → X to get X * . The operation on NER also aligns a recent finding (Zeng et al., 2020) that the entity itself is more important than context for entity classification. By doing so, the key clues have been wiped off in the generated counterfactuals representations X * , which can be used to strengthen the main effect while reduce spurious correlations and the side effect.

3.3. STEP 4 AND 5: CAUSAL EFFECT ESTIMATION

We estimate the causal effect in the fourth step and make use of the couterfactuals representation for a more robust prediction in the fifth step. Inspired by Total Direct Effect (TDE) used in (Tang et al., 2020b) , we can compare the original outcome Y x and its counterfactual Y x * to estimate the effect of RE so that the side effect can be eliminated (see Appendix A.1.2 for derivation): T DE = Y x -Y x * As both context and entity (or trigger) play important roles for the classification in the NER and ED tasks, we propose a novel approach to alleviate the spurious correlations caused by side effects, while strengthening the main effect at the same time. The interventional causal effect of the i-th entity in a sequence can be described as: Ef f ect = Y xi -Y x * i + αW XY x * i ( ) where α is the hyperparameter that balances the importance of context and entity (or trigger) for the NER and ED task. The first part Y xi -Y x * i indicates the main effect, which reflects more about the debiased context, while the second part W XY x * i reflects more about the entity (or trigger) itself. Combining them yields more robust prediction by better distinguishing the main and side effect. As shown in Figure 1 , the sentence "The program was killed" produces biased high score for event "Life:Die" in Y x and results in wrong prediction due to the word "killed". By computing the counterfactual Y x * with "program" masked, the score for "Life:Die" remains high but the score for "SW:Quit" drops dramatically. This difference Y xi -Y x * i leads us to correct prediction and knowing the important role of the word "program". Such a design differs from that of the previous work used in vision community (Tang et al., 2020a ) by providing more flexible adjustment and effect estimation. We will show that our approach is more suitable for long-tailed IE tasks.

4.1. DATASETS AND SETTINGS

The five datasets used in our experiments include OntoNotes5.0 (Pradhan et al., 2013) and ATIS (Tur et al., 2010) for the NER task, ACE2005 (Doddington et al., 2004) and MAVEN (Wang et al., 2020b) for the ED task, and NYT24 (Gardent et al., 2017) for the RE task. For all the five datasets, we categorize the classes into three splits based on the number of training instances per class. The model parameters are finetuned on the development sets. For RE, we leverage Stochastic Gradient Descent (SGD) optimizer with a 0.3 learning rate and 0.9 weight decay rate. For NER and ED, we utilize Adam optimizer with an initial learning rate of 0.001. The hidden size of the BiLSTM and GCNs are set as 300, and the number of layers of GCNs is configured as 3. 300-dimensional GloVe (Pennington et al., 2014) is used to initialize the word embeddingsfoot_1 . We focus more on Mean Recall (MR) (Tang et al., 2020b) and Macro F1 (MF1), two more balanced metrics to measure the performance of long-tailed IE tasks, as MR is able to better reflect the capability in identifying the instance-scare class, and MF1 can better represent the model's ability for each class, while the conventional Micro F1 score highly depends on the data-rich classes and pays less attention to the tail classes. We report the Micro F1 score (F1) for each dataset in the Appendix. We also follow (Liu et al., 2019b) to report the MR and MF1 on three splits in Table 5 in the Appendix.

4.2. BASELINES

We categorized the baselines into three groups. 1) Conventional Models include BiLSTM (Chiu & Nichols, 2016), BiLSTM+CRF (Ma & Hovy, 2016) , C-GCN (Zhang et al., 2017) , Dep-Guided LSTM (Jie & Lu, 2019) , AGGCN (Guo et al., 2019) and BERT (Devlin et al., 2019) . They do not explicitly take the long-tailed issues into consideration. 2) Re-weighting/Decoupling models refer to loss re-weighting approaches including Focal Loss (Lin et al., 2017) , and two-stage decoupled learning approaches (Kang et al., 2019) that include τ -normalization, classifier retraining (cRT) and learnable weight scaling (LWS). 3) Causal model include TDE (Tang et al., 2020b) . There are also recent studies based on the deconfounded methodology (Tang et al., 2020a; Yang et al., 2020) , which however seem not applicable to be selected as a causal baseline in our case. In our experiments, we reproduced the results for all the baselines as most of the results have not been reported on NLP datasets. We believe some recent strong baselines, which are not mentioned in this paper due to space limitation, may also further benefit our model by integrating them into the edge S → X.

4.3. TASK DEFINITIONS

Named Entity Recognition: NER is a sequence labeling task that seeks to locate and classify named entities in unstructured text into pre-defined categories such as person, location, etc. Event Detection: ED aims to detect the occurrences of predefined events and categorize them as triggers from unstructured text. Event trigger is defined as the words or phase that most clearly expresses an event occurrence. Taking the sentence "a cameraman died in the Palestine Hotel" as an example, the word "died" is considered as the trigger with a "Death" event. Relation Extraction: The goal of RE is to identify semantic relationships from text, given two or more entities. For example, "Paris is in France" states a "is in" relationship between two entities Paris to France. Their relation can be denoted by the triples (Paris, is in, France).

4.4. RESULTS

Named Entity Recognition: Table 1 shows the comparison results on both OntoNotes5.0 and ATIS datasets. Our models outperform the two classical models BiLSTM and BiLSTM+CRF under most settings, especially on the Few setting, e.g achieving 10.2 points higher Mean Recall (MR) against BiLSTM on OntoNotes5.0, and 12.7 points higher Mean F1 (MF1) against BiLSTM+CRF on ATIS. The results indicate the superiority of our proposed model in handling the instance-scarce classes. Comparing with the C-GCN model that makes use of dependency trees for information aggregation, our model also achieves 8.4 higher MR and comparable MF1, indicating the capability of a causal model in improving the long-tailed sequence labeling problem. Comparing with a recent causal baseline TDE, our model consistently perform better in terms of long-tailed scores, the results confirm our hypothesis that making good use of language structure helps a causal model to distinguish main effect from the side effect. Among re-balancing approaches such as Focal Loss, cRT and LWS, τ -Normalization performs best and this aligns with the findings in the previous study (Kang et al., 2019) for long-tailed image classification. Event Detection: Table 2 shows comparison results on both ACE2005 and MAVEN datasets. Overall, our model significantly outperforms the baselines under the Few setting by a large margin, e.g., 12.8 and 15.8 pointers higher in terms of MR and MF1 respectively on ACE2005 dataset, 20.6 and 20.8 points higher in terms of the two metrics on MAVEN dataset. Meanwhile, our model is able to achieve better or comparable results under other settings. The results further confirm the robustness of our model in improving the classifications for tail classes with few training instances available. Our model also performs better than BERT baselines under the Few setting, indicating that the pre-trained BERT models still suffer bias issues on the long-tailed IE tasks. (Lin et al., 2017) 52.0 48.3 62.9 61.9 cRT (Kang et al., 2019) 66.0 24.2 65.6 50.5 τ -Normalization (Kang et al., 2019) imbalanced dataset by learning to distinguish the main effect from the side effect. We also observe that CFIE outperforms the previously proposed TDE by a large margin for the both Few and Overall settings, i.e., 11.5 points and 3.4 points improvement in terms of MF1. This further proves our hypothesis that properly exploring language structure on causal models will boost the performance of IE tasks on imbalanced datasets.

4.5. DISCUSSIONS

The picture showed premier Peng Li visiting malacca The picture showed premier Peng Li visiting malacca The picture showed premier Peng Li visiting malacca The picture showed premier Peng Li visiting malacca What are the most important factors for NER? We have hypothesised that the factors, such as 2-hop and 1-hop context on the dependency tree, the entity itself, and POS feature, may hold the potential to be the key clues for the NER predictions. To evaluate the impact of these factors, we first generate new sequences by masking or mitigating these factors. Then we feed the generated sequences to the proposed SCM to obtain the predictions. Figure 3 shows a qualitative example for predicting the NER tag for the entity "malacca". Specifically, Figure 3 (a) visualizes the variances of the predictions, where the histograms in the left refer to prediction probabilities for the ground truth class, while the histograms in the right are the max predictions except the results of ground truth class. Figure 3 (b) illustrates how we mask the context based on a dependency tree. It shows that masking the entity, i.e., "malacca", will lead to the most significant performance drop, indicating that entity itself plays a key role for the NER classification. This also inspires us to design step 5 in our framework. More analyses about ED and RE are given in the Appendix A.4.1 and A.4.2. Does the syntax structure matter? To answer this question, we design three baselines including: 1) Causal Models w/o Syntax that doesn't employ dependency trees during the training stage, and only uses it for generating counterfactuals, 2) Counterfactuals w/o Syntax that employs dependency structures for training but utilizes a null input as the intervention during the inference state. We refer such a setting from the previous study (Tang et al., 2020a) , and 3) No Syntax that is the same to the previous work TDE (Tang et al., 2020b) 

A APPENDIX

A.1 DERIVATIONS A.1.1 COUNTERFACTUALS Recall that the formal computation for counterfactual is defined as: Y x * (u) = Y M x * (u) where M x * means assigning X = x * for all equations in the SCM. The crucial step in the derivation is to understand the goal of the exogenous variable U , by which the variables in the causal graph are uniquely determined. To compute the counterfactual of a prediction regarding variable X, we have to keep all other variables under the same setting as the original prediction. Consider an intuitive example that a boy got an A for the subject because he studied hard. To estimate the counterfactual "what score would he get if he did not study hard", we should maintain all other factors like the difficulty of the subject and the skills of the teacher and so on at the original level to simulate the hypothetical scenario that the boy travelled back in time and behaved differently. Thus, setting U = u where u is the environment (e.g. year of admission, faculty) for the original prediction is to ensure the consistency in estimating the value of all other variables, which is mathematically: V i = f i (PA i , U = u) , ∀V i ∈ V except for the variable of interest X along with its descendants (e.g. commendation from the teacher) due to the intervention do(X = x * ). Thus, for our SCM, as long as we can ensure the value of variables (S, Z) that are not descendants of X follow the original situation, the exogenous variable u is only for notational purpose and no longer needed in computing the counterfactuals. Besides, for all variables, only the descendants of X should be re-calculated. We now present the mathematical derivation for the counterfactual Y x * in our SCM: Y x * = Y M x * (u) = Y (do(X = x * ), U = u) = f Y (do(X = x * ), S = s, Z = z) = f Y (x * , s, z) = i∈E\{X} W iY H i + W XY H x * In short, to compute the counterfactual Y x * , we simply need to 1. Assign a new value x * to the variable of interest X. 2. Cut off the dependency between X and its parents in SCM.

3.. Recompute all values.

A.1.2 TOTAL DIRECT EFFECT In an SCM, let M be the mediator variables such that path X → Z → Y exists. The formal definition of Total Direct Effect (TDE) is: T DE = Y x (u) -Y x * ,m (u) where m are the original values of the mediator variables. Thus, additional intervention do(M = m) is required to compute T DE. Fortunately, our SCM shown in Figure 1 does not have mediators for X and the computation is reduced to: T DE = Y x (u) -Y x * (u) = Y x -Y x * One may question that why X imposes no effect on Z including POS and NER tags for relation extraction. This is because POS and NER tags are provided in the dataset and we are not using them for joint training. Thus, there is no direct dependency between contextual representation and the representation for the tags.

A.2 DATASET STATISTICS

We give the statistics of five datasets as follows in Table 4 . We follow the (Liu et al., 2019b) to split the training set as Few-shot(Few), Medium-shot(Medium) and Many-Shot(Many). We split the dataset based on the distribution of class types and numbers. Details are given in Table 5 . We use spaCyfoot_2 to generate the dependency tree, NER as well as POS tagging for a input sentence. The hyperparameters that we used on three tasks are listed as follows in Table 6 , Table 7 , and Table 8 . We show the parameters in different tables as the setting varies for each task.

A.4 MORE DISCUSSIONS

We add more discussions here based on Section 4.5. our experiments here are similar as that of NER task described in Section 4.5. Figure 10 shows a qualitative example for predicting the event type for the word "shot". Specifically, Figure 10 (a) visualizes the variances of the predictions, where the histograms in the left refer to prediction probabilities for the ground truth class, while the histograms in the right are the max predictions except the results of ground truth class. Figure 10 (b) illustrates how we mask the context based on a dependency tree. We obtain the same conclusion that masking the word itself, i.e., "shot", will lead to the most significant performance drop, indicating that entity itself serves as a key for the ED classification. Also we can see that 1-hop neighbors in the dependency tree plays the second important roles. When 1-hop neighbors are masked, the lead in the probability of the ground truth class is reduced relative to the probability of the error class, which indicates the decline of the model's classification ability. To answer this question, we conduct experiments on NYT24 dataset. We have hypothesised that the factors, such as context on the shortest path between the targets, the contextualized word representations, POS feature, and NER feature may hold the potential to be the key clues for the RE predictions. The design of our experiments here are similar as that of NER task described in Section 4.5. Figure 11 shows a qualitative example for predicting the relation type for the targets "Italy" and "Modena". Specifically, Figure 11 Here we show the performance on OntoNotes5.0, ACE2005, and MAVEN datasets regarding various values of α. As shown in Figure 12 , we observe that the trends are similar on different datasets. The optimal values are 0.9, 1.5, 1.5 respectively on OntoNotes5.0, ACE2005, and MAVEN dataset.

A.6 MORE DETAILED EXPERIMENTAL RESULTS

For the NER and ED tasks, we report more detailed comparisons on the Ontonotes5.0, ATIS, ACE2005, and MAVEN datasets in Table 10 , Table 11 , Table 12 , and Table 13 respectively. We also report the detailed results for RE on the NYT24 dataset in Table 14 . 



h is the sequence length for NER and ED, and h = 1 for relation extraction. The statistics of the datasets and detailed hyperparameters are attached in the Appendix https://spacy.io/



Figure 1: Work flow of our CFIE in five steps.

Figure 2: (a) a unified structured causal models for IE tasks. (b) interventions on X.

Figure 3: (a) prediction distributions for various factors. (b) masking operations based on a syntax tree.

Figure 10: (a) prediction distributions for various factors. (b) masking operations based on a syntax tree. A.4.2 WHAT ARE THE MOST IMPORTANT FACTORS FOR THE RE TASK?

Figure 11: (a) prediction distributions for various factors. (b) masking operations based on a syntax tree. A.4.3 HOW THE HYPER-PARAMETER α IMPACTS THE PERFORMANCE?

Evaluation results on the OntoNotes5.0 dataset and ATIS dataset for the NER task.

Evaluation results on the ACE2005 dataset and MAVEN dataset for the event detection.

Evaluation results on the NYT24 dataset for RE.

As shown inTable 3, we further evaluate CFIE for the relation extraction on NYT24 dataset. Our method significantly outperforms all other methods in MF for both tail classes and overall F1. Although cRT achieves relatively high

which don't involve depen-Xu Yang, Hanwang Zhang, and Jianfei Cai. Deconfounded image captioning: A causal retrospect. arXiv preprint arXiv:2003.03923, 2020. Zhongqi Yue, Hanwang Zhang, Qianru Sun, and Xian-Sheng Hua. Interventional few-shot learning. In NeurIPS, 2020. Daojian Zeng, Kang Liu, Siwei Lai, Guangyou Zhou, and Jian Zhao. Relation classification via convolutional deep neural network. In Proc. of COLING, 2014. Xiangji Zeng, Yunliang Li, Yuchen Zhai, and Yin Zhang. Counterfactual generator: A weaklysupervised method for named entity recognition. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 7270-7280, 2020. Dong Zhang, Hanwang Zhang, Jinhui Tang, Xiansheng Hua, and Qianru Sun. Causal intervention for weakly-supervised semantic segmentation. In NeurIPS, 2020. Ningyu Zhang, Shumin Deng, Zhanlin Sun, Guanying Wang, Xi Chen, Wei Zhang, and Huajun Chen. Long-tail relation extraction via knowledge graph embeddings and graph convolution networks. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. Yuhao Zhang, Victor Zhong, Danqi Chen, Gabor Angeli, and Christopher D. Manning. Positionaware attention and supervised data improve slot filling. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.

Data Statistics

Evaluation results on the OntoNotes5.0 dataset for the named entity recognition.

Evaluation results on the ATIS dataset for the named entity recognition.

Evaluation results on the ACE2005 dataset for the event detection.

Evaluation results on the MAVEN dataset for the event detection.

annex

dency structures in both training and inference stages. As shown in Figure 4 , our model outperforms the first two baselines on the ACE2005 dataset under both Few and All settings, demonstrating the effectiveness of dependency structure in improving the causal models for long-tailed IE. How can we make good use of dependency structure? To answer this question, we present three tree pruning mechanisms under two graph aggregation settings, i.e., Prune with DGLSTM and Prune with C-GCN as described in Equation 3 and Equation 4. The three pruning strategies include 1) CFIE Mask 1-hop which masks the tokens that directly connect to the targeting token in a dependency tree, 2) CFIE Mask token which directly masks the targeting token, 3) CFIE Mask token&1-hop which masks both the targeting token and its 1-hop neighbours in the dependency tree. Figure 5 and Figure 6 depict the results on OntoNotes5.0 dataset. We observe that masking 1-hop neighbours in the dependency tree achieves the best performance among three strategies, indicating that an entity itself is more important in NER sequence labeling. By comparing the two graph aggregation method, we draw a conclusion that Prune with DGLSTM can make better use of dependency structures.

How about the performance under various interventions and SCMs?

We study this question on ACE2005 dataset for ED task. We design three interventional methods including 1) 7 shows that introducing interventions solely on X is able to achieve the best performance under both Few and All settings. We also introduce three variants of our proposed SCMs : 1) SCM w/o NER, 2) SCM w/o POS, 3) SCM w/o NER and POS. Figure 8 shows that mitigating the NER node will significantly decrease the ED performance, especially over the Few setting. The results prove the superiority of our proposed SCMs that explicitly involve linguistic features to calculate main effect. More analyses for NER task are given in Appendix A.4.4. How the hyper-parameter α impacts the performance? To evaluate the impact of α on the performance, we tuned the parameter on four datasets including OntoNotes5.0, ATIS, ACE2005, and MAVEN. As shown in Figure 9 , when increasing α from 0 to 2.4 on ATIS dataset, the F1 scores increase at first dramatically then decrease slowly. The F1 scores reach the peak when α is set to 1.2. As the value of α represents the importance of entity for classifications, we therefore draw a conclusion that, for NER task , an entity plays a relatively more important role than the context. We also demonstrate the necessity of step 5 in our framework, since the performance is poor when α is set to 0. Experimental results on the other three datasets are given in the Appendix A.4.3.

5. CONCLUSION

In this paper, we present CFIE, a novel approach to tackling the long-tailed information extraction issues via counterfactual analysis in causal inference. Experimental results on five datasets across three IE tasks show the effectiveness of our approach. The future research directions include applying the proposed framework to more challenging long-tailed document-level IE tasks.

A.4.1 WHAT ARE THE MOST IMPORTANT FACTORS FOR THE ED TASK?

To answer this question, we conduct experiments on ACE2005. We have hypothesised that the factors, such as 2-hop and 1-hop context on the dependency tree, the entity itself, POS feature, and NER feature may hold the potential to be the key clues for the ED predictions. The design of We conduct experiments for NER task on OntoNotes5.0 dataset regarding different intervention methods and SCMs. The design and conclusions are similar to those of the ED task described in Section 4.5. The results are shown in Table 13 . To be specific, only intervening X achieves the best performance, indicating that our method is capable of capturing the most significant effect. Furthermore, our design of including POS tag in the causal graph can incorporate the useful information while eliminating the bias in POS tags. 

A.5 MEASURING CAUSAL EFFECTS OF VARIOUS FACTORS

We measure the causal effects of different factors for RE task. Here we define a set of factors F = {X, S, N ER, P OS, T AGS, Context, DepEdges}, where S, X, N ER, P OS are variables defined in our SCM, T AGS includes both NER tag and POS tag, Context denotes tokens along the shortest path between subject and object in RE task, and DepEdges denotes the dependency edges connected to either subject or object. We calculate the causal effect by Equation 7, where x * is generated by masking each factor in F. Instead of measuring the effect on a specific instance, we calculate the average effect on the ground truth class over all samples in NYT24 dataset. A larger value indicates a more significant causal effect from the specific factor to the ground truth label. From Table 9 we can observe that X and Context have the largest effect to the ground truth, which is captured in our model. Also we can conclude that masking tokens in the dependency tree is better choice compared with masking dependency relations. (Lin et al., 2017) 52.0 48.3 78.0 74.8 63.5 64.0 62.9 61.9 74.5 cRT (Kang et al., 2019) 66.0 24.2 75.2 58.5 64.6 61.9 65.6 50.5 68.0 τ -Normalization (Kang et al., 2019) 

