UTC-IE: A UNIFIED TOKEN-PAIR CLASSIFICATION ARCHITECTURE FOR INFORMATION EXTRACTION

Abstract

Information Extraction (IE) spans several tasks with different output structures, such as named entity recognition, relation extraction and event extraction. Previously, those tasks were solved with different models because of diverse task output structures. Through re-examining IE tasks, we find that all of them can be interpreted as extracting spans and span relations. We propose using the start and end token of a span to pinpoint the span in texts, and using the start-to-start and end-to-end token pairs of two spans to determine the relation. Hence, we can unify all IE tasks under the same token-pair classification formulation. Based on the reformulation, we propose a Unified Token-pair Classification architecture for Information Extraction (UTC-IE), where we introduce Plusformer on top of the token-pair feature matrix. Specifically, it models axis-aware interaction with plusshaped self-attention and local interaction with Convolutional Neural Network over token pairs. Experiments show that our approach outperforms task-specific and unified models on all tasks in 10 datasets, and achieves better or comparable results on 2 joint IE datasets. Moreover, UTC-IE speeds up over state-of-the-art models on IE tasks significantly in most datasets, which verifies the effectiveness of our architecture.

1. INTRODUCTION

Information Extraction (IE) aims to identify and classify structured information from unstructured texts (Andersen et al., 1992; Grishman, 2019) . IE consists of a wide range of tasks, such as named entity recognition (NER), joint entity relation extraction (RE) 1 and event extraction (EE)foot_1 . In the last decade, many paradigms have been proposed to solve IE tasks, such as sequence labeling (McCallum & Li, 2003; Huang et al., 2015; Zheng et al., 2017; Yu et al., 2020a) , span-based classification (Jiang et al., 2020; Yu et al., 2020b; Wang et al., 2021; Ye et al., 2022) , MRC-based methods (Levy et al., 2017; Li et al., 2020; Liu et al., 2020) and generation-based methods (Zeng et al., 2018; Yan et al., 2021a; Hsu et al., 2022) . The above work mainly concentrates on solving individual tasks, but it is desired to have a unified model to solve all IE tasks without designing dedicated modules. Besides, tackling all IE tasks with one model can facilitate knowledge sharing between different tasks. Therefore, various attempts have been made to unify all IE tasks with one model structure. Wadden et al. (2019) ; Lin et al. (2020) ; Nguyen et al. (2021) encode all IE tasks' target structure as graphs and design graph-based methods to predict them; Paolini et al. (2021) ; Lu et al. (2022) solve general IE tasks in a generative way with a text-to-text or text-to-structure framework. However, graph-based models tend to be complex to design, and generative models are time-consuming to decode. In our work, we creatively propose a simple yet effective paradigm for unified IE. Inspired by Jiang et al. (2020) , we re-examine IE tasks and consider that all of them are fundamentally span extraction (entity extraction in NER and RE, trigger classification and argument span detection in EE) token pair, and it can be classified into pre-defined types. e, r, t, a and rol in figures mean entity, relation, event trigger, event argument and event role. For the span extraction, we use the start-to-end and end-to-start token pairs to pinpoint the span, such as entity spans e 1 , e 2 , argument spans a 1 , a 2 and trigger span t (cells with pure color). For the relational extraction, we use the start-to-start and end-to-end token pairs to represent the relation, such as r and rol 1 , rol 2 (cells with gradient color). Therefore, all IE tasks can be decomposed into token pair classifications. After the reformulation, the local dependency and interaction from the plus-shaped orientation (as the orange and blue dotted lines depict) can provide vital information to classify the central token pair. or relational extraction 3 (relation extraction in RE and argument role classification in EE). Based on this perspective, we further simplify and unify all IE tasks into token-pair classification tasks. Figure 1 shows how each task can be converted. Specifically, a span is decomposed into start-to-end and end-to-start token pairs. As depicted, the entity "School of Computer Science" in Figure 1 (a) is decomposed into indices of (School, Science) and (Science, School). As for detecting the relation between two spans, we convert it into start-to-start and end-to-end token pairs from head mention to tail mention. For example, in Figure 1 (b), the relation "Author" between "J.K. Rowling" and "Harry Potter novels" is decomposed into indices of (J.K., Harry) and (Rowling, novels). Based on the above decomposition, we propose a Unified Token-pair Classification architecture for Information Extraction (UTC-IE). Specifically, we first apply Biaffine model on top of the pretrained language model to get representations of token pairs. Then we design a novel Transformer to obtain interactions between them. As the plus-shaped dotted lines depicted in Figure 1 , token pairs in horizontal and vertical directions cover vital information for the classification on the central token pair. For span extraction, token pairs in the plus-shaped orientation are either clashing or nested with the central token pair, for example, e 2 is contained by e 1 in Figure 1(a) ; for relational extraction, the central token pair's two constituent spans locate in the plus-shaped orientation, such as in Figure 1 (b), r is determined by e 1 and e 2 . Therefore, we make one token pair only attend horizontally and vertically in the token pair feature matrix. In addition, position embeddings are incorporated to keep the token pairs position-aware. Moreover, neighboring token pairs are highly likely to be informative to determine the types of the central token pair, so we apply Convolutional Neural Network (CNN) to model the local interaction after the plus-shaped attention. Since the attention map for one token pair is intuitively similar to the plus operator, we name the novel module as Plusformer. We conduct numerous experiments in two settings. When training separately on each task, our model outperforms previous task-specific and unified models on 10 datasets of all IE tasks. When training a single model simultaneously on all IE tasks in one dataset (named as joint IE task), UTC-IE achieves better or comparable results than 2 joint IE baselines. To thoroughly analyze why our UTC-IE architecture is useful in IE tasks under the token-pair paradigm, we execute several ablation studies. We observe that CNN module in Plusformer plays a significant role in IE tasks because of the abundant local dependency between token pairs after the reformulation. Furthermore, owing to the good parallelism of self-attention and CNN, UTC-IE is one to two orders of magnitude faster than prior unified IE models and some task-specific work. To summarize, our key contributions are as follows 1. We introduce UTC-IE, which decomposes all IE tasks into token-pair classification tasks. In this way, we can unify all IE tasks under the same task formulation. Henceforth, we can use one model to fit all IE tasks without designing task-specific modules. Besides, this unified decomposition is much faster than recently proposed generation-based unified frameworks. 2. After the reformulation of different IE tasks, we propose the Plusformer to model interaction between different token pairs. The plus-shaped self-attention and CNN in Plusformer are well-motivated and effective in the reformulated IE scenario. Experiments in 12 IE datasets all achieve state-of-the-art (SOTA) performance which justifies the superiority of Plusformer in IE tasks. 3. The reformulation enables us to use one model to fit all IE tasks. Therefore, we can train one model on three IE tasks, and results on two joint IE datasets show that the proposed unification can effectively benefit each IE task through multi-task learning.

4.

Extensive ablation experiments reveal that components in Plusformer are necessary and beneficial. Among them, CNN module in Plusformer can be essential to the overall performance. Analysis shows that this performance gain is well-explained because when reformulating IE tasks into token pair classifications, the adjacent token pairs can be informative and CNN can take good advantage of the local dependency between them.

2. TASK DECOMPOSITION AND DECODING

We first introduce how we decompose IE tasks to conduct training, and then present the decoding procedure for the decomposition. More discussions about the decomposition are presented in Appendix A.

2.1. TASK DECOMPOSITION

Formally, given an input sentence of L tokens x = [x 1 , x 2 , ..., x L ], the potential token pairs can form a score matrix Y ∈ R L×L× (|S|+|R|) , where S is span classes, R is relational classes. We stipulate EE aims to extract all events {{(s i , e i , t i ), (s 1 ia , e 1 ia , rol 1 i ), . . . , (s k ia , e k ia , rol k i )}}, where (s i , e i ) means the trigger span, t i ∈ S t is the event type, S t is pre-defined event types; s ia , e ia ∈ [1, L] are the start and end token indices of an argument span, k is the number of arguments of the trigger, to extract argument spans, we set the argument type as S a , and |S a | = 1; rol i ∈ R o is the role type of the argument and R o is pre-defined role types. We can view role types between the trigger and the arguments as relations. Therefore, in EE, S = S t ∪ S a and R = R o . Joint IE aims to jointly extract entities, relations, and events in the text. Extracting entities and relations are generally the same as those in NER and RE. When extracting events, there is no need to extract argument spans purposely because all argument candidates are entities. Therefore, in joint IE, S = S t ∪ S e and R = R r ∪ R o .  (s 2 , e 2 ), if Y [s 1 , s 2 , r] = Y [e 1 , e 2 , r] = 1 and r ∈ [|S| + 1, |S| + |R|], then the span (s 1 , e 1 ) forms relation r with the span (s 2 , e 2 ). The above decoding is for the ideal situation, where no span clash exists. However, for model's predictions, we need to first resolve the conflicts. The decoding with model's predictions will be presented in Appendix B.

3. METHOD

Figure 2 shows an overview of the architecture. Firstly, we present Biaffine (Dozat & Manning, 2017) model based on pre-trained language models (PLM). Then, we propose a novel Transformerlike structure named Plusformer to model interactions between token pairs. Next, we describe loss functions.

3.1. BIAFFINE MODEL

Given an input sentence, we first apply a PLM as our sentence encoder to obtain the contextualized representation as follows H = [h 1 , h 2 , ..., h L ] = PLM([x 1 , x 2 , ..., x L ]), where H ∈ R L×d , d is the PLM's hidden size. Next, we use the Biaffine mechanism to get features for each token pair as follows H s , H e = MLP start (H), MLP end (H), S[i, j] = (H s [i]) T W 1 H e [j] + W 2 (H s [i] ⊕ H e [j]) + b, where MLP start , MLP end are multi-layer perceptron layers, H s , H e ∈ R L×d , W 1 ∈ R d×c×d , W 2 ∈ R c×2d , b ∈ R c , ⊕ refers to concatenation; S ∈ R L×L×c provides features for all possible token pairs, and c is the feature dimension size.

3.2. PLUSFORMER

As illustrated in Section 1, when modeling the interaction between token pairs, the plus-shaped and local interaction should be beneficial. Therefore, we introduce the axis-aware plus-shaped selfattention and position embeddings to conduct plus-shaped interaction, we name this self-attention PlusAttention. Then, we leverage CNN to model local dependencies. We name this whole structure Plusformer. PlusAttention. We first apply the self-attention mechanism (Vaswani et al., 2017) horizontally and vertically as follows Z h [i, :] = Attention(S[i, :]W Q h , S[i, :]W K h , S[i, :]W V h ), Z v [:, j] = Attention(S[:, j]W Q v , S[:, j]W K v , S[:, j]W V v ), Attention(Q, K, V ) = softmax( QK T √ c )V , where W Q h , W K h , W V h , W Q v , W K v , W V v ∈ R c×c , Z h , Z v ∈ R L×L×c . After the self-attention, we use the following method to merge Z h , Z v S ′ = MLP(Z h ⊕ Z v ), where S ′ ∈ R L×L×c . We make the plus-shaped self-attention axis-aware by using two groups of attention parameters and using concatenation instead of an addition to merge Z h , Z v . Position Embeddings. Although the model should be able to distinguish between horizontal and vertical directions through axis-aware plus-shaped attention, it still lacks the sense of distances between token pairs and the area the token pair locates. Therefore, we utilize two kinds of position embeddings to enable the model with these abilities. • Rotary Position Embedding (RoPE) (Su et al., 2021) can encode the relative distance between two token pairs. It is utilized in both horizontal and vertical self-attention. • Triangle position embedding is incorporated to mark the position of token pairs in the feature map, which means cells in the upper and lower triangles will use different position embeddings. It adds to S in Eq.( 3) before Attention. CNN Layer. After the PlusAttention, we apply CNN with kernel size 3 × 3 on the S ′ to help the model exploit the local dependency between neighboring token pairs. The formulation is as follows S ′′ = Conv(σ(Conv(S ′ ))) where S ′′ ∈ R L×L×c , and σ is the activation function; and the bias term of CNN is not used to avoid result inconsistencies for a sample when it is in batches of different lengths. The Plusformer layer will be repeatedly used to interact fully between token pairs. Layer normalization (Ba et al., 2016) is ignored in the formulation for brevity.  where |R|+1) are scores for span extraction and relational extraction, respectively. The +1 in (|R|+1) is because we use the adaptive thresholding loss (ATL) from (Zhou et al., 2021) to avoid a global threshold in relational extraction. Ŷ ∈ R L×L×(|S|+|R|+1) ; ŶS ∈ R L×L×|S| , ŶR ∈ R L×L×( For the span extraction, we use the binary cross-entropy (BCE) loss as follows L 1 = - L i,j=1 |S| r=1 [Y [i, j, r]log( Ŷ [i, j, r]) + (1 -Y [i, j, r])log(1 -Ŷ [i, j, r])] For the relational extraction, we utilize the ATL as follows L 2 = - L i,j=1 r∈P T log exp( ŶR [i, j, r]) r ′ ∈P T ∪{TH} exp( ŶR [i, j, r ′ ]) -log exp( ŶR [i, j, |R|+1]) r ′ ∈N T ∪{TH} exp( ŶR [i, j, r ′ ]) where P T and N T denote the positive and negative classes, ŶR [:, :, |R| + 1] is the score for the threshold class TH. Only token pairs with scores higher than their corresponding adaptive thresholds are considered when decoding. We do not use ATL loss for span extraction because we need to sort span scores when decoding spans. The total loss L = L 1 +L 2 is used for optimization. (Pradhan et al., 2013) on flat NER, and with ACE04 (Doddington et al., 2004) , ACE05-Ent (Walker et al., 2006) and GENIA (Kim et al., 2003) on nested NER. As for relation extraction, we use ACE05-R (Walker et al., 2006) and SciERC (Luan et al., 2018) . Since Wang et al. (2021) and Ye et al. (2022) consider symmetric relations, which shall massively influence the performance, we name this scenario Symmetric RE with datasets ACE05-R + and SciERC + . For event extraction, we follow Lin et al. (2020) to perform experiments on three datasets, ACE05-E, ACE05-E+ (Doddington et al., 2004) and ERE-EN (Song et al., 2015) . And for joint IE, we test on ACE05-E+ and ERE-EN. Statistics of all these datasets, implementation details, evaluation metrics, baselines, pre-trained language models and hyper-parameters are described in Appendix C. Besides, we also test UTC-IE without Plusformer. Surprisingly, this simple model surpasses previous SOTA models on four results marked with ♣ , which proves the effectiveness of the task decomposition. The comparison between models with and without Plusformer clearly shows that Plusformer is effective in all tested datasets, and the performance improvement ranges from +0.40 (on OntoNotes) to +3.00 (on SciERC + ). Notably, the average performance gain of adding Plusformer on symmetric RE (+2.06) is more remarkable than that on RE (+1.03). We presume this is because the interaction between token pairs are more beneficial for symmetric relations.

4.3. RESULTS ON JOINT IE TASK

Multi-task learning has be proven to be useful in the IE area (Lin et al., 2020; Nguyen et al., 2021) . Since UTC-IE unifies all IE tasks into a token-pair classification scenario, it is natural to test whether one UTC-IE model can benefit from jointly learning all IE tasks. In Table 2 We unify all IE tasks as several token-pair classification tasks, which are fundamentally similar to the span-based methods on the IE task, for the start and end tokens can locate a span. Numerous NER studies emerge on span-based models, which are compatible with both flat and nested entities and perform well (Eberts & Ulges, 2020; Yu et al., 2020b; Li et al., 2021; Zhu & Li, 2022) . In addition to entities, the span-based method is also used in RE. Some models (Wang et al., 2021; Ye et al., 2022) 2020)'s work is similar to ours, but they need a two-stage model to determine the span type and span relations, respectively. Detailed analysis are depicted in Appendix G. Although many span-based IE models exist, they are task-specific and lack interaction between token pairs. Decomposing IE tasks as token-pair classification and conducting interaction between token pairs can uniformly model span-related knowledge and advance SOTA performance. The key component of Plusformer is the plus-shaped attention mechanism, which can make token pairs interact with each other in an efficient way. A similar structure called Axial Transformers (Ho et al., 2019) is proposed in Computer Vision (CV) field, which is designed to deal with data organized as high-dimension tensors. Tan et al. (2022) incorporate axial attention into relation classification, aiming to improve the performance on two-hop relation. However, CNN was not used in these work, while CNN has been proven to be vital to the IE tasks. Another similar structure named Twin Transformer (Guo et al., 2021) used in CV is very much similar to Plusformer, where they encode pixel of image from row and column sequentially, and leverage CNN on top of them. But the position embeddings, which is important for IE tasks, are not used in the Twin Transformer. Besides, we want to point out that the usage of plus-shaped attention and CNN originates from the reformulation of IE tasks, any other modules which can directly enable interaction between constituent spans of a relation and between adjacent token pairs should be beneficial.

6. CONCLUSION

In this paper, we decompose NER, RE and EE tasks into token-pair classifications. Through the decomposition, we unify all IE tasks under the same formulation. After scrutinizing the tokenpair feature matrix, we find the adjacent and plus-shaped interactions between token pairs should be informative. Therefore, we propose Plusformer, which uses an axis-aware plus-shaped self-attention followed by CNN layers to help token pairs interact with each other. Experiments in 10 single IE datasets and 2 joint IE datasets all outperform or approach the SOTA performance. Besides, owing to the parallelism of self-attention and CNN, our model's inference speed is substantially faster than previous SOTA models in RE and EE. Lastly, most of the previous IE models limit the interaction in the 1-D sequential dimension, while the reformulation of IE tasks opens a new angle to broaden the communication to the 2-D feature matrix. A DISCUSSION ON TASK DECOMPOSITION In this section, we will discuss two issues of the decomposition. The first is the inconsistency stipulation about the relation decomposition, the second is the false positive issue when decoding relations. A 

B DECODING WITH MODEL'S PREDICTIONS

In this section, we will detail the decoding process for models' predictions. The process described in Section 2.2 is not directly applicable to models' predictions since spans may conflict with each other 6 . With prediction score matrix ŶS from Eq.( 6), we follow previous work (Yu et al., 2020b) to first filter out spans whose scores are less than 0.5; for the left spans, we sort the spans based on their scores, then choose spans in descending order and make sure the span has no boundary clash with chosen spans. For relational extraction, we first decode all spans, then we get a binary matrix 

C EXPERIMENTAL SETTINGS

In this section, we describe all experimental settings in detail, such as the statistics of datasets, baseline models, and more implementation details.

C.1 DATASETS

We conduct experiments on 10 single IE datasets and 2 joint IE datasets, and we detail the statistic of all datasets in Table 5 . et al., 2006) and GENIA (Kim et al., 2003) . To distinguish ACE05 dataset used in other tasks, we name ACE05 in named entity recognition as ACE05-Ent. Specifically, we use the same preprocessing and splitting procedure on nested datasets as Yan et al. (2022) , for they fix some annotation problems to unify different versions of these datasets and make a strictly fair comparison. # Relation extraction. We conduct experiments on two relation extraction datasets, ACE05 (Walker et al., 2006) and SciERCfoot_9 ( Luan et al., 2018) . The ACE05 dataset, named as ACE05-R in our paper, is collected from various domains, such as newswire and online forums. The SciERC dataset provides entity, coreference and relation annotations from AI conference/workshop proceedings. Previous models mentioned above use different RE datasets. Specifically, UniRE and PL-Marker regard symmetric relations as two directed relations, while other work does not. Besides, these two models utilize cross-sentence context. Event extraction. Generative methods are popular in recently proposed event extraction papers. • TEXT2EVENT (Lu et al., 2021) is a sequence-to-structure model which outputs a tree-like event structure with an given input sentence. The model uses T5-large as the base model. • DEGREE (Hsu et al., 2022) leverages manually designed prompts to generate event records in natural language. We report the end-to-end performance of DEGREE instead of the pipeline way. The model leverages BART-large as encoder-decoder. Joint IE. There are only two previous models that consider the joint IE in ACE05-E and ERE-EN datasets. • OneIE (Lin et al., 2020) proposes an end-to-end IE model, which employs global features and type dependency constraint at decoding step. • FourIE (Nguyen et al., 2021) further improves the model by incorporating interaction dependency on representation level and label level. For a fair comparison, we list the pre-trained model used for all baselines and our model on every IE dataset in Table 6 . When choosing our pre-trained language model in different IE tasks' datasets, we pick the same pre-trained model as the most recently published papers, such as BioBERT for GENIA and RoBERTa-base for other NER datasets. For RE and joint IE tasks, we choose the same pre-trained model as previous work. For tasks where previous work applied a generative pre-trained model, we choose pre-trained model that has a similar size. For example, in event extraction, we use DeBERTa-large, whose number of parameters is 390M, which is closest to BART-large and T5-large used by previous EE papers.

C.3 EVALUATION METRICS

We report micro-F1 on all tasks: • Entity: an entity is correct if its entity type and span offsets match a golden entity. We use "Ent." to represent entity F1 through all tables. • Relation: a relation is correct if its type and its head and tail entities are correct, and the offsets and type of entities should also match the golden instance. We use "Rel." to represent relation F1 through all tables. • Event trigger: a trigger is correct if its span offset and event type is correct. We use "Trig." to represent trigger F1 through all tables. • Event argument: an argument is correct if its span offset, event type and role type all match the ground truth. We use "Arg." to represent argument F1 through all tables.

C.4 HYPER-PARAMETERS

The detailed hyper-parameters used in each dataset are listed in Table 7 . AdamW optimizer (Loshchilov & Hutter, 2019) with weight decay 1e-2 for all datasets. Experiments are conducted five times with five different random seeds. We report the performance on test sets based on the model which achieves the best dev results in each dataset. For NER, the best results are calculated by the entity F1; for RE, the best results are calculated by the sum of entity F1 and relation F1; for EE, the best results are calculated by the best argument F1; for joint IE, the best results are calculated by the sum of relation F1 and trigger F1.

D COMPLETE RESULTS

We present the complete results of UTC-IE and that without Plusformer in Table 8 . 

E SPEED COMPARISION

We test the speed of other models through their released code. For models, such as OneIE (Lin et al., 2020) , DEGREE (Hsu et al., 2022) and PL-Markerfoot_10 (Ye et al., 2022) , they also released a trained model along with their code, and we used their released model to test the inference speed. For UIE (Lu et al., 2022) and BS (Zhu & Li, 2022) , we trained a model with their code. The speed test is conducted in one RTX 3090 GPU and the batch size is set as 32 for all models (if the model goes out of memory, we choose the largest batch size that can accommodate the GPU); the test corpus is the test set of each datasets. The speed is measured by the number of sentences in the test set divided by the number of seconds that elapsed. And each inference is repeated three times, the average speed is reported. The speed comparison can be roughly categorized into two kinds. The first kind is the comparison with previous universal IE models, namely OneIE (Lin et al., 2020) and UIE (Lu et al., 2022) , and results are depicted in Table 3 . Compared with UIE in five chosen datasets, UTC-IE is x19.7 faster and improves performance by 1.86 averagely. Besides, for the joint IE task, UTC-IE is 18.4 times faster than OneIE and improves performance by 2.72 on average. The second kind is the comparison between UTC-IE and SOTA models targeted for each IE task, and results are presented in Table 9 . Compared with previous SOTA models, the average performance increments for entity F1, relation F1 and argument F1 are 0.31, 0.94 and 2.15. In the meantime, UTC-IE speeds up for x1.0, x5.5 and x101.9 averagely. In short, using UTC-IE for IE tasks can not only substantially enhance the performance in most cases, but also significantly speed up the inference speed in almost all datasets. 11 . Besides, we also study how the performance varies with the change of the number of Plusformer layers in Figure 10 . F.1 CNN Based on our ablations in Table 10 and Table 11 , the CNN module in Plusformer contributes most to the performance enhancement. To reveal why CNN is so effective in both span extraction and relational extraction, we first present an intuitive example in Figure 3 to show how CNN helps to extract entities and relations in the RE task. Like in Figure 3 (a), for NER, the entity e 2 can interact with entity e 1 and relation r 1 through CNN. Besides, for RE, CNN can contribute in two ways. On the one hand, CNN helps the relational token pair to directly gather information from its constituent entities, like the r 1 in Figure 3 (b). On the other hand, the start-to-start and end-to-end relational token pairs, like two r 2 cells, can directly interact with each other through CNN. To quantitatively present the effectiveness of CNN in UTC-IE, we propose further ablations to show how the distance between the relational token pair and its constituent spans affects the relational F1, and how the distance between start-to-start and end-to-end token pairs affects the relational F1. In this section, we will show how the relational F1 (relation F1 in RE and argument F1 in EE) will change when the distance between the relational token pair and its constituent spans varies. For two spans (s 1 , e 1 ) and (s 2 , e 2 ) (we ignore their diagonally symmetric counterparts, since they will not affect the calculation here), the span relation from (s 1 , e 1 ) to (s 2 , e 2 ) is represented by two token pairs (s 1 , s 2 ) and (e 1 , e 2 ). The distance between the two token pairs and its constituent spans is calculated as follows d = max(|s 2 -e 1 |, |s 1 -e 2 |) + 1, where the distance d is named as "Span-Rel-Span Distance", it represents the longest distance between the relational token pairs to their constituent spans. The relation between d and the relational F1 is shown in Figure 4 . Without CNN, the performance for extracting relations between nearby constituent spans will drop sightly, while less affected for further ones, which proves that CNN is effective for exploiting local dependency to predict relations. As shown in Figure 3(b) , if the distance between the start-to-start and end-to-end relational token pairs is small, the CNN should be helpful. To verify this assumption, we first define the "Inner Relational Distance" as follows, for two spans (s 1 , e 1 ) and (s 2 , e 2 ), the relational token pairs are (s 1 , s 2 ) and (e 1 , e 2 ), then the distance between two relational token pairs is calculated as follows From the results, it is clear that without CNN, the performance of Plusformer will drop when extracting relations (relation for RE and argument for EE) between nearby spans, while the performance is less effected for relations with further constituent spans. We conjecture this is because the receptive field of CNN is limited to a relatively small distance. where d reveals the distance between start-to-start and end-to-end token pairs, and it is actually decided by the max constituent span length. And its relation with the relational F1 is shown in Figure 5 . It is clear that, most of the start-to-start token pair is near to its end-to-end token pair, and CNN takes advantage of this adjacency to make better predictions. d = max(e 1 -s 1 , e 2 -s 2 ) + 1,

F.1.3 CNN KERNEL SIZE VS. F1

We study the relation between the kernel size of CNN and F1 performance in Figure 6 . We observe that CNN with kernel size 3 obtains the best performance on almost all datasets and tasks. Specifically, reducing CNN kernel size to 1 significantly harms the performance on all datasets, for CNN will lose the capability of interacting with neighboring token pairs. In contrast, F1 also slightly decreases with larger CNN kernel size. We presume that CNN with a larger kernel size may introduce more noise and harm performance. Therefore, we choose kernel size 3 for all datasets.

F.2 IS CNN ALL WE NEED?

Since CNN is so effective in the Plusformer, it is natural to ask whether it is enough only to use CNN. Therefore, we conduct experiments on models without the plus-shaped self-attention and named this model CNN-IE. We conduct experiments for CNN-IE in six datasets, and results are listed in Table 10 and Table 11 . Only with the CNN module can the model achieve SOTA or near SOTA performance in all six datasets, which depicts the effectiveness of the proposed token-pair decomposition and CNN module. However, it still lags behind the UTC-IE model, which reveals the necessity of the PlusAttention.

F.3 POSITION EMBEDDINGS

The RoPE embedding aims to help token pairs be aware of the spatial relationships between each other, and the triangle position embedding tries to enable spans to be informed of their areas in the feature map. From As expected, from Table 10 and Table 11 , if we discard the axis-aware in Plusformer, the average performance of span extraction and relational extraction diminish 0.28 and 0.86, respectively, which reveals the necessity of axis-aware in the PlusAttention module. Besides, we show two case studies of the plus-shaped attention in Figure 9 . The sentences are from the test dataset of ACE2005-Ent and ACE2005-R. Both cases put larger attention scores on informative token pairs. F.5 NUMBER OF PLUSFORMER VS. F1 We study the relation between the number of Plusformer layers and F1 performance in Figure 10 . For the NER datasets, we use two layers of Plusformer, and for the RE and EE we use three. G COMPARISON WITH GLAD token representations to represent a span, and for relations, they concatenate the head and tail span representations. Therefore, in their work, the interaction between spans are weak. In our work, we obtain the feature matrix of all token pairs and add well-designed Plusformer module on top of all token pairs, where token pairs can interact with others thoroughly. In order to prove the superiority of our reformulation and UTC-IE model, we make a fair comparison on several tasks from the GLAD benchmark (Jiang et al., 2020) . We choose 3 additional IE tasks, including Open Information Extraction (OIE), Semantic Role Labeling (SRL) and Aspect Based Sentiment Analysis (ABSA), and NER and RE. We use WLP (Hashimoto et al., 2017) on NER and RE, OIE2016 (Stanovsky & Dagan, 2016 ) on OIE, OntoNotes (Pradhan et al., 2013) on SRL and SemEval14 (Pontiki et al., 2014) on ABSA. The detailed experimental settings are the same as those in GLAD, to ensure a fair comparison. Results are present in Table 12 . The table shows that UTC-IE outperforms GLAD on all chosen tasks exceedingly, with +2.76 improvement on average. Moreover, we observe that UTC-IE without Plusformer also surpasses GLAD benchmarks on all tasks with +0.84 improvement on average, which proves the superiority of our unified reformulation. In the left figure, spans in e's vertical direction share the same end token as e except for spans in the lower triangle, since they clash with e in the back (because "Airway" is the end token of e but the start token for these spans); spans in e's horizontal direction have common start token as e, but not spans in the lower triangle, because they clash with e in the front (since "US" is the start token of e but the end token for these spans). Therefore, both the axis-aware and triangle position embedding are crucial for spans to figure out their relationships with each other. In the right figure, for a relational token pair, the spans from its horizontal direction must be the head span, while the tail span must come vertically. Thusly, axis-aware is informative for relational extractions. 



Joint entity relation extraction aims to extract both entities and relations. In our paper, we call it relation extraction (RE) for simplicity. Event extraction covers trigger extraction and argument extraction, where we first conduct argument span detection and then conduct argument role classification in our architecture. In this paper, we use relational extraction to represent extracting relations between spans, which has broader meanings than relation extraction. The precision and recall for UTC-IE in these datasets can be found in Table8in Appendix. Removing axis-aware means using the same self-attention parameters for both directions and adding Z h and Z v instead of concatenation. All IE tasks forbid span boundary clashes. https://catalog.ldc.upenn.edu/LDC2013T19 https://catalog.ldc.upenn.edu/LDC2005T09 https://catalog.ldc.upenn.edu/LDC2006T06 http://nlp.cs.washington.edu/sciIE/ PL-Marker used a two-stage pipeline to conduct prediction. Therefore, the time is measured by the total seconds elapse to finish two stages.



Figure1: An illustration of the token-pair decomposition for IE tasks. Each cell represents one token pair, and it can be classified into pre-defined types. e, r, t, a and rol in figures mean entity, relation, event trigger, event argument and event role. For the span extraction, we use the start-to-end and end-to-start token pairs to pinpoint the span, such as entity spans e 1 , e 2 , argument spans a 1 , a 2 and trigger span t (cells with pure color). For the relational extraction, we use the start-to-start and end-to-end token pairs to represent the relation, such as r and rol 1 , rol 2 (cells with gradient color). Therefore, all IE tasks can be decomposed into token pair classifications. After the reformulation, the local dependency and interaction from the plus-shaped orientation (as the orange and blue dotted lines depict) can provide vital information to classify the central token pair.

Figure 2: An overview of the UTC-IE Model.

LOSS FUNCTION Finally, we get final scores as follows ŶS , ŶR = Sigmoid( Ŷ [:, :, : |S|]), Ŷ [:, :, |S| :], Ŷ = MLP(S ′′ + S),

Figure 3: An intuitive example of the influence of CNN on span extraction and relation extraction.

.2 DISTANCE BETWEEN START-TO-START AND END-TO-END TOKEN PAIRS VS. RELATIONAL F1

Figure4: Distance between the relational token pair and its constituent spans (Span-Rel-Span Distance) VS. relational F1 when with or without CNN in Plusformer. The upper and lower figures are for RE and EE tasks, respectively. From the results, it is clear that without CNN, the performance of Plusformer will drop when extracting relations (relation for RE and argument for EE) between nearby spans, while the performance is less effected for relations with further constituent spans. We conjecture this is because the receptive field of CNN is limited to a relatively small distance.

Figure 5: Distance between two relational token pairs of the same span pair (Inner Relational Distance) VS. relational F1 when with or without CNN in Plusformer. The upper and lower figures are for RE and EE tasks, respectively. Since almost all spans are of a length of less than 5, CNN is valuable to model the interaction between start-to-start and end-to-end relational pairs.

Figure 6: The performance varies with the kernel size of CNN. NER, RE and EE results are listed from top to bottom. CNN with kernel size 3 has the best performance over almost all datasets.

Figure7: Distance between the relational token pair and its constituent spans (Span-Rel-Span Distance) VS. relational F1 when with or without position embeddings in Plusformer. The upper and lower figures are for RE and EE tasks, respectively. Without position embeddings, the relational performance is lower almost in all "Span-Rel-Span" distances. We presume this is because, with position embedding, Plusformer can exploit the distance inductive bias to determine the relations.

Spans with the same end token with 𝑒 ✘ Spans that clash with entity 𝑒 Spans with the same start token with 𝑒 Potential tail span for 𝑟 Potential head span for 𝑟

Figure8: Examples to show why axis-aware is meaningful for IE tasks. In the left figure, spans in e's vertical direction share the same end token as e except for spans in the lower triangle, since they clash with e in the back (because "Airway" is the end token of e but the start token for these spans); spans in e's horizontal direction have common start token as e, but not spans in the lower triangle, because they clash with e in the front (since "US" is the start token of e but the end token for these spans). Therefore, both the axis-aware and triangle position embedding are crucial for spans to figure out their relationships with each other. In the right figure, for a relational token pair, the spans from its horizontal direction must be the head span, while the tail span must come vertically. Thusly, axis-aware is informative for relational extractions.

Figure 9: Two case studies of the PlusAttention. The horizontal and vertical attention scores are from the horizontal and vertical self-attentions of last layer of Plusformer. The center cells are with two colors, one for the horizontal attention scores and the other for the vertical attention scores. For NER, the center cell attends more on other entities. And for RE, the center relational cell attends more on its constituent entities.

Figure 10: The performance varies with the number of Plusformer layers. NER, RE and EE results are listed from top to bottom. For the NER tasks, the performance peaks at the two layers of Plusformer, and for RE and EE, the performance plateaus after three layers of Plusformer.

where the superscript h and t denotes the head and tail entities, t h i , t t i ∈ S e , r i ∈ R r and S e , R r are pre-defined entity types, and relation types. Therefore, in RE, S = S e and R = R r .

Overall F1 on single IE tasks. Results of UTC-IE are the average of 5 runs, and the subscript means the standard deviation (e.g., 93.45 24 means 93.45±0.24). Datasets marked as * have nested entities. Results marked as † are fromYan et al. (2022). ⋆ means results from their Github repo or our reproduction. ♣ means that the UTC-IE without Plusformer surpasses previous SOTA performance.

Results on joint IE. UTC-IE single shows results by separately trained model on NER, RE and EE, while UTC-IE joint shows results by jointly trained model. ♣ means that UTC-IE without Plusformer surpasses previous SOTA performance. RESULTS ON SINGLE IE TASKS In this section, we report the UTC-IE performance in each single IE task. Results are shown in Table 1

, the performance of UTC-IE single is from the entity F1 ofNER, relation F1 of RE, trigger F1 of EE and argument F1 of  EE, respectively.  Based on the comparison between UTC-IE single and UTC-IE joint , it is obvious that jointly learning these three tasks consistently improves performance in the 2 joint IE datasets.Moreover, UTC-IE joint outperforms previous SOTA joint IE models in Table2, the average performance enhancement is +0.69 in ACE05-E+ and +0.75 in ERE-EN. Specifically, UTC-IE joint increases the average performance of relational extraction by +1.30. Thusly, through unifying different IE tasks through our task decomposition, Plusformer can enjoy the benefit of multi-tasking learning, and achieve better performance than previous SOTA models.4.4 SPEED COMPARISONTo get a sense of the speed superiority of UTC-IE, we compare the inference speed of UTC-IE with previous unified models and task-specific SOTA models. The former comparison is presented in Table3and the latter locates in the Appendix E. Compared with the generative UIE(Lu et al.The F1 and efficiency comparison with UIE and OneIE. "Ent.", "Rel."  and "Arg." denote F1 of corresponding test sets. "Speed" is measured in "sentence/s" on inference procedure. The improvement shows the changes in performance and speed.

Ablation studies in the NER, RE and EE datasets. CNN-IE is similar to UTC-IE except that it is deprived of the PlusAttention. Underlines mean the most dropped factor. ♣ means that the CNN-IE surpasses previous SOTA performance. Based on the ablation, CNN is the most useful component among all IE tasks. The reason behind this improvement is that once token pairs are organized in the square feature map, the spatial correlations between neighboring token pairs become allusive, and CNN excels at exploiting these local interactions. More comprehensive analysis of CNN in Plusformer locates in Appendix F.1. To deepen our understanding of UTC-IE, we try another variant of Plusformer where the PlusAttention is discarded, and we name this variant CNN-IE. The bottom line of Table4shows that the CNN-IE model can surpass or approach previous SOTA performance in almost all datasets, which proves the universality of our proposed task formulation. However, CNN is not a panacea for UTC-IE. From Table4, removing position embeddings or axisaware 5 from UTC-IE will lead to an average of 0.39 or 0.44 performance degradation, respectively. Moreover, based on the performance of CNN-IE and UTC-IE, the average performance shrinks from 74.53 to 74.17 if the PlusAttention is deprived of Plusformer, which means the plus-shaped self-attention is a desideratum. In addition, we present some intuitive examples and deeper analysis for position embeddings and axis-aware in Appendix F.3 and F.4.5 RELATED WORKInformation extraction tasks, which consists of named entity recognition, relation extraction, and event extraction, have long been a fundamental and well-researched task in the natural language processing (NLP) field. Previous researches mainly only focus on one or two tasks. Recently, building joint neural models of unified IE tasks has attracted increasing attention. Some of them incorporate graphs into IE structure.Wadden et al. (2019) propose a unified framework called DYGIE++ to extract entities, relations and events by leveraging span representations via span graph updates.Lin et al. (2020) andNguyen et al. (2021) extend DYGIE++ by incorporating global features to extract cross-task and cross-instance interactions with multi-task learning. In addition to the graph-based models mentioned above, other studies focus on tackling general IE by generative models.Paolini et al. (2021) construct a framework called TANL, which enhances the generation model using augmented language methods. Moreover,Lu et al. (2022) regard IE task as a text-to-structure generation task, and leveraging prompt mechanism.

only leverage span representations to locate entities and simply calculate the interaction between entity pair, while others(Wang et al., 2020;Zhong & Chen, 2021) encode span pair information explicitly to extract relations. With regard to event extraction, as far as we know, there is little work on injecting span information into EE explicitly.Wadden et al. (2019) leveraging span representations on general IE, but their model is complicated and only considers span at the embedding layer without further interaction. Conceptually,Jiang et al. (

Given four spans p 1 = (s 1 , e 1 ), p 2 = (s 2 , e 2 ), p 3 = (s 3 , e 3 ), p 4 = (s 4 , s 4 ), if p 4 has relation r with p 1 and p 2 , and no relation exist between p 4 and p3 . Then Y [s 4 , s 1 , r] = Y [e 4 , e 1 , r] = 1, Y [s 4 , s 2 , r] = Y [e 4 , e 2 , r] = 1. However, if s 1 = s 3 , e 2 = e 3 .Namely, p 1 shares start token with p 3 and p 2 shares end token with p 3 . Then, based onY [s 4 , s 1 , r] = Y [e 4 , e 2 , r] = 1, we get Y [s 4 , s 3 , r] = Y [e 4 , e 3 ,r] = 1,, the decoding process will mistakenly think p 4 has relation r with p 3 . However, this situation should be rare, and none is found in the tested datasets.

ȲR = ŶR [:, :, : |R| + 1] > ŶR [:, :, |R| + 1], then we pair spans to check whether they form relations. Take two spans (s 1 , e 1 ) and (s 2 , e 2 ) for instance, if ȲR [s 1 , s 2 , r] = ȲR [e 1 , e 2 , r] = 1, we claim the first span has relation r with the second span. For the RE task, we pair all entity spans to check if they form relations; for the EE task, we pair the trigger spans and argument spans to check if they form a role relationship; and for the joint IE task, we pair entity spans to check if they form relations, we pair the trigger spans and entity spans (because all argument spans are entity spans) to check if they form a role relationship.



The hyper-parameters used in each dataset.

Completed results for precision (P), recall (R) and F1 (F) of UTC-IE on different tasks. Bold results represent the most improved metrics on UTC-IE without Plusformer between precision and recall. 76.2 73.5 55.5 57.6 56.5 70.8 76.1 73.4 57.8 57.6 57.7 58.1 62.5 60.2 54.5 50.7 52.5 -Plusformer 70.1 76.0 72.9 52.5 58.7 55.4 70.5 75.5 72.9 55.6 57.7 56.6 56.0 63.0 59.3 52.3 50.3 51.3

The F1 and inference time comparison on UTC-IE and currently SOTA models on each IE task. "Ent.", "Rel." and "Arg." denote F1 of corresponding test sets. "Speed" is measured in "sentence/s" on inference procedure. Improvement shows the changes in performance and speed.

Ablation Study for span extraction. Underlines mean the most dropped factor. ♣ means the CNN-IE surpasses previous SOTA performance. ablation, we will choose two datasets for each IE task to study the effect of each component in Plusformer. We separately list the performance for span extraction (including entity extraction in NER and RE, trigger extraction in EE) in Table 10 and relational extraction (including relation extraction in RE and argument extraction in EE) in Table

Ablation Study for relational extraction. Underlines mean the most dropped factor. ♣ means that the CNN-IE surpasses previous SOTA performance. Center on relation 𝑟 " and 𝑟 !

Table 10 and Table 11, the position embeddings enhance the span extraction

Results comparison between a similar work GLAD and UTC-IE on NER, RE, SRL and ABSA. We leverage BERT-base as base model for fair comparison. GLAD performs NER and RE jointly on WLP dataset, and report them separately. We use the same settings as theirs. ♣ means that the UTC-IE without Plusformer surpasses previous SOTA performance.

annex

We follow the data preprocessing in Luan et al. (2019) to split ACE05-R and SciERC into train, dev and test sets.In typical RE, it is crucial to distinguish which entity comes first (head entity) and which comes next (tail entity). As for symmetric relational instance, the relation exists from both head-to-tail and tail-to-head directions. There are one such relation type in ACE05-R and two in SciERC. Some papers (Wang et al., 2021; Ye et al., 2022 ) regard each symmetric relational example as two directed relations, while others regard them as one relation. We find that this setting will hugely influence the performance. Therefore, we name this setting Symmetric Relation Extraction and name the corresponding datasets ACE05-R + and SciERC + .Event extraction. We evaluate UTC-IE on two widely used event extraction datasets, ACE2005 (Doddington et al., 2004) and ERE (Song et al., 2015) . Following the prior preprocessing step (Wadden et al., 2019; Lin et al., 2020; Lu et al., 2021) on them, we obtain three datasets, ACE05-E, ACE05-E+ and ERE-EN. ACE05-E+ additionally takes relation arguments, pronouns and multitoken event triggers into consideration compared with ACE05-E. We use the same train/dev/test split as Lu et al. (2021) for all datasets to ensure a fair comparison. Furthermore, we still use ACE05-E+ and ERE-EN on joint IE, for they have annotations on all IE tasks. C.2 BASELINES TANL (Paolini et al., 2021) and UIE (Lu et al., 2022) are both unified information extraction models in the generative way, with different input and output formats. TANL uses T5-base as the backbone model, while UIE uses T5-large. We compare our model with them in every IE task. For TANL, we report single-task results for our model is trained under each task. For UIE, we report results with pre-training, which have better performance. In addition to these two baselines, each task also compares with a series of recently proposed task-specific methods as follows.CNN-IE is the baseline model we design to prove the necessity of PlusAttention. The only difference between CNN-IE and UTC-IE is the former ignores the PlusAttention in Figure 2 . We tune the number of CNN layers in CNN-IE from 2 to 6, and the best results are reported.Named entity recognition. We compare our model's performance on NER with several recently proposed NER methods.• BART-NER (Yan et al., 2021a) • CNN-NER (Yan et al., 2022) utilizes CNN to model local spatial correlations between spans and surpass recently proposed methods on nested NER. We report results using RoBERTa-base model.

Relation extraction.

For relation extraction, we compare our model with several SOTA models.• UniRE (Wang et al., 2021) jointly extracts entities and relations using a table containing all word pairs.• PURE (Zhong & Chen, 2021 ) adopts a pipeline approach to solve NER and RE independently, using distinct contextual representations for entities and relations.• PFN (Yan et al., 2021b) claims that some information should be shared between named entity recognition and relation extraction, while other information should be independent.They propose PFN to model two-way interaction (partition and filter) between two tasks.• PL-Marker (Ye et al., 2022) : authors consider interactions between spans and propose PL-Marker by strategically packing the markers in the encoder.Table 6 : Overall pre-trained model on all IE baselines. Abbreviations before "-" denote pre-trained model names. Specifically, "BA" means BART, "BE" means BERT, "RoB" means RoBERTa, "ALB" means ALBERT, "DeB" means DeBERTa. The letters after "-" means the size of the model, such as base model ("b"), large model ("l"), xx-large model ("xxl"). The number of parameters of each pre-trained model is as follows: BE-b (110M), BE-l (340M), RoB-b (125M), ALB-xxl (233M), DeB-l (390M), T5-b (220M), T5-l (770M), BA-l (406M).Named Entity Recognition CoNLL03 OntoNotes ACE04* ACE05-Ent* GENIA* BART-NER (Yan et al., 2021a) BA-l BA-l BA-l BA-l BA-l TANL (Paolini et al., 2021) T5-b T5-b -T5-b T5-b W 2 NER (Li et al., 2022) BE-l BE-l BE-l BE-l BioBERT UIE (Lu et al., 2022) T5-l -T5-l T5-l -BS (Zhu & Li, 2022) RoB 

