EDGEFORMERS: GRAPH-EMPOWERED TRANSFORM-ERS FOR REPRESENTATION LEARNING ON TEXTUAL-EDGE NETWORKS

Abstract

Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers 1 , a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node's ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively.

1. INTRODUCTION

Networks are ubiquitous and are widely used to model interrelated data in the real world, such as user-user and user-item interactions on social media (Kwak et al., 2010; Leskovec et al., 2010) and recommender systems (Wang et al., 2019; Jin et al., 2020) . In recent years, graph neural networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019) have demonstrated their power in network representation learning. However, a vast majority of GNN models leverage node attributes only and lack specific designs to capture information on edges. (We refer to these models as node-centric GNNs.) Yet, in many scenarios, there is rich information associated with edges in a network. For example, when a person replies to another on social media, there will be a directed edge between them accompanied by the response texts; when a user comments on an item, the user's review will be naturally associated with the user-item edge. To utilize edge information during network representation learning, some edge-aware GNNs (Gong & Cheng, 2019; Jiang et al., 2019; Yang & Li, 2020; Jo et al., 2021) have been proposed. Nevertheless, these studies assume the information carried by edges can be directly described as an attribute vector. This assumption holds well when edge features are categorical (e.g., bond features in molecular graphs (Hu et al., 2020) and relation features in knowledge graphs (Schlichtkrull et al., 2018) ). However, effectively modeling free-text edge information in edge-aware GNNs has remained elusive, mainly because bag-of-words and context-free embeddings (Mikolov et al., 2013) used in previous edge-aware GNNs cannot fully capture contextualized text semantics. For example, "Byzantine" in history book reviews and "Byzantine" in distributed system papers should have different meanings given their context, but they correspond to the same entry in a bag-of-words vector and have the same context-free embedding. To accurately capture contextualized semantics, a straightforward idea is to integrate pretrained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) with GNNs. In node-centric GNN studies, this idea has been instantiated by a PLM-GNN cascaded architecture (Fang et al., 2020; Li et al., 2021; Zhu et al., 2021) , where text information is first encoded by a PLM and then aggregated by a GNN. However, such architectures process text and graph signals one after the other, and fail to simultaneously model the deep interactions between both types of information. This could be a loss to the text encoder because network signals are often strong indicators to text semantics. For example, a brief political tweet may become more comprehensible if the stands of the two communicators are known. To deeply couple PLMs and GNNs, the recent GraphFormers model (Yang et al., 2021) proposes a GNN-nested PLM architecture to inject network information into the text encoding process. They introduce GNNs nested in between Transformer layers so that the center node encoding not only leverages its own textual information, but also aggregates the signals from its neighbors. Nevertheless, they assume that only nodes are associated with textual information and cannot be easily adapted to handle text-rich edges. To effectively model the textual and network structure information via a unified encoder architecture, in this paper, we propose a novel network representation learning framework, Edgeformers, that leverage graph-enhanced Transformers to model edge texts in a contextualized way. Edgeformers include two architectures, Edgeformer-E and Edgeformer-N, for edge and node representation learning, respectively. In Edgeformer-E, we add virtual node tokens to each Transformer layer inside the PLM when encoding edge texts. Such an architecture goes beyond the PLM-GNN cascaded architecture and enables deep, layer-wise interactions between network and text signals to produce edge representations. In Edgeformer-N, we aggregate the network-and-text-aware edge representations to obtain node representations through an attention mechanism within each node's ego-graph. The two architectures can be trained via edge classification (which relies on good edge representations) and link prediction (which relies on good node representations) tasks, respectively. To summarize, our main contributions are as follows: • Conceptually, we identify the importance of modeling text information on network edges and formulate the problem of representation learning on textual-edge networks. • Methodologically, we propose Edgeformers (i.e., Edgeformer-E and Edgeformer-N), two graphenhanced Transformer architectures, to deeply couple network and text information in a contextualized way for edge and node representation learning. • Empirically, we conduct experiments on five public datasets from different domains and demonstrate the superiority of Edgeformers over various baselines, including node-centric GNNs, edge-aware GNNs, and PLM-GNN cascaded architectures.

2. PRELIMINARIES

2.1 TEXTUAL-EDGE NETWORKS In a textual-edge network, each edge is associated with texts. We view the texts on each edge as a document, and all such documents constitute a corpus D. Since the major goal of this work is to explore the effect of textual information on edges, we assume there is no auxiliary information (e.g., categorical or textual attributes) associated with nodes in the network. Definition 1 (Textual-Edge Networks) A textual-edge network is defined as G " pV, E, Dq, where V, E, D represent the sets of nodes, edges, and documents, respectively. Each edge e ij P E is associated with a document d ij P D. To give an example of textual-edge networks, consider a review network (e.g., Amazon (He & McAuley, 2016) ) where nodes are users and items. If a user v i writes a review about an item v j , there will be an edge e ij connecting them, and the review text will be the associated document d ij . 2.2 TRANSFORMER Many PLMs (e.g., BERT (Devlin et al., 2019) ) adopt a multi-layer Transformer architecture (Vaswani et al., 2017) to encode texts. Each Transformer layer utilizes a multi-head self-attention mechanism to obtain a contextualized representation of each text token. Specifically, let H plq " rh plq 1 , h plq 2 , ..., h plq n s denote the output sequence of the l-th Transformer layer, where h plq i P R d is the hidden representation of the text token at position i. Then, in the pl `1q-th Transformer layer, the multi-head self-attention (MHA) is calculated as where W Q,t , W K,t , W V,t are query, key, and value matrices to be learned by the model, k is the number of attention head and } is the concatenate operation. MHApH plq q " k } t"1 head t pH plq t q (1) head t pH plq t q " V plq t ¨softmaxp K plqJ t Q plq t a d{k q (2) Q plq t " W plq Q,t H plq t , K plq t " W plq K,t H plq t , V plq t " W plq V,t H plq t ,

2.3. PROBLEM DEFINITIONS

Our general goal is to learn meaningful edge and node embeddings in textual-edge networks so as to benefit downstream tasks. To be specific, we consider the following two tasks focusing on edge representation learning and node representation learning, respectively. The first task is edge classification, which relies on learning a good representation h e of an edge e P E. We assume each edge e ij belongs to a category y P Y. The category can be indicated by its associated text d ij and/or the nodes v i and v j . For example, in the Amazon review network, Y " t1-star, 2-star, ..., 5-staru. The category of e ij reflects how the user v i is satisfied with the item v j , which may be expressed by the sentiment of d ij and/or implied by v i 's preference and v j 's quality. Given a review, the task is to predict its category based on review text and user/item information. Definition 2 (Edge Classification) In a textual-edge network G " pV, E, Dq, we can observe the category of some edges E train Ď E. Given an edge e ij P EzE train , predict its category y P Y based on d ij P D and v i , v j P V. The second task is link prediction, which relies on learning an accurate representation h vi of a node v i P V. Given two nodes v i and v j , the task is to predict whether there is an edge between them. Note that, unlike edge classification, we no longer have the text information d ij (because we even do not know whether e ij exists). Instead, we need to exploit other edges (local network structure) involving v i or v j as well as their text to learn node representations h vi and h vj . For example, in the Amazon review network, we aim to predict whether a user will be satisfied with a product according to the user's reviews towards other products and the item's reviews from other users. Definition 3 (Link Prediction) In a textual-edge network G " pV, E, Dq, we can observe some edges E train Ď E and their associated text. Given v i , v j P V where e ij R E train , predict whether e ij P E.

3. PROPOSED METHOD

In this section, we present our Edgeformers framework. Based on the two tasks mentioned in Section 2.3, we first introduce how we conduct edge representation learning by jointly considering text and network information via a Transformer-based architecture (Edgeformer-E). Then, we illustrate how to perform node representation learning using the edge representation learning module as building blocks (Edgeformer-N). The overview of Edgeformers is shown in Figure 1 .

3.1. EDGE REPRESENTATION LEARNING (EDGEFORMER-E)

Network-aware Edge Text Encoding with Virtual Node Tokens. Encoding d ij in a textual-edge network is different from encoding plain text, mainly because edge texts are naturally accompanied by network structure information, which can provide auxiliary signals. Given that text semantics can be well captured by a multi-layer Transformer architecture (Devlin et al., 2019) , we propose a simple and effective way to inject network signals into the Transformer encoding process. The key idea is to introduce virtual node tokens. Given an edge e ij " pv i , v j q and its associated texts d ij , let H plq eij P R dˆn denote the output representations of all text tokens in d ij after the l-th model layer (l ě 1). In each layer, we introduce two virtual node tokens to represent v i and v j , respectively. Their embeddings are denoted as z plq vi and z plq vj P R d , which are concatenated to the text token sequence hidden states as follows: Ă H plq eij " z plq vi } z plq vj } H plq eij . After the concatenation, Ă H plq eij contains information from both e ij 's associated text d ij and its involving nodes v i and v j . To let text token representations carry node signals, we adopt a multi-head attention mechanism: MHApH plq eij , Ă H plq eij q " k } t"1 head t pH plq eij ,t , Ă H plq eij ,t q, Q plq t " W plq Q,t H plq eij ,t , K plq t " W plq K,t Ă H plq eij ,t , V plq t " W plq V,t Ă H plq eij ,t . In Eq. ( 5), the multi-head attention is asymmetric (i.e., the keys K and values V are augmented with virtual node embeddings but queries Q are not) to avoid network information being overwritten by text signals. This design has been used in existing studies Yang et al. (2021) , and offers better effectiveness than the original self-attention mechanism according to our experiments in Section 4.2. The output of MHA includes updated node-aware representations of text tokens. Then, following the Transformer architecture (Vaswani et al., 2017) , the updated representations will go through a feed-forward network (FFN) to finish our pl `1q-th model layer encoding. Formally, H plq 1 eij " NormalizepH plq eij `MHApH plq eij , Ă H plq eij qq, H pl`1q eij " NormalizepH plq 1 eij `FFNpH plq 1 eij qq, ( ) where Normalizep¨q is the layer normalization function. After L model layers, the final representation of the [CLS] token will be used as the edge representation of e ij , i.e., h eij " H pLq eij rCLSs. Representation of Virtual Node Tokens. The virtual node representation z plq vi used in Eq.( 4) is obtained by a layer-specific mapping of the initial node embedding z p0q vi . Formally, z plq vi " W plq n z p0q vi , where W plq n P R dˆd 1 is the mapping matrix for the l-th layer. The large population of nodes will introduce a large number of parameters to our framework, which may finally lead to model underfitting. As a result, in Edgeformers, we set the initial node embedding to be low-dimensional (e.g., z p0q vi P R 64 ) and project it to the high-dimensional token representation space (e.g., z plq vi P R 768 ). Note that it is possible to go beyond the linear mapping in Eq. ( 9) and use structure-aware encoders such as GNNs to obtain z plq vi , and we leave such extensions for future studies.

3.2. TEXT-AWARE NODE REPRESENTATION LEARNING (EDGEFORMER-N)

In this section, we first discuss how to perform text-aware node representation learning by taking the aforementioned edge representation learning module (i.e., Edgeformer-E) as building blocks. Then, we propose to enhance the edge representation learning module with the target node's additional local network structure. Aggregating Edge Representations. Since the edge representations learned by Edgeformer-E capture both text semantics and network structure information, a straightforward way to obtain a node representation is to aggregate the representations of all edges involving the node. Given a node v i , its representation h vi is given by h vi " AGGpth eij |e ij P N e pv i quq, where N e pv i q is the set of edges containing v i . AGGp¨q can be any permutation invariant function such as meanp¨q or maxp¨q. Here, we instantiate AGGp¨q with an attention-based aggregation: α eij ,vi " softmaxph J eij W s z p0q vi q, h vi " ÿ eij PNepviq α eij ,vi h eij , where W s P R dˆd 1 is a learnable scoring matrix. Enhancing Edge Representations with the Node's Local Network Structure. Since we are aggregating information from multiple edges, it is intuitive that they can mutually improve each other's representation by providing auxiliary semantic signals. For example, given a conversation about "Transformers" and their participants' other conversations centered around "machine learning", it is more likely that the term "Transformers" refers to a deep learning architecture rather than a character in the movie. To implement this intuition in the edge representation learning module, we introduce the third virtual token hidden state hplq eij |vi during edge encoding: Ă H plq eij |vi " z plq vi } z plq vj } hplq eij |vi } H plq eij , where Connection between Edgeformer-N and GNNs. According to Figure 1 , Edgeformer-N adopts a Transformer-based architecture. Meanwhile, it can also be viewed as a GNN model. Indeed, GNN models (Wu et al., 2020; Yang et al., 2020) mainly adopt a propagation-aggregation paradigm to obtain node representations: hplq a pl´1q ij " PROP plq ´hpl´1q i , h pl´1q j ¯, `@j P N piq ˘; h plq i " AGG plq ´hpl´1q i , ta pl´1q ij |j P N piqu ¯. (14) Analogously, in Edgeformer-N, Eq. ( 13) can be treated as the propagation function PROP plq , and the aggregation step AGG plq is the combination of Eqs. ( 12), ( 7), (8), and (10).

3.3. TRAINING

As mentioned in Section 2.3, we consider edge classification and link prediction as two tasks to train Edgeformer-E and Edgeformer-N, respectively. Edge Classification. For Edgeformer-E (i.e., edge representation learning), we adopt supervised training, the objective function of which is as follows. L e " ´ÿ eij y J eij log ŷeij `p1 ´yeij q J logp1 ´ŷ eij q, ( ) where ŷeij " f ph eij q is the predicted category distribution of e ij and f p¨q is a learnable classifier. Link Prediction. For Edgeformer-N (i.e., node representation learning), we conduct unsupervised training, where the objective function is as follows. L n " ÿ vPV ÿ uPNnpvq ´log expph J v h u q expph J v h u q `řu 1 expph J v h u 1 q . ( ) Here, N n pvq is the set of v's node neighbors and u 1 denotes a random negative sample. In our implementation, we utilize "in-batch negative samples" (Karpukhin et al., 2020) to reduce encoding and training costs. Overall Algorithm. The workflow of our edge representation learning (Edgeformer-E) and node representation learning (Edgeformer-N) algorithms can be found in Alg. 1 and Alg. 2, respectively. Complexity Analysis. Given a node involved in N edges, and each edge has P text tokens, the time complexity of edge encoding for each Edgeformer-E layer is OpPfoot_1 q (the same as one vanilla Transformer layer). The time complexity of node encoding for each Edgeformer-N layer is OpN P 2 `N 2 q. For most nodes in the network, we can assume N 2 ! N P 2 , so the complexity is roughly OpN P 2 q (the same as one PLM-GNN cascaded layer). For more discussions about time complexity, please refer to Section 4.5.

4. EXPERIMENTS

In this section, we first introduce five datasets. Then, we demonstrate the effectiveness of Edgeformers on both edge-level (e.g., edge classification) and node-level (e.g., link prediction) tasks. Finally, we conduct visualization and efficiency analysis to further understand Edgeformers.

4.1. DATASETS

We run experiments on three real-world networks: Amazon (He & McAuley, 2016) , Goodreads (Wan et al., 2019) , and StackOverflow 2 . Amazon is a user-item interaction network, where reviews are treated as text on edges; Goodreads is a reader-book network, where readers' comments are used as edge text information; StackOverflow is an expert-question network, and there will an edge when an expert posts an answer to a question. Since Amazon and Goodreads both have multiple domains, we select two domains for each of them. In total, there are five datasets used in evaluation (i.e., Amazon-Movie, Amazon-Apps, Goodreads-Crime, Goodreads-Children, StackOverflow). Dataset statistics can be found in Appendix A.1.

4.2. TASK FOR EDGE REPRESENTATION LEARNING

Baselines. We compare our Edgeformer-E model with a bag-of-words method (TF-IDF (Robertson & Walker, 1994) ) and a pretrained language model (BERT (Devlin et al., 2019) ). Both baselines are further enhanced with network information by concatenating the node embedding z i with the bag-of-words vector (TF-IDF+nodes) or appending it to the input token sequence (BERT+nodes). Edge Classification. The model is asked to predict the category of each edge based on its associated text and local network structure. There are 5 categories for edges in Amazon (i.e., 1-star, ..., 5-star) and 6 categories for edges in Goodreads (i.e., 0-star, ..., 5-star). For TF-IDF methods, the dimension of the bag-of-words vector is 2000. BERT-involved models and Edgeformer-E have the same model size (L " 12, d " 768) and are initialized by the same checkpointfoot_2 . The dimension of initial node embeddings d 1 is set to be 64. We use AdamW as the optimizer with pϵ, β 1 , β 2 q " p1e-8, 0.9, 0.999q. The learning rate is 1e-5. The early stopping patience is 3 epochs. The batch size is 25. Macro-F1 and Micro-F1 are used as evaluation metrics. For BERT-involved models, parameters in BERT are trainable. We compare Edgeformer-N with several vanilla GNN models and PLM-integrated GNN models. Vanilla GNN models include node-centric GNNs such as MeanSAGE (Hamilton et al., 2017) , MaxSAGE (Hamilton et al., 2017) and GIN (Xu et al., 2019) , and edge-aware GNNs such as CensNet (Jiang et al., 2019) and NENN (Yang & Li, 2020) . All vanilla edge-aware GNNs models use bag-of-words as initial edge feature representations. PLM-integrated GNN models utilize a PLM (Devlin et al., 2019) to obtain text representations on edges and adopt a GNN to obtain node representations by aggregating edge representations. Baselines include BERT+MeanSAGE (Hamilton et al., 2017) , BERT+MaxSAGE (Hamilton et al., 2017) , BERT+GIN (Xu et al., 2019) , BERT+CensNet (Jiang et al., 2019) , BERT+NENN (Yang & Li, 2020) , and GraphFormers (Yang et al., 2021) . To verify the importance of both text and network information in text-rich networks, we also include matrix factorization (MF) (Qiu et al., 2018) and vanilla BERT (Devlin et al., 2019) in the comparison. Link Prediction. The task is to predict whether there will be an edge between two target nodes, given their local network structures. Specifically, in the Amazon and Goodreads datasets, given the target user's reviews to other items/books and the target item/book's reviews from other users, we aim to predict whether there will be a 5-star link between the target user and the target item/book. In the StackOverflow dataset, we aim to predict whether the target expert can give an answer to the target question. We use MRR and NDCG as evaluation metrics. For vanilla GNN models, we find that adopting MF node embeddings as initial node embeddings can help them obtain better performance (Lv et al., 2021) . For edge-aware GNNs, bag-of-words vectors are used as edge features, the size of which is set as 2000. For BERT-involved models, the training parameters are the same as 4.2. During the testing stage, all methods are evaluated with samples in the batch for efficiency, i.e., each query node is provided with one positive key node and 99 randomly sampled negative key nodes. More details can be found in Appendix A.8. Ablation Study. We further conduct an ablation study to validate the effectiveness of all the three virtual tokens on node representation learning. The three virtual token hidden states are deleted respectively from the whole model and the results are shown in Table 3 . From the table, we can find that Edgeformer-N generally outperforms all the model variants on all the datasets, except for that without neighbor edge information virtual token in Amazon-Apps, which indicates the importance of all three virtual token hidden states. Node Classification with unsupervised node embedding. To further evaluate the quality of the unsupervised learned node embeddings, we fix the node embeddings obtained from link prediction and train a logistic regression classifier to predict nodes' categories. This is a multi-class multi-label classification task, where there are 2 classes for Amazon-Movie and 26 classes for Amazon-Apps. Table 4 summarizes the performance comparison between several edge-aware methods. We can find that: (a) Edgeformer-N can outperform all the baselines significantly and consistently, which indicates that Edgeformer-N can learn more effective node representations; (b) edge-aware models can have better performance, but it depends on how the edge text information is employed.

4.4. EMBEDDING VISUALIZATION

To reveal the relation between edge embeddings and node embeddings learned by our model, we apply t-SNE (Van der Maaten & Hinton, 2008) to visualize them in Figure 2 . Node embeddings (i.e., th v |v P Vu) are denoted as stars, while edge embeddings (i.e., th e |e P Eu) are denoted as points with the same color as the node they link to. From the figure, we observe that: (1) node embeddings tend to be closer to each other in the embedding space compared with edge embeddings; (2) the embeddings of edges linked to the same node are in the vicinity of each other.

4.5. EFFICIENCY ANALYSIS

We now compare the efficiency of BERT+GIN (a node-centric GNN), BERT+NENN (an edge-aware GNN), GraphFormers (a PLM-GNN nested architecture), and our Edgeformer-N. All models are run on one NVIDIA A6000. The mini-batch size is 25; each sample contains one center node and |N e pvq| neighbor edges; the maximum text length is 64 tokens. The running time (per mini-batch) of compared models is reported in Table 5 , where we have the following findings: (a) the time cost of training Edgeformer-N is quite close to that of BERT+GIN, BERT+NENN, and GraphFormers; (b) PLM-GNN nested architectures (i.e., GraphFormers and Edgeformer-N) require slightly longer time during training than PLM-GNN cascaded architectures (i.e., BERT+GIN and BERT+NENN); (c) the time cost of Edgeformer-N increases linearly with the neighbor size |N e pvq|, which is consistent with our analysis in Section 3.3 that the time complexity of Edgeformer-N is OpN P 2 `N 2 q " OpN P 2 q when N ! P .

5. RELATED WORK

5.1 PRETRAINED LANGUAGE MODELS PLMs are proposed to learn universal language representations from large-scale text corpora. Early studies such as word2vec (Mikolov et al., 2013) , fastText (Bojanowski et al., 2017) , and GloVe (Pennington et al., 2014) aim to learn a set of context-independent word embeddings to capture word semantics. However, many NLP tasks are beyond word-level, so it is beneficial to derive word representations based on specific contexts. Contextualized language models are extensively studied recently to achieve this goal. For example, GPT (Peters et al., 2018; Radford et al., 2019) adopts auto-regressive language modeling to predict a token given all previous tokens; BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019) are trained via masked language modeling to recover randomly masked tokens; XLNet (Yang et al., 2019) proposes permutation language modeling; ELECTRA (Clark et al., 2020) uses an auxiliary Transformer to replace some tokens and pretrains the main Transformer to detect the replaced tokens. For more related studies, one can refer to a recent survey (Qiu et al., 2020) . To jointly leverage text and graph information, previous studies (Zhang et al., 2019; Fang et al., 2020; Li et al., 2021; Zhu et al., 2021) propose a PLM-GNN cascaded architecture, where the text information of each node is first encoded via PLMs, then the node representations are aggregated via GNNs. (Bi et al., 2022) proposes a triple2seq operation to linearize subgraphs and a "mask prediction" paradigm to conduct inference. Recently, GraphFormers (Yang et al., 2021) introduces a GNN-nested Transformer to stack GNN layers and Transformer layers alternately. However, these works mainly consider textual-node networks, thus their focus is orthogonal to ours on textual-edge networks.

5.2. EDGE-AWARE GRAPH NEURAL NETWORKS

A vast majority of GNN models (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019) leverage node attributes only and lack specific designs to utilize edge features. Heterogeneous GNNs (Schlichtkrull et al., 2018; Yang et al., 2020) assume each edge has a predefined type and take such types into consideration during aggregation. However, they still cannot deal with more complicated features (e.g., text) associated with the edges. EGNN (Gong & Cheng, 2019) introduces an attention mechanism to inject edge features into node representations; CensNet (Jiang et al., 2019) alternately updates node embeddings and edge embeddings in convolution layers; NENN (Yang & Li, 2020) aggregates the representation of each node/edge from both its node and edge neighbors via a GAT-like attention mechanism. EHGNN (Jo et al., 2021) proposes the dual hypergraph transformation and conducts graph convolutions for edges. Nevertheless, these models do not collaborate PLMs and GNNs to specifically deal with text features on edges, thus they underperform our Edgeformers model, even stacked with a BERT encoder.

6. CONCLUSIONS

We tackle the problem of representation learning on textual-edge networks. 2) applying the framework to more network-related tasks such as recommendation and text-rich social network analysis.

A APPENDIX

A.1 DATASETS The statistics of the five datasets can be found in Table 6 . 

A.3 EDGE CLASSIFICATION

We also compare our method with the state-of-the-art edge representation learning method EHGNN (Jo et al., 2021) and two node-centric PLM-GNN methods BERT+MaxSAGE (Hamilton et al., 2017) and GraphFormers (Yang et al., 2021) . The experimental results can be found in Table 7 . From the results, we can find that Edgeformer-E consistently outperforms all the baseline methods, including EHGNN, BERT+EHGNN, BERT+MaxSAGE and GraphFormers. EHGNN cannot obtain promising results because of two reasons: 1) edge-edge propagation: EHGNN proposes to transform edge to node and node to edge in the original network, followed by graph convolutions on the new hypernetwork. This results in edge-edge information propagation when conducting edge representation learning. However, edge-edge information propagation has the underlying edge-edge homophily assumption which is not always true in textual-edge networks. For example, when predicting the rate of a review text e i j given by u to i, it is not straightforward to make the judgment based on reviews for i written by other users (neighbor edges); 2) The integration Published as a conference paper at ICLR 2023 Algorithm 2: Node Representation Learning Procedure of Edgeformer-N. Input :The center node vi, its edge neighbors Nepviq, and its node neighbors Nnpviq. The The 2 classes of Amazon-Movie nodes are: "Movies" and "TV". Node Dimension. We conduct experiments on Amazon-Apps, Goodreads-Children, and Stack-Overflow to understand the effect of the initial node embedding dimension in Eq.( 9). We test the performance of Edgeformer-N on the link prediction task with the initial node embedding dimension varying in 4, 8, 16, 32, and 64. The results are shown in Figure 3 , where we can find that the performance of Edgeformer-N generally increases as the initial node embedding dimension increases. This finding is straightforward since the more parameters an initial node embedding has before overfitting, the more information it can represent. Sampled Neighbor Size. We further analyze the impact of sampled neighbor size for node representation learning on Amazon-Apps, Goodreads-Children, and StackOverflow, with a fraction of edges randomly sampled for the center node. The result can be found in Figure 4 . We can find that the performance increases progressively as sampled neighbor size |N e pvq| increases. It is intuitive since the more neighbors we have, the more information can contribute to center node learning. Meantime, the increase rate decreases as |N e pvq| increases linearly because the information between neighbors can have information overlap.

A.7 SELF-ATTENTION MAP STUDY

In order to study how the virtual node token will benifit the encoding of Edgeformer-E, we plot the self-attention probability map for a random sample in Figure 5 . We random pick up a token from this sample and plot the self-attention probability of how different tokens (x-axis), including virtual node tokens and the first twenty original text tokens, will contribute to the encoding of this random token in different layers (y-axis). From the figure, we can find that: In higher layers (e.g., Layers 10-11), the attention weights of virtual node tokens are significantly larger than those of original node tokens. Since virtual node token hidden states are of R dˆ2 and the original text token hidden states are of R dˆl (l is text sequence length), the ratio of network tokens to text tokens is 2 : l in Ă H plq eij (Eq.4), where l is the text sequence length. However, the self-attention mechanism can automatically learn to balance the two types of information by assigning higher weights to the corresponding virtual node tokens, so a larger number of tokens representing textual information will not cause network information to be overwhelmed.

A.8 REPRODUCIBILITY SETTINGS

For a fair comparison, the training objectives of Edgeformer-N and all PLM-involved baselines are the same. The hyper-parameter configuration for obtaining the results in Tables 1 and 2 can be found



Code can be found at https://github.com/PeterGriffinJin/Edgeformers. https://www.kaggle.com/datasets/stackoverflow/stackoverflow https://huggingface.co/bert-base-uncased



Figure 1: Model Framework Overview. (a) An illustration of Edgeformer-E for edge representation learning, where virtual node token hidden states are concatenated to the edge text original token hidden states to inject network signal into edge text encoding. (b) An illustration of Edgeformer-N for node representation learning, where Edgeformer-E is enhanced by local network structure virtual token hidden state and edge representations are aggregated to obtain node representation.

Figure 2: Embedding visualization. Node embeddings are denoted as stars, and the embeddings of edges are denoted as points with the same color if they are linked to the same node. Table5: Time cost (ms) per mini-batch for BERT+GIN, BERT+NENN, GraphFormers, and Edgeformer-N, with neighbor size |N e pvq| increasing from 2 to 5 on Amazon-Apps, Goodreads-Children, and StackOverflow. Edgeformer-N achieves similar efficiency with the baselines.

Figure 3: Effect of the dimension of initial node embeddings.

Figure 4: Effect of the sampled neighbor size (i.e., |N e pvq|).

eij |vi is the contextualized representation of e ij given target node v i 's local network structure. Now we introduce how to calculate hplq eij |vi by aggregating information from N e pv i q. Representation of hplq eij |vi . For each edge e is P N e pv i q (including e ij ), we treat the hidden state of its [CLS] token after the l-th layer as its representation (i.e., h To obtain hplq eis|vi , we adopt MHA to let all edges in N e pv i q interact with each other.

Edge classification performance on Amazon-Movie, Amazon-App, Goodreads-Crime, and Goodreads-Children.



Link prediction performance (on the testing set) on Amazon-Movie, Amazon-Apps, Goodreads-Crime, Goodreads-Children, and StackOverflow. ∆ denotes the relative improvement of our model comparing with the best baseline.



Ablation study of link prediction performance (on the testing set) on Amazon-Movie, Amazon-Apps, Goodreads-Crime, Goodreads-Children, and StackOverflow. (-) means removing the corresponding virtual tokens.

Node classification performance on Amazon-Movie and Amazon-App.

Time cost (ms) per mini-batch for BERT+GIN, BERT+NENN, GraphFormers, and Edgeformer-N, with neighbor size |N e pvq| increasing from 2 to 5 on Amazon-Apps, Goodreads-Children, and StackOverflow. Edgeformer-N achieves similar efficiency with the baselines.

To this end, we propose a novel graph-empowered Transformer framework, which integrates local network structure information into each Transformer layer text encoding for edge representation learning and aggregates edge representation fused by network and text signals for node representation. Comprehensive experiments on five real-world datasets from different domains demonstrate the effectiveness of Edgeformers on both edge-level and node-level tasks. Interesting future directions include (1) exploring other variants of introducing network signals into Transformer text encoding and (

Dataset Statistics Output :The embedding he ij of the edge eij.

initial token embedding H p0q e ij of each document dij associated with eij P Nepviq. Output :The embedding hv i of the center node vi. Nepviq do h e ij |v i Ð H AGGpth e ij |v i |eij P Nepviquq ; return hv i end

ACKNOWLEDGMENTS

We thank anonymous reviewers for their valuable and insightful feedback. Research was supported in part by US DARPA KAIROS Program No. FA8750-19-2-1004 and INCAS Program No. HR001121C0165, National Science Foundation IIS-19-56151, IIS-17-41317, and IIS 17-04532, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897, and the Institute for Geospatial Understanding through an Integrative Discovery Environment (I-GUIDE) by NSF under Award No. 2118329. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily represent the views, either expressed or implied, of DARPA or the U.S. Government.

annex

of text and network signals are loose for BERT+EHGNN, since such architectures process text and graph signals one after the other, and fail to simultaneously model the deep interactions between both types of information. However, our Edgeformer-E is designed following the more reasonable node-edge homophily hypothesis and deeply integrating text & network signals by introducing virtual node tokens in Transformer encoding.Note that both PLM+GNN and Edgeformer-E require textual information on ALL nodes in the network. However, this assumption does not hold in many textual-edge networks. Therefore, we propose a way around to concatenate the text on the edges linked to the given node together to make up node text. However, such a strategy does not lead to competitive performance of PLM+MaxSAGE and GraphFormers according to our experimental results. Therefore, to make our model generalizable to the case of missing node text, the proposed Edgeformers can be a better solution. 

A.4 LINK PREDICTION

We further report the link prediction performance of compared models on the validation set in Table 8 .Table 8 : Link prediction performance (on the validation set) on Amazon-Movie, Amazon-Apps, Goodreads-Crime, Goodreads-Children, and StackOverflow. ∆ denotes the relative improvement of our model comparing with the best baseline. A.5 NODE CLASSIFICATIONThe 26 classes of Amazon-Apps nodes are: "Books & Comics", "Communication", "Cooking", "Education", "Entertainment", "Finance", "Games", "Health & Fitness", "Kids", "Lifestyle", "Music", "Navigation", "News & Magazines", "Novelty", "Photography", "Podcasts", "Productivity", "Reference", "Ringtones", "Shopping", "Social Networking", "Sports", "Themes", "Travel", "Utilities", and "Weather". In higher layers (e.g., layers 10-11), the attention weights of virtual node tokens are significantly larger than those of original node tokens. The ratio of network tokens to text tokens is 2 : l in Ă H plq eij (Eq.4), where l is the text sequence length. However, the self-attention mechanism can automatically learn to balance the two types of information by assigning higher weights to the corresponding virtual node tokens, so a larger number of tokens representing textual information will not cause network information to be overwhelmed. in Table 9 , where "sampled neighbor size" stands for the number of neighbors sampled for each type of the center node during node representation learning. This hyper-parameter is determined according to the average node degree of the corresponding node type. The edge classification and link prediction experiments are conducted on one NVIDIA V100 and one NVIDIA A6000, respectively.In Section 4.3, we adopt logistic regression as our classifier. We employ the Adam optimizer (Kingma & Ba, 2015) with the early-stopping patience as 10 to train our classifier. The learning rate is set as 0.001.

A.9 LIMITATIONS

In this work, we mainly focus on modeling homogeneous textual-edge networks and solving fundamental tasks in graph learning such as node/edge classification and link prediction. Interesting future studies include designing models to characterize network heterogeneity and applying our proposed model to real-world applications such as recommendation. A.10 ETHICAL CONSIDERATIONSWhile it has been demonstrated that PLMs are powerful in language understanding (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) , there are studies pointing out their drawbacks such as containing social bias (Liang et al., 2021) and misinformation (Abid et al., 2021) . In our work, we focus on enriching PLMs' text encoding process with the associated network structure information, which could be a way to mitigate the bias and wipe out the contained misinformation.

