EDGEFORMERS: GRAPH-EMPOWERED TRANSFORM-ERS FOR REPRESENTATION LEARNING ON TEXTUAL-EDGE NETWORKS

Abstract

Edges in many real-world social/information networks are associated with rich text information (e.g., user-user communications or user-product reviews). However, mainstream network representation learning models focus on propagating and aggregating node attributes, lacking specific designs to utilize text semantics on edges. While there exist edge-aware graph neural networks, they directly initialize edge attributes as a feature vector, which cannot fully capture the contextualized text semantics of edges. In this paper, we propose Edgeformers 1 , a framework built upon graph-enhanced Transformers, to perform edge and node representation learning by modeling texts on edges in a contextualized way. Specifically, in edge representation learning, we inject network information into each Transformer layer when encoding edge texts; in node representation learning, we aggregate edge representations through an attention mechanism within each node's ego-graph. On five public datasets from three different domains, Edgeformers consistently outperform state-of-the-art baselines in edge classification and link prediction, demonstrating the efficacy in learning edge and node representations, respectively.

1. INTRODUCTION

Networks are ubiquitous and are widely used to model interrelated data in the real world, such as user-user and user-item interactions on social media (Kwak et al., 2010; Leskovec et al., 2010) and recommender systems (Wang et al., 2019; Jin et al., 2020) . In recent years, graph neural networks (GNNs) (Kipf & Welling, 2017; Hamilton et al., 2017; Velickovic et al., 2018; Xu et al., 2019) have demonstrated their power in network representation learning. However, a vast majority of GNN models leverage node attributes only and lack specific designs to capture information on edges. (We refer to these models as node-centric GNNs.) Yet, in many scenarios, there is rich information associated with edges in a network. For example, when a person replies to another on social media, there will be a directed edge between them accompanied by the response texts; when a user comments on an item, the user's review will be naturally associated with the user-item edge. To utilize edge information during network representation learning, some edge-aware GNNs (Gong & Cheng, 2019; Jiang et al., 2019; Yang & Li, 2020; Jo et al., 2021) have been proposed. Nevertheless, these studies assume the information carried by edges can be directly described as an attribute vector. This assumption holds well when edge features are categorical (e.g., bond features in molecular graphs (Hu et al., 2020) and relation features in knowledge graphs (Schlichtkrull et al., 2018) ). However, effectively modeling free-text edge information in edge-aware GNNs has remained elusive, mainly because bag-of-words and context-free embeddings (Mikolov et al., 2013) used in previous edge-aware GNNs cannot fully capture contextualized text semantics. For example, "Byzantine" in history book reviews and "Byzantine" in distributed system papers should have different meanings given their context, but they correspond to the same entry in a bag-of-words vector and have the same context-free embedding. To accurately capture contextualized semantics, a straightforward idea is to integrate pretrained language models (PLMs) (Devlin et al., 2019; Liu et al., 2019; Clark et al., 2020) with GNNs. In node-centric GNN studies, this idea has been instantiated by a PLM-GNN cascaded architecture (Fang et al., 2020; Li et al., 2021; Zhu et al., 2021) , where text information is first encoded by a PLM and then aggregated by a GNN. However, such architectures process text and graph signals one after the other, and fail to simultaneously model the deep interactions between both types of information. This could be a loss to the text encoder because network signals are often strong indicators to text semantics. For example, a brief political tweet may become more comprehensible if the stands of the two communicators are known. To deeply couple PLMs and GNNs, the recent GraphFormers model (Yang et al., 2021) proposes a GNN-nested PLM architecture to inject network information into the text encoding process. They introduce GNNs nested in between Transformer layers so that the center node encoding not only leverages its own textual information, but also aggregates the signals from its neighbors. Nevertheless, they assume that only nodes are associated with textual information and cannot be easily adapted to handle text-rich edges. To effectively model the textual and network structure information via a unified encoder architecture, in this paper, we propose a novel network representation learning framework, Edgeformers, that leverage graph-enhanced Transformers to model edge texts in a contextualized way. Edgeformers include two architectures, Edgeformer-E and Edgeformer-N, for edge and node representation learning, respectively. In Edgeformer-E, we add virtual node tokens to each Transformer layer inside the PLM when encoding edge texts. Such an architecture goes beyond the PLM-GNN cascaded architecture and enables deep, layer-wise interactions between network and text signals to produce edge representations. In Edgeformer-N, we aggregate the network-and-text-aware edge representations to obtain node representations through an attention mechanism within each node's ego-graph. The two architectures can be trained via edge classification (which relies on good edge representations) and link prediction (which relies on good node representations) tasks, respectively. To summarize, our main contributions are as follows: • Conceptually, we identify the importance of modeling text information on network edges and formulate the problem of representation learning on textual-edge networks. • Methodologically, we propose Edgeformers (i.e., Edgeformer-E and Edgeformer-N), two graphenhanced Transformer architectures, to deeply couple network and text information in a contextualized way for edge and node representation learning. • Empirically, we conduct experiments on five public datasets from different domains and demonstrate the superiority of Edgeformers over various baselines, including node-centric GNNs, edge-aware GNNs, and PLM-GNN cascaded architectures.

2. PRELIMINARIES

2.1 TEXTUAL-EDGE NETWORKS In a textual-edge network, each edge is associated with texts. We view the texts on each edge as a document, and all such documents constitute a corpus D. Since the major goal of this work is to explore the effect of textual information on edges, we assume there is no auxiliary information (e.g., categorical or textual attributes) associated with nodes in the network. Definition 1 (Textual-Edge Networks) A textual-edge network is defined as G " pV, E, Dq, where V, E, D represent the sets of nodes, edges, and documents, respectively. Each edge e ij P E is associated with a document d ij P D. To give an example of textual-edge networks, consider a review network (e.g., Amazon (He & McAuley, 2016)) where nodes are users and items. If a user v i writes a review about an item v j , there will be an edge e ij connecting them, and the review text will be the associated document d ij . 2 



Code can be found at https://github.com/PeterGriffinJin/Edgeformers.



denote the output sequence of the l-th Transformer layer, where h plq i P R d is the hidden representation of the text token at position i. Then, in the pl `1q-th Transformer layer, the multi-head self-attention (MHA) is calculated as MHApH plq q "

