SELF-ATTENTIVE RATIONALIZATION FOR GRAPH CONTRASTIVE LEARNING

Abstract

Graph augmentation is the key component to reveal instance-discriminative features of a graph as its rationale in graph contrastive learning (GCL). And existing rationale-aware augmentation mechanisms in GCL frameworks roughly fall into two categories and suffer from inherent limitations: (1) non-heuristic methods with the guidance of domain knowledge to preserve salient features, which require expensive expertise and lack generality, or (2) heuristic augmentations with a cotrained auxiliary model to identify crucial substructures, which face not only the dilemma between system complexity and transformation diversity, but also the instability stemming from the co-training of two separated sub-models. Inspired by recent studies on transformers, we propose Self-attentive Rationale guided Graph Contrastive Learning (SR-GCL), which integrates rationale generator and encoder together, leverages the self-attention values in transformer module as a natural guidance to delineate semantically informative substructures from both node-and edge-wise perspectives, and contrasts on rationale-aware augmented pairs. On real-world biochemistry datasets, visualization results verify the effectiveness of self-attentive rationalization, and the performance on downstream tasks demonstrates the state-of-the-art performance of SR-GCL for graph model pre-training.

1. INTRODUCTION

Graph augmentation is a crucial enabler for graph contrastive learning (GCL) (You et al., 2020; Qiu et al., 2020; Zhu et al., 2020) . It pre-trains the model to yield instance-discriminative representations by contrasting augmented samples against each other, without hand-annotated labels. To achieve this goal, early studies (You et al., 2020; 2021; Qiu et al., 2020; Zhu et al., 2020) conduct random corruptions in topological structures (i.e., nodes and edges) or attributes to construct contrastive pairs. However, such random corruptions, especially on salient substructures, easily cause a semantic gap between two augmented views of the same anchor graph, misguiding the following contrastive optimization procedure (Wang et al., 2021; Li et al., 2022) . To mitigate this, there has been recent interest in rationale discovery (Chang et al., 2020; Suresh et al., 2021; Li et al., 2022) as graph augmentation. We systematize these studies as rationale-aware augmentations, where a rationale exhibits a graph's instance-discriminative information from the others. The dominant paradigm often consists of two subsequent modules: the rationale discovery function and the rationale encoder, which aim at creating the rationale-aware views and yielding their representations to contrast, respectively. To find rationales, early studies turn to domain knowledge to highlight the salient parts of graphs (Zhu et al., 2021; Liu et al., 2022) . For instance, Rong et al. (2020 ) leverage RDkit (Landrum, 2010) , an assistant software of chemistry, to capture crucial functional groups with high activity in molecule graphs. However, such expertise is expensive or even inaccessible in some scenarios (Tang et al., 2014) . Besides, bringing in too much prior knowledge might harm generalization (Wang et al., 2022) . To mitigate this problem, recent efforts (Suresh et al., 2021; Li et al., 2022) introduce an auxiliary model instead to automatically identify rationales, which is named the rationale generator and co-train with the rationale encoder. In this ad-hoc scheme, however, we reveal two inherent limitations: • Typically, the generator is tailor-made for one single transformation of graph data (Suresh et al., 2021; Li et al., 2022) , forcing the focus on either node-or edge-wise rationales (e.g., Figures 1(b Here we ascribe this crux of generator to the lack of transformation diversity, and argue that a high-performing generator is supposed to be equipped with perspectives of both node and edge (e.g., Figure 1(d) ). • As illustrated in Figure 2 (a), the generator, aiming at discovering rationales, separates from the subsequent encoder, which specializes in encoding them. While conceptually appealing, we hypothesize that these separate modules cooperate with each other to pursue high-quality rationales unsmoothly. Because the supervision signal for the generator is remotely generated by the contrastive optimization of the encoder, much of which is weak. Moreover, co-optimizing two submodels could make the pre-training more complicated and time-consuming but less stable. To resolve these limitations, we draw inspiration from the transformers (Vaswani et al., 2017) to reshape the generator-encoder scheme. Despite originally being proposed for language (Devlin et al., 2019) and vision tasks (Dosovitskiy et al., 2021) , transformers are attracting a surge of interest in graph area (Wu et al., 2021; Chen et al., 2022; Rampásek et al., 2022) . At the core is the self-attention operation, which models pairwise connections between tokens and yields high-quality representations. We find self-attention de facto a natural mechanism to concurrently discover and condense rationale information from both edge-and node-wise transformations. By prepending a special token as its proxy and treating its nodes as other tokens, self-attention is able to elegantly indicate the importance of each node and each edge (See Section 2.2 and Figure 4 ). Sampling nodes and edges based on the importance scores (i.e., heterogeneous transformations) allows us to generate both the node-and edge-wise subgraphs (i.e., rationales) simultaneously. Moreover, self-attention can directly output the rationale representations without additional modules. In stark contrast to the prior generator-encoder scheme, the "self-attentive rationalization" not only accomplishes diverse rationales in one shot, but also integrates the functions of rationale discovery and encoding together. With the Self-attentive Rationalization as graph augmentation, we incorporate it into GCL and name the framework SR-GCL. Specifically, two augmented views stem from the node-and edge-wise rationales, respectively; subsequently, the contrastive optimization pulls close representations of contrastive pair augmented from the same anchor graph and pushes away those of different anchors by minimizing contrastive loss. Compared with conventional GCL methods, our SR-GCL collates and conflates the instance-discriminative information across both node-and edge-wise transformations to construct rationale-aware contrastive pairs from dual perpectives. Henceforth, we find that such strategy improves the generalization performance of pre-trained model on downstream tasks, while simultaneously interpreting the contribution of each node/edge to instance-discrimination. Extensive experiments show that SR-GCL sets the new state-of-the-art for graph pre-training across a number of biochemical molecule and social network benchmark datasets in (Wu et al., 2018a; Morris et al., 2020) . Codes are available at https://anonymous.4open.science/r/SR-GCL-EDD3.



)

Figure1: Rationale-aware graph augmentation preserves instance-discriminative features in original graph (e.g., (a)). Existing frameworks tailor-make an auxiliary model for one single transformation (e.g., (b) or (c)). SR-GCL constructs both node-and edge-wise rationale-aware views (e.g., (d)).

