LOOK IN THE MIRROR: MOLECULAR GRAPH CON-TRASTIVE LEARNING WITH LINE GRAPH Anonymous

Abstract

Trapped by the label scarcity in molecular property prediction and drug design, graph contrastive learning came forward. A general contrastive model consists of a view generator, view encoder, and contrastive loss, in which the view mainly controls the encoded information underlying input graphs. Leading contrastive learning works show two kinds of view generators, that is, random or learnable data corruption and domain knowledge incorporation. While effective, the two ways also lead to molecular semantics altering and limited generalization capability, respectively. Thus, a decent view that can fully retain molecular semantics and is free from profound domain knowledge is supposed to come forward. To this end, we relate molecular graph contrastive learning with the line graph and propose a novel method termed LGCL. Specifically, by contrasting the given graph with the corresponding line graph, the graph encoder can freely encode the molecular semantics without omission. While considering the information inconsistency and over-smoothing derived from the learning process because of the mismatched pace of message passing in two kinds of graphs, we present a new patch with edge attribute fusion and two local contrastive losses for performance fixing. Compared with state-of-the-art (SOTA) methods for view generation, superior performance on molecular property prediction suggests the effectiveness of line graphs severing as the contrasting views.

1. INTRODUCTION

A deep understanding of molecular properties plays a vital role in the chemical and pharmaceutical domains. In order to computationally discover novel materials and drugs, the molecules will be abstractly regarded as graphs, in which atoms are vertices and bonds are edges Gilmer et al. (2017) ; Goh et al. (2017) ; Chen et al. (2018a) . Thus, the marriage between molecular property prediction and graph learning captured a bunch of researchers and showed their happiness in several fields Yang et al. (2019) ; Song et al. (2020) ; Chen et al. (2021) ; Wu et al. (2022a) . However, this relationship faces the challenges of label scarcity, as deep learning methods are known to consume massive amounts of labeled data, and annotated data are often of limited size and hard to acquire when considering many specific domains. In addition, given the immense differentiation among chemical molecules, existing supervised models could be barely reused in unseen cases Hu et al. (2020) ; Rong et al. (2020) . Therefore, there are increasing demands for molecular representation learning in an unsupervised or self-supervised manner. Plenty of works have attempted to learn molecule representations discarding the supervision of labels, like graph context prediction Liu et al. (2019) , graph-level motif prediction Rong et al. (2020) and masked attribute prediction Hu et al. (2020) . In light of the contrastive learning from computer vision, researchers go one step further to model molecules in a contrastive manner with data augmentations You et al. (2020) ; Suresh et al. (2021) . Considering the inherent characteristics of chemical molecules, graph contrastive learning incorporating well-designed domain knowledge has also shown excellent capacity in molecular properties prediction Sun et al. (2021) ; Fang et al. (2022) . Analogously, everything comes with a price. Inspecting the generated views in previous molecular graph contrastive learning unveils two intrinsic limitations. First, data augmentation-based methods adopting random or learnable corruption (e.g., node/edge dropping and graph generation) would lead to inevitable variance in the crucial semantics and further misguide the contrastive learning Figure 1 : Framework overview of LGCL. Contrasted views consist of the original graph and the corresponding line graph. Input graphs are encoded by a dual-helix graph encoder with edge attribute fusion for information consistency. The whole model is jointly optimized via minimizing the NT-Xent loss and the two local contrastive losses. You et al. (2020) ; Sun et al. (2021) . Second, based on predefined sub-structure substitution rules Sun et al. (2021) or contrasted with 3-dimension geometric views Liu et al. (2022) ; Stärk et al. (2022) , domain knowledge-based methods intend to alleviate the problem of semantic alteration. While effective, they are stinted to the profound domain knowledge that is unfriendly to researchers without such knowledge, thus limiting their generalization capability in other domains. In this context, we are seeking for a decent view that will not be bothered by prefabricated domain knowledge and can maintain the molecular semantic information integrally. Fortunately, we met the line graph, also known as congruent graph in graph theory Whitney (1932) ; Harary & Norman (1960) ; Jung (1966) . In a line graph, the nodes correspond to the edges of the original graph, and the edges refer to the common nodes of the pair edges in the original graph. In particular, the isomorphism of two line graphs is judged to be consistent with the corresponding two original graphs Whitney (1932) ; Jung (1966) , which ensures the congruent semantic structure after line graph transformation. In light of the line graph, we propose a method termed LGCL to tackle our expectations. The framework of LGCL is shown in Figure 1 . Specifically, to fill the framework demanding two views, all input molecular graphs are transformed into the corresponding line graph. On such a basis, LGCL equips with a dual-helix graph encoder to learn the hidden representation of two views. Note that, due to the different pace of message passing in the original graph and the corresponding line graph, two issues derive from the learning process, that is, information inconsistency and oversmoothing. For information consistency, we further update the graph encoder with edge attribute fusion to bridge the edge attributes between the two kinds of graphs. Over-smoothing is addressed by a novel intra-local contrastive loss based on the idea of NT-Xent loss; put differently, the intralocal contrastive loss aims to maximize the consense between the edge pairs in the two corresponding views and minimize the consense between different edge pairs within the same views. Moreover, we further give an inter-local contrastive loss to enhance the representation learning. The effectiveness of LGCL is verified under the ubiquitous setting of transfer learning for molecular property prediction Hu et al. (2020) . Through pre-training on two million molecular graphs from ZINC15, LGCL shows superior performance on six out of eight benchmarks for molecular property prediction and acquires the highest position on both average ROC-AUC and average ranking. Additionally, we delve deeper into the proposed components via analytical experiments to further assess their benefits. The contributions are elaborated below: • To the best of our knowledge, we are the first to figure out a way to freely and fully excavate molecular semantics within graph contrastive learning. • Inspired by the line graph, we present an approach, termed LGCL, to tackle our expectations, in which edge attribute fusion and two local contrastive losses are united to address the concomitant issues and enhance molecular representation learning. • Leveraging eight benchmarks for molecular property prediction under the setting of transfer learning, LGCL exhibits its superiority against the SOTA methods for view generation.

2. RELATED WORKS

This research focuses on molecular graph contrastive view generation, especially considering the case that is free from the intricate domain knowledge. We will elaborate on these topics below. Molecular graph contrastive learning. To enhance the performance in molecular property prediction, the domain knowledge-driven contrastive learning frameworks were proposed to preserve the semantics of graphs in the augmentation process Sun et al. (2021) ; Fang et al. (2022) . However, their learning capability heavily relies on the dissolved domain knowledge, that is the well-designed substitution rules in MoCL Sun et al. (2021) and the prefabricated associations among chemical elements in KCL Fang et al. (2022) . Furthermore, the domain knowledge varies across domains, which limits the application of these methods. Recently, besides the contrasting view exploration, GraphLoG Xu et al. (2021) and OEPG Yang & Hong (2022) are built upon the generic graph contrastive learning methods to discover the global semantic structure underlying the whole dataset and present excellent performance. In this work, we are devoted to the domain of contrasting view generation, which is orthogonal to the works for dataset semantic structure exploration; put differently, extensive works for contrasting view generation can work with the framework of GraphLoG and OEPG to produce more superior performance. Line graph. The line graph is a classic concept and has a long history in graph theory Whitney (1932) ; Harary & Norman (1960); Jung (1966) . In a line graph, the nodes correspond to the edges of the original graph, and the edges refer to the common nodes of the pair edges in the original graph. Thus, the graph neural networks (GNNs) built on line graphs are capable of encoding edge features and enhancing feature learning on graphs. Recently, based on the line graph structures, several line graph neural networks have shown promising performance on various graph-related tasks Chen et al. (2018b) ; Jiang et al. (2019) ; Bandyopadhyay et al. (2020) . In the chemistry domain, the structure of a compound can be treated as a graph, where the edges derived from chemical bonds link the corresponding atom nodes. Thus, the edges in such graphs have different properties and various functions. In generic GNNs, however, message-passing operations among nodes do not pay enough attention to the edge properties. Fortunately, the line graph structure enables generic GNNs to take advantage of the edges as equals as nodes Jiang et al. (2019) ; Chen et al. (2018b) . To this far, there is still no graph contrastive learning model to encode the molecular semantics integrally without well-designed domain knowledge. In this paper, we revisit the line graph from the angle of graph contrastive learning. Based on it, we design a novel contrastive model, termed LGCL, to freely and fully excavate molecular semantics.

3. PRELIMINARIES

Here, we first present some preliminary concepts and notations. In this work, let G = {G 1 , G 2 , • • • , G N } be a graph dataset with size N , and a molecular graph can be formulated as G = (V, E, X V , X E ), where V is the node set, E is the edge set, X V ∈ R |V |×V denotes the node features, and X E ∈ R |E|×E stands for the edge attributes.

3.1. GRAPH REPRESENTATION LEARNING

In generic GNNs, the message-passing scheme is adopted for information transmission among nodes Xu et al. (2019) ; Wu et al. (2022b) . Through stacking L layers, a GNN will produce a hidden representation h v ∈ R with L-hop neighbor information for each node and a feature vector h G ∈ R via a global readout function for the entire graph G. Each node v is initialized with node feature X v and sent to the GNN input. Formally, the l-th layer of a GNN Xu et al. (2019) can be written as ĥ(l) v = AGGREGATE (l) ({h (l-1) u |u ∈ N (v)}), h (l) v = COMBINE (l) ( ĥ(l) v , h (l-1) v ), where h (l) v represents the feature vector of node v at the l-th iteration, N (v) covers the 1-hop neighbors of v, AGGREGATE denotes the crucial message-passing scheme in GNNs, and COMBINE is used to update the hidden feature of v via merging information from its neighbors and itself. Finally, a GNN can produce the feature vector h G of the entire graph with a prefabricated readout function: h G = READOUT({h v |v ∈ V}), where READOUT aggregates the final set of node representations.

3.2. GRAPH CONTRASTIVE LEARNING

In a generic graph contrastive learning model, two correlated views from the same graph G i are required for contrasting and generally produced by two augmentation operations. Here, we denote the augmented views as G1 i and G2 i . Then, a graph encoder and a projection head are stacked behind the two augmentation operators to map the correlated views into an embedding space and yield corresponding feature vectors h 1 i and h 2 i . The released hidden representations are supposed to contain the essential features of the original graph G so that they can recognize themselves from the others. Thus, the objective of graph contrastive learning is to maximize the consensus between the two positive views via the widespread NT-Xent loss Chen et al. (2020) : L i = -log e sim(h 1 i ,h 2 i )/τ N j=1,j̸ =i e sim(h 1 i ,h 2 j )/τ , where N is the batch size, τ refers to the temperature parameter, and sim(h 1 , h 2 ) denotes a cosine similarity function h 1⊤ h 2 ||h 1 ||•||h 2 || . The numerator part is the similarity of the correlated views as positive pair. The rest pairs that consist of views from different graphs are regarded as negative pairs and act as the denominator part. Note that the negative pairs can come from two directions, put differently, h 1 i can pair with all h 2 j , and h 2 i can pair with all h 1 j .

4. METHODOLOGY

In this section, we bring about the proposed graph contrastive learning framework, termed LGCL, by revisiting the line graph of corresponding molecules. Given the issue of label scarcity in realworld graph data, LGCL is designed to encode the molecular semantics integrally and free from well-designed domain knowledge. Specifically, to produce two contrastive views without any loss of molecular semantics, we first need to transform the given molecule to the corresponding line graph. On such a basis, LGCL equips with a dual-helix graph encoder to learn the hidden representation of two views with edge attribute fusion. In particular, besides the ubiquitous contrastive loss for the readout graph representations, we further propose two local contrastive losses to enhance representation learning and alleviate the over-smoothing issue in deep GNNs. Next, we elaborate on the LGCL framework below.

4.1. LINE GRAPH TRANSFORMATION

Here, we first present an illustration of line graph transformation from a simple graph. As shown in Figure 2 , let G = (V, E) be a simple undirected graph, the output line graph L(G) after transformation is such a graph that reveals the adjacencies of edges in G. Specifically, each edge in G is mapped into a node of L(G) and each edge in L(G) indicates that the corresponding two vertices have a common node in G. Formally, the line graph can be written as L(G) = (V L , E L ), where V L = {(v i , v j )|(v i , v j ) ∈ E)} and E L = {((v i , v j ), (v j , v k ))|{(v i , v j ), (v j , v k )} ⊂ E}. At this point, we have settled the topology transformation of line graphs. Besides the relationships among nodes and edges, the node and edge attributes underlying the molecular graph should also be delivered to the corresponding line graph. In this paper, based on the one-to-one correspondence between the edges of G and the nodes of L(G), the node attributes of the line graph can be directly obtained from the edge attributes of original graphs, that is, X V L = X E . As for the edge attributes of L(G), because several edges in the line graph would correspond to the same node in the original graph, a mapping function with such relationships is required to endow the line graph edge with the original node attribute. In the light of E L after the line graph transformation, the edge attribute mapping function can be formulated as M (e L ) = (v i , v j ) ∩ (v j , v k ), thus the edge attributes of the line graph can be obtained via X E L = M X V . Finally, the line graph of a molecular graph is given by L (G) = (V L , E L , X V L , X E L ). According to Roussopoulos's algorithm Roussopoulos (1973) , the time complexity of line graph transformation is O(max(|V |, |E|)). As stated in the Whitney graph isomorphism theorem Whitney (1932) , the isomorphism of two line graphs is judged to be consistent with the corresponding two original graphs, which convinces us that the semantic structure information of G is encoded in the line graph L(G). In particular, as described in the line graph transformation, there is a one-to-one correspondence between the edges in the graph G and the vertices in the line graph L(G). Therefore, a vertex with e edges in G will produce e × (e -1)/2 edges in L(G). Meanwhile, the message-passing frequency around this node will drift from O(e) in G to O(e 2 ) in L(G), put differently, this node feature in G is only passed to e neighbors, while the corresponding line graph will pass such information to e × (e -1)/2 nodes. Actually, in light of the line graph capacity in Table A .1, the line graph encoder only requires about 50% more computation than the original graph encoder. While this nature of the line graph could cause two inevitable issues in the contrastive learning framework with stacked graph convolutional layers, that is information inconsistency and over-smoothing. In this paper, we propose two approaches, edge attribute fusion and two local contrastive losses, to alleviate the two issues and strengthen molecular representation learning. Next, we give a detailed description.

4.2. EDGE ATTRIBUTE FUSION

In the chemistry domain, the structure of a compound can be treated as a graph, where the edges derived from the chemical bonds link the corresponding atom nodes. Thus, the edges in such graphs have different properties and various functions. Besides the topology information weaved by atoms, a well-designed graph convolution with edge attributes plays a crucial role in molecular property and protein function prediction. Given a molecular graph, its input node features and edge features are both represented as a 2dimensional categorical vector (see Appendix A for details), denoted as X V ∈ R |V |×2 and X E ∈ R |E|×2 , respectively. In previous works regarding molecular property prediction Hu et al. (2020) , the raw node categorical vectors are embedded in the input layer by h (0) v = EMBEDDING(x 0 v ) + EMBEDDING(x 1 v ), where x 0 v and x 1 v are the atomic number and chirality tag of node v, respectively. EMBEDDING(•) denotes an embedding function that transfers a single integer into a d-dimensional vector space. Meanwhile, the raw edge categorical vectors are embedded in each layer by h (l) e = EMBEDDING(x 0 e ) + EMBEDDING(x 1 e ), where x 0 e and x 1 e represent the bond type and bond direction, respectively, and l denotes the index of GNN layers. At the l-th layer, the node representation can be updated by h (l) v = σ(MLP (l) (h (l-1) v + u∈N (v) h (l-1) u + e∈{(v,u)|u∈N (v)∪{v}} h (l-1) e )), where σ(•) is an activation function, and (v, v) represents the self-loop edge. Under this GNN architecture, the output molecular representations will be decorated with edge attributes. However, as discussed above, there is a significant difference in message-passing frequency between the original graph and the corresponding line graph, which could lead to information inconsistency between the outputs. Here, we present a novel edge attribute fusion approach to tackle this issue. Specifically, we bridge the edge information between the molecular graph and line graph to help the original graph encoder keep pace with the line graph encoder. The edge and node embeddings are still employed as the initial edge attributes in the first layer (i.e., l = 0). As for l ≥ 1, the edge attributes of the original graph are obtained from the node hidden features in the line graph, which is formally given by h (l) G•(vi,vj ) = h (l-1) L(G)•(vi,vj ) , where (v i , v j ) ∈ E and (v i , v j ) ∈ V L . Correspondingly, the edge attributes of the line graph can be updated by the node hidden features in the original graph, which is formally formulated as: h (l) L(G)•((vi,vj ),(vj ,v k )) = h (l-1) G•vj , where (v i , v j ) ∈ E, (v j , v k ) ∈ E and ((v i , v j ), (v j , v k )) ∈ E L . Based on the dual-helix graph encoder with edge attribute fusion, the hidden features from the line graph will be dissolved into the original graph representations, allowing information consistency between the two contrastive views and enhancing molecular representation learning.

4.3. INTRA-LOCAL CONTRASTIVE LOSS

In this part, we look forward to tackling the over-smoothing issue introduced by the line graph. Motivated by NT-Xent loss for contrastive learning, an intra-local contrastive loss is proposed. The design concept of NT-Xent loss aims to maximize the representation similarities of positive pairs consisting of hidden features of the same molecules and enforce dissimilarity of negative pairs comprising hidden features of different molecules simultaneously. Similarly, the proposed intra-local contrastive loss seeks to optimize the consensus between the same nodes as opposed to different nodes in a single graph. Considering the one-to-one correspondence between the edges in G and the vertices in L(G), the contrastive samples of this loss are composed of the edge hidden features in G and the node hidden features in L(G). Thus, given a graph G, the intra-local contrastive loss of one edge pair is formally defined as: L ei IntraC = -log e sim( hG•e i ,h L(G)•e i )/τ |E| j=1,j̸ =i e sim( hG•e i ,h L(G)•e j )/τ , ( ) where e i = (v m , v n ), e i ∈ E and (v m , v n ) ∈ V L . In particular, the edge representations from G are formed by such hidden features of the two endpoints of each edge, that is, hG•ei = MLP([h vm , h vn ]). In light of this contrastive loss designed inside the graph, we look forward to reducing the similarity between different nodes and further alleviating the over-smoothing.

4.4. INTER-LOCAL CONTRASTIVE LOSS

Here, we give another contrastive loss to enhance molecular graph contrastive learning. As we are capable of enforcing dissimilarity between different edge representations, we move forward to generalizing the edge dissimilarity to all contrasted samples. Our critical insight is that the widespread NT-Xent loss only provides graph-level contrast, while a contrastive angle based on node representations would also be meaningful in crucial structure identification. In light of the intralocal contrastive loss, the inter-local contrastive loss can be formally formulated as: L ei InterC = -log e sim( hG•e i ,h L(G)•e i )/τ G Ĝ̸ =G |E Ĝ| j=1 e sim( hG•e i ,h L( Ĝ)•e j )/τ , where e i ∈ E G , e j ∈ V L( Ĝ) and G represents a training batch. Note that the negative pairs of the inter-local contrastive loss also come from two directions. Currently, we have presented the main components of the proposed LGCL that aims to help molecular graph contrastive learning free from well-designed domain knowledge and maintain the semantics. For unsupervised molecular graph representation learning, the final objective function of LGCL for pre-training is given by min L = L G + αL InterC + βL IntraC , where L G denotes the NT-Xent loss, α and β are two hyper-parameters for loss weight controlling.

5. EXPERIMENT

In this section, we are devoted to evaluating LGCL with extensive experimentsfoot_0 . Following the procedure of pre-training and fine-tuning, we validate the effectiveness of our approach against SOTA competitors for view generation. Furthermore, we carry out analytical studies to assess each proposed component. Unsupervised and semi-supervised learning are shown in appendix.

5.1. EXPERIMENTAL SETUP

To be in line with the previous graph contrastive learning methods without prefabricated domain knowledge and make the comparisons fair, we follow the experimental setup under the guidance of Hu et al. (2020) . Pre-training dataset. ZINC15 Sterling & Irwin (2015) dataset is adopted for LGCL pre-training. In particular, a subset with two million unlabeled molecular graphs are sampled from the ZINC15. Pre-training details. In the graph encoder setting in Hu et al. (2020) , a Graph Isomorphism Network (GIN) Xu et al. (2019) with five convolutional layers is adopted for message passing. In particular, the hidden dimension is fixed to 300 across all layers and a pooling readout function that averages graph nodes is hired for NT-Xent loss calculation with the scale parameter τ = 0.1. The hidden representations at the last layer are injected into the average pooling function. An Adam optimizer Kingma & Ba ( 2015) is employed to minimize the integrated losses produced by the 5-layer GIN encoder. The batch size is set as 256, and all training processes will run 100 epochs. The two hyper-parameters (i.e., α and β) for loss weight controlling are both set as 1. Fine-tuning dataset. We employ the eight ubiquitous benchmarks from the MoleculeNet dataset Wu et al. (2018) to validate LGCL as downstream experiments. These benchmarks include a variety of molecular tasks like physical chemistry, quantum mechanics, physiology, and biophysics. For dataset split, the scaffold split scheme Chen et al. ( 2012) is adopted for train/validation/test set generation. Table A .1 summarizes the basic characteristics of the datasets, such as the size, tasks and molecule statistics. Detailed descriptions can be found in Appendix A. 

5.2. RESULTS

The results of LGCL along with SOTA competitors for molecular property prediction on eight benchmarks are shown in Table 1 . To summarize, the proposed graph contrastive learning framework with the line graph, LGCL, obtains superior performance compared with the previous works. Specifically, in light of the last column for average rank, our method seizes the highest ranking position from SOTA contrastive learning methods as well as self-supervised learning methods, and a significant ranking improvement can be witnessed as opposed to the second place (D-SLA gives the A.R. 5.0). In particular, LGCL achieves the best performance on six out of eight benchmarks, and the best comprehensive performance is also with us (see the penultimate column). Thus, we can conclude that LGCL captures the molecular semantic information well in the absence of well-designed domain knowledge, and the line graph provides an excellent contrastive view without altering the molecular semantics. 

5.3. ABLATION STUDY

Here, we delve deeper into the performance influence of each proposed component. First, we analyze the performance boosting from the introduction of the line graph and edge attribute fusion without ZINC15 pre-training. As for the two local contrastive losses, we present the test results of various combinations from these parts following the transfer learning settings. The detailed discussions are as follows. The effect of line graph. In Figure 3 , we analyze the effect of the line graph. In comparison with the red bar (i.e., 'No Pre-Train') that denotes the results from random initialization, introducing the line graph (i.e., 'No Pre-Train w/LG') shows an overall superior performance, which empirically suggests that the semantics underlying the edges can be better captured by the line graph. The effect of edge attribute fusion. Based on the performance boosting of the line graph, we further present the results with edge attribute fusion in Figure 3 fusion also brings five out of eight better results in contrast to the solo line graph. Furthermore, following the setting of transfer learning, the performance differences between the first and fifth rows as well as the fourth and sixth rows in Table 2 also validate the effectiveness of edge attribute fusion. Thus, we may conclude that edge attribute fusion can alleviate information inconsistency and enhance molecular graph representation learning. The effect of intra-local contrastive loss. The test results under the supervision of the proposed losses are shown in Table 2 . To achieve a comprehensive comparison, we first give a baseline only pre-trained with the NT-Xent loss (see the first row). The effectiveness of the proposed intra-local contrastive loss is confirmed by the performance differences between the second and first rows as well as the fourth and third rows, in which the only experimental setup difference is the L IntraC . Specifically, at least six out of eight better results are obtained via deploying this contrastive loss, which informs us of its effectiveness in over-smoothing addressing. The effect of inter-local contrastive loss. Analogously, when comparing the results of the first and third rows as well as the second and fourth rows in Table 2 , we can observe that six and five datasets achieve performance exaltation, respectively. This metric promotion indicates the effectiveness of the inter-local contrastive loss in crucial structure identification. Finally, despite several failures within these ablation studies, the last row that simultaneously adopts all proposed components performs best; thus, the proposed parts of LGCL are complementary to each other in molecular semantic exploration regardless of the intricate domain knowledge.

6. CONCLUSIONS

In this work, we try to figure out a decent view for molecular graph contrastive learning that can maintain the integrity of molecular semantic information and is friendly to researchers without profound domain knowledge. Inspired by the line graph, we propose a method, called LGCL, to tackle our expectations. Due to the different pace of message passing in the original graph and the corresponding line graph, we further present three crucial components to address the concomitant issues and enhance molecular graph representation learning. Under the setting of transfer learning, we empirically present the superior performance of LGCL over SOTA works.

Supplementary Materials for

Look in The Mirror: Molecular Graph Contrastive Learning with Line Graph

A DETAILS OF MOLECULAR DATASETS

As discussed in Section 4.1, a vertex with e edges in G will produce e × (e -1)/2 edges in L(G), which could lead to severe runtime complexity when the original graphs are dense. Therefore, our method only suits sparse graphs. As for molecules adopted in this work, we can see from Table A .1 that the computation of the line graph encoder is 1.5 times that of the original graph encoder according to the average degrees in the transformed line graphs. However, as for the realistic time required for model pre-training, detailed comparisons are shown in Appendix C.2. Input graph representation. For simplicity, we use a minimal set of node and bond features that unambiguously describe the two-dimensional structure of molecules. We use RDKit Landrum (2013) to obtain these features. (2018) are used to evaluate model performance. • BBBP Martins et al. (2012) . Blood-brain barrier penetration (membrane permeability), involves records of whether a compound carries the permeability property of penetrating the blood-brain barrier. • Tox21 Tox (2014). Toxicity data on 12 biological targets, which has been used in the 2014 Tox21 Data Challenge and includes nuclear receptors and stress response pathways. • ToxCast Richard et al. (2016) . Toxicology measurements based on over 600 in vitro highthroughput screenings. • SIDER Kuhn et al. (2016) . Database of marketed drugs and adverse drug reactions (ADR), grouped into 27 system organ classes and also known as the Side Effect Resource. • ClinTox Novick et al. ( 2013); Gayvert et al. (2016) . Qualitative data classifying drugs approved by the FDA and those that have failed clinical trials for toxicity reasons. • MUV Gardiner et al. (2011) . Subset of PubChem BioAssay by applying a refined nearest neighbor analysis, designed for validation of virtual screening techniques. • HIV HIV. Experimentally measured abilities to inhibit HIV replication. • BACE Subramanian et al. (2016) . Qualitative binding results for a set of inhibitors of human β-secretase 1. Dataset splitting. For molecular prediction tasks, following Ramsundar et al. (2019) • NCI1 is a dataset made publicly available by the National Cancer Institute (NCI) and is a subset of balanced datasets containing chemical compounds screened for their ability to suppress or inhibit the growth of a panel of human tumor cell lines; this dataset possesses 37 discrete labels. • MUTAG has seven kinds of graphs that are derived from 188 mutagenic aromatic and heteroaromatic nitro compounds. • PROTEINS is a dataset where the nodes are secondary structure elements (SSEs), and there is an edge between two nodes if they are neighbors in the given amino acid sequence or in 3D space. The dataset has 3 discrete labels, representing helixes, sheets or turns. Configuration. To keep in line with GraphCL You et al. (2020) , the same GNN architectures are employed with their original hyper-parameters under individual experiment settings. Specifically, GIN Xu et al. (2019) with 3 layers is set up in unsupervised representation learning. The encoder hidden dimensions are fixed for all layers to keep in line with GraphCL under individual experiment setting. Models are trained 20 epochs and tested every 10 epochs. Hidden dimension is 32, and batch size is ∈ {32, 128}. An Adam optimizer Kingma & Ba ( 2015) is employed to minimize the contrastive lose and learning rate is ∈ {0.01, 0.001, 0.0001}. Learning protocol. Following the learning setting in SOTA works, the corresponding learning protocols are adopted for a fair comparison. In unsupervised representation learning Sun et al. (2020) , all data is used for model pre-training and the learned graph embeddings are then fed into a non-linear SVM classifier to perform classification. Experiments are performed for 5 times each of which corresponds to a 10-fold evaluation as Sun et al. (2020) , with mean and standard deviation of accuracies (%) reported. Compared methods. We adopt nine baselines that are composed of three categories. The published hyper-parameters of these methods are adopted. The first set is three SOTA kernel-based methods that include GL Shervashidze et al. ( 2009 Results. The results of LGCL along with SOTA competitors on three benchmarks are shown in Table A .3. To summarize, the proposed graph contrastive learning framework with the line graph, LGCL, obtains superior performance compared with the previous works. In particular, except for NCI1, LGCL achieves the best performance on two out of three benchmarks. Thus, we can conclude that LGCL captures the molecular semantic information well in the setting of unsupervised learning. Configuration. ResGCN with 128 hidden units and 5 layers is set up in semi-supervised learning. For all datasets we perform experiments with 10% label rate for 5 times, each of which corresponds to a 10-fold evaluation as You et al. (2020) , with mean and standard deviation of accuracies (%) reported. For pre-training, learning rate is tuned in {0.01, 0.001, 0.0001} and epoch number in {20, 40, 60, 80, 100} where grid search is performed. For fine-tuning, we following the default setting in You et al. (2020) , that is, learning rate is 0.001, hidden dimension is 128, bath size is 128, and the pre-trained models are trained 100 epochs. Learning protocols. Following the learning setting in SOTA works, the corresponding learning protocols are adopted for a fair comparison. In semi-supervised learning You et al. (2020) Results. The results of LGCL along with SOTA competitors on the two benchmarks are shown in Table A .4, in which LGCL suppresses the SOTA view generation works on the two employed datasets. Thus, we can conclude that LGCL captures the molecular semantic information well in the setting of semi-supervised learning. In the design of LGCL, besides the general hyper-parameters (i.e., learning rate, batch size, dropout ratio, etc.), we introduce two hyper-parameters, α and β, for loss balance in pre-training stage. To clearly show the essential effectiveness of the two losses rather than the two hyper-parameters, we fix α and β to 1 in the main text. The other hyper-parameters in pre-training phase are also fixed and consistent with Hu et al. (2020) . To further inspect the hyper-parameter sensitivity of LGCL, we tuned the candidates of α and β in the range of [0.01, 0.1, 1, 10, 100], respectively. In the tunning of α, we fix the β to 1 and vice versa. In the fine-tuning stage, the learning is fixed to 0.001; the batch size is fixed to 32; the dropout ratio is fixed to 0.5. The node representations for graph pooling are adopted from the last layer. All The results are shown in Table A .5. As can be seen, LGCL suppresses GROVER on all datasets, MGSSL on six out of eight datasets, and 3D-Infomax on six out of seven datasets. Moreover, LGCL also achieves the highest average results among the these baselines. 

E THEORETICAL UNDERSTANDING OF LGCL

Besides the superior performance of LGCL shown in the main text for molecular property prediction, here, we further present a theoretical understanding of how LGCL obtains better performance. Definition E.1. (Graph Quotient Space). Define the equivalence ∼ = between two graphs G 1 ∼ = G 2 if G 1 , G 2 cannot be distinguished by the 1-WL test. Define the quotient space G = G/ ∼ =. So every element in the quotient space, i.e., G ∈ G, is a representative graph from a family of graphs that cannot be distinguished by the 1-WL test. Note that our definition also allows attributed graphs. Theorem E.2. Suppose G is a countable space and thus G ′ is a countable space. Because G and G ′ are countable, P G and P G ′ are defined over countable sets and thus discrete distribution. Later we may call a function z(•) can distinguish two graphs G 1 , G 2 if z(G 1 ) ̸ = z(G 2 ). Moreover, for notational simplicity, we consider the following definition. Suppose the encoder f is implemented by a GNN. The optimal encoder f * is the best model which GNN can find. Because f * is as powerful as the 1-WL test. Then, for any two graphs The statement 1 in Theorem E.2 indicates that the retained information after line graph transformation from given graphs is more than contrastive views with data augmentation. G 1 , G 2 ∈ G, G 1 ∼ = G 2 , f * (G 1 ) = f * (G 2 ). The statement 2 in Theorem E.2 suggests that the essential information underlying the line graph for target prediction is more than contrastive views with data augmentation. ≥ I(t(G ′ ); G ′ ), (15) where (a) is because the data processing inequality Cover (1999). Moreover, because f * could be as powerful as the 1-WL test and is injective in G ′ . Meanwhile, as stated in the Whitney graph isomorphism theorem Whitney (1932) , the isomorphism of two line graphs is judged to be consistent with the corresponding two original graphs, thus we have 



The code of LGCL will be public after acceptance. The two models are both not equipped with the two local contrastive losses.



Figure 2: An illustration of line graph transformation. (a) shows a simple undirected graph G; (b) reveals the derivation of vertices in line graph, every vertex of line graph is marked with green and labeled with the pair nodes of the corresponding edge in G; (c) establishes the associations in L(G) based on the common nodes of two edges; (d) delivers the output line graph L(G) of the original graph G.

-tuning details. For downstream tasks, a linear layer is stacked after the pre-trained graph encoders for final property prediction. The downstream model still employs the Adam optimizer for 100 epochs fine-tuning. All experiments on each dataset are performed for ten runs with different seeds, and the results are the averaged ROC-AUC scores (%) ± standard deviations. The hyperparameters tuned for each dataset are: (a) the learning rate ∈ {0.01, 0.001, 0.0001}; (b) the batch size ∈ {32, 128}; (c) the dropout ratio ∈ {0, 0.5}. The node representations for graph pooling are adopted from the last layer or the concatenation of all layers. These hyper-parameters are selected by the grid search on the validation sets.Baselines. In this paper, we choose the SOTA competitors that follows the experimental setup inHu et al. (2020). The first category is self-supervised graph learning algorithms, including EdgePred, AttrMsking, ContexPredHu et al. (2020), Infomax Velickovic et al. (2019), and GraphMAE Hou  et al. (2022). The second category are graph contrastive learning methods for view generation, such as GraphCLYou et al. (2020), JOAO(v2)You et al. (2021), LP-InfoYou et al. (2022), AutoGCL Yin  et al. (2022), GraphMVP Liu et al. (2022),RGCL Li et al. (2022b)  and D-SLAKim et al. (2022).

Figure 3: Average test ROC-AUC (%) gain within 'No Pre-Train' from the line graph (w/LG) and edge attribute fusion (w/AF) across all datasets.

),WL Shervashidze et al. (2011)  and DGKYanardag & Vishwanathan (2015). The second set is four heuristic self-supervised methods, including node2vecGrover & Leskovec (2016), sub2vec Adhikari et al. (2018), graph2vec Annamalai Narayanan & Jaiswal (2017), and InfoGraph Sun et al. (2020). The final compared methods are GraphCL You et al. (2020), JOAO(v2) You et al. (2021), AD-GCL Suresh et al. (2021), Auto-GCL Yin et al. (2022) and RGCL Li et al. (2022b).

Figure A.1: Sensitivity w.r.t. hyper-parameter α and β.

Comparison among parts of LGCL.

Figure A.2: Pre-training time comparison. The time required by LGCL is much less than the time needed by baselines except AutoGCL.

Figure A.2 shows the pre-training time required for 2 million molecular graphs from ZINC15 pretraining with 100 epochs. As shown in Figure A.2a, because the contrastive views of LGCL are static, the time required byLGCL is much less than the time needed by baselines except AutoGCL, which reveals that LGCL has not only superior performance but also excellent efficiency. In partic-

Figure A.3: Visualization of intra-local contrastive loss in over-smoothing alleviating. We examine the two groups of distributions by a Kolmogorov-Smirnov test (KS test), where the KS test p-values show that the two groups of distributions are distinct (i.e., the p-values are less than 0.01).

o AF IntraC InterC LGCL w/o IntraC InterC (b) Pre-traning loss curves of LGCL with and without edge attribute fusion.

Figure A.4: Visualization of edge attribute fusion in information inconsistency alleviating. We examine the two view similarity distributions by a Kolmogorov-Smirnov test (KS test), where the KS test p-values show that the two groups of distributions are distinct (i.e., P-value ≤ 0.01).

We may define a mapping over G ′ , also denoted by f * which simply satisfies f * (G ′ ) :≜ f * (G), where G ∼ = G ′ , and G ∈ G and G ′ ∈ G ′ . Suppose t(•) is the data augmentation function and L(•) is the line graph transformation function. We have 1. I(L(G); G) ≥ I(t(G ′ ); G ′ ); 2. I(L(G); Y ) ≥ I(t(G); Y ).

Given G, G ⇒ L(G) is an injective deterministic mapping. Therefore, for any random variableQ, I(L(G); Q) = I(G; Q). (13)Of course, we may set Q = G. So,I(L(G); G) = I(G; G).(14)Then, we haveI(L(G ′ ); G ′ ) = I(G ′ ; G ′ ) (a)

I(L(G ′ ); G ′ ) = I(f * (L(G ′ )); f * (G ′ )) = I(f * (L(G)); f * (G)) = I(L(G); G).(16)Here, the second equality is because the transformation of line graph will not change the isomorphic relationship between two graphs G ′ and G, meanwhile f * (G ′ ) = f * (G). Therefore, we achieve the statement 1:I(L(G); G) ≥ I(t(G ′ ); G ′ ).(17)Again, because by definitionf * = argmax f I(f (G); G), f * must be injective. Given G * , G * ⇒ f * (G * ) is an injective deterministic mapping. Of course, we may set Q = Y . So, I(f * (G); Y ) = I(G * ; Y ).(18)Because G ⇒ L(G) is an injective deterministic mapping.I(f * (G); Y ) = I(f * (L(G)); Y ), = I(L(G); Y ). (19)Further because of the data processing inequality Cover (1999),I(f * (G); Y ) = I(G; Y ) ≥ I(t(G); Y ) = I(f * (t(G)); Y ).(20) Combining above equations, we have the statement 2: I(L(G); Y ) = I(f * (L(G)); Y ) = I(f * (G); Y ) ≥ I(f * (t(G)); Y ) = I(t(G); Y ), (21) which concludes the proof of the essential information of line graph.

Along with the development of graph contrastive learning, plenty of research efforts have been devoted to designing contrastive learning models for molecular graphs Sun et al. (2021); Xu et al. (2021); Fang et al. (2022); Stärk et al. (2022); Liu et al. (2022); Li et al. (2022a). Besides random or learnable corruption, several works presented various contrastive learning models to embed the molecular geometry information by means of contrasting the generic 2D graph with its 3D conformers Liu et al. (2022); Stärk et al. (2022);Li et al. (2022a). They indeed get rid of the semantic altering issue caused by random corruption on molecular graphs, while introducing another semantic altering issue caused by 3D conformers, because a single 2D molecular graph generally has multiple conformers with different chemical propertiesStärk et al. (2022).

Average test ROC-AUC (%) ± Std. over different 10 runs of LGCL along with all baselines on eight downstream molecular property prediction benchmarks. The results of baselines are derived from the published works. Bold indicates the best performance among all baselines. Avg. shows the average ROC-AUC over all datasets. A.R. denotes the average rank.

Average test ROC-AUC (%) of LGCL with different components. Avg. shows the average ROC-AUC over all datasets. A.R. denotes the average rank.

Downstream task datasets. 8 binary graph classification datasets from MoleculeNet Wu et al.

discussed above, LGCL only suits sparse graphs to avoid too bad runtime complexity. As shown in TableA.2, the capacities of all social network datasets and dense bioinformatics dataset (i.e., DD) heavily increase after the line graph transformation, which leads to unaffordable computation consumption in graph representation learning. Therefore, we only employ sparse bioinformatics datasets for unsupervised and semi-supervised learning. Experiment details are elaborated below. Three sparse bioinformatics datasets are adopted from TUDatasetMorris et al. (2020) for unsupervised learning, including NCI1 and MUTAG, and PROTEINS. TableA.2 summarizes the characteristics of the three employed datasets.

Boldindicates the best performance over all methods. Italic marks the second best performance.

, there exist two learning settings. For datasets with a public training/validation/test split, pre-training is performed only on training dataset, finetuning is conducted with 10% of the training data, and final evaluation results are from the validation/test sets. For datasets without such splits, all samples are employed for pre-training while finetuning and evaluation are performed over 10 folds.

4: Average accuracies (%) ± Std. of compared methods via semi-supervised representation learning with 10% labels. Bold indicates the best performance over all methods. Italic marks the second best performance.

5: Average test ROC-AUC (%) ± Std. over different 10 runs of LGCL along with all baselines on eight downstream molecular property prediction benchmarks. The results of baselines are derived from the published works. Bold indicates the best performance among all baselines. Avg. shows the average ROC-AUC over all datasets. A.R. denotes the average rank. -indicates the data missing in the such works. 99±1.05 76.95±0.43 64.71±0.72 63.37±0.56 77.59±1.54 77.70±3.00 78.69±1.10 84.68±0.73 74.33

