O-GNN: INCORPORATING RING PRIORS INTO MOLECULAR MODELING

Abstract

Cyclic compounds that contain at least one ring play an important role in drug design. Despite the recent success of molecular modeling with graph neural networks (GNNs), few models explicitly take rings in compounds into consideration, consequently limiting the expressiveness of the models. In this work, we design a new variant of GNN, ring-enhanced GNN (O-GNN), that explicitly models rings in addition to atoms and bonds in compounds. In O-GNN, each ring is represented by a latent vector, which contributes to and is iteratively updated by atom and bond representations. Theoretical analysis shows that O-GNN is able to distinguish two isomorphic subgraphs lying on different rings using only one layer while conventional graph convolutional neural networks require multiple layers to distinguish, demonstrating that O-GNN is more expressive. Through experiments, O-GNN shows good performance on 11 public datasets. In particular, it achieves state-ofthe-art validation result on the PCQM4Mv1 benchmark (outperforming the previous KDDCup champion solution) and the drug-drug interaction prediction task on DrugBank. Furthermore, O-GNN outperforms strong baselines (without modeling rings) on the molecular property prediction and retrosynthesis prediction tasks.

1. INTRODUCTION

Cyclic compounds, which refers to the molecules that have at least one ring in its system, naturally exist in the chemical space. According to our statistics on 109M compounds from PubChem (Kim et al., 2019) which is a widely used chemical library, more than 90% compounds have at least one ring. The rings could be small/simple (e.g., the benzene is a six-member carbon ring, and the pentazole is a five-member nitrogen ring) or large/complex (e.g., the molecule shown in Figure 1 ). Rings are important in drug discovery, for example: (1) Rings can potentially reduce the flexibility of molecules, reduce the uncertainty when interacting with target proteins, and lock the molecules to their bioactive conformation (Sun et al., 2012) . (2) Macrocyclic compounds, which usually have a ring with more than 12 atoms, play important roles in antibotics design (Venugopal & Johnson, 2011) and peptide drug design (Bhardwaj et al., 2022) . Recently, deep neural networks, especially graph neural networks (denoted as GNN) (Kipf & Welling, 2017; Hamilton et al., 2017a) , have been widely used in molecular modeling. A GNN takes a graph as input, and messages of different nodes are passed along edges. GNNs have made great success in scientific discovery: (1) Stokes et al. (2020) train a GNN to predict growth inhibition of Escherichia coli and find that Halicin is a broad-spectrum bactericidal antibiotic. (2) Shan et al. (2022) leverage GNN to model the interactions between proteins, and they eventually obtain possible antibodies for SARS-CoV-2. In addition, GNNs are widely used in drug property prediction (Rong et al., 2020) , drug-target interaction modeling (Torng & Altman, 2019) , retrosynthesis (Chen & Jung, 2021) , etc. However, none of the above work explicitly models the ring information into GNNs. From the application's perspective, they miss an important feature for their tasks. From the machine learning's perspective, Loukas (2020) points out that existing message-passing-based GNNs cannot properly capture the ring information when the product of network width and height is not large enough (see the Table 1 in Loukas (2020) ). Therefore, with the classic GNNs, the ring information in compounds is not well leveraged. Figure 1 : Paclitaxel, a compound with 7 simple rings. Kampan et al. (2015) summarized that the intact taxane ring (i.e., r 4 , r 5 , r 6 ) and a four-membered oxetane side ring (i.e., r 7 ) is essential to induce cytotoxic activity. To tackle this issue, in this work, we propose a new model, ring-enhanced GNN (denoted as O-GNN), that explicitly models the ring information in a compound. The O stands for the rings in molecules and is pronounced as "O". Generally speaking, O-GNN stacks L layers, and each layer sequentially updates edge representations, node representations and ring representations by aggregating their neighbourhood information. We mainly use a self-attention layer for adaptive message passing, and use a feed-forward layer to introduce non-linearity to representations. We first demonstrate the advantage of O-GNN through theoretical analysis. O-GNN is able to distinguish two isomorphic sub-graphs lying on different rings using only one layer (see Figure 2 for the example). On the contrary, if we remove the ring-modeling components from O-GNN, such a distinguishability would require multiple layers (see Section 2.3 for detailed analysis). These results demonstrate that O-GNN is more expressive than conventional graph convolutional networks in the absence of explicitly modeling rings. We then conduct experiments on 11 datasets from three tasks, including molecular property prediction, drug-drug interaction prediction and retrosynthesis: (1) For molecular property prediction, we first conduct experiments on PCQM4Mv1, which is to predict the HOMO-LUMO gap of molecules. Our method outperforms the champion solution of KDDCup on the validation set (Shi et al., 2022) (note that test set labels are not available). Next, we verify O-GNN on six datasets from MoleculeNet (Wu et al., 2018) , which is to predict several pharmaceutical related properties of molecules. O-GNN outperforms the corresponding GNN baselines without rings. Finally, we conduct experiments on FS-Mol (Stanley et al., 2021) , a few-shot property prediction task, and shows that modeling rings can also improve the prediction accuracy. (2) For drug-drug interaction prediction, which is to predict whether two drugs interacts with each other, we test O-GNN on DrugBank following the previous settings (Nyamabo et al., 2021; Li et al., 2022) , and achieve state-of-the-art results. (3) For retrosynthesis, we apply O-GNN to LocalRetro (Chen & Jung, 2021) , a strong GNN-based method for retrosynthesis. On USPTO-50k, our method significantly boosts the accuracy.

2.1. NOTATION AND PRELIMINARIES

Let G = (V, E) denote a molecular graph, where V and E are the collections of nodes/atoms and edges/bondsfoot_0 . Let R denote the collection of rings in G. Define V = {v 1 , v 2 , • • • , v |V | } and E = {e ij } , where v i is the i-th atom and e ij is the bond connecting v i and v j . When the context is clear, we use i to denote atom v i , and use e(v i , v j ) to denote edge e ij . Let N (i) denote the neighbors of atom i, i.e., N (i) = {v j | e ij ∈ E}. Define R = {r 1 , r 2 , • • • , r |R| }, where each r i is a simple ring. A simple ring does not contain any ring structure. For example, for the molecule in Figure 3 , it has two simple rings as we marked (r 1 and r 2 ). The ring (1, 2, 3, 4, 5, 6, 7, 8, 9, 1) is not a simple ring. Let R(v i ) and R(e ij ) denote the rings that the atom v i or the bond e ij lies on, and V (r) and E(r) denote all the atoms and the bonds lying on ring r. For example, in Figure 3  , R(v 4 ) = {r 1 , r 2 } while R(v 3 ) = r 2 . R(e 49 ) = {r 1 , r 2 } while R(e 78 ) = r 1 . V (r 1 ) = {v 4 , v 5 , v 6 , v 7 , v 8 , v 9 } and E(r 1 ) = {e 45 , e 56 , e 67 , e 78 , e 89 , e 94 }. A graph neural network (GNN) is usually stacked by several identical GNN layers. Each GNN layer is composed of an Aggregate function and an Update function, h ′ i = Update (h i , Aggregate(h j |j ∈ N (i))) , where h i is the representation of atom i and h ′ i is its updated representation.  (l) E , H (l) V , H (l) R and U (l) denote the representation collections of bond, atom, ring and the global compound at the l-th layer.

2.2. MODEL

Our model consists of L identical layers with different parameters. The architecture of each layer is shown in Figure 3 . Let h (l) i , h (l) ij and h (l) r denote the output representations of atom v i , bond e ij and ring r at the l-th layer, respectively. Let U (l) denote the compound representation at the l-th layer. We initialize h (0) i via a learnable embedding layer which indicates its atomic type, chirality, degree number, formal charge, hybridization type, and so on. Similarly, we initialize h (0) ij with a learnable embedding wich indicates its bond type, stereoisomerism type and whether the bond is conjugated. Then we initialize h (0) r by concatenating the node and edge embedding and then transform it with a non-linear layer. Last, we initialize the compound representation with a learnable embedding. In each layer, we update representations of nodes, bonds, rings and the compound sequentially. We will frequently use MLP(• • • ), a multi-layer perception network with one hidden layer, to build our model. The inputs of MLP are concatenated as a long vector and processed by the network. (1) Update bond representations: The representation of a bond is updated via the connected atoms, the rings that the bond belongs to and the compound representation from the last layer: h (l) ij = h (l-1) ij + MLP   h (l-1) i , h (l-1) j , h (l-1) ij , r∈R(e ij ) h (l-1) r |R(eij)| , U (l-1)   . (2) Update atom representations: We use an attention model to adaptively aggregate bond representations into the centralized atoms. Mathematically, h(l) i = j∈N (i) αjWvconcat(h (l) ij , h (l-1) j ); αj ∝ exp(a ⊤ LeakyReLU(Wqh (l-1) i + W k concat(h (l-1) j , h (l) ij ))); h (l) i = h (l-1) i + MLP h (l-1) i , h(l) i , 1 |R(vi)| r∈R(v i ) h (l-1) r , U (l-1) . (3) In Eqn.(3), the W 's are the parameters to be learned, and concat denotes concatenating the input vectors as a long one. (3) Update ring representations: The ring representations are updated using MLP networks: h (l) r = h (l-1) r +MLP h (l-1) r , v i ∈V (r) h (l) i , e ij ∈E(r) h (l) ij , U (l-1) (4) Update the compound representation: U (l) = U (l-1) + MLP   1 |V | |V | i=1 h (l) i , 1 |E| i,j h (l) ij , 1 |R| r∈R h (l) r , U (l-1)   . After stacking L O-GNN layers, we get the graph representation by a simple average pooling layer, i.e., h G = 1 |V | |V | i=1 h (L) i , which could be utilized by graph classification tasks. For node classification tasks, we can add a classification head to h (L) i .

2.3. THEORETICAL ANALYSIS

In this section, we compare the distinguishability between standard GNN (without ring representations) and O-GNN. In addition to the notations defined in Section 2.1, we define the valued version of a graph G = (V, E) as a triplet VALUE f (G) = (V, E, f ), where f is a mapping storing feature information and mapping a node or an edge to its corresponding input feature (e.g., a 256-dimension representation). We call f as a feature mapping on G. Definition 1 (k-neighbourhood node). For a molecular graph G = (V, E) and two nodes u, v ∈ V , we say u is a k-neighbourhood of v if there exists a path in G connecting u and v with length no larger than k. More formally, u is a k-neighbourhood of v if and only if there exists a set of nodes {v 0 , v 1 , • • • , v t } ⊂ V , such that, t ≤ k, v 0 = v, v t = u, and for any i ∈ {0, • • • , t -1}, v i+1 ∈ N (v i ). We highlight here that v is a 0-neighbourhood node (and thus a k-neighbourhood node with any k ≥ 0) of itself. Definition 2 (k-neighbourhood sub-graph). For a molecular graph G = (V, E) and a node v in G, we define the k-neighbourhood sub-graph of v as the sub-graph composed of all v's k-neighbourhood node. More formally, we slightly abuse the notations and denote the k- neighbourhood sub-graph of v as G(v, k) ≜ (V (v, k), E(v, k)), where V (v, k) ≜ {u ∈ V : u is a k-neighbourhood node of v}, E(v, k) ≜ {e(v 1 , v 2 ) ∈ E : v 1 , v 2 ∈ V (v, k)}. Definition 3 (Equivalent valued graph). For two valued graphs VALUE f1 (G 1 ) = (V 1 , E 1 , f 1 ) and VALUE f2 (G 2 ) = (V 2 , E 2 , f 2 ) , we say that they are equivalent if (i). G 1 and G 2 are isomorphic, i.e., there exists a one-to-one mapping P : V 1 → V 2 , such that the edges are preserved; (ii). P also preserves the value of edges and the the value of nodes, i.e., ∀u, v ∈ G 1 , e(u, v) ∈ E 1 ⇔ e(P(u), P(v)) ∈ E 2 , f 1 (u) = f 2 (P(u)), f 1 (v) = f 2 (P(v)), f 1 (e(u, v)) = f 2 (e(P(u), P(v))). With all the preparations above, we are now ready to define the graph feature extractor and its discriminatory ability. Definition 4 (Graph feature extractor and its discriminatory ability). We say a mapping Φ is a graph feature extractor, if it maps a valued graph VALUE f (G) to a new feature mapping f on G. We further allow Φ to be parameterized as Φ θ , and call Φ θ a parameterized graph feature extractor. For a parameterized graph feature extractor Φ θ , we say Φ θ has the discriminatory ability for kneighbourhood sub-graphs, if for any valued graphs (G, f ) and any two nodes u, v in G, if the valued k-neighbourhood sub-graph of u and v (i.e., (G(u, k), f ) and (G(v, k), f )) are equivalent, there exists θ ⋆ such that Φ θ ⋆ ((G, f ))(u) ̸ = Φ θ ⋆ ((G, f ))(v). In this case, we also say that Φ θ ⋆ can distinguish u and v. We point out that {h (l) i } i ∪ {h (l) i,j } i,j defined by Eqn. (2, 3, 4, 5) is a parameterized feature extractor, and thus above provides a formal definition of O-GNN's discriminatory ability. The next proposition shows that without the ring representation, the O-GNN needs at least k + 1 layer to have the has the discriminatory power for k-neighbourhood sub-graphs. Proposition 1. Without the ring presentation, O-GNN with no more than k layers does not have the discriminatory ability for k-neighbourhood sub-graphs. Note that Proposition 1 can be easily extended to the conventional graph convolutional neural networks, which only aggregate information from 1-neighborhood nodes. We then show that with the ring representations, O-GNN with only one layer has the discriminatory power. Proposition 2. If u and v lie on different rings, O-GNN with only one layer can distinguish them. The proofs are deferred to Appendix B due to space limitation. From Proposition 1 and 2, we can see that O-GNN is more expressive than the regular GNN that does not model rings. The regular GNN requires at least k layers to distinguish two isomorphic k-neighborhood sub-graphs on different rings, while O-GNN only requires one layer for this purpose (see the example in Figure 2 ). Comparing O-GNN to a regular GNN with the same number of layers, modeling ring presentations constantly increases the percentages of parameters (irrelevant to k). However, a regular GNN may require k layers to achieve the discriminatory power for k-neighborhood sub-graphs. When k is large, O-GNN will be much more parameter efficient. More discussions are in Appendix C.5.

3. EXPERIMENTS

To validate the effectiveness of our method, we test O-GNN on the following three tasks: molecular property prediction, drug-drug interaction prediction and retrosynthesis. The first two tasks are graph classification tasks, and the third one is a node/link prediction task.

3.1. APPLICATION TO MOLECULAR PROPERTY PREDICTION

Datasets. We work on three datasets for this application: (1) The HOMO-LUMO energy gap prediction of the PCQM4Mv1 dataset (Hu et al., 2021) . The input is a 2D molecular graph, and the target is its HOMO-LUMO energy gap, which is an essential molecular property in quantum chemistry. PCQM4Mv1 has 3045360 and 380670 training and validation data (test labels are not available). The properties are obtained via density function theory. (2) Molecular property prediction on MoleculeNet dataset (Wu et al., 2018) . This is a dataset about the prediction of pharmaceutical properties of small molecules. We choose six molecular property prediction tasks (including BBBP, Tox21, ClinTox, HIV, BACE and SIDER), and the data ranges from 1.5k to 41k. (3) Few-shot molecular property prediction of the FS-Mol dataset (Stanley et al., 2021) . FS-Mol is made up of 5120 separate assays extracted from ChEMBL27 (https://www.ebi.ac.uk/ chembl/). Each assay has 94 molecular-property pairs on average. Training configuration. For PCQM4Mv1, we set the number of layers as 12 and hidden dimension as 256, which is selected by the cross-validation method on the training set. For FS-Mol, the number of layers are 6 and the hidden dimension is 256. The candidate number of layers and hidden dimensions for MoleculeNet are {4, 6, 8, 12} and {128, 256}. On FS-Mol and MoleculetNet, the hyper-parameters are selected according to validation performance. We train all these tasks on one GPU. The optimizer is AdamW (Loshchilov & Hutter, 2019) . More detailed parameters are summarized in Table 5 of Appendix A.

Results on PCQM4Mv1

The results of PCQM4Mv1 are reported in Table 1 . We compare O-GNN with the following baselines: (1) Conventional GCN/GIN with/without virtual node (marked by "vn"). The results are from Hu et al. (2021) ; (2) ConfDSS (Liu et al., 2021) , which predicts quantum properties conditioned on low-cost conformer sets; (3) Two-branch Transformer (Xia et al., 2021) , which has a regression head and a classification head that learn from each other; (4) Graphormer (Ying et al., 2021; Shi et al., 2022) the owners of PCQM4Mv1 did not release labels of the test set, we can only compare the results on the validation set. The evaluation metric is the mean absolute error (MAE). From Table 1 , we can see that O-GNN achieves the best results among the strong baseline models, which shows the effectiveness of our method. In addition, GIN vn, ConfDSS and Graphormer do not explicitly use the ring information, and we will combine O-GNN with the strong methods in the future. To investigate the significance of the ring information, we study a variant of O-GNN by removing the ring modeling component from O-GNN, and denote this variant as "O-GNN w/o ring". Specifically, it is implemented by removing Eqn.( 4) and all the h r 's in Eqn.(2,3,5). We conduct experiments for O-GNN and "O-GNN w/o ring" from 2 to 14 layers. The results are in Figure 4 . We can see that by utilizing ring information, the performance is boosted regardless of the number of layers. In addition, we find that a 6-layer O-GNN is comparable with the 12-layer O-GNN w/o ring, which shows the great power of modeling rings in GNN. We also have that O-GNN outperforms "O-GNN w/o ring" in terms of the number of parameters (see Figure 10 ). It is noteworthy to point out that validation MAE of the 14-layer O-GNN slightly drops compared to the 12-layer O-GNN. Note that this phenomenon is also observed in Graphormer (Shi et al., 2022 ) that larger models do not always lead to better validation results. We will explore how to train deeper models in the future. On PCQM4Mv1, we also study the average performance improvement w.r.t. several ring properties. The performance improvement is defined as ϵ 1ϵ 2 , where ϵ 1 and ϵ 2 denote the validation MAE of "O-GNN w/o rings" and O-GNN. The ring properties include: (i) the number of rings in a molecule; (ii) the number of atoms lying on rings; (iii) the number of atoms in the largest ring. We conduct experiments for the networks with different numbers of layers (L = 2, 6, 12). Results are reported in Figure 5 . We can conclude that overall, as the increase of number of rings, maximum ring sizes and the number of atoms lying on rings, O-GNN achieves more improvement compared to the variant without modeling rings. More analyses are in Appendix C.4.

Results on MoleculeNet

For MoleculeNet, we compare with both pretraining and non-pretraining methods. For non-pretraining methods, we compare with the following baselines: (i) GCN (Kipf & Welling, 2017) with virtual node ; (ii) GIN (Xu et al., 2018) (Liu et al., 2022) 70.3 ± 1.6 75.2 ± 0.3 59.9 ± 8.2 75.9 ± 0.9 79.2 ± 0.3 58.4 ± 0.6 G-Motif (Liu et al., 2022) 66.4 ± 3.4 73.2 ± 0.8 77.8 ± 2.0 73.8 ± 1.4 73.4 ± 4.0 60.6 ± 1.1 GraphMVP (Liu et al., 2022) 72.4 ± 1.6 75.9 ± 0.5 79.1 ± 2.8 77.0 ± 1.2 81.2 ± 0.9 63.9 ± 1. The evaluation metric is ∆AUPRC, which is the difference between the AUPRC (area under the precision-recall curve) and the ratio of the active compounds in that query set. A higher ∆AUPRC score indicates better classification performance of the model. The results are in Figure 6 . We report the mean and the standard derivations for different tasks across various support sizes. We have the following observations: (i) By using O-GNN as the backbone model for the prototypical network, the results are boosted for different support set sizes. (ii) The improvement is more significant when the support set size is large. When |T u,support | = 128 and 256, the improvements are 0.014 and 0.016. When reducing the sizes to 16/32/64, their improvements are all around 0.008. We will further improve the results on limited data size in the future.

3.2. APPLICATION TO DDI PREDICTION

Drug-drug interaction (DDI) prediction is to predict therapeutic output of taking two drugs together, like increasing the risk of some side effects, or the effect is enhanced to take them together. We focus on the classification task, where the inputs are two drug molecules and one interaction (e.g., inhibition), and the output is 0 or 1 to indicate whether the two drugs have this specific interaction. Table 3 : Results of drug-drug interaction prediction on DrugBank. To predict the interaction between two drugs, we use one 6-layer O-GNN to extract features for the two drugs. Specifically, for each drug, we average the node representations output by the last layer as the drug feature. We concatenate the two drug features together, and then multiply the interaction embedding to do the prediction. The detailed parameters are left in Table 6 of Appendix A. The results are reported in Table 3 . O-GNN significantly outperforms previous baselines in terms of accuracy (denoted as ACC), the area under the receiver operating characteristic (AUROC), the average precision (AP), and the F1 score. Most of previous works use GCN, GIN or GAT backbones, and they focus on designing comprehensive interaction module (Nyamabo et al., 2021; Li et al., 2022) . By using the advanced O-GNN backbone, we can significantly improve the results without designing complex interaction modules. This shows the effectiveness of our method.

3.3. APPLICATION TO RETROSYNTHESIS

Retrosynthesis is to predict the reactants of a given product. Various GNNs have been applied to this task. For example, GLN (Dai et al., 2019) use GNNs to predict the distributions of candidate reaction templates and reactants. GraphRetro (Somnath et al., 2021) and G2G (Shi et al., 2020) use GNNs to predict where to break a bond and how to add the fragments to complete the synthons. To demonstrate the ability of our O-GNN, we combine our method with LocalRetro (Chen & Jung, 2021), the current best graph-based model for retrosynthesis (without using pre-training). LocalRetro uses GNN to predict the possible templates for each atom and each bond, and sort the predicted templates according to their probabilities. The top templates will be applied to the corresponding atoms or bonds via RDKit (Landrum et al., 2016) to generate the reactants. Chen & Jung (2021) use MPNN (Gilmer et al., 2017a) for prediction, and we replace the MPNN with O-GNN. We conduct experiments on the USPTO-50k dataset (Coley et al., 2017) that contains 50, 016 reactions. Following Chen & Jung (2021), we partition the dataset as 45k training set, 5k validation set and 5k test set. The evaluation metric is the the top-k accuracy, where k = 1, 3, 5, 10, 50. The results are summarized in Table 4 . We can observe that O-GNN can predict reactions more accurately than the baselines without ring information. Especially, when the reaction type is known, we improve the top-1 accuracy for 1.8 points and the top-3 accuracy for 1.6 points. These results show the importance of modeling ring structure and the effectiveness of our method. The performance for different number of rings. To study the prediction performance of molecules with different number of rings, we group the USPTO-50k test set by the number of rings in the product molecules and compute the top-1 accuracy for each group. More specifically, we divide Overall speaking, the improvement is larger when there are more rings in a molecule. Especially, when there are at least 6 rings in a group (i.e., the last column), O-GNN increases the accuracy for 5.96 points, demonstrating that our method can better leverage ring structures. Case study. In Figure 7 (b), we show an example prediction of a product molecule with 5 rings. The reactions in the left panel are the top-3 predictions from LocalRetro baseline and the ones on the right are from O-GNN. Our method successfully predicts the correct reactants in its first output (marked as green), but the baseline fails to give a correct prediction. More importantly, the baseline system even fails to identify the correct bond to change. These results suggest that modeling ring structures is crucial to predict reactions accurately, and O-GNN is an effective algorithm for retrosynthesis.

4. CONCLUSIONS AND FUTURE WORK

In this work, we propose a new model, ring-enhanced GNN (briefly, O-GNN) for molecular modeling. We explicitly incorporate the ring representations into GNN and jointly update them with atom and bond representations. We provide theoretical analysis to O-GNN and prove that by using O-GNN, the node representations are more distinguishable than the variant without using ring representations. We conduct experiments on molecular property prediction, drug-drug interaction (DDI) prediction and retrosynthesis. O-GNN outperforms strong baselines on these tasks and achieves state-of-the-art results on the validation performance of PCQM4Mv1 and DDI prediction. For future work, first, we will combine with pre-training to obtain a stronger O-GNN. Second, we need to further improve our model when the training data is very limited (e.g., when the support set size is 16 or fewer). Third, how to efficiently identify and incorporate the representations with more complex structures is another interesting direction to explore. Fourth, we will apply our model to more real world scenarios, like the synthesis and generation of natural products with large rings. Therefore, there exists a choice of MLP, such that h (1) a =h (0) a + MLP h (0) a , h(1) a , 1 |R(v a )| r∈R(va) h (0) r , U (0) ̸ =h (0) b + MLP h (0) b , h(1) b , 1 |R(v b )| r∈R(v b ) h (0) r , U (0) =h (1) b . The proof is completed.

C MORE ABLATION STUDY

C.1 NODE REPRESENTATION POOLING V.S. COMPOUND REPRESENTATIONS We explore the difference between using average pooling h G = 1 |V | |V | i=1 h (L) i and the compound representation U (L) for classification. We try two networks with different numbers of layers (L = 6 and 12). We conduct experiments on PCQM4Mv1 dataset. The validation mean absolute errors (MAE) are reported in Table 8 . We can see that using average node pooling is better than using compound representation. This is consistent with the discovery of using virtual node in GIN (Hu et al., 2021) In Eqn.(4), we concatenate the sum pooling of atom representations, the sum pooling of bond representations and compound representations to update ring representations. An alternative solution is to use attention models to aggregate the atom and bond representations. We study a variant which updates the ring representations as follows: h (l) r = h (l-1) r + MLP h (l-1) r , vi∈V (r) α (l) i h (l) i , eij ∈E(r) β (l) ij h (l) ij , U (l-1) , In Eqn.(9), α (l) i ∝ exp W q1 h (l-1) r + W k1 h (l) i and β (l) ij ∝ exp W q2 h (l-1) r + W k2 h (l) ij , where the four W 's are parameters to be learned. The results are reported in Table 9 . We can see that although our method is simple, it can effectively leverage the ring information, and outperform this attention-based variant.

C.3 O-GNN WITH BRICS

The ring representation used in our method could be considered as a special motif. One might wonder whether other types of new motifs would be helpful. To see the effect, we use BRICS model (Degen et al., 2008) to decompose molecules into fragments. BRICS designs 16 rules to break bonds that can match a set of chemical reactions. The ring representations in Eqn.(2,3,4,5) are replaced by L = 6 L = 12 O-GNN 0.1171 0.1149 O-GNN with attention models when updating ring representations 0.1179 0.1160 Table 9 : Comparison between our method and using attention models when updating ring representations. these motif representations. The remaining parts remain unchanged. We conduct the experiments on PCQM4Mv1 dataset, and the results are shown in Table 10 . Due to time and computation resource limitation, all the models are trained for 200 epochs. L = 2 L = 4 L = 6 L = 8 L = 12 O-GNN From Table 10 , we can conclude that: (1) using simple ring representations achieves better results than using BRICS; (2) in general, using BRICS is better than the variant without using any ring or ring-based fragmentation information. We will keep exploring more segmentation methods.

C.4 MORE COMPARISON BETWEEN O-GNN AND O-GNN W/O RINGS

As a complementary to the analysis of the MAE w.r.t. the number of rings in molecules in Figure 5 , we also report the predicted error (i.e., mean absolute error, MAE) of O-GNN and the variant "O-GNN w/o rings" in Figure 8 . We can observe that when molecules have no rings, the two methods perform similar. As the number of rings increases from 1 to 6, the MAE increases, and O-GNN always outperforms the "O-GNN w/o rings" variant.

C.5 ADDITIONAL DISCUSSIONS

About over-smoothing One might be curious that since we build a 12-layer network, whether it suffers from over-smoothing. Actually, Cong et al. (2021) the architecture of (Addanki et al., 2021) , therefore we do not think that our model suffers from over-smoothing. Modeling k-neighborhood: If we want to explicitly use the k-neighborhood information, we might need additional modules to process them, like net 1 (1 neighbor nodes) + net 2 (2 neighbor nodes) + • • • + net k (k neighbor nodes). (11) To ensure expressiveness, we usually do not share parameters. Therefore, the parameters are k times larger than regular GNN. O-GNN constantly increases the percentages of parameters (irrelevant to k). When k is large, O-GNN will be much more parameter efficient. On the other hand, the optimal k * is not easy to determine. For example, in DrugBank, the ring maximum ring sizes range from 3 (e.g., DB00658) to 53 (e.g., DB05034). Which k is the best is hard to determine.

About invariant constraints

In O-GNN, the features of atoms, bonds and rings are all invariant. Specifically, the features of atoms and bonds are about their types, number of correlated electrons, number of neighbors, etc (please refer to https://github.com/O-GNN/O-GNN/blob/ 5b70a4f9dc9a5f87a0171eea1e9cecde30489eb8/ogb/utils/features.py#L2 for details). The ring representations are obtained via atom and bond representations (please kindly refer to Eqn.( 4)), which are also invariant. The variant features (like coordinates) are not encoded. Comparison about the convergence speed: The validation MAE curves of PCQM4Mv1 are shown in Figure 9 . The results of 6-layer O-GNN (with/without rings) 12-layer O-GNN (with/without rings) are reported. We can see that: ( Comparison between different number of parameters. Figure 4 shows the validation MAE of O-GNN and "O-GNN w/o ring" w.r.t. the number of layers. We also visualize the validation MAE w.r.t. the number of parameters in Figure 10 . We can observe that when aligned with the number of parameters, O-GNN still outperforms the variant without modeling rings. Pre-training baselines on MoleculeNet. We summarize the pre-training baselines on MoleculeNet in Table 11 . Sun et al. (2022) have demonstrated different data splitting method could result in significantly different results. We follow the common practice to use scaffold based splitting, and we cite the results of Rong et al. (2020) from Fang et al. (2022) . Note that the results of O-GNN is not pre-trained on unlabeled molecules. We can see that in terms of the average score, our method



When the context is clear, we use nodes/atoms and edges/bonds alternatively in this work.



Figure 2: An illustrative example of theoretical results. The three substructures in the red circles are isomorphic. The second and third substructures lie on different rings (a Cyclooctane and an Azocane). A regular GNN requires multiple layers to distinguish the three substructures while O-GNN requires only one layer due to the ring representations.

Figure 3: The workflow of our method. H

Figure 4: MAE w.r.t. numbers of layers.

Number of atoms lying on rings.

Figure 5: Performance improvement over several ring properties.

Case study on molecule with complex rings.

Figure 7: Study of O-GNN on retrosynthesis task. (a) The top-1 accuracy w.r.t. number of rings in product molecules. (b) The one-step retrosynthesis prediction of a product molecule with five rings. The first O-GNN output is the same as the ground truth (marked as green).

Figure 8: Predicted MAE categorized by different properties. x-axis denotes the number or rings, and y-axis denoted the mean absolute error (MAE) on the validation set.

Figure 9: Comparison about the convergence speed of O-GNN and "O-GNN w/o ring". x-axis denotes the training epoch and y-axis denotes the validation MAE.

Different GNNs have different Aggregate functions and Update functions. Details are summarized in Appendix D.

, the champion solution of PCQM4Mv1. Since Validation MAE on PCQM4Mv1.

with virtual node; (iii) O-GNN without using ring information (denoted as "O-GNN w/o ring"). For pre-training methods, we select

Test ROC-AUC (%) performance of different methods on 6 binary classification tasks from MoleculeNet benchmark. The training, validation and test sets are provided by DeepChem. Each experiment is independently run for three times. The mean and standard derivation are reported.several representative graph-based methods: (i)Hu et al. (2020) proposed to predict the masked attributes on graphs as well as maintaining the consistency between a subgraph and its neighbors; (ii) G-{Contextual, Motif} are variants of(Rong et al., 2020), which are provided inLiu et al.

Following Nyamabo et al. (2022)  andLi et al. (2022), we work on the inductive setting of the DrugBank dataset(Wishart et al., 2018), which has 1, 706 drugs, 86 interaction types, and 191, 808 triplets. To test the generalization ability of the model, we conduct experiments on two settings w.r.t. the drugs: the S1 setting, where neither of the two drugs on the test set appears in the training set; the S2 setting, where one drug is seen in the training set and the other is not. Noting that the drug pairs in the test set do not appear in the training set. Hence, the DrugBank data is split into training and test sets by the visibility of the drugs, and the negative samples are offline generated. We directly use the data provided byNyamabo et al. (2021; 2022), where 20% drugs are first hold as unseen drugs for formulating test set and the rest 80% drugs are used to create the training set.

Results on USPTO-50k datasets with reaction type known/unknown.

. A virtual node can be regarded as a compound representation, which connects to all nodes in graph. When using virtual nodes, it is a common practice to use the average or sum pooling of node representations to represent a graph. One can refer to https://github.com/ snap-stanford/ogb/blob/1c875697fdb20ab452b2c11cf8bfa2c0e88b5ad3/ examples/lsc/pcqm4m/gnn.py#L60 for the detailed implementation. Comparison between using average node representations VS compound representations.

Comparison between using simple rings (i.e., our method) and using BRICS-based fragments.

point that "over-smoothing does not necessarily happen in practice, a deeper model is provably expressive, can converge to global optimum with linear convergence rate, and achieve very high training accuracy as long as properly trained." (The words are from(Cong et al., 2021) for accurate expression). In addition,Li et al. (2020) andAddanki et al. (2021) both successfully trained 50+ layer networks. Our method follows

ACKNOWLEDGMENTS

This work was supported in part by NSFC under Contract 61836011, and in part by the Fundamental Research Funds for the Central Universities under contract WK3490000007.

availability

//github.

A DETAILED EXPERIMENT CONFIGURATIONS

The hyperparameters for the molecular property prediction, drug-drug interaction prediction and retrosynthesis are summarized in Table 5 , Table 6 and Table 7 

B PROOFS OF THE TWO PROPOSITIONS

Proof of Proposition 1. We start the proof by explicitly writing down the ring-free variant of O-GNN.Specifically, the bond representations are given by, U (l-1) ).The atom representations are given by);The compound representations are given byGiven the above notations, Proposition 1 can then be translated to the following claim:Claim. For any valued graphs (G, f ) and any two nodesWe denote the equivalent mapping between (G(v a , k), f ) and (G(v b , k), f ) as P. We will slightly abuse the notations by lettingWe will prove the above claim by induction. Specifically, we will prove that for any l ∈ {0, 1,Base case: for l = 0, by the definition of f , we have f), and h 0 P(c1)P(c2) = f (P(c 1 ), P(c 2 )). Induction step: suppose the claim is true for lwhere Eq.(⋆) is due to the induction hypothesis, asSimilarly, for everyP(c1)P(j) , h (l-1)P(c1)P(j) , h (l-1)where Eq. (•) is due to the induction hypothesis and Eq. ( 8). Eq. (⋄) is due to))) = exp(a ⊤ LeakyReLU(W q h (l-1)and thus α j = α P(j) for any j ∈ N (c 1 ).We then haveThus, the claim holds for l = i + 1, and the proof for the induction claim completes. Thus, the claim is true for every l ∈ {0, • • • , k}.b , and the proof is completed.Proof of Proposition 2. For two equivalent valued sub-graph (G(v a , k), f ) and (G(v b , k), f ), if v a and v b lie on different rings, we havePublished is comparable with those strong baselines, which demonstrate the effectiveness of our method. We will combine our method with pre-training in the future. (Liu et al., 2022) 70.3 ± 1.6 75.2 ± 0.3 59.9 ± 8.2 75.9 ± 0.9 79.2 ± 0.3 58.4 ± 0.6 69.8 G-Motif (Liu et al., 2022) 66.4 ± 3.4 73.2 ± 0.8 77.8 ± 2.0 73.8 ± 1.4 73.4 ± 4.0 60.6 ± 1.1 70.9 GraphMVP (Liu et al., 2022) 72.4 ± 1.6 75.9 ± 0.5 79.1 ± 2.8 77.0 ± 1.2 81.2 ± 0.9 63.9 ± 1.2 74.9 MGSSL (Zhang et al., 2021) 70.5 ± 1.1 76.5 ± 0.3 80.7 ± 2.1 79.5 ± 1.1 79.7 ± 0.8 61.8 ± 0.8 74.8 GROVERbase (Rong et al., 2020) 70.0 ± 0.1 74.3 ± 0.1 81.2 ± 3.0 62.5 ± 0.9 82.6 ± 0.7 64.8 ± 0.6 72.6 GROVERlarge (Rong et al., 2020) 69.5 ± 0.1 73.5 ± 0.1 76.2 ± 3.7 68.2 ± 1.1 81.0 ± 1.4 65.4 ± 0.1 72.3 GEM (Fang et al., 2022) 72.4 ± 0.4 78.1 ± 0.1 90. 

D RELATED WORK SUMMARY

GCN (Kipf & Welling, 2017) aggregates its neighbor information according to the adjacency matrix and degree matrix, and then updates the aggregated information with a linear transformation and a non-linear activation layer. GraphSAGE (Hamilton et al., 2017b) aggregates the neighbors information by element-wise average. GAT (Veličković et al., 2017) 

