TOWARDS RELIABLE LINK PREDICTION WITH ROBUST GRAPH INFORMATION BOTTLENECK

Abstract

Link prediction on graphs has achieved great success with the rise of deep graph learning. However, the potential robustness under the edge noise is less investigated. We reveal that the inherent edge noise that naturally perturbs both input topology and target label leads to severe performance degradation and representation collapse. In this work, we propose an information-theory-guided principle, Robust Graph Information Bottleneck (RGIB), to extract reliable supervision signals and avoid representation collapse. Different from the general information bottleneck, RGIB decouples and balances the mutual dependence among graph topology, target labels, and representation, building new learning objectives toward robust representation. We also provide two instantiations, RGIB-SSL and RGIB-REP, which benefit from different methodologies, i.e., self-supervised learning and data reparametrization, for implicit and explicit data denoising, respectively. Extensive experiments on 6 benchmarks of various scenarios verify the effectiveness of the proposed RGIB.

1. INTRODUCTION

As a fundamental problem in graph learning, link prediction (Liben-Nowell & Kleinberg, 2007) has attracted growing interest in real-world applications like drug discovery (Ioannidis et al., 2020) , knowledge graph completion (Bordes et al., 2013) , and question answering (Huang et al., 2019) . Recent advances from heuristic designs (Katz, 1953; Page et al., 1999) to graph neural networks (GNNs) (Kipf & Welling, 2016a; Gilmer et al., 2017; Kipf & Welling, 2016b; Zhang & Chen, 2018; Zhu et al., 2021) have achieved superior performances. Nevertheless, the poor robustness in imperfect scenarios with the inherent edge noise is still a practical bottleneck to the current deep graph models (Gallagher et al., 2008; Ferrara et al., 2016; Wu et al., 2022a; Dai et al., 2022) . Early explorations improve the robustness of GNNs for node classification under label noise (Dai et al., 2021; Li et al., 2021) through the smoothing effect of neighboring nodes. Other methods achieve a similar goal via randomly removing edges (Rong et al., 2020) or actively selecting the informative nodes or edges and pruning the task-irrelevant ones (Zheng et al., 2020; Luo et al., 2021) . However, when applying these noise-robust methods to the link prediction with noise, only marginal improvements are achieved (see Section 5). The attribution is that the edge noise can naturally deteriorate both the input topology and the target labels (Figure 1(a) ). Previous works that consider the noise either in input space or label space cannot effectively deal with such a coupled scenario. Therefore, it raises a new challenge to understand and tackle the edge noise for robust link prediction. In this paper, we dive into the inherent edge noise and empirically show the significantly degraded performances it leads to (Section 3.1). Then, we reveal the negative effect of the edge noise through carefully inspecting the distribution of learned representations, and discover that graph representation is severely collapsed, which is reflected by much lower alignment and poorer uniformity (Section 3.2). To solve this challenging problem, we propose the Robust Graph Information Bottleneck (RGIB) principle based on the basic GIB for adversarial robustness (Wu et al., 2020) (Section 4.1) . Conceptually, the RGIB principle is with new learning objectives that decouple the mutual information (MI) among noisy inputs Ã, noisy labels Ỹ , and the representation H. As illustrated in Figure 1(b) , RGIB generalizes the basic GIB to learn a robust representation that is resistant to the edge noise. Technically, we provide two instantiations of RGIB based on different methodologies, i.e., RGIB-SSL and RGIB-REP: (1) the former utilizes contrastive pairs with automatically augmented views to form the informative regularization in a self-supervised learning manner (Section 4.2); and (2) the latter explicitly purifies the graph topology and supervision targets with the reparameterization mechanism (Section 4.3). Both instantiations are equipped with adaptive designs, aiming to effectively estimate and balance the corresponding informative terms in a tractable manner. For example, the hybrid augmentation algorithm and self-adversarial alignment loss for RGIB-SSL, and the relaxed information constraints on topology space as well as label space for RGIB-REP. Empirically, we show that these two instantiations work effectively under extensive noisy scenarios and can be seamlessly integrated with various existing GNNs (Section 5). Our main contributions are summarized as follows. • To our best knowledge, we are the first to study the robustness problem of link prediction under the inherent edge noise. We reveal that the inherent noise can bring a severe representation collapse and performance degradation, and such negative impacts are general to common datasets and GNNs. • We propose a general learning framework, RGIB, with refined representation learning objectives to promote the robustness of GNNs. Two instantiations, RGIB-SSL and RGIB-REP, are proposed upon different methodologies that are equipped with adaptive designs and theoretical guarantees. • Without modifying the GNN architectures, the RGIB achieves state-of-the-art results on 3 GNNs and 6 datasets under various noisy scenarios, obtaining up to 12.9% AUC promotion. The distribution of learned representations is notably recovered and more robust to the inherent noise.

2. PRELIMINARIES

Notation. We denote V = {v i } N i=1 as the set of nodes and E = {e ij } M ij=1 as the set of edges. With adjacent matrix A and node features X, an undirected graph is denoted as G = (A, X), where A ij = 1 means there is an edge e ij between v i and v j . X [i,:] ∈ R D is the D-dimension node feature of v i . Link prediction is to indicate the existence of query edges with labels Y that are not observed in A. GNNs for Link Prediction. We follow the common link prediction framework, i.e., graph autoencoders (Kipf & Welling, 2016b) , where the GNN architecture can be GCN (Kipf & Welling, 2016a) , GAT (Veličković et al., 2018) , or SAGE (Hamilton et al., 2017) . Given a L-layer GNN, the graph representations H ∈ R |V|×D for each node v i ∈ V are obtained by a L-layer message propagation as the encoding process. For decoding, logits φ eij of each query edge e ij are computed with a readout function, e.g., the dot product φ eij = h i h j . Finally, the optimization objective is to minimize the binary classification loss, i.e., min L cls = eij ∈E train -y ij log σ(φ eij ) -(1-y ij )log 1-σ(φ eij ) , where σ(•) is the sigmoid function, and y ij = 1 for positive edges while y ij = 0 for negative ones. Topological denoising approaches. A natural way to tackle the input edge noise is to directly clean the noisy graph. Sampling-based methods, such as DropEdge (Rong et al., 2020) , Neu-ralSparse (Zheng et al., 2020) , and PTDNet (Luo et al., 2021) , are proposed to remove the taskirrelevant edges. Besides, as GNNs can be easily fooled by adversarial network with only a few perturbed edges (Chen et al., 2018; Zhu et al., 2019; Entezari et al., 2020) , defending methods like GCN-jaccard (Wu et al., 2019) and GIB (Wu et al., 2020) are designed for pruning adversarial edges. Label-noise-resistant techniques. For tackling the general problem of noisy label, Co-teaching (Han et al., 2018) lets two neural networks teach each other with small-loss samples based on the memorization effect (Arpit et al., 2017) . Besides, peer loss function (Liu & Guo, 2020) pairs independent peer examples for supervision and works within the standard empirical risk minimization framework. As for the graph domain, label propagation techniques proposed by pioneer works (Dai et al., 2021; Li et al., 2021) propagate the reliable signals from clean nodes to noisy ones, which are nonetheless entangled with the node annotations and node classification task that cannot be directly applied here.

3. AN EMPIRICAL STUDY OF THE INHERENT EDGE NOISE

In this section, we attempt to figure out how GNNs behave when learning with the edge noise and what are the latent mechanisms behind it. We first present an empirical study in Section 3.1 and then investigate the negative impact of noise through the lens of representation distribution in Section 3.2.

3.1. HOW DO GNNS PERFORM UNDER THE INHERENT EDGE NOISE?

Since existing benchmarks are usually well-annotated and clean, there is a need to simulate the inherent edge noise properly to investigate the impact of noise. Note the data split manner adopt by most relevant works (Kipf & Welling, 2016b; Zhang & Chen, 2018; Zhu et al., 2021) randomly divides partial edges as observations and the others as prediction targets. The inherent edge noise, if exists, should be false positive samples and uniformly distributed to both input observations and output labels. Thus, the training data can be with noisy adjacence Ã and noisy labels Ỹ , i.e., the input noise and label noise. We elaborate the formal simulation of such an additive edge noise as follows. Definition 3.1 (Additive edge noise). Given a clean training data, i.e., observed graph G = (A, X) and labels Y ∈ {0, 1} of query edges, the noisy adjacence Ã is generated by only adding edges to the original adjacent matrix A while keeping the node features X unchanged. The noisy labels Ỹ are generated by only adding false-positive edges to the labels Y . Specifically, given a noise ratio ε a , the added noisy edges A ( Ã = A+A ) are randomly generated with the zero elements in A as candidates. With the simulated noise, an empirical study is then performed with various GNNs and datasets. As shown in Figure 2 , the edge noise causes a significant drop in performance, and a larger noise ratio generally leads to greater degradation. It means that these common GNNs normally trained with the stochastic gradient descent are vulnerable to the inherent edge noise, yielding a severe robustness problem.foot_0 However, none of the existing defending methods are designed for such a coupled noise, which is practical for real-world graph data that can be extremely noisy (Gallagher et al., 2008; Wu et al., 2022a) . Thus, it is urgent to understand the noise effect and devise the corresponding robust method.

3.2. UNDERSTANDING THE IMPACT OF NOISE VIA REPRESENTATION DISTRIBUTION

Denote a GNN as f w (•) with learnable weights w. Node representations are extracted by a forward pass as H = f w (A, X) while the backward propagation with stochastic gradient descent optimizes the GNN as by contrast When encountering noise within Ã and Ỹ , the representation H can be directly influenced, since the training neglects the adverse effect of data corruption. Besides, the GNN readouts the edge logit φ eij based on top-layer node representations h i and h j , which are possibly degraded or even collapsed under the edge noise (Graf et al., 2021; Nguyen et al., 2022) . For quantitive and qualitative analysis of H under edge noise, we introduce two concepts (Wang & Isola, 2020) here, i.e., alignment and uniformity. Specifically, alignment is computed as the distance of representations between two randomly augmented graphs. It quantifies the stability of GNN when encountering edge noise in the testing phase. A higher alignment means the GNN is more resistant and invariant to input perturbations. Uniformity, on the other hand, qualitatively measures the denoising effects of GNN when learning with edge noise in the training phase. A greater uniformity implies that the learned representations of various samples are more uniformly distributed on the unit hypersphere, preserving as much information about the original data as possible. As shown in Table 1 and Figure 3 , a poorer alignment and a worse uniformity are brought by a severer edge noise. The learned GNN f w (•) is more sensitive to input perturbations as the alignment values are sharply increased, and the learned edge representations tend to be less uniformly distributed and gradually collapse to individual points as the noise ratio increases. Thereby, it is discovered that the graph representation is severely collapsed under the inherent edge noise, resulting in the nondiscriminative property. This is usually undesirable and potentially risky for downstream applications.  N N i=1 ||H i 1 -H i 2 || 2 . where H i 1 = f w (A i 1 , X) and H i 2 = f w (A i 2 , X

4. ON TACKLING THE REPRESENTATION COLLAPSE VIA ROBUST GIB

As aforementioned, the inherent edge noise brings unique challenges to link-level graph learning. Without prior knowledge like the noise ratio, without the assistance of auxiliary datasets, and even without modifying GNN architectures, how can learned representations be resistant to the inherent noise? Here, we formally build a method to address this problem from the perspective of graph information bottleneck (Section 4.1) and design its two practical instantiations (Section 4.2 and 4.3).

4.1. THE PRINCIPLE OF ROBUST GIB

Recall in Section 3.2, the graph representation H is severely degraded due to incorporating noisy signals in Ã and Ỹ . To robustify H, one can naturally utilize the information constraint based on the basic graph information bottleneck (GIB) principle (Wu et al., 2020; Yu et al., 2020) , i.e., solving min GIB -I(H; Ỹ ), s.t. I(H; Ã) < γ. ( ) where the hyper-parameter γ constrains the MI I(H; Ã) to avoid H from capturing excess taskirrelevant features from Ã. The basic GIB (Eqn. 1) can effectively withstand the input perturbation (Wu et al., 2020) . However, it is intrinsically susceptive to label noise since it entirely preserves the label supervision with maximizing I(H; Ỹ ). Empirical results in Section 5 show that it becomes ineffective when learning with inherent edge noise simulated as Definition 3.1.  ; Ỹ ), s.t. γ - H < H(H) < γ + H , I(H; Ỹ | Ã) < γ Y , I(H; Ã| Ỹ ) < γ A . ( ) where constraints on H(H) encourage a diverse H to prevent representation collapse (> γ - H ) and also limit its capacity (< γ + H ) to avoid over-fitting. Another two symmetric terms, I(H; Ỹ | Ã) and I(H; Ã| Ỹ ), mutually regularize posteriors to mitigate the negative impact of inherent noise on H. Note that MI terms like I(H; Ã| Ỹ ) are usually intractable. Therefore, we introduce two practical implementations of RGIB, i.e., RGIB-SSL and RGIB-REP, based on different methodologies. The former explicitly optimizes the representation H with the self-supervised regularization, while the latter implicitly optimizes H by purifying the noisy Ã and Ỹ with the reparameterization mechanism.

4.2. INSTANTIATING ROBUST GIB VIA SELF-SUPERVISED LEARNING

Recall that the graph representation is deteriorated with the supervised learning paradigm. Naturally, we modify it into a self-supervised counterpart by explicitly regularizing the representation H (shown in Figure 4(b) ) to avoid collapse and to implicitly capture reliable relations among noisy edges, i.e., min RGIB-SSL -λ s (I(H 1 ; Ỹ )+I(H 2 ; Ỹ )) supervision -λ u (H(H 1 )+H(H 2 )) uniformity -λ a I(H 1 ; H 2 ) alignment . (3) where margins λ s , λ u , λ a balance one supervised and two self-supervised regularization terms. When λ s ≡ 1 and λ u ≡ 0, the RGIB-SSL can be degenerated to the basic GIB. Note that contrastive learning is a prevalent technique in the self-supervised area to learn robust representation (Chen et al., 2020) , where contrasting pair of samples with data augmentation plays an essential role. In practice, we follow such a manner and calculate the supervision term by However, directly applying existing contrastive methods like (Chen et al., 2020; Khosla et al., 2020) can be suboptimal, since they are not originally designed for graph data and neglect the internal correlation of topology Ã and target Ỹ . Two following designs are proposed to avoid trivial solutions. E[I(H; Ỹ | Ã)] ≤ E[I(H; Ỹ )] = E Ãs∼P( Ã) [I(H s ; Ỹ )] ≈ 1 /2(I(H 1 ; Ỹ )+I(H 2 ; Ỹ )) = 1 /2(L cls (H 1 ; Ỹ )+L cls (H 2 ; Ỹ )), Hybrid graph augmentation. To encourage more diverse views with lower I(A 1 ; A 2 ) and to avoid manual selection of augmentation operations, we propose a hybrid augmentation method with four augmentation operations as predefined candidates and ranges of their corresponding hyper-parameters. In each training iteration, two augmentation operators, T 1 (•) and T 2 (•), and their hyper-parameters θ 1 and θ 2 are automatically sampled from the search space. Then, two augmented graphs are obtained by applying the two operators on the original graph G. Namely, G 1 = T 1 (G|θ 1 ) and G 2 = T 2 (G|θ 2 ). Self-adversarial alignment & uniformity loss. With representations H 1 and H 2 from two augmented views, we build the alignment objective by minimizing the representation similarity of the positive pairs (h 1 ij , h 2 ij ) and maximizing that of the randomly-sampled negative pairs (h 1 ij , h 2 mn ), e ij = e mn . The proposed self-adversarial alignment loss is R align = N i=1 R pos i +R neg i , where R pos i = p pos (h 1 ij , h 2 ij ) • h 1 ij -h 2 ij 2 2 and R neg i = p neg (h 1 ij , h 2 mn ) • (γ -h 1 ij -h 2 mn 2 2 ). 2 Importantly, softmax functions p pos (•) and p neg (•) aim to mitigate the inefficiency problem (Sun et al., 2019) that aligned pairs are not informative. Besides, with Gaussian potential kernel, the uniformity loss is as R unif = K ij,mn e -h 1 ij -h 1 mn 2 2 +e -h 2 ij -h 2 mn 2 2 , where e ij ,e mn are randomly sampled. Optimization. As Eqn. 3, the overall loss function of RGIB-SSL is L = λ s L cls +λ a R align +λ u R unif . Remark 4.1. The inherent noise leads to class collapse, i.e., samples from the same class have the same representation. It comes from trivially minimizing the noisy supervision I(H; Ỹ ) and results in degraded representations shown in Section 3.2. Fortunately, the alignment and uniformity terms regularizing representations can alleviate such noise effects and avoid collapse (see Section 5).

4.3. INSTANTIATING ROBUST GIB VIA DATA REPARAMETERIZATION

Another realization is by reparameterizing the graph data on both topology space and label space jointly to preserve clean information and discard noise. We propose RGIB-REP that explicitly models the reliability of Ã and Ỹ via latent variables Z to learn a noise-resistant H (as Figure 4 (c)), namely,  min RGIB-REP -λ s I(H; Z Y ) supervision +λ A I(Z A ; Ã) topology constraint + λ Y I(Z Y ; Ỹ ) label constraint . L = λ s L cls + λ A R A + λ Y R Y , and the corresponding guarantee is discussed in Theorem 4.5. Proposition 4.3. Given the edge number n of Ã, the marginal distribution of  Z A is Q(Z A ) = n P φ (P |n)P( Ã = n) = P(n) n Ãij=1 P ij . Then, we derive the upper bound I(Z A ; Ã) ≤ E[KL(P φ (Z A |A)||Q(Z A ))] = eij ∈ Ã P ij log P ij τ +(1 -P ij ) log 1-P ij 1-τ = R A , where τ is a con- stant. Similarly, the label constraint is bounded as I(Z Y ; Ỹ ) ≤ R Y , Proof. See Appendix C.3. Proposition 4.4. Since I(H; Z Y ) ≥ E Z Y ,Z A [log P w (Z Y |Z A )] ≈ -L cls (f w (Z A ), Z Y ), D sub = (Z * A , X, Z * Y ) and Z * Y ≈ Y , based on which a trained GNN predictor f w (•) satisfies f w (Z * A , X) = Z * Y + . The error is independent from D sub and → 0. Then, for any λ s , λ A , λ Y ∈ [0, 1], Z A = Z * A and Z Y = Z * Y minimizes the RGIB-REP (Eqn. 4). Proof. See Appendix C.5. 2 Specifically, p pos (h 1 ij , h 2 ij ) = exp( h 1 ij -h 2 ij 2 2 ) N i=1 exp( h 1 ij -h 2 ij 2 2 ) , and p neg (h 1 ij , h 2 mn ) = exp(α-h 1 ij -h 2 mn 2 2 ) N i=1 exp(α-h 1 ij -h 2 mn 2 2 ) . 

5. EXPERIMENTS

Setup. In this section, we empirically verify the effectiveness of the proposed RGIB framework. 6 popular datasets and 3 GNNs are taken in the experiments. The inherent edge noise is generated based on Definition. 3.1 after the commonly used data split where 85% edges are randomly selected for training, 5% as the validation set, and 10% for testing. The AUC is used as the evaluation metric as in (Zhang & Chen, 2018; Zhu et al., 2021) . The software framework is the Pytorch (Paszke et al., 2017) , while the hardware platform is a single NVIDIA RTX 3090 GPU. We repeat all the experiments 10 times to obtain evaluation results with mean values and standard deviations. 3Baselines. As existing robust methods separately deal with input noise or label noise, both kinds of methods should be considered as baselines. For input noise, three sampling-based approaches are used for comparison, i.e., DropEdge (Rong et al., 2020) , NeuralSparse (Zheng et al., 2020) , and PTDNet (Luo et al., 2021) . Jaccard (Wu et al., 2019) , GIB (Wu et al., 2020) , VIB (Sun et al., 2022) , and PRI (Yu et al., 2022) , which are designed for pruning adversarial edges, are also included. Two generic methods are selected for label noise, i.e., Co-teaching (Han et al., 2018) and Peer loss (Liu & Guo, 2020) . In addition, two contrastive learning methods are taken into comparison with adaptation to the link prediction task, including SupCon (Khosla et al., 2020) utilizing the full labels for supervision and GRACE (Zhu et al., 2020) that are optimized in a self-supervised manner without labels. All the above baseline methods are evaluated w.r.t. their original implementations 2 , the RGIB achieves the best results in all 6 datasets under the inherent edge noise with various noise ratios, especially on the more challenging datasets, i.e., Cora and Citeseer, where a 12.9% AUC promotion can be gained compared with the second-best methods. When it comes to decoupled noise settings shown in Table 3 , RGIB also surpasses all the baselines ad hoc for input noise or label noise by a large margin.

Performance comparison. As shown in Table

Remark 5.1. The two instantiations of RGIB can be generalized to different scenarios with their own priority w.r.t. intrinsic graph properties that can be complementary to each other with flexible options in practice. Basically, the RGIB-SSL is more adaptive to sparser graphs, e.g., Cora and Citeseer, where the inherent edge noise presents a considerable challenge and results in greater performance degradation. The RGIB-REP can be more suitable for denser graphs, e.g., Facebook and Chameleon. Meanwhile, two instantiations also work effectively on heterogeneous graphs with low homophily. The learned representation distribution. Next, we justify that the proposed method can effectively alleviate the representation collapse. Compared with the standard training, both RGIB-REP and RGIB-SSL bring significant improvements for the alignment with much lower values, as in Table 4 . At the same time, the uniformity of learned representation is also enhanced: it can be seen from Figure 5 that the various query edges tend to be more uniformly distributed on the unit circle. As RGIB-SSL explicitly constrains the representation, its recovery power on representation distribution is naturally stronger than RGIB-REP, resulting in comparably better alignment and uniformity measures. Optimization schedulers. To reduce the search cost of coefficients in objectives of RGIB-SSL and RGIB-REP, we setup a unified optimization framework that can be formed as L = αL cls + (1 -α)R 1 + (1 -α)R 2 . Here, we attempt 5 different schedulers to tune the only hyper-parameter α ∈ [0, 1], including (1) constant, α ≡ c; (2) linear, α t = k • t, where t is the normalized time step; (3) sine, α t = sin(t • π /2); (4) cosine, α t = cos(t • π /2); and (5) exponential, α t = e k•t . As empirical results summarized in Table 5 , the selection of optimization schedulers greatly influences the final results. Although there is no gold scheduler that consistently performs the best, the constant and sine schedulers are generally better than others among the 5 above candidates. A further hyper-parameter study with grid search of the λs in RGIB-SSL and RGIB-REP can be found in Appendix F.2. 

5.2. ABLATION STUDY

In this part, we conduct a thorough ablation study for the RGIB framework. As shown in Table 6 , each component contributes to the final performance. For RGIB-SSL, we have the following analysis: • Hybrid augmentation. RGIB-SSL benefits from the hybrid augmentation algorithm that automatically generates graphs of high diversity for contrast. Compared with fixed augmentation, the hybrid augmentation brings consistent improvements with a 3.0% average AUC promotion on Cora. • Self-adversarial alignment loss. Randomly-sampled pairs are with hierarchical information to be learned from, and the proposed re-weighting technique further enhances high-quality pairs and decreases low-quality counterparts. It enables to discriminate the more informative contrasting pairs and thus refines the alignment signal for optimization, bringing to up 2.1% AUC promotion. • Information constraints. Label supervision contributes the most among the three informative terms, even with label noise. Degenerating RGIB-SSL to a pure self-supervised manner without supervision (i.e., λ s = 0) leads to an average 11.9% AUC drop. Meanwhile, we show that three regularization terms can be jointly optimized while another two terms are also of significant values. As for RGIB-REP, its sample selection mechanism and corresponding constraints are also essential: • Edge / label selection. Two sample selection methods are near equally important for learning with the coupled inherent noise that both informative sources are required to be purified. Besides, the edge selection is more important for tackling the decoupled input noise, as a greater drop will come when removed. Similarly, the label selection plays a dominant role in handling the label noise. • Topological / label constraint. Table 6 also shows that the selection mechanism should be regularized by the related constraint; and otherwise, sub-optimal solutions will be achieved. Besides, the topological constraint is generally more sensitive than the label constraint for RGIB-REP.

6. CONCLUSION

In this work, we study the problem of link prediction with the inherent edge noise and reveal that the graph representation is severely collapsed under such a coupled noise. Based on the observation, we introduce the Robust Graph Information Bottleneck (RGIB) principle, aiming to extract reliable signals via decoupling and balancing the mutual information among inputs, labels, and representation to enhance the robustness and avoid collapse. Regarding the instantiation of RGIB, the self-supervised learning technique and data reparametrization mechanism are utilized to establish the RGIB-SSL and RGIB-REP, respectively. Empirical studies on 6 datasets and 3 GNNs verify the denoising effect of the proposed RGIB under different noisy scenarios. In future work, we will generalize RGIB to more scenarios in the link prediction domain, e.g., multi-hop reasoning on knowledge graphs with multi-relational edges that characterize more diverse and complex patterns.

REPRODUCIBILITY STATEMENT

To ensure the reproducibility of the empirical result in this work, we have stated all the details about our experiments in corresponding contents of Section 3 and Section 5. The implementation details are further elaborated in Appendix E. Besides, we will provide an anonymous Github repository for the source codes during the discussion phase for the reviewers of ICLR 2023.

A BROADER IMPACT

In this work, we propose the RGIB framework for denoising the inherent edge noise in graphs and for improving the robustness of link prediction. We conceptually derive the RGIB and empirically justify that it enables the various GNNs to learn the input-invariant and label-invariant graph representations, preventing representation collapse and obtaining superior performances against edge noise. Our objective is to protect and enhance the current graph models, and we do not think our work would have negative societal impacts. A similar problem has been noticed in an early work (Gallagher et al., 2008) , where robots tend to build connections with normal users to spread misinformation on social networks, yielding the degeneration of GNNs for robot detection. In addition, as pointed out by a recent survey (Wu et al., 2022b) , the inherent noise can be produced in the data generation process that requires manually annotating the data. As introduced in (Wu et al., 2022b) , the inherent noise here refers to irreducible noises in graph structures, attributes, and labels. In short, the studied edge noise in our work is in line with the generic concept of graph inherent noise in the literature, which is recognized as a common problem in practice but also an under-explored challenge in both academic and industrial scenes. However, several existing benchmarks, e.g., Cora and Citeseer, are generally clean and without annotated noisy edges. Inevitably, there is a gap when one would like to study the influence of inherent noise on these common benchmarks. Thus, to fill this gap, it is necessary to simulate the inherent edge noise properly to investigate the impact of noise. If the inherent edge noise exists, it should be false positive samples, while the false negative samples are often intractable to be collected. Thus, we focus on the investigation of the false positive edges as the inherent noise and design the simulation approach in Definition 3.1 as it is more close to the real-world scenarios. Besides, to our best knowledge, we are the first to study the robustness problem of link prediction under the inherent edge noise. One of our major contributions to this research problem is that we reveal that the inherent noise can bring a severe representation collapse and performance degradation, and such negative impacts are general to common datasets and GNNs. What's more, it is also possible that new instantiations based on other kinds of methodology are inspired by the robust GIB principle. We believe that such a bidirectional information bottleneck that strictly treats the information source on both input side and label side is helpful in practice, especially for extremely noisy scenarios.

B NOTATIONS

We summarize the frequently used notations in Table 7 . Table 7 : The most frequently used notations in this paper. notations meanings V = {v i } N i=1 the set of nodes E = {e ij } M ij=1 the set of edges A ∈ {0, 1} N ×N the adjacent matrix with binary elements Proof. As introduced in Section 4.2, the uniformity loss on two augmented graphs is formed as X ∈ R N ×D the node features G = (A, X) the input graph of a GNN E query = {e ij } K R unif = K ij,mn e -h 1 ij -h 1 mn 2 2 +e -h 2 ij -h 2 mn 2 2 . For simplicity, we can reduce it to one graph, i.e., R unif = E ij,mn∼p E [e -hij -hmn 2 2 ] . Next, we obtain its upper bound by the following derivation. log R unif = log E ij,mn∼p E [e -hij -hmn 2 2 ] = log E ij,mn∼p E [e 2t•h T ij hmn-2t ] ≤E ij∼p E log E mn∼p E [e 2t•h T ij hmn ] = 1 N N ij log N mn [e 2t•h T ij hmn ] = 1 N N ij log pvMF-KDE (h ij ) + log Z vMF -Ĥ(H) + log Z vMF (5) Here, pvMF-KDE (•) is a von Mises-Fisher (vMF) kernel density estimation based on N training samples, and Z vMF is the vMF normalization constant. Ĥ(•) is the resubstitution entropy estimator of H = f w (•) (Ahmad & Lin, 1976) . As the uniformity loss L unif can be approximated by entropy Ĥ(H), a higher entropy H(H) indicates a lower uniformity loss, i.e., a higher uniformity. C.2 PROOF FOR PROPOSITION 4.2 Proof. Recall the hybrid graph augmentation technique is adopted by RGIB-SSL for contrasting, we can approximate the expectation E[I(H; Ã| Ỹ )] by 1 /N N i=1 I(H i ; Ãi | Ỹ ) with N augmented graphs. Based on this, we drive the upper bound of I(H; Ã| Ỹ ) as follows. I(H; Ã| Ỹ ) ≤ I(H; Ã) = 1 /N N i=1 I(H i ; Ãi ) ≈ 1 /2 I(H 1 ; Ã1 ) + I(H 2 ; Ã2 ) = 1 /2 H(H 1 ) + H( Ã1 ) + H(H 2 ) + H( Ã2 ) -H(H 1 , Ã1 ) -H(H 2 , Ã2 ) ≤ 1 /2 H(H 1 ) + H( Ã1 ) + H(H 2 ) + H( Ã2 ) -H(H 1 , H 2 ) -H( Ã1 , Ã2 ) = 1 /2 I(H 1 ; H 2 ) + I( Ã1 ; Ã2 ) = 1 /2 I(H 1 ; H 2 ) + c (6) Thus, a lower I(H 1 ; H 2 ) can upper bounded as lower I(H; Ã) and I(H; Ã| Ỹ ).

C.3 PROOF FOR PROPOSITION 4.3

Proof. With the marginal distribution Q(Z A ) = n P φ (P |n)P( Ã = n) = P(n) n Ãij=1 P ij , we drive the upper bound of MI I(Z A ; Ã) as: I(Z A ; Ã) =E Z A , Ã[log( P(Z A | Ã) P(Z A ) )] =E Z A , Ã[log( P φ (Z A | Ã) Q(Z A ) )] -KL(P(Z A )||Q(Z A )) ≤E[KL(P φ (Z A |A)||Q(Z A ))] = eij ∈ Ã P ij log P ij τ + (1 -P ij ) log 1 -P ij 1 -τ = R A (7) where the KL divergence on two given distribution P(x) and Q(x) is defined as KL(P(x)||Q(x)) = x P(x) log P(x) /Q(x). Thus, we obtain the upper bound of I(Z A ; Ã) as Eqn. 7. Similarly, the label constraint is bounded as I(Z Y ; Ỹ ) ≤ eij ∈ Ỹ P ij log P ij τ +(1 -P ij ) log 1-P ij 1-τ = R Y . C.4 PROOF FOR PROPOSITION 4.4 Proof. We derive the lower bound of I(H; Z Y ) as follows. Proof. With the relaxation via parametrization in C.4, we first relax the standard RGIB-REP to its upper bound as follows. I(H; Z Y ) =I(f w (Z A ); Z Y ) ≥E Z Y ,Z A [log P w (Z Y |Z A ) P(Z Y ) ] =E Z Y ,Z A [log P w (Z Y |Z A )-log(P(Z Y ))] =E Z Y ,Z A [log P w (Z Y |Z A ) + H(Z Y )] ≥E Z Y ,Z A [log P w (Z Y |Z A )] ≈ - 1 |Z Y | eij ∈Z Y L cls (f w (Z A ), Z Y ) min RGIB-REP -λ s I(H; Z Y ) + λ A I(Z A ; Ã) + λ Y I(Z Y ; Ỹ ) ≤ -λ s I(Z A ; Z Y ) + λ A I(Z A ; Ã) + λ Y I(Z Y ; Ỹ ) As Z * Y ≈ Y , Eqn. 9 can be reduced to min -λ s I(Z A ; Y ) + λ A I(Z A ; Ã) + λ Y H(Y ), where I(Y ; Ỹ ) = H(Y ) as Y ⊆ Ỹ . Removing the final term with constant H(Y ), it can be further reduced to min -I(Z A ; Y ) + λI(Z A ; Ã), where λ = λ A /λs. Since to minimize the -I(Z A ; Y ) + λI(Z A ; Ã) is equal to maximize the I(Z A ; Y ) -λI(Z A ; Ã), next, we conduct the following derivation: A maximizes I(Z A ; Y ) -λI(Z A ; Ã), which is equal to minimize -I(Z A ; Y ) + λI(Z A ; Ã), i.e., the RGIB-REP. Symmetrically, when max I(Z A ; Y ) -λI(Z A ; Ã) = max I(Y ; Z A , Ã) -I( Ã; Y |Z A ) -λI(Z A ; Ã) = max I(Y ; Z A , Ã) -I( Ã; Y |Z A ) -λ I(Z A ; Ã, Y ) -I( Ã; Y |Z A ) = max I(Y ; Z A , Ã) -(1 -λ)I( Ã; Y |Z A ) -λI(Z A ; Ã, Y ) = max I(Y ; Ã) -(1 -λ)I( Ã; Y |Z A ) -λI(Z A ; Ã, Y ) = max(1 -λ)I( Ã; Y ) -(1 -λ)I( Ã; Y |Z A ) -λI(Z A ; Ã|Y ) = max(1 -λ)c -(1 -λ)I( Ã; Y |Z A ) -λI(Z A ; Ã|Y ) =(1 -λ)c Z A ≡ A, Z Y = Z * Y maximizes I(A; Z Y ) -λI(Z Y ; Ỹ ) as I(A; Z Y ) -λI(Z Y ; Ỹ ) = (1 -λ)I( Ỹ ; A) -(1 -λ)I( Ỹ ; A|Z Y ) -λI(Z Y ; Ỹ |A).

D A FURTHER EMPIRICAL STUDY OF NOISE EFFECTS

This section is the extension of Section 3 in the main content.

D.1 FULL EVALUATION RESULTS

Evaluation settings. In this part, we provide a thorough empirical study traversing all combinations of following settings from 5 different aspects. • 3 GNN architectures: GCN, GAT, SAGE. • 3 numbers of layers: 2, 4, 6. • 4 noise types: clean, mixed noise, input noise, label noise. • 3 noise ratios: 20%, 40%, 60%. • 6 datasets: Cora, CiteSeer, PubMed, Facebook, Chameleon, Squirrel. Then, we summarize the entire evaluation results of three kinds of GNNs as follows. As can be seen, all three common GNNs, including GCN, GAT, and SAGE, are vulnerable to the inherent edge noise. • Table 13 : full evaluation results with GCN. • Table 14 : full evaluation results with GAT. • Table 15 : full evaluation results with SAGE. Data statistics. The statistics of the 6 datasets in our experiments are shown in Table 8 . As can be seen, 4 homogeneous graphs (Cora, CiteSeer, PubMed, and Facebook) are with the much higher homophily values than the other 2 heterogeneous graphs (Chameleon and Squirrel). 

Loss distribution

We visualize the loss distribution under different scenarios to further investigate the memorization effect of GNNs. As shown in Figure 6 , two clusters are gradually separated apart with clean data, but such a learning process can be slowed down when training with the noisy-input data, which confuses the model and leads to overlapped distributions. As for the label noise, the model cannot distinguish the noisy samples with a clear decision boundary to separate the clean and noisy samples apart. Besides, it is found that the model can gradually memorize the noisy edges according to the decreasing trend of the corresponding losses, where the loss distribution of noisy edges is minimized and progressively moving towards that of the clean ones. 

D.3 DECOUPLING INPUT NOISE AND LABEL NOISE

For a further and deeper study of the coupled noise, we use the edge homophily metric to quantify the distribution of edges. Specifically, the homophily value h homo ij of the edge e ij is computed as the cosine similarity of the node feature x i and x j , i.e., h homo ij = cos(x i , x j ). As shown in figure 8 , the distributions of edge homophily are nearly the same for label noise and structure noise, where the envelope of the two distributions are almost overlapping. Here, we justify that the randomly split label noise and structure noise are indeed coupled together. Based on the measurement of edge homophily, we then decouple these two kinds of noise in simulation. Here, we consider two cases that separate these two kinds of noise apart, i.e., (case 1) input noise with high homophily and label noise with low homophily shown in Figure 9 , and (case 2) input noise with low homophily and label noise with high homophily shown in Figure 10 . Basically, the edge noise can exist in the form of false positive edges or false negative noise. Specifically, the false positive edges are treated as existing edges with label 1, but in fact, such edges do not exist. On the other hand, the false negative edges are treated as non-existing edges with label 0 that the predictive probabilities of such edges will be minimized. Here, we would highlight that our work focuses on the false positive edges as it is more practical and common in real-world scenarios since the data annotating procedure can produce such a kind of noise (Wu et al., 2022b) . Thus, if the inherent edge noise exists, it is more likely to be false positive samples, while the false negative samples are often intractable to be collected and annotated in practice. Actually, investigating the influences of the false negative samples is another line of research, such as (Yang et al., 2020; Kamigaito & Hayashi, 2022) , which is orthogonal and complementary to our work. Learning from a clean graph can also encounter the problem of false negative samples, which is usually due to the random sampling of negative nodes/edges. Note that tackling the false negative samples is a well-known problem in the area of link prediction while handling the false positive samples is also valuable but still remains under-explored. The input noise and label noise does look quite similar and coupled. From the perspective of data processing, the collected noisy edges in the training set can be randomly split into the observed graph (i.e., the input of GNN) or the predictive graph (i.e., the query edges for the GNN to predict). Considering the noisy edges might come from similar sources, e.g., biases from human annotation, the corresponding noise patterns can also be similar between these two kinds of noise that are naturally coupled together to some extent. However, we would claim and justify that the two kinds of noise are fundamentally different. Although both noise can bring severe degradation on empirical performances, they actually act in different places from the model perspective. As the learnable weights are updated as w := w -η∇ w L(H, Ỹ), the label noise Ỹ acting on the backward propagation can directly influence the model. By contrast, the input noise indirectly acts on the learned weights as it appears in the front end of the forward inference, i.e., H = f w (A, X). Empirically, as results shown in Table 3 and Table 4 , the standard GNN (without any defenses) performs quite differently under the same proportion of input noise and label noise. Such a phenomenon can inspire one to understand the intrinsic denoising mechanism or memorization effects of the GNN, and we would leave that as further work. More importantly, from the perspective of defending, it could be easy and trivial to defend if these two kinds of noise are the same. However, none of the existing robust methods can effectively defend such an inherently coupled noise. As can be seen from Table 2 , only marginal improvements are achieved when applying the existing robust methods to the coupled noise. While in Table 3 and Table  4 , these robust methods work effectively in handling the decoupled noise, i.e., only the structure noise or label noise exists. The reason is that, the properties of coupled noise are much more complex than the single decoupled noise. Both sides of the information sources, i.e., Ã and Ỹ , should be considered noisy, based on which the defending mechanism could be devised.

E IMPLEMENTATION DETAILS E.1 GNNS FOR LINK PREDICTION

We provide a detailed introduction forward propagation and backward update of GNNs in this part. Formally, let = 1 . . . L denote the layer index, h i is the representation of the node i, MESS(•) is a learnable mapping function to transformer the input feature, AGGREGATE(•) captures the 1-hop information from neighborhood N (v) in the graph, and COMBINE(•) is final combination between neighbour features and the node itself. Then, the l-layer operation of GNNs can be formulated as m v = AGGREGATE ({MESS(h -1 u , h -1 v , e uv ) : u ∈ N (v)}), where the representation of node v is h v = COMBINE (h -1 v , m v ). After L-layer propagation, the final node representations h L e of each e ∈ V are obtained. Then, for each query edge e ij ∈ E train unseen from the input graph, the logit φ eij is computed with the node representations h L i and h L jfoot_2 with the readout function, i.e., φ eij = READOUT(h L i , h L j ) → R. Finally, the optimization objective can be defined as minimizing the binary cross-entropy loss, i.e., min L cls = eij ∈E train -y ij log(σ(φ eij )) -(1 -y ij )log(1 -σ(φ eij )) where σ(•) is the sigmoid function, and y ij = 1 for positive edges while y ij = 0 for negative ones. In addtion, we summarize the detailed architectures of different GNNs in the following Table 9 .  MESS(•) & AGGREGATE(•) COMBINE(•) READOUT(•) GCN m l i = W l j∈N (i) 1 √ di dj h l-1 j h l i = σ(m l i + W l 1 di h l-1 i ) φ eij = h i h j GAT m l i = j∈N (i) α ij W l h l-1 j h l i = σ(m l i + W l α ii h l-1 i ) φ eij = h i h j SAGE m l i = W l 1 |N (i)| j∈N (i) h l-1 j h l i = σ(m l i + W l h l-1 i ) φ eij = h i h j E.2 IMPLEMENTATION DETAILS OF RGIB-SSL As introduced in Section 4.2, the graph augmentation technique is adopted here to generate the perpetuated graphs of various views. To avoid manually selecting and tuning the augmentation operations, we propose the hybrid graph augmentation method with the four augmentation operations as predefined candidates and the ranges of their corresponding hyper-parameters. The search space is summarized in Table 10 , where the candidate operations cover most augmentation approaches except for those operations modifying the number of nodes that are unsuitable for the link prediction task. In each training iteration, two augmentation operators T 1 (•) and T 2 (•) and their hyper-parameters θ 1 and θ 2 are randomly sampled from the search space as elaborated in Algorithm. 1. The two operators will be performed on the observed graph G, obtaining the two augmented graphs, namely, G 1 = T 1 (G|θ 1 ) and G 2 = T 2 (G|θ 2 ). The edge representations are gained by uniformly sample the corresponding hyper-parameter θ i ∼ U (θ Ti ); h 1 ij = f (G 1 , e ij |w) = h 1 i h 1 j , 5: store the newly sampled operator and combine it with the existing ones T (•) ∪ {T i (•|θ i )} → T (•); 6: end for 7: return the hybrid augmentation operators T (•) Then, we learn the self-supervised edge representations by maximizing the edge-level agreement between the same query edge of different augmented graphs (positive pairs) and minimizing the agreement among different edges (negative pairs) with their representations as shown in Figure 11 . Note the h ij here is the edge representation. Specifically, we minimize the representation similarity of the positive pairs (h 1 ij , h 2 ij ) and maximize the representation similarity of the randomly-sampled negative pairs (h 1 ij , h 2 mn ), where e ij = e mn . Figure 11 : Illustration of the RGIB-SSL model.

E.3 BASELINE IMPLEMENTATIONS

All baselines compared in this paper are based on their own original implementations. We list their source links here. • DropEdge, https://github.com/DropEdge/DropEdge. • NeuralSparse, https://github.com/flyingdoog/PTDNet. • PTDNet, https://github.com/flyingdoog/PTDNet. • GCN Jaccard, https://github.com/DSE-MSU/DeepRobust. • GIB, https://github.com/snap-stanford/GIB. • VIB, https://github.com/RingBDStack/VIB-GSL. • PRI, https://github.com/SJYuCNEL/PRI-Graphs. • Co-teaching, https://github.com/bhanML/Co-teaching. • Peer loss functions, https://github.com/weijiaheng/Multi-class-Peer-Loss-functions. • SupCon, https://github.com/HobbitLong/SupContrast. • GRACE, https://github.com/CRIPAC-DIG/GRACE.

F FULL EMPIRICAL RESULTS

In this section, we elaborate the full empirical study on inherent edge noise with various robust methods and GNN architectures.

F.1 ROBUST METHODS COMPARISON WITH CLEAN DATA

Here, we would like to figure out how the robust methods introduced in Section 5 behave when learning with clean data, i.e., no edge noise exists. As shown in Table 11 , the proposed two instantiations of RGIB can also boost the predicting performance when learning on clean graphs, and outperforms other baselines in most cases.

F.2 FURTHER ABLATION STUDIES ON THE TRADE-OFF PARAMETERS λ

We conduct an ablation study with the grid search of several hyper-parameters λ in RGIB. For simplicity, we fix the weight of the supervision signal as one, i.e., λ s = 1. Then, the objective of RGIB can be formed as L = L cls + λ 1 R 1 + λ 2 R 2 , where the information regularization terms R 1 /R 2 are alignment and uniformity for RGIB-SSL, while topology constraint and label constraint for RGIB-REP, respectively. As the heatmaps illustrated in Figure 12 and Figure 13 , the λ 1 , λ 2 are 

F.3 FURTHER CASE STUDIES

Alignment and uniformity of baseline methods. The alignment of other methods is summarized in Table 12 , while the uniformity is visualized in Figure 14 and Figure 15 . We have the following three observations. First, the sampling-based methods, e.g., DropEdge, PTDNet, and NeuralSparse, can also promote alignment and uniformity due to their sampling mechanisms to defend the structural perturbations. Second, the contrastive methods, e.g., SupCon and GRACE, are with much better alignment but much worse uniformity. The reason is that the learned representations are severely collapsed, which can be degenerated to single points seen from the uniformity plots but stay nearly unchanged when encountering structural perturbations. Third, the remaining methods are not observed with significant improvements in alignment or uniformity. When connecting the above observations with their empirical performances, we can draw a conclusion. That is, both alignment and uniformity are important to evaluate the robust methods from the perspective of representation learning. Besides, such a conclusion is in line with the previous study (Wang & Isola, 2020) . Learning curves of RGIB. We draw the learning curves of RGIB with constant schedulers in Figure 16 , Figure 17 , Figure 18 , and Figure 19 . We normalize the values of each plotted line to (0, 1) for better visualization. Table 12: Alignment comparison. method Cora Citeseer clean ε = 20% ε = 40% ε = 60% clean ε = 20% ε = 40% ε = 60% Standard . For RGIB-SSL, the uniformity term, i.e, H(H), converges quickly and remains low after 200 epochs. Similarly, the alignment term, i.e, I(H 1 ; H 2 ) also converges in the early stages and keeps stable in the rest. At the same time, the supervised signal, i.e, I(H; Y) gradually and steadily decreases as the training time moves forward. The learning processes are generally stable across different datasets. As for RGIB-REP, we observe that the topology I(Z A ; Ã) and label constraints I(Z Y ; Ỹ) can indeed adapt to noisy scenarios with different noise ratios. As can be seen, these two regularizations converge more significantly when learning on a more noisy case. That is, when noise ratio increases from 0 to 60%, these two regularizations react adaptively to the noisy data Ã, Ỹ . Such a phenomenon shows that RGIB-REP with these two information constraints works as an effective information bottleneck to filter out the noisy signals. The entire evaluation with 10 baselines and two proposed methods are conducted keeping the same settings as in D.1. Results on each dataset are summarized as follows. • Table 16 , Table 17 , Table 18 : full results of GCN/GAT/SAGE on Cora dataset. • Table 19 , Table 20 , Table 21 : full results of GCN/GAT/SAGE on CiteSeer dataset. • Table 22 , Table 23 , Table 24 : full results of GCN/GAT/SAGE on PubMed dataset. • Table 25 , Table 26 , Table 27 : full results of GCN/GAT/SAGE on Facebook dataset. • Table 28 , Table 29 , Table 30 : full results of GCN/GAT/SAGE on Chameleon dataset. • Table 31 , Table 32 , Table 33 : full results of GCN/GAT/SAGE on Squirrel dataset. 



For generality, we provide the full empirical study in Appendix D. Full evaluation results can be found in Appendix F. To avoid abusing notations, we use the hi to stand for the final representation h L i in later contents.



(a) Illustration of inherent edge noise. The GNN takes the graph topology A as inputs, and predicts the logits of unseen edges with labels Y . The noisy Ã and Ỹ are added with random edges for simulating the inherent edge noise as in Def. 3.1. (b) The basic GIB (left) and the proposed RGIB (right). I(•; •) here indicates the mutual information. To solve the intrinsic deficiency of basic GIB in tackling the edge noise, the RGIB learns the graph represenation H via a further balance of informative signals 1 , 3 , 4 regarding the H.

Figure 1: Link prediction with inherent edge noise (a) and the proposed RGIB principle (b).

A A = O and ε a = |nonzero( Ã)|-|nonzero(A)| /|nonzero(A)|. Similarly, noisy labels are generated and added to original labels, where ε y = |nonzero( Ỹ )|-|nonzero(Y )| /|nonzero(Y )|.

Figure 2: Performances (test AUC) with simulated edge noise. For fair evaluations, results are obtained by repeating 10 times with 4-layer GNNs on each dataset.

Figure3: Uniformity distribution on Cora dataset. Here, we map the representations of query edges in test set to unit circle of R 2 followed by the Gaussian kernel density estimation, as in(Wang & Isola, 2020).

Figure 4: Digrams of the proposed RGIB and its two instantiations (best viewed in color). Analytically, I( Ã; Ỹ |H) = I( Ã; Ỹ )+I(H; Ỹ | Ã)+I(H; Ã| Ỹ )-H(H)+H(H| Ã, Ỹ ), where I( Ã; Ỹ ) is a constant and redundancy H(H| Ã, Ỹ ) can be easily minimized. Thus, the I( Ã; Ỹ |H) can be approximated by the other three terms, i.e., H(H), I(H; Ỹ | Ã) and I(H; Ã| Ỹ ). Since the two later terms are also with noise, a balance of these three informative terms can be a solution to the problem. Based on the above analysis, we propose the Robust Graph Information Bottleneck (RGIB), a new learning objective to balance informative signals regarding H, as illustrated in Figure 4(a), i.e., min RGIB -I(H; Ỹ ), s.t. γ - H < H(H) < γ + H , I(H; Ỹ | Ã) < γ Y , I(H; Ã| Ỹ ) < γ A . (2)

the supervision term in Eqn. 4 can be empirically reduced to the classification loss. Proof. See Appendix C.4. Theorem 4.5. Assume the noisy training data D train = ( Ã, X, Ỹ ) contains a potentially clean subset

Figure 5: Uniformity distribution on Citeseer with ε = 40%.

the query edges to be predicted Y the labels of the query edges h ij representation of a query edge e ij H representation of all the query edges I(X; Y ) the mutual information of X and Y I(X; Y |Z) the conditional mutual information of X and Y when observing Z C THEORETICAL JUSTIFICATION C.1 PROOF FOR PROPOSITION 4.1

8) Thus, the MI I(H; Z Y ) can be lower bounded by the classification loss, and min -λ s I(H; Z Y ) in RGIB-REP (Eqn. 4) is upper bounded by min λs /|Z Y | eij ∈Z Y L cls (f w (Z A ), Z Y ) as Eqn. 8. C.5 PROOF FOR THEOREM 4.5

I( Ã; Y |Z A ) ≥ 0 and I(Z A ; Ã|Y ) ≥ 0 are always true, the optimal Z * A should makes -(1 -λ)I( Ã; Y |Z A ) -λI(Z A ; Ã|Y ) = 0 to reach the optimal case. Thus, it should satisfy I( Ã; Y |Z A ) = 0 and I(Z A ; Ã|Y ) = 0 simultaneously. Therefore, Z A = Z *

loss of positive/nagative samples with clean data (left) and 40% input-noisy data (right).

loss of clean/noisy samples with clean data (left) and 40% label-noisy data (right).

Figure 6: Loss distribution of the standard GCN with 40% input noise (a) and 40% label noise (b).

Figure 8: Distributions of edge homophily with random split manner.

Figure 9: Distributions of edge homophily for decoupled noise (case 1).

Figure 10: Distributions of edge homophily for decoupled noise (case 2).

Figure 12: Grid search of hyper-parameters with RGIB-SSL on Cora dataset ( = 40%).

Figure 16: Learning curves of RGIB-SSL on Cora dataset.

Figure 17: Learning curves of RGIB-REP on Cora dataset.

Figure 18: Learning curves of RGIB-SSL on Citeseer dataset.

Figure 19: Learning curves of RGIB-REP on Citeseer dataset.

Mean values of alignment on Cora dataset, that are calculated as the L2 distance of representations of two randomly perturbed graphs i.e., 1

).

where latent variables Z Y and Z A are clean signals extracted from noisy Ỹ and Ã. Their complementary parts Z Y and Z A are considered as noise, satisfying Ỹ = Z Y +Z Y and Ã = Z A +Z A . When Z Y ≡ Ỹ and Z A ≡ Ã, the RGIB-REP can be degenerated to the basic GIB. Here, the I(H; Z Y ) measures the supervised signals with selected samples Z Y , where the classifier takes Z A (a subgraph of Ã) as input instead of the original Ã, i.e., H = f w (Z A , X). Constraints I(Z A ; Ã) and I(Z Y ; Ỹ ) aim to select the cleanest and most task-relevant information from Ã and Ỹ .Instantiation. For deriving a tractable objective regarding Z A and Z Y , a parameterized sampler f φ (•) sharing the same architecture and weights as f w (•) is adopted here. f φ (•) generates the probabilistic distribution of edges that include both Ã and Ỹ by P = σ(H φ H φ ) ∈ (0, 1) |V|×|V| , where hidden representations H φ = f φ ( Ã, X). Then, the Bernoulli sampling is used to obtain edges of high confidences, i.e., Z A = SAMP(P | Ã), Z Y = SAMP(P | Ỹ ) where |Z A | ≤ | Ã| and |Z Y | ≤ | Ỹ |.

Method comparison with GCN under inherent noise, i.e., both the input and label noise exist. The boldface numbers mean the best results, while the underlines indicate the second-best results.

Method comparison with GCN under decoupled input noise (upper) or label noise (lower).

Comparison on different schedulers. Here, SSL/REP are short for RGIB-SSL/RGIB-REP. Experiments are performed with a 4-layer GAT and = 40% inherent edge noise. REP SSL std. REP SSL clean .616 .524 .475 .445 .439 .418 ε = 20% .687 .642 .543 .586 .533 .505 ε = 40% .695 .679 .578 .689 .623 .533 ε = 60% .732 .704 .615 .696 .647 .542

Ablation study for RGIB-SSL and RGIB-REP with a 4-layer SAGE. Here, = 60% indicates the 60% coupled inherent noise, while the a / y represent ratios of decoupled input/label noise.

Dataset statistics.

Detailed architectures of different GNNs.

where the node representations h 1 i and h 2 i are generated by the GNN model f (•|w) with learnable weights w, and so it is for h 2 ij with G 2 . Search space H T of the hybrid graph augmentation.

Method comparison with a 4-layer GCN trained on the clean data. The boldface numbers mean the best results, while the underlines indicate the second-best results.

Full results of GCN with edge noise.

Full results of GAT with edge noise. 9869±.0005 .9856±.0008 .9836±.0007 .9878±.0006 .9861±.0006 .9858±.0008 .9872±.0006 .9864±.0009 .9857±.0008 .9875±.0007 .9857±.0015 .9850±.0012 .9820±.0014 .9855±.0011 .9827±.0019 .9773±.0046 .9873±.0010 .9874±.0010 .9874±.0005 .9860±.0007 .9805±.0025 .9658±.0321 .9577±.0314 .9804±.0018 .9738±.0044 .9710±.0036 .9854±.0015 .9860±.0007 .9867±.0012 Chameleon .9770±.0044 .9725±.0027 .9650±.0018 .9625±.0018 .9767±.0026 .9747±.0020 .9759±.0018 .9746±.0023 .9743±.0017 .9711±.0041 .9734±.0047 .9721±.0035 .9652±.0023 .9605±.0031 .9741±.0028 .9686±.0030 .9674±.0027 .9740±.0037 .9738±.0027 .9712±.0047 .9742±.0052 .9659±.0029 .9573±.0036 .9482±.0054 .9644±.0033 .9543±.0075 .9474±.0074 .9722±.0043 .9688±.0055 .9698±.0065 Squirrel .9740±.0011 .9680±.0007 .9635±.0017 .9588±.0025 .9702±.0008 .9690±.0010 .9659±.0014 .9719±.0018 .9701±.0017 .9686±.0012 .9720±.0023 .9581±.0046 .9436±.0063 .9335±.0062 .9592±.0047 .9455±.0075 .9415±.0061 .9682±.0030 .9690±.0028 .9686±.0021 .9578±.0067 .9507±.0050 .9309±.0164 .9254±.0089 .9487±.0065 .9419±.0041 .9255±.0073 .9585±.0097 .9520±.0070 .9507±.0162

Full results of SAGE with edge noise. 9680±.0015 .9626±.0015 .9570±.0016 .9737±.0010 .9736±.0010 .9721±.0011 .9692±.0015 .9643±.0012 .9606±.0017 .9689±.0052 .9584±.0107 .9577±.0076 .9541±.0021 .9637±.0092 .9630±.0079 .9607±.0090 .9663±.0020 .9612±.0049 .9612±.0020 .9682±.0045 .9555±.0065 .9528±.0038 .9461±.0054 .9592±.0053 .9600±.0036 .9551±.0042 .9574±.0192 .9540±.0207 .9583±.0023

Full results on Cora dataset with GCN. REP .8103±.0137 .7439±.0221 .7040±.0192 .8282±.0123 .7857±.0142 .7623±.0144 .8365±.0163 .8247±.0142 .8240±.0119 RGIB-SSL .8623±.0126 .8080±.0240 .7357±.0342 .8632±.0187 .7878±.0368 .7310±.0483 .9184±.0070 .9120±.0108 .9126±.0081

Full results on Cora dataset with GAT.

Full results on Cora dataset with SAGE. 7809±.0176 .7383±.0218 .8568±.0115 .8450±.0153 .8445±.0187 .8426±.0105 .8150±.0170 .7943±.0129 GRACE .6242±.0245 .6424±.0290 .6711±.0452 .6465±.0381 .6172±.0320 .6496±.0544 .6434±.0384 .6376±.0251 .6438±.0449 RGIB-REP .8274±.0112 .7822±.0143 .7692±.0202 .8634±.0121 .8470±.0144 .8528±.0131 .8367±.0149 .8087±.0187 .7991±.0120 RGIB-SSL .8837±.0065 .8728±.0116 .8613±.0148 .8960±.0109 .8817±.0119 .8825±.0113 .9130±.0038 .9041±.0075 .9023±.0072 L=6 Standard .7787±.0423 .7420±.0251 .7180±.0248 .8256±.0222 .7947±.0561 .8005±.0421 .8158±.0168 .7707±.0235 .7660±.0134 DropEdge .8035±.0228 .7398±.0560 .7176±.0389 .8262±.0153 .8193±.0679 .8089±.0260 .8340±.0161 .7993±.0091 .7897±.0144 NeuralSparse .7953±.0177 .7378±.0180 .7292±.0238 .8384±.0120 .8234±.0288 .7980±.0701 .8214±.0107 .7908±.0136 .7622±.0160 PTDNet .7999±.0151 .7604±.0169 .7352±.0202 .8311±.0143 .8267±.0078 .8109±.0140 .8222±.0121 .7823±.0078 .7745±.0231 Co-teaching .7817±.0477 .7445±.0312 .7212±.0332 .8306±.0256 .7991±.0595 .8007±.0439 .8324±.0256 .7720±.0263 .7687±.0266 Peer loss .7781±.0451 .7445±.0286 .7192±.0277 .8300±.0234 .8020±.0624 .8043±.0449 .8309±.0149 .7734±.0388 .7652±.0262 Jaccard .7779±.0437 .7493±.0245 .7277±.0238 .8333±.0323 .8075±.0605 .8037±.0494 .8148±.0186 .7707±.0243 .7709±.0204 GIB .7814±.0493 .7473±.0442 .7349±.0437 .8366±.0194 .8106±.0689 .8040±.0617 .8172±.0258 .7806±.0265 .7689±.0180 SupCon .7879±.0356 .7019±.0285 .6673±.0317 .8219±.0469 .7648±.0666 .7159±.0717 .8242±.0159 .7880±.0152 .7686±.0148 GRACE .6866±.0160 .6437±.0455 .5967±.0248 .6949±.0181 .6536±.0365 .6114±.0394 .7239±.0231 .7035±.0160 .7014±.0111 RGIB-REP .8049±.0146 .7157±.0725 .7099±.0473 .8391±.0215 .8149±.0234 .7927±.0171 .8358±.0100 .7974±.0140 .8046±.0135 RGIB-SSL .8662±.0130 .8430±.0178 .8306±.0108 .8746±.0091 .8634±.0099 .8603±.0156 .8982±.0089 .8930±.0108 .8940±.0076

Full results on Citeseer dataset with GCN.

Full results on Citeseer dataset with GAT. 8338±.0127 .8207±.0121 .8689±.0096 .8526±.0130 .8512±.0174 .8762±.0076 .8650±.0102 .8648±.0166 DropEdge .8566±.0113 .8333±.0183 .8100±.0098 .8750±.0079 .8496±.0101 .8512±.0121 .8820±.0086 .8679±.0112 .8673±.0114 NeuralSparse .8573±.0101 .8431±.0151 .8222±.0092 .8743±.0117 .8577±.0067 .8580±.0135 .8826±.0080 .8724±.0076 .8657±.0089 PTDNet .8602±.0107 .8381±.0137 .8157±.0075 .8755±.0090 .8560±.0084 .8574±.0154 .8784±.0120 .8693±.0098 .8669±.0142 Co-teaching .8628±.0220 .8366±.0124 .8199±.0194 .8720±.0128 .8521±.0139 .8510±.0224 .8924±.0122 .8888±.0365 .8919±.0305 Peer loss .8637±.0125 .8378±.0170 .8235±.0120 .8721±.0172 .8529±.0173 .8559±.0216 .8878±.0185 .8653±.0288 .8631±.0258 Jaccard .8615±.0197 .8379±.0222 .8223±.0124 .8841±.0079 .8556±.0119 .8498±.0309 .8843±.0143 .8676±.0195 .8661±.0256 GIB .8610±.0230 .8462±.0114 .8324±.0316 .8909±.0091 .8823±.0188 .8488±.0276 .8781±.0135 .8739±.0144 .8741±.0156 SupCon .8495±.0100 .8138±.0174 .8155±.0099 .8611±.0086 .8454±.0111 .8393±.0172 .8558±.0137 .8459±.0170 .8379±.0185 GRACE .8092±.0221 .7564±.0264 .7479±.0278 .8014±.0370 .7628±.0240 .7433±.0245 .8788±.0146 .8768±.0073 .8654±.0172 RGIB-REP .8545±.0108 .8310±.0127 .8137±.0091 .8736±.0107 .8566±.0097 .8503±.0159 .8778±.0093 .8696±.0081 .8614±.0084 RGIB-SSL .9106±.0102 .8829±.0058 .8677±.0095 .9172±.0072 .8909±.0086 .8785±.0121 .9419±.0071 .9410±.0047 .9410±.0090 L=4 Standard .8026±.0157 .7775±.0248 .7518±.0183 .8191±.0092 .8043±.0105 .7912±.0073 .8174±.0172 .7998±.0143 .7934±.0156 DropEdge .8063±.0079 .7624±.0211 .7434±.0124 .8171±.0132 .7977±.0178 .7814±.0162 .8262±.0148 .8103±.0178 .8057±.0148 NeuralSparse .7958±.0142 .7761±.0172 .7550±.0129 .8282±.0130 .8088±.0088 .7911±.0174 .8259±.0119 .8135±.0092 .7986±.0109 PTDNet .8000±.0113 .7734±.0198 .7597±.0185 .8254±.0105 .8132±.0089 .7950±.0143 .8137±.0243 .8082±.0094 .8036±.0139 Co-teaching .8016±.0184 .7807±.0315 .7521±.0267 .8213±.0173 .8068±.0156 .7903±.0105 .8402±.0220 .8109±.0316 .7947±.0350 Peer loss .8064±.0178 .7802±.0253 .7544±.0191 .8246±.0145 .8108±.0122 .7945±.0113 .8160±.0329 .8045±.0185 .7925±.0207 Jaccard .8098±.0222 .7771±.0273 .7517±.0186 .8258±.0124 .8083±.0138 .7901±.0073 .8206±.0168 .8036±.0176 .7999±.0215 GIB .8170±.0230 .7884±.0341 .7645±.0247 .8422±.0365 .8112±.0212 .7972±.0305 .8192±.0249 .8080±.0155 .8010±.0177 SupCon .7940±.0114 .7728±.0125 .7478±.0145 .8137±.0115 .8003±.0116 .7777±.0409 .8038±.0114 .7972±.0198 .7852±.0201 GRACE .7319±.0433 .6611±.0395 .6449±.0579 .7216±.0261 .5947±.0660 .6060±.0507 .7775±.1040 .7739±.0475 .7882±.0328 RGIB-REP .7991±.0107 .7743±.0164 .7418±.0121 .8155±.0156 .7905±.0157 .7372±.0908 .8108±.0118 .7946±.0180 .7935±.0131 RGIB-SSL .8520±.0145 .8306±.0149 .8029±.0098 .8592±.0120 .8251±.0132 .8145±.0110 .9084±.0091 .9101±.0076 .9102±.0117 L=6 Standard .7807±.0117 .7373±.0270 .7139±.0251 .7970±.0134 .7860±.0107 .7741±.0126 .7963±.0129 .7883±.0162 .7801±.0161 DropEdge .7768±.0088 .7477±.0195 .7116±.0119 .7854±.0232 .7640±.0188 .7425±.0362 .8114±.0132 .7840±.0217 .7826±.0186 NeuralSparse .7704±.0099 .7462±.0170 .7242±.0138 .8047±.0101 .7647±.0372 .7248±.0596 .8087±.0235 .7855±.0176 .7880±.0148 PTDNet .7805±.0193 .7503±.0223 .7286±.0237 .7927±.0287 .7822±.0132 .7579±.0355 .8002±.0085 .7977±.0134 .7890±.0145 Co-teaching .7819±.0141 .7399±.0335 .7236±.0292 .7964±.0189 .7809±.0183 .7740±.0185 .7933±.0406 .7918±.0348 .7979±.0245 Peer loss .7846±.0214 .7459±.0294 .7187±.0259 .7979±.0172 .7955±.0168 .7796±.0218 .7957±.0273 .7865±.0285 .7912±.0148 Jaccard .7902±.0117 .7365±.0332 .7157±.0248 .8056±.0264 .8038±.0226 .7733±.0245 .7964±.0226 .7936±.0255 .7847±.0218 GIB .7818±.0230 .7378±.0285 .7137±.0416 .8161±.0267 .7995±.0183 .7762±.0176 .8002±.0155 .7955±.0166 .7794±.0244 SupCon .7370±.0524 .7160±.0462 .6670±.0442 .7667±.0402 .7729±.0356 .6999±.0597 .7810±.0219 .7752±.0119 .7591±.0362 GRACE .5068±.0128 .5034±.0106 .5108±.0319 .5058±.0096 .4956±.0069 .5379±.0427 .5181±.0547 .5288±.0467 .5068±.0178 RGIB-REP .7817±.0129 .7062±.0681 .7254±.0188 .7883±.0160 .7769±.0168 .7620±.0176 .7981±.0092 .7711±.0487 .7817±.0164 RGIB-SSL .8275±.0148 .7989±.0136 .7681±.0140 .8261±.0096 .8024±.0087 .7806±.0174 .8855±.0103 .8918±.0143 .8940±.0119

Full results on Citeseer dataset with SAGE.

Full results on Pubmed dataset with GCN.

Full results on Pubmed dataset with GAT.

Full results on Pubmed dataset with SAGE.

Full results on Facebook dataset with GAT.

Full results on Chameleon dataset with GCN.

Full results on Chameleon dataset with SAGE.

Full results on Squirrel dataset with GAT.

