REVISITING UNCERTAINTY ESTIMATION FOR NODE CLASSIFICATION: NEW BENCHMARK AND INSIGHTS Anonymous authors Paper under double-blind review

Abstract

Uncertainty estimation is an important task that can be essential for high-risk applications of machine learning. This problem is especially challenging for nodelevel prediction in graph-structured data, as the samples (nodes) are interdependent. However, there is no established benchmark that allows for the evaluation of node-level uncertainty estimation methods in a unified setup, covering diverse and meaningful distribution shifts. In this paper, we address this problem and propose such a benchmark, together with a technique for the controllable generation of data splits with various types of distribution shifts. Importantly, we describe the shifts that are specific to the graph-structured data. Our benchmark consists of several graph datasets equipped with various distribution shifts on which we evaluate the robustness of models and their uncertainty estimation performance. To illustrate the benchmark, we decompose the current state-of-the-art Dirichlet-based framework and perform an ablation study on its components. In our experiments on the proposed benchmark, we show that when faced with complex yet realistic distribution shifts, most models fail to maintain high classification performance and consistency of uncertainty estimates with prediction errors. However, ensembling techniques help to partially overcome significant drops in performance and achieve better results than distinct models.

1. INTRODUCTION

Uncertainty estimation is an important and challenging task with many applications in financial systems, medical diagnostics, autonomous driving, etc. It aims at quantifying the confidence of machine learning models and can be used to design more reliable decision-making systems. In particular, it enables one to solve such problems as misclassification detection, where the model has to assign higher uncertainty to the potential prediction errors, or out-of-distribution (OOD) detection, when the model is required to yield higher uncertainty for the samples from an unknown distribution. Depending on the source of uncertainty, it can be divided into data uncertainty, which describes the inherent noise in data due to the labeling mistakes or class overlap, and knowledge uncertainty, which accounts for insufficient amount of information for accurate predictions when the distribution of test data is different from the training one (Gal, 2016; Malinin, 2019) . The problem of uncertainty estimation for graph-structured data has recently started to gain attention. It is especially complex at the node level as one has to deal with interdependent samples that may come from different distributions, so their predictions can change significantly depending on the neighborhood. This problem has already been addressed in several studies, and the proposed methods are commonly based on the Dirichlet distribution and introduce various extensions to the Dirichlet framework (Sensoy et al., 2018; Malinin & Gales, 2018; Malinin, 2019; Charpentier et al., 2020) , such as graph-based kernel Dirichlet estimation (Zhao et al., 2020) or graph propagation of Dirichlet parameters (Stadler et al., 2021) . However, the field of robustness and uncertainty estimation for node-level graph problems suffers from the absence of benchmarks with diverse and meaningful distribution shifts. Usually, the evaluation is limited to somewhat unrealistic distribution shifts, such as noisy node features (Stadler et al., 2021) or left-out classes (Zhao et al., 2020; Stadler et al., 2021) . Importantly, Gui et al. (2022) try to overcome this issue and systematically construct a graph OOD benchmark, in which they explicitly make distinctions between covariate and concept shifts. However, the authors either consider Under review as a conference paper at ICLR 2023 synthetic datasets or ignore the graph structure when creating distribution shifts. The problem with the mentioned approaches is that, in real applications, distribution shifts can be much more complex and diverse, and may depend on the global graph structure (for a more detailed discussion, refer to Appendix C). Thus, the existing benchmarks can be insufficient to reliably and comprehensively evaluate uncertainty estimation methods for graph-structured data. Therefore, the current status quo about the best uncertainty estimation methods for node classification remains unclear and requires further investigation. In this work, we propose a new benchmark for evaluating robustness and uncertainty estimation in transductive node classification tasks. The main feature of our benchmark is a general approach to constructing the data splits with distribution shifts: it can be applied to any graph dataset, allows for generating shifts of different nature, and one can easily vary the sizes of splits. For demonstration purpose, we apply our method to 7 common node classification datasets and describe 3 particular strategies to induce distribution shifts. Using the proposed benchmark, we evaluate the robustness of various models and their ability to detect errors and OOD inputs. Thus, we show that the recently proposed Graph Posterior Network (Stadler et al., 2021) is consistently the best method for detecting the OOD inputs. However, the best results for the other tasks are achieved using Natural Posterior Networks (Charpentier et al., 2021) . We also confirm that ensembling often allows one to improve the model performance -ensembles of GPNs achieve the best performance for OOD detection, while ensembles of NatPNs have the best predictive performance and error detection.

2. PROBLEM STATEMENT

We consider the problem of transductive node classification in an attributed graph G = (A, X, Y) with an adjacency matrix A ∈ {0, 1} n×n , a node feature matrix X ∈ R n×d and categorical targets vector Y ∈ {1, . . . , C} n . We split the set of nodes V into several non-intersecting subsets depending on whether they are used for training, validation, or testing and if they belong to in-distribution (ID) or out-of-distribution (OOD) subset. Let Y train denote the labels of train nodes V train . Given a graph G train = (A, X, Y train ), we aim at predicting the labels Y test of test nodes V test and estimating the uncertainty measure u i ∈ R associated with these predictions. The obtained uncertainty estimates are used to solve the misclassification detection and OOD detection problems.

3. PROPOSED BENCHMARK

This section describes our benchmark for evaluating uncertainty estimates and robustness to distribution shifts for node-level graph problems. The most important ingredient of our benchmark is a unified approach for the controllable generation of diverse distribution shifts that can be applied to any graph dataset. Our benchmark includes a collection of common node classification datasets, several data split strategies, a set of problems for evaluating robustness and uncertainty estimation performance, and the associated metrics. We describe these components below.

3.1. GRAPH DATASETS

While our approach can potentially be applied to any node classification or node regression dataset, for our experiments, we pick the following 7 datasets commonly used in the literature: 3 citation networks, including CoraML, CiteSeer (McCallum et al., 2000; Giles et al., 1998; Getoor, 2005; Sen et al., 2008) and PubMed (Namata et al., 2012) , 2 co-authorship graphs -CoauthorPhysics and CoauthorCS (Shchur et al., 2018) , and 2 co-purchase datasets -AmazonPhoto and Amazon-Computers (McAuley et al., 2015; Shchur et al., 2018) .

3.2. DATA SPLITS

The most important ingredient of our benchmark is a general generating data splits in a graph G to yield non-trivial yet reasonable distribution shifts. For this purpose, we make a distinction between the ID parts that are described by p(Y in |X, A) and shifted (OOD) parts where the targets may come from a significantly different distribution p(Y out |X, A). We define the following ID parts: • Train contains the nodes V train that are used for the regular training of models and represent the only observations that take part in gradient computation; • Valid-In enables us to monitor the best model during the training stage by computing the validation loss for the nodes V valid-in and choose the best checkpoint; • Test-In is used for testing on the remaining in-distribution nodes V test-in and represents the most simple setup that requires the model to reproduce in-distribution dependencies. Both Valid-In and Test-In parts are assumed to come from the exact same distribution as Train. At the same time, we introduce the following OOD parts: • Valid-Out contains the validation nodes V valid-out that also can be used for monitoring but tends to be a more difficult part of graph with potentially different dependencies; • Test-Out represents the most shifted part V test-out and can be used for evaluating robustness of models to distribution shifts. To construct a particular data split, we choose some characteristic σ i and compute it for every node i ∈ V, as described in Section 3.3. This characteristic reflects some node property that may depend on features or graph structure. After that, we sort all nodes in ascending order of σ i . Some fraction of nodes with the smallest values of σ i is considered to be ID, while the remaining ones become OOD and are split into Valid-Out and Test-Out based on their values of σ i . Importantly, this general split strategy is very flexible -it allows one to vary the size of the training part and to analyze the effect of this size on the robustness and the quality of uncertainty estimates. The type of distribution shift depends on the choice of σ i and can also be easily varied. In our experiments, we split the dataset in the following proportions. The half of the nodes with the smallest values of σ i are assumed to be ID and are split into Train, Valid-In, and Test-In uniformly at random in proportion 30%/10%/10%. The second half contains the remaining OOD nodes split into Valid-Out and Test-Out in the ascending order of σ i in proportion 10%/40%. As a result, the Test-Out part has the most significant distribution shift.

3.3. DISTRIBUTION SHIFTS

To define our data splits, it is necessary to choose some node characteristic σ i as a split factor. We aim to consider diverse characteristics which cover a variety of distribution shifts that may occur in practice. In a standard non-graph ML setup, shifts typically happen only in feature space (or, more generally, the joint distribution of features and targets may become shifted). In graph learning tasks, there can be shifts specifically related to the graph structure: the training part can be biased towards more popular nodes or may consist of nodes from a particular region in the graph. Thus, we consider the following representative data split strategies. Random This is a standard approach to constructing the data splits, where the nodes are selected uniformly at random, i.e., we can take σ i to be a random position in a sorted list. This type of shift is not realistic for practical applications but can be helpful for the analysis: it shows how well the model generalizes given that the distribution does not change. The random splitting strategy also allows for evaluating the robustness of models when the size of the training dataset varies. Feature This approach represents a family of possible feature-based shifts that do not take into account the graph structure explicitly. There are multiple ways to construct such shifts, e.g., a split can be based on values of one particular feature (Gui et al., 2022) . However, to follow our general setup described above, we base a split on a continuous characteristic that can be computed for any dataset. Namely, we project the original features x i ∈ R d into R 2 via a random linear transform W, where all entries w ij are independent and come from N (0, 1). After that, σ i is set to the distance between the node i ∈ V and the centroid of the projected data, so the most central nodes in terms of features are said to be ID, while OOD parts are close to periphery. This setup naturally corresponds to the situation when the training dataset consists of the most typical elements, while some outliers may be encountered at the inference stage. Thus, this type of shift tests the robustness of models to non-standard feature combinations. We visualize all the split strategies applied to the AmazonPhoto dataset in Figure 1 . The figures for the remaining datasets can be found in Appendix A. Here, one can see that the feature-based split does not introduce a notable structural shift, i.e., the nodes are distributed across all regions of the graph. This fact is additionally confirmed by our analysis in Appendix B: it is clear that there is no significant difference in the degree distribution and pairwise node distances between the ID and OOD parts. Our empirical observations confirm that the feature-based shifts are the easiest to handle by the considered methods. PageRank This strategy represents a possible bias towards popularity. It is natural to expect the training set to consist of more popular items. For instance, in the web search, the importance of pages in the internet graph can be measured via PageRank (Page et al., 1999) . For this application, the labeling of pages should start with important ones, since they are visited more often. Similar situations may happen for social networks, where it is natural to start labeling with the most influential users, or citation networks, where the most cited pages should be labeled first. However, when applying the model, it is essential to make accurate predictions on less popular elements. Motivated by that, we introduce a PageRank-based split. In particular, we compute the PageRank (PR) values for every node i ∈ V and define the measure σ i as the negative PR score, which means that the nodes with smaller values of PR (i.e., less important ones) come to the OOD subsets. As can be seen in Figure 1 , the PageRank-based split separates the most important nodes that belong to the cores of large clusters and the structural periphery, which consists of less important nodes in terms of PageRank. Our analysis in Appendix B confirms this observation: the degree distribution changes significantly across the ID and OOD subsets, tending to higher values for the ID nodes. The distance between such nodes also appears to be smaller on average. Our experiments prove that such a structural distribution shift creates a more severe challenge for the considered methods. Personalized PageRank This strategy is focused on a potential bias towards locality, which may happen when labeling is performed by exploring the graph starting from some node. For instance, this may occur in a web search where a crawler has to explore the web graph following the links. Similarly, information about the users of a social network can usually be obtained via an API, and new users are discovered following the friends of known users. To model such a situation, we use the concept of Personalized PageRank (PPR) (Page et al., 1999) . It represents the stationary distribution of a random walk that always restarts from some fixed node (see, e.g., (Klicpera et al., 2018) for more details). The associated distribution shift naturally combines popularity and locality: PPR is related to node importance since the stationary distribution concentrates more on higher-degree nodes. On the other hand, the locality is also preserved since restarts always happen in a fixed node. For our splits, we select the node j ∈ V with the highest PR score as a restarting node and compute the PPR score for every node i ∈ V. After that, we define the measure σ i as negative PPR. The nodes with high PPR, which belong to the ID part, are expected to be close to the restarting node, while far away nodes go to the OOD subset. Figure 1 shows that locality is indeed preserved, as the ID part consists of one compact region around the most important node chosen as the restarting one. Thus, the ID subset includes periphery nodes as well as some nodes that were previously marked as the most important in the PR-based split but are now less important for the restarting node. The remaining nodes come to the OOD part. Our analysis in Appendix B also provides strong evidence for the mentioned behavior: the PPR-based split strongly affects the distribution of pairwise distances within the ID/OOD parts as the locality bias of the ID part makes the OOD nodes even more distant from each other. The shift of the degree distribution is also notable but not as severe as for the PR-based split since here we consider only the popularity conditioned on some fixed node. Finally, our empirical results in Section 5 confirm that the PPR-based split is the most challenging one for graph neural networks.

3.4. METRICS

To evaluate the classification performance, we exploit standard Accuracy and compute these metrics on Test-In and Test-Out parts. To analyze the aggregated performance, we also report Accuracy and AUROC on the mixture of Test-In and Test-Out. To evaluate the quality of uncertainty estimates, we consider two problems: error (misclassification) detection and OOD detection. To assess how well a model can detect misclassified samples, we use the concept of Prediction Rejection Curve (PRC) (Malinin et al., 2021; 2022) . PRC traces the error rate as we replace model predictions with ground-truth labels in the order of decreasing uncertainty. If uncertainty is high for incorrectly classified samples, then the error rate is expected to drop quickly as we replace such predictions with ground true labels. Thus, the Area Under Prediction Rejection Curve (AUPRC) evaluates the joint classification and uncertainty estimation performance, requiring the model not only to provide high prediction accuracy but also to signalize about possible errors through higher uncertainty scores. In our experiments, we compute AUPRC model on the merged test subset of nodes V test using total uncertainty (TU), as the errors may occur due to the inherent noise in data or because of predicting on OOD samples. A measure called Prediction Rejection Ratio (PRR) is also based on the Prediction Rejection Curve but evaluates only the ability of a model to detect misclassified samples. For this purpose, AUPRC is normalized as follows. Let PRC model be the predicted uncertainty estimates, PRC random be random uncertainty estimates, and PRC oracle be estimates that perfectly sort samples according to the prediction errors (i.e., all misclassified samples have higher oracle uncertainty). Then, the PRR metric is defined as follows: PRR = AUPRC random -AUPRC model AUPRC random -AUPRC oracle The best value of this measure is 1 (for the perfect uncertainty estimates), while random uncertainty estimates give PRR = 0. Note that each model has its own oracle with the associated estimates that perfectly match its prediction errors. So the AUPRC oracle values of different models are independent and computed only based on the corresponding model predictions. We also evaluate the ability of models to detect OOD samples. For this, we consider the mixture of Test-In and Test-Out. A good model is expected to have higher knowledge uncertainty (KU) (Gal, 2016; Malinin, 2019) values for the observations from Test-Out compared to Test-In. Here, we use the standard AUROC for the binary classification with positive events corresponding to the observations coming from the OOD subset.

4. METHODS

We consider several methods for estimating uncertainty in graph-related problems. Specifically, we cover message-passing neural networks, ensemble approaches Lakshminarayanan et al. ( 2017), and Dirichlet-based methods that are currently considered to be state-of-the-art for OOD detection that is called GPN (Stadler et al., 2021) . For Dirichlet-based approaches, we conduct an ablation study to evaluate which design choices contribute most to performance.

4.1. STANDARD METHODS

In this class of methods, the constructed model f θ predicts the parameters µ i = f θ (x i )foot_0 of the categorical distribution P θ (y i |x i ) = P(y i |µ i ) in the standard classification task, while the uncertainty estimates are obtained based on the entropy of this distribution. A simple baseline that serves us as a lower bound for our further experiments with more advanced methods is MLP, which represents a graph-agnostic MLP model and takes into account only the features of the current observation. Further, GNN is a simple GNN model based on a two-layer SAGE convolution (Hamilton et al., 2017) , which combines the information from both the central node and its neighborhood. As a training objective, these methods use the standard Cross-Entropy loss between the one-hot-encoded target y i and the predicted categorical vector µ i . For these methods, we can only define uncertainty as the entropy u i = H P(y i |µ i ) of the predictive categorical distribution.

4.2. DIRICHLET-BASED METHODS

The core idea behind the Dirichlet-based uncertainty estimation methods is to model the pointwise Dirichlet distribution p θ (µ i |x i ) = p(µ i |β post i ) by predicting its parameters β feat i = f θ (x i ) and updating the uniform prior distribution with parameters β prior through their sum β post i = β feat i +β prior i . Using this Dirichlet distribution, one can obtain the target categorical one as follows: P θ (y i |x i ) = E p(µi|β post i ) P(y i |µ i ). It implies that P θ (y i |x i ) has parameters β post i /S i , where S i = k β post ik is called evidence or precision. In other words, the parameters of the categorical distribution can be obtained from the Dirichlet ones by normalization. Importantly, the Dirichlet-based methods allow us to distinguish between total and knowledge uncertainty as follows: u total i = H P θ (y i |x i ) = H E p(µi|β post i ) P(y i |µ i ) , u know i = -S i . For this class of methods, the training objective is Expected Cross-Entropy with an optional regularisation term that is equal to the entropy of the predicted Dirichlet distribution p(µ i |β post i ): L i = E p(µi|β post i ) -log P(y i |µ i ) -λH p(µ i |β post i ) . This loss function can be computed in closed form (Malinin & Gales, 2018; Charpentier et al., 2020) . As the most straightforward method in this class, we consider a modification of GNN that is referred to as EN (Evidential Network) (Sensoy et al., 2018) -while exploiting the same architecture, it is trained to predict the Dirichlet parameters via Loss (1). There are also more advanced methods based on the Dirichlet distribution which induce the behavior of the underlying model by estimating the density function in the latent space using Normalizing Flows (Kingma et al., 2016; Huang et al., 2018) . These methods can be united within the recently proposed framework Posterior Networks (Charpentier et al., 2020; 2021) . It was applied to the nodelevel problems in (Stadler et al., 2021) . In this paper, we provide a detailed study of this framework and consider different variations depending on how the density estimation is performed and how the graph information is used. The description of these components can be found in Appendix E.

4.3. ENSEMBLES

In our study, we also consider ensembling techniques that proved to increase the predictive performance of models and provide instruments for estimating uncertainty. Among the methods that predict the parameters µ i of categorical distributions P(y i |µ i ), there is a widely used approach for uncertainty estimation introduced by Lakshminarayanan et al. ( 2017). It can be formulated as an empirical distribution of model parameters q(θ|G train ) that can be obtained after training several instances of the model with different random seeds for initialization: P θ (y i |x i ) = E q(θ|Gtrain) P(y i |µ i ). Given this, we can split the total uncertainty into data and knowledge uncertainty through the following expression (Malinin & Gales, 2018) : u total i = H P θ (y i |x i ) = H E q(θ|Gtrain) P(y i |µ i ) , u data i = E q(θ|Gtrain) H P(y i |µ i ) , u know i = u total i -u data i . We apply this approach to GNN models and denote the obtained ensemble as EnsGNN. As for the Dirichlet-based approaches, we follow Charpentier et al. (2021) and define an ensemble of models that predict the parameters of posterior Dirichlet distribution as the mean over the parameters in ensemble. Here, uncertainty is estimated in the same way as for a single Dirichlet-based model.

5. EXPERIMENTS AND ABLATION

Setup As discussed in Section 3.4, we provide the comparison of methods in 4 different problems using their associated metrics. In particular, we report the standard Accuracy for general classification performance, PRR@TU for consistency of total uncertainty estimates u total i with the prediction errors, AUPRC@TU for the joint classification and confidence performance, and AUROC@KU for the OOD detection using knowledge uncertainty estimates u know i . The details of our experimental setup can be found in Appendix G. Our code and benchmark are publicly available at https://anonymous.4open.science/ r/revisiting-uncertainty-estimation-F4B6, together with a framework for evaluating a variety of baseline models, Dirichlet-based methods, and ensembling techniques. We run each method 5 times with different random seeds and report the mean and standard deviation. To compare a pair of methods and analyze whether one is noticeably better than the other, we report win/tie/loss counts aggregated over the datasets and, in some cases, over different distribution shifts. We say that a method wins on a particular dataset if it is better than its competitor and the difference is larger than the sum of their standard deviations. If the difference is smaller, there is a tie. Note that we do not compute the standard deviation for the ensembles since they combine all 5 models. To compare different methods, we aggregate their results over all the datasets as follows. First, for each dataset, metric, and distribution shift, we rank all the methods according to their performance (the smaller, the better). Then, we average the obtained ranks over all the datasets.

Results and insights

In this section, we address several research questions. First, we analyze the complexity and diversity of the proposed data split strategies. Then, we demonstrate the performance of the considered methods on our benchmark. After that, we investigate the Dirichlet-based methods to figure out which of them is superior for each task. In conclusion, we find out whether the bestperforming models can further be improved via ensembling.

Q1: How complex and diverse are the proposed data splits?

To answer this question, we analyze the predictive performance of the models on Test-In (ID) and Test-Out (OOD) parts. Table 1 shows this comparison for GNN, and similar results for GPN, EnsGNN, and MLP can be found in Tables 6-8 in Appendix. We can see that feature-based splits are the easiest for GNN: the difference between ID and OOD parts is sufficiently small, and the performance on the latter can be even better for some datasets. The PageRank splits are noticeably more complicated, but the most difficult split strategy is based on Personalized PageRank -the performance drop is dramatic on some datasets. This can be explained by the locality of the ID part (see Figure 1 ): the clear separation of ID from OOD in terms of the graph structure makes graphbased learning significantly more complicated.

Q2: How do existing methods perform in our setup, and what is the current status quo?

To answer this question, we compare the following models: GNN as the classic graph processing method, EnsGNN as the standard and universal approach allowing for getting uncertainty estimates, and GPN which is known to be state-of-the-art for the OOD detection (Stadler et al., 2021) . Table 2 compares these methods over all datasets. One can see that, according to AUROC@KU, the best OOD detection performance is achieved by GPN for all types of shifts, which is consistent with previously reported results (Stadler et al., 2021) . Unsurprisingly, EnsGNN shows the best predictive performance, as measured by Accuracy. Moreover, it has the most consistent uncertainty estimates in context of PRR@TU and provides the best joint performance via AUPRC@TU. In summary, ensembles are the best for all tasks but OOD detection, where the superior method is GPN. Q3: Which Dirichlet-based methods are the best for each prediction task? Here, we provide a detailed analysis of the Dirichlet-based framework. The simplest method is EN, which represents a standard GNN trained with the Loss (1). Further, we consider the methods using Normalizing Flows, and compare Standard vs Natural density estimation and Graph Encoding vs Graph Propagation. Based on Table 3 , one can make the following conclusions. First, GPN is still the best method for OOD detection. Second, NatPN is the best according to all other tasks. Third, EN is a strong baseline that shows competitive results in Accuracy, PRR@TU, and AUCPRC@TU, often staying close to NatPN. To show whether the difference is statistically significant, we report the win/tie/loss counts for some pairs of models. See Table 4 for the aggregated results and Tables 9, 11-14 in Appendix for more details. Q4: Do ensembles consistently improve the performance of complex models? Ensemble are known to consistently improve the model performance in various tasks, so we aim to confirm that this result holds in the transductive node classification. For this purpose, we compare the ensembles of the two most promising methods GPN and NatPN. The aggregated results are shown in Table 5 (also, see Tables 10-14 in Appendix for more details). One can see that ensembling via EnsGPN consistently improves GPN for OOD detection, but the difference is mostly insignificant. In contrast, EnsNatPN is noticeably better than NatPN, and this gain is especially significant for Accuracy and the joint performance measured by AUPRC. To summarize our findings, we refer to Table 15 for the comparison of all the methods in terms of ranks and to Tables 11-14 for the detailed comparison of their win/tie/loss counts. One can conclude that GPN is the best single-pass method for OOD detection, while NatPN is the best one for all other tasks. Moreover, the performance of the latter can be further improved by ensembling, so EnsNatPN achieves the best results in terms of Accuracy, PRR, and AUPRC.

6. CONCLUSION

In this work, we propose a new benchmark for evaluating robustness and uncertainty estimation in transductive node classification tasks. For this, we design a universal approach to creating data splits with distribution shifts: it can be applied to any graph dataset and allows for generating shifts of various nature. Using our benchmark, we show that the recently proposed Graph Posterior Network (Stadler et al., 2021) is consistently the best method for detecting the OOD inputs, while the best results for the other tasks are achieved using Natural Posterior Networks (Charpentier et al., 2021) . Our experiments also confirm that ensembling allows one to improve the model performance. Thus, we believe that our benchmark will be useful for future studies of node-level uncertainty estimation.

A VISUALIZATION OF DISTRIBUTION SHIFTS

In Figures 2 3 4 5 6 7 8 , we provide the visualization of different split strategies for all the datasets. Some graphs have multiple connected components -in that case, we keep only the largest one. 

B PROPERTIES OF DISTRIBUTION SHIFTS

This section provides a detailed analysis and comparison of the proposed distribution shifts. For this purpose, we consider three representative real-world datasets AmazonComputers, CoauthorCS, and CoraML, and discuss how different distribution shifts affect the basic properties of data: class balance, degree distribution, and graph distances between nodes within ID and OOD subsets. In Figures 9-11, one can see that neither feature-based nor PageRank-based split makes a notable difference in the class balance between the ID and OOD subsets. At the same time, the PPR-based split leads to significant changes for some classes. This shows that the split strategies based on the structural locality in a graph can be very challenging as they also affect such crucial statistics as class balance. Interestingly, the PageRank-based split does not lead to significant shifts of class balance (for the datasets under consideration), i.e., the more important and less important nodes have, on average, the same probability of belonging to a particular class.

Degree distribution

The node degree distribution is one of the basic structural characteristics of a graph that describes the local importance of nodes. Degrees are especially important for such graph processing methods as GNNs, since they describe how many channels around the considered node are used for message passing and aggregation. In Figures 12 13 14 , one can see that the most significant change in the degree distribution appears when the ID and OOD subsets are separated based on PageRank: the ID part contains more highdegree nodes. This is expected since PageRank is a graph characteristic measuring node importance (a.k.a. centrality), and node degree is the simplest centrality measure known to be correlated with PageRank. For PPR-based splits, the difference in degree distribution is smaller but still significant since PPR selects nodes by their relative importance for a particular node, so some high-degree nodes can be less important in terms of PPR. Finally, for features-based splits, the degree distribution also changes, but the shift level is much less significant.

Distribution of pairwise distances

The distance between two nodes in a graph is defined as the length of the shortest path between them. Here, we compute such distances between the nodes in the ID or OOD subset within the original graph, i.e., we consider the whole graph when searching for the shortest path. The distribution of distances shows how easily messages can be passed between the nodes. Therefore, we expect that larger pairwise distances create more complicated tasks. In Figures 15-17, one can observe that the PPR-based split leads to the most significant changes in distances, making the OOD nodes nearly twice as far from each other as the ID ones. At the same time, the PageRank-based split does not lead to such a difference, revealing almost the same distributions on ID and OOD subsets. This means that the popularity bias in a graph does not prevent one from covering the less popular periphery nodes since the most popular nodes may be widespread. Finally, the feature-based split preserves the distances within the subsets.

C COMPARISON WITH GOOD BENCHMARK FROM GUI ET AL. (2022)

Our work complements and extends the GOOD benchmark recently proposed by Gui et al. (2022) . However, there are several important differences that we discuss in this section. One of the main properties of the GOOD benchmark is its theoretical distinction between two types of distribution shifts, which are represented through a graphical model. In particular, the authors consider covariate shifts, in which the distribution of features changes while the conditional distribution of targets given features remains the same, and concept shifts, where the opposite situation occurs, i.e., the conditional target distribution changes, while the feature distribution is the same. Although this distinction might be very helpful for understanding the properties of particular GNN models, such exclusively covariate or concept shifts rarely happen in practice where both types of shifts are present at the same time. To create pure covariate or concept shifts, Gui et al. (2022) introduce different subsets of variables that either fully determine the target, create confounding associations with the target, or are completely independent of the target. This has to be properly handled and makes it non-trivial to create distribution shifts on new datasets with this approach. Indeed, the distribution shifts in the GOOD benchmark can be properly implemented only for synthetic graph datasets or via appending synthetic features that either describe various domains as completely independent variables or create the necessary concepts by inducing some spurious correlation with the target. Moreover, the authors claim that, in the case of real-world datasets, one has to perform screening over the available node features to create the required setup of domain or concept shift. This fact implies numerous restrictions on how the data splits can be prepared. In contrast, our method does not make a distinction between covariate and concept shifts and thus can be universally applied to any dataset and does not require any dataset modifications. Importantly, the type of distribution shift and the sizes of all split parts are easily controllable. This flexibility is the main advantage of our approach. Finally, Gui et al. (2022) confirm the importance of using both node features and graph structure. Still, their node-level distribution shifts are usually based on node features such as the number of words or the year of publication in a citation network, the language of users in a social network, or the name of organizations in a webpage network. As for the graph properties, only node degrees are used in some citation networks. In contrast, we propose to use the graph structure directly and create significant distribution shifts using a very simple technique that requires computing some node property in the graph that should be chosen depending on a specific problem. For instance, one may use PageRank to create distribution shifts by the structural popularity of instances or Personalized PageRank to take into account the locality and distinguish between the core and periphery nodes. Further in this section, we compare our benchmark with GOOD in terms of distribution shift statistics discussed in Appendix B. Degree distribution Comparing to our distribution shifts, the GOOD data splits have much less impact on the degree distribution: this graph property changes dramatically only when the covariate shift is constructed using the degree domain, as in the case of GOOD-Cora dataset (see Figure 22 ).

Graph distance distribution

In contrast to our benchmark, the GOOD approach does not lead to a significant change in pairwise distance distribution between ID and OOD parts: the distance distribution hardly changes for concept shifts, and only covariate shifts make the difference between ID and OOD subsets somehow notable. This proves the necessity of considering the graph structure for inducing challenging distribution shifts. 

D ADDITIONAL RESULTS

In this section, we provide additional experimental results. • Tables 6-8 are similar to Table 1 in the main text. They show how difficult our splits are for different models. These results are consistent: the feature-based split is the easiest for all the methods, while the PPR-based one is the hardest. Interestingly, it holds for the graphagnostic MLP, which means that this graph-based shift also implies a noticeable shift in the feature space; • Table 9 is a detalization of Table 4 , where we consider different distribution shifts separately and add a random partition. Similarly, Table 10 detalizes Table 5;  • Tables 11-14 aggregates win/tie/loss counts for all pairs of methods; • Table 15 compares all the methods in terms of their ranks averaged over the datasets; • Tables 16-22 provide the results for all the methods on all the datasets. Note that all other aggregated results can be deduced from these tables. Accuracy PRR@TU AUPRC@TU AUROC@KU NatPN vs EN Random 0/7/0 1/6/0 2/4/1 n/a Feature 1/6/0 2/5/0 2/4/1 1/1/5 PageRank 1/6/0 1/6/0 3/4/0 2/1/4 PPR 1/6/0 3/3/1 3/4/0 0/1/6 GPN vs NatGPN Random 4/2/1 1/5/1 4/2/1 n/a Feature 5/1/1 1/6/0 5/0/2 2/5/0 PageRank 5/2/0 3/3/1 5/1/1 5/1/1 PPR 4/2/1 4/2/1 5/1/1 5/2/0 NatPN vs GPN Random 5/0/2 6/1/0 5/1/1 n/a Feature 4/1/2 7/0/0 5/1/1 0/0/7 PageRank 5/0/2 5/1/1 5/1/1 0/0/7 PPR 3/1/3 2/4/1 3/1/3 0/0/7 15/1/5 7/2/12 6/3/12 13/8/0 7/2/12 0/21/0 14/5/2 6/1/14 6/1/14 1/11/9 NatGPN 9/6/6 0/5/16 0/7/14 6/5/10 0/4/17 2/5/14 0/21/0 0/1/20 0/1/20 4/2/15 EnsGNN 19/0/2 15/6/0 14/7/0 17/1/3 12/6/3 14/1/6 20/1/0 0/21/0 9/0/12 14/0/7 EnsNatPN 19/0/2 15/5/1 15/6/0 17/1/3 14/7/0 14/1/6 20/1/0 9/0/12 0/21/0 14/0/7 EnsGPN 16/0/5 8/0/13 8/1/12 17/3/1 7/2/12 9/11/1 15/2/4 7/0/14 7/0/14 0/21/0 18/2/1 6/15/0 6/14/1 10/11/0 0/21/0 14/5/2 14/6/1 8/7/6 0/15/6 15/3/3 GPN 13/5/3 3/8/10 4/7/10 7/6/8 2/5/14 0/21/0 8/11/2 5/3/13 2/4/15 1/14/6 NatGPN 9/9/3 1/7/13 1/6/14 4/10/7 1/6/14 2/11/8 0/21/0 1/4/16 1/4/16 5/5/11 EnsGNN 20/0/1 3/17/1 6/8/7 10/8/3 6/7/8 13/3/5 16/4/1 0/21/0 9/0/12 14/0/7 EnsNatPN 20/0/1 9/12/0 10/8/3 12/9/0 6/15/0 15/4/2 16/4/1 12/0/9 0/21/0 16/0/5 EnsGPN 15/0/6 6/1/14 5/4/12 8/5/8 3/3/15 6/14/1 11/5/5 7/0/14 5/0/16 0/21/0 0/21/0 0/1/20 0/2/19 7/4/10 1/0/20 3/2/16 6/6/9 0/0/21 0/0/21 3/0/18 GNN 20/1/0 0/21/0 1/20/0 17/3/1 0/15/6 13/3/5 16/5/0 0/6/15 0/4/17 13/1/7 EN 19/2/0 0/20/1 0/21/0 16/1/4 1/12/8 13/3/5 18/3/0 0/4/17 0/2/19 13/0/8 PN 10/4/7 1/3/17 4/1/16 0/21/0 2/2/17 3/6/12 10/5/6 1/3/17 1/3/17 2/6/13 NatPN 20/0/1 6/15/0 8/12/1 17/2/2 0/21/0 13/3/5 17/4/0 6/2/13 0/6/15 13/2/6 GPN 16/2/3 5/3/13 5/3/13 12/6/3 5/3/13 0/21/0 15/2/4 4/2/15 5/0/16 0/7/14 NatGPN 9/6/6 0/5/16 0/3/18 6/5/10 0/4/17 4/2/15 0/21/0 0/2/19 0/1/20 5/0/16 EnsGNN 21/0/0 15/6/0 17/4/0 17/3/1 13/2/6 15/2/4 19/2/0 0/21/0 12/0/9 15/0/6 EnsNatPN 21/0/0 17/4/0 19/2/0 17/3/1 15/6/0 16/0/5 20/1/0 9/0/12 0/21/0 16/0/5 EnsGPN 18/0/3 7/1/13 8/0/13 13/6/2 6/2/13 14/7/0 16/0/5 6/0/15 5/0/16 0/21/0 0/21/0 10/3/8 10/3/8 8/3/10 15/1/5 3/1/17 4/0/17 10/1/10 14/1/6 3/0/18 GNN 8/3/10 0/21/0 1/18/2 6/6/9 15/3/3 0/3/18 1/3/17 5/4/12 15/0/6 0/1/20 EN 8/3/10 2/18/1 0/21/0 5/7/9 15/3/3 0/3/18 1/6/14 7/4/10 15/1/5 0/1/20 PN 10/3/8 9/6/6 9/7/5 0/21/0 17/3/1 0/1/20 1/3/17 10/3/8 17/1/3 0/0/21 NatPN 5/1/15 3/3/15 3/3/15 1/3/17 0/21/0 0/0/21 0/3/18 4/1/16 7/9/5 0/0/21 GPN 17/1/3 18/3/0 18/3/0 20/1/0 21/0/0 0/21/0 12/8/1 20/1/0 21/0/0 0/17/4 NatGPN 17/0/4 17/3/1 14/6/1 17/3/1 18/3/0 1/8/12 0/21/0 17/2/2 19/1/1 3/6/12 EnsGNN 10/1/10 12/4/5 10/4/7 8/3/10 16/1/4 0/1/20 2/2/17 0/21/0 15/0/6 0/0/21 EnsNatPN 6/1/14 6/0/15 5/1/15 3/1/17 5/9/7 0/0/21 1/1/19 6/0/15 0/21/0 0/0/21 EnsGPN 18/0/3 20/1/0 20/1/0 21/0/0 21/0/0 4/17/0 12/6/3 21/0/0 21/0/0 0/21/0 One property that we vary in these methods is the number of Normalizing Flows and how we use their density estimates. In particular, we consider a Standard approach (Charpentier et al., 2020) , where distinct Normalizing Flows p ψ (z i |k) are used per each class k to predict the corresponding Dirichlet parameter β feat ik based on the representation z i = f ϕ (x i ): β feat ik = n • p ψ (z i , k) = n • p ψ (z i |k) • P(k) = n k • p ψ (z i |k) , where the probability P(k) of class k is approximated by the ratio of train observations from class k, n is usually equal to the dataset size (but can, in general, be a hyper-parameter), and n k := n • P(k). Another approach is the Natural version of Posterior Network that is proposed by Charpentier et al. (2021) . It exploits a single Normalizing Flow p ψ (z i ) to predict the evidence S i , while the normalized Dirichlet parameters µ i are obtained using one-layer linear transformation g ω (z i ). The final predictions are obtained as follows: S i = n • p ψ (z i ), µ i = g ω (z i ) =⇒ β feat ik = S i • µ ik E.2 GRAPH ENCODING OR GRAPH PROPAGATION Another aspect that we analyze is how the graph structure can be combined with Posterior Networks. Here, we consider two approaches -use the graph for the encoding to obtain z i or use graph propagation to smooth the predicted parameters as the post-processing step. In the case of graph encoding, we use a two-layer SAGE convolution to produce the representations z i and then estimate the density as usual via Normalizing Flows. Thus, the graph structure is used for the pre-processing step. The second approach is adopted by Stadler et al. (2021) . In this case, the initial representations z i are obtained using a graph-agnostic two-layer MLP, while graph propagation is applied to smooth the Dirichlet parameters β feat i . Graph propagation is performed in several steps via some transformation π : R n×C → R n×C as follows: B t+1 = (1 -α)π(B t ) + αB, where B 0 = B ∈ R n×C is formed by the parameters β feat ik and α is a hyperparameter that controls the smoothing effect of the propagation step. Similarly to Stadler et al. (2021) , we use a Personalized Propagation scheme (Klicpera et al., 2018) , which takes into account the mutual importance of nodes and is defined as π(B) = D -1/2 AD -1/2 B. We refer to the method with graph propagation as GPN (Graph Posterior Network) and with Graph Encoding as PN (Posterior Network). Further, if the Natural version of the Posterior Network is used instead of the Standard one, we denote the models NatGPN and NatPN, respectively. 



For GNNs, we formally have µi = f θ (A, X, i) but we write f θ (xi) for simplicity of notation and consistency with non-graph methods.



Figure 1: Visualization of data splits for AmazonPhoto dataset: ID is blue, OOD is red.

Figure 2: Visualization of data splits for CoraML dataset: ID is blue, OOD is red.

Figure 3: Visualization of data splits for CiteSeer dataset: ID is blue, OOD is red.

Figure 5: Visualization of data splits for AmazonComputers dataset: ID is blue, OOD is red.

Figure 6: Visualization of data splits for AmazonPhoto dataset: ID is blue, OOD is red.

Figure 7: Visualization of data splits for CoauthorCS dataset: ID is blue, OOD is red.

Figure 9: Visualization of class balance for AmazonComputers dataset.

Figure 12: Visualization of degree distribution for AmazonComputers dataset.

Figure 13: Visualization of degree distribution for CoauthorCS dataset.

Figure 14: Visualization of degree distribution for CoraML dataset.

Figure 15: Visualization of pairwise distance distribution for AmazonComputers dataset.

Figure 17: Visualization of pairwise distance distribution for CoraML dataset.

Figure 18: Visualization of class balance for GOOD-Twitch dataset.

Figure 21: Visualization of degree distribution for GOOD-Twitch dataset.

Figure 24: Visualization of pairwise distance distribution for GOOD-Twitch dataset.

Accuracy of GNN on ID vs OOD test subsets. Diff.% is the difference between the accuracy scores on the OOD and ID parts divided by the accuracy score on the ID part.

Average ranks of standard uncertainty estimation methods over graph datasets.

Average ranks of Dirichlet-based methods over datasets, including EN, PN, NatPN, GPN and NatGPN, which are obtained for every considered split strategy and prediction task.

Win/tie/loss counts for some pairs of Dirichlet-based methods across all the considered graph datasets and split strategies (except Random).

Win/tie/loss counts for ensembles vs the corresponding single models across all the considered graph datasets and split strategies (except Random).

Accuracy of MLP on ID vs OOD test subsets for every split strategy.

Accuracy of GPN on ID vs OOD test subsets for every split strategy.

Accuracy of EnsGNN on ID vs OOD test subsets for every split strategy.

Win/tie/loss counts for some pairs of Dirichlet-based methods across all the considered graph datasets.

Win/tie/loss counts for ensembles vs the corresponding single models across all the considered graph datasets.

Pairwise win/tie/loss counts for Accuracy across all graph datasets and split strategies (except Random).

Pairwise win/tie/loss counts for PRR@TU across all graph datasets and split strategies (except Random).

Pairwise win/tie/loss counts for AUPRC@TU across all graph datasets and split strategies (except Random).

Pairwise win/tie/loss counts for AUROC@KU across all graph datasets and split strategies (except Random).

Average ranks of the considered methods across all datasets.

Experiment results on AmazonComputers dataset.

Experiment results on AmazonPhoto dataset.

Experiment results on CoauthorCS dataset.

Experiment results on CoauthorPhysics dataset.

Experiment results on CoraML dataset.

Experiment results on CiteSeer dataset. PPR MLP 55.53 ± 0.28 39.66 ± 1.44 65.32 ± 0.23 69.37 ± 0.39 GNN 62.22 ± 0.60 44.67 ± 2.44 72.72 ± 0.68 81.83 ± 0.54 EN 62.18 ± 0.91 44.78 ± 1.08 72.70 ± 0.89 80.35 ± 0.77 PN 46.30 ± 2.63 33.99 ± 8.13 54.74 ± 4.41 80.83 ± 1.27 NatPN 63.09 ± 0.58 41.88 ± 1.41 72.84 ± 0.37 47.42 ± 6.69 GPN 56.27 ± 1.23 40.11 ± 3.87 66.14 ± 0.75 93.34 ± 4.34 NatGPN 64.00 ± 1.40 36.33 ± 0.48 72.37 ± 1.17 88.78 ± 4.75

Experiment results on PubMed dataset. Feature MLP 84.66 ± 0.14 60.50 ± 0.34 92.52 ± 0.04 48.63 ± 0.09 GNN 86.32 ± 0.17 65.27 ± 0.39 94.03 ± 0.06 50.50 ± 0.24 EN 86.36 ± 0.30 65.81 ± 1.03 94.11 ± 0.06 50.27 ± 0.31 PN ± 0.19 58.47 ± 1.87 92.73 ± 0.24 49.03 ± 0.84 NatPN 86.15 ± 0.17 64.99 ± 0.40 93.91 ± 0.06 50.73 ± 0.28 GPN 86.89 ± 0.29 62.46 ± 0.73 94.00 ± 0.14 52.47 ± 0.37 NatGPN 84.76 ± 1.06 60.53 ± 1.60 92.56 ± 0.84 50.83 ± 0.28 PPR MLP 84.12 ± 0.05 53.29 ± 0.34 91.24 ± 0.05 56.44 ± 0.24 GNN 84.80 ± 0.20 60.82 ± 0.70 92.64 ± 0.10 58.41 ± 1.03 EN 85.04 ± 0.31 61.21 ± 0.65 92.83 ± 0.16 66.78 ± 2.97 PN 83.31 ± 0.40 48.27 ± 1.77 90.02 ± 0.47 72.08 ± 1.95 NatPN 84.50 ± 0.43 60.66 ± 1.06 92.44 ± 0.24 61.40 ± 4.36 GPN 86.05 ± 0.14 62.12 ± 1.35 93.50 ± 0.13 74.88 ± 1.93 NatGPN 82.85 ± 1.88 57.71 ± 2.81 91.01 ± 1.60 68.86 ± 0.67

Description of the considered graph datasets for node classification task.

G EXPERIMENTAL SETUP

We evaluate all the methods discussed in Section 4 using the benchmark proposed in Section 3. Some of the considered methods require a specific training procedure consisting of several stages. Methods without Normalizing Flows, including MLP, GNN, EnsGNN, and EN, have only one training stage when the corresponding models train in the standard end-to-end mode. Methods with Normalizing Flows have at least three phases of training -warm-up of exclusively Flow neural layers, end-to-end training of the entire model, and finetuning the same Flow layers. In this regard, we follow the setup of Stadler et al. (2021) . Further details can be found in Appendix G.In our experiments, one training stage (i.e., warm-up, main training stage, or finetuning) takes 200 epochs, while the best loss value on the Valid-In part serves as a criterion for saving the model checkpoint. We exploit the standard Adam optimizer (Kingma & Ba, 2014) with a learning rate of 0.001 for Normalizing Flows and 0.0003 for other neural modules. For all the considered models, we utilize weight decay of 0.00001 and set λ = 0.001 in Expected Cross-Entropy.As for model configurations, we set the hidden size of linear layers to 64, the number of layers in Normalizing Flows to 8 and the latent space dimension to 16. Also, Graph Propagation is performed in 5 steps with α = 0.2.

