REVISITING UNCERTAINTY ESTIMATION FOR NODE CLASSIFICATION: NEW BENCHMARK AND INSIGHTS Anonymous authors Paper under double-blind review

Abstract

Uncertainty estimation is an important task that can be essential for high-risk applications of machine learning. This problem is especially challenging for nodelevel prediction in graph-structured data, as the samples (nodes) are interdependent. However, there is no established benchmark that allows for the evaluation of node-level uncertainty estimation methods in a unified setup, covering diverse and meaningful distribution shifts. In this paper, we address this problem and propose such a benchmark, together with a technique for the controllable generation of data splits with various types of distribution shifts. Importantly, we describe the shifts that are specific to the graph-structured data. Our benchmark consists of several graph datasets equipped with various distribution shifts on which we evaluate the robustness of models and their uncertainty estimation performance. To illustrate the benchmark, we decompose the current state-of-the-art Dirichlet-based framework and perform an ablation study on its components. In our experiments on the proposed benchmark, we show that when faced with complex yet realistic distribution shifts, most models fail to maintain high classification performance and consistency of uncertainty estimates with prediction errors. However, ensembling techniques help to partially overcome significant drops in performance and achieve better results than distinct models.

1. INTRODUCTION

Uncertainty estimation is an important and challenging task with many applications in financial systems, medical diagnostics, autonomous driving, etc. It aims at quantifying the confidence of machine learning models and can be used to design more reliable decision-making systems. In particular, it enables one to solve such problems as misclassification detection, where the model has to assign higher uncertainty to the potential prediction errors, or out-of-distribution (OOD) detection, when the model is required to yield higher uncertainty for the samples from an unknown distribution. Depending on the source of uncertainty, it can be divided into data uncertainty, which describes the inherent noise in data due to the labeling mistakes or class overlap, and knowledge uncertainty, which accounts for insufficient amount of information for accurate predictions when the distribution of test data is different from the training one (Gal, 2016; Malinin, 2019) . The problem of uncertainty estimation for graph-structured data has recently started to gain attention. It is especially complex at the node level as one has to deal with interdependent samples that may come from different distributions, so their predictions can change significantly depending on the neighborhood. This problem has already been addressed in several studies, and the proposed methods are commonly based on the Dirichlet distribution and introduce various extensions to the Dirichlet framework (Sensoy et al., 2018; Malinin & Gales, 2018; Malinin, 2019; Charpentier et al., 2020) , such as graph-based kernel Dirichlet estimation (Zhao et al., 2020) or graph propagation of Dirichlet parameters (Stadler et al., 2021) . However, the field of robustness and uncertainty estimation for node-level graph problems suffers from the absence of benchmarks with diverse and meaningful distribution shifts. Usually, the evaluation is limited to somewhat unrealistic distribution shifts, such as noisy node features (Stadler et al., 2021) or left-out classes (Zhao et al., 2020; Stadler et al., 2021) . Importantly, Gui et al. ( 2022) try to overcome this issue and systematically construct a graph OOD benchmark, in which they explicitly make distinctions between covariate and concept shifts. However, the authors either consider synthetic datasets or ignore the graph structure when creating distribution shifts. The problem with the mentioned approaches is that, in real applications, distribution shifts can be much more complex and diverse, and may depend on the global graph structure (for a more detailed discussion, refer to Appendix C). Thus, the existing benchmarks can be insufficient to reliably and comprehensively evaluate uncertainty estimation methods for graph-structured data. Therefore, the current status quo about the best uncertainty estimation methods for node classification remains unclear and requires further investigation. In this work, we propose a new benchmark for evaluating robustness and uncertainty estimation in transductive node classification tasks. The main feature of our benchmark is a general approach to constructing the data splits with distribution shifts: it can be applied to any graph dataset, allows for generating shifts of different nature, and one can easily vary the sizes of splits. For demonstration purpose, we apply our method to 7 common node classification datasets and describe 3 particular strategies to induce distribution shifts. Using the proposed benchmark, we evaluate the robustness of various models and their ability to detect errors and OOD inputs. Thus, we show that the recently proposed Graph Posterior Network (Stadler et al., 2021) is consistently the best method for detecting the OOD inputs. However, the best results for the other tasks are achieved using Natural Posterior Networks (Charpentier et al., 2021) . We also confirm that ensembling often allows one to improve the model performance -ensembles of GPNs achieve the best performance for OOD detection, while ensembles of NatPNs have the best predictive performance and error detection.

2. PROBLEM STATEMENT

We consider the problem of transductive node classification in an attributed graph G = (A, X, Y) with an adjacency matrix A ∈ {0, 1} n×n , a node feature matrix X ∈ R n×d and categorical targets vector Y ∈ {1, . . . , C} n . We split the set of nodes V into several non-intersecting subsets depending on whether they are used for training, validation, or testing and if they belong to in-distribution (ID) or out-of-distribution (OOD) subset. Let Y train denote the labels of train nodes V train . Given a graph G train = (A, X, Y train ), we aim at predicting the labels Y test of test nodes V test and estimating the uncertainty measure u i ∈ R associated with these predictions. The obtained uncertainty estimates are used to solve the misclassification detection and OOD detection problems.

3. PROPOSED BENCHMARK

This section describes our benchmark for evaluating uncertainty estimates and robustness to distribution shifts for node-level graph problems. The most important ingredient of our benchmark is a unified approach for the controllable generation of diverse distribution shifts that can be applied to any graph dataset. Our benchmark includes a collection of common node classification datasets, several data split strategies, a set of problems for evaluating robustness and uncertainty estimation performance, and the associated metrics. We describe these components below.

3.1. GRAPH DATASETS

While our approach can potentially be applied to any node classification or node regression dataset, for our experiments, we pick the following 7 datasets commonly used in the literature: 3 citation networks, including CoraML, CiteSeer (McCallum et al., 2000; Giles et al., 1998; Getoor, 2005; Sen et al., 2008) and PubMed (Namata et al., 2012) , 2 co-authorship graphs -CoauthorPhysics and CoauthorCS (Shchur et al., 2018) , and 2 co-purchase datasets -AmazonPhoto and Amazon-Computers (McAuley et al., 2015; Shchur et al., 2018) .

3.2. DATA SPLITS

The most important ingredient of our benchmark is a general generating data splits in a graph G to yield non-trivial yet reasonable distribution shifts. For this purpose, we make a distinction between the ID parts that are described by p(Y in |X, A) and shifted (OOD) parts where the targets may come from a significantly different distribution p(Y out |X, A). We define the following ID parts:

