A NON-ASYMPTOTIC ANALYSIS OF OVERSMOOTHING IN GRAPH NEURAL NETWORKS

Abstract

Oversmoothing is a central challenge of building more powerful Graph Neural Networks (GNNs). While previous works have only demonstrated that oversmoothing is inevitable when the number of graph convolutions tends to infinity, in this paper, we precisely characterize the mechanism behind the phenomenon via a non-asymptotic analysis. Specifically, we distinguish between two different effects when applying graph convolutions-an undesirable mixing effect that homogenizes node representations in different classes, and a desirable denoising effect that homogenizes node representations in the same class. By quantifying these two effects on random graphs sampled from the Contextual Stochastic Block Model (CSBM), we show that oversmoothing happens once the mixing effect starts to dominate the denoising effect, and the number of layers required for this transition is O(log N/ log(log N )) for sufficiently dense graphs with N nodes. We also extend our analysis to study the effects of Personalized PageRank (PPR), or equivalently, the effects of initial residual connections on oversmoothing. Our results suggest that while PPR mitigates oversmoothing at deeper layers, PPR-based architectures still achieve their best performance at a shallow depth and are outperformed by the graph convolution approach on certain graphs. Finally, we support our theoretical results with numerical experiments, which further suggest that the oversmoothing phenomenon observed in practice can be magnified by the difficulty of optimizing deep GNN models.

1. INTRODUCTION

Graph Neural Networks (GNNs) are a powerful framework for learning with graph-structured data (Gori et al., 2005; Scarselli et al., 2009; Bruna et al., 2014; Duvenaud et al., 2015; Defferrard et al., 2016; Battaglia et al., 2016; Li et al., 2016) . Most GNN models are built by stacking graph convolutions or message-passing layers (Gilmer et al., 2017) , where the representation of each node is computed by recursively aggregating and transforming the representations of its neighboring nodes. The most representative and popular example is the Graph Convolutional Network (GCN) (Kipf & Welling, 2017) , which has demonstrated success in node classification, a primary graph task which asks for node labels and identifies community structures in real graphs. Despite these achievements, the choice of depth for these GNN models remains an intriguing question. GNNs often achieve optimal classification performance when networks are shallow. Many widely used GNNs such as the GCN are no deeper than 4 layers (Kipf & Welling, 2017; Wu et al., 2019) , and it has been observed that for deeper GNNs, repeated message-passing makes node representations in different classes indistinguishable and leads to lower node classification accuracy-a phenomenon known as oversmoothing (Kipf & Welling, 2017; Li et al., 2018; Klicpera et al., 2019; Wu et al., 2019; Oono & Suzuki, 2020; Chen et al., 2020a; b; Keriven, 2022) . Through the insight that graph convolutions can be regarded as low-pass filters on graph signals, prior studies have established that oversmoothing is inevitable when the number of layers in a GNN increases to infinity (Li et al., 2018; Oono & Suzuki, 2020) . However, these asymptotic analyses do not fully explain the rapid occurrence of oversmoothing when we increase the network depth, let alone the fact that for some datasets, having no graph convolution is even optimal (Liu et al., 2021) . These observations motivate the following key questions about oversmoothing in GNNs:

A B

Figure 1 : Stacking GNN layers increases both the mixing and denoising effects counteracting each other. Depending on the graph properties, either the denoising effect dominates the mixing effect, resulting in less difficulty classifying nodes (A), or the mixing effect dominates the denoising effect, resulting in more difficulty classifying nodes (B)-this is when oversmoothing starts to happen. Why does oversmoothing happen at a relatively shallow depth? Can we quantitatively model the effect of applying a finite number of graph convolutions and theoretically predict the "sweet spot" for the choice of depth? In this paper, we propose a non-asymptotic analysis framework to study the effects of graph convolutions and oversmoothing using the Contextual Stochastic Block Model (CSBM) (Deshpande et al., 2018) . The CSBM mimics the community structure of real graphs and enables us to evaluate the performance of linear GNNs through the probabilistic model with ground truth community labels. More importantly, as a generative model, the CSBM gives us full control over the graph structure and allows us to analyze the effect of graph convolutions non-asymptotically. In particular, we distinguish between two counteracting effects of graph convolutions: • mixing effect (undesirable): homogenizing node representations in different classes; • denoising effect (desirable): homogenizing node representations in the same class. Adding graph convolutions will increase both the mixing and denoising effects. As a result, oversmoothing happens not just because the mixing effect keeps accumulating as the depth increases, on which the asymptotic analyses are based (Li et al., 2018; Oono & Suzuki, 2020) , but rather because the mixing effect starts to dominate the denoising effect (see Figure 1 for a schematic illustration). By quantifying both effects as a function of the model depth, we show that the turning point of the tradeoff between the two effects is O(log N/ log(log N )) for graphs with N nodes sampled from the CSBM in sufficiently dense regimes. Besides new theory, this paper also presents numerical results directly comparing theoretical predictions and empirical results. This comparison leads to new insights highlighting the fact that the oversmoothing phenomenon observed in practice is often a mixture of pure oversmoothing and difficulty of optimizing weights in deep GNN models. In addition, we apply our framework to analyze the effects of Personalized PageRank (PPR) on oversmoothing. Personalized propagation of neural predictions (PPNP) and its approximate variant (APPNP) make use of PPR and its approximate variant, respectively, and were proposed as a solution to mitigate oversmoothing while retaining the ability to aggregate information from larger neighborhoods in the graph (Klicpera et al., 2019) . We show mathematically that PPR makes the model performance more robust to increasing number of layers by reducing the mixing effect at each layer, while it nonetheless reduces the desirable denoising effect at the same time. For graphs with a large size or strong community structure, the reduction of the denoising effect would be greater than the reduction of the mixing effect and thus PPNP and APPNP would perform worse than the vanilla GNN on those graphs. Our contributions are summarized as follows: • We show that adding graph convolutions strengthens the denoising effect while exacerbates the mixing effect. Oversmoothing happens because the mixing effect dominates the denoising effect beyond a certain depth. For sufficiently dense CSBM graphs with N nodes, the required number of layers for this to happen is O(log N/ log(log N )). • We apply our framework to rigorously characterize the effects of PPR on oversmoothing. We show that PPR reduces both the mixing effect and the denoising effect of message-passing and thus does not necessarily improve node classification performance. • We verify our theoretical results in experiments. Through comparison between theory and experiments, we find that the difficulty of optimizing weights in deep GNN architectures often aggravates oversmoothing.

