DEEPERGCN: TRAINING DEEPER GCNS WITH GEN-ERALIZED AGGREGATION FUNCTIONS

Abstract

Graph Convolutional Networks (GCNs) have been drawing significant attention with the power of representation learning on graphs. Recent works developed frameworks to train deep GCNs. Such works show impressive results in tasks like point cloud classification and segmentation, and protein interaction prediction. In this work, we study the performance of such deep models in large scale graph datasets from the Open Graph Benchmark (OGB). In particular, we look at the effect of adequately choosing an aggregation function, and its effect on final performance. Common choices of aggregation are mean, max, and sum. It has shown that GCNs are sensitive to such aggregations when applied to different datasets. We further validate this point and propose to alleviate it by introducing a novel Generalized Aggregation Function. Our new aggregation not only covers all commonly used ones, but also can be tuned to learn customized functions for different tasks. Our generalized aggregation is fully differentiable, and thus its parameters can be learned in an end-to-end fashion. We add our generalized aggregation into a deep GCN framework and show it achieves state-of-the-art results in six benchmarks from OGB.

1. INTRODUCTION

The rise of availability of non-Euclidean data (Bronstein et al., 2017) has recently shed interest into the topic of Graph Convolutional Networks (GCNs). GCNs provide powerful deep learning architectures for irregular data, like point clouds and graphs. GCNs have proven valuable for applications in social networks (Tang & Liu, 2009) , drug discovery (Zitnik & Leskovec, 2017; Wale et al., 2008) , recommendation engines (Monti et al., 2017b; Ying et al., 2018) , and point clouds (Wang et al., 2018; Li et al., 2019b) . Recent works looked at frameworks to train deeper GCN architectures (Li et al., 2019b; a) . These works demonstrate how increased depth leads to state-of-the-art performance on tasks like point cloud classification and segmentation, and protein interaction prediction. The power of deep models become more evident with the introduction of more challenging and largescale graph datasets. Such datasets were recently introduced in the Open Graph Benchmark (OGB) (Hu et al., 2020) , for tasks of node classification, link prediction, and graph classification. Graph convolutions in GCNs are based on the notion of message passing (Gilmer et al., 2017) . To compute a new node feature at each GCN layer, information is aggregated from the node and its connected neighbors. Given the nature of graphs, aggregation functions must be permutation invariant. This property guarantees invariance/equivariance to isomorphic graphs (Battaglia et al., 2018; Xu et al., 2019b; Maron et al., 2019a) . Popular choices for aggregation functions are mean (Kipf & Welling, 2016 ), max (Hamilton et al., 2017 ), and sum (Xu et al., 2019b) . Recent works suggest different aggregations have different performance impact depending on the task. For example, mean and sum perform best in node classification (Kipf & Welling, 2016), while max is favorable for dealing with 3D point clouds (Qi et al., 2017; Wang et al., 2019) . Currently, all works rely on empirical analysis to choose aggregation functions. In DeepGCNs (Li et al. (2019b) ), the authors complement aggregation functions with residual and dense connections, and dilated convolutions, in order to train very deep GCNs. Equipped with these new modules, GCNs with more than 100 layers can be reliably trained. Despite the potential of these new modules (Kipf & Welling, 2016; Hamilton et al., 2017; Veličković et al., 2018; Xu et al., 2019a) , it is still unclear if they are the ideal choice for DeepGCNs when handling large-scale graphs. In this work, we analyze the performance of GCNs on large-scale graphs. In particular, we look at the effect of aggregation functions in performance. We unify aggregation functions by proposing a novel Generalized Aggregation Function (Figure 1 ) suited for graph convolutions. We show how our function covers all commonly used aggregations (mean, max, and sum), and its parameters can be tuned to learn customized functions for different tasks. Our novel aggregation is fully differentiable and can be learned in an end-to-end fashion in a deep GCN framework. In our experiments, we show the performance of baseline aggregations in various large-scale graph datasets. We then introduce our generalized aggregation and observe improved performance with the correct choice of aggregation parameters. Finally, we demonstrate how learning the parameters of our generalized aggregation, in an end-to-end fashion, leads to state-of-the-art performance in several OGB benchmarks. Our analysis indicates the choice of suitable aggregations is imperative to the performance of different tasks. A differentiable generalized aggregation function ensures the correct aggregation is used for each learning scenario.

Sum

We summarize our contributions as two-fold: (1) We propose a novel Generalized Aggregation Function. This new function is suitable for GCNs, as it enjoys a permutation invariant property. We show how our generalized aggregation covers commonly used functions such as mean, max, and sum in graph convolutions. Additionally, we show how its parameters can be tuned to improve performance on diverse GCN tasks. Since this new function is fully differentiable, we show how its parameters can be learned in an end-to-end fashion. (2) We run extensive experiments on seven datasets from the Open Graph Benchmark (OGB). Our results show that combining depth with our generalized aggregation function achieves state-of-the-art in several of these benchmarks.

2. RELATED WORK

Graph Convolutional Networks (GCNs). Current GCN algorithms can be divided into two categories: spectral-based and spatial-based. Based on spectral graph theory, Bruna et al. ( 2013) firstly developed graph convolutions using the Fourier basis of a given graph in the spectral domain. Later, many methods proposed to apply improvements, extensions, and approximations on spectral-based GCNs (Kipf & Welling, 2016; Defferrard et al., 2016; Henaff et al., 2015; Levie et al., 2018; Li et al., 2018; Wu et al., 2019) . Spatial-based GCNs (Scarselli et al., 2008; Hamilton et al., 2017; Monti et al., 2017a; Niepert et al., 2016; Gao et al., 2018; Xu et al., 2019b; Veličković et al., 2018) define graph convolution operations directly on the graph by aggregating information from neighbor nodes. To address the scalability issue of GCNs on large-scale graphs, two main categories of algorithms exist: sampling-based (Hamilton et al., 2017; Chen et al., 2018b; Li et al., 2018; Chen et al., 2018a; Zeng et al., 2020) and clustering-based (Chiang et al., 2019) . Training Deep GCNs. Despite the rapid and fruitful progress of GCNs, most prior work employs shallow GCNs. Several works attempt different ways of training deeper GCNs (Hamilton et al., 2017; Armeni et al., 2017; Rahimi et al., 2018; Xu et al., 2018) . However, all these approaches are limited to 10 layers of depth, after which GCN performance would degrade because of vanishing gradient and over-smoothingLi et al. (2018) . Inspired by the merits of training deep CNN-based networks (He et al., 2016a; Huang et al., 2017; Yu & Koltun, 2016) , DeepGCNs (Li et al., 2019b) propose to train very deep GCNs (56 layers) by adapting residual/dense connections



Figure 1: Illustration of Generalized Message Aggregation Functions

