LEARNING DISCRETE ADAPTIVE RECEPTIVE FIELDS FOR GRAPH CONVOLUTIONAL NETWORKS Anonymous

Abstract

Different nodes in a graph neighborhood generally yield different importance. In previous work of Graph Convolutional Networks (GCNs), such differences are typically modeled with attention mechanisms. However, as we prove in our paper, soft attention weights suffer from over-smoothness in large neighborhoods. To address this weakness, we introduce a novel framework of conducting graph convolutions, where nodes are discretely selected among multi-hop neighborhoods to construct adaptive receptive fields (ARFs). ARFs enable GCNs to get rid of the over-smoothness of soft attention weights, as well as to efficiently explore long-distance dependencies in graphs. We further propose GRARF (GCN with Reinforced Adaptive Receptive Fields) as an instance, where an optimal policy of constructing ARFs is learned with reinforcement learning. GRARF achieves or matches state-of-the-art performances on public datasets from different domains. Our further analysis corroborates that GRARF is more robust than attention models against neighborhood noises.

1. INTRODUCTION

After a series of explorations and modifications (Bruna et al., 2014; Kipf & Welling, 2017; Velickovic et al., 2017; Xu et al., 2019; Li et al., 2019; Abu-El-Haija et al., 2019) , Graph Convolutional Networks (GCNs)foot_0 have gained considerable attention in the machine learning community. Typically, a graph convolutional model can be abstracted as a message-passing process (Gilmer et al., 2017) -nodes in the neighborhood of a central node are regarded as contexts, who individually pass their messages to the central node via convolutional layers. The central node then weighs and transforms these messages. This process is recursively conducted as the depth of network increases. 2 Neighborhood convolutions proved to be widely useful on various graph data. However, some inconveniences also exist in current GCNs. While different nodes may yield different importance in the neighborhood, early GCNs (Kipf & Welling, 2017; Hamilton et al., 2017) did not discriminate contexts in their receptive fields. These models either treated contexts equally, or used normalized edge weights as the weights of contexts. As a result, such implementations failed to capture critical contexts -contexts that pose greater influences on the central node, close friends among acquaintances, for example. Graph Attention Networks (GATs) (Velickovic et al., 2017) resolved this problem with attention mechanisms (Bahdanau et al., 2015; Vaswani et al., 2017) . Soft attention weights were used to discriminate importance of contexts, which allowed the model to better focus on relevant contexts to make decisions. With impressive performances, GATs became widely used in later generations of GCNs including (Li et al., 2019; Liu et al., 2019) . However, we observe that using soft attention weights in hierarchical convolutions does not fully solve the problem. Firstly, we will show as Proposition 1 that under common conditions, soft attention weights almost surely approach 0 as the neighborhood sizes increase. This smoothness 3 hinders the discrimination of context importance in large neighborhoods. Secondly, we will show by experiments in Section 4.2 that GATs cannot well distinguish true graph nodes from artificial noises: attention weights assigned to true nodes and noises are almost identical in distribution, which further leads to a dramatic drop of performance. Meanwhile, an ideal GCN architecture is often expected to exploit information on nodes with various distances. Most existing GCNs use hierarchical convolutional layers, in which only one-hop neighborhoods are convolved. As a result, one must increase the model depth to detect long-distance dependencies (informative nodes that are distant from the central nodes). This is particularly an issue in large graphs, as the complexity of the graph convolutions is exponential to the model depth. 4In large graphs, the model depths are often set as 1, 2 or 3 (Hamilton et al., 2017; Velickovic et al., 2017) . Accordingly, no dependencies longer than 3 hops are exploited in these models. Motivated by the discussions above, we propose the idea of adaptive receptive fields (ARFs). Figure 1 illustrates the differences between hierarchical convolutions and convolutions with ARFs. An ARF is defined as a subset of contexts that are most informative for a central node, and is constructed via selecting contexts among the neighborhood. Nodes in an ARF can be at various distances from the central node. The discrete selection process of contexts gets rid of the undesired smoothness of soft weights (see Section 2). In addition, by allowing ARFs to choose contexts on different hops from the central node, one can efficiently explore dependencies with longer distances. Experiments also show that ARFs are more robust to noises (see Section 4). We further propose GRARF (GCNs with Reinforced Adaptive Receptive Fields) as an instance for using ARFs in node-level tasks. In GRARF, an optimal policy of constructing ARFs is learned with reinforcement learning (RL). An RL agent (constructor) successively expands the ARF via a two-stage process: a contact node in the intermediately-constructed ARF is firstly selected; a context among the direct neighbors of the contact node is then added to the ARF. The reward of the constructor is defined as the performance of a trained GCN (evaluator) on the constructed ARF. GRARF is validated on datasets from different domains including three citation networks, one social network, and an inductive protein-protein interaction dataset. GRARF matches or improves performances on node classification tasks compared with strong baselines.foot_3 Moreover, we design two tasks to test the models' abilities in focusing on informative contexts and leveraging long-distance dependencies by injecting node noises in graphs with different strategies.

2. PRELIMINARIES AND THEORIES

Notations. In our paper, we consider node-level supervised learning tasks on attributed graphs. An attributed graph G is generally represented as G = (V, A, X), where V = {v 1 , • • • , v n } denotes the set of nodes, A ∈ {0, 1} n×n denotes the (binary) adjacency matrix, and X ∈ R n×d0 denotes the input node features, x v ∈ R d0 the features of node v. E is used as the set of edges. We use N (v i ) to denote the one-hop neighborhood of node v i , with v i itself included. We use H (l) ∈ R n×d l as the matrix containing d l -dimensional hidden representations of nodes in the l-th layer, h (l) v that of node



We use the name GCN for a class of deep learning approaches where information is convolved among graph neighborhoods, including but not limited to the vanilla GCN (Kipf & Welling, 2017).2 We use the term contexts to denote the neighbor nodes, and receptive field to denote the set of contexts that the convolutions refer to.3 The smoothness discussed in our paper is different to that in(Li et al., 2018), i.e. the phenomenon that representations of nodes converge in very deep GNNs. With sparse adjacency matrices, the average complexity of graph convolutions is O(d L ), where L is the model depth and d is the graph degree (or the neighborhood-sampling sizes in(Hamilton et al., 2017)). We mainly show the results of node classification tasks in our paper, whereas GRARF is intrinsically adapted to all node-level supervised learning tasks.



Figure 1: A Comparison between hierarchical convolutions and convolutions with ARF. Left: GCNs with ARFs better focus on critical nodes and filter out noises in large neighborhoods. Right: ARFs more efficiently explore long-distance dependencies.

