ADAPTIVE STACKED GRAPH FILTER

Abstract

We study Graph Convolutional Networks (GCN) from the graph signal processing viewpoint by addressing a difference between learning graph filters with fullyconnected weights versus trainable polynomial coefficients. We find that by stacking graph filters with learnable polynomial parameters, we can build a highly adaptive and robust vertex classification model. Our treatment here relaxes the low-frequency (or equivalently, high homophily) assumptions in existing vertex classification models, resulting a more ubiquitous solution in terms of spectral properties. Empirically, by using only one hyper-parameter setting, our model achieves strong results on most benchmark datasets across the frequency spectrum.

1. INTRODUCTION

The semi-supervised vertex classification problem (Weston et al., 2012; Yang et al., 2016) in attributed graphs has become one of the most fundamental machine learning problems in recent years. This problem is often associated with its most popular recent solution, namely Graph Convolutional Networks (Kipf & Welling, 2017) . Since the GCN proposal, there has been a vast amount of research to improve its scalability (Hamilton et al., 2017; Chen et al., 2018; Wu et al., 2019) as well as performance (Liao et al., 2019; Li et al., 2019; Pei et al., 2020) . Existing vertex classification models often (implicitly) assume that the graph has large vertex homophily (Pei et al., 2020) , or equivalently, low-frequency property (Li et al., 2019; Wu et al., 2019) ; see Section 2.1 for graph frequency. However, this assumption is not true in general. For instance, let us take the Wisconsin dataset (Table 1 ), which captures a network of students, faculty, staff, courses, and projects. These categories naturally exhibit different frequency patternsfoot_0 . Connections between people are often low-frequency, while connections between topics and projects are often midrange. This problem becomes apparent as GCN-like models show low accuracies on this dataset; for example, see (Pei et al., 2020; Chen et al., 2020b; Liu et al., 2020) . This paper aims at establishing a GCN model for the vertex classification problem (Definition 1) that does not rely on any frequency assumption. Such a model can be applied to ubiquitous datasets without any hyper-parameter tuning for the graph structure. Contributions. By observing the relation between label frequency and performance of existing GCN-like models, we propose to learn the graph filters coefficients directly rather than learning the MLP part of a GCN-like layer. We use filter stacking to implement a trainable graph filter, which is capable of learning any filter function. Our stacked filter construction with novel learnable filter parameters is easy to implement, sufficiently expressive, and less sensitive to the filters' degree. By using only one hyper-parameter setting, we show that our model is more adaptive than existing work on a wide range of benchmark datasets. The rest of our paper is organized as follows. Section 2 introduces notations and analytical tools. Section 3 provides insights into the vertex classification problem and motivations to our model's design. Section 4 presents an implementation of our model. Section 5 summarizes related literature with a focus on graph filters and state-of-the-art models. Section 6 compares our model and other existing methods empirically. We also provide additional experimental results in Appendix A. We consider a simple undirected graph G = (V, E), where V = {1, . . . , n} is a set of n vertices and E ⊆ V × V is a set of edges. A graph G is called an attributed graph, denoted by G(X), when it is associated with a vertex feature mapping X : V → R d , where d is the dimension of the features. We define the following vertex classification problem, also known in the literature as the semi-supervised vertex classification problem (Yang et al., 2016) . Definition 1 (Vertex Classification Problem). We are given an attributed graph G(X), a set of training vertices V tr ⊂ V , training labels Y tr : V tr → C, and label set C. The task is to find a model h : V → C using the training data (V tr , Y tr ) that approximates the true labeling function Y : V → C. Let A be the adjacency matrix of the graph G, i.e., A i,j = 1 if (i, j) ∈ E and 0 otherwise. Let d i = j A ij be the degree of vertex i ∈ V , and let D = diag(d 1 , . . . , d n ) be the n × n diagonal matrix of degrees. Let L = D -A be the combinatorial graph Laplacian. Let L = D -1/2 LD -1/2 be the symmetric normalized graph Laplacian. We mainly focus on the symmetric normalized graph Laplacian due to its interesting spectral properties: (1) its eigenvalues range from 0 to 2; and (2) the spectral properties can be compared between different graphs (Chung & Graham, 1997) . In recent literature, the normalized adjacency matrix with added self-loops, Ã = I -L + c, is often used as the propagation matrix, where c is some diagonal matrix.

2.1. GRAPH FREQUENCY

Graph signal processing (Shuman et al., 2012) extends "frequency" concepts in the classical signal processing to graphs using the graph Laplacian. Let L = U ΛU be the eigendecomposition of the Laplacian, where U ∈ R n×n is the orthogonal matrix consists of the orthonormal eigenvectors of L and Λ is the diagonal matrix of eigenvalues. Then, we can regard each eigenvector u k as a "oscillation pattern" and its eigenvalue λ k as the "frequency" of the oscillation. This intuition is supported by the Rayleigh quotient as follows. r(L, x) x Lx x x = u∼v L u,v (x(u) -x(v)) 2 u∈V x(u) 2 . ( ) where u∼v sums over all unordered pairs for which u and v are adjacent, x(u) denotes the entry of vector x corresponding to vertex u, and L u,v is the (u, v)-entry of L. From the definition we see that r(x) is non-negative and L is positive semi-definite. r(x) is also known as a variational characterization of eigenvalues of L (Horn & Johnson, 2012, Chapter 4), hence 0 ≤ r(x) ≤ 2 for any non-zero real vector x. We use the notation r(x) to denote the Rayleigh quotient when the normalized graph Laplacian is clear from context. The Rayleigh quotient r(x) measures how the data x is oscillating. Hence, in this study, we use the term "frequency" and the "Rayleigh quotient" interchangeably. By the definition, the eigenvector u i has the frequency of λ i . The labeling y of the vertices is low-frequency if the adjacent vertices are more likely to have the same label. This is a common assumption made by the spectral clustering algorithms (Shi & Malik, 2000; Ng et al., 2002; Shaham et al., 2018) . Commonly used terms, homophily and heterophily, used in network science, correspond to low-frequency and high-frequency, respectively.

2.2. GRAPH FILTERING

In classical signal processing, a given signal is processed by filters in order to remove unwanted interference. Here, we first design a frequency response f (λ) of the filter, and then apply the filter to the signal in the sense that each frequency component x(λ) of the data is modulated as f (λ)x(λ). Graph signal processing extends this concept as follows. Same as in classical signal processing, we design a filter f (λ). Then, we represent a given graph signal x ∈ R |V | as a linear combination of the eigenvectors as x = i x i u i . Then, we modulate each frequency component by f (λ) as x = i f (λ i )x i u i . An important fact is that this can be done without performing the eigendecomposition explicitly. Let f (L) be the matrix function induced from f (λ). Then, the filter is represented by f (L)x.



"Frequency" is an equivalent concept to "homophily" and will be explained in Section

