DEEP ENSEMBLES FOR GRAPHS WITH HIGHER-ORDER DEPENDENCIES

Abstract

Graph neural networks (GNNs) continue to achieve state-of-the-art performance on many graph learning tasks, but rely on the assumption that a given graph is a sufficient approximation of the true neighborhood structure. When a system contains higher-order sequential dependencies, we show that the tendency of traditional graph representations to underfit each node's neighborhood causes existing GNNs to generalize poorly. To address this, we propose a novel Deep Graph Ensemble (DGE), which captures neighborhood variance by training an ensemble of GNNs on different neighborhood subspaces of the same node within a higherorder network representation. We show that DGE consistently outperforms existing GNNs on semisupervised and supervised tasks on six real-world data sets with known higher-order dependencies, even under a similar parameter budget. We demonstrate that diverse and accurate base classifiers are central to DGE's success, and discuss the implications of these findings for future work on ensembles of GNNs.

1. INTRODUCTION

Graph neural networks (GNNs) solve learning tasks by propagating information through each node's neighborhood in a graph (Zhou et al., 2020; Wu et al., 2020) . Most present work on GNNs assumes that a given graph is a sufficient approximation of the underlying neighborhood structure. But a growing body of work has challenged this assumption by showing that traditional graphs often cannot capture the higher-order structure and dynamics that govern many real-world systems (Lambiotte et al., 2019; Battiston et al., 2020; Porter, 2020; Torres et al., 2021; Battiston et al., 2021) . In the present work, we couple GNNs with a specific family of graphs, higher-order networks (HONs), which encode sequential higher-order dependencies (i.e., conditional probabilities that cannot be explained by a first-order Markov model) in a graph structure. A traditional graph, which we call a first-order network (FON), represents a system by decomposing it into a set of pairwise edges, so the only way to infer polyadic interactions is via transitive paths over adjacent nodes. When higher-order dependencies are present, these Markovian paths underfit the true neighborhood (Scholtes, 2017) and can thus produce many false positive interactions between nodes (Lambiotte et al., 2019) . To address this limitation, Xu et al. (2016) proposed a HON that creates conditional nodes to more accurately encode the observed higher-order interactions. By preserving this additional information in the graph structure, HONs have produced new insights in studies of user behavior (Chierichetti et al., 2012) , citation networks (Rosvall et al., 2014) , human mobility and navigation patterns (Scholtes et al., 2014; Peixoto & Rosvall, 2017) , the spread of invasive species Saebi et al. (2020b) , anomaly detection (Saebi et al., 2020d) , disease progression (Krieg et al., 2020b) , and more (Koher et al., 2016; Peixoto & Rosvall, 2017; Scholtes, 2017; Lambiotte et al., 2019; Saebi et al., 2020a) . However, their use with GNNs has not been thoroughly explored. As Figure 1 illustrates, the tendency of FONs to underfit has consequences for GNNs, which typically compute representations by recursively pooling features from each node's neighbors. In order Figure 1 : A toy example of challenges faced by GNNs in modeling systems with higher-order dependencies. A FON (G 1 ) underfits the higher-order dependencies in the observed paths. Consequently, a GNN will learn similar representations for A and B, since they share the same 2-hop neighborhood in G 1 . A HON (G k , with k = 2 in this example) uses conditional nodes to encode higher-order dependencies. For example, node C|A represents the observed dependency that C only interacts with D when it also interacts with A (note that in real-world systems, G k rarely breaks the graph into multiple components). However, computing a representation for C then requires a GNN to aggregate multiple local neighborhoods. Colors depict node features. to maximize GNN performance, we must ensure that local neighborhoods capture the true distribution of interactions in the system. To enable GNNs to utilize the additional information encoded in HONs, we propose a novel Deep Graph Ensemble (DGE), which uses independent GNNs to exploit variance in higher-order node neighborhoods and learn effective representations in graphs with higher-order dependencies. The key contributions of our work include: 1. We analyze the data-level challenges that fundamentally limit the ability of existing GNNs to learn effective models of systems with higher-order dependencies. 2. We introduce the notion of neighborhood subspaces by showing that neighborhoods in a HON are analogous to feature subspaces of first-order neighborhoods. Borrowing from ensemble methods, we then propose DGE to exploit the variance in these subspaces. 3. We experimentally evaluate DGE against eight state-of-the-art baselines on six real-world data sets with known higher-order dependencies, and show that, even with similar parameter budgets, DGE consistently outperforms baselines on semisupervised (node classification) and supervised (link prediction) tasks.foot_0 4. We demonstrate that DGE's ability to train accurate and diverse classifiers is central to strong performance, and show that ensembling multiple GNNs with separate parameters is a consistent way to maximize the trade-off between accuracy and diversity.

2. BACKGROUND AND PRELIMINARIES

2.1 HIGHER-ORDER NETWORKS Let S = {S 1 , S 2 , ..., S n } be a set of observed paths (e.g., flight itineraries, disease trajectories, or user clickstreams), where each S i = ⟨s 1 , s 2 , ..., s m ⟩ is a sequence of entities (e.g., airports, diagnosis codes, or web pages). Let A = S denote the set of entities across all sequences. By using a graph to summarize S, we can model the global function of each entity in the system and solve a number of useful learning problems. For example, we can predict disease function via node classification or interactions between airports using link prediction.foot_1 However, there is a large space of possible graphs that can represent S. We consider two: a FON, and the HON introduced by Xu et al. (2016) . In a FON G 1 = (V 1 , E 1 ), the node set V 1 = A (or, more generally, the mapping f : V 1 - → A is bijective), and the edge set E 1 is the set of node pairs (u, v) ∈ V 1 × V 1 that are adjacent elements in at least one S i . In a HON G k = (V k , E k ) with order k > 1, each node is a sequence of entities u ′ = ⟨a ′ 1 , ..., a ′ m-1 , a ′ m ⟩, where each a ′ i ∈ V 1 and m ≤ k. We define a ′ m as the base node, and in practice use the notation u ′ = a ′ m |a ′ 1 , ..., a ′ m-1 to emphasize that each u ′ ∈ V k represents a base node whose current state is conditioned on a set of predecessors. Each node can have a different number of predecessors, and a conditional node with m > 1 is only created if the conditional distribution of paths it encodes sufficiently reduces the entropy of the graph (Saebi et al., 2020d) . This means that V k ⊆ A k ; or, more generally, the mapping f : V k -→ A k is injective but not necessarily bijective. We define Ω k u = {u ′ ∈ V k , a ′ m = u} as the higher-order family of u (including u itself), and call each u ′ ∈ Ω k u a relative of u (in Figure 1 , for example, C|A and C|B are the relatives of C). Like E 1 , the edge set E k is the set of node pairs (u, v) ∈ V k × V k that are adjacent in at least one S i . In both HONs and FONs, edges are directed such that (u, v) ̸ = (v, u) and weighted via w k : E k - → R ≥0 , where 0 indicates a missing edge. By creating conditional nodes, a HON can express higher-order interactions while remaining a graph, since each edge is still a 2-tuple. For example, consider that passengers who fly from Atlanta to Chicago are much more likely to fly back to Atlanta than to New York, and vice versa. A HON can encode this dependency by creating the conditional nodes "Chicago|Atlanta" and "Chicago|New York", which changes the topology and flow of information within local neighborhoods (Rosvall et al., 2014) . Choosing which conditional nodes and edges to create is a non-trivial problem; readers interested in more details can refer to Krieg et al. (2020a) , who proposed the procedure that we used in this study. Related graph-based models The term "higher-order" is also used in the literature to refer to the analysis of polyadic structures within graphs (Benson et al., 2016) , as well GNNs that are able to distinguish these structures (Morris et al., 2019; Li et al., 2020; Schnake et al., 2021) . These studies rely on the same assumptions as other GNNs, i.e., that a graph is a sufficient approximation of the neighborhood structure. Despite similarities in terminology, HONs are primarily concerned with the question of initial representation (i.e., how should the graph be constructed?) rather than downstream analysis of an existing graph, and thus address a fundamentally different-though complementaryproblem (Lambiotte et al., 2019) . HONs are also distinct from other formalisms like hypergraphs and simplicial complexes in that they encode conditional distributions that govern higher-order paths (via conditional nodes and directed, weighted edges), and therefore represent different kinds of systems (Battiston et al., 2020; Porter, 2020; Torres et al., 2021; Battiston et al., 2021) . Some very recent works have shown that aggregators based on paths (via random walks) can improve the expressiveness of GNNs (Eliasof et al., 2022; Jin et al., 2022b) , but these still rely on the graph structure to guide path sampling. This further motivates the use of representations like HONs, which are designed to consistently and accurately encode higher-order paths for downstream learning tasks (Rosvall et al., 2014; Xu et al., 2016; Saebi et al., 2020d; Krieg et al., 2020a) . To our knowledge, only one study has previously used GNNs with HONs (Jin et al., 2022a) ; however, as we discuss in Section 3, its proposed method has critical shortcomings.

2.2. GRAPH NEURAL NETWORKS

A generic GNN computes a hidden vector representation for a node u at timestamp t according to: h (t) u = COMBINE h (t-1) u , AGGREGATE {h (t-1) v , v ∈ N (u)} , where N (u) is the neighborhood of u in a graph G. What typically distinguishes GNNs is how they define COMBINE and AGGREGATE, and how they represent N (u) (Xu et al., 2019; Zhou et al., 2020; Wu et al., 2020) . By recursively pooling features via Eq. 1, GNNs implicitly construct higherorder neighborhoods across transitive paths. This allows nonadjacent nodes to share information if they are close in the graph (Li et al., 2018; Chen et al., 2020) ; however, as we will demonstrate, when higher-order dependencies are present, assuming transitivity leads GNNs that are trained on FONs to generalize poorly. In this work, we do not reformulate Eq. 1 and are agnostic toward its particular implementation. We instead abstract GNN as a function that takes a single node u as input and returns either the final hidden representation h (t) u , or, in a supervised setting, a vector of predicted label probabilities ŷu : GNN(u) = h (t) u or GNN(u) = ŷu. We always assume that G contains initial node features {x u , ∀u ∈ V} and that GNN(•) is parameterized by weights θ, but for simplicity we omit them from our notation. GNN ensembles Ho (1998) and Breiman (2001) showed that training an ensemble of shallow learners on random subspaces could exploit variance in the feature space and improve performance, and Dietterich (2000) demonstrated that these ensembles are most potent when the predictions of their base classifiers are accurate and diverse. Deep ensembles have typically been used with random weight initializations to improve uncertainty estimation and robustness (Lakshminarayanan et al., 2017; Fort et al., 2019; Wasay & Idreos, 2020) , which has benefited GNNs via mechanisms like multi-head attention (Veličković et al., 2018; Brody et al., 2022; Hou et al., 2021) , but very few works have directly explored ensembles of GNNs. Some recent exceptions have suggested that ensembling subgraphs could benefit GNNs (Zeng et al., 2021; Tang et al., 2021; Lin et al., 2022) .

3.1. WHY ENSEMBLES?

There are a number of design challenges (CHs) we must address in order to realize the joint potential of GNNs and HONs. In a HON, entities are represented by a non-fixed number of conditional nodes (CH1); for example, in Figure 1 , C is represented by two nodes but A, B, D, and E are each only represented by one node. Conditional nodes typically have different neighborhoods (CH2); for example, in Figure 1 , C|A and C|B have different neighbors. Further, they may vary in importance (i.e., degree) in the graph (CH3) (Xu et al., 2016; Saebi et al., 2020d) . To address these challenges, one intuitive idea is to reformulate Eq. 2 so that it computes a representation for u by sampling neighbors from any of u's relatives. However, this fails to address CH2 because a GNN would aggregate the samples without considering differences between relatives. Another idea is to compute representations for each relative separately, then pool them via a permutation-invariant function like an elementwise MEAN, as proposed for HO-GNN by Jin et al. (2022a) . But this does not address CH3, since all relatives would contribute equally to the final representation. Moreover, if we assume that all relatives in a higher-order family share the same (or similar) features, this solution will overrepresent features associated with larger higher-order families. In order to propose a method that comprehensively addresses these challenges, we first consider the following relationship. Theorem 1. Let G 1 and G k be a FON and HON, respectively, both constructed from the same input S. Let N 1 (u) and N k (u) denote the neighborhoods of any node u in G 1 and G k , respectively. Let AGGREGATE(•) represent any symmetric neighborhood aggregation function. If u ∈ V 1 and u ′ ∈ Ω k u , then AGGREGATE(N k (u ′ )) is a biased estimator of AGGREGATE(N 1 (u)). We prove Theorem 1 in Appendix A. Intuitively, we observe that HONs are constructed such that N k (u ′ ) ⊆ N 1 (u), and u ′ only exists in G k if the expectation of a random walker differs substantially (measured via KL-divergence) from u in G 1 (Saebi et al., 2020d; Krieg et al., 2020a) . Consequently, these differences in neighborhood structure will shift the expectation of the features gathered by AGGREGATE(N k (u ′ )). A typical strategy for training a single GNN would involve attempting to eliminate this variance via some sampling method on the graph. We instead take an ensemble approach and propose to regularize the model via multiple GNNs that err in different ways. Toward this end, and inspired by feature subspace methods (Ho, 1998; Breiman, 2001) , we call N k (u ′ ) a neighborhood subspace of N 1 (u). Remark Our result from Theorem 1 also relates to the expressiveness of an aggregator over neighborhoods in G k . Consider two nodes u, v ∈ V 1 whose rooted subgraphs (i.e., neighborhoods) are non-isomorphic but are not distinguishable by the aggregator in G 1 . As long as there exists some u ′ ∈ Ω k u (excluding u itself), then we know that, since it is a biased estimator, the aggregator can distinguish u ′ from u in G k . It follows transitively that the aggregator can also distinguish u ′ from v. It is possible that there also exists some v ′ ∈ Ω k v that cannot be distinguished from u ′ , but this would be extremely unlikely to occur over all pairs of relatives in Ω k u and Ω k v . We can thus improve the expressiveness of the model by allowing the information from nodes in Ω k u to contribute to the final representation for each u. Guided by these observations, we propose to address the CHs outlined above by training an ensemble of GNNs {GNN 1 , GNN 2 , ..., GNN ℓ }. Given a set of training nodes D ⊆ V 1 , we generate bootstraps {D (1) , D (2) , ..., D (ℓ) } subject to the constraint that |D (i) Ω k u | = 1 for all u ∈ D and i ≤ ℓ (i.e., each bootstrap contains exactly one relative of each training node). This constraint allows us to avoid the feature overrepresentation problem, address CH1 by sampling with replacement, and address CH2 by training each GNN i on different neighborhood subspaces as represented in D (i) . To solve CH3, we weight the sampling probability for each relative u ′ according to the normalized out-degree of its higher-order family: P k u (u ′ ) = OUTDEG k (u ′ ) v ′ ∈Ω k u OUTDEG k (v ′ ) , where u ∈ D and OUTDEG k (u ′ ) is the weighted out-degree of u ′ in G k . Because weighted out-degree in G k is the frequency with which u ′ appears in S (Krieg et al., 2020a), it is a natural measure of the importance of u ′ with respect to the rest of the higher-order family Ω k u . If D ⊆ E 1 consists of node pairs for an edge task, we modify Eq. 3 slightly. For each edge (u, v) ∈ D, we resample a single pair of relatives (u ′ , v ′ ) with probability according to the normalized weights of all edges between relatives of u and v: P k u,v (u ′ , v ′ ) = w k (u ′ , v ′ ) (u ′′ ,v ′′ )∈Ω k u ×Ω k v w k (u ′′ , v ′′ ) .

3.2. TRAINING AND INFERENCE

Let D (i) u denote the relative of u that was sampled for the i th bootstrap. We consider a supervised or semisupervised setting, in which our goal is to predict class probabilities ŷu for each u ∈ D such that some loss is minimized w.r.t. ground truth y u . We propose three methods for computing ŷu : ŷu = σ CONCAT GNNi (D (i) u ), ∀i ≤ ℓ ⊤ • W and GNNi(u ′ ) = h (i,t) u ′ , ŷu = σ MEAN GNNi(D (i) u ), ∀i ≤ ℓ ⊤ • W and GNNi(u ′ ) = h (i,t) u ′ , ŷu = MEAN GNNi(D (i) u ), ∀i ≤ ℓ and GNNi(u ′ ) = ŷ(i) u ′ , ( ) where σ is a non-linear activation, CONCAT is vector concatenation, MEAN is elementwise mean, x ⊤ is the transpose of x, h (i,t) u ′ ∈ R d represents the hidden state of node u ′ in the t th (final) layer of GNN i , and d is the number of hidden units (we assume d is fixed for all GNN i ). Using c to denote the number of classes, for Eq. 5a we have W ∈ R dℓ×c , and for Eq. 5b we have W ∈ R d×c . We refer to Eq. 5a as DGE-concat, Eq. 5b as DGE-pool, and Eq. 5c as DGE-bag. In DGE-concat and DGE-pool each GNN outputs hidden representations, which are concatenated or pooled, respectively, before computing the logits. This means that they can be trained in end-to-end fashion as a single neural network with parallel but independent GNN modules. During a forward pass, each GNN i only computes representations for the nodes in D (i) . In DGE-bag, on the other hand, each GNN outputs class probabilities, which means that each GNN is trained independently and their predictions are simply averaged to compute the final probabilities. We also designed an attention-based pooling method, but found that it did not generalize well (Appendix D). One important question is whether all GNN i should share parameters. In other words, is an ensemble necessary, or is it sufficient to use single, more complex model (Abe et al., 2022) ? We evaluated this question experimentally, and use DGE-concat*, DGE-pool*, and DGE-bag* to denote sharedparameter variants of Eqs. 5a, 5b, and 5c, respectively. For DGE-concat* and DGE-pool*, we adjusted our training procedure so that, during backpropagation, each parameter was updated once according to its contribution to the summed loss across all D (i) . For DGE-bag*, this meant that each GNN i was essentially pretrained on D (i-1) . Since each GNN i shares parameters, none of these variants are true ensembles. Instead, they are single models that synthesize representations for conditional nodes via a READOUT function (Wu et al., 2020) on each higher-order family. DGE-pool* is similar to HO-GNN (Jin et al., 2022a) , which uses a single GNN and computes a representation for each u via the mean of all its relatives (rather than a weighted sample) in Ω k u . We also considered one final variant, DGE-batch*, which does not use a fixed set of bootstraps for training. Instead, it uses Eq. 3 to sample a new set of relatives for each batch. Then, during inference, we use the same procedure as DGE-bag: resample ℓ relatives for each node and compute their outputs via Eq. 5c. DGE-batch* thus drops any resemblance to an ensemble, instead addressing CH1, CH2, and CH3 entirely via batch sampling. Figure 2 summarizes the components of DGE: resampling ℓ relatives for each entity via Eq. 3, computing a node representation for each sampled relative via Eq. 2, and pooling the computed representations via Eq. 5a, 5b, or 5c. In general, DGE's computational cost is linear with the cost of the base GNN and the ensemble size ℓ, since we are essentially constructing ℓ copies of that GNN (or, in the case of shared parameters, repeating ℓ forward passes per example). For a given node u, the additional cost of sampling and pooling ℓ relatives is O(ℓ |Ω k u |) and O(ℓ), respectively, which are trivial compared to the costs of Eq. 2 (Wu et al., 2020) . Additionally, the one-time cost of constructing G k increases linearly with k (Krieg et al., 2020a). Since we used G 2 in our experiments, there was marginal overhead for graph construction as compared to G 1 .

4.1. EXPERIMENTAL SETUP

We used GROWHON (Krieg et al., 2020a) to construct FONs (G 1 ) and HONs with k = 2 (G 2 ) for six real-world data sets with known higher-order dependencies: flight itineraries for airline passengers in the United States (Air) (Rosvall et al., 2014) , disease trajectories for type 2 diabetes patients in Indiana (T2D) (Krieg et al., 2020b) , clickstreams of users playing the Wikispeedia game (Wiki) (West et al., 2009) , readership trajectories for a large online magazine (Mag and a larger version, Mag+, for node classification only) (Wang et al., 2020) , and global shipping routes (Ship) (Saebi et al., 2020c) . Table 1 summarizes their key characteristics, including average homophily H (see Appendix G for details). We discuss additional details and preprocessing steps in Appendix B. We evaluated several baselines, including GCN (Kipf & Welling, 2017) , GAT (Veličković et al., 2018) , GraphSAGE (Hamilton et al., 2017) , GIN (Xu et al., 2019) and GATv2 (Brody et al., 2022) , and SEAL (Zhang & Chen, 2018 ) (link prediction only). Other noteworthy baselines are Graph-SAINT (Zeng et al., 2020) which samples a different subgraph for each training iteration (Hu et al., 2020) ; GCNII (Chen et al., 2020) , which uses residual connections to address over-smoothing; and PathGCN (Eliasof et al., 2022) , which learns spatial operators on paths sampled via random walks. We also evaluated two baselines designed specifically for HONs: HONEM, a matrix factorization method (Saebi et al., 2020a) , and HO-GNN (Jin et al., 2022a) . All baselines used G 1 as input (we also evaluated each baseline using G 2 as input, details are in Appendix E). We manually tuned each model (details in Appendix C). For DGE, unless noted otherwise, we fixed ℓ = 16 and used the mean-pooling variant of GraphSAGE as the base GNN, since a) it performed reasonably well as a baseline on all six data sets, and b) its sample-and-aggregate procedure intuitively complements the relative sampling and pooling used by DGE. We used Python 3.7.3 and Tensorfow 2.4.1 for all experiments, and utilized Stellargraph 1.2.1 (Data61, 2018) for the implementation of DGE.

4.2. EXPERIMENTAL RESULTS

Node classification and link prediction On the node classification task (Table 2 ), DGE-bag outperformed all other methods on five data sets and was second to DGE-pool* on Air. DGE-pool performed only slightly worse than DGE-bag in all cases, but DGE-concat did not generalize well and performed poorly in all cases except Ship. This suggests that our relative sampling procedure was effective in conjunction with ensembling (DGE-bag), but not as effective at regularizing a single fullyconnected network against the variance in G k . Since concatenation depends heavily on the position of each relative in the vector representation, DGE-concat overfit on nodes that had many relatives. effect. DGE-pool* outperformed DGE-bag on Air because the classifiers were accurate enough to make up for the lack of disagreement. This may be because Air had larger higher-order families than other data sets, so each bootstrap was less representative of the true neighborhood. However, using separate parameters for each GNN-a true ensemble approach-was the most consistently effective way to balance the trade-off between accuracy and diversity.

5. LIMITATIONS, OTHER RELATED WORK, AND FUTURE DIRECTIONS

DGE's empirical success has implications for multiple research areas. Most broadly, we suggest that ensembling techniques for GNNs have been underexplored. However, generalizing our findings is complicated by the fact that DGE relies on the HON to supply the neighborhood variance, which is in turn limited to representing sequential data. Our results also suggest that the task of designing informed graph representations like HONs is essential and has been previously overshadowed by downstream learning algorithms. Other research on higher-order models has made progress toward this end, but much work remains to generalize these models and the types of dependencies they can represent. One area we consider promising for future work is graph structure learning. At present, HONs are constructed in unsupervised fashion, but future work could overcome this limitation by inferring the graph structure that best solves the given task (Brugere & Berger-Wolf, 2020; Chen & Wu, 2022) . Unfortunately, most standard data sets for GNN tasks are pre-constructed graphs, meaning that any higher-order dependencies (sequential or otherwise) have already been lost before learning begins. By integrating graph construction into the learning process and continuing to develop more powerful GNNs, we will increase our capacity to model higher-order relationships and expand the frontier of machine learning on graphs.



Code and 3 data sets are available at https://github.com/sjkrieg/dge. This is distinct from sequence models like transformers, which typically predict an entity's local function within a single sequence.



Figure 2: Overview of DGE. Given a HON G k , DGE computes outputs for each base node u by a) resampling relatives from u's higher-order family Ω k u via a sampling distribution P k u , b) computing outputs for each sampled relative using independent GNN modules, and c) pooling the outputs.

Figure 4: Mean node classification loss for all pairs of the ℓ = 16 base GNNs within each testing fold on T2D, plotted as a function of Cohen's kappa (lower values indicate lower agreement). Each point represents one pair of GNNs in the ensemble. All plots contain the same number of points.

Summary of graphs used in node classification and link prediction experiments.

Node classification results (micro F1). Bold font indicates the best result for each data set. ± 0.04 0.501 ± 0.06 0.615 ± 0.02 0.790 ± 0.02 0.681 ± 0.04 0.828 ± 0.01 DGE-concat* 0.810 ± 0.04 0.439 ± 0.03 0.577 ± 0.02 0.761 ± 0.02 0.642 ± 0.01 0.809 ± 0.01 DGE-pool 0.839 ± 0.03 0.735 ± 0.03 0.671 ± 0.01 0.860 ± 0.01 0.722 ± 0.01 0.808 ± 0.01 DGE-pool* 0.865 ± 0.02 0.555 ± 0.07 0.599 ± 0.04 0.775 ± 0.01 0.671 ± 0.01 0.767 ± 0.02 DGE-bag 0.856 ± 0.02 0.770 ± 0.04 0.681 ± 0.00 0.871 ± 0.01 0.769 ± 0.01 0.840 ± 0.01 DGE-bag* 0.766 ± 0.04 0.719 ± 0.04 0.644 ± 0.02 0.841 ± 0.02 0.739 ± 0.01 0.825 ± 0.01 DGE-batch* 0.764 ± 0.03 0.646 ± 0.01 0.623 ± 0.01 0.818 ± 0.01 0.742 ± 0.00 0.812 ± 0.01

Link prediction results (AUPRC). Bold font indicates the best result for each data set.

Node classification results (mean micro F1 for 5-fold cross validation) under various parameter budgets. Bold font indicates the best result for each budget and data set.

annex

Sharing parameters typically reduced performance for each DGE variant, excepting DGE-pool* on Air. We discuss this observation in Section 4.3. The differences between HO-GNN and DGE emphasize the importance of accounting for both the variance in neighborhood subspaces in G k as well as the differences in importance between relatives. Some baselines were occasionally competitive: GCNII, GATv2, and PathGCN performed relatively well on Air, and GCNII and GraphSAINT performed relatively well on Wiki. However, on Mag and especially T2D, DGE-bag outperformed all baselines by a significant margin. PathGCN and the full-batch models struggled on T2D, likely due to its high density-meaning that there are many transitive and false positive paths in G 1 . Low homophily may also have contributed to their poor performance (Zhu et al., 2020) (see discussion in Appendix G). GAT and GATv2 performed especially poorly on the dense graphs, likely because they compute attention weights (instead of using the given edge weights) and were thus more susceptible to overfitting. The link prediction results, summarized in Table 3 , tell a similar story. Because of its strong performance, we focus the remainder of our analysis on DGE-bag. For details on training time and convergence for all models, please see Appendix F.Model size For the baselines, increasing the number of GNN layers typically increased generalization error (Figure 3 ). For DGE-bag, the test loss decreased with increased model depth in all but one case (layer 3 on T2D). These results support the findings of Lambiotte et al. (2019) that transitively inferring paths in a FON cannot account for higher-order dependencies, and our hypothesis that GNNs trained on G 1 will overfit on non-existent paths. The exception to this pattern was Wiki, on which many models performed best with 3 layers. This difference is perhaps because Wiki was the sparsest graph, meaning that the likelihood of sampling a false positive path is relatively low. However, DGE-bag still produced the lowest test error at all depths.In order to ensure that DGE's performance was not simply due to using a more expensive model, we compared DGE-bag to the strongest baselines at several parameter budgets (Table 4 ; results for Wiki and Mag are available in Appendix E). DGE-bag underperformed with the smallest budgets, since each base learner was too weak and underfit. We found that sharing parameters (DGE-bag*) resolved this issue and produced strong results on T2D, Wiki, and Mag. However, with a moderate parameter budget, DGE-bag consistently outperformed other models. Again, for most baselines, increasing model depth simply caused caused them to overfit and decreased performance.

4.3. ENSEMBLE DIVERSITY AND PARAMETER SHARING

The variants of DGE that used separate parameters generally performed best. To understand this observation, we draw from prior work on ensembles, which has established that ensembles are most potent when the individual classifiers have low error and high disagreement (Dietterich, 2000) . As Figure 4 shows, DGE-bag consistently produced classifiers that were both diverse and accurate. The shared-parameter variants often produced classifiers that were either diverse or accurate, but not both. We observed similar results for Air, Wiki, and Mag (Appendix E). These results reflect node classification performance (Table 2 ) and support several important conclusions. First, the high disagreement in DGE-bag demonstrates that treating higher-order relatives as neighborhood subspaces successfully introduced variance into the model. Second, the low mean error for DGE-bag demonstrates that ensembling GNNs with separate parameters effectively captured and exploited this variance. Third, sharing parameters (i.e., using a single, more complex model) did not achieve the same

